WHLL

Overview

Recognizing the spatial information indicated by location expressions in texts is a promising direction in text comprehension. Geoparsing is the task of estimating the latitude and longitude (coordinates) of location expressions in texts. Tackling this task with machine learning methods requires a large-scale, annotated corpus. However, constructing such a corpus by human annotation is expensive.

We proposed a method for automatic construction of corpora annotated with coordinates from wikipedia dumps, called Wikipedia Hyperlink-based Location Linking (WHLL). Some articles have a coordinates (a pair of latitude and longitude) annotated by human editors. We focus on the hyperlinks to Wikipedia articles with coordinates and treat the linked string as a location expression associated with the coordinates.

Corpura's Specification

WHLL corpora are determined by CirrusSearch dumps and HTML dumps. We denote the created corpora as WHLL-{wikipedia_edition_code}-CS{timestamp of CirrusSearch dump}.HTML{timestamp of HTML dump}.

e.g.) WHLL-en-CS20230710.HTML20230701

Each corpus consists of two types of files.

coord.tsv: list of article coordinates. one article per line.

ArticleTitle Latitude Longitude ArticleID is_redirect

*.jsonl: annotated articles. one article per line.

id: article id
title: article title
text: article body text
gold: list of location expressions := [start_pos, end_pos, expression, [latitude, longitude]]

Statistical Information

Statistics of generated WHLL corpora.
Corpus Name	#Articles	#Sents.	#Words	#Chars.	#LE	#K. of LE	R. amb.	R. amb.&rec.
WHLL-en-CS20230710.HTML20230701	1,315,117	23,187,909	550,593,285	2,883,484,675	14,726,908	1,571,291	45.6%	9.9%
WHLL-ja-CS20240304.HTML20240301	200,906	3,678,314	123,648,103	214,227,083	4,151,205	245,482	29.5%	8.4%

#Sents. and #Words are calculated with using stanza tokenizers.

Note: These depend on the version of stanza and may differ from the paper.

LE=location expressions
K.=kinds / R.=rate
amb.=ambiguous LEs.

Location expressions with same strings but different coordinates

amb.&rec.=ambiguous and recessive LEs.

Ambiguous LEs and the coordinates are not the same as the most frequent one for the string

Download

Source Code

Python script is available under the MIT license at [Source Code] or [GitHub]

Created Corpora

The text of Wikipedia is co-licensed under the Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA) and the GNU Free Documentation License (GFDL).

WHLL-en-CS20230710.HTML20230701 (used for LREC-COLING 2024)
WHLL-ja-CS20240304.HTML20240301

Members

Shinsuke Mori
Hirotaka Kameko
Keisuke Shirai
Taichi Nishimura
Keyaki Ohno

Reference

Automatic Construction of a Large-Scale Corpus for Geoparsing Using Wikipedia Hyperlinks
Keyaki Ohno, Hirotaka Kameko, Keisuke Shirai, Taichi Nishimura, Shinsuke Mori
LREC-COLING, 2024 (TBA)

Last Change: 2024/03/21 by Hirotaka Kameko

Kyoto University ACCMS NLP Group