Recognizing the spatial information indicated by location expressions in texts is a promising direction in text comprehension. Geoparsing is the task of estimating the latitude and longitude (coordinates) of location expressions in texts. Tackling this task with machine learning methods requires a large-scale, annotated corpus. However, constructing such a corpus by human annotation is expensive.
We proposed a method for automatic construction of corpora annotated with coordinates from wikipedia dumps, called Wikipedia Hyperlink-based Location Linking (WHLL). Some articles have a coordinates (a pair of latitude and longitude) annotated by human editors. We focus on the hyperlinks to Wikipedia articles with coordinates and treat the linked string as a location expression associated with the coordinates.
WHLL corpora are determined by CirrusSearch dumps and HTML dumps. We denote the created corpora as WHLL-{wikipedia_edition_code}-CS{timestamp of CirrusSearch dump}.HTML{timestamp of HTML dump}.
Each corpus consists of two types of files.
Corpus Name | #Articles | #Sents. | #Words | #Chars. | #LE | #K. of LE | R. amb. | R. amb.&rec. |
---|---|---|---|---|---|---|---|---|
WHLL-en-CS20230710.HTML20230701 | 1,315,117 | 23,187,909 | 550,593,285 | 2,883,484,675 | 14,726,908 | 1,571,291 | 45.6% | 9.9% |
WHLL-ja-CS20240304.HTML20240301 | 200,906 | 3,678,314 | 123,648,103 | 214,227,083 | 4,151,205 | 245,482 | 29.5% | 8.4% |
Python script is available under the MIT license at [Source Code] or [GitHub]
The text of Wikipedia is co-licensed under the Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA) and the GNU Free Documentation License (GFDL).
- Automatic Construction of a Large-Scale Corpus for Geoparsing Using Wikipedia Hyperlinks
- Keyaki Ohno, Hirotaka Kameko, Keisuke Shirai, Taichi Nishimura, Shinsuke Mori
- LREC-COLING, 2024 (TBA)