Kyoto University ACCMS NLP Group

WHLL English

Overview

Recognizing the spatial information indicated by location expressions in texts is a promising direction in text comprehension. Geoparsing is the task of estimating the latitude and longitude (coordinates) of location expressions in texts. Tackling this task with machine learning methods requires a large-scale, annotated corpus. However, constructing such a corpus by human annotation is expensive.

We proposed a method for automatic construction of corpora annotated with coordinates from wikipedia dumps, called Wikipedia Hyperlink-based Location Linking (WHLL). Some articles have a coordinates (a pair of latitude and longitude) annotated by human editors. We focus on the hyperlinks to Wikipedia articles with coordinates and treat the linked string as a location expression associated with the coordinates.

Corpura's Specification

WHLL corpora are determined by CirrusSearch dumps and HTML dumps. We denote the created corpora as WHLL-{wikipedia_edition_code}-CS{timestamp of CirrusSearch dump}.HTML{timestamp of HTML dump}.

Each corpus consists of two types of files.

Statistical Information

Statistics of generated WHLL corpora.
Corpus Name #Articles #Sents. #Words #Chars. #LE #K. of LE R. amb. R. amb.&rec.
WHLL-en-CS20230710.HTML20230701 1,315,117 23,187,909 550,593,285 2,883,484,675 14,726,908 1,571,291 45.6% 9.9%
WHLL-ja-CS20240304.HTML20240301 200,906 3,678,314 123,648,103 214,227,083 4,151,205 245,482 29.5% 8.4%

Download

Source Code

Python script is available under the MIT license at [Source Code] or [GitHub]

Created Corpora

The text of Wikipedia is co-licensed under the Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA) and the GNU Free Documentation License (GFDL).

Members

Reference

Automatic Construction of a Large-Scale Corpus for Geoparsing Using Wikipedia Hyperlinks
Keyaki Ohno, Hirotaka Kameko, Keisuke Shirai, Taichi Nishimura, Shinsuke Mori
LREC-COLING, 2024 (TBA)

Last Change: 2024/03/21 by Hirotaka Kameko
Kyoto University ACCMS NLP Group