Kyoto University ACCMS NLP Group

Wikitext-JA English

Overview

We constructed and published the Japanese WikiText language modeling dataset. This project was conducted by refering to「The wikitext long term dependency langage modeling dataset」.

Specifically, we collected all articles designated as 'Featured Articles'(85 articles) and 'Good Articles'(1423 articles) in wikipedia, arranged them as datasets, and acquired their statistical information.

Datasets' Specification

Each title and heads of articles we collected are showed such as '=ノストラダムス=' and '==概要==', and after those, we described each bodies.
We show one example below.

=ノストラダムス=
ミシェル・ノストラダムス(Michel Nostradamus、1503年12月14日 ‐ 1566年7月2日)は、ルネサンス期フランスの医師、占星術師、詩人。また料理研究の著作も著している。日本では「ノストラダムスの大予言」の名で知られる詩集を著した。彼の予言は、現在に至るまで多くの信奉者を生み出し、様々な論争を引き起こしてきた。 本名はミシェル・ド・ノートルダム (Michel de Nostredame) で、これはフランス語による。よく知られるノストラダムスの名は、姓をラテン語風に綴ったものである。しばしば、「ミシェル・ド・ノストラダムス」と表記されることもあるが、後述するように適切なものではない。

Also,as data cleansing, we performed the following processing.
First, in order to adapt characters in data to JISX 0208, we converted all half-width characters into full-width characters.
The characters difficult to substitute properly (i.e. arabic arphabet) were replaced with specific symbols such as '*1*'.We have shown all characters replaced in 'Exception_F(G).html'.
Second, we surrounded sentences embedded as a separete content in an original page by <block><block>.
Third, we replaced numeral formulae for <math-element>.

We arranged data processed as stated above, and published them.
In addition, We divided the data into 3 datasets: training set, validation set, test set. The division ratio is 8:1:1.
We also publised these datasets.

Statistical Information

Category of Articles #Items Average Sentence Length #Sent.#WordsVocabulary Size
Featured Articles 85 61 35194 1318239 42113
Good Articles 1423 59 336707 12087117 181939
Total 1508 120 371901 13405356 224052
Category of Articles Category of Dataset#Items Average Sentence Length #Sent.#WordsVocabulary Size
Featured Articles Train 69 62 27397 1046764 36917
Valid 8 57 4294 147905 10227
Test 8 58 3503 124032 7454
Good Articles Train 1139 59 267454 9589680 159807
Valid 142 60 33447 1231552 42333
Test 142 57 35806 1265839 42255

The above tables show statistical information of datasets.


We calculated 'number of items,sentences, and words', 'average sentence length',and 'vocablary size', and have shown them by the category of articles.

Download

Links

Members

Reference

The wikitext long term dependency language modeling dataset
Stephen Merity
September 26, 2016

Last Change: 2019/07/04 by Akira Ogawa
Kyoto University ACCMS NLP Group