Wikitext-JA

Overview

We constructed and published the Japanese WikiText language modeling dataset. This project was conducted by refering to「The wikitext long term dependency langage modeling dataset」.

Specifically, we collected all articles designated as 'Featured Articles'(85 articles) and 'Good Articles'(1423 articles) in wikipedia, arranged them as datasets, and acquired their statistical information.

Datasets' Specification

Each title and heads of articles we collected are showed such as '=ノストラダムス=' and '==概要==', and after those, we described each bodies.
We show one example below.

=ノストラダムス=
ミシェル・ノストラダムス（ＭｉｃｈｅｌＮｏｓｔｒａｄａｍｕｓ、１５０３年１２月１４日 ‐ １５６６年７月２日）は、ルネサンス期フランスの医師、占星術師、詩人。また料理研究の著作も著している。日本では「ノストラダムスの大予言」の名で知られる詩集を著した。彼の予言は、現在に至るまで多くの信奉者を生み出し、様々な論争を引き起こしてきた。本名はミシェル・ド・ノートルダム（ＭｉｃｈｅｌｄｅＮｏｓｔｒｅｄａｍｅ）で、これはフランス語による。よく知られるノストラダムスの名は、姓をラテン語風に綴ったものである。しばしば、「ミシェル・ド・ノストラダムス」と表記されることもあるが、後述するように適切なものではない。

Also,as data cleansing, we performed the following processing.
First, in order to adapt characters in data to JISX 0208, we converted all half-width characters into full-width characters.
The characters difficult to substitute properly (i.e. arabic arphabet) were replaced with specific symbols such as '*1*'.We have shown all characters replaced in 'Exception_F(G).html'.
Second, we surrounded sentences embedded as a separete content in an original page by <block><block>.
Third, we replaced numeral formulae for <math-element>.

We arranged data processed as stated above, and published them.
In addition, We divided the data into 3 datasets: training set, validation set, test set. The division ratio is 8:1:1.
We also publised these datasets.

Statistical Information

Category of Articles	#Items	Average Sentence Length	#Sent.	#Words	Vocabulary Size
Featured Articles	85	61	35194	1318239	42113
Good Articles	1423	59	336707	12087117	181939
Total	1508	120	371901	13405356	224052

Category of Articles	Category of Dataset	#Items	Average Sentence Length	#Sent.	#Words	Vocabulary Size
Featured Articles	Train	69	62	27397	1046764	36917
	Valid	8	57	4294	147905	10227
	Test	8	58	3503	124032	7454
Good Articles	Train	1139	59	267454	9589680	159807
	Valid	142	60	33447	1231552	42333
	Test	142	57	35806	1265839	42255

The above tables show statistical information of datasets.

We calculated 'number of items,sentences, and words', 'average sentence length',and 'vocablary size', and have shown them by the category of articles.

Download

Featured_List.txt
Featured_Contents.txt
Exception_F.txt
The list of characters processed exceptionally in text data of Featured Articles
Train_Data_F.txt
Training dataset of Featured Articles
Valid_Data_F.txt
Validation dataset of Featured Articles
Test_Data_F.txt
Test dataset of Featured Articles
Good_List.txt
Good_Contents.txt
Exception_G.txt
The list of characters processed exceptionally in text data of Good Articles
Train_Data_G.txt
Training dataset of Good Articles
Valid_Data_G.txt
Validation dataset of Good Articles
Test_Data_G.txt
Test dataset of Good Articles

Members

Shinsuke Mori
Hirotaka Kameko
Akira Ogawa

Reference

The wikitext long term dependency language modeling dataset
Stephen Merity
September 26, 2016

Last Change: 2019/07/04 by Akira Ogawa

Kyoto University ACCMS NLP Group