Language Resources

We are developping various language resources for natural language processing (NLP) or artificial inteligence (AI). Below you find brief explanations and the links to the detailed pages. The distributed version may not the latest. In case you want the latest or you want to contribute to the data, please do not hesitate to contact us.

Japanese Dependecy Corpus

We are constructing a publicly available dependency corpus in Japanese. The unit is, as in many languages, words.

Now we have about 35,000 annotated sentences taken from various sources such as blogs. The annotated date are as follows.

word segmentation
part-of-speech
pronunciation
dependency

We are also distributing a parser EDA trained on the data.

Flow Graph of Procedural Texts

We have proposed to represent the meaning of procedural texts as flow graphs (directed acyclic graph). As the representative of procedural texts we have adopted recipes, of which there are many sites or books. We defined eight types of important terms and represent their relationships by a directed acyclic graph (DAG). We have annotated recipes with the following information and made them public.

Important terms (8 classes; foods, tools, actions by chef, ...)
Flow graph

The framework can cover general procedural texts just by replacing "food" with "parts." Please take a look at the details if you are interested in it.

We are also distributing a named entity recognizer PWNER trained on the data. And we are developping a flow graph constructor.

UniDic++ (Tentative)

To power up KyTea, Japanese text processing tool
word/POS/pronunciation
contexts

We are also distributing a word segmenter, POS tagger, and pronuciation estimater KyTea including this data.

Game Commentary Corpus

This is a corpus consisting of pairs of a game state and commentary sentences on it. The game is shogi (Japanese chess). Words in each sentences are identified and annotated game term tags. We defined 21 game term types.

Game state (piece distribution, piece to drop)
Word sequence
Game terms

A word segmenter and part-of-speech tagger, KyTea, trained from this corpus is available. A term recognizer, PWNER, is also available.

Japanese Wikification Corpora

This corpus allows us to develop a tool for connecting expressoins in a text and world knowledge. The annotated texts contains sentences in BCCWJ as well as those in Twitter. This corpus is useful for wikification of various texts.

Any topics, not just named entities.
Exhaustive annotation, not just "important" concepts.
No NIL detection (all entities have corresponding Wikipedia articles).

We are devising a wikification tool based on it.

Collaboration with prof. Yugo Murawaki of Graduate School of Informatics, Kyoto University

KUSK Dataset (Kyoto Univ. Smart Kitchen Dataset)

This is a multi-modal dataset of activity observation in a kitchen, one of operations according to procedural texts. The dataset includes

video images
thermal camera
load sensor
electric power sensor (induction heater)
water flow sensor

The target recipes are the 20 recipes in "Flow Graph of Procedural Texts". (correspondence)