English 日本語

EDA Dependency Parser

EDA is a word-based dependency parser.

The name EDA stands for Easily adaptable Dependency Analyzer.

Features

Note: Because we are working mainly on parsing written Japanese, EDA currently uses the restriction that dependencies must go from left to right (towards the end of the sentence). It cannot handle dependencies which go from right to left (as in casual or spoken Japanese).

Please see the following papers for details.

Download and Installation

Download

Latest Release:EDA 0.2.0

Previous Releases:EDA 0.1.2 EDA 0.1.1 EDA 0.1.0

Latest Source Code (for the adventurous):Bitbucket repository

Install

You'll need gcc (version 3.4 or greater) and the Boost Libraries to build EDA. Please install Boost first! (Depending on where you've installed Boost, you may need to set the path to the headers in the environment variable LIBRARY_PATH.)

After downloading the source, use the following commands to extract and build it.

tar xzvf eda-0.2.0.tar.gz
cd eda-0.2.0
make

After doing this, the executables eda and train-eda will be created in the src/eda directory, and model files *.dat will be placed in the data directory. You can finish the installation by copying the executables and model files to the location of your choice.

Usage

Analyzing Text

The eda command parses Japanese text. Japanese text must be segmented into words and tagged with part-of-speech (POS) tags before it can be parsed. I recommend KyTea for these tasks because it can perform them very accurately. EDA's default models are based on KyTea's word segmentation standard and POS tagset, and EDA's default input format is also KyTea's default output format. This integration makes it possible to connect these programs with a pipeline. If you have KyTea installed, you can preprocess a file with KyTea and then use EDA to parse the output with the default model (for UTF-8 text) with the following command. By default, the results of the dependency analysis are written to the standard output, but you can redirect the output to a file like any other command.

kytea my_file.txt | eda -e my_file.tree -v jp-0.1.0-utf8-vocab-small.dat -w jp-0.1.0-utf8-weight-small.dat

You can also specify EDA's tree format by giving the -f tree option on the command line. The data directory contains a sample tree file which can be parsed with the following command. In this example, we redirect the output to a file.

eda -f tree -v data/jp-0.1.0-utf8-vocab-small.dat -w data/jp-0.1.0-utf8-weight-small.dat < data/sample.tree > data/sample_out.tree

Evaluating Accuracy

There's also a script to measure the accuracy of a parsed file against a gold data file. In the example below, we measure the accuracy of the parsed file that we created with the last command against the gold data for the sample file.

perl eval.pl data/sample_test.tree data/sample_out.tree

Training Your own Models

You can use the train-eda command to train your own models for use with eda. The default model included with the source code is trained on a mix of Japanese newspaper data and example sentences from a dictionary, and should perform reasonably well. Note that this model is are based on KyTea's word segmentation standard and POS tagset. If you'd like to parse text in some specialized domain, you can train your own models for parsing text as described below. (It is also possible to train a model using a different word segmentation standard or POS tagset. Just remember that the text you parse with this model must be preprocessed in the same way as the text you used to train it.)

Models consist of two files, a vocabulary file and a weight file. The following command will train a model from the annotated data file training.tree, creating the vocabulary file my_vocab.dat and weight file my_weight.dat. After training is finished, the vocabulary file and weight file may be used instead of the default model to parse text.

train-eda -t training.tree -v my_vocab.dat -w my_weight.dat -a llsgd -i 30 -c 3

You can get further information on the options for train-eda by running it without any arguments. The parameters in the example above are the ones used for experiments in the papers listed on this page. Please note that training on several thousand sentences can take a long time and will probably require several gigs of memory!

Input/Output Format

There are two input formats for the eda command: output from KyTea and EDA's tree format. Output is in tree format only. In all cases, the encoding for training and test files should be UTF-8. The Word Dependency Annotation Standard outlines how EDA handles grammatical phenomena. The default models were trained on training data prepared using this standard.

KyTea Format (default)

When this format is used, the default output of KyTea can be used as input. It is useful for parsing raw text that has been preprocessed by KyTea. This format is used by default, but can also be specified with the -f kytea option.

Tree Format (-f tree option)

This format shows the dependency tree for a sentence. It is the default output format for the eda command, and can optionally be used for its input, too. Training data for the train-eda command must use this format. Sentences begin with a ID line containing ID= followed by an ID string, and are delimited a single blank line. Each line after the ID line contains the following five fields that describe a single word.

(Word index) (Head index) (Surface form) (POS tag) (Word class)

The index fields are integers starting from 1, and should be left-padded to three characters with 0. Each field must be separated by a space character (specifically, a hankaku space in Japanese). The head of the last word in the sentence must be set to -1. The word class (or cluster) field is required by default, but if you aren't using this field you can just set it to a dummy value such as 0 for all words.

Full Annotation

ID=000001
001 002 私   代名詞  0
002 005 は   助詞   0
003 004 リンゴ 名詞   0
004 005 を   助詞   0
005 006 食べ  動詞   0
006 007 る   語尾   0
007  -1 。   補助記号 0

Partial Annotation

ID=000001
001   0 私   代名詞  0
002 005 は   助詞   0
003   0 リンゴ 名詞   0
004   0 を   助詞   0
005   0 食べ  動詞   0
006   0 る   語尾   0
007  -1 。   補助記号 0

Development Information

Development Team

Daniel Flannery (Maintenance, Documentation)
Yusuke Miyao (Lead Developer, Advisor)
Shinsuke Mori (Advisor, Power User)

Version History

0.2.0 (2013/3/29)

0.1.2 (2012/7/4)

0.1.1 (2012/5/24)

0.1.0 (2012/3/16)