EDA Parser

EDA Dependency Parser

EDA is a word-based dependency parser.

The name EDA stands for Easily adaptable Dependency Analyzer.

Features

Trainable from Partially Annotated Corpora：Most parsers require every word in a training sentence to be annotated with their heads, but with EDA you only need to annotate the words you care about.
Handles Non-Projective Dependencies：EDA can handle non-projective dependencies (crossing dependency arcs) for dependencies which go from left to right.

Note: Because we are working mainly on parsing written Japanese, EDA currently uses the restriction that dependencies must go from left to right (towards the end of the sentence). It cannot handle dependencies which go from right to left (as in casual or spoken Japanese).

Please see the following papers for details.

Training Dependency Parsers from Partially Annotated Corpora
Daniel Flannery, Yusuke Miyao, Graham Neubig, Shinsuke Mori
IJCNLP, pp.776-784, 2011.
A Pointwise Approach to Training Dependency Parsers from Partially Annotated Corpora
Daniel Flannery, Yusuke Miyao, Graham Neubig, Shinsuke Mori
Journal of the Association of Natural Language Processing, Vol. 19, No. 3, September 2012.

Download and Installation

Download

Latest Release：EDA 0.2.0

Previous Releases：EDA 0.1.2 EDA 0.1.1 EDA 0.1.0

Latest Source Code (for the adventurous)：Bitbucket repository

Install

You'll need gcc (version 3.4 or greater) and the Boost Libraries to build EDA. Please install Boost first! (Depending on where you've installed Boost, you may need to set the path to the headers in the environment variable LIBRARY_PATH.)

After downloading the source, use the following commands to extract and build it.

tar xzvf eda-0.2.0.tar.gz
cd eda-0.2.0
make

After doing this, the executables eda and train-eda will be created in the src/eda directory, and model files *.dat will be placed in the data directory. You can finish the installation by copying the executables and model files to the location of your choice.

Usage

Analyzing Text

The eda command parses Japanese text. Japanese text must be segmented into words and tagged with part-of-speech (POS) tags before it can be parsed. I recommend KyTea for these tasks because it can perform them very accurately. EDA's default models are based on KyTea's word segmentation standard and POS tagset, and EDA's default input format is also KyTea's default output format. This integration makes it possible to connect these programs with a pipeline. If you have KyTea installed, you can preprocess a file with KyTea and then use EDA to parse the output with the default model (for UTF-8 text) with the following command. By default, the results of the dependency analysis are written to the standard output, but you can redirect the output to a file like any other command.

kytea my_file.txt | eda -e my_file.tree -v jp-0.1.0-utf8-vocab-small.dat -w jp-0.1.0-utf8-weight-small.dat

You can also specify EDA's tree format by giving the -f tree option on the command line. The data directory contains a sample tree file which can be parsed with the following command. In this example, we redirect the output to a file.

eda -f tree -v data/jp-0.1.0-utf8-vocab-small.dat -w data/jp-0.1.0-utf8-weight-small.dat < data/sample.tree > data/sample_out.tree

Evaluating Accuracy

There's also a script to measure the accuracy of a parsed file against a gold data file. In the example below, we measure the accuracy of the parsed file that we created with the last command against the gold data for the sample file.

perl eval.pl data/sample_test.tree data/sample_out.tree

Training Your own Models

You can use the train-eda command to train your own models for use with eda. The default model included with the source code is trained on a mix of Japanese newspaper data and example sentences from a dictionary, and should perform reasonably well. Note that this model is are based on KyTea's word segmentation standard and POS tagset. If you'd like to parse text in some specialized domain, you can train your own models for parsing text as described below. (It is also possible to train a model using a different word segmentation standard or POS tagset. Just remember that the text you parse with this model must be preprocessed in the same way as the text you used to train it.)

Models consist of two files, a vocabulary file and a weight file. The following command will train a model from the annotated data file training.tree, creating the vocabulary file my_vocab.dat and weight file my_weight.dat. After training is finished, the vocabulary file and weight file may be used instead of the default model to parse text.

train-eda -t training.tree -v my_vocab.dat -w my_weight.dat -a llsgd -i 30 -c 3

You can get further information on the options for train-eda by running it without any arguments. The parameters in the example above are the ones used for experiments in the papers listed on this page. Please note that training on several thousand sentences can take a long time and will probably require several gigs of memory!

Input/Output Format

There are two input formats for the eda command: output from KyTea and EDA's tree format. Output is in tree format only. In all cases, the encoding for training and test files should be UTF-8. The Word Dependency Annotation Standard outlines how EDA handles grammatical phenomena. The default models were trained on training data prepared using this standard.

KyTea Format (default)

When this format is used, the default output of KyTea can be used as input. It is useful for parsing raw text that has been preprocessed by KyTea. This format is used by default, but can also be specified with the -f kytea option.

Tree Format (`-f tree` option)

This format shows the dependency tree for a sentence. It is the default output format for the eda command, and can optionally be used for its input, too. Training data for the train-eda command must use this format. Sentences begin with a ID line containing ID= followed by an ID string, and are delimited a single blank line. Each line after the ID line contains the following five fields that describe a single word.

(Word index) (Head index) (Surface form) (POS tag) (Word class)

The index fields are integers starting from 1, and should be left-padded to three characters with 0. Each field must be separated by a space character (specifically, a hankaku space in Japanese). The head of the last word in the sentence must be set to -1. The word class (or cluster) field is required by default, but if you aren't using this field you can just set it to a dummy value such as 0 for all words.

Full Annotation

ID=000001
001 002 私　　　代名詞　　0
002 005 は　　　助詞　　　0
003 004 リンゴ　名詞　　　0
004 005 を　　　助詞　　　0
005 006 食べ　　動詞　　　0
006 007 る　　　語尾　　　0
007  -1 。　　　補助記号　0

Partial Annotation

ID=000001
001   0 私　　　代名詞　　0
002 005 は　　　助詞　　　0
003   0 リンゴ　名詞　　　0
004   0 を　　　助詞　　　0
005   0 食べ　　動詞　　　0
006   0 る　　　語尾　　　0
007  -1 。　　　補助記号　0

Development Information

Development Team

Daniel Flannery (Maintenance, Documentation)
Yusuke Miyao (Lead Developer, Advisor)
Shinsuke Mori (Advisor, Power User)

Version History

0.2.0 (2013/3/29)

KyTea's output is now used as the default input format instead of the tree format
Removed the treeify.py script from the distribution
Added the -f option to select the input format
Removed the -e option used to specify the input file. Instead, input is read from the standard input
Added the -n option to restrict the solution search for named entities

0.1.2 (2012/7/4)

Posted the Word Dependency Annotation Standard for Japanese on this site
Added the option -V to print the version number
Before beginning model training, train-eda will now check the training data for dependencies where the dependent and head are the same word, dependencies which go from right to left, and dependencies where the head does not exist in the sentence
During parsing, sentence IDs in the input file are now written to the output
Changed the marker used to indicate the head word of the last word in a sentence in the parsing results from 000 to -1 (to match the format of the training data)

0.1.1 (2012/5/24)

Fixed a bug where the source couldn't be compiled in certain environments
Fixed a bug that caused the treeify.py script to drop words containing slashes
Changed the wording of the program help messages

0.1.0 (2012/3/16)

First release at the 18th Annual Meeting of the Association for Natural Language Processing