EDA is a word-based dependency parser.
The name EDA stands for Easily adaptable Dependency Analyzer.
Trainable from Partially Annotated Corpora:Most parsers require every word in a training sentence to be annotated with their heads, but with EDA you only need to annotate the words you care about.
Handles Non-Projective Dependencies:EDA can handle non-projective dependencies (crossing dependency arcs) for dependencies which go from left to right.
Note: Because we are working mainly on parsing written Japanese, EDA currently uses the restriction that dependencies must go from left to right (towards the end of the sentence). It cannot handle dependencies which go from right to left (as in casual or spoken Japanese).
Please see the following papers for details.
Training Dependency Parsers from Partially Annotated Corpora
Daniel Flannery, Yusuke Miyao, Graham Neubig, Shinsuke Mori
IJCNLP, pp.776-784, 2011.
A Pointwise Approach to Training Dependency Parsers from Partially Annotated Corpora
Daniel Flannery, Yusuke Miyao, Graham Neubig, Shinsuke Mori
Journal of the Association of Natural Language Processing, Vol. 19, No. 3, September 2012.
Latest Release:EDA 0.2.0
Previous Releases:EDA 0.1.2 EDA 0.1.1 EDA 0.1.0
Latest Source Code (for the adventurous):Bitbucket repository
You'll need gcc (version 3.4 or greater) and the Boost Libraries to build EDA. Please install Boost first! (Depending on where you've installed Boost, you may need to set the path to the headers in the environment variable LIBRARY_PATH.)
After downloading the source, use the following commands to extract and build it.
tar xzvf eda-0.2.0.tar.gz
cd eda-0.2.0
make
After doing this, the executables eda
and train-eda
will be created in the src/eda
directory, and model files *.dat
will be placed in the data
directory. You can finish the installation by copying the executables and model files to the location of your choice.
The eda
command parses Japanese text. Japanese text must be
segmented into words and tagged with part-of-speech (POS) tags before
it can be parsed. I recommend KyTea
for these tasks because it can perform them very accurately. EDA's
default models are based on KyTea's word segmentation standard and POS
tagset, and EDA's default input format is also KyTea's default output
format. This integration makes it possible to connect these programs
with a pipeline. If you have KyTea installed, you can preprocess a
file with KyTea and then use EDA to parse the output with the default
model (for UTF-8 text) with the following command. By default, the
results of the dependency analysis are written to the standard output,
but you can redirect the output to a file like any other command.
kytea my_file.txt | eda -e my_file.tree -v jp-0.1.0-utf8-vocab-small.dat -w jp-0.1.0-utf8-weight-small.dat
You can also specify EDA's tree format by giving the -f tree
option
on the command line. The data
directory contains a sample tree file
which can be parsed with the following command. In this example, we
redirect the output to a file.
eda -f tree -v data/jp-0.1.0-utf8-vocab-small.dat -w data/jp-0.1.0-utf8-weight-small.dat < data/sample.tree > data/sample_out.tree
There's also a script to measure the accuracy of a parsed file against a gold data file. In the example below, we measure the accuracy of the parsed file that we created with the last command against the gold data for the sample file.
perl eval.pl data/sample_test.tree data/sample_out.tree
You can use the train-eda
command to train your own models for use
with eda
. The default model included with the source code is trained
on a mix of Japanese newspaper data and example sentences from a
dictionary, and should perform reasonably well. Note that this model is
are based on KyTea's word segmentation standard and POS tagset. If
you'd like to parse text in some specialized domain, you can train
your own models for parsing text as described below. (It is also
possible to train a model using a different word segmentation standard
or POS tagset. Just remember that the text you parse with this model
must be preprocessed in the same way as the text you used to train
it.)
Models consist of two files, a vocabulary file and a weight file. The
following command will train a model from the annotated data file
training.tree
, creating the vocabulary file my_vocab.dat
and
weight file my_weight.dat
. After training is finished, the
vocabulary file and weight file may be used instead of the default
model to parse text.
train-eda -t training.tree -v my_vocab.dat -w my_weight.dat -a llsgd -i 30 -c 3
You can get further information on the options for train-eda
by
running it without any arguments. The parameters in the example above
are the ones used for experiments in the papers listed on this
page. Please note that training on several thousand sentences can take
a long time and will probably require several gigs of memory!
There are two input formats for the eda
command: output from KyTea
and EDA's tree format. Output is in tree format only. In all cases,
the encoding for training and test files should be UTF-8. The
Word Dependency Annotation Standard outlines how
EDA handles grammatical phenomena. The default models were trained on
training data prepared using this standard.
When this format is used, the default output of KyTea can be used as
input. It is useful for parsing raw text that has been preprocessed by
KyTea. This format is used by default, but can also be specified with the -f kytea
option.
-f tree
option)This format shows the dependency tree for a sentence. It is
the default output format for the eda
command, and can optionally be
used for its input, too. Training data for the train-eda
command
must use this format. Sentences begin with a ID line containing ID=
followed by an ID string, and are delimited a single blank line. Each
line after the ID line contains the following five fields that
describe a single word.
(Word index) (Head index) (Surface form) (POS tag) (Word class)
The index fields are integers starting from 1
, and should be
left-padded to three characters with 0
. Each field must be separated
by a space character (specifically, a hankaku space in
Japanese). The head of the last word in the sentence must be set to
-1
. The word class (or cluster) field is required by default, but if
you aren't using this field you can just set it to a dummy value such
as 0
for all words.
ID=000001
001 002 私 代名詞 0
002 005 は 助詞 0
003 004 リンゴ 名詞 0
004 005 を 助詞 0
005 006 食べ 動詞 0
006 007 る 語尾 0
007 -1 。 補助記号 0
ID=000001
001 0 私 代名詞 0
002 005 は 助詞 0
003 0 リンゴ 名詞 0
004 0 を 助詞 0
005 0 食べ 動詞 0
006 0 る 語尾 0
007 -1 。 補助記号 0
Daniel Flannery (Maintenance, Documentation)
Yusuke Miyao (Lead Developer, Advisor)
Shinsuke Mori (Advisor, Power User)
treeify.py
script from the distribution-f
option to select the input format-e
option used to specify the input file. Instead,
input is read from the standard input-n
option to restrict the solution search for named entities-V
to print the version numbertrain-eda
will now check the training
data for dependencies where the dependent and head are the same
word, dependencies which go from right to left, and dependencies
where the head does not exist in the sentence000
to -1
(to match
the format of the training data)treeify.py
script to drop words containing slashes