EDA Parser

固有表現の書類

概要

Named entities (NEs) are groups of words that form a logical unit and have special meaning in a system. They might be proper nouns like the names of companies and institutions, or phrases like numerical amounts or time expressions. We can think of the words in the NE as forming a self-contained subtree, so this partial structure can help us parse the rest of the sentence in which it occurs.

EDA has a feature that uses NE tags to constrain the solution search for the words that make up NEs. When this feature is used the following restrictions are enforced:

Outgoing Dependencies: The heads for words in the NE can only be other words in the NE, except for the NE's last word. This ensures that there is only one outgoing dependency arc from the NE.
Incoming Dependencies: The heads for words preceding the NE in the sentence may not be words in the NE, except for the head word of the NE. This ensures that all dependency arcs going into the NE only go to the head word of the NE.

使い方

This feature can be used by specifying the -n option followed by an argument that tells which tag to treat as an NE tag.

eda -n 4 -v jp-0.1.0-utf8-vocab-small.dat -w jp-0.1.0-utf8-weight-small.dat < tagged.txt

入力形式

Kyteaからの出力

The input is the default output format used by KyTea. Each line contains a sentence with words separated by spaces (hankaku spaces). Each word has one or more tags, separated by forward slashes (/). The argument to the -n option tells which of these tags to use an NE tags. This value starts from one, so for the following example we would specify -n 3.

打ち粉/名詞/F-B を/助詞/O し/動詞C-B 生地/名詞/F-B を/助詞/O 厚/形容詞/Q-B さ/接尾辞/Q-I ５/名詞/Q-I ｍｍ/名詞/Q-I に/助詞/O のば/動詞C-B し/語尾/O 型抜き/名詞C-B する/動詞/O 。/補助記号/O

ツリー形式

Alternatively, you can use an extended version of EDA's tree format by specifying the -f tree option. The format is the same as the normal tree format, except that a sixth field for NE tags is allowed, so you must specify -n 6. An example of input in this format is shown below.

ID=1
001 002 ボウル　　名詞　　　0 T-B
002 005 に　　　　助詞　　　0 O
003 004 卵　　　　名詞　　　0 F-B
004 005 を　　　　助詞　　　0 O
005 006 ときほぐ　動詞　　　0 Ac-B
006 007 し　　　　動詞　　　0 O
007 013 、　　　　補助記号　0 O
008 009 牛乳　　　名詞　　　0 F-B
009 011 ・　　　　補助記号　0 O
010 011 サラダ　　名詞　　　0 F-B
011 012 油　　　　接尾辞　　0 F-I
012 013 と　　　　助詞　　　0 O
013 014 混ぜ　　　動詞　　　0 Ac-B
014  -1 る　　　　語尾　　　0 O

固有表現タグ形式

Since NE tags vary by application, EDA uses a relaxed version of the IOB format for maximum flexibility. Only the information about the boundaries of NEs, not information about the specific tag type, is used. Any tag ending with B, I, or O is legal. The last character of the tag is checked, and tags are simply passed through to the output after parsing is completed.

A B tag indicates that the word begins an NE, while an I tag indicates that the word is inside an NE. Tags ending with O indicate words that are outside of an NE. Nesting is not allowed, so two consecutive B tags are interpreted as a single-word NE followed by another NE.