UniDic++ (Tentative)

Introduction

A word segmentater and a POS tagger built from Balanced Corpus of Contemporary Written Japanese (BCCWJ) and UniDic dictionary as the language resource are very accurate. Still, they are far from covering all the expressions describing knowledges of all human beings. Terms for academic or cultural matters are representatives, but product names or service names of companies are also important terms. Titles and names of fictional personalities or concepts in animations or novels are also equally important.

We provide dictionaries and corpora for a word segmentater or a POS tagger to deal with these terms and expressions. The definition of the words is the authentic short unit by National Institute of Japanese Language (NINJAL). The unit is solid and stable with a well written standard book. BCCWJ and UniDic follows this standard. Using our dictionaries along with these language resources, you can analyze Japanese texts in various domains with a high accuracy.

We call our dictionaries UniDic++ (tentative). We hope that they are included in an future dictionary provided from NINJAL. That is the reason why we add "tentative." For that morment, we continue to add more entries or contexts accurately.

Download

!! Under Construction !!

配布するものは2種類があります。それぞれ中に3つのファイルがあります。 level0 は自動収集、level1 は機械チェック済み、level2 は人手チェック済みです。

UniDic+ (wihtout context): Normal dictionaries consisting of words annotated with other information
UniDic++ (with context): In addition the entries have left and right contexts in real use (need authentication)

辞書に単語を追加することで形態素解析の精度は基本的には上がるのですが、現在の精度は非常に高く、別の箇所での誤り(副作用)が避けられません。文脈をつけておいて、形態素解析器にそれを学習させると副作用が避けられます (詳しくは LREC2014 のスライドをご覧ください)。一方で、文脈は元の文の一部なので、意図せず著作権を侵害する恐れがあります。そのため文脈ありの UniDic++ には認証を設けております。

UniDic++ は、KyTea の配布モデルに定期的に反映しています。配布版と最新の UniDic++ の差は小さいので、UniDic+ の追加で相当カバーできます。 !! Under Construction !!

Last Change: 2015/06/28 by Shinsuke MORI