Kyoto University ACCMS NLP Group

shogi corpusJapanese

Abstract

In recent years there has been a surge of interest in the generation of natural language annotations to describe digital recordings of the real world. However, images, videos, and many other forms of media have ambiguities that make symbol grounding difficult. In this task we propose to use a well-defined ''real world,'' that is game states, to concentrate on language ambiguities. The game we focus on is shogi. We collected 742,286 commentary sentences in Japanese and then defined domain-specific named entities (NEs). We finally annotated NEs for 2,508 sentences to form our game commentary corpus, which has the following distinguishing characteristics: We finally annotated NEs for 2,508 sentences to form our game commentary corpus, which has the following distinguishing characteristics:

Commentary

Professional players and writers give commentaries of professional matches for shogi fans explaining the reasoning behind moves, evaluation of the current state, and suggest probable next moves. These commentaries mostly concern the game itself inluding notations for specifying moves, such as ''△1四香とすれば決戦 'We{white's L1d} would shift the phase to the end game), but sometimes the commentaries also include information irrelevant to the board state, such as information about the players. Those sentences are almost grammatically correct. We checked randomly selected 100 sentences and found no typo nor grammatical error.

Shogi Commentary Expressions

Commentaries contain many domain specific expressions (words or multi-word expressions), which can be categorized into groups in a similar way to the general domain or bio-medical NEs. All players are familiar with these expressions (e.g. King's gambit, Ruy Lopez, etc. in chess) and category names (e.g. opening).

Move Notation
Shogi has a well-defined notation to record games. The notation of a move is decomposed into the following components. These categories are basically finite, but we include misspelled expressions as well.
Tu: Expressions indicating the turn. This category only contains ''先手'' (black), ''¸後手'' (white), '▲'' (black), and''△'' (white)
Po: Positions denoted by two numerals (one Arabic numeral for file and one Chinese numeral for rank).
Pi: Piece names including promoted ones (14 types).
Mc: Move compliment. There are only two expressions: ''成る'' (promoted) and ''不成'' (non-promoted).
Move Descriptions
For some moves, a commentator explains their meaning using the following expressions:
Mn: Move name such as ''王手'' (check).
Me: Move evaluation such as ''好手'' (good move).
Opening Expressions
Opening sequences have set names, which appear frequently.
St: Strategy names. As with chess, shogi has many attacking formations with various names. This class is almost closed, but sometimes new openings are invented. An example is ''ゴキゲン中飛車'' (cheerful central rook).
Ca: Castle names. Defensive formations also have names. This class is also almost closed with some exceptions like ''ミレニアム'' (Millenium formation), which arose in the year of 2000.
Position Evaluation
The most important commentaries are those concerning evaluation of the current board state, for example ''black is winning.'' The class for this type of commentary includes adjectival expressions and simple sentences consisting of a subject and a predicate with arguments.
Ev: Evaluation expressions about the entire board. This category does not include those from a specific viewpoint covered by the followings.
Ee: Other evaluation expressions focusing on a certain aspect. Examples are ''駒得'' (gaining pieces) and '配置が良い'' (pieces are well positioned).
Expressions for Description of Board Positions
Commentators use the following expressions to describe board states.
Re: Region on the board, such as ''中央'' (center), '4筋'' (4th file), and ''3段目'' (3rd rank).
Ph: Phase of the match, such as ''序盤'' (opening), ''中盤'' (middlegame), and ''終盤'' (endgame), including vague ones such as '終盤の入り口'' (start of endgame).
Pa: Piece attributes. Every piece has its own movement and commentators use special expressions for it. For example, ''道'' (path) is used to denote bishop's diagonal lines and rook's orthogonal lines. There are special expressions to denote relative positions of a piece like ''腹'' (belly) meaning the side squares of a piece.
Pq: Piece quantity. Usually it is a pair of a number and a counter word. This also includes expressions such as ''切れ'' (lack of) and ''豊富'' (abundant).
Describing Events Outside the Board
Commentators sometimes refer to issues outside of the board but related to the match. They can be classified into the followings types:
Hu: Names of players, commentators, etc. including their title, such as ''名人'' (champion). This category also contains expressions for groups of players and places such as ''検討室'' (discussion room) which behaves like a human. Names in expressions belonging to other types are excluded like Ishida style.
Ti: Expressions for the total time spent, the time spent on the current move and the time remaining. In addition to concrete expressions, like ''10 minutes,'' this includes abstract ones such as ''長時間'' (long time).
Actions
Unlike the general NE definitions, we decided to incorporate verbal expressions including copula verbs followed by an adjective. These include passive forms and causative forms.
Ac: Verbs whose subject is a player. The action must be related to the board, such as ''捨てる'' (sacrifice). Thus this does not include other player actions like ''close eyes.''
Ap: Verbs whose subject is a piece. For example ''下がった'' (retreated).
Ao: Other verbs. For example ''始まる'' (start), with the subject ''戦い'' (battle).
Others
Ot: Other important notions for shogi. Typical ones are noun phrases denoting the above categories themselves like ''戦型'' (strategy). Note that this in not included in St.

Game Commentary Corpus

As the notation for shogi NEs, we adopt the BIO tag system B, I, and O stand for begin, intermediate, and other, respectively.

広瀬/Hu-B は/O 対/O ゴ/St-B キゲン/St-I 中/St-I 飛車/St-I の/O 超速St-B/ ▲/St-I 3七/St-I 銀/St-I 戦法/St-I を/O 採用/Ac(ここAc-B?) し/O た/O 。/O

We first segmented sentences automatically with a tool KyTea, trained on the general domain corpus, BCCWJ and a dictionary, UniDic containing 212,900 words. We then supplied the results to the tool. Finally an annotator corrected word boundaries and added BIO tags for word.(manu.) We also trained a BIO2-based NE recognizer, PWNER and conduct NE recognition.(auto)

Corpus Size

Statistics of our corpus is below:

TrainingPrecisionRecallF-measure
BCCWJ0.8720.9070.889
BCCWJ + shogi0.9830.9830.983

Game board

  • The file contains information about game board and features.

    Evaluation of relation between a state and s-NE

  • This dataset has some evaluations of relation between a state and NE.
  • We extracted states from the corpus by full-text search and so on using a NE as a query, and evaluated the relations according to the following the choices.

    File

  • The file is not open to the public. Please contact me anytime, if you need it.

    Link

  • Narural language processing
  • Old version.

    Member

    Reference

    A Japanese Chess Commentary Corpus,
    Shinsuke Mori, John Richardson, Atsushi Ushiku, Tetsuro Sasada, Hirotaka Kameko, and Yoshimasa Tsuruoka.
    LREC, 2016.

    Update

    2017/11/02 Reference added
    2017/10/02 Description on modalities
    2017/02/15 File uploaded
    2016/04/04 Open


    Last Change: 2017/11/02 by Shinsuke Mori
    Kyoto University ACCMS LSTA Lab.