Kyoto University, ACCMS, LSTA Group

Sequential Labeling tool CaPSLJapanese

CaPSL (CRF and Pointwise-based Sequential Labeling tool) is a sequential labeling tool with the following features.

Features of the tool

Download

Source Code

download

Trained Models

Applications

Usage

Requirement

We tested our tool with following versions.

As script

Option

Running mode
          --train            training mode
	  --inference        inference mode
Directory of the model specification
	  --model_dir        Directory of the model (training mode: for saving / inference mode: for loading)
Hyper-parameters in training
	  --num_epoch        epoch number of training (default: 20)
	  --lr               learning rate (default: 2e-6)
	  --encoder          model for text encoding (bert/lstm, default: bert)
	  --use_dict         set if use dictionary
	  --batch_size       batch size for training (default: 32)
	  --use_scheduler    set if use warmup of learning rate in training (linear scheduing which it reaches the peak at 1/10 of the training loop)
	  --bert_dir         bert model for encoding (default: bert-base-uncased)
	  --lambda_pointwise weight for pointwise loss in training (default: 0.5)
	  --do_pre           set if train with fixed encoder before main training loop
	  --do_crfonly       set if train only with CRF loss of fully-annotated data after main training loop
	  --target_tag       use n-th tags for training (without this option, use all tags for training)
	  --quiet_train      do not print the training loss
	  --quiet_valid      do not print the validation loss and evaluate with validation data (if validation data is set, the model evaluate with it for early stopping)
	  --quiet_test       do not evaluate with the training data after training
Hyper-parameters in inference
	  --eval_batch_size  batch size for inference and evaluation in training (default: 64)
File specification
	  --file_format      format of annotated data (spl/conll, default: spl)
	  --trainfile        training data
	  --dictfile         dictionary data
	  --validfile        validation data
	  --testfile         test data
	  --target           word segmented data for inference
random seed
          --seed             random seed (default: 42)

Training sample

python main.py --train --trainfile sample/train.iob2 --num_epoch 200 --model_dir /tmp/ner
Inference sample
python main.py --inference --target sample/target.raw --model_dir /tmp/ner

As python module

# import function 'load_crf_model' from crf.py
import sys
sys.path.append()
from crf import load_crf_model

# load trained model
model = load_crf_model()
# list of target sentences (split into words with space)
texts = ['10 日 午後 、 京都 大学 に 怪獣 が 出現 し た 。']
labels = model.predict(texts)
for text, label in zip(texts, labels):
	print(' '.join(['{}/{}'.format(t, l) for t, l in zip(text.split(' '), label)]))

Data Format

spl (sentence per line) format

Describe oen sentence per line.
Each word is separated by halfwidth spaces.
Tags follow the word, separated by slashes("/")

10/T-B/O 日/T-I/O 午後/T-I/O 、/O/O 京都/O/L-B 大学/O/L-I に/O/O 怪獣/O/O が/O/O 出現/O/O し/O/O た/O/O 。/O/O
In partial annotation, if a tag is an empty string, the script treats it as the word is not annotated.
If all tags after n-th are not annotated, "/" can be omitted.
10/T-B 日/T-I 午後/T-I 、 京都//L-B 大学//L-I に 怪獣 が 出現 し た 。

conll format

Describe one word per line.
Each word is followed by tags, separated by halfwidth spaces.

10 T-B O
日 T-I O
午後 T-I O
、 O O
京都 O L-B
大学 O L-I
に O O
怪獣 O O
が O O
出現 O O
し O O
た O O
。 O O
In partial annotation, if a tag is an empty string, the script treats it as the word is not annotated.
(We show "␣" (U+2423) instead of a halfwidth space for visualization in the following sample.)
10␣T-B
日␣T-I
午後␣T-I
、
京都␣␣L-B
大学␣␣L-I
に
怪獣
が
出現
し
た
。

Dictionary format

Describe one entry per line.
Tags and a phrase are separated by tabs, and tags are separated by shashes.
A phrase is a sequence of words separated by halfwidth spaces.
(We show "\t" instead of a tab for visualization in the following sample.)

L/\tKyoto
/T\tlast month

IOB2 format

各タグはIOB2形式で記述します.
Each tag is described by IOB2 format.
Our script can read both prefix-style (B-TAG) and suffix-style (TAG-B).