Kyoto University, ACCMS, LSTA Group

Sequential Labeling tool CaPSLJapanese

CaPSL (CRF and Pointwise-based Sequential Labeling tool) is a sequential labeling tool with the following features.

We tested our tool with following versions.

As script


Running mode
          --train            training mode
	  --inference        inference mode
Directory of the model specification
	  --model_dir        Directory of the model (training mode: for saving / inference mode: for loading)
Hyper-parameters in training
	  --num_epoch        epoch number of training (default: 20)
	  --lr               learning rate (default: 2e-6)
	  --encoder          model for text encoding (bert/lstm, default: bert)
	  --use_dict         set if use dictionary
	  --batch_size       batch size for training (default: 32)
	  --use_scheduler    set if use warmup of learning rate in training (linear scheduing which it reaches the peak at 1/10 of the training loop)
	  --bert_dir         bert model for encoding (default: bert-base-uncased)
	  --lambda_pointwise weight for pointwise loss in training (default: 0.5)
	  --do_pre           set if train with fixed encoder before main training loop
	  --do_crfonly       set if train only with CRF loss of fully-annotated data after main training loop
	  --target_tag       use n-th tags for training (without this option, use all tags for training)
	  --quiet_train      do not print the training loss
	  --quiet_valid      do not print the validation loss and evaluate with validation data (if validation data is set, the model evaluate with it for early stopping)
	  --quiet_test       do not evaluate with the training data after training
Hyper-parameters in inference
	  --eval_batch_size  batch size for inference and evaluation in training (default: 64)
File specification
	  --file_format      format of annotated data (spl/conll, default: spl)
	  --trainfile        training data
	  --dictfile         dictionary data
	  --validfile        validation data
	  --testfile         test data
	  --target           word segmented data for inference
random seed
          --seed             random seed (default: 42)

Training sample

python --train --trainfile sample/train.iob2 --num_epoch 200 --model_dir /tmp/ner
Inference sample
python --inference --target sample/target.raw --model_dir /tmp/ner

As python module

# import function 'load_crf_model' from
import sys
from crf import load_crf_model

# load trained model
model = load_crf_model()
# list of target sentences (split into words with space)
texts = ['10 日 午後 、 京都 大学 に 怪獣 が 出現 し た 。']
labels = model.predict(texts)
for text, label in zip(texts, labels):
	print(' '.join(['{}/{}'.format(t, l) for t, l in zip(text.split(' '), label)]))

Data Format

spl (sentence per line) format

Describe oen sentence per line.
Each word is separated by halfwidth spaces.
Tags follow the word, separated by slashes("/")

10/T-B/O 日/T-I/O 午後/T-I/O 、/O/O 京都/O/L-B 大学/O/L-I に/O/O 怪獣/O/O が/O/O 出現/O/O し/O/O た/O/O 。/O/O
In partial annotation, if a tag is an empty string, the script treats it as the word is not annotated.
If all tags after n-th are not annotated, "/" can be omitted.
10/T-B 日/T-I 午後/T-I 、 京都//L-B 大学//L-I に 怪獣 が 出現 し た 。

conll format

Describe one word per line.
Each word is followed by tags, separated by halfwidth spaces.

10 T-B O
日 T-I O
午後 T-I O
、 O O
京都 O L-B
大学 O L-I
に O O
怪獣 O O
が O O
出現 O O
し O O
た O O
。 O O
In partial annotation, if a tag is an empty string, the script treats it as the word is not annotated.
(We show "␣" (U+2423) instead of a halfwidth space for visualization in the following sample.)

Dictionary format

Describe one entry per line.
Tags and a phrase are separated by tabs, and tags are separated by shashes.
A phrase is a sequence of words separated by halfwidth spaces.
(We show "\t" instead of a tab for visualization in the following sample.)

/T\tlast month

IOB2 format

Each tag is described by IOB2 format.
Our script can read both prefix-style (B-TAG) and suffix-style (TAG-B).