CaPSL (CRF and Pointwise-based Sequential Labeling tool) is a sequential labeling tool with the following features.
Temporal expression recognizer
Spatial expression recognizer
Multitask of Spatial/Temporal recognizer
Trained with annotated Mainichi Newspaper Article Corpus
Temporal expression recognizer (under construction)
Spatial expression recognizer
Multitask of Spatial/Temporal recognizer (under construction)
Trained with annotated AFP Articles
(under construction)
Rule-based approach
(under construction)
Latitude/Longitude extimation
Dictionary-based listing and optimization
Option
Running mode
--train training mode
--inference inference mode
Directory of the model specification
--model_dir Directory of the model (training mode: for saving / inference mode: for loading)
Hyper-parameters in training
--num_epoch epoch number of training (default: 20)
--lr learning rate (default: 2e-6)
--encoder model for text encoding (bert/lstm, default: bert)
--use_dict set if use dictionary
--batch_size batch size for training (default: 32)
--use_scheduler set if use warmup of learning rate in training (linear scheduing which it reaches the peak at 1/10 of the training loop)
--bert_dir bert model for encoding (default: bert-base-uncased)
--lambda_pointwise weight for pointwise loss in training (default: 0.5)
--do_pre set if train with fixed encoder before main training loop
--do_crfonly set if train only with CRF loss of fully-annotated data after main training loop
--target_tag use n-th tags for training (without this option, use all tags for training)
--quiet_train do not print the training loss
--quiet_valid do not print the validation loss and evaluate with validation data (if validation data is set, the model evaluate with it for early stopping)
--quiet_test do not evaluate with the training data after training
Hyper-parameters in inference
--eval_batch_size batch size for inference and evaluation in training (default: 64)
File specification
--file_format format of annotated data (spl/conll, default: spl)
--trainfile training data
--dictfile dictionary data
--validfile validation data
--testfile test data
--target word segmented data for inference
random seed
--seed random seed (default: 42)
Training sample
python main.py --train --trainfile sample/train.iob2 --num_epoch 200 --model_dir /tmp/ner
Inference sample
python main.py --inference --target sample/target.raw --model_dir /tmp/ner
# import function 'load_crf_model' from crf.py import sys sys.path.append() from crf import load_crf_model # load trained model model = load_crf_model( ) # list of target sentences (split into words with space) texts = ['10 日 午後 、 京都 大学 に 怪獣 が 出現 し た 。'] labels = model.predict(texts) for text, label in zip(texts, labels): print(' '.join(['{}/{}'.format(t, l) for t, l in zip(text.split(' '), label)]))
Describe oen sentence per line.
Each word is separated by halfwidth spaces.
Tags follow the word, separated by slashes("/")
10/T-B/O 日/T-I/O 午後/T-I/O 、/O/O 京都/O/L-B 大学/O/L-I に/O/O 怪獣/O/O が/O/O 出現/O/O し/O/O た/O/O 。/O/O
In partial annotation, if a tag is an empty string, the script treats it as the word is not annotated.10/T-B 日/T-I 午後/T-I 、 京都//L-B 大学//L-I に 怪獣 が 出現 し た 。
Describe one word per line.
Each word is followed by tags, separated by halfwidth spaces.
10 T-B O
日 T-I O
午後 T-I O
、 O O
京都 O L-B
大学 O L-I
に O O
怪獣 O O
が O O
出現 O O
し O O
た O O
。 O O
In partial annotation, if a tag is an empty string, the script treats it as the word is not annotated.10␣T-B
日␣T-I
午後␣T-I
、
京都␣␣L-B
大学␣␣L-I
に
怪獣
が
出現
し
た
。
Describe one entry per line.
Tags and a phrase are separated by tabs, and tags are separated by shashes.
A phrase is a sequence of words separated by halfwidth spaces.
(We show "\t" instead of a tab for visualization in the following sample.)
L/\tKyoto
/T\tlast month
各タグはIOB2形式で記述します.
Each tag is described by IOB2 format.
Our script can read both prefix-style (B-TAG) and suffix-style (TAG-B).