This is a step-by-step implementation of Chinese Title Entity Recognition via BERT-BiLSTM-CRF Model.
For original codes and tutorials, please visit here.
For Chinese readers, please visit here.
My job lies in the Chinese chat title named entity recognition by fine-tuning the BERT model.
So I changed several lines of codes and extended more Chinese chat title entities to the original dataset.
96.73% accuracy has been achieved via the BERT-BiLSTM-CRF.
The main purpose of the job(examples):
Input One: 小贾你最近忙什么呢?
Input Two: 贾舒越
Output: 小贾 is 贾舒越
Input One: 建勋师兄你何时来实验室?
Input Two: 邸建勋
Output: 建勋师兄 is 王建勋
Input One: 最近王宇航学习怎么样呀
Input Two: 王海生
Output: There is no match for 王海生.
Input One: 贾泽阳现在回家了嘛
Input Two: 吴泽阳
Output: There is no match for 吴泽阳.
For Chinese readers, you guys could read the 提取聊天对方的称谓 - 方案与deadline.pdf to get into the details.
pip install bert-base==0.0.7 -i https://pypi.python.org/simple
tensorflow >= 1.12.0
tensorflow-gpu >= 1.12.0 # GPU version of TensorFlow.
GPUtil >= 1.3.0 # no need if you dont have GPU
pyzmq >= 17.1.0 # python zmq
Download the BERT pre-trained model from here.
Be sure to place the extracted folder "chinese_L-12_H-768_A-12" on "init_checkpoint" folder.
training dataset from here.
Be sure to place "train.txt" on the "data" folder.
Open the CMD terminal or the Anaconda Prompt and be sure to guide it to the working path and tensorflow environment:
e.g. my working path is /Users/shuyuej/Desktop/Python-Files/Chinese-Chat-Title-NER-BERT-BiLSTM-CRF/.
Then input the command:
bert-base-ner-train -data_dir /Users/shuyuej/Desktop/Python-Files/BERT-BiLSTM-CRF-NER/data/ -output_dir /Users/shuyuej/Desktop/Python-Files/BERT-BiLSTM-CRF-NER/final_output/ -init_checkpoint /Users/shuyuej/Desktop/Python-Files/BERT-BiLSTM-CRF-NER/init_checkpoint/chinese_L-12_H-768_A-12\bert_model.ckpt -bert_config_file /Users/shuyuej/Desktop/Python-Files/BERT-BiLSTM-CRF-NER/init_checkpoint/chinese_L-12_H-768_A-12/bert_config.json -vocab_file /Users/shuyuej/Desktop/Python-Files/BERT-BiLSTM-CRF-NER/init_checkpoint/chinese_L-12_H-768_A-12/vocab.txt -batch_size 8
FYI, be sure to change my "/Users/shuyuej/Desktop/Python-Files/BERT-BiLSTM-CRF-NER/" to your own BERT-BiLSTM-CRF-NER path.
For Windows OS System: I use the following command line:
bert-base-ner-train -data_dir E:\BERT-BiLSTM-CRF-NER\data\ -output_dir E:\BERT-BiLSTM-CRF-NER\final_output\ -init_checkpoint E:\BERT-BiLSTM-CRF-NER\init_checkpoint\chinese_L-12_H-768_A-12\bert_model.ckpt -bert_config_file E:\BERT-BiLSTM-CRF-NER\init_checkpoint\chinese_L-12_H-768_A-12\bert_config.json -vocab_file E:\BERT-BiLSTM-CRF-NER\init_checkpoint\chinese_L-12_H-768_A-12\vocab.txt -batch_size 8
The final trained model will be in the "final_output" folder.
The Test File is "test.py" and could test the "Test-set.xlsx" and get a result.
Before you execute the file, be sure to change the paths of trained BERT model, original pre-trained BERT model, and Test-set.xlsx to your own.
And you could see the results and power of BERT.
Another executive file is "predict-test.py" in which you could input the sentence and name and finally get the match results. Be sure to change the paths same as "test.py" file.