Abstract:
In order to quickly obtain knowledge information from tobacco-related scientific and technological literatures, an interactive iterative learning method for tobacco knowledge entity annotation and recognition was used to construct a corpus for annotating tobacco-related texts. A text annotation specification suitable for the field of tobacco was designed, and the BERT+CRF (Bidirectional Encoder Representations from Transformers + Conditional Random Field) deep learning network model was used to recognize and pre-annotate tobacco named entities. Combined with manual proofreading, the size of the original corpus was increased and the performance of the model was optimized. The results showed that the consistency
F1an of the corpus annotation reached 92.4%. The BERT+CRF model has better recognition ability than those of commonly used CRF, BiLSTM+CRF named entity recognition models. This technology supports the improvement of text analysis and knowledge mining capabilities in the field of tobacco.