面向烟草领域的文本标注语料库构建

王永胜; 刘亚丽; 宗国浩; 王迪; 王锐; 王金棒; 李丰霖; 贾楠; 冯伟华

doi:10.16135/j.issn1002-0861.2023.0320

面向烟草领域的文本标注语料库构建

Construction of a corpus for annotating tobacco-related texts

摘要

摘要: 为快速获取烟草科技文献中的知识信息，通过交互式迭代学习的烟草知识实体标注与识别方法，构建了面向烟草领域的文本标注语料库，设计了适用于烟草领域的文本标注规范，并利用BERT+CRF（Bidirectional Encoder Representations from Transformers + Conditional Random Field）深度学习网络模型实现了烟草命名实体的识别和预标注，结合人工校对扩充了原始语料的规模，优化了模型性能。结果表明：语料标注一致性F_1标注达92.4%；BERT+CRF模型识别能力优于常用的CRF、BiLSTM+CRF命名实体识别模型。该技术可为提升烟草领域文本分析和知识挖掘能力提供支持。

Abstract: In order to quickly obtain knowledge information from tobacco-related scientific and technological literatures, an interactive iterative learning method for tobacco knowledge entity annotation and recognition was used to construct a corpus for annotating tobacco-related texts. A text annotation specification suitable for the field of tobacco was designed, and the BERT+CRF (Bidirectional Encoder Representations from Transformers + Conditional Random Field) deep learning network model was used to recognize and pre-annotate tobacco named entities. Combined with manual proofreading, the size of the original corpus was increased and the performance of the model was optimized. The results showed that the consistency F_1an of the corpus annotation reached 92.4%. The BERT+CRF model has better recognition ability than those of commonly used CRF, BiLSTM+CRF named entity recognition models. This technology supports the improvement of text analysis and knowledge mining capabilities in the field of tobacco.

HTML全文

参考文献(16)

施引文献

资源附件(0)