本平台为互联网非涉密平台,严禁处理、传输国家秘密或工作秘密

面向烟草领域的文本标注语料库构建

Construction of a corpus for annotating tobacco-related texts

  • 摘要: 为快速获取烟草科技文献中的知识信息,通过交互式迭代学习的烟草知识实体标注与识别方法,构建了面向烟草领域的文本标注语料库,设计了适用于烟草领域的文本标注规范,并利用BERT+CRF(Bidirectional Encoder Representations from Transformers + Conditional Random Field)深度学习网络模型实现了烟草命名实体的识别和预标注,结合人工校对扩充了原始语料的规模,优化了模型性能。结果表明:语料标注一致性F1标注达92.4%;BERT+CRF模型识别能力优于常用的CRF、BiLSTM+CRF命名实体识别模型。该技术可为提升烟草领域文本分析和知识挖掘能力提供支持。

     

    Abstract: In order to quickly obtain knowledge information from tobacco-related scientific and technological literatures, an interactive iterative learning method for tobacco knowledge entity annotation and recognition was used to construct a corpus for annotating tobacco-related texts. A text annotation specification suitable for the field of tobacco was designed, and the BERT+CRF (Bidirectional Encoder Representations from Transformers + Conditional Random Field) deep learning network model was used to recognize and pre-annotate tobacco named entities. Combined with manual proofreading, the size of the original corpus was increased and the performance of the model was optimized. The results showed that the consistency F1an of the corpus annotation reached 92.4%. The BERT+CRF model has better recognition ability than those of commonly used CRF, BiLSTM+CRF named entity recognition models. This technology supports the improvement of text analysis and knowledge mining capabilities in the field of tobacco.

     

/

返回文章
返回