本平台为互联网非涉密平台,严禁处理、传输国家秘密或工作秘密
王轶博, 潘 伟, 张海涛, 江 涛. 基于层级互信息聚类的烟草行业信息分类与编码设计[J]. 烟草科技.
引用本文: 王轶博, 潘 伟, 张海涛, 江 涛. 基于层级互信息聚类的烟草行业信息分类与编码设计[J]. 烟草科技.
WANG Yibo, PAN Wei, ZHANG Haitao, JIANG Tao. Design of information classifying and coding for tobacco industry based on hierarchical mutual information clustering[J]. Tobacco Science & Technology.
Citation: WANG Yibo, PAN Wei, ZHANG Haitao, JIANG Tao. Design of information classifying and coding for tobacco industry based on hierarchical mutual information clustering[J]. Tobacco Science & Technology.

基于层级互信息聚类的烟草行业信息分类与编码设计

Design of information classifying and coding for tobacco industry based on hierarchical mutual information clustering

  • 摘要: 为满足全国烟草生产经营管理一体化平台建设对行业信息分类与编码的需求,按照“流程、实体、服务”三类数字对象对信息系统进行解构,结合烟草行业业务实际情况,提出层级互信息聚类算法(Hierarchical Mutual Information Clustering,HMIC),通过对文本数据进行自然语言处理,计算不同数字对象在不同分类层级的互信息,利用层次聚类算法对数字对象进行聚类,从而得到烟草行业信息分类,并在此基础上进行信息编码。将HMIC与常用聚类算法进行对比测试,结果表明:①所构建的HMIC模型的信息分类效果最好,其整体信息熵比使用欧氏距离的聚类算法降低约8.221%,比仅使用互信息矩阵的聚类算法降低约2.512%。②从信息量的角度对分类编码进行研究,能够更好地区分不同类别之间的差异,提高信息分类与编码的可用性。该技术可为指导信息系统项目全生命周期建设提供支持。

     

    Abstract: To meet the needs of the construction of National Tobacco Production, Operation and Management Integrated Platform for information classifying and coding of tobacco industry, information systems are decomposed according to three types of digital objects, namely “process, entity, and service”, and in conjunction with the real business of tobacco industry, a hierarchical mutual information clustering (HMIC) algorithm is proposed. By conducting natural language processing on text data, the mutual information of different digital objects at different classification levels is calculated, and the hierarchical clustering algorithm is used to classify digital objects, thus obtaining tobacco industry information classifying, and then information coding is completed based on information classifying. HMIC algorithm was compared with commonly used clustering algorithms, the results showed that: 1) HMIC algorithm achieved the best performance of information classifying, with its total information entropy reduced by about 8.221% compared with the clustering algorithm using Euclidean distance, and by about 2.512% compared with the clustering algorithm using only the mutual information matrix. 2) The research of information classifying and coding from the perspective of information content could better distinguish the differences between different categories, which improved the usability of information classifying and coding. This technology supports the guidance for the whole life cycle of information system project construction.

     

/

返回文章
返回