Abstract:
To meet the needs of the construction of National Tobacco Production, Operation and Management Integrated Platform for information classifying and coding of tobacco industry, information systems are decomposed according to three types of digital objects, namely “process, entity, and service”, and in conjunction with the real business of tobacco industry, a hierarchical mutual information clustering (HMIC) algorithm is proposed. By conducting natural language processing on text data, the mutual information of different digital objects at different classification levels is calculated, and the hierarchical clustering algorithm is used to classify digital objects, thus obtaining tobacco industry information classifying, and then information coding is completed based on information classifying. HMIC algorithm was compared with commonly used clustering algorithms, the results showed that: 1) HMIC algorithm achieved the best performance of information classifying, with its total information entropy reduced by about 8.221% compared with the clustering algorithm using Euclidean distance, and by about 2.512% compared with the clustering algorithm using only the mutual information matrix. 2) The research of information classifying and coding from the perspective of information content could better distinguish the differences between different categories, which improved the usability of information classifying and coding. This technology supports the guidance for the whole life cycle of information system project construction.