Abstract:
To meet the needs of the construction of National Tobacco Production, Operation and Management Integrated Platform of the tobacco industry, information classifying and coding are developed. The information systems are decomposed according to three types of digital objects, namely "process, entity, and service", and in conjunction with the real-life business of the tobacco industry, a hierarchical mutual information clustering (HMIC) algorithm is proposed. By conducting natural language processing on text data, the mutual information of different digital objects at different classification levels is calculated, and the hierarchical clustering algorithm is used to classify digital objects, thus obtaining tobacco industry information classification, and then information coding is completed based on the information classification. The HMIC algorithm was compared with commonly used clustering algorithms, the results showed that: 1) The designed HMIC algorithm featured the best performance in information classifying, with its total information entropy reduced by about 8.2% compared with the clustering algorithm using Euclidean distance, and by about 2.5% compared with the clustering algorithm with mutual information matrix only. 2) From the point of information content, the research of information classifying and coding could better distinguish the differences between different categories and improve their usability. This technology supports the guidance for the whole life cycle of information system project construction.