本平台为互联网非涉密平台,严禁处理、传输国家秘密或工作秘密

基于化学指标分析构建非均衡烟叶样本的八大生态区分类模型

Construction of a classification model for tobacco leaf samples from eight ecological regions with imbalanced quantities based on chemical index analysis

  • 摘要: 为了实现基于化学指标分析对八大生态区烟叶的判别,采用合成少数类过采样技术(SMOTE)均衡不同生态区样本数量分布,结合随机森林算法(RF)构建烟叶八大生态区分类模型,解决烟叶生态区分类模型中因数据类别不均衡导致少数类生态区分类性能不佳的问题。结果表明:①生态区样本数量不均衡造成第Ⅱ、Ⅲ、Ⅴ、Ⅷ生态区RF分类模型识别效果较差。②SMOTE-RF模型对不同生态区样本判别的五折交叉验证的平均准确率为0.95,精确率、召回率和F1分数分别为0.85~0.99、0.81~0.98和0.82~0.98,较RF模型提升18.7%~34.2%,且模型性能优于SMOTE-LR(逻辑回归)、SMOTE-LDA(线性判别分析)、SMOTE-ANN(人工神经网络)和SMOTE-GNB(高斯朴素贝叶斯)。③SMOTE-RF模型的化学指标重要性分析结果表明总植物碱、氯和茄尼醇等15种化学指标为八大生态区分类的关键指标,还原糖、总糖和总氮等36种化学指标为重要分类指标,这51种指标构成了八大生态区可能的特征指标。上述结果表明,SMOTE结合RF分类算法,可实现基于烟叶化学指标对八大生态区不均衡样本的高质量分类且具有良好的可解释性。

     

    Abstract: To achieve the discrimination of tobacco leaves from eight ecological regions based on chemical indexes, the synthetic minority over-sampling technique(SMOTE)was employed to balance the quantity distribution of samples from different ecological regions, and combined with the random forest (RF) algorithm a classification model for the eight ecological regions was established to address the issue of poor classification performance for ecological regions in the minority due to imbalanced data sizes. The results showed that: 1) The imbalanced sample quantities resulted in poor recognition performance of RF classification model for ecological regionⅡ, Ⅲ, Ⅴ and Ⅷ. 2)The average accuracy of 50% cross-validation of SMOTE-RF model for samples from different ecological regions were 0.95, and the precisions, recall rates and F1 scores were 0.85-0.99, 0.81-0.98 and 0.82-0.98 respectively, representing increases ranged from 18.7% to 34.2% compared to RF model and indicating better performance than that of SMOTE-LR, SMOTE-LDA, SMOTE-ANN and SMOTE-GNB. 3) The importance analysis of chemical indexes of SMOTE-RF model revealed that 15 chemical indexes such as total alkaloids, chlorine and solanesol were key classification indexes for the eight ecological regions. While 36 chemical indexes such as reducing sugars, total sugars and total nitrogen were important classification indexes. All those 51 indexes collectively constituted the characteristic indexes for the eight ecological regions. By combining SMOTE with RF classification algorithm, high-quality classification of all the samples from the eight ecological regions with imbalanced quantities based on tobacco leaf chemical indexes was achieved with good interpretability.

     

/

返回文章
返回