Abstract:
To achieve the discrimination of tobacco leaves from eight ecological regions based on chemical indexes, the synthetic minority over-sampling technique(SMOTE)was employed to balance the quantity distribution of samples from different ecological regions, and combined with the random forest (RF) algorithm a classification model for the eight ecological regions was established to address the issue of poor classification performance for ecological regions in the minority due to imbalanced data sizes. The results showed that: 1) The imbalanced sample quantities resulted in poor recognition performance of RF classification model for ecological regionⅡ, Ⅲ, Ⅴ and Ⅷ. 2)The average accuracy of 50% cross-validation of SMOTE-RF model for samples from different ecological regions were 0.95, and the precisions, recall rates and F1 scores were 0.85-0.99, 0.81-0.98 and 0.82-0.98 respectively, representing increases ranged from 18.7% to 34.2% compared to RF model and indicating better performance than that of SMOTE-LR, SMOTE-LDA, SMOTE-ANN and SMOTE-GNB. 3) The importance analysis of chemical indexes of SMOTE-RF model revealed that 15 chemical indexes such as total alkaloids, chlorine and solanesol were key classification indexes for the eight ecological regions. While 36 chemical indexes such as reducing sugars, total sugars and total nitrogen were important classification indexes. All those 51 indexes collectively constituted the characteristic indexes for the eight ecological regions. By combining SMOTE with RF classification algorithm, high-quality classification of all the samples from the eight ecological regions with imbalanced quantities based on tobacco leaf chemical indexes was achieved with good interpretability.