Abstract:
To establish a robust and accurate model for discriminating the origin of flue-cured tobacco samples from imbalanced multi-regions, tobacco strip samples aged for three years were selected from a specific company. The growing areas of these samples cover 12 domestic and international regions. The contents of 68 chemical components, the pH values and the dichloromethane extract yield of the samples were obtained using near-infrared chemical component rapid analysis technology. The Particle Swarm Optimization (PSO) algorithm was used to optimize the parameters of the Support Vector Machine (SVM) kernels to construct the imbalanced multi-region flue-cured tobacco origin discrimination model. This model was then compared and evaluated against the Backpropagation Neural Network (BPNN), Random Forest (RF), and Fisher Discriminant Analysis (FDA) models. The results showed that: 1) The flue-cured tobacco origin discrimination model based on the SVM with hybrid kernel algorithm effectively learned key features and achieved high discrimination accuracy for these samples from different regions. The overall discrimination accuracies of the training set and test set reached 99.69% and 99.59%, respectively. 2) Compared with the BPNN, RF, and FDA models, the SVM with hybrid kernel model achieved an increased overall discrimination accuracy of 4.55, 6.20, and 6.61 percentage points on the test set, respectively. 3) When predicting samples from 12 regions with a highly imbalanced distribution of numbers, the macro recall, macro precision, and macro
F1 score of the SVM with hybrid kernel model were 0.995 1, 0.998 5, and 0.996 8, respectively. Compared with the BPNN, RF, and FDA models, the macro recall increased by 0.299 1, 0.326 4, and 0.406 5; the macro precision increased by 0.347 6, 0.291 3, and 0.412 4; and the macro
F1 score increased by 0.324 1, 0.309 4, and 0.409 5, respectively. The flue-cured tobacco origin discrimination model based on SVM with hybrid kernel algorithm outperformed BPNN, RF, and FDA models when discriminating tobacco samples from imbalanced multi-regions.