Abstract:
To select a suitable software pipeline for analyzing large-scale resequencing data of tobacco genome, nine software pipelines were compared. Three standalone software packages including NGS QC Toolkit, Trimmomatic and ngsShoRT were used for filtering K326 genome sequencing data. The quality filtered reads were mapped to Hongda Reference Genome through two sequence aligners BWA and Bowtie2. Then, SAMtools, a variant calling tool, was used to identify SNPs, and GATK was used to analyze the results generated by BWA. Finally, a total of nine independent VCF files containing SNPs and InDels were obtained. The results showed that the outputs analyzed by the nine software pipelines differed significantly, and the exact probabilities of the nine SNPs-calling pipelines ranged from 55% to 71%. The pipeline of Trimmomatic_BWA_SAMtools featured higher efficiency, easier operation and higher precision, it was therefore considered suitable for data reprocessing of large-scale genomic resequencing data.