Analysis workflow for genome-wide variations in TASUKE+ for RAP-DB

(version 2.0)

Genome-wide variations were detected using publicly available genome resequencing data for various rice varieties and provided in the multiple genome browser TASUKE+. Here we provide the analysis workflow for the detection of variations.

History

Reference data

  • Genome sequences [FASTA]
    • IRGSP-1.0 genome (including organella and unanchored contig sequences)
  • Gene annotation for snpEff [Annotation GTF] [Protein FASTA]
    • RAP-DB (both representative and predicted genes as of 11 Nov 2021)
    • MSU (all genes in RGAP 7)
  • Illumina adapter sequences attached with the Trimmomatic program (TruSeq2 PE + TruSeq3 PE)
  • Chromosome information (chromosome_list.csv) for TASUKE+
  • $ cat chromosome_list.csv
    chr01,43270923,16610866,17243770
    chr02,35937250,13541821,13872411
    chr03,36413819,19431743,19745569
    chr04,35502694,9744480,9973218
    chr05,29958434,12390387,12627019
    chr06,31248787,15332004,15555636
    chr07,29697621,11887856,12272916
    chr08,28443022,12847483,13061068
    chr09,23012720,2749793,3043847
    chr10,23207287,8082722,8309866
    chr11,29021106,12039480,12482616
    chr12,27531856,11761737,12103486
    
  • Interval file (IRGSP-1.0_12chrom.intervals) for GATK
  • chr01
    chr02
    chr03
    chr04
    chr05
    chr06
    chr07
    chr08
    chr09
    chr10
    chr11
    chr12
    
  • Sample map file (sample.map) for GATK. Sample ID and path to GVCF are separeted by a TAB
  • SAMD00009825    variants_SAMD00009825.g.vcf.gz
    SAMD00010635    variants_SAMD00010635.g.vcf.gz
    SAMD00010636    variants_SAMD00010636.g.vcf.gz
    ...
    ...
    
  • Sample ID list (include_sample.args) for GATK.
  • SAMD00009825
    SAMD00010635
    SAMD00010636
    ...
    ...
    

Analysis tools

  • Java (JDK 1.8.0_191)
  • FastQC v0.11.8
  • BWA (bwa v0.7.17)
  • SamTools (v1.9)
  • BamTools (as of 4 Dec 2018)
  • GATK (v4.0.11.0)/GATK (v4.2.4.0)
  • Trimmomatic (v0.38)
  • Picard (v2.18.17)
  • SnpEff (v5.0e)
  • TASUKE+ (version 20210831)

Commands and parameters used in the workflow

  1. Preprocessing of Illumina paired-end reads
  2. $ java -jar trimmomatic-0.38.jar PE \
        -phred33 read.r1.fastq.gz read.r2.fastq.gz \
        read.pe.r1.fastq.gz read.se.r1.fastq.gz read.pe.r2.fastq.gz read.se.r2.fastq.gz \
        ILLUMINACLIP:adapters.fa:2:30:10 LEADING:20 TRAILING:20 SLIDINGWINDOW:10:20 MINLEN:30
    
  3. Making index of the genome
  4. $ bwa index genome.fa
    $ samtools faidx genome.fa
    $ java -jar picard.jar CreateSequenceDictionary \
        REFERENCE=genome.fa \
        OUTPUT=genome.dict
    
  5. Alignment of Illumina reads to the reference genome
  6. $ bwa mem -M genome.fa read.pe.r1.fastq.gz read.pe.r2.fastq.gz \
        | samtools sort -o alignment.sort.bam -
    $ samtools index alignment.sort.bam
    
  7. Make clean BAM
  8. $ java -jar picard.jar FastqToSam \
        FASTQ=read.pe.r1.fastq.gz \
        FASTQ2=read.pe.r2.fastq.gz \
        OUTPUT=uBAM.bam \
        READ_GROUP_NAME=${SAMPLE_ID} \
        SAMPLE_NAME=${SAMPLE_ID} \
        LIBRARY_NAME=${SAMPLE_ID} \
    
    $ java -jar picard.jar MergeBamAlignment \
        ALIGNED=alignment.sort.bam \
        UNMAPPED=uBAM.bam \
        OUTPUT=alignment.merge.bam \
        REFERENCE_SEQUENCE=genome.fa
    

    *User can specify any sample ID in the variable “${SAMPLE_ID}”.

  9. Remove PCR duplicates
  10. $ java -jar picard.jar MarkDuplicates \
        INPUT=alignment.merge.bam \
        OUTPUT=alignment.rmdup.bam \
        METRICS_FILE=rmdup.matrix \
        REMOVE_DUPLICATES=true \
        MAX_RECORDS_IN_RAM=1000000 \
        TMP_DIR=./tmp
    
    $ samtools index alignment.rmdup.bam
    
  11. Variant detection by GATK HaplotypeCaller
  12. $ gatk HaplotypeCaller \
        --input alignment.rmdup.bam \
        --output variants.g.vcf.gz \
        --reference genome.fa \
        -max-alternate-alleles 2 \
        --emit-ref-confidence GVCF
    

    *Step1-6 were performed for each sample before Step7.

  13. import GVCFs into GenomicsDB
  14. $ gatk GenomicsDBImport \
      --genomicsdb-workspace-path gvcfs_db \
      --sample-name-map sample.map \
      --intervals IRGSP-1.0_12chrom.intervals
    
  15. Genotyping and filtering variants by GATK
  16. gatk GenotypeGVCFs \
      --reference genome.fa \
      --variant gendb://gvcfs_db \
      --output variants.genotype.vcf.gz
    

    * Before VariantFiltration, average and standard deviation of depth of coverage was calculated for specifying $DP_AVG_PLUS_5SD.

    gatk VariantFiltration \
      --reference genome.fa \
      --variant variants.genotype.vcf.gz
      --output variants.genotype.filtered_SNP.vcf.gz \
      --filter-expression "DP > $DP_AVG_PLUS_5SD*" --filter-name "DP" \
      --filter-expression "QD < 2.0" --filter-name "QD" \
      --filter-expression "MQ < 40.0" --filter-name "MQ" \
      --filter-expression "FS > 60.0" --filter-name "FS" \
      --filter-expression "SOR > 3.0" --filter-name "SOR" \
      --filter-expression "MQRankSum < -12.5" --filter-name "MQRS" \
      --filter-expression "ReadPosRankSum < -8.0" --filter-name "RPRS"
    
    gatk VariantFiltration \
      --reference genome.fa \
      --variant variants.genotype.vcf.gz \
      --output variants.genotype.filtered_InDel.vcf.gz \
      --filter-expression "DP > $DP_AVG_PLUS_5SD" --filter-name "DP" \
      --filter-expression "QD < 2.0" --filter-name "QD" \
      --filter-expression "FS > 200.0" --filter-name "FS" \
      --filter-expression "SOR > 10.0" --filter-name "SOR" \
      --filter-expression "ReadPosRankSum < -20.0" --filter-name "RPRS"
    
    gatk SelectVariants \
      --reference genome.fa \
      --variant variants.genotype.filtered_SNP.vcf.gz \
      --output variants.genotype.selected_SNP.vcf.gz \
      --exclude-filtered \
      --select-type-to-include SNP \
      --select-type-to-include MNP \
      --sample-name include_sample.args
    
    gatk SelectVariants \
      --reference genome.fa \
      --variant variants.genotype.filtered_InDel.vcf.gz \
      --output variants.genotype.selected_InDel.vcf.gz \
      --exclude-filtered \
      --select-type-to-include INDEL \
      --select-type-to-include MIXED \
      --sample-name include_sample.args
    
    java -jar picard.jar MergeVcfs \
      --INPUT variants.genotype.selected_SNP.vcf.gz \
      --INPUT variants.genotype.selected_InDel.vcf.gz \
      --OUTPUT variants.varonly.vcf.gz
    
  17. Run SnpEff
  18. $ java -jar snpEff.jar build -gtf22 -v RAP_MSU_on_IRGSP-1.0
    
    $ java -jar snpEff.jar \
        -ud 1000 \
        -v RAP_MSU_on_IRGSP-1.0 variants.varonly.vcf.gz | gzip -c > variants.snpEff.varonly.vcf.gz
    

    *Reference genome sequence (FASTA), gene annotation data (GTF) and protein sequences of all genes (FASTA) should be placed in appropriate directories before running SnpEff.

  19. Make depth data for TASUKE+
  20. $ perl tasuke_bamtodepth.pl \
        -s samtools -c chromosome_list.csv -i alignment.rmdup.bam \
        -o alignment_depth.tsv
    
  21. Upload variation and depth data to TASUKE+

The following two output files are uploaded to the TASUKE+ database

  • variants.snpEff.varonly.vcf.gz
  • alignment_depth.tsv

Information of rice varieties and sequence data provided in TASUKE+