Analysis workflow for genome-wide variations in TASUKE+ for RAP-DB
(version 2.0)
| Genome-wide variations were detected using publicly available genome resequencing data for various rice varieties and provided in the multiple genome browser TASUKE+. Here we provide the analysis workflow for the detection of variations. |
History
- 05/Feb/2026 TASUKE+ with 1358 rice varieties were released.
- 11/Mar/2022 Analysis workflow (version 2.0) and TASUKE+ with 685 rice varieties were released.
- 29/Aug/2019 Analysis workflow (version 1.0) and TASUKE+ with 533 rice varieties were released.
- 15/Nov/2017 TASUKE with 333 rice varieties were released.
Reference data
- Genome sequences [FASTA]
- IRGSP-1.0 genome (including organella and unanchored contig sequences)
- Gene annotation for snpEff [Annotation GTF] [Protein FASTA]
- RAP-DB (both representative and predicted genes as of 11 Nov 2021)
- MSU (all genes in RGAP 7)
- Illumina adapter sequences attached with the Trimmomatic program (TruSeq2 PE + TruSeq3 PE)
- Chromosome information (chromosome_list.csv) for TASUKE+
$ cat chromosome_list.csv chr01,43270923,16610866,17243770 chr02,35937250,13541821,13872411 chr03,36413819,19431743,19745569 chr04,35502694,9744480,9973218 chr05,29958434,12390387,12627019 chr06,31248787,15332004,15555636 chr07,29697621,11887856,12272916 chr08,28443022,12847483,13061068 chr09,23012720,2749793,3043847 chr10,23207287,8082722,8309866 chr11,29021106,12039480,12482616 chr12,27531856,11761737,12103486
chr01 chr02 chr03 chr04 chr05 chr06 chr07 chr08 chr09 chr10 chr11 chr12
SAMD00009825 variants_SAMD00009825.g.vcf.gz SAMD00010635 variants_SAMD00010635.g.vcf.gz SAMD00010636 variants_SAMD00010636.g.vcf.gz ... ...
SAMD00009825 SAMD00010635 SAMD00010636 ... ...
Analysis tools
- Java (JDK 1.8.0_191)
- FastQC v0.11.8
- BWA (bwa v0.7.17)
- SamTools (v1.9)
- BamTools (as of 4 Dec 2018)
- GATK (v4.0.11.0)/GATK (v4.2.4.0)
- Trimmomatic (v0.38)
- Picard (v2.18.17)
- SnpEff (v5.0e)
- TASUKE+ (version 20210831)
Commands and parameters used in the workflow
- Preprocessing of Illumina paired-end reads
- Making index of the genome
- Alignment of Illumina reads to the reference genome
- Make clean BAM
- Remove PCR duplicates
- Variant detection by GATK HaplotypeCaller
- import GVCFs into GenomicsDB
- Genotyping and filtering variants by GATK
- Run SnpEff
- Make depth data for TASUKE+
- Upload variation and depth data to TASUKE+
$ java -jar trimmomatic-0.38.jar PE \
-phred33 read.r1.fastq.gz read.r2.fastq.gz \
read.pe.r1.fastq.gz read.se.r1.fastq.gz read.pe.r2.fastq.gz read.se.r2.fastq.gz \
ILLUMINACLIP:adapters.fa:2:30:10 LEADING:20 TRAILING:20 SLIDINGWINDOW:10:20 MINLEN:30
$ bwa index genome.fa
$ samtools faidx genome.fa
$ java -jar picard.jar CreateSequenceDictionary \
REFERENCE=genome.fa \
OUTPUT=genome.dict
$ bwa mem -M genome.fa read.pe.r1.fastq.gz read.pe.r2.fastq.gz \
| samtools sort -o alignment.sort.bam -
$ samtools index alignment.sort.bam
$ java -jar picard.jar FastqToSam \
FASTQ=read.pe.r1.fastq.gz \
FASTQ2=read.pe.r2.fastq.gz \
OUTPUT=uBAM.bam \
READ_GROUP_NAME=${SAMPLE_ID} \
SAMPLE_NAME=${SAMPLE_ID} \
LIBRARY_NAME=${SAMPLE_ID} \
$ java -jar picard.jar MergeBamAlignment \
ALIGNED=alignment.sort.bam \
UNMAPPED=uBAM.bam \
OUTPUT=alignment.merge.bam \
REFERENCE_SEQUENCE=genome.fa
*User can specify any sample ID in the variable “${SAMPLE_ID}”.
$ java -jar picard.jar MarkDuplicates \
INPUT=alignment.merge.bam \
OUTPUT=alignment.rmdup.bam \
METRICS_FILE=rmdup.matrix \
REMOVE_DUPLICATES=true \
MAX_RECORDS_IN_RAM=1000000 \
TMP_DIR=./tmp
$ samtools index alignment.rmdup.bam
$ gatk HaplotypeCaller \
--input alignment.rmdup.bam \
--output variants.g.vcf.gz \
--reference genome.fa \
-max-alternate-alleles 2 \
--emit-ref-confidence GVCF
*Step1-6 were performed for each sample before Step7.
$ gatk GenomicsDBImport \ --genomicsdb-workspace-path gvcfs_db \ --sample-name-map sample.map \ --intervals IRGSP-1.0_12chrom.intervals
gatk GenotypeGVCFs \ --reference genome.fa \ --variant gendb://gvcfs_db \ --output variants.genotype.vcf.gz
* Before VariantFiltration, average and standard deviation of depth of coverage was calculated for specifying $DP_AVG_PLUS_5SD.
gatk VariantFiltration \ --reference genome.fa \ --variant variants.genotype.vcf.gz --output variants.genotype.filtered_SNP.vcf.gz \ --filter-expression "DP > $DP_AVG_PLUS_5SD*" --filter-name "DP" \ --filter-expression "QD < 2.0" --filter-name "QD" \ --filter-expression "MQ < 40.0" --filter-name "MQ" \ --filter-expression "FS > 60.0" --filter-name "FS" \ --filter-expression "SOR > 3.0" --filter-name "SOR" \ --filter-expression "MQRankSum < -12.5" --filter-name "MQRS" \ --filter-expression "ReadPosRankSum < -8.0" --filter-name "RPRS" gatk VariantFiltration \ --reference genome.fa \ --variant variants.genotype.vcf.gz \ --output variants.genotype.filtered_InDel.vcf.gz \ --filter-expression "DP > $DP_AVG_PLUS_5SD" --filter-name "DP" \ --filter-expression "QD < 2.0" --filter-name "QD" \ --filter-expression "FS > 200.0" --filter-name "FS" \ --filter-expression "SOR > 10.0" --filter-name "SOR" \ --filter-expression "ReadPosRankSum < -20.0" --filter-name "RPRS" gatk SelectVariants \ --reference genome.fa \ --variant variants.genotype.filtered_SNP.vcf.gz \ --output variants.genotype.selected_SNP.vcf.gz \ --exclude-filtered \ --select-type-to-include SNP \ --select-type-to-include MNP \ --sample-name include_sample.args gatk SelectVariants \ --reference genome.fa \ --variant variants.genotype.filtered_InDel.vcf.gz \ --output variants.genotype.selected_InDel.vcf.gz \ --exclude-filtered \ --select-type-to-include INDEL \ --select-type-to-include MIXED \ --sample-name include_sample.args java -jar picard.jar MergeVcfs \ --INPUT variants.genotype.selected_SNP.vcf.gz \ --INPUT variants.genotype.selected_InDel.vcf.gz \ --OUTPUT variants.varonly.vcf.gz
$ java -jar snpEff.jar build -gtf22 -v RAP_MSU_on_IRGSP-1.0
$ java -jar snpEff.jar \
-ud 1000 \
-v RAP_MSU_on_IRGSP-1.0 variants.varonly.vcf.gz | gzip -c > variants.snpEff.varonly.vcf.gz
*Reference genome sequence (FASTA), gene annotation data (GTF) and protein sequences of all genes (FASTA) should be placed in appropriate directories before running SnpEff.
$ perl tasuke_bamtodepth.pl \
-s samtools -c chromosome_list.csv -i alignment.rmdup.bam \
-o alignment_depth.tsv
The following two output files are uploaded to the TASUKE+ database
- variants.snpEff.varonly.vcf.gz
- alignment_depth.tsv
