TENOR (Transcriptome ENcyclopedia Of Rice) in RAP-DB
| We have published TENOR (Transcriptome ENcyclopedia Of Rice) database for providing gene expression profiles and transcriptional activity on the rice genome at the nucleotide level based on the RNA-Seq data under 140 environmental stresses and plant hormone treated conditions (Kawahara Y. et. al. 2016). Six years have passed since its release, numerous RNA-Seq data have been accumulated in the public database. To provide more comprehensive transcriptome information on the rice genome, we obtained publicly available RNA-Seq data, curated meta-information (sampling, experimental, sequencing conditions) for each sample, and analyzed them using the standardized analysis pipeline. Currently, gene expression profiles and transcriptional activities under 565 different experimental conditions and tissues are provided in the "Expression (TENOR)" section of each transcript annotation page (e.g. Os01t0911700-01) and JBrowse, respectively. The meta-information of the samples and the details of the analysis pipeline are as below. |
Reference data
- Genome sequences [FASTA]
- IRGSP-1.0 genome (including organella and unanchored contig sequences)
- Gene annotation for StringTie in GTF (as of 11 Nov 2021)
Analysis tools
- Java (JDK 1.8.0_191)
- Trimmomatic (v0.39)
- HISAT2 (v2.2.1)
- StringTie(v2.1.6)
- SamTools (v1.13)
- BAMscale (v1.0)
Commands and parameters used in the workflow
- Preprocessing of Illumina paired-end reads
- Making index of the genome
- Alignment of Illumina reads to the reference genome
- Calculate gene abundance (TPM)
- Making BigWig data for JBrowse
$ java -jar trimmomatic-0.39.jar PE \
-phred33 read.r1.fastq.gz read.r2.fastq.gz \
read.pe.r1.fastq.gz read.se.r1.fastq.gz read.pe.r2.fastq.gz read.se.r2.fastq.gz \
ILLUMINACLIP:adapters.fa:2:30:10 LEADING:15 TRAILING:15 SLIDINGWINDOW:10:15 MINLEN:30
# concatenate GTF files of RAP-DB Rep. and Pred. genes
$ cat IRGSP-1.0_representative_transcript_exon_2021-11-11.gtf \
IRGSP-1.0_predicted_transcript_exon_2021-11-11.gtf \
> all_transcripts_exon.gtf
# make splice site data
$ python extract_splice_sites.py all_transcripts_exon.gtf > ss.tab
# make exon position data
$ python extract_exons.py all_transcripts_exon.gtf > exon.tab
# make index for hisat2
$ hisat2-build --ss ss.tab --exon exon.tab \
IRGSP-1.0_genome_M_C_unanchored.fa IRGSP-1.0_genome_M_C_unanchored
$ hisat2 -x IRGSP-1.0_genome_M_C_unanchored \
--summary-file rnaseq_summary.stats --min-intronlen 20 --max-intronlen 10000 \
--dta --new-summary -1 read.pe.r1.fastq.gz -2 read.pe.r2.fastq.gz -S alignment.sam
$ samtools sort -o alignment.sort.bam alignment.sam
$ samtools index alignment.sort.bam
$ stringtie alignment.sort.bam -e -B \
-G all_transcripts_exon.gtf \
-o sample/sample.gtf -A rnaseq_abundance_sample.stats
To get gene expression levels of each transcript and sample, TPM values were extracted from GTF files (sample.gtf) output by StringTie.
# for standard RNA-Seq data
BAMscale scale --operation rna --bam ./alignment.sort.bam
# for strand-specific RNA-Seq data
BAMscale scale --operation strandrna --bam ./alignment.sort.bam
