TENOR (Transcriptome ENcyclopedia Of Rice) in RAP-DB


We have published TENOR (Transcriptome ENcyclopedia Of Rice) database for providing gene expression profiles and transcriptional activity on the rice genome at the nucleotide level based on the RNA-Seq data under 140 environmental stresses and plant hormone treated conditions (Kawahara Y. et. al. 2016). Six years have passed since its release, numerous RNA-Seq data have been accumulated in the public database. To provide more comprehensive transcriptome information on the rice genome, we obtained publicly available RNA-Seq data, curated meta-information (sampling, experimental, sequencing conditions) for each sample, and analyzed them using the standardized analysis pipeline. Currently, gene expression profiles and transcriptional activities under 565 different experimental conditions and tissues are provided in the "Expression (TENOR)" section of each transcript annotation page (e.g. Os01t0911700-01) and JBrowse, respectively. The meta-information of the samples and the details of the analysis pipeline are as below.

Reference data

  • Genome sequences [FASTA]
    • IRGSP-1.0 genome (including organella and unanchored contig sequences)
  • Gene annotation for StringTie in GTF (as of 11 Nov 2021)
    • RAP-DB representative genes[GTF]
    • RAP-DB predicted genes[GTF]

Analysis tools

  • Java (JDK 1.8.0_191)
  • Trimmomatic (v0.39)
  • HISAT2 (v2.2.1)
  • StringTie(v2.1.6)
  • SamTools (v1.13)
  • BAMscale (v1.0)

Commands and parameters used in the workflow

  1. Preprocessing of Illumina paired-end reads
  2.     $ java -jar trimmomatic-0.39.jar PE \
        -phred33 read.r1.fastq.gz read.r2.fastq.gz \
        read.pe.r1.fastq.gz read.se.r1.fastq.gz read.pe.r2.fastq.gz read.se.r2.fastq.gz \
        ILLUMINACLIP:adapters.fa:2:30:10 LEADING:15 TRAILING:15 SLIDINGWINDOW:10:15 MINLEN:30
    
  3. Making index of the genome
  4.     # concatenate GTF files of RAP-DB Rep. and Pred. genes
        $ cat IRGSP-1.0_representative_transcript_exon_2021-11-11.gtf \
        IRGSP-1.0_predicted_transcript_exon_2021-11-11.gtf \
        > all_transcripts_exon.gtf
    
        # make splice site data
        $ python extract_splice_sites.py all_transcripts_exon.gtf > ss.tab
    
        # make exon position data
        $ python extract_exons.py all_transcripts_exon.gtf > exon.tab
    
        # make index for hisat2
        $ hisat2-build --ss ss.tab --exon exon.tab \
        IRGSP-1.0_genome_M_C_unanchored.fa IRGSP-1.0_genome_M_C_unanchored
    
  5. Alignment of Illumina reads to the reference genome
  6.     $ hisat2 -x IRGSP-1.0_genome_M_C_unanchored \
        --summary-file rnaseq_summary.stats --min-intronlen 20 --max-intronlen 10000 \
        --dta --new-summary -1 read.pe.r1.fastq.gz -2 read.pe.r2.fastq.gz -S alignment.sam
        $ samtools sort -o alignment.sort.bam alignment.sam
        $ samtools index alignment.sort.bam
    
  7. Calculate gene abundance (TPM)
  8.     $ stringtie alignment.sort.bam -e -B \
        -G all_transcripts_exon.gtf \
        -o sample/sample.gtf -A rnaseq_abundance_sample.stats
    

    To get gene expression levels of each transcript and sample, TPM values were extracted from GTF files (sample.gtf) output by StringTie.

  9. Making BigWig data for JBrowse
  10.     # for standard RNA-Seq data
        BAMscale scale --operation rna --bam ./alignment.sort.bam
    
        # for strand-specific RNA-Seq data
        BAMscale scale --operation strandrna --bam ./alignment.sort.bam
    

Meta-information of RNA-Seq samples in TENOR of RAP-DB