This workflow exemplifies the comparison of DNA expression levels in two experimental conditions from RNA-Seq data. It reimplements a study by Trapnell et al. 2012.

rna-seq_csDensity.png

Figure 1: Probability density of expression levels of transcribed genes in group C1 (blue) and group C2 (red).

Introduction

This workflow processes artificial data as produced in a typical RNA-Seq experiment. Given two conditions C1 and C2, messenger RNAs from both conditions are sequenced in order to identify differences in gene expression levels. From the differences in these expression levels one can infer a relationship between the genes exhibiting differing expression levels and the condition in question. In addition to quantifying expression levels this workflow can be used to identify new genes and splice variants.

Starting from a pair of RNA-Seq samples in FastQ format and the Drosophila melanogaster reference genome this workflow is centered around the tools TopHat for read mapping, Cufflinks for transcript assembly and quantification, and CummeRbund for visualization. The output of the workflow is a table characterizing differentially expressed genes and their splice variants. From this table a number of visualizations can be derived. Figures 1 and 2 exemplify such visualizations comparing expression levels over all identified gene variants. In Figure 3 the expression levels of one particular gene as well as its isoforms are compared. The remainder of this post covers Cuneiform implementations of the necessary computing steps and data dependencies constituting this workflow.

rna-seq_csScatter.png rna-seq_csVolcano.png

Figure 2: Scatter plot and volcano plot of expression levels of transcribed genes comparing groups C1 and C2. Points above the main diagonal of the scatter plot represent genes which are expressed higher in group C2.

rna-seq_regucalcin_expressionBarplot.png rna-seq_regucalcin_expressionBarplot_isoforms.png

Figure 3: Bar plots comparing the expression levels of the regucalcin gene and its isoforms in group C1 (blue) and C2 (red).

Task Definitions

This section of a workflow defines all tasks which can be used in the Workflow Definition section.

Utility Tasks

Utility tasks do generic data transformations like uncompressing archives or concatenating files.

gunzip

The gunzip task takes a gzipped file and decompresses it.

deftask gunzip( out( File ) : gz( File ) )in bash *{
  out=unzipped_${gz%.gz}
  gzip -c -d $gz > $out
}*

Next-Generation Sequencing Tasks

Next-Generation Sequencing tasks process sequencing data.

TopHat

Read mapping is performed using TopHat. Tophat implicitly uses Bowtie to align reads to a reference genome. Additionally, TopHat identifies transcript splice sites. The output is a BAM file containing annotated alignments which are used in downstream analysis. Note, that TopHat expects the basename of the reference genome file without the file extension fa or fasta. Thus, we create a symbolic link to the reference having a known basename.

deftask tophat-align(
    bam( File )
  : geneanno( File ) <idx( File )> fa( File )
    [fq1( File ) fq2( File )] )in bash *{

  idxname=${idx[0]%.?.ebwt}
  ln -sf $fa $idxname.fa
  tophat --bowtie1 -G $geneanno -o thout $idxname $fq1 $fq2
  bam=thout/accepted_hits.bam
}*

SAMtools

Creating an index over a BAM file.

deftask samtools-index( bai( File ) : bam( File ) )in bash *{
  bai=$bam.bai
  samtools index $bam
}*

The samtools-idxstats task collects statistics about an indexed BAM file. For each reference sequence region, the number of mapped and unmapped reads is reported. Note that SAMtools implicitly expects the alignment index to have the same name as the corresponding BAM file with the additional .bai extension. Thus, we create symbolic links to ensure that this naming convention is met.

deftask samtools-idxstats( idxstats( File ) : bam( File ) bai( File ) )in bash *{
  ln -sf $bai $bam.bai
  idxstats=idxstats.txt
  samtools idxstats $bam > $idxstats
}*

The Cufflinks package provides a number of useful operations to analyze RNA-Seq data. It creates a transcriptome assembly from alignments, reports differentially expressed genes, or annotates transcript data. Different tools from the Cufflinks package constitute the following four task definitions.

The cufflinks task consumes a set of reads aligned to a reference genome and attempts to reconstruct all transcripts from these alignments. The output of the cufflinks task is a transcriptome assembly.

deftask cufflinks( transcript( File ) : bam( File ) )in bash *{
  cufflinks -o clout $bam
  transcript=clout/transcripts.gtf
}*

The cuffmerge task consumes the transcriptome assemblies from multiple conditions and merges them together. Note that the cuffmerge command implicitly expects the reference index to have the same name as the reference FastA file with the additional .fai extension. Thus, we create symbolic links to ensure this naming convention is met.

deftask cuffmerge( merged( File ) : <transcript( File )> geneanno( File ) [fa( File ) fai( File )] )in bash *{
  ln -sf $fa genome.fa
  ln -sf $fai genome.fa.fai
  printf "%s\n" ${transcript[@]} > assemblies.txt
  cuffmerge -g $geneanno -s genome.fa -p 4 assemblies.txt
  merged=merged_asm/merged.gtf
}*

The cuffdiff task takes the merged transcriptomes of multiple conditions and reports differentially expressed genes and transcripts. Note that apart from the reference index naming convention already mentioned, the cuffdiff command expects the names of the BAM files as comma-separated lists. We construct these lists using the printf command in Bash.

deftask cuffdiff( diff( File ) : merged( File ) <bam1( File )> <bam2( File )> [fa( File ) fai( File )] )in bash *{

  b1=`printf ",%s" ${bam1[@]}`
  b1=${b1:1}

  b2=`printf ",%s" ${bam2[@]}`
  b2=${b2:1}

  ln -sf $fa genome.fa
  ln -sf $fai genome.fa.fai

  cuffdiff --no-update-check -p 4 \
  -o diff_out \
  -b genome.fa \
  -L C1,C2 \
  -u $merged \
  $b1 $b2

  diff=diff.tar
  tar cf $diff diff_out
}*

The cuffcompare task consumes a set of transcript assemblies and compares it with a given gene annotation database.

deftask cuffcompare( out( File ) : <gtf( File )> geneanno( File ) )in bash *{
  printf "%s\n" ${gtf[@]} > gtf_out_list.txt
  cuffcompare -i gtf_out_list.txt -r $geneanno

  out=comparison.csv
  for i in `ls *.tmap`
  do
    echo $i >> $out
    awk 'NR > 1 { s[$3]++ } END { for (j in s) { print j, s[j] }}' $i >> $out
  done
}*

CummeRbund

The differentially expressed genes reported in the cuffdiff task are stored in a table which can be further analyzed. CummeRbund is an R library for visualizing such reports. CummeRbund task produces a collection of publication-ready visualizations in pdf format.

The cummerbund task, produces visualizations summarizing the differences in overall gene expression levels in both conditions in the form of a density plot, volcano plot and scatter plot. Additionally, the expression levels of the regucalcin gene and its isoforms are separately compared.

deftask cummerbund(
    csdensity( File )
    scatter( File )
    volcano( File )
    regucalcin_expression( File )
    regucalcin_isoforms( File )
  : diff( File ) )in r *{

  # load libraries
  library(cummeRbund)

  # prepare data directory
  system( paste( "tar xf", diff ) )

  # prepare output directory
  res.out.dir <- "rescb"
  system( paste( "mkdir", res.out.dir ) )

  # create cummerbund database
  cuff_data <- readCufflinks( 'diff_out' )

  # plot distribution of expression levels
  csdensity <- paste( res.out.dir, "csDensity.pdf", sep="/" )
  pdf( csdensity )
  csDensity( genes( cuff_data) )
  dev.off()

  # compare the expression of each gene in both conditions in a scatter plot
  scatter <- paste( res.out.dir, "csScatter.pdf", sep="/" )
  pdf( scatter )
  csScatter( genes( cuff_data ), 'C1', 'C2' )
  dev.off()

  # compare differentially expressed genes in volcano plot
  volcano <- paste( res.out.dir, "csVolcano.pdf", sep="/" )
  pdf( volcano )
  csVolcano( genes( cuff_data ), 'C1', 'C2' )
  dev.off()

  # define gene of interest regucalcin
  mygene <- getGene( cuff_data, 'regucalcin' )

  # expression levels for gene of interest regucalcin
  regucalcin_expression <- paste( res.out.dir, "regucalcin_expressionBarplot.pdf", sep="/" )
  pdf( regucalcin_expression )
  expressionBarplot( mygene )
  dev.off()

  # individual isoform expression levels for gene of interest regucalcin
  regucalcin_isoforms <- paste( res.out.dir, "regucalcin_expressionBarplot_isoforms.pdf", sep="/" )
  pdf( regucalcin_isoforms )
  expressionBarplot( isoforms( mygene ) )
  dev.off()
}*

While in the previous cummerbund task we used the CummeRbund library to generate visualizations, we can use the same library to collect collect summarizing information about the RNA-Seq experiment. The cummerbund2 task counts the number of significant genes identified, the number of splicing isoforms, promoters and other informtation.

deftask cummerbund2( diff_genes( File ) nsig_gene nsig_isoform nsig_tss nsig_cds nsig_promoter nsig_splicing nsig_relCDS : diff( File ) )in r *{

  # load libraries
  library(cummeRbund)

  # prepare data directory
  system( paste( "tar xf", diff ) )

  # prepare output directory
  res.out.dir <- "rescb"
  system( paste( "mkdir", res.out.dir ) )

  # create cummerbund database
  cuff_data <- readCufflinks( 'diff_out' )

  gene_diff_data <- diffData( genes( cuff_data ) )
  sig_gene_data <- subset( gene_diff_data, ( significant == 'yes' ) )
  nsig_gene = nrow( sig_gene_data )

  isoform_diff_data <- diffData( isoforms( cuff_data ), 'C1', 'C2' )
  sig_isoform_data <- subset( isoform_diff_data, ( significant == 'yes' ) )
  nsig_isoform = nrow( sig_isoform_data )

  tss_diff_data <- diffData( TSS( cuff_data ), 'C1', 'C2' )
  sig_tss_data <- subset( tss_diff_data, ( significant == 'yes' ) )
  nsig_tss = nrow( sig_tss_data )

  cds_diff_data <- diffData( CDS( cuff_data ), 'C1', 'C2' )
  sig_cds_data <- subset( cds_diff_data, ( significant == 'yes' ) )
  nsig_cds = nrow( sig_cds_data )

  promoter_diff_data <- distValues( promoters( cuff_data ) )
  sig_promoter_data <- subset( promoter_diff_data, ( significant == 'yes' ) )
  nsig_promoter = nrow( sig_promoter_data )

  splicing_diff_data <- distValues( splicing( cuff_data ) )
  sig_splicing_data <- subset( splicing_diff_data, ( significant == 'yes' ) )
  nsig_splicing = nrow( sig_splicing_data )

  relCDS_diff_data <- distValues(relCDS( cuff_data ) )
  sig_relCDS_data <- subset( relCDS_diff_data, ( significant == 'yes' ) )
  nsig_relCDS = nrow( sig_relCDS_data )

  gene_diff_data <- diffData( genes( cuff_data ) )
  sig_gene_data <- subset( gene_diff_data, ( significant == 'yes' ) )
  diff_genes = 'diff_genes.txt'
  write.table( sig_gene_data, diff_genes, sep='\t', row.names = F, col.names = T, quote = F )
}*

Input Data

This workflow uses the Ensembl BDGP6 version of the Drosophila melanogaster reference genome as well as artificial sequence reads provided with the original publication. The reference genome is bundled with a gene annotation database as well as a SAMtools and Bowtie index which, alternatively, could be built from the reference genome.

geneanno = "BDGP6/Annotation/Genes/genes.gtf";

fa = "BDGP6/Sequence/WholeGenomeFasta/genome.fa";
fai = "BDGP6/Sequence/WholeGenomeFasta/genome.fa.fai";

idx = "BDGP6/Sequence/BowtieIndex/genome.1.ebwt" "BDGP6/Sequence/BowtieIndex/genome.2.ebwt"
      "BDGP6/Sequence/BowtieIndex/genome.3.ebwt" "BDGP6/Sequence/BowtieIndex/genome.4.ebwt"
      "BDGP6/Sequence/BowtieIndex/genome.rev.1.ebwt" "BDGP6/Sequence/BowtieIndex/genome.rev.2.ebwt";

fq-c1-1 = "fastq/GSM794483_C1_R1_1.fq"
          "fastq/GSM794484_C1_R2_1.fq"
          "fastq/GSM794485_C1_R3_1.fq";

fq-c1-2 = "fastq/GSM794483_C1_R1_2.fq"
          "fastq/GSM794484_C1_R2_2.fq"
          "fastq/GSM794485_C1_R3_2.fq";

fq-c2-1 = "fastq/GSM794486_C2_R1_1.fq"
          "fastq/GSM794487_C2_R2_1.fq"
          "fastq/GSM794488_C2_R3_1.fq";

fq-c2-2 = "fastq/GSM794486_C2_R1_2.fq"
          "fastq/GSM794487_C2_R2_2.fq"
          "fastq/GSM794488_C2_R3_2.fq";

Workflow Definition

The workflow determines data dependencies between task applications. Note that the order in which we write the task application in the Cuneiform script does not necessarily reflect the order in which these tasks are applied. Only data dependencies constrain the execution order.

First, we call TopHat on the FastQ reads for both conditions independently. As a result we obtain two BAM files, one for each condition.

bam-c1 = tophat-align(
           geneanno: geneanno,
           idx: idx,
           fa:  fa,
           fq1: fq-c1-1,
           fq2: fq-c1-2 );
    
bam-c2 = tophat-align(
           geneanno: geneanno,
           idx: idx,
           fa:  fa,
           fq1: fq-c2-1,
           fq2: fq-c2-2 );

Next, we call Cufflinks to generate the transcriptome assembly from the alignments, again for both conditions separately.

transcript-c1 = cufflinks( bam: bam-c1 );
transcript-c2 = cufflinks( bam: bam-c2 );

Prior to identifying differentially expressed genes with cuffdiff, we need to merge the transcripts for both conditions into a single file.

merged = cuffmerge(
          transcript: transcript-c1 transcript-c2,
          geneanno: geneanno,
          fa: fa,
          fai: fai );

After merging, the transcriptome data is ready to be analyzed with cuffdiff. It produces a table of differentially expressed genes.

diff = cuffdiff(
         merged: merged,
         bam1: bam-c1,
         bam2: bam-c2,
         fa: fa,
         fai: fai );

Visualizations are produced using CummeRbund. The figures presented in the Introduction section were produced in this step.

csdensity scatter volcano regucalcin_expression regucalcin_isoforms = cummerbund( diff: diff );

Independent from the analysis with cuffdiff, we gather statistics about the quality of the read mapping using samtools-idxstats on the pair of indexed BAM files.

bam = bam-c1 bam-c2;
bai = samtools-index( bam: bam );
stats = samtools-idxstats( bam: bam, bai: bai );

The transcriptome data produced in the cufflinks step can, furthermore be annotated using cuffcompare.

comparison = cuffcompare( gtf: transcript-c1 transcript-c2, geneanno: geneanno );

In a second application of the CummeRbund library we count the number of significant genes, isoforms, promoters, etc.

diff_genes nsig_gene nsig_isoform nsig_tss nsig_cds nsig_promoter nsig_splicing nsig_relCDS = cummerbund2( diff: diff );

Query

The worklfow query is where we determine which data we’re interested in. Only computing steps that contribute to the query are actually performed.

We gather the density, scatter and volcano plots as well as the regucalcin expression levels and its isoforms from the cummerbund application. The stats provided in samtools-idxstats help us judge whether read mapping performed reasonably well. Furthermore, we query the comparison produced in cuffcompare and the table of differential gene expressions. Eventually, we collect the summarizing info produced in cummerbund2.

csdensity scatter volcano regucalcin_expression regucalcin_isoforms stats
comparison diff_genes nsig_gene nsig_isoform nsig_tss nsig_cds nsig_promoter
nsig_splicing nsig_relCDS;