This workflow exemplifies SNP and indel calling on RNAseq data using the GATK Pipeline.

Introduction

This workflow maps RNA-Seq data to a reference and produces high-quality variant calls. It is split in two phases, where the first phase transforms the raw input data to make it usable for analysis with the Genome Analysis Toolkit (GATK) in the second phase to produce variant calls.

Starting from a pair of RNA-Seq samples in FastQ format and the HG38 reference genome this workflow is centered around the tools STAR for read mapping, Picard-Tools to add or replace read groups, sort, mark duplicates, and create an index, and GATK to reassign mapping qualities, indel realignment, base recalibration and, finally, variant calling and filtering.

The remainder of this post covers Cuneiform implementations of the necessary computing steps and data dependencies constituting this workflow.

Task Definitions

This section of a workflow defines all tasks which can be used in the Workflow Definition section. Each step creates a *.step file to help navigation within the work directory if the need arises.

Utility Tasks

Utility tasks do generic data transformations like uncompressing archives or concatenating files.

gunzip

The gunzip task takes a gzipped file and decompresses it.

deftask gunzip( decompressed( File ) : [gz( File ) name] ) in bash *{    
    touch gunzip.step
    decompressed=$name
    gzip -c -d $gz > $decompressed
}*

Mapping to the reference

The first steps align the samples to the reference data.

STAR

Read mapping is performed using STAR. This is done using a 2-pass alignment, consisting of 4 individual steps. The first task, prepareAlignment, only needs the reference file to generate genome indexes and is therefore the same for each pair of fastq-files. The second task is a pipeline of 3 steps - mapping the reads to the genomes, building new genome indexes and mapping the reads again against these new indexes.

deftask prepareAlignment( genomeDir : reference( File ) threads ) in bash *{
    src=$(pwd)
    touch twoPass.$reference.step
    genomeDir=$src/hg19_1pass
    mkdir $genomeDir
    # building genome index from reference
    STAR --runMode genomeGenerate --genomeDir $genomeDir --genomeFastaFiles $reference --runThreadN $threads
}*


deftask twoPassAlignment( sam(File) : reference( File ) [mate1( File ) mate2( File )] threads genDir ) in bash *{
    src=$(pwd)
    touch twoPass.$mate1.step
    
    pass1path=$src/pass1
	mkdir $pass1path
	cd $pass1path
	STAR --genomeDir $genDir --readFilesIn $src/$mate1 $src/$mate2 --runThreadN $threads
	
    cd ..
	gDir=$src/hg19_2pass
	mkdir $gDir
	STAR --runMode genomeGenerate --genomeDir $gDir --genomeFastaFiles $src/$reference \
		--sjdbFileChrStartEnd $pass1path/SJ.out.tab --sjdbOverhang 75 --runThreadN $threads
	
	runDir=$src/pass2
	mkdir $runDir
	cd $runDir
	STAR --genomeDir $gDir --readFilesIn $src/$mate1 $src/$mate2 --runThreadN $threads
	#produces a SAM file, which we then put through the usual Picard processing steps
    sam=out.sam
    cp Aligned.out.sam $src/out.sam
}*

SAMtools

Creating an index over a BAM file.

deftask samtools-index( bai( File ) : bam( File ) )in bash *{
  bai=$bam.bai
  samtools index $bam
}*

The samtools-idxstats task collects statistics about an indexed BAM file. For each reference sequence region, the number of mapped and unmapped reads is reported. Note that SAMtools implicitly expects the alignment index to have the same name as the corresponding BAM file with the additional .bai extension. Thus, we create symbolic links to ensure that this naming convention is met.

deftask samtools-idxstats( idxstats( File ) : bam( File ) bai( File ) )in bash *{
  ln -sf $bai $bam.bai
  idxstats=idxstats.txt
  samtools idxstats $bam > $idxstats
}*

The Cufflinks package provides a number of useful operations to analyze RNA-Seq data. It creates a transcriptome assembly from alignments, reports differentially expressed genes, or annotates transcript data. Different tools from the Cufflinks package constitute the following four task definitions.

The cufflinks task consumes a set of reads aligned to a reference genome and attempts to reconstruct all transcripts from these alignments. The output of the cufflinks task is a transcriptome assembly.

deftask cufflinks( transcript( File ) : bam( File ) )in bash *{
  cufflinks -o clout $bam
  transcript=clout/transcripts.gtf
}*

The cuffmerge task consumes the transcriptome assemblies from multiple conditions and merges them together. Note that the cuffmerge command implicitly expects the reference index to have the same name as the reference FastA file with the additional .fai extension. Thus, we create symbolic links to ensure this naming convention is met.

deftask cuffmerge( merged( File ) : <transcript( File )> geneanno( File ) [fa( File ) fai( File )] )in bash *{
  ln -sf $fa genome.fa
  ln -sf $fai genome.fa.fai
  printf "%s\n" ${transcript[@]} > assemblies.txt
  cuffmerge -g $geneanno -s genome.fa -p 4 assemblies.txt
  merged=merged_asm/merged.gtf
}*

The cuffdiff task takes the merged transcriptomes of multiple conditions and reports differentially expressed genes and transcripts. Note that apart from the reference index naming convention already mentioned, the cuffdiff command expects the names of the BAM files as comma-separated lists. We construct these lists using the printf command in Bash.

deftask cuffdiff( diff( File ) : merged( File ) <bam1( File )> <bam2( File )> [fa( File ) fai( File )] )in bash *{

  b1=`printf ",%s" ${bam1[@]}`
  b1=${b1:1}

  b2=`printf ",%s" ${bam2[@]}`
  b2=${b2:1}

  ln -sf $fa genome.fa
  ln -sf $fai genome.fa.fai

  cuffdiff --no-update-check -p 4 \
  -o diff_out \
  -b genome.fa \
  -L C1,C2 \
  -u $merged \
  $b1 $b2

  diff=diff.tar
  tar cf $diff diff_out
}*

The cuffcompare task consumes a set of transcript assemblies and compares it with a given gene annotation database.

deftask cuffcompare( out( File ) : <gtf( File )> geneanno( File ) )in bash *{
  printf "%s\n" ${gtf[@]} > gtf_out_list.txt
  cuffcompare -i gtf_out_list.txt -r $geneanno

  out=comparison.csv
  for i in `ls *.tmap`
  do
    echo $i >> $out
    awk 'NR > 1 { s[$3]++ } END { for (j in s) { print j, s[j] }}' $i >> $out
  done
}*

CummeRbund

The differentially expressed genes reported in the cuffdiff task are stored in a table which can be further analyzed. CummeRbund is an R library for visualizing such reports. CummeRbund task produces a collection of publication-ready visualizations in pdf format.

The cummerbund task, produces visualizations summarizing the differences in overall gene expression levels in both conditions in the form of a density plot, volcano plot and scatter plot. Additionally, the expression levels of the regucalcin gene and its isoforms are separately compared.

deftask cummerbund(
    csdensity( File )
    scatter( File )
    volcano( File )
    regucalcin_expression( File )
    regucalcin_isoforms( File )
  : diff( File ) )in r *{

  # load libraries
  library(cummeRbund)

  # prepare data directory
  system( paste( "tar xf", diff ) )

  # prepare output directory
  res.out.dir <- "rescb"
  system( paste( "mkdir", res.out.dir ) )

  # create cummerbund database
  cuff_data <- readCufflinks( 'diff_out' )

  # plot distribution of expression levels
  csdensity <- paste( res.out.dir, "csDensity.pdf", sep="/" )
  pdf( csdensity )
  csDensity( genes( cuff_data) )
  dev.off()

  # compare the expression of each gene in both conditions in a scatter plot
  scatter <- paste( res.out.dir, "csScatter.pdf", sep="/" )
  pdf( scatter )
  csScatter( genes( cuff_data ), 'C1', 'C2' )
  dev.off()

  # compare differentially expressed genes in volcano plot
  volcano <- paste( res.out.dir, "csVolcano.pdf", sep="/" )
  pdf( volcano )
  csVolcano( genes( cuff_data ), 'C1', 'C2' )
  dev.off()

  # define gene of interest regucalcin
  mygene <- getGene( cuff_data, 'regucalcin' )

  # expression levels for gene of interest regucalcin
  regucalcin_expression <- paste( res.out.dir, "regucalcin_expressionBarplot.pdf", sep="/" )
  pdf( regucalcin_expression )
  expressionBarplot( mygene )
  dev.off()

  # individual isoform expression levels for gene of interest regucalcin
  regucalcin_isoforms <- paste( res.out.dir, "regucalcin_expressionBarplot_isoforms.pdf", sep="/" )
  pdf( regucalcin_isoforms )
  expressionBarplot( isoforms( mygene ) )
  dev.off()
}*

While in the previous cummerbund task we used the CummeRbund library to generate visualizations, we can use the same library to collect collect summarizing information about the RNA-Seq experiment. The cummerbund2 task counts the number of significant genes identified, the number of splicing isoforms, promoters and other informtation.

deftask cummerbund2( diff_genes( File ) nsig_gene nsig_isoform nsig_tss nsig_cds nsig_promoter nsig_splicing nsig_relCDS : diff( File ) )in r *{

  # load libraries
  library(cummeRbund)

  # prepare data directory
  system( paste( "tar xf", diff ) )

  # prepare output directory
  res.out.dir <- "rescb"
  system( paste( "mkdir", res.out.dir ) )

  # create cummerbund database
  cuff_data <- readCufflinks( 'diff_out' )

  gene_diff_data <- diffData( genes( cuff_data ) )
  sig_gene_data <- subset( gene_diff_data, ( significant == 'yes' ) )
  nsig_gene = nrow( sig_gene_data )

  isoform_diff_data <- diffData( isoforms( cuff_data ), 'C1', 'C2' )
  sig_isoform_data <- subset( isoform_diff_data, ( significant == 'yes' ) )
  nsig_isoform = nrow( sig_isoform_data )

  tss_diff_data <- diffData( TSS( cuff_data ), 'C1', 'C2' )
  sig_tss_data <- subset( tss_diff_data, ( significant == 'yes' ) )
  nsig_tss = nrow( sig_tss_data )

  cds_diff_data <- diffData( CDS( cuff_data ), 'C1', 'C2' )
  sig_cds_data <- subset( cds_diff_data, ( significant == 'yes' ) )
  nsig_cds = nrow( sig_cds_data )

  promoter_diff_data <- distValues( promoters( cuff_data ) )
  sig_promoter_data <- subset( promoter_diff_data, ( significant == 'yes' ) )
  nsig_promoter = nrow( sig_promoter_data )

  splicing_diff_data <- distValues( splicing( cuff_data ) )
  sig_splicing_data <- subset( splicing_diff_data, ( significant == 'yes' ) )
  nsig_splicing = nrow( sig_splicing_data )

  relCDS_diff_data <- distValues(relCDS( cuff_data ) )
  sig_relCDS_data <- subset( relCDS_diff_data, ( significant == 'yes' ) )
  nsig_relCDS = nrow( sig_relCDS_data )

  gene_diff_data <- diffData( genes( cuff_data ) )
  sig_gene_data <- subset( gene_diff_data, ( significant == 'yes' ) )
  diff_genes = 'diff_genes.txt'
  write.table( sig_gene_data, diff_genes, sep='\t', row.names = F, col.names = T, quote = F )
}*

Input Data

This workflow uses the Ensembl BDGP6 version of the Drosophila melanogaster reference genome as well as artificial sequence reads provided with the original publication. The reference genome is bundled with a gene annotation database as well as a SAMtools and Bowtie index which, alternatively, could be built from the reference genome.

geneanno = "BDGP6/Annotation/Genes/genes.gtf";

fa = "BDGP6/Sequence/WholeGenomeFasta/genome.fa";
fai = "BDGP6/Sequence/WholeGenomeFasta/genome.fa.fai";

idx = "BDGP6/Sequence/BowtieIndex/genome.1.ebwt" "BDGP6/Sequence/BowtieIndex/genome.2.ebwt"
      "BDGP6/Sequence/BowtieIndex/genome.3.ebwt" "BDGP6/Sequence/BowtieIndex/genome.4.ebwt"
      "BDGP6/Sequence/BowtieIndex/genome.rev.1.ebwt" "BDGP6/Sequence/BowtieIndex/genome.rev.2.ebwt";

fq-c1-1 = "fastq/GSM794483_C1_R1_1.fq"
          "fastq/GSM794484_C1_R2_1.fq"
          "fastq/GSM794485_C1_R3_1.fq";

fq-c1-2 = "fastq/GSM794483_C1_R1_2.fq"
          "fastq/GSM794484_C1_R2_2.fq"
          "fastq/GSM794485_C1_R3_2.fq";

fq-c2-1 = "fastq/GSM794486_C2_R1_1.fq"
          "fastq/GSM794487_C2_R2_1.fq"
          "fastq/GSM794488_C2_R3_1.fq";

fq-c2-2 = "fastq/GSM794486_C2_R1_2.fq"
          "fastq/GSM794487_C2_R2_2.fq"
          "fastq/GSM794488_C2_R3_2.fq";

Workflow Definition

The workflow determines data dependencies between task applications. Note that the order in which we write the task application in the Cuneiform script does not necessarily reflect the order in which these tasks are applied. Only data dependencies constrain the execution order.

First, we call TopHat on the FastQ reads for both conditions independently. As a result we obtain two BAM files, one for each condition.

bam-c1 = tophat-align(
           geneanno: geneanno,
           idx: idx,
           fa:  fa,
           fq1: fq-c1-1,
           fq2: fq-c1-2 );
    
bam-c2 = tophat-align(
           geneanno: geneanno,
           idx: idx,
           fa:  fa,
           fq1: fq-c2-1,
           fq2: fq-c2-2 );

Next, we call Cufflinks to generate the transcriptome assembly from the alignments, again for both conditions separately.

transcript-c1 = cufflinks( bam: bam-c1 );
transcript-c2 = cufflinks( bam: bam-c2 );

Prior to identifying differentially expressed genes with cuffdiff, we need to merge the transcripts for both conditions into a single file.

merged = cuffmerge(
          transcript: transcript-c1 transcript-c2,
          geneanno: geneanno,
          fa: fa,
          fai: fai );

After merging, the transcriptome data is ready to be analyzed with cuffdiff. It produces a table of differentially expressed genes.

diff = cuffdiff(
         merged: merged,
         bam1: bam-c1,
         bam2: bam-c2,
         fa: fa,
         fai: fai );

Visualizations are produced using CummeRbund. The figures presented in the Introduction section were produced in this step.

csdensity scatter volcano regucalcin_expression regucalcin_isoforms = cummerbund( diff: diff );

Independent from the analysis with cuffdiff, we gather statistics about the quality of the read mapping using samtools-idxstats on the pair of indexed BAM files.

bam = bam-c1 bam-c2;
bai = samtools-index( bam: bam );
stats = samtools-idxstats( bam: bam, bai: bai );

The transcriptome data produced in the cufflinks step can, furthermore be annotated using cuffcompare.

comparison = cuffcompare( gtf: transcript-c1 transcript-c2, geneanno: geneanno );

In a second application of the CummeRbund library we count the number of significant genes, isoforms, promoters, etc.

diff_genes nsig_gene nsig_isoform nsig_tss nsig_cds nsig_promoter nsig_splicing nsig_relCDS = cummerbund2( diff: diff );

Query

The worklfow query is where we determine which data we’re interested in. Only computing steps that contribute to the query are actually performed.

We gather the density, scatter and volcano plots as well as the regucalcin expression levels and its isoforms from the cummerbund application. The stats provided in samtools-idxstats help us judge whether read mapping performed reasonably well. Furthermore, we query the comparison produced in cuffcompare and the table of differential gene expressions. Eventually, we collect the summarizing info produced in cummerbund2.

csdensity scatter volcano regucalcin_expression regucalcin_isoforms stats
comparison diff_genes nsig_gene nsig_isoform nsig_tss nsig_cds nsig_promoter
nsig_splicing nsig_relCDS;