This workflow compares a case and a control sample identifying differences in DNA methylation. It reimplements a publication by Hansen et al. 2012.

Introduction

Effectively and reliably uncovering complex differential methylation patterns between e.g. cancerous and non-cancerous tissue remains challenging. A plethora of tools has been developed, used and partially being rendered legacy. Here we show the updated version of the highly cited Merman legacy algorithm Hansen et al. 2011 for differential methylation analysis with the state-of-the-art tool Bismark in combination with the sophisticated Bsseq R statistics.

Hansen et al. 2011 Merman workflow has been implemented with slight variation as described in Hansen et al. 2012. The developed Cuneiform workflow consists of various tools, algorithms and scripts, ranging from raw read trimming to the high-level final statistical analyses to identifies differentially DNA-methylated regions (cDMRs) by comparing multiple control sample to multiple case samples.

The input for this workflow are whole-genome bisulfite sequencing reads in FastQ format. Instead of using the original data from the study we create a pair of methylation samples using Sherman on chromosome 22. Reduced Representation Bisulfile Sequencing (RRBS) reads Base-Space (BS) and Abi Solid Bisulfite Color-Space (CS) sequencing technology is supported. Note, that Abi Solid data is re-written to BS prior to analysis.

Trim Galore is used to account for Illumina Reduced Representation Bisulfite Sequencing representation input FastQ data. Since the read mapper used in the original study, Merman, has stopped being supported read mapping is performed using Bismark and Bowtie2 instead. Coverage information and BED graphs are also derived using Bismark.

Non differential methylation analyses can be performed by simply de-multiplexing the R-Script analysis after smoothing.

From the SAM alignment files output by Bismark, sorted, indexed BAM files are created to be used in genome browsers like IGV.

The coverage information is used to create a report in PDF format using Bsmooth.

Task Definitions

This section of a workflow defines all tasks which can be used in the Workflow Definition section.

Utility Tasks

Utility tasks do generic data transformations like uncompressing archives or concatenating files.

untar

The untar task takes a tar archive and extracts its content. All files inside the archive are then enumerated in the output list.

deftask untar( <out( File )> : tar( File ) )in bash *{
  tar xf $tar
  out=`tar tf $tar`
}*

cat

The cat task takes a list of files and concatenates them returning a single output file.

deftask cat( out( File ) : <file( File )> )in bash *{
  cat ${file[@]} > $out
}*

gunzip

The gunzip task takes a gzipped file and decompresses it.

deftask gunzip( out( File ) : gz( File ) )in bash *{
  out=${gz%.gz}
  gzip -c -d $gz > $out
}*

Next-Generation Sequencing Tasks

Next-Generation Sequencing tasks process sequencing data.

FastQC

The fastqc task performs quality control over a sample in FastQ format. It outputs a report in HTML format summarizing the properties of the sample.

deftask fastqc( zip( File ) : fastq( File ) )in bash *{
  fastqc -f fastq --noextract -o ./ $fastq
  zip=`ls *.zip`
}*

Bismark

bismark-prepare creates a pair of Bowtie 2 indexes from a transformed reference genome in FastA format.

deftask bismark-prepare( idx( File ) : fa( File ) )in bash *{
  mkdir ref
  (cd ref; ln -s ../$fa)
  bismark_genome_preparation --bowtie2 ref
  idx=idx.tar
  tar chf $idx ref
}*

bismark-align aligns a sample in FastQ format to an index pair created with bismark-prepare. It returns a SAM file containing all alignments.

deftask bismark-align( sam( File ) : idx( File ) fastq( File  ) )in bash *{
  tar xf $idx
  bismark --sam --bowtie2 -N 1 -L 20 --phred33-quals ref $fastq
  sam=${fastq}_bismark_bt2.sam
}*

bismark-extract extracts a BED graph and coverage information from a set of alignments in SAM format.

deftask bismark-extract( bismark_cov( File ) bedgraph( File ) : sam( File ) )in bash *{
  bismark_methylation_extractor -s --comprehensive --bedGraph $sam
  bedgraph=`ls *.bedGraph.gz`
  bismark_cov=`ls *.bismark.cov.gz`
}*

Sherman

Sherman is a tool for generating artificial methylation reads from a given reference genome in FastA format. The user can specify read length (len), number of reads (nread), conversion rate (cr), and error rate (e) of the generative process.

deftask sherman( fastq( File ): len nread cr e <fa( File )> )in bash *{
  mkdir ref
  (cd ref; for i in ${fa[@]}; do ln -s ../$i; done)
  Sherman -cr $cr -e $e -l $len -n $nread --genome_folder ref
  fastq=simulated.fastq
}*

Trim Galore

Trim Galore takes a sequence sample in FastQ format and removes all Illumina-specific adapter sequences.

deftask trim-galore( trimmed( File ) : fastq( File ) )in bash *{
  trim_galore --rrbs --illumina $fastq
  trimmed=`ls *_trimmed.*`
}*

SAMtools

samtools-view converts a SAM file into a binary alignment format (BAM).

deftask samtools-view( bam( File ) : sam( File ) )in bash *{
  bam=${sam%.sam}.bam
  samtools view -bS $sam > $bam
}*

samtools-sort sorts a BAM file for chromosomes and positions.

deftask samtools-sort( sortedbam( File ) : bam( File ) )in bash *{
  sortedbam=sorted.bam
  samtools sort -m 2G $bam -o $sortedbam
}*

samtools-index creates an index over a BAM file. Such an index can be used to visualize the alignments in a genome browser like IGV.

deftask samtools-index( bai( File ): bam( File ) )in bash *{
  bai=$bam.bai
  samtools index $bam
}*

bsseq

The report task uses the R library bsseq to compare coverage information in a case and a control sample. It returns a pdf file summarizing the results.

deftask report( pdf_file( File ) : <cov_ctl( File )> <cov_case( File )> )in r *{

  library("bsseq")
      
  pdf_file = "differentially_methylated_regions.pdf"
      
  input_coverage_data = read.bismark(  
    c(
      cov_ctl,
      cov_case
    ),
    sampleNames = basename( c(
      cov_ctl,
      cov_case
    ) )
  )
      
  index_control = 1:length( cov_ctl )
  index_case    = index_control+length( cov_ctl )
      
  BS.cov = getCoverage(input_coverage_data)
  # now filter for positions with sufficient coverage
  keepLoci.ex = which(rowSums( BS.cov[, index_control ] >= 2 ) >= 2 & rowSums(BS.cov[, index_case ] >= 2) >= 2)
  input_coverage_data_filt = input_coverage_data[keepLoci.ex,]
      
  # now smoothen
  input_coverage_data_filt = BSmooth( input_coverage_data_filt, mc.cores = 1, verbose = T )
      
  input_coverage_data.tstat = BSmooth.tstat(
      
    input_coverage_data_filt[1:10**5,],
    group1 = basename( cov_ctl ),
    group2 = basename( cov_case ),
    estimate.var = "group2",
    verbose = T
  )
      
  dmrs0 = dmrFinder( input_coverage_data.tstat, cutoff = c( -1, 1 ) )
  dmrs = subset(dmrs0, n >= 3 & abs(meanDiff) >= 0.1)
  col = rep(c("red", "blue"), each = length(cov_ctl) ) # assuming only fair number of samples
  pData( input_coverage_data_filt)$col = col
    
  pdf(file = pdf_file, width = 10, height = 5)
  plotManyRegions( input_coverage_data_filt, dmrs[ 1 : dim( dmrs )[ 1 ], ], extend = 5000, addRegions = dmrs)
  dev.off()
}*

Input Data

First, we fix the parametrization for the generation of artificial sequence samples with Sherman. We set the read length to 80 base pairs, the conversion rate to 90% and the error rate to 0.01%.

len      =     80;
cr       =     90;
e        = "0.01";

Next we define the reference to be chromosome 22 of the Human genome (UCSC hg19).

fa-gz = "hg19/chr22.fa.gz";

Workflow Definition

The workflow is defined through data dependencies between task applications. In this section these data dependencies are established. Note that the order in which the task applications are written down, does not matter at all. Cuneiform determines the order of task application solely from data dependencies.

First, we decompress the reference genome in its gzipped format using the task gunzip. The result is a plain reference genome in FastA format.

fa = gunzip( gz: fa-gz );

We can now create a Bismark index from the uncompressed reference.

idx = bismark-prepare( fa: fa );

Independent from the index creation, the uncompressed reference genome is used by Sherman to create 4 sequence samples: 2 for the case and 2 for the control group.

Note that we use a different value for the number of reads for each of the sequence samples. Since Cuneiform assumes tasks to be deterministic, it would apply Sherman only once if the parameters of each call are identical. We can avoid this problem by calling Sherman 4 times with slightly different parameters. This way Cuneiform will run the task exactly 4 times instead of running it once and reusing the memoized result.

fastq_case = sherman(
  len:   len,
  nread: 2000000 2000001,
  fa:    fa,
  cr:    cr,
  e:     e );
      
fastq_ctl = sherman(
  len:   len,
  nread: 2000010 2000011,
  fa:    fa,
  cr:    cr,
  e:     e );

Trim Galore is used on each of the samples to dispose of adapter sequences in the read data.

trimmed_case = trim-galore( fastq: fastq_case );
trimmed_ctl = trim-galore( fastq: fastq_ctl );

Now Bismark is used to align the artificial reads to the index of the reference genome.

sam_case = bismark-align(
  idx:    idx,
  fastq: trimmed_case );
      
sam_ctl = bismark-align(
  idx:    idx,
  fastq: trimmed_ctl );

We use SAMtools to get a sorted, binary version of the alignments.

bam_case = samtools-sort( bam: samtools-view( sam: sam_case ) );
bam_ctl = samtools-sort( bam: samtools-view( sam: sam_ctl ) );

Furthermore we create an index from the sorted alignments to be used when browsing the alignments in IGV.

bai_case = samtools-index( bam: bam_case );
bai_ctl = samtools-index( bam: bam_ctl );

We use Bismark again to extract coverage information and a BED graph.

cov_case bedgraph_case = bismark-extract( sam: sam_case );
cov_ctl bedgraph_ctl = bismark-extract( sam: sam_ctl );

Also, we are interested in a Quality Control report derived from the input sequences.

qc = fastqc( fastq: fastq_case fastq_ctl );

The report task is used to compare coverage data from case and control group and derive a pdf report from these differences.

pdf = report(
        cov_ctl: cov_ctl
        cov_case: cov_case );

Query

The worklfow query is where we determine which values we are interested in. Only computing steps that contribute to the query are actually performed.

We collect the coverage report, the quality control report, and the sorted alignments and their indexes. The computing steps needed to calculate these variables constitute the workflow.

pdf qc bam_case bai_case bam_ctl bai_ctl;