This workflow calls variants in a sequence sample by aligning it to a reference genome, identifying variants and annotating them. It reimplements a publication by Koboldt et al. 2013.


Variant calling is a typical use case from the field of Next-Generation Sequencing. The workflow takes as input a genetic sample from a person and determines differences with respect to a reference human genome. Variations that have been identified to be significant are summarized in a table and annotated.

First, the sample in FastQ format undergoes quality control. This step allows an a priori assessment of the suitability of the sample. A low read quality or the presence of barcodes or adapter sequences may reduce the validity of the workflow results. Quality control with FastQC helps identifying and precluding these error sources.

The computationally most demanding step is aligning each read in the sample to the human reference genome. This step is performed by the read mapper Bowtie2. The tool creates an index over each chromosome of the reference genome provided in FastA format. Then for each reference index and each sample file in FastQ format, read mapping is performed. The result is a table in BAM format, stating for each mapped read the location it aligns to in the reference genome.

The BAM files are sorted individually and then merged in a way that only one BAM file per chromosome results. Then a multiple pile-up table is created for each of these BAM files. SAMtools is used to perform these transformations. The multiple pile-up is the input to the actual variant detection algorithm.

Variant detection is performed using VarScan which consumes a multiple pile-up and produces a variant table in VCF format. A significance level of 1% is applied to identify variants.

Eventually, the variant table is annotated using Annovar which classifies variants according to their supposed effect and whether they appear in coding or non-coding regions of the human genome.

Task Definitions

This section of a workflow defines all tasks which can be used in the Workflow Definition section.

Utility Tasks

Utility tasks do generic data transformations like uncompressing archives or concatenating files.


The split function takes a FastQ file and splits it into partitions each containing 320 000 reads. So the function outputs a record with a single field lst which is a list of files.

def split( file : File ) -> <lst : [File]> in Bash *{
  split -l 1280000 $file txt


The untar function takes a tar archive and extracts its content. All files inside the archive are then enumerated in the output list.

def untar( tar : File ) ->
      <lst : [File]>

in Bash *{
  tar xf $tar
  lst=`tar tf $tar`


The gunzip function takes a gzipped file and decompresses it.

def gunzip( gz : File ) ->
      <file : File>

in Bash *{
  gzip -c -d $gz > $file

Next-Generation Sequencing Tasks

Next-Generation Sequencing tasks process sequencing data.


The fastqc function performs quality control over a sample in FastQ format. It outputs a report in HTML format summarizing the properties of the sample.

def fastqc( fastq : File ) ->
      <zip : File>

in Bash *{
  fastqc -f fastq --noextract -o ./ $fastq
  zip=`ls *.zip`


The bowtie2-build function creates a Bowtie2 index from a reference genome in FastA format.

def bowtie2-build( fa : File ) ->
      <idx : File>

in Bash *{
  bowtie2-build $fa bt2idx
  tar cf $idx --remove-files bt2idx.*

The bowtie2-align function takes a pair of read files in FastQ format and a Bowtie2 index of a reference genome and aligns the reads against this reference. The result is an alignment file in SAM format which is converted to BAM format before it is returned.

def bowtie2-align( idx : File, fastq1 : File, fastq2 : File ) ->
      <bam : File>

in Bash *{
  tar xf $idx
  bowtie2 -D 20 -R 3 -N 0 -L 20 -i S,1,0.50 --no-unal -x bt2idx \
  -1 $fastq1 -2 $fastq2 -S - | samtools view -b - > $bam
  rm bt2idx.*


The samtools-faidx function creates a SAMtools FastA index from an input FastA file. The index is essentially a text file listing all sequence names.

def samtools-faidx( fa : File ) ->
      <fai : File>

in Bash *{
  samtools faidx $fa

The samtools-sort function sorts a BAM file for chromosomes and positions.

def samtools-sort( bam : File ) ->
      <sorted : File>

in Bash *{
  samtools sort -m 2G $bam -o $sorted

The samtools-mpileup function processes a BAM file and produces a Multiple Pileup from it. A reference genome as well as a SAMtools FastA index on that reference are required.

def samtools-mpileup( bam : File, fa : File, fai : File ) ->
      <mpileup : File>

in Bash *{
  ln -sf $fai $fa.fai
  samtools mpileup -f $fa $bam > $mpileup

The samtools-merge function takes a list of presorted BAM files and merges them. The merged result is also sorted.

def samtools-merge( bam-lst : [File] ) ->
      <merged : File>


  def samtools-merge( bams : [File] ) ->
        <merged : File>

  in Bash *{
    if [ ${#bams[@]} -eq "1" ]
      samtools merge -f $merged ${bams[@]}

  if isnil bam-lst
    error "Merge list must not be empty." : <merged : File>
    samtools-merge( bams = bam-lst )


VarScan is a variant detection tool which can process the Multiple Pileup format produced by SAMtools.

VarScan can detect single nucleotide polymorphisms:

def varscan-snp( mpileup : File ) ->
      <vcf : File>

in Bash *{
  varscan mpileup2snp $mpileup --output-vcf > $vcf

And also insertions and deletions:

def varscan-indel( mpileup : File ) ->
      <vcf : File>

in Bash *{
  varscan mpileup2indel $mpileup --output-vcf > $vcf


ANNOVAR is an annotation tool which annotates a list of genome positions with information taken from an annotation database. The task annovar takes a list of VCF tables and produces a single annotation table. The body of the task concatenates all VCF files and pipes the output to a conversion tool which converts the VCF stream to the ANNOVAR canonical format. The output is then piped to the actual annotation tool which uses the provided annotation database to create the output table.

def annovar( vcfs : [File], db : File, vsn : Str ) ->
      <fun : File, exonic : File>

in Bash *{
  tar xvf $db
  cat ${vcfs[@]} | \ -format vcf4 - | \ -buildver $vsn -geneanno -outfile table - db

Input Data

First, we define the location of the tar archive holding all chromosomes of the reference genomes in separate FastA files.

let hg38-tar : File =

Next we define the locations of two pairs of paired-end sequenced read files in FastQ format. All 4 files together constitute a sample sequenced from a single individual.

let fastq1-lst-gz : [File] =
   'kgenomes/SRR359195_1.filt.fastq.gz' : File];

let fastq2-lst-gz : [File] =
   'kgenomes/SRR359195_2.filt.fastq.gz' : File];

The last item of importance is an annotation database which we will use to interpret the variants we identify in the sequence sample.

let build-vsn : Str =

let db : File =

Workflow Definition

The workflow is defined through data dependencies between task applications. In this section these data dependencies are established. Note that the order in which the task applications are written down, does not matter at all. Cuneiform determines the order of task application solely from data dependencies.

First, we decompress the reference genome using the task untar. The result is a list of plain reference genome files in FastA format, one for each chromosome.

let <lst = fa-lst : [File]> =
  untar( tar = hg38-tar );

Next, we decompress the the sample reads which are stored in gzipped format using the task gunzip. While Bowtie2 is able to process also compressed reads, quality control with FastQC demands us to decompress reads first.

let fastq1-lst : [File] =
  for gz : File <- fastq1-lst-gz do
    ( gunzip( gz = gz )|file ) : File

let fastq2-lst : [File] =
  for gz : File <- fastq2-lst-gz do
    ( gunzip( gz = gz )|file ) : File

The task fastqc performs quality control over all sample reads and stores the result in a HTML report.

let qc-lst : [File] =
  for fastq : File <- ( fastq1-lst+fastq2-lst ) do
    ( fastqc( fastq = fastq )|zip ) : File

The alignment processing part describes how we come to a list of multiple pileup files, one for each chromosome in the reference genome. Accordingly, we iterate over each chromosome and build a Bowtie 2 index and a SAMtools index from them. Next, we split each sample and align it to the current chromosome. We sort the alignments and merge them to one large alignment for which we create a multiple pileup.

let mpileup-lst : [File] =
  for fa : File <- fa-lst do

    let <idx = idx : File> =
      bowtie2-build( fa = fa );

    let <fai = fai : File> =
      samtools-faidx( fa = fa );

    let sorted-lst : [File] =
      for fastq1 : File <- fastq1-lst, fastq2 : File <- fastq2-lst do

        let <lst = split-lst1 : [File]> = split( file = fastq1 );
        let <lst = split-lst2 : [File]> = split( file = fastq2 );

        let bam-lst : [File] =
          for split1 : File <- split-lst1,
              split2 : File <- split-lst2 do

            let <bam = bam : File> =
              bowtie2-align( idx    = idx,
                             fastq1 = split1,
                             fastq2 = split2 );

            ( samtools-sort( bam = bam )|sorted ) : File

        ( samtools-merge( bam-lst = bam-lst )|merged ) : File


    let <merged = merged : File> =
      samtools-merge( bam-lst = sorted-lst );

    ( samtools-mpileup( bam = merged,
                        fa  = fa,
                        fai = fai )|mpileup ) : File


Next we call variants on the list of multiple pileups. VarScan gives us two ways of calling variants: calling SNPs and calling indels.

let snp-lst : [File] =
  for mpileup : File <- mpileup-lst do
    ( varscan-snp( mpileup = mpileup )|vcf ) : File

let indel-lst : [File] =
  for mpileup : File <- mpileup-lst do
    ( varscan-indel( mpileup = mpileup )|vcf ) : File

Eventually, the vcf files are fed into the task annovar which outputs a list of tables characterizing each identified variant.

let <fun = fun : File, exonic = exonic : File> =
  annovar( vcfs = ( snp-lst+indel-lst ),
           db   = db,
           vsn  = build-vsn );


The workflow query is where we determine which values we’re interested in. Only computing steps that contribute to the query are actually performed.

We collect the VCF variant table, the annotations from coding and non-coding regions, as well as the quality control report. The computing steps needed to calculate these variables constitute the workflow.

<fun    = fun,
 exonic = exonic,
 qc-lst = qc-lst>;