This workflow calls variants in a sequence sample by aligning it to a reference genome, identifying variants and annotating them. It reimplements a publication by Koboldt et al. 2013.

Introduction

Variant calling is a typical use case from the field of Next-Generation Sequencing. The workflow takes as input a genetic sample from a person and determines differences with respect to a reference human genome. Variations that have been identified to be significant are summarized in a table and annotated.

First, the sample in FastQ format undergoes quality control. This step allows an a priori assessment of the suitability of the sample. A low read quality or the presence of barcodes or adapter sequences may reduce the validity of the workflow results. Quality control with FastQC helps identifying and precluding these error sources.

The computationally most demanding step is aligning each read in the sample to the human reference genome. This step is performed by the read mapper Bowtie2. The tool creates an index over each chromosome of the reference genome provided in FastA format. Then for each reference index and each sample file in FastQ format, read mapping is performed. The result is a table in BAM format, stating for each mapped read the location it aligns to in the reference genome.

The BAM files are sorted individually and then merged in a way that only one BAM file per chromosome results. Then a multiple pile-up table is created for each of these BAM files. SAMtools is used to perform these transformations. The multiple pile-up is the input to the actual variant detection algorithm.

Variant detection is performed using VarScan which consumes a multiple pile-up and produces a variant table in VCF format. A significance level of 1% is applied to identify variants.

Eventually, the variant table is annotated using Annovar which classifies variants according to their supposed effect and whether they appear in coding or non-coding regions of the human genome.

Task Definitions

This section of a workflow defines all tasks which can be used in the Workflow Definition section.

Utility Tasks

Utility tasks do generic data transformations like uncompressing archives or concatenating files.

untar

The untar task takes a tar archive and extracts its content. All files inside the archive are then enumerated in the output list.

def untar( tar : File ) ->
      <fileLst : [File]>

in Bash*{
  tar xf $tar
  fileLst=`tar tf $tar`
}*

gunzip

The gunzip task takes a gzipped file and decompresses it.

def gunzip( gz : File ) ->
      <file : File>

in Bash *{
  file=unzipped_${gz%.gz}
  gzip -c -d $gz > $file
}*

Next-Generation Sequencing Tasks

Next-Generation Sequencing tasks process sequencing data.

FastQC

The fastqc task performs quality control over a sample in FastQ format. It outputs a report in HTML format summarizing the properties of the sample.

def fastqc( fastq : File ) ->
      <zip : File>

in Bash *{
  fastqc -f fastq --noextract -o ./ $fastq
  zip=`ls *.zip`
}*

Bowtie2

The task bowtie2Build creates a Bowtie2 index from a reference genome in FastA format.

def bowtie2Build( fa : File ) ->
      <idx : File>

in Bash *{
  bowtie2-build $fa bt2idx
  idx=idx.tar
  tar cf $idx --remove-files bt2idx.*
}*

bowtie2Align takes a pair of read files in FastQ format and a Bowtie2 index of a reference genome and aligns the reads against this reference. The result is an alignment file in SAM format which is converted to BAM format before it is returned.

def bowtie2Align( idx : File, fastq1 : File, fastq2 : File ) ->
      <bam : File>

in Bash *{
  tar xf $idx
  bam=alignment.bam
  bowtie2 -D 5 -R 1 -N 0 -L 22 -i S,0,2.50 --no-unal -x bt2idx \
  -1 $fastq1 -2 $fastq2 -S - | samtools view -b - > $bam
  rm bt2idx.*
}*

SAMtools

The task samtoolsFaidx creates a SAMtools FastA index from an input FastA file. The index is essentially a text file listing all sequence names.

def samtoolsFaidx( fa : File ) ->
      <fai : File>

in Bash *{
  samtools faidx $fa
  fai=$fa.fai
}*

The task samtoolsSort sorts a BAM file for chromosomes and positions.

def samtoolsSort( bam : File ) ->
      <sorted : File>

in Bash *{
  sorted=sorted.bam
  samtools sort -m 2G $bam -o $sorted
}*

The task samtoolsMpileup processes a BAM file and produces a Multiple Pileup from it. A reference genome as well as a SAMtools FastA index on that reference are required.

def samtoolsMpileup( bam : File, fa : File, fai : File ) ->
      <mpileup : File>

in Bash *{
  ln -sf $fai $fa.fai
  mpileup=mpileup.csv
  samtools mpileup -f $fa $bam > $mpileup
}*

The task samtoolsMerge takes a list of presorted BAM files and merges them. The merged result is also sorted.

def samtoolsMerge( bamLst : [File] ) ->
      <merged : File>

{

  def samtoolsMerge( bamLst : [File] ) ->
        <merged : File>

  in Bash *{
    merged=merged.bam
    if [ ${#bamLst[@]} -eq "1" ]
    then
      merged=$bam
    else
      samtools merge -f $merged ${bamLst[@]}
    fi
  }*


  if isnil bamLst
  then
    error "Merge list must not be empty." : <merged : File>
  else
    samtoolsMerge( bamLst = bamLst )
  end
}

VarScan

VarScan is a variant detection tool which can process the Multiple Pileup format produced by SAMtools. We implicitly set the p-value to 99%.

def varscan( mpileup : File ) ->
      <vcf : File>

in Bash *{
  vcf=variants.vcf
  varscan mpileup2snp $mpileup --output-vcf --p-value 99e-02 > $vcf
}*

ANNOVAR

ANNOVAR is an annotation tool which annotates a list of genome positions with information taken from an annotation database. The task annovar takes a list of VCF tables and produces a single annotation table. The body of the task concatenates all VCF files and pipes the output to a conversion tool which converts the VCF stream to the ANNOVAR canonical format. The output is then piped to the actual annotation tool which uses the provided annotation database to create the output table.

def annovar( vcfLst : [File], db : Str, buildVsn : Str ) ->
      <fun : File, exonicFun : File>

in Bash *{
  fun=table.variant_function
  exonicFun=table.exonic_variant_function
  cat ${vcfLst[@]} | \
  convert2annovar.pl -format vcf4 - | \
  annotate_variation.pl -buildver $buildVsn -geneanno -outfile table - $db
}*

Input Data

First, we define the location of the tar archive holding all chromosomes of the reference genomes in separate FastA files.

let hg38Tar : File =
  'hg38/hg38.tar';

Next we define the locations of two pairs of paired-end sequenced read files in FastQ format. All 4 files together constitute a sample sequenced from a single individual.

let fastq1LstGz : [File] =
  ['kgenomes/SRR359188_1.filt.fastq.gz',
   'kgenomes/SRR359195_1.filt.fastq.gz' : File];

let fastq2LstGz : [File] =
  ['kgenomes/SRR359188_2.filt.fastq.gz',
   'kgenomes/SRR359195_2.filt.fastq.gz' : File];

The last item of importance is an annotation database which we will use to interpret the variants we identify in the sequence sample.

let db : Str =
  "/opt/data/annodb_hg38";

let buildVsn : Str =
  "hg38";

Workflow Definition

The workflow is defined through data dependencies between task applications. In this section these data dependencies are established. Note that the order in which the task applications are written down, does not matter at all. Cuneiform determines the order of task application solely from data dependencies.

First, we decompress the reference genome using the task untar. The result is a list of plain reference genome files in FastA format, one for each chromosome.

let <fileLst = faLst : [File]> =
  untar( tar = hg38Tar );

Next, we decompress the the sample reads which are stored in gzipped format using the task gunzip. While Bowtie2 is able to process also compressed reads, quality control with FastQC demands us to decompress reads first.

let fastq1Lst : [File] =
  for gz <- fastq1LstGz do
    ( gunzip( gz = gz )|file ) : File
  end;

let fastq2Lst : [File] =
  for gz <- fastq2LstGz do
    ( gunzip( gz = gz )|file ) : File
  end;

The task fastqc performs quality control over all sample reads and stores the result in a HTML report.

let qcLst : [File] =
  for fastq <- ( fastq1Lst+fastq2Lst ) do
    ( fastqc( fastq = fastq )|zip ) : File
  end;

The actual variant calling part describes how we come to a list of VCF files, one for each chromosome in the reference genome. Accordingly, we iterate over each chromosome and build a Bowtie 2 index and a SAMtools index from them. Next, for each sample we perform the alignment and sort the result. All sorted alignments for one chromosome are then merged to one large alignment for which we create a multiple pileup and call variants with VarScan.

let vcfLst : [File] =
  for fa <- faLst do

    let <idx = idx : File> =
      bowtie2Build( fa = fa );

    let <fai = fai : File> =
      samtoolsFaidx( fa = fa );

    let sortedLst : [File] =
      for fastq1 <- fastq1Lst, fastq2 <- fastq2Lst do

        let <bam = bam : File> =
          bowtie2Align( idx = idx, fastq1 = fastq1, fastq2 = fastq2 );

        ( samtoolsSort( bam = bam )|sorted ) : File

      end;

    let <merged = merged : File> =
      samtoolsMerge( bamLst = sortedLst );

    let <mpileup = mpileup : File> =
      samtoolsMpileup( bam = merged, fa = fa, fai = fai );

    ( varscan( mpileup = mpileup )|vcf ) : File

  end;

Eventually, the vcf files are fed into the task annovar which outputs a list of tables characterizing each identified variant.

let <fun = fun : File, exonicFun = exonicFun : File> =
  annovar( vcfLst = vcfLst, db = db, buildVsn = buildVsn );

Query

The workflow query is where we determine which values we’re interested in. Only computing steps that contribute to the query are actually performed.

We collect the VCF variant table, the annotations from coding and non-coding regions, as well as the quality control report. The computing steps needed to calculate these variables constitute the workflow.

<fun = fun, exonicFun = exonicFun, qcLst = qcLst>;