This workflow calls variants in a sequence sample by aligning it to a reference genome, identifying variants and annotating them. It reimplements a publication by Koboldt et al. 2013.

Introduction

Variant calling is a typical use case from the field of Next-Generation Sequencing. The workflow takes as input a genetic sample from a person and determines differences with respect to a reference human genome. Variations that have been identified to be significant are summarized in a table and annotated.

First, the sample in FastQ format undergoes quality control. This step allows an a priori assessment of the suitability of the sample. A low read quality or the presence of barcodes or adapter sequences may reduce the validity of the workflow results. Quality control with FastQC helps identifying and precluding these error sources.

The computationally most demanding step is aligning each read in the sample to the human reference genome. This step is performed by the read mapper Bowtie2. The tool creates an index over each chromosome of the reference genome provided in FastA format. Then for each reference index and each sample file in FastQ format, read mapping is performed. The result is a table in BAM format, stating for each mapped read the location it aligns to in the reference genome.

The BAM files are sorted individually and then merged in a way that only one BAM file per chromosome results. Then a multiple pile-up table is created for each of these BAM files. SAMtools is used to perform these transformations. The multiple pile-up is the input to the actual variant detection algorithm.

Variant detection is performed using VarScan which consumes a multiple pile-up and produces a variant table in VCF format. A significance level of 1% is applied to identify variants.

Eventually, the variant table is annotated using Annovar which classifies variants according to their supposed effect and whether they appear in coding or non-coding regions of the human genome.

Task Definitions

This section of a workflow defines all tasks which can be used in the Workflow Definition section.

Utility Tasks

Utility tasks do generic data transformations like uncompressing archives or concatenating files.

untar

The untar task takes a tar archive and extracts its content. All files inside the archive are then enumerated in the output list.

deftask untar( <out( File )> : tar( File ) )in bash *{
  tar xf $tar
  out=`tar tf $tar`
}*

gunzip

The gunzip task takes a gzipped file and decompresses it.

deftask gunzip( out( File ) : gz( File ) )in bash *{
  out=unzipped_${gz%.gz}
  gzip -c -d $gz > $out
}*

Next-Generation Sequencing Tasks

Next-Generation Sequencing tasks process sequencing data.

FastQC

The fastqc task performs quality control over a sample in FastQ format. It outputs a report in HTML format summarizing the properties of the sample.

deftask fastqc( zip( File ) : fastq( File ) )in bash *{
  fastqc -f fastq --noextract -o ./ $fastq
  zip=`ls *.zip`
}*

Bowtie2

The task bowtie2-build creates a Bowtie2 index from a reference genome in FastA format.

deftask bowtie2-build( idx( File ) : fa( File ) )in bash *{
  bowtie2-build $fa bt2idx
  idx=idx.tar
  tar cf $idx --remove-files bt2idx.*
}*

bowtie2-align takes a pair of read files in FastQ format and a Bowtie2 index of a reference genome and aligns the reads against this reference. The result is an alignment file in SAM format which is converted to BAM format before it is returned.

deftask bowtie2-align( bam( File )
                     : idx( File )
                       [fastq1( File ) fastq2( File )] )in bash *{
  tar xf $idx
  bam=alignment.bam
  bowtie2 -D 5 -R 1 -N 0 -L 22 -i S,0,2.50 --no-unal -x bt2idx \
  -1 $fastq1 -2 $fastq2 -S - | samtools view -b - > $bam
  rm bt2idx.*
}*

SAMtools

The task samtools-faidx creates a SAMtools FastA index from an input FastA file. The index is essentially a text file listing all sequence names.

deftask samtools-faidx( fai( File ) : fa( File ) )in bash *{
  samtools faidx $fa
  fai=$fa.fai
}*

The task samtools-sort sorts a BAM file for chromosomes and positions.

deftask samtools-sort( sortedbam( File )
                     : bam( File ) )in bash *{
  sortedbam=sorted.bam
  samtools sort -m 2G $bam -o $sortedbam
}*

The task samtools-merge takes a list of presorted BAM files and merges them. The merged result is also sorted.

deftask samtools-merge( merged( File ) : <bam( File )> )in bash *{
  merged=merged.bam
  if [ ${#bam[@]} -eq "0" ]
  then
    echo "No files to merge." >&2
    exit -1
  else
    if [ ${#bam[@]} -eq "1" ]
    then
      merged=$bam
    else
      samtools merge -f $merged ${bam[@]}
    fi
  fi
}*

The task samtools-mpileup processes a BAM file and produces a Multiple Pileup from it. A reference genome as well as a SAMtools FastA index on that reference are required.

deftask samtools-mpileup( mpileup( File )
                        : bam( File ) [fa( File ) fai( File )] )in bash *{
  ln -sf $fai $fa.fai
  mpileup=mpileup.csv
  samtools mpileup -f $fa $bam > $mpileup
}*

VarScan

VarScan is a variant detection tool which can process the Multiple Pileup format produced by SAMtools. We implicitly set the p-value to 99%.

deftask varscan( vcf( File ) : mpileup( File ) )in bash *{
  vcf=variants.vcf
  varscan mpileup2snp $mpileup --output-vcf --p-value 99e-02 > $vcf
}*

ANNOVAR

ANNOVAR is an annotation tool which annotates a list of genome positions with information taken from an annotation database. The task annovar takes a list of VCF tables and produces a single annotation table. The body of the task concatenates all VCF files and pipes the output to a conversion tool which converts the VCF stream to the ANNOVAR canonical format. The output is then piped to the actual annotation tool which uses the provided annotation database to create the output table.

deftask annovar( fun( File ) exonicfun( File )
               : <vcf( File )> db buildver )in bash *{
  fun=table.variant_function
  exonicfun=table.exonic_variant_function
  cat ${vcf[@]} | \
  convert2annovar.pl -format vcf4 - | \
  annotate_variation.pl -buildver $buildver -geneanno -outfile table - $db
}*

Subworkflow Definition

Cuneiform tasks can be written in several scripting languages, including Bash. However, we can also define tasks in Cuneiform itself. This way it is possible to group common procedures and call them as a single unit. Additionally, a task always comes with a set of rules, specifying the way the argument lists are decomposed.

per-chromosome

The sub-workflow task per-chromosome takes a list of FastA references fa, a list of SAMtools FastA indexes fai, a list of Bowtie2 indexes idx (all 3 lists are of equal length), as well as a pair of lists holding sequence reads in FastQ format. Before the task per-chromosome is applied, the list of references is decomposed and the body of the task per-chromosome is applied for each triple of corresponding reference file, SAMtools FastA index, and Bowtie2 index. The sample read pairs, in contrast, are consumed as a whole.

Inside the sub-workflow task per-chromosome another sub-workflow per-fastq is called, resulting in a list of sorted BAM files sortedbam. This list is then merged using samtools-merge and a Multiple Pileup is generated from the merged BAM file. The task varscan is called on the Multiple Pileup, identifying the variants and storing the result in a VCF file vcf.

deftask per-chromosome( vcf( File )
                      : [fa( File ) fai( File ) idx( File )]
                        <fastq1( File )> <fastq2( File )> ) {

  sortedbam = per-fastq(
    idx:    idx,
    fastq1: fastq1,
    fastq2: fastq2 );

  mergedbam = samtools-merge( bam: sortedbam );

  mpileup = samtools-mpileup(
    bam: mergedbam,
    fa:  fa,
    fai: fai );

  vcf = varscan( mpileup: mpileup );
}

per-fastq

The sub-workflow task per-fastq takes a list pair of FastQ reads, aligns each of them to a reference genome and sorts the resulting alignment BAM file.

deftask per-fastq( sortedbam
                 : idx( File ) [fastq1( File ) fastq2( File )] ) {

  bam = bowtie2-align(
    idx:    idx,
    fastq1: fastq1,
    fastq2: fastq2 );

  sortedbam = samtools-sort( bam: bam );
}

Input Data

First, we define the location of the tar archive holding all chromosomes of the reference genomes in separate FastA files.

hg38-tar = "hg38/hg38.tar";

Next we define the locations of two pairs of paired-end sequenced read files in FastQ format. All 4 files together constitute a sample sequenced from a single individual.

fastq1-gz = "kgenomes/SRR359188_1.filt.fastq.gz"
            "kgenomes/SRR359195_1.filt.fastq.gz";

fastq2-gz = "kgenomes/SRR359188_2.filt.fastq.gz"
            "kgenomes/SRR359195_2.filt.fastq.gz";

The last item of importance is an annotation database which we will use to interpret the variants we identify in the sequence sample.

db = "/opt/data/annodb_hg38";

Workflow Definition

The workflow is defined through data dependencies between task applications. In this section these data dependencies are established. Note that the order in which the task applications are written down, does not matter at all. Cuneiform determines the order of task application solely from data dependencies.

First, we decompress the reference genome using the task untar. The result is a list of plain reference genome files in FastA format, one for each chromosome.

fa = untar( tar: hg38-tar );

Next, we decompress the the sample reads which are stored in gzipped format using the task gunzip. While Bowtie2 is able to process also compressed reads, quality control with FastQC demands us to decompress reads first.

fastq1 = gunzip( gz: fastq1-gz );
fastq2 = gunzip( gz: fastq2-gz );

The task fastqc performs quality control over all sample reads and stores the result in a HTML report.

qc = fastqc( fastq: fastq1 fastq2 );

Prior to mapping the sample reads, we need to create an index of the reference genome. The task bowtie2-build is mapped to each element in the list of FastA files representing the reference genome.

bt2idx = bowtie2-build( fa: fa );

The Multiple Pileup step which is part of the per-chromosome sub-workflow requires another kind of index of the reference genome. We create it using the task samtools-faidx.

fai = samtools-faidx( fa: fa );

Next, we call the task per-chromosome which outputs a list of VCF files, one for each chromosome.

vcf = per-chromosome(
  fa:     fa,
  fai:    fai,
  idx:    bt2idx,
  fastq1: fastq1,
  fastq2: fastq2 );

Eventually, the vcf files are fed into the task annovar which outputs a list of tables characterizing each identified variant.

fun exonicfun = annovar(
  vcf:      vcf,
  db:       db,
  buildver: "hg38" );

Query

The workflow query is where we determine which values we’re interested in. Only computing steps that contribute to the query are actually performed.

We collect the VCF variant table, the annotations from coding and non-coding regions, as well as the quality control report. The computing steps needed to calculate these variables constitute the workflow.

vcf fun exonicfun qc;