This workflow calls variants in a sequence sample by aligning it to a reference genome, identifying variants and annotating them. It reimplements a publication by Koboldt et al. 2013.

Introduction

Variant calling is a typical use case from the field of Next-Generation Sequencing. The workflow takes as input a genetic sample from a person and determines differences with respect to a reference human genome. Variations that have been identified to be significant are summarized in a table and annotated.

First, the sample in FastQ format undergoes quality control. This step allows an a priori assessment of the suitability of the sample. A low read quality or the presence of barcodes or adapter sequences may reduce the validity of the workflow results. Quality control with FastQC helps identifying and precluding these error sources.

The computationally most demanding step is aligning each read in the sample to the human reference genome. This step is performed by the read mapper Bowtie2. The tool creates an index over each chromosome of the reference genome provided in FastA format. Then for each reference index and each sample file in FastQ format, read mapping is performed. The result is a table in BAM format, stating for each mapped read the location it aligns to in the reference genome.

The BAM files are sorted individually and then merged in a way that only one BAM file per chromosome results. Then a multiple pile-up table is created for each of these BAM files. SAMtools is used to perform these transformations. The multiple pile-up is the input to the actual variant detection algorithm.

Variant detection is performed using VarScan which consumes a multiple pile-up and produces a variant table in VCF format. A significance level of 1% is applied to identify variants.

Eventually, the variant table is annotated using Annovar which classifies variants according to their supposed effect and whether they appear in coding or non-coding regions of the human genome.

Task Definitions

This section of a workflow defines all tasks which can be used in the Workflow Definition section.

Utility Tasks

Utility tasks do generic data transformations like uncompressing archives or concatenating files.

split

The split function takes a FastQ file and splits it into partitions each containing 320 000 reads. So the function outputs a record with a single field lst which is a list of files.

def split( file : File ) -> <lst : [File]> in Bash *{
  split -l 1280000 $file txt
  lst=txt*
}*

untar

The untar function takes a tar archive and extracts its content. All files inside the archive are then enumerated in the output list.

def untar( tar : File ) ->
      <lst : [File]>

in Bash *{
  tar xf $tar
  lst=`tar tf $tar`
}*

gunzip

The gunzip function takes a gzipped file and decompresses it.

def gunzip( gz : File ) ->
      <file : File>

in Bash *{
  file=unzipped_${gz%.gz}
  gzip -c -d $gz > $file
}*

Next-Generation Sequencing Tasks

Next-Generation Sequencing tasks process sequencing data.

FastQC

The fastqc function performs quality control over a sample in FastQ format. It outputs a report in HTML format summarizing the properties of the sample.

def fastqc( fastq : File ) ->
      <zip : File>

in Bash *{
  fastqc -f fastq --noextract -o ./ $fastq
  zip=`ls *.zip`
}*

Bowtie2

The bowtie2-build function creates a Bowtie2 index from a reference genome in FastA format.

def bowtie2-build( fa : File ) ->
      <idx : File>

in Bash *{
  bowtie2-build $fa bt2idx
  idx=idx.tar
  tar cf $idx --remove-files bt2idx.*
}*

The bowtie2-align function takes a pair of read files in FastQ format and a Bowtie2 index of a reference genome and aligns the reads against this reference. The result is an alignment file in SAM format which is converted to BAM format before it is returned.

def bowtie2-align( idx : File, fastq1 : File, fastq2 : File ) ->
      <bam : File>

in Bash *{
  tar xf $idx
  bam=alignment.bam
  bowtie2 -D 20 -R 3 -N 0 -L 20 -i S,1,0.50 --no-unal -x bt2idx \
  -1 $fastq1 -2 $fastq2 -S - | samtools view -b - > $bam
  rm bt2idx.*
}*

SAMtools

The samtools-faidx function creates a SAMtools FastA index from an input FastA file. The index is essentially a text file listing all sequence names.

def samtools-faidx( fa : File ) ->
      <fai : File>

in Bash *{
  samtools faidx $fa
  fai=$fa.fai
}*

The samtools-sort function sorts a BAM file for chromosomes and positions.

def samtools-sort( bam : File ) ->
      <sorted : File>

in Bash *{
  sorted=sorted.bam
  samtools sort -m 2G $bam -o $sorted
}*

The samtools-mpileup function processes a BAM file and produces a Multiple Pileup from it. A reference genome as well as a SAMtools FastA index on that reference are required.

def samtools-mpileup( bam : File, fa : File, fai : File ) ->
      <mpileup : File>

in Bash *{
  ln -sf $fai $fa.fai
  mpileup=mpileup.csv
  samtools mpileup -f $fa $bam > $mpileup
}*

The samtools-merge function takes a list of presorted BAM files and merges them. The merged result is also sorted.

def samtools-merge( bam-lst : [File] ) ->
      <merged : File>

{

  def samtools-merge( bams : [File] ) ->
        <merged : File>

  in Bash *{
    merged=merged.bam
    if [ ${#bams[@]} -eq "1" ]
    then
      merged=$bam
    else
      samtools merge -f $merged ${bams[@]}
    fi
  }*


  if isnil bam-lst
  then
    error "Merge list must not be empty." : <merged : File>
  else
    samtools-merge( bams = bam-lst )
  end
}

VarScan

VarScan is a variant detection tool which can process the Multiple Pileup format produced by SAMtools.

VarScan can detect single nucleotide polymorphisms:

def varscan-snp( mpileup : File ) ->
      <vcf : File>

in Bash *{
  vcf=snp.vcf
  varscan mpileup2snp $mpileup --output-vcf > $vcf
}*

And also insertions and deletions:

def varscan-indel( mpileup : File ) ->
      <vcf : File>

in Bash *{
  vcf=indel.vcf
  varscan mpileup2indel $mpileup --output-vcf > $vcf
}*

ANNOVAR

ANNOVAR is an annotation tool which annotates a list of genome positions with information taken from an annotation database. The task annovar takes a list of VCF tables and produces a single annotation table. The body of the task concatenates all VCF files and pipes the output to a conversion tool which converts the VCF stream to the ANNOVAR canonical format. The output is then piped to the actual annotation tool which uses the provided annotation database to create the output table.

def annovar( vcfs : [File], db : File, vsn : Str ) ->
      <fun : File, exonic : File>

in Bash *{
  fun=table.variant_function
  exonic=table.exonic_variant_function
  tar xvf $db
  cat ${vcfs[@]} | \
  convert2annovar.pl -format vcf4 - | \
  annotate_variation.pl -buildver $vsn -geneanno -outfile table - db
}*

Input Data

First, we define the location of the tar archive holding all chromosomes of the reference genomes in separate FastA files.

let hg38-tar : File =
  'hg38/hg38.tar';

Next we define the locations of two pairs of paired-end sequenced read files in FastQ format. All 4 files together constitute a sample sequenced from a single individual.

let fastq1-lst-gz : [File] =
  ['kgenomes/SRR359188_1.filt.fastq.gz',
   'kgenomes/SRR359195_1.filt.fastq.gz' : File];

let fastq2-lst-gz : [File] =
  ['kgenomes/SRR359188_2.filt.fastq.gz',
   'kgenomes/SRR359195_2.filt.fastq.gz' : File];

The last item of importance is an annotation database which we will use to interpret the variants we identify in the sequence sample.

let build-vsn : Str =
  "hg38";

let db : File =
  'annovar/hg38db.tar';

Workflow Definition

The workflow is defined through data dependencies between task applications. In this section these data dependencies are established. Note that the order in which the task applications are written down, does not matter at all. Cuneiform determines the order of task application solely from data dependencies.

First, we decompress the reference genome using the task untar. The result is a list of plain reference genome files in FastA format, one for each chromosome.

let <lst = fa-lst : [File]> =
  untar( tar = hg38-tar );

Next, we decompress the the sample reads which are stored in gzipped format using the task gunzip. While Bowtie2 is able to process also compressed reads, quality control with FastQC demands us to decompress reads first.

let fastq1-lst : [File] =
  for gz : File <- fastq1-lst-gz do
    ( gunzip( gz = gz )|file ) : File
  end;

let fastq2-lst : [File] =
  for gz : File <- fastq2-lst-gz do
    ( gunzip( gz = gz )|file ) : File
  end;

The task fastqc performs quality control over all sample reads and stores the result in a HTML report.

let qc-lst : [File] =
  for fastq : File <- ( fastq1-lst+fastq2-lst ) do
    ( fastqc( fastq = fastq )|zip ) : File
  end;

The alignment processing part describes how we come to a list of multiple pileup files, one for each chromosome in the reference genome. Accordingly, we iterate over each chromosome and build a Bowtie 2 index and a SAMtools index from them. Next, we split each sample and align it to the current chromosome. We sort the alignments and merge them to one large alignment for which we create a multiple pileup.

let mpileup-lst : [File] =
  for fa : File <- fa-lst do

    let <idx = idx : File> =
      bowtie2-build( fa = fa );

    let <fai = fai : File> =
      samtools-faidx( fa = fa );

    let sorted-lst : [File] =
      for fastq1 : File <- fastq1-lst, fastq2 : File <- fastq2-lst do

        let <lst = split-lst1 : [File]> = split( file = fastq1 );
        let <lst = split-lst2 : [File]> = split( file = fastq2 );

        let bam-lst : [File] =
          for split1 : File <- split-lst1,
              split2 : File <- split-lst2 do

            let <bam = bam : File> =
              bowtie2-align( idx    = idx,
                             fastq1 = split1,
                             fastq2 = split2 );

            ( samtools-sort( bam = bam )|sorted ) : File
          end;

        ( samtools-merge( bam-lst = bam-lst )|merged ) : File

      end;

    let <merged = merged : File> =
      samtools-merge( bam-lst = sorted-lst );

    ( samtools-mpileup( bam = merged,
                        fa  = fa,
                        fai = fai )|mpileup ) : File

  end;

Next we call variants on the list of multiple pileups. VarScan gives us two ways of calling variants: calling SNPs and calling indels.

let snp-lst : [File] =
  for mpileup : File <- mpileup-lst do
    ( varscan-snp( mpileup = mpileup )|vcf ) : File
  end;

let indel-lst : [File] =
  for mpileup : File <- mpileup-lst do
    ( varscan-indel( mpileup = mpileup )|vcf ) : File
  end;

Eventually, the vcf files are fed into the task annovar which outputs a list of tables characterizing each identified variant.

let <fun = fun : File, exonic = exonic : File> =
  annovar( vcfs = ( snp-lst+indel-lst ),
           db   = db,
           vsn  = build-vsn );

Query

The workflow query is where we determine which values we’re interested in. Only computing steps that contribute to the query are actually performed.

We collect the VCF variant table, the annotations from coding and non-coding regions, as well as the quality control report. The computing steps needed to calculate these variables constitute the workflow.

<fun    = fun,
 exonic = exonic,
 qc-lst = qc-lst>;