This workflow uses GATK to call germline variants in a sequence sample. In two extra VQSR steps for both SNPs and indels common variants are filtered out. This workflow reimplements a study by van der Auwera et al. 2013.

Introduction

This example workflow demonstrates the calling of germline variants using the Genome Analysis Tool Kit (GATK). From sequence reads taken from a test individual, as a set of FastQ files, this workflow finds differences between this test individual and a given reference. Herein, the workflow incorporates prior knowledge in the form of known variants, i.e., common polymorphisms likely to be found in a major part of the population. The workflow filters out such common polymorphisms from the set of reported variants.

Beginning with a set of FastQ read files the workflow aligns all reads to a reference genome (see Figure 1). This alignment step is complemented by a local realignment step which attempts to locally improve the alignment. After realignment, Base Quality Score Recalibration (BQSR) is performed in which the read quality is reestimated accounting for systematic errors introduced by the sequencing method. Then, the actual variant calling is performed. It produces a comprehensive table of variants in VCF format. Next, these variants are filtered with the goal to retain only variants which score high enough after applying Variant Quality Score Recalibration (VQSR) with a previously trained error model. Such an error model is independently trained for SNPs as well as indels from a set of known variants (see Figures 2 and 3). The variants not filtered out in both VQSR steps are reported and can be further processed by, e.g., annotating them.

gatk_recal_reads_igv.png

Figure 1: Alignments after Base Quality Score Recalibration (BQSR) viewed in IGV.

This workflow is suited to detect germline variants. Also somatic variants can be identified provided that the two VQSR steps are replaced with suitable filtering steps for somatic variant detection. It differs from the variant calling workflow applying published in a previous post in that this workflow applied VarScan as a variant caller and did not have realignment, BQSR, or any post call filtering steps. In contrast, it performed variant annotation with ANNOVAR, which is not shown here. This workflow is centered around the tools BWA for read alignment, Picard tools for alignment processing, local realignment, and BQSR, and GATK for variant calling and VQSR.

Perhaps the most challenging part in establishing such a workflow is the reproduction and proper use of prior knowledge in the form of a set of known variants VCF files. The description of the necessary processing steps, what known variants to use at each step, and where to get them is dispersed over several webpages. Below is a list of useful links which together allow the reproduction of a GATK workflow as presented here.

The remainder of this post covers Cuneiform implementations of the necessary computing steps and data dependencies constituting this workflow.

gatk_recalibrate_snp_tranches_novel_variants.png gatk_recalibrate_snp_tranches_sensitivity_specificity.png

Figure 2: (a) Size comparison among filtering tranches when retaining 90%, 99%, 99.9%, and 100% of variants after applying the Variant Quality Score Recalibration (VQSR) for SNPs. (b) sensitivity vs. specificity plot for different tranches.

gatk_recalibrate_indel_plot_qd_dp.png gatk_recalibrate_snp_plot_qd_dp.png

Figure 3: Plots comparing the coverage (DP) and quality by depth (QD) features (a) after VQSR for SNPs and (b) after VQSR for indels

Task Definitions

This section of a workflow defines all tasks which can be used in the Workflow Definition section.

Utility Tasks

Utility tasks do generic data transformations like uncompressing archives or concatenating files.

gunzip

The gunzip task takes a gzipped file and decompresses it.

deftask gunzip( out( File ) : gz( File ) )in bash *{
  out=unzipped_${gz%.gz}
  gzip -c -d $gz > $out
}*

Next-Generation Sequencing Tasks

Next-Generation Sequencing tasks process sequencing data.

FastQC

The fastqc task performs quality control over a sample in FastQ format. It outputs a report in HTML format summarizing the properties of the sample.

deftask fastqc( zip( File ) : fq( File ) )in bash *{
    fastqc -f fastq --noextract -o ./ $fq
    zip=`ls *.zip`
}*

SAMtools

While the alignment processing tool used in this workflow is primarily Picard tools, we use SAMtools for some auxiliary tasks. The first is an index of the reference genome necessary for GATK.

deftask samtools-faidx( fai( File ) : fa( File ) )in bash *{
  samtools faidx $fa
  fai=$fa.fai
}*

Also necessary for GATK is the creation of a tabix index for each of the VCF files we use as known variants.

deftask tabix( tbi( File ) : vcf( File ) )in bash *{
  tabix $vcf
  tbi=$vcf.tbi
}*

BWA

The read aligner used in this workflow is BWA. It maps each read from the sequence sample to its most probable position in the reference genome. BWA does this in two separate steps. In the first step, an index of the reference genome is created.

deftask bwa-build( idx( File ) : fa( File ) )in bash *{
  idx=bwaidx.tar
  bwa index -p bwaidx $fa
  tar cf $idx --remove-files bwaidx.*
}*

In the next step, this index is used to align each read from a paired-end sequence sample to the reference. We use SAMtools to convert the output of BWA to the BAM format on the fly.

deftask bwa-align( bam( File ) : idx( File ) [fq1( File ) fq2( File ) group] )
in bash *{
  tar xf $idx
  bam=aligned_reads.bam
  bwa mem -R "@RG\tID:group${group}\tSM:sample1\tPL:illumina\tLB:lib1\tPU:unit1" \
  bwaidx $fq1 $fq2 -t 4 | samtools view -b - > $bam
}*

Picard tools

We use Picard tools as the alignment processing tool for this workflow. First, we use Picard tools to sort a BAM alignment file.

deftask picard-sort( sortedbam( File ) : bam( File ) )in bash *{
  sortedbam=sorted.bam
  picard SortSam I=$bam O=$sortedbam SO=coordinate
}*

BAM alignment files can be merged together with picard-merge. This step assumes that the alignments have been sorted with picard-sort first.

deftask picard-merge( mergedbam( File ) : <bam( File )> )in bash *{
  mergedbam=merged.bam
  picard MergeSamFiles I=$bam SO=coordinate AS=true O=$mergedbam
}*

The picard-dedup task consumes a sorted BAM alignment file and marks all duplicates.

deftask picard-dedup( dedupbam( File ) dedupmetrics( File ) : bam( File ) )
in bash *{
  dedupbam=dedup.bam
  dedupmetrics=dedupmetrics.txt
  picard MarkDuplicates I=$bam O=$dedupbam METRICS_FILE=$dedupmetrics
}*

GATK not only needs a reference index as created by SAMtools to run but also a sequence dictionary which we create in the picard-dict task.

deftask picard-dict( dict( File ) : fa( File ) )in bash *{
  ln -s $fa ref.fa
  dict=$fa.dict
  picard CreateSequenceDictionary REFERENCE=ref.fa OUTPUT=$dict
}*

The picard-bai task creates a BAI index from a given BAM alignment file. BAM files processed by GATK must be indexed this way. While GATK automatically creates BAI indexes for BAM files it produces, we need to create this index once for the initial BAM file to be consumed by GATK.

deftask picard-bai( bai( File ) : bam( File ) )in bash *{
  bai=$bam.bai
  picard BuildBamIndex I=$bam O=$bai
}*

GATK

Local realignment is performed in two steps. In the first step GATK determines a set of intervals in which to perform local realignment. Accordingly, the gatk-createtarget task consumes the Human reference genome, a set of alignments, as well as a set of known variant databases. Along with this input data, the task consumes indexes for all aforementioned data.

deftask gatk-createtarget(
    interval( File )
  : [fa( File ) fai( File ) dict( File )] [bam( File ) bai( File )]
    <known( File )> <tbi( File )> )in bash *{

  interval=target_intervals.list

  ln -sf $fa ref.fa
  ln -sf $fai ref.fa.fai
  ln -sf $dict ref.dict
  ln -sf $bai $bam.bai

  for ((i=0; i < ${#known[@]}; i+=1))
  do
    ln -sf ${tbi[$i]} ${known[$i]}.tbi
  done

  gatk -T RealignerTargetCreator -R ref.fa -I $bam -o $interval \
    `printf " -known %s" ${known[@]}`
}*

In the second step GATK performs local realignment in the previously determined intervals. The task gatk-realign consumes a Human reference genome, a set of alignments with a corresponding realignment interval list, as well as a set of known variant databases. It produces an indexed, locally realigned alignment set in BAM format.

deftask gatk-realign(
    realignedbam( File ) realignedbai( File )
  : [fa( File ) fai( File ) dict( File )]
    [bam( File ) bai( File ) interval( File )]
    <known( File )> <tbi( File )> )in bash *{

  realignedbam=realigned_reads.bam
  realignedbai=realigned_reads.bai

  ln -sf $fa ref.fa
  ln -sf $fai ref.fa.fai
  ln -sf $dict ref.dict
  ln -sf $bai $bam.bai

  for ((i=0; i < ${#known[@]}; i+=1))
  do
    ln -sf ${tbi[$i]} ${known[$i]}.tbi
  done
  
  gatk -T IndelRealigner \
    -R ref.fa -I $bam -targetIntervals $interval -o $realignedbam \
    `printf " -known %s" ${known[@]}`
}*

Base Quality Score Recalibration (BQSR) on a set of alignments is also performed in two steps. In the first step a base recalibration table is generated. The gat-baserecal task consumes a Human reference genome, a set of alignments, and a set of known variant databases. It produces the base recalibration table.

deftask gatk-baserecal(
    recal( File )
  : [fa( File ) fai( File ) dict( File )]
    [bam( File ) bai( File )]
    <known( File )> <tbi( File )> )in bash *{

  recal=recal_data.grp

  ln -sf $fa ref.fa
  ln -sf $fai ref.fa.fai
  ln -sf $dict ref.dict
  ln -sf $bai $bam.bai

  for ((i=0; i < ${#known[@]}; i+=1))
  do
    ln -sf ${tbi[$i]} ${known[$i]}.tbi
  done

  gatk -T BaseRecalibrator -R ref.fa -I $bam -o $recal \
  `printf " -knownSites %s" ${known[@]}`
}*

In the second step a new alignment set is created containing the altered base scores as detailed in the previously generated base recalibration table. The gatk-printreads task consumes a Human reference genome, the original alignment set as well as a base recalibration table. It produces an indexed, recalibrated alignment set in BAM format.

deftask gatk-printreads(
    recalbam( File ) recalbai( File )
  : [fa( File ) fai( File ) dict( File )]
    [bam( File ) bai( File ) recal( File )] ) in bash *{

  recalbam=recal_reads.bam
  recalbai=recal_reads.bai

  ln -sf $fa ref.fa
  ln -sf $fai ref.fa.fai
  ln -sf $dict ref.dict
  ln -sf $bai $bam.bai

  gatk -T PrintReads -R ref.fa -I $bam -BQSR $recal -o recal_reads.bam
}*

The actual variant calling is performed by the gatk-haplotypecall task. This task locally reassembles haplotypes to identify germline variants in the form of SNPs or indels. The task consumes a Human reference genome, a set of alignments, and an instance of the dbSNP database. It produces an indexed set of variants in VCF format.

deftask gatk-haplotypecall(
    variants( File ) idx( File )
  : [fa( File ) fai( File ) dict( File )] [bam( File ) bai( File )]
    [dbsnp_vcf( File ) dbsnp_vcf_tbi( File )] ) in bash *{

  variants=raw_variants.vcf
  idx=raw_variants.vcf.idx

  ln -sf $fa ref.fa
  ln -sf $fai ref.fa.fai
  ln -sf $dict ref.dict
  ln -sf $bai $bam.bai
  ln -sf $dbsnp_vcf_tbi $dbsnp_vcf.tbi

  gatk -T HaplotypeCaller -R ref.fa -I $bam -o $variants \
    --genotyping_mode DISCOVERY \
    --output_mode EMIT_VARIANTS_ONLY \
    --dbsnp $dbsnp_vcf
}*

Variant Quality Score Recalibration (VQSR) on a set of variants is performed in two steps. In the first step a SNP recalibration model is trained for the variant set using several known variant databases. The gatk-recal-snp task consumes a Human reference genome, the original variant set, and several known variant databases. It produces an indexed variant recalibration table, a table containing the tranches corresponding to different confidence levels, as well as several reports in PDF format.

deftask gatk-recal-snp(
    recal( File ) recal_idx( File ) tranches( File )
    tranches_report( File ) recal_report( File )
  : [fa( File ) fai( File ) dict( File )]
    [vcf( File ) vcf_idx( File )]
    [hapmap_vcf( File ) hapmap_vcf_tbi( File )]
    [omni_vcf( File ) omni_vcf_tbi( File )]
    [phase1_vcf( File ) phase1_vcf_tbi( File )]
    [dbsnp_vcf( File ) dbsnp_vcf_tbi( File )] ) in bash *{

  recal=recalibrate_SNP.recal
  recal_idx=recalibrate_SNP.recal.idx
  tranches=recalibrate_SNP.tranches
  tranches_report=recalibrate_SNP.tranches.pdf
  recal_report=recalibrate_SNP_plots.R.pdf

  ln -sf $fa ref.fa
  ln -sf $fai ref.fa.fai
  ln -sf $dict ref.dict
  ln -sf $vcf_idx $vcf.idx
  ln -sf $hapmap_vcf_tbi $hapmap_vcf.tbi
  ln -sf $omni_vcf_tbi $omni_vcf.tbi
  ln -sf $phase1_vcf_tbi $phase1_vcf.tbi
  ln -sf $dbsnp_vcf_tbi $dbsnp_vcf.tbi

  gatk -T VariantRecalibrator -R ref.fa -input $vcf \
    -recalFile $recal \
    -tranchesFile $tranches \
    -rscriptFile recalibrate_SNP_plots.R \
    -mode SNP \
    -resource:hapmap,known=false,training=true,truth=true,prior=15.0 $hapmap_vcf \
    -resource:omni,known=false,training=true,truth=true,prior=12.0 $omni_vcf \
    -resource:1000g,known=false,training=true,truth=false,prior=10.0 $phase1_vcf \
    -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 $dbsnp_vcf \
    -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -an DP
}*

In the second step the variant recalibration table produced in the first step is used to generate a new variant set with corrected confidence score annotations. The gatk-apply-recal task consumes a Human reference genome, the original set of variants, a variant recalibration table, and a tranches file. It produces an indexed, corrected variant set in VCF format.

deftask gatk-apply-recal(
    recalvcf( File ) recalvcf_idx( File )
  : mode filter
    [fa( File ) fai( File ) dict( File )]
    [vcf( File ) vcf_idx( File )]
    [recal( File ) recal_idx( File )]
    tranches( File ) ) in bash *{

  recalvcf=recalibrated.vcf
  recalvcf_idx=recalibrated.vcf.idx

  ln -sf $fa ref.fa
  ln -sf $fai ref.fa.fai
  ln -sf $dict ref.dict
  ln -sf $vcf_idx $vcf.idx
  ln -sf $recal_idx $recal.idx

  gatk -T ApplyRecalibration -R ref.fa -input $vcf \
    -mode $mode \
    --ts_filter_level $filter \
    -recalFile $recal \
    -tranchesFile $tranches \
    -o $recalvcf
}*

The previous round of VQSR applied a variant model specifically trained for SNPs. Similarly we now create a model for indels, further narrowing the set of variants. The gatk-recal-indel task consumes a Human reference genome, the original set of variants, and several known variant databases. It produces an indexed variant recalibration table, a tranches table, as well as a report in PDF format. Like for the SNP-specific variant recalibration table, the gatk-apply-recal task can be used to obtain the set of variants with corrected confidence scores.

deftask gatk-recal-indel(
    recal( File ) recal_idx( File ) tranches( File ) recal_report( File )
  : [fa( File ) fai( File ) dict( File )]
    [vcf( File ) vcf_idx( File )]
    [mills_vcf( File ) mills_vcf_tbi( File )]
    [dbsnp_vcf( File ) dbsnp_vcf_tbi( File )]
  ) in bash *{

  recal=recalibrate_INDEL.recal
  recal_idx=recalibrate_INDEL.recal.idx
  tranches=recalibrate_INDEL.tranches
  recal_report=recalibrate_INDEL_plots.R.pdf

  ln -sf $fa ref.fa
  ln -sf $fai ref.fa.fai
  ln -sf $dict ref.dict
  ln -sf $vcf_idx $vcf.idx
  ln -sf $mills_vcf_tbi $mills_vcf.tbi
  ln -sf $dbsnp_vcf_tbi $dbsnp_vcf.tbi

  gatk -T VariantRecalibrator -R ref.fa -input $vcf \
    -recalFile $recal \
    -tranchesFile $tranches \
    -rscriptFile recalibrate_INDEL_plots.R \
    -mode INDEL --maxGaussians 4 \
    -resource:mills,known=false,training=true,truth=true,prior=12.0 $mills_vcf \
    -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 $dbsnp_vcf \
    -an QD -an DP -an FS -an SOR -an ReadPosRankSum -an MQRankSum
}*

Input Data

The input data to this variant calling workflow is the Human reference genome stored in a single compressed FastA file (fa_gz), a read set of 5 pairs of compressed FastQ files (fq1 and fq2), and a number of databases containing known variants (mills_vcf, kgenomes_phase1_vcf, kgenomes_omni_vcf, dbsnp138_vcf, and hapmap_vcf). These input files are consumed by the previously defined tasks.

fa_gz = "Homo_sapiens_assembly38.fasta.gz";

group = 1 2 3 4 5;

fq1 = "kgenomes/SRR062634_1.filt.fastq.gz" "kgenomes/SRR062635_1.filt.fastq.gz"
      "kgenomes/SRR062641_1.filt.fastq.gz" "kgenomes/SRR077487_1.filt.fastq.gz"
      "kgenomes/SRR081241_1.filt.fastq.gz";

fq2 = "kgenomes/SRR062634_2.filt.fastq.gz" "kgenomes/SRR062635_2.filt.fastq.gz"
      "kgenomes/SRR062641_2.filt.fastq.gz" "kgenomes/SRR077487_2.filt.fastq.gz"
      "kgenomes/SRR081241_2.filt.fastq.gz";

% known variants
mills_vcf           = "Mills_and_1000G_gold_standard.indels.hg38.vcf.gz";
kgenomes_phase1_vcf = "1000G_phase1.snps.high_confidence.hg38.vcf.gz";
kgenomes_omni_vcf   = "1000G_omni2.5.hg38.vcf.gz";
dbsnp138_vcf        = "Homo_sapiens_assembly38.dbsnp138.vcf.gz";
hapmap_vcf          = "hapmap_3.3.hg38.vcf.gz";

Workflow Definition

The previously defined task are the building blocks of the workflow. Calling these tasks with the appropriate inputs constitutes a data dependency graph which can be queried later.

First, we extract the gzipped Human reference genome and create a number of indexes on them: fai subsumes the names of the sequences in the reference. bwaidx is the reference index used for read-mapping with BWA, and dict is a sequence dictionary used by Picard tools.

fa     = gunzip( gz: fa_gz );
fai    = samtools-faidx( fa: fa );
bwaidx = bwa-build( fa: fa );
dict   = picard-dict( fa: fa );

Next, we create a Tabix index for each of the variant resources representing our prior knowledge.

mills_vcf_tbi           = tabix( vcf: mills_vcf );
kgenomes_phase1_vcf_tbi = tabix( vcf: kgenomes_phase1_vcf );
kgenomes_omni_vcf_tbi   = tabix( vcf: kgenomes_omni_vcf );
dbsnp138_vcf_tbi        = tabix( vcf: dbsnp138_vcf );
hapmap_vcf_tbi          = tabix( vcf: hapmap_vcf );

To assess the quality of the read set we create a FastQC report.

qc = fastqc( fq: fq1 fq2 );

We use the previously created BWA index to map the read set to the Human reference genome. The resulting alignments in BAM format are independently sorted, merged into a single BAM file, which is then deduplicated using Picard tools.

aligned_reads = bwa-align(
  idx:   bwaidx,
  fq1:   fq1,
  fq2:   fq2,
  group: group );

sorted_reads    = picard-sort( bam: aligned_reads );
merged_reads    = picard-merge( bam: sorted_reads );
dedup_reads     = picard-dedup( bam: merged_reads );
dedup_reads_bai = picard-bai( bam: dedup_reads );

The deduplicated alignments are now locally realigned using GATK.

target_intervals = gatk-createtarget(
  fa:    fa,
  fai:   fai,
  dict:  dict,
  bam:   dedup_reads,
  bai:   dedup_reads_bai,
  known: kgenomes_phase1_vcf mills_vcf,
  tbi:   kgenomes_phase1_vcf_tbi mills_vcf_tbi );

realigned_reads realigned_reads_bai = gatk-realign(
  fa:       fa,
  fai:      fai,
  dict:     dict,
  bam:      dedup_reads,
  bai:      dedup_reads_bai,
  interval: target_intervals,
  known:    kgenomes_phase1_vcf mills_vcf,
  tbi:      kgenomes_phase1_vcf_tbi mills_vcf_tbi );

After local realignment, Base Quality Score Recalibration (BQSR) is performed. The new base quality scores are first annotated inside the original BAM file in the gatk-baserecal step. In a separate step, the original scores are replaced in the gatk-printreads step.

recal_data = gatk-baserecal(
  fa:    fa,
  fai:   fai,
  dict:  dict,
  bam:   realigned_reads,
  bai:   realigned_reads_bai,
  known: kgenomes_phase1_vcf mills_vcf dbsnp138_vcf,
  tbi:   kgenomes_phase1_vcf_tbi mills_vcf_tbi dbsnp138_vcf_tbi );

recal_reads recal_reads_bai = gatk-printreads(
  fa:    fa,
  fai:   fai,
  dict:  dict,
  bam:   realigned_reads,
  bai:   realigned_reads_bai,
  recal: recal_data );

The deduplicated, realigned, base-recalibrated reads are now ready for the actual variant calling step, which is performed by GATK’s haplotype caller.

raw_variants raw_variants_idx = gatk-haplotypecall(
  fa:            fa,
  fai:           fai,
  dict:          dict,
  bam:           recal_reads,
  bai:           recal_reads_bai,
  dbsnp_vcf:     dbsnp138_vcf,
  dbsnp_vcf_tbi: dbsnp138_vcf_tbi );

In the first of two variant recalibration steps a SNP-specific model of variants from known variants is created and applied to the raw variant call set to readjust the variant scores.

recalibrate_snp recalibrate_snp_idx tranches_snp tranches_report recal_report_snp = gatk-recal-snp(
  fa:             fa,
  fai:            fai,
  dict:           dict,
  vcf:            raw_variants,
  vcf_idx:        raw_variants_idx,
  hapmap_vcf:     hapmap_vcf,
  hapmap_vcf_tbi: hapmap_vcf_tbi,
  omni_vcf:       kgenomes_omni_vcf,
  omni_vcf_tbi:   kgenomes_omni_vcf_tbi,
  phase1_vcf:     kgenomes_phase1_vcf,
  phase1_vcf_tbi: kgenomes_phase1_vcf_tbi,
  dbsnp_vcf:      dbsnp138_vcf,
  dbsnp_vcf_tbi:  dbsnp138_vcf_tbi );

recalibrated_snps_raw_indels_vcf recalibrated_snps_raw_indels_vcf_idx = gatk-apply-recal(
  mode:      "SNP",
  filter:    "99.5",
  fa:        fa,
  fai:       fai,
  dict:      dict,
  vcf:       raw_variants,
  vcf_idx:   raw_variants_idx,
  recal:     recalibrate_snp,
  recal_idx: recalibrate_snp_idx,
  tranches:   tranches_snp );

In the second recalibration step, an indel-specific model is created from known variants and applied to the previously SNP-recalibrated call set. The result is the final call set with variants adjusted to account for both known SNPs and indels.

recalibrate_indel recalibrate_indel_idx tranches_indel recal_report_indel = gatk-recal-indel(
  fa:            fa,
  fai:           fai,
  dict:          dict,
  vcf:           recalibrated_snps_raw_indels_vcf,
  vcf_idx:       recalibrated_snps_raw_indels_vcf_idx,
  mills_vcf:     mills_vcf,
  mills_vcf_tbi: mills_vcf_tbi,
  dbsnp_vcf:     dbsnp138_vcf,
  dbsnp_vcf_tbi: dbsnp138_vcf_tbi );

recalibrated_variants_vcf recalibrated_variants_vcf_idx = gatk-apply-recal(
  mode:      "INDEL",
  filter:    "99.0",
  fa:        fa,
  fai:       fai,
  dict:      dict,
  vcf:       recalibrated_snps_raw_indels_vcf,
  vcf_idx:   recalibrated_snps_raw_indels_vcf_idx,
  recal:     recalibrate_indel,
  recal_idx: recalibrate_indel_idx,
  tranches:  tranches_indel );

Query

The workflow query determines the output of the workflow. Only processing steps contributing to that output are actually performed.

We query the recalibrated variants, the quality control reports as well as the visualizations.

recalibrated_variants_vcf qc
tranches_report recal_report_snp recal_report_indel;