This workflow calculates the percentage of GC content in a given sequence file in FastA format. It is based on an example workflow in SciPipe by Samuel Lampa.

Introduction

This workflow takes as an input a genomic sequence file in FastA format. This file is split into many partitions and the GC and AT amount are independently counted. The result of the workflow is the share of the GC content in the overall sequence composition. Running the workflow we learn that the GC content in the Human Y chromosome is 39.2%.

This workflow has been conceived to demonstrate the definition of a simple bioinformatics use case in a given workflow language. The workflow uses only standard Linux shell commands. Thus, tool reproducibility is not an issue with this workflow. In addition, even though the workflow exploits data parallelism by partitioning the input file, it uses only little resources.

For more information about this workflow please refer to

  • the SciPipe example workflow
  • a blog post evaluating different implementations of this use case without using any workflow system

Task Definitions

In this task we introduce the task definitions which are the building blocks of the workflow. First, we present file utilities which perform the IO related parts of the workflow. In a second section, we introduce the basic arithmetics used to obtain the GC ratio.

File utilities

The file utilities introduced in this section subsume the operations on files needed to count GC and AT content.

gunzip

The gunzip task consumes a gzipped file and decompresses it.

deftask gunzip( out( File ) : gz( File ) )in bash *{
  out=unzipped_${gz%.gz}
  gzip -c -d $gz > $out
}*

split

We use the split task to partition the input data. The task consumes an input file and a parameter nline specifying the number of lines in each partition. The output is a list of partition files.

deftask split( <partlist( File )> : file( File ) nline ) in bash *{
  split -l $nline $file split_$file.
  partlist=split_*
}*

char-count

The char-count task consumes an input file and a pattern string specifying which characters to count. The output is a number denoting the number any of the characters in the pattern string were encountered reading the input file.

deftask char-count( cnt : file( File ) pattern )in bash *{
  set +o pipefail
  cnt=`cat $file | fold -w 1 | grep [$pattern] | wc -l`
}*

Arithmetics

Cuneiform is minimal in the sense that it does not provide any features exceeding workflow composition. I.e., arithmetics or boolean operations have to be performed in foreign tasks. In this section we introduce two arithmetic operations needed to calculate the GC ratio from the partition counts.

sum

The sum task takes a list of numbers and sums them up.

deftask sum( result : <n> )in perl *{
  $result = eval join '+', @n;
}*

div

The div task takes two arguments: nom, the nominator and den, the denominator of a division operation returning the quotient.

deftask div( result : nom den )in perl *{
  $result = $nom/$den;
}*

Input Data

The only input file to this workflow is the Human Y chromosome. An important parameter for the file splitting is the number of lines per partition, which we set to 100000.

fa_gz           = "Homo_sapiens.GRCh38.dna.chromosome.Y.fa.gz";
lines_per_split = 100000;

Workflow Definition

In the workflow definition section we call tasks on input data. These task calls constitute the data dependencies of the workflow. First, we decompress the input sequence file in FastA format. Then we split it into partitions with 100000 lines each. In all partitions we count the absolute frequencies of GC and AT content and sum the partition counts to obtain the overall GC and AT count. Last, we divide the GC count by overall count and assign the result to the variable gc_ratio.

fa       = gunzip( gz: fa_gz );
split_fa = split( file: fa, nline: lines_per_split );
gc       = sum( n: char-count( file: split_fa, pattern: "GC" ) );
at       = sum( n: char-count( file: split_fa, pattern: "AT" ) );
gc_ratio = div( nom: gc, den: sum( n: gc at ) );

Query

The query indicates the value we’re actually interested. Only operations that contribute to the query are actually executed.

gc_ratio;