This workflow performs a phylogenetic study of the CHASE domain, a protein domain resposible for cytokinin reception in the two-component system in late diverging plants. It is a reimplementation of a paper by Pils and Heyl 2009.

Introduction

The two-component system is a signalling pathway in some plants and bacteria for detecting certain types of signals at the cell’s membrane, relaying that signal to the inside, and up- or down-regulating target genes. As the name suggests, this signalling pathway consists of two components:

  • The histidine kinase that detects the extracellular signal and auto-phosphorylates at a specific residue in the cytoplasmic region of the protein.

  • The response regulator which undergoes a conformational change when it is phosphorylated which has the effect that a target gene is up- or down-regulated.

In plants this signal needs to be relayed by an additional histidine phosphotransferase which transfers the phosphoryl group from the phosphorylated histidine kinase to the plant cell’s nucleus where the response regulators reside.

What signal a histidine kinase responds to is determined by its receptor domain. One such receptor domain is the CHASE domain which responds to a certain type of plant hormone called cytokinins. Many late diverging plants like grass or crops are sensitive to cytokinins. Accordingly, the CHASE receptor domain has been identified in the genomes of many plants for which gene indices are available. However, many early diverging plants like some algae lack the CHASE domain which suggests that the cytokinin-specific two-component system has been acquired by plants at some point in evolutiononary process.

With gene indices available for many plants, early diverging and late diverging, we can search for homologs of the CHASE domain in these plants and perform a phylogenetic analysis. In this example workflow, we search CHASE homologs in seven different plants including three algae, a moss, a farn, the model plant Arabidopsis thaliana, and rice.

The first step is to create a HMM model for a CHASE seed alignment which we obtained from the Pfam database. With this HMM model, we use HMMER to search for CHASE homologs in the amino acid sequences of the gene indices we obtained for the aforementioned plants from the JGI, Gramene, and TAIR. Then we use MUSCLE to create a de-novo multiple sequence alignment from the putative CHASE sequences we identified (see Figure 1) and generate a phylogenetic tree from the alignment using ClustalW. Note that the originial publication uses PHYML and MrBayes to construct the phylogenetic tree. Finally, we use Phylip to generate a pdf from the generated phylogenetic tree (see Figure 2).

In the following we describe the Cuneiorm workflow that consumes the seed alignment and the gene indices and generates a phylogenetic tree in the form of a pdf document.

phylogeny_ugene_msa.png

Figure 1: Multiple sequence alignment of all protein domains identified by HMMER viewed in UGENE.

g4653.png

Figure 2: Phylogenetic tree of all CHASE homologs created with ClustalW from the multiple sequence alignment.

Task Definitions

In this section we give the task definitions that wrap the command line tools used in this workflow. All tasks are assumed to terminate, be deterministic and be side effect-free.

Utility Tasks

gunzip

The gunzip task consumes a compressed file and extracts it.

deftask gunzip( out( File ) : gz( File ) ) in bash *{
  out=unzipped_${gz%.gz}
  gzip -c -d $gz > $out
}*

cat

The cat task consumes a list of files and concatenates them to a single file.

deftask cat( out( File ) : <lst( File )> ) in bash *{
  out=out.txt
  cat ${lst[@]} > $out
}*

Bioinformatics Tools

HMMER

The hmmbuild task consumes a seed alignment in stockholm format and generates a HMM model from it.

deftask hmmbuild( hmm( File ) : sto( File ) ) in bash *{
  hmm=${sto%.sto}.hmm
  hmmbuild $hmm $sto
}*

The hmmsearch task consumes a set of sequences in FastA format and matches each sequence to a HMM model generated with hmmbuild. The task produces a multiple sequence alignment in FastA format.

deftask hmmsearch( hits( File ) : hmm( File ) fa( File ) ) in bash *{
  hits=hits_${fa%.fa}.sto
  hmmsearch -A $hits $hmm $fa
}*

The sto2fa task uses the esl-reformat command that is shipped with HMMER to extract all sequences from a multiple sequence alignment in stockholm format and stores them in FastA format. The alignment information is lost in the process.

deftask sto2fa( fa( File ) : sto( File ) ) in bash *{
  fa=${sto%.sto}.fa
  esl-reformat fasta $sto > $fa
}*

MUSCLE

The muscle task consumes a FastA file containing a set of sequences and creates a multiple sequence alignment, again in FastA format.

deftask muscle( msa( File ) : fa( File ) ) in bash *{
  msa=msa.fa
  muscle -quiet -in $fa > $msa
}*

ClustalW

The clustalw_tree task consumes a multiple sequence alignment in FastA format and produces a phylogenetic tree in Phylip format.

deftask clustalw_tree( ph( File ) : fa( File ) ) in bash *{
  ph=${fa%.fa}.ph
  clustalw -tree -infile=$fa
}*

Phylip

The phylip_drawtree task consumes a phylogenetic tree in Phylip format and produces a pdf. The Phylip tool actually generates a file in postscript format which we convert to pdf using Ghostscript.

deftask phylip_drawtree( pdf( File ) : ph( File ) )in bash *{
  pdf=tree.pdf
  ln -sf $ph intree
  echo Y | phylip drawtree
  ps2pdf plotfile $pdf
}*

Input Data

The input data for the workflow is the CHASE seed alignment as well as the amino acid sequences acquired from the gene indices of the plants under consideration.

% CHASE seed alignment
chase_seed_sto = "seed/PF03924_seed.txt";

% plant gene indices
fa_gz = "fa/Chlre4_all_proteins.fasta.gz"
        "fa/Poptr1_1_GeneModels_AllModels_20081112_aa.fasta.gz"
        "fa/proteins.Phypa1_1.AllModels.fasta.gz"
        "fa/Volca1.GeneCatalog_2007_09_13.proteins.fasta.gz"
        "fa/Selmo1_GeneModels_FilteredModels2_aa.fasta.gz"
        "fa/Arabidopsis_thaliana.TAIR10.31.pep.all.fa.gz"
        "fa/Oryza_sativa.IRGSP-1.0.31.pep.all.fa.gz";

Workflow Definition

In the workflow definition section we apply the previously defined tasks to the input data defined in the previous section. This way we define the data dependency graph that constitutes the workflow.

First we build a HMM model from the seed alignment in stockholm format.

hmm     = hmmbuild( sto: chase_seed_sto );

Also we need to extract the amino acid sequences in FastA format from the gzipped archives.

fa      = gunzip( gz: fa_gz );

Next we use HMMER to search for CHASE homologs in each plant. For each plant HMMER produces a multiple sequence alignment in stockholm format containing all sequences for which similarity with the HMM model is significant. We extract the sequences contained in this multiple sequence alignment and convert them to FastA format so that we can concatenate all matching sequences to one large FastA file.

hits_fa = cat( lst: sto2fa( sto: hmmsearch( hmm: hmm, fa: fa ) ) );

From this FastA file we generate a new multiple sequence alignment using MUSLCE.

msa     = muscle( fa: hits_fa );

Finally, we generate a phylogenetic tree from the multiple sequence alignment with ClustalW and convert it to pdf using Phylip.

tree    = clustalw_tree( fa: msa );
pdf     = phylip_drawtree( ph: tree );

Query

In the last part we query the variables we’re actually interested in. Only computing steps that actually contribute to the query are processed by the Cuneiform interpreter.

The end result of this workflow we’re interested in is the phylogenetic tree in pdf format.

pdf;