This workflow demonstrates ChIP-Seq analysis in Escherichia coli by performing peak detection in a treatment sample using MACS. It is a Cuneiform transcription of a hands-on introduction to ChIP-Seq by Morgane Thomas-Chollier which itself is based on a study by Myers et al. 2013.

Introduction

In a ChIP-Seq experiment the DNA binding sites of transcription factors or other DNA-binding proteins are first selected using chromatin immunoprecipitation and then sequenced. By comparing the sequencing coverage in such a treatment sample to the coverage observed in a control sample where no DNA selection has been performed it is possible to find the DNA binding sites of the protein in question.

In this workflow we are given the sequenced reads of a treatment and control sample of an experiment conducted in Escherichia coli. After aligning both samples to the reference sequence we can perform peak calling. In this workflow we use the peak caller MACS.

The peak caller not only identifies peak regions and summits but also outputs coverage information which can be displayed in genome browsers like IGV (see Figure 1). The data produced by this workflow is ready to be used with downstream processing like, e.g., motif discovery with RSAT. This web-based tool not only produces statistics about peak length distrubtion (see Figure 2), single and dinucleotide composition (see Figure 3) as well as their position profiles around the peak (see Figures 4 and 5) but also gathers information about the motifs discovered in the peak regions. Figure 6 shows the forward and reverse sequence logos of the most often encountered motif. Figure 7 depicts the positions of that motif relative to the peak center. Figure 8 shows the probability of encountering a certain number of occurrences of this motif per peak region.

The Cuneiform workflow described in this post consists of four parts. First, the task definitions are given. All tasks used in this workflow are defined in Bash. In the next section we define the input data consisting of two SRA samples and a reference genome. Next, the workflow structure is established by applying the previously defined tasks to the input data. In the last section the query is given which defines what the output of the workflow is.

Setting up a machine with Chef (installing software and downloading data) takes around 90 minutes. Running this example workflow on a machine with 4 cores takes around 10 minutes.

chip-seq_peaks_igv.png

Figure 1: Read alignments and coverage visualization for both treatment and control sample with annotated genes, peak regions and summits viewed in IGV.

chip-seq_peak-motifs_test_seqlen_distrib.png

Figure 2: Peak region length distribution generated with RSAT. For each peak region length the absolute frequencies are given.

chip-seq_peak-motifs_test_heatmap-1str-ovlp_1nt.png chip-seq_peak-motifs_test_heatmap-1str-ovlp_2nt.png

Figure 3: Transition frequencies for both single nucleotides and dinucleotides.

chip-seq_peak-motifs_test_profiles-1str-ovlp_1nt_ci20.png

Figure 4: Single nucleotide composition profiles. Occurrences per position around the peak center for each nucleotide.

chip-seq_peak-motifs_test_profiles-1str-ovlp_2nt_ci20.png

Figure 5: Dinucleotide composition profiles. Occurrences per position around the peak center for each nucleotide 2-gram.

chip-seq_peak-motifs_oligos-2str-noov_6nt_mkv3_pssm_count_matrices_logo_m1.png chip-seq_peak-motifs_oligos-2str-noov_6nt_mkv3_pssm_count_matrices_logo_m1_rc.png

Figure 6: Forward and reverse sequence logos for most frequent motif.

chip-seq_peak-motifs_oligos_6nt_mkv3_m1_site_distrib.png

Figure 7: Distribution of predicted sites. For each position around the peak center the absolute frequencies for the motif are given.

chip-seq_peak-motifs_oligos_6nt_mkv3_m1_sites_per_peak.png

Figure 8: Number of predicted sites per peak.

Task Definitions

In this section we give the task definitions that wrap the command line tools used in this workflow. All tasks are assumed to terminate, be deterministic and be side effect-free.

Next-Generation Sequencing Tasks

sra-tools

We use sra-tools to extract the sequence reads in FastQ format from an archive in SRA format. The -Z flag makes fastq-dump output the FastQ sequence on its standard output channel.

deftask fastq-dump( fastq( File ) : sra( File ) )in bash *{
  fastq=$sra.fastq
  fastq-dump -Z $sra > $fastq
}*

FastQC

We use the tool FastQC to generate quality reports in HTML format. These HTML reports are output as zip archives which can be decompressed and viewed in a Browser.

deftask fastqc( zip( File ) : fq( File ) )in bash *{
  fastqc -f fastq --noextract -o ./ $fq
  zip=`ls *.zip`
}*

Bowtie

We use Bowtie to perform read mapping. The read mapper first needs to create an index of the reference sequence. We create this index in the task bowtie-build. Note that we do not return the file list constituting the index but instead create a tar archive to hold the index files. This is an easy way to keep Cuneiform from altering the index filenames.

deftask bowtie-build( idx( File ) : fa( File ) )in bash *{
  bowtie-build $fa btidx
  idx=idx.tar
  tar cf $idx btidx.* --remove-files
}*

In a second step we use the previously created index to map sequence reads stored in FastQ format against the reference genome. The -S flag makes Bowtie output the alignments in SAM format while the -p 2 flag tells the Bowtie instance to use two cores.

deftask bowtie-align( sam( File ) : idx( File ) fq( File ) )in bash *{
  tar xf $idx
  sam=$fq.sam
  bowtie btidx -q $fq -v 2 -m 1 -3 1 -S -p 2 > $sam
}*

MACS

The MACS tool is used to perform peak calling. It consumes a pair of alignments one for the treatment and one for the control group and outputs peak regions and summits in bed format. In addition it creates a number of tables in xls format as well as coverage information in the form of bed graphs.

deftask macs(
    peaks( File ) summits( File ) <xls( File )>
    bedgraph_tag( File ) bedgraph_ctl( File )
  : tag_sam( File ) ctl_sam( File ) )in bash *{

  macs14 -t $tag_sam -c $ctl_sam --format SAM --gsize 4639675 --name "macs14" \
    --bw 400 --keep-dup 1 --bdg --single-profile --diag

  peaks=macs14_peaks.bed
  summits=macs14_summits.bed
  xls=(macs14_diag.xls macs14_negative_peaks.xls)
  bedgraph_tag=macs14_MACS_bedGraph/treat/macs14_treat_afterfiting_all.bdg.gz
  bedgraph_ctl=macs14_MACS_bedGraph/control/macs14_control_afterfiting_all.bdg.gz
}*

SAMtools

While SAMtools is not directly needed for peak calling, we use it to remove duplicates from the alignments generated in read mapping to obtain an alternative pair of coverage graphs using deepTools. In addition we use SAMtools to create a reference index used by bedtools to extract the FastA sequences of the identified peak regions.

Sorting is a step necessary prior to removing duplicates. The task samtools-sort consumes an alignment file in SAM format, converts it to BAM and sorts it without writing the intermediate BAM file to disk.

deftask samtools-sort( sorted_bam( File ) : sam( File ) )in bash *{
  sorted_bam=sorted.bam
  samtools view -bS $sam | samtools sort -o $sorted_bam -
}*

The task samtools-rmdup consumes a sorted alignment file in BAM format and outputs a BAM file containing only unique alignments.

deftask samtools-rmdup( dedup_bam( File ) : bam( File ) )in bash *{
  dedup_bam=dedup.bam
  samtools rmdup -s $bam $dedup_bam
}*

The task samtools-index creates an index on a given alignment file in BAM format.

deftask samtools-index( bai( File ) : bam( File ) )in bash *{
  bai=$bam.bai
  samtools index $bam $bai
}*

The task samtools-faidx creates and index on a given reference sequence in FastA format.

deftask samtools-faidx( fai( File ) : fa( File ) )in bash *{
  fai=$fa.fai
  samtools faidx $fa
}*

deepTools

While MACS provides us with viable coverage information we can use deepTools to as an alternative for generating coverage information.

deftask bamcoverage( bedgraph( File ) : bam( File ) bai( File ) )in bash *{
  bedgraph=$bam.bedgraph
  ln -sf $bai $bam.bai
  bamCoverage --bam $bam --outFileName $bedgraph --normalizeTo1x 4639675 \
    --outFileFormat bedgraph
}*

Prior to creating the coverage graph we convert to BAM, sort, deduplicate, and index the alignments obtained with Bowtie.

deftask deeptools( bedgraph( File ) : sam( File ) ) {
  sorted_bam       = samtools-sort( sam: sam );
  dedup_sorted_bam = samtools-rmdup( bam: sorted_bam );
  dedup_sorted_bai = samtools-index( bam: dedup_sorted_bam );
  bedgraph         = bamcoverage( bam: dedup_sorted_bam, bai: dedup_sorted_bai );
}

bedtools

We can use bedtools to obtain the nucleotide sequence in FastA format for all peak regions stored in a bed file. Note that we have to rename the SAMtools FastA index to match the name of the corresponding FastA file because bedtools implicitly assumes the index to find the index under this name or create a new one.

deftask bedtools-getfasta( bed_fa( File ) : fa( File ) fai( File ) bed( File ) )
in bash *{
  bed_fa=$bed.fa
  ln -sf $fai $fa.fai
  bedtools getfasta -fi $fa -bed $bed -fo $bed_fa
}*

Custom Scripts

Peak Restriction

We can shorten the peak regions in a bed file, which is essentially a table storing the start and stop position of a peak region, by adding 100 bp to the start position while subtracting 100 bp from the ending position. This can be achieved by evaluating a short expression in Perl.

deftask restrict-peaks( restricted_bed( File ) : bed( File ) )in bash *{
  restricted_bed=$bed.100.bed
  perl -lane '$start=$F[1]+100; $end = $F[2]-100 ; print "$F[0]\t$start\t$end"' \
    $bed > $restricted_bed
}*

Input Data

The input data for this workflow are a pair of sequence archive files in SRA format obtained from NCBI GEO as well as the Escherichia coli reference genome in FastA format. We are using an older version of the reference genome here, because we want to reproduce the results from the original publication as accurately as possible.

tag_sra = "sra/SRR576933.sra";
ctl_sra = "sra/SRR576938.sra";
fa      = "ref/Escherichia_coli_K_12_MG1655.fasta";

Workflow Definition

In the workflow definition section we apply the previously defined tasks to the input data defined in the previous section. This way we define the data dependency graph that constitutes the workflow.

First we extract the read set in FastQ format from the SRA archives downloaded from GEO by applying the task fastq-dump.

tag_fq = fastq-dump( sra: tag_sra );
ctl_fq = fastq-dump( sra: ctl_sra );

Now we call the fastqc task to obtain read quality reports for both the data sets in question. Note that we can map the task fastqc over the two-element list tag_fq ctl_fq just by calling the task on the list. Internally, the Cuneiform interpreter rewrites this assignment to qc = fastqc( fq: tag_fq ) fastqc( fq: ctl_fq );. Since both calls to fastqc are independent, the interpreter runs them parallely if possible.

qc = fastqc( fq: tag_fq ctl_fq );

The refererence genome needs to be indexed in two ways: First, we use the task samtools-faidx to obtain a SAMtools index for the reference. This reference index is later used together with bedtools. While bedtools creates this index if it cannot find it, it is good practice to create the reference index in an extra step, which the Cuneiform interpreter can run in parallel with other tasks having the index already created when bedtools requires it.

fai = samtools-faidx( fa: fa );

The second index over the reference is needed to perform read mapping with Bowtie. We build it from the reference genome using bowtie-build and perform read mapping for both treatment and control group using bowtie-align.

idx     = bowtie-build( fa: fa );
tag_sam = bowtie-align( idx: idx, fq: tag_fq );
ctl_sam = bowtie-align( idx: idx, fq: ctl_fq );

We use the alignments in SAM format to perform peak calling with the task macs.

peaks summits xls tag_macs_bedgraph ctl_macs_bedgraph = macs(
  tag_sam: tag_sam, ctl_sam: ctl_sam );

Now we use deepTools to generate a set of coverage graphs in addition to the ones generated with MACS. Note that both calls to deeptools are independent from the call to the macs task and, thus, can run in parallel to it.

tag_deeptools_bedgraph = deeptools( sam: tag_sam );
ctl_deeptools_bedgraph = deeptools( sam: ctl_sam );

The peak regions generated with MACS

peaks_100    = restrict-peaks( bed: peaks );

peaks_fa     = bedtools-getfasta( fa: fa, fai: fai, bed: peaks );
peaks_100_fa = bedtools-getfasta( fa: fa, fai: fai, bed: peaks_100 );

Query

In the last part we query the variables we’re actually interested in. Only computing steps that actually contribute to the query are processed by the Cuneiform interpreter.

We query the quality reports, the peak sequences in FastA format (both orginial and short), the coverage graph for the treatment as well as the control group generated by MACS and also by deepTools.

qc peaks peaks_fa peaks_100_fa
tag_macs_bedgraph ctl_macs_bedgraph
tag_deeptools_bedgraph ctl_deeptools_bedgraph;