This workflow counts the occurrences of words in a text corpus. It implements a Cuneiform example workflow first published in Bux et al. 2015.

Introduction

The canonical word count is a basic workflow often used to exemplify the fundamental concepts of a data analysis platform. Similar workflows have been published for data analysis platforms like Hadoop, Spark, or Flink.

The workflow takes a text corpus and produces a table associating to each word occurring in the corpus a count, denoting the word’s absolute frequency. While the example shown in Bux et al. 2015 uses Python function definitions to perform the counting and aggregation of counts, here, we use only standard Unix tools like uniq or awk.

Task Definitions

Utility Tasks

unzip

The unzip function consumes a single zip file and extracts it into a local subdirectory dir. The extracted files are returned as a list.

def unzip( zip : File ) -> <fileLst : [File]>
in Bash *{
  unzip -d dir $zip
  fileLst=`ls dir | awk '{print "dir/" $0}'`
}*

split

The split function consumes a single text file and splits it every 1024 lines. The split partitions are returned as a list.

def split( file : File ) -> <splitLst : [File]>
in Bash *{
  split -l 1024 $file txt
  splitLst=txt*
}*

Tasks for Counting Words

count

The count function consumes a single text file and creates from it a table associating to each word occurring in the text file a count. The resulting csv table is returned. In addition, count removes special characters and transforms all letters to lower-case.

def count( txt : File ) -> <csv : File>
in Bash *{
  csv=count.csv
  sed "s/[^a-zA-Z0-9]/ /g" < $txt | tr ' ' '\n' | tr '[:upper:]' '[:lower:]' | \
  sort | uniq -c -i > $csv
}*

join

The join function consumes a list of csv table files and aggregates them. The occurrences of words are added. The resulting csv table is returned.

def join( csvLst : [File] ) -> <csv : File>
in Bash *{
  csv=ret.csv
  cat ${csvLst[@]} | awk '{a[$2]+=$1}END{for(i in a) print i "\t" a[i]}' > $csv
}*

Workflow Definition

The workflow definition defines a variable zip which denotes a zip file containing a single text file. This text file is the corpus on which we are going to perform word counting.

let zip : File = 'sotu/stateoftheunion1790-2014.txt.zip';

After extracting the text file we split it into smaller partitions. This enables us to perform word counting on each individual partition in parallel.

let <fileLst = txtLst : [File]> =
  unzip( zip = zip );

Query

Up to now we have stated only task definitions and assignments. Neither of these trigger any computation. By querying an expression we define what the workflow’s goal is. Only computations contributing to this goal are run.

In the query we iterate over all text files txt extracted from the original zip file. In this example, the zip archive contains just one file.

This file is now split into partitions, each of which is counted using count. The resulting csv table list is joined by calling join.

The result of the workflow is a list of csv files with just one element, the table summarizing the count of every word in all the state-of-the-union speaches.

for txt <- txtLst do

  let <splitLst = splitLst : [File]> =
    split( file = txt );

  let countLst : [File] =
    for split <- splitLst do
      ( count( txt = split )|csv )
      : File
    end;

  ( join( csvLst = countLst )|csv )
  : File

end;