Introduction

One of the central systems in the technology stack I work with is the Common Workflow Language (CWL).

The Common Workflow Language (CWL) is a specification for describing analysis workflows and tools in a way that makes them portable and scalable across a variety of software and hardware environments, from workstations to cluster, cloud, and high performance computing (HPC) environments.

I disagree that, by describing a workflow in CWL, that workflow becomes scalable. First, you have to define “scattering” logic, such that data that must be processed linearly can also be processed in parallel. Essentially, we define how to scatter in CWL by specifying an input, which is an array of something, and dispatching each element of that array to a worker. Scattering based on inputs like strings, numbers and files are all straight-forward enough.

However, you may need to scatter structured information, like paired read fastqs generated from Illumina SBS sequencing. So each node gets a pair of files, not just an individual fastq file. This is tricky and rather poorly documented.

In this tutorial, I will demonstrate how to pair fastq files – and, by extension, any structured data – by setting up a dummy workflow. The workflow will consist of two nodes: an ExpressionTool that groups corresponding input fastq’s and a CommandLineTool that aligns with a read pair. Finally, we’ll gather the output of the CommandLineTool nodes as an output:

Alt Text

The CommandLineTool will count the number of records in a fastq file using BioPython.

Setting up the CommandLineTool

CWL is all about defining inputs and outputs to tools that run inside docker. We start with the boilerplate, including the ScatterFeatureRequirement to allow it to scatter.

# Workflow.cwl.1
class: Workflow
cwlVersion: v1.0
requirements:
  - class: ScatterFeatureRequirement
inputs: []
outputs: []
steps: []

The first step in constructing a pipeline is to determine the start and ends points. Our dummy pipeline will start with a set of fastq files and end up with the number of records in each read pair.

# Workflow.cwl.2
class: Workflow
cwlVersion: v1.0
requirements:
  - class: ScatterFeatureRequirement
inputs:
  fastqs:
    type: File[]
outputs:
  ReadCounts: ?
steps: []

So, how do we go from A to B? From fastqs to ReadCounts?

We begin by structuring our inputs. We want to take the list of fastqs and output a list of records, each containing the file path to the mated R1 and R2 fastq. Since we are just moving things around, we use the ExpressionTool, rather than the command line tool. This is a best practice.

Here is the code:

# PairReadFiles.cwl
class: ExpressionTool
requirements:
  - class: InlineJavascriptRequirement
cwlVersion: v1.0
inputs:
  fastqs:
    type: File[]
outputs:
  pairedReads:
    type:
      type: array
      items:
        type: record
        fields:
          - name: r1
            type: File
          - name: r2
            type: File
expression: |-
  ${
      //group by sample name
      var samples = {};
      for (var i = 0; i < inputs.fastqs.length; i++) {
        var = inputs.fastqs[i];
        // this regex will match files from basespace, e.g. SAMPLE_R1_.fastq.gz
        var reg_groups = f.basename.match(/^(.+)(_R[1|2]_)(.+)$/);
        var samplename = reg_groups[1];
        var flag = reg_groups[2];
        var ext = reg_groups[3];

        var readPairId = samplename + ext;

        if (!samples[readPairId]) {
            samples[readPairId] = {r1: null, r2: null};
        } if (R === "_R1_") {
            samples[readPairId].r1 = f;
        } else if (R === "_R2_") {
            samples[readPairId].r2 = f;
        }
      }
      // pivot
      var pairedReads = Object.keys(samples).map(function(key){return(samples[k]);})
      // set the PairedReads record defined in `outputs` equal to the local PairedReads variable
      return {pairedReads: pairedReads};
  }

An aside…

It is instructive to see where this code “lives.” This….

class: ExpressionTool
requirements:
  - class: InlineJavascriptRequirement
cwlVersion: v1.0
inputs: []
outputs:
  anOutput:
    type: int
expression: |
  ${
      return ({anOutput: 0});
  }

…gets turned into the following:

"use strict";
var inputs = {};
var self = null;
var runtime = {
  outdirSize: 1024,
  ram: 1024,
  tmpdirSize: 1024,
  cores: 1,
  tmpdir: "/private/tmp/tmpM9VDWF",
  outdir: "/private/tmp/tmpz8v2Fk",
};
(function () {
  return { anOutput: 0 };
})();

So, inputs, self and runtime are all variables we can play with. Notably, self is an array of the globbed output files. That only applies to CommandLineTools; it is null here, and we can ignore it.

and back…

Notice the syntax to define a record array:

PairedReads:
  type:
    type: array
    items:
      type: record

I feel that first type is counter-intuitive; we’ll just have to remember it. Underneath, we define an array of key-value pairs that define the output schema: an R1 file and an R2 file.

Next, we have the code. This expression iterates over the list of fastq files, generates a string that uniquely identifies the read pair, and pairs the files in a javascript object. After flattening out this object into a list, it outputs this list as value contained inside a javascript object literal. Finally, the values of this object get slotted into the proper output defined in the CWL document.

Once the read pairing node is complete, we wire it into the workflow cwl.

# Workflow.cwl.3
class: Workflow
cwlVersion: v1.0
requirements:
  - class: ScatterFeatureRequirement
inputs:
  fastqs:
    type: File[]
outputs:
  readCounts:
    type: int[]
steps:
  PairReadFiles:
    run: PairReadFiles.cwl
    in:
      fastqs: fastqs
    out:
      - pairedReads

Now, we have a structured list that goes nowhere. We’ll define the next node to take this list and process this.

# ProcessMatedReadFastqs.cwl
class: CommandLineTool
cwlVersion: v1.0
requirements:
  - class: InlineJavascriptRequirement
hints:
  DockerRequirement:
    dockerPull: readpaircounter
baseCommand:
  - count_reads.py
inputs:
  pairedRead:
    inputBinding:
      position: 0
    type:
      type: record
      fields:
        - name: r1
          type: File
          inputBinding:
            prefix: --r1
            position: 1
        - name: r2
          type: File
          inputBinding:
            prefix: --r2
            position: 2
outputs:
  readCount:
    type: stdout
# Dockerfile
FROM python
RUN pip install biopython
ADD count_reads.py /usr/local/bin
RUN chmod +x /usr/local/bin/count_reads.py
CMD sh
# count_reads.py
#!/usr/bin/env python

import sys
import argparse
from Bio import SeqIO

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('--r1', required=True)
    parser.add_argument('--r2', required=True)
    args = parser.parse_args()

    fastq_record_count = 0
    for rx in (args.r1, args.r2):
        with open(rx) as infile:
            fastq_record_count += sum(1 for record in SeqIO.parse(infile, 'fastq'))

    sys.stdout.write(fastq_record_count)

if __name__ == '__main__':
    main()

I won’t bother here to explain my code, as there is decent documentation about CommandTools. However, I do want to point out a difference from how we normally define the input. Namely, instead of defining them just below the input key, we put it in a record like so:

type: record
  fields:
    - name: ...

You may notice a disconnect. In the previous node, we created a list of records; whereas, here, we accept only a single record. We need special wiring here that breaks apart the list of records so each node receives only one record. This is accomplished with the scatter feature.

# Workflow.cwl.4
class: Workflow
cwlVersion: v1.0
requirements:
  - class: ScatterFeatureRequirement
inputs:
  fastqs:
    type: File[]
outputs:
  readCounts: ProcessReadFile/readCount
steps:
  PairReadFiles:
    run: PairReadFiles.cwl
    in:
      fastqs: fastqs
    out:
      - pairedReads
  ProcessReadFile:
    run: ProcessMatedReadFastqs.cwl
    in:
      pairedRead: PairReadFiles/pairedReads
    out:
      - readCount
    scatter:
      - pairedRead

There ya go. We have paired our reads in a way that allows scattering across a cluster.

Reflections

Is it all worth it?

There is one major benefit to CWL. Once written, I will be able to run this workflow anywhere, forever. You can’t say that about a lot of code.

But that assumes that CWL stabilizes as a language.

Once, I connected an enum of strings to a string sink. This is fine in the latest version of cwltool. But, later a user reported that he could no longer run our pipeline. I had to refactor the cwl documents – so that both tools accepted an enum input, to be backwards compatible.

That example applies to the marginal less well-thought-out aspects of the language. But it is illustrative. I often wonder whether the language changes enough to break compatibility with cwl documents written today. If so, what’s the point? I suggest developers and bioinformaticians wait and see before adopting CWL.

My other thought: this is very verbose. We have written one hundred and eighty-one lines of code over five files to pair read files and do trivial processing.

Snakemake seems more promising, since it’s language constructs strike me as expressive and concise. I haven’t attempted to run the snakemake workflow on a cluster, however. It uses Kubernetes, though, so any cloud provider should support it.

Day to day, I will continue to use CWL. But, for 2018, I do not recommend CWL. For the next project, I will seriously consider snakemake and, a few years down the line, re-evaluate CWL.