Efficient piping in the Common Workflow Language

Introduction

In the CWL specification for CommandInputParameter (here), there is a tag for streamable, described as:

Only valid when type: File or is an array of items: File.

A value of true indicates that the file is read or written sequentially without seeking. An implementation may use this flag to indicate whether it is valid to stream file contents using a named pipe. Default: false.

This has intrigued me for a while, since one of the major bottlenecks of our HPC, when running CWL, is network IO. For idiomatic CWL, you specify a command and chain them together with a workflow. However, when a command outputs something in an uncompressed format, it overwhelms the network. So something like the following, which aligns a fastq file with bowtie2, is idiomatic but inefficient:

#align_with_bowtie.cwl
class: Workflow
cwlVersion: v1.0
inputs:
  index_basename:
    type: string
  index_dir:
    type: Directory
  fastq:
    type: File
outputs:
  bam:
    type: File
    outputSource: samtools_view/bam
steps:
  bowtie2:
    in:
      fastq: fastq
      index_dir: index_dir
      index_basename: index_basename
    out:
      - alignment
    run:
      class: CommandLineTool
      cwlVersion: v1.0
      requirements:
        - class: DockerRequirement
          dockerPull: biocontainers/bowtie2:v2.2.9_cv2
      baseCommand:
        - bowtie2
      stdout: $(inputs.fastq.nameroot).sam
      arguments:
        - valueFrom: $(inputs.index_dir.dirname)/$(inputs.index_dir.basename)/$(inputs.index_basename)
          prefix: "-x"
          position: 0
      inputs:
        index_dir:
          type: Directory
        index_basename:
          type: string
        fastq:
          type: File
          inputBinding:
            position: 1
            prefix: "-U"
      outputs:
        alignment:
          type: stdout
  samtools_view:
    in:
      sam: bowtie2/alignment
    out:
      - bam
    run:
      class: CommandLineTool
      cwlVersion: v1.0
      requirements:
        - class: DockerRequirement
          dockerPull: biocontainers/samtools:v1.7.0_cv3
      stdout: $(inputs.sam.nameroot).bam
      baseCommand:
        - samtools
        - view
      inputs:
        sam:
          type: File
          inputBinding:
            position: 0
      outputs:
        bam:
          type: stdout

Why is this inefficient? Because (1) we have to wait for bowtie2 to complete before compressing and (2) we have to transfer the uncompressed file over the network.

Our first issue would be solved by setting streamable:true for the sam file as recommended. However, no workflow executor that I am aware of actually makes use of this setting. The folks at arvados, who I spoke to, were not aware of the setting.

The second issue depends on how the jobs are dispatched. It would not be an issue if both were sent to the same server, but that’s not guaranteed and varies depending on the workflow executor and its scheduling algorithm. For example, in my experience SevenBridges runs subworkflows on the same instance, which is preferable here – although they change their scheduler enough that I wouldn’t depend on it.

For now, we can solve both issues with a unix pipe, like: bowtie2 -x <INDEX> -U <FASTQ_FILE> | samtools view -bS - > out.bam. However, this present a couple of design issues for CWL:

Shell commands are disabled by default
We cannot use biocontainers alone, since we need to install multiple binaries to a docker image

To get around this, we can take advantage of the following:

Shell commands can be enabled with the ShellCommandRequirement
We can use miniconda to install multiple binaries that are compatible with one another

This is demonstrated for the command earlier like so:

class: CommandLineTool
cwlVersion: v1.0
requirements:
  - class: ShellCommandRequirement
  - class: DockerRequirement
    dockerImageId: bowtie2-and-samtools
    dockerFile: |-
      FROM continuumio/miniconda3
      RUN conda config --add channels defaults \
          && conda config --add channels bioconda \
          && conda config --add channels conda-forge \
          && conda install bowtie2 samtools \
          && conda clean -pt
baseCommand:
  - bowtie2
stdout: $(inputs.fastq.nameroot).sam
inputs:
  index_dir:
    type: Directory
  index_basename:
    type: string
  fastq:
    type: File
arguments:
  - valueFrom: $(inputs.index_dir.dirname)/$(inputs.index_dir.basename)/$(inputs.index_basename)
    prefix: "-x"
  - prefix: "-U"
    valueFrom: $(inputs.fastq)
  - valueFrom: "|"
    shellQuote: false
  - samtools
  - view
  - "-"
outputs:
  alignment:
    type: stdout

Performance considerations

This docker image is 830 megabytes. 460 of those are from the miniconda3 base image. Depending on how docker images are cached on your server, this may not be performant.

An alternative that produces smaller docker images would be to copy the binaries from the miniconda3 image to an alpine image. I have avoided doing so because bowtie2 and samtools have dependencies themselves that I found tricky.

That being said, this design pattern moves everything reliably into a single instance. It gives us another tool to address one of the trickier bottlenecks in CWL: network IO.