4 minutes
Efficient piping in the Common Workflow Language
Introduction
In the CWL specification for CommandInputParameter
(here), there is a tag for streamable, described as:
Only valid when type: File or is an array of items: File.
A value of true indicates that the file is read or written sequentially without seeking. An implementation may use this flag to indicate whether it is valid to stream file contents using a named pipe. Default: false.
This has intrigued me for a while, since one of the major bottlenecks of our HPC, when running CWL, is network IO. For idiomatic CWL, you specify a command and chain them together with a workflow. However, when a command outputs something in an uncompressed format, it overwhelms the network. So something like the following, which aligns a fastq file with bowtie2, is idiomatic but inefficient:
#align_with_bowtie.cwl
class: Workflow
cwlVersion: v1.0
inputs:
index_basename:
type: string
index_dir:
type: Directory
fastq:
type: File
outputs:
bam:
type: File
outputSource: samtools_view/bam
steps:
bowtie2:
in:
fastq: fastq
index_dir: index_dir
index_basename: index_basename
out:
- alignment
run:
class: CommandLineTool
cwlVersion: v1.0
requirements:
- class: DockerRequirement
dockerPull: biocontainers/bowtie2:v2.2.9_cv2
baseCommand:
- bowtie2
stdout: $(inputs.fastq.nameroot).sam
arguments:
- valueFrom: $(inputs.index_dir.dirname)/$(inputs.index_dir.basename)/$(inputs.index_basename)
prefix: "-x"
position: 0
inputs:
index_dir:
type: Directory
index_basename:
type: string
fastq:
type: File
inputBinding:
position: 1
prefix: "-U"
outputs:
alignment:
type: stdout
samtools_view:
in:
sam: bowtie2/alignment
out:
- bam
run:
class: CommandLineTool
cwlVersion: v1.0
requirements:
- class: DockerRequirement
dockerPull: biocontainers/samtools:v1.7.0_cv3
stdout: $(inputs.sam.nameroot).bam
baseCommand:
- samtools
- view
inputs:
sam:
type: File
inputBinding:
position: 0
outputs:
bam:
type: stdout
Why is this inefficient? Because (1) we have to wait for bowtie2 to complete before compressing and (2) we have to transfer the uncompressed file over the network.
Our first issue would be solved by setting streamable:true
for the sam file as recommended. However, no workflow executor that I am aware of actually makes use of this setting. The folks at arvados, who I spoke to, were not aware of the setting.
The second issue depends on how the jobs are dispatched. It would not be an issue if both were sent to the same server, but that’s not guaranteed and varies depending on the workflow executor and its scheduling algorithm. For example, in my experience SevenBridges runs subworkflows on the same instance, which is preferable here – although they change their scheduler enough that I wouldn’t depend on it.
For now, we can solve both issues with a unix pipe, like: bowtie2 -x <INDEX> -U <FASTQ_FILE> | samtools view -bS - > out.bam
. However, this present a couple of design issues for CWL:
- Shell commands are disabled by default
- We cannot use biocontainers alone, since we need to install multiple binaries to a docker image
To get around this, we can take advantage of the following:
- Shell commands can be enabled with the
ShellCommandRequirement
- We can use miniconda to install multiple binaries that are compatible with one another
This is demonstrated for the command earlier like so:
class: CommandLineTool
cwlVersion: v1.0
requirements:
- class: ShellCommandRequirement
- class: DockerRequirement
dockerImageId: bowtie2-and-samtools
dockerFile: |-
FROM continuumio/miniconda3
RUN conda config --add channels defaults \
&& conda config --add channels bioconda \
&& conda config --add channels conda-forge \
&& conda install bowtie2 samtools \
&& conda clean -pt
baseCommand:
- bowtie2
stdout: $(inputs.fastq.nameroot).sam
inputs:
index_dir:
type: Directory
index_basename:
type: string
fastq:
type: File
arguments:
- valueFrom: $(inputs.index_dir.dirname)/$(inputs.index_dir.basename)/$(inputs.index_basename)
prefix: "-x"
- prefix: "-U"
valueFrom: $(inputs.fastq)
- valueFrom: "|"
shellQuote: false
- samtools
- view
- "-"
outputs:
alignment:
type: stdout
Performance considerations
This docker image is 830 megabytes. 460 of those are from the miniconda3 base image. Depending on how docker images are cached on your server, this may not be performant.
An alternative that produces smaller docker images would be to copy the binaries from the miniconda3 image to an alpine image. I have avoided doing so because bowtie2 and samtools have dependencies themselves that I found tricky.
That being said, this design pattern moves everything reliably into a single instance. It gives us another tool to address one of the trickier bottlenecks in CWL: network IO.
668 Words
2018-03-04 19:00