You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

29 KiB

Creating a reproducible workflow with CWL


The quest for building a fully reproducible pipeline with provenance

In the quest for truely reproducible workflows I set out to create an example of a reproducible workflow using GNU Guix, IPFS and CWL. GNU Guix provides content-addressable reproducible software deployment, IPFS provides content-addressable storage and CWL can describe a workflow that can run on backends that support it. In principle, this combination of tools should be enough to provide reproducibility with provenance.

Note: this work was mostly executed during the Biohackathon 2018 in Matsue, Japan

Table of Contents   TOC

Common Workflow Language (CWL)

The Common Workflow Language (CWL) is a specification for describing analysis workflows and tools in a way that makes them portable and scalable across a variety of software and hardware environments, from workstations to cluster, cloud, and high performance computing (HPC) environments. CWL is designed to meet the needs of data-intensive science, such as Bioinformatics, Medical Imaging, Astronomy, Physics, and Chemistry. CWL started as an answer to ad hoc scripting of pipelines in bioinformatics and come up with something portable instead. The CWL promises a future of running pipelines that others have created. Something of a must when you know that there is a reproducibility crisis with a majority of published papers today describing bioinformatics data and methods that can not be reproduced. Some good points also made in this paper on Experimenting with reproducibility: a case study of robustness in bioinformatics and in PiGx: reproducible genomics analysis pipelines with GNU Guix.


The InterPlanetary File System is a protocol and network designed to create a content-addressable, peer-to-peer method of storing and sharing hypermedia in a distributed file system. IPFS is exciting because it delivers a way of connecting different sites together (say from different institutes) that can locate and serve files in a (reasonably) scalable way. IPFS is content-addressable (more on that below) and allows for deduplication, local caching and swarm-like software downloads over the network. IPFS is free software and part of GNU Guix.

GNU Guix

GNU Guix is the package manager of the GNU Project. Originally based on the Nix package manager it has morphed into a substantial project in its own right with hundreds of committers and some 9,000 packages including a wide range of bioinformatics packages. GNU Guix can be deployed on any existing Linux distribution (I use Debian) and provides a rigorous and robust control of the dependency graph.

Why content-addressable?

The short explanation:

Content addressable files are referenced to by a hash on their contents as part of the file path/URI. For example, in the workflow below we use a file named small.chr22.fa that is used by its full path:


A hash value was computed over the fastq file and that became part of its reference. If the file changes in any way, even one single letter, the hash value changes and therefore the reference… This property quarantees you are always dealing with the same input data - a key property of any reproducible pipeline. There can be no ambuigity about file names and what they represent. Files can not change without the filename changing.

Similarly, every GNU Guix software reference includes a hash over its content. The reference to a fastq binary executable, for example, looks like


A reproducible pipeline therefore includes a unique reference to the binary tool(s). It is even better than that because all dependencies are included in the hash. Therefore the software dependency tree is carved in stone and we can recover and draw the dependency graph as is shown below.

Now this may appear a little elaborate. The good news is that most of these references are transparent. The Guix environment deals with resolving them as should become clear.

Getting started

GNU Guix installation

The first step is to install the Guix daemon. This daemon allows regular users to install software packages on any Linux distribution (Debian, Fedora and CentOS are all fine). GNU Guix does not interfere with the running Linux distribution. Installation instructions can be found here and here. The Guix deamon needs to be installed as root, but runs with user land privileges. For those who can not get root there are work arounds (including the use of Docker). And Ricardo Wurmus describes how MDC deploys GNU Guix on their HPC and here (essentially use one build host and copy files to the rest). For HPC we typically use a build host which has privileges, but all other HPC nodes simply mount one directory under /gnu/store using a network mount. More HPC blogs can be found here. If you don't think it can be done on HPC, think again: Compute Canada deploys Nix on their HPCs on over 120,000 cores. And if you can do Nix you can do Guix. Same principles.

IPFS and CWL installation

IPFS was recently added to GNU Guix. The first task was to update and add CWL to GNU Guix. This took me a few hours because quite a few dependencies had to be added in and some of these packages have 'fixated' versions and ultimately do not build on Python 3.7. Of course this should be fixed but with Guix we can introduce both older and personally updated packages no problem (fixing dependency hell). To manage all this I created a special channel and after setting up the channel (see the README) on Debian, Ubuntu, Fedora, Arch (etc.) installation should be as easy as

guix package -i cwltool -p ~/opt/cwl

Now to run the tool you need to set the paths etc with

. ~/opt/cwl/etc/profile
cwltool --help

I added the packages in these commits. E.g. update CWL. Also some packages on Guix trunk needed to be updated, including python-rdflib and python-setuptools. This leads to the following dependency graph for cwltool which is generated by Guix itself:

If Guix is correctly installed most packages get downloaded and installed as binaries. Guix only builds packages when it can not find a binary substitute. And now I can run

cwltool --version
/gnu/store/nwrvpgf3l2d5pccg997cfjq2zqj0ja0j-cwltool-1.0.20181012180214/bin/.cwltool-real 1.0


Note that the guix-cwl channel also provides a Docker image which we'll update for cwltool.

Short recap

After adding the cwl channel we can have the main tools installed in one go with

guix package -i go-ipfs cwltool -p ~/opt/cwl

Again, to make the full environment availble do

. ~/opt/cwl/etc/profile
ipfs --version
  ipfs version 0.4.19

The workflow

Choosing a CWL workflow

First I thought to run one of the pipelines from bcbio-nextgen as an example. Bcbio generates CWL which is rather convenient. But then at the BH18 there was a newly created CWL pipeline in and I decided to start from there. This particular pipeline uses github to store data and a Docker container to run a JVM tool. Good challenge to replace that with IPFS and Guix and make it fully reproducible.

Note that git does provide provenance but is not suitable for large data files. And even though Docker may provide reproducible binary blobs - it is hard to know what is in them, i.e., there is a trust issue, and it is usually impossible to recreate them exactly, which is a reproducibility issue. We can do better than that.

Add the data sources

In the next step we are going to make the data available through IPFS (we installed above).

After above installation of go-ipfs, following IPFS instructions create a data directory

mkdir /export/data/ipfs
env IPFS_PATH=/export/data/ipfs ipfs init
  initializing IPFS node at /export/data/ipfs
  generating 2048-bit RSA keypair...done
  peer identity: QmUZsWGgHmJdG2pKK52eF9kG3DQ91fHWNJXUP9fTbzdJFR

Start the daemon

env IPFS_PATH=/export/data/ipfs ipfs daemon

and we can add the data

export IPFS_PATH=/export/data/ipfs
ipfs add -r DATA2/
  added QmXwNNBT4SyWGnNogzDq8PTbtFi48Q9J6kXRWTRQGmgoNz DATA/small.ERR034597_1.fastq
  added QmcJ7P7eyMqhttSVssYhiRPUc9PxqAapVvS91Qo78xDjj3 DATA/small.ERR034597_2.fastq
  added QmfRb8TLfVnMbxauTPV2hx5EW6pYYYrCRmexcYCQyQpZjV DATA/small.chr22.fa
  added QmXaN36yNT82jQbUf2YuyV8symuF5NrdBX2hxz4mAG1Fby DATA/small.chr22.fa.amb
  added QmVM3SERieRzAdRMxpLuEKMuWT6cYkhCJsyqpGLj7qayoc DATA/small.chr22.fa.ann
  added QmfYpScLAEBXxyZmASWLJQMZU2Ze9UkV919jptGf4qm5EC DATA/small.chr22.fa.bwt
  added Qmc2P19eV77CspK8W1JZ7Y6fs2xRxh1khMsqMdfsPo1a7o DATA/small.chr22.fa.pac
  added QmV8xAwugh2Y35U3tzheZoywjXT1Kej2HBaJK1gXz8GycD DATA/
  added QmR81HRaDDvvEjnhrq5ftMF1KRtqp8MiwGzwZnsC4ch1tE DATA

Test a file

ipfs cat QmfRb8TLfVnMbxauTPV2hx5EW6pYYYrCRmexcYCQyQpZjV

and you should see the contents of small.chr22.fa. You can also browse to http://localhost:8080/ipfs/QmR81HRaDDvvEjnhrq5ftMF1KRtqp8MiwGzwZnsC4ch1tE


Next you ought to pin the data so it does not get garbage collected by IPFS

ipfs pin add QmR81HRaDDvvEjnhrq5ftMF1KRtqp8MiwGzwZnsC4ch1tE
  pinned QmR81HRaDDvvEjnhrq5ftMF1KRtqp8MiwGzwZnsC4ch1tE recursively

Run CWL script

Following the instructions in the original workflow README

cwltool Workflows/test-workflow.cwl Jobs/small.ERR034597.test-workflow.yml

complains we don't have Docker. Since we want to run without Docker specify

cwltool --no-container Workflows/test-workflow.cwl Jobs/small.ERR034597.test-workflow.yml

Resulting in

'fastqc' not found: [Errno 2] No such file or directory: 'fastqc': 'fastqc'

which exists in Guix, so

guix package -i fastqc -p ~/opt/cwl


fastqc       0.11.5  /gnu/store/sh0wj2c00vkkh218jb5p34gndfdmbhrf-fastqc-0.11.5

and also downloads missing fastqc dependencies


Note that the package is completely defined with its dependencies and 'content-addressable'. We can see it pulls in Java and Picard. Note also the software is made available under an 'isolated' profile in ~/opt/cwl. We are not mixing with other software setups. And, in the end, all software installed in this profile can be hosted in a (Docker) container.

After installing with Guix we can rerun the workflow and it fails at the next step with

/gnu/store/nwrvpgf3l2d5pccg997cfjq2zqj0ja0j-cwltool-1.0.20181012180214/bin/.cwltool-real 1.0
Resolved 'Workflows/test-workflow.cwl' to 'file:///export/export/local/wrk/izip/git/opensource/cwl/hacchy1983-CWL-workflows/Workflows/test-workflow.cwl'
[workflow ] start
[workflow ] starting step qc1
[step qc1] start
[job qc1] /tmp/ig4k8x8m$ fastqc \
    -o \
    . \
Started analysis of small.ERR034597_1.fastq
Approx 5% complete for small.ERR034597_1.fastq
Approx 10% complete for small.ERR034597_1.fastq
Approx 15% complete for small.ERR034597_1.fastq
Approx 20% complete for small.ERR034597_1.fastq

Error: Unable to access jarfile /usr/local/share/trimmomatic/trimmomatic.jar

Success. fastqc runs fine and now we hit the next issue. The /usr/local points out there is at least one problem :). There is also another issue in that the data files are specified from the source tree, e.g.

fq1:  # type "File"
    class: File
    path: ../DATA/small.ERR034597_1.fastq

Here, btw, you may start to appreciate the added value of a CWL workflow definition. By using an EDAM ontology CWL gets metadata describing the data format which can be used down the line. Still, we need to fetch with IPFS so the description becomes

fq1:  # type "File"
    class: File
    path: ../DATA/small.ERR034597_1.fastq

To make sure we do not fetch the old data I moved the old data files out of the way and modified the job description to use the IPFS local web server

git mv ./DATA ./DATA2
mkdir DATA
--- a/Jobs/small.ERR034597.test-workflow.yml
+++ b/Jobs/small.ERR034597.test-workflow.yml
@@ -1,10 +1,10 @@
 fq1:  # type "File"
     class: File
-    path: ../DATA/small.ERR034597_1.fastq
+    path: http://localhost:8080/ipfs/QmR81HRaDDvvEjnhrq5ftMF1KRtqp8MiwGzwZnsC4ch1tE/small.ERR034597_1.fastq
 fq2:  # type "File"
     class: File
-    path: ../DATA/small.ERR034597_2.fastq
+    path: http://localhost:8080/ipfs/QmR81HRaDDvvEjnhrq5ftMF1KRtqp8MiwGzwZnsC4ch1tE/small.ERR034597_2.fastq
 fadir:  # type "Directory"
     class: Directory

The http fetches can be replaced later with a direct IPFS call which will fetch files transparently from the public IPFS somewhere - much like bittorrent does - and cache locally. We will need to add that support to CWL so we can write something like

path: ipfs://QmR81HRaDDvvEjnhrq5ftMF1KRtqp8MiwGzwZnsC4ch1tE

This is safe because IPFS is content-addressable.

Now the directory tree looks like

├── DATA
├── DATA2
│   ├── small.chr22.fa
│   ├── small.chr22.fa.amb
│   ├── small.chr22.fa.ann
│   ├── small.chr22.fa.bwt
│   ├── small.chr22.fa.pac
│   ├──
│   ├── small.ERR034597_1.fastq
│   └── small.ERR034597_2.fastq
├── Jobs
│   ├── small.chr22.bwa-index.yml
│   └── small.ERR034597.test-workflow.yml
├── small.ERR034597_1_fastqc.html
├── Tools
│   ├── bwa-index.cwl
│   ├── bwa-mem-PE.cwl
│   ├── fastqc.cwl
│   ├── samtools-sam2bam.cwl
│   └── trimmomaticPE.cwl
└── Workflows
    └── test-workflow.cwl

and again CWL runs up to

Error: Unable to access jarfile /usr/local/share/trimmomatic/trimmomatic.jar

trimmomatic: adding a binary blob to GNU Guix

The original workflow pulls trimmomatic.jar as a Docker image. Just as an example I downloaded the jar file from source and created a GNU Guix package to make it available to the workflow.

Guix likes things to be built from source - it is a clear goal of the GNU project and the whole system is designed around that. But you can still stick in binary blobs if you want. Main thing is that they need to be available in the /gnu/store to be seen at build/install time. Here I am going to show you how to do that, but keep in mind that for reproducible pipelines this is a questionable design choice.

I created a jar download for GNU Guix, see /pjotrp/guix-notes/src/commit/00e3a6dbd2e6e5d7f81eb05476ff4c6a391dceca/ This was done by creating a Guix channel as part of the repository. The idea of the package in words is:

  1. Download the jar and compute the HASH for Guix with

guix download
  1. Check the contents of the Zip file

unzip -t /gnu/store/
   testing: Trimmomatic-0.38/trimmomatic-0.38.jar   OK
  1. On running 'guix install' Guix will unzip the file in a 'build' directory

  2. You need to tell Guix to copy the file into the target 'installation' directory - we'll copy it into lib/share/jar

  3. After installation the jar will be available in the profile under that directory path

If you want to see the actual package definition and how it is done see /pjotrp/guix-notes/src/commit/00e3a6dbd2e6e5d7f81eb05476ff4c6a391dceca/

After installing the package and updating the profile try again after updating the paths for trimmomatic in

env GUIX_PACKAGE_PATH=../../cwl//hacchy1983-CWL-workflows/ ./pre-inst-env guix package -i trimmomatic-jar -p ~/opt/cwl

# ---- Update the paths
. ~/opt/cwl/etc/profile

# ---- Run
cwltool --no-container Workflows/test-workflow.cwl Jobs/small.ERR034597.test-workflow.yml

In the next step the workflow failed because bwa was missing, so added

guix package -i bwa -p ~/opt/cwl

And then we got a different error

[E::bwa_idx_load_from_disk] fail to locate the index files

Whoah. This workflow is broken because there are no index files!

Actually if you check earlier IPFS upload you can see we added them with:

  added QmfRb8TLfVnMbxauTPV2hx5EW6pYYYrCRmexcYCQyQpZjV DATA/small.chr22.fa
  added QmXaN36yNT82jQbUf2YuyV8symuF5NrdBX2hxz4mAG1Fby DATA/small.chr22.fa.amb
  added QmVM3SERieRzAdRMxpLuEKMuWT6cYkhCJsyqpGLj7qayoc DATA/small.chr22.fa.ann
  added QmfYpScLAEBXxyZmASWLJQMZU2Ze9UkV919jptGf4qm5EC DATA/small.chr22.fa.bwt
  added Qmc2P19eV77CspK8W1JZ7Y6fs2xRxh1khMsqMdfsPo1a7o DATA/small.chr22.fa.pac
  added QmV8xAwugh2Y35U3tzheZoywjXT1Kej2HBaJK1gXz8GycD DATA/
  added QmR81HRaDDvvEjnhrq5ftMF1KRtqp8MiwGzwZnsC4ch1tE DATA

But the workflow does not automatically fetch them. So, let's fix that. We'll simply add them using IPFS (though we could actually recreate them using 'bwa index' instead).

diff --git a/Jobs/small.ERR034597.test-workflow.yml b/Jobs/small.ERR034597.test-workflow.yml
index 9b9b153..51f2174 100644
--- a/Jobs/small.ERR034597.test-workflow.yml
+++ b/Jobs/small.ERR034597.test-workflow.yml
@@ -6,7 +6,18 @@ fq2:  # type "File"
     class: File
     path: http://localhost:8080/ipfs/QmR81HRaDDvvEjnhrq5ftMF1KRtqp8MiwGzwZnsC4ch1tE/small.ERR034597_2.fastq
-fadir:  # type "Directory"
-    class: Directory
-    path: ../DATA
-ref: small.chr22  # type "string"
+ref:  # type "File"
+    class: File
+    path: http://localhost:8080/ipfs/QmR81HRaDDvvEjnhrq5ftMF1KRtqp8MiwGzwZnsC4ch1tE/small.chr22.fa
+    format:
+    secondaryFiles:
+      - class: File
+        path: http://localhost:8080/ipfs/QmR81HRaDDvvEjnhrq5ftMF1KRtqp8MiwGzwZnsC4ch1tE/small.chr22.fa.amb
+      - class: File
+        path: http://localhost:8080/ipfs/QmR81HRaDDvvEjnhrq5ftMF1KRtqp8MiwGzwZnsC4ch1tE/small.chr22.fa.ann
+      - class: File
+        path: http://localhost:8080/ipfs/QmR81HRaDDvvEjnhrq5ftMF1KRtqp8MiwGzwZnsC4ch1tE/small.chr22.fa.bwt
+      - class: File
+        path: http://localhost:8080/ipfs/QmR81HRaDDvvEjnhrq5ftMF1KRtqp8MiwGzwZnsC4ch1tE/small.chr22.fa.pac
+      - class: File
+        path: http://localhost:8080/ipfs/QmR81HRaDDvvEjnhrq5ftMF1KRtqp8MiwGzwZnsC4ch1tE/

To make the workflow work I had to replace the concept of an fa directory for bwa to using these files explicitely which better describes what is happening (as a bonus):

diff --git a/Tools/bwa-mem-PE.cwl b/Tools/bwa-mem-PE.cwl
index fc0d12d..0f87af3 100644
--- a/Tools/bwa-mem-PE.cwl
+++ b/Tools/bwa-mem-PE.cwl
@@ -19,12 +19,17 @@ requirements:
 baseCommand: [ bwa, mem ]

-  - id: fadir
-    type: Directory
-    doc: directory containing FastA file and index
   - id: ref
-    type: string
-    doc: name of reference (e.g., hs37d5)
+    type: File
+    inputBinding:
+      position: 2
+    doc: Fasta reference (e.g., hs37d5)
+    secondaryFiles:
+      - .amb
+      - .ann
+      - .bwt
+      - .pac
+      - .sa
   - id: fq1
     type: File
     format: edam:format_1930

After that we got

Final process status is success


The source and full diff can be viewed on github.

Prove results are deterministic

GNU Guix has an option to rebuild packages multiple times and compare the results. In case there is a difference the packages can not be considered deterministic. For example software builds may contain a time stamp at time of build. This is harmless, but who is to tell the difference is not caused by something else? This is why the reproducible builds project exist of which Guix is a member. See also GNU Guix Reproducible builds: a means to an end.

The CWL runner does not have such an option (yet). I ran it by hand three times. The first time capture the MD5 values with

find . -type f -print0 | xargs -0 md5sum > ~/md5sum.txt

next times check with

md5sum -c ~/md5sum.txt |grep -v OK

it complained on one file

./output.sam: FAILED
md5sum: WARNING: 1 computed checksum did NOT match

and the @PG field in the output file contains a temporary path:

diff output.sam output.sam.2
< @PG   ID:bwa  PN:bwa  VN:0.7.17-r1188 CL:bwa mem -t 4 /gnu/tmp/cwl/tmpdoetk_3r/stge19b3f1c-864a-478e-8aee-087a61654aba/small.chr22.fa /gnu/tmp/cwl/tmpdoetk_3r/stgd649e430-caa8-491f-8621-6a2d6c67dcb9/small.ERR034597_1.fastq.trim.1P.fastq /gnu/tmp/cwl/tmpdoetk_3r/stg8330a0f5-751e-4685-911e-52a5c93ecded/small.ERR034597_2.fastq.trim.2P.fastq
> @PG   ID:bwa  PN:bwa  VN:0.7.17-r1188 CL:bwa mem -t 4 /gnu/tmp/cwl/tmpl860q0ng/stg2210ff0e-184d-47cb-bba3-36f48365ec27/small.chr22.fa /gnu/tmp/cwl/tmpl860q0ng/stgb694ec99-50fe-4aa6-bba4-37fa72ea7030/small.ERR034597_1.fastq.trim.1P.fastq /gnu/tmp/cwl/tmpl860q0ng/stgf3ace0cb-eb2d-4250-b8b7-eb79448a374f/small.ERR034597_2.fastq.trim.2P.fastq

To fix it we could add a step to the pipeline to filter out this field or force output to go into the same destination directory. Or tell bwa to skip the @PG field.

Determinism (and reproducibility) may break when the pipeline has software that does not behave well. Some tools give different results when run with the exact same inputs. The solution is to fix or avoid that software. Also, software may try to download inputs which can lead to different results over time (for example by including a time stamp in the output). To be stringent, it may be advisable to disable network traffic when the workflow is running. GNU Guix builds all its software without network, i.e., after downloading the files as described in the package definition the network is switched off and the build procedure runs without network in complete isolation. This guarantees software can not download non-deterministic material from the internet. It also guarantees no dependencies can 'bleed' in. This is why GNU Guix is called a 'functional package manager' - in the spirit of functional programming.

TODO A full Docker container

Capture the provenance graph

TODO GNU Guix software graph

TODO CWL provenance graph


Here we show the principle of a working reproducible pipeline. With little effort, anywone can create such a pipeline using GNU Guix, an addressable data source such as IPFS, and a CWL work flow definition that includes content-addressable references to software and data inputs (here we used IPFS for data). By running the workflow multiple times it can be asserted the outcome is deterministic and therefore reproducible.

In the process of migrating the original Docker version of this workflow it came out that not all inputs were explicitely defined.

This reproducible workflow captures the full graph, including all data, tools and cwl-runner itself! There was no need to use Docker at all. In fact, this version is better than the original Docker pipeline because both software and data are complete and guaranteed to run with the same (binary) tools.

To guarantee reproducibility it is necessary to fixate inputs and have well behaved software. With rogue or badly behaved software this may be a challenge. The good news is that such behaviour is not so common and, if so, GNU Guix + IPFS will bring out any reproducibility issues.

Based on this exercise I also conclude that CWL is a very interesting technical proposition to write pipelines that can be shared. The online documentation is a bit wanting and, for example, to figure out the use of secondaryFiles for bwa I read through a number of existing pipelines on github. With the growth of online pipelines CWL should become stronger and stronger. And with the growing support any CWL user will get the benefit of capturing provenance graphs and other goodies.

Beside improviging the documentation, I suggest CWL gets an option for checking determinism (run workflows multiple times and check results) and add some support for GNU Guix profiles - so it becomes even easier to create deterministic software deployments that are built from source, transparent and forever recreatable. It is particularly in these last two points that Docker falls short.

Finally GNU Guix comes with its own workflow language GWL which natively makes use of GNU Guix facilities. It may be worth looking into because it is both simpler and more rigorous and can be combined with CWL and in the future it may write CWL definitions.

Extra notes

Building cwltool inside a Guix container

Guix containers allow isolation of the build system

env GUIX_PACKAGE_PATH=~/izip/git/opensource/genenetwork/guix-bioinformatics/ ~/izip/git/opensource/genenetwork/guix-monza/pre-inst-env guix environment -C guix --ad-hoc cwltool coreutils python

Run the tests with

python3 build

Some network related tests may fail (6 at this point). To build CWL in a container you can do something like this:

env PYTHONPATH=here/lib/python3.6/site-packages:$PYTHONPATH python3 install --prefix here

Create dependency graph

The full package graph can be generated with

env GUIX_PACKAGE_PATH=~/izip/git/opensource/genenetwork/guix-bioinformatics ./pre-inst-env guix graph cwltool |dot -Tpdf > cwltool-package.pdf

And the full dependency graph can be generated with

env GUIX_PACKAGE_PATH=~/izip/git/opensource/genenetwork/guix-bioinformatics ./pre-inst-env guix graph  --type=references cwltool |dot -Tpdf > cwltool-references.pdf

TODO Create a Docker container