aboutsummaryrefslogtreecommitdiff

Introduction

The common workflow language (CWL) can run workflows defined in a YAML definition. Some key concepts are that CWL workflows can be analysed and reasoned on (unlike shell scripts) and CWL workflows are a separation of concerns: (1) tools/scripts, (2) data and (3) the workflow, i.e. how it connects up.

CWL is also agnostic about finding underlying tooling. Docker links are often provided as hints, but with --no-container a tool just gets invoked. This is great in the context of GNU Guix environments!

Install CWL using GNU Guix

You may need to install GNU Guix and see the README on http://git.genenetwork.org/guix-bioinformatics/guix-bioinformatics

Recent versions of GNU Guix contain cwl-runner:

guix pull
~/.config/guix/current/bin/guix package -A cwl
  cwltool 3.0.20201121085451      out     gnu/packages/bioinformatics.scm:2627:2

Install with

guix package -i cwltool

or in a special profile (I tend to do that)

guix package -i cwltool -p ~/opt/CWL

Set the PATH and you should be able to run cwltool

. ~/opt/CWL/etc/profile
cwltool

Set up a more advanced workflow

Let's run the workflow that was described in creating a reproducible workflow with GNU Guix:

git clone https://github.com/pjotrp/CWL-workflows

Build the contained trimmomatic (if you are unlucky this may take a while)

cd CWL-workflows
env GUIX_PACKAGE_PATH=. guix build trimmomatic-jar

Now let's rerun the workflow as set up in above BLOG (I created a local version to skip IPFS). Make sure your PATH points to all the tools and

cwltool --no-container Workflows/test-workflow.cwl Jobs/local-small.ERR034597.test-workflow.yml

in the first run gives an error: ERROR 'fastqc' not found. We need to add the tool to the environment. For this I created a file .guix-deploy in the root of the repo:

cat .guix-deploy
env GUIX_PACKAGE_PATH=.:~/iwrk/opensource/guix/guix-bioinformatics/  ~/.config/guix/current/bin/guix environment -C guix --ad-hoc cwltool trimmomatic-jar bwa fastqc go-ipfs curl --network

You can see it requires the guix-bioinformatics, so you may need to clone that repo first. Next start the Guix container:

. ./guix-deploy
cwltool --no-container Workflows/test-workflow.cwl Jobs/local-small.ERR034597.test-workflow.yml

Now the workflow should run fastq. When it works it should say

<lots of output>
INFO Final process status is success

The current workflow is only working partly. It now complains with

ILLUMINACLIP:/gnu/store/v2jys382g6j5b7lsxzh8v4vfhd414nhz-profile/lib/share/jar/adapters/TruSeq2-PE.fa:2:40:15. Error: Unable to access jarfile /gnu/store/v2jys382g6j5b7lsxzh8v4vfhd414nhz-profile/lib/share/jar/trimmomatic-0.38.jar

This is because I hard coded two paths which you need to point to your Guix profile first:

Tools/trimmomaticPE.cwl:    valueFrom: /gnu/store/v2jys382g6j5b7lsxzh8v4vfhd414nhz-profile/lib/share/jar/trimmomatic-0.38.jar
Tools/trimmomaticPE.cwl:    valueFrom: 'ILLUMINACLIP:/gnu/store/v2jys382g6j5b7lsxzh8v4vfhd414nhz-profile/lib/share/jar/adapters/TruSeq2-PE.fa:2:40:15'

In the container the Guix profile can be found with

echo $GUIX_ENVIRONMENT

Plug it into above values. This is not typical and I should find a proper way to do this. After modifying the source by splitting in the GUIXENVIROMENT it worked.

diff --git a/Tools/trimmomaticPE.cwl b/Tools/trimmomaticPE.cwl
index ed57eb5..aedd23a 100644
--- a/Tools/trimmomaticPE.cwl
+++ b/Tools/trimmomaticPE.cwl
@@ -55,7 +55,7 @@ outputs:

 arguments:
   - position: 1
-    valueFrom: /gnu/store/v2jys382g6j5b7lsxzh8v4vfhd414nhz-profile/lib/share/jar/trimmomatic-0.38.jar
+    valueFrom: /gnu/store/j1ljhxzaxmcqy8v6d4v1y37p48c68f5q-profile/lib/share/jar/trimmomatic-0.38.jar
   - position: 2
     valueFrom: PE
   - position: 5
@@ -67,4 +67,4 @@ arguments:
   - position: 8
     valueFrom: $(inputs.fq2.basename).trim.2U.fastq
   - position: 9
-    valueFrom: 'ILLUMINACLIP:/gnu/store/v2jys382g6j5b7lsxzh8v4vfhd414nhz-profile/lib/share/jar/adapters/TruSeq2-PE.fa:2:40:15'
+    valueFrom: 'ILLUMINACLIP:/gnu/store/j1ljhxzaxmcqy8v6d4v1y37p48c68f5q-profile/lib/share/jar/adapters/TruSeq2-PE.fa:2:40:15'

Try

. ./guix-deploy
cwltool --no-container --preserve-environment GUIX_ENVIRONMENT Workflows/test-workflow.cwl Jobs/local-small.ERR034597.test-workflow.yml
(output)
INFO Final process status is success

GUIXENVIRONMENT

The question is how to deal with GUIXENVIRONMENT. cwltool has a switch `–preserve-environment ENVVAR'. This value is then available in the environment, but it is not available to the CWL parser, it appears.

To automate this I think there are two options:

  1. Add GUIXENVIRONMENT support to CWL
  2. Generate/patch above CWL script before running

The second one is easy if this is part of a Guix package, but I think we need to add proper support in CWL.