pjotrp 3 years ago
parent
commit
711e585bfa
  1. 6
      CONTAINERS.org
  2. 58
      WORKFLOW.org

6
CONTAINERS.org

@ -145,7 +145,7 @@ main system. This makes GNU Guix containers really small and fast.
*** Docker
You can create a Docker image without actually installing Docker(!)
With GNU Guix you can create a Docker image without actually installing Docker(!)
#+begin_src shell
env GUIX_PACKAGE_PATH=../guix-bioinformatics/ \
@ -158,7 +158,7 @@ env GUIX_PACKAGE_PATH=../guix-bioinformatics/ \
note we now have the -S switch which can make the /usr/bin symlink
into the profile.
This produced a file which we can load into Docker
This produced a file which we can be loaded into Docker
: docker load --input /gnu/store/0p1ianjqqzbk1rr9rycaqcjdr2s13mcj-docker-pack.tar.gz
: docker images
@ -230,7 +230,7 @@ docker pull ubuntu
The steps that follow will be somewhat similar, with each image building upon
the image before it.
The files created here can be found
The files created here can be found
[[https://github.com/fredmanglis/guix-conda-docker/][in this repository]].
The first image to be built only contains conda, and it was initialised with a

58
WORKFLOW.org

@ -1,4 +1,7 @@
* Reproducible Workflow
# -*- mode: org; coding: utf-8; -*-
#+TITLE: Creating a reproducible workflow with CWL
* Introduction
/An example of building a fully reproducible pipeline with
provenance - note this document is a work in progress - to complete in
@ -15,7 +18,26 @@ reproducibility with provenance.
/Note: this work was mostly executed during the Biohackathon 2018 in
Matsue, Japan./
** Why content-addressable?
* Table of Contents :TOC:
- [[#introduction][Introduction]]
- [[#why-content-addressable][Why content-addressable?]]
- [[#gnu-guix-installation][GNU Guix installation]]
- [[#ipfs-and-cwl-installation][IPFS and CWL installation]]
- [[#choosing-a-cwl-workflow][Choosing a CWL workflow]]
- [[#add-the-data-sources][Add the data sources]]
- [[#run-cwl-script][Run CWL script]]
- [[#adding-a-binary-blob-to-gnu-guix][Adding a binary blob to GNU Guix]]
- [[#replacing-the-binary-blob-with-a-source-build][Replacing the binary blob with a source build]]
- [[#run-the-workflow-inside-an-isolated-container-without-network][Run the workflow inside an isolated container without network]]
- [[#prove-results-are-deterministic][Prove results are deterministic]]
- [[#capture-the-provenance-graph][Capture the provenance graph]]
- [[#discussion][Discussion]]
- [[#extra-notes][Extra notes]]
- [[#building-cwltool-inside-a-guix-container][Building cwltool inside a Guix container]]
- [[#create-dependency-graph][Create dependency graph]]
- [[#create-a-docker-container][Create a Docker container]]
* Why content-addressable?
[[https://en.wikipedia.org/wiki/Content-addressable_storage][Content addressable files]] are referenced to by a hash on their
contents as part of the file path/URI. For example, in the workflow
@ -50,7 +72,7 @@ Now this may appear a little elaborate. The good news is that most of
these references are transparent. The Guix environment deals with
resolving them as should become clear.
** GNU Guix installation
* GNU Guix installation
The first step is to install the Guix daemon. This daemon allows
regular users to install software packages on any Linux distribution
@ -64,7 +86,7 @@ HPC we typically use a build host which has privileges, but all other
HPC nodes simply mount one directory under /gnu/store using a network
mount. More HPC blogs can be found [[https://guix-hpc.bordeaux.inria.fr/blog/][here]].
** IPFS and CWL installation
* IPFS and CWL installation
IPFS was recently added to GNU Guix. The first task was to update and
add CWL to GNU Guix. This took me a few hours because quite a few
@ -72,8 +94,8 @@ dependencies had to be added in and some of these packages have
'fixated' versions and ultimately do not build on Python 3.7. Of
course this should be fixed but with Guix we can introduce older
packages no problem. For this I created a special [[https://github.com/genenetwork/guix-cwl][channel]] and after
setting up the channel on Debian, Ubuntu, Fedora, Arch (whatever)
installation should be as easy as
setting up the channel (see the [[https://github.com/genenetwork/guix-cwl/blob/master/README.org][README]]) on Debian, Ubuntu, Fedora,
Arch (whatever) installation should be as easy as
: guix package -i cwltool -p ~/opt/cwl
@ -87,10 +109,10 @@ packages on Guix trunk needed to be updated, including [[https://gitlab.com/gene
and python-setuptools]]. This leads to the following dependency graph
for cwltool which is generated by Guix itself:
#+ATTR_HTML: :style margin-left: auto; margin-right: auto;
#+ATTR_HTML: :style margin-left: auto; margin-right: auto; width=100%
[[http://biogems.info/cwltool-references.svg]]
If Guix is correctly intalled most packages get downloaded and
If Guix is correctly installed most packages get downloaded and
installed as binaries. Guix only builds packages when it can not find
a binary substitute. And now I can run
@ -99,11 +121,11 @@ a binary substitute. And now I can run
Success!
We can have the main tools installed in one go with
So, after adding the cwl channel we can have the main tools installed in one go with
: guix package -i go-ipfs cwltool -p ~/opt/cwl
** Choosing a CWL workflow
* Choosing a CWL workflow
First I thought to run one of the pipelines from bcbio-nextgen as an
example. Bcbio generates CWL which is rather convenient. But then at
@ -118,7 +140,7 @@ know what is in them, i.e., there is a trust issue, and it is usually
impossible to recreate them exactly, which is a reproducibility
issue. We can do better than that.
** Add the data sources
* Add the data sources
After above installation of go-ipfs, following [[https://docs.ipfs.io/introduction/usage/][IPFS instructions]] create a data
directory
@ -160,7 +182,7 @@ Next you ought to pin the data so it does not get garbage collected by IPFS
: env IPFS_PATH=/export/data/ipfs ipfs pin add QmR81HRaDDvvEjnhrq5ftMF1KRtqp8MiwGzwZnsC4ch1tE
: pinned QmR81HRaDDvvEjnhrq5ftMF1KRtqp8MiwGzwZnsC4ch1tE recursively
** Run CWL script
* Run CWL script
Following the instructions in the original workflow README
@ -311,7 +333,7 @@ and CWL runs up to
: ILLUMINACLIP:/usr/local/share/trimmomatic/adapters/TruSeq2-PE.fa:2:40:15
: Error: Unable to access jarfile /usr/local/share/trimmomatic/trimmomatic.jar
** Adding a binary blob to GNU Guix
* Adding a binary blob to GNU Guix
Guix likes things to be built from source - it is a clear goal of the
GNU project and the whole system is designed around that. But you can
@ -322,26 +344,26 @@ pipelines this is a questionable design choice. Much of reproducible
science is about transparancy - and binary blobs do not cut
it. Anything that is not transparent ought to be questioned.
** Replacing the binary blob with a source build
* Replacing the binary blob with a source build
tbd
** Run the workflow inside an isolated container without network
* Run the workflow inside an isolated container without network
To really make sure no dependencies 'bleed' in and no data gets pulled
from the network we can run the workflow inside a container with no
other tools than those defined in the Guix dependency graph. In
addition the container can block the network.
** Prove results are deterministic
* Prove results are deterministic
tbd
** Capture the provenance graph
* Capture the provenance graph
tbd
** Discussion
* Discussion
Here we show the principle of a working reproducible pipeline. With
little effort, anywone can create such a pipeline using GNU Guix, an

Loading…
Cancel
Save