From ae1a7f0c8bed6b1a3445a4fac26a578851715629 Mon Sep 17 00:00:00 2001 From: Pjotr Prins Date: Sat, 24 Sep 2016 06:59:37 +0000 Subject: Doc: Section on reproducibility - fixed SVG URLs --- doc/Architecture.org | 61 ++++++++++++++++++++++++++++++++++++++++++++++++++++ doc/README.org | 4 ++-- 2 files changed, 63 insertions(+), 2 deletions(-) diff --git a/doc/Architecture.org b/doc/Architecture.org index 04e05e40..ec56f9a9 100644 --- a/doc/Architecture.org +++ b/doc/Architecture.org @@ -2,6 +2,7 @@ * Table of Contents :TOC: - [[#introduction][Introduction]] + - [[#reproducibility-and-interoperability][Reproducibility and interoperability]] - [[#webserver][Webserver]] - [[#gnserver-rest][GnServer (REST)]] - [[#gnexec][GnExec]] @@ -14,6 +15,66 @@ This document describes the architecture of GN2. Because GN2 is evolving, only a high-level overview is given here. +* Reproducibility and interoperability + +Reproducible data analysis and software interoperability should be key +goals for any system that aims to bring research groups +together. These goals are increasingly relevant with growing data +sizes and increasingly complex analysis pipelines. Rigor, +reproducibility, and robustness starts with data that should abide by +Findable, Accessible, Interoperable, and Re-usable (FAIR) principles +(see the Wilkinson Nature paper on [[http://www.nature.com/articles/sdata201618][FAIR Guiding Principles for +scientific data management and stewardship]]). + +With GN2 we are solving these requirements by assigning unique +identifiers (cryptographic HASH values calculated over immutable data +content and including that value in the file names or directories) and +making these identifiers available through web interfaces (e.g., +through a REST API). This means that at any point in the future the +exact same data can be retrieved using a known non-changeable +identifier (see also +https://github.com/pjotrp/genenetwork2/blob/staging/doc/submit-data.org). + +Synchronisation, integrity checking and backups become trivial using +these HASH values, even for very large datasets. Since everything is +managed at the file system level we can also use Unix authorisation +systems. HIPAA compliancy is achieved by using HASH values and +bringing the software into the controlled HIPAA environment. + +In the context of GeneNetwork we are using git and github for version +control of software source code +(https://github.com/genenetwork/). Software can be treated just like +data, i.e., git uses HASH identifiers to retrieve specific versions of +source. I.e., versions of source code are identifiable and retrievable +and can be matched with data into an analysis pipeline. The +combination of software and data, again, makes a unique HASH value +which identifies the analysis pipe-line. + +For combining runnable software and data into an analysis pipeline we +use GNU Guix which, yet again, turns everything into a unique HASH +value which allows for exact retrieval and reproducibility. Not only +that, GNU Guix gives control of the software and all its dependencies, +use GNU Guix which, yet again, turns everything into a unique HASH +value which allows for exact retrieval and reproducibility. Not only +that, GNU Guix gives control of the software and all its dependencies, +calculating a HASH value for all dependencies, all the way down to +versions of R, BLAS and glibc. This way of packaging software +ascertains that identical software pipelines are easily setup on +different system or in the Cloud. Meaning that everyone ends up using +the exact same combination of software versions in a pipeline. + +For software development we use GNU Guix for integration testing and +deployment (described in JOSS paper). We also use automated test tools +(Ruby mechanize) for integration testing of the web services and we +use unit testing of all backend services. All our software source code +is published as `free and open source software' (FOSS) which means +that anyone can view code on github, comment on it, or even +contribute. GeneNetwork is becoming increasingly modular and has a +growing number of contributers who, in principle, abide by the THE +SMALL TOOLS MANIFESTO FOR BIOINFORMATICS which we wrote up +(https://github.com/pjotrp/bioinformatics) and was signed by 51 +bioinformaticians. + * Webserver The main [[https://github.com/genenetwork/genenetwork2][GN2 webserver]] is built on [[http://flask.pocoo.org/][Python flask]] and this GN2 source diff --git a/doc/README.org b/doc/README.org index 2b27d562..0f56914a 100644 --- a/doc/README.org +++ b/doc/README.org @@ -29,7 +29,7 @@ If you want to understand the architecture of GN2 read [[Architecture.org]]. The rest of this document is mostly on deployment of GN2. -Large system deployments can get very [[http://biobeat.org/gn2.svg][complex]]. In this document we +Large system deployments can get very [[http://biogems.info/contrib/genenetwork/gn2.svg ][complex]]. In this document we explain the GeneNetwork version 2 (GN2) reproducible deployment system which is based on GNU Guix (see also Pjotr's [[https://github.com/pjotrp/guix-notes/blob/master/README.md][Guix-notes]]). The Guix system can be used to install GN with all its files and dependencies. @@ -243,7 +243,7 @@ change the settings in etc/default_settings.py to match your path. Graph of all runtime dependencies as installed by GNU Guix. #+ATTR_HTML: :title GN2_graph -[[http://biobeat.org/gn2.svg]] +http://biogems.info/contrib/genenetwork/gn2.svg * Source deployment -- cgit v1.2.3