aboutsummaryrefslogtreecommitdiff
path: root/doc/Architecture.org
diff options
context:
space:
mode:
authorPjotr Prins2016-09-24 06:59:37 +0000
committerPjotr Prins2016-09-25 07:38:36 +0000
commitae1a7f0c8bed6b1a3445a4fac26a578851715629 (patch)
treee82487dd9d0a23b64e1114bcaff4bf92e81bc422 /doc/Architecture.org
parent62cacc047ae502d7ff052466b1dcc50e4d895f3b (diff)
downloadgenenetwork2-ae1a7f0c8bed6b1a3445a4fac26a578851715629.tar.gz
Doc: Section on reproducibility
- fixed SVG URLs
Diffstat (limited to 'doc/Architecture.org')
-rw-r--r--doc/Architecture.org61
1 files changed, 61 insertions, 0 deletions
diff --git a/doc/Architecture.org b/doc/Architecture.org
index 04e05e40..ec56f9a9 100644
--- a/doc/Architecture.org
+++ b/doc/Architecture.org
@@ -2,6 +2,7 @@
* Table of Contents :TOC:
- [[#introduction][Introduction]]
+ - [[#reproducibility-and-interoperability][Reproducibility and interoperability]]
- [[#webserver][Webserver]]
- [[#gnserver-rest][GnServer (REST)]]
- [[#gnexec][GnExec]]
@@ -14,6 +15,66 @@
This document describes the architecture of GN2. Because GN2 is
evolving, only a high-level overview is given here.
+* Reproducibility and interoperability
+
+Reproducible data analysis and software interoperability should be key
+goals for any system that aims to bring research groups
+together. These goals are increasingly relevant with growing data
+sizes and increasingly complex analysis pipelines. Rigor,
+reproducibility, and robustness starts with data that should abide by
+Findable, Accessible, Interoperable, and Re-usable (FAIR) principles
+(see the Wilkinson Nature paper on [[http://www.nature.com/articles/sdata201618][FAIR Guiding Principles for
+scientific data management and stewardship]]).
+
+With GN2 we are solving these requirements by assigning unique
+identifiers (cryptographic HASH values calculated over immutable data
+content and including that value in the file names or directories) and
+making these identifiers available through web interfaces (e.g.,
+through a REST API). This means that at any point in the future the
+exact same data can be retrieved using a known non-changeable
+identifier (see also
+https://github.com/pjotrp/genenetwork2/blob/staging/doc/submit-data.org).
+
+Synchronisation, integrity checking and backups become trivial using
+these HASH values, even for very large datasets. Since everything is
+managed at the file system level we can also use Unix authorisation
+systems. HIPAA compliancy is achieved by using HASH values and
+bringing the software into the controlled HIPAA environment.
+
+In the context of GeneNetwork we are using git and github for version
+control of software source code
+(https://github.com/genenetwork/). Software can be treated just like
+data, i.e., git uses HASH identifiers to retrieve specific versions of
+source. I.e., versions of source code are identifiable and retrievable
+and can be matched with data into an analysis pipeline. The
+combination of software and data, again, makes a unique HASH value
+which identifies the analysis pipe-line.
+
+For combining runnable software and data into an analysis pipeline we
+use GNU Guix which, yet again, turns everything into a unique HASH
+value which allows for exact retrieval and reproducibility. Not only
+that, GNU Guix gives control of the software and all its dependencies,
+use GNU Guix which, yet again, turns everything into a unique HASH
+value which allows for exact retrieval and reproducibility. Not only
+that, GNU Guix gives control of the software and all its dependencies,
+calculating a HASH value for all dependencies, all the way down to
+versions of R, BLAS and glibc. This way of packaging software
+ascertains that identical software pipelines are easily setup on
+different system or in the Cloud. Meaning that everyone ends up using
+the exact same combination of software versions in a pipeline.
+
+For software development we use GNU Guix for integration testing and
+deployment (described in JOSS paper). We also use automated test tools
+(Ruby mechanize) for integration testing of the web services and we
+use unit testing of all backend services. All our software source code
+is published as `free and open source software' (FOSS) which means
+that anyone can view code on github, comment on it, or even
+contribute. GeneNetwork is becoming increasingly modular and has a
+growing number of contributers who, in principle, abide by the THE
+SMALL TOOLS MANIFESTO FOR BIOINFORMATICS which we wrote up
+(https://github.com/pjotrp/bioinformatics) and was signed by 51
+bioinformaticians.
+
* Webserver
The main [[https://github.com/genenetwork/genenetwork2][GN2 webserver]] is built on [[http://flask.pocoo.org/][Python flask]] and this GN2 source