aboutsummaryrefslogtreecommitdiff
path: root/doc
diff options
context:
space:
mode:
Diffstat (limited to 'doc')
-rw-r--r--doc/Architecture.org61
-rw-r--r--doc/README.org4
2 files changed, 63 insertions, 2 deletions
diff --git a/doc/Architecture.org b/doc/Architecture.org
index 04e05e40..ec56f9a9 100644
--- a/doc/Architecture.org
+++ b/doc/Architecture.org
@@ -2,6 +2,7 @@
* Table of Contents :TOC:
- [[#introduction][Introduction]]
+ - [[#reproducibility-and-interoperability][Reproducibility and interoperability]]
- [[#webserver][Webserver]]
- [[#gnserver-rest][GnServer (REST)]]
- [[#gnexec][GnExec]]
@@ -14,6 +15,66 @@
This document describes the architecture of GN2. Because GN2 is
evolving, only a high-level overview is given here.
+* Reproducibility and interoperability
+
+Reproducible data analysis and software interoperability should be key
+goals for any system that aims to bring research groups
+together. These goals are increasingly relevant with growing data
+sizes and increasingly complex analysis pipelines. Rigor,
+reproducibility, and robustness starts with data that should abide by
+Findable, Accessible, Interoperable, and Re-usable (FAIR) principles
+(see the Wilkinson Nature paper on [[http://www.nature.com/articles/sdata201618][FAIR Guiding Principles for
+scientific data management and stewardship]]).
+
+With GN2 we are solving these requirements by assigning unique
+identifiers (cryptographic HASH values calculated over immutable data
+content and including that value in the file names or directories) and
+making these identifiers available through web interfaces (e.g.,
+through a REST API). This means that at any point in the future the
+exact same data can be retrieved using a known non-changeable
+identifier (see also
+https://github.com/pjotrp/genenetwork2/blob/staging/doc/submit-data.org).
+
+Synchronisation, integrity checking and backups become trivial using
+these HASH values, even for very large datasets. Since everything is
+managed at the file system level we can also use Unix authorisation
+systems. HIPAA compliancy is achieved by using HASH values and
+bringing the software into the controlled HIPAA environment.
+
+In the context of GeneNetwork we are using git and github for version
+control of software source code
+(https://github.com/genenetwork/). Software can be treated just like
+data, i.e., git uses HASH identifiers to retrieve specific versions of
+source. I.e., versions of source code are identifiable and retrievable
+and can be matched with data into an analysis pipeline. The
+combination of software and data, again, makes a unique HASH value
+which identifies the analysis pipe-line.
+
+For combining runnable software and data into an analysis pipeline we
+use GNU Guix which, yet again, turns everything into a unique HASH
+value which allows for exact retrieval and reproducibility. Not only
+that, GNU Guix gives control of the software and all its dependencies,
+use GNU Guix which, yet again, turns everything into a unique HASH
+value which allows for exact retrieval and reproducibility. Not only
+that, GNU Guix gives control of the software and all its dependencies,
+calculating a HASH value for all dependencies, all the way down to
+versions of R, BLAS and glibc. This way of packaging software
+ascertains that identical software pipelines are easily setup on
+different system or in the Cloud. Meaning that everyone ends up using
+the exact same combination of software versions in a pipeline.
+
+For software development we use GNU Guix for integration testing and
+deployment (described in JOSS paper). We also use automated test tools
+(Ruby mechanize) for integration testing of the web services and we
+use unit testing of all backend services. All our software source code
+is published as `free and open source software' (FOSS) which means
+that anyone can view code on github, comment on it, or even
+contribute. GeneNetwork is becoming increasingly modular and has a
+growing number of contributers who, in principle, abide by the THE
+SMALL TOOLS MANIFESTO FOR BIOINFORMATICS which we wrote up
+(https://github.com/pjotrp/bioinformatics) and was signed by 51
+bioinformaticians.
+
* Webserver
The main [[https://github.com/genenetwork/genenetwork2][GN2 webserver]] is built on [[http://flask.pocoo.org/][Python flask]] and this GN2 source
diff --git a/doc/README.org b/doc/README.org
index 2b27d562..0f56914a 100644
--- a/doc/README.org
+++ b/doc/README.org
@@ -29,7 +29,7 @@ If you want to understand the architecture of GN2 read
[[Architecture.org]]. The rest of this document is mostly on deployment
of GN2.
-Large system deployments can get very [[http://biobeat.org/gn2.svg][complex]]. In this document we
+Large system deployments can get very [[http://biogems.info/contrib/genenetwork/gn2.svg ][complex]]. In this document we
explain the GeneNetwork version 2 (GN2) reproducible deployment system
which is based on GNU Guix (see also Pjotr's [[https://github.com/pjotrp/guix-notes/blob/master/README.md][Guix-notes]]). The Guix
system can be used to install GN with all its files and dependencies.
@@ -243,7 +243,7 @@ change the settings in etc/default_settings.py to match your path.
Graph of all runtime dependencies as installed by GNU Guix.
#+ATTR_HTML: :title GN2_graph
-[[http://biobeat.org/gn2.svg]]
+http://biogems.info/contrib/genenetwork/gn2.svg
* Source deployment