From ae1a7f0c8bed6b1a3445a4fac26a578851715629 Mon Sep 17 00:00:00 2001
From: Pjotr Prins
Date: Sat, 24 Sep 2016 06:59:37 +0000
Subject: Doc: Section on reproducibility      - fixed SVG URLs

---
 doc/Architecture.org | 61 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 doc/README.org       |  4 ++--
 2 files changed, 63 insertions(+), 2 deletions(-)

diff --git a/doc/Architecture.org b/doc/Architecture.org
index 04e05e40..ec56f9a9 100644
--- a/doc/Architecture.org
+++ b/doc/Architecture.org
@@ -2,6 +2,7 @@
 
 * Table of Contents                                                     :TOC:
  - [[#introduction][Introduction]]
+ - [[#reproducibility-and-interoperability][Reproducibility and interoperability]]
  - [[#webserver][Webserver]]
  - [[#gnserver-rest][GnServer (REST)]]
  - [[#gnexec][GnExec]]
@@ -14,6 +15,66 @@
 This document describes the architecture of GN2. Because GN2 is
 evolving, only a high-level overview is given here.
 
+* Reproducibility and interoperability
+
+Reproducible data analysis and software interoperability should be key
+goals for any system that aims to bring research groups
+together. These goals are increasingly relevant with growing data
+sizes and increasingly complex analysis pipelines. Rigor,
+reproducibility, and robustness starts with data that should abide by
+Findable, Accessible, Interoperable, and Re-usable (FAIR) principles
+(see the Wilkinson Nature paper on [[http://www.nature.com/articles/sdata201618][FAIR Guiding Principles for
+scientific data management and stewardship]]).
+
+With GN2 we are solving these requirements by assigning unique
+identifiers (cryptographic HASH values calculated over immutable data
+content and including that value in the file names or directories) and
+making these identifiers available through web interfaces (e.g.,
+through a REST API). This means that at any point in the future the
+exact same data can be retrieved using a known non-changeable
+identifier (see also
+https://github.com/pjotrp/genenetwork2/blob/staging/doc/submit-data.org).
+
+Synchronisation, integrity checking and backups become trivial using
+these HASH values, even for very large datasets. Since everything is
+managed at the file system level we can also use Unix authorisation
+systems. HIPAA compliancy is achieved by using HASH values and
+bringing the software into the controlled HIPAA environment.
+
+In the context of GeneNetwork we are using git and github for version
+control of software source code
+(https://github.com/genenetwork/). Software can be treated just like
+data, i.e., git uses HASH identifiers to retrieve specific versions of
+source. I.e., versions of source code are identifiable and retrievable
+and can be matched with data into an analysis pipeline. The
+combination of software and data, again, makes a unique HASH value
+which identifies the analysis pipe-line.
+
+For combining runnable software and data into an analysis pipeline we
+use GNU Guix which, yet again, turns everything into a unique HASH
+value which allows for exact retrieval and reproducibility. Not only
+that, GNU Guix gives control of the software and all its dependencies,
+use GNU Guix which, yet again, turns everything into a unique HASH
+value which allows for exact retrieval and reproducibility. Not only
+that, GNU Guix gives control of the software and all its dependencies,
+calculating a HASH value for all dependencies, all the way down to
+versions of R, BLAS and glibc. This way of packaging software
+ascertains that identical software pipelines are easily setup on
+different system or in the Cloud. Meaning that everyone ends up using
+the exact same combination of software versions in a pipeline.
+
+For software development we use GNU Guix for integration testing and
+deployment (described in JOSS paper). We also use automated test tools
+(Ruby mechanize) for integration testing of the web services and we
+use unit testing of all backend services. All our software source code
+is published as `free and open source software' (FOSS) which means
+that anyone can view code on github, comment on it, or even
+contribute. GeneNetwork is becoming increasingly modular and has a
+growing number of contributers who, in principle, abide by the THE
+SMALL TOOLS MANIFESTO FOR BIOINFORMATICS which we wrote up
+(https://github.com/pjotrp/bioinformatics) and was signed by 51
+bioinformaticians.
+
 * Webserver
 
 The main [[https://github.com/genenetwork/genenetwork2][GN2 webserver]] is built on [[http://flask.pocoo.org/][Python flask]] and this GN2 source
diff --git a/doc/README.org b/doc/README.org
index 2b27d562..0f56914a 100644
--- a/doc/README.org
+++ b/doc/README.org
@@ -29,7 +29,7 @@ If you want to understand the architecture of GN2 read
 [[Architecture.org]].  The rest of this document is mostly on deployment
 of GN2.
 
-Large system deployments can get very [[http://biobeat.org/gn2.svg][complex]]. In this document we
+Large system deployments can get very [[http://biogems.info/contrib/genenetwork/gn2.svg ][complex]]. In this document we
 explain the GeneNetwork version 2 (GN2) reproducible deployment system
 which is based on GNU Guix (see also Pjotr's [[https://github.com/pjotrp/guix-notes/blob/master/README.md][Guix-notes]]). The Guix
 system can be used to install GN with all its files and dependencies.
@@ -243,7 +243,7 @@ change the settings in etc/default_settings.py to match your path.
 Graph of all runtime dependencies as installed by GNU Guix.
 
 #+ATTR_HTML: :title GN2_graph
-[[http://biobeat.org/gn2.svg]]
+http://biogems.info/contrib/genenetwork/gn2.svg
 
 * Source deployment
 
-- 
cgit v1.2.3