aboutsummaryrefslogtreecommitdiff
path: root/doc
diff options
context:
space:
mode:
Diffstat (limited to 'doc')
-rw-r--r--doc/Architecture.org59
-rw-r--r--doc/README.org68
-rw-r--r--doc/testing.org43
3 files changed, 147 insertions, 23 deletions
diff --git a/doc/Architecture.org b/doc/Architecture.org
index 04e05e40..c5876196 100644
--- a/doc/Architecture.org
+++ b/doc/Architecture.org
@@ -2,6 +2,7 @@
* Table of Contents :TOC:
- [[#introduction][Introduction]]
+ - [[#reproducibility-and-interoperability][Reproducibility and interoperability]]
- [[#webserver][Webserver]]
- [[#gnserver-rest][GnServer (REST)]]
- [[#gnexec][GnExec]]
@@ -14,6 +15,64 @@
This document describes the architecture of GN2. Because GN2 is
evolving, only a high-level overview is given here.
+* Reproducibility and interoperability
+
+Reproducible data analysis and software interoperability should be key
+goals for any system that aims to bring research groups
+together. These goals are increasingly relevant with growing data
+sizes and increasingly complex analysis pipelines. Rigor,
+reproducibility, and robustness starts with data that should abide by
+Findable, Accessible, Interoperable, and Re-usable (FAIR) principles
+(see the Wilkinson Nature paper on [[http://www.nature.com/articles/sdata201618][FAIR Guiding Principles for
+scientific data management and stewardship]]).
+
+GeneNetwork (GN2) solves this by assigning unique identifiers
+(cryptographic HASH values calculated over immutable data content),
+including these values in file or directory names, and making them
+available through web interfaces (e.g., through a through a REST
+API). This means that at any point in the future the exact same data
+can be retrieved using a known non-changeable identifier (see also
+https://github.com/pjotrp/genenetwork2/blob/staging/doc/submit-data.org).
+
+Synchronisation, integrity checking and backups become trivial using
+these HASH values, even for very large datasets. Since everything is
+managed at the file system level we can also use Unix authorisation
+systems. HIPAA compliancy is achieved by using HASH references and
+bringing the software into the controlled HIPAA environment.
+
+In the context of GeneNetwork we are using git for version control of
+software source code (https://github.com/genenetwork/). Software can
+be treated just like data, i.e., git uses HASH identifiers to retrieve
+specific versions of source. I.e., versions of source code are
+identifiable and retrievable and can be matched with data into an
+analysis pipeline. The combination of software and data, again, makes
+a unique HASH value which identifies the analysis pipeline.
+
+For combining runnable software and data into an analysis pipeline we
+use GNU Guix which, yet again, turns everything into a unique HASH
+value which allows for exact retrieval and reproducibility. Not only
+that, GNU Guix gives control of the software and all its dependencies,
+use GNU Guix which, yet again, turns everything into a unique HASH
+value which allows for exact retrieval and reproducibility. Not only
+that, GNU Guix gives control of the software and all its dependencies,
+calculating a HASH value for all dependencies, all the way down to
+versions of R, BLAS and glibc. This way of packaging software
+ascertains that identical software pipelines are easily setup on
+different system or in the Cloud. Meaning that everyone ends up using
+the exact same combination of software versions in a pipeline.
+
+For software development we use GNU Guix for integration testing and
+deployment (described in JOSS paper). We also use automated test tools
+(Ruby mechanize) for integration testing of the web services and we
+use unit testing of all backend services. All our software source code
+is published as `free and open source software' (FOSS) which means
+that anyone can view code on github, comment on, or even contribute
+to. GeneNetwork is becoming increasingly modular and has a growing
+number of contributers who subscribe to the principles of THE SMALL
+TOOLS MANIFESTO FOR BIOINFORMATICS
+(https://github.com/pjotrp/bioinformatics) which we drew up and was
+signed by over fifty bioinformaticians.
+
* Webserver
The main [[https://github.com/genenetwork/genenetwork2][GN2 webserver]] is built on [[http://flask.pocoo.org/][Python flask]] and this GN2 source
diff --git a/doc/README.org b/doc/README.org
index b3c78f29..0f56914a 100644
--- a/doc/README.org
+++ b/doc/README.org
@@ -6,7 +6,7 @@
- [[#step-1-install-gnu-guix][Step 1: Install GNU Guix]]
- [[#step-2-checkout-the-gn2-git-repositories][Step 2: Checkout the GN2 git repositories]]
- [[#step-3-authorize-the-gn-guix-server][Step 3: Authorize the GN Guix server]]
- - [[#step-4-install-and-run-gn2-][Step 4: Install and run GN2 ]]
+ - [[#step-4-install-and-run-gn2][Step 4: Install and run GN2]]
- [[#run-mysql-server][Run MySQL server]]
- [[#gn2-dependency-graph][GN2 Dependency Graph]]
- [[#source-deployment][Source deployment]]
@@ -20,6 +20,7 @@
- [[#importerror-no-module-named-jinja2][ImportError: No module named jinja2]]
- [[#error-can-not-find-directory-homegn2_data][ERROR: can not find directory $HOME/gn2_data]]
- [[#cant-run-a-module][Can't run a module]]
+ - [[#rpy2-error-show-now-found][Rpy2 error 'show' now found]]
- [[#irc-session][IRC session]]
* Introduction
@@ -28,7 +29,7 @@ If you want to understand the architecture of GN2 read
[[Architecture.org]]. The rest of this document is mostly on deployment
of GN2.
-Large system deployments can get very [[http://biobeat.org/gn2.svg][complex]]. In this document we
+Large system deployments can get very [[http://biogems.info/contrib/genenetwork/gn2.svg ][complex]]. In this document we
explain the GeneNetwork version 2 (GN2) reproducible deployment system
which is based on GNU Guix (see also Pjotr's [[https://github.com/pjotrp/guix-notes/blob/master/README.md][Guix-notes]]). The Guix
system can be used to install GN with all its files and dependencies.
@@ -117,7 +118,7 @@ cd guix-gn-latest
** Step 3: Authorize the GN Guix server
GN2 has its own GNU Guix binary distribution server. To trust it you have
-to add the following key
+to add the following key
#+begin_src scheme
(public-key
@@ -136,9 +137,9 @@ guix archive --authorize
and hit Ctrl-D.
-Now you can use the substitute server to install GN2 binaries.
+Now you can use the substitute server to install GN2 binaries.
-** Step 4: Install and run GN2
+** Step 4: Install and run GN2
Since this is a quick and dirty install we are going to override the
GNU Guix package path by pointing the package path to our repository:
@@ -208,7 +209,7 @@ https://s3.amazonaws.com/genenetwork2/db_webqtl_s.zip
Check the md5sum.
After installation inflate the database binary in the MySQL directory
-(this installation path is subject to change soon)
+(this installation path is subject to change soon)
: chown -R mysql:mysql db_webqtl_s/
: chmod 700 db_webqtl_s/
@@ -242,7 +243,7 @@ change the settings in etc/default_settings.py to match your path.
Graph of all runtime dependencies as installed by GNU Guix.
#+ATTR_HTML: :title GN2_graph
-[[http://biobeat.org/gn2.svg]]
+http://biogems.info/contrib/genenetwork/gn2.svg
* Source deployment
@@ -271,10 +272,10 @@ R_LIBS_SITE are set) from the information given by guix:
Inside the repository:
: cd genenetwork2
-: ./bin/genenetwork2
+: ./bin/genenetwork2
-Will fire up your local repo http://localhost:5003/ using the
-settings in ./etc/default_settings.py. These settings may
+Will fire up your local repo http://localhost:5003/ using the
+settings in ./etc/default_settings.py. These settings may
not reflect your system. To override settings create your own from a copy of
default_settings.py and pass it into GN2 with
@@ -348,7 +349,7 @@ Make dirs
Add users
-: adduser nobody ; addgroup nobody
+: adduser nobody ; addgroup nobody
Run nginx
@@ -392,6 +393,12 @@ Make a note of the paths with
./pre-inst-env guix package --search-paths
#+end_src bash
+or this should also work if guix is installed
+
+#+begin_src bash
+guix package --search-paths
+#+end_src bash
+
After setting the paths for the server
#+begin_src bash
@@ -413,7 +420,7 @@ genenetwork2
will start the default server which listens on port 5003, i.e.,
http://localhost:5003/.
-OK, we are where we were before with step 4. Only difference is that we
+OK, we are where we were before with step 4. Only difference is that we
used our own compiled guix server.
* Trouble shooting
@@ -433,7 +440,7 @@ On one system:
: export R_LIBS_SITE="$HOME/.guix-profile/site-library/"
: export GEM_PATH="$HOME/.guix-profile/lib/ruby/gems/2.2.0"
-and perhaps a few more.
+and perhaps a few more.
** ERROR: can not find directory $HOME/gn2_data
The default settings file looks in your $HOME/gn2_data. Since these
@@ -447,6 +454,21 @@ In rare cases, development modules are not brought in with Guix
because no source code is available. This can lead to missing modules
on a running server. Please check with the authors when a module
is missing.
+** Rpy2 error 'show' now found
+
+This error
+
+: __show = rpy2.rinterface.baseenv.get("show")
+: LookupError: 'show' not found
+
+means that R was updated in your path, and that Rpy2 needs to be
+recompiled against this R - don't you love informative messages?
+
+In our case it means that GN's PYTHONPATH is not in sync with
+R_LIBS_SITE. Please check your GNU Guix GN2 installation paths,
+you man need to reinstall. Note that this may be the point you
+may want to start using profiles (see profile section).
+
* IRC session
Here an IRC session where we installed GN2 from scratch using GNU Guix
@@ -466,7 +488,7 @@ and a download of the test database.
<user01> set to the ones in ~/.guix-profile/
<pjotrp> good, and you are in gn-latest-guix repo [07:06]
<user01> yep [07:07]
-<pjotrp> git log shows
+<pjotrp> git log shows
Author: David Thompson <dthompson2@worcester.edu>
Date: Sun Mar 27 21:20:19 2016 -0400
@@ -488,7 +510,7 @@ genenetwork2-files-small 1.0 out ../guix-bioinformatics/gn/packages/g
<user01> hah, I don't have screen installed yet [07:11]
<pjotrp> comes with guix ;) [07:12]
<pjotrp> no worries, you can run it any way you want
-<pjotrp> $HOME/.guix-profile/bin/guix-daemon --build-users-group=guixbuild
+<pjotrp> $HOME/.guix-profile/bin/guix-daemon --build-users-group=guixbuild
<user01> then something's weird, because it says I don't have it
<pjotrp> oh, you need to install it first [07:13]
<pjotrp> guix package -A screen
@@ -546,11 +568,11 @@ The following derivations would be built:
<pjotrp> https://github.com/pjotrp/guix-notes/blob/master/REPRODUCIBLE.org
<pjotrp> this is exactly what we are doing now
<user01> alrighty [07:35]
-<pjotrp> To see if a remote server has a guix server running it should respond
+<pjotrp> To see if a remote server has a guix server running it should respond
[07:36]
<pjotrp> lynx http://guix.genenetwork.org:8080 --dump
<pjotrp> Resource not found: /
-<pjotrp>
+<pjotrp>
<pjotrp> you see that?
<user01> yes [07:37]
<pjotrp> good. The main hydra server is too slow. So on my laptop I forced
@@ -558,7 +580,7 @@ The following derivations would be built:
<pjotrp> env GUIX_PACKAGE_PATH=../guix-bioinformatics/ ./pre-inst-env guix
package -i genenetwork2 --dry-run
--substitute-urls="http://mirror.hydra.gnu.org"
-<pjotrp>
+<pjotrp>
<pjotrp> the list looks the same to me [07:40]
<user01> me too
<pjotrp> note that some packages will be built and some downloaded, right?
@@ -688,7 +710,7 @@ The following derivations would be built:
<pjotrp> everything should be pre-built from guix.genenetwork.org
<pjotrp> you are downloading?
<user02> yes [09:15]
-<pjotrp> cool. Maybe an idea to set up a server
+<pjotrp> cool. Maybe an idea to set up a server
<pjotrp> for your own use
<user02> Stuck at downloading preprocesscore
<pjotrp> should not [09:24]
@@ -735,7 +757,7 @@ The following derivations would be built:
<pjotrp> should be at
/gnu/store/y1f3r2xs3fhyadd46nd2aqbr2p9qv2ra-r-biocpreprocesscore-1.32.0
[09:33]
-<pjotrp>
+<pjotrp>
<user03> pjotrp: Possibly we should use the archive utility of Guix to do
deployment to avoid such out-of-sync differences :) [09:34]
<pjotrp> maybe. I did not get archive to update profiles properly [09:37]
@@ -802,7 +824,7 @@ The following derivations would be built:
<pjotrp> but do not checkout that genetwork2_diet
<pjotrp> we reverted to the main tree
<pjotrp> clone git@github.com:genenetwork/genenetwork2.git [09:53]
-<pjotrp> instead and checkout the staging branch
+<pjotrp> instead and checkout the staging branch
<pjotrp> that is effectively my branch [09:54]
<pjotrp> when that is done you should be able to fire up the webserver from
there [09:55]
@@ -825,7 +847,7 @@ The following derivations would be built:
<user01> yep
<pjotrp> that can also run on remote files over ssh
<pjotrp> that's an alternative
-<pjotrp> kudos for using emacs :), wdyt user03
+<pjotrp> kudos for using emacs :), wdyt user03
<user02> 79 minutes to go downloading the db
<pjotrp> user02: sorry about that [09:59]
<pjotrp> it is 2GB
@@ -850,7 +872,7 @@ The following derivations would be built:
--substitute-urls="http://guix.genenetwork.org:8080" [10:08]
<pjotrp> elixir 1.2.3 out
../guix-bioinformatics/gn/packages/elixir.scm:31:2
-<pjotrp>
+<pjotrp>
<pjotrp> I am building it on guix.genenetwork.org right now [10:09]
<user01> nice [10:10]
#+end_src
diff --git a/doc/testing.org b/doc/testing.org
new file mode 100644
index 00000000..1d5cc8b8
--- /dev/null
+++ b/doc/testing.org
@@ -0,0 +1,43 @@
+#+TITLE: Testing GN2
+
+* Table of Contents :TOC:
+ - [[#introduction][Introduction]]
+ - [[#run-tests][Run tests]]
+ - [[#setup][Setup]]
+ - [[#running][Running]]
+
+* Introduction
+
+For integration testing we currently use the brilliant Ruby Mechanize
+gem against the small database; a setup we call mechanical Rob because
+it emulates someone clicking through the website and checking results.
+
+These scripts invoke calls to a running webserver and test the
+response. If a page changes or is broken tests will break and we are
+informed. In principle, Mechanical Rob is run before code merges are
+committed to the main server.
+
+In the future we may move to Python mechanize - it'll be easy to mix
+the Ruby and Python versions.
+
+* Run tests
+
+** Setup
+
+Mechanize is not yet included in Guix deployment.
+
+
+** Running
+
+Run the tests from the root of the genenetwork2 source tree as, for
+example,
+
+: ./bin/test-website http://localhost:5003/ (default)
+
+If you are using the small deployment database you can use
+
+: ./bin/test-website --skip -n
+
+To run individual tests on localhost you can do
+
+: ruby -Itest -Itest/lib test/lib/mapping.rb --name="/Mapping/"