doc/Architecture.org


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231

#+TITLE: Installing GeneNetwork services

* Table of Contents                                                     :TOC:
 - [[#introduction][Introduction]]
 - [[#reproducibility-and-interoperability][Reproducibility and interoperability]]
 - [[#webserver][Webserver]]
 - [[#gnserver-rest][GnServer (REST)]]
 - [[#gnexec][GnExec]]
 - [[#database][Database]]
   - [[#phenotypes][Phenotypes]]
   - [[#genotypes][Genotypes]]

* Introduction

This document describes the architecture of GN2. Because GN2 is
evolving, only a high-level overview is given here.

* Reproducibility and interoperability

Reproducible data analysis and software interoperability should be key
goals for any system that aims to bring research groups
together. These goals are increasingly relevant with growing data
sizes and increasingly complex analysis pipelines. Rigor,
reproducibility, and robustness starts with data that should abide by
Findable, Accessible, Interoperable, and Re-usable (FAIR) principles
(see the Wilkinson Nature paper on [[http://www.nature.com/articles/sdata201618][FAIR Guiding Principles for
scientific data management and stewardship]]).

With GN2 we are solving these requirements by assigning unique
identifiers (cryptographic HASH values calculated over immutable data
content and including that value in the file names or directories) and
making these identifiers available through web interfaces (e.g.,
through a REST API). This means that at any point in the future the
exact same data can be retrieved using a known non-changeable
identifier (see also
https://github.com/pjotrp/genenetwork2/blob/staging/doc/submit-data.org).

Synchronisation, integrity checking and backups become trivial using
these HASH values, even for very large datasets. Since everything is
managed at the file system level we can also use Unix authorisation
systems. HIPAA compliancy is achieved by using HASH values and
bringing the software into the controlled HIPAA environment.

In the context of GeneNetwork we are using git and github for version
control of software source code
(https://github.com/genenetwork/). Software can be treated just like
data, i.e., git uses HASH identifiers to retrieve specific versions of
source. I.e., versions of source code are identifiable and retrievable
and can be matched with data into an analysis pipeline. The
combination of software and data, again, makes a unique HASH value
which identifies the analysis pipe-line.

For combining runnable software and data into an analysis pipeline we
use GNU Guix which, yet again, turns everything into a unique HASH
value which allows for exact retrieval and reproducibility. Not only
that, GNU Guix gives control of the software and all its dependencies,
use GNU Guix which, yet again, turns everything into a unique HASH
value which allows for exact retrieval and reproducibility. Not only
that, GNU Guix gives control of the software and all its dependencies,
calculating a HASH value for all dependencies, all the way down to
versions of R, BLAS and glibc. This way of packaging software
ascertains that identical software pipelines are easily setup on
different system or in the Cloud. Meaning that everyone ends up using
the exact same combination of software versions in a pipeline.

For software development we use GNU Guix for integration testing and
deployment (described in JOSS paper). We also use automated test tools
(Ruby mechanize) for integration testing of the web services and we
use unit testing of all backend services. All our software source code
is published as `free and open source software' (FOSS) which means
that anyone can view code on github, comment on it, or even
contribute. GeneNetwork is becoming increasingly modular and has a
growing number of contributers who, in principle, abide by the THE
SMALL TOOLS MANIFESTO FOR BIOINFORMATICS which we wrote up
(https://github.com/pjotrp/bioinformatics) and was signed by 51
bioinformaticians.

* Webserver

The main [[https://github.com/genenetwork/genenetwork2][GN2 webserver]] is built on [[http://flask.pocoo.org/][Python flask]] and this GN2 source
code can be found on [[https://github.com/genenetwork/genenetwork2/tree/master/wqflask/wqflask][github]] in the wqflask directory. The routing
tables are defined in [[https://github.com/genenetwork/genenetwork2/blob/master/wqflask/wqflask/views.py][views.py]]. For example the main page is loaded
from a template named [[https://github.com/genenetwork/genenetwork2/blob/master/wqflask/wqflask/templates/index_page.htm][index_page.html]] in the [[https://github.com/genenetwork/genenetwork2/tree/master/wqflask/wqflask/templates][templates]] directory. In
the template you can find get the form gets filled by a Javascript
routine defined in [[https://github.com/genenetwork/genenetwork2/blob/master/wqflask/wqflask/static/new/javascript/dataset_select_menu.js][data_select_menu.js]] which picks up a static JSON
file for the menu. This static file is generated with
[[https://github.com/genenetwork/genenetwork2/blob/master/wqflask/maintenance/gen_select_dataset.py][gen_select_dataset.py]].  Note that this JSON data is served by
gn_server in the latest version, see [[#gnserver-rest][GnServer (REST)]].

When you hit a search with, for example,
'http://localhost:5003/search?species=mouse&group=BXD&type=Hippocampus+mRNA&dataset=HC_M2_0606_P&search_terms_or=&search_terms_and=MEAN%3D%2815+16%29+LRS%3D%2823+46%29+&FormID=searchResult'
it has the menu items as parameters. According to the routing table,
the search is executed and Redis caching is used (we'll probably
change that to the level of the gn_server). The logic is in
search_result.py which invokes database functions in
wqflask/dbFunction/webqtlDatabaseFunction.py, for example. The
receiving template lives at [[https://github.com/genenetwork/genenetwork2/blob/master/wqflask/wqflask/templates/search_result_page.html][search_result_page.html]].

For what happens at the database level see [[database.org]].

A view consists of an HTML template with JS libraries for managing
menus, tables etc. For example, for the search results see the
[[https://github.com/genenetwork/genenetwork2/blob/master/wqflask/wqflask/templates/search_result_page.html][search_result_page.html]] which is a Flask template. The first section
puts the search in plain English, e.g. 'We searched Hippocampus
Consortium M430v2 (Jun06) PDNN to find all records with MEAN between
15 and 16 and with LRS between 23 and 46.'. Then the results are added
to a table which is displayed using a JS [[https://datatables.net/][DataTable container]].

* GnServer (REST)

The [[https://github.com/genenetwork/gn_server][GnServer REST API]] is built on high performance [[http://elixir-lang.org/][Elixir]] with [[https://github.com/falood/maru][Maru]].
Mainly the GnServer serves JSON requests, for example to fetch data
from the database. To get the menu data in YAML you can do something like

: curl localhost:8880/int/menu/main.json|ruby extra/json2yaml.rb

(json2yaml.rb is in the gn_server repo). For the current API definition
see [[https://github.com/genenetwork/gn_server/doc/API.md][GnServer REST API]] documentation.

* GnExec

GnExec, also written in Elixir, executes commands using a separate
daemon.

* Database
** Phenotypes

Phenotypes are stored in the SQL database.  For what happens at the
database level see [[database.org]]. A test database can be downloaded -
see the installation [[./README.org][instructions]].

** Genotypes

Genotypes are stored in genotype files. These are part of the GNU Guix
distribution, see the installation [[./README.org][instructions]]. Genotype files are
currently in GN1 format, and will be aligned with the [[http://kbroman.org/qtl2/pages/sampledata.html][R/qtl2 formats]].

GN1-style (still default GN2) for the stored file BXD.geno:

#+begin_src js
@name:BXD
@type:riset
@mat:B
@pat:D
@het:H
@unk:U
Chr Locus cM  Mb  BXD1  BXD2  BXD5  BXD6  BXD8  BXD9  BXD11 BXD12 BXD13 BXD14 BX
D15 BXD16 BXD18 BXD19 BXD20 BXD21 BXD22 BXD23 BXD24a  BXD24 BXD25 BXD27 BXD28 BX
D29 BXD30 BXD31 BXD32 BXD33 BXD34 BXD35 BXD36 BXD37 BXD38 BXD39 BXD40 BXD41 BXD4
2 BXD43 BXD44 BXD45 BXD48 BXD49 BXD50 BXD51 BXD52 BXD53 BXD54 BXD55 BXD56 BXD59
BXD60 BXD61 BXD62 BXD63 BXD64 BXD65 BXD66 BXD67 BXD68 BXD69 BXD70 BXD71 BXD72 BX
D73 BXD74 BXD75 BXD76 BXD77 BXD78 BXD79 BXD80 BXD81 BXD83 BXD84 BXD85 BXD86 BXD8
7 BXD88 BXD89 BXD90 BXD91 BXD92 BXD93 BXD94 BXD95 BXD96 BXD97 BXD98 BXD99 BXD100
  BXD101  BXD102  BXD103
1 rs6269442 0.0 3.482275  B B D D D B B D B B D D B D D D D B B B D B D D B B B
B B B B B B D B D B B D B B H H B D B B H H B B D D D D D B B H B B B B D B D B
D D D D D H B D D B D B B D D B D D B B B B B B B D
1 rs6365999 0.0 4.811062  B B D D D B B D B B D D B D D D D B B B D B D D B B B
B B B B B B D B D B B D B B H H B D B B H H B B D D D D D B B H B B B B D B D B
D D D D D H B D D B D B B D D B D D B B B B B B U D
...
#+end_src

and, for example, in the method run_rqtl_geno this file gets
loaded. For GnServer, however, we only want to deal with standardized
R/qtl formatted data, so with gn_extra we convert the original format
into R/qtl format with geno2rqtl with one adaptation: the geno table
is transposed so now becomes

#+begin_src js
marker,BXD1,BXD2,BXD5,BXD6,BXD8,BXD9,BXD11,BXD12,BXD13,BXD14,BXD15,BXD16,BXD18,BXD19,BXD20,BXD21,BXD22,BXD23,BXD24a,BXD24,BXD25,BXD27,BXD28,BXD29,BXD30,BXD31,BXD32,BXD33,BXD34,BXD35,BXD36,BXD37,BXD38,BXD39,BXD40,BXD41,BXD42,BXD43,BXD44,BXD45,BXD48,BXD49,BXD50,BXD51,BXD52,BXD53,BXD54,BXD55,BXD56,BXD59,BXD60,BXD61,BXD62,BXD63,BXD64,BXD65,BXD66,BXD67,BXD68,BXD69,BXD70,BXD71,BXD72,BXD73,BXD74,BXD75,BXD76,BXD77,BXD78,BXD79,BXD80,BXD81,BXD83,BXD84,BXD85,BXD86,BXD87,BXD88,BXD89,BXD90,BXD91,BXD92,BXD93,BXD94,BXD95,BXD96,BXD97,BXD98,BXD99,BXD100,BXD101,BXD102,BXD103
1,B,B,D,D,D,B,B,D,B,B,D,D,B,D,D,D,D,B,B,B,D,B,D,D,B,B,B,B,B,B,B,B,B,D,B,D,B,B,D,B,B,H,H,B,D,B,B,H,H,B,B,D,D,D,D,D,B,B,H,B,B,B,B,D,B,D,B,D,D,D,D,D,H,B,D,D,B,D,B,B,D,D,B,D,D,B,B,B,B,B,B,B,D
2,B,B,D,D,D,B,B,D,B,B,D,D,B,D,D,D,D,B,B,B,D,B,D,D,B,B,B,B,B,B,B,B,B,D,B,D,B,B,D,B,B,H,H,B,D,B,B,H,H,B,B,D,D,D,D,D,B,B,H,B,B,B,B,D,B,D,B,D,D,D,D,D,H,B,D,D,B,D,B,B,D,D,B,D,D,B,B,B,B,B,B,U,D
3,B,B,D,D,D,B,B,D,B,B,D,D,B,D,D,D,D,B,B,B,D,B,D,D,B,B,B,B,B,B,B,B,B,D,B,D,B,D,D,B,B,H,H,B,B,B,B,H,H,B,B,D,D,D,D,B,B,B,H,B,B,B,B,D,B,D,B,D,D,D,D,D,H,B,D,D,B,D,B,B,D,D,B,D,D,B,B,B,B,B,B,U,D
...
#+end_src js

i.e. individuals are columns and markers are rows. Alternatively it could look like

#+begin_src js
marker,BXD1,BXD2,BXD5,BXD6,BXD8,BXD9,BXD11,BXD12,BXD13,BXD14,BXD15,BXD16,BXD18,BXD19,BXD20,BXD21,BXD22,BXD23,BXD24a,BXD24,BXD25,BXD27,BXD28,BXD29,BXD30,BXD31,BXD32,BXD33,BXD34,BXD35,BXD36,BXD37,BXD38,BXD39,BXD40,BXD41,BXD42,BXD43,BXD44,BXD45,BXD48,BXD49,BXD50,BXD51,BXD52,BXD53,BXD54,BXD55,BXD56,BXD59,BXD60,BXD61,BXD62,BXD63,BXD64,BXD65,BXD66,BXD67,BXD68,BXD69,BXD70,BXD71,BXD72,BXD73,BXD74,BXD75,BXD76,BXD77,BXD78,BXD79,BXD80,BXD81,BXD83,BXD84,BXD85,BXD86,BXD87,BXD88,BXD89,BXD90,BXD91,BXD92,BXD93,BXD94,BXD95,BXD96,BXD97,BXD98,BXD99,BXD100,BXD101,BXD102,BXD103
rs6269442,B,B,D,D,D,B,B,D,B,B,D,D,B,D,D,D,D,B,B,B,D,B,D,D,B,B,B,B,B,B,B,B,B,D,B,D,B,B,D,B,B,H,H,B,D,B,B,H,H,B,B,D,D,D,D,D,B,B,H,B,B,B,B,D,B,D,B,D,D,D,D,D,H,B,D,D,B,D,B,B,D,D,B,D,D,B,B,B,B,B,B,B,D
rs6365999,B,B,D,D,D,B,B,D,B,B,D,D,B,D,D,D,D,B,B,B,D,B,D,D,B,B,B,B,B,B,B,B,B,D,B,D,B,B,D,B,B,H,H,B,D,B,B,H,H,B,B,D,D,D,D,D,B,B,H,B,B,B,B,D,B,D,B,D,D,D,D,D,H,B,D,D,B,D,B,B,D,D,B,D,D,B,B,B,B,B,B,U,D
rs6376963,B,B,D,D,D,B,B,D,B,B,D,D,B,D,D,D,D,B,B,B,D,B,D,D,B,B,B,B,B,B,B,B,B,D,B,D,B,D,D,B,B,H,H,B,B,B,B,H,H,B,B,D,D,D,D,B,B,B,H,B,B,B,B,D,B,D,B,D,D,D,D,D,H,B,D,D,B,D,B,B,D,D,B,D,D,B,B,B,B,B,B,U,D
#+end_src js

This is also the format provided by R/qtl in
https://github.com/rqtl/qtl2data/tree/master/DO_Recla which we will
use as the base line for the REST server. In the meta json file the
genotype data is tagged as transposed:

#+begin_src js
{
"description": "DO data from Recla et al. (2014) Mamm Genome 25:211-222",
"crosstype": "do",
"geno": "recla_geno.csv",
"geno_transposed": true,
"founder_geno": "recla_foundergeno.csv",
"founder_geno_transposed": true,
"genotypes": {
  "1": "1",
  "2": "2",
  "3": "3"
},
"pheno": "recla_pheno.csv",
"pheno_transposed": false,
"covar": "recla_covar.csv",
"sex": {
  "covar": "Sex",
  "female": "female",
  "male": "male"
},
"x_chr": "X",
"cross_info": {
  "covar": "ngen"
},
"gmap": "recla_gmap.csv",
"pmap": "recla_pmap.csv",
"alleles": ["A", "B", "C", "D", "E", "F", "G", "H"]
}
#+end_src

Meanwhile the gmap file looks like

#+begin_src js
marker,chr,pos,Mb
rs6269442,1,0.0,3.482275
rs6365999,1,0.0,4.811062
rs6376963,1,0.895,5.008089
rs3677817,1,1.185,5.176058
#+end_src