#+TITLE: Installing GeneNetwork services * Table of Contents :TOC: - [[#introduction][Introduction]] - [[#reproducibility-and-interoperability][Reproducibility and interoperability]] - [[#webserver][Webserver]] - [[#gnserver-rest][GnServer (REST)]] - [[#gnexec][GnExec]] - [[#database][Database]] - [[#phenotypes][Phenotypes]] - [[#genotypes][Genotypes]] * Introduction This document describes the architecture of GN2. Because GN2 is evolving, only a high-level overview is given here. * Reproducibility and interoperability Reproducible data analysis and software interoperability should be key goals for any system that aims to bring research groups together. These goals are increasingly relevant with growing data sizes and increasingly complex analysis pipelines. Rigor, reproducibility, and robustness starts with data that should abide by Findable, Accessible, Interoperable, and Re-usable (FAIR) principles (see the Wilkinson Nature paper on [[http://www.nature.com/articles/sdata201618][FAIR Guiding Principles for scientific data management and stewardship]]). GeneNetwork (GN2) solves this by assigning unique identifiers (cryptographic HASH values calculated over immutable data content), including these values in file or directory names, and making them available through web interfaces (e.g., through a through a REST API). This means that at any point in the future the exact same data can be retrieved using a known non-changeable identifier (see also https://github.com/pjotrp/genenetwork2/blob/staging/doc/submit-data.org). Synchronisation, integrity checking and backups become trivial using these HASH values, even for very large datasets. Since everything is managed at the file system level we can also use Unix authorisation systems. HIPAA compliancy is achieved by using HASH references and bringing the software into the controlled HIPAA environment. In the context of GeneNetwork we are using git for version control of software source code (https://github.com/genenetwork/). Software can be treated just like data, i.e., git uses HASH identifiers to retrieve specific versions of source. I.e., versions of source code are identifiable and retrievable and can be matched with data into an analysis pipeline. The combination of software and data, again, makes a unique HASH value which identifies the analysis pipeline. For combining runnable software and data into an analysis pipeline we use GNU Guix which, yet again, turns everything into a unique HASH value which allows for exact retrieval and reproducibility. Not only that, GNU Guix gives control of the software and all its dependencies, use GNU Guix which, yet again, turns everything into a unique HASH value which allows for exact retrieval and reproducibility. Not only that, GNU Guix gives control of the software and all its dependencies, calculating a HASH value for all dependencies, all the way down to versions of R, BLAS and glibc. This way of packaging software ascertains that identical software pipelines are easily setup on different system or in the Cloud. Meaning that everyone ends up using the exact same combination of software versions in a pipeline. For software development we use GNU Guix for integration testing and deployment (described in JOSS paper). We also use automated test tools (Ruby mechanize) for integration testing of the web services and we use unit testing of all backend services. All our software source code is published as `free and open source software' (FOSS) which means that anyone can view code on github, comment on, or even contribute to. GeneNetwork is becoming increasingly modular and has a growing number of contributers who subscribe to the principles of THE SMALL TOOLS MANIFESTO FOR BIOINFORMATICS (https://github.com/pjotrp/bioinformatics) which we drew up and was signed by over fifty bioinformaticians. * Webserver The main [[https://github.com/genenetwork/genenetwork2][GN2 webserver]] is built on [[http://flask.pocoo.org/][Python flask]] and this GN2 source code can be found on [[https://github.com/genenetwork/genenetwork2/tree/master/wqflask/wqflask][github]] in the wqflask directory. The routing tables are defined in [[https://github.com/genenetwork/genenetwork2/blob/master/wqflask/wqflask/views.py][views.py]]. For example the main page is loaded from a template named [[https://github.com/genenetwork/genenetwork2/blob/master/wqflask/wqflask/templates/index_page.htm][index_page.html]] in the [[https://github.com/genenetwork/genenetwork2/tree/master/wqflask/wqflask/templates][templates]] directory. In the template you can find get the form gets filled by a Javascript routine defined in [[https://github.com/genenetwork/genenetwork2/blob/master/wqflask/wqflask/static/new/javascript/dataset_select_menu.js][data_select_menu.js]] which picks up a static JSON file for the menu. This static file is generated with [[https://github.com/genenetwork/genenetwork2/blob/master/wqflask/maintenance/gen_select_dataset.py][gen_select_dataset.py]]. Note that this JSON data is served by gn_server in the latest version, see [[#gnserver-rest][GnServer (REST)]]. When you hit a search with, for example, 'http://localhost:5003/search?species=mouse&group=BXD&type=Hippocampus+mRNA&dataset=HC_M2_0606_P&search_terms_or=&search_terms_and=MEAN%3D%2815+16%29+LRS%3D%2823+46%29+&FormID=searchResult' it has the menu items as parameters. According to the routing table, the search is executed and Redis caching is used (we'll probably change that to the level of the gn_server). The logic is in search_result.py which invokes database functions in wqflask/dbFunction/webqtlDatabaseFunction.py, for example. The receiving template lives at [[https://github.com/genenetwork/genenetwork2/blob/master/wqflask/wqflask/templates/search_result_page.html][search_result_page.html]]. For what happens at the database level see [[database.org]]. A view consists of an HTML template with JS libraries for managing menus, tables etc. For example, for the search results see the [[https://github.com/genenetwork/genenetwork2/blob/master/wqflask/wqflask/templates/search_result_page.html][search_result_page.html]] which is a Flask template. The first section puts the search in plain English, e.g. 'We searched Hippocampus Consortium M430v2 (Jun06) PDNN to find all records with MEAN between 15 and 16 and with LRS between 23 and 46.'. Then the results are added to a table which is displayed using a JS [[https://datatables.net/][DataTable container]]. * GnServer (REST) The [[https://github.com/genenetwork/gn_server][GnServer REST API]] is built on high performance [[http://elixir-lang.org/][Elixir]] with [[https://github.com/falood/maru][Maru]]. Mainly the GnServer serves JSON requests, for example to fetch data from the database. To get the menu data in YAML you can do something like : curl localhost:8880/int/menu/main.json|ruby extra/json2yaml.rb (json2yaml.rb is in the gn_server repo). For the current API definition see [[https://github.com/genenetwork/gn_server/doc/API.md][GnServer REST API]] documentation. * GnExec GnExec, also written in Elixir, executes commands using a separate daemon. * Database ** Phenotypes Phenotypes are stored in the SQL database. For what happens at the database level see [[database.org]]. A test database can be downloaded - see the installation [[./README.org][instructions]]. ** Genotypes Genotypes are stored in genotype files. These are part of the GNU Guix distribution, see the installation [[./README.org][instructions]]. Genotype files are currently in GN1 format, and will be aligned with the [[http://kbroman.org/qtl2/pages/sampledata.html][R/qtl2 formats]]. GN1-style (still default GN2) for the stored file BXD.geno: #+begin_src js @name:BXD @type:riset @mat:B @pat:D @het:H @unk:U Chr Locus cM Mb BXD1 BXD2 BXD5 BXD6 BXD8 BXD9 BXD11 BXD12 BXD13 BXD14 BX D15 BXD16 BXD18 BXD19 BXD20 BXD21 BXD22 BXD23 BXD24a BXD24 BXD25 BXD27 BXD28 BX D29 BXD30 BXD31 BXD32 BXD33 BXD34 BXD35 BXD36 BXD37 BXD38 BXD39 BXD40 BXD41 BXD4 2 BXD43 BXD44 BXD45 BXD48 BXD49 BXD50 BXD51 BXD52 BXD53 BXD54 BXD55 BXD56 BXD59 BXD60 BXD61 BXD62 BXD63 BXD64 BXD65 BXD66 BXD67 BXD68 BXD69 BXD70 BXD71 BXD72 BX D73 BXD74 BXD75 BXD76 BXD77 BXD78 BXD79 BXD80 BXD81 BXD83 BXD84 BXD85 BXD86 BXD8 7 BXD88 BXD89 BXD90 BXD91 BXD92 BXD93 BXD94 BXD95 BXD96 BXD97 BXD98 BXD99 BXD100 BXD101 BXD102 BXD103 1 rs6269442 0.0 3.482275 B B D D D B B D B B D D B D D D D B B B D B D D B B B B B B B B B D B D B B D B B H H B D B B H H B B D D D D D B B H B B B B D B D B D D D D D H B D D B D B B D D B D D B B B B B B B D 1 rs6365999 0.0 4.811062 B B D D D B B D B B D D B D D D D B B B D B D D B B B B B B B B B D B D B B D B B H H B D B B H H B B D D D D D B B H B B B B D B D B D D D D D H B D D B D B B D D B D D B B B B B B U D ... #+end_src and, for example, in the method run_rqtl_geno this file gets loaded. For GnServer, however, we only want to deal with standardized R/qtl formatted data, so with gn_extra we convert the original format into R/qtl format with geno2rqtl with one adaptation: the geno table is transposed so now becomes #+begin_src js marker,BXD1,BXD2,BXD5,BXD6,BXD8,BXD9,BXD11,BXD12,BXD13,BXD14,BXD15,BXD16,BXD18,BXD19,BXD20,BXD21,BXD22,BXD23,BXD24a,BXD24,BXD25,BXD27,BXD28,BXD29,BXD30,BXD31,BXD32,BXD33,BXD34,BXD35,BXD36,BXD37,BXD38,BXD39,BXD40,BXD41,BXD42,BXD43,BXD44,BXD45,BXD48,BXD49,BXD50,BXD51,BXD52,BXD53,BXD54,BXD55,BXD56,BXD59,BXD60,BXD61,BXD62,BXD63,BXD64,BXD65,BXD66,BXD67,BXD68,BXD69,BXD70,BXD71,BXD72,BXD73,BXD74,BXD75,BXD76,BXD77,BXD78,BXD79,BXD80,BXD81,BXD83,BXD84,BXD85,BXD86,BXD87,BXD88,BXD89,BXD90,BXD91,BXD92,BXD93,BXD94,BXD95,BXD96,BXD97,BXD98,BXD99,BXD100,BXD101,BXD102,BXD103 1,B,B,D,D,D,B,B,D,B,B,D,D,B,D,D,D,D,B,B,B,D,B,D,D,B,B,B,B,B,B,B,B,B,D,B,D,B,B,D,B,B,H,H,B,D,B,B,H,H,B,B,D,D,D,D,D,B,B,H,B,B,B,B,D,B,D,B,D,D,D,D,D,H,B,D,D,B,D,B,B,D,D,B,D,D,B,B,B,B,B,B,B,D 2,B,B,D,D,D,B,B,D,B,B,D,D,B,D,D,D,D,B,B,B,D,B,D,D,B,B,B,B,B,B,B,B,B,D,B,D,B,B,D,B,B,H,H,B,D,B,B,H,H,B,B,D,D,D,D,D,B,B,H,B,B,B,B,D,B,D,B,D,D,D,D,D,H,B,D,D,B,D,B,B,D,D,B,D,D,B,B,B,B,B,B,U,D 3,B,B,D,D,D,B,B,D,B,B,D,D,B,D,D,D,D,B,B,B,D,B,D,D,B,B,B,B,B,B,B,B,B,D,B,D,B,D,D,B,B,H,H,B,B,B,B,H,H,B,B,D,D,D,D,B,B,B,H,B,B,B,B,D,B,D,B,D,D,D,D,D,H,B,D,D,B,D,B,B,D,D,B,D,D,B,B,B,B,B,B,U,D ... #+end_src js i.e. individuals are columns and markers are rows. Alternatively it could look like #+begin_src js marker,BXD1,BXD2,BXD5,BXD6,BXD8,BXD9,BXD11,BXD12,BXD13,BXD14,BXD15,BXD16,BXD18,BXD19,BXD20,BXD21,BXD22,BXD23,BXD24a,BXD24,BXD25,BXD27,BXD28,BXD29,BXD30,BXD31,BXD32,BXD33,BXD34,BXD35,BXD36,BXD37,BXD38,BXD39,BXD40,BXD41,BXD42,BXD43,BXD44,BXD45,BXD48,BXD49,BXD50,BXD51,BXD52,BXD53,BXD54,BXD55,BXD56,BXD59,BXD60,BXD61,BXD62,BXD63,BXD64,BXD65,BXD66,BXD67,BXD68,BXD69,BXD70,BXD71,BXD72,BXD73,BXD74,BXD75,BXD76,BXD77,BXD78,BXD79,BXD80,BXD81,BXD83,BXD84,BXD85,BXD86,BXD87,BXD88,BXD89,BXD90,BXD91,BXD92,BXD93,BXD94,BXD95,BXD96,BXD97,BXD98,BXD99,BXD100,BXD101,BXD102,BXD103 rs6269442,B,B,D,D,D,B,B,D,B,B,D,D,B,D,D,D,D,B,B,B,D,B,D,D,B,B,B,B,B,B,B,B,B,D,B,D,B,B,D,B,B,H,H,B,D,B,B,H,H,B,B,D,D,D,D,D,B,B,H,B,B,B,B,D,B,D,B,D,D,D,D,D,H,B,D,D,B,D,B,B,D,D,B,D,D,B,B,B,B,B,B,B,D rs6365999,B,B,D,D,D,B,B,D,B,B,D,D,B,D,D,D,D,B,B,B,D,B,D,D,B,B,B,B,B,B,B,B,B,D,B,D,B,B,D,B,B,H,H,B,D,B,B,H,H,B,B,D,D,D,D,D,B,B,H,B,B,B,B,D,B,D,B,D,D,D,D,D,H,B,D,D,B,D,B,B,D,D,B,D,D,B,B,B,B,B,B,U,D rs6376963,B,B,D,D,D,B,B,D,B,B,D,D,B,D,D,D,D,B,B,B,D,B,D,D,B,B,B,B,B,B,B,B,B,D,B,D,B,D,D,B,B,H,H,B,B,B,B,H,H,B,B,D,D,D,D,B,B,B,H,B,B,B,B,D,B,D,B,D,D,D,D,D,H,B,D,D,B,D,B,B,D,D,B,D,D,B,B,B,B,B,B,U,D #+end_src js This is also the format provided by R/qtl in https://github.com/rqtl/qtl2data/tree/master/DO_Recla which we will use as the base line for the REST server. In the meta json file the genotype data is tagged as transposed: #+begin_src js { "description": "DO data from Recla et al. (2014) Mamm Genome 25:211-222", "crosstype": "do", "geno": "recla_geno.csv", "geno_transposed": true, "founder_geno": "recla_foundergeno.csv", "founder_geno_transposed": true, "genotypes": { "1": "1", "2": "2", "3": "3" }, "pheno": "recla_pheno.csv", "pheno_transposed": false, "covar": "recla_covar.csv", "sex": { "covar": "Sex", "female": "female", "male": "male" }, "x_chr": "X", "cross_info": { "covar": "ngen" }, "gmap": "recla_gmap.csv", "pmap": "recla_pmap.csv", "alleles": ["A", "B", "C", "D", "E", "F", "G", "H"] } #+end_src Meanwhile the gmap file looks like #+begin_src js marker,chr,pos,Mb rs6269442,1,0.0,3.482275 rs6365999,1,0.0,4.811062 rs6376963,1,0.895,5.008089 rs3677817,1,1.185,5.176058 #+end_src