aboutsummaryrefslogtreecommitdiff
path: root/README.md
blob: 02952a50d24fab471df99332e629c0ad893fa7df (about) (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
# genenetwork3

[![GeneNetwork3 CI
badge](https://ci.genenetwork.org/badge/genenetwork3.svg)](https://ci.genenetwork.org/jobs/genenetwork3)
[![GeneNetwork3 all tests CI
badge](https://ci.genenetwork.org/badge/genenetwork3-all-tests.svg)](https://ci.genenetwork.org/jobs/genenetwork3-all-tests)

GeneNetwork3 REST API for data science and machine learning

GeneNetwork3 is a light-weight back-end that serves different front-ends, including the GeneNetwork2 web UI.
Transports happen in multiple ways:

1. A REST API
2. Direct python library calls (using PYTHONPATH)

The main advantage is that the code is not cluttered by UX output and starting the webserver and running tests is *easier* than using GeneNetwork2. It allows for using Jupyter Notebooks and Pluto Notebooks as front-ends as well as using the API from R etc.

A continuously deployed instance of genenetwork3 is available at
[https://cd.genenetwork.org/](https://cd.genenetwork.org/). This instance is
redeployed on every commit provided that the [continuous integration
tests](https://ci.genenetwork.org/jobs/genenetwork3) pass.

## Configuration

The system comes with some default configurations found in **"gn3/settings.py"**
relative to the repository root.

To overwrite these settings without changing the file, you can provide a path in
the `GN3_CONF` environment variable, to a file containing those variables whose
values you want to change.

The `GN3_CONF` variable allows you to have your own environment-specific
configurations rather than being forced to conform to the defaults.

## Installation

#### GNU Guix packages

Install GNU Guix - this can be done on every running Linux system.

There are at least three ways to start GeneNetwork3 with GNU Guix:

1. Create an environment with `guix shell`
2. Create a container with `guix shell -C`
3. Use a profile and shell settings with `source ~/opt/genenetwork3/etc/profile`
4. Use the guix system container with GN3 directory mounted in

At this point we use all three for different purposes. In all cases you'll most likely need the mysql database.

#### Create an environment:

Simply load up the environment (for development purposes):

```bash
guix shell -Df guix.scm
```

Also, make sure you have the guix-bioinformatics channel set up correctly and this should work

```bash
guix shell --expose=$HOME/genotype_files/ -Df guix.scm
python3
  import redis
```

Check if guix and guix-bioinformatics channel are up-to-date with

```
guix describe
```

#### Run a Guix container with network

Containers provide full isolation from the underlying distribution. Very useful for figuring out any dependency issues:

```
guix shell -C --network --expose=$HOME/genotype_files/ -Df guix.scm
```

#### Using a Guix profile (or rolling back)

A guix profile is different from a Guix shell - it has less isolation from the underlying distribution.

Create a new profile with

```
guix package -i genenetwork3 -p ~/opt/genenetwork3
```

and load the profile settings with

```
source ~/opt/genenetwork3/etc/profile
start server...
```

Note that GN2 profiles include the GN3 profile (!). To roll genenetwork3 back you can use either in the same fashion (probably best to start a new shell first)

```
bash
source ~/opt/genenetwork2-older-version/etc/profile
set|grep store
run tests, server etc...
```

#### Troubleshooting Guix packages

If you get a Guix error, such as `ice-9/boot-9.scm:1669:16: In procedure raise-exception:
error: python-sqlalchemy-stubs: unbound variable` it typically means an update to guix latest is required (i.e., guix pull):

```
guix pull
source ~/.config/guix/current/etc/profile
```

and try again. Also make sure your ~/guix-bioinformatics is up to date.

See also instructions in [.guix.scm](.guix.scm).

#### Setting necessary configurations

These configurations should be set in an external config file, pointed to with the environment variable GN3_CONF.

- SPARQL_ENDPOINT (ex: "http://localhost:9082/sparql")
- XAPIAN_DB_PATH (ex: "/export/data/genenetwork/xapian")
- TMPDIR
- SPARQL_USER
- SPARQL_ENDPOINT (ex: "http://localhost:9082/sparql-auth/")
- GN3_SECRETS

TMPDIR also needs to be set correctly for the R script(s) because they pass results on as files on the local system (previously there was an issue with it being set to /tmp instead of ~/genenetwork3/tmp). Note that the Guix build system should take care of the paths.

### Secrets

All of GN3's secret parameters are found inside the "GN3_SECRETS".  This file should contain the following:

```
SPARQL_USER = "dba"
SPARQL_PASSWORD = "dba"
SPARQL_AUTH_URI="http://localhost:8890/sparql-auth/"
SPARQL_CRUD_AUTH_URI="http://localhost:8890/sparql-graph-crud-auth"
FAHAMU_AUTH_TOKEN="XXXXXX"
```

Note: The sparql configurations are important for running tests I.e. `pytest -k rdf`.

### Setting up Virtuoso for Local Development

GN3 uses Virtuoso to:

- Fetch metadata for the Xapian indexing script
- Fetch metadata for some end-points
- Test SPARQL queries for some unit tests

Local development setup instructions can be found [here](https://issues.genenetwork.org/topics/engineering/working-with-virtuoso-locally), while a more comprehensive tutorial is available [here](https://issues.genenetwork.org/topics/systems/virtuoso).

## Command-Line Utility Scripts

This project has a number of utility scripts that are needed in specific cirscumstances, and whose purpose is to support the operation of this application in one way or another. Have a look at the [Scripts.md file](./docs/Scripts.md] to see the details for each of the scripts that are available.

## Example cURL Commands for OAuth2

In this section, we present some example request to the API using cURL to
acquire the token(s) and access resources.

### Request Token

```sh
curl -X POST http://localhost:8080/api/oauth2/token \
    -F "username=test@development.user" -F "password=testpasswd" \
    -F "grant_type=password" \
    -F "client_id=0bbfca82-d73f-4bd4-a140-5ae7abb4a64d" \
    -F "client_secret=yadabadaboo" \
    -F "scope=profile group role resource register-client user introspect migrate-data"
```

### Access a Resource

Once you have acquired a token as above, we can now access a resource with, for
example:

```sh
curl -X GET -H "Authorization: Bearer L3Q5mvehQeSUNQQbFLfrcUEdEyoknyblXWxlpKkvdl" \
    "http://localhost:8080/api/oauth2/group/members/8f8d7640-5d51-4445-ad68-7ab217439804"
```

to get all the members of a group with the ID
`8f8d7640-5d51-4445-ad68-7ab217439804`

or:

```sh
curl -X POST "http://localhost:8080/api/oauth2/user/register" \
    -F "email=a_new@users.email" -F "password=apasswd" \
    -F "confirm_password=apasswd"
```

where
`L3Q5mvehQeSUNQQbFLfrcUEdEyoknyblXWxlpKkvdl` is the token you got in the
**Request Token** section above.

## Running Tests

(assuming you are in a guix container; otherwise use venv!)

To run tests:

```bash
export AUTHLIB_INSECURE_TRANSPORT=true
export OAUTH2_ACCESS_TOKEN_GENERATOR="tests.unit.auth.test_token.gen_token"
pytest
```

To specify unit-tests:

```bash
export AUTHLIB_INSECURE_TRANSPORT=true
export OAUTH2_ACCESS_TOKEN_GENERATOR="tests.unit.auth.test_token.gen_token"
pytest -k unit_test
```

Running pylint:
```bash
pylint $(find . -name '*.py' | xargs)
```

Running mypy(type-checker):

```bash
mypy --show-error-codes .
```

## Running the GN3 web service

To spin up the server on its own (for development):

```bash
export FLASK_DEBUG=1
export FLASK_APP="main.py"
flask run --port=8080
```

And test with

```
curl localhost:8080/api/version
"1.0"
```

To run with gunicorn

```
gunicorn --bind 0.0.0.0:8080 wsgi:app
```

consider the following options for development `--bind 0.0.0.0:$SERVER_PORT --workers=1 --timeout 180 --reload wsgi`.

And for the scalable production version run

```
gunicorn --bind 0.0.0.0:8080 --workers 8 --keep-alive 6000 --max-requests 10 --max-requests-jitter 5 --timeout 1200 wsgi:app
```

(see also the [.guix_deploy](./.guix_deploy) script)

## Using python-pip

IMPORTANT NOTE: we do not recommend using pip tools, use Guix instead

1. Prepare your system. You need to make you have python > 3.8, and
   the ability to install modules.
2. Create and enter your virtualenv:

```bash
virtualenv --python python3 venv
. venv/bin/activate
```
3. Install the required packages

```bash
# The --ignore-installed flag forces packages to
# get installed in the venv even if they existed
# in the global env
pip install -r requirements.txt --ignore-installed
```

#### A note on dependencies

Make sure that the dependencies in the `requirements.txt` file match those in
guix. To freeze dependencies:

```bash
# Consistent way to ensure you don't capture globally
# installed packages
pip freeze --path venv/lib/python3.8/site-packages > requirements.txt

```

## Logging

During development, there is periodically need to log what the application is
doing to help resolve issues.

The logging system [was initialised](https://github.com/genenetwork/genenetwork3/commit/95f067a542424b76022595a74d660a7e84158f38)
to help with this.

Now, you can simply use the `current_app.logger.*` logging methods to log out
any information you desire: e.g.

```python
from flask import current_app

...

def some_function(arg1, arg2, **args, **kwargs):
    ...
    current_app.logger.debug(f"THE KWARGS: {kwargs}")
    ...
```

## Genotype Files

You can get the genotype files from http://ipfs.genenetwork.org/ipfs/QmXQy3DAUWJuYxubLHLkPMNCEVq1oV7844xWG2d1GSPFPL and save them on your host machine at, say `$HOME/genotype_files` with something like:

```bash
$ mkdir -p $HOME/genotype_files
$ cd $HOME/genotype_files
$ yes | 7z x genotype_files.tar.7z
$ tar xf genotype_files.tar
```

The `genotype_files.tar.7z` file seems to only contain the **BXD.geno** genotype file.

## QTLReaper (rust-qtlreaper) and Trait Files

To run QTL computations, this system makes use of the [rust-qtlreaper](https://github.com/chfi/rust-qtlreaper.git) utility.

To do this, the system needs to export the trait data into a tab-separated file, that can then be passed to the utility using the `--traits` option. For more information about the available options, please [see the rust-qtlreaper](https://github.com/chfi/rust-qtlreaper.git) repository.

### Traits File Format

The traits file begins with a header row/line with the column headers. The first column in the file has the header **"Trait"**. Every other column has a header for one of the strains in consideration.

Under the **"Trait"** column, the traits are numbered from **T1** to **T<n>** where **<n>** is the count of the total number of traits in consideration.

As an example, you could end up with a trait file like the following:

```txt
Trait	BXD27	BXD32	DBA/2J	BXD21	...
T1	10.5735	9.27408	9.48255	9.18253	...
T2	6.4471	6.7191	5.98015	6.68051	...
...
```

It is very important that the column header names for the strains correspond to the genotype file used.

## Partial Correlations

The partial correlations feature depends on the following external systems to run correctly:

- Redis: Acts as a communications broker between the webserver and external processes
- `sheepdog/worker.py`: Actually runs the external processes that do the computations

These two systems should be running in the background for the partial correlations feature to work correctly.