topics/systems/mariadb/precompute-mapping-input-data.gmi


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438

# Precompute mapping input data

GN relies on precomputed mapping scores for search and other functionality. Here we prepare for a new generation of functionality that introduces LMMs for compute and multiple significant scores for queries.

# Tags

* assigned: pjotrp
* type: precompute, gemma
* status: in progress
* priority: high
* keywords: ui, correlations

# Tasks

* [ ] Start using GEMMA for precomputed values as a background pipeline on a different machine
* [ ] Update the table values using GEMMA output (single highest score)

Above is the quick win for plugging in GEMMA values. We will make sure not to recompute the values that are already up to date.
This is achieved by naming the input and output files as a hash on their DB inputs.

Next:

* [ ] Store all GEMMA values efficiently
* [ ] Track metadata of computed datasets (in RDF?)
* [ ] Compute significance with GEMMA or other LMM (bulkLMM?)
* [ ] Store signficance and significant values for processing
* [ ] Update search & correlations to use these
* [ ] Further optimize computations so they can run continuously in the background

# Info

## Original qtlreaper version

The original reaper precompute lives in

=> https://github.com/genenetwork/genenetwork2/blob/testing/scripts/maintenance/QTL_Reaper_v6.py

This script first fetches inbredsets

```
MariaDB [db_webqtl]> select Id, Name from InbredSet limit 5;
+----+----------+
| Id | Name     |
+----+----------+
|  1 | BXD      |
|  2 | B6D2F2   |
|  4 | AXBXA    |
|  5 | AKXD     |
|  6 | B6BTBRF2 |
+----+----------+
```

and expands them to a .geno file, e.g. BXD.geno. Note that the script does not compute with the many variations of .geno files we have today. Next it sets the Id for ProbeSetFreeze which is the same as the InbredSet Id. So, ProbeSetFreeze.Id == IndbredSet.Id.

There are groups/collections, such as "Hippocampus_M430_V2_BXD_PDNN_Jun06"

```
select max(distinct(ProbeSetFreezeId)) from ProbeSetXRef limit 5;
|                            1054 |
```

Next for this Id we fetch, known as `ProbeSetXRefInfos`:

```
MariaDB [db_webqtl]> select ProbeSetId, Locus, DataId from ProbeSetXRef where ProbeSetFreezeId=1 limit 5;
+------------+----------------+--------+
| ProbeSetId | Locus          | DataId |
+------------+----------------+--------+
|          1 | rs13480619     |      1 |
|          2 | rs29535974     |      2 |
|          3 | rs49742109     |      3 |
|          4 | rsm10000002321 |      4 |
|          5 | rsm10000019445 |      5 |
+------------+----------------+--------+
```

Then we fetch the trait values:

```
MariaDB [db_webqtl]> select * from ProbeSetData where ProbeSetData.Id = 1 limit 5;
+----+----------+-------+
| Id | StrainId | value |
+----+----------+-------+
|  1 |        1 | 5.742 |
|  1 |        2 | 5.006 |
|  1 |        3 | 6.079 |
|  1 |        4 | 6.414 |
|  1 |        5 | 4.885 |
+----+----------+-------+
```

with names:

```
MariaDB [db_webqtl]> select Strain.Name, ProbeSetData.value from Strain, ProbeSetData where Strain.Id = ProbeSetData.StrainId and ProbeSetData.Id = 1 limit 5;
+----------+-------+
| Name     | value |
+----------+-------+
| B6D2F1   | 5.742 |
| C57BL/6J | 5.006 |
| DBA/2J   | 6.079 |
| BXD1     | 6.414 |
| BXD2     | 4.885 |
+----------+-------+
```

35 strains were used for this dataset:

```
select count(ProbeSetData.value) from ProbeSetData where ProbeSetData.Id = 1;
+---------------------------+
| count(ProbeSetData.value) |
+---------------------------+
|                        35 |
+---------------------------+
```

with genotypes (from files) and these phenotypes (from MySQL) qtlreaper is started and next we update the values for

```
select * from ProbeSetXRef where ProbeSetId=1 and ProbeSetFreezeId=1 limit 5;
+------------------+------------+--------+------------+------------------+------------+------------------+---------------------+------------+-----------------+--------+-------------+------+
| ProbeSetFreezeId | ProbeSetId | DataId | Locus_old  | LRS_old          | pValue_old | mean             |
se                  | Locus      | LRS             | pValue | additive    | h2   |
+------------------+------------+--------+------------+------------------+------------+------------------+---------------------+------------+-----------------+--------+-------------+------+
|                1 |          1 |      1 | 10.095.400 | 13.3971627898894 |      0.163 | 5.48794285714286 |
0.08525787814808819 | rs13480619 | 12.590069931048 |  0.269 | -0.28515625 | NULL |
+------------------+------------+--------+------------+------------------+------------+------------------+---------------------+------------+-----------------+--------+-------------+------+
```

the actual update is

```
update ProbeSetXRef set Locus=%s, LRS=%s, additive=%s where ProbeSetId=%s and ProbeSetFreezeId=%s
```

so Locus, LRS and additive fields are updated.

The old reaper scores are in

```
MariaDB [db_webqtl]> select ProbeSetId,ProbeSetFreezeId,Locus,LRS,additive from ProbeSetXRef limit 10;
+------------+------------------+----------------+------------------+--------------------+
| ProbeSetId | ProbeSetFreezeId | Locus          | LRS              | additive           |
+------------+------------------+----------------+------------------+--------------------+
|          1 |                1 | rs13480619     |  12.590069931048 |        -0.28515625 |
|          2 |                1 | rs29535974     | 10.5970737900941 | -0.116783333333333 |
|          3 |                1 | rs49742109     |  6.0970532702754 |  0.112957489878542 |
|          4 |                1 | rsm10000002321 | 11.7748675511731 | -0.157113725490196 |
|          5 |                1 | rsm10000019445 | 10.9232633740162 |  0.114764705882353 |
|          6 |                1 | rsm10000017255 | 8.45741703245224 | -0.200034412955466 |
|          7 |                1 | rs4178505      | 7.46477918183565 |  0.104331983805668 |
|          8 |                1 | rsm10000144086 | 12.1201771258006 | -0.134278431372548 |
|          9 |                1 | rsm10000014965 | 11.8837168740735 |  0.341458333333334 |
|         10 |                1 | rsm10000020208 | 10.2809848009836 | -0.173866666666667 |
+------------+------------------+----------------+------------------+--------------------+
10 rows in set (0.000 sec)
```

This means for every dataset one single maximum score gets stored(!)

From this exercise we can conclude:

* Existing precomputed values are by linear regression (old QTLreaper)
* Only one highest score is stored
* Every time the script is run *all* datasets are recomputed
* The '_old' named values in the table are not touched by the script
* h2 and mean are not updated
* No significance score is computed (needs permutations)

Rob voiced a wish to retain all scores (at least those above 1.0). This is not feasible in mariadb, but we can retain GEMMA output if stored efficiently.

## Notes on ProbeSetXRef

ProbeSetXRef is pretty small, currently @5.6Gb and 48,307,650 rows, so we could decide to add columns to track different mappers. Something funny

```
select count(LRS) from ProbeSetXRef;
+------------+
| count(LRS) |
+------------+
|   28335955 |
+------------+
```

half the fields are used. What to think of

```
MariaDB [db_webqtl]> select * from ProbeSetXRef where LRS=0 limit 5;
+------------------+------------+----------+-----------+---------+------------+------+------+----------------+------+--------+----------+------+
| ProbeSetFreezeId | ProbeSetId | DataId   | Locus_old | LRS_old | pValue_old | mean | se   | Locus          | LRS  | pValue | additive | h2   |
+------------------+------------+----------+-----------+---------+------------+------+------+----------------+------+--------+----------+------+
|              589 |    8599010 | 70606737 | NULL      |    NULL |       NULL |    0 | NULL | rsm10000000001 |    0 |   NULL |        0 | NULL |
|              589 |    8593710 | 70606738 | NULL      |    NULL |       NULL |    0 | NULL | rsm10000000001 |    0 |   NULL |        0 | NULL |
|              589 |    8607637 | 70606739 | NULL      |    NULL |       NULL |    0 | NULL | rsm10000000001 |    0 |   NULL |        0 | NULL |
|              589 |    8594490 | 70606740 | NULL      |    NULL |       NULL |    0 | NULL | rsm10000000001 |    0 |   NULL |        0 | NULL |
|              589 |    8591994 | 70606741 | NULL      |    NULL |       NULL |    0 | NULL | rsm10000000001 |    0 |   NULL |        0 | NULL |
+------------------+------------+----------+-----------+---------+------------+------+------+----------------+------+--------+----------+------+
```

```
MariaDB [db_webqtl]> select count(*) from ProbeSetXRef where LRS=0 and Locus="rsm10000000001";
+----------+
| count(*) |
+----------+
|   291447 |
+----------+
```

There is obviously more.  I think this table can use some cleaning up?

Looking at the GN1 code base ProbeSetXRef is used in a great number of files. In GN2 it is:

```
wqflask/maintenance/quantile_normalize.py
wqflask/maintenance/generate_probesetfreeze_file.py
wqflask/base/mrna_assay_tissue_data.py
wqflask/wqflask/update_search_results.py
wqflask/wqflask/correlation/rust_correlation.py
wqflask/base/trait.py
wqflask/wqflask/do_search.py
wqflask/base/data_set/mrnaassaydataset.py
wqflask/base/data_set/dataset.py
wqflask/wqflask/correlation/pre_computes.py
wqflask/wqflask/api/router.py
wqflask/wqflask/show_trait/show_trait.py
scripts/insert_expression_data.py
scripts/maintenance/Update_Case_Attributes_MySQL_tab.py
scripts/maintenance/readProbeSetSE_v7.py
scripts/maintenance/QTL_Reaper_v6.py
scripts/maintenance/readProbeSetMean_v7.py
wqflask/tests/unit/base/test_mrna_assay_tissue_data.py
```

Let's visit these one by one to make sure there is no side-effect.

### wqflask/maintenance/quantile_normalize.py

Appears to be a one-off to normalize a certain dataset in `/home/zas1024/cfw_data/`.
Last time it was used is probably 2018.

### wqflask/maintenance/generate_probesetfreeze_file.py

This one is even older and probably from 2013.

### wqflask/base/mrna_assay_tissue_data.py

Another dinosaur, actually uses TissueProbeSetXRef, so not relevant.

### wqflask/wqflask/update_search_results.py

This is for global 'gene' search making sure ProbeSet.Id = ProbeSetXRef.ProbeSetId AND ProbeSetXRef.ProbeSetFreezeId=ProbeSetFreeze.Id. LRS is passed on.

### wqflask/wqflask/correlation/rust_correlation.py

Again making sure IDs match. LRS is passed on. This requires scrutiny FIXME

### wqflask/base/trait.py

Trait class defines a trait in webqtl, can be either Microarray, Published phenotype, genotype, or user input trait. Fetches LRS.

### wqflask/wqflask/do_search.py

Searches for genes with a QTL within the given LRS values

    LRS searches can take 3 different forms:
    - LRS > (or <) min/max_LRS
    - LRS=(min_LRS max_LRS)
    - LRS=(min_LRS max_LRS chromosome start_Mb end_Mb)
    where min/max_LRS represent the range of LRS scores and start/end_Mb represent
    the range in megabases on the given chromosome

### wqflask/base/data_set/mrnaassaydataset.py

Also search results.

### wqflask/base/data_set/dataset.py

ID sync

### wqflask/wqflask/correlation/pre_computes.py

ID sync

### wqflask/wqflask/api/router.py

Fetch traits and sample data API (we may use this!)

### wqflask/wqflask/show_trait/show_trait.py

Same.

### The following are a punch of scripts:

* scripts/insert_expression_data.py
* scripts/maintenance/Update_Case_Attributes_MySQL_tab.py
* scripts/maintenance/readProbeSetSE_v7.py
* scripts/maintenance/QTL_Reaper_v6.py
* scripts/maintenance/readProbeSetMean_v7.py
* wqflask/tests/unit/base/test_mrna_assay_tissue_data.py

## Using LRS

* wqflask/wqflask/static/new/javascript/lod_chart.js

LRS is used to display a LOD chart

and in a bunch of python files that match above list for ProbeSetXRef:

* scripts/maintenance/QTL_Reaper_v6.py
* wqflask/base/webqtlConfig.py
* wqflask/base/trait.py
* wqflask/tests/unit/wqflask/marker_regression/test_display_mapping_results.py
* wqflask/tests/unit/base/test_trait.py
* wqflask/wqflask/parser.py
* wqflask/tests/unit/wqflask/correlation/test_show_corr_results.py
* wqflask/wqflask/update_search_results.py
* wqflask/wqflask/correlation/show_corr_results.py
* wqflask/wqflask/marker_regression/run_mapping.py
* wqflask/wqflask/export_traits.py
* wqflask/wqflask/gsearch.py
* wqflask/base/data_set/phenotypedataset.py
* wqflask/base/data_set/markers.py
* wqflask/wqflask/api/router.py
* wqflask/base/data_set/mrnaassaydataset.py
* wqflask/wqflask/correlation/rust_correlation.py
* wqflask/wqflask/marker_regression/display_mapping_results.py
* wqflask/wqflask/collect.py
* wqflask/wqflask/do_search.py
* wqflask/wqflask/views.py
* wqflask/utility/Plot.py
* test/requests/main_web_functionality.py
* test/requests/correlation_tests.py

From above it can be concluded these precomputed values are used for display and for getting search results (filtering on datasets that have a QTL).
It looks safe to start replacing qtlreaper results with GEMMA results.
Only thing I am not sure about is correlations.
A cursory inspection suggests LRS is only used for final output and that makes sense if correlations are done on (expression) phenotypes.
Anyway, if things go haywire we'll find out soon (enough).

At the next stage we ignore all this and start precompute with GEMMA on the BXD.

## Precompute DB

We will use a database to track precompute updates (see below).

On the computing host (or client) we should track the following:

* time: MySQL time ProbeSetData table was last updated
* Dataset (phenotypes)
* Hash on DB inputs (for DB table updates)
* Genotypes
* Algorithm
* Hash on run inputs (phenotypes, genotypes, algorithm, invocation)
* time: of initiate run
* time: of completion
* Hostname of run (this)
* File path (this)
* Hash on output data (for validation)
* array of rec: DB hostnames + time stamps: Successfully updated DB table for these servers

The logic is that if the DB table was changed we should recompute the hash on inputs.
Note the ProbeSetData table is the largest at 200G including indices.
If that hash changed the mapping algorithm should rerun.
A new record will be created on the new inputs.
A flag is set for the updated DB for precomputed values.
We can garbage collect when multiple entries for `Dataset' exist.

What database should we use? Ultimately precompute is part of the main GN setup. So, one could argue MariaDB is a good choice.
We would like to share precompute with other setups, however.
Also we typically do not want to run the precompute on the production host.
That means we should be able to rebuild the database from a precompute output directory and feed the update to the running (production) server.
We want to track compute so we can distribute running the algorithms across servers and/or PBS.
This implies the compute machines have to be able to query the DB in some way.
Basically a machine has a 'runner' that checks the DB for updates and fetches phenotypes and genotypes.
A run is started and on completion the DB is notified and updated.
Note that a runner can not be parallel on a single results directory, so one runner per target output directory.

We can have different runners, one for local machine, one for PBS and one for remotes.

This also implies that 'state' is handled with the output dir (on a machine).
We will create a JSON record for every compute.
This state can be pushed into 'any' GN instance at any point.
For running jobs the DB can be queried through a (same) REST API.
The REST API has to handle the DB, so it has to be able to access DB and files, i.e. it should be on the same (production) machine.

On the DB we'll create a Hash table on inputs of ProbeSetData. This way we don't have to walk the giant table for every query. This can be handled through a CRON job. So there will be a table:

* Dataset
* Hash on relevant phenotypes (ProbeSetData)
* time: Job started
* host: host running job
* status: (final) job status
* time: DB updated

## Tracking updates to DB

This brings us to CRON jobs.
There are several things that ought to be updated when the DB changes. Xapian being one example and now this table.
These should run on a regular basis and only update what really changed.
We need to support that type of work out of the box with an hourly runner (for precompute) and a daily runner (for xapian).

One possibility is to use MySQL triggers to track changes to the database table.
Even a hash can be computed as suggested here:

=> https://stackoverflow.com/questions/38732185/hash-of-two-columns-in-mysql

The problem is that this relative expensive computation done on every row update. I don't think it is feasible for updating phenotypes. Arthur may find the updates go very slow. It is possible, however, to just update a boolean value, or even a time stamp, that is cheap. That will probably increase the update time by 50-100% and that may be acceptable. We can only find out by trying.

But first I am going to simply generate the hashes from the running DB by scanning the full table. If that runs in minutes we should be good with a CRON job for now. Also, in the medium term, we may replace these tables with lmdb files and it is no good to depend on MysQL/MariaDB for this type of service.

Note that ProbeSetXRef is indexed on DataId, ProbeSetFreezeId, ProbeSetId and Locus.
The following should be unique and ProbeSetFreezeId+ProbeSetId, indexed by unique DataId, is the actual dataset, so that should be reasonable fast.

`ProbeSetFreezeID` is a unique identifier to a dataset, such as "Hippocampus_M430_V2_BXD_PDNN_Jun06".
`ProbeSetId` is the trait in the collection, e.g. '100001_at'.
`DataId` is the counter in ProbeSetXRef table.

So now this should be clear on update:

```
update ProbeSetXRef set Locus=%s, LRS=%s, additive=%s where ProbeSetId=%s and ProbeSetFreezeId=%s
```

Every trait gets an LRS and effect size this way.


## Writing script to Hash ProbeSetData

tux02 has a running database, so it is good for trials. The first step is to update

## Preparing for GEMMA

A good dataset to take apart is

=> http://genenetwork.org/show_trait?trait_id=1436869_at&dataset=HC_M2_0606_P

because it has 71 BXD samples and 32 other samples.