topics/systems/mariadb/precompute-mapping-input-data.gmi


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618

# Precompute mapping input data

GN relies on precomputed mapping scores for search and other functionality. Here we prepare for a new generation of functionality that introduces LMMs for compute and multiple significant scores for queries.

# Tags

* assigned: pjotrp
* type: precompute, gemma
* status: in progress
* priority: high
* keywords: ui, correlations

# Tasks

* [ ] Start using GEMMA for precomputed values as a background pipeline on a different machine
* [ ] Update the table values using GEMMA output (single highest score)

Above is the quick win for plugging in GEMMA values. We will make sure not to recompute the values that are already up to date.
This is achieved by naming the input and output files as a hash on their DB inputs.

Next:

* [ ] Store all GEMMA values efficiently
* [ ] Track metadata of computed datasets (in RDF?)
* [ ] Compute significance with GEMMA or other LMM (bulkLMM?)
* [ ] Store signficance and significant values for processing
* [ ] Update search & correlations to use these
* [ ] Further optimize computations so they can run continuously in the background

# Info

## Original qtlreaper version

The original reaper precompute lives in

=> https://github.com/genenetwork/genenetwork2/blob/testing/scripts/maintenance/QTL_Reaper_v6.py

This script first fetches inbredsets

```
MariaDB [db_webqtl]> select Id, Name from InbredSet limit 5;
+----+----------+
| Id | Name     |
+----+----------+
|  1 | BXD      |
|  2 | B6D2F2   |
|  4 | AXBXA    |
|  5 | AKXD     |
|  6 | B6BTBRF2 |
+----+----------+
```

and expands them to a .geno file, e.g. BXD.geno. Note that the script does not compute with the many variations of .geno files we have today. Next it sets the Id for ProbeSetFreeze which is the same as the InbredSet Id. So, ProbeSetFreeze.Id == IndbredSet.Id.

There are groups/collections, such as "Hippocampus_M430_V2_BXD_PDNN_Jun06"

```
select max(distinct(ProbeSetFreezeId)) from ProbeSetXRef limit 5;
|                            1054 |
```

Next for this Id we fetch, known as `ProbeSetXRefInfos`:

```
MariaDB [db_webqtl]> select ProbeSetId, Locus, DataId from ProbeSetXRef where ProbeSetFreezeId=1 limit 5;
+------------+----------------+--------+
| ProbeSetId | Locus          | DataId |
+------------+----------------+--------+
|          1 | rs13480619     |      1 |
|          2 | rs29535974     |      2 |
|          3 | rs49742109     |      3 |
|          4 | rsm10000002321 |      4 |
|          5 | rsm10000019445 |      5 |
+------------+----------------+--------+
```

Then we fetch the trait values:

```
MariaDB [db_webqtl]> select * from ProbeSetData where ProbeSetData.Id = 1 limit 5;
+----+----------+-------+
| Id | StrainId | value |
+----+----------+-------+
|  1 |        1 | 5.742 |
|  1 |        2 | 5.006 |
|  1 |        3 | 6.079 |
|  1 |        4 | 6.414 |
|  1 |        5 | 4.885 |
+----+----------+-------+
```

with names:

```
MariaDB [db_webqtl]> select Strain.Name, ProbeSetData.value from Strain, ProbeSetData where Strain.Id = ProbeSetData.StrainId and ProbeSetData.Id = 1 limit 5;
+----------+-------+
| Name     | value |
+----------+-------+
| B6D2F1   | 5.742 |
| C57BL/6J | 5.006 |
| DBA/2J   | 6.079 |
| BXD1     | 6.414 |
| BXD2     | 4.885 |
+----------+-------+
```

35 strains were used for this dataset:

```
select count(ProbeSetData.value) from ProbeSetData where ProbeSetData.Id = 1;
+---------------------------+
| count(ProbeSetData.value) |
+---------------------------+
|                        35 |
+---------------------------+
```

with genotypes (from files) and these phenotypes (from MySQL) qtlreaper is started and next we update the values for

```
select * from ProbeSetXRef where ProbeSetId=1 and ProbeSetFreezeId=1 limit 5;
+------------------+------------+--------+------------+------------------+------------+------------------+---------------------+------------+-----------------+--------+-------------+------+
| ProbeSetFreezeId | ProbeSetId | DataId | Locus_old  | LRS_old          | pValue_old | mean             |
se                  | Locus      | LRS             | pValue | additive    | h2   |
+------------------+------------+--------+------------+------------------+------------+------------------+---------------------+------------+-----------------+--------+-------------+------+
|                1 |          1 |      1 | 10.095.400 | 13.3971627898894 |      0.163 | 5.48794285714286 |
0.08525787814808819 | rs13480619 | 12.590069931048 |  0.269 | -0.28515625 | NULL |
+------------------+------------+--------+------------+------------------+------------+------------------+---------------------+------------+-----------------+--------+-------------+------+
```

the actual update is

```
update ProbeSetXRef set Locus=%s, LRS=%s, additive=%s where ProbeSetId=%s and ProbeSetFreezeId=%s
```

so Locus, LRS and additive fields are updated.

The old reaper scores are in

```
MariaDB [db_webqtl]> select ProbeSetId,ProbeSetFreezeId,Locus,LRS,additive from ProbeSetXRef limit 10;
+------------+------------------+----------------+------------------+--------------------+
| ProbeSetId | ProbeSetFreezeId | Locus          | LRS              | additive           |
+------------+------------------+----------------+------------------+--------------------+
|          1 |                1 | rs13480619     |  12.590069931048 |        -0.28515625 |
|          2 |                1 | rs29535974     | 10.5970737900941 | -0.116783333333333 |
|          3 |                1 | rs49742109     |  6.0970532702754 |  0.112957489878542 |
|          4 |                1 | rsm10000002321 | 11.7748675511731 | -0.157113725490196 |
|          5 |                1 | rsm10000019445 | 10.9232633740162 |  0.114764705882353 |
|          6 |                1 | rsm10000017255 | 8.45741703245224 | -0.200034412955466 |
|          7 |                1 | rs4178505      | 7.46477918183565 |  0.104331983805668 |
|          8 |                1 | rsm10000144086 | 12.1201771258006 | -0.134278431372548 |
|          9 |                1 | rsm10000014965 | 11.8837168740735 |  0.341458333333334 |
|         10 |                1 | rsm10000020208 | 10.2809848009836 | -0.173866666666667 |
+------------+------------------+----------------+------------------+--------------------+
10 rows in set (0.000 sec)
```

This means for every dataset one single maximum score gets stored(!)

From this exercise we can conclude:

* Existing precomputed values are by linear regression (old QTLreaper)
* Only one highest score is stored
* Every time the script is run *all* datasets are recomputed
* The '_old' named values in the table are not touched by the script
* h2 and mean are not updated
* No significance score is computed (needs permutations)

Rob voiced a wish to retain all scores (at least those above 1.0). This is not feasible in mariadb, but we can retain GEMMA output if stored efficiently.

## Notes on ProbeSetXRef

ProbeSetXRef is pretty small, currently @5.6Gb and 48,307,650 rows, so we could decide to add columns to track different mappers. Something funny

```
select count(LRS) from ProbeSetXRef;
+------------+
| count(LRS) |
+------------+
|   28335955 |
+------------+
```

half the fields are used. What to think of

```
MariaDB [db_webqtl]> select * from ProbeSetXRef where LRS=0 limit 5;
+------------------+------------+----------+-----------+---------+------------+------+------+----------------+------+--------+----------+------+
| ProbeSetFreezeId | ProbeSetId | DataId   | Locus_old | LRS_old | pValue_old | mean | se   | Locus          | LRS  | pValue | additive | h2   |
+------------------+------------+----------+-----------+---------+------------+------+------+----------------+------+--------+----------+------+
|              589 |    8599010 | 70606737 | NULL      |    NULL |       NULL |    0 | NULL | rsm10000000001 |    0 |   NULL |        0 | NULL |
|              589 |    8593710 | 70606738 | NULL      |    NULL |       NULL |    0 | NULL | rsm10000000001 |    0 |   NULL |        0 | NULL |
|              589 |    8607637 | 70606739 | NULL      |    NULL |       NULL |    0 | NULL | rsm10000000001 |    0 |   NULL |        0 | NULL |
|              589 |    8594490 | 70606740 | NULL      |    NULL |       NULL |    0 | NULL | rsm10000000001 |    0 |   NULL |        0 | NULL |
|              589 |    8591994 | 70606741 | NULL      |    NULL |       NULL |    0 | NULL | rsm10000000001 |    0 |   NULL |        0 | NULL |
+------------------+------------+----------+-----------+---------+------------+------+------+----------------+------+--------+----------+------+
```

```
MariaDB [db_webqtl]> select count(*) from ProbeSetXRef where LRS=0 and Locus="rsm10000000001";
+----------+
| count(*) |
+----------+
|   291447 |
+----------+
```

There is obviously more.  I think this table can use some cleaning up?

Looking at the GN1 code base ProbeSetXRef is used in a great number of files. In GN2 it is:

```
wqflask/maintenance/quantile_normalize.py
wqflask/maintenance/generate_probesetfreeze_file.py
wqflask/base/mrna_assay_tissue_data.py
wqflask/wqflask/update_search_results.py
wqflask/wqflask/correlation/rust_correlation.py
wqflask/base/trait.py
wqflask/wqflask/do_search.py
wqflask/base/data_set/mrnaassaydataset.py
wqflask/base/data_set/dataset.py
wqflask/wqflask/correlation/pre_computes.py
wqflask/wqflask/api/router.py
wqflask/wqflask/show_trait/show_trait.py
scripts/insert_expression_data.py
scripts/maintenance/Update_Case_Attributes_MySQL_tab.py
scripts/maintenance/readProbeSetSE_v7.py
scripts/maintenance/QTL_Reaper_v6.py
scripts/maintenance/readProbeSetMean_v7.py
wqflask/tests/unit/base/test_mrna_assay_tissue_data.py
```

Let's visit these one by one to make sure there is no side-effect.

### wqflask/maintenance/quantile_normalize.py

Appears to be a one-off to normalize a certain dataset in `/home/zas1024/cfw_data/`.
Last time it was used is probably 2018.

### wqflask/maintenance/generate_probesetfreeze_file.py

This one is even older and probably from 2013.

### wqflask/base/mrna_assay_tissue_data.py

Another dinosaur, actually uses TissueProbeSetXRef, so not relevant.

### wqflask/wqflask/update_search_results.py

This is for global 'gene' search making sure ProbeSet.Id = ProbeSetXRef.ProbeSetId AND ProbeSetXRef.ProbeSetFreezeId=ProbeSetFreeze.Id. LRS is passed on.

### wqflask/wqflask/correlation/rust_correlation.py

Again making sure IDs match. LRS is passed on. This requires scrutiny FIXME

### wqflask/base/trait.py

Trait class defines a trait in webqtl, can be either Microarray, Published phenotype, genotype, or user input trait. Fetches LRS.

### wqflask/wqflask/do_search.py

Searches for genes with a QTL within the given LRS values

    LRS searches can take 3 different forms:
    - LRS > (or <) min/max_LRS
    - LRS=(min_LRS max_LRS)
    - LRS=(min_LRS max_LRS chromosome start_Mb end_Mb)
    where min/max_LRS represent the range of LRS scores and start/end_Mb represent
    the range in megabases on the given chromosome

### wqflask/base/data_set/mrnaassaydataset.py

Also search results.

### wqflask/base/data_set/dataset.py

ID sync

### wqflask/wqflask/correlation/pre_computes.py

ID sync

### wqflask/wqflask/api/router.py

Fetch traits and sample data API (we may use this!)

### wqflask/wqflask/show_trait/show_trait.py

Same.

### The following are a punch of scripts:

* scripts/insert_expression_data.py
* scripts/maintenance/Update_Case_Attributes_MySQL_tab.py
* scripts/maintenance/readProbeSetSE_v7.py
* scripts/maintenance/QTL_Reaper_v6.py
* scripts/maintenance/readProbeSetMean_v7.py
* wqflask/tests/unit/base/test_mrna_assay_tissue_data.py

## Using LRS

* wqflask/wqflask/static/new/javascript/lod_chart.js

LRS is used to display a LOD chart

and in a bunch of python files that match above list for ProbeSetXRef:

* scripts/maintenance/QTL_Reaper_v6.py
* wqflask/base/webqtlConfig.py
* wqflask/base/trait.py
* wqflask/tests/unit/wqflask/marker_regression/test_display_mapping_results.py
* wqflask/tests/unit/base/test_trait.py
* wqflask/wqflask/parser.py
* wqflask/tests/unit/wqflask/correlation/test_show_corr_results.py
* wqflask/wqflask/update_search_results.py
* wqflask/wqflask/correlation/show_corr_results.py
* wqflask/wqflask/marker_regression/run_mapping.py
* wqflask/wqflask/export_traits.py
* wqflask/wqflask/gsearch.py
* wqflask/base/data_set/phenotypedataset.py
* wqflask/base/data_set/markers.py
* wqflask/wqflask/api/router.py
* wqflask/base/data_set/mrnaassaydataset.py
* wqflask/wqflask/correlation/rust_correlation.py
* wqflask/wqflask/marker_regression/display_mapping_results.py
* wqflask/wqflask/collect.py
* wqflask/wqflask/do_search.py
* wqflask/wqflask/views.py
* wqflask/utility/Plot.py
* test/requests/main_web_functionality.py
* test/requests/correlation_tests.py

From above it can be concluded these precomputed values are used for display and for getting search results (filtering on datasets that have a QTL).
It looks safe to start replacing qtlreaper results with GEMMA results.
Only thing I am not sure about is correlations.
A cursory inspection suggests LRS is only used for final output and that makes sense if correlations are done on (expression) phenotypes.
Anyway, if things go haywire we'll find out soon (enough).

At the next stage we ignore all this and start precompute with GEMMA on the BXD.

## Precompute DB

We will use a database to track precompute updates (see below).

On the computing host (or client) we should track the following:

* Dataset (phenotypes)
* time: MySQL time ProbeSetData table was last updated (if identical we can skip)
* Hash on DB inputs (for DB table updates)
* Genotypes
* Algorithm
* Hash on run inputs (phenotypes, genotypes, algorithm, invocation)
* time: of initiate run
* time: of completion
* Hostname of run (this)
* File path (this)
* Hash on output data (for validation)
* array of rec: DB hostnames + time stamps: successfully updated DB table for these servers

The logic is that if the DB table was changed we should recompute the hash on inputs.
Note the ProbeSetData table is the largest at 200G including indices.
If that hash changed the mapping algorithm should rerun.
A new record will be created on the new inputs.
A flag is set for the updated DB for precomputed values.
We can garbage collect when multiple entries for `Dataset' exist.

What database should we use? Ultimately precompute is part of the main GN setup. So, one could argue MariaDB is a good choice.
We would like to share precompute with other setups, however.
Also we typically do not want to run the precompute on the production host.
That means we should be able to rebuild the database from a precompute output directory and feed the update to the running (production) server.
We want to track compute so we can distribute running the algorithms across servers and/or PBS.
This implies the compute machines have to be able to query the DB in some way.
Basically a machine has a 'runner' that checks the DB for updates and fetches phenotypes and genotypes.
A run is started and on completion the DB is notified and updated.
Note that a runner can not be parallel on a single results directory, so one runner per target output directory.

We can have different runners, one for local machine, one for PBS and one for remotes.

This also implies that 'state' is handled with the output dir (on a machine).
We will create a JSON record for every compute.
This state can be pushed into 'any' GN instance at any point.
For running jobs the DB can be queried through a (same) REST API.
The REST API has to handle the DB, so it has to be able to access DB and files, i.e. it should be on the same (production) machine.

On the DB we'll create a Hash table on inputs of ProbeSetData. This way we don't have to walk the giant table for every query. This can be handled through a CRON job. So there will be a table:

* Dataset
* Hash on relevant phenotypes (ProbeSetData)
* time: Job started
* host: host running job
* status: (final) job status
* time: DB updated

## Tracking updates to DB

This brings us to CRON jobs.
There are several things that ought to be updated when the DB changes. Xapian being one example and now this table.
These should run on a regular basis and only update what really changed.
We need to support that type of work out of the box with an hourly runner (for precompute) and a daily runner (for xapian).

One possibility is to use MySQL triggers to track changes to the database table.
Even a hash can be computed as suggested here:

=> https://stackoverflow.com/questions/38732185/hash-of-two-columns-in-mysql

The problem is that this relative expensive computation done on every row update. I don't think it is feasible for updating phenotypes. Arthur may find the updates go very slow. It is possible, however, to just update a boolean value, or even a time stamp, that is cheap. That will probably increase the update time by 50-100% and that may be acceptable. We can only find out by trying.

But first I am going to simply generate the hashes from the running DB by scanning the full table. If that runs in minutes we should be good with a CRON job for now. Also, in the medium term, we may replace these tables with lmdb files and it is no good to depend on MysQL/MariaDB for this type of service.

Note that ProbeSetXRef is indexed on DataId, ProbeSetFreezeId, ProbeSetId and Locus.
The following should be unique and ProbeSetFreezeId+ProbeSetId, indexed by unique DataId, is the actual dataset, so that should be reasonable fast.

`ProbeSetFreezeID` is a unique identifier to a dataset, such as "Hippocampus_M430_V2_BXD_PDNN_Jun06".
`ProbeSetId` is the trait in the collection, e.g. '100001_at'.
`DataId` is the counter in ProbeSetXRef table.

So now this should be clear on update:

```
update ProbeSetXRef set Locus=%s, LRS=%s, additive=%s where ProbeSetId=%s and ProbeSetFreezeId=%s
```

Every trait gets an LRS and effect size this way.


## Writing script to Hash ProbeSetData

tux02 has a running database, so it is good for trials. The first step is to add hashes to ProbeSetData and compute them. Two questions we can ask: (1) should be host that data in the same table and (2) should we use another DB?

For (1) we should *not* use the same table. The current table is reasonably small and if we expand 28,335,955 rows on ProbeSetXRef it is going to be huge. This table is used a lot, so we should create a new table for this specific purpose. Now that means, even though this table is obviously linked to the running DB, we can actually use a different storage. And for this particular purpose I think sqlite will do great. Our preferred 'transient' databases, these days, are lmdb for disk b-trees, redis for in-RAM b-trees, sqlite for tables and virtuoso for RDF. This may change again, but for now these are great default choices. 'transient' means that the DB can be regenerated from some source of truth. This is not necessarily the case today - particularly authorization and session management are now using sqlite and redis as primary data sources. That is fine, as long as we are fully aware of the choices and what that implies.

We already use SQLite for authorization. Fred has been adding that to GN3 and now to its own authorization service:

=> https://github.com/genenetwork/gn-auth

In the README you can see the DB is located with a parameter 'AUTH_DB'. There is no example, so I check our production setup. Zach has it setup with

```
LOGLEVEL = "DEBUG"
SECRET_KEY = '9a4a2a...'
SQL_URI = "mysql://user:pwd@127.0.0.1:3306/db_webqtl"
AUTH_DB = "/home/gn2/auth_copy.db"
AUTH_MIGRATIONS = "/home/zas1024/gn-auth/migrations"
```

The DB file is world writeable - and we should probably change that. Also it needs a better storage location. Essentially GN should have a limited number of directories that handle state:

* SQL server (mysql) - read-write
* Other databases - read-write
* Upload files - read-write
* Reference files - read-only
* Secret files - read-only
* Cache files - read-write

Within these directories there should be clear subdirectories for each use case. So, for authorization we could have `/data/gn/databases/auth/production/auth.db'. Likewise, the precompute hash table will be in `/data/gn/databases/precompute/production/hashes.db' and we'll introduce an environment variable "GN_HASH_DB".
We'll take care of that later.

For our hashing purposes we'll first create a DB named 'hashes.db'. This table can also be used for correlations etc. down the line by using the 'algorithm' field and its combined input hash. It is generic, in other words.

## Getting precompute running

Now we have pretty much figured out what to do it is time to set up precompute. We can start with one single server that has access to mariadb. The new genoa Tux04 will be our replacement of Tux01. It has some larger Crucial P3 NVME SSDs rated at 3.5Gb/s. The Dell drive is 1.4 TB and is rated to be twice as fast. So we'll use that for the MariaDB now.

The first step is to make space for all the GN2+GN3 stuff. Next we test if we can query tux01 ports. tux04 can see tux01 because they are both inside the UTHSC firewall. I just need to allow the mysql port on tux01 and we keep using that DB for now.

The first phase is to simply start running gemma on tux05 and tux04 using the BXD. I mean, before computing 28,335,955 phenotypes we'll have to have a more serious compute setup!

```
MariaDB [db_webqtl]> select * from Strain where SpeciesId=1 limit 200;
+-----+------------+------------+-----------+--------+------------+
| Id  | Name       | Name2      | SpeciesId | Symbol | Alias      |
+-----+------------+------------+-----------+--------+------------+
|   1 | B6D2F1     | B6D2F1     |         1 | NULL   | NULL       |
|   2 | C57BL/6J   | C57BL/6J   |         1 | B6J    | P          |
|   3 | DBA/2J     | DBA/2J     |         1 | D2J    | P          |
|   4 | BXD1       | BXD1       |         1 | NULL   | NULL       |
|   5 | BXD2       | BXD2       |         1 | NULL   | NULL       |
|   6 | BXD5       | BXD5       |         1 | NULL   | NULL       |
|   7 | BXD6       | BXD6       |         1 | NULL   | NULL       |

select * from Strain where SpeciesId=1 and Name like '%BXD%';
returns 1371 rows
```

The originally named StrainXRef table links StrainID with SpeciesID:

```
MariaDB [db_webqtl]> select * from StrainXRef limit 3;
+-------------+----------+---------+------------------+----------------+
| InbredSetId | StrainId | OrderId | Used_for_mapping | PedigreeStatus |
+-------------+----------+---------+------------------+----------------+
|           1 |        1 |      10 | N                | MP_F1          |
|           1 |        2 |      20 | N                | Mat            |
|           1 |        3 |      30 | N                | Pat            |
+-------------+----------+---------+------------------+----------------+
```

## Getting all BXD datasets

Together with the InbredSet table we have the information we need to fetch BXD datasets.

```
select Id,InbredSetId,InbredSetName,Name,SpeciesId,FullName,public,MappingMethodId,GeneticType | Family,FamilyOrder,MenuOrderId,InbredSetCode from InbredSet limit 3;
+----+-------------+-------------------+--------+-----------+-------------------+--------+-----------------+----------------------+-------------+-------------+---------------+
| Id | InbredSetId | InbredSetName     | Name   | SpeciesId | FullName          | public | MappingMethodId | GeneticType | Family | FamilyOrder | MenuOrderId | InbredSetCode |
+----+-------------+-------------------+--------+-----------+-------------------+--------+-----------------+----------------------+-------------+-------------+---------------+
|  1 |           1 | BXD               | BXD    |         1 | BXD Family        |      2 | 1               |                    0 |           1 |           0 | BXD           |
|  2 |           2 | B6D2F2 OHSU Brain | B6D2F2 |         1 | B6D2F2 OHSU Brain |      2 | 1               |                    0 |           3 |           0 | NULL          |
|  4 |           4 | AXB/BXA           | AXBXA  |         1 | AXB/BXA Family    |      2 | 1               |                 NULL |           1 |           0 | AXB           |
+----+-------------+-------------------+--------+-----------+-------------------+--------+-----------------+----------------------+-------------+-------------+---------------+
```

So

```
MariaDB [db_webqtl]> select * from StrainXRef where InbredSetID=1 limit 5;
+-------------+----------+---------+------------------+----------------+
| InbredSetId | StrainId | OrderId | Used_for_mapping | PedigreeStatus |
+-------------+----------+---------+------------------+----------------+
|           1 |        1 |      10 | N                | MP_F1          |
|           1 |        2 |      20 | N                | Mat            |
|           1 |        3 |      30 | N                | Pat            |
|           1 |        4 |      40 | Y                | RI_progeny     |
|           1 |        5 |      50 | Y                | RI_progeny     |
+-------------+----------+---------+------------------+----------------+
```

```
select * from ProbeSetData where StrainId = 4 limit 5;
```

Fetches all data where strain `BXD1' was used. That is 14,829,435 of data points!
Now it should be clear the database is designed to go from dataset -> strain and not the other way round. We can, however, walk ProbeSetXRef and get the ProbeSetId. Use that to query ProbeSetData and use the StrainID to make sure it belongs to the BXD for compute.

Next, the fields `Locus_old | LRS_old | pValue_old' are not used that I can tell from the source code. So, we could use these to track updates. I am going to set these _old fields in ProbeSetXRef. After checking the backups we will first set all Locus_old to NULL. If a value
is NULL we can precompute. Afther precompute the current Locus_old value is taken from the reaper compute, as well as LRS_old and p_Value_old. We will add a column for additive_old.

After some thought I decided to create a new table that mirrors what we'll do with sqlite down the line. The new table will also allow for referring to time stamps and multiple hits. So, in addition to time stamps and hash values, we'll add to the above update record:

* top hit info (locus, LRS, p-value, effect)
* number of other hits
* file with other hits

And down the line we can point to a different DB table that records significant hits.

As a first check it will be interesting to see how these values differ between qtlreaper and GEMMA!

## Create Hash table

ProbeSetXRef is defined as

```
CREATE TABLE ProbeSetXRef (
  ProbeSetFreezeId smallint(5) unsigned NOT NULL DEFAULT 0,
  ProbeSetId int(10) unsigned NOT NULL DEFAULT 0,
  DataId int(10) unsigned NOT NULL DEFAULT 0,
  Locus_old char(20) DEFAULT NULL,
  LRS_old double DEFAULT NULL,
  pValue_old double DEFAULT NULL,
  mean double DEFAULT NULL,
  se double DEFAULT NULL,
  Locus varchar(50) DEFAULT NULL,
  LRS double DEFAULT NULL,
  pValue double DEFAULT NULL,
  additive double DEFAULT NULL,
  h2 float DEFAULT NULL,
  PRIMARY KEY (DataId),
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4
```

Not all choices are that sane, but we'll leave it like it is for now (it is a small table).
The new table will look like (LRS=Log wise ratios)

```
CREATE TABLE PrecomputeLRS (
  id int unsigned NOT NULL DEFAULT 0,
  ProbeSetFreezeId smallint(5) unsigned NOT NULL DEFAULT 0,
  ProbeSetId int(10) unsigned NOT NULL DEFAULT 0,
  DataId int(10) unsigned NOT NULL DEFAULT 0, ??
  mean double DEFAULT NULL,
  se double DEFAULT NULL,
  Locus varchar(50) DEFAULT NULL,
  LRS double DEFAULT NULL,
  pValue double DEFAULT NULL,
  additive double DEFAULT NULL,
  h2 float DEFAULT NULL,
  PRIMARY KEY (id),
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4
```

## Preparing for GEMMA

Meanwhile I have prepared tux04 and tux05 for the new runs. Next step is to query the DB and run GEMMA.

A first script on tux02 runs:

```scheme
(use-modules (dbi dbi))

(define db_webqtl (dbi-open "mysql" "webqtlout:webqtlout:db_webqtl:tcp:127.0.0.1:3306"))
(dbi-query db_webqtl "SELECT * FROM ProbeSetXRef LIMIT 1")
(display (dbi-get_row db_webqtl))

((ProbeSetFreezeId . 1) (ProbeSetId . 1) (DataId . 1) (Locus_old . 10.095.400) (LRS_old . 13.3971627898894) (pValue_old . 0.163) (mean . 5.48794285714286) (se . 0.08525787814808819) (Locus . rs13480619) (LRS . 12.590069931048) (pValue . 0.269) (additive . -0.28515625) (h2 . 0.0))
```

Note that you can run this over guile-geiser remote.


## More complicated datasets

A good dataset to take apart is

=> http://genenetwork.org/show_trait?trait_id=1436869_at&dataset=HC_M2_0606_P

because it has 71 BXD samples and 32 other samples.