summaryrefslogtreecommitdiff
path: root/topics/systems/mariadb/precompute-mapping-input-data.gmi
blob: 0c89fe54c319a2beb0b0892b4208a8c97c7c6d3b (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
# Precompute mapping input data

GN relies on precomputed mapping scores for search and other functionality. Here we prepare for a new generation of functionality that introduces LMMs for compute and multiple significant scores for queries.

At this stage we precompute GEMMA and tarball or lmdb it. As a project is never complete we need to add a metadata record in each tarball that track the status of the 'package'. Also, to offload compute to machines without DB access we need to prepare a first step that contains genotypes and phenotypes for compute. The genotypes have to be shared, as well as the computed kinship with and without LOCO. See

=> /topics/data/precompute/steps

The mariadb database gets (re)updated from the computed data, parsing metadata and forcing an update. We only need to handle state to track creating batches for (offline) compute. As this can happen on a single machine we don't need to do anything real fancy. Just write out the last updated traitid, i.e., DataId from ProbeSetXRef which is indexed. Before doing anything we use Locus_old to identify updates by setting it to NULL. See the header of list-traits-to-compute.scm.

# Tags

* assigned: pjotrp
* type: precompute, gemma
* status: in progress
* priority: high
* keywords: ui, correlations

# Tasks

See

=> topics/data/precompute/steps

Next, for running the full batch:

* [X] Store all GEMMA values efficiently
* [ ] Include metadata record in lmdb and as JSON file
* [ ] Include metadata record on compute status
* [ ] Remove junk from tarball - use lmdb?
* [ ] List significant markers as metadata in lmdb
* [ ] Reread below info
* [ ] Submit jobs to PBS using CCWL
* [ ] Report results to mariadb

And after:

* [ ] Track metadata of computed datasets (in RDF?)
* [ ] Compute significance with GEMMA or other LMM (bulkLMM?)
* [ ] Store significance and significant values for processing
* [ ] Update search & correlations to use these
* [ ] Further optimize computations so they can run continuously in the background

# Info

## Original qtlreaper version

The original reaper precompute lives in

=> https://github.com/genenetwork/genenetwork2/blob/testing/scripts/maintenance/QTL_Reaper_v6.py

This script first fetches inbredsets

```
 select Id,InbredSetId,InbredSetName,Name,SpeciesId,FullName,public,MappingMethodId,GeneticType,Family,FamilyOrder,MenuOrderId,InbredSetCode from InbredSet LIMIT 5;
+----+-------------+-------------------+----------+-----------+-------------------+--------+-----------------+-------------+--------------------------------------------------+-------------+-------------+---------------+
| Id | InbredSetId | InbredSetName     | Name     | SpeciesId | FullName          | public | MappingMethodId | GeneticType | Family                                           | FamilyOrder | MenuOrderId | InbredSetCode |
+----+-------------+-------------------+----------+-----------+-------------------+--------+-----------------+-------------+--------------------------------------------------+-------------+-------------+---------------+
|  1 |           1 | BXD               | BXD      |         1 | BXD Family        |      2 | 1               | riset       | Reference Populations (replicate average, SE, N) |           1 |           0 | BXD           |
|  2 |           2 | B6D2F2 OHSU Brain | B6D2F2   |         1 | B6D2F2 OHSU Brain |      2 | 1               | intercross  | Crosses, AIL, HS                                 |           3 |           0 | NULL          |
|  4 |           4 | AXB/BXA           | AXBXA    |         1 | AXB/BXA Family    |      2 | 1               | NULL        | Reference Populations (replicate average, SE, N) |           1 |           0 | AXB           |
|  5 |           5 | AKXD              | AKXD     |         1 | AKXD Family       |      2 | 1               | NULL        | Reference Populations (replicate average, SE, N) |           1 |           0 | AKD           |
|  6 |           6 | B6BTBRF2          | B6BTBRF2 |         1 | B6BTBRF2          |      2 | 1               | intercross  | Crosses, AIL, HS                                 |           3 |           0 | BBT           |
+----+-------------+-------------------+----------+-----------+-------------------+--------+-----------------+-------------+--------------------------------------------------+-------------+-------------+---------------+
```

```
MariaDB [db_webqtl]> select Id, Name from InbredSet limit 5;
+----+----------+
| Id | Name     |
+----+----------+
|  1 | BXD      |
|  2 | B6D2F2   |
|  4 | AXBXA    |
|  5 | AKXD     |
|  6 | B6BTBRF2 |
+----+----------+
```

and expands them to a .geno file, e.g. BXD.geno. Note that the script does not compute with the many variations of .geno files we have today (let alone the latest?). Next it sets the Id for ProbeSetFreeze which is the same as the InbredSet Id. So, ProbeSetFreeze.Id == IndbredSet.Id.

There are groups/collections, such as "Hippocampus_M430_V2_BXD_PDNN_Jun06"

```
select max(distinct(ProbeSetFreezeId)) from ProbeSetXRef limit 5;
|                            1054 |
```

Next for this Id we fetch, known as `ProbeSetXRefInfos`:

```
MariaDB [db_webqtl]> select ProbeSetId, Locus, DataId from ProbeSetXRef where ProbeSetFreezeId=1 limit 5;
+------------+----------------+--------+
| ProbeSetId | Locus          | DataId |
+------------+----------------+--------+
|          1 | rs13480619     |      1 |
|          2 | rs29535974     |      2 |
|          3 | rs49742109     |      3 |
|          4 | rsm10000002321 |      4 |
|          5 | rsm10000019445 |      5 |
+------------+----------------+--------+
```

Then we fetch the trait values:

```
MariaDB [db_webqtl]> select * from ProbeSetData where ProbeSetData.Id = 1 limit 5;
+----+----------+-------+
| Id | StrainId | value |
+----+----------+-------+
|  1 |        1 | 5.742 |
|  1 |        2 | 5.006 |
|  1 |        3 | 6.079 |
|  1 |        4 | 6.414 |
|  1 |        5 | 4.885 |
+----+----------+-------+
```

with names:

```
MariaDB [db_webqtl]> select Strain.Name, ProbeSetData.value from Strain, ProbeSetData where Strain.Id = ProbeSetData.StrainId and ProbeSetData.Id = 1 limit 5;
+----------+-------+
| Name     | value |
+----------+-------+
| B6D2F1   | 5.742 |
| C57BL/6J | 5.006 |
| DBA/2J   | 6.079 |
| BXD1     | 6.414 |
| BXD2     | 4.885 |
+----------+-------+
```

35 strains were used for this dataset:

```
select count(ProbeSetData.value) from ProbeSetData where ProbeSetData.Id = 1;
+---------------------------+
| count(ProbeSetData.value) |
+---------------------------+
|                        35 |
+---------------------------+
```

with genotypes (from files) and these phenotypes (from MySQL) qtlreaper is started and next we update the values for

```
select * from ProbeSetXRef where ProbeSetId=1 and ProbeSetFreezeId=1 limit 5;
+------------------+------------+--------+------------+------------------+------------+------------------+---------------------+------------+-----------------+--------+-------------+------+
| ProbeSetFreezeId | ProbeSetId | DataId | Locus_old  | LRS_old          | pValue_old | mean             |
se                  | Locus      | LRS             | pValue | additive    | h2   |
+------------------+------------+--------+------------+------------------+------------+------------------+---------------------+------------+-----------------+--------+-------------+------+
|                1 |          1 |      1 | 10.095.400 | 13.3971627898894 |      0.163 | 5.48794285714286 |
0.08525787814808819 | rs13480619 | 12.590069931048 |  0.269 | -0.28515625 | NULL |
+------------------+------------+--------+------------+------------------+------------+------------------+---------------------+------------+-----------------+--------+-------------+------+
```

the actual update is

```
update ProbeSetXRef set Locus=%s, LRS=%s, additive=%s where ProbeSetId=%s and ProbeSetFreezeId=%s
```

so Locus, LRS and additive fields are updated.

The old reaper scores are in

```
MariaDB [db_webqtl]> select ProbeSetId,ProbeSetFreezeId,Locus,LRS,additive from ProbeSetXRef limit 10;
+------------+------------------+----------------+------------------+--------------------+
| ProbeSetId | ProbeSetFreezeId | Locus          | LRS              | additive           |
+------------+------------------+----------------+------------------+--------------------+
|          1 |                1 | rs13480619     |  12.590069931048 |        -0.28515625 |
|          2 |                1 | rs29535974     | 10.5970737900941 | -0.116783333333333 |
|          3 |                1 | rs49742109     |  6.0970532702754 |  0.112957489878542 |
|          4 |                1 | rsm10000002321 | 11.7748675511731 | -0.157113725490196 |
|          5 |                1 | rsm10000019445 | 10.9232633740162 |  0.114764705882353 |
|          6 |                1 | rsm10000017255 | 8.45741703245224 | -0.200034412955466 |
|          7 |                1 | rs4178505      | 7.46477918183565 |  0.104331983805668 |
|          8 |                1 | rsm10000144086 | 12.1201771258006 | -0.134278431372548 |
|          9 |                1 | rsm10000014965 | 11.8837168740735 |  0.341458333333334 |
|         10 |                1 | rsm10000020208 | 10.2809848009836 | -0.173866666666667 |
+------------+------------------+----------------+------------------+--------------------+
10 rows in set (0.000 sec)
```

This means for every dataset one single maximum score gets stored(!)

From this exercise we can conclude:

* Existing precomputed values are by linear regression (old QTLreaper)
* Only one highest score is stored
* Every time the script is run *all* datasets are recomputed
* The '_old' named values in the table are not touched by the script
* h2 and mean are not updated
* No significance score is computed (needs permutations)

Rob voiced a wish to retain all scores (at least those above 1.0). This is not feasible in mariadb, but we can retain GEMMA output if stored efficiently.

## Notes on ProbeSetXRef

ProbeSetXRef is pretty small, currently @5.6Gb and 48,307,650 rows, so we could decide to add columns to track different mappers. Something funny

```
select count(LRS) from ProbeSetXRef;
+------------+
| count(LRS) |
+------------+
|   28335955 |
+------------+
```

half the fields are used. What to think of

```
MariaDB [db_webqtl]> select * from ProbeSetXRef where LRS=0 limit 5;
+------------------+------------+----------+-----------+---------+------------+------+------+----------------+------+--------+----------+------+
| ProbeSetFreezeId | ProbeSetId | DataId   | Locus_old | LRS_old | pValue_old | mean | se   | Locus          | LRS  | pValue | additive | h2   |
+------------------+------------+----------+-----------+---------+------------+------+------+----------------+------+--------+----------+------+
|              589 |    8599010 | 70606737 | NULL      |    NULL |       NULL |    0 | NULL | rsm10000000001 |    0 |   NULL |        0 | NULL |
|              589 |    8593710 | 70606738 | NULL      |    NULL |       NULL |    0 | NULL | rsm10000000001 |    0 |   NULL |        0 | NULL |
|              589 |    8607637 | 70606739 | NULL      |    NULL |       NULL |    0 | NULL | rsm10000000001 |    0 |   NULL |        0 | NULL |
|              589 |    8594490 | 70606740 | NULL      |    NULL |       NULL |    0 | NULL | rsm10000000001 |    0 |   NULL |        0 | NULL |
|              589 |    8591994 | 70606741 | NULL      |    NULL |       NULL |    0 | NULL | rsm10000000001 |    0 |   NULL |        0 | NULL |
+------------------+------------+----------+-----------+---------+------------+------+------+----------------+------+--------+----------+------+
```

```
MariaDB [db_webqtl]> select count(*) from ProbeSetXRef where LRS=0 and Locus="rsm10000000001";
+----------+
| count(*) |
+----------+
|   291447 |
+----------+
```

There is obviously more.  I think this table can use some cleaning up?

Looking at the GN1 code base ProbeSetXRef is used in a great number of files. In GN2 it is:

```
wqflask/maintenance/quantile_normalize.py
wqflask/maintenance/generate_probesetfreeze_file.py
wqflask/base/mrna_assay_tissue_data.py
wqflask/wqflask/update_search_results.py
wqflask/wqflask/correlation/rust_correlation.py
wqflask/base/trait.py
wqflask/wqflask/do_search.py
wqflask/base/data_set/mrnaassaydataset.py
wqflask/base/data_set/dataset.py
wqflask/wqflask/correlation/pre_computes.py
wqflask/wqflask/api/router.py
wqflask/wqflask/show_trait/show_trait.py
scripts/insert_expression_data.py
scripts/maintenance/Update_Case_Attributes_MySQL_tab.py
scripts/maintenance/readProbeSetSE_v7.py
scripts/maintenance/QTL_Reaper_v6.py
scripts/maintenance/readProbeSetMean_v7.py
wqflask/tests/unit/base/test_mrna_assay_tissue_data.py
```

Let's visit these one by one to make sure there is no side-effect.

### wqflask/maintenance/quantile_normalize.py

Appears to be a one-off to normalize a certain dataset in `/home/zas1024/cfw_data/`.
Last time it was used is probably 2018.

### wqflask/maintenance/generate_probesetfreeze_file.py

This one is even older and probably from 2013.

### wqflask/base/mrna_assay_tissue_data.py

Another dinosaur, actually uses TissueProbeSetXRef, so not relevant.

### wqflask/wqflask/update_search_results.py

This is for global 'gene' search making sure ProbeSet.Id = ProbeSetXRef.ProbeSetId AND ProbeSetXRef.ProbeSetFreezeId=ProbeSetFreeze.Id. LRS is passed on.

### wqflask/wqflask/correlation/rust_correlation.py

Again making sure IDs match. LRS is passed on. This requires scrutiny FIXME

### wqflask/base/trait.py

Trait class defines a trait in webqtl, can be either Microarray, Published phenotype, genotype, or user input trait. Fetches LRS.

### wqflask/wqflask/do_search.py

Searches for genes with a QTL within the given LRS values

    LRS searches can take 3 different forms:
    - LRS > (or <) min/max_LRS
    - LRS=(min_LRS max_LRS)
    - LRS=(min_LRS max_LRS chromosome start_Mb end_Mb)
    where min/max_LRS represent the range of LRS scores and start/end_Mb represent
    the range in megabases on the given chromosome

### wqflask/base/data_set/mrnaassaydataset.py

Also search results.

### wqflask/base/data_set/dataset.py

ID sync

### wqflask/wqflask/correlation/pre_computes.py

ID sync

### wqflask/wqflask/api/router.py

Fetch traits and sample data API (we may use this!)

### wqflask/wqflask/show_trait/show_trait.py

Same.

### The following are a punch of scripts:

* scripts/insert_expression_data.py
* scripts/maintenance/Update_Case_Attributes_MySQL_tab.py
* scripts/maintenance/readProbeSetSE_v7.py
* scripts/maintenance/QTL_Reaper_v6.py
* scripts/maintenance/readProbeSetMean_v7.py
* wqflask/tests/unit/base/test_mrna_assay_tissue_data.py

## Using LRS

* wqflask/wqflask/static/new/javascript/lod_chart.js

LRS is used to display a LOD chart

and in a bunch of python files that match above list for ProbeSetXRef:

* scripts/maintenance/QTL_Reaper_v6.py
* wqflask/base/webqtlConfig.py
* wqflask/base/trait.py
* wqflask/tests/unit/wqflask/marker_regression/test_display_mapping_results.py
* wqflask/tests/unit/base/test_trait.py
* wqflask/wqflask/parser.py
* wqflask/tests/unit/wqflask/correlation/test_show_corr_results.py
* wqflask/wqflask/update_search_results.py
* wqflask/wqflask/correlation/show_corr_results.py
* wqflask/wqflask/marker_regression/run_mapping.py
* wqflask/wqflask/export_traits.py
* wqflask/wqflask/gsearch.py
* wqflask/base/data_set/phenotypedataset.py
* wqflask/base/data_set/markers.py
* wqflask/wqflask/api/router.py
* wqflask/base/data_set/mrnaassaydataset.py
* wqflask/wqflask/correlation/rust_correlation.py
* wqflask/wqflask/marker_regression/display_mapping_results.py
* wqflask/wqflask/collect.py
* wqflask/wqflask/do_search.py
* wqflask/wqflask/views.py
* wqflask/utility/Plot.py
* test/requests/main_web_functionality.py
* test/requests/correlation_tests.py

From above it can be concluded these precomputed values are used for display and for getting search results (filtering on datasets that have a QTL).
It looks safe to start replacing qtlreaper results with GEMMA results.
Only thing I am not sure about is correlations.
A cursory inspection suggests LRS is only used for final output and that makes sense if correlations are done on (expression) phenotypes.
Anyway, if things go haywire we'll find out soon (enough).

At the next stage we ignore all this and start precompute with GEMMA on the BXD.

## Precompute DB

We will use a database to track precompute updates (see below).

On the computing host (or client) we should track the following:

* Dataset (phenotypes)
* time: MySQL time ProbeSetData table was last updated (if identical we can skip)
* Hash on DB inputs (for DB table updates)
* Genotypes
* Algorithm
* Hash on run inputs (phenotypes, genotypes, algorithm, invocation)
* time: of initiate run
* time: of completion
* Hostname of run (this)
* File path (this)
* Hash on output data (for validation)
* array of rec: DB hostnames + time stamps: successfully updated DB table for these servers

The logic is that if the DB table was changed we should recompute the hash on inputs.
Note the ProbeSetData table is the largest at 200G including indices.
If that hash changed the mapping algorithm should rerun.
A new record will be created on the new inputs.
A flag is set for the updated DB for precomputed values.
We can garbage collect when multiple entries for `Dataset' exist.

What database should we use? Ultimately precompute is part of the main GN setup. So, one could argue MariaDB is a good choice.
We would like to share precompute with other setups, however.
Also we typically do not want to run the precompute on the production host.
That means we should be able to rebuild the database from a precompute output directory and feed the update to the running (production) server.
We want to track compute so we can distribute running the algorithms across servers and/or PBS.
This implies the compute machines have to be able to query the DB in some way.
Basically a machine has a 'runner' that checks the DB for updates and fetches phenotypes and genotypes.
A run is started and on completion the DB is notified and updated.
Note that a runner can not be parallel on a single results directory, so one runner per target output directory.

We can have different runners, one for local machine, one for PBS and one for remotes.

This also implies that 'state' is handled with the output dir (on a machine).
We will create a JSON record for every compute.
This state can be pushed into 'any' GN instance at any point.
For running jobs the DB can be queried through a (same) REST API.
The REST API has to handle the DB, so it has to be able to access DB and files, i.e. it should be on the same (production) machine.

On the DB we'll create a Hash table on inputs of ProbeSetData. This way we don't have to walk the giant table for every query. This can be handled through a CRON job. So there will be a table:

* Dataset
* Hash on relevant phenotypes (ProbeSetData)
* time: Job started
* host: host running job
* status: (final) job status
* time: DB updated

## Tracking updates to DB

This brings us to CRON jobs.
There are several things that ought to be updated when the DB changes. Xapian being one example and now this table.
These should run on a regular basis and only update what really changed.
We need to support that type of work out of the box with an hourly runner (for precompute) and a daily runner (for xapian).

One possibility is to use MySQL triggers to track changes to the database table.
Even a hash can be computed as suggested here:

=> https://stackoverflow.com/questions/38732185/hash-of-two-columns-in-mysql

The problem is that this relative expensive computation done on every row update. I don't think it is feasible for updating phenotypes. Arthur may find the updates go very slow. It is possible, however, to just update a boolean value, or even a time stamp, that is cheap. That will probably increase the update time by 50-100% and that may be acceptable. We can only find out by trying.

But first I am going to simply generate the hashes from the running DB by scanning the full table. If that runs in minutes we should be good with a CRON job for now. Also, in the medium term, we may replace these tables with lmdb files and it is no good to depend on MysQL/MariaDB for this type of service.

Note that ProbeSetXRef is indexed on DataId, ProbeSetFreezeId, ProbeSetId and Locus.
The following should be unique and ProbeSetFreezeId+ProbeSetId, indexed by unique DataId, is the actual dataset, so that should be reasonable fast.

`ProbeSetFreezeID` is a unique identifier to a dataset, such as "Hippocampus_M430_V2_BXD_PDNN_Jun06".
`ProbeSetId` is the trait in the collection, e.g. '100001_at'.
`DataId` is the counter in ProbeSetXRef table.

So now this should be clear on update:

```
update ProbeSetXRef set Locus=%s, LRS=%s, additive=%s where ProbeSetId=%s and ProbeSetFreezeId=%s
```

Every trait gets an LRS and effect size this way.


## Writing script to Hash ProbeSetData

tux02 has a running database, so it is good for trials. The first step is to add hashes to ProbeSetData and compute them. Two questions we can ask: (1) should be host that data in the same table and (2) should we use another DB?

For (1) we should *not* use the same table. The current table is reasonably small and if we expand 28,335,955 rows on ProbeSetXRef it is going to be huge. This table is used a lot, so we should create a new table for this specific purpose. Now that means, even though this table is obviously linked to the running DB, we can actually use a different storage. And for this particular purpose I think sqlite will do great. Our preferred 'transient' databases, these days, are lmdb for disk b-trees, redis for in-RAM b-trees, sqlite for tables and virtuoso for RDF. This may change again, but for now these are great default choices. 'transient' means that the DB can be regenerated from some source of truth. This is not necessarily the case today - particularly authorization and session management are now using sqlite and redis as primary data sources. That is fine, as long as we are fully aware of the choices and what that implies.

We already use SQLite for authorization. Fred has been adding that to GN3 and now to its own authorization service:

=> https://github.com/genenetwork/gn-auth

In the README you can see the DB is located with a parameter 'AUTH_DB'. There is no example, so I check our production setup. Zach has it setup with

```
LOGLEVEL = "DEBUG"
SECRET_KEY = '9a4a2a...'
SQL_URI = "mysql://user:pwd@127.0.0.1:3306/db_webqtl"
AUTH_DB = "/home/gn2/auth_copy.db"
AUTH_MIGRATIONS = "/home/zas1024/gn-auth/migrations"
```

The DB file is world writeable - and we should probably change that. Also it needs a better storage location. Essentially GN should have a limited number of directories that handle state:

* SQL server (mysql) - read-write
* Other databases - read-write
* Upload files - read-write
* Reference files - read-only
* Secret files - read-only
* Cache files - read-write

Within these directories there should be clear subdirectories for each use case. So, for authorization we could have `/data/gn/databases/auth/production/auth.db'. Likewise, the precompute hash table will be in `/data/gn/databases/precompute/production/hashes.db' and we'll introduce an environment variable "GN_HASH_DB".
We'll take care of that later.

For our hashing purposes we'll first create a DB named 'hashes.db'. This table can also be used for correlations etc. down the line by using the 'algorithm' field and its combined input hash. It is generic, in other words.

## Getting precompute running

Now we have pretty much figured out what to do it is time to set up precompute. We can start with one single server that has access to mariadb. The new genoa Tux04 will be our replacement of Tux01. It has some larger Crucial P3 NVME SSDs rated at 3.5Gb/s. The Dell drive is 1.4 TB and is rated to be twice as fast. So we'll use that for the MariaDB now.

The first step is to make space for all the GN2+GN3 stuff. Next we test if we can query tux01 ports. tux04 can see tux01 because they are both inside the UTHSC firewall. I just need to allow the mysql port on tux01 and we keep using that DB for now.

The first phase is to simply start running gemma on tux05 and tux04 using the BXD. I mean, before computing 28,335,955 phenotypes we'll have to have a more serious compute setup!

```
MariaDB [db_webqtl]> select * from Strain where SpeciesId=1 limit 200;
+-----+------------+------------+-----------+--------+------------+
| Id  | Name       | Name2      | SpeciesId | Symbol | Alias      |
+-----+------------+------------+-----------+--------+------------+
|   1 | B6D2F1     | B6D2F1     |         1 | NULL   | NULL       |
|   2 | C57BL/6J   | C57BL/6J   |         1 | B6J    | P          |
|   3 | DBA/2J     | DBA/2J     |         1 | D2J    | P          |
|   4 | BXD1       | BXD1       |         1 | NULL   | NULL       |
|   5 | BXD2       | BXD2       |         1 | NULL   | NULL       |
|   6 | BXD5       | BXD5       |         1 | NULL   | NULL       |
|   7 | BXD6       | BXD6       |         1 | NULL   | NULL       |

select * from Strain where SpeciesId=1 and Name like '%BXD%';
returns 1371 rows
```

The originally named StrainXRef table links StrainID with SpeciesID:

```
MariaDB [db_webqtl]> select * from StrainXRef limit 3;
+-------------+----------+---------+------------------+----------------+
| InbredSetId | StrainId | OrderId | Used_for_mapping | PedigreeStatus |
+-------------+----------+---------+------------------+----------------+
|           1 |        1 |      10 | N                | MP_F1          |
|           1 |        2 |      20 | N                | Mat            |
|           1 |        3 |      30 | N                | Pat            |
+-------------+----------+---------+------------------+----------------+
```

## Getting all BXD datasets

Together with the InbredSet table we have the information we need to fetch BXD datasets.

```
select Id,InbredSetId,InbredSetName,Name,SpeciesId,FullName,public,MappingMethodId,GeneticType | Family,FamilyOrder,MenuOrderId,InbredSetCode from InbredSet limit 3;
+----+-------------+-------------------+--------+-----------+-------------------+--------+-----------------+----------------------+-------------+-------------+---------------+
| Id | InbredSetId | InbredSetName     | Name   | SpeciesId | FullName          | public | MappingMethodId | GeneticType | Family | FamilyOrder | MenuOrderId | InbredSetCode |
+----+-------------+-------------------+--------+-----------+-------------------+--------+-----------------+----------------------+-------------+-------------+---------------+
|  1 |           1 | BXD               | BXD    |         1 | BXD Family        |      2 | 1               |                    0 |           1 |           0 | BXD           |
|  2 |           2 | B6D2F2 OHSU Brain | B6D2F2 |         1 | B6D2F2 OHSU Brain |      2 | 1               |                    0 |           3 |           0 | NULL          |
|  4 |           4 | AXB/BXA           | AXBXA  |         1 | AXB/BXA Family    |      2 | 1               |                 NULL |           1 |           0 | AXB           |
+----+-------------+-------------------+--------+-----------+-------------------+--------+-----------------+----------------------+-------------+-------------+---------------+
```

So

```
MariaDB [db_webqtl]> select * from StrainXRef where InbredSetID=1 limit 5;
+-------------+----------+---------+------------------+----------------+
| InbredSetId | StrainId | OrderId | Used_for_mapping | PedigreeStatus |
+-------------+----------+---------+------------------+----------------+
|           1 |        1 |      10 | N                | MP_F1          |
|           1 |        2 |      20 | N                | Mat            |
|           1 |        3 |      30 | N                | Pat            |
|           1 |        4 |      40 | Y                | RI_progeny     |
|           1 |        5 |      50 | Y                | RI_progeny     |
+-------------+----------+---------+------------------+----------------+
```

```
select * from ProbeSetData where StrainId = 4 limit 5;
```

Fetches all data where strain `BXD1' was used. That is 14,829,435 of data points!
Now it should be clear the database is designed to go from dataset -> strain and not the other way round. We can, however, walk ProbeSetXRef and get the ProbeSetId. Use that to query ProbeSetData and use the StrainID to make sure it belongs to the BXD for compute.

Next, the fields `Locus_old | LRS_old | pValue_old' are not used that I can tell from the source code. So, we could use these to track updates. I am going to set these _old fields in ProbeSetXRef. After checking the backups we will first set all Locus_old to NULL. If a value
is NULL we can precompute. Afther precompute the current Locus_old value is taken from the reaper compute, as well as LRS_old and p_Value_old. We will add a column for additive_old.

Let's fetch one where Locus_old is now NULL:

```
select ProbeSetFreezeId,ProbeSetId,DataId,Locus_old,Locus from ProbeSetXRef where Locus_old is NULL limit 1;
+------------------+------------+--------+-----------+-----------+
| ProbeSetFreezeId | ProbeSetId | DataId | Locus_old | Locus     |
+------------------+------------+--------+-----------+-----------+
|               31 |      48552 | 458384 | NULL      | D10Mit194 |
+------------------+------------+--------+-----------+-----------+
1 row in set (0.00 sec)
```

DataId is really ProbeSetDataId, so:

```
MariaDB [db_webqtl]> select * from ProbeSetData where ProbeSetData.id = 458384 limit 1;
+--------+----------+--------+
| Id     | StrainId | value  |
+--------+----------+--------+
| 458384 |       42 | 10.306 |
+--------+----------+--------+
1 row in set (0.00 sec)
```

And StrainId can be found in StrainXRef with

```
MariaDB [db_webqtl]> select * from Strain where Strain.id=42;
+----+---------+---------+-----------+--------+-------+
| Id | Name    | Name2   | SpeciesId | Symbol | Alias |
+----+---------+---------+-----------+--------+-------+
| 42 | CASE001 | CASE001 |         1 | NULL   | NULL  |
+----+---------+---------+-----------+--------+-------+
```

and

```
MariaDB [db_webqtl]> select * from StrainXRef where StrainId=42;
+-------------+----------+---------+------------------+----------------+
| InbredSetId | StrainId | OrderId | Used_for_mapping | PedigreeStatus |
+-------------+----------+---------+------------------+----------------+
|           2 |       42 |       1 | N                | NULL           |
+-------------+----------+---------+------------------+----------------+
```

```
MariaDB [db_webqtl]> select * from InbredSet where Id=2;;
+----+-------------+-------------------+--------+-----------+-------------------+--------+-----------------+-------------+------------------+-------------+-------------+---------------+-------------+
| Id | InbredSetId | InbredSetName     | Name   | SpeciesId | FullName          | public | MappingMethodId | GeneticType | Family           | FamilyOrder | MenuOrderId | InbredSetCode | Description |
+----+-------------+-------------------+--------+-----------+-------------------+--------+-----------------+-------------+------------------+-------------+-------------+---------------+-------------+
|  2 |           2 | B6D2F2 OHSU Brain | B6D2F2 |         1 | B6D2F2 OHSU Brain |      2 | 1               | intercross  | Crosses, AIL, HS |           3 |           0 | NULL          | NULL        |
+----+-------------+-------------------+--------+-----------+-------------------+--------+-----------------+-------------+------------------+-------------+-------------+---------------+-------------+
```

which is not part of the BXD (InbredSetId=1).

So, StrainXRef is crucial to get to the actual BXD. Let's look how it is used in existing code. In GN2 it is used in 2 places only: router and sample list:

```
SELECT Strain.Name FROM Strain, StrainXRef WHERE StrainXRef.StrainId = Strain.Id AND StrainXRef.InbredSetId = 1;
```

lists all strain *names* based on Strain.id. The nomenclature is bogglingly bad. Anyway, it confirms that we can use this to see if something is part of the BXD.

Get all the BXD used for mapping

```
SELECT StrainId,Strain.Name FROM Strain, StrainXRef WHERE StrainXRef.StrainId = Strain.Id AND StrainXRef.InbredSetId = 1 AND Used_for_mapping="Y" ORDER BY StrainId;
```

After some thought I decided we will create a new table that mirrors what we'll do with sqlite down the line. The new table will also allow for referring to time stamps and multiple hits. So, in addition to time stamps and hash values, we'll add to the above update record:

* top hit info (locus, LRS, p-value, effect)
* number of other hits
* file with other hits

And down the line we can point to a different DB table that records significant hits.

As a first check it will be interesting to see how these values differ between qtlreaper and GEMMA!

## Create Hash table

This new Hash table can be called 'SignificantHits' or something.

ProbeSetXRef is defined as

```
CREATE TABLE ProbeSetXRef (
  ProbeSetFreezeId smallint(5) unsigned NOT NULL DEFAULT 0,
  ProbeSetId int(10) unsigned NOT NULL DEFAULT 0,
  DataId int(10) unsigned NOT NULL DEFAULT 0,
  Locus_old char(20) DEFAULT NULL,
  LRS_old double DEFAULT NULL,
  pValue_old double DEFAULT NULL,
  mean double DEFAULT NULL,
  se double DEFAULT NULL,
  Locus varchar(50) DEFAULT NULL,
  LRS double DEFAULT NULL,
  pValue double DEFAULT NULL,
  additive double DEFAULT NULL,
  h2 float DEFAULT NULL,
  PRIMARY KEY (DataId),
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4
```

Not all choices are that sane, but we'll leave it like it is for now (it is a small table).
The new table will look like (LRS=Log wise ratios)

```
CREATE TABLE SignificantHits (
  Id INT UNSIGNED NOT NULL DEFAULT 0,
  ProbeSetFreezeId INT UNSIGNED NOT NULL DEFAULT 0, <- points to dataset info
  ProbeSetId INT UNSIGNED NOT NULL DEFAULT 0,       <- points to trait info
  ProbeSetDataId INT UNSIGNED NOT NULL DEFAULT 0,   <- id, strain value OR ID of the table?
  mean double DEFAULT NULL,
  se double DEFAULT NULL,
  Locus varchar(50) DEFAULT NULL,
  LRS double DEFAULT NULL,
  pValue double DEFAULT NULL,
  significant double DEFAULT NULL,
  additive double DEFAULT NULL,
  h2 float DEFAULT NULL,
  PRIMARY KEY (Id),
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4
```

Note prepending Id and a rename from DataId to ProbeSetDataId because they refer to the same:

```
MariaDB [db_webqtl]> select max(DataId) from ProbeSetXRef limit 10;
| max(DataId) |
|    92213693 |

select max(Id) from ProbeSetData limit 3;
| max(Id)  |
| 92213693 |
```

In the future this table will be the replacement for the badly named ProbeSetXRef

## Creating a list of work on tux04

We'll use a test table first on tux02. It should not matter for precompute because we'll regenerate hashes from the DB input and GEMMA output. I.e., we only need to recompute if something has changed in phenotypes or genotypes and that is quite rare.

We will work on tux04 and I'll first install the database there. In the process we'll need to get the basic environment going that we need for work. First I thought I'd copy the backup from space - as I thought it has 10Gbs, but actually it uses 1Gbs and the backup is a month old (need to fix that too!).

Alright, let's use tux02 for the DB. On the new machine tux04 we disable /etc/profile and /etc/bash.bashrc - people can add that themselves and it messes up guix profiles.
After creating my environment it looks like any other machine I use and I have the same keyboard short cuts.

Note that my UID may be wrong for Octopus, but that requires a more heavy surgery anyway.

Next we create a drop for the backups from another machine following:

=> ../backup_drops.gmi

Copy the backup which takes a while over the slow network and unpack with the great tool borg

```
time borg extract --progress /export/backup/bacchus/drop/tux01/borg-tux01::borg-backup-mariadb-20231111-06:39-Sat
```

Next install a recent Mariadb from guix as a container using the instructions in

=> mariadb.gmi
=> setting-up-local-development-database.gmi

move the database files into, for example, /export/mysql/var/lib/mysql. chown files to your user account. Next

```
cd /export/mysql
mkdir tmp
mkdir run
tux04:/export/mysql$ ~/opt/guix-pull/bin/guix shell -C -N coreutils sed mariadb --share=/export/mysql/var=/var --share=/export/mysql/tmp=/tmp
    export TMPDIR=/tmp
    mysqld_safe --datadir='/var/lib/mysql/' --protocol tcp  --port=3307 --user=$USER --group=users --nowatch
```

and a client with:

```
/export/mysql$ ~/opt/guix-pull/bin/guix shell mysql -- mysql --socket=var/run/mysqld/mysqld.sock -uwebqtlout -pwebqtlout db_webqtl
```

### Modifying the tables

Set unused 'Locus_old' to NULL because we use that to update the existing table with the new values for production (note this is a quick hack discussed earlier):

```
update ProbeSetXRef set Locus_old=NULL;
```

SELECT DISTINCT DataId from ProbeSetXRef INNER JOIN ProbeSetData ON ProbeSetXRef.DataId = ProbeSetData.Id where StrainId>45 AND Locus_old is NULL limit 10;


## Preparing for GEMMA

Meanwhile I have prepared tux04 and tux05 for the new runs. Next step is to query the DB and run GEMMA.

### Testing a script to access the database

A first script on tux02 runs:

```scheme
(use-modules (dbi dbi))

(define db_webqtl (dbi-open "mysql" "webqtlout:webqtlout:db_webqtl:tcp:127.0.0.1:3306"))
(dbi-query db_webqtl "SELECT * FROM ProbeSetXRef LIMIT 1")
(display (dbi-get_row db_webqtl))

((ProbeSetFreezeId . 1) (ProbeSetId . 1) (DataId . 1) (Locus_old . 10.095.400) (LRS_old . 13.3971627898894) (pValue_old . 0.163) (mean . 5.48794285714286) (se . 0.08525787814808819) (Locus . rs13480619) (LRS . 12.590069931048) (pValue . 0.269) (additive . -0.28515625) (h2 . 0.0))
```

Note that you can run this over guile-geiser remote.

### Install and run gemma

Now we have a working DB query method on a compute node we can start running GEMMA.
GEMMA is packaged as part of GNU Guix, so that part is straightforward. After above guix pull

```
. ~/opt/guix-pull/etc/profile <- don't do this, it does not work on Debian Guix images!
git clone https://git.genenetwork.org/guix-bioinformatics/
git clone https://gitlab.inria.fr/guix-hpc/guix-past
~/opt/guix-pull/bin/guix package -L ~/guix-bioinformatics/ -L ~/guix-past/modules -A gemma
  gemma-gn2       0.98.5-8cd4cdb  out   /home/wrk/guix-bioinformatics/gn/packages/gemma.scm:31:2
```

and install gemma with

```
wrk@tux05:~$ ~/opt/guix-pull/bin/guix package -L ~/guix-bioinformatics/ -L ~/guix-past/modules -i gemma -p ~/opt/gemma
```

### Creating the precompute-runner

The precompute-runner can be part of `gn-guile/scripts/precompute`, i.e., the new GN guile repo's maintenance script dir:

=> https://git.genenetwork.org/gn-guile/tree/scripts/precompute

See the README.md for more information on running the script etc.

For development I'll tunnel to the Tux02 database.

As we are doing the BXD's first we first fetch a record from ProbeSetXRef that has Locus_old set to NULL AND matches a BXD trait.

First we fetch all BXD strainids. I wrote a function `bxd-strain-id-names` for that.

Next, using

=> https://git.genenetwork.org/gn-guile/commit/?id=b1db013cc01c94e27edf982be9b027a2b0bb9712

we fetch the first BXD dataset where all strains are members of BXD:

```
((Locus . rs13480619) (DataId . 1) (ProbeSetId . 1))1
WE HAVE OUR FIRST BXD DATASET for precompute!((1 . 5.742) (2 . 5.006) (3 . 6.079) (4 . 6.414) (5 . 4.885) (6 . 4.719) (7 . 5.761) (8 . 5.604) (9 . 5.661) (10 . 5.708) (11 . 5.628) (12 . 6.325) (13 . 5.37) (14 . 6.544) (15 . 5.476) (16 . 5.248) (17 . 5.528) (19 . 5.51) (20 . 5.886) (21 . 5.177) (22 . 5.655) (23 . 5.522) (24 . 5.549) (25 . 4.588) (26 . 5.618) (28 . 6.335) (29 . 5.569) (30 . 4.422) (31 . 5.194) (35 . 4.784) (36 . 5.056) (37 . 5.869) (39 . 5.175) (40 . 5.207) (41 . 5.264))
```

If it is not a match we'll need to iterate to the next one. The current version of precompute can fetch all the BXD datasets by name and trait name.

=> https://git.genenetwork.org/gn-guile/tree/scripts/precompute/precompute-hits.scm

For GEMMA we need to write a phenotype file which is simply a vector of values. Meanwhile, the genotype file and SNP files for the BXD are fixed (currently version 8 with 21.056 markers).

To run GEMMA we can take an example from `gemma-wrapper`, a tool I wrote to cache results:

=> https://github.com/genetics-statistics/gemma-wrapper

gemma-wrapper does a bit more work than we really need and is a bit too generic for this precompute exercise. The following things matter:

* handling of hashes to prevent recompute
* handling of the kinship matrix (K)
* handling of leave one chromosome out (LOCO)

all of which are handled by gemma-wrapper really well. We'll use gemma-wrapper for the time being to do the actual computations.

BTW the cache for GN production is currently at 190G with 40K files. That is 3 months worth of caching GEMMA as we clean up older files! For precompute we may need to be smarter because that only represents 970 association results and 400 kinship matrices. We should improve GEMMA by:

* by default only keep the required results
* provide a more compact storage of results

We should be using something like lmdb as that will speed up lookups and GEMMA itself. Simple compression will reduce output 5x for one 7Mb output. Some K matrices are huge too. The largest one runs at 0.5 Gb. GEMMA does crash once in a while, unsurprisingly.

For the BXD we need to compute at least 100K traits - I have not done a proper count yet - and we can delete output files when the database gets updated. We should, however, hang on to hits that are potentially significant, so I want to retain output, or create a new mysql table.

## Run gemma-wrapper

We can run a local version of gemma-wrapper with

```
.guix-shell --expose=/home/wrk/iwrk/opensource/code/genetics/gemma-wrapper/=/gemma-wrapper ruby -- ruby /gemma-wrapper/bin/gemma-wrapper
```

this is useful because we want to change gemma-wrapper itself. Mostly to produce less bulky output and add some metadata for downstream processing (i.e. to update the database itself).
Note how versatile `guix shell` is for development!

gemma-wrapper shows the following output:

```
NOTE: gemma-wrapper is soon to be replaced
Using GEMMA 0.98.3 (2020-11-28) by Xiang Zhou and team (C) 2012-2020

Found 0.98.3, comparing against expected v0.98.4
GEMMA 0.98.3 (2020-11-28) by Xiang Zhou and team (C) 2012-2020
WARNING: GEMMA version is out of date. Update GEMMA to 0.98.4!
```

Some irony there. We are going to replace gemma-wrapper at some point. And the gemma that is packaged in guix by default it older that what we want to use. Though the release notes point out mostly maintenance updates:

=> https://github.com/genetics-statistics/GEMMA/releases

We can pull in a local gemma too using the same `--expose` trick above. So, let's use that for now:

```
.guix-shell --expose=/home/wrk/iwrk/opensource/code/genetics/gemma-wrapper/=/gemma-wrapper --expose=/home/wrk/iwrk/opensource/code/genetics/gemma/=/gemma ruby -- /gemma/bin/gemma
```

This does not work out of the box because I typically build gemma on a different Guix profile, so it won't find the libraries (RPATH is fixed). There are two options, build a static version of gemma or override the LD_LIBRARY_PATH. The first used to work, but Guix no longer carries static libs for openblas, so I would need to provide those myself. Overriding LD_LIBRARY_PATH is a bit inconvenient, but in this case doable. We have to run from inside the container and set:

```
env LD_LIBRARY_PATH=$GUIX_ENVIRONMENT/lib/ /gemma/bin/gemma

GEMMA 0.98.6 (2022-08-05) by Xiang Zhou, Pjotr Prins and team (C) 2012-2022

 type ./gemma -h [num] for detailed help
 options:
  1: quick guide
  2: file I/O related
  3: SNP QC
  ...
```

and success. With gemma wrapper we can pick up the latest gemma by setting the GEMMA_COMMAND

```
(system "env GEMMA_COMMAND=/gemma/bin/gemma /gemma-wrapper/bin/gemma-wrapper")
gemma-wrapper 0.99.6 (Ruby 3.2.2) by Pjotr Prins 2017-2022

NOTE: gemma-wrapper is soon to be replaced
Using GEMMA 0.98.6 (2022-08-05) by Xiang Zhou, Pjotr Prins and team (C) 2012-2022
```

All set for actual compute!

## Running gemma

The following command fails because we don't have a pheno file yet:

```
gemma-wrapper --debug -- -gk -g BXD.8_geno.txt.gz -p pheno.txt -a BXD.8_snps.txt
```

Now the pheno file needs to match the number of genotypes in the geno file which, presumably, is generated from the file `BXD.8.gene`. It has the header:

```
BXD1	BXD2	BXD5	BXD6	BXD8	BXD9	BXD11	BXD12	BXD13	BXD14	BXD15	BXD16	BXD18	BXD19	BXD20	BXD21	BXD22	BXD23	BXD24	BXD24a	BXD25	BXD27	BXD28	BXD29	BXD30	BXD31	BXD32	BXD33	BXD34	BXD35	BXD36	BXD37	BXD38	BXD39	BXD40	BXD41	BXD42	BXD43	BXD44	BXD45	BXD48	BXD48a	BXD49	BXD50	BXD51	BXD52	BXD53	BXD54	BXD55	BXD56	BXD59	BXD60	BXD61	BXD62	BXD63	BXD64	BXD65	BXD65a	BXD65b	BXD66	BXD67	BXD68	BXD69	BXD70	BXD71	BXD72	BXD73	BXD73a	BXD73b	BXD74	BXD75	BXD76	BXD77	BXD78	BXD79	BXD81	BXD83	BXD84	BXD85	BXD86	BXD87	BXD88	BXD89	BXD90	BXD91	BXD93	BXD94	BXD95	BXD98	BXD99	BXD100	BXD101	BXD102	BXD104	BXD105	BXD106	BXD107	BXD108	BXD109	BXD110	BXD111	BXD112	BXD113	BXD114	BXD115	BXD116	BXD117	BXD119	BXD120	BXD121	BXD122	BXD123	BXD124	BXD125	BXD126	BXD127	BXD128	BXD128a	BXD130	BXD131	BXD132	BXD133	BXD134	BXD135	BXD136	BXD137	BXD138	BXD139	BXD141	BXD142	BXD144	BXD145	BXD146	BXD147	BXD148	BXD149	BXD150	BXD151	BXD152	BXD153	BXD154	BXD155	BXD156	BXD157	BXD160	BXD161	BXD162	BXD165	BXD168	BXD169	BXD170	BXD171	BXD172	BXD173	BXD174	BXD175	BXD176	BXD177	BXD178	BXD180	BXD181	BXD183	BXD184	BXD186	BXD187	BXD188	BXD189	BXD190	BXD191	BXD192	BXD193	BXD194	BXD195	BXD196	BXD197	BXD198	BXD199	BXD200	BXD201	BXD202	BXD203	BXD204	BXD205	BXD206	BXD207	BXD208	BXD209	BXD210	BXD211	BXD212	BXD213	BXD214	BXD215	BXD216	BXD217	BXD218	BXD219	BXD220	C57BL/6JxBXD037F1	BXD001xBXD065aF1	BXD009xBXD170F1	BXD009xBXD172F1	BXD012xBXD002F1	BXD012xBXD021F1	BXD020xBXD012F1	BXD021xBXD002F1	BXD024xBXD034F1	BXD032xBXD005F1	BXD032xBXD028F1	BXD032xBXD65bF1	BXD034xBXD024F1	BXD034xBXD073F1	BXD055xBXD074F1	BXD055xBXD65bF1	BXD061xBXD071F1	BXD062xBXD077F1	BXD065xBXD077F1	BXD069xBXD090F1	BXD071xBXD061F1	BXD073bxBXD065F1	BXD073bxBXD077F1	BXD073xBXD034F1	BXD073xBXD065F1	BXD073xBXD077F1	BXD074xBXD055F1	BXD077xBXD062F1	BXD083xBXD045F1	BXD087xBXD100F1	BXD065bxBXD055F1	BXD102xBXD077F1	BXD102xBXD73bF1	BXD170xBXD172F1	BXD172xBXD197F1	BXD197xBXD009F1	BXD197xBXD170F1
```

with 235 named BXD mice. Now when we check the DB we get

```
select * from StrainXRef WHERE InbredSetId=1 AND Used_for_mapping="Y"
276 rows
```

so it looks like we need some extra logic to fetch the used individuals from the actual genotype file(s).

For development I am running:

```
.guix-shell ruby --expose=/home/wrk/iwrk/opensource/code/genetics/gemma-wrapper/=/gemma-wrapper --expose=/home/wrk/iwrk/opensource/code/genetics/gemma/=/gemma -- guile -L . -s ./scripts/precompute/precompute-hits.scm
```

## Writing the phenotype file

For gemma we need to feed a phenotype file that has only the individuals that are in the genotype file (the other 'missing' phenotype values should be NAs).

The genotype header in GN2 is read from a `BXD.json` file. Code is at
wqflask/wqflask/marker_regression/run_mapping.py:get_genofile_samplelist(dataset)

Zach writes:

> The JSON files were basically just a "stopgap" solution to allow for
> multiple genotype files for any given group (which previously wasn't
> possible), and they don't even exist for most groups - the code uses
> them only if they exist for a given group. They contain the
> title/label, filename, and sample lists for situations where the other
> genotype files only contain a subset of the samples from the full
> sample list, but otherwise (and usually) the samplelist is just fetched
> from the .geno file. It's also possible to fetch that subset of samples
> from the other/alternate .geno files, but it's easier to just fetch the
> samples as a list from that .json file than it is to parse it out of
> the other .geno files (and the JSON file would still be used to to get
> the "title" and filenames, so it's already being read in the code).
> Basically the JSON files are just a sort of "genotype file metadata"
> thing to deal with the existence of multiple genotype files for a
> single group and provide the "attributes" of the files in question
> (namely the title and filename - the sample lists were only added later
> once we started to get genotype files that didn't contain the full
> samplelist for the group in question). The "titles" appear in the
> drop-down on the trait page* - in their absence it will just use
> "{group_name}.geno" (for example BXD.geno). So if you deleted BXD.json,
> mapping would still work for BXD, but it would just default to BXD.geno
> and wouldn't give the option of other files. This information could
> probably also be stored somehow in the .geno file, but I'm not sure how
> to easily do so since that's a weird proprietary format and would
> seemingly require some sort of direct parsing in the GN code (while
> JSON is more simple to deal with).

=> https://genenetwork.org/show_trait?trait_id=1427571_at&dataset=HC_M2_0606_P

> If you scroll down to mapping and Inspect Element on the
> Genotypes drop-down you'll see how the values are the actual filenames
> while the text is the title from the JSON file

So this is genotype metadata which could go into virtuoso (where all metadata belongs).
Anyway, we'll get there when the time comes, but for now I can use it to see what genotypes we have in the genotype file and write the phenotype file accordingly.

I note we also have a GenoData table in mariadb which has allele info. The script

=> https://github.com/genenetwork/genenetwork2/blob/testing/scripts/maintenance/load_genotypes.py

loads these from the BXD.geno-type files. As far as I can tell GenoData is not used in GN2+GN3. Other Geno files are basically used to fetch metadata on genotypes.

Genotype state lives in 4 places. Time to create a 5th one with lmdb ;). At least the BXD.json file lines up with BXD.8.geno and the BIMBAM version with 235 inds.

Using this information we created our first phenotype file and GEMMA run!

# Successfully running gemma-wrapper

Running the wrapper is a two step process:

```
env TMPDIR=tmp ruby ./bin/gemma-wrapper --force --json \
        --loco -- \
        -g test/data/input/BXD_geno.txt.gz \
        -p test/data/input/BXD_pheno.txt \
        -a test/data/input/BXD_snps.txt \
        -gk > K.json
env TMPDIR=tmp ruby ./bin/gemma-wrapper --json --input K.json \
        -- \
        -g test/data/input/BXD_geno.txt.gz \
        -p test/data/input/BXD_pheno.txt \
        -a test/data/input/BXD_snps.txt \
        -lmm -maf 0.1 > G.json
```

The current LOCO approach leads to two files for every chromosome GRM and two files for every chr association output file. In this case 82 files with a total of 13Mb (4Mb compressed).
That is a bit insane if you know the input is 300K, even knowing disk space is cheap(!) Cheap is not always cheap because we still need to process the data and with growing datasets the size grows rapidly overall.

So, this is the right time to put gemma-wrapper on a diet. The GRM files are largest. Currently we create kinship files for every population subset that is used and that may change once we simply reduce the final GRM by removing cols/rows. But that is one exercise we want to prove first using our precompute exercise. In this case we will simply compress the kinship files and that halves the size with zip. xz compression brings it down to 1/4. That is impressive by itself. I also checked lmza and bzip2 and they were no better. So, with gemma-wrapper we can now store the GRMs in an xz archive. For the assoc files we will cat them in to one file and compress that too, reducing the size to 1/7th. As noted above, the current cache size for GN is 190Gb for 3 months. We can reduce that significantly and that will speed up lookups. Decompression with xz is very fast.

# Storing GRM output

The current version of gemma-wrapper stores per chromosome GRMs in separate files. The first fix was to store them in an xz archive. gemma-wrapper already uses a temporary directory so, that was straightforward. Next I had to tell gemma-wrapper not to recompute when the xz archive exists. This now works.

Next step is to use the archive to check the GWA run. In the final step we will archive results and write an lmdb file for further processing. If we can make it really small we'll retain that instead of the full archive. This code is part of gemma-wrapper right now:

=> https://github.com/genetics-statistics/gemma-wrapper/blob/master/bin/gemma2lmdb.py

I looked at using SQLITE - which would have been a bit easier - but the size would have been at least double the size of lmdb (floats are stored as 8 bytes in SQLITE).

Inside the lmdb file we also store the results of the log files as a JSON 'meta' record. A metadata record is important because it allows for quick assessments later, such has how long the compute took at each step and how much memory was used. We collect metadata at every step. This means that we pass around a growing JSON object. Also we should be able to expose the JSON easily so it can be parsed with python, ruby, jq etc. With metadata we can store some extra information on the ProbeSet:

```
MariaDB [db_webqtl]> select Id,Chr,Mb,Name,Symbol,description from ProbeSet where Id=1 limit 1;
+----+------+-----------+-----------+--------+---------------------------------+
| Id | Chr  | Mb        | Name      | Symbol | description                     |
+----+------+-----------+-----------+--------+---------------------------------+
|  1 | 9    | 44.970689 | 100001_at | Cd3g   | CD3d antigen, gamma polypeptide |
+----+------+-----------+-----------+--------+---------------------------------+
```

and the dataset:

```
MariaDB [db_webqtl]> select * from ProbeSetFreeze WHERE Name='HC_M2_0606_P' limit 5;
+-----+---------------+-------+--------------+------------------------------------+--------------------------------------------+-----------------------------------+------------+-----------+--------+-----------------+-----------------+-----------+
| Id  | ProbeFreezeId | AvgID | Name         | Name2                              | FullName                                   | ShortName                         | CreateTime | OrderList | public | confidentiality | AuthorisedUsers | DataScale |
+-----+---------------+-------+--------------+------------------------------------+--------------------------------------------+-----------------------------------+------------+-----------+--------+-----------------+-----------------+-----------+
| 112 |            30 |     2 | HC_M2_0606_P | Hippocampus_M430_V2_BXD_PDNN_Jun06 | Hippocampus Consortium M430v2 (Jun06) PDNN | Hippocampus M430v2 BXD 06/06 PDNN | 2016-02-11 |         1 |      2 |               0 |                 | log2      |
+-----+---------------+-------+--------------+------------------------------------+--------------------------------------------+-----------------------------------+------------+-----------+--------+-----------------+-----------------+-----------+
```

gemma-wrapper outputs JSON - but it is fairly elaborate so we'll reduce it to a minimum. At the time of lmdb creation we will pass in a small JSON file that describes the gemma-wrapper run.

# Storing assoc output

To kick off precompute we added new nodes to the Octopus cluster: doubling its capacity. In the next step we have to compress the output of GEMMA so we can keep it forever. For this we want to have the peaks (obviously), but we als want to retain the 'shape' of the distribution - i.e., the QTL with sign. This shape we can use for correlations and potentially some AI-style mining. The way it is presented in AraQTL.

For the sign we can use the SNP additive effect estimate. The se of Beta is a function of MAF of the SNP. So if you want to present Beta as the SNP additive effect for a standardized genotype, then you want to use Beta/se; otherwise, Beta is the SNP additive effect for the original, unstandardized genotype. The Beta is obtained by controlling for population structure. For effect sign we need to check the incoming genotypes because they may have been switched.

Anyway, we can consider compressing the shape the way a CDROM is compressed.
For now, in the next step we will compress the storage using xz - as discussed above. In the next step, after running the precompute and storing the highest hit in the DB, we will start using lmdb to store assoc values.

The precompute runs on tux04 with

```
tux04:~/services/gn-guile$ . .guix-shell ruby --expose=/home/wrk/services/gemma-wrapper=/gemma-wrapper --share=/export2/precompute-gemma -- env TMPDIR=/export2/precompute-gemma guile -L . -s ./scripts/precompute/precompute-hits.scm
```

## Final tweaks

The precompute runs and updates the DB. We are creating a new precompute table that contains the highest hits, so we can track status of the runs. From this we can update ProbeSetXRef - rather than doing it directly.

That way we can also track the top hits for different computation.

For one result we have

```js
  "meta": {
    "type": "gemma-wrapper",
    "version": "0.99.7-pre1",
    "population": "BXD",
    "name": "HC_U_0304_R",
    "trait": "101500_at",
    "url": "https://genenetwork.org/show_trait?trait_id=101500_at&dataset=HC_U_0304_R",
    "archive_GRM": "46bfba373fe8c19e68be6156cad3750120280e2e-gemma-cXX.tar.xz",
    "archive_GWA": "779a54a59e4cd03608178db4068791db4ca44ab3-gemma-GWA.tar.xz",
    "dataid": 75629,
    "probesetid": 1097,
    "probesetfreezeid": 7
  }
```

Continue with steps:

=> /topics/data/precompute/steps


# Notes

1	rsm10000000577	3.20	Chr1: 69.315198	-0.248

=> chr,pos,af,beta,se,l_mle,p_lrt
1,69315198,0.2800000011920929,0.49527430534362793,0.13411599397659302,100000.0,0.0006254952750168741
Math.log(0.0006254952750168741,10)
=> -3.203775966613006
-0.4952743053436279/2
=> -0.24763715267181394


5	rsm10000001990	3.11	Chr4: 81.464858	0.282

4,81464858,0.36500000953674316,-0.5631462931632996,0.15581850707530975,100000.0,0.0007759517757222056
Math.log(0.0007759517757222056,10)
=> -3.1101652686754857
 0.563146/2
=> 0.281573

The likelihood ratio is bounded between zero and one.
The l_lrt runs between 0.0 and 1.0. The smaller the number the higher the log10 value (basically the number of digits; 0.000001 is -5.99

The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. In our case it is always exactly 10^5 or -10^5 for some reason. According to GEMMA doc that means it is a pure genetic effect when positive.

## Xapian

We want to add the log values, effect size and number of individuals to the xapian search

## NAs in GN

A note from Zach:

On Sat, Dec 09, 2023 at 06:09:56PM -0600, Zachary Sloan wrote:
>    (After typing the rest of this out, I realized that part of the
>    confusion might be about how locations are stored. We don't actually
>    database locations in the ProbeSetXRef table - we only database the
>    peak Locus marker name. This is then cross-referenced against the Geno
>    table, where the actual Location is stored. This is the main source of
>    the problem. So I think the best short-term solution might be to just
>    directly database the locations in the ProbeSetXRef table. Those
>    locations might become out of date, but as you mention they'd still
>    probably be in the same ballpark.)

It is logical to store the location with the peak - if it changes we
should recompute. That also adds the idea that we should track the
version of the genotypes in that table.


## More complicated datasets

A good dataset to take apart is

=> http://genenetwork.org/show_trait?trait_id=1436869_at&dataset=HC_M2_0606_P

because it has 71 BXD samples and 32 other samples.

## Variations

This paper discusses a number of approaches that may be interesting:

=> https://biodatamining.biomedcentral.com/articles/10.1186/s13040-023-00331-3 Automated quantitative trait locus analysis (AutoQTL)

## Fetch strain IDs

The following join will fetch StrainID in a dataset

```
MariaDB [db_webqtl]> select StrainId, Locus, DataId, ProbeSetId, ProbeSetFreezeId from ProbeSetXRef INNER JOIN ProbeSetData ON ProbeSetXRef.DataId=ProbeSetData.Id where DataId>0 AND Locus_old is NULL ORDER BY DataId LIMIT 5;
+----------+-----------+--------+------------+------------------+
| StrainId | Locus     | DataId | ProbeSetId | ProbeSetFreezeId |
+----------+-----------+--------+------------+------------------+
|        1 | rs6394483 | 115467 |      13825 |                8 |
|        2 | rs6394483 | 115467 |      13825 |                8 |
|        3 | rs6394483 | 115467 |      13825 |                8 |
|        4 | rs6394483 | 115467 |      13825 |                8 |
|        5 | rs6394483 | 115467 |      13825 |                8 |
+----------+-----------+--------+------------+------------------+
5 rows in set (0.205 sec)
```

## Count data sets

Using above line we can count the number of times BXD1 was used:

```
MariaDB [db_webqtl]> select count(StrainId) from ProbeSetXRef INNER JOIN ProbeSetData ON ProbeSetXRef.DataId=ProbeSetData.Id where DataId>0 AND Locus_old is NULL AND StrainId=1 ORDER BY DataId LIMIT 5;
+-----------------+
| count(StrainId) |
+-----------------+
|        10432197 |
+-----------------+
1 row in set (39.545 sec)
```

Or the main BXDs