topics/systems/mariadb/precompute-mapping-input-data.gmi


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161

# Precompute mapping input data

GN relies on precomputed mapping scores for search and other functionality. Here we prepare for a new generation of functionality that introduces LMMs for compute and multiple significant scores for queries.

# Tags

* assigned: pjotrp
* type: precompute, gemma
* status: in progress
* priority: high
* keywords: ui, correlations

# Tasks

* [ ] Start using GEMMA for precomputed values as a background pipeline on a different machine
* [ ] Update the table values using GEMMA output (single highest score)

Above is the quick win for plugging in GEMMA value. We will make sure not to recompute the values that are already up to date.

Next:

* [ ] Track metadata of computed datasets in RDF
* [ ] Store all GEMMA values efficiently
* [ ] Compute significance with GEMMA or other LMM (bulkLMM?)
* [ ] Store signficance and significant values for processing
* [ ] Update search & correlations to use these
* [ ] Further optimize computations so they can run continuously in the background

# Info

## Original qtlreaper version

The original reaper precompute lives in

=> https://github.com/genenetwork/genenetwork2/blob/testing/scripts/maintenance/QTL_Reaper_v6.py

This script first fetches inbredsets

```
MariaDB [db_webqtl]> select Id, Name from InbredSet limit 5;
+----+----------+
| Id | Name     |
+----+----------+
|  1 | BXD      |
|  2 | B6D2F2   |
|  4 | AXBXA    |
|  5 | AKXD     |
|  6 | B6BTBRF2 |
+----+----------+
```

and expands them to a .geno file, e.g. BXD.geno. Note that the script does not compute with the many variations of .geno files we have today. Next it sets the Id for ProbeSetFreeze which is the same as the InbredSet Id. So, ProbeSetFreeze.Id == IndbredSet.Id.

Next for this Id we fetch, known as `ProbeSetXRefInfos`:

```
MariaDB [db_webqtl]> select ProbeSetId, Locus, DataId from ProbeSetXRef where ProbeSetFreezeId=1 limit 5;
+------------+----------------+--------+
| ProbeSetId | Locus          | DataId |
+------------+----------------+--------+
|          1 | rs13480619     |      1 |
|          2 | rs29535974     |      2 |
|          3 | rs49742109     |      3 |
|          4 | rsm10000002321 |      4 |
|          5 | rsm10000019445 |      5 |
+------------+----------------+--------+
```

Then we fetch the trait values:

```
MariaDB [db_webqtl]> select Strain.Name, ProbeSetData.value from Strain, ProbeSetData where Strain.Id = ProbeSetData.StrainId and ProbeSetData.Id = 1 limit 5;
+----------+-------+
| Name     | value |
+----------+-------+
| B6D2F1   | 5.742 |
| C57BL/6J | 5.006 |
| DBA/2J   | 6.079 |
| BXD1     | 6.414 |
| BXD2     | 4.885 |
+----------+-------+
```

with genotypes and these phenotypes qtlreaper is started and next we update the values for

```
select * from ProbeSetXRef where ProbeSetId=1 and ProbeSetFreezeId=1 limit 5;
+------------------+------------+--------+------------+------------------+------------+------------------+---------------------+------------+-----------------+--------+-------------+------+
| ProbeSetFreezeId | ProbeSetId | DataId | Locus_old  | LRS_old          | pValue_old | mean             |
se                  | Locus      | LRS             | pValue | additive    | h2   |
+------------------+------------+--------+------------+------------------+------------+------------------+---------------------+------------+-----------------+--------+-------------+------+
|                1 |          1 |      1 | 10.095.400 | 13.3971627898894 |      0.163 | 5.48794285714286 |
0.08525787814808819 | rs13480619 | 12.590069931048 |  0.269 | -0.28515625 | NULL |
+------------------+------------+--------+------------+------------------+------------+------------------+---------------------+------------+-----------------+--------+-------------+------+
```

the actual update is

```
update ProbeSetXRef set Locus=%s, LRS=%s, additive=%s where ProbeSetId=%s and ProbeSetFreezeId=%s
```

so Locus, LRS and additive fields are updated.

From this exercise we can conclude:

* Existing precomputed values are by linear regression (old QTLreaper)
* Only one highest score is stored
* Every time the script is run *all* datasets are recomputed
* The '_old' named values in the table are not touched by the script
* h2 and mean are not updated
* No significance score is computed (needs permutations)

Rob voiced a wish to retain all scores (at least those above 1.0). This is not feasible in mariadb, but we can retain GEMMA output if stored efficiently.

## Notes on ProbeSetXRef

ProbeSetXRef is pretty small, currently @5.6Gb and 48,307,650 rows, so we could decide to add columns to track different mappers. Something funny

```
select count(LRS) from ProbeSetXRef;
+------------+
| count(LRS) |
+------------+
|   28335955 |
+------------+
```

half the fields are used. What to think of

```
MariaDB [db_webqtl]> select * from ProbeSetXRef where LRS=0 limit 5;
+------------------+------------+----------+-----------+---------+------------+------+------+----------------+------+--------+----------+------+
| ProbeSetFreezeId | ProbeSetId | DataId   | Locus_old | LRS_old | pValue_old | mean | se   | Locus          | LRS  | pValue | additive | h2   |
+------------------+------------+----------+-----------+---------+------------+------+------+----------------+------+--------+----------+------+
|              589 |    8599010 | 70606737 | NULL      |    NULL |       NULL |    0 | NULL | rsm10000000001 |    0 |   NULL |        0 | NULL |
|              589 |    8593710 | 70606738 | NULL      |    NULL |       NULL |    0 | NULL | rsm10000000001 |    0 |   NULL |        0 | NULL |
|              589 |    8607637 | 70606739 | NULL      |    NULL |       NULL |    0 | NULL | rsm10000000001 |    0 |   NULL |        0 | NULL |
|              589 |    8594490 | 70606740 | NULL      |    NULL |       NULL |    0 | NULL | rsm10000000001 |    0 |   NULL |        0 | NULL |
|              589 |    8591994 | 70606741 | NULL      |    NULL |       NULL |    0 | NULL | rsm10000000001 |    0 |   NULL |        0 | NULL |
+------------------+------------+----------+-----------+---------+------------+------+------+----------------+------+--------+----------+------+
```

```
MariaDB [db_webqtl]> select count(*) from ProbeSetXRef where LRS=0 and Locus="rsm10000000001";
+----------+
| count(*) |
+----------+
|   291447 |
+----------+
```

There is obviously more.  I think this table can use some cleaning up?

## Preparing for GEMMA

A good dataset to take apart is

=> http://genenetwork.org/show_trait?trait_id=1436869_at&dataset=HC_M2_0606_P

because it has 71 BXD samples and 32 other samples.