# Precompute mapping input data

GN relies on precomputed mapping scores for search and other functionality. Here we prepare for a new generation of functionality that introduces LMMs for compute and multiple significant scores for queries.

# Tags

* assigned: pjotrp
* type: precompute, gemma
* status: in progress
* priority: high
* keywords: ui, correlations

# Tasks

* [ ] Start using GEMMA for precomputed values as a background pipeline on a different machine
* [ ] Update the table values using GEMMA output (single highest score)

Above is the quick win for plugging in GEMMA values. We will make sure not to recompute the values that are already up to date.
This is achieved by naming the input and output files as a hash on their DB inputs.

Next:

* [ ] Store all GEMMA values efficiently
* [ ] Track metadata of computed datasets (in RDF?)
* [ ] Compute significance with GEMMA or other LMM (bulkLMM?)
* [ ] Store signficance and significant values for processing
* [ ] Update search & correlations to use these
* [ ] Further optimize computations so they can run continuously in the background

# Info

## Original qtlreaper version

The original reaper precompute lives in

=> https://github.com/genenetwork/genenetwork2/blob/testing/scripts/maintenance/QTL_Reaper_v6.py

This script first fetches inbredsets

```
MariaDB [db_webqtl]> select Id, Name from InbredSet limit 5;
+----+----------+
| Id | Name     |
+----+----------+
|  1 | BXD      |
|  2 | B6D2F2   |
|  4 | AXBXA    |
|  5 | AKXD     |
|  6 | B6BTBRF2 |
+----+----------+
```

and expands them to a .geno file, e.g. BXD.geno. Note that the script does not compute with the many variations of .geno files we have today. Next it sets the Id for ProbeSetFreeze which is the same as the InbredSet Id. So, ProbeSetFreeze.Id == IndbredSet.Id.

There are groups/collections, such as "Hippocampus_M430_V2_BXD_PDNN_Jun06"

```
select max(distinct(ProbeSetFreezeId)) from ProbeSetXRef limit 5;
|                            1054 |
```

Next for this Id we fetch, known as `ProbeSetXRefInfos`:

```
MariaDB [db_webqtl]> select ProbeSetId, Locus, DataId from ProbeSetXRef where ProbeSetFreezeId=1 limit 5;
+------------+----------------+--------+
| ProbeSetId | Locus          | DataId |
+------------+----------------+--------+
|          1 | rs13480619     |      1 |
|          2 | rs29535974     |      2 |
|          3 | rs49742109     |      3 |
|          4 | rsm10000002321 |      4 |
|          5 | rsm10000019445 |      5 |
+------------+----------------+--------+
```

Then we fetch the trait values:

```
MariaDB [db_webqtl]> select * from ProbeSetData where ProbeSetData.Id = 1 limit 5;
+----+----------+-------+
| Id | StrainId | value |
+----+----------+-------+
|  1 |        1 | 5.742 |
|  1 |        2 | 5.006 |
|  1 |        3 | 6.079 |
|  1 |        4 | 6.414 |
|  1 |        5 | 4.885 |
+----+----------+-------+
```

with names:

```
MariaDB [db_webqtl]> select Strain.Name, ProbeSetData.value from Strain, ProbeSetData where Strain.Id = ProbeSetData.StrainId and ProbeSetData.Id = 1 limit 5;
+----------+-------+
| Name     | value |
+----------+-------+
| B6D2F1   | 5.742 |
| C57BL/6J | 5.006 |
| DBA/2J   | 6.079 |
| BXD1     | 6.414 |
| BXD2     | 4.885 |
+----------+-------+
```

35 strains were used for this dataset:

```
select count(ProbeSetData.value) from ProbeSetData where ProbeSetData.Id = 1;
+---------------------------+
| count(ProbeSetData.value) |
+---------------------------+
|                        35 |
+---------------------------+
```

with genotypes (from files) and these phenotypes (from MySQL) qtlreaper is started and next we update the values for

```
select * from ProbeSetXRef where ProbeSetId=1 and ProbeSetFreezeId=1 limit 5;
+------------------+------------+--------+------------+------------------+------------+------------------+---------------------+------------+-----------------+--------+-------------+------+
| ProbeSetFreezeId | ProbeSetId | DataId | Locus_old  | LRS_old          | pValue_old | mean             |
se                  | Locus      | LRS             | pValue | additive    | h2   |
+------------------+------------+--------+------------+------------------+------------+------------------+---------------------+------------+-----------------+--------+-------------+------+
|                1 |          1 |      1 | 10.095.400 | 13.3971627898894 |      0.163 | 5.48794285714286 |
0.08525787814808819 | rs13480619 | 12.590069931048 |  0.269 | -0.28515625 | NULL |
+------------------+------------+--------+------------+------------------+------------+------------------+---------------------+------------+-----------------+--------+-------------+------+
```

the actual update is

```
update ProbeSetXRef set Locus=%s, LRS=%s, additive=%s where ProbeSetId=%s and ProbeSetFreezeId=%s
```

so Locus, LRS and additive fields are updated.

The old reaper scores are in

```
MariaDB [db_webqtl]> select ProbeSetId,ProbeSetFreezeId,Locus,LRS,additive from ProbeSetXRef limit 10;
+------------+------------------+----------------+------------------+--------------------+
| ProbeSetId | ProbeSetFreezeId | Locus          | LRS              | additive           |
+------------+------------------+----------------+------------------+--------------------+
|          1 |                1 | rs13480619     |  12.590069931048 |        -0.28515625 |
|          2 |                1 | rs29535974     | 10.5970737900941 | -0.116783333333333 |
|          3 |                1 | rs49742109     |  6.0970532702754 |  0.112957489878542 |
|          4 |                1 | rsm10000002321 | 11.7748675511731 | -0.157113725490196 |
|          5 |                1 | rsm10000019445 | 10.9232633740162 |  0.114764705882353 |
|          6 |                1 | rsm10000017255 | 8.45741703245224 | -0.200034412955466 |
|          7 |                1 | rs4178505      | 7.46477918183565 |  0.104331983805668 |
|          8 |                1 | rsm10000144086 | 12.1201771258006 | -0.134278431372548 |
|          9 |                1 | rsm10000014965 | 11.8837168740735 |  0.341458333333334 |
|         10 |                1 | rsm10000020208 | 10.2809848009836 | -0.173866666666667 |
+------------+------------------+----------------+------------------+--------------------+
10 rows in set (0.000 sec)
```

This means for every dataset one single maximum score gets stored(!)

From this exercise we can conclude:

* Existing precomputed values are by linear regression (old QTLreaper)
* Only one highest score is stored
* Every time the script is run *all* datasets are recomputed
* The '_old' named values in the table are not touched by the script
* h2 and mean are not updated
* No significance score is computed (needs permutations)

Rob voiced a wish to retain all scores (at least those above 1.0). This is not feasible in mariadb, but we can retain GEMMA output if stored efficiently.

## Notes on ProbeSetXRef

ProbeSetXRef is pretty small, currently @5.6Gb and 48,307,650 rows, so we could decide to add columns to track different mappers. Something funny

```
select count(LRS) from ProbeSetXRef;
+------------+
| count(LRS) |
+------------+
|   28335955 |
+------------+
```

half the fields are used. What to think of

```
MariaDB [db_webqtl]> select * from ProbeSetXRef where LRS=0 limit 5;
+------------------+------------+----------+-----------+---------+------------+------+------+----------------+------+--------+----------+------+
| ProbeSetFreezeId | ProbeSetId | DataId   | Locus_old | LRS_old | pValue_old | mean | se   | Locus          | LRS  | pValue | additive | h2   |
+------------------+------------+----------+-----------+---------+------------+------+------+----------------+------+--------+----------+------+
|              589 |    8599010 | 70606737 | NULL      |    NULL |       NULL |    0 | NULL | rsm10000000001 |    0 |   NULL |        0 | NULL |
|              589 |    8593710 | 70606738 | NULL      |    NULL |       NULL |    0 | NULL | rsm10000000001 |    0 |   NULL |        0 | NULL |
|              589 |    8607637 | 70606739 | NULL      |    NULL |       NULL |    0 | NULL | rsm10000000001 |    0 |   NULL |        0 | NULL |
|              589 |    8594490 | 70606740 | NULL      |    NULL |       NULL |    0 | NULL | rsm10000000001 |    0 |   NULL |        0 | NULL |
|              589 |    8591994 | 70606741 | NULL      |    NULL |       NULL |    0 | NULL | rsm10000000001 |    0 |   NULL |        0 | NULL |
+------------------+------------+----------+-----------+---------+------------+------+------+----------------+------+--------+----------+------+
```

```
MariaDB [db_webqtl]> select count(*) from ProbeSetXRef where LRS=0 and Locus="rsm10000000001";
+----------+
| count(*) |
+----------+
|   291447 |
+----------+
```

There is obviously more.  I think this table can use some cleaning up?

Looking at the GN1 code base ProbeSetXRef is used in a great number of files. In GN2 it is:

```
wqflask/maintenance/quantile_normalize.py
wqflask/maintenance/generate_probesetfreeze_file.py
wqflask/base/mrna_assay_tissue_data.py
wqflask/wqflask/update_search_results.py
wqflask/wqflask/correlation/rust_correlation.py
wqflask/base/trait.py
wqflask/wqflask/do_search.py
wqflask/base/data_set/mrnaassaydataset.py
wqflask/base/data_set/dataset.py
wqflask/wqflask/correlation/pre_computes.py
wqflask/wqflask/api/router.py
wqflask/wqflask/show_trait/show_trait.py
scripts/insert_expression_data.py
scripts/maintenance/Update_Case_Attributes_MySQL_tab.py
scripts/maintenance/readProbeSetSE_v7.py
scripts/maintenance/QTL_Reaper_v6.py
scripts/maintenance/readProbeSetMean_v7.py
wqflask/tests/unit/base/test_mrna_assay_tissue_data.py
```

Let's visit these one by one to make sure there is no side-effect.

### wqflask/maintenance/quantile_normalize.py

Appears to be a one-off to normalize a certain dataset in `/home/zas1024/cfw_data/`.
Last time it was used is probably 2018.

### wqflask/maintenance/generate_probesetfreeze_file.py

This one is even older and probably from 2013.

### wqflask/base/mrna_assay_tissue_data.py

Another dinosaur, actually uses TissueProbeSetXRef, so not relevant.

### wqflask/wqflask/update_search_results.py

This is for global 'gene' search making sure ProbeSet.Id = ProbeSetXRef.ProbeSetId AND ProbeSetXRef.ProbeSetFreezeId=ProbeSetFreeze.Id. LRS is passed on.

### wqflask/wqflask/correlation/rust_correlation.py

Again making sure IDs match. LRS is passed on. This requires scrutiny FIXME

### wqflask/base/trait.py

Trait class defines a trait in webqtl, can be either Microarray, Published phenotype, genotype, or user input trait. Fetches LRS.

### wqflask/wqflask/do_search.py

Searches for genes with a QTL within the given LRS values

    LRS searches can take 3 different forms:
    - LRS > (or <) min/max_LRS
    - LRS=(min_LRS max_LRS)
    - LRS=(min_LRS max_LRS chromosome start_Mb end_Mb)
    where min/max_LRS represent the range of LRS scores and start/end_Mb represent
    the range in megabases on the given chromosome

### wqflask/base/data_set/mrnaassaydataset.py

Also search results.

### wqflask/base/data_set/dataset.py

ID sync

### wqflask/wqflask/correlation/pre_computes.py

ID sync

### wqflask/wqflask/api/router.py

Fetch traits and sample data API (we may use this!)

### wqflask/wqflask/show_trait/show_trait.py

Same.

### The following are a punch of scripts:

* scripts/insert_expression_data.py
* scripts/maintenance/Update_Case_Attributes_MySQL_tab.py
* scripts/maintenance/readProbeSetSE_v7.py
* scripts/maintenance/QTL_Reaper_v6.py
* scripts/maintenance/readProbeSetMean_v7.py
* wqflask/tests/unit/base/test_mrna_assay_tissue_data.py

## Using LRS

* wqflask/wqflask/static/new/javascript/lod_chart.js

LRS is used to display a LOD chart

and in a bunch of python files that match above list for ProbeSetXRef:

* scripts/maintenance/QTL_Reaper_v6.py
* wqflask/base/webqtlConfig.py
* wqflask/base/trait.py
* wqflask/tests/unit/wqflask/marker_regression/test_display_mapping_results.py
* wqflask/tests/unit/base/test_trait.py
* wqflask/wqflask/parser.py
* wqflask/tests/unit/wqflask/correlation/test_show_corr_results.py
* wqflask/wqflask/update_search_results.py
* wqflask/wqflask/correlation/show_corr_results.py
* wqflask/wqflask/marker_regression/run_mapping.py
* wqflask/wqflask/export_traits.py
* wqflask/wqflask/gsearch.py
* wqflask/base/data_set/phenotypedataset.py
* wqflask/base/data_set/markers.py
* wqflask/wqflask/api/router.py
* wqflask/base/data_set/mrnaassaydataset.py
* wqflask/wqflask/correlation/rust_correlation.py
* wqflask/wqflask/marker_regression/display_mapping_results.py
* wqflask/wqflask/collect.py
* wqflask/wqflask/do_search.py
* wqflask/wqflask/views.py
* wqflask/utility/Plot.py
* test/requests/main_web_functionality.py
* test/requests/correlation_tests.py

From above it can be concluded these precomputed values are used for display and for getting search results (filtering on datasets that have a QTL).
It looks safe to start replacing qtlreaper results with GEMMA results.
Only thing I am not sure about is correlations.
A cursory inspection suggests LRS is only used for final output and that makes sense if correlations are done on (expression) phenotypes.
Anyway, if things go haywire we'll find out soon (enough).

At the next stage we ignore all this and start precompute with GEMMA on the BXD.

## Precompute DB

We will use a database to track precompute updates (see below).

On the computing host (or client) we should track the following:

* Dataset (phenotypes)
* time: MySQL time ProbeSetData table was last updated (if identical we can skip)
* Hash on DB inputs (for DB table updates)
* Genotypes
* Algorithm
* Hash on run inputs (phenotypes, genotypes, algorithm, invocation)
* time: of initiate run
* time: of completion
* Hostname of run (this)
* File path (this)
* Hash on output data (for validation)
* array of rec: DB hostnames + time stamps: successfully updated DB table for these servers

The logic is that if the DB table was changed we should recompute the hash on inputs.
Note the ProbeSetData table is the largest at 200G including indices.
If that hash changed the mapping algorithm should rerun.
A new record will be created on the new inputs.
A flag is set for the updated DB for precomputed values.
We can garbage collect when multiple entries for `Dataset' exist.

What database should we use? Ultimately precompute is part of the main GN setup. So, one could argue MariaDB is a good choice.
We would like to share precompute with other setups, however.
Also we typically do not want to run the precompute on the production host.
That means we should be able to rebuild the database from a precompute output directory and feed the update to the running (production) server.
We want to track compute so we can distribute running the algorithms across servers and/or PBS.
This implies the compute machines have to be able to query the DB in some way.
Basically a machine has a 'runner' that checks the DB for updates and fetches phenotypes and genotypes.
A run is started and on completion the DB is notified and updated.
Note that a runner can not be parallel on a single results directory, so one runner per target output directory.

We can have different runners, one for local machine, one for PBS and one for remotes.

This also implies that 'state' is handled with the output dir (on a machine).
We will create a JSON record for every compute.
This state can be pushed into 'any' GN instance at any point.
For running jobs the DB can be queried through a (same) REST API.
The REST API has to handle the DB, so it has to be able to access DB and files, i.e. it should be on the same (production) machine.

On the DB we'll create a Hash table on inputs of ProbeSetData. This way we don't have to walk the giant table for every query. This can be handled through a CRON job. So there will be a table:

* Dataset
* Hash on relevant phenotypes (ProbeSetData)
* time: Job started
* host: host running job
* status: (final) job status
* time: DB updated

## Tracking updates to DB

This brings us to CRON jobs.
There are several things that ought to be updated when the DB changes. Xapian being one example and now this table.
These should run on a regular basis and only update what really changed.
We need to support that type of work out of the box with an hourly runner (for precompute) and a daily runner (for xapian).

One possibility is to use MySQL triggers to track changes to the database table.
Even a hash can be computed as suggested here:

=> https://stackoverflow.com/questions/38732185/hash-of-two-columns-in-mysql

The problem is that this relative expensive computation done on every row update. I don't think it is feasible for updating phenotypes. Arthur may find the updates go very slow. It is possible, however, to just update a boolean value, or even a time stamp, that is cheap. That will probably increase the update time by 50-100% and that may be acceptable. We can only find out by trying.

But first I am going to simply generate the hashes from the running DB by scanning the full table. If that runs in minutes we should be good with a CRON job for now. Also, in the medium term, we may replace these tables with lmdb files and it is no good to depend on MysQL/MariaDB for this type of service.

Note that ProbeSetXRef is indexed on DataId, ProbeSetFreezeId, ProbeSetId and Locus.
The following should be unique and ProbeSetFreezeId+ProbeSetId, indexed by unique DataId, is the actual dataset, so that should be reasonable fast.

`ProbeSetFreezeID` is a unique identifier to a dataset, such as "Hippocampus_M430_V2_BXD_PDNN_Jun06".
`ProbeSetId` is the trait in the collection, e.g. '100001_at'.
`DataId` is the counter in ProbeSetXRef table.

So now this should be clear on update:

```
update ProbeSetXRef set Locus=%s, LRS=%s, additive=%s where ProbeSetId=%s and ProbeSetFreezeId=%s
```

Every trait gets an LRS and effect size this way.


## Writing script to Hash ProbeSetData

tux02 has a running database, so it is good for trials. The first step is to add hashes to ProbeSetData and compute them. Two questions we can ask: (1) should be host that data in the same table and (2) should we use another DB?

For (1) we should *not* use the same table. The current table is reasonably small and if we expand 28,335,955 rows on ProbeSetXRef it is going to be huge. This table is used a lot, so we should create a new table for this specific purpose. Now that means, even though this table is obviously linked to the running DB, we can actually use a different storage. And for this particular purpose I think sqlite will do great. Our preferred 'transient' databases, these days, are lmdb for disk b-trees, redis for in-RAM b-trees, sqlite for tables and virtuoso for RDF. This may change again, but for now these are great default choices. 'transient' means that the DB can be regenerated from some source of truth. This is not necessarily the case today - particularly authorization and session management are now using sqlite and redis as primary data sources. That is fine, as long as we are fully aware of the choices and what that implies.

We already use SQLite for authorization. Fred has been adding that to GN3 and now to its own authorization service:

=> https://github.com/genenetwork/gn-auth

In the README you can see the DB is located with a parameter 'AUTH_DB'. There is no example, so I check our production setup. Zach has it setup with

```
LOGLEVEL = "DEBUG"
SECRET_KEY = '9a4a2a...'
SQL_URI = "mysql://user:pwd@127.0.0.1:3306/db_webqtl"
AUTH_DB = "/home/gn2/auth_copy.db"
AUTH_MIGRATIONS = "/home/zas1024/gn-auth/migrations"
```

The DB file is world writeable - and we should probably change that. Also it needs a better storage location. Essentially GN should have a limited number of directories that handle state:

* SQL server (mysql) - read-write
* Other databases - read-write
* Upload files - read-write
* Reference files - read-only
* Secret files - read-only
* Cache files - read-write

Within these directories there should be clear subdirectories for each use case. So, for authorization we could have `/data/gn/databases/auth/production/auth.db'. Likewise, the precompute hash table will be in `/data/gn/databases/precompute/production/hashes.db' and we'll introduce an environment variable "GN_HASH_DB".
We'll take care of that later.

For our hashing purposes we'll first create a DB named 'hashes.db'. This table can also be used for correlations etc. down the line by using the 'algorithm' field and its combined input hash. It is generic, in other words.

## Getting precompute running

Now we have pretty much figured out what to do it is time to set up precompute. We can start with one single server that has access to mariadb. The new genoa Tux04 will be our replacement of Tux01. It has some larger Crucial P3 NVME SSDs rated at 3.5Gb/s. The Dell drive is 1.4 TB and is rated to be twice as fast. So we'll use that for the MariaDB now.

The first step is to make space for all the GN2+GN3 stuff. Next we test if we can query tux01 ports. tux04 can see tux01 because they are both inside the UTHSC firewall. I just need to allow the mysql port on tux01 and we keep using that DB for now.

The first phase is to simply start running gemma on tux05 and tux04 using the BXD. I mean, before computing 28,335,955 phenotypes we'll have to have a more serious compute setup!

```
MariaDB [db_webqtl]> select * from Strain where SpeciesId=1 limit 200;
+-----+------------+------------+-----------+--------+------------+
| Id  | Name       | Name2      | SpeciesId | Symbol | Alias      |
+-----+------------+------------+-----------+--------+------------+
|   1 | B6D2F1     | B6D2F1     |         1 | NULL   | NULL       |
|   2 | C57BL/6J   | C57BL/6J   |         1 | B6J    | P          |
|   3 | DBA/2J     | DBA/2J     |         1 | D2J    | P          |
|   4 | BXD1       | BXD1       |         1 | NULL   | NULL       |
|   5 | BXD2       | BXD2       |         1 | NULL   | NULL       |
|   6 | BXD5       | BXD5       |         1 | NULL   | NULL       |
|   7 | BXD6       | BXD6       |         1 | NULL   | NULL       |

select * from Strain where SpeciesId=1 and Name like '%BXD%';
returns 1371 rows
```

The originally named StrainXRef table links StrainID with SpeciesID:

```
MariaDB [db_webqtl]> select * from StrainXRef limit 3;
+-------------+----------+---------+------------------+----------------+
| InbredSetId | StrainId | OrderId | Used_for_mapping | PedigreeStatus |
+-------------+----------+---------+------------------+----------------+
|           1 |        1 |      10 | N                | MP_F1          |
|           1 |        2 |      20 | N                | Mat            |
|           1 |        3 |      30 | N                | Pat            |
+-------------+----------+---------+------------------+----------------+
```

## Getting all BXD datasets

Together with the InbredSet table we have the information we need to fetch BXD datasets.

```
select Id,InbredSetId,InbredSetName,Name,SpeciesId,FullName,public,MappingMethodId,GeneticType | Family,FamilyOrder,MenuOrderId,InbredSetCode from InbredSet limit 3;
+----+-------------+-------------------+--------+-----------+-------------------+--------+-----------------+----------------------+-------------+-------------+---------------+
| Id | InbredSetId | InbredSetName     | Name   | SpeciesId | FullName          | public | MappingMethodId | GeneticType | Family | FamilyOrder | MenuOrderId | InbredSetCode |
+----+-------------+-------------------+--------+-----------+-------------------+--------+-----------------+----------------------+-------------+-------------+---------------+
|  1 |           1 | BXD               | BXD    |         1 | BXD Family        |      2 | 1               |                    0 |           1 |           0 | BXD           |
|  2 |           2 | B6D2F2 OHSU Brain | B6D2F2 |         1 | B6D2F2 OHSU Brain |      2 | 1               |                    0 |           3 |           0 | NULL          |
|  4 |           4 | AXB/BXA           | AXBXA  |         1 | AXB/BXA Family    |      2 | 1               |                 NULL |           1 |           0 | AXB           |
+----+-------------+-------------------+--------+-----------+-------------------+--------+-----------------+----------------------+-------------+-------------+---------------+
```

So

```
MariaDB [db_webqtl]> select * from StrainXRef where InbredSetID=1 limit 5;
+-------------+----------+---------+------------------+----------------+
| InbredSetId | StrainId | OrderId | Used_for_mapping | PedigreeStatus |
+-------------+----------+---------+------------------+----------------+
|           1 |        1 |      10 | N                | MP_F1          |
|           1 |        2 |      20 | N                | Mat            |
|           1 |        3 |      30 | N                | Pat            |
|           1 |        4 |      40 | Y                | RI_progeny     |
|           1 |        5 |      50 | Y                | RI_progeny     |
+-------------+----------+---------+------------------+----------------+
```

```
select * from ProbeSetData where StrainId = 4 limit 5;
```

Fetches all data where strain `BXD1' was used. That is 14,829,435 of data points!
Now it should be clear the database is designed to go from dataset -> strain and not the other way round. We can, however, walk ProbeSetXRef and get the ProbeSetId. Use that to query ProbeSetData and use the StrainID to make sure it belongs to the BXD for compute.

Next, the fields `Locus_old | LRS_old | pValue_old' are not used that I can tell from the source code. So, we could use these to track updates. I am going to set these _old fields in ProbeSetXRef. After checking the backups we will first set all Locus_old to NULL. If a value
is NULL we can precompute. Afther precompute the current Locus_old value is taken from the reaper compute, as well as LRS_old and p_Value_old. We will add a column for additive_old.

After some thought I decided to create a new table that mirrors what we'll do with sqlite down the line. The new table will also allow for referring to time stamps and multiple hits. So, in addition to time stamps and hash values, we'll add to the above update record:

* top hit info (locus, LRS, p-value, effect)
* number of other hits
* file with other hits

And down the line we can point to a different DB table that records significant hits.

As a first check it will be interesting to see how these values differ between qtlreaper and GEMMA!

## Create Hash table

ProbeSetXRef is defined as

```
CREATE TABLE ProbeSetXRef (
  ProbeSetFreezeId smallint(5) unsigned NOT NULL DEFAULT 0,
  ProbeSetId int(10) unsigned NOT NULL DEFAULT 0,
  DataId int(10) unsigned NOT NULL DEFAULT 0,
  Locus_old char(20) DEFAULT NULL,
  LRS_old double DEFAULT NULL,
  pValue_old double DEFAULT NULL,
  mean double DEFAULT NULL,
  se double DEFAULT NULL,
  Locus varchar(50) DEFAULT NULL,
  LRS double DEFAULT NULL,
  pValue double DEFAULT NULL,
  additive double DEFAULT NULL,
  h2 float DEFAULT NULL,
  PRIMARY KEY (DataId),
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4
```

Not all choices are that sane, but we'll leave it like it is for now (it is a small table).
The new table will look like (LRS=Log wise ratios)

```
CREATE TABLE PrecomputeLRS (
  id int unsigned NOT NULL DEFAULT 0,
  ProbeSetFreezeId smallint(5) unsigned NOT NULL DEFAULT 0,
  ProbeSetId int(10) unsigned NOT NULL DEFAULT 0,
  DataId int(10) unsigned NOT NULL DEFAULT 0, ??
  mean double DEFAULT NULL,
  se double DEFAULT NULL,
  Locus varchar(50) DEFAULT NULL,
  LRS double DEFAULT NULL,
  pValue double DEFAULT NULL,
  additive double DEFAULT NULL,
  h2 float DEFAULT NULL,
  PRIMARY KEY (id),
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4
```

## Preparing for GEMMA

Meanwhile I have prepared tux04 and tux05 for the new runs. Next step is to query the DB and run GEMMA.

A first script on tux02 runs:

```scheme
(use-modules (dbi dbi))

(define db_webqtl (dbi-open "mysql" "webqtlout:webqtlout:db_webqtl:tcp:127.0.0.1:3306"))
(dbi-query db_webqtl "SELECT * FROM ProbeSetXRef LIMIT 1")
(display (dbi-get_row db_webqtl))

((ProbeSetFreezeId . 1) (ProbeSetId . 1) (DataId . 1) (Locus_old . 10.095.400) (LRS_old . 13.3971627898894) (pValue_old . 0.163) (mean . 5.48794285714286) (se . 0.08525787814808819) (Locus . rs13480619) (LRS . 12.590069931048) (pValue . 0.269) (additive . -0.28515625) (h2 . 0.0))
```

Note that you can run this over guile-geiser remote.


## More complicated datasets

A good dataset to take apart is

=> http://genenetwork.org/show_trait?trait_id=1436869_at&dataset=HC_M2_0606_P

because it has 71 BXD samples and 32 other samples.