summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorPjotr Prins2024-01-24 09:38:30 +0100
committerPjotr Prins2024-02-09 16:23:38 +0100
commit0900f9653f12bb2b4a27b56532251ef32b5cb366 (patch)
treeff1750dd8d6e71a7cca0d9da51e5c85b906d4782
parent5e87433fab6b45362c27510ed725a233c37d9b34 (diff)
downloadgn-gemtext-0900f9653f12bb2b4a27b56532251ef32b5cb366.tar.gz
precompute
-rw-r--r--topics/systems/mariadb/precompute-mapping-input-data.gmi88
1 files changed, 86 insertions, 2 deletions
diff --git a/topics/systems/mariadb/precompute-mapping-input-data.gmi b/topics/systems/mariadb/precompute-mapping-input-data.gmi
index 58b7b5c..968277f 100644
--- a/topics/systems/mariadb/precompute-mapping-input-data.gmi
+++ b/topics/systems/mariadb/precompute-mapping-input-data.gmi
@@ -20,7 +20,7 @@ This is achieved by naming the input and output files as a hash on their DB inpu
Next:
-* [ ] Store all GEMMA values efficiently
+* [X] Store all GEMMA values efficiently
* [ ] Track metadata of computed datasets (in RDF?)
* [ ] Compute significance with GEMMA or other LMM (bulkLMM?)
* [ ] Store signficance and significant values for processing
@@ -1007,7 +1007,35 @@ So, this is the right time to put gemma-wrapper on a diet. The GRM files are lar
The current version of gemma-wrapper stores per chromosome GRMs in separate files. The first fix was to store them in an xz archive. gemma-wrapper already uses a temporary directory so, that was straightforward. Next I had to tell gemma-wrapper not to recompute when the xz archive exists. This now works.
-Next step is to use the archive to check the GWA run. In the final step we will archive results and write an lmdb file for further processing. If we can make it really small we'll retain that instead of the full archive.
+Next step is to use the archive to check the GWA run. In the final step we will archive results and write an lmdb file for further processing. If we can make it really small we'll retain that instead of the full archive. This code is part of gemma-wrapper right now:
+
+=> https://github.com/genetics-statistics/gemma-wrapper/blob/master/bin/gemma2lmdb.py
+
+I looked at using SQLITE - which would have been a bit easier - but the size would have been at least double the size of lmdb (floats are stored as 8 bytes in SQLITE).
+
+Inside the lmdb file we also store the results of the log files as a JSON 'meta' record. A metadata record is important because it allows for quick assessments later, such has how long the compute took at each step and how much memory was used. We collect metadata at every step. This means that we pass around a growing JSON object. Also we should be able to expose the JSON easily so it can be parsed with python, ruby, jq etc. With metadata we can store some extra information on the ProbeSet:
+
+```
+MariaDB [db_webqtl]> select Id,Chr,Mb,Name,Symbol,description from ProbeSet where Id=1 limit 1;
++----+------+-----------+-----------+--------+---------------------------------+
+| Id | Chr | Mb | Name | Symbol | description |
++----+------+-----------+-----------+--------+---------------------------------+
+| 1 | 9 | 44.970689 | 100001_at | Cd3g | CD3d antigen, gamma polypeptide |
++----+------+-----------+-----------+--------+---------------------------------+
+```
+
+and the dataset:
+
+```
+MariaDB [db_webqtl]> select * from ProbeSetFreeze WHERE Name='HC_M2_0606_P' limit 5;
++-----+---------------+-------+--------------+------------------------------------+--------------------------------------------+-----------------------------------+------------+-----------+--------+-----------------+-----------------+-----------+
+| Id | ProbeFreezeId | AvgID | Name | Name2 | FullName | ShortName | CreateTime | OrderList | public | confidentiality | AuthorisedUsers | DataScale |
++-----+---------------+-------+--------------+------------------------------------+--------------------------------------------+-----------------------------------+------------+-----------+--------+-----------------+-----------------+-----------+
+| 112 | 30 | 2 | HC_M2_0606_P | Hippocampus_M430_V2_BXD_PDNN_Jun06 | Hippocampus Consortium M430v2 (Jun06) PDNN | Hippocampus M430v2 BXD 06/06 PDNN | 2016-02-11 | 1 | 2 | 0 | | log2 |
++-----+---------------+-------+--------------+------------------------------------+--------------------------------------------+-----------------------------------+------------+-----------+--------+-----------------+-----------------+-----------+
+```
+
+gemma-wrapper outputs JSON - but it is fairly elaborate so we'll reduce it to a minimum. At the time of lmdb creation we will pass in a small JSON file that describes the gemma-wrapper run.
# Storing assoc output
@@ -1018,8 +1046,64 @@ For the sign we can use the SNP additive effect estimate. The se of Beta is a fu
Anyway, we can consider compressing the shape the way a CDROM is compressed.
For now, in the next step we will compress the storage using xz - as discussed above. In the next step, after running the precompute and storing the highest hit in the DB, we will start using lmdb to store assoc values.
+The precompute runs on tux04 with
+
+```
+tux04:~/services/gn-guile$ . .guix-shell ruby --expose=/home/wrk/services/gemma-wrapper=/gemma-wrapper --share=/export2/precompute-gemma -- env TMPDIR=/export2/precompute-gemma guile -L . -s ./scripts/precompute/precompute-hits.scm
+```
+
+## Final tweaks
+
+The precompute runs and updates the DB. We are creating a new precompute table that contains the highest hits, so we can track status of the runs. From this we can update ProbeSetXRef - rather than doing it directly.
+
+That way we can also track the top hits for different computation.
+
+For one result we have
+
+```js
+ "meta": {
+ "type": "gemma-wrapper",
+ "version": "0.99.7-pre1",
+ "population": "BXD",
+ "name": "HC_U_0304_R",
+ "trait": "101500_at",
+ "url": "https://genenetwork.org/show_trait?trait_id=101500_at&dataset=HC_U_0304_R",
+ "archive_GRM": "46bfba373fe8c19e68be6156cad3750120280e2e-gemma-cXX.tar.xz",
+ "archive_GWA": "779a54a59e4cd03608178db4068791db4ca44ab3-gemma-GWA.tar.xz",
+ "dataid": 75629,
+ "probesetid": 1097,
+ "probesetfreezeid": 7
+ }
+```
+
+
+
# Notes
+1 rsm10000000577 3.20 Chr1: 69.315198 -0.248
+
+=> chr,pos,af,beta,se,l_mle,p_lrt
+1,69315198,0.2800000011920929,0.49527430534362793,0.13411599397659302,100000.0,0.0006254952750168741
+Math.log(0.0006254952750168741,10)
+=> -3.203775966613006
+-0.4952743053436279/2
+=> -0.24763715267181394
+
+
+5 rsm10000001990 3.11 Chr4: 81.464858 0.282
+
+4,81464858,0.36500000953674316,-0.5631462931632996,0.15581850707530975,100000.0,0.0007759517757222056
+Math.log(0.0007759517757222056,10)
+=> -3.1101652686754857
+ 0.563146/2
+=> 0.281573
+
+The likelihood ratio is bounded between zero and one.
+The l_lrt runs between 0.0 and 1.0. The smaller the number the higher the log10 value (basically the number of digits; 0.000001 is -5.99
+
+The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. In our case it is always exactly 10^5 or -10^5 for some reason. According to GEMMA doc that means it is a pure genetic effect when positive.
+
+
## NAs in GN
A note from Zach: