summary refs log tree commit diff
path: root/topics/systems/mariadb/precompute-publishdata.gmi
diff options
context:
space:
mode:
authorPjotr Prins2025-08-09 10:22:37 +0200
committerPjotr Prins2026-01-05 11:12:10 +0100
commit0dca72dff0d57ad300f5f97899b91edd4c491403 (patch)
treefd56ae2ae014006f45356c7e3d6ea98ef6674966 /topics/systems/mariadb/precompute-publishdata.gmi
parent4f9eb9b88766dc25ffabe62ed54b8af96eb3a32e (diff)
downloadgn-gemtext-0dca72dff0d57ad300f5f97899b91edd4c491403.tar.gz
On precompute
Diffstat (limited to 'topics/systems/mariadb/precompute-publishdata.gmi')
-rw-r--r--topics/systems/mariadb/precompute-publishdata.gmi63
1 files changed, 62 insertions, 1 deletions
diff --git a/topics/systems/mariadb/precompute-publishdata.gmi b/topics/systems/mariadb/precompute-publishdata.gmi
index 3cb2f0e..f9e108e 100644
--- a/topics/systems/mariadb/precompute-publishdata.gmi
+++ b/topics/systems/mariadb/precompute-publishdata.gmi
@@ -19,7 +19,8 @@ So we can convert a .geno file to BIMBAM. I need to extract GN traits to a R/qtl
 * [X] Run gemma-wrapper
 * [X] We should map by trait-id, data id is not intuitive: curl http://127.0.0.1:8091/dataset/bxd-publish/values/8967044.json > 10002-pheno.json
 * [X] Check why Zach/GN JSON file lists different mappable BXDs
-* [ ] Add batch run and some metadata
+* [X] Update DB on run-server
+* [X] Add batch run and some metadata so we can link back from results
 * [ ] Create a DB/table containing hits and old reaper values
 * [ ] Update PublishXRef and store old reaper value(?)
 * [ ] Correctly Handle escalating errors
@@ -874,3 +875,63 @@ Luckily this is a perfect match:
 ```
 
 The lmdb file contains the full vector and compresses to 100K. For 13K traits that equals about 1Gb.
+
+First I wanted to check how Zach's list of mappable inds compares to mine. A simple REPL exercise shows:
+
+```
+zach = JSON.parse(File.read('BXD.json'))
+pj = JSON.parse(File.read('BXD.geno.json'))
+s1 = zach["genofile"][0]["sample_list"]
+=> ["BXD1", "BXD2", "BXD5", "BXD6", "BXD8", "BXD9", "BXD11", "BXD12", "BXD13", "BXD14", "BXD15", "BXD16", "BXD18",...
+s2 = pj["samples"]
+=> ["BXD1", "BXD2", "BXD5", "BXD6", "BXD8", "BXD9", "BXD11", "BXD12", "BXD13", "BXD14", "BXD15", "BXD16", "BXD18",...
+s1.size()
+=> 235
+s2.size()
+=> 237
+ s2-s1
+=> ["BXD077xBXD065F1", "BXD065xBXD102F1"]
+```
+
+So it turns out the newer geno file contains these two new inds that are *also* in the .geno file and confuses the hell out of my scripts ;). The GN2 webserver probably uses the header of the geno file to fetch the correct number. The trait page also lists these inds, so (I guess) the BXD.json file ought to be updated.
+
+Now that is explained and we are good.
+
+## Running at scale
+
+In the next step we need to batch run GEMMA. Initially we'll run on one server. gemma-wrapper takes care of running only once, so we can restart the pipeline at any point (we'll move to ravanan after to run on the cluster). At this point the API uses the dataid to return the trait values. I think that is not so intuitive, so I modified the endpoint to give the same results for:
+
+```
+curl http://127.0.0.1:8091/dataset/bxd-publish/values/10002.json > 10002-pheno.json
+curl http://127.0.0.1:8091/dataset/bxd-publish/dataid/values/8967044.json > 10002-pheno.json
+```
+
+Now that works we can get a list of all BXDPublish datasets that I wrote earlier:
+
+```
+curl http://127.0.0.1:8091/dataset/bxd-publish/list > bxd-publish.json
+[
+  {
+    "Id": 10001,
+    "PhenotypeId": 4,
+    "DataId": 8967043
+  },
+  {
+    "Id": 10002,
+    "PhenotypeId": 10,
+    "DataId": 8967044
+  },
+  {
+    "Id": 10003,
+    "PhenotypeId": 15,
+    "DataId": 8967045
+  },
+```
+
+so we can use this to create our batch list. There are 13711 datasets listed on this DB. We can use jq to extract all Ids
+
+```
+jq ".[] | .Id" < bxd-publish.json > ids.txt
+```
+
+All set to run our first batch! Now we replicate our guix-wrapper environment, start the gn-guile server and fire up a batch script that pulls the data from the database and runs gemma for every step.