summaryrefslogtreecommitdiff
path: root/topics/systems
diff options
context:
space:
mode:
authorPjotr Prins2023-12-24 15:33:38 +0100
committerPjotr Prins2023-12-24 15:33:38 +0100
commitde78d7e63e47fa08dd2f5d1b9e2fb8fac47e4104 (patch)
treeb38a08d6e9e7dc5c9bb13d734d5a52a81230faa9 /topics/systems
parent9ed1495c1aedbea61f86da3437ec0b6093ef832c (diff)
downloadgn-gemtext-de78d7e63e47fa08dd2f5d1b9e2fb8fac47e4104.tar.gz
GRM
Diffstat (limited to 'topics/systems')
-rw-r--r--topics/systems/mariadb/precompute-mapping-input-data.gmi4
1 files changed, 4 insertions, 0 deletions
diff --git a/topics/systems/mariadb/precompute-mapping-input-data.gmi b/topics/systems/mariadb/precompute-mapping-input-data.gmi
index e0e5962..7896c87 100644
--- a/topics/systems/mariadb/precompute-mapping-input-data.gmi
+++ b/topics/systems/mariadb/precompute-mapping-input-data.gmi
@@ -1003,6 +1003,10 @@ That is a bit insane if you know the input is 300K, even knowing disk space is c
So, this is the right time to put gemma-wrapper on a diet. The GRM files are largest. Currently we create kinship files for every population subset that is used and that may change once we simply reduce the final GRM by removing cols/rows. But that is one exercise we want to prove first using our precompute exercise. In this case we will simply compress the kinship files and that halves the size with zip. xz compression brings it down to 1/4. That is impressive by itself. I also checked lmza and bzip2 and they were no better. So, with gemma-wrapper we can now store the GRMs in an xz archive. For the assoc files we will cat them in to one file and compress that too, reducing the size to 1/7th. As noted above, the current cache size for GN is 190Gb for 3 months. We can reduce that significantly and that will speed up lookups. Decompression with xz is very fast.
+# Storing GRM output
+
+gemma-wrapper stores per chromosome GRMs in separate files. The first fix was to store them in an xz archive. gemma-wrapper already uses a temporary directory so, that should be straightforward.
+
# Storing assoc output
To kick off precompute we added new nodes to the Octopus cluster: doubling its capacity. In the next step we have to compress the output of GEMMA so we can keep it forever. For this we want to have the peaks (obviously), but we als want to retain the 'shape' of the distribution - i.e., the QTL with sign. This shape we can use for correlations and potentially some AI-style mining. The way it is presented in AraQTL.