On GEMMA

author: Pjotr Prins 2025-11-11 11:08:13 +0100
committer: Pjotr Prins 2026-01-05 11:12:11 +0100
commit: d61a0458ada13801642fcfa7242089fe6c4cc021 (patch)
tree: 31c3d949fc92ca7479a8d72035fd9e0ecd705c52
parent: 64064864eff8641e3259c3dd6bb53ea1b195aaa1 (diff)
download: gn-gemtext-d61a0458ada13801642fcfa7242089fe6c4cc021.tar.gz
2 files changed, 171 insertions, 4 deletions
diff --git a/issues/genetics/speeding-up-gemma.gmi b/issues/genetics/speeding-up-gemma.gmi
new file mode 100644
index 0000000..306735a
--- /dev/null
+++ b/issues/genetics/speeding-up-gemma.gmi
@@ -0,0 +1,113 @@
+# Speeding up GEMMA
+
+GEMMA is slow, but usually fast enough. Earlier I wrote gemma-wrapper to speed things up. In genenetwork.org, by using gemma-wrapper with LOCO, most traits are mapped in a few seconds on a a large server (30 individuals x 200K markers). By expanding makers to over 1 million, however, runtimes degrade to 6 minutes. Increasing the number of individuals to 1000 may slow mapping down to hour(s). As we are running 'precompute' on 13K traits - and soon maybe millions - it would be beneficial to reduce runtimes again.
+
+One thing to look at is Sen's bulklmm. It can do phenotypes in parallel, provided there is no missing data. This is perfect for permutations which we'll also do. For multiple phenotypes it is a bit tricky however, because you'll have to mix and match experiments to show the same individuals (read samples).
+
+So the approach is to first analyze steps in GEMMA and see where it is particularly inefficient. Maybe we can do something about that. I note I started the pangemma effort (and mgamma effort before). The idea is to use a propagator network for incremental improvements and also to introduce a new build system and testing framework. In parallel we'll try to scale out on HPC.
+
+Anyway, there is no such thing as a free lunch. So, let's dive in.
+
+# Description
+
+# Tags
+
+* assigned: pjotrp
+* type: feature
+* priority: high
+
+# Tasks
+
+* [X] Try gzipped version
+* [X] Run without debug
+* [ ] Optimize openblas for target architecture
+* [ ] Try a faster malloc library for GEMMA
+* [ ] Use lmdb for genotypes
+* [ ] Other improvements...
+
+# Analysis
+
+As a test case we'll take on of the runs:
+
+```
+time -v /bin/gemma -loco 11 -k /export2/data/wrk/services/gemma-wrapper/tmp/tmp/panlmm/93f6b39ec06c09fb9ba9ca628b5fb990921b6c60.11.cXX.txt.cXX.txt -o 680029457111fdd460990f95853131c87ea20c57.11.assoc.txt -p pheno.json.txt -g pangenome-13M-genotypes.txt -a snps-matched.txt -lmm 9 -maf 0.1 -n 2 -outdir /export2/data/wrk/services/gemma-wrapper/tmp/tmp/panlmm/d20251111-588798-f81icw
+```
+
+which I simplify to
+
+```
+/bin/time -v /bin/gemma -loco 11 -k 93f6b39ec06c09fb9ba9ca628b5fb990921b6c60.11.cXX.txt.cXX.txt -p pheno.json.txt -g pangenome-13M-genotypes.txt -a snps-matched.txt -lmm 9 -maf 0.1 -n 2 -debug
+Reading Files ...
+number of total individuals = 143
+number of analyzed individuals = 20
+number of total SNPs/var        = 13209385
+number of SNPS for K            = 12376792
+number of SNPS for GWAS         =   832593
+number of analyzed SNPs         = 13111938
+```
+
+The timer says:
+
+```
+User time (seconds): 365.33
+System time (seconds): 16.59
+Percent of CPU this job got: 128%
+Elapsed (wall clock) time (h:mm:ss or m:ss): 4:57.01
+Average shared text size (kbytes): 0
+Average unshared data size (kbytes): 0
+Average stack size (kbytes): 0
+Average total size (kbytes): 0
+Maximum resident set size (kbytes): 11073412
+Average resident set size (kbytes): 0
+Major (requiring I/O) page faults: 0
+Minor (reclaiming a frame) page faults: 5756557
+Voluntary context switches: 1365
+Involuntary context switches: 478
+Swaps: 0
+File system inputs: 0
+File system outputs: 143704
+Socket messages sent: 0
+Socket messages received: 0
+Signals delivered: 0
+Page size (bytes): 4096
+Exit status: 0
+```
+
+The genotype file is unzipped at 30G. Let's try running the gzipped version (which will be beneficial on a compute cluster anyhow) which comes in at 9.2G. We know that Gemma is not the most efficient when it comes to IO. So testing is crucial.
+Critically the run gets slower:
+
+```
+Percent of CPU this job got: 118%
+Elapsed (wall clock) time (h:mm:ss or m:ss): 7:43.56
+```
+
+The problem is that unzip runs on a single thread in GEMMA.
+
+## Running without debug
+
+Without the debug swith gemma runs at the same speed with 128% CPU. That won't help much.
+
+## Optimizing GEMMA+OpenBLAS+GSL
+
+Compiling with optimization can be low hanging fruit - despite the fact that we seem to be IO bound at 128% CPU. Still, aggressive compiler optimizations may make a difference. The current build reads:
+
+```
+GEMMA Version    = 0.98.6 (2022-08-05)
+Build profile    = /gnu/store/8rvid272yb53bgascf5c468z0jhsyflj-profile
+GCC version      = 14.3.0
+GSL Version      = 2.8
+OpenBlas         = OpenBLAS 0.3.30  - OpenBLAS 0.3.30 DYNAMIC_ARCH NO_AFFINITY Cooperlake MAX_THREADS=128
+arch           = Cooperlake
+threads        = 96
+parallel type  = threaded
+```
+
+this uses the gemma-gn2 package in
+
+=> https://git.genenetwork.org/guix-bioinformatics/tree/gn/packages/gemma.scm#n27
+
+which is currently not built with arch optimizations (even though Cooperlake suggests differently). Another potential optimization is to use a fast malloc library. We do, however, already compile with a recent gcc, thanks to Guix. No need to improve on that.
+
+## Use lmdb for genotypes
+
+Rather than focussing on gzip, another potential improvement is to use lmdb with mmap.
diff --git a/topics/genetics/test-pangenome-derived-genotypes.gmi b/topics/genetics/test-pangenome-derived-genotypes.gmi
index 6fb936f..b4256bb 100644
--- a/topics/genetics/test-pangenome-derived-genotypes.gmi
+++ b/topics/genetics/test-pangenome-derived-genotypes.gmi
@@ -9,9 +9,10 @@ For the BXD we have 23M markers(!) whereof 8M *not* on the reference genome.
 
 # Tasks
 
-* [ ] Reintroduce nodes that were not annotated for position
+* [ ] Use ravanan/CWL to push to Octopus
+* [ ] Reintroduce nodes that were not annotated for position (Flavia)
 * [ ] GWA plotter
-* [ ] Speed up IO for GEMMA
+* [ ] Speed up IO for GEMMA by using lmdb for genotypes and marker file
 * [ ] Reduce GEMMA GRM RAM requirements (not urgent)
 * [ ] Fix -lmm 4 ERROR: Enforce failed for Trying to take the sqrt of nan in src/mathfunc.cpp at line 127 in safe_sqrt
 
@@ -249,7 +250,7 @@ Now we have confirmation and all the pieces we can run the same set with gemma-w
 
 ## gemma-wrapper
 
-The first 'challenge' is that gemma-wrapper computes hash values using a Ruby lib which is rather slow. This is also something we encounter in guix.
+The first 'challenge' is that gemma-wrapper computes hash values using a Ruby lib which is rather slow. This is also something we encounter in guix. I replaced that by using our pfff hashing for larger files.
 
 ```
 /bin/time -v ../bin/gemma-wrapper --json --loco --jobs 8 -v -- -g new-genotypes.txt -p pheno_filtered_143.txt -gk -a snps-matched.txt > K.json
@@ -266,4 +267,57 @@ gemma-wrapper has a wall clock time of 10 minutes utilizing 17 cores.
 
 Some chromosomes failed with 'ERROR: Enforce failed for Trying to take the sqrt of nan in src/mathfunc.cpp at line 127 in safe_sqrt2'. Running the same with -lmm 9 passed. I'll need to keep an eye on that one.
 
-After some fixes we now have loco in an lmdb output. The mdb file comes in at 693Mb. That will make 9TB for 13K traits. Storing the full vector is probably not wise here (and arguably we won't ever use it at this size - we should use the smoothed haplotypes). Only storing the significant values (4.0) made the size 17Mb. That makes it 215Gb total. Which is manageable. I made it even smaller by removing the (superfluous) hits from the metadata. Now down to 7Mb. That'll total some 100Gb for 13K traits.
+After some fixes we now have loco in an lmdb output. The mdb file comes in at 693Mb. That will make 9TB for 13K traits. Storing the full vector is probably not wise here (and arguably we won't ever use it at this size - we should use the smoothed haplotypes). Only storing the significant values (4.0) made the size 17Mb. That makes it 215Gb total. Which is manageable. I made it even smaller by removing the (superfluous) hits from the metadata. Now down to 7Mb and 3.2Mb compressed. That'll total less than 100Gb for 13K traits. Good.
+
+## Final hookup
+
+Now gemma-wrapper works (and test results are confirmed) we have to wire it up to fetch traits from the DB. We also have to make sure the trait values align with the individuals in the genotype file. Earlier I was running the script gemma-batch-run.sh:
+
+=> https://github.com/genetics-statistics/gemma-wrapper/blob/master/bin/gemma-batch-run.sh
+
+which looks like:
+
+```
+export TMPDIR=./tmp
+curl http://127.0.0.1:8092/dataset/bxd-publish/list > bxd-publish.json
+jq ".[] | .Id" < bxd-publish.json > ids.txt
+# ---- Compute GRM
+./bin/gemma-wrapper --json --loco -- -g BXD.geno.txt -p BXD_pheno.txt -a BXD.8_snps.txt -n 2 -gk > K.json
+
+# ---- For all entries run LMM
+for id in 'cat ids.txt' ; do
+  echo Precomputing $id
+  if [ ! -e tmp/*-BXDPublish-$id-gemma-GWA.tar.xz ] ; then
+    curl http://127.0.0.1:8092/dataset/bxd-publish/values/$id.json > pheno.json
+    ./bin/gemma-wrapper --json --lmdb --geno-json BXD.geno.json --phenotypes pheno.json --population BXD --name BXDPublish --trait $id --loco --input K.json -- -g BXD.geno.txt -a BXD.8_snps.txt -lmm 9 -maf 0.1 -n 2 -debug > GWA.json
+  fi
+done
+```
+
+We already have ids.txt and the GRM. What is required is the trait values from the DB. What we need to do is run gn-guile somewhere with access to the DB. Also I need to make sure the current gemma-wrapper tar-balls up the result.
+
+OK, we are running. Looks like the smaller datasets only use 11Gb RES RAM per chromosome. Which means we can run two computes in parallel on this machine.
+
+The first run came through! I forgot the --reduce flag, so it came as 190Mb. I'll fix that. 34 individuals ran in 7 minutes.
+We are currently runnings at a trait in 6 min. We can double that on this machine.
+
+The following puzzles me a bit
+
+```
+## number of analyzed individuals = 31
+## number of covariates = 1
+## number of phenotypes = 1
+## leave one chromosome out (LOCO) =       14
+## number of total SNPs/var        = 13209385
+## number of SNPS for K            = 12322657
+## number of SNPS for GWAS         =   886728
+## number of analyzed SNPs         = 13122153
+```
+
+why is the number of SNPs for GWAS low? Perhaps a threshold of 10% for maf is a bit stringent ;).
+
+Anyway, we are running traits and the first 500 we'll use for analysis.
+
+Meanwhile I'll look at deploying on octopus and maybe speeding up GEMMA. See
+
+=> issues/genetics/speeding-up-gemma
author	Pjotr Prins	2025-11-11 11:08:13 +0100
committer	Pjotr Prins	2026-01-05 11:12:11 +0100
commit	d61a0458ada13801642fcfa7242089fe6c4cc021 (patch)
tree	31c3d949fc92ca7479a8d72035fd9e0ecd705c52
parent	64064864eff8641e3259c3dd6bb53ea1b195aaa1 (diff)
download	gn-gemtext-d61a0458ada13801642fcfa7242089fe6c4cc021.tar.gz