On GEMMA

author: Pjotr Prins 2025-11-11 11:08:13 +0100
committer: Pjotr Prins 2026-01-05 11:12:11 +0100
commit: d61a0458ada13801642fcfa7242089fe6c4cc021 (patch)
tree: 31c3d949fc92ca7479a8d72035fd9e0ecd705c52 /issues
parent: 64064864eff8641e3259c3dd6bb53ea1b195aaa1 (diff)
download: gn-gemtext-d61a0458ada13801642fcfa7242089fe6c4cc021.tar.gz
1 files changed, 113 insertions, 0 deletions
diff --git a/issues/genetics/speeding-up-gemma.gmi b/issues/genetics/speeding-up-gemma.gmi
new file mode 100644
index 0000000..306735a
--- /dev/null
+++ b/issues/genetics/speeding-up-gemma.gmi
@@ -0,0 +1,113 @@
+# Speeding up GEMMA
+
+GEMMA is slow, but usually fast enough. Earlier I wrote gemma-wrapper to speed things up. In genenetwork.org, by using gemma-wrapper with LOCO, most traits are mapped in a few seconds on a a large server (30 individuals x 200K markers). By expanding makers to over 1 million, however, runtimes degrade to 6 minutes. Increasing the number of individuals to 1000 may slow mapping down to hour(s). As we are running 'precompute' on 13K traits - and soon maybe millions - it would be beneficial to reduce runtimes again.
+
+One thing to look at is Sen's bulklmm. It can do phenotypes in parallel, provided there is no missing data. This is perfect for permutations which we'll also do. For multiple phenotypes it is a bit tricky however, because you'll have to mix and match experiments to show the same individuals (read samples).
+
+So the approach is to first analyze steps in GEMMA and see where it is particularly inefficient. Maybe we can do something about that. I note I started the pangemma effort (and mgamma effort before). The idea is to use a propagator network for incremental improvements and also to introduce a new build system and testing framework. In parallel we'll try to scale out on HPC.
+
+Anyway, there is no such thing as a free lunch. So, let's dive in.
+
+# Description
+
+# Tags
+
+* assigned: pjotrp
+* type: feature
+* priority: high
+
+# Tasks
+
+* [X] Try gzipped version
+* [X] Run without debug
+* [ ] Optimize openblas for target architecture
+* [ ] Try a faster malloc library for GEMMA
+* [ ] Use lmdb for genotypes
+* [ ] Other improvements...
+
+# Analysis
+
+As a test case we'll take on of the runs:
+
+```
+time -v /bin/gemma -loco 11 -k /export2/data/wrk/services/gemma-wrapper/tmp/tmp/panlmm/93f6b39ec06c09fb9ba9ca628b5fb990921b6c60.11.cXX.txt.cXX.txt -o 680029457111fdd460990f95853131c87ea20c57.11.assoc.txt -p pheno.json.txt -g pangenome-13M-genotypes.txt -a snps-matched.txt -lmm 9 -maf 0.1 -n 2 -outdir /export2/data/wrk/services/gemma-wrapper/tmp/tmp/panlmm/d20251111-588798-f81icw
+```
+
+which I simplify to
+
+```
+/bin/time -v /bin/gemma -loco 11 -k 93f6b39ec06c09fb9ba9ca628b5fb990921b6c60.11.cXX.txt.cXX.txt -p pheno.json.txt -g pangenome-13M-genotypes.txt -a snps-matched.txt -lmm 9 -maf 0.1 -n 2 -debug
+Reading Files ...
+number of total individuals = 143
+number of analyzed individuals = 20
+number of total SNPs/var        = 13209385
+number of SNPS for K            = 12376792
+number of SNPS for GWAS         =   832593
+number of analyzed SNPs         = 13111938
+```
+
+The timer says:
+
+```
+User time (seconds): 365.33
+System time (seconds): 16.59
+Percent of CPU this job got: 128%
+Elapsed (wall clock) time (h:mm:ss or m:ss): 4:57.01
+Average shared text size (kbytes): 0
+Average unshared data size (kbytes): 0
+Average stack size (kbytes): 0
+Average total size (kbytes): 0
+Maximum resident set size (kbytes): 11073412
+Average resident set size (kbytes): 0
+Major (requiring I/O) page faults: 0
+Minor (reclaiming a frame) page faults: 5756557
+Voluntary context switches: 1365
+Involuntary context switches: 478
+Swaps: 0
+File system inputs: 0
+File system outputs: 143704
+Socket messages sent: 0
+Socket messages received: 0
+Signals delivered: 0
+Page size (bytes): 4096
+Exit status: 0
+```
+
+The genotype file is unzipped at 30G. Let's try running the gzipped version (which will be beneficial on a compute cluster anyhow) which comes in at 9.2G. We know that Gemma is not the most efficient when it comes to IO. So testing is crucial.
+Critically the run gets slower:
+
+```
+Percent of CPU this job got: 118%
+Elapsed (wall clock) time (h:mm:ss or m:ss): 7:43.56
+```
+
+The problem is that unzip runs on a single thread in GEMMA.
+
+## Running without debug
+
+Without the debug swith gemma runs at the same speed with 128% CPU. That won't help much.
+
+## Optimizing GEMMA+OpenBLAS+GSL
+
+Compiling with optimization can be low hanging fruit - despite the fact that we seem to be IO bound at 128% CPU. Still, aggressive compiler optimizations may make a difference. The current build reads:
+
+```
+GEMMA Version    = 0.98.6 (2022-08-05)
+Build profile    = /gnu/store/8rvid272yb53bgascf5c468z0jhsyflj-profile
+GCC version      = 14.3.0
+GSL Version      = 2.8
+OpenBlas         = OpenBLAS 0.3.30  - OpenBLAS 0.3.30 DYNAMIC_ARCH NO_AFFINITY Cooperlake MAX_THREADS=128
+arch           = Cooperlake
+threads        = 96
+parallel type  = threaded
+```
+
+this uses the gemma-gn2 package in
+
+=> https://git.genenetwork.org/guix-bioinformatics/tree/gn/packages/gemma.scm#n27
+
+which is currently not built with arch optimizations (even though Cooperlake suggests differently). Another potential optimization is to use a fast malloc library. We do, however, already compile with a recent gcc, thanks to Guix. No need to improve on that.
+
+## Use lmdb for genotypes
+
+Rather than focussing on gzip, another potential improvement is to use lmdb with mmap.
author	Pjotr Prins	2025-11-11 11:08:13 +0100
committer	Pjotr Prins	2026-01-05 11:12:11 +0100
commit	d61a0458ada13801642fcfa7242089fe6c4cc021 (patch)
tree	31c3d949fc92ca7479a8d72035fd9e0ecd705c52 /issues
parent	64064864eff8641e3259c3dd6bb53ea1b195aaa1 (diff)
download	gn-gemtext-d61a0458ada13801642fcfa7242089fe6c4cc021.tar.gz