PanGEMMA: Genome-wide Efficient Mixed Model Association
This repository is used to rewrite and modernize the original GEMMA tool. The idea it to upgrade the software, but keeping it going using ideas from Hanson and Sussman's book on Software Design for Flexibility: How to Avoid Programming Yourself into a Corner. It is work in progress (WIP). For more information see PanGEMMA design
GEMMA is the original software toolkit for fast application of linear mixed models (LMMs) and related models to genome-wide association studies (GWAS) and other large-scale data sets. You can find the original code on github. It may even build from this repo for the time being.
NOTE: December 2024 main software development has moved to PanGEMMA! Pangemma is essentially a fork of GEMMA that is meant to scale up for pangenomics. We are also taking the opportunity to revamp the code base. GEMMA itself is in maintenance mode.
Check out RELEASE-NOTES.md to see what's new in each release.
- Key features
- Installation
- Precompiled binaries
- Run GEMMA
- Debugging and optimization
- Help
- Citing GEMMA
- License
- Optimizing performance
- Building from source
- Input data formats
- Reporting a GEMMA bug or issue
- Check list:
- Code of conduct
- Credits
Key features
-
Fast assocation tests implemented using the univariate linear mixed model (LMM). In GWAS, this can correct for population structure and sample non-exchangeability. It also provides estimates of the proportion of variance in phenotypes explained by available genotypes (PVE), often called "chip heritability" or "SNP heritability".
-
Fast association tests for multiple phenotypes implemented using a multivariate linear mixed model (mvLMM). In GWAS, this can correct for population structure and sample (non)exchangeability - jointly in multiple complex phenotypes.
-
Bayesian sparse linear mixed model (BSLMM) for estimating PVE, phenotype prediction, and multi-marker modeling in GWAS.
-
Estimation of variance components ("chip/SNP heritability") partitioned by different SNP functional categories from raw (individual-level) data or summary data. For raw data, HE regression or the REML AI algorithm can be used to estimate variance components when individual-level data are available. For summary data, GEMMA uses the MQS algorithm to estimate variance components.
Installation
WIP
Precompiled binaries
WIP
Run GEMMA
Pangemma (for now) maintains a version of gemma and may support new features. Run the legacy version with:
gemma -h
# compute Kinship matrix
gemma -g ./example/mouse_hs1940.geno.txt.gz -p ./example/mouse_hs1940.pheno.txt \
-gk -o mouse_hs1940
# run univariate LMM
gemma -g ./example/mouse_hs1940.geno.txt.gz \
-p ./example/mouse_hs1940.pheno.txt -n 1 -a ./example/mouse_hs1940.anno.txt \
-k ./output/mouse_hs1940.cXX.txt -lmm -o mouse_hs1940_CD8_lmm
Debugging and optimization
We use guix for debugging and development. Try something like
LD_LIBRARY_PATH=$GUIX_ENVIRONMENT/lib gdb --args ./build/bin/Debug/gemma -g ./example/mouse_hs1940.geno.txt.gz \
-p ./example/mouse_hs1940.pheno.txt -n 1 -a ./example/mouse_hs1940.anno.txt \
-k ./output/mouse_hs1940.cXX.txt -lmm -o mouse_hs1940_CD8_lmm
Citing PanGEMMA
PanGEMMA is not published yet.
But if you use GEMMA for published work, please cite our paper:
- Xiang Zhou and Matthew Stephens (2012). Genome-wide efficient mixed-model analysis for association studies. Nature Genetics 44, 821–824.
If you use the multivariate linear mixed model (mvLMM) in your research, please cite:
- Xiang Zhou and Matthew Stephens (2014). Efficient multivariate linear mixed model algorithms for genome-wide association studies. Nature Methods 11, 407–409.
If you use the Bayesian sparse linear mixed model (BSLMM), please cite:
- Xiang Zhou, Peter Carbonetto and Matthew Stephens (2013). Polygenic modeling with bayesian sparse linear mixed models. PLoS Genetics 9, e1003264.
And if you use of the variance component estimation using summary statistics, please cite:
- Xiang Zhou (2016). A unified framework for variance component estimation with summary statistics in genome-wide association studies. Annals of Applied Statistics, in press.
License
Copyright (C) 2012–2025, Pjotr Prins & Xiang Zhou
The PanGEMMA source code is free software: you can redistribute it under the terms of the GNU General Public License. All the files in this project are part of PanGEMMA. This project is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose. See file LICENSE for the full text of the license.
The original source code for GEMMA that is part of PanGEMMA is distributed under the same GPL license.
The source code for the gzstream zlib wrapper (still) included in GEMMA are distributed under the GNU Lesser General Public License, either version 2.1 of the License, or (at your option) any later revision.
Code of conduct
Being part of the PanGEMMA community and communicating with its communtity you implicitely agree to abide by the code of conduct as published by the Software Carpentry initiative.
Credits
The PanGEMMA software was developed by
Pjotr Prins
Dept. of Genetics, Genomics and Informatics
University of Tennessee Health Science Center
The GEMMA software was developed by:
Xiang Zhou
Dept. of Biostatistics
University of Michigan
with (early) contributions from Peter Carbonetto, Tim Flutre, Matthew Stephens, and others.
