# PanGEMMA: Genome-wide Efficient Mixed Model Association

This repository is used to rewrite and modernize the original GEMMA tool. The idea it to upgrade the software, but keeping it going using ideas from Hanson and Sussman's book on *Software Design for Flexibility: How to Avoid Programming Yourself into a Corner*. It is work in progress (WIP). For more information see [PanGEMMA design](./doc/code/pangemma.md)

GEMMA is the original software toolkit for fast application of linear mixed models (LMMs) and related models to genome-wide association studies (GWAS) and other large-scale data sets. You can find the original code on [github](https://github.com/genetics-statistics/GEMMA).

Check out [RELEASE-NOTES.md](./RELEASE-NOTES.md) to see what's new in each release.

* [Key features](#key-features)
* [Installation](#installation)
  * [Precompiled binaries](#precompiled-binaries)
* [Run GEMMA](#run-gemma)
  * [Debugging and optimization](#debugging-and-optimization)
* [Help](#help)
* [Citing GEMMA](#citing-gemma)
* [License](#license)
* [Optimizing performance](#optimizing-performance)
* [Building from source](#building-from-source)
* [Input data formats](#input-data-formats)
* [Reporting a GEMMA bug or issue](#reporting-a-gemma-bug-or-issue)
  * [Check list:](#check-list)
* [Code of conduct](#code-of-conduct)
* [Credits](#credits)

## Key features

1. Fast assocation tests implemented using the univariate linear mixed model (LMM). In GWAS, this can correct for population structure and sample non-exchangeability. It also provides estimates of the proportion of variance in phenotypes explained by available genotypes (PVE), often called "chip heritability" or "SNP heritability".

2. Fast association tests for multiple phenotypes implemented using a multivariate linear mixed model (mvLMM). In GWAS, this can correct for population structure and sample (non)exchangeability - jointly in multiple complex phenotypes.

3. Bayesian sparse linear mixed model (BSLMM) for estimating PVE, phenotype prediction, and multi-marker modeling in GWAS.

4. Estimation of variance components ("chip/SNP heritability") partitioned by different SNP functional categories from raw (individual-level) data or summary data. For raw data, HE regression or the REML AI algorithm can be used to estimate variance components when individual-level data are available. For summary data, GEMMA uses the MQS algorithm to estimate variance components.

## Installation

WIP

### Precompiled binaries

WIP

## Run GEMMA

GEMMA is run from the command line. To run gemma

```sh
gemma -h
```

a typical example would be

```sh
# compute Kinship matrix
gemma -g ./example/mouse_hs1940.geno.txt.gz -p ./example/mouse_hs1940.pheno.txt \
    -gk -o mouse_hs1940
# run univariate LMM
gemma -g ./example/mouse_hs1940.geno.txt.gz \
    -p ./example/mouse_hs1940.pheno.txt -n 1 -a ./example/mouse_hs1940.anno.txt \
    -k ./output/mouse_hs1940.cXX.txt -lmm -o mouse_hs1940_CD8_lmm
```

Above example files are in the git repo and can be downloaded from
[github](https://github.com/genetics-statistics/GEMMA/tree/master/example).

### Debugging and optimization

GEMMA has a wide range of debugging options which can be viewed with

```
 DEBUG OPTIONS

 -check                   enable checks (slower)
 -no-fpe-check            disable hardware floating point checking
 -strict                  strict mode will stop when there is a problem
 -silence                 silent terminal display
 -debug                   debug output
 -debug-data              debug data output
 -debug-dump              -debug-data, but store the data to files (grep write() calls for messages/names)
 -nind       [num]        read up to num individuals
 -issue      [num]        enable tests relevant to issue tracker
 -legacy                  run gemma in legacy mode
```

typically when running gemma you should use -debug which includes
relevant checks. When compiled for debugging the debug version of
GEMMA gives more information.

For performance you may want to use the -no-check option. Also check
the build optimization notes in [INSTALL.md](INSTALL.md).

## Help

+ [The GEMMA manual](doc/manual.pdf).

+ [Detailed example with HS mouse data](example/demo.txt).

+ [Tutorial on GEMMA for genome-wide association
analysis](https://github.com/rcc-uchicago/genetic-data-analysis-2).

## Citing PanGEMMA

PanGEMMA is not published yet.

But if you use GEMMA for published work, please cite our paper:

+ Xiang Zhou and Matthew Stephens (2012). [Genome-wide efficient
mixed-model analysis for association studies.](http://doi.org/10.1038/ng.2310)
*Nature Genetics* **44**, 821–824.

If you use the multivariate linear mixed model (mvLMM) in your
research, please cite:

+ Xiang Zhou and Matthew Stephens (2014). [Efficient multivariate linear
mixed model algorithms for genome-wide association
studies.](http://doi.org/10.1038/nmeth.2848)
*Nature Methods* **11**, 407–409.

If you use the Bayesian sparse linear mixed model (BSLMM), please cite:

+ Xiang Zhou, Peter Carbonetto and Matthew Stephens (2013). [Polygenic
modeling with bayesian sparse linear mixed
models.](http://doi.org/10.1371/journal.pgen.1003264) *PLoS Genetics*
**9**, e1003264.

And if you use of the variance component estimation using summary
statistics, please cite:

+ Xiang Zhou (2016). [A unified framework for variance component
estimation with summary statistics in genome-wide association
studies.](https://doi.org/10.1101/042846) *Annals of Applied Statistics*, in press.

## License

PanGEMMA Copyright (C) 2012–2025, Pjotr Prins, Xiang Zhou and others (see the soure file headers and git log).

The *PanGEMMA* and *GEMMA* source code repository is free software: you can redistribute it under the terms of the [GNU General Public License](http://www.gnu.org/licenses/gpl.html). All the files in this project are part of *GEMMA*. This project is distributed in the hope that it will be useful, but **without any warranty**; without even the implied warranty of **merchantability or fitness for a particular purpose**. See file [LICENSE](LICENSE) for the full text of the license.

Both the source code for the [gzstream zlib wrapper](http://www.cs.unc.edu/Research/compgeom/gzstream/) and [shUnit2](https://github.com/genenetwork/shunit2) unit testing framework included in GEMMA are distributed under the [GNU Lesser General Public License](contrib/shunit2-2.0.3/doc/LGPL-2.1), either version 2.1 of the License, or (at your option) any later revision.

The source code for the included [Catch](http://catch-lib.net) unit testing framework is distributed under the [Boost Software Licence version 1](https://github.com/philsquared/Catch/blob/master/LICENSE.txt).

## Optimizing performance

Precompiled binaries and libraries may not be optimal for your particular hardware. See [INSTALL.md](INSTALL.md) for speeding up tips.

## Building from source

More information on source code, dependencies and installation can be found in [INSTALL.md](INSTALL.md).

## Input data formats

## Contributing code, reporting a PanGEMMA bug or issue

WIP

## Code of conduct

By using GEMMA and communicating with its communtity you implicitely agree to abide by the [code of conduct](https://software-carpentry.org/conduct/) as published by the Software Carpentry initiative.

## Credits

The *PanGEMMA* software was developmed

[Pjotr Prins](http://thebird.nl/)<br>
Dept. of Genetics, Genomics and Informatics<br>
University of Tennessee Health Science Center<br>

The *GEMMA* software was developed by:

[Xiang Zhou](http://www.xzlab.org)<br>
Dept. of Biostatistics<br>
University of Michigan<br>

with contributions from Peter Carbonetto, Tim Flutre, Matthew Stephens,
and [others](https://github.com/genetics-statistics/GEMMA/graphs/contributors).