# PanGEMMA

We are rewriting and modernizing the much beloved GEMMA tool that is, for example, core to GeneNetwork.org.
The idea it to upgrade the software, but keeping it going using ideas from Hanson and Sussman's book on *Software Design for Flexibility: How to Avoid Programming Yourself into a Corner*.
This is not the first attempt, in fact quite a few efforts have started, but none really finished!

We want to keep the tool heart beating while upgrading the environment taking inspiration from fetal heart development: The human heart is one of the first organs to form and function during embryogenesis. By the end of gestational week 3, passive oxygen diffusion becomes insufficient to support metabolism of the developing embryo, and thus the fetal heart becomes vital for oxygen and nutrient distribution. The initiation of the first heart beat via the *primitive heart tube* begins at gestational day 22, followed by active fetal blood circulation by the end of week 4. The start of early heart development involves several types of progenitor cells that are derived from the mesoderm, proepicardium, and neural crest. This eventually leads to the formation of the 4-chambered heart by gestational week 7 via heart looping and complex cellular interactions in utero (e.g., [Tan and Lewandowski](https://doi.org/10.1159/000501906)).

What we will do is create components and wire them together, allowing for sharing RAM between components. Each component may have multiple implementations. We will introduce a DSL for orchestration and we may introduce a propagator network to run components in parallel and test them for correctness. At the same time, the core functionality of GEMMA will keep going while we swap components in and out. See also [propagators](https://groups.csail.mit.edu/mac/users/gjs/propagators/) and [examples](https://github.com/namin/propagators/blob/master/examples/multiple-dwelling.scm).

We want PanGEMMA to be able to run on high peformance computing (HPC) architectures, including GPU targets. This implies the core project can have few dependencies and should easily compile from C.

# Innovation

* Split functionality into components
* Wire components up so they can be tested and replaced
* New components may run in parallel

## Breaking with the past

The original gemma source base is considered stable and will be maintained - mostly to prevent bit rot. See https://github.com/genetics-statistics/GEMMA. To move forward we forked pangemma to be able to break with the past.

Even so, pangemma is supposed to be able to run the same steps as the original gemma. And hopefully improve things.

## Adapting the old gemma code

For running the old gemma code we need to break up the code base into pieces to run in a propagator network (see below).
Initially we can use the file system to pass state around. That will break up the implicit global state that gemma carries right now and makes working with the code rather tricky.
Note that we don't have the goal of making gemma more efficient because people can still use the old stable code base.
Essentially we are going to add flags to the binary that will run gemma partially by exporting and importing intermediate state.

Later, when things work in a propagator network, we can create alternatives that pass state around in memory.

# A simple propagator network

We will create cells that hold basic computations. We won't do a full propagator setup, though we may do a full implementation later.
For now we use a network of cells - essentially a dependency graph of computation. Cells can tell other cells that they require them and that allows for alternate paths. E.g. to create a kinship matrix:

```
(define-cell genotypes)
(define-cell kinship-matrix (e:kinship genotypes)
(run)
(content kinship-matrix)
```

essentially e:kinship gets run when genotypes are available. It is kinda reversed programming. Now say we want to add an input and a filter:


```
(define-cell bimbam-genofile)
(define-cell input-genotypes (e:read-bimbam bimbam-genofile))
(define-cell genotypes (e:freq-filter input-genotypes))
```

now you can see some logic building up to get from file to genotypes. Next we can add a different file format:

```
(define-cell plink-genofile)
(define-cell input-genotypes (e:read-plink plink-genofile))
```

and we have created another 'route' to get to the kinship matrix.

```
(add-content bimbam-genofile "test.bimbam")
(run)
```

runs one path and

```
(add-content plink-genofile "test.plink")
(add-content bimbam-genofile "test.bimbam")
(run)
```

will run both.

This simple example shows how simple complex running logic can be described without (1) predicting how people will use the software and (2) no complex if/then statements.

Why does this matter for gemma? It will allow us to run old parts of gemma as part of the network in addition to new parts - and potentially validate them. It also allows us to create multiple implementations, such as for CPU and GPU, that can run in parallel and validate each other's outcomes.

## Create the first cells

Let's start with default GEMMA behaviour and create the first cells to get to exporting the kinship-matrix above.

```
(content kinship-matrix)
```

We'll break down to inspecting cell levels after. The genofile cells contain a file name (string).
When the file gets read we will capture the state in a file and carry the file name of the new file forward.
In this setup cells simply represent file names (for state). This allows us to simply read and write files in the C code.
Later, when we wire up new propagators we may carry state in memory. The whole point of using this approach is that we really don't have to care!