aboutsummaryrefslogtreecommitdiff
path: root/doc/code/pangemma.md
blob: 6640a80de8224aea1e60dda5413d3fcd122fdce1 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
# PanGEMMA

We are rewriting and modernizing the much beloved GEMMA tool that is,
for example, core to GeneNetwork.org.
The idea is to upgrade the software, and keep it going using ideas
from Hanson and Sussman's book on *Software Design for Flexibility:
How to Avoid Programming Yourself into a Corner*.
This is not the first attempt, in fact quite a few efforts have
started, but none really finished!

We want to keep the heart of GEMMA beating while upgrading the
environment taking inspiration from fetal heart development: The human
heart is one of the first organs to form and function during
embryogenesis. By the end of gestational week 3, passive oxygen
diffusion becomes insufficient to support metabolism of the developing
embryo, and thus the fetal heart becomes vital for oxygen and nutrient
distribution. The initiation of the first heart beat via the
*primitive heart tube* begins at gestational day 22, followed by
active fetal blood circulation by the end of week 4. The start of
early heart development involves several types of progenitor cells
that are derived from the mesoderm, proepicardium, and neural crest.
This eventually leads to the formation of the 4-chambered heart by
gestational week 7 via heart looping and complex cellular interactions
in utero (e.g., [Tan and
Lewandowski](https://doi.org/10.1159/000501906)).

What we will do is create components and wire them together, allowing
for sharing RAM between components. Each component may have multiple
implementations. We will introduce a DSL for orchestration and we may
introduce a propagator network to run components in parallel and test
them for correctness. At the same time, the core functionality of
GEMMA will keep going while we swap components in and out. See also
[propagators](https://groups.csail.mit.edu/mac/users/gjs/propagators/)
and [examples](https://github.com/namin/propagators/blob/master/examples/multiple-dwelling.scm).

Propagators are [Unix pipes on steroids](https://github.com/pjotrp/ruby-propagator-network/blob/main/README.md).

We want PanGEMMA to be able to run on high-performance computing (HPC)
architectures, including GPUs. This implies the core project can have
few dependencies and should easily compile from C.

# Innovation

* Split functionality into components
* Wire components up so they can be tested and replaced
* New components may run in parallel

## Breaking with the past

The original gemma source base is considered stable and will be
maintained - mostly to prevent bit rot. See
https://github.com/genetics-statistics/GEMMA. To move forward we
forked pangemma to be able to break with the past.

Even so, pangemma is supposed to be able to run the same steps as the
original gemma. And hopefully improve things.

## Adapting the old gemma code

For running the old gemma code we need to break up the code base into
pieces to run in a propagator network (see below).
Initially we can use the file system to pass state around. That will
break up the implicit global state that gemma carries right now and
makes working with the code rather tricky.
Note that we don't have the goal of making gemma more efficient
because people can still use the old stable code base.
Essentially we are going to add flags to the binary that will run
gemma partially by exporting and importing intermediate state.

Later, when things work in a propagator network, we can create
alternatives that pass state around in memory.

# A simple propagator network

We will create cells that hold basic computations and functions. We
won't do a full propagator setup, though we may do a full
implementation later.
For now we use a network of cells - essentially a dependency graph of
computations. Cells can tell other cells that they require them and
that allows for alternate paths. E.g. to create a kinship matrix:

```
(define-cell genotypes)
(define-cell kinship-matrix (e:kinship genotypes)
(run)
(content kinship-matrix)
```

essentially e:kinship gets run when genotypes are available. It is
kinda reversed programming. Now say we want to add an input and a
filter:


```
(define-cell bimbam-genofile)
(define-cell input-genotypes (e:read-bimbam bimbam-genofile))
(define-cell genotypes (e:freq-filter input-genotypes))
```

now you can see some logic building up to get from file to genotypes.
Next we can add a different file format:

```
(define-cell plink-genofile)
(define-cell input-genotypes (e:read-plink plink-genofile))
```

and we have created another 'route' to get to the kinship matrix.

```
(add-content bimbam-genofile "test.bimbam")
(run)
```

runs one path and

```
(add-content plink-genofile "test.plink")
(add-content bimbam-genofile "test.bimbam")
(run)
```

will run both.

This simple example shows how complex running logic can be
described without (1) predicting how people will use the software and
(2) no complex if/then statements.

Why does this matter for GEMMA? It will allow us to run old parts of
GEMMA as part of the network in addition to new parts - and
potentially validate them. It also allows us to create multiple
implementations, such as for CPU and GPU, that can run in parallel and
validate each other's outcomes.

## Create the first cells

Let's start with default GEMMA behaviour and create the first cells to
get to exporting the kinship-matrix above.

```
(content kinship-matrix)
```
We'll break this down by inspecting cell levels next. The genofile
cells contain a file name (string).
When the file is read we will capture the state in a file and carry
the file name of the new file forward.
In this setup cells simply represent file names (for state).
This allows us to simply read and write files in the C code.
Later, when we wire up new propagators we may carry this state in
memory. The whole point of using this approach is that we really don't
have to care!

Our first simple implementation of the cell will simply contain a
string referring to a file.
Cells can be works-in-progress and incrementally improved.

## Create propagators

A propagator contains a list of inputs and an output cell. So we wire
up the graph by adding inputs to propagators.
Every propagator has state (too), that is, it may be idle, computing, or done.

## The runner

The runner visits the list of propagators and checks whether the
inputs are complete and whether they have changed. On detecting a
change, a computation has to be triggered to update the output cell.

## Setting check points in GEMMA

GEMMA is quite stateful in its original design. We want to break the
work up into chunks setting 'check points'. For example the actual
kinship multiplication could start as 'start-compute-kinship' and end
with 'compute-kinship' check points. To not have to deal with state
too much we can simply let GEMMA run from the start of the program
until 'compute-kinship' to have a kinship-propagator. The output will
be a kinship file. Similarly we can run until 'filter-genotypes' that
is a step earlier. The output of these propagators can be used by
other PanGEMMA propagators as input for comparison and continuation.
All the original GEMMA does is act as a reference for alternative
implementation of these chunks. Speed is not a concern though there
may be opportunities to start compute after some of these check-points
(using intermediate output) down the line.

So, let's start with a first check point implementation for 'read-bimbam-file'.

## read-bimbam-file

Reading the bimbam file happens in the `ReadFile_bim' function in
`gemma_io.cpp'. Of course all it does is read a file - which is the
same as any output. But just for the sake of a simple pilot we'll add
the check point at the end of the function that will exit GEMMA.
We'll add a CLI switch `-checkpoint read-geno-file' which will force the exit.

```C++
checkpoint("read-geno-file",file_geno);
```

It passes in the outputfile (the infile in this case), that is used to
feed the calling propagator. Some of the outfiles may be composed of
multiple outputs - in that case we may add filenames. And exits with:

```
**** Checkpoint reached: read-geno-file (normal exit)
```

## List check points

When you compile PanGEMMA with debug information

```
make debug
```

and run a computation with the '-debug' switch it should output the check-points it passes. You should see something like

```
**** DEBUG: checkpoint read-geno-file passed with ./example/mouse_hs1940.geno.txt.gz in src/gemma_io.cpp at line 874 in ReadFile_geno
**** DEBUG: checkpoint bimbam-kinship-matrix passed with kinship.txt in src/gemma_io.cpp at line 1598 in BimbamKin
```

## Filtering steps

note that both plink and BIMBAM input files have their own kinship computation with some filtering(!).
Similarly read-geno-file also filters on MAF, for example, and it is well known that the old GEMMA will read the genotype file multiple times for different purposes. With growing geno files this is becoming highly inefficient.

In my new propagator setup these filtering steps should go in their own functions or propagators.

To refactor this at read-geno-file we can start to write out the filtered-genotype file at the checkpoint. That will be our base line 'output'. Next we write an alternative path and make sure the outputs are the same! Sounds easy, no?


# Other

## Example

I created a very minimalistic example in Ruby with a simple
round-robin scheduler:

=> https://github.com/pjotrp/ruby-propagator-network/blob/main/propagator.rb

## Running tasks in parallel

In principle propnets make it trivially easy to run tasks in parallel.
When inputs are complete the propagator goes into a
'compute' state and a process can be run (forked or on PBS) that
executes a command. The output file can be picked up to make sure the
propagator completes and switches to the 'done' state. Note that
actors and goblins would be perfect parallel companions to propagators
with the potential advantage of keeping results in RAM.

We are not going there right now, but it is one of the important
reasons to build this setup.

## Why guile and not ruby or python?

Above example in Ruby is rather nice and we can build on that initially.
Ruby's multithreading capabilities (and that of Python), however, are
hampered by the layout of the interpreters and modules that can not be
re-entered. Even with the slow disappearance of the global interpreter
lock (GIL) these languages don't do great in that area. You can work
around it by using copy-on-write processes (like I did with the really
fast bioruby-vcf), but that gets clunky fast.