doc: merged edits by Rob W. Williams

author: Pjotr Prins 2025-01-04 00:22:55 -0600
committer: Pjotr Prins 2025-01-04 00:22:55 -0600
commit: dee2dee77ecc8575f747951d94e4cc96192b3fa2 (patch)
tree: b55f8c169e714a34454e0b7de9fc3526142f167d /doc
parent: efbeb0d120cec56ce1d6edbcf1674e68695fdb43 (diff)
download: pangemma-dee2dee77ecc8575f747951d94e4cc96192b3fa2.tar.gz
1 files changed, 126 insertions, 39 deletions
diff --git a/doc/code/pangemma.md b/doc/code/pangemma.md
index fa3b984..618dd1d 100644
--- a/doc/code/pangemma.md
+++ b/doc/code/pangemma.md
@@ -1,16 +1,43 @@
 # PanGEMMA
 
-We are rewriting and modernizing the much beloved GEMMA tool that is, for example, core to GeneNetwork.org.
-The idea it to upgrade the software, but keeping it going using ideas from Hanson and Sussman's book on *Software Design for Flexibility: How to Avoid Programming Yourself into a Corner*.
-This is not the first attempt, in fact quite a few efforts have started, but none really finished!
-
-We want to keep the tool heart beating while upgrading the environment taking inspiration from fetal heart development: The human heart is one of the first organs to form and function during embryogenesis. By the end of gestational week 3, passive oxygen diffusion becomes insufficient to support metabolism of the developing embryo, and thus the fetal heart becomes vital for oxygen and nutrient distribution. The initiation of the first heart beat via the *primitive heart tube* begins at gestational day 22, followed by active fetal blood circulation by the end of week 4. The start of early heart development involves several types of progenitor cells that are derived from the mesoderm, proepicardium, and neural crest. This eventually leads to the formation of the 4-chambered heart by gestational week 7 via heart looping and complex cellular interactions in utero (e.g., [Tan and Lewandowski](https://doi.org/10.1159/000501906)).
-
-What we will do is create components and wire them together, allowing for sharing RAM between components. Each component may have multiple implementations. We will introduce a DSL for orchestration and we may introduce a propagator network to run components in parallel and test them for correctness. At the same time, the core functionality of GEMMA will keep going while we swap components in and out. See also [propagators](https://groups.csail.mit.edu/mac/users/gjs/propagators/) and [examples](https://github.com/namin/propagators/blob/master/examples/multiple-dwelling.scm).
+We are rewriting and modernizing the much beloved GEMMA tool that is,
+for example, core to GeneNetwork.org.
+The idea is to upgrade the software, and keep it going using ideas
+from Hanson and Sussman's book on *Software Design for Flexibility:
+How to Avoid Programming Yourself into a Corner*.
+This is not the first attempt, in fact quite a few efforts have
+started, but none really finished!
+
+We want to keep the heart of GEMMA beating while upgrading the
+environment taking inspiration from fetal heart development: The human
+heart is one of the first organs to form and function during
+embryogenesis. By the end of gestational week 3, passive oxygen
+diffusion becomes insufficient to support metabolism of the developing
+embryo, and thus the fetal heart becomes vital for oxygen and nutrient
+distribution. The initiation of the first heart beat via the
+*primitive heart tube* begins at gestational day 22, followed by
+active fetal blood circulation by the end of week 4. The start of
+early heart development involves several types of progenitor cells
+that are derived from the mesoderm, proepicardium, and neural crest.
+This eventually leads to the formation of the 4-chambered heart by
+gestational week 7 via heart looping and complex cellular interactions
+in utero (e.g., [Tan and
+Lewandowski](https://doi.org/10.1159/000501906)).
+
+What we will do is create components and wire them together, allowing
+for sharing RAM between components. Each component may have multiple
+implementations. We will introduce a DSL for orchestration and we may
+introduce a propagator network to run components in parallel and test
+them for correctness. At the same time, the core functionality of
+GEMMA will keep going while we swap components in and out. See also
+[propagators](https://groups.csail.mit.edu/mac/users/gjs/propagators/)
+and [examples](https://github.com/namin/propagators/blob/master/examples/multiple-dwelling.scm).
 
 Propagators are Unix pipes on steroids.
 
-We want PanGEMMA to be able to run on high peformance computing (HPC) architectures, including GPU targets. This implies the core project can have few dependencies and should easily compile from C.
+We want PanGEMMA to be able to run on high-performance computing (HPC)
+architectures, including GPUs. This implies the core project can have
+few dependencies and should easily compile from C.
 
 # Innovation
 
@@ -20,23 +47,37 @@ We want PanGEMMA to be able to run on high peformance computing (HPC) architectu
 
 ## Breaking with the past
 
-The original gemma source base is considered stable and will be maintained - mostly to prevent bit rot. See https://github.com/genetics-statistics/GEMMA. To move forward we forked pangemma to be able to break with the past.
+The original gemma source base is considered stable and will be
+maintained - mostly to prevent bit rot. See
+https://github.com/genetics-statistics/GEMMA. To move forward we
+forked pangemma to be able to break with the past.
 
-Even so, pangemma is supposed to be able to run the same steps as the original gemma. And hopefully improve things.
+Even so, pangemma is supposed to be able to run the same steps as the
+original gemma. And hopefully improve things.
 
 ## Adapting the old gemma code
 
-For running the old gemma code we need to break up the code base into pieces to run in a propagator network (see below).
-Initially we can use the file system to pass state around. That will break up the implicit global state that gemma carries right now and makes working with the code rather tricky.
-Note that we don't have the goal of making gemma more efficient because people can still use the old stable code base.
-Essentially we are going to add flags to the binary that will run gemma partially by exporting and importing intermediate state.
+For running the old gemma code we need to break up the code base into
+pieces to run in a propagator network (see below).
+Initially we can use the file system to pass state around. That will
+break up the implicit global state that gemma carries right now and
+makes working with the code rather tricky.
+Note that we don't have the goal of making gemma more efficient
+because people can still use the old stable code base.
+Essentially we are going to add flags to the binary that will run
+gemma partially by exporting and importing intermediate state.
 
-Later, when things work in a propagator network, we can create alternatives that pass state around in memory.
+Later, when things work in a propagator network, we can create
+alternatives that pass state around in memory.
 
 # A simple propagator network
 
-We will create cells that hold basic computations. We won't do a full propagator setup, though we may do a full implementation later.
-For now we use a network of cells - essentially a dependency graph of computation. Cells can tell other cells that they require them and that allows for alternate paths. E.g. to create a kinship matrix:
+We will create cells that hold basic computations and functions. We
+won't do a full propagator setup, though we may do a full
+implementation later.
+For now we use a network of cells - essentially a dependency graph of
+computations. Cells can tell other cells that they require them and
+that allows for alternate paths. E.g. to create a kinship matrix:
 
 ```
 (define-cell genotypes)
@@ -45,7 +86,9 @@ For now we use a network of cells - essentially a dependency graph of computatio
 (content kinship-matrix)
 ```
 
-essentially e:kinship gets run when genotypes are available. It is kinda reversed programming. Now say we want to add an input and a filter:
+essentially e:kinship gets run when genotypes are available. It is
+kinda reversed programming. Now say we want to add an input and a
+filter:
 
 
 ```
@@ -54,7 +97,8 @@ essentially e:kinship gets run when genotypes are available. It is kinda reverse
 (define-cell genotypes (e:freq-filter input-genotypes))
 ```
 
-now you can see some logic building up to get from file to genotypes. Next we can add a different file format:
+now you can see some logic building up to get from file to genotypes.
+Next we can add a different file format:
 
 ```
 (define-cell plink-genofile)
@@ -78,52 +122,83 @@ runs one path and
 
 will run both.
 
-This simple example shows how simple complex running logic can be described without (1) predicting how people will use the software and (2) no complex if/then statements.
+This simple example shows how complex running logic can be
+described without (1) predicting how people will use the software and
+(2) no complex if/then statements.
 
-Why does this matter for gemma? It will allow us to run old parts of gemma as part of the network in addition to new parts - and potentially validate them. It also allows us to create multiple implementations, such as for CPU and GPU, that can run in parallel and validate each other's outcomes.
+Why does this matter for GEMMA? It will allow us to run old parts of
+GEMMA as part of the network in addition to new parts - and
+potentially validate them. It also allows us to create multiple
+implementations, such as for CPU and GPU, that can run in parallel and
+validate each other's outcomes.
 
 ## Create the first cells
 
-Let's start with default GEMMA behaviour and create the first cells to get to exporting the kinship-matrix above.
+Let's start with default GEMMA behaviour and create the first cells to
+get to exporting the kinship-matrix above.
 
 ```
 (content kinship-matrix)
 ```
-
-We'll break down to inspecting cell levels after. The genofile cells contain a file name (string).
-When the file gets read we will capture the state in a file and carry the file name of the new file forward.
+We'll break this down by inspecting cell levels next. The genofile
+cells contain a file name (string).
+When the file is read we will capture the state in a file and carry
+the file name of the new file forward.
 In this setup cells simply represent file names (for state).
 This allows us to simply read and write files in the C code.
-Later, when we wire up new propagators we may carry state in memory. The whole point of using this approach is that we really don't have to care!
+Later, when we wire up new propagators we may carry this state in
+memory. The whole point of using this approach is that we really don't
+have to care!
 
-Our first simple implementation of the cell will simply contain a string referring to a file.
-Cells can be work in progress and incrementally improved.
+Our first simple implementation of the cell will simply contain a
+string referring to a file.
+Cells can be works-in-progress and incrementally improved.
 
 ## Create propagators
 
-A propagator contains a list of inputs and an output cell. So we wire up the graph by adding inputs to propagators.
-Every propagator has state (too). I.e. it may be idle, computing and done.
+A propagator contains a list of inputs and an output cell. So we wire
+up the graph by adding inputs to propagators.
+Every propagator has state (too), that is, it may be idle, computing, or done.
 
 ## The runner
 
-The runner visits the list of propagators and checks wether the inputs are complete and whether they have changed. On change computation has to happen updating the output cell.
+The runner visits the list of propagators and checks whether the
+inputs are complete and whether they have changed. On detecting a
+change, a computation has to be triggered to update the output cell.
 
 ## Setting check points in GEMMA
 
-GEMMA is quite stateful in its original design. We want to break the work up into chunks setting 'check points'. For example the actual kinship multiplication could start as 'start-compute-kinship' and end with 'compute-kinship' check points. To not have to deal with state too much we can simply let gemma run from the start of the program until 'compute-kinship' to have a kinship-propagator. The output will be a kinship file. Similarly we can run until 'filter-genotypes' that is a step earlier. The output of these propagators can be used by other pangemma propagators as input for comparison and continuation. All the original GEMMA does is act as a reference for alternative implementation of these chunks. Speed is not a concern though there may be opportunities to start compute after some of these check points (using intermediate output) down the line.
+GEMMA is quite stateful in its original design. We want to break the
+work up into chunks setting 'check points'. For example the actual
+kinship multiplication could start as 'start-compute-kinship' and end
+with 'compute-kinship' check points. To not have to deal with state
+too much we can simply let GEMMA run from the start of the program
+until 'compute-kinship' to have a kinship-propagator. The output will
+be a kinship file. Similarly we can run until 'filter-genotypes' that
+is a step earlier. The output of these propagators can be used by
+other PanGEMMA propagators as input for comparison and continuation.
+All the original GEMMA does is act as a reference for alternative
+implementation of these chunks. Speed is not a concern though there
+may be opportunities to start compute after some of these check-points
+(using intermediate output) down the line.
 
 So, let's start with a first check point implementation for 'read-bimbam-file'.
 
 ## read-bimbam-file
 
-Reading the bimbam file happens in the `ReadFile_bim' function in `gemma_io.cpp'. Of course all it does is read a file - which is the same as any output. But just for the sake of a simple pilot we'll add the check point at the end of the function that will exit GEMMA.
+Reading the bimbam file happens in the `ReadFile_bim' function in
+`gemma_io.cpp'. Of course all it does is read a file - which is the
+same as any output. But just for the sake of a simple pilot we'll add
+the check point at the end of the function that will exit GEMMA.
 We'll add a CLI switch `-checkpoint read-geno-file' which will force the exit.
 
 ```C++
 checkpoint("read-geno-file",file_geno);
 ```
 
-It passes in the outputfile (the infile in this case), that is used to feed the calling propagator. Some of the outfiles may be composed of multiple outputs - in that case we may add filenames. And exits with:
+It passes in the outputfile (the infile in this case), that is used to
+feed the calling propagator. Some of the outfiles may be composed of
+multiple outputs - in that case we may add filenames. And exits with:
 
 ```
 **** Checkpoint reached: read-geno-file (normal exit)
@@ -133,18 +208,30 @@ It passes in the outputfile (the infile in this case), that is used to feed the
 
 ## Example
 
-I created a very minimalistic example in Ruby with a simple round robin scheduler:
+I created a very minimalistic example in Ruby with a simple
+round-robin scheduler:
 
 => https://github.com/pjotrp/ruby-propagator-network/blob/main/propagator.rb
 
 ## Running tasks in parallel
 
-In principle propnets make it trivially easy to run tasks in parallel. When inputs are complete the propagator goes into a
-'compute' state and a process can be run (forked or on PBS) that executes a command. The output file can be picked up to make sure the propagator completes and switches do the 'done' state. Note that actors and goblins would be perfect parallel companions to propagators with the potential advantage of keeping results in RAM.
+In principle propnets make it trivially easy to run tasks in parallel.
+When inputs are complete the propagator goes into a
+'compute' state and a process can be run (forked or on PBS) that
+executes a command. The output file can be picked up to make sure the
+propagator completes and switches to the 'done' state. Note that
+actors and goblins would be perfect parallel companions to propagators
+with the potential advantage of keeping results in RAM.
 
-We are not going there right now, but it is one of the important reasons to build this setup.
+We are not going there right now, but it is one of the important
+reasons to build this setup.
 
 ## Why guile and not ruby or python?
 
 Above example in Ruby is rather nice and we can build on that initially.
-Ruby's multithreading capabilities (and that of python), however, are hampered by the layout of the interpreters and modules that can not be re-entered. Even with the slow disappearance of the global interpreter lock (GIL) these languages don't do great in that area. You can work around it by using copy-on-write processes (like I did with the really fast bioruby-vcf), but that gets clunky fast.
+Ruby's multithreading capabilities (and that of Python), however, are
+hampered by the layout of the interpreters and modules that can not be
+re-entered. Even with the slow disappearance of the global interpreter
+lock (GIL) these languages don't do great in that area. You can work
+around it by using copy-on-write processes (like I did with the really
+fast bioruby-vcf), but that gets clunky fast.
author	Pjotr Prins	2025-01-04 00:22:55 -0600
committer	Pjotr Prins	2025-01-04 00:22:55 -0600
commit	dee2dee77ecc8575f747951d94e4cc96192b3fa2 (patch)
tree	b55f8c169e714a34454e0b7de9fc3526142f167d /doc
parent	efbeb0d120cec56ce1d6edbcf1674e68695fdb43 (diff)
download	pangemma-dee2dee77ecc8575f747951d94e4cc96192b3fa2.tar.gz