diff options
Diffstat (limited to 'doc/code/pangemma.md')
-rw-r--r-- | doc/code/pangemma.md | 21 |
1 files changed, 21 insertions, 0 deletions
diff --git a/doc/code/pangemma.md b/doc/code/pangemma.md index 6640a80..1c3fed9 100644 --- a/doc/code/pangemma.md +++ b/doc/code/pangemma.md @@ -224,9 +224,30 @@ and run a computation with the '-debug' switch it should output the check-points note that both plink and BIMBAM input files have their own kinship computation with some filtering(!). Similarly read-geno-file also filters on MAF, for example, and it is well known that the old GEMMA will read the genotype file multiple times for different purposes. With growing geno files this is becoming highly inefficient. +Not all is bad though, MAF filtering happens in two places -- in geno and bed file readers -- so + In my new propagator setup these filtering steps should go in their own functions or propagators. +The important thing to note here is that we are *not* going to change the original GEMMA code. We are adding a parallel path to the old code and both can be run at *any* time. +I deal with legacy software all the time and one typical issue is that we'll update the old software and everything looks fine (according to our test suite). And then when the software is out there complaints come in. +Typically edge cases, but still. What happens is that we ask people to use older versions of the software and they lose some of the new facilities. Also we 'git bisect' and remove the offending code. In other words, we are back at where we were. + +Not great. To refactor this at read-geno-file we can start to write out the filtered-genotype file at the checkpoint. That will be our base line 'output'. Next we write an alternative path and make sure the outputs are the same! Sounds easy, no? +It should be with a propagator infrastructure. The fun part is that both paths can run at the same time (for a while), even in production, so we pick up any conflicts. + +## Adding some propagator plumbing + +I am highly resistant to writing C++ code. I probably have wrote most of lines of code in my life in C++, so I should know ;) +As we are still in the proof-of-concept stage I am going to use Ruby for the first plumbing (I do like that language, despite its warts) and maybe throw some scheme in for writing the new lmdb support. One fun side of propagators is that I don't need to worry too much about languages as people can trivially update any propagator in any way they like. We'll replace the Ruby propnet engine into something that can handle parallelism, e.g., when we want to start using actors to run computations. + +Alright, to duplicate above filtering routine we are going to add an alternative path for GEMMA's read-geno-file, splitting the two tasks into two steps: + + read-geno-file (with filtering) -> list-of-filtered-genotypes + rb-read-geno-file -> rb-filter-geno-file -> '' + +'read-geno-file' translates to GEMMA's read-geno-file and rb-read-geno-file is the alternate path. list-of-filtered-genotypes will be a file that we can compare between the two paths. +These steps should be so simple that anyone can replace them with, say, a python version. So, what can go wrong? # Other |