From e58c198e1e5fd8f94cdcaf3c621931ee3a21e2f4 Mon Sep 17 00:00:00 2001 From: Pjotr Prins Date: Sat, 4 Jan 2025 02:53:42 -0600 Subject: Added checkpoints for file readers --- doc/code/pangemma.md | 13 ++++++++++++- 1 file changed, 12 insertions(+), 1 deletion(-) (limited to 'doc/code') diff --git a/doc/code/pangemma.md b/doc/code/pangemma.md index 223e2ae..6640a80 100644 --- a/doc/code/pangemma.md +++ b/doc/code/pangemma.md @@ -204,7 +204,7 @@ multiple outputs - in that case we may add filenames. And exits with: **** Checkpoint reached: read-geno-file (normal exit) ``` -# List check points +## List check points When you compile PanGEMMA with debug information @@ -216,8 +216,19 @@ and run a computation with the '-debug' switch it should output the check-points ``` **** DEBUG: checkpoint read-geno-file passed with ./example/mouse_hs1940.geno.txt.gz in src/gemma_io.cpp at line 874 in ReadFile_geno +**** DEBUG: checkpoint bimbam-kinship-matrix passed with kinship.txt in src/gemma_io.cpp at line 1598 in BimbamKin ``` +## Filtering steps + +note that both plink and BIMBAM input files have their own kinship computation with some filtering(!). +Similarly read-geno-file also filters on MAF, for example, and it is well known that the old GEMMA will read the genotype file multiple times for different purposes. With growing geno files this is becoming highly inefficient. + +In my new propagator setup these filtering steps should go in their own functions or propagators. + +To refactor this at read-geno-file we can start to write out the filtered-genotype file at the checkpoint. That will be our base line 'output'. Next we write an alternative path and make sure the outputs are the same! Sounds easy, no? + + # Other ## Example -- cgit v1.2.3