From f4c00ec1e03932bdf610ed41a652090b30ea3bf2 Mon Sep 17 00:00:00 2001 From: Pjotr Prins Date: Wed, 22 Nov 2023 18:18:47 +0100 Subject: Precompute progress --- .../mariadb/precompute-mapping-input-data.gmi | 107 +++++++++++++++++++++ 1 file changed, 107 insertions(+) (limited to 'topics/systems') diff --git a/topics/systems/mariadb/precompute-mapping-input-data.gmi b/topics/systems/mariadb/precompute-mapping-input-data.gmi index 9ea8b3e..85a11ee 100644 --- a/topics/systems/mariadb/precompute-mapping-input-data.gmi +++ b/topics/systems/mariadb/precompute-mapping-input-data.gmi @@ -813,6 +813,113 @@ If it is not a match we'll need to iterate to the next one. The current version => https://git.genenetwork.org/gn-guile/tree/scripts/precompute/precompute-hits.scm +For GEMMA we need to write a phenotype file which is simply a vector of values. Meanwhile, the genotype file and SNP files for the BXD are fixed (currently version 8 with 21.056 markers). + +To run GEMMA we can take an example from `gemma-wrapper`, a tool I wrote to cache results: + +=> https://github.com/genetics-statistics/gemma-wrapper + +gemma-wrapper does a bit more work than we really need and is a bit too generic for this precompute exercise. The following things matter: + +* handling of hashes to prevent recompute +* handling of the kinship matrix (K) +* handling of leave one chromosome out (LOCO) + +all of which are handled by gemma-wrapper really well. We'll use gemma-wrapper for the time being to do the actual computations. + +BTW the cache for GN production is currently at 190G with 40K files. That is 3 months worth of caching GEMMA as we clean up older files! For precompute we may need to be smarter because that only represents 970 association results and 400 kinship matrices. We should improve GEMMA by: + +* by default only keep the required results +* provide a more compact storage of results + +We should be using something like lmdb as that will speed up lookups and GEMMA itself. Simple compression will reduce output 5x for one 7Mb output. Some K matrices are huge too. The largest one runs at 0.5 Gb. GEMMA does crash once in a while, unsurprisingly. + +For the BXD we need to compute at least 100K traits - I have not done a proper count yet - and we can delete output files when the database gets updated. We should, however, hang on to hits that are potentially significant, so I want to retain output, or create a new mysql table. + +## Run gemma-wrapper + +We can run a local version of gemma-wrapper with + +``` +.guix-shell --expose=/home/wrk/iwrk/opensource/code/genetics/gemma-wrapper/=/gemma-wrapper ruby -- ruby /gemma-wrapper/bin/gemma-wrapper +``` + +this is useful because we want to change gemma-wrapper itself. Mostly to produce less bulky output and add some metadata for downstream processing (i.e. to update the database itself). +Note how versatile `guix shell` is for development! + +gemma-wrapper shows the following output: + +``` +NOTE: gemma-wrapper is soon to be replaced +Using GEMMA 0.98.3 (2020-11-28) by Xiang Zhou and team (C) 2012-2020 + +Found 0.98.3, comparing against expected v0.98.4 +GEMMA 0.98.3 (2020-11-28) by Xiang Zhou and team (C) 2012-2020 +WARNING: GEMMA version is out of date. Update GEMMA to 0.98.4! +``` + +Some irony there. We are going to replace gemma-wrapper at some point. And the gemma that is packaged in guix by default it older that what we want to use. Though the release notes point out mostly maintenance updates: + +=> https://github.com/genetics-statistics/GEMMA/releases + +We can pull in a local gemma too using the same `--expose` trick above. So, let's use that for now: + +``` +.guix-shell --expose=/home/wrk/iwrk/opensource/code/genetics/gemma-wrapper/=/gemma-wrapper --expose=/home/wrk/iwrk/opensource/code/genetics/gemma/=/gemma ruby -- /gemma/bin/gemma +``` + +This does not work out of the box because I typically build gemma on a different Guix profile, so it won't find the libraries (RPATH is fixed). There are two options, build a static version of gemma or override the LD_LIBRARY_PATH. The first used to work, but Guix no longer carries static libs for openblas, so I would need to provide those myself. Overriding LD_LIBRARY_PATH is a bit inconvenient, but in this case doable. We have to run from inside the container and set: + +``` +env LD_LIBRARY_PATH=$GUIX_ENVIRONMENT/lib/ /gemma/bin/gemma + +GEMMA 0.98.6 (2022-08-05) by Xiang Zhou, Pjotr Prins and team (C) 2012-2022 + + type ./gemma -h [num] for detailed help + options: + 1: quick guide + 2: file I/O related + 3: SNP QC + ... +``` + +and success. With gemma wrapper we can pick up the latest gemma by setting the GEMMA_COMMAND + +``` +(system "env GEMMA_COMMAND=/gemma/bin/gemma /gemma-wrapper/bin/gemma-wrapper") +gemma-wrapper 0.99.6 (Ruby 3.2.2) by Pjotr Prins 2017-2022 + +NOTE: gemma-wrapper is soon to be replaced +Using GEMMA 0.98.6 (2022-08-05) by Xiang Zhou, Pjotr Prins and team (C) 2012-2022 +``` + +All set for actual compute! + +## Running gemma + +The following command fails because we don't have a pheno file yet: + +``` +gemma-wrapper --debug -- -gk -g BXD.8_geno.txt.gz -p pheno.txt -a BXD.8_snps.txt +``` + +Now the pheno file needs to match the number of genotypes in the geno file which, presumably, is generated from the file `BXD.8.gene`. It has the header: + +``` +BXD1 BXD2 BXD5 BXD6 BXD8 BXD9 BXD11 BXD12 BXD13 BXD14 BXD15 BXD16 BXD18 BXD19 BXD20 BXD21 BXD22 BXD23 BXD24 BXD24a BXD25 BXD27 BXD28 BXD29 BXD30 BXD31 BXD32 BXD33 BXD34 BXD35 BXD36 BXD37 BXD38 BXD39 BXD40 BXD41 BXD42 BXD43 BXD44 BXD45 BXD48 BXD48a BXD49 BXD50 BXD51 BXD52 BXD53 BXD54 BXD55 BXD56 BXD59 BXD60 BXD61 BXD62 BXD63 BXD64 BXD65 BXD65a BXD65b BXD66 BXD67 BXD68 BXD69 BXD70 BXD71 BXD72 BXD73 BXD73a BXD73b BXD74 BXD75 BXD76 BXD77 BXD78 BXD79 BXD81 BXD83 BXD84 BXD85 BXD86 BXD87 BXD88 BXD89 BXD90 BXD91 BXD93 BXD94 BXD95 BXD98 BXD99 BXD100 BXD101 BXD102 BXD104 BXD105 BXD106 BXD107 BXD108 BXD109 BXD110 BXD111 BXD112 BXD113 BXD114 BXD115 BXD116 BXD117 BXD119 BXD120 BXD121 BXD122 BXD123 BXD124 BXD125 BXD126 BXD127 BXD128 BXD128a BXD130 BXD131 BXD132 BXD133 BXD134 BXD135 BXD136 BXD137 BXD138 BXD139 BXD141 BXD142 BXD144 BXD145 BXD146 BXD147 BXD148 BXD149 BXD150 BXD151 BXD152 BXD153 BXD154 BXD155 BXD156 BXD157 BXD160 BXD161 BXD162 BXD165 BXD168 BXD169 BXD170 BXD171 BXD172 BXD173 BXD174 BXD175 BXD176 BXD177 BXD178 BXD180 BXD181 BXD183 BXD184 BXD186 BXD187 BXD188 BXD189 BXD190 BXD191 BXD192 BXD193 BXD194 BXD195 BXD196 BXD197 BXD198 BXD199 BXD200 BXD201 BXD202 BXD203 BXD204 BXD205 BXD206 BXD207 BXD208 BXD209 BXD210 BXD211 BXD212 BXD213 BXD214 BXD215 BXD216 BXD217 BXD218 BXD219 BXD220 C57BL/6JxBXD037F1 BXD001xBXD065aF1 BXD009xBXD170F1 BXD009xBXD172F1 BXD012xBXD002F1 BXD012xBXD021F1 BXD020xBXD012F1 BXD021xBXD002F1 BXD024xBXD034F1 BXD032xBXD005F1 BXD032xBXD028F1 BXD032xBXD65bF1 BXD034xBXD024F1 BXD034xBXD073F1 BXD055xBXD074F1 BXD055xBXD65bF1 BXD061xBXD071F1 BXD062xBXD077F1 BXD065xBXD077F1 BXD069xBXD090F1 BXD071xBXD061F1 BXD073bxBXD065F1 BXD073bxBXD077F1 BXD073xBXD034F1 BXD073xBXD065F1 BXD073xBXD077F1 BXD074xBXD055F1 BXD077xBXD062F1 BXD083xBXD045F1 BXD087xBXD100F1 BXD065bxBXD055F1 BXD102xBXD077F1 BXD102xBXD73bF1 BXD170xBXD172F1 BXD172xBXD197F1 BXD197xBXD009F1 BXD197xBXD170F1 +``` + +with 235 named BXD mice. Now when we check the DB we get + +``` +select * from StrainXRef WHERE InbredSetId=1 AND Used_for_mapping="Y" +276 rows +``` + +so it looks like we need some extra logic. + + + ## More complicated datasets A good dataset to take apart is -- cgit v1.2.3