First permutations work

author: Pjotr Prins 2024-08-23 15:17:34 +0200
committer: Pjotr Prins 2024-08-23 15:17:34 +0200
commit: 8d7f7db90298b900ba82a908cb0dd8a6ab185a6b (patch)
tree: 6a99810a0dc00c32b12ecc71ec9227a834f9b4ea
parent: 41ae3e92af3ed8a7ee6bfe03bec0c4981f68c2a8 (diff)
download: gn-gemtext-8d7f7db90298b900ba82a908cb0dd8a6ab185a6b.tar.gz
1 files changed, 70 insertions, 2 deletions
diff --git a/topics/lmms/gemma/permutations.gmi b/topics/lmms/gemma/permutations.gmi
index 96af97d..af79740 100644
--- a/topics/lmms/gemma/permutations.gmi
+++ b/topics/lmms/gemma/permutations.gmi
@@ -6,7 +6,7 @@ in certain cases (genotypes have to be complete, for one).
 
 Because of many changes gemma-wrapper is not working for permutations. I have a few steps to take care of:
 
-* [ ] read R/qtl2 format for phenotype
+* [X] read R/qtl2 format for phenotype
 
 # R/qtl2 and GEMMA formats
 
@@ -56,7 +56,7 @@ OK, let's try. The first step is to convert the phenotype file to something GEMM
 
 * [X] Read the JSON layout file - 'sample_list' is essentially the header of the BIMBAM geno file
 * [X] Use the R/qtl2-style phenotype file to write a correct GEMMA pheno file (multi column)
-* [ ] Compare results with GN pheno output
+* [X] Compare results with GN pheno output
 
 Running GEMMA by hand it complained
 
@@ -147,6 +147,74 @@ and that matches Dave's output for LOCO and marker rs3718618. All good, so far.
 
 Now we have gemma-wrapper working we need to fix it to work with the latest type of files.
 
+* [X] randomize phenotypes using -n switch
+* [ ] Permute gemma and collect results
+* [ ] Unseed randomizer or make it an option
+* [ ] Fix tmpdir
+* [X] Show final score
+* [ ] Compare small and large BXD set
+
+For the first one, the --permutate-phenotype switch takes the input pheno file. Because we pick a column with gemma we can randomize all input lines together. So, in the above example, we shuffle BXD_pheno_Dave-GEMMA.txt. Interestingly it looks like we are already shuffling by line in gemma-wrapper.
+
+The good news is that it runs, but the outcome is wrong:
+
+```
+["95 percentile (significant) ", 1000.0, -3.0]
+["67 percentile (suggestive)  ", 1000.0, -3.0]
+```
+
+Inspecting the phenotype files they are shuffled, e.g.
+
+```
+BXD073xBXD065F1 NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA
+BXD49 NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA
+BXD86 NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA
+BXD161  15.623  142.908997  4.0 2350.637939 3294.824951 1.452 2.08  20.416365 0.363636  142.909088  15.622727 4.0 2350.638672 3294.825928 1.45
+1636  2.079909  0.363636  33.545448 2.125 2.0 2.375 1.25  44.5
+BXD154  20.143  195.5 4.75  1533.689941 4568.76416  0.727 2.213748  27.9275 0.75  195.5 20.142857 4.75  1533.690796 4568.76416  0.72675 2.2137
+48  0.75  54.5  0.75  1.75  3.0 1.5 33.0
+```
+
+which brings out an interesting point. Most BXDs in the genotype file are missing from this experiment. We are computing LOD scores as if we have a full BXD population. So, what we are saying here is that if we have all BXD genotypes and we randomly assign phenotypes against a subset, what is the chance we get a hit at random. I don't think this is a bad assumption, but it not exactly what Gary Churchill had in mind in his 1994 paper:
+
+=> https://pubmed.ncbi.nlm.nih.gov/7851788/ Empirical threshold values for quantitative trait mapping
+
+The idea is to shuffle genotypes against phenotypes. If there is a high correlation we get a result. The idea is to break the correlation and that should work for both the large and the small BXD set. Scoring the best 'random' result out of 1000 permutations at, say 95% highest, sets the significance level.
+With our new precompute we should be able to show the difference. Anyway, that is one problem, the other is that the stats somehow do not add up to the final result. Score min is set at
+
+=> https://github.com/genetics-statistics/gemma-wrapper/blob/7769f209bcaff2472ba185234fad47985e59e7a3/bin/gemma-wrapper#L667
+
+The next line says 'if false'. Alright, that explains part of it at least as the next block was disabled for slurm and is never run. I should rip the slurm stuff out, actually, as Arun has come up with a much better solution. But that is for later.
+
+Disabling that permutation stopped with
+
+```
+Add parallel job: time -v /bin/gemma -loco X -k 02fe8482913a998e6e9559ff5e3f1b89e904d59d.X.cXX.txt.cXX.txt -o 55b49eb774f638d16fd267313d8b4d1d6d2a0a25.X.assoc.txt -p phenotypes-1 -lmm 9 -g BXD-test.txt -n 5 -a BXD.8_snps.txt -outdir /tmp/d20240823-4481-xfrnp6
+DEBUG: Reading 55b49eb774f638d16fd267313d8b4d1d6d2a0a25.X.assoc.txt.1.assoc.txt
+./bin/gemma-wrapper:672:in `foreach': No such file or directory @ rb_sysopen - 55b49eb774f638d16fd267313d8b4d1d6d2a0a25.X.assoc.txt.1.assoc.txt (Errno::ENOENT)
+```
+
+so it created a file, but can't find it because outdir is not shared. Now tmpdir is in the outer block so the file should still exist. For troubleshooting the first step is to seed the randomizer (seed) so we get the same run every time.
+It turns out there are a number of problems. First of all the permutation output was numbered and the result was not found. Fixing that gave a first result without the -parallel switch:
+
+```
+[0.0008489742, 0.03214928, 0.03426648, 0.0351207, 0.0405179, 0.04688354, 0.0692488, 0.1217158, 0.1270747, 0.1880325]
+["95 percentile (significant) ", 0.0008489742, 3.1]
+["67 percentile (suggestive)  ", 0.0351207, 1.5]
+```
+
+That is pleasing and it suggests that we have a significant result for the trait of interest: `volume of the first tumor that developed`. Running LOCO withouth parallel is slow (how did we survive in the past!).
+
+The 100 run shows
+
+```
+[0.0001626146, 0.0001993085, 0.000652191, 0.0007356249, 0.0008489742, 0.0009828207, 0.00102203, 0.001091924, 0.00117823, 0.001282312, 0.001471041, 0.001663572, 0.001898194, 0.003467039, 0.004655921, 0.005284387, 0.005628393, 0.006319995, 0.006767502, 0.007752473, 0.008757406, 0.008826192, 0.009018125, 0.009735282, 0.01034488, 0.01039465, 0.0122644, 0.01231366, 0.01265093, 0.01317425, 0.01348443, 0.013548, 0.01399461, 0.01442383, 0.01534904, 0.01579931, 0.01668551, 0.01696015, 0.01770371, 0.01838937, 0.01883068, 0.02011034, 0.02234977, 0.02362105, 0.0242342, 0.02520063, 0.02536663, 0.0266905, 0.02932001, 0.03116032, 0.03139836, 0.03176087, 0.03214928, 0.03348359, 0.03426648, 0.0351207, 0.03538503, 0.0354338, 0.03609931, 0.0371134, 0.03739827, 0.03787489, 0.04022586, 0.0405179, 0.04056273, 0.04076034, 0.04545012, 0.04588635, 0.04688354, 0.04790254, 0.05871501, 0.05903692, 0.05904868, 0.05978341, 0.06103624, 0.06396175, 0.06628317, 0.06640048, 0.06676557, 0.06848021, 0.0692488, 0.07122914, 0.07166011, 0.0749728, 0.08174019, 0.08188341, 0.08647539, 0.0955264, 0.1019648, 0.1032776, 0.1169525, 0.1182405, 0.1217158, 0.1270747, 0.1316735, 0.1316905, 0.1392859, 0.1576149, 0.1685975, 0.1880325]
+["95 percentile (significant) ", 0.0009828207, 3.0]
+["67 percentile (suggestive)  ", 0.01442383, 1.8]
+```
+
+Not too far off!
+
 ## Covariates
 
 - [ ] Try covariates Dave
author	Pjotr Prins	2024-08-23 15:17:34 +0200
committer	Pjotr Prins	2024-08-23 15:17:34 +0200
commit	8d7f7db90298b900ba82a908cb0dd8a6ab185a6b (patch)
tree	6a99810a0dc00c32b12ecc71ec9227a834f9b4ea
parent	41ae3e92af3ed8a7ee6bfe03bec0c4981f68c2a8 (diff)
download	gn-gemtext-8d7f7db90298b900ba82a908cb0dd8a6ab185a6b.tar.gz