More on permutations - heading for bulklmm

author: Pjotr Prins 2024-09-27 10:19:17 +0200
committer: Pjotr Prins 2024-09-27 10:19:20 +0200
commit: e6c06a661946dfc3f7a5cc0d45d844f318d97702 (patch)
tree: d1804a2a3474d7395f3d33d6721de3f46211db82
parent: 013c004d2f67045324b3c8272aabf48c85fd9f35 (diff)
download: gn-gemtext-e6c06a661946dfc3f7a5cc0d45d844f318d97702.tar.gz
1 files changed, 379 insertions, 3 deletions
diff --git a/topics/lmms/gemma/permutations.gmi b/topics/lmms/gemma/permutations.gmi
index a3968fe..99e3951 100644
--- a/topics/lmms/gemma/permutations.gmi
+++ b/topics/lmms/gemma/permutations.gmi
@@ -227,7 +227,7 @@ It is fun to see that when I did a second run the
 [100, ["95 percentile (significant) ", 0.0002998286, 3.5], ["67 percentile (suggestive)  ", 0.01167864, 1.9]]
 ```
 
-significance value was 3.5. Still, our hit is whopper.
+significance value was 3.5. Still, our hit is whopper - based on this.
 
 ## Run permutations in parallel
 
@@ -491,6 +491,7 @@ GEMMA in GN2 shows the same result when setting MAF to 0.05 or 0.1 (you can try
 
 That leads me to think that we only need to check for epoch when we have a single *low* MAF hit, say 0.01 for 28 mice. As we actively filter on MAF right now we won't likely see an epoch hit.
 
+
 ## Protocol for permutations
 
 First we run GEMMA just without LOCO using default settings that GN uses
@@ -541,6 +542,7 @@ Now that works we can move to a full LOCO
 ```
 ./bin/gemma-wrapper --loco --json -- -gk -g BXD.geno.txt -p BXD_pheno_matched.txt -n 5 -a BXD.8_snps.txt > K.json
 ./bin/gemma-wrapper --debug --verbose --force --loco --json --lmdb --input K.json -- -g BXD.geno.txt -a test/data/input/BXD_snps.txt -lmm 9 -maf 0.05 -p BXD_pheno_matched.txt -n 5
+./bin/./bin/view-gemma-mdb --sort /tmp/test/ca55b05e8b48fb139179fe09c35cff0340fe13bc.mdb
 ```
 
 and we get
@@ -588,7 +590,7 @@ Note that at this step we actually create a full GRM. Reducing happens in the ne
 Note the use of '-n' switch. We should change that.
 
 ```
-./bin/gemma-values.py /tmp/test/8599834ee474b9da9ff39cc4954d662518a6b5c8.mdb --sort
+./bin/./bin/view-gemma-mdb /tmp/test/8599834ee474b9da9ff39cc4954d662518a6b5c8.mdb --sort
 ```
 
 Look for rs3718618 at 69216071 and I am currently getting the wrong result for trait 21529 and it is not clear why that is:
@@ -601,6 +603,25 @@ chr,pos,marker,af,beta,se,l_mle,l_lrt,-logP
 18,69216071,?,0.462,10.8099,93.3936,0.0,0.8097,0.09
 ```
 
+The failing command is:
+
+```
+/bin/gemma -loco 18 -k /tmp/test/reduced-GRM-18.txt.tmp -o 69170e8a2d2f08905daa14461eca1d82a676b4c4.18.assoc.txt -p /tmp/test/reduced-pheno.txt.tmp -n 2 -g BXD.geno.txt -a test/data/input/BXD_snps.txt -lmm 9 -maf 0.05 -outdir /tmp/test
+```
+
+produces
+
+```
+18  rs3718618       69216071        0       X       Y       0.462   -2.161984e+01   9.339365e+01    1.000000e-05    8.097026e-01
+```
+
+The pheno file looks correct, so it has to be the reduced GRM. And this does not look good either:
+
+```
+number of SNPS for K            =     7070
+number of SNPS for GWAS         =      250
+```
+
 When running GEMMA on genenetwork.org we get a peak for LOCO at that position for rs3718618. I note that the non-LOCO version at 4.1 vs 4.5 for LOCO has a higher peak. We should compute the significance for both!
 
 Now, when I run the non-LOCO version by hand I get
@@ -610,6 +631,359 @@ Now, when I run the non-LOCO version by hand I get
 => 4.123063165904243
 ```
 
+## Finally
+
+So, we rolled back to not using reduced phenotypes for now.
+
+For trait 21529 after 1000 permutations we get for LOCO:
+
+```
+["95 percentile (significant) ", 1.051208e-05, 5.0]
+["67 percentile (suggestive)  ", 0.0001483188, 3.8]
+```
+
+which means our GWA hit is at 4.5 is not so close to being significant.
+
+Next I made sure the phenotypes got shuffled against the BXD used - which is arguably the right thing to do.
+It should not have a huge impact because the BXDs share haplotypes - so randomized association should end up in the same ball park. The new result after 1000 permutations is:
+
+```
+["95 percentile (significant) ", 8.799303e-06, 5.1]
+["67 percentile (suggestive)  ", 0.0001048443, 4.0]
+```
+
+## More for Dave
+
+
+Run and permute:
+
+```
+./bin/gemma-wrapper --lmdb --debug --phenotypes BXD_pheno_matched.txt --verbose --force --loco  --json --input K.json -- -g BXD.geno.txt -a BXD.8. -lmm 9 -maf 0.05 -n 2 -p BXD_pheno_matched.txt
+./bin/gemma-wrapper --debug --phenotypes BXD_pheno_matched.txt --permutate 1000 --phenotype-column 2 --verbose --force --loco --json --input K.json -- -g BXD.geno.txt -a test/data/input/BXD_snps.txt -lmm 9 -maf 0.05
+```
+
+```
+21526 How old was the mouse when a tumor was first detected?
+chr,pos,marker,af,beta,se,l_mle,l_lrt,-logP
+14,99632276,?,0.462,-0.6627,0.3322,100000.0,0.0003,3.56
+14,99694520,?,0.462,-0.6627,0.3322,100000.0,0.0003,3.56
+17,80952261,?,0.538,0.6528,0.3451,100000.0,0.0005,3.31
+["95 percentile (significant) ", 6.352578e-06, 5.2]
+["67 percentile (suggestive)  ", 0.0001007502, 4.0]
+```
+
+```
+24406 What was the weight of the first tumor that developed, at death?
+chr,pos,marker,af,beta,se,l_mle,l_lrt,-logP
+11,9032629,?,0.536,0.1293,0.0562,100000.0,0.0,4.36
+11,9165457,?,0.536,0.1293,0.0562,100000.0,0.0,4.36
+11,11152439,?,0.5,0.126,0.0562,100000.0,0.0001,4.21
+11,11171143,?,0.5,0.126,0.0562,100000.0,0.0001,4.21
+11,11525458,?,0.5,0.126,0.0562,100000.0,0.0001,4.21
+11,8786241,?,0.571,0.1203,0.0581,100000.0,0.0002,3.78
+11,8836726,?,0.571,0.1203,0.0581,100000.0,0.0002,3.78
+11,19745817,?,0.536,0.1183,0.061,100000.0,0.0003,3.46
+11,19833554,?,0.536,0.1183,0.061,100000.0,0.0003,3.46
+["95 percentile (significant) ", 1.172001e-05, 4.9]
+["67 percentile (suggestive)  ", 0.0001175644, 3.9]
+```
+
+```
+27515 No description
+chr,pos,marker,af,beta,se,l_mle,l_lrt,-logP
+4,103682035,?,0.481,-0.1653,0.0585,100000.0,0.0,5.57
+4,103875085,?,0.481,-0.1653,0.0585,100000.0,0.0,5.57
+4,104004372,?,0.481,-0.1653,0.0585,100000.0,0.0,5.57
+4,104156915,?,0.481,-0.1653,0.0585,100000.0,0.0,5.57
+4,104166428,?,0.481,-0.1653,0.0585,100000.0,0.0,5.57
+4,104584276,?,0.481,-0.1653,0.0585,100000.0,0.0,5.57
+4,103634906,?,0.519,-0.1497,0.0733,100000.0,0.0002,3.67
+4,103640707,?,0.519,-0.1497,0.0733,100000.0,0.0002,3.67
+["95 percentile (significant) ", 7.501004e-06, 5.1]
+["67 percentile (suggestive)  ", 7.804668e-05, 4.1]
+```
+
+## Dealing with significance
+
+Now the significance thresholds appear to be a bit higher than we expect. So, let's see what is going on. First I check the randomization of the phenotypes. That looks great. There are 1000 different phenotype files and they randomized only the BXD we used. Let's zoom in on our most interesting 27515. When running in GN2 I get more hits - they are at the same level, but somehow SNPs have dropped off. In those runs our SNP of interest shows only a few higher values:
+
+```
+./6abd89211d93b0d03dc4281ac3a0abe7fc10da46.4.assoc.txt.assoc.txt:4      rs28166983      103682035       0       X       Y       0.481   -2.932957e-01 7.337327e-02    1.000000e+05    2.700506e-04
+./b6e58d6092987d0c23ae1735d11d4a293782c511.4.assoc.txt.assoc.txt:4      rs28166983      103682035       0       X       Y       0.481   -2.413067e-01 6.416133e-02    1.000000e+05    5.188637e-04
+./4266656951ab0c5f3097ddb4bf917448d7542dd5.4.assoc.txt.assoc.txt:4      rs28166983      103682035       0       X       Y       0.481   2.757074e-01  6.815899e-02    1.000000e+05    2.365318e-04
+./265e44a4c078d2a608b7117bbdcb9be36f56c7de.4.assoc.txt.assoc.txt:4      rs28166983      103682035       0       X       Y       0.481   2.358494e-01  5.743872e-02    1.000000e+05    1.996261e-04
+napoli:/export/local/home/wrk/iwrk/opensource/code/genetics/gemma-wrapper/tmp/test$ rg 103682035 .|grep 5$
+./b29f08a4b1061301d52f939087f1a4c1376256f0.4.assoc.txt.assoc.txt:4      rs28166983      103682035       0       X       Y       0.481   -2.841255e-01 6.194426e-02    1.000000e+05    5.220922e-05
+./3e5b12e9b7478b127b47c23ccdfba2127cf7e2b2.4.assoc.txt.assoc.txt:4      rs28166983      103682035       0       X       Y       0.481   -2.813968e-01 6.379554e-02    1.000000e+05    8.533857e-05
+```
+
+but none as high as the original hit of 5.57
+
+```
+irb(main):001:0> -Math.log10(2.700506e-04)
+=> 3.5685548534637
+irb(main):002:0> -Math.log10(5.220922e-05)
+=> 4.282252795052573
+irb(main):003:0> -Math.log10(8.533857e-05)
+=> 4.06885463879464
+```
+
+All good. This leaves two things to look into. First, I see less hits than with GN2(!). Second, qnorm gives a higher peak in GN2.
+
+* [X] Check for number of SNPs
+
+The number of SNPs is not enough:
+
+```
+GEMMA 0.98.6 (2022-08-05) by Xiang Zhou, Pjotr Prins and team (C) 2012-2022
+Reading Files ...
+## number of total individuals = 237
+## number of analyzed individuals = 26
+## number of covariates = 1
+## number of phenotypes = 1
+## leave one chromosome out (LOCO) =        1
+## number of total SNPs/var        =    21056
+## number of SNPS for K            =     6684
+## number of SNPS for GWAS         =      636
+## number of analyzed SNPs         =    21056
+```
+
+Even when disabling MAF filtering we still see a subset of SNPs. I am wondering what GN2 does here.
+
+## Missing SNPs
+
+In our results we miss SNPs that are listed on GN2, but do appear in our genotypes, e.g.
+
+```
+BXD.8_snps.txt
+19463:rsm10000013598, 69448067, 18
+```
+
+First of all we find we used a total of 6360 SNPs out of the original 21056. For this SNP the genotype files show:
+
+```
+BXD_geno.txt
+19463:rsm10000013598, X, Y, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 1, 1, 0.5, 1, 1, 1, 1, 0, 1, 0, 1, 0.5, 0, 0, 0, 1, 0.5, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0.5, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0.5, 1, 0, 0, 0, 1, 1, 1, 0.5, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0.5, 0.5, 0, 0.5, 0.5, 0.5, 0, 0.5, 0.5, 0.5, 0, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 1, 0.5, 1, 0.5, 0.5, 1, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0, 0.5, 1, 0.5, 0, 0.5
+```
+
+and in our updated
+
+```
+BXD.geno.txt
+rsm10000013598,X,Y,2,0,2,0,2,0,2,0,0,0,0,0,0,2,2,2,0,0,0,0,2,2,2,2,2,0,0,0,0,2,0,2,2,2,0,0,0,0,2,0,2,2,2,2,0,0,2,0,0,0,2,2,0,2,0,0,2,2,2,0,0,2,2,2,2,2,2,2,2,2,2,0,0,2,2,0,2,2,2,2,0,2,2,2,2,2,2,2,0,0,2,2,0,2,0,0,2,2,2,0,2,2,2,0,1,1,1,1,1,1,2,2,1,2,2,2,2,0,2,0,2,1,0,0,0,2,1,0,2,2,2,2,2,0,0,2,2,0,2,2,0,2,2,2,2,2,2,2,2,0,2,2,2,2,2,0,0,0,0,0,2,0,0,2,0,2,1,0,2,0,0,0,0,0,0,0,0,1,2,0,0,0,2,2,2,1,0,2,2,2,2,0,2,0,0,0,2,2,2,2,1,1,0,1,1,1,0,1,1,1,0,1,1,1,1,1,1,1,1,2,1,2,1,1,2,1,1,1,1,1,1,0,1,2,1,0,1
+```
+
+That looks good. Turns out we need the annotation file(?!)
+
+I figured out where the missing SNPs went. Turns out that, if you pass in an annotation file, and if it is not complete, GEMMA drops the non-annotated SNPs unceremoniously. Getting the right annotation file fixed it. GEMMA should obviously not behave like that ;). Anyway, I am in sync with GN2 now. Unfortunately, with permutations, the significance threshold did not change much (which kinda makes sense).
+
+I want to see why gemma is giving this number. If I can't find it fast I'll try to run bulklmm or R/qtl2 lmm instead and see if they disagree with gemma and if we can get close to what Rob expects.
+
+
+```
+gemma -gk -g BXD.geno.txt -p BXD_pheno_matched.txt -n 22
+gemma -lmm 9 -g BXD.geno.txt -p BXD_pheno_matched.txt -k output/result.cXX.txt -n 22
+```
+
+Now that works we can move to a full LOCO
+
+```
+./bin/gemma-wrapper --loco --json -- -gk -g BXD.geno.txt -p BXD_pheno_matched.txt -n 5  -a BXD.8_snps.txt > K.json
+./bin/gemma-wrapper --debug --verbose --force --loco --json --lmdb --input K.json -- -g BXD.geno.txt -a BXD.8_snps.txt -lmm 9 -maf 0.05 -p BXD_pheno_matched.txt -n 5
+./bin/./bin/view-gemma-mdb --sort /tmp/test/ca55b05e8b48fb139179fe09c35cff0340fe13bc.mdb
+```
+
+and we get
+
+```
+chr,pos,marker,af,beta,se,l_mle,l_lrt,-logP
+18,69216071,rs3718618,0.635,-195.5784,82.1243,100000.0,0.0,4.5
+18,69448067,rsm10000013598,0.635,-195.5784,82.1243,100000.0,0.0,4.5
+18,69463065,rsm10000013599,0.635,-195.5784,82.1243,100000.0,0.0,4.5
+18,69803489,rsm10000013600,0.635,-195.5784,82.1243,100000.0,0.0,4.5
+18,69825784,rs50446650,0.635,-195.5784,82.1243,100000.0,0.0,4.5
+18,69836387,rsm10000013601,0.635,-195.5784,82.1243,100000.0,0.0,4.5
+18,68188822,rsm10000013579,0.596,-189.7332,79.7479,100000.0,0.0,4.49
+18,68189477,rs29539715,0.596,-189.7332,79.7479,100000.0,0.0,4.49
+18,68195226,rsm10000013580,0.596,-189.7332,79.7479,100000.0,0.0,4.49
+18,68195289,rsm10000013581,0.596,-189.7332,79.7479,100000.0,0.0,4.49
+18,68195758,rsm10000013582,0.596,-189.7332,79.7479,100000.0,0.0,4.49
+18,68454446,rs30216358,0.596,-189.7332,79.7479,100000.0,0.0,4.49
+18,68514475,rs6346101,0.596,-189.7332,79.7479,100000.0,0.0,4.49
+18,68521138,rsm10000013583,0.596,-189.7332,79.7479,100000.0,0.0,4.49
+18,68526029,rs29984158,0.596,-189.7332,79.7479,100000.0,0.0,4.49
+18,68542739,rsm10000013584,0.596,-189.7332,79.7479,100000.0,0.0,4.49
+18,68543456,rsm10000013585,0.596,-189.7332,79.7479,100000.0,0.0,4.49
+18,68564736,rsm10000013586,0.596,-189.7332,79.7479,100000.0,0.0,4.49
+18,68565230,rsm10000013587,0.596,-189.7332,79.7479,100000.0,0.0,4.49
+```
+
+which is in line with GN2.
+
+Run and permute:
+
+```
+./bin/gemma-wrapper --debug --phenotypes BXD_pheno_matched.txt --permutate 1000 --phenotype-column 2 --verbose --force --loco --json --input K.json -- -g BXD.geno.txt -a BXD.8_snps.txt -lmm 9 -maf 0.05
+```
+
+* [X] Test significance effect for higher and lower MAF than 0.05
+
+Lower MAF increases significance thresholds?
+
+```
+0.05?
+["95 percentile (significant) ", 6.268117e-06, 5.2]
+["67 percentile (suggestive)  ", 7.457537e-05, 4.1]
+
+0.01
+["95 percentile (significant) ", 5.871237e-06, 5.2]
+["67 percentile (suggestive)  ", 7.046853e-05, 4.2]
+```
+
+* [ ] Check distribution of hits with permutations
+
+## What about significance
+
+What we are trying to do here is to decide on a significance level that says that the chance of a hit caused by a random event is less that 1 in a thousand. We are currently finding levels of 5.0 and from earlier work it should be less than 4.0. We are essentially following Gary Churchill's '94 paper: ``Empirical threshold values for quantitative trait mapping''. The significance level depends on the shape of the data - i.e., the shape of both genotypes and the trait under study. If the significance level is 5.0 it means that we can expect alpha=0.05 or 5% of random trait vectors can be expected to show a LOD score of 5 or higher.
+
+What GEMMA does is look for a correlation between a marker, e.g.
+
+```
+BXD.geno.txt
+rsm10000013598,X,Y,2,0,2,0,2,0,2,0,0,0,0,0,0,2,2,2,0,0,0,0,2,2,2,2,2,0,0,0,0,2,0,2,2,2,0,0,0,0,2,0,2,2,2,2,0,0,2,0,0,0,2,2,0,2,0,0,2,2,2,0,0,2,2,2,2,2,2,2,2,2,2,0,0,2,2,0,2,2,2,2,0,2,2,2,2,2,2,2,0,0,2,2,0,2,0,0,2,2,2,0,2,2,2,0,1,1,1,1,1,1,2,2,1,2,2,2,2,0,2,0,2,1,0,0,0,2,1,0,2,2,2,2,2,0,0,2,2,0,2,2,0,2,2,2,2,2,2,2,2,0,2,2,2,2,2,0,0,0,0,0,2,0,0,2,0,2,1,0,2,0,0,0,0,0,0,0,0,1,2,0,0,0,2,2,2,1,0,2,2,2,2,0,2,0,0,0,2,2,2,2,1,1,0,1,1,1,0,1,1,1,0,1,1,1,1,1,1,1,1,2,1,2,1,1,2,1,1,1,1,1,1,0,1,2,1,0,1
+```
+
+and a trait that is measured for a limited number against these individuals/strains/genometypes. We also correct for kinship between the individuals, but that is tied to the individuals, so we can ignore that for now. So you get a vector of:
+
+```
+marker rsm10000013598
+ind  trait
+0     8.1
+0     7.9
+2     12.3
+2     13.4
+```
+
+We permute the data after breaking the correlation between left and right columns. When running 1000 permutations for this particular hit we find that the shuffled never gets a higher value then for our main run. That is comforting because random permutations are always less correlated (for this marker).
+
+If we do this genome-wide we also see a randomly positioned highest hit across all chromosomes after shuffling the trait vector and our hit never appears the highest. E.g.
+
+```
+[10, ["2", "rs13476914", "170826974"], ["95 percentile (significant) ", 1.870138e-05, 4.7], ["67 percentile (suggestive)  ", 6.3797e-05, 4.2]]
+[11, ["6", "rsm10000004149", "25227945"], ["95 percentile (significant) ", 1.870138e-05, 4.7], ["67 percentile (suggestive)  ", 6.3797e-05, 4. 2]]
+[12, ["9", "rsm10000006852", "81294046"], ["95 percentile (significant) ", 1.555683e-05, 4.8], ["67 percentile (suggestive)  ", 4.216931e-05, 4.4]]
+[13, ["2", "rsm10000001382", "57898368"], ["95 percentile (significant) ", 1.555683e-05, 4.8], ["67 percentile (suggestive)  ", 6.3797e-05, 4. 2]]
+[14, ["1", "rsm10000000166", "94030054"], ["95 percentile (significant) ", 1.555683e-05, 4.8], ["67 percentile (suggestive)  ", 6.3797e-05, 4. 2]]
+[15, ["X", "rsm10000014672", "163387262"], ["95 percentile (significant) ", 1.555683e-05, 4.8], ["67 percentile (suggestive)  ", 6.3797e-05, 4 .2]]
+```
+
+### Shuffling a normally distributed trait
+
+
+So the randomization works well. Still, or 95% is close to 5.0 and that is by chance. What happens when we change the shape of the data? Let's create a new trait, so the distribution is random and normal:
+
+```
+> rnorm(25, mean = 10, sd = 2)
+ [1] 10.347116  9.475156 11.747876 10.969742 11.374611 12.283834 11.499779
+ [8] 11.123520 10.830300 11.640049 10.392085 11.586836 11.540470 10.700869
+[15]  8.802858 10.238498 11.099536  8.832104  6.463636 10.347956 11.222558
+[22]  8.658024  7.796304 10.684967  9.540483
+```
+
+These random trait values renders a hit of -Math.log10(8.325683e-04) = 3.0! Now we permute and we get:
+
+["95 percentile (significant) ", 5.22093e-06, 5.3]
+["67 percentile (suggestive)  ", 7.303966e-05, 4.1]
+
+So the shape of a normally distribute trait gives a higher threshold - it is easier to get a hit by chance.
+
+### Genotypes
+
+So 95% of random shuffled trait runs still gives us 5.x. So this has to be a property of the genotypes in conjunction with the method GEMMA applies. With regard to genotypes, the BXD are not exactly random because they share markers from two parents which run along haplotypes. I.e. we are dealing with a patchwork of similar genotypes. You may expect that would suppress the chance of finding random hits. Let's try to prove that by creating fully random genotypes and an extreme haplotype set. And, for good measure something in between.
+
+* [X] Fully random genotypes
+
+In the next phase we are going to play a bit with the haplotypes. First we fully randomize the genotype matrix. This way we break all haplotypes. As BIMBAM is a simple format we'll just modify an existing BIMBAM file. It looks like
+
+```
+rs3677817,X,Y,1.77,0.42,0.18,0.42,1.42,0.34,0.69,1.57,0.52,0.1,0.37,1.27,0.62,1.87,1.71,1.65,1.83,0.04,1.05,0.52,1.92,0.57,0.61,0.11,1.49,1.07,1.48,1.7,0.5,1.75,1.74,0.29,0.37,1.78,1.91,1.37,1.64,0.32,0.09,1.21,1.58,0.4,1.0,0.62,1.1,0.7,0.35,0.86,0.7,0.46,1.14,0.04,1.87,1.96,0.61,1.34,0.63,1.04,1.95,0.22,0.54,0.31,0.14,0.95,1.45,0.93,0.37,0.79,1.37,0.87,1.79,0.41,1.73,1.25,1.49,1.57,0.39,1.61,0.37,1.85,1.83,1.71,1.5,1.78,1.34,1.29,1.41,1.54,1.05,0.3,0.87,1.85,0.5,0.19,1.54,0.53,0.26,1.47,0.67,0.84,0.18,0.79,0.68,1.48,0.4,1.83,1.76,1.09,0.2,1.48,0.24,0.53,0.41,1.24,1.38,1.31,1.73,0.52,1.86,1.21,0.58,1.68,0.79,0.4,1.41,0.07,0.57,0.42,0.47,0.49,0.05,0.77,1.33,0.15,1.41,0.03,0.24,1.66,1.39,2.0,0.23,1.4,1.05,0.79,0.51,0.66,1.24,0.29,1.12,0.46,0.92,1.12,1.53,1.78,1.22,1.35,0.1,0.43,0.41,1.89,0.09,0.13,1.04,0.24,1.4,1.25,0.24,0.26,0.31,0.36,0.31,1.34,1.23,1.91,0.7,0.08,1.43,0.17,1.9,0.06,1.42,1.94,0.43,0.54,1.96,1.29,0.64,0.82,1.85,1.63,0.23,1.79,0.52,1.65,1.43,0.95,1.13,0.59,0.07,0.66,1.79,0.92,1.89,1.2,0.51,0.18,0.96,0.44,0.46,0.88,0.39,0.89,1.68,0.07,1.46,1.61,1.73,0.56,1.33,1.67,0.16,1.78,0.61,1.55,0.88,0.15,1.98,1.96,0.61,0.04,0.12,1.4,1.65,0.71,1.3,1.85,0.49
+```
+
+We'll stick in the old hit for good measure and run our genotypes:
+
+```
+./bin/gemma-wrapper --loco --json -- -gk -g BXD.geno.rand.txt -p BXD_pheno_matched.txt -n 5  -a BXD.8_snps.txt > K.json
+./bin/gemma-wrapper --debug --verbose --force --loco --json --lmdb --input K.json -- -g BXD.geno.rand.txt -a BXD.8_snps.txt -lmm 9 -maf 0.05 -p BXD_pheno_matched.txt -n 22
+./bin/./bin/view-gemma-mdb --sort /tmp/test/ca55b05e8b48fb139179fe09c35cff0340fe13bc.mdb
+./bin/view-gemma-mdb /tmp/e279abbebee8e41d7eb9dae...-gemma-GWA.tar.xz --anno BXD.8_snps.txt|head -20
+chr,pos,marker,af,beta,se,l_mle,l_lrt,-logP
+X,139258413,rsm10000014629,0.496,0.2248,0.093,100000.0,0.0,4.58
+6,132586518,rsm10000003691,0.517,0.2399,0.1068,100000.0,0.0001,4.17
+2,161895805,rs27350606,0.585,-0.2303,0.1059,100000.0,0.0001,4.0
+X,47002415,rsm10000014323,0.562,-0.1904,0.0877,100000.0,0.0001,3.99
+3,32576363,rsm10000001568,0.468,-0.2251,0.104,100000.0,0.0001,3.97
+14,19281191,rs52350512,0.5,-0.2454,0.1154,100000.0,0.0001,3.88
+7,111680092,rs32385258,0.536,0.2022,0.0968,100000.0,0.0002,3.79
+4,151267320,rsm10000002095,0.604,-0.2257,0.1102,100000.0,0.0002,3.69
+2,157353289,rs27323024,0.455,0.2188,0.1072,100000.0,0.0002,3.67
+19,56503719,rsm10000013894,0.617,0.2606,0.1302,100000.0,0.0003,3.58
+```
+
+Interestingly our trait did not do that well:
+
+```
+18,69448067,rsm10000013598,0.635,0.0941,0.0774,100000.0,0.0167,1.78
+```
+
+It shows how large the impact of the GRM is. We can run our permutations.
+
+```
+./bin/gemma-wrapper --debug --phenotypes BXD_pheno_matched.txt --permutate 1000 --phenotype-column 22 --verbose --force --loco --json --input K.json -- -g BXD.geno.rand.txt -a BXD.8_snps.txt -lmm 9 -maf 0.05
+["95 percentile (significant) ", 1.478479e-07, 6.8]
+["67 percentile (suggestive)  ", 1.892087e-06, 5.7]
+```
+
+Well that went through the roof :). It makes sense when you think about it. Randomizing genotypes of 21K SNPs gives you a high chance of finding SNPs that correlate with the trait. Let's go the other way and give 20% of indidivuals the exact same haplotypes, basically copying
+
+```
+rsm10000013598,X,Y,2,0,2,0,2,0,2,0,0,0,0,0,0,2,2,2,0,0,0,0,2,2,2,2,2,0,0,0,0,2,0,2,2,2,0,0,0,0,2,0,2,2,2,2,0,0,2,0,0,0,2,2,0,2,0,0,2,2,2,0,0,2,2,2,2,2,2,2,2,2,2,0,0,2,2,0,2,2,2,2,0,2,2,2,2,2,2,2,0,0,2,2,0,2,0,0,2,2,2,0,2,2,2,0,1,1,1,1,1,1,2,2,1,2,2,2,2,0,2,0,2,1,0,0,0,2,1,0,2,2,2,2,2,0,0,2,2,0,2,2,0,2,2,2,2,2,2,2,2,0,2,2,2,2,2,0,0,0,0,0,2,0,0,2,0,2,1,0,2,0,0,0,0,0,0,0,0,1,2,0,0,0,2,2,2,1,0,2,2,2,2,0,2,0,0,0,2,2,2,2,1,1,0,1,1,1,0,1,1,1,0,1,1,1,1,1,1,1,1,2,1,2,1,1,2,1,1,1,1,1,1,0,1,2,1,0,1
+```
+
+```
+./bin/bimbam-rewrite.py --inject inject.geno.txt BXD.geno.txt --perc=20 > BXD.geno.20.txt
+rg -c "2,0,2,0,2,0,2,0,0,0,0,0,0,2,2,2,0,0,0,0,2,2,2,2,2,0,0,0,0,2,0,2,2,2,0,0,0,0,2,0,2,2,2,2,0,0,2,0,0,0,2,2,0,2,0,0,2,2,2,0,0,2,2,2,2,2,2,2,2,2,2,0,0,2,2,0,2,2,2,2,0,2,2,2,2,2,2,2,0,0,2,2,0,2,0,0,2,2,2,0,2,2,2,0,1,1,1,1,1,1,2,2,1,2,2,2,2,0,2,0,2,1,0,0,0,2,1,0,2,2,2,2,2,0,0,2,2,0,2,2,0,2,2,2,2,2,2,2,2,0,2,2,2,2,2,0,0,0,0,0,2,0,0,2,0,2,1,0,2,0,0,0,0,0,0,0,0,1,2,0,0,0,2,2,2,1,0,2,2,2,2,0,2,0,0,0,2,2,2,2,1,1,0,1,1,1,0,1,1,1,0,1,1,1,1,1,1,1,1,2,1,2,1,1,2,1,1,1,1,1,1,0,1,2,1,0,1" BXD.geno.20.txt
+4276
+```
+
+so 4K out of 20K SNPs has identical haplotypes which correlate with our trait of interest:
+
+```
+["95 percentile (significant) ", 5.16167e-06, 5.3]
+["67 percentile (suggestive)  ", 6.163728e-05, 4.2]
+```
+
+and at 40% haplotype injection we get
+
+```
+["95 percentile (significant) ", 3.104788e-06, 5.5]
+["67 percentile (suggestive)  ", 7.032406e-05, 4.2]
+```
+
+* [X] Haplotype equal genotypes 20% and 40%
+
+All looks interesting, but does not help.
+
+Also when we halve the number of SNPs the results are similar:
+
+```
+["95 percentile (significant) ", 6.026549e-06, 5.2]
+["67 percentile (suggestive)  ", 8.571557e-05, 4.1]
+```
+
+Even though the threshold is high, it is kind of interesting to see that no matter what you do you end up similar levels. After a meeting with Rob and Saunak the latter pointed out that these numbers are not completely surprising. For LMMs we need to use an adaptation - i.e. shuffle the trait values after rotation and transformation and then reverse that procedure. The good news is that BulkLMM contains that method and thresholds will be lower.
+
+
+* [ ] Run bulklmm
 
 
 ## Dealing with epoch
@@ -626,9 +1000,11 @@ We have two or more possible solutions to deal with hierarchy in the population.
 
 ## Later
 
+* [ ] Check running or trait without LOCO with both standard and random GRMs
+* [ ] Test non-loco effect for rsm10000013598 - looks too low and does not agree with GN2
+* [X] Try qnorm run
 * [ ] Fix non-use of MAF in GN for non-LOCO
 * [ ] Fix running of -p switch when assoc cache exists (bug)
-* [ ] Compare significance for non-LOCO
 
 Quantile-Based Permutation Thresholds for Quantitative Trait Loci Hotspots
 https://academic.oup.com/genetics/article/191/4/1355/5935078
author	Pjotr Prins	2024-09-27 10:19:17 +0200
committer	Pjotr Prins	2024-09-27 10:19:20 +0200
commit	e6c06a661946dfc3f7a5cc0d45d844f318d97702 (patch)
tree	d1804a2a3474d7395f3d33d6721de3f46211db82
parent	013c004d2f67045324b3c8272aabf48c85fd9f35 (diff)
download	gn-gemtext-e6c06a661946dfc3f7a5cc0d45d844f318d97702.tar.gz