diff options
| author | Pjotr Prins | 2025-11-04 17:01:23 +0100 |
|---|---|---|
| committer | Pjotr Prins | 2026-01-05 11:12:11 +0100 |
| commit | 7837327c7b97284bdc98e657a45d48e863beb7c8 (patch) | |
| tree | 5fb19287a048953a6af39d9a7c724f89ae4abe04 | |
| parent | 851d7ada61e9dacbcf04d57b08e2db85d1f1af69 (diff) | |
| download | gn-gemtext-7837327c7b97284bdc98e657a45d48e863beb7c8.tar.gz | |
Continuing with precompute
| -rw-r--r-- | topics/data/epochs.gmi | 5 | ||||
| -rw-r--r-- | topics/genetics/test-pangenome-derived-genotypes.gmi | 195 | ||||
| -rw-r--r-- | topics/systems/mariadb/precompute-publishdata.gmi | 6 |
3 files changed, 206 insertions, 0 deletions
diff --git a/topics/data/epochs.gmi b/topics/data/epochs.gmi new file mode 100644 index 0000000..5ffe795 --- /dev/null +++ b/topics/data/epochs.gmi @@ -0,0 +1,5 @@ +# Epochs + +In the 2019 BXD paper epochs are brought up. Basically, even though the BXD are 'immortal' with identical children, mutations do creep in. An epoch is a period of mice and we track the years a mouse was used. So a BXD1 breeding started at 1971 and production in 2001. In GN we don't make a distinction (per se), but obviously these are (slightly) different mice today. Ashbrook et al. find some interesting results that differ in epochs. + +In GN epochs are currently handled as a trait. This can help with covariate mapping. For a different epoch, however, the genotypes should also be adapted. The effect on the kinship matrix will be minor, but genotypes can be used for fine mapping. With pangenome derived genotypes it should get even more interesting. diff --git a/topics/genetics/test-pangenome-derived-genotypes.gmi b/topics/genetics/test-pangenome-derived-genotypes.gmi new file mode 100644 index 0000000..48111d4 --- /dev/null +++ b/topics/genetics/test-pangenome-derived-genotypes.gmi @@ -0,0 +1,195 @@ +# Test pangenome derived genotypes + +Here we follow up on the work we did on precompute PublishData: + +=> ../genetics/systems/mariadb/precompute-publishdata + +But now run against pangenome derived genotypes. +For the BXD we have 23M markers(!) whereof 8M *not* on the reference genome. + +# Tasks + +* [ ] Reintroduce nodes that were not annotated for position + +# Prior work + +For the first traits (presented at CTC'25) gemma was run as + +``` +echo "[$(date)] Starting kinship matrix calculation for PERCENTILE..." +gemma -g ${BIMBAM_DIR}/143samples.percentile.bimbam.bimbam.gz \ + -p ${PHENO_FILE} \ + -gk \ + -o percentile_result > percentile.kinship.143.txt + +echo "[$(date)] Kinship matrix calculation completed for PERCENTILE." +echo "[$(date)] Starting association analysis for PERCENTILE..." +gemma -g ${BIMBAM_DIR}/143samples.percentile.bimbam.bimbam.gz \ + -p ${PHENO_FILE} \ + -k ./output/percentile_result.cXX.txt \ + -lmm 4 \ + -maf 0.05 \ + -o percentile_association > percentile.assoc.143.txt +``` + +Note no LOCO. + +The genotype BIMBAM file is 45G uncompressed. Even though GEMMA does not load everything in RAM, it is a bit large for my workstation. I opted to use tux04 since no one is using it. Had to reboot the machine because it is unreliable and had crashed. + +There I rebuilt gemma and set up a first run: + +``` +tux04:/export/data/wrk/iwrk/opensource/code/genetics/gemma/tmp$ +/bin/time -v ../bin/gemma -g 143samples.percentile.bimbam.bimbam.gz -p 143samples.percentile.bimbam.pheno.gz -gk +``` + +Without LOCO this took about 18 minutes (186% CPU), 110Gb of RAM. We ought to work on this ;) Next + +``` +/bin/time -v ../bin/gemma -g 143samples.percentile.bimbam.bimbam.gz -p 143samples.percentile.bimbam.pheno.gz -k output/result.cXX.txt -lmm 9 -maf 0.05 +``` + +To run gemma on the current 23M BXD pangenome derived genotypes takes 2.5 hours (@ 200% CPU). That is a bit long :). 13K traits would be 43 months on a single machine. We'll need something better. As Rob writes: + +> The huge majority of variants will have r2 of 1 with hundreds ir thousands of neighbors. This is just a monster distraction. We just want proximal and distal haplotype boundaries for each BXD. Then we want to layer on the weird non-SNP variants and inversions. + +A few days later I had to rerun gemma because the output was wrong (I should have checked!). It shows: + +``` +chr rs ps n_miss allele1 allele0 af beta se logl_H1 l_remle l_mle p_wald p_lrt p_score +-9 A1-0 -9 0 A T 0.171 -nan -nan -nan 1.000000e+05 1.000000e+05 -nan -nan -nan +-9 A2-0 -9 0 A T 0.170 -nan -nan -nan 1.000000e+05 1.000000e+05 -nan -nan -nan +``` + +Turns out I was using the wrong pheno file. Let's try again. + +``` +/bin/time -v ../bin/gemma -g 143samples.percentile.bimbam.bimbam.gz -p 10354082_143.list.pang.txt -k output/result.cXX.txt -lmm 9 -maf 0.05 +``` + +As a check I can diff against the original output. So, I replicated the original run! It also ran faster at 400% CPU in 35 minutes. + +(btw tux04 crashed, so I upgraded the BIOS and iDRAC remotely, let's see if this improves things). + +## Moving to gemma-wrapper + +gemma-wrapper has extra facilities, such as LOCO and caching and lmdb output. Last time we used it in + +=> ../genetics/systems/mariadb/precompute-publishdata + +in a guix container it looked like + +``` +#! /bin/env sh + +export TMPDIR=./tmp +curl http://127.0.0.1:8092/dataset/bxd-publish/list > bxd-publish.json +jq ".[] | .Id" < bxd-publish.json > ids.txt +./bin/gemma-wrapper --force --json --loco -- -g BXD.geno.txt -p BXD_pheno.txt -a BXD.8_snps.txt -n 2 -gk > K.json + +for id in 'cat ids.txt' ; do + echo Precomputing $id + if [ ! -e tmp/*-BXDPublish-$id-gemma-GWA.tar.xz ] ; then + curl http://127.0.0.1:8092/dataset/bxd-publish/values/$id.json > pheno.json + ./bin/gn-pheno-to-gemma.rb --phenotypes pheno.json --geno-json BXD.geno.json > BXD_pheno.txt + ./bin/gemma-wrapper --json --lmdb --population BXD --name BXDPublish --trait $id --loco --input K.json -- -g BXD.geno.txt -p BXD_pheno.txt -a BXD.8_snps.txt -lmm 9 -maf 0.1 -n 2 -debug > GWA.json + fi +done +``` + +Let's try running the big stuff instead: + +``` +./bin/gemma-wrapper --force --json --loco -- -g tmp/143samples.percentile.bimbam.bimbam.gz -p tmp/143samples.percentile.bimbam.pheno.gz -gk +``` + +## Individuals + +gemma does not really track individuals. The order of genotype columns should just be the same as in the pheno file. +In this case a sample list is provided and we'll generate a geno-json version that we can give to gemma-wrapper. Basically such a file lists the following + +``` +{ + "type": "gn-geno-to-gemma", + "genofile": "BXD.geno", + "samples": [ + "BXD1", + "BXD2", + "BXD5", +... + ], + "numsamples": 237, + "header": [ + "# File name: BXD_experimental_DGA_7_Dec_2021", +... +``` + +To get this + +``` +cut -f 1 143samples.pc-list.tsv|sed -e s,_.\*,,|sed -e s,^,\",|sed -e s,$,\"\,,| cut -f 1 143samples.pc-list.tsv|sed -e s,_.\*,,|sed -e s,^,\",|sed -e "s,$,\"\\,," > bxd_inds.list.txt +"BXD100", +"BXD101", +"BXD102", +``` + +Next I turned it into a JSON file by hand as 'bxd_inds.list.json'. + +## Markers + +With GEMMA marker names are listed in the geno file. GEMMA also can use a SNP file that gives the chromosome and location. +Without the SNP filegemma-wrapper complains it needs the SNP/marker annotation file. This is logical because for LOCO it needs to know what chromosome a marker is on. + +The next step is to take the nodes file that and extract all rows from the genotype file that match nodes with chromosomes defined. Andrea is going to deliver all positions for all nodes, but for now we can use what we have. Currently we have nodes annotated in mm10+C57BL_6+DBA_2J.p98.s10k.matrix-pos.txt: + +``` +mm10#1#chr3 23209565 93886997 +mm10#1#chr3 23209564 93886999 +mm10#1#chr3 23209563 93887016 +... +``` + +In the genotype file we find, for example + +``` +A23209564-0, A, T, 1.919141867395325, 0.9306930597711228, 1.8201319833577734, 0.7607260422339468, 1.427392726736106, 1.2310230984252724, 1.6633662444541875, 0.6105610229068721, ... +``` + +bit funny, but you get the idea. So we can take the mm10 file and write out the genotype file again for all matching nodes with a matching SNP file that should contain for this node: + +``` +A23209564-0 93886999 3 +``` + +To rewrite above mm10+C57BL_6+DBA_2J.p98.s10k.matrix-pos.txt file we can do something like + +``` +#! ruby + +ARGF.each_line do |line| + tag,name,pos = line.strip.split(/\t/) + tag =~ /chr(.*)$/ + chrom = $1 + print "A#{name}-0\t#{pos}\t#{chrom}\n" +end +``` + +Now, another problem is that not all SNPs have a position in the genotype file (yet). As we can't display them I can drop them at this stage. So we take the SNP file and rewrite the BIMBAM file using that information. That throwaway script looks like + +``` +bimbamfn = ARGV.shift +snpfn = ARGV.shift +snps = {} +open(snpfn).each_line do |snpl| + name = snpl.split(/\t/)[0] + snps[name] = 1 +end +open(bimbamfn).each_line do |line| + marker = line.split(/[,\s]/)[0] + if snps[marker] + print line + end +end +``` + +takes a while to run, but as this is a one-off that does not matter. diff --git a/topics/systems/mariadb/precompute-publishdata.gmi b/topics/systems/mariadb/precompute-publishdata.gmi index 483e019..6b15a2d 100644 --- a/topics/systems/mariadb/precompute-publishdata.gmi +++ b/topics/systems/mariadb/precompute-publishdata.gmi @@ -3356,3 +3356,9 @@ We can also look for details like skewness by adding ?traitid gnt:traitId ?trait ; gnt:skew ?skew . ``` + +# Testing pangenome derived genotypes + +We continue testing new genotypes in this document: + +=> ../genetics/test-pangenome-derived-genotypes |
