From 5682cd079485237b64407115f247c74ede078644 Mon Sep 17 00:00:00 2001 From: Pjotr Prins Date: Wed, 19 Nov 2025 20:27:04 +0100 Subject: Pangenome genotypes --- issues/genetics/speeding-up-gemma.gmi | 10 +- .../genetics/test-pangenome-derived-genotypes.gmi | 323 ++++++++++++++++++++- 2 files changed, 329 insertions(+), 4 deletions(-) diff --git a/issues/genetics/speeding-up-gemma.gmi b/issues/genetics/speeding-up-gemma.gmi index 306735a..2743f3e 100644 --- a/issues/genetics/speeding-up-gemma.gmi +++ b/issues/genetics/speeding-up-gemma.gmi @@ -20,9 +20,9 @@ Anyway, there is no such thing as a free lunch. So, let's dive in. * [X] Try gzipped version * [X] Run without debug +* [ ] Use lmdb for genotypes * [ ] Optimize openblas for target architecture * [ ] Try a faster malloc library for GEMMA -* [ ] Use lmdb for genotypes * [ ] Other improvements... # Analysis @@ -110,4 +110,10 @@ which is currently not built with arch optimizations (even though Cooperlake sug ## Use lmdb for genotypes -Rather than focussing on gzip, another potential improvement is to use lmdb with mmap. +Rather than focussing on gzip, another potential improvement is to use lmdb with mmap. We am not going to upgrade the original gemma code (which is in maintenance mode). We are going to upgrade the new pangemma project instead: + +=> https://git.genenetwork.org/pangemma/ + +Reason being that this is our experimental project. + +So I just managed to build pangemma/gemma in Guix. Next step is to introduce lmdb genotypes. diff --git a/topics/genetics/test-pangenome-derived-genotypes.gmi b/topics/genetics/test-pangenome-derived-genotypes.gmi index a08c57d..5594c69 100644 --- a/topics/genetics/test-pangenome-derived-genotypes.gmi +++ b/topics/genetics/test-pangenome-derived-genotypes.gmi @@ -460,7 +460,7 @@ Lists: | http://genenetwork.org/id/A8828461_0_BXDPublish_10383_gemma_GWA_e6478639 | "A8828461-0" | "1" | 3304440 | ``` -## Annotating QTLs +## Scoring/annotating QTL Next step is annotating the QTL in RDF. Earlier I wrote a script rdf-analyse-gemma-hits. It uses rapper to read two RDF files (two runs) and annotates the QTL and differences between the files. The code is not pretty: @@ -526,10 +526,329 @@ sys 0m0.000s To get the trait info for 400 traits takes a second. So, that is no big deal. To get the 6K SNPs for one trait also takes a second. Hmmm. That takes hours, compared to the minutes for direct RDF parsing. Before lmdb comes to the rescue we should try running in on the virtuoso server itself. For curl we get 0.5s. Which makes it two hours for 13K traits. But when we run the query using isql it runs in 70ms which totals 15 minutes. That is perfectly fine for running the whole set! -One way is to simply script isql from the command line. It also turns out the ODBC interface can be used from python or ruby. Here an example in R: +One way is to simply script isql from the command line. Meanwhile, it also turns out the ODBC interface can be used from python or ruby. Here an example in R: => https://cran.stat.auckland.ac.nz/web/packages/virtuoso/index.html Not sure if that is fast enough, but perhaps worth trying. So, now we have a way to query the data around a trait in seconds. This means I can rewrite the QTL generator to go by trait. This also allows for a quick turnaround during development (good!). Also I want two scripts: one for computing the QTL and one for annotating the differences. + +Alright. The first script should simply to fetch a trait with its markers from SPARQL and score the QTL (as RDF output). The new script is at + +=> https://github.com/genetics-statistics/gemma-wrapper/blob/master/bin/sparql-qtl-detect.rb + +First, the query for one trait looks like: + +``` +PREFIX dct: +PREFIX gn: +PREFIX owl: +PREFIX gnc: +PREFIX gnt: +PREFIX sdmx-measure: +PREFIX skos: +PREFIX rdf: +PREFIX rdfs: +PREFIX xsd: +PREFIX qb: +PREFIX xkos: +PREFIX pubmed: + +SELECT ?lod ?af ?nodeid ?chr ?pos FROM WHERE { +?traitid a gnt:mappedTrait; + gnt:run gn:test ; + gnt:traitId "10002" . +?snp gnt:mappedSnp ?traitid ; + gnt:locus ?locus ; + gnt:lodScore ?lod ; + gnt:af ?af . +?locus rdfs:label ?nodeid ; + gnt:chr ?chr ; + gnt:pos ?pos . +} ORDER BY DESC(?lod) +``` + +rendering some 22K markers for trait 10002 as a TSV: + +``` +"lod" "af" "nodeid" "chr" "pos" +7.5 0.547 "A13459298-0" "8" 98658490 +7.1 0.154 "A13402313-0" "8" 96798487 +7 0.432 "A13446492-0" "8" 97355019 +7 0.263 "A13387873-0" "8" 94934820 +7 0.585 "A4794343-0" "1" 172265488 +... +``` + +Earlier with precompute for trait 10002 we got: + +``` +[10002,HK] =>{"1"=>[#], "8"=>[#]} +[10002,LOCO] =>{"1"=>[#, #], "8"=>[#]} +``` + +so the hits are in range, but the LOD may be inflated because of the number of markers. Anyway, this point we are merely concerned with scoring QTL. The first script is simply: + +``` +qtls = QTL::QRanges.new("10002","test") +CSV.foreach(fn,headers: true, col_sep: "\t") do |hit| + qlocus = QTL::QLocus.new(hit["nodeid"],hit["chr"],hit["pos"].to_i,hit["af"].to_f,hit["lod"].to_f) + qtls.add_locus(qlocus) +end +print qtls +``` + +and prints a long list of QTL containing a single hit. + +``` +[10002,test] =>{"1"=>[#, #, #, #, #, #,... +``` + +For trait 10002 tweaking thresholds and rebinning we get + +``` +# +# +# +# +# +# +# +# +# +# +``` + +with a LOD>5.5 cut-off. That seems justified because LOD scores are inflated. Compare this with the earlier mapping using 'traditional' genotypes: + +``` +[10002,LOCO] =>{ +"1"=>[#, + #], +"8"=>[#]} +``` + +we can see the significance of chr8 has gone up with pangenome mapping (relative to chr1) and we find 2 QTL now on chr8, a new one to the left. Chr1 looks similar. We have some other candidates that may or may not be relevant (all narrow!). + +Note this *is* a random trait(!) and suggests the landscape of QTLs will change pretty dramatically. Note also that Andrea will give new genotypes and smoothing to follow. But it is encouraging. + +I played a bit with the QTL output, and for now settled on tracking nodes that have a LOD>5.0. We drop QTL based on the following: + +``` +qtl.lod.max < 6.0 or (qtl.lod.max < 7.5 - qtl.snps.size/2) +``` + +I.e. a single SNP QTL has to have a LOD of 7.0. A 2-SNP QTL has to have a LOD of 6.5. This begets + +``` +[10002,test] =>{ +"1"=>[#], +"4"=>[#], +"8"=>[#], +"10"=>[#], +"12"=>[#]} +``` + +which are all worth considering (I think). Obviously we could annotate all QTL in RDF triples and filter on that using SPARQL. But this makes processing a bit faster without having to deal with too much noise. We can fine tune later. + +Now two more steps to go: + +* [ ] Fetch all mapped traits using SPARQL and write RDF +* [ ] Compare QTL between datasets and annotate new hits + +## Fetch all mapped traits + +``` +SELECT * FROM WHERE { +?traitid a gnt:mappedTrait; + gnt:run gn:test ; + gnt:traitId "10002" . +?snp gnt:mappedSnp ?traitid ; + gnt:locus ?locus ; + gnt:lodScore ?lod ; + gnt:af ?af . +?locus rdfs:label ?nodeid ; + gnt:chr ?chr ; + gnt:pos ?pos . +} ORDER BY DESC(?lod) +``` + +The first step is to fetch this data. Let's try SPARQL over the web first. + +## Compare QTL sets + +The previous code I wrote to compare QTLs essentially walks the QTLs and annotates a new QTL if there is no overlap between the two sets. Again, this code is too convoluted: + +=> https://github.com/genetics-statistics/gemma-wrapper/blob/18e7a3ac8a11becba84325499116621ad095f28e/lib/qtlrange.rb#L190 + +The principle is straightforward, however. The code for reading the SPARQL output for a trait is + +``` + CSV.foreach(fn,headers: true, col_sep: "\t") do |hit| + trait_id = hit["traitid"] if not trait_id + lod = hit["lod"].to_f + if lod > 5.0 # set for pangenome input + qlocus = QTL::QLocus.new(hit["snp"],hit["chr"],hit["pos"].to_f/10**6,hit["af"].to_f,lod) + qtls.add_locus(qlocus) + end + end +``` + +So we can use SPARQL to build two sets on the fly and then run the diff. + +Actually, when thinking about this I realised it should not be too hard to do in SPARQL to find the 'new' QTL. + +``` + +SELECT * WHERE { +?traitid a gnt:mappedTrait ; + gnt:traitId "10002" . +} +http://genenetwork.org/id/GEMMAMapped_LOCO_BXDPublish_10002_gemma_GWA_7c00f36d +http://genenetwork.org/id/HK_trait_BXDPublish_10002_gemma_GWA_hk_assoc_txt +http://genenetwork.org/id/GEMMAMapped_test_LOCO_BXDPublish_10002_gemma_GWA_82087f23 +``` + +lists the three versions of compute for traits. To fetch all QTL for first mapping: + +``` +SELECT ?qtl ?lod ?chr ?start ?stop (count(?snp) as ?snps) WHERE { +?traitid a gnt:mappedTrait ; + gnt:traitId "10002" . +?qtl gnt:mappedQTL ?traitid ; + gnt:qtlChr ?chr ; + gnt:qtlStart ?start ; + gnt:qtlStop ?stop ; + gnt:qtlLOD ?lod . +?qtl gnt:mappedSnp ?snp . +} +``` + +gets 3 QTL. Now I did not store HK in RDF, but to show the filtering principle we can fetch two traits and compare QTL. +The following gets two QTL from trait "10002" on CHR1 and holds that against that of trait "10079": + +``` +SELECT ?t ?s1 ?e1 ?t2 ?s2 ?e2 WHERE { + ?traitid a gnt:mappedTrait ; + gnt:traitId ?t . + ?qtl gnt:mappedQTL ?traitid ; + gnt:qtlChr ?chr ; + gnt:qtlStart ?s1 ; + gnt:qtlStop ?e1 . + { + SELECT * WHERE { + ?tid a gnt:mappedTrait ; + gnt:traitId "10079" ; + gnt:traitId ?t2 . + ?qtl2 gnt:mappedQTL ?tid ; + gnt:qtlChr ?chr ; + gnt:qtlStart ?s2 ; + gnt:qtlStop ?e2 . + } + } + FILTER (?t = "10002") . +} LIMIT 10 + +"10002",171.172,183.154,"10079",172.235,172.235 +"10002",72.2551,73.3771,"10079",172.235,172.235 +``` + +Note we pivot on two traits and one chromosome, so we find all pairs. +To say if a QTL is *new* or different we can add another FILTER + +``` +FILTER ((?s2 > ?s1 && ?e2 > ?e1) || (?s2 < ?s1 && ?e2 < ?e1)) . +"t","s1","e1","t2","s2","e2" +"10002",72.2551,73.3771,"10079",172.235,172.235 +``` + +that says that this ?qtl2 does not overlap with ?qtl. I.e. here it is a new QTL! + +This new insight means we should should store *all* QTL in RDF, including the single SNP ones, because it is easy to filter on them. Note that there may be a more elegant way to query traits pairwise. This is just the first thing that worked. It may need more tuning if there are more than two QTL on a chromosome. E.g. the comparison between 10002 and 10413 finds: + +``` +"t","s1","e1","t2","s2","e2" +"10002",72.2551,73.3771,"10413",32.3113,42.4624 +"10002",171.172,183.154,"10413",171.04,171.041 +"10002",171.172,183.154,"10413",32.3113,42.4624 +"10002",72.2551,73.3771,"10413",171.04,171.041 +``` + +I.e. it does find new QTL here and you still need to do a little set analysis. In words you should be able to "remove all overlapping QTL from a chromosome". Maybe we can filter the other way - select overlapping QTL and remove those from the result set. + +``` +BIND ((?s2 >= ?s1 && ?e2 <= ?e1) || (?s1 >= ?s2 && ?e1 <= ?e2) as ?overlap) . +"10002",171.172,183.154,"10079",172.235,172.235,1 +"10002",72.2551,73.3771,"10079",172.235,172.235,0 +``` + +now drop all ?t's that are overlapping. It appears to work with: + +=> https://github.com/genetics-statistics/gemma-wrapper/blob/master/doc/examples/show-qtls-two-traits.sparql + +I'll need to test it on the pangenome set. + +# Listing QTL + +To get all QTL from a run you can use something like + +``` +SELECT DISTINCT ?t ?lod (count(?snp) as ?snps) ?chr ?s ?e WHERE { + ?traitid a gnt:mappedTrait ; + gnt:traitId ?t . + MINUS { ?traitid gnt:run gn:test } # use if you want the original GEMMA QTL + # ?traitid gnt:run gn:test . # use if you want the new QTL + ?qtl gnt:mappedQTL ?traitid ; + gnt:qtlChr ?chr ; + gnt:qtlLOD ?lod ; + gnt:qtlStart ?s ; + gnt:qtlStop ?e . + ?qtl gnt:mappedSnp ?snp . + FILTER (?t = "10002" && ?lod >= 5.0 ) . +} LIMIT 100 +``` + +Note we filter on a trait name and LOD score. + +For panQTL (gnt:run == gn:test) this results in + +``` +"t" "lod" "snps" "chr" "start" "end" +"10002" 6.4 3 "15" 87.671663 98.028911 +"10002" 6.4 12 "4" 56.498971 147.86044 +"10002" 7 69 "1" 3.099543 192.718161 +"10002" 7.5 2774 "8" 34.303454 116.023702 +"10002" 6.2 7 "10" 82.334108 105.062097 +"10002" 6.2 2 "3" 47.644513 82.451061 +"10002" 6.2 1 "3" 130.145235 130.145235 +"10002" 6 2 "10" 13.442071 13.442088 +"10002" 6.2 9 "12" 21.707644 72.57041 +``` + +For the traditional genotypes (gnt:run != gn:test) + +``` +"t" "lod" "snps" "chr" "start" "end" +"10002" 5.3 91 "1" 171.172 183.154 +"10002" 5.1 15 "1" 72.2551 73.3771 +``` + + +# Listing SNPs + +Now we have all QTLs in the DB, as well as underlying SNPs, one interesting question to ask is what SNPs are repeated across our traits. This, if you remember, is the key idea of reversed genetics. +Of course, with our pangenome-derived genotypes, we now have thousands of SNPs per trait. Let's see if we can rank them by number of traits. + +For our 1000 traits we map about 7.7M snps with a LOD>5. + + +# Using sparql from emacs + +Note: if you are doing SPARQL quite a bit, I recommend using sparql-mode in emacs! It is easy, faster and you can use git :) + +=> https://github.com/ljos/sparql-mode + +``` +M-x sparql-query-region [ENTER] http://sparql-test.genenetwork.org/sparql/ [ENTER] +``` -- cgit 1.4.1