diff options
| author | Pjotr Prins | 2025-11-29 13:02:46 +0100 |
|---|---|---|
| committer | Pjotr Prins | 2026-01-05 11:12:11 +0100 |
| commit | e5c8bfbb81d223f28bd97c3c025ff06c911a4b83 (patch) | |
| tree | 6545e251d7cb8274194cb14c6ad123d4d0c78454 | |
| parent | 6acdea0683ef4655ae6c9a98187e58edc11c8bb1 (diff) | |
| download | gn-gemtext-e5c8bfbb81d223f28bd97c3c025ff06c911a4b83.tar.gz | |
Recomputed QTL
| -rw-r--r-- | issues/genetics/speeding-up-gemma.gmi | 8 | ||||
| -rw-r--r-- | topics/genetics/test-pangenome-derived-genotypes.gmi | 107 |
2 files changed, 112 insertions, 3 deletions
diff --git a/issues/genetics/speeding-up-gemma.gmi b/issues/genetics/speeding-up-gemma.gmi index a6a959b..96046a0 100644 --- a/issues/genetics/speeding-up-gemma.gmi +++ b/issues/genetics/speeding-up-gemma.gmi @@ -30,6 +30,14 @@ There is no such thing as a free lunch. So, let's dive in. * [ ] Fix sqrt(NaN) when running big file example with -debug * [ ] Other improvements... +# Summary + +Convert a geno file to mdb with + +``` +./bin/geno2mdb.rb mouse_hs1940.geno.txt --eval Gf # convert to floating point +``` + # Analysis As a test case we'll take on of the runs: diff --git a/topics/genetics/test-pangenome-derived-genotypes.gmi b/topics/genetics/test-pangenome-derived-genotypes.gmi index 5594c69..4f806ee 100644 --- a/topics/genetics/test-pangenome-derived-genotypes.gmi +++ b/topics/genetics/test-pangenome-derived-genotypes.gmi @@ -18,6 +18,107 @@ For the BXD we have 23M markers(!) whereof 8M *not* on the reference genome. * [ ] Reduce GEMMA GRM RAM requirements (not urgent) * [ ] Fix -lmm 4 ERROR: Enforce failed for Trying to take the sqrt of nan in src/mathfunc.cpp at line 127 in safe_sqrt +# Summary + +To get the mapping and generate the assoc output in mdb format we run a variant of gemma-wrapper + +``` +gemma-batch-run.sh +``` + +Next we convert that output to RDF with + +``` +../bin/gemma-mdb-to-rdf.rb --header > output.ttl +../bin/gemma-mdb-to-rdf.rb --anno snps-matched.txt.mdb tmp/panlmm/*-gemma-GWA.tar.xz >> test-run-3000.ttl +serdi -i turtle -o ntriples test-run-3000.ttl > test-run-3000.n3 +``` + +(serdi does better than rapper with huge files) and +copy the file to the virtuoso instance and load it with isql: + +``` +cd /export/guix-containers/virtuoso/data/virtuoso/ttl/ +guix shell -C -N --expose=/export/guix-containers/virtuoso/data/virtuoso/ttl/=/export/data/virtuoso/ttl virtuoso-ose -- isql -S 8891 +SQL> ld_dir('/export/data/virtuoso/ttl','test-run-3000.n3','http://pan-test.genenetwork.org'); +Done. -- 3 msec. +# for testing the validity and optional delete problematic ones: +SQL> SELECT * FROM DB.DBA.load_list; +SQL> DELETE from DB.DBA.LOAD_LIST where ll_error IS NOT NULL ; +SQL> DELETE from DB.DBA.LOAD_LIST where LL_STATE = 1; +# commit changes +SQL> rdf_loader_run (); +SQL> checkpoint; +Done. -- 16 msec. +SQL> SPARQL SELECT count(*) FROM <http://pan-test.genenetwork.org> WHERE { ?s ?p ?o } LIMIT 10; +34200686 +``` + +As a test, fetch a table of the traits with their SNPs + +``` +PREFIX dct: <http://purl.org/dc/terms/> +PREFIX gn: <http://genenetwork.org/id/> +PREFIX owl: <http://www.w3.org/2002/07/owl#> +PREFIX gnc: <http://genenetwork.org/category/> +PREFIX gnt: <http://genenetwork.org/term/> +PREFIX sdmx-measure: <http://purl.org/linked-data/sdmx/2009/measure#> +PREFIX skos: <http://www.w3.org/2004/02/skos/core#> +PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> +PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> +PREFIX xsd: <http://www.w3.org/2001/XMLSchema#> +PREFIX qb: <http://purl.org/linked-data/cube#> +PREFIX xkos: <http://rdf-vocabulary.ddialliance.org/xkos#> +PREFIX pubmed: <http://rdf.ncbi.nlm.nih.gov/pubmed/> + +SELECT * FROM <http://pan-test.genenetwork.org> WHERE { +?traitid a gnt:mappedTrait; + gnt:run gn:test . +?snp gnt:mappedSnp ?traitid ; + gnt:locus ?locus ; + gnt:lodScore ?lod ; + gnt:af ?af . +?locus rdfs:label ?nodeid ; + gnt:chr ?chr ; + gnt:pos ?pos . +FILTER (contains(?nodeid,"Marker") && ?pos < 1000) +} LIMIT 100 +``` + +OK, we are ready to run a little workflow. First create a sorted list of IDs. + +``` +SELECT DISTINCT ?trait FROM <http://pan-test.genenetwork.org> WHERE { +?traitid a gnt:mappedTrait; + gnt:run gn:test ; + gnt:traitId ?trait. +} +``` + +Sort that list and save as 'pan-ids-sorted.txt'. Next run + +``` +../../bin/workflow/qtl-detect-batch-run.sh +``` + +and load those in virtuoso. List new QTL + +``` +SELECT DISTINCT ?t ?lod (count(?snp) as ?snps) ?chr ?s ?e WHERE { + ?traitid a gnt:mappedTrait ; + gnt:traitId ?t . + MINUS { ?traitid gnt:run gn:test } # use if you want the original GEMMA QTL + # ?traitid gnt:run gn:test . # use if you want the new QTL + ?qtl gnt:mappedQTL ?traitid ; + gnt:qtlChr ?chr ; + gnt:qtlLOD ?lod ; + gnt:qtlStart ?s ; + gnt:qtlStop ?e . + ?qtl gnt:mappedSnp ?snp . + FILTER (?t = "10002" && ?lod >= 5.0 ) . +} LIMIT 100 +``` + # Prior work For the first traits (presented at CTC'25) gemma was run as @@ -654,8 +755,8 @@ which are all worth considering (I think). Obviously we could annotate all QTL i Now two more steps to go: -* [ ] Fetch all mapped traits using SPARQL and write RDF -* [ ] Compare QTL between datasets and annotate new hits +* [X] Fetch all mapped traits using SPARQL and write RDF +* [X] Compare QTL between datasets and annotate new hits ## Fetch all mapped traits @@ -840,7 +941,7 @@ For the traditional genotypes (gnt:run != gn:test) Now we have all QTLs in the DB, as well as underlying SNPs, one interesting question to ask is what SNPs are repeated across our traits. This, if you remember, is the key idea of reversed genetics. Of course, with our pangenome-derived genotypes, we now have thousands of SNPs per trait. Let's see if we can rank them by number of traits. -For our 1000 traits we map about 7.7M snps with a LOD>5. +For our 1000 traits we map about 7.7M snps with a LOD>5 # Using sparql from emacs |
