summary refs log tree commit diff
path: root/topics
diff options
context:
space:
mode:
Diffstat (limited to 'topics')
-rw-r--r--topics/ADR/gn-guile/000-markdown-editor-push-to-bare-repo.gmi18
-rw-r--r--topics/ADR/gn-transform-databases/000-remodel-rif-transform-with-predicateobject-lists.gmi74
-rw-r--r--topics/ADR/gn-transform-databases/001-remodel-ncbi-transform-with-predicateobject-lists.gmi102
-rw-r--r--topics/ADR/gn-transform-databases/002-remodel-ncbi-transform-to-be-more-compact.gmi127
-rw-r--r--topics/ADR/gn3/000-add-test-cases-for-rdf.gmi21
-rw-r--r--topics/ADR/gn3/001-remove-stace-traces-in-gn3-error-response.gmi49
-rw-r--r--topics/ADR/gn3/002-run-rdf-tests-in-build-container.gmi32
-rw-r--r--topics/ai/aider.gmi19
-rw-r--r--topics/ai/gn_agent.gmi790
-rw-r--r--topics/ai/ontogpt.gmi7
-rw-r--r--topics/authentication/architecture.gmi15
-rw-r--r--topics/authentication/development-guide.gmi60
-rw-r--r--topics/authentication/permission_hooks.gmi62
-rw-r--r--topics/biohackathon/biohackrxiv2024.gmi7
-rw-r--r--topics/data/R-qtl2-format-notes.gmi (renamed from topics/R-qtl2-format-notes.gmi)39
-rw-r--r--topics/data/epochs.gmi153
-rw-r--r--topics/data/precompute/steps.gmi28
-rw-r--r--topics/database/mariadb-database-architecture.gmi830
-rw-r--r--topics/database/setting-up-local-development-database.gmi7
-rw-r--r--topics/database/sql.svg2558
-rw-r--r--topics/deploy/configuring-nginx-on-host.gmi220
-rw-r--r--topics/deploy/deployment.gmi9
-rw-r--r--topics/deploy/genecup.gmi69
-rw-r--r--topics/deploy/installation.gmi2
-rw-r--r--topics/deploy/machines.gmi10
-rw-r--r--topics/deploy/our-virtuoso-instances.gmi2
-rw-r--r--topics/deploy/paths-in-flask-applications.gmi22
-rw-r--r--topics/deploy/setting-up-or-migrating-production-across-machines.gmi202
-rw-r--r--topics/deploy/uthsc-email.gmi64
-rw-r--r--topics/deploy/uthsc-vpn-with-free-software.gmi68
-rw-r--r--topics/deploy/uthsc-vpn.scm154
-rw-r--r--topics/documentation/guides_vs_references.gmi24
-rw-r--r--topics/editing/case-attributes.gmi110
-rw-r--r--topics/editing/case_attributes.gmi180
-rw-r--r--topics/engineering/improving-wiki-rif-search-in-genenetwork.gmi119
-rw-r--r--topics/engineering/instrumenting-ram-usage.gmi32
-rw-r--r--topics/engineering/setting-up-a-basic-pre-commit-hook-for-linting-scheme-files.gmi31
-rw-r--r--topics/engineering/using-architecture-decision-records-in-genenetwork.gmi56
-rw-r--r--topics/engineering/working-with-virtuoso-locally.gmi70
-rw-r--r--topics/genenetwork-releases.gmi77
-rw-r--r--topics/genenetwork/Case_Attributes_GN22
-rw-r--r--topics/genenetwork/genenetwork-services.gmi122
-rw-r--r--topics/genenetwork/genenetwork-streaming-functionality.gmi43
-rw-r--r--topics/genenetwork/publications-on-genenetwork.gmi14
-rw-r--r--topics/genenetwork/starting_gn1.gmi4
-rw-r--r--topics/genetics/pangenotypes.gmi52
-rw-r--r--topics/genetics/standards/gemma-genotype-format.gmi99
-rw-r--r--topics/genetics/test-pangenome-derived-genotypes.gmi1005
-rw-r--r--topics/genome-browser/hoot-genome-browser.gmi21
-rw-r--r--topics/gn-learning-team/next-steps.gmi48
-rw-r--r--topics/gn-uploader/genome-details.gmi42
-rw-r--r--topics/gn-uploader/genotypes-assemblies-markers-and-genenetwork.gmi40
-rw-r--r--topics/gn-uploader/samplelist-details.gmi17
-rw-r--r--topics/gn-uploader/types-of-data.gmi63
-rw-r--r--topics/guix/genenetwork-fixating-guix.gmi34
-rw-r--r--topics/guix/guix-profiles.gmi27
-rw-r--r--topics/guix/packages.gmi16
-rw-r--r--topics/gunicorn/deploying-app-under-url-prefix.gmi121
-rw-r--r--topics/hpc/octopus/slurm-user-guide.gmi1
-rw-r--r--topics/lmms/bulklmm/readme.gmi1
-rw-r--r--topics/lmms/gemma/permutations.gmi1014
-rw-r--r--topics/lmms/rqtl2/genenetwork-rqtl2-implementation.gmi71
-rw-r--r--topics/lmms/rqtl2/gn-rqtl-design-implementation.gmi203
-rw-r--r--topics/lmms/rqtl2/using-rqtl2-lmdb-adapter.gmi84
-rw-r--r--topics/lmms/rqtl2/using-rqtl2.gmi44
-rw-r--r--topics/meetings/gn-kilifi-2025-standup.gmi177
-rw-r--r--topics/meetings/gn-nairobi-2025.gmi17
-rw-r--r--topics/meetings/jnduli_bmunyoki.gmi457
-rw-r--r--topics/octopus/lizardfs/lizard-maintenance.gmi (renamed from topics/octopus/lizardfs/README.gmi)113
-rw-r--r--topics/octopus/maintenance.gmi98
-rw-r--r--topics/octopus/moosefs/moosefs-maintenance.gmi252
-rw-r--r--topics/octopus/octopussy-needs-love.gmi266
-rw-r--r--topics/octopus/recent-rust.gmi76
-rw-r--r--topics/octopus/set-up-guix-for-new-users.gmi38
-rw-r--r--topics/octopus/slurm-upgrade.gmi89
-rw-r--r--topics/pangenome/impg/impg-agc-bindings.gmi246
-rw-r--r--topics/programming/autossh-for-keeping-ssh-tunnels.gmi65
-rw-r--r--topics/programming/better-logging.gmi17
-rw-r--r--topics/rust/guix-rust-bootstrap.gmi173
-rw-r--r--topics/systems/backup-drops.gmi87
-rw-r--r--topics/systems/backups-with-borg.gmi449
-rw-r--r--topics/systems/ci-cd.gmi84
-rw-r--r--topics/systems/debug-and-developing-code-with-genenetwork-system-container.gmi61
-rw-r--r--topics/systems/dns-changes.gmi1
-rw-r--r--topics/systems/hpc/octopus-maintenance.gmi69
-rw-r--r--topics/systems/hpc/performance.gmi17
-rw-r--r--topics/systems/linux/GPU-on-balg01.gmi201
-rw-r--r--topics/systems/linux/add-boot-partition.gmi52
-rw-r--r--topics/systems/linux/adding-nvidia-drivers-penguin2.gmi74
-rw-r--r--topics/systems/mariadb/mariadb.gmi11
-rw-r--r--topics/systems/mariadb/precompute-mapping-input-data.gmi33
-rw-r--r--topics/systems/mariadb/precompute-publishdata.gmi3370
-rw-r--r--topics/systems/migrate-p2.gmi12
-rw-r--r--topics/systems/restore-backups.gmi2
-rw-r--r--topics/systems/screenshot-github-webhook.pngbin0 -> 177112 bytes
-rw-r--r--topics/systems/security.gmi61
-rw-r--r--topics/systems/synchronising-the-different-environments.gmi68
-rw-r--r--topics/systems/update-production-checklist.gmi197
-rw-r--r--topics/systems/virtuoso.gmi40
-rw-r--r--topics/testing/mechanical-rob.gmi73
-rw-r--r--topics/xapian/xapian-indexing.gmi42
101 files changed, 16454 insertions, 341 deletions
diff --git a/topics/ADR/gn-guile/000-markdown-editor-push-to-bare-repo.gmi b/topics/ADR/gn-guile/000-markdown-editor-push-to-bare-repo.gmi
new file mode 100644
index 0000000..05b2b6a
--- /dev/null
+++ b/topics/ADR/gn-guile/000-markdown-editor-push-to-bare-repo.gmi
@@ -0,0 +1,18 @@
+# [gn-guile/ADR-000] Extend Markdown Editor to push to Git Bare Repo
+
+* author: bonfacem
+* status: accepted
+* reviewed-by: alexm, jnduli
+
+## Context
+
+The gn-guile markdown editor currently reads from normal git repositories.  However, for GN's self-hosted git repository, we use bare repositories.  Bare repositories only store the git objects, therefore we can't edit files directly.
+
+## Decision
+
+gn-guile and the cgit instance run in the same server.  We will have one normal repository, and the bare repository, configurable by: "CURRENT_REPO_PATH", which has the normal raw files; and "CGIT_REPO_PATH" which is the bare repository.  We will make edits to the normal repository, and once that is done, push locally to the cgit instance.
+
+## Consequences
+
+* When creating the gn-guile container, this introduces extra complexity in that will have to make sure that the container has the correct write access to the bare repository in the container.
+* With this, we are coupled to our GN git set-up.
diff --git a/topics/ADR/gn-transform-databases/000-remodel-rif-transform-with-predicateobject-lists.gmi b/topics/ADR/gn-transform-databases/000-remodel-rif-transform-with-predicateobject-lists.gmi
new file mode 100644
index 0000000..1e3ee6a
--- /dev/null
+++ b/topics/ADR/gn-transform-databases/000-remodel-rif-transform-with-predicateobject-lists.gmi
@@ -0,0 +1,74 @@
+# [gn-transform-databases/ADR-000] Remodel GeneRIF Metadata Using predicateObject Lists
+
+* author: bonfacem
+* status: rejected
+* reviewed-by: pjotr, jnduli
+
+## Context
+
+In RDF 1.1 Turtle, you have to use a Qname as the subject.  As such, you cannot have a string literal forming the string.  In simpler terms, this is not possible:
+
+```
+"Unique expression signature of a system that includes the subiculum, layer 6 in cortex ventral and lateral to dorsal striatum, and the endopiriform nucleus. Expression in cerebellum is apparently limited to Bergemann glia ABA" dct:created "2007-08-31T13:00:47"^^xsd:datetime .
+```
+
+As of commit "397745b554e0", a work-around was to manually create a unique identifier for each comment for the GeneRIF table.  This identifier was created by combining GeneRIF.Id with GeneRIF.VersionId.  One challenge with this is that we create some coupling with MySQL's unique generation of the GeneRIF.Id column.  Here's an example of snipped turtle entries:
+
+```
+gn:wiki-352-0 rdfs:comment "Ubiquitously expressed. Hypomorphic vibrator allele shows degeneration of interneurons and tremor and juvenile lethality; modified by CAST alleles of Nxf1. Knockout has hepatic steatosis and hypoglycemia." .
+gn:wiki-352-0 rdf:type gnc:GNWikiEntry .
+gn:wiki-352-0 gnt:symbol gn:symbolPitpna .
+gn:wiki-352-0 dct:created "2006-03-10T15:39:29"^^xsd:datetime .
+gn:wiki-352-0 gnt:belongsToSpecies gn:Mus_musculus .
+gn:wiki-352-0 dct:hasVersion "0"^^xsd:int .
+gn:wiki-352-0 dct:identifier "352"^^xsd:int .
+gn:wiki-352-0 gnt:initial "BAH" .
+gn:wiki-352-0 foaf:mbox "XXX@XXX.XXX" .
+gn:wiki-352-0 dct:references ( pubmed:9182797 pubmed:12788952 pubmed:14517553 ) .
+gn:wiki-352-0 gnt:belongsToCategory ( "Cellular distribution" "Development and aging" "Expression patterns: mature cells, tissues" "Genetic variation and alleles" "Health and disease associations" "Interactions: mRNA, proteins, other molecules" ) .
+```
+
+## Decision
+
+We want to avoid manually generating a unique identifier for each WIKI comment.  We should instead have that UID be a blank node reference that we don't care about and use predicateObjectLists as an idiom for representing string literals that can't be subjects.
+
+=> https://www.w3.org/TR/turtle/#grammar-production-predicateObjectList Predicate Object Lists
+
+The above transform (gn:wiki-352-0) would now be represented as:
+
+```
+[ rdfs:comment '''Ubiquitously expressed. Hypomorphic vibrator allele shows degeneration of interneurons and tremor and juvenile lethality; modified by CAST alleles of Nxf1. Knockout has hepatic steatosis and hypoglycemia.'''@en]  rdf:type gnc:GNWikiEntry ;
+	gnt:belongsToSpecies gn:Mus_musculus ;
+	dct:created "2006-03-10 12:39:29"^^xsd:datetime ;
+	dct:references ( pubmed:9182797 pubmed:12788952 pubmed:14517553 ) ;
+	foaf:mbox <XXX@XXX.XXX> ;
+	dct:identifier "352"^^xsd:integer ;
+	dct:hasVersion "0"^^xsd:integer ;
+	gnt:initial "BAH" ;
+	gnt:belongsToCategory ( "Cellular distribution" "Development and aging" "Expression patterns: mature cells, tissues" "Genetic variation and alleles" "Health and disease associations" "Interactions: mRNA, proteins, other molecules" ) ;
+	gnt:symbol gn:symbolPitpna .
+```
+
+The above can be loosely translated as:
+
+```
+_:comment rdfs:comment '''Ubiquitously expressed. Hypomorphic vibrator allele shows degeneration of interneurons and tremor and juvenile lethality; modified by CAST alleles of Nxf1. Knockout has hepatic steatosis and hypoglycemia.'''@en] .
+_:comment rdf:type gnc:GNWikiEntry .
+_:comment dct:created "2006-03-10 12:39:29"^^xsd:datetime .
+_:comment dct:references ( pubmed:9182797 pubmed:12788952 pubmed:14517553 ) .
+_:comment foaf:mbox <bah@ucsd.edu> .
+_:comment dct:identifier "352"^^xsd:integer .
+_:comment dct:hasVersion "0"^^xsd:integer .
+_:comment gnt:initial "BAH" .
+_:comment gnt:belongsToCategory ( "Cellular distribution" "Development and aging" "Expression patterns: mature cells, tissues" "Genetic variation and alleles" "Health and disease associations" "Interactions: mRNA, proteins, other molecules" ) .
+_:comment gnt:symbol gn:symbolPitpna .
+```
+
+## Consequences
+
+* Update SPARQL in tux02, tux01 in lockstep with updating GN3/GN2 and the XAPIAN index.
+* Reduction in size of the final output, and faster transform time because using PredicateObjectLists output more terse RDF.
+
+## Rejection Rationale
+
+This proposal was rejected because relying on blank-nodes as an identifier is opaque and not human-readable.  We want to use human readable identifiers where possible.
diff --git a/topics/ADR/gn-transform-databases/001-remodel-ncbi-transform-with-predicateobject-lists.gmi b/topics/ADR/gn-transform-databases/001-remodel-ncbi-transform-with-predicateobject-lists.gmi
new file mode 100644
index 0000000..073525a
--- /dev/null
+++ b/topics/ADR/gn-transform-databases/001-remodel-ncbi-transform-with-predicateobject-lists.gmi
@@ -0,0 +1,102 @@
+# [gn-transform-databases/ADR-001] Remodel GeneRIF_BASIC (NCBI RIFs) Metadata Using predicateObject Lists
+
+* author: bonfacem
+* status: rejected
+* reviewed-by: pjotr, jnduli
+
+## Context
+
+We can model RIF comments using pridacetobject lists as described in:
+
+=> https://issues.genenetwork.org/topics/ADR/gn-transform-databases/000-remodel-rif-transform-with-predicateobject-lists [ADR/gn-transform-databases] Remodel GeneRIF Metadata Using predicateObject Lists
+
+However, currently for NCBI RIFs we represent comments as blank nodes:
+
+```
+gn:symbolsspA rdfs:comment [
+	rdf:type gnc:NCBIWikiEntry ;
+	rdfs:comment "N-terminus verified by Edman degradation on mature peptide"^^xsd:string ;
+	gnt:belongsToSpecies gn:Mus_musculus ;
+	skos:notation taxon:511145 ;
+	gnt:hasGeneId generif:944744 ;
+	dct:hasVersion '1'^^xsd:int ;
+	dct:references pubmed:97295 ;
+	...
+	dct:references pubmed:15361618 ;
+	dct:created "2007-11-06T00:38:00"^^xsd:datetime ;
+] .
+gn:symbolaraC rdfs:comment [
+	rdf:type gnc:NCBIWikiEntry ;
+	rdfs:comment "N-terminus verified by Edman degradation on mature peptide"^^xsd:string ;
+	gnt:belongsToSpecies gn:Mus_musculus ;
+	skos:notation taxon:511145 ;
+	gnt:hasGeneId generif:944780 ;
+	dct:hasVersion '1'^^xsd:int ;
+	dct:references pubmed:320034 ;
+	...
+	dct:references pubmed:16369539 ;
+	dct:created "2007-11-06T00:39:00"^^xsd:datetime ;
+] .
+
+```
+
+Here we see alot of duplicated entries for the same symbols.  For the above 2 entries, everything is exactly the same except for the "gnt:hasGeneId" and "dct:references" predicates.
+
+## Decision
+
+We use predicateObjectLists with blankNodePropertyLists as an idiom to represent the generif comments.
+
+=> https://www.w3.org/TR/turtle/#grammar-production-predicateObjectList predicateObjectList
+=> https://www.w3.org/TR/turtle/#grammar-production-blankNodePropertyList blankNodePropertyList
+
+In so doing, we can de-duplicate the entries demonstrated above.  A representation of the above RDF Turtle triples would be:
+
+```
+[ rdfs:comment "N-terminus verified by Edman degradation on mature peptide"^^xsd:string ]
+rdf:type gnc:NCBIWikiEntry ;
+dct:created "2007-11-06T00:39:00"^^xsd:datetime ;
+gnt:belongsToSpecies gn:Mus_musculus ;
+skos:notation taxon:511145 ;
+dct:hasVersion '1'^^xsd:int ;
+rdfs:seeAlso [
+	gnt:hasGeneId generif:944744 ;
+	gnt:symbol gn:symbolsspA ;
+	dct:references ( pubmed:97295 ... pubmed:15361618 ) ;
+] ;
+rdfs:seeAlso [
+	gnt:hasGeneId generif:944780 ;
+	gn:symbolaraC ;
+	dct:references ( pubmed:320034 ... pubmed:16369539 ) ;
+] .
+```
+
+The above would translate to the following triples:
+
+```
+_:comment rdfs:comment "N-terminus verified by Edman degradation on mature peptide"^^xsd:string .
+_:comment rdfs:type gn:NCBIWikiEntry .
+_:comment dct:created "2007-11-06T00:39:00"^^xsd:datetime .
+_:comment gnt:belongsToSpecies gn:Mus_musculus .
+_:comment skos:notation taxon:511145 .
+_:comment dct:hasVersion '1'^^xsd:int .
+_:comment rdfs:seeAlso _:metadata1
+_:comment rdfs:seeAlso _:metadata2 .
+_:metadata1 gnt:hasGeneId generif:944744 .
+_:metadata1 gnt:symbol gn:symbolaraC .
+_:metadata1 dct:references ( pubmed:97295 ... pubmed:15361618 )
+_:metadata2 gnt:hasGeneId generif:944780 .
+_:metadata2 gnt:symbol gn:symbolsspA .
+_:metadata2 dct:references ( pubmed:320034 ... pubmed:16369539 ) .
+```
+
+Beyond that, we intentionally use a sequence to store a list of pubmed references.
+
+## Consequences
+
+* De-duplication of comments during the transform while retaining the integrity of the RIF metadata.
+* Because of the terseness, less work during the I/O heavy operation.
+* Update SPARQL in tux02, tux01 in lockstep with updating GN3/GN2 and the XAPIAN index.
+
+## Rejection Rationale
+
+This proposal was rejected because relying on blank-nodes as an identifier is opaque and not human-readable.  We want to use human readable identifiers where possible.
diff --git a/topics/ADR/gn-transform-databases/002-remodel-ncbi-transform-to-be-more-compact.gmi b/topics/ADR/gn-transform-databases/002-remodel-ncbi-transform-to-be-more-compact.gmi
new file mode 100644
index 0000000..ac06fc1
--- /dev/null
+++ b/topics/ADR/gn-transform-databases/002-remodel-ncbi-transform-to-be-more-compact.gmi
@@ -0,0 +1,127 @@
+# [gn-transform-databases/ADR-002] Remodel GeneRIF_BASIC (NCBI RIFs) Metadata To Be More Compact
+
+* author: bonfacem
+* status: proposal
+* reviewed-by: pjotr, jnduli
+
+## Context
+
+Currently, we represent NCBI RIFs as blank nodes that form the object of a given symbol:
+
+```
+gn:symbolsspA rdfs:comment [
+	rdf:type gnc:NCBIWikiEntry ;
+	rdfs:comment "N-terminus verified by Edman degradation on mature peptide"^^xsd:string ;
+	gnt:belongsToSpecies gn:Mus_musculus ;
+	skos:notation taxon:511145 ;
+	gnt:hasGeneId generif:944744 ;
+	dct:hasVersion '1'^^xsd:int ;
+	dct:references pubmed:97295 ;
+	...
+	dct:references pubmed:15361618 ;
+	dct:created "2007-11-06T00:38:00"^^xsd:datetime ;
+] .
+gn:symbolaraC rdfs:comment [
+	rdf:type gnc:NCBIWikiEntry ;
+	rdfs:comment "N-terminus verified by Edman degradation on mature peptide"^^xsd:string ;
+	gnt:belongsToSpecies gn:Mus_musculus ;
+	skos:notation taxon:511145 ;
+	gnt:hasGeneId generif:944780 ;
+	dct:hasVersion '1'^^xsd:int ;
+	dct:references pubmed:320034 ;
+	...
+	dct:references pubmed:16369539 ;
+	dct:created "2007-11-06T00:39:00"^^xsd:datetime ;
+] .
+```
+
+Moreover, we also store all the different versions of a comment:
+
+```
+mysql> SELECT * FROM GeneRIF_BASIC WHERE SpeciesId=1 AND TaxID=7955 AND GeneId=323473 AND PubMed_ID = 15680355\G
+*************************** 1. row ***************************
+ SpeciesId: 1
+     TaxID: 7955
+    GeneId: 323473
+    symbol: prdm1a
+ PubMed_ID: 15680355
+createtime: 2010-01-21 00:00:00
+   comment: One of two mutations in which defects are observed in both cell populations: it leads to a complete absence of RB neurons and a reduction in neural crest cells
+ VersionId: 1
+*************************** 2. row ***************************
+ SpeciesId: 1
+     TaxID: 7955
+    GeneId: 323473
+    symbol: prdm1a
+ PubMed_ID: 15680355
+createtime: 2010-01-21 00:00:00
+   comment: prdm1 functions to promote the cell fate specification of both neural crest cells and sensory neurons
+ VersionId: 2
+```
+
+## Decision
+
+First, we should only store the latest version of a given RIF entry and ignore all other versions.  RIF entries in the GeneRIF_BASIC table are uniquely identified by the columns: SpeciesId, GeneId, PubMed_ID, createtime, and VersionId.  Since we are storing the latest version of a given RIF entry, we drop the version identifier during the RDF transform.
+
+We use a unique identifier for a given comment, and use that as a triple's QName:
+
+> gn:rif-<speciesId>-<GeneId>
+
+Finally instead of:
+
+```
+<symbol> predicate <comment metadata>
+```
+
+We use:
+
+```
+<comment-uid> predicate object ;
+              ... (more metadata) .
+```
+
+An example triple would take the form:
+
+```
+gn:rif-1-511145 rdf:label '''N-terminus verified by Edman degradation on mature peptide'''@en .
+gn:rif-1-511145 rdf:type gnc:NCBIWikiEntry .
+gn:rif-1-511145 gnt:belongsToSpecies gn:Mus_musculus .
+gn:rif-1-511145 skos:notation taxon:511145 .
+gn:rif-1-511145 rdfs:seeAlso [
+    gnt:hasGeneId generif:944744 ;
+    gnt:symbol "spA" ;
+    dct:references ( pubmed:97295 ... pubmed:15361618 ) .
+] .
+gn:rif-1-511145 rdfs:seeAlso [
+    gnt:hasGeneId generif:944780 ;
+    gnt:symbol "araC" ;
+    dct:references ( pubmed:320034 ... pubmed:16369539 ) .
+]
+```
+
+To efficiently store GeneIds, symbols and references, we use blank nodes.  This reduces redundancy and simplifies the triples compared to including these details within the subject:
+
+```
+gn:rif-1-511145-944744 rdf:label '''N-terminus verified by Edman degradation on mature peptide'''@en .
+gn:rif-1-511145-944744 rdf:type gnc:NCBIWikiEntry .
+gn:rif-1-511145-944744 gnt:belongsToSpecies gn:Mus_musculus .
+gn:rif-1-511145-944744 skos:notation taxon:511145 .
+gn:rif-1-511145-944744 gnt:hasGeneId generif:944744 .
+gn:rif-1-511145-944744 gnt:symbol "spA" .
+gn:rif-1-511145-944744 dct:references ( pubmed:97295 ... pubmed:15361618 ) .
+
+gn:rif-1-511145-944780 rdf:label '''N-terminus verified by Edman degradation on mature peptide'''@en .
+gn:rif-1-511145-944780 rdf:type gnc:NCBIWikiEntry .
+gn:rif-1-511145-944780 gnt:belongsToSpecies gn:Mus_musculus .
+gn:rif-1-511145-944780 skos:notation taxon:511145 .
+gn:rif-1-511145-944780 gnt:hasGeneId generif:944744 .
+gn:rif-1-511145-944780 gnt:symbol "spA" .
+gn:rif-1-511145-944780 dct:references ( pubmed:97295 ... pubmed:15361618 ) .
+```
+
+## Consequences
+
+* More complex SQL query required for the transform.
+* De-duplication of RIF entries during the transform.
+* Because of the terseness, less work during the I/O heavy operation.
+* Update SPARQL in tux02, tux01 in lockstep with updating GN3/GN2 and the XAPIAN index.
diff --git a/topics/ADR/gn3/000-add-test-cases-for-rdf.gmi b/topics/ADR/gn3/000-add-test-cases-for-rdf.gmi
new file mode 100644
index 0000000..43ac2ba
--- /dev/null
+++ b/topics/ADR/gn3/000-add-test-cases-for-rdf.gmi
@@ -0,0 +1,21 @@
+# [gn3/ADR-000] Add RDF Test Cases
+
+* author: bonfacem
+* status: proposed
+* reviewed-by: jnduli
+
+## Context
+
+We have no way of ensuring the integrity of our SPARQL queries in GN3.  As such, GN3 is fragile to breaking changes when the TTL files are updated.
+
+## Decision
+
+In Virtuoso, we load all our data to a default named graph: <http://genenetwork.org>.  For SPARQL/RDF tests, we should upload test ttl files to a test named graph: <http://cd-test.genenetwork.org>, and run our RDF unit tests against that named graph.
+
+## Consequences
+
+* Extra bootstrapping to load ttl files when running the test.
+* Extra documentation to GN developers on how to run virtuoso locally to get the tests running.
+* Testing against gn-machines to make sure that all things run accordingly.
+* Extra maintenance costs to keep the TTL files in lockstep with the latest RDF changes during re-modeling.
+* Improvement in GN3 reliability.
diff --git a/topics/ADR/gn3/001-remove-stace-traces-in-gn3-error-response.gmi b/topics/ADR/gn3/001-remove-stace-traces-in-gn3-error-response.gmi
new file mode 100644
index 0000000..0910415
--- /dev/null
+++ b/topics/ADR/gn3/001-remove-stace-traces-in-gn3-error-response.gmi
@@ -0,0 +1,49 @@
+# [gn3/ADR-001] Remove Stack Traces in GN3
+
+* author: bonfacem
+* status: rejected
+* reviewed-by: jnduli, zach, pjotr, fredm
+
+## Context
+
+Currently, GN3 error responses include stack traces:
+
+```
+def add_trace(exc: Exception, jsonmsg: dict) -> dict:
+    """Add the traceback to the error handling object."""
+    return {
+        **jsonmsg,
+        "error-trace": "".join(traceback.format_exception(exc))
+    }
+
+
+def page_not_found(pnf):
+    """Generic 404 handler."""
+    current_app.logger.error("Handling 404 errors", exc_info=True)
+    return jsonify(add_trace(pnf, {
+        "error": pnf.name,
+        "error_description": pnf.description
+    })), 404
+
+
+def internal_server_error(pnf):
+    """Generic 404 handler."""
+    current_app.logger.error("Handling internal server errors", exc_info=True)
+    return jsonify(add_trace(pnf, {
+        "error": pnf.name,
+        "error_description": pnf.description
+    })), 500
+```
+
+
+## Decision
+
+Stack traces have the potential to allow malicious actors compromise our system by providing more context.  As such, we should send a useful description of what went wrong; and log our stack traces in our logs, and send an appropriate error status code.  We can use the logs to troubleshoot our system.
+
+## Consequences
+
+* Lockstep update in GN2 UI on how we handle GN3 errors.
+
+## Rejection Rationale
+
+The proposal to remove stack traces from error responses was rejected because they are essential for troubleshooting, especially when issues are difficult to reproduce or production logs are inaccessible.  Stack traces provide immediate error context, and removing them would complicate debugging by requiring additional effort to link logs with specific requests; a trade-off we are not willing to make at the moment.
diff --git a/topics/ADR/gn3/002-run-rdf-tests-in-build-container.gmi b/topics/ADR/gn3/002-run-rdf-tests-in-build-container.gmi
new file mode 100644
index 0000000..a8026ce
--- /dev/null
+++ b/topics/ADR/gn3/002-run-rdf-tests-in-build-container.gmi
@@ -0,0 +1,32 @@
+# [gn3/ADR-002] Move RDF Test Cases to Build Container
+
+* author: bonfacem
+* status: accepted
+* reviewed-by: jnduli
+
+## Context
+
+GN3 RDF tests are run against the CD's virtuoso instance.  As such, we need to set special parameters when running tests:
+
+```
+SPARQL_USER = "dba"
+SPARQL_PASSWORD = "dba"
+SPARQL_AUTH_URI="http://localhost:8890/sparql-auth/"
+SPARQL_CRUD_AUTH_URI="http://localhost:8890/sparql-graph-crud-auth"
+FAHAMU_AUTH_TOKEN="XXXXXX"
+```
+
+This extra bootstrapping when running tests needs care, and locks tests to CD or special configuration when running locally.  This leads to fragile tests that cause CD to break.  Moreover, to add tests to CD, we would have to add extra g-exp to gn-machines.
+
+This ADR is related to:
+
+=> /topics/ADR/gn3/000-add-test-cases-for-rdf.gmi gn3/ADR-000.
+
+## Decision
+
+Move tests to the test build phase of building the genenetwork3 package.  These tests are added in the ".guix/genenetwork3-all-tests.scm" file instead of the main "genenetwork3" package definition in guix-bioinformatics.  This way, we have all our "light" tests I.e. unit tests running in guix-bioinformatics, while having all our heavier tests, in this case, RDF tests, running in CD.
+
+## Consequences
+
+* Extra bootstrapping to gn3's .guix/genenetwork3-package.scm to get tests working.
+* GN3 RDF tests refactoring to use a virtuoso instance running in the background while tests are running.
diff --git a/topics/ai/aider.gmi b/topics/ai/aider.gmi
new file mode 100644
index 0000000..00845c8
--- /dev/null
+++ b/topics/ai/aider.gmi
@@ -0,0 +1,19 @@
+# Aider
+
+=> https://aider.chat/
+
+```
+apt-get install python3-venv # or use guix
+python3 -m venv ~/opt/python-aider
+~/opt/python-aider/bin/python3 -m pip install aider-install
+export PATH="/home/wrk/.local/bin:$PATH"
+~/opt/python-aider/bin/aider-install
+```
+
+Installed 1 executable: aider
+Executable directory /home/wrk/.local/bin is already in PATH
+
+```
+aider --model sonnet --api-key anthropic=sk-ant...
+aider --model gpt-4o --openai-api-key aa...
+```
diff --git a/topics/ai/gn_agent.gmi b/topics/ai/gn_agent.gmi
new file mode 100644
index 0000000..2b789c9
--- /dev/null
+++ b/topics/ai/gn_agent.gmi
@@ -0,0 +1,790 @@
+# Build an AI system for GN
+
+## Tags
+* type: feature
+* assigned: johannesm
+* priority: medium
+* status: in progress
+* keywords: llm, rag, ai, agent
+
+## Description
+
+The aim is to build an AI system/agent/RAG able to digest mapping results and metadata in GN for analysis scaling. This is not quite possible at the moment, given that one stills need to dig and compare manually that type of information. And the data in GN is somehow big for such approach :)
+
+I have made an attempt to using Deep-Learning for my Masters project. It could work but required further processing of results for interpretation. Not quite handy! Instead, we want a system which takes care of all the work (at least most of it) and that we can understand. This is how transformers and LLMs came into the picture.
+
+This work is an extension of the GNQA system initiated by Shelby and Pjotr. 
+
+## Tasks
+* [X] Look for transformer model ready for use and try
+* [X] Build a RAG system and test with small corpus of mapping results
+* [X] Experiment with actual mapping results and metadata
+* [X] Move from RAG to agent
+* [X] Optimize AI system
+* [] Scale analysis to more data
+* [] Compare performance of open LLMs with Claude in the system
+
+
+
+
+### Look for transformer model ready for use and try
+
+Given the success of transformers, I was first incited by Pjotr to look for a model that can support different types of data i.e numerical (mapping results) vs textual (metadata).
+
+I found TAPAS which:
+* takes data of different types in tabular format
+* takes a query or question in form of text
+* performs operations on rows of the data table
+* retrieves relevant information
+* returns an answer to the original query
+
+Experimentations were ongoing when Rob found with the help of Claude that this architecture would not go far. I know we used an AI to assist our work on AI (at least we did not ask an AI to do the job from the go :))
+But it was a good point. TAPAS is relatively old and a lot of progress have been made with LLM and agent since!
+
+To take advantage of all the progress made with LLM, need to find a way to have only text data. LLMs are trained to undertand and work with text. Metadata, being RDF, is already in text format. I only needed to convert the mapping results to text. It is a detour worth the shot if this can give more flexibility and save development time!
+
+### Build a RAG system and test with a small corpus of mapping results
+
+I have read a number of books and found that RAG system are pretty easy to design with LangChain. The RAG is made of 2 components:
+* search and retrieval -> need a keyword search algorithm or embedding model
+* response generation -> need an LLM
+
+The system takes a query and return a response based on the information retrieved. Very efficient in increasing factuality of LLM responses!
+
+I wrote a first draft with:
+
+1. BMD25 keyword search algorithm and Qwen3-Embedding-0.6B model
+This is an hybrid search subsystem. It searches both by keyword and meaning. It should be more scalable. The subsystem:
+* takes the query
+* processes it
+* extracts the most similar documents or data chunks
+* passes the results to the LLM for further processing
+
+2. Qwen2.5-72B-Instruct LLM
+This is the response subsystem. It:
+* takes the original query
+* takes retrieved information
+* reason using tree-of-thought mechanism for validation
+* generate an answer to the query using retrieved information
+
+Small bonus
+I suspected that more than one query might be required to address some questions. I needed to find a way to make the system conversational so that one could ask many questions without the information to be lost in between. This came down to adding memory to the system. So, I added a small model for memory propagation. The model was:
+3. Phi-3-mini-4k-instruct
+It uses the model and the ConversationSummary class from LangChain to summarizekey information in interaction for accessibility between sessions.
+
+Time to test!
+
+* Query 1
+
+```
+Question: What is the lod for trait leptin receptor EPFLMouseLiverCDEx0413 at position 100?
+
+Response:
+**Expert 1:**
+The question asks for the LOD score at position 100. From the relevant information, I see that there is a LOD score of 1 at position 100.
+
+**Expert 2:**
+I agree with Expert 1. The relevant information clearly states that the LOD score at position 100 is 1.
+
+**Expert 3:**
+I concur with both Experts 1 and 2. The LOD score for the trait leptin receptor EPFLMouseLiverCDEx0413 at position 100 is indeed 1.
+
+**Conclusion:**
+All experts agree that the LOD score for the trait leptin receptor EPFLMouseLiverCDEx0413 at position 100 is 1. <|end|>
+```
+
+Since the corpus used for testing is small, I could inspect to confirm the accuracy of the response generated by the RAG. Lookup revealed that it was correct, yah!
+
+* Query 2
+
+```
+Question: Which position of leptin receptor EPFLMouseLiverCDEx0413 was examined in the previous prompt?
+
+Response:
+**Expert 1:**
+The previous prompt asked for the LOD score at position 100. Therefore, the position examined in the previous prompt is 100.
+
+**Expert 2:**
+I agree with Expert 1. The previous prompt specifically asked about the LOD score at position 100. Thus, the position examined is 100.
+
+**Expert 3:**
+I concur with both Experts 1 and 2. The previous prompt clearly specified that it was asking for the LOD score at position 100. Therefore, the position examined in the previous prompt is 100.
+
+**Conclusion:**
+All experts agree that the position of leptin receptor EPFLMouseLiverCDEx0413 examined in the previous prompt is 100. <|end|>
+```
+
+This was also correct. I was very excited, Rob too!
+
+I received a feedback that made me clarify the meaning of position. I rebuilt the small corpus using a BXD dataset of traits I run GEMMA on for my Masters project, making sure to use actual marker positions this time and including also marker names.
+
+Let's experiment again!
+
+I got results similar to the ones above to the exception that marker positions are true and marker names are supported.
+
+I faced a challenge though :(
+
+For queries that require combinations of different data chunks or documents (non atomic queries), the system does not perform well. For example, to the query
+
+* How many traits hepatic nuclear factor 4 are in the datasets?
+The system was confused. Even after prompt engineering, the answer generated was not accurate
+
+* Identify 2 traits that have similar lod values on chromosome 1 position 3010274
+The system sometimes missed or caught only 1 trait having a lod value at the position.
+
+This is probably because the system cannot execute more than one retrieval run. To get there, I need to make the RAG more autonomous: this is how the concept of agent came up.
+
+
+### Experiment with actual mapping results and metadata
+
+Getting an agent asked for more readings. In the meantime, I decided to get actual mapping results and metadata for experimentation. Would be sad to proceed if the system is actually not compatible with data to use in production :)
+
+I waited for Pjotr to precompute GEMMA association results and export them with metadata to an endpoint. The RDF schema was very interesting to learn and Bonz did some work about that in the past :)
+
+You can check out recent developments of  Pjotr's work here:
+=> https://issues.genenetwork.org/topics/systems/mariadb/precompute-publishdata
+
+For Bonz work, see:
+=> https://github.com/genenetwork/gn-docs/blob/master/rdf-documentation/phenotype-metadata.md
+
+Anyway, it took some time but I finally got a glance of the data.
+
+This started with the metadata from an old endpoint created by Bonz. I had also to learn SPARQL - I was quite new to it!
+
+We thought LLMs can make directly sense of RDF data (still in text format) but it turns out it is not. They can recognize that it is RDF but in between all the URIs, they start making mistakes quite quickly. Instead of using RDF natively, we decided to use LLMs to first convert RDF data - could be both metadata or mapping results - to natural text before using it with the RAG system. The system should do best and we confirmed that!
+
+Pjotr made available the first version of the global endpoint. Nothing should stop me now :) I wrote a script to fetch metadata from the endpoint. I am not sharing my code so far. Let me fix that right now. You can follow this link for the script I was referring to above:
+=> https://github.com/johanmed/gn-rag/blob/main/fetch_metadata.py
+
+Pjotr also made available the ttl files in my home directory on balg01 - full flexibility!
+
+I naturalized some RDF triples. The corpus now looked like below
+
+```
+The phenotype identified as BXD_19926, or abbreviated as TAG(48:3)_HFD, is part of the EPFL LISP3 Study. It's a measurement of Liver Lipidomics: TAG48:3_HFD, which has side chain FA composition [TG12:0_18:1_18:2,TG14:0_16:0_18:3,TG14:0_16:1_18:2,TG14:0_16:2_18:1,TG14:1_16:0_18:2,TG14:1_16:1_18:1,TG16:0_16:1_16:2,TG16:1_16:1_16:1] belonging to the lipid class triacylglycerol TAG. This phenotype was measured by MS [R.U.AUC] in overnight fasted male mice at 29 weeks of age, who were on a high fat diet (60% kCal/fat HarlanTD.06414).
+The phenotype was submitted by Jha P and contributed to by Auwerx J. It belongs to the group setBxd and has a sequence number of 1. The mean value for this phenotype is 0.00410489, with an additive effect of 0.00066854. The LOD score for this phenotype is 2.91779.
+The locus associated with this phenotype is Rsm10000001653.
+```
+
+It was time to proceed to testing. The results I show below are not exactly for the RAG system I explained above. I was improving the RAG in parallel. So watch out. I will explain everything in the next task :)
+
+For now, you can have a look at the first implementation at:
+=> https://github.com/johanmed/gn-rag/commit/2cf0b74442e8f7e3a67d563b882f3ab25a4ceb6d
+
+The goal was to try a complex query the previous system failed on. You can see that the question was indeed not atomic. 
+
+```
+Question: Identify traits having a lod score > 4.0
+
+Response:
+In addressing the query to identify traits with a LOD score exceeding 4.0, it's pertinent to highlight three mapped loci that meet these criteria: Rsm10000011038_BXDPublish_170
+86_gemma_GWA_19ca1aeb (LOD: 4.5), Rsm10000000185_BXDPublish_12641_gemma_GWA_8c8d0548 (LOD: 4.14) and Rsm10000000559 (LOD: 4.61). These findings underscore the significance of these loci in genetic studies, given their high LOD scores indicative of strong linkage evidence.
+```
+
+It is already better. At least, the system combined information from different data or document chunks. Retrieved chunks were:
+
+```
+[[Document(metadata={}, page_content='\nThe mapped locus associated with Rsm10000001536 has a LOD score of 2.07, and the mapped SNP is identified as GEMMAMapped_LOCO_BXDPublish_24451_gemma_GWA_aecf628e. The effect size for this locus is 0.01, with an allele frequency (AF) of 0.333.\n                '), Document(metadata={}, page_content='\nThe mapped
+ locus Rsm10000011536 is associated with a LOD score of 5.69, an effect size of 0.385 and an allele frequency of 0.526. This locus has been mapped to the SNP GEMMAMapped_LOCO_BXDPublish_2032
+0_gemma_GWA_6832c0e4.\n                '), Document(metadata={}, page_content='\nThe mapped locus, Rsm10000000185_BXDPublish_12641_gemma_GWA_8c8d0548, has an effect size of -3.137 and a LOD
+score of 4.14. This locus is associated with the mapped SNP GEMMAMapped_LOCO_BXDPublish_12641_gemma_GWA_8c8d0548, and it has an allele frequency of 0.556.\n                '), Document(metad
+ata={}, page_content='\nIn plain English, this data refers to a mapped locus associated with the Rsm10000011038_BXDPublish_17086_gemma_GWA_19ca1aeb identifier. This locus is linked to the Rsm10000011038 identifier, has an effect size of -0.048, a LOD score of 4.5, and an allele frequency (AF) of 0.167. The mapped SNP associated with this data can be found under the GEMMAMapped_LOCO_BXDPublish_17086_gemma_GWA_19ca1aeb identifier.\n                '), Document(metadata={}, page_content='\n                In plain English, the data describes a genetic locus identified as Rsm10000000559. This locus was mapped through an effect size of -34.191, with an allele frequency of 0.438. The mapping achieved a LOD score of 4.61, indicating the statistical significance of this genetic association. The mapped locus is associated with a specific SNP (Single Nucleotide Polymorphism) identified as GEMMAMapped_LOCO_BXDPublish_12016_gemma_GWA_bc6adcae.\n                ')]]
+```
+
+### Move from RAG to agent
+
+This is where I made the system more autonomous i.e agentic. I am now going to explain how I did it. I read a couple of sources and found that RAG system built with LangChain could be made agentic by using LangGraph. This creates a graph structure which splits the task among different nodes or agents. Each agent achieves a specific subtasks and a final node manages the integration.
+
+Checkout this commit to see the results:
+=> https://github.com/johanmed/gn-rag/commit/ecde30a31588605358007cc39df25976b9c2e295
+
+You can clearly see differences between *rag_langchain.py* and *rag_langgraph.py*
+
+Basically,
+
+```
+def ask_question(self, question: str):
+        start=time.time()
+        memory_var=self.memory.load_memory_variables({})
+        chat_history=memory_var.get('chat_history', '')
+        result=self.retrieval_chain.invoke(
+            {'question': question,
+             'input': question,
+             'chat_history': chat_history})
+        answer=result.get("answer")
+        citations=result.get("context")
+        self.memory.save_context(
+            {'input': question},
+            {'answer': answer})
+        # Close LLMs
+        GENERATIVE_MODEL.client.close()
+        SUMMARY_MODEL.client.close()
+        end=time.time()
+        print(f'ask_question: {end-start}')
+        return {
+            "question": question,
+            "answer": answer,
+            "citations": citations,
+        }
+```
+
+became:
+
+```
+def retrieve(self, state: State) -> dict:
+        # Define graph node for retrieval
+        prompt = f"""
+        You are powerful data retriever and you strictly return
+        what is asked for.
+        Retrieve relevant documents for the query below,
+        excluding these documents: {state.get('seen_documents', [])}
+        Query: {state['input']}"""
+        retrieved_docs = self.ensemble_retriever.invoke(prompt)
+        return {"input": state["input"],
+                "context": retrieved_docs,
+                "digested_context": state.get("digested_context", []),
+                "result_count": state.get("result_count", 0),
+                "target": state.get("target", 3),
+                "max_iterations": state.get("max_iterations", 5),
+                "should_continue": "naturalize",
+                "iterations": state.get("iterations", 0) + 1, # Add one per run
+                "chat_history": state.get("chat_history", []),
+                "answer": state.get("answer", ""),
+                "seen_documents": state.get("seen_documents", [])}
+
+    def manage(self, state:State) -> dict:
+        # Define graph node for task orchestration
+        context = state.get("context", [])
+        digested_context = state.get("digested_context", [])
+        answer = state.get("answer", "")
+        iterations = state.get("iterations", 0)
+        chat_history = state.get("chat_history", [])
+        result_count = state.get("result_count", 0)
+        target = state.get("target", 3)
+        max_iterations = state.get("max_iterations", 5)
+        should_continue = state.get("should_continue", "retrieve")
+        # Orchestration logic
+        if iterations >= max_iterations or result_count >= target:
+            should_continue = "summarize"
+        elif should_continue == "retrieve":
+            # Reset fields
+            context = []
+            digested_context = []
+            answer = ""
+        elif should_continue == "naturalize" and not context:
+            should_continue = "retrieve"  # Can't naturalize without context
+            context = []
+            digested_context = []
+            answer = ""
+        elif should_continue == "analyze" and \
+             (not context or not digested_context):
+            should_continue = "retrieve"  # Can't analyze without context
+            context = []
+            digested_context = []
+            answer = ""
+        elif should_continue == "check_relevance" and not answer:
+            should_continue = "analyze"  # Can't check relevance without answer
+        elif should_continue not in ["retrieve", \
+                "naturalize", "check_relevance", "analyze", "summarize"]:
+            should_continue = "summarize"  # Fallback
+        return {"input": state["input"],
+                "should_continue": should_continue,
+                "result_count": result_count,
+                "target": target,
+                "iterations": iterations,
+                "max_iterations": max_iterations,
+                "context": context,
+                "digested_context": digested_context,
+                "chat_history": chat_history,
+                "answer": answer,
+                "seen_documents": state.get("seen_documents", [])}
+
+    def analyze(self, state:State) -> dict:
+        # Define graph node for analysis and text generation
+        context = "\n".join(state.get("digested_context", []))
+        existing_history="\n".join(state.get("chat_history", [])) \
+            if state.get("chat_history") else ""
+        iterations = state.get("iterations", 0)
+        max_iterations = state.get("max_iterations", 5)
+        result_count = state.get("result_count", 0)
+        target = state.get("target", 3)
+        if not context: # Cannot proceed without context
+            should_continue = "summarize" if iterations >= max_iterations \
+                or result_count >= target else "retrieve"
+            response = ""
+        else:
+            prompt = f"""
+             <|im_start|>system
+             You are an experienced analyst that can use available information
+             to provide accurate and concise feedback.
+             <|im_end|>
+             <|im_start|>user
+             Answer the question below using following information.
+             Context: {context}
+             History: {existing_history}
+             Question: {state["input"]}
+             Answer:
+             <|im_end|>
+             <|im_start|>assistant"""
+            response = GENERATIVE_MODEL.invoke(prompt)
+            if not response or not isinstance(response, str) or \
+                    response.strip() == "": # Need valid generation
+                should_continue = "summarize" if iterations >= max_iterations \
+                    or result_count >= target else "retrieve"
+                response = ""  # Ensure a clean state
+            else:
+                should_continue = "check_relevance"
+        return {"input": state["input"],
+                "answer": response,
+                "should_continue": should_continue,
+                "context": state.get("context", []),
+                "digested_context": state.get("digested_context", []),
+                "iterations": iterations,
+                "max_iterations": max_iterations,
+                "result_count": result_count,
+                "target": target,
+                "chat_history": state.get("chat_history", []),
+                "seen_documents": state.get("seen_documents", [])}
+
+    
+    def summarize(self, state:State) -> dict:
+        # Define node for summarization
+        existing_history = state.get("chat_history", [])
+        current_interaction=f"""
+            User: {state["input"]}\nAssistant: {state["answer"]}"""
+        full_context = "\n".join(existing_history) + "\n" + \
+            current_interaction if existing_history else current_interaction
+        result_count = state.get("result_count", 0)
+        target = state.get("target", 3)
+        iterations = state.get("iterations", 0)
+        max_iterations = state.get("max_iterations", 5)
+        prompt = f"""
+            <|system|>
+            You are an excellent and concise summary maker.
+            <|end|>
+            <|user|>
+            Summarize in bullet points the conversation below.
+            Follow this format: input - answer
+            Conversation: {full_context}
+            <|end|>
+            <|assistant|>"""
+        summary = GENERATIVE_MODEL.invoke(prompt).strip() # central task
+        if not summary or not isinstance(summary, str) or summary.strip() == "":
+            summary = f"- {state['input']} - No valid answer generated"
+        should_continue="end" if result_count >= target or \
+            iterations >= max_iterations else "retrieve"
+        updated_history = existing_history + [summary] # update chat_history
+        print(f"\nChat history in summarize: {updated_history}")
+        return {"input": state["input"],
+                "answer": summary,
+                "should_continue": should_continue,
+                "context": state.get("context", []),
+                "digested_context": state.get("digested_context", []),
+                "iterations": iterations,
+                "max_iterations": max_iterations,
+                "result_count": result_count,
+                "target": target,
+                "chat_history": updated_history,
+                "seen_documents": state.get("seen_documents", [])}
+
+    def check_relevance(self, state:State) -> dict:
+        # Define node to check relevance of retrieved data
+        context = "\n".join(state.get("digested_context", []))
+        result_count = state.get("result_count", 0)
+        target = state.get("target", 3)
+        iterations = state.get("iterations", 0)
+        max_iterations = state.get("max_iterations", 5)
+        seen_documents = state.get("seen_documents", [])
+        prompt = f"""
+            <|system|>
+            You are an expert in evaluating data relevance. You do it seriously.
+            <|end|>
+            <|user|>
+            Assess if the provided answer is relevant to the query.
+            Return only yes or no. Nothing else.
+            Answer: {state["answer"]}
+            Query: {state["input"]}
+            Context: {context}
+            <|end|>
+            <|assistant|>"""
+        assessment = GENERATIVE_MODEL.invoke(prompt).strip()
+        if assessment=="yes":
+            result_count = result_count + 1
+            should_continue = "summarize"
+        elif result_count >= target or iterations >= max_iterations:
+            should_continue = "summarize"
+        else:
+            should_continue = "retrieve"
+            seen_documents.extend([doc.page_content for doc in \
+                state.get("context", [])])
+        return {"input": state["input"],
+                "context": state.get("context", []),
+                "digested_context": state.get("digested_context", []),
+                "iterations": iterations,
+                "max_iterations": max_iterations,
+                "answer": state["answer"],
+                "result_count": result_count,
+                "target": target,
+                "seen_documents": seen_documents,
+                "chat_history": state.get("chat_history", []),
+                "should_continue": should_continue}
+        
+    def route_manage(self, state: State) -> str:
+            should_continue = state.get("should_continue", "retrieve")
+            iterations = state.get("iterations", 0)
+            max_iterations = state.get("max_iterations", 5)
+            result_count = state.get("result_count", 0)
+            target = state.get("target", 3)
+            context = state.get("context", [])
+            digested_context = state.get("digested_context", [])
+            answer = state.get("answer", "")
+            # Validate state and enforce termination
+            if iterations >= max_iterations or result_count >= target:
+                return "summarize"
+            if should_continue not in ["retrieve", "naturalize", \
+                    "check_relevance", "analyze", "summarize"]:
+                return "summarize"  # Fallback to summarize
+            return should_continue
+
+    def initialize_langgraph_chain(self) -> Any:
+        graph_builder = StateGraph(State)
+        graph_builder.add_node("manage", self.manage)
+        graph_builder.add_node("retrieve", self.retrieve)
+        graph_builder.add_node("naturalize", self.naturalize)
+        graph_builder.add_node("check_relevance", self.check_relevance)
+        graph_builder.add_node("analyze", self.analyze)
+        graph_builder.add_node("summarize", self.summarize)
+        graph_builder.add_edge(START, "manage")
+        graph_builder.add_edge("retrieve", "naturalize")
+        graph_builder.add_edge("naturalize", "analyze")
+        graph_builder.add_edge("analyze", "check_relevance")
+        graph_builder.add_edge("check_relevance", "manage")
+        graph_builder.add_edge("summarize", END)
+        graph_builder.add_conditional_edges(
+            "manage",
+            self.route_manage,
+            {"retrieve": "retrieve",
+             "naturalize": "naturalize",
+             "check_relevance": "check_relevance",
+             "analyze": "analyze",
+             "summarize": "summarize"})
+        graph=graph_builder.compile()
+        return graph
+
+    async def invoke_langgraph(self, question: str) -> Any:
+        graph = self.initialize_langgraph_chain()
+        initial_state = {
+            "input": question,
+            "chat_history": [],
+            "context": [],
+            "digested_context": [],
+            "seen_documents": [],
+            "answer": "",
+            "iterations": 0,
+            "result_count": 0,
+            "should_continue": "retrieve",
+            "target": 3,  # Explain magic number 3
+            "max_iterations": 5 # Explain magic number 5
+        }
+        result = await graph.ainvoke(initial_state) # Run graph asynchronously
+        return result
+
+    
+    def answer_question(self, question: str) -> Any:
+        start = time.time()
+        result = asyncio.run(self.invoke_langgraph(question))
+        end = time.time()
+        print(f'answer_question: {end-start}')
+        return {"result": result["chat_history"],
+                "state": result}
+```
+
+As mentioned above, we quickly spotted the need for the naturalization of RDF triples. This explains the addition of a naturalization node to the graph:
+
+```
+def naturalize(self, state: State) -> dict:
+        # Define graph node for RDF naturalization
+        prompt = f"""
+        <|im_start|>system
+        You are extremely good at naturalizing RDF and inferring meaning.
+        <|im_end|>
+        <|im_start|>user
+        Take element in the list of RDF triples one by one and
+        make it sounds like Plain English. Repeat for each the subject
+        which is at the start. You should return a list. Nothing else.
+        List: ["Entity http://genenetwork.org/id/traitBxd_20537 \
+        \nhas http://purl.org/dc/terms/isReferencedBy of \
+        http://genenetwork.org/id/unpublished22893", "has \
+        http://genenetwork.org/term/locus of \
+        http://genenetwork.org/id/Rsm10000002554"]
+        <|im_end|>
+        <|im_start|>assistant
+        New list: ["traitBxd_20537 isReferencedBy unpublished22893", \
+        "traitBxd_20537 has a locus Rsm10000002554"]
+        <|im_end|>
+        <|im_start|>user
+        Take element in the list of RDF triples one by one and
+        make it sounds like Plain English. Repeat for each the subject
+        which is at the start. You should return a list. Nothing else.
+        List: {state.get("context", [])}
+        <|im_start|>end
+        <|im_start|>assistant"""
+        response = GENERATIVE_MODEL.invoke(prompt)
+        print(f"Response in naturalize: {response}")
+        if isinstance(response, str):
+            start=response.find("[")
+            end=response.rfind("]") + 1 # offset by 1 to make slicing
+            response=json.loads(response[start:end])
+        else:
+            response=[]
+        return {"input": state["input"],
+                "context": state.get("context", []),
+                "digested_context": response,
+                "result_count": state.get("result_count", 0),
+                "target": state.get("target", 3),
+                "max_iterations": state.get("max_iterations", 5),
+                "should_continue": "analyze",
+                "iterations": state.get("iterations", 0),
+                "chat_history": state.get("chat_history", []),
+                "answer": state.get("answer", ""),
+                "seen_documents": state.get("seen_documents", [])}
+```
+
+Next step is to compare the RAG system (rag_langchain.py) to the agent system (rag_langgraph.py) on some queries.
+
+* Query 1
+
+```
+Question: What is the lod score of BXDPublish_10187_gemma_GWA at D12mit280?
+
+RAG response: 3.01245
+
+Agent response: 5.21
+```
+
+Hmm, the result of the RAG system is dubious. I did not expect long floating number for the LOD scores. Fortunately, Pjotr's endpoint is useful to make quick query agains the RDF data. By the way, one could see how complementary SPARQL queries and LLM calls are. It was very thoughtful to pass through RDF :)
+
+Using the endpoint, I have fetched all lodscores for BXDPublish_10187_gemma_GWA at D12mit280 and only the result of the agent system appeared to be true.
+
+* Query 2
+
+```
+Question: I want you look for D12mit280. You are allowed to initiate many rounds of retrieval until you get 20 relevant results. Next, extract the lod score and trait for each result. List for me traits that have a lod score > 4.0. Join to the list the corresponding lod scores so I can confirm. Show results using the following format: trait - lod score
+
+RAG response: It seems there's a misunderstanding. The provided data doesn't include any results related to "D12mit280". I can only provide information or perform tasks based on the given context and data. If you have another query or need clarification on something else, feel free to ask.
+
+Agent response: It appears there are no results for D12mit280 with a lod score > 4.0 in the provided entities. If you have more data or different criteria, please let me know so I can assist further.
+
+```
+
+This was surprising. My hypotehsis was that the node in charge of retrieval was not fetching relevant documents for the LLM to elaborate a sound feedback. There is only one way to confirm: check the documents retrieved.
+
+Printing the documents after retrieval showed that I was actually right. I also noticed a couple of other problems in the execution of the program. Nodes were mostly not executed in the order I expected. Consequently, I decided to get into a long quest of incremental improvements :)
+
+### Optimize AI system
+
+A couple of things needed to be optimized. This included retrieval, node orchestration and GPU acceleration.
+
+* Retrieval
+
+Let's start with retrieval. I played with different parameters of the retriever. It was an EnsembleRetriever using both keyword and semantic search as illustrated below:
+```
+ensemble_retriever = EnsembleRetriever(
+            retrievers=[
+                self.chroma_db.as_retriever(search_kwargs={"k": 10}),
+                bm25_retriever,
+            ],
+            weights=[0.4, 0.6],
+        )
+```
+I opted for trying different combinations of weights to get to this selection. But more rigorous work needs to be done to systematically identify the best hyparameters for retrieval.
+
+* Node orchestration
+
+Moving to node orchestration. It took me some time and reflection to realize that the nodes, I had at the moment, make only sense to be executed sequentially. Analysis (analyze node) should always be followed with relevance checking (check_relevance node), finding summarization (summarize node), and in that order. Any other sequence of execution is not useful. I had to modify the code to comply to this and prevent getting into some unnecessary loops :)
+
+But this also highlighted other limitations of the system: lack of flexibility and lack of autonomy
+
+To address the lack of flexibility, I introduced a new node to split a query into multiple queries that can be solved independently and asynchronously. The node split_query works as follows:
+```
+def split_query(self, query: str) -> list[str]:
+
+        prompt = f"""
+            <|im_start|>system
+            You are a very powerful task generator.
+        
+            Split the query into task and context based on tags.
+            Based on the context, ask relevant questions that help achieve the task. Make sure the subquestions are atomic and do not rely on each other.
+            Return only the subquestions.
+            Return strictly a JSON list of strings, nothing else.
+            <|im_end|>
+            <|im_start|>user
+            Query:
+            Task: Identify traits with a lod score > 3.0 for the marker Rsm10000011643. Tell me what marker Rsm10000011643 is involved in biology.
+            Context: A trait has a long name and contain generally strings like GWA or GEMMA. The goal is to know the biological processes which might be related to the marker previously mentioned.
+        
+            Result:
+            <|im_end|>
+            <|im_start|>assistant
+            ["What traits (containing GWA or GEMMA) have a lod score > 3.0 at Rsm10000011643?", "Which biological processes are related to Rsm10000011643?"]
+            <|im_end|>
+            <|im_start|>user
+            Query:
+            {query}
+            Result:
+            <|im_end|>
+            <|im_start|>assistant"""
+
+        with self.generative_lock:
+            response = GENERATIVE_MODEL.invoke(prompt)
+        print(f"Subqueries in split_query: {response}")
+
+        if isinstance(response, str):
+            start = response.find("[")
+            end = response.rfind("]") + 1
+            subqueries = json.loads(response[start:end])
+        else:
+            subqueries = [query]
+
+        return subqueries
+
+```
+
+There is need for another node to reconcile answers generated for each subquery. This motivated the addition of the node finalize:
+```
+def finalize(self, query: str, subqueries: list[str], answers: list[str]) -> dict:
+
+        prompt = f"""
+            <|im_start|>system
+            You are an experienced biology scientist. Given the subqueries and corresponding answers, generate a comprehensive explanation to address the query using all information provided.
+            Ensure the response is insightful, concise, and draws logical inferences where possible.
+            Do not modify entities names such as trait and marker.            
+            Make sure to link based on what is common in the answers.
+            Provide only the story, nothing else.
+            Do not repeat answers. Use only 200 words max.
+            <|im_end|>
+            <|im_start|>user
+            Query:
+            Identify two traits related to diabetes.
+            Compare their lod scores at Rsm149505.
+            Subqueries:
+            ["Identify two traits related to diabetes",
+            "Compare lod scores of same traits at Rsm149505"]
+            Answers:
+            ["Traits A and B are related to diabetes", \
+            "The lod score at Rsm149505 is 2.3 and 3.4 for trait A and B"]
+            Conclusion:
+            <|im_end|>
+            <|im_start|>assistant
+            Traits A and B are related to diabetes and have a lod score of\
+            2.3 and 3.4 at Rsm149505. The two traits could interact via a\
+            gene close to the marker Rsm149505.
+            <|im_end|>
+            <|im_start|>user
+            Query:
+            {query}
+            Subqueries:
+            {subqueries}
+            Answers:
+            {answers}
+            Conclusion:
+            <|im_end|>
+            <|im_start|>assistant"""
+	with self.generative_lock:
+            response = GENERATIVE_MODEL.invoke(prompt)
+        print(f"Response in finalize: {response}")
+
+        final_answer = (
+            response
+            if response
+            else "Sorry, we are unable to \
+            provide an overall feedback due to lack of relevant data."
+        )
+
+        return final_answer
+```
+
+The system could now take a multi-faceted query, split it into multiple subqueries, address each one of them asynchronously using sequentially retriever, analysis, check_relevance and summarize. Results are combined in the end before giving a feedback to the user.
+
+Time to make the system really agentic - so far it is not trulty because of the lack of autonomy! An agentic system requires access to many tools and a core LLM that can reason on its own about sequence of tools to call in order to solve a problem. This sounds scary but not quite if well designed :) I was also planning to add some safeguards to prevent infinite looping that could consume a lot of tokens very quickly.
+
+What I did was to register the graph I have so far as a subgraph a bigger graph (real AI system). This arm of the AI system is called researcher and has the following definition:
+```
+def researcher(self, state: AgentState) -> Any:
+        if len(state.messages) < 3:
+            input = state.messages[0]
+        else:
+            input = state.messages[-1]
+        input = input.content
+        logging.info(f"Input in researcher: {input}")
+        result = self.manage_subtasks(input)
+        end = time.time()
+        logging.info(f"Result in researcher: {result}")
+
+        return {
+            "messages": [result],
+        }
+```
+
+I also designed a planner, reflector and supervisor that the system can use. As the name indicates, the planner helps with planning steps to take to solve the problem. The reflector provides feedback and helps improve the output of the researcher. The supervisor is the core handler. It manages interations between planner, researcher and reflector.
+
+You can inspect design code for planner, reflector and supervisor below:
+```
+def planner(self, state: AgentState) -> Any:
+    input = [self.plan_system_prompt] + state.messages
+    result = plan(background=input)
+    answer = result.get("answer")
+    return {
+            "messages": [answer],
+        }
+
+def reflector(self, state: AgentState) -> Any:
+    trans_map = {AIMessage: HumanMessage, HumanMessage: AIMessage}
+    translated_messages = [self.refl_system_prompt, state.messages[0]] + [
+    trans_map[msg.__class__](content=msg.content) for msg in state.messages[1:]
+        ]
+    result = tune(background=translated_messages)
+    answer = result.get("answer")
+    answer = (
+            "Progress has been made. Use now all the resources to addess this new suggestion: "
+            + answer
+        )
+    return {
+            "messages": [HumanMessage(answer)],
+        }
+
+def supervisor(self, state: AgentState) -> Any:
+    messages = [
+            ("system", self.sup_system_prompt1),
+            *state.messages,
+            ("system", self.sup_system_prompt2),
+        ]
+
+    if len(messages) > self.max_global_visits:
+       return {"next": "end"}
+
+    result = supervise(background=messages)
+    next = result.get("next")
+
+    return {
+            "next": next,
+        }
+```
+
+* GPU acceleration
+
+The last point is GPU acceleration. Pjotr installed a GPU on balg01 to allow for acceleration. You can check out the details here:
+=> https://issues.genenetwork.org/topics/systems/linux/GPU-on-balg01
+
+The GPU is automatically used for LLM related work. I just started using it at first. Later, I learnt about SGLang which allows for deployment of LLM server and even faster inference. Code for deployment of the server is here:
+=> https://github.com/johanmed/gn-rag/blob/543a7835f5620a541cdb679b852c91e62bca2698/src/agent_system/config.sh
+
+With DSPy, I could literally switch between any model, closed or open. Consequently, I added support for DSPy. For details, check out the following commit:
+=> https://github.com/johanmed/gn-rag/commit/ec0d8ffc174cca0ccf32cb98d82ebdc7106b4ac2
+
+Small gotcha, for locally served model using SGLang, not all open models could be run given VRAM (GPU's RAM) constraint. Took me some experiments to find workable models that are finetuned for instruction following and have decent performance. At the time of writing, I am working with Qwen/Qwen2.5-7B-Instruct accessed via HuggingFace. This is the LLM. There is also an embedding model but I have not added GPU acceleration support for it to improve memory management. We have limited resources for now :)
+
+I also performed a series of refactoring and formatting to improve readability of the source code. Find it here:
+=> https://github.com/johanmed/gn-rag/tree/main/src
+
+### Scale analysis to more data
diff --git a/topics/ai/ontogpt.gmi b/topics/ai/ontogpt.gmi
new file mode 100644
index 0000000..94bd165
--- /dev/null
+++ b/topics/ai/ontogpt.gmi
@@ -0,0 +1,7 @@
+# OntoGPT
+
+python3 -m venv ~/opt/ontogpt
+~/opt/ontogpt/bin/python3 -m pip install ontogpt
+
+
+runoak set-apikey -e openai
diff --git a/topics/authentication/architecture.gmi b/topics/authentication/architecture.gmi
index 931f9cb..2200745 100644
--- a/topics/authentication/architecture.gmi
+++ b/topics/authentication/architecture.gmi
@@ -54,13 +54,14 @@ Users are granted privileges (see "Privileges" section) to act upon resources, t
 
 Examples of "types" of resources on the system:
 
-- system: The system itself
-- group: Collection of users considered a group
-- genotype: A resource representing a genotype trait
-- phenotype: A resource representing a phenotype trait
-- mrna: A resource representing a collection of mRNA Assay traits
-- inbredset-group: A resource representing an InbredSet group
-
+* system: The system itself
+* group: Collection of users considered a group
+* genotype: A resource representing a genotype trait
+* phenotype: A resource representing a phenotype trait
+* mrna: A resource representing a collection of mRNA Assay traits
+* inbredset-group: A resource representing an InbredSet group
+
+----
 * TODO: Figure out a better name/description for "InbredSet group" -- so far, I have "a classification/grouping of traits/datasets within a species". Another is to use the term "population".
 
 ## Users
diff --git a/topics/authentication/development-guide.gmi b/topics/authentication/development-guide.gmi
new file mode 100644
index 0000000..840c26b
--- /dev/null
+++ b/topics/authentication/development-guide.gmi
@@ -0,0 +1,60 @@
+# GN-AUTH FAQ
+
+## Tags
+
+* type: docs, documentation
+* status: ongoing, open
+* keywords: authentication, authorisation, docs, documentation
+* author: @jnduli
+
+## Quick configuration for local development
+
+Save a `local_settings.conf` file that has the contents:
+
+```
+SQL_URI = "mysql://user:password@localhost/db_name" # mysql uri
+AUTH_DB = "/absolute/path/to/auth.db/" # path to sqlite db file
+GN_AUTH_SECRETS = "/absolute/path/to/secrets/secrets.conf"
+```
+
+The `GN_AUTH_SECRETS` path has two functions:
+
+* It contains the `SECRET_KEY` we use in our application
+* The folder containing this file is used to store our jwks.
+
+An example is:
+
+```
+SECRET_KEY = "qQIrgiK29kXZU6v8D09y4uw_sk8I4cqgNZniYUrRoUk"
+```
+
+## Quick set up cli commands
+
+```
+export FLASK_DEBUG=1 AUTHLIB_INSECURE_TRANSPORT=1 OAUTHLIB_INSECURE_TRANSPORT=1 FLASK_APP=gn_auth/wsgi
+export GN_AUTH_CONF=/absolute/path/to/local_settings.conf
+flask init-dev-clients --client-uri "http://localhost:port"
+flask init-dev-users
+flask assign-system-admin 0ad1917c-57da-46dc-b79e-c81c91e5b928
+```
+
+## Handling verification for users in local development
+
+* Run flask init_dev_users, which will create a verified local user.
+* Run `UPDATE users set verified=1` on the sqlite3 auth database.
+
+## Errors related to unsupported clients/redirect URIs for client
+
+Rerun
+
+```
+FLASK_DEBUG=1 AUTHLIB_INSECURE_TRANSPORT=1 OAUTHLIB_INSECURE_TRANSPORT=1 \
+  GN_AUTH_CONF=/absolute/path/to/local_settings.conf FLASK_APP=gn_auth/wsgi \
+  flask init-dev-clients --client-uri "http://localhost:port_you_use_for_gn2"
+```
+
+This will update your clients list to have all the related urls we want.
+
+## 500 Server Error: INTERNAL SERVER ERROR
+
+When you see the error: `500 Server Error: INTERNAL SERVER ERROR for url: http://localhost:8081/auth/token`, restart the gn2 server.
diff --git a/topics/authentication/permission_hooks.gmi b/topics/authentication/permission_hooks.gmi
new file mode 100644
index 0000000..dd475b6
--- /dev/null
+++ b/topics/authentication/permission_hooks.gmi
@@ -0,0 +1,62 @@
+# Permission Hooks System Design
+## Status: Draft
+
+## Objective
+
+We want to achieve:
+
+- Default permissions for users that come from `.edu` domains.
+- Support for visitors to the website.
+
+This should be dynamic and easily maintenable.
+
+## Design
+
+### Events
+
+* Use middleware to plug into the various aspects of a requests life cycle. We'll plug into `after_request` for providing default permissions.
+* Create a hook which contains: the event to handle, what part of the life cycle the hook plugs into and the actual functions to call,
+* Events can be identified using their `request.base_url` parameter.
+* Each hook registers itself to the global set of hooks (TODO: Figure out how to automatically handle the registration).
+
+
+```
+@app.after_request
+def handle_hooks():
+  for hook in hooks:
+    if hook.lifecycle == "after_request" and hook.can_handle():
+      hook.run()
+
+
+Hooks = [RegistrationHook, ...]
+
+
+class RegistrationHook:
+
+  def can_handle(self):
+    request.base_url == "register"
+
+  def lifecyle:
+    return "after_request"
+
+  def run(self):
+    ...
+```
+
+### Privilege Hooks
+
+* After login/registration, use the email to get extra privileges assigned to a user. We use `login` too to ensure that all users have the most up-to-date roles and privileges.
+* This means that any user gets assigned these privileges and normal workflows can happen.
+
+### Storage
+
+* Create a new role that contains the default privileges we want to assign to users depending on their domain.
+* This role will link up with the privileges to be assigned to said user.
+* Example privileges we may want to add to users in the `.edu` domain:
+  * group:resource:edit-resource
+  * system:inbreadset:apply-case-attribute-edit
+  * system:inbreadset:edit-case-attribute
+  * system:inbreadset:view-case-attribute
+* Create an extra table that provides a link between some `email identifier` and the role we'd like to pre-assign. We can use python regex for the email identifier e.g. `*.edu$` or `*.utsch.edu`.
+* This will be the table used by the Registration Hook.
+* This also allows us to edit roles/privileges without code releases.
diff --git a/topics/biohackathon/biohackrxiv2024.gmi b/topics/biohackathon/biohackrxiv2024.gmi
new file mode 100644
index 0000000..a159ec4
--- /dev/null
+++ b/topics/biohackathon/biohackrxiv2024.gmi
@@ -0,0 +1,7 @@
+# BioHackRxiv
+
+We have a hacking week in Barcelona to work on BioHackRXiv.
+
+# Tasks
+
+* [ ] ORCIDs for authors in PDF
diff --git a/topics/R-qtl2-format-notes.gmi b/topics/data/R-qtl2-format-notes.gmi
index e0109b1..3397b5e 100644
--- a/topics/R-qtl2-format-notes.gmi
+++ b/topics/data/R-qtl2-format-notes.gmi
@@ -1,4 +1,4 @@
-# R/qtl2 Format Notes
+# R/qtl2 and GEMMA Format Notes
 
 This document is mostly to help other non-biologists figure out their way around the format(s) of the R/qtl2 files. It mostly deals with the meaning/significance of the various fields.
 
@@ -12,6 +12,39 @@ and
 
 We are going to consider the "non-transposed" form here, for ease of documentation: simply flip the meanings as appropriate for the transposed files.
 
+To convert between formats we should probably use python as that is what can use as 'esperanto'.
+
+## Control files
+
+Both GN and R/qtl2 have control files. For GN it basically describes the individuals (genometypes) and looks like:
+
+```js
+{
+        "mat": "C57BL/6J",
+        "pat": "DBA/2J",
+        "f1s": ["B6D2F1", "D2B6F1"],
+        "genofile" : [{
+                "title" : "WGS-based (Mar2022)",
+                "location" : "BXD.8.geno",
+                "sample_list" : ["BXD1", "BXD2", "BXD5", "BXD6", "BXD8", "BXD9", "BXD11", "BXD12", "BXD13", "BXD14", "BXD15", "BXD16", "BXD18", "BXD19", "BXD20", "BXD21", "BXD22", "BXD23", "BXD24", "BXD24a", "BXD25", "BXD27", "BXD28", "BXD29", "BXD30", "BXD31", "BXD32", "BXD33", "BXD34", "BXD35", "BXD36", "BXD37", "BXD38", "BXD39", "BXD40", "BXD41", "BXD42", "BXD43", "BXD44",
+ ...]}]}
+```
+
+In gn-guile this gets parsed in gn/data/genotype.scm to fetch the individuals that match the genotype and phenotype layouts.
+
+## pheno files and phenotypes
+
+The standard GEMMA input files are not very good for trouble shooting. R/qtl2 has at least the individual or genometype ID for every line:
+
+```
+id,bolting_days,seed_weight,seed_area,ttl_seedspfruit,branches,height,pc_seeds_aborted,fruit_length
+MAGIC.1,15.33,17.15,0.64,45.11,10.5,NA,0,14.95
+MAGIC.2,22,22.71,0.75,49.11,4.33,42.33,1.09,13.27
+MAGIC.3,23,21.03,0.68,57,4.67,50,0,13.9
+```
+
+This is a good standard and can match with the control files.
+
 ## geno files
 
 > The genotype data file is a matrix of individuals × markers. The first column is the individual IDs; the first row is the marker names.
@@ -22,10 +55,6 @@ For GeneNetwork, this means that the first column contains the Sample names (pre
 
 The first column of the gmap/pmap file contains genetic marker values. There are no Individuals/samples (or strains) here.
 
-## pheno files
-
-The first column is the list of individuals (samples/strains) whereas the first column is the list of phenotypes.
-
 ## phenocovar files
 
 These seem to contain extra metadata for the phenotypes.
diff --git a/topics/data/epochs.gmi b/topics/data/epochs.gmi
new file mode 100644
index 0000000..3e8b676
--- /dev/null
+++ b/topics/data/epochs.gmi
@@ -0,0 +1,153 @@
+# Epochs
+
+In the 2019 BXD paper epochs are brought up. Basically, even though the BXD are 'immortal' with identical children, mutations do creep in. An epoch is a period of mice and we track the years a mouse was used. So a BXD1 breeding started at 1971 and production in 2001. In GN we don't make a distinction (per se), but obviously these are (slightly) different mice today. Ashbrook et al. find some interesting results that differ in epochs.
+
+In GN epochs are currently handled as a trait. This can help with covariate mapping. For a different epoch, however, the genotypes should also be adapted. The effect on the kinship matrix will be minor, but genotypes can be used for fine mapping. With pangenome derived genotypes it should get even more interesting.
+
+# Fetching data
+
+Tracking the epochs is happening in spreadsheet. According to track changes only one item was changed in two years - BXD10 was marked as extinct.
+
+In the GN SQL database Epoch with its RRID is stored as a CaseAttribute:
+
+```
+MariaDB [db_webqtl]> select * from CaseAttribute LIMIT 3;
++-------------+-----------------+--------+-------------------------------------------------------------------------------------------------------------------------------------------------------------+
+| InbredSetId | CaseAttributeId | Name   | Description
+                                                                                |
++-------------+-----------------+--------+-------------------------------------------------------------------------------------------------------------------------------------------------------------+
+|           1 |               1 | Status | Live= Available at JAX, Cryo=Cryopreserved only, Extinct
+                                                                                |
+|           1 |              36 | RRID   | Research resource identifier given by SciCrunch.org
+                                                                                |
+|           1 |              37 | Epoch  | BXD family subgroups. Each number with common parents. Epoch1(BXD1-32), Epoch2-6 (BXD33-220). See Ashbrook et al. https://pubmed.ncbi.nlm.nih.gov/33472028/ |
++-------------+-----------------+--------+-------------------------------------------------------------------------------------------------------------------------------------------------------------+
+```
+
+And
+
+```
+MariaDB [db_webqtl]> select * from CaseAttributeXRefNew LIMIT 40;
++-------------+----------+-----------------+------------+
+| InbredSetId | StrainId | CaseAttributeId | Value      |
++-------------+----------+-----------------+------------+
+|           1 |        1 |               1 | Live       |
+|           1 |        1 |              36 | JAX:100006 |
+|           1 |        1 |              37 | 0          |
+|           1 |        1 |              40 |            |
+|           1 |        2 |               1 | Live       |
+|           1 |        2 |              36 | JAX:000664 |
+|           1 |        2 |              37 | 0          |
+|           1 |        2 |              40 | 69         |
+|           1 |        3 |               1 | Live       |
+|           1 |        3 |              36 | JAX:000671 |
+|           1 |        3 |              37 | 0          |
+|           1 |        3 |              40 | 108        |
+|           1 |        4 |               1 | Live
+```
+
+I am not going to comment on this table architecture, other than that RDF is a much better fit.
+
+For extracting this data, the SQL table is probably the best source of 'truth' as it is seen by users on a regular basis. But, at this point, we'll just use the spreadsheet. Generating something like:
+
+```
+gn:Bxd14
+                                dct:description "BXD014/TyJ" ;
+                                gnt:epoch 1 ;
+                                gnt:availability "Cryorecovery" ;
+                                gnt:method "B6 female to D2 male F2 intercross" ;
+                                gnt:M_origin "B6" ;
+                                gnt:Y_origin "D2" ;
+                                gnt:JAX "000329" ;
+                                gnt:start_year 1971 ;
+                                gnt:age_seq_ind 271 ;
+                                gnt:birth_seq_ind "2/18/2016" ;
+                                gnt:availability_2023 "Cryorecovery" ;
+                                gnt:has_genotypes true ;
+                                rdfs:label "BXD14" .
+gn:Bxd65
+                                dct:description "BXD065/RwwJ" ;
+                                gnt:epoch 3 ;
+                                gnt:availability "Available" ;
+                                gnt:method "Advanced intercross progeny of B6 female to D2 male" ;
+                                gnt:M_origin "B6" ;
+                                gnt:Y_origin "D2" ;
+                                gnt:JAX "007110" ;
+                                gnt:start_year 1999 ;
+                                gnt:age_seq_ind 46 ;
+                                gnt:birth_seq_ind "9/18/2016" ;
+                                gnt:availability_2023 "Available" ;
+                                gnt:has_genotypes true ;
+                                rdfs:label "BXD65" .
+etc.
+```
+
+# Approach
+
+## Fetching data
+
+To get at the epochs we'll need to fetch the sample/ind names (such as BXD73b) from GN.
+
+For every dataset we can fetch samples+values with
+
+```
+curl http://127.0.0.1:8092/dataset/bxd-publish/values/$id.json > pheno.json
+{"BXD40":-1.631969,"BXD68":-2.721761,"BXD43":-2.290135,"BXD44":-2.512057,"BXD48":-3.128819 ...
+```
+
+These are also stored in the pangemma output lmdb files. We don't want to store all values in RDF as these are only used for compute and can be easily fetched on demand from GN. We do want to access the sample names, but that is a list that is not necessarily unique to a single trait. In fact a trait should be referencing an experiment/dataset that has the samples/inds. Usually they will use the same animals. To not complicate things we'll just point to the samples with something like
+
+```
+traitid gn:sample gn:BXD40 .
+```
+
+Currently RDF contains
+
+```
+gn:Bxd12 rdfs:label "BXD12" .
+gn:Bxd12 rdf:type gnc:strain .
+gn:Bxd12 gnt:belongsToSpecies gn:Mus_musculus .
+```
+
+and traits have
+
+```
+gn:traitBxd_10002 rdf:type gnc:Phenotype .
+gn:traitBxd_10002 gnt:belongsToGroup gn:setBxd .
+gn:traitBxd_10002 gnt:traitId "10002" .
+gn:traitBxd_10002 skos:altLabel "BXD_10002" .
+gn:traitBxd_10002 dct:description "Central nervous system, morphology: Cerebellum weight after adjustment for covariance with brain size [mg]" .
+gn:traitBxd_10002 gnt:abbreviation "ADJCBLWT" .
+gn:traitBxd_10002 gnt:submitter "robwilliams" .
+gn:traitBxd_10002 gnt:mean "52.22058767430923"^^xsd:double .
+gn:traitBxd_10002 gnt:locus gn:Rsm10000005699 .
+gn:traitBxd_10002 gnt:lodScore "4.779380894726979"^^xsd:double .
+gn:traitBxd_10002 gnt:additive "2.0817857571428617"^^xsd:double .
+gn:traitBxd_10002 gnt:sequence "1"^^xsd:integer .
+gn:traitBxd_10002 dct:isReferencedBy pubmed:11438585 .
+```
+
+ignore the capitalization and some naming - gnc:strain should be gnc:sample - we'll fix that. But for now we can find some trait info and we can link the individuals up with a trait.
+
+The query we want to write is something like
+
+```
+SELECT * WHERE {
+  ?traitid a gnc:Phenotype;
+  gnt:traitId "10002" ;
+  gnt:belongsToGroup gn:setBxd ;
+  gnt:traitId ?trait ;
+  dct:isReferencedBy ?pubmed .
+  OPTIONAL {
+    ?traitid dct:description ?descr ;
+    gnt:sample_id ?sampleid .
+    ?sampleid rdfs:label ?sample .
+    }
+} LIMIT 10
+```
+
+So, for every trait/sample combination we need to add
+
+```
+gn:traitBxd_10002 gnt:sample_id gn:Bxd12 .
+```
diff --git a/topics/data/precompute/steps.gmi b/topics/data/precompute/steps.gmi
index 75e3bfd..ac03d1a 100644
--- a/topics/data/precompute/steps.gmi
+++ b/topics/data/precompute/steps.gmi
@@ -2,7 +2,8 @@
 
 At this stage precompute fetches a trait from the DB and runs GEMMA. Next it tar balls up the vector for later use. It also updates the database with the latest info.
 
-To actually kick off compute on machines that do not access the DB I realize now we need a step-wise approach. Basically you want to shift files around without connecting to a DB. And then update the DB whenever it is convenient. So we are going to make it a multi-step procedure. I don't have to write all code because we have a working runner. I just need to chunk the work.
+To actually kick off compute on machines that do not access the DB I realize now we need a step-wise approach. Basically you want to shift files around without connecting to a DB. And then update the DB whenever it is convenient. So we are going to make it a multi-step procedure.
+We need to chunk the work.
 
 We will track precompute steps here. We will have:
 
@@ -13,8 +14,18 @@ We will track precompute steps here. We will have:
 Trait archives will have steps for
 
 * [X] step p1: list-traits-to-compute
-* [+] step p2: gemma-lmm9-loco-output: Compute standard GEMMA lmm9 LOCO vector with gemma-wrapper
-* [ ] step p3: gemma-to-lmdb: create a clean vector
+* [X] step p2: gemma-lmm9-loco-output: Compute standard GEMMA lmm9 LOCO vector with gemma-wrapper
+* [X] step p3: gemma-to-lmdb: create a clean vector
+
+Start precompute
+
+* [ ] Fetch traits on tux04
+* [ ] Set up runner on tux04 and others
+* [ ] Run on Octopus
+
+Work on published data
+
+* [ ] Fetch traits
 
 The DB itself can be updated from these
 
@@ -22,8 +33,11 @@ The DB itself can be updated from these
 
 Later
 
+* [ ] Rqtl2: Compute Rqtl2 vector
 * [ ] bulklmm: Compute bulklmm vector
 
+Interestingly this work coincides with Arun's work on CWL. Rather than trying to write a workflow in bash, we'll use ccwl and accompanying tools to scale up the effort.
+
 # Tags
 
 * assigned: pjotrp
@@ -36,10 +50,10 @@ Later
 
 * [ ] Check Artyoms LMDB version for kinship and maybe add LOCO
 * [+] Create JSON metadata controller for every compute incl. type of content
-* [+] Create genotype archive
-* [+] Create kinship archive
+* [X] Create genotype archive
+* [X] Create kinship archive
 * [+] Create trait archives
-* [+] Kick off lmm9 step
+* [X] Kick off lmm9 step
 * [ ] Update DB step v1
 
 # Step p1: list traits to compute
@@ -62,7 +76,7 @@ At this point we can write
 {"2":9.40338,"3":10.196,"4":10.1093,"5":9.42362,"6":9.8285,"7":10.0808,"8":9.17844,"9":10.1527,"10":10.1167,"11":9.88551,"13":9.58127,"15":9.82312,"17":9.88005,"19":10.0761,"20":10.2739,"21":9.54171,"22":10.1056,"23":10.5702,"25":10.1433,"26":9.68685,"28":9.98464,"29":10.132,"30":9.96049,"31":10.2055,"35":10.1406,"36":9.94794,"37":9.96864,"39":9.31048}
 ```
 
-Note that it (potentially) includes the parents. Also the strain-id is a string and we may want to plug in the strain name. To allow for easy comparison downstream. Finally we may want to store a checksum of sorts. In Guile this can be achieved with:
+Note that it (potentially) includes the parents and that is corrected when generating the phenotype file for GEMMA. Also the strain-id is a string and we may want to plug in the strain name. To allow for easy comparison downstream. Finally we may want to store a checksum of sorts. In Guile this can be achieved with:
 
 ```scheme
 (use-modules  (rnrs bytevectors)
diff --git a/topics/database/mariadb-database-architecture.gmi b/topics/database/mariadb-database-architecture.gmi
new file mode 100644
index 0000000..0454d71
--- /dev/null
+++ b/topics/database/mariadb-database-architecture.gmi
@@ -0,0 +1,830 @@
+# MariaDB Database Architecture
+
+The GeneNetwork database is running on MariaDB and the layout is almost carved in stone because so much code depends on it.
+We are increasingly moving material out into lmdb (genotypes and phenotypes) and virtuoso (all types of metadata), but this proves a lengthy and rather tedious process. We also run redis for cachine, sqlite for authentication, and xapian for search!
+
+In this document we'll discuss where things are, where they ought to go, and how the nomenclature should change.
+
+An SVG of the SQL layout can be found here
+
+=> https://raw.githubusercontent.com/genenetwork/gn-gemtext-threads/main/topics/database/sql.svg
+
+# Nomenclature
+
+These are the terms we use
+
+* Genotypes
+* Case or genometype: individual, strain, sample
+* ProbeData: Now almost obsolete. [Comment by RWW perhaps for a footnote: In GeneNetwork 1 we had built and maintained a table for individual "Probe level" data simply because the Affymetrix data sets were so large. For example, the BXD Family: "UMUTAffy Hippocampus Exon 9Feb09)RMA" array data consists of 1.236 million "probesets" each of which is a summary of many individual probe assays (ProbeData)—a total of 4.5 million probes (see https://www.thermofisher.com/order/catalog/product/900817). In GN1 we built a special interface to interrogate these 4.5 million indivdual probes--extremely useful to studing the fine-structure of mRNA expresswion. We thought it best to split these very large "pro-level" data sets from the much smaller and more widely use "ProbeSetData". The term "Probe" in this particular context (Affymetrix Exon arrays) refers to short nucleotide probes used by Affymetrix and other microarray vendors. Affymetrix "Exon"-type arrays consist of 25 nt hybridization probes that target relatively specific parts of RNAs--mainly exons but also many intronic sequences.]
+* ProbeSetData: trait/sample values almost exclusively used for molecular data types (mRNA, protein, methylation assays, metabolomics, etc). [Comment by RWW perhaps for a footnote: The term "ProbeSetData" should ideally be changed to "High_Content_Data_Assays. In 2003 the only high content data assays we had were  Affymetrix microarrays that measured mRNA level, and the vendor called their assays "ProbeSets". We used this now obsolete term. Most ProbeSetData in GN1 and GN2 as of 2024 are measurments of molecular traits that can be tagged to a single genome location—-the location of the gene from which the mRNA and its derivative protein are transcribed and translated, or in the case of epigenomic studies—the site at which the genome is methylated. When these three types of molecular traits are mapped, we typically add a mark all graphic output maps that highlight the location of the "parent" gene. For example, the sonic hedgehog gene in mice is located on chromosome 5 at about 28.457 Mb on the mm10 assembly (aka GRCm38). When we measure the expression of Shh mRNA, we place a purple triangle at the coordinate of the Shh gene. Two notes: 1. There are at least three ProbeSetData types do NOT have parent genes--metabolomic data, and metagenomic data, and new high-content brain connectome data. When we do NOT know the location of a parent gene, we should NOT place any mark along the X-axis. 2. Ideally GN databases would define the TYPE of high-content data, so that the code could fork to the correct GUI for that particular data type. Connectome data for the brain is an example of a data type that is very large (40,000 measurements per brain), that is truly high-content data, but that is NOT molecular. Time series data may also fall into this category.]
+* ProbeSetFreeze: points to datasets
+
+## More on naming
+
+Naming convention-wise there is a confusing use of id and data-id in particular. We should stick to the table-id naming.
+
+# The small test database (2GB)
+
+The default install comes with a smaller database which includes a
+number of the BXDs and the Human liver dataset (GSE9588).
+
+It can be downloaded from:
+
+=> https://files.genenetwork.org/database/
+
+Try the latest one first.
+
+# GeneNetwork database
+
+Estimated table sizes with metadata comment for the important tables
+
+select table_name,round(((data_length + index_length) / 1024 / 1024), 2) `Size in MB` from information_schema.TABLES where table_schema = "db_webqtl" order by data_length;
+
+```
++-------------------------+------------+
+| table_name              | Size in MB | Should be named:
++-------------------------+------------+
+| PublishData             |      22.54 | ClassicTraitValues  <- data-id, strain-id, value (3M traits)
+| PublishSE               |       4.71 | ClassicTraitValueError (300K traits) <- data-id, strain-id, value
+| PublishXRef             |       2.18 | List of publications <- id, data-id, inbred-id, pheno-id, pub-id
+| ProbeSetData            |   59358.80 | BulkTraitValues     <- id, strain, value
+| ProbeSetSE              |   14551.02 | BulkTraitValueError <- SE values aligns with ProbeSetData
+| ProbeSetXRef            |    4532.89 | PrecomputedLRS      <- precomputed LRS values, pointing to dataset+trait
+| ProbeSet                |    2880.21 | ProbeSetInfo        <- over utilized mRNA probeset description, e.g. 100001_at comes with sequence info
+| ProbeSetFreeze          |       0.22 | DatasetInfo         <- dataset description, e.g. "Hippocampus_BXD_Jun06" - probesetfreezeid points to dataset, shortname, public?
+| Probe                   |    2150.30 | ProbeInfo           <- Probe trait info incl sequence, id, probeset-id
+| ProbeFreeze             |       0.06 | Dataset names       <- Similar to ProbesetFreeze, id, chip-id, inbredset-id, tissue-id
+| Phenotype               |       6.50 | PhenotypeMeta       <- "Hippocampus weight", id, prepublish short-name, postpublish short-name
+| ProbeXRef               |     743.38 | ProbeFreezeDataIDs  <- link ProbeFreeze-Id,Probe-Id with Data-Id
+| Datasets                |       2.31 | DatasetMeta         <- "Data generated by...", investigator-id, publication title
+| NStrain                 |       4.80 | StrainCountDataId   <- Strains used in dataset, count, strain-id, data-id
+| Strain                  |       1.07 | StrainNames         <- with species ID and alias, id, species-id, name
+| TissueProbeSetData      |      74.42 |                     <- link Id,TissueID with value
+| TissueProbeSetXRef      |      14.73 | TissueGeneTable? <- data-id, gene-id, mean, symbol, TissueProbeSetFreezeId | ProbesetId | DataId
+| TissueProbeSetFreeze    |       0.01 | tissueprobefreeze-id
+| InbredSet               |       0.01 | InbredSetMeta -> Id,SpeciesId,FullName
+| ProbeData               |   22405.44 | (OLD?) mRNAStrainValues used for partial correlations <- id, strain, value = individual probe data (mRNA) [GN1,GN3]
+| ProbeSE                 |    6263.83 | (OLD?) Trait Error  <- trait SE aligns with ProbeData? [GN3]
++-------------------------+------------+
+```
+Less commonly used tables:
+
+```
++-------------------------+------------+
+| table_name              | Size in MB |
++-------------------------+------------+
+| LCorrRamin3             |   18506.53 |
+| SnpAll                  |   15484.67 |
+| SnpPattern              |    9177.05 |
+| QuickSearch             |    5972.86 |
+| GenoData                |    3291.91 | Strain by genotype - only used in GN1
+| CeleraINFO_mm6          |     989.80 |
+| pubmedsearch            |    1032.50 |
+| GeneRIF_BASIC           |     448.54 |
+| BXDSnpPosition          |     224.44 |
+| EnsemblProbe            |     133.66 |
+| EnsemblProbeLocation    |     105.49 |
+| Genbank                 |      37.71 |
+| AccessLog               |      42.38 |
+| GeneList                |      34.11 |
+| Geno                    |      33.90 | Marker probe info (incl. sequence)
+| MachineAccessLog        |      28.34 |
+| IndelAll                |      22.42 |
+| ProbeH2                 |      13.26 |
+| GenoXRef                |      22.83 |
+| TempData                |       8.35 |
+| GeneList_rn3            |       5.54 |
+| GORef                   |       4.97 |
+| temporary               |       3.59 |
+| InfoFiles               |       3.32 |
+| Publication             |       3.42 |
+| Homologene              |       5.69 |
+| GeneList_rn33           |       2.61 |
+| GeneRIF                 |       2.18 |
+| Vlookup                 |       1.87 |
+| H2                      |       2.18 |
+| IndelXRef               |       2.91 |
+| GeneMap_cuiyan          |       0.51 |
+| user_collection         |       0.30 |
+| CaseAttributeXRef       |       0.44 |
+| StrainXRef              |       0.56 |
+| GeneIDXRef              |       0.77 |
+| Docs                    |       0.17 |
+| News                    |       0.17 |
+| GeneRIFXRef             |       0.24 |
+| Sample                  |       0.06 |
+| login                   |       0.06 |
+| user                    |       0.04 |
+| TableFieldAnnotation    |       0.05 |
+| DatasetMapInvestigator  |       0.05 |
+| User                    |       0.04 |
+| TableComments           |       0.02 |
+| Investigators           |       0.02 |
+| DBList                  |       0.03 |
+| Tissue                  |       0.02 |
+| GeneChip                |       0.01 |
+| GeneCategory            |       0.01 |
+| SampleXRef              |       0.01 |
+| SnpAllele_to_be_deleted |       0.00 |
+| Organizations           |       0.01 |
+| PublishFreeze           |       0.00 |
+| GenoFreeze              |       0.00 | Used for public/private
+| Chr_Length              |       0.01 |
+| SnpSource               |       0.00 |
+| AvgMethod               |       0.00 |
+| Species                 |       0.00 |
+| Dataset_mbat            |       0.00 |
+| TissueProbeFreeze       |       0.00 |
+| EnsemblChip             |       0.00 |
+| UserPrivilege           |       0.00 |
+| CaseAttribute           |       0.00 |
+| MappingMethod           |       0.00 |
+| DBType                  |       0.00 |
+| InfoFilesUser_md5       |       0.00 |
+| GenoCode                |       0.00 |
+| DatasetStatus           |       0.00 |
+| GeneChipEnsemblXRef     |       0.00 |
+| GenoSE                  |       0.00 |
+| user_openids            |       0.00 |
+| roles_users             |       0.00 |
+| role                    |       0.00 |
+| Temp                    |       NULL |
++-------------------------+------------+
+97 rows in set, 1 warning (0.01 sec)
+```
+
+All *Data tables are large
+
+## Tables containing trait values
+
+A trait on GN is defined by a trait-id with a dataset-id.
+
+=> https://genenetwork.org/show_trait?trait_id=10031&dataset=BXDPublish
+
+The trait-id can also be a probe name
+
+=> https://genenetwork.org/show_trait?trait_id=1441566_at&dataset=HC_M2_0606_P
+
+One of the more problematic aspects of GN is that there are two tables containing trait values (actually there are three!). ProbeSetData mostly contains expression data. PublishData contains 'classical' phenotypes. ProbeData is considered defunct.
+
+So, a set of trait values gets described by the dataset+probe (trait_id) OR by BXDPublish --- which is its own table --- and an identifier, here 10031.
+
+OK, let's look at the ProbeSetData (expression) traits:
+
+```
+MariaDB [db_webqtl]> select * from ProbeSetData limit 5;
++----+----------+-------+
+| Id | StrainId | value |
++----+----------+-------+
+|  1 |        1 | 5.742 |
+|  1 |        2 | 5.006 |
+|  1 |        3 | 6.079 |
+|  1 |        4 | 6.414 |
+|  1 |        5 | 4.885 |
++----+----------+-------+
+5 rows in set (0.193 sec)
+MariaDB [db_webqtl]> select * from ProbeData limit 5;
++--------+----------+---------+
+| Id     | StrainId | value   |
++--------+----------+---------+
+| 503636 |       42 | 11.6906 |
+| 503636 |       43 | 11.4205 |
+| 503636 |       44 | 11.2491 |
+| 503636 |       45 | 11.2373 |
+| 503636 |       46 | 12.0471 |
++--------+----------+---------+
+5 rows in set (0.183 sec)
+```
+
+ProbeSet describes ProbeSetData. I.e., every probe ID comes with a sequence (microarray) etc.
+
+As for duplicated data: duplicated or "detached"* data happens sometimes, though that's not related to the PublishData/ProbeSetData distinction (unless this is done deliberately for some reason). I believe that whether data is entered as one or the other primarily comes down to the desire/need to divide it into datasets (or by tissue) within a group (with mRNA expression data just being the most common reason for this). I've encountered a situation before with Arthur where there was data in ProbeSetData that wasn't also in ProbeSetXRef
+
+an you give an example of exactly what you mean? PublishData would be stuff like sex, weight, etc (is this what you mean?) while ProbeSetData is used for mRNA expression data (except for a few situations where it isn't lol).
+
+That being said, *functionally*, I think the only real distinction (aside from what metadata is displayed) is that "ProbeSet" data has extra levels of "granularity" where it's also organized by tissue type and can be split into "datasets" (while "PublishData" traits are only associated with a Group (InbredSet in DB). That's why some non-mRNA expression data is still classified as "ProbeSet" - I think it's basically just a way to separate it into datasets within a group, often for specific tissues.
+
+So the organization is something like this:
+
+```
+Group -> PublishData
+Group -> Tissue -> Dataset -> ProbeSetData
+```
+
+## ProbeData
+
+[OBSOLETE] ProbeData meanwhile is a table with fine-grained probe level Affymetrix data only. Contains 1 billion rows March 2016. This table may be *deleted* later since it is only used by the Probe Table display in GN1. Not used in GN2
+"ProbeData" should probably be "AssayData" or something more neutral.
+
+In comparison the "ProbeSetData" table contains more molecular assay data, including probe set data, RNA-seq data, proteomic data, and metabolomic data. 2.5 billion rows March 2016.
+ProbeData contains data only for Affymetrix probe level data (e.g. Exon array probes and M430 probes).
+
+"StrainId" should be "CaseId" or "SampleId" or "GenometypeId", see nomenclature above.
+
+```
+select * from ProbeData limit 2;
++--------+----------+---------+
+| Id     | StrainId | value   |
++--------+----------+---------+
+| 503636 |       42 | 11.6906 |
+| 503636 |       43 | 11.4205 |
++--------+----------+---------+
+2 rows in set (0.00 sec)
+
+select count(*) from ProbeData limit 2;
++-----------+
+| count(*)  |
++-----------+
+| 976753435 |
++-----------+
+1 row in set (0.00 sec)
+```
+
+## PublishData
+
+These are the classic phenotypes under BXDPublish.
+
+```
+MariaDB [db_webqtl]> select * from PublishData where StrainId=5 limit 5;
++---------+----------+------------+
+| Id      | StrainId | value      |
++---------+----------+------------+
+| 8967043 |        5 |  49.000000 |
+| 8967044 |        5 |  50.099998 |
+| 8967045 |        5 | 403.000000 |
+| 8967046 |        5 |  45.500000 |
+| 8967047 |        5 |  44.900002 |
++---------+----------+------------+
+5 rows in set (0.265 sec)
+MariaDB [db_webqtl]> select * from PublishSE where StrainId=5 limit 5;
++---------+----------+-------+
+| DataId  | StrainId | error |
++---------+----------+-------+
+| 8967043 |        5 |  1.25 |
+| 8967044 |        5 |  0.71 |
+| 8967045 |        5 |   8.6 |
+| 8967046 |        5 |  1.23 |
+| 8967047 |        5 |  1.42 |
++---------+----------+-------+
+5 rows in set (0.203 sec)
+MariaDB [db_webqtl]> select * from PublishXRef limit 2;
++-------+-------------+-------------+---------------+---------+-------------------+----------------+------------------+------------------+----------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
+| Id    | InbredSetId | PhenotypeId | PublicationId | DataId  | mean              | Locus          | LRS              | additive         | Sequence | comments                                                                                                                                                     |
++-------+-------------+-------------+---------------+---------+-------------------+----------------+------------------+------------------+----------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
+| 10001 |           1 |           4 |           116 | 8967043 | 52.13529418496525 | rs48756159     | 13.4974911471087 | 2.39444435069444 |        1 | robwilliams modified post_publication_description at Mon Jul 30 14:58:10 2012
+robwilliams modified post_publication_description at Sat Jan 30 13:48:49 2016
+ |
+| 10002 |           1 |          10 |           116 | 8967044 | 52.22058767430923 | rsm10000005699 |  22.004269639323 | 2.08178575714286 |        1 | robwilliams modified phenotype at Thu Oct 28 21:43:28 2010
+                                                                                                  |
++-------+-------------+-------------+---------------+---------+-------------------+----------------+------------------+------------------+----------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
+2 rows in set (0.328 sec)
+```
+
+## ProbeSet
+
+
+Comment: PLEASE CHANGE TABLE NAME and rework fields carefully. This is a terrible table but it works well (RWW March 2016). It is used in combination with the crucial TRAIT DATA and ANALYSIS pages in GN1 and GN2. It is also used by annotators using the UPDATE INFO AND DATA web form to correct and update annotation. It is used by Arthur to enter new annotation files and metadata for arrays, genes, proteins, metabolites. The main problem with this table is that it is doing too much work. And it is not doing enough because it is huge, but does not track changes. The plan is to migrate to lmdb for that.
+
+Initially (2003) this table contained only Affymetrix ProbeSet data for mouse (U74aV2 initially). Many other array platforms for different species were added. At least four other major categories of molecular assays have been added since about 2010:
+
+1. RNA-seq annotation and sequence data for transcripts using ENSEMBL identifiers or NCBI NM_XXXXX and NR_XXXXX type identifiers
+
+2. Protein and peptide annotation and sequence data (see BXD Liver Proteome data, SRM and SWATH type data) with identifiers such as "abcb10_q9ji39_t311" for SRM data and "LLGNMIVIVLGHHLGKDFTPAAQAA" for SWATH data where the latter is just the peptide fragment that has been quantified. Data first entered in 2015 for work by Rudi Aebersold and colleagues.
+
+3. Metabolite annotation and metadata (see BXD Liver Metabolome data) with identifiers that are usually Mass charge ratios such as "149.0970810_MZ"
+
+4. Epigenomic and methylome data (e.g. Human CANDLE Methylation data with identifiers such as "cg24523000")
+
+It would make good sense to break this table into four or more types of molecular assay metadata or annotation tables) (AssayRNA_Anno, AssayProtein_Anno, AssayMetabolite_Anno, AssayEpigenome_Anno, AssayMetagenome_Anno), since these assays will have many differences in annotation content compared to RNAs (RWW).
+
+Some complex logic is used to update contents of this table when annotators modify and correct the information (for example, updating gene symbols). These features requested by Rob so that annotating one gene symbol in one species would annotate all gene symbols in the same species based on common NCBI GeneID number. For example, changing the gene alias for one ProbeSet.Id will changing the list of aliases in all instances with the same gene symbol.
+
+If the ProbeSet.BlatSeq (or is this ProbSetTargetSeq) is identical between different ProbeSet.Ids then annotation is forced to be the same even if the symbol or geneID is different. This "feature" was implemented when we found many probe sets with identical sequence but different annotations and identifiers.
+
+
+```
+select count(*) from ProbeSet limit 5;
++----------+
+| count(*) |
++----------+
+|  4351030 |
++----------+
+| Id   | ChipId | Name     | TargetId | Symbol | description                                  | Chr  | Mb        | alias    | GeneId | GenbankId | SNP  | BlatSeq                                                                                                                                                                     |TargetSeq                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | UniGeneId | Strand_Probe | Strand_Gene | OMIM   | comments | Probe_set_target_region | Probe_set_specificity | Probe_set_BLAT_score | Probe_set_Blat_Mb_start | Probe_set_Blat_Mb_end | Probe_set_strand | Probe_set_Note_by_RW | flag | Symbol_H | description_H | chromosome_H | MB_H | alias_H | GeneId_H | chr_num | name_num | Probe_Target_Description | RefSeq_TranscriptId | Chr_mm8 | Mb_mm8    | Probe_set_Blat_Mb_start_mm8 | Probe_set_Blat_Mb_end_mm8 | HomoloGeneID | Biotype_ENS | ProteinID | ProteinName | Flybase_Id | HMDB_ID | Confidence | ChEBI_ID | ChEMBL_ID | CAS_number | PubChem_ID | ChemSpider_ID | UNII_ID | EC_number | KEGG_ID | Molecular_Weight | Nugowiki_ID | Type | Tissue | PrimaryName | SecondaryNames | PeptideSequence |
++------+--------+----------+----------+--------+----------------------------------------------+------+-----------+----------+--------+-----------+------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------+--------------+-------------+--------+----------+-------------------------+-----------------------+----------------------+-------------------------+-----------------------+------------------+----------------------+------+----------+---------------+--------------+------+---------+----------+---------+----------+--------------------------+---------------------+---------+-----------+-----------------------------+---------------------------+--------------+-------------+-----------+-------------+------------+---------+------------+----------+-----------+------------+------------+---------------+---------+-----------+---------+------------------+-------------+------+--------+-------------+----------------+-----------------+
+| 7282 |      1 | 93288_at | NULL     | Arpc2  | actin related protein 2/3 complex, subunit 2 | 1    | 74.310961 | AK008777 | 76709  | AI835883  |    0 | CCGACTTCCTTAAGGTGCTCAACCGGACTGCTTGCTACTGGATAATCGTGAGGGATTCTCCATTTGGGTTCCATTTTGTACGAGTTTGGCAAATAACCTGCAGAAACGAGCTGTGCTTGCAAGGACTTGATAGTTCCTAATCCTTTTCCAAGCTGTTTGCTTTGCAATATGT | ccgacttccttaaggtgctcaaccgtnnnnnnccnannnnccnagaaaaaagaaatgaaaannnnnnnnnnnnnnnnnnnttcatcccgctaactcttgggaactgaggaggaagcgctgtcgaccgaagnntggactgcttgctactggataatcgtnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnntgagggattctccatttgggttccattttgtacgagtttggcaaataacctgcagaaacgagctgtgcttgcaaggacttgatagttcctaagaattanaanaaaaaaaanaanttccacttgatcaanttaattcccttttatttttcctccctcantccccttccttttccaagctgtttgctttgcaatatgt                                                                                                                                                                                                                                     | Mm.337038 | +            |             | 604224 |          | NULL                    |                  8.45 |                  169 |               74.310961 |              74.31466 | NULL             | NULL                 | 3    | NULL     | NULL          | NULL         | NULL | NULL    | NULL     |       1 |    93288 | NULL                     | XM_129773           | 1       | 74.197594 |                   74.197594 |                 74.201293 | 4187         | NULL        | NULL      | NULL        | NULL       | NULL    |       NULL |     NULL | NULL      | NULL       |       NULL |          NULL | NULL    | NULL      | NULL    |             NULL |        NULL | NULL | NULL   | NULL        | NULL           | NULL            |
++------+--------+----------+----------+--------+----------------------------------------------+------+-----------+----------+--------+-----------+------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------+--------------+-------------+--------+----------+-------------------------+-----------------------+----------------------+-------------------------+-----------------------+------------------+----------------------+------+----------+---------------+--------------+------+---------+----------+---------+----------+--------------------------+---------------------+---------+-----------+-----------------------------+---------------------------+--------------+-------------+-----------+-------------+------------+---------+------------+----------+-----------+------------+------------+---------------+---------+-----------+---------+------------------+-------------+------+--------+-------------+----------------+-----------------+
+2 rows in set (0.00 sec)
+```
+
+** ProbeSetXRef (phenotypes/dataset_name.json)
+
+For every probe set (read dataset measuring point):
+
+```
+select * from ProbeSetXRef;
+| ProbeSetFreezeId | ProbeSetId | DataId   | Locus_old | LRS_old | pValue_old | mean             | se   | Locus      | LRS               | pValue | additive              | h2   |
+|              112 |     123528 | 23439389 | NULL      |    NULL |       NULL |  6.7460707070707 | NULL | rs6239372  |  10.9675593568894 |  0.567 |    0.0448545966228878 | NULL |
+|              112 |     123527 | 23439388 | NULL      |    NULL |       NULL | 6.19416161616162 | NULL | rs13476936 |  10.9075670392762 |  0.567 |   -0.0358456732993988 | NULL |
+```
+
+where ProbeSetFreezeId is the dataset (experiment). ProbesetId refers to the probe set information (measuring point). DataId points to the data point. The other values are used for search. It is used in search thus:
+
+```
+SELECT distinct ProbeSet.Name as TNAME,
+  ProbeSetXRef.Mean as TMEAN, ProbeSetXRef.LRS as TLRS,
+  ProbeSetXRef.PVALUE as TPVALUE, ProbeSet.Chr_num as TCHR_NUM,
+  ProbeSet.Mb as TMB, ProbeSet.Symbol as TSYMBOL,
+  ProbeSet.name_num as TNAME_NUM
+FROM ProbeSetXRef, ProbeSet
+WHERE ProbeSet.Id = ProbeSetXRef.ProbeSetId
+  and ProbeSetXRef.ProbeSetFreezeId = 112
+  ORDER BY ProbeSet.symbol ASC limit 5;
+| TNAME      | TMEAN            | TLRS               | TPVALUE               | TCHR_NUM | TMB        | TSYMBOL       | TNAME_NUM |
+| 1445618_at | 7.05679797979798 |   13.5417452764616 |                  0.17 |        8 |  75.077895 | NULL          |   1445618 |
+| 1452452_at |            7.232 |   30.4944361132252 | 0.0000609756097560421 |       12 |    12.6694 | NULL          |   1452452 |
+```
+
+Probedata - main molecular data. Probesets, metabolome,
+
+Almost all important molecular assay data is in this table including probe set data, RNA-seq data, proteomic data, and metabolomic data. 2.5 billion rows March 2016. In comparison, ProbeData contains data only for Affymetrix probe level data (e.g. Exon array probes and M430 probes).
+
+# Strain
+
+```
+select * from Strain limit 5;
++----+----------+----------+-----------+--------+-------+
+| Id | Name     | Name2    | SpeciesId | Symbol | Alias |
++----+----------+----------+-----------+--------+-------+
+|  1 | B6D2F1   | B6D2F1   |         1 | NULL   | NULL  |
+|  2 | C57BL/6J | C57BL/6J |         1 | B6J    | NULL  |
+|  3 | DBA/2J   | DBA/2J   |         1 | D2J    | NULL  |
+|  4 | BXD1     | BXD1     |         1 | NULL   | NULL  |
+|  5 | BXD2     | BXD2     |         1 | NULL   | NULL  |
++----+----------+----------+-----------+--------+-------+
+```
+
+```
+show indexes from Strain;
++--------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
+| Table  | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
++--------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
+| Strain |          0 | PRIMARY  |            1 | Id          | A         |       14368 |     NULL | NULL   |      | BTREE      |         |               |
+| Strain |          0 | Name     |            1 | Name        | A         |       14368 |     NULL | NULL   | YES  | BTREE      |         |               |
+| Strain |          0 | Name     |            2 | SpeciesId   | A         |       14368 |     NULL | NULL   |      | BTREE      |         |               |
+| Strain |          1 | Symbol   |            1 | Symbol      | A         |       14368 |     NULL | NULL   | YES  | BTREE      |         |               |
++--------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
+
+A typical query may look like
+
+SELECT Strain.Name, ProbeSetData.value, ProbeSetSE.error, ProbeSetData.Id
+                    FROM (ProbeSetData, ProbeSetFreeze, Strain, ProbeSet, ProbeSetXRef)
+                    left join ProbeSetSE on
+                      (ProbeSetSE.DataId = ProbeSetData.Id AND ProbeSetSE.StrainId = ProbeSetData.StrainId)
+                    WHERE
+                      ProbeSetFreeze.name = 'B139_K_1206_M' AND
+                      ProbeSetXRef.ProbeSetId = ProbeSet.Id AND
+                      ProbeSetXRef.ProbeSetFreezeId = ProbeSetFreeze.Id AND
+                      ProbeSetXRef.DataId = ProbeSetData.Id AND
+                      ProbeSetData.StrainId = Strain.Id
+                    Order BY Strain.Name
+
++-------+-------+-------+----------+
+| Name  | value | error | Id       |
++-------+-------+-------+----------+
+| SM001 |  38.3 |  NULL | 25309550 |
+| SM001 |   2.7 |  NULL | 25309520 |
+| SM001 |  20.3 |  NULL | 25309507 |
+| SM001 | 125.8 |  NULL | 25309511 |
+| SM001 |   8.2 |  NULL | 25309534 |
++-------+-------+-------+----------+
+5 rows in set (22.28 sec)
+```
+
+# ProbeSetFreeze
+
+```
+select * from ProbeSetFreeze limit 5;
++----+---------------+-------+-------------+---------------------------------+---------------------------------------------+-------------------------+------------+-----------+--------+-----------------+-----------------+-----------+
+| Id | ProbeFreezeId | AvgID | Name        | Name2                           | FullName                                    | ShortName               | CreateTime | OrderList | public | confidentiality | AuthorisedUsers | DataScale |
++----+---------------+-------+-------------+---------------------------------+---------------------------------------------+-------------------------+------------+-----------+--------+-----------------+-----------------+-----------+
+|  1 |             3 |     1 | Br_U_0803_M | BXDMicroArray_ProbeSet_August03 | UTHSC Brain mRNA U74Av2 (Aug03) MAS5        | Brain U74Av2 08/03 MAS5 | 2003-08-01 |      NULL |      0 |               0 | NULL            | log2      |
+|  2 |            10 |     1 | Br_U_0603_M | BXDMicroArray_ProbeSet_June03   | UTHSC Brain mRNA U74Av2 (Jun03) MAS5        | Brain U74Av2 06/03 MAS5 | 2003-06-01 |      NULL |      0 |               0 | NULL            | log2      |
+|  3 |             8 |     1 | Br_U_0303_M | BXDMicroArray_ProbeSet_March03  | UTHSC Brain mRNA U74Av2 (Mar03) MAS5        | Brain U74Av2 03/03 MAS5 | 2003-03-01 |      NULL |      0 |               0 | NULL            | log2      |
+|  4 |             5 |     1 | Br_U_0503_M | BXDMicroArray_ProbeSet_May03    | UTHSC Brain mRNA U74Av2 (May03) MAS5        | Brain U74Av2 05/03 MAS5 | 2003-05-01 |      NULL |      0 |               0 | NULL            | log2      |
+|  5 |             4 |     1 | HC_U_0303_M | GNFMicroArray_ProbeSet_March03  | GNF Hematopoietic Cells U74Av2 (Mar03) MAS5 | GNF U74Av2 03/03 MAS5   | 2003-03-01 |      NULL |      0 |               0 | NULL            | log2      |
++----+---------------+-------+-------------+---------------------------------+---------------------------------------------+-------------------------+------------+-----------+--------+-----------------+-----------------+-----------+
+```
+
+# ProbeSetXRef
+
+```
+select * from ProbeSetXRef limit 5;
++------------------+------------+--------+------------+--------------------+------------+-------------------+---------------------+-----------------+--------------------+--------+----------------------+------+
+| ProbeSetFreezeId | ProbeSetId | DataId | Locus_old  | LRS_old            | pValue_old | mean              | se                  | Locus           | LRS                | pValue | additive             | h2   |
++------------------+------------+--------+------------+--------------------+------------+-------------------+---------------------+-----------------+--------------------+--------+----------------------+------+
+|                1 |          1 |      1 | 10.095.400 |   13.3971627898894 |      0.163 |  5.48794285714286 | 0.08525787814808819 | rs13480619      | 12.590069931048001 |  0.269 |          -0.28515625 | NULL |
+|                1 |          2 |      2 | D15Mit189  | 10.042057464356201 |      0.431 |  9.90165714285714 |  0.0374686634976217 | CEL-17_50896182 |   10.5970737900941 |  0.304 | -0.11678333333333299 | NULL |
+|                1 |          3 |      3 | D5Mit139   |   5.43678531742749 |      0.993 |  7.83948571428571 |  0.0457583416912569 | rs13478499      |    6.0970532702754 |  0.988 |    0.112957489878542 | NULL |
+|                1 |          4 |      4 | D1Mit511   |   9.87815279480766 |      0.483 | 8.315628571428569 |  0.0470396593931327 | rs6154379       | 11.774867551173099 |  0.286 |   -0.157113725490196 | NULL |
+|                1 |          5 |      5 | D16H21S16  | 10.191723834264499 |      0.528 |  9.19345714285714 |  0.0354801718293322 | rs4199265       | 10.923263374016202 |  0.468 |  0.11476470588235299 | NULL |
++------------------+------------+--------+------------+--------------------+------------+-------------------+---------------------+-----------------+--------------------+--------+----------------------+------+
+```
+
+
+Note that the following unlimited search is very slow:
+
+select max(value) from ProbeSetData;
+
+```
++------------+
+| max(value) |
++------------+
+|   26436006 |
++------------+
+1 row in set (2 min 16.31 sec)
+```
+
+which is in some form is used in the search page, see [[https://github.com/genenetwork/genenetwork2_diet/blob/master/wqflask/wqflask/do_search.py#L811][the search code]].
+
+
+*** Comments
+
+I think the ProbeSetData table should be generalized to a 'phenotypes' table with an 'sample_id' column and a 'value' column.
+
+A new table 'samples' will link each sample against an 'experiment', an 'individual' and which in turn can link to a 'strain'.
+
+Experiment is here in a wide sense, GTex can be one - I don't want to use dataset ;)
+
+This means a (slight) reordering:
+
+```
+phenotypes:  (id), sample_id, value
+samples:     experiment_id, individual_id
+experiments: name, version
+individual:  strain_id
+strains:     species_id
+species:     ...
+```
+
+ProbeData is also interesting, because it has the same structure as ProbeSetData, but only contains microarrays. This tables should be one (when we clear up the cross-referencing) as they both contain phenotype values. Both are large tables.
+
+PublishData is another phenotype table with values only which can be merged into that same table. This data does not require the annotations of probesets(!)
+
+=> https://genenetwork.org/show_trait?trait_id=10031&dataset=BXDPublish
+
+So we have phenotype data in 3 tables with exactly the same
+layout. There is also TissueProbeSet*, but we'll ignore those for
+now. I think we should merge these into one and have the sample ref
+refer to the type of data (probeset, probe, metabolomics,
+whatever). These are all phenotype values and by having them split
+into different tables they won't play well when looking for
+correlations.
+
+ProbeSet contains the metadata on the probes and should (eventually)
+move into NoSQL. There is plenty redundancy in that table now.
+
+I know it is going to be a pain to reorganize the database, but if we
+want to use it in the long run we are going to have to simplify it.
+
+# ProbeSetFreeze and ProbeFreeze (/dataset/name.json)
+
+GN_SERVER: /dataset/HC_M2_0606_P.json
+
+ProbesetFreeze contains DataSet information, such as name, fullname of
+datasets, as well as whether they are public and how the data is
+scaled:
+
+```
+select * from ProbeSetFreeze;
+| Id  | ProbeFreezeId | AvgID | Name         | Name2                              | FullName                                      | ShortName                                     | CreateTime | OrderList | public | confidentiality | AuthorisedUsers | DataScale |
+| 112 |            30 |     2 | HC_M2_0606_P | Hippocampus_M430_V2_BXD_PDNN_Jun06 | Hippocampus Consortium M430v2 (Jun06) PDNN    | Hippocampus M430v2 BXD 06/06 PDNN             | 2006-06-23 |      NULL |      2 |               0 | NULL            | log2      |
+```
+
+Another table contains a tissue reference and a back reference to the cross
+type:
+
+```
+select * from ProbeFreeze;
+| Id  | ProbeFreezeId | ChipId | TissueId | Name                                        | FullName | ShortName | CreateTime | InbredSetId |
+|  30 |            30 |      4 |        9 | Hippocampus Consortium M430v2 Probe (Jun06) |          |           | 2006-07-07 |           1 |
+```
+
+NOTE: these tables can probably be merged into one.
+
+```
+show indexes from ProbeSetFreeze;
++----------------+------------+-----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
+| Table          | Non_unique | Key_name  | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
++----------------+------------+-----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
+| ProbeSetFreeze |          0 | PRIMARY   |            1 | Id          | A         |           2 |     NULL | NULL   |      | BTREE      |         |               |
+| ProbeSetFreeze |          0 | FullName  |            1 | FullName    | A         |           2 |     NULL | NULL   |      | BTREE      |         |               |
+| ProbeSetFreeze |          0 | Name      |            1 | Name        | A         |           2 |     NULL | NULL   | YES  | BTREE      |         |               |
+| ProbeSetFreeze |          1 | NameIndex |            1 | Name2       | A         |           2 |     NULL | NULL   |      | BTREE      |         |               |
++----------------+------------+-----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
+```
+
+# ProbeSetSE
+
+```
+select * from ProbeSetSE limit 5;
++--------+----------+----------+
+| DataId | StrainId | error    |
++--------+----------+----------+
+|      1 |        1 | 0.681091 |
+|      1 |        2 | 0.361151 |
+|      1 |        3 | 0.364342 |
+|      1 |        4 | 0.827588 |
+|      1 |        5 | 0.303492 |
++--------+----------+----------+
+```
+
+# More information
+
+For the other tables, you may check the GN2/doc/database.org document (the starting point for this document).
+
+# Contributions regarding data upload to the GeneNetwork webserver
+* Ideas shared by the GeneNetwork team to facilitate the process of uploading data to production
+
+## Quality check and integrity of the data to be uploaded to gn2
+
+* A note to add (from Arthur): Some datasets have the following identifiers: ProbeSet IDs {chr_3020701, chr_3020851, etc}. This is not an acceptable way to name the probeset IDs. So, the data provider needs to understand what format is needed for gn2 to accept the ProbeSet IDs in their dataset
+* Also, for the annotation file, among other important columns, it is crucial that there are descriptions, aliases, and location columns. And the formatting should be exactly as found in the public repositories such as NCBI, Ensembl, etc. For instance, for description: `X-linked Kx blood group related 4`, and Aliases: ` XRG4; Gm210; mKIAA1889` as in
+=> https://www.ncbi.nlm.nih.gov/gene/497097
+
+## Valid ProbeSetIDs
+
+* The official ProbeSetIDs would be the one from the vendor. This would also constitute the platform used to generate data {Novogene-specific platform}, for instance; `NovaSeqPE150` for the MBD UTHSC mice seq dataset
+* NB; in this case, if the vendor does not provide the official names as expected, we can use the platform + the numbering order of the file to generate probeset IDs. For instance; `NseqPE150_000001 to NseqPE150_432694` for samples 1 to 432694
+* Avoid IDs with meaning, e.g. =chr1_3020701= → Chromosome 1 at 3020701 base pairs. Prefer IDs with no meaning
+
+## The importance of having unique identifiers within a platform
+
+* Unique identifiers solve the hurdles that come with having duplicate genes. So, the QA tools in place should ensure the uploaded dataset adheres to the requirements mentioned
+* However, newer RNA-seq data sets generated by sequencing do not usually have an official vendor identifier. The identifier is usually based on the NCBI mRNA model (NM_XXXXXX) that was used to evaluate an expression and on the sequence that is involved, usually the start and stop nucleotide positions based on a specific genome assembly or just a suffix to make sure it is unique. In this case, you are looking at mRNA assays for a single transcript, but different parts of the transcript that have different genome coordinates. We now typically use ENSEMBL identifiers.
+* The mouse version of the sonic hedgehog gene as an example: `ENSMUST00000002708` or `ENSMUSG00000002633` sources should be fine. The important thing is to know the provenance of the ID—who is in charge of that ID type?
+* When a mRNA assay is super precise (one exon only or a part of the 5' UTR), then we should use exon identifiers from ENSEMBL probably.
+* Ideally, we should enter the sequence's first and last 100 nt in GeneNetwork for verification and  alignment. We did this religiously for arrays, but have started to get lazy now. The sequence is the ultimate identifier
+* For methylation arrays and CpG assays, we can use this format `cg14050475` as seen in MBD UTHSC Ben's data
+* For metabolites like isoleucine—the ID we have been using is the mass-to-charge (MZ) ratio such as `130.0874220_MZ`
+* For protein and peptide identifiers we have used the official Protein ID followed by an underscore character and then some or all of the sequence. This is then followed by another underscore and a number. Evan to confirm, but the suffix number is the charge state if I remember correctly
+```
+Q9JHJ3_LLHTADVCQLEVALVGASPR_3
+A2A8E1_TIVEFECR_2
+A2A8E1_ATLENVTNLRPVGEDFR_3
+A2A8E1_ENSIDILSSTIK_2
+```
+* But in older protein expression databases Evan and the team used a different method
+```
+abcb10_q9ji39_t311
+abcb10_q9ji39_t312
+```
+* The above is just the gene symbol then the protein ID and not so sure what t311 and t312 mean
+* Ideally these IDs are explained to some extent when they embed some information
+
+
+
+## BXD individuals
+
+* Basically groups (represented by the InbredSet tables) are primarily defined by their list of samples/strains (represented by the Strain tables). When we create a new group, it's because we have data with a distinct set of samples/strains from any existing groups.
+* So when we receive data for BXD individuals, as far as the database is concerned they are a completely separate group (since the list of samples is new/distinct from any other existing groups). We can choose to also enter it as part of the "generic" BXD group (by converting it to strain means/SEs using the strain of each individual, assuming it's provided like in the files Arthur was showing us).
+* This same logic could apply to other groups as well - we could choose to make one group the "strain mean" group for another set of groups that contain sample data for individuals. But the database doesn't reflect the relationship between these groups*
+* As far as the database is concerned, there is no distinction between strain means and individual sample data - they're all rows in the ProbeSetData/PublishData tables. The only difference is that strain mean data will probably also have an SE value in the ProbeSetSE/PublishSE tables and/or an N (number of individuals per strain) value in the NStrain table
+* As for what this means for the uploader - I think it depends on whether Rob/Arthur/etc wants to give users the ability to simultaneously upload both strain mean and individual data. For example, if someone uploads some BXD individuals' data, do we want the uploader to both create a new group for this (or add to an existing BXD individuals group) and calculate the strain means/SE and enter it into the "main" BXD group? My personal feeling is that it's probably best to postpone that for later and only upload the data with the specific set of samples indicated in the file since it would insert some extra complexity to the uploading process that could always be added later (since the user would need to select "the group the strains are from" as a separate option)
+* The relationship is sorta captured in the CaseAttribute and CaseAttributeXRefNew tables (which contain sample metadata), but only in the form of the metadata that is sometimes displayed as extra columns in the trait page table - this data isn't used in any queries/analyses currently (outside of some JS filters run on the table itself) and isn't that important as part of the uploading process (or at least can be postponed)
+
+## Individual Datasets and Derivatives datasets in gn2
+* Individual dataset reflects the actual data provided or submitted by the investigator (user). Derivative datasets include the processed information from the individual dataset, as in the case of the average datasets.
+* An example of an individual dataset would look something like; (MBD dataset)
+```
+#+begin_example
+sample, strain, Sex, Age,…
+FEB0001,BXD48a,M,63,…
+FEB0002,BXD48a,M,15,…
+FEB0003,BXD48a,F,22,…
+FEB0004,BXD16,M,39,…
+FEB0005,BXD16,F,14,…
+⋮
+#+end_example
+```
+* The strain column above has repetitive values. Each value has a one-to-many relationship with values on sample column. From this dataset, there can be several derivatives. For example;
+- Sex-based categories
+- Average data (3 sample values averaged to one strain value)
+- Standard error table computed for the averages
+
+## Saving data to database
+* Strain table schema
+```
+#+begin_src sql
+  MariaDB [db_webqtl]> DESC Strain;
+  +-----------+----------------------+------+-----+---------+----------------+
+  | Field     | Type                 | Null | Key | Default | Extra          |
+  +-----------+----------------------+------+-----+---------+----------------+
+  | Id        | int(20)              | NO   | PRI | NULL    | auto_increment |
+  | Name      | varchar(100)         | YES  | MUL | NULL    |                |
+  | Name2     | varchar(100)         | YES  |     | NULL    |                |
+  | SpeciesId | smallint(5) unsigned | NO   |     | 0       |                |
+  | Symbol    | varchar(20)          | YES  | MUL | NULL    |                |
+  | Alias     | varchar(255)         | YES  |     | NULL    |                |
+  +-----------+----------------------+------+-----+---------+----------------+
+  6 rows in set (0.00 sec)
+#+end_src
+```
+* For the *individual data*, the =sample= field would be saved as =Name= and the =strain= would be saved as =Name2=. These records would then all be linked to an inbredset group (population?) in the =InbredSet= table via the =StrainXRef= table, whose schema is as follows:
+```
+#+begin_src sql
+  MariaDB [db_webqtl]> DESC StrainXRef;
+  +------------------+----------------------+------+-----+---------+-------+
+  | Field            | Type                 | Null | Key | Default | Extra |
+  +------------------+----------------------+------+-----+---------+-------+
+  | InbredSetId      | smallint(5) unsigned | NO   | PRI | 0       |       |
+  | StrainId         | int(20)              | NO   | PRI | NULL    |       |
+  | OrderId          | int(20)              | YES  |     | NULL    |       |
+  | Used_for_mapping | char(1)              | YES  |     | N       |       |
+  | PedigreeStatus   | varchar(255)         | YES  |     | NULL    |       |
+  +------------------+----------------------+------+-----+---------+-------+
+  5 rows in set (0.00 sec)
+#+end_src
+```
+* Where the =InbredSetId= comes from the =InbredSet= table and the =StrainId= comes from the =Strain= table. The *individual data* would be linked to an inbredset group that is for individuals
+* For the *average data*, the only value to save would be the =strain= field, which would be saved as =Name= in the =Strain= table and linked to an InbredSet group that is for averages
+*Question 01*: How do we distinguish the inbredset groups?
+*Answer*: The =Family= field is useful for this.
+
+*Question 02*: If you have more derived "datasets", e.g. males-only, females-only, under-10-years, 10-to-25-years, etc. How would the =Strains= table handle all those differences?
+
+## Metadata
+* The data we looked at had =gene id= and =gene symbol= fields. These fields were used to fetch the *Ensembl ID* and *descriptions* from [[https://www.ncbi.nlm.nih.gov/][NCBI]] and the [[https://useast.ensembl.org/][Ensembl Genome Browser]]
+
+## Files for mapping
+* Files used for mapping need to be in =bimbam= or =.geno= formats. We would need to do conversions to at least one of these formats where necessary
+
+## Annotation files
+* Consider the following schema of DB tables
+#+begin_src sql
+  MariaDB [db_webqtl]> DESC InbredSet;
+  +-----------------+----------------------+------+-----+---------+----------------+
+  | Field           | Type                 | Null | Key | Default | Extra          |
+  +-----------------+----------------------+------+-----+---------+----------------+
+  | Id              | smallint(5) unsigned | NO   | PRI | NULL    | auto_increment |
+  | InbredSetId     | int(5) unsigned      | NO   |     | NULL    |                |
+  | InbredSetName   | varchar(100)         | YES  |     | NULL    |                |
+  | Name            | char(30)             | NO   |     |         |                |
+  | SpeciesId       | smallint(5) unsigned | YES  |     | 1       |                |
+  | FullName        | varchar(100)         | YES  |     | NULL    |                |
+  | public          | tinyint(3) unsigned  | YES  |     | 2       |                |
+  | MappingMethodId | char(50)             | YES  |     | 1       |                |
+  | GeneticType     | varchar(255)         | YES  |     | NULL    |                |
+  | Family          | varchar(100)         | YES  |     | NULL    |                |
+  | FamilyOrder     | int(5)               | YES  |     | NULL    |                |
+  | MenuOrderId     | double               | NO   |     | NULL    |                |
+  | InbredSetCode   | varchar(5)           | YES  |     | NULL    |                |
+  | Description     | longtext             | YES  |     | NULL    |                |
+  +-----------------+----------------------+------+-----+---------+----------------+
+  ⋮
+  MariaDB [db_webqtl]> DESC Strain;
+  +-----------+----------------------+------+-----+---------+----------------+
+  | Field     | Type                 | Null | Key | Default | Extra          |
+  +-----------+----------------------+------+-----+---------+----------------+
+  | Id        | int(20)              | NO   | PRI | NULL    | auto_increment |
+  | Name      | varchar(100)         | YES  | MUL | NULL    |                |
+  | Name2     | varchar(100)         | YES  |     | NULL    |                |
+  | SpeciesId | smallint(5) unsigned | NO   |     | 0       |                |
+  | Symbol    | varchar(20)          | YES  | MUL | NULL    |                |
+  | Alias     | varchar(255)         | YES  |     | NULL    |                |
+  +-----------+----------------------+------+-----+---------+----------------+
+  ⋮
+  MariaDB [db_webqtl]> DESC StrainXRef;
+  +------------------+----------------------+------+-----+---------+-------+
+  | Field            | Type                 | Null | Key | Default | Extra |
+  +------------------+----------------------+------+-----+---------+-------+
+  | InbredSetId      | smallint(5) unsigned | NO   | PRI | 0       |       |
+  | StrainId         | int(20)              | NO   | PRI | NULL    |       |
+  | OrderId          | int(20)              | YES  |     | NULL    |       |
+  | Used_for_mapping | char(1)              | YES  |     | N       |       |
+  | PedigreeStatus   | varchar(255)         | YES  |     | NULL    |       |
+  +------------------+----------------------+------+-----+---------+-------+
+#+end_src
+
+* The =StrainXRef= table creates a link between the Samples/cases/individuals (stored in the =Strain= table) to the group (population?) they belong to in the =InbredSet= table
+* Steps to prepare the TSV file for entering samples/cases into the database are:
+- Clean up =Name= of the samples/cases/individuals in the file:
+  - Names should have no spaces
+  - Names should be the same length of characters: pad those that are shorter e.g. *SampleName12* → *SampleName012* to fit in with other names if, say, the samples range from 1 to 999. In a similar vein, you'd rename *SampleName1* to *SampleName001*
+- Order samples by the names
+- Create a new column, say, =orderId= in the TSV, and assign the order *1, 2, 3, …, n* for the rows, from the first to the "n^{th}" row. The order of the strains is very important and must be maintained
+- retrieve the largest current =Id= value  in the =Strain= table
+- Increment by one (1) and assign that to the first row of your ordered data
+  - Assign subsequent rows, the subsequent values for the ID e.g. Assuming the largest =Id= value in the =Strain= table was *23*, the first row of the new data would have the id *24*. The second row would have *25*, the third, *26* and so on
+- Get the =InbredSetId= for your samples' data. Add a new column in the data and copy this value for all rows
+- Enter data into the =Strain= table
+- Using the previously computed strain ID values, and the =InbredSetId= previously copied, enter data into the =StrainXRef= table
+
+* Some notes on the data:
+- The =Symbol= field in the =Strain= table corresponds to the =Strain= field in the annotation file
+- The =used_for_mapping= field should be set to ~Y~ unless otherwise informed
+- The =PedigreeStatus= field is unknown to us for now: set to ~NULL~
+
+* Annotation file format
+The important fields are:
+- =ChipId=: The platform that the data was collected from/with
+Consider the following table;
+#+begin_src sql
+    MariaDB [db_webqtl]> DESC GeneChip;
+    +---------------+----------------------+------+-----+---------+----------------+
+    | Field         | Type                 | Null | Key | Default | Extra          |
+    +---------------+----------------------+------+-----+---------+----------------+
+    | Id            | smallint(5) unsigned | NO   | PRI | NULL    | auto_increment |
+    | GeneChipId    | int(5)               | YES  |     | NULL    |                |
+    | GeneChipName  | varchar(200)         | YES  |     | NULL    |                |
+    | Name          | char(30)             | NO   |     |         |                |
+    | GeoPlatform   | char(15)             | YES  |     | NULL    |                |
+    | Title         | varchar(100)         | YES  |     | NULL    |                |
+    | SpeciesId     | int(5)               | YES  |     | 1       |                |
+    | GO_tree_value | varchar(50)          | YES  |     | NULL    |                |
+    +---------------+----------------------+------+-----+---------+----------------+
+  #+end_src
+ Some of the important fields that were highlighted were:
+ - =GeoPlatform=: Links the details of the platform in our database with NCBI's [[https://www.ncbi.nlm.nih.gov/geo/][Gene Ontology Omnibus (GEO)]] system. This is not always possible, but where we can, it would be nice to have this field populated
+ - =GO_tree_value=:  This is supposed to link the detail we have with some external system "GO". I have not figured this one out on my own and will need to follow up on it.
+ - =Name=: The name corresponds to the =ProbeSetId=, and we want this to be the same value as the identifier on the [[https://www.ensembl.org][Ensembl genome browser]], e.g. For a gene, say =Shh=, for *mouse*, we want the =Name= value to be a variation on [[https://useast.ensembl.org/Mus_musculus/Gene/Summary?db=core;g=ENSMUSG00000002633;r=5:28661813-28672254;t=ENSMUST00000002708][*ENSMUSG00000002633*]]
+ - =Probe_set_Blat_Mb_start=/=Probe_set_Blat_Mb_end=: In Byron's and Beni's data, these correspond to the =geneStart= and =geneEnd= fields respectively. These are the positions, in megabasepairs, that the gene begins and ends at, respectively.
+ - =Mb=: This is the =geneStart=/=Probe_set_Blat_Mb_start= value divided by *1000000*. (*Note to self*: Maybe the Probe_set_Blat_Mb_* fields above might not be in megabase pairs — please confirm)
+ - =Strand_Probe= and =Strand_Gene=: These fields' values are simply ~+~ or ~-~. If these values are missing, you can [[https://ftp.ncbi.nih.gov/gene/README][retrieve them from NCBI]], specifically from the =orientation= field of seemingly any text file with the field
+ - =Chr=: This is the chromosome on which the gene is found
+
+* The final annotation file will have (at minimum) the following fields (or their
+analogs):
+- =StrainName=
+- =OrderId=
+- =StrainId=: from the database
+- =InbredSetId=: from the database
+- =Symbol=: This could be named =Strain=
+- =GeneChipId=: from the database
+- =EnsemblId=: from the Ensembl genome browser
+- =Probe_set_Blat_Mb_start=: possible analogue is =geneStart=
+- =Probe_set_Blat_Mb_end=: possible analogue is =geneEnd=
+- =Mb=
+- =Strand_Probe=
+- =Strand_Gene=
+- =Chr=
+
+*  =.geno= Files
+- The =.geno= files have sample names, not the strain/symbol. The =Locus= field in the =.geno= file corresponds to the **marker**. =.geno= files are used with =QTLReaper=
+- The sample names in the ~.geno~ files *MUST* be in the same order as the
+strains/symbols for that species. For example;
+Data format is as follows;
+```
+#+begin_example
+SampleName,Strain,…
+⋮
+BJCWI0001,BXD40,…
+BJCWI0002,BXD40,…
+BJCWI0003,BXD33,…
+BJCWI0004,BXD50,…
+BJCWI0005,BXD50,…
+⋮
+#+end_example
+```
+and the order of strains is as follows;
+```
+#+begin_example
+…,BXD33,…,BXD40,…,BXD50,…
+#+end_example
+```
+then, the ~.geno~ file generated by this data should have a form such as shown
+below;
+```
+#+begin_example
+…,BJCWI0003,…,BJCWI0001,BJCWI0002,…,BJCWI0004,BJCWI0005,…
+#+end_example
+```
+The order of samples that belong to the same strain is irrelevant - they share the same data, i.e. the order below is also valid;
+```
+#+begin_example
+…,BJCWI0003,…,BJCWI0002,BJCWI0001,…,BJCWI0004,BJCWI0005,…
+#+end_example
+```
+* =BimBam= Files
+- Used with =GEMMA=
+*  Case Attributes
+- These are metadata about every case/sample/individual in an InbredSet group. The metadata is any data that has nothing to do with phenotypes (e.g. height, weight, etc) that is useful for researchers to have in order to make sense of the data.
+- Examples of case attributes:
+ - Treatment
+ - Sex (Really? Isn't sex an expression of genes?)
+ - batch
+ - Case ID, etc
+
+* Summary steps to load data to the database
+- [x] Create *InbredSet* group (think population)
+- [x] Load the strains/samples data
+- [x] Load the sample cross-reference data to link the samples to their
+ InbredSet group
+- Load the case-attributes data
+- [x] Load the annotation (data into ProbeSet table)
+- [x] Create the study for the data (At around this point, the InbredSet group
+  will show up in the UI).
+- [x] Create the Dataset for the data
+- [x] Load the *Log2* data (ProbeSetData and ProbeSetXRef tables)
+- [x] Compute means (an SQL query was used — this could be pre-computed in code
+  and entered along with the data)
+- [x] Run QTLReaper
diff --git a/topics/database/setting-up-local-development-database.gmi b/topics/database/setting-up-local-development-database.gmi
index 3b743b9..9ebb48b 100644
--- a/topics/database/setting-up-local-development-database.gmi
+++ b/topics/database/setting-up-local-development-database.gmi
@@ -41,7 +41,12 @@ Setting up mariadb in a Guix container is the preferred and easier method. But,
 ```
 $ sudo $(./containers/db-container.sh)
 ```
-You should now be able to connect to the database using
+By default, mariadb allows passwordless login for root only on the local machine. So, enter the container using guix container exec and set the root password to a blank.
+```
+$ mysql -u root
+MariaDB [(none)]> SET PASSWORD = PASSWORD("");
+```
+You should now be able to connect to the database from outside the container using
 ```
 $ mysql --protocol tcp -u root
 ```
diff --git a/topics/database/sql.svg b/topics/database/sql.svg
new file mode 100644
index 0000000..b7ab96e
--- /dev/null
+++ b/topics/database/sql.svg
@@ -0,0 +1,2558 @@
+<?xml version="1.0" encoding="UTF-8" standalone="no"?>
+<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN"
+ "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
+<!-- Generated by graphviz version 2.49.0 (20210828.1703)
+ -->
+<!-- Title: schema Pages: 1 -->
+<svg width="13704pt" height="5921pt"
+ viewBox="0.00 0.00 13703.50 5921.00" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink">
+<g id="graph0" class="graph" transform="scale(1 1) rotate(0) translate(4 5917)">
+<title>schema</title>
+<polygon fill="white" stroke="transparent" points="-4,4 -4,-5917 13699.5,-5917 13699.5,4 -4,4"/>
+<!-- NStrain -->
+<g id="node1" class="node">
+<title>NStrain</title>
+<polygon fill="white" stroke="transparent" points="6648.5,-1918 6648.5,-2008 6775.5,-2008 6775.5,-1918 6648.5,-1918"/>
+<polygon fill="#df65b0" stroke="transparent" points="6652,-1984 6652,-2005 6773,-2005 6773,-1984 6652,-1984"/>
+<polygon fill="none" stroke="black" points="6652,-1984 6652,-2005 6773,-2005 6773,-1984 6652,-1984"/>
+<text text-anchor="start" x="6655" y="-1990.8" font-family="Times,serif" font-size="14.00">NStrain (9 MiB)</text>
+<text text-anchor="start" x="6692.5" y="-1968.8" font-family="Times,serif" font-size="14.00">count</text>
+<text text-anchor="start" x="6688" y="-1947.8" font-family="Times,serif" font-size="14.00">DataId</text>
+<text text-anchor="start" x="6683" y="-1926.8" font-family="Times,serif" font-size="14.00">StrainId</text>
+<polygon fill="none" stroke="black" points="6648.5,-1918 6648.5,-2008 6775.5,-2008 6775.5,-1918 6648.5,-1918"/>
+</g>
+<!-- Strain -->
+<g id="node40" class="node">
+<title>Strain</title>
+<polygon fill="lightgrey" stroke="transparent" points="5728.5,-765.5 5728.5,-918.5 5843.5,-918.5 5843.5,-765.5 5728.5,-765.5"/>
+<polygon fill="#df65b0" stroke="transparent" points="5732,-894 5732,-915 5841,-915 5841,-894 5732,-894"/>
+<polygon fill="none" stroke="black" points="5732,-894 5732,-915 5841,-915 5841,-894 5732,-894"/>
+<text text-anchor="start" x="5735" y="-900.8" font-family="Times,serif" font-size="14.00">Strain (2 MiB)</text>
+<polygon fill="green" stroke="transparent" points="5732,-873 5732,-892 5841,-892 5841,-873 5732,-873"/>
+<text text-anchor="start" x="5769" y="-878.8" font-family="Times,serif" font-size="14.00">Alias</text>
+<polygon fill="green" stroke="transparent" points="5732,-852 5732,-871 5841,-871 5841,-852 5732,-852"/>
+<text text-anchor="start" x="5765" y="-857.8" font-family="Times,serif" font-size="14.00">Name</text>
+<polygon fill="green" stroke="transparent" points="5732,-831 5732,-850 5841,-850 5841,-831 5732,-831"/>
+<text text-anchor="start" x="5760.5" y="-836.8" font-family="Times,serif" font-size="14.00">Name2</text>
+<polygon fill="green" stroke="transparent" points="5732,-810 5732,-829 5841,-829 5841,-810 5732,-810"/>
+<text text-anchor="start" x="5759.5" y="-815.8" font-family="Times,serif" font-size="14.00">Symbol</text>
+<text text-anchor="start" x="5779" y="-794.8" font-family="Times,serif" font-size="14.00">Id</text>
+<text text-anchor="start" x="5751.5" y="-773.8" font-family="Times,serif" font-size="14.00">SpeciesId</text>
+<polygon fill="none" stroke="black" points="5728.5,-765.5 5728.5,-918.5 5843.5,-918.5 5843.5,-765.5 5728.5,-765.5"/>
+</g>
+<!-- NStrain&#45;&gt;Strain -->
+<g id="edge1" class="edge">
+<title>NStrain:StrainId&#45;&gt;Strain</title>
+<path fill="none" stroke="black" d="M6651,-1930C6610.43,-1930 6653.88,-1233.5 6631,-1200 6450.66,-935.96 6033.45,-866.5 5861.83,-848.81"/>
+<polygon fill="black" stroke="black" points="5861.92,-845.3 5851.62,-847.79 5861.23,-852.27 5861.92,-845.3"/>
+</g>
+<!-- roles_users -->
+<g id="node2" class="node">
+<title>roles_users</title>
+<polygon fill="white" stroke="transparent" points="7071.5,-4853 7071.5,-4922 7204.5,-4922 7204.5,-4853 7071.5,-4853"/>
+<polygon fill="#f1eef6" stroke="transparent" points="7075,-4897.5 7075,-4918.5 7202,-4918.5 7202,-4897.5 7075,-4897.5"/>
+<polygon fill="none" stroke="black" points="7075,-4897.5 7075,-4918.5 7202,-4918.5 7202,-4897.5 7075,-4897.5"/>
+<text text-anchor="start" x="7078" y="-4904.3" font-family="Times,serif" font-size="14.00">roles_users (0 B)</text>
+<text text-anchor="start" x="7114" y="-4882.3" font-family="Times,serif" font-size="14.00">role_id</text>
+<text text-anchor="start" x="7112.5" y="-4861.3" font-family="Times,serif" font-size="14.00">user_id</text>
+<polygon fill="none" stroke="black" points="7071.5,-4853 7071.5,-4922 7204.5,-4922 7204.5,-4853 7071.5,-4853"/>
+</g>
+<!-- role -->
+<g id="node58" class="node">
+<title>role</title>
+<polygon fill="white" stroke="transparent" points="7093.5,-3249 7093.5,-3339 7184.5,-3339 7184.5,-3249 7093.5,-3249"/>
+<polygon fill="#f1eef6" stroke="transparent" points="7097,-3315 7097,-3336 7182,-3336 7182,-3315 7097,-3315"/>
+<polygon fill="none" stroke="black" points="7097,-3315 7097,-3336 7182,-3336 7182,-3315 7097,-3315"/>
+<text text-anchor="start" x="7106" y="-3321.8" font-family="Times,serif" font-size="14.00">role (0 B)</text>
+<text text-anchor="start" x="7099" y="-3299.8" font-family="Times,serif" font-size="14.00">description</text>
+<text text-anchor="start" x="7119.5" y="-3278.8" font-family="Times,serif" font-size="14.00">name</text>
+<text text-anchor="start" x="7117.5" y="-3257.8" font-family="Times,serif" font-size="14.00">the_id</text>
+<polygon fill="none" stroke="black" points="7093.5,-3249 7093.5,-3339 7184.5,-3339 7184.5,-3249 7093.5,-3249"/>
+</g>
+<!-- roles_users&#45;&gt;role -->
+<g id="edge2" class="edge">
+<title>roles_users:role_id&#45;&gt;role</title>
+<path fill="none" stroke="black" d="M7203,-4885.5C7242.13,-4885.5 7161.86,-3639.62 7142.89,-3353.21"/>
+<polygon fill="black" stroke="black" points="7146.37,-3352.78 7142.22,-3343.03 7139.39,-3353.24 7146.37,-3352.78"/>
+</g>
+<!-- User -->
+<g id="node60" class="node">
+<title>User</title>
+<polygon fill="white" stroke="transparent" points="7244,-3175.5 7244,-3412.5 7354,-3412.5 7354,-3175.5 7244,-3175.5"/>
+<polygon fill="#d7b5d8" stroke="transparent" points="7247,-3388 7247,-3409 7351,-3409 7351,-3388 7247,-3388"/>
+<polygon fill="none" stroke="black" points="7247,-3388 7247,-3409 7351,-3409 7351,-3388 7247,-3388"/>
+<text text-anchor="start" x="7250" y="-3394.8" font-family="Times,serif" font-size="14.00">User (28 KiB)</text>
+<text text-anchor="start" x="7260" y="-3372.8" font-family="Times,serif" font-size="14.00">createtime</text>
+<text text-anchor="start" x="7273" y="-3351.8" font-family="Times,serif" font-size="14.00">disable</text>
+<text text-anchor="start" x="7279" y="-3330.8" font-family="Times,serif" font-size="14.00">email</text>
+<text text-anchor="start" x="7265.5" y="-3309.8" font-family="Times,serif" font-size="14.00">grpName</text>
+<text text-anchor="start" x="7292" y="-3288.8" font-family="Times,serif" font-size="14.00">id</text>
+<text text-anchor="start" x="7268" y="-3267.8" font-family="Times,serif" font-size="14.00">lastlogin</text>
+<text text-anchor="start" x="7279" y="-3246.8" font-family="Times,serif" font-size="14.00">name</text>
+<text text-anchor="start" x="7264.5" y="-3225.8" font-family="Times,serif" font-size="14.00">password</text>
+<text text-anchor="start" x="7267" y="-3204.8" font-family="Times,serif" font-size="14.00">privilege</text>
+<text text-anchor="start" x="7273" y="-3183.8" font-family="Times,serif" font-size="14.00">user_ip</text>
+<polygon fill="none" stroke="black" points="7244,-3175.5 7244,-3412.5 7354,-3412.5 7354,-3175.5 7244,-3175.5"/>
+</g>
+<!-- roles_users&#45;&gt;User -->
+<g id="edge3" class="edge">
+<title>roles_users:user_id&#45;&gt;User</title>
+<path fill="none" stroke="black" d="M7139,-4854.5C7139,-4323.12 7232.06,-3695.19 7276.24,-3427.05"/>
+<polygon fill="black" stroke="black" points="7279.74,-3427.32 7277.92,-3416.88 7272.83,-3426.18 7279.74,-3427.32"/>
+</g>
+<!-- SnpAllRat -->
+<g id="node3" class="node">
+<title>SnpAllRat</title>
+<polygon fill="white" stroke="transparent" points="2716,-702.5 2716,-981.5 2876,-981.5 2876,-702.5 2716,-702.5"/>
+<polygon fill="#df65b0" stroke="transparent" points="2719,-957 2719,-978 2873,-978 2873,-957 2719,-957"/>
+<polygon fill="none" stroke="black" points="2719,-957 2719,-978 2873,-978 2873,-957 2719,-957"/>
+<text text-anchor="start" x="2722" y="-963.8" font-family="Times,serif" font-size="14.00">SnpAllRat (908 MiB)</text>
+<text text-anchor="start" x="2772" y="-941.8" font-family="Times,serif" font-size="14.00">Alleles</text>
+<text text-anchor="start" x="2749" y="-920.8" font-family="Times,serif" font-size="14.00">Chromosome</text>
+<text text-anchor="start" x="2728" y="-899.8" font-family="Times,serif" font-size="14.00">ConservationScore</text>
+<text text-anchor="start" x="2768.5" y="-878.8" font-family="Times,serif" font-size="14.00">Domain</text>
+<text text-anchor="start" x="2764" y="-857.8" font-family="Times,serif" font-size="14.00">Function</text>
+<text text-anchor="start" x="2777.5" y="-836.8" font-family="Times,serif" font-size="14.00">Gene</text>
+<text text-anchor="start" x="2788.5" y="-815.8" font-family="Times,serif" font-size="14.00">Id</text>
+<text text-anchor="start" x="2767" y="-794.8" font-family="Times,serif" font-size="14.00">Position</text>
+<text text-anchor="start" x="2761" y="-773.8" font-family="Times,serif" font-size="14.00">SnpName</text>
+<text text-anchor="start" x="2771" y="-752.8" font-family="Times,serif" font-size="14.00">Source</text>
+<text text-anchor="start" x="2761" y="-731.8" font-family="Times,serif" font-size="14.00">SpeciesId</text>
+<text text-anchor="start" x="2758.5" y="-710.8" font-family="Times,serif" font-size="14.00">Transcript</text>
+<polygon fill="none" stroke="black" points="2716,-702.5 2716,-981.5 2876,-981.5 2876,-702.5 2716,-702.5"/>
+</g>
+<!-- Species -->
+<g id="node33" class="node">
+<title>Species</title>
+<polygon fill="lightgrey" stroke="transparent" points="2734,-201 2734,-396 2858,-396 2858,-201 2734,-201"/>
+<polygon fill="#f1eef6" stroke="transparent" points="2737,-371.5 2737,-392.5 2855,-392.5 2855,-371.5 2737,-371.5"/>
+<polygon fill="none" stroke="black" points="2737,-371.5 2737,-392.5 2855,-392.5 2855,-371.5 2737,-371.5"/>
+<text text-anchor="start" x="2740" y="-378.3" font-family="Times,serif" font-size="14.00">Species (796 B)</text>
+<polygon fill="green" stroke="transparent" points="2737,-350.5 2737,-369.5 2855,-369.5 2855,-350.5 2737,-350.5"/>
+<text text-anchor="start" x="2761" y="-356.3" font-family="Times,serif" font-size="14.00">FullName</text>
+<polygon fill="green" stroke="transparent" points="2737,-329.5 2737,-348.5 2855,-348.5 2855,-329.5 2737,-329.5"/>
+<text text-anchor="start" x="2754.5" y="-335.3" font-family="Times,serif" font-size="14.00">MenuName</text>
+<polygon fill="green" stroke="transparent" points="2737,-308.5 2737,-327.5 2855,-327.5 2855,-308.5 2737,-308.5"/>
+<text text-anchor="start" x="2747.5" y="-314.3" font-family="Times,serif" font-size="14.00">SpeciesName</text>
+<text text-anchor="start" x="2788.5" y="-293.3" font-family="Times,serif" font-size="14.00">Id</text>
+<polygon fill="green" stroke="transparent" points="2737,-266.5 2737,-285.5 2855,-285.5 2855,-266.5 2737,-266.5"/>
+<text text-anchor="start" x="2774.5" y="-272.3" font-family="Times,serif" font-size="14.00">Name</text>
+<text text-anchor="start" x="2767.5" y="-251.3" font-family="Times,serif" font-size="14.00">OrderId</text>
+<text text-anchor="start" x="2761" y="-230.3" font-family="Times,serif" font-size="14.00">SpeciesId</text>
+<text text-anchor="start" x="2752.5" y="-209.3" font-family="Times,serif" font-size="14.00">TaxonomyId</text>
+<polygon fill="none" stroke="black" points="2734,-201 2734,-396 2858,-396 2858,-201 2734,-201"/>
+</g>
+<!-- SnpAllRat&#45;&gt;Species -->
+<g id="edge4" class="edge">
+<title>SnpAllRat:SpeciesId&#45;&gt;Species</title>
+<path fill="none" stroke="black" d="M2874,-735C2906.96,-735 2860.65,-539.2 2826.56,-410.18"/>
+<polygon fill="black" stroke="black" points="2829.87,-409 2823.92,-400.23 2823.1,-410.8 2829.87,-409"/>
+</g>
+<!-- SampleXRef -->
+<g id="node4" class="node">
+<title>SampleXRef</title>
+<polygon fill="white" stroke="transparent" points="3272,-3259.5 3272,-3328.5 3426,-3328.5 3426,-3259.5 3272,-3259.5"/>
+<polygon fill="#d7b5d8" stroke="transparent" points="3275,-3304 3275,-3325 3423,-3325 3423,-3304 3275,-3304"/>
+<polygon fill="none" stroke="black" points="3275,-3304 3275,-3325 3423,-3325 3423,-3304 3275,-3304"/>
+<text text-anchor="start" x="3278" y="-3310.8" font-family="Times,serif" font-size="14.00">SampleXRef (4 KiB)</text>
+<text text-anchor="start" x="3296" y="-3288.8" font-family="Times,serif" font-size="14.00">ProbeFreezeId</text>
+<text text-anchor="start" x="3315" y="-3267.8" font-family="Times,serif" font-size="14.00">SampleId</text>
+<polygon fill="none" stroke="black" points="3272,-3259.5 3272,-3328.5 3426,-3328.5 3426,-3259.5 3272,-3259.5"/>
+</g>
+<!-- ProbeFreeze -->
+<g id="node42" class="node">
+<title>ProbeFreeze</title>
+<polygon fill="white" stroke="transparent" points="2611,-1855 2611,-2071 2777,-2071 2777,-1855 2611,-1855"/>
+<polygon fill="#d7b5d8" stroke="transparent" points="2614,-2047 2614,-2068 2774,-2068 2774,-2047 2614,-2047"/>
+<polygon fill="none" stroke="black" points="2614,-2047 2614,-2068 2774,-2068 2774,-2047 2614,-2047"/>
+<text text-anchor="start" x="2617" y="-2053.8" font-family="Times,serif" font-size="14.00">ProbeFreeze (30 KiB)</text>
+<text text-anchor="start" x="2670" y="-2031.8" font-family="Times,serif" font-size="14.00">ChipId</text>
+<text text-anchor="start" x="2652" y="-2010.8" font-family="Times,serif" font-size="14.00">CreateTime</text>
+<text text-anchor="start" x="2659" y="-1989.8" font-family="Times,serif" font-size="14.00">FullName</text>
+<text text-anchor="start" x="2686.5" y="-1968.8" font-family="Times,serif" font-size="14.00">Id</text>
+<text text-anchor="start" x="2651" y="-1947.8" font-family="Times,serif" font-size="14.00">InbredSetId</text>
+<text text-anchor="start" x="2672.5" y="-1926.8" font-family="Times,serif" font-size="14.00">Name</text>
+<text text-anchor="start" x="2641" y="-1905.8" font-family="Times,serif" font-size="14.00">ProbeFreezeId</text>
+<text text-anchor="start" x="2653" y="-1884.8" font-family="Times,serif" font-size="14.00">ShortName</text>
+<text text-anchor="start" x="2663.5" y="-1863.8" font-family="Times,serif" font-size="14.00">TissueId</text>
+<polygon fill="none" stroke="black" points="2611,-1855 2611,-2071 2777,-2071 2777,-1855 2611,-1855"/>
+</g>
+<!-- SampleXRef&#45;&gt;ProbeFreeze -->
+<g id="edge5" class="edge">
+<title>SampleXRef:ProbeFreezeId&#45;&gt;ProbeFreeze</title>
+<path fill="none" stroke="black" d="M3274,-3292C3032.87,-3292 3338.17,-2922.26 3158,-2762 3097.26,-2707.98 2852.39,-2782.55 2794,-2726 2622.74,-2560.12 2641.84,-2254.55 2669,-2085.12"/>
+<polygon fill="black" stroke="black" points="2672.47,-2085.6 2670.63,-2075.16 2665.56,-2084.47 2672.47,-2085.6"/>
+</g>
+<!-- Sample -->
+<g id="node95" class="node">
+<title>Sample</title>
+<polygon fill="white" stroke="transparent" points="3653.5,-1792 3653.5,-2134 3782.5,-2134 3782.5,-1792 3653.5,-1792"/>
+<polygon fill="#d7b5d8" stroke="transparent" points="3657,-2110 3657,-2131 3780,-2131 3780,-2110 3657,-2110"/>
+<polygon fill="none" stroke="black" points="3657,-2110 3657,-2131 3780,-2131 3780,-2110 3657,-2110"/>
+<text text-anchor="start" x="3660" y="-2116.8" font-family="Times,serif" font-size="14.00">Sample (53 KiB)</text>
+<text text-anchor="start" x="3704.5" y="-2094.8" font-family="Times,serif" font-size="14.00">Age</text>
+<text text-anchor="start" x="3688" y="-2073.8" font-family="Times,serif" font-size="14.00">CELURL</text>
+<text text-anchor="start" x="3686.5" y="-2052.8" font-family="Times,serif" font-size="14.00">CHPURL</text>
+<text text-anchor="start" x="3676.5" y="-2031.8" font-family="Times,serif" font-size="14.00">CreateTime</text>
+<text text-anchor="start" x="3688" y="-2010.8" font-family="Times,serif" font-size="14.00">DATURL</text>
+<text text-anchor="start" x="3688" y="-1989.8" font-family="Times,serif" font-size="14.00">EXPURL</text>
+<text text-anchor="start" x="3687" y="-1968.8" font-family="Times,serif" font-size="14.00">FromSrc</text>
+<text text-anchor="start" x="3711" y="-1947.8" font-family="Times,serif" font-size="14.00">Id</text>
+<text text-anchor="start" x="3680.5" y="-1926.8" font-family="Times,serif" font-size="14.00">ImageURL</text>
+<text text-anchor="start" x="3697" y="-1905.8" font-family="Times,serif" font-size="14.00">Name</text>
+<text text-anchor="start" x="3688" y="-1884.8" font-family="Times,serif" font-size="14.00">RPTURL</text>
+<text text-anchor="start" x="3705" y="-1863.8" font-family="Times,serif" font-size="14.00">Sex</text>
+<text text-anchor="start" x="3689" y="-1842.8" font-family="Times,serif" font-size="14.00">StrainId</text>
+<text text-anchor="start" x="3678" y="-1821.8" font-family="Times,serif" font-size="14.00">TissueType</text>
+<text text-anchor="start" x="3688.5" y="-1800.8" font-family="Times,serif" font-size="14.00">TXTURL</text>
+<polygon fill="none" stroke="black" points="3653.5,-1792 3653.5,-2134 3782.5,-2134 3782.5,-1792 3653.5,-1792"/>
+</g>
+<!-- SampleXRef&#45;&gt;Sample -->
+<g id="edge6" class="edge">
+<title>SampleXRef:SampleId&#45;&gt;Sample</title>
+<path fill="none" stroke="black" d="M3424,-3271C3878.8,-3271 3810.34,-2508.42 3752.65,-2148.25"/>
+<polygon fill="black" stroke="black" points="3756.08,-2147.55 3751.03,-2138.24 3749.17,-2148.67 3756.08,-2147.55"/>
+</g>
+<!-- GeneIDXRef -->
+<g id="node5" class="node">
+<title>GeneIDXRef</title>
+<polygon fill="white" stroke="transparent" points="7441,-4842.5 7441,-4932.5 7613,-4932.5 7613,-4842.5 7441,-4842.5"/>
+<polygon fill="#d7b5d8" stroke="transparent" points="7444,-4908.5 7444,-4929.5 7610,-4929.5 7610,-4908.5 7444,-4908.5"/>
+<polygon fill="none" stroke="black" points="7444,-4908.5 7444,-4929.5 7610,-4929.5 7610,-4908.5 7444,-4908.5"/>
+<text text-anchor="start" x="7447" y="-4915.3" font-family="Times,serif" font-size="14.00">GeneIDXRef (220 KiB)</text>
+<text text-anchor="start" x="7502.5" y="-4893.3" font-family="Times,serif" font-size="14.00">human</text>
+<text text-anchor="start" x="7503.5" y="-4872.3" font-family="Times,serif" font-size="14.00">mouse</text>
+<text text-anchor="start" x="7516" y="-4851.3" font-family="Times,serif" font-size="14.00">rat</text>
+<polygon fill="none" stroke="black" points="7441,-4842.5 7441,-4932.5 7613,-4932.5 7613,-4842.5 7441,-4842.5"/>
+</g>
+<!-- MachineAccessLog -->
+<g id="node6" class="node">
+<title>MachineAccessLog</title>
+<polygon fill="white" stroke="transparent" points="7647,-4811 7647,-4964 7861,-4964 7861,-4811 7647,-4811"/>
+<polygon fill="#df65b0" stroke="transparent" points="7650,-4939.5 7650,-4960.5 7858,-4960.5 7858,-4939.5 7650,-4939.5"/>
+<polygon fill="none" stroke="black" points="7650,-4939.5 7650,-4960.5 7858,-4960.5 7858,-4939.5 7650,-4939.5"/>
+<text text-anchor="start" x="7653" y="-4946.3" font-family="Times,serif" font-size="14.00">MachineAccessLog (23 MiB)</text>
+<text text-anchor="start" x="7714.5" y="-4924.3" font-family="Times,serif" font-size="14.00">accesstime</text>
+<text text-anchor="start" x="7732" y="-4903.3" font-family="Times,serif" font-size="14.00">action</text>
+<text text-anchor="start" x="7728" y="-4882.3" font-family="Times,serif" font-size="14.00">data_id</text>
+<text text-anchor="start" x="7734.5" y="-4861.3" font-family="Times,serif" font-size="14.00">db_id</text>
+<text text-anchor="start" x="7747" y="-4840.3" font-family="Times,serif" font-size="14.00">id</text>
+<text text-anchor="start" x="7715.5" y="-4819.3" font-family="Times,serif" font-size="14.00">ip_address</text>
+<polygon fill="none" stroke="black" points="7647,-4811 7647,-4964 7861,-4964 7861,-4811 7647,-4811"/>
+</g>
+<!-- metadata_audit -->
+<g id="node7" class="node">
+<title>metadata_audit</title>
+<polygon fill="white" stroke="transparent" points="292.5,-1897 292.5,-2029 479.5,-2029 479.5,-1897 292.5,-1897"/>
+<polygon fill="#d7b5d8" stroke="transparent" points="296,-2005 296,-2026 477,-2026 477,-2005 296,-2005"/>
+<polygon fill="none" stroke="black" points="296,-2005 296,-2026 477,-2026 477,-2005 296,-2005"/>
+<text text-anchor="start" x="299" y="-2011.8" font-family="Times,serif" font-size="14.00">metadata_audit (16 KiB)</text>
+<text text-anchor="start" x="349.5" y="-1989.8" font-family="Times,serif" font-size="14.00">dataset_id</text>
+<text text-anchor="start" x="365" y="-1968.8" font-family="Times,serif" font-size="14.00">editor</text>
+<text text-anchor="start" x="379.5" y="-1947.8" font-family="Times,serif" font-size="14.00">id</text>
+<text text-anchor="start" x="337.5" y="-1926.8" font-family="Times,serif" font-size="14.00">json_diff_data</text>
+<text text-anchor="start" x="344.5" y="-1905.8" font-family="Times,serif" font-size="14.00">time_stamp</text>
+<polygon fill="none" stroke="black" points="292.5,-1897 292.5,-2029 479.5,-2029 479.5,-1897 292.5,-1897"/>
+</g>
+<!-- Datasets -->
+<g id="node16" class="node">
+<title>Datasets</title>
+<polygon fill="lightgrey" stroke="transparent" points="305,-660.5 305,-1023.5 469,-1023.5 469,-660.5 305,-660.5"/>
+<polygon fill="#df65b0" stroke="transparent" points="308,-999 308,-1020 466,-1020 466,-999 308,-999"/>
+<polygon fill="none" stroke="black" points="308,-999 308,-1020 466,-1020 466,-999 308,-999"/>
+<text text-anchor="start" x="326.5" y="-1005.8" font-family="Times,serif" font-size="14.00">Datasets (4 MiB)</text>
+<polygon fill="green" stroke="transparent" points="308,-978 308,-997 466,-997 466,-978 308,-978"/>
+<text text-anchor="start" x="344.5" y="-983.8" font-family="Times,serif" font-size="14.00">AboutCases</text>
+<polygon fill="green" stroke="transparent" points="308,-957 308,-976 466,-976 466,-957 308,-957"/>
+<text text-anchor="start" x="310" y="-962.8" font-family="Times,serif" font-size="14.00">AboutDataProcessing</text>
+<polygon fill="green" stroke="transparent" points="308,-936 308,-955 466,-955 466,-936 308,-936"/>
+<text text-anchor="start" x="334.5" y="-941.8" font-family="Times,serif" font-size="14.00">AboutPlatform</text>
+<polygon fill="green" stroke="transparent" points="308,-915 308,-934 466,-934 466,-915 308,-915"/>
+<text text-anchor="start" x="343" y="-920.8" font-family="Times,serif" font-size="14.00">AboutTissue</text>
+<polygon fill="green" stroke="transparent" points="308,-894 308,-913 466,-913 466,-894 308,-894"/>
+<text text-anchor="start" x="325.5" y="-899.8" font-family="Times,serif" font-size="14.00">Acknowledgment</text>
+<polygon fill="green" stroke="transparent" points="308,-873 308,-892 466,-892 466,-873 308,-873"/>
+<text text-anchor="start" x="358" y="-878.8" font-family="Times,serif" font-size="14.00">Citation</text>
+<polygon fill="green" stroke="transparent" points="308,-852 308,-871 466,-871 466,-852 308,-852"/>
+<text text-anchor="start" x="341" y="-857.8" font-family="Times,serif" font-size="14.00">Contributors</text>
+<text text-anchor="start" x="352" y="-836.8" font-family="Times,serif" font-size="14.00">DatasetId</text>
+<polygon fill="green" stroke="transparent" points="308,-810 308,-829 466,-829 466,-810 308,-810"/>
+<text text-anchor="start" x="338" y="-815.8" font-family="Times,serif" font-size="14.00">DatasetName</text>
+<text text-anchor="start" x="328.5" y="-794.8" font-family="Times,serif" font-size="14.00">DatasetStatusId</text>
+<polygon fill="green" stroke="transparent" points="308,-768 308,-787 466,-787 466,-768 308,-768"/>
+<text text-anchor="start" x="320" y="-773.8" font-family="Times,serif" font-size="14.00">ExperimentDesign</text>
+<polygon fill="green" stroke="transparent" points="308,-747 308,-766 466,-766 466,-747 308,-747"/>
+<text text-anchor="start" x="350.5" y="-752.8" font-family="Times,serif" font-size="14.00">GeoSeries</text>
+<text text-anchor="start" x="336" y="-731.8" font-family="Times,serif" font-size="14.00">InvestigatorId</text>
+<polygon fill="green" stroke="transparent" points="308,-705 308,-724 466,-724 466,-705 308,-705"/>
+<text text-anchor="start" x="365.5" y="-710.8" font-family="Times,serif" font-size="14.00">Notes</text>
+<text text-anchor="start" x="330.5" y="-689.8" font-family="Times,serif" font-size="14.00">PublicationTitle</text>
+<polygon fill="green" stroke="transparent" points="308,-663 308,-682 466,-682 466,-663 308,-663"/>
+<text text-anchor="start" x="352" y="-668.8" font-family="Times,serif" font-size="14.00">Summary</text>
+<polygon fill="none" stroke="black" points="305,-660.5 305,-1023.5 469,-1023.5 469,-660.5 305,-660.5"/>
+</g>
+<!-- metadata_audit&#45;&gt;Datasets -->
+<g id="edge7" class="edge">
+<title>metadata_audit:dataset_id&#45;&gt;Datasets</title>
+<path fill="none" stroke="black" d="M478,-1994C525.38,-1994 453.11,-1365.95 412.1,-1037.71"/>
+<polygon fill="black" stroke="black" points="415.55,-1037.1 410.84,-1027.61 408.61,-1037.97 415.55,-1037.1"/>
+</g>
+<!-- GenoXRef -->
+<g id="node8" class="node">
+<title>GenoXRef</title>
+<polygon fill="white" stroke="transparent" points="4464,-3228 4464,-3360 4614,-3360 4614,-3228 4464,-3228"/>
+<polygon fill="#df65b0" stroke="transparent" points="4467,-3336 4467,-3357 4611,-3357 4611,-3336 4467,-3336"/>
+<polygon fill="none" stroke="black" points="4467,-3336 4467,-3357 4611,-3357 4611,-3336 4467,-3336"/>
+<text text-anchor="start" x="4470" y="-3342.8" font-family="Times,serif" font-size="14.00">GenoXRef (14 MiB)</text>
+<text text-anchor="start" x="4528" y="-3320.8" font-family="Times,serif" font-size="14.00">cM</text>
+<text text-anchor="start" x="4514.5" y="-3299.8" font-family="Times,serif" font-size="14.00">DataId</text>
+<text text-anchor="start" x="4489" y="-3278.8" font-family="Times,serif" font-size="14.00">GenoFreezeId</text>
+<text text-anchor="start" x="4513" y="-3257.8" font-family="Times,serif" font-size="14.00">GenoId</text>
+<text text-anchor="start" x="4472.5" y="-3236.8" font-family="Times,serif" font-size="14.00">Used_for_mapping</text>
+<polygon fill="none" stroke="black" points="4464,-3228 4464,-3360 4614,-3360 4614,-3228 4464,-3228"/>
+</g>
+<!-- Geno -->
+<g id="node46" class="node">
+<title>Geno</title>
+<polygon fill="white" stroke="transparent" points="4245,-671 4245,-1013 4383,-1013 4383,-671 4245,-671"/>
+<polygon fill="#df65b0" stroke="transparent" points="4248,-989 4248,-1010 4380,-1010 4380,-989 4248,-989"/>
+<polygon fill="none" stroke="black" points="4248,-989 4248,-1010 4380,-1010 4380,-989 4248,-989"/>
+<text text-anchor="start" x="4262" y="-995.8" font-family="Times,serif" font-size="14.00">Geno (39 MiB)</text>
+<text text-anchor="start" x="4300.5" y="-973.8" font-family="Times,serif" font-size="14.00">Chr</text>
+<text text-anchor="start" x="4279" y="-952.8" font-family="Times,serif" font-size="14.00">Chr_mm8</text>
+<text text-anchor="start" x="4283" y="-931.8" font-family="Times,serif" font-size="14.00">chr_num</text>
+<text text-anchor="start" x="4275.5" y="-910.8" font-family="Times,serif" font-size="14.00">Comments</text>
+<text text-anchor="start" x="4306.5" y="-889.8" font-family="Times,serif" font-size="14.00">Id</text>
+<text text-anchor="start" x="4263" y="-868.8" font-family="Times,serif" font-size="14.00">Marker_Name</text>
+<text text-anchor="start" x="4302" y="-847.8" font-family="Times,serif" font-size="14.00">Mb</text>
+<text text-anchor="start" x="4280.5" y="-826.8" font-family="Times,serif" font-size="14.00">Mb_2016</text>
+<text text-anchor="start" x="4280.5" y="-805.8" font-family="Times,serif" font-size="14.00">Mb_mm8</text>
+<text text-anchor="start" x="4292.5" y="-784.8" font-family="Times,serif" font-size="14.00">Name</text>
+<text text-anchor="start" x="4279" y="-763.8" font-family="Times,serif" font-size="14.00">Sequence</text>
+<text text-anchor="start" x="4289" y="-742.8" font-family="Times,serif" font-size="14.00">Source</text>
+<text text-anchor="start" x="4284.5" y="-721.8" font-family="Times,serif" font-size="14.00">Source2</text>
+<text text-anchor="start" x="4279" y="-700.8" font-family="Times,serif" font-size="14.00">SpeciesId</text>
+<text text-anchor="start" x="4250" y="-679.8" font-family="Times,serif" font-size="14.00">used_by_geno_file</text>
+<polygon fill="none" stroke="black" points="4245,-671 4245,-1013 4383,-1013 4383,-671 4245,-671"/>
+</g>
+<!-- GenoXRef&#45;&gt;Geno -->
+<g id="edge9" class="edge">
+<title>GenoXRef:GenoId&#45;&gt;Geno</title>
+<path fill="none" stroke="black" d="M4612,-3261C4626.31,-3261 4580.57,-1213.56 4576,-1200 4540.22,-1093.91 4460.35,-992.99 4398.15,-925.69"/>
+<polygon fill="black" stroke="black" points="4400.41,-922.99 4391.03,-918.06 4395.29,-927.76 4400.41,-922.99"/>
+</g>
+<!-- GenoFreeze -->
+<g id="node82" class="node">
+<title>GenoFreeze</title>
+<polygon fill="white" stroke="transparent" points="4407,-1855 4407,-2071 4559,-2071 4559,-1855 4407,-1855"/>
+<polygon fill="#d7b5d8" stroke="transparent" points="4410,-2047 4410,-2068 4556,-2068 4556,-2047 4410,-2047"/>
+<polygon fill="none" stroke="black" points="4410,-2047 4410,-2068 4556,-2068 4556,-2047 4410,-2047"/>
+<text text-anchor="start" x="4413" y="-2053.8" font-family="Times,serif" font-size="14.00">GenoFreeze (2 KiB)</text>
+<text text-anchor="start" x="4422.5" y="-2031.8" font-family="Times,serif" font-size="14.00">AuthorisedUsers</text>
+<text text-anchor="start" x="4431.5" y="-2010.8" font-family="Times,serif" font-size="14.00">confidentiality</text>
+<text text-anchor="start" x="4441" y="-1989.8" font-family="Times,serif" font-size="14.00">CreateTime</text>
+<text text-anchor="start" x="4448" y="-1968.8" font-family="Times,serif" font-size="14.00">FullName</text>
+<text text-anchor="start" x="4475.5" y="-1947.8" font-family="Times,serif" font-size="14.00">Id</text>
+<text text-anchor="start" x="4440" y="-1926.8" font-family="Times,serif" font-size="14.00">InbredSetId</text>
+<text text-anchor="start" x="4461.5" y="-1905.8" font-family="Times,serif" font-size="14.00">Name</text>
+<text text-anchor="start" x="4461" y="-1884.8" font-family="Times,serif" font-size="14.00">public</text>
+<text text-anchor="start" x="4442" y="-1863.8" font-family="Times,serif" font-size="14.00">ShortName</text>
+<polygon fill="none" stroke="black" points="4407,-1855 4407,-2071 4559,-2071 4559,-1855 4407,-1855"/>
+</g>
+<!-- GenoXRef&#45;&gt;GenoFreeze -->
+<g id="edge8" class="edge">
+<title>GenoXRef:GenoFreezeId&#45;&gt;GenoFreeze</title>
+<path fill="none" stroke="black" d="M4466,-3282C4346.95,-3282 4432.68,-2411.13 4468.93,-2085.19"/>
+<polygon fill="black" stroke="black" points="4472.41,-2085.56 4470.04,-2075.24 4465.45,-2084.79 4472.41,-2085.56"/>
+</g>
+<!-- TissueProbeSetXRef -->
+<g id="node9" class="node">
+<title>TissueProbeSetXRef</title>
+<polygon fill="white" stroke="transparent" points="6347,-4748 6347,-5027 6563,-5027 6563,-4748 6347,-4748"/>
+<polygon fill="#df65b0" stroke="transparent" points="6350,-5002.5 6350,-5023.5 6560,-5023.5 6560,-5002.5 6350,-5002.5"/>
+<polygon fill="none" stroke="black" points="6350,-5002.5 6350,-5023.5 6560,-5023.5 6560,-5002.5 6350,-5002.5"/>
+<text text-anchor="start" x="6353" y="-5009.3" font-family="Times,serif" font-size="14.00">TissueProbeSetXRef (9 MiB)</text>
+<text text-anchor="start" x="6441.5" y="-4987.3" font-family="Times,serif" font-size="14.00">Chr</text>
+<text text-anchor="start" x="6430.5" y="-4966.3" font-family="Times,serif" font-size="14.00">DataId</text>
+<text text-anchor="start" x="6414.5" y="-4945.3" font-family="Times,serif" font-size="14.00">description</text>
+<text text-anchor="start" x="6429" y="-4924.3" font-family="Times,serif" font-size="14.00">GeneId</text>
+<text text-anchor="start" x="6443" y="-4903.3" font-family="Times,serif" font-size="14.00">Mb</text>
+<text text-anchor="start" x="6421.5" y="-4882.3" font-family="Times,serif" font-size="14.00">Mb_2016</text>
+<text text-anchor="start" x="6435" y="-4861.3" font-family="Times,serif" font-size="14.00">Mean</text>
+<text text-anchor="start" x="6362.5" y="-4840.3" font-family="Times,serif" font-size="14.00">Probe_Target_Description</text>
+<text text-anchor="start" x="6415.5" y="-4819.3" font-family="Times,serif" font-size="14.00">ProbesetId</text>
+<text text-anchor="start" x="6428" y="-4798.3" font-family="Times,serif" font-size="14.00">Symbol</text>
+<text text-anchor="start" x="6367.5" y="-4777.3" font-family="Times,serif" font-size="14.00">TissueProbeSetFreezeId</text>
+<text text-anchor="start" x="6419" y="-4756.3" font-family="Times,serif" font-size="14.00">useStatus</text>
+<polygon fill="none" stroke="black" points="6347,-4748 6347,-5027 6563,-5027 6563,-4748 6347,-4748"/>
+</g>
+<!-- TissueProbeSetFreeze -->
+<g id="node23" class="node">
+<title>TissueProbeSetFreeze</title>
+<polygon fill="white" stroke="transparent" points="4747,-3165 4747,-3423 4977,-3423 4977,-3165 4747,-3165"/>
+<polygon fill="#f1eef6" stroke="transparent" points="4750,-3399 4750,-3420 4974,-3420 4974,-3399 4750,-3399"/>
+<polygon fill="none" stroke="black" points="4750,-3399 4750,-3420 4974,-3420 4974,-3399 4750,-3399"/>
+<text text-anchor="start" x="4753" y="-3405.8" font-family="Times,serif" font-size="14.00">TissueProbeSetFreeze (228 B)</text>
+<text text-anchor="start" x="4801.5" y="-3383.8" font-family="Times,serif" font-size="14.00">AuthorisedUsers</text>
+<text text-anchor="start" x="4840" y="-3362.8" font-family="Times,serif" font-size="14.00">AvgID</text>
+<text text-anchor="start" x="4810.5" y="-3341.8" font-family="Times,serif" font-size="14.00">confidentiality</text>
+<text text-anchor="start" x="4820" y="-3320.8" font-family="Times,serif" font-size="14.00">CreateTime</text>
+<text text-anchor="start" x="4827" y="-3299.8" font-family="Times,serif" font-size="14.00">FullName</text>
+<text text-anchor="start" x="4854.5" y="-3278.8" font-family="Times,serif" font-size="14.00">Id</text>
+<text text-anchor="start" x="4840.5" y="-3257.8" font-family="Times,serif" font-size="14.00">Name</text>
+<text text-anchor="start" x="4836" y="-3236.8" font-family="Times,serif" font-size="14.00">Name2</text>
+<text text-anchor="start" x="4840" y="-3215.8" font-family="Times,serif" font-size="14.00">public</text>
+<text text-anchor="start" x="4821" y="-3194.8" font-family="Times,serif" font-size="14.00">ShortName</text>
+<text text-anchor="start" x="4786.5" y="-3173.8" font-family="Times,serif" font-size="14.00">TissueProbeFreezeId</text>
+<polygon fill="none" stroke="black" points="4747,-3165 4747,-3423 4977,-3423 4977,-3165 4747,-3165"/>
+</g>
+<!-- TissueProbeSetXRef&#45;&gt;TissueProbeSetFreeze -->
+<g id="edge11" class="edge">
+<title>TissueProbeSetXRef:TissueProbeSetFreezeId&#45;&gt;TissueProbeSetFreeze</title>
+<path fill="none" stroke="black" d="M6349,-4780.5C5901.77,-4780.5 6243.92,-4188.23 5938,-3862 5667.77,-3573.83 5217.81,-3404.02 4995.17,-3333.49"/>
+<polygon fill="black" stroke="black" points="4995.98,-3330.08 4985.39,-3330.41 4993.88,-3336.75 4995.98,-3330.08"/>
+</g>
+<!-- ProbeSE -->
+<g id="node78" class="node">
+<title>ProbeSE</title>
+<polygon fill="white" stroke="transparent" points="6992,-1918 6992,-2008 7122,-2008 7122,-1918 6992,-1918"/>
+<polygon fill="#ce1256" stroke="transparent" points="6995,-1984 6995,-2005 7119,-2005 7119,-1984 6995,-1984"/>
+<polygon fill="none" stroke="black" points="6995,-1984 6995,-2005 7119,-2005 7119,-1984 6995,-1984"/>
+<text text-anchor="start" x="6998" y="-1990.8" font-family="Times,serif" font-size="14.00">ProbeSE (3 GiB)</text>
+<text text-anchor="start" x="7032.5" y="-1968.8" font-family="Times,serif" font-size="14.00">DataId</text>
+<text text-anchor="start" x="7038.5" y="-1947.8" font-family="Times,serif" font-size="14.00">error</text>
+<text text-anchor="start" x="7027.5" y="-1926.8" font-family="Times,serif" font-size="14.00">StrainId</text>
+<polygon fill="none" stroke="black" points="6992,-1918 6992,-2008 7122,-2008 7122,-1918 6992,-1918"/>
+</g>
+<!-- TissueProbeSetXRef&#45;&gt;ProbeSE -->
+<g id="edge10" class="edge">
+<title>TissueProbeSetXRef:ProbesetId&#45;&gt;ProbeSE</title>
+<path fill="none" stroke="black" d="M6561,-4822.5C6998.45,-4822.5 6458.97,-4163.43 6776,-3862 6844.63,-3796.75 6923.59,-3897.22 6986,-3826 7107.35,-3687.52 7069.01,-2322.6 7059.04,-2022.25"/>
+<polygon fill="black" stroke="black" points="7062.53,-2021.9 7058.7,-2012.02 7055.54,-2022.13 7062.53,-2021.9"/>
+</g>
+<!-- Homologene -->
+<g id="node10" class="node">
+<title>Homologene</title>
+<polygon fill="white" stroke="transparent" points="7895,-4842.5 7895,-4932.5 8055,-4932.5 8055,-4842.5 7895,-4842.5"/>
+<polygon fill="#df65b0" stroke="transparent" points="7898,-4908.5 7898,-4929.5 8052,-4929.5 8052,-4908.5 7898,-4908.5"/>
+<polygon fill="none" stroke="black" points="7898,-4908.5 7898,-4929.5 8052,-4929.5 8052,-4908.5 7898,-4908.5"/>
+<text text-anchor="start" x="7901" y="-4915.3" font-family="Times,serif" font-size="14.00">Homologene (3 MiB)</text>
+<text text-anchor="start" x="7949" y="-4893.3" font-family="Times,serif" font-size="14.00">GeneId</text>
+<text text-anchor="start" x="7923" y="-4872.3" font-family="Times,serif" font-size="14.00">HomologeneId</text>
+<text text-anchor="start" x="7931.5" y="-4851.3" font-family="Times,serif" font-size="14.00">TaxonomyId</text>
+<polygon fill="none" stroke="black" points="7895,-4842.5 7895,-4932.5 8055,-4932.5 8055,-4842.5 7895,-4842.5"/>
+</g>
+<!-- PublishData -->
+<g id="node11" class="node">
+<title>PublishData</title>
+<polygon fill="white" stroke="transparent" points="5091,-1918 5091,-2008 5257,-2008 5257,-1918 5091,-1918"/>
+<polygon fill="#df65b0" stroke="transparent" points="5094,-1984 5094,-2005 5254,-2005 5254,-1984 5094,-1984"/>
+<polygon fill="none" stroke="black" points="5094,-1984 5094,-2005 5254,-2005 5254,-1984 5094,-1984"/>
+<text text-anchor="start" x="5097" y="-1990.8" font-family="Times,serif" font-size="14.00">PublishData (34 MiB)</text>
+<text text-anchor="start" x="5166.5" y="-1968.8" font-family="Times,serif" font-size="14.00">Id</text>
+<text text-anchor="start" x="5144.5" y="-1947.8" font-family="Times,serif" font-size="14.00">StrainId</text>
+<text text-anchor="start" x="5154.5" y="-1926.8" font-family="Times,serif" font-size="14.00">value</text>
+<polygon fill="none" stroke="black" points="5091,-1918 5091,-2008 5257,-2008 5257,-1918 5091,-1918"/>
+</g>
+<!-- PublishData&#45;&gt;Strain -->
+<g id="edge12" class="edge">
+<title>PublishData:StrainId&#45;&gt;Strain</title>
+<path fill="none" stroke="black" d="M5255,-1951C5275.87,-1951 5264.11,-1218.38 5274,-1200 5368.85,-1023.7 5593.45,-915.93 5711.13,-869.6"/>
+<polygon fill="black" stroke="black" points="5712.4,-872.86 5720.45,-865.97 5709.86,-866.34 5712.4,-872.86"/>
+</g>
+<!-- ProbeSetXRef -->
+<g id="node12" class="node">
+<title>ProbeSetXRef</title>
+<polygon fill="white" stroke="transparent" points="3033.5,-4737.5 3033.5,-5037.5 3200.5,-5037.5 3200.5,-4737.5 3033.5,-4737.5"/>
+<polygon fill="#ce1256" stroke="transparent" points="3037,-5013.5 3037,-5034.5 3198,-5034.5 3198,-5013.5 3037,-5013.5"/>
+<polygon fill="none" stroke="black" points="3037,-5013.5 3037,-5034.5 3198,-5034.5 3198,-5013.5 3037,-5013.5"/>
+<text text-anchor="start" x="3040" y="-5020.3" font-family="Times,serif" font-size="14.00">ProbeSetXRef (2 GiB)</text>
+<text text-anchor="start" x="3088.5" y="-4998.3" font-family="Times,serif" font-size="14.00">additive</text>
+<text text-anchor="start" x="3093" y="-4977.3" font-family="Times,serif" font-size="14.00">DataId</text>
+<text text-anchor="start" x="3108" y="-4956.3" font-family="Times,serif" font-size="14.00">h2</text>
+<text text-anchor="start" x="3096.5" y="-4935.3" font-family="Times,serif" font-size="14.00">Locus</text>
+<text text-anchor="start" x="3082.5" y="-4914.3" font-family="Times,serif" font-size="14.00">Locus_old</text>
+<text text-anchor="start" x="3102.5" y="-4893.3" font-family="Times,serif" font-size="14.00">LRS</text>
+<text text-anchor="start" x="3088.5" y="-4872.3" font-family="Times,serif" font-size="14.00">LRS_old</text>
+<text text-anchor="start" x="3097.5" y="-4851.3" font-family="Times,serif" font-size="14.00">mean</text>
+<text text-anchor="start" x="3052.5" y="-4830.3" font-family="Times,serif" font-size="14.00">ProbeSetFreezeId</text>
+<text text-anchor="start" x="3077" y="-4809.3" font-family="Times,serif" font-size="14.00">ProbeSetId</text>
+<text text-anchor="start" x="3093" y="-4788.3" font-family="Times,serif" font-size="14.00">pValue</text>
+<text text-anchor="start" x="3079" y="-4767.3" font-family="Times,serif" font-size="14.00">pValue_old</text>
+<text text-anchor="start" x="3109.5" y="-4746.3" font-family="Times,serif" font-size="14.00">se</text>
+<polygon fill="none" stroke="black" points="3033.5,-4737.5 3033.5,-5037.5 3200.5,-5037.5 3200.5,-4737.5 3033.5,-4737.5"/>
+</g>
+<!-- ProbeSetXRef&#45;&gt;ProbeSE -->
+<g id="edge14" class="edge">
+<title>ProbeSetXRef:ProbeSetId&#45;&gt;ProbeSE</title>
+<path fill="none" stroke="black" d="M3199,-4812.5C4021.93,-4812.5 3996.77,-4088.2 4788,-3862 4841.88,-3846.6 6765.02,-3865.27 6805,-3826 6889.39,-3743.1 6769.62,-2854.79 6843,-2762 6880.46,-2714.64 6934.85,-2771.97 6974,-2726 7149.11,-2520.43 7098.76,-2161.98 7070.36,-2022.18"/>
+<polygon fill="black" stroke="black" points="7073.73,-2021.18 7068.27,-2012.1 7066.87,-2022.6 7073.73,-2021.18"/>
+</g>
+<!-- ProbeSetFreeze -->
+<g id="node90" class="node">
+<title>ProbeSetFreeze</title>
+<polygon fill="white" stroke="transparent" points="2639.5,-3144 2639.5,-3444 2838.5,-3444 2838.5,-3144 2639.5,-3144"/>
+<polygon fill="#d7b5d8" stroke="transparent" points="2643,-3420 2643,-3441 2836,-3441 2836,-3420 2643,-3420"/>
+<polygon fill="none" stroke="black" points="2643,-3420 2643,-3441 2836,-3441 2836,-3420 2643,-3420"/>
+<text text-anchor="start" x="2646" y="-3426.8" font-family="Times,serif" font-size="14.00">ProbeSetFreeze (171 KiB)</text>
+<text text-anchor="start" x="2679" y="-3404.8" font-family="Times,serif" font-size="14.00">AuthorisedUsers</text>
+<text text-anchor="start" x="2717.5" y="-3383.8" font-family="Times,serif" font-size="14.00">AvgID</text>
+<text text-anchor="start" x="2688" y="-3362.8" font-family="Times,serif" font-size="14.00">confidentiality</text>
+<text text-anchor="start" x="2697.5" y="-3341.8" font-family="Times,serif" font-size="14.00">CreateTime</text>
+<text text-anchor="start" x="2703" y="-3320.8" font-family="Times,serif" font-size="14.00">DataScale</text>
+<text text-anchor="start" x="2704.5" y="-3299.8" font-family="Times,serif" font-size="14.00">FullName</text>
+<text text-anchor="start" x="2732" y="-3278.8" font-family="Times,serif" font-size="14.00">Id</text>
+<text text-anchor="start" x="2718" y="-3257.8" font-family="Times,serif" font-size="14.00">Name</text>
+<text text-anchor="start" x="2713.5" y="-3236.8" font-family="Times,serif" font-size="14.00">Name2</text>
+<text text-anchor="start" x="2704.5" y="-3215.8" font-family="Times,serif" font-size="14.00">OrderList</text>
+<text text-anchor="start" x="2686.5" y="-3194.8" font-family="Times,serif" font-size="14.00">ProbeFreezeId</text>
+<text text-anchor="start" x="2717.5" y="-3173.8" font-family="Times,serif" font-size="14.00">public</text>
+<text text-anchor="start" x="2698.5" y="-3152.8" font-family="Times,serif" font-size="14.00">ShortName</text>
+<polygon fill="none" stroke="black" points="2639.5,-3144 2639.5,-3444 2838.5,-3444 2838.5,-3144 2639.5,-3144"/>
+</g>
+<!-- ProbeSetXRef&#45;&gt;ProbeSetFreeze -->
+<g id="edge13" class="edge">
+<title>ProbeSetXRef:ProbeSetFreezeId&#45;&gt;ProbeSetFreeze</title>
+<path fill="none" stroke="black" d="M3036,-4833.5C2816.79,-4833.5 2907.79,-4076.99 2865,-3862 2837.79,-3725.3 2803.24,-3570.92 2777.19,-3457.81"/>
+<polygon fill="black" stroke="black" points="2780.6,-3456.98 2774.94,-3448.03 2773.77,-3458.56 2780.6,-3456.98"/>
+</g>
+<!-- TraitMetadata -->
+<g id="node13" class="node">
+<title>TraitMetadata</title>
+<polygon fill="white" stroke="transparent" points="8089,-4853 8089,-4922 8267,-4922 8267,-4853 8089,-4853"/>
+<polygon fill="#d7b5d8" stroke="transparent" points="8092,-4897.5 8092,-4918.5 8264,-4918.5 8264,-4897.5 8092,-4897.5"/>
+<polygon fill="none" stroke="black" points="8092,-4897.5 8092,-4918.5 8264,-4918.5 8264,-4897.5 8092,-4897.5"/>
+<text text-anchor="start" x="8095" y="-4904.3" font-family="Times,serif" font-size="14.00">TraitMetadata (16 KiB)</text>
+<text text-anchor="start" x="8162" y="-4882.3" font-family="Times,serif" font-size="14.00">type</text>
+<text text-anchor="start" x="8158.5" y="-4861.3" font-family="Times,serif" font-size="14.00">value</text>
+<polygon fill="none" stroke="black" points="8089,-4853 8089,-4922 8267,-4922 8267,-4853 8089,-4853"/>
+</g>
+<!-- TissueProbeSetData -->
+<g id="node14" class="node">
+<title>TissueProbeSetData</title>
+<polygon fill="white" stroke="transparent" points="2313.5,-1918 2313.5,-2008 2538.5,-2008 2538.5,-1918 2313.5,-1918"/>
+<polygon fill="#df65b0" stroke="transparent" points="2317,-1984 2317,-2005 2536,-2005 2536,-1984 2317,-1984"/>
+<polygon fill="none" stroke="black" points="2317,-1984 2317,-2005 2536,-2005 2536,-1984 2317,-1984"/>
+<text text-anchor="start" x="2320" y="-1990.8" font-family="Times,serif" font-size="14.00">TissueProbeSetData (33 MiB)</text>
+<text text-anchor="start" x="2419" y="-1968.8" font-family="Times,serif" font-size="14.00">Id</text>
+<text text-anchor="start" x="2395" y="-1947.8" font-family="Times,serif" font-size="14.00">TissueID</text>
+<text text-anchor="start" x="2407" y="-1926.8" font-family="Times,serif" font-size="14.00">value</text>
+<polygon fill="none" stroke="black" points="2313.5,-1918 2313.5,-2008 2538.5,-2008 2538.5,-1918 2313.5,-1918"/>
+</g>
+<!-- Tissue -->
+<g id="node79" class="node">
+<title>Tissue</title>
+<polygon fill="lightgrey" stroke="transparent" points="2372.5,-755 2372.5,-929 2497.5,-929 2497.5,-755 2372.5,-755"/>
+<polygon fill="#d7b5d8" stroke="transparent" points="2376,-905 2376,-926 2495,-926 2495,-905 2376,-905"/>
+<polygon fill="none" stroke="black" points="2376,-905 2376,-926 2495,-926 2495,-905 2376,-905"/>
+<text text-anchor="start" x="2381" y="-911.8" font-family="Times,serif" font-size="14.00">Tissue (11 KiB)</text>
+<text text-anchor="start" x="2390.5" y="-889.8" font-family="Times,serif" font-size="14.00">BIRN_lex_ID</text>
+<text text-anchor="start" x="2378" y="-868.8" font-family="Times,serif" font-size="14.00">BIRN_lex_Name</text>
+<text text-anchor="start" x="2428" y="-847.8" font-family="Times,serif" font-size="14.00">Id</text>
+<polygon fill="green" stroke="transparent" points="2376,-821 2376,-840 2495,-840 2495,-821 2376,-821"/>
+<text text-anchor="start" x="2414" y="-826.8" font-family="Times,serif" font-size="14.00">Name</text>
+<polygon fill="green" stroke="transparent" points="2376,-800 2376,-819 2495,-819 2495,-800 2376,-800"/>
+<text text-anchor="start" x="2391" y="-805.8" font-family="Times,serif" font-size="14.00">Short_Name</text>
+<text text-anchor="start" x="2405" y="-784.8" font-family="Times,serif" font-size="14.00">TissueId</text>
+<text text-anchor="start" x="2391.5" y="-763.8" font-family="Times,serif" font-size="14.00">TissueName</text>
+<polygon fill="none" stroke="black" points="2372.5,-755 2372.5,-929 2497.5,-929 2497.5,-755 2372.5,-755"/>
+</g>
+<!-- TissueProbeSetData&#45;&gt;Tissue -->
+<g id="edge15" class="edge">
+<title>TissueProbeSetData:TissueID&#45;&gt;Tissue</title>
+<path fill="none" stroke="black" d="M2537,-1951C2587.33,-1951 2488.08,-1216.42 2449.46,-943.5"/>
+<polygon fill="black" stroke="black" points="2452.87,-942.61 2448,-933.2 2445.94,-943.59 2452.87,-942.61"/>
+</g>
+<!-- DBType -->
+<g id="node15" class="node">
+<title>DBType</title>
+<polygon fill="white" stroke="transparent" points="8304.5,-3259.5 8304.5,-3328.5 8421.5,-3328.5 8421.5,-3259.5 8304.5,-3259.5"/>
+<polygon fill="#f1eef6" stroke="transparent" points="8308,-3304 8308,-3325 8419,-3325 8419,-3304 8308,-3304"/>
+<polygon fill="none" stroke="black" points="8308,-3304 8308,-3325 8419,-3325 8419,-3304 8308,-3304"/>
+<text text-anchor="start" x="8311" y="-3310.8" font-family="Times,serif" font-size="14.00">DBType (99 B)</text>
+<text text-anchor="start" x="8356" y="-3288.8" font-family="Times,serif" font-size="14.00">Id</text>
+<text text-anchor="start" x="8342" y="-3267.8" font-family="Times,serif" font-size="14.00">Name</text>
+<polygon fill="none" stroke="black" points="8304.5,-3259.5 8304.5,-3328.5 8421.5,-3328.5 8421.5,-3259.5 8304.5,-3259.5"/>
+</g>
+<!-- DatasetStatus -->
+<g id="node20" class="node">
+<title>DatasetStatus</title>
+<polygon fill="lightgrey" stroke="transparent" points="305.5,-264 305.5,-333 468.5,-333 468.5,-264 305.5,-264"/>
+<polygon fill="#f1eef6" stroke="transparent" points="309,-308.5 309,-329.5 466,-329.5 466,-308.5 309,-308.5"/>
+<polygon fill="none" stroke="black" points="309,-308.5 309,-329.5 466,-329.5 466,-308.5 309,-308.5"/>
+<text text-anchor="start" x="312" y="-315.3" font-family="Times,serif" font-size="14.00">DatasetStatus (40 B)</text>
+<text text-anchor="start" x="329" y="-293.3" font-family="Times,serif" font-size="14.00">DatasetStatusId</text>
+<polygon fill="green" stroke="transparent" points="309,-266.5 309,-285.5 466,-285.5 466,-266.5 309,-266.5"/>
+<text text-anchor="start" x="315" y="-272.3" font-family="Times,serif" font-size="14.00">DatasetStatusName</text>
+<polygon fill="none" stroke="black" points="305.5,-264 305.5,-333 468.5,-333 468.5,-264 305.5,-264"/>
+</g>
+<!-- Datasets&#45;&gt;DatasetStatus -->
+<g id="edge16" class="edge">
+<title>Datasets:DatasetStatusId&#45;&gt;DatasetStatus</title>
+<path fill="none" stroke="black" d="M467,-798C557.78,-798 449.28,-471.63 404.55,-347.04"/>
+<polygon fill="black" stroke="black" points="407.75,-345.6 401.06,-337.38 401.16,-347.97 407.75,-345.6"/>
+</g>
+<!-- Investigators -->
+<g id="node71" class="node">
+<title>Investigators</title>
+<polygon fill="lightgrey" stroke="transparent" points="88,-117 88,-480 258,-480 258,-117 88,-117"/>
+<polygon fill="#d7b5d8" stroke="transparent" points="91,-455.5 91,-476.5 255,-476.5 255,-455.5 91,-455.5"/>
+<polygon fill="none" stroke="black" points="91,-455.5 91,-476.5 255,-476.5 255,-455.5 91,-455.5"/>
+<text text-anchor="start" x="94" y="-462.3" font-family="Times,serif" font-size="14.00">Investigators (22 KiB)</text>
+<polygon fill="green" stroke="transparent" points="91,-434.5 91,-453.5 255,-453.5 255,-434.5 91,-434.5"/>
+<text text-anchor="start" x="144" y="-440.3" font-family="Times,serif" font-size="14.00">Address</text>
+<polygon fill="green" stroke="transparent" points="91,-413.5 91,-432.5 255,-432.5 255,-413.5 91,-413.5"/>
+<text text-anchor="start" x="158" y="-419.3" font-family="Times,serif" font-size="14.00">City</text>
+<polygon fill="green" stroke="transparent" points="91,-392.5 91,-411.5 255,-411.5 255,-392.5 91,-392.5"/>
+<text text-anchor="start" x="144" y="-398.3" font-family="Times,serif" font-size="14.00">Country</text>
+<polygon fill="green" stroke="transparent" points="91,-371.5 91,-390.5 255,-390.5 255,-371.5 91,-371.5"/>
+<text text-anchor="start" x="152" y="-377.3" font-family="Times,serif" font-size="14.00">Email</text>
+<polygon fill="green" stroke="transparent" points="91,-350.5 91,-369.5 255,-369.5 255,-350.5 91,-350.5"/>
+<text text-anchor="start" x="134.5" y="-356.3" font-family="Times,serif" font-size="14.00">FirstName</text>
+<text text-anchor="start" x="122" y="-335.3" font-family="Times,serif" font-size="14.00">InvestigatorId</text>
+<polygon fill="green" stroke="transparent" points="91,-308.5 91,-327.5 255,-327.5 255,-308.5 91,-308.5"/>
+<text text-anchor="start" x="136.5" y="-314.3" font-family="Times,serif" font-size="14.00">LastName</text>
+<text text-anchor="start" x="119.5" y="-293.3" font-family="Times,serif" font-size="14.00">OrganizationId</text>
+<polygon fill="green" stroke="transparent" points="91,-266.5 91,-285.5 255,-285.5 255,-266.5 91,-266.5"/>
+<text text-anchor="start" x="150.5" y="-272.3" font-family="Times,serif" font-size="14.00">Phone</text>
+<polygon fill="green" stroke="transparent" points="91,-245.5 91,-264.5 255,-264.5 255,-245.5 91,-245.5"/>
+<text text-anchor="start" x="153.5" y="-251.3" font-family="Times,serif" font-size="14.00">State</text>
+<polygon fill="green" stroke="transparent" points="91,-224.5 91,-243.5 255,-243.5 255,-224.5 91,-224.5"/>
+<text text-anchor="start" x="161" y="-230.3" font-family="Times,serif" font-size="14.00">Url</text>
+<text text-anchor="start" x="138.5" y="-209.3" font-family="Times,serif" font-size="14.00">UserDate</text>
+<text text-anchor="start" x="136.5" y="-188.3" font-family="Times,serif" font-size="14.00">UserLevel</text>
+<text text-anchor="start" x="134.5" y="-167.3" font-family="Times,serif" font-size="14.00">UserName</text>
+<text text-anchor="start" x="139.5" y="-146.3" font-family="Times,serif" font-size="14.00">UserPass</text>
+<polygon fill="green" stroke="transparent" points="91,-119.5 91,-138.5 255,-138.5 255,-119.5 91,-119.5"/>
+<text text-anchor="start" x="143" y="-125.3" font-family="Times,serif" font-size="14.00">ZipCode</text>
+<polygon fill="none" stroke="black" points="88,-117 88,-480 258,-480 258,-117 88,-117"/>
+</g>
+<!-- Datasets&#45;&gt;Investigators -->
+<g id="edge17" class="edge">
+<title>Datasets:InvestigatorId&#45;&gt;Investigators</title>
+<path fill="none" stroke="black" d="M307,-735C252.81,-735 218.24,-610.26 197.82,-494.3"/>
+<polygon fill="black" stroke="black" points="201.22,-493.45 196.07,-484.19 194.32,-494.64 201.22,-493.45"/>
+</g>
+<!-- IndelAll -->
+<g id="node17" class="node">
+<title>IndelAll</title>
+<polygon fill="white" stroke="transparent" points="3168,-692 3168,-992 3302,-992 3302,-692 3168,-692"/>
+<polygon fill="#df65b0" stroke="transparent" points="3171,-968 3171,-989 3299,-989 3299,-968 3171,-968"/>
+<polygon fill="none" stroke="black" points="3171,-968 3171,-989 3299,-989 3299,-968 3171,-968"/>
+<text text-anchor="start" x="3174" y="-974.8" font-family="Times,serif" font-size="14.00">IndelAll (17 MiB)</text>
+<text text-anchor="start" x="3188" y="-952.8" font-family="Times,serif" font-size="14.00">Chromosome</text>
+<text text-anchor="start" x="3227.5" y="-931.8" font-family="Times,serif" font-size="14.00">Id</text>
+<text text-anchor="start" x="3181" y="-910.8" font-family="Times,serif" font-size="14.00">InDelSequence</text>
+<text text-anchor="start" x="3206.5" y="-889.8" font-family="Times,serif" font-size="14.00">Mb_end</text>
+<text text-anchor="start" x="3185" y="-868.8" font-family="Times,serif" font-size="14.00">Mb_end_2016</text>
+<text text-anchor="start" x="3202.5" y="-847.8" font-family="Times,serif" font-size="14.00">Mb_start</text>
+<text text-anchor="start" x="3181" y="-826.8" font-family="Times,serif" font-size="14.00">Mb_start_2016</text>
+<text text-anchor="start" x="3213.5" y="-805.8" font-family="Times,serif" font-size="14.00">Name</text>
+<text text-anchor="start" x="3219.5" y="-784.8" font-family="Times,serif" font-size="14.00">Size</text>
+<text text-anchor="start" x="3203" y="-763.8" font-family="Times,serif" font-size="14.00">SourceId</text>
+<text text-anchor="start" x="3200" y="-742.8" font-family="Times,serif" font-size="14.00">SpeciesId</text>
+<text text-anchor="start" x="3210.5" y="-721.8" font-family="Times,serif" font-size="14.00">Strand</text>
+<text text-anchor="start" x="3217.5" y="-700.8" font-family="Times,serif" font-size="14.00">Type</text>
+<polygon fill="none" stroke="black" points="3168,-692 3168,-992 3302,-992 3302,-692 3168,-692"/>
+</g>
+<!-- IndelAll&#45;&gt;Species -->
+<g id="edge18" class="edge">
+<title>IndelAll:SpeciesId&#45;&gt;Species</title>
+<path fill="none" stroke="black" d="M3170,-746C3144.8,-746 3164.16,-541.49 3151,-520 3088.71,-418.27 2960,-356.26 2875.88,-324.91"/>
+<polygon fill="black" stroke="black" points="2876.95,-321.58 2866.36,-321.42 2874.55,-328.15 2876.95,-321.58"/>
+</g>
+<!-- GORef -->
+<g id="node18" class="node">
+<title>GORef</title>
+<polygon fill="white" stroke="transparent" points="8459.5,-4842.5 8459.5,-4932.5 8576.5,-4932.5 8576.5,-4842.5 8459.5,-4842.5"/>
+<polygon fill="#df65b0" stroke="transparent" points="8463,-4908.5 8463,-4929.5 8574,-4929.5 8574,-4908.5 8463,-4908.5"/>
+<polygon fill="none" stroke="black" points="8463,-4908.5 8463,-4929.5 8574,-4929.5 8574,-4908.5 8463,-4908.5"/>
+<text text-anchor="start" x="8466" y="-4915.3" font-family="Times,serif" font-size="14.00">GORef (2 MiB)</text>
+<text text-anchor="start" x="8497" y="-4893.3" font-family="Times,serif" font-size="14.00">genes</text>
+<text text-anchor="start" x="8492.5" y="-4872.3" font-family="Times,serif" font-size="14.00">goterm</text>
+<text text-anchor="start" x="8511.5" y="-4851.3" font-family="Times,serif" font-size="14.00">id</text>
+<polygon fill="none" stroke="black" points="8459.5,-4842.5 8459.5,-4932.5 8576.5,-4932.5 8576.5,-4842.5 8459.5,-4842.5"/>
+</g>
+<!-- Publication -->
+<g id="node19" class="node">
+<title>Publication</title>
+<polygon fill="lightgrey" stroke="transparent" points="2531.5,-723.5 2531.5,-960.5 2682.5,-960.5 2682.5,-723.5 2531.5,-723.5"/>
+<polygon fill="#df65b0" stroke="transparent" points="2535,-936 2535,-957 2680,-957 2680,-936 2535,-936"/>
+<polygon fill="none" stroke="black" points="2535,-936 2535,-957 2680,-957 2680,-936 2535,-936"/>
+<text text-anchor="start" x="2538" y="-942.8" font-family="Times,serif" font-size="14.00">Publication (7 MiB)</text>
+<polygon fill="green" stroke="transparent" points="2535,-915 2535,-934 2680,-934 2680,-915 2535,-915"/>
+<text text-anchor="start" x="2577" y="-920.8" font-family="Times,serif" font-size="14.00">Abstract</text>
+<polygon fill="green" stroke="transparent" points="2535,-894 2535,-913 2680,-913 2680,-894 2535,-894"/>
+<text text-anchor="start" x="2579" y="-899.8" font-family="Times,serif" font-size="14.00">Authors</text>
+<polygon fill="green" stroke="transparent" points="2535,-873 2535,-892 2680,-892 2680,-873 2535,-873"/>
+<text text-anchor="start" x="2581.5" y="-878.8" font-family="Times,serif" font-size="14.00">Journal</text>
+<polygon fill="green" stroke="transparent" points="2535,-852 2535,-871 2680,-871 2680,-852 2535,-852"/>
+<text text-anchor="start" x="2584" y="-857.8" font-family="Times,serif" font-size="14.00">Month</text>
+<polygon fill="green" stroke="transparent" points="2535,-831 2535,-850 2680,-850 2680,-831 2535,-831"/>
+<text text-anchor="start" x="2586" y="-836.8" font-family="Times,serif" font-size="14.00">Pages</text>
+<polygon fill="green" stroke="transparent" points="2535,-810 2535,-829 2680,-829 2680,-810 2535,-810"/>
+<text text-anchor="start" x="2566" y="-815.8" font-family="Times,serif" font-size="14.00">PubMed_ID</text>
+<polygon fill="green" stroke="transparent" points="2535,-789 2535,-808 2680,-808 2680,-789 2535,-789"/>
+<text text-anchor="start" x="2591" y="-794.8" font-family="Times,serif" font-size="14.00">Title</text>
+<polygon fill="green" stroke="transparent" points="2535,-768 2535,-787 2680,-787 2680,-768 2535,-768"/>
+<text text-anchor="start" x="2581" y="-773.8" font-family="Times,serif" font-size="14.00">Volume</text>
+<polygon fill="green" stroke="transparent" points="2535,-747 2535,-766 2680,-766 2680,-747 2535,-747"/>
+<text text-anchor="start" x="2591.5" y="-752.8" font-family="Times,serif" font-size="14.00">Year</text>
+<text text-anchor="start" x="2600" y="-731.8" font-family="Times,serif" font-size="14.00">Id</text>
+<polygon fill="none" stroke="black" points="2531.5,-723.5 2531.5,-960.5 2682.5,-960.5 2682.5,-723.5 2531.5,-723.5"/>
+</g>
+<!-- PublishFreeze -->
+<g id="node21" class="node">
+<title>PublishFreeze</title>
+<polygon fill="white" stroke="transparent" points="3246.5,-1855 3246.5,-2071 3415.5,-2071 3415.5,-1855 3246.5,-1855"/>
+<polygon fill="#d7b5d8" stroke="transparent" points="3250,-2047 3250,-2068 3413,-2068 3413,-2047 3250,-2047"/>
+<polygon fill="none" stroke="black" points="3250,-2047 3250,-2068 3413,-2068 3413,-2047 3250,-2047"/>
+<text text-anchor="start" x="3253" y="-2053.8" font-family="Times,serif" font-size="14.00">PublishFreeze (6 KiB)</text>
+<text text-anchor="start" x="3271" y="-2031.8" font-family="Times,serif" font-size="14.00">AuthorisedUsers</text>
+<text text-anchor="start" x="3280" y="-2010.8" font-family="Times,serif" font-size="14.00">confidentiality</text>
+<text text-anchor="start" x="3289.5" y="-1989.8" font-family="Times,serif" font-size="14.00">CreateTime</text>
+<text text-anchor="start" x="3296.5" y="-1968.8" font-family="Times,serif" font-size="14.00">FullName</text>
+<text text-anchor="start" x="3324" y="-1947.8" font-family="Times,serif" font-size="14.00">Id</text>
+<text text-anchor="start" x="3288.5" y="-1926.8" font-family="Times,serif" font-size="14.00">InbredSetId</text>
+<text text-anchor="start" x="3310" y="-1905.8" font-family="Times,serif" font-size="14.00">Name</text>
+<text text-anchor="start" x="3309.5" y="-1884.8" font-family="Times,serif" font-size="14.00">public</text>
+<text text-anchor="start" x="3290.5" y="-1863.8" font-family="Times,serif" font-size="14.00">ShortName</text>
+<polygon fill="none" stroke="black" points="3246.5,-1855 3246.5,-2071 3415.5,-2071 3415.5,-1855 3246.5,-1855"/>
+</g>
+<!-- InbredSet -->
+<g id="node28" class="node">
+<title>InbredSet</title>
+<polygon fill="lightgrey" stroke="transparent" points="3781.5,-692 3781.5,-992 3928.5,-992 3928.5,-692 3781.5,-692"/>
+<polygon fill="#d7b5d8" stroke="transparent" points="3785,-968 3785,-989 3926,-989 3926,-968 3785,-968"/>
+<polygon fill="none" stroke="black" points="3785,-968 3785,-989 3926,-989 3926,-968 3785,-968"/>
+<text text-anchor="start" x="3788" y="-974.8" font-family="Times,serif" font-size="14.00">InbredSet (10 KiB)</text>
+<text text-anchor="start" x="3810" y="-952.8" font-family="Times,serif" font-size="14.00">FamilyOrder</text>
+<text text-anchor="start" x="3848" y="-931.8" font-family="Times,serif" font-size="14.00">Id</text>
+<text text-anchor="start" x="3801.5" y="-910.8" font-family="Times,serif" font-size="14.00">InbredSetCode</text>
+<text text-anchor="start" x="3812.5" y="-889.8" font-family="Times,serif" font-size="14.00">InbredSetId</text>
+<text text-anchor="start" x="3798.5" y="-868.8" font-family="Times,serif" font-size="14.00">InbredSetName</text>
+<text text-anchor="start" x="3789" y="-847.8" font-family="Times,serif" font-size="14.00">MappingMethodId</text>
+<text text-anchor="start" x="3807" y="-826.8" font-family="Times,serif" font-size="14.00">MenuOrderId</text>
+<polygon fill="green" stroke="transparent" points="3785,-800 3785,-819 3926,-819 3926,-800 3785,-800"/>
+<text text-anchor="start" x="3834" y="-805.8" font-family="Times,serif" font-size="14.00">Name</text>
+<text text-anchor="start" x="3833.5" y="-784.8" font-family="Times,serif" font-size="14.00">public</text>
+<text text-anchor="start" x="3820.5" y="-763.8" font-family="Times,serif" font-size="14.00">SpeciesId</text>
+<polygon fill="green" stroke="transparent" points="3785,-737 3785,-756 3926,-756 3926,-737 3785,-737"/>
+<text text-anchor="start" x="3831" y="-742.8" font-family="Times,serif" font-size="14.00">Family</text>
+<polygon fill="green" stroke="transparent" points="3785,-716 3785,-735 3926,-735 3926,-716 3785,-716"/>
+<text text-anchor="start" x="3820.5" y="-721.8" font-family="Times,serif" font-size="14.00">FullName</text>
+<polygon fill="green" stroke="transparent" points="3785,-695 3785,-714 3926,-714 3926,-695 3785,-695"/>
+<text text-anchor="start" x="3810.5" y="-700.8" font-family="Times,serif" font-size="14.00">GeneticType</text>
+<polygon fill="none" stroke="black" points="3781.5,-692 3781.5,-992 3928.5,-992 3928.5,-692 3781.5,-692"/>
+</g>
+<!-- PublishFreeze&#45;&gt;InbredSet -->
+<g id="edge19" class="edge">
+<title>PublishFreeze:InbredSetId&#45;&gt;InbredSet</title>
+<path fill="none" stroke="black" d="M3414,-1930C3454.58,-1930 3409.48,-1229.81 3437,-1200 3485.84,-1147.1 3703.73,-1210.15 3759,-1164 3805.64,-1125.05 3830.2,-1064.45 3842.93,-1006.34"/>
+<polygon fill="black" stroke="black" points="3846.42,-1006.79 3845.03,-996.28 3839.56,-1005.36 3846.42,-1006.79"/>
+</g>
+<!-- TissueProbeFreeze -->
+<g id="node22" class="node">
+<title>TissueProbeFreeze</title>
+<polygon fill="white" stroke="transparent" points="4631,-1865.5 4631,-2060.5 4837,-2060.5 4837,-1865.5 4631,-1865.5"/>
+<polygon fill="#f1eef6" stroke="transparent" points="4634,-2036 4634,-2057 4834,-2057 4834,-2036 4634,-2036"/>
+<polygon fill="none" stroke="black" points="4634,-2036 4634,-2057 4834,-2057 4834,-2036 4634,-2036"/>
+<text text-anchor="start" x="4637" y="-2042.8" font-family="Times,serif" font-size="14.00">TissueProbeFreeze (116 B)</text>
+<text text-anchor="start" x="4710" y="-2020.8" font-family="Times,serif" font-size="14.00">ChipId</text>
+<text text-anchor="start" x="4692" y="-1999.8" font-family="Times,serif" font-size="14.00">CreateTime</text>
+<text text-anchor="start" x="4699" y="-1978.8" font-family="Times,serif" font-size="14.00">FullName</text>
+<text text-anchor="start" x="4726.5" y="-1957.8" font-family="Times,serif" font-size="14.00">Id</text>
+<text text-anchor="start" x="4691" y="-1936.8" font-family="Times,serif" font-size="14.00">InbredSetId</text>
+<text text-anchor="start" x="4712.5" y="-1915.8" font-family="Times,serif" font-size="14.00">Name</text>
+<text text-anchor="start" x="4693" y="-1894.8" font-family="Times,serif" font-size="14.00">ShortName</text>
+<text text-anchor="start" x="4704.5" y="-1873.8" font-family="Times,serif" font-size="14.00">StrainId</text>
+<polygon fill="none" stroke="black" points="4631,-1865.5 4631,-2060.5 4837,-2060.5 4837,-1865.5 4631,-1865.5"/>
+</g>
+<!-- TissueProbeFreeze&#45;&gt;InbredSet -->
+<g id="edge20" class="edge">
+<title>TissueProbeFreeze:InbredSetId&#45;&gt;InbredSet</title>
+<path fill="none" stroke="black" d="M4633,-1940C4550.53,-1940 4633.54,-1259.07 4576,-1200 4521.75,-1144.31 4299.4,-1194.77 4228,-1164 4116.11,-1115.79 4013.14,-1021.68 3943.86,-947.77"/>
+<polygon fill="black" stroke="black" points="3946.22,-945.17 3936.85,-940.23 3941.1,-949.94 3946.22,-945.17"/>
+</g>
+<!-- TissueProbeSetFreeze&#45;&gt;TissueProbeFreeze -->
+<g id="edge21" class="edge">
+<title>TissueProbeSetFreeze:TissueProbeFreezeId&#45;&gt;TissueProbeFreeze</title>
+<path fill="none" stroke="black" d="M4862,-3167C4862,-2762.54 4789.57,-2285.87 4753.68,-2074.48"/>
+<polygon fill="black" stroke="black" points="4757.13,-2073.88 4752,-2064.61 4750.23,-2075.06 4757.13,-2073.88"/>
+</g>
+<!-- ProbeXRef -->
+<g id="node24" class="node">
+<title>ProbeXRef</title>
+<polygon fill="white" stroke="transparent" points="4805,-4842.5 4805,-4932.5 4969,-4932.5 4969,-4842.5 4805,-4842.5"/>
+<polygon fill="#df65b0" stroke="transparent" points="4808,-4908.5 4808,-4929.5 4966,-4929.5 4966,-4908.5 4808,-4908.5"/>
+<polygon fill="none" stroke="black" points="4808,-4908.5 4808,-4929.5 4966,-4929.5 4966,-4908.5 4808,-4908.5"/>
+<text text-anchor="start" x="4811" y="-4915.3" font-family="Times,serif" font-size="14.00">ProbeXRef (229 MiB)</text>
+<text text-anchor="start" x="4862.5" y="-4893.3" font-family="Times,serif" font-size="14.00">DataId</text>
+<text text-anchor="start" x="4834" y="-4872.3" font-family="Times,serif" font-size="14.00">ProbeFreezeId</text>
+<text text-anchor="start" x="4858.5" y="-4851.3" font-family="Times,serif" font-size="14.00">ProbeId</text>
+<polygon fill="none" stroke="black" points="4805,-4842.5 4805,-4932.5 4969,-4932.5 4969,-4842.5 4805,-4842.5"/>
+</g>
+<!-- Probe -->
+<g id="node41" class="node">
+<title>Probe</title>
+<polygon fill="white" stroke="transparent" points="6860.5,-3186 6860.5,-3402 6969.5,-3402 6969.5,-3186 6860.5,-3186"/>
+<polygon fill="#ce1256" stroke="transparent" points="6864,-3378 6864,-3399 6967,-3399 6967,-3378 6864,-3378"/>
+<polygon fill="none" stroke="black" points="6864,-3378 6864,-3399 6967,-3399 6967,-3378 6864,-3378"/>
+<text text-anchor="start" x="6867" y="-3384.8" font-family="Times,serif" font-size="14.00">Probe (2 GiB)</text>
+<text text-anchor="start" x="6891" y="-3362.8" font-family="Times,serif" font-size="14.00">E_GSB</text>
+<text text-anchor="start" x="6890.5" y="-3341.8" font-family="Times,serif" font-size="14.00">E_NSB</text>
+<text text-anchor="start" x="6887" y="-3320.8" font-family="Times,serif" font-size="14.00">ExonNo</text>
+<text text-anchor="start" x="6908" y="-3299.8" font-family="Times,serif" font-size="14.00">Id</text>
+<text text-anchor="start" x="6894" y="-3278.8" font-family="Times,serif" font-size="14.00">Name</text>
+<text text-anchor="start" x="6875" y="-3257.8" font-family="Times,serif" font-size="14.00">ProbeSetId</text>
+<text text-anchor="start" x="6880.5" y="-3236.8" font-family="Times,serif" font-size="14.00">Sequence</text>
+<text text-anchor="start" x="6873" y="-3215.8" font-family="Times,serif" font-size="14.00">SerialOrder</text>
+<text text-anchor="start" x="6904" y="-3194.8" font-family="Times,serif" font-size="14.00">Tm</text>
+<polygon fill="none" stroke="black" points="6860.5,-3186 6860.5,-3402 6969.5,-3402 6969.5,-3186 6860.5,-3186"/>
+</g>
+<!-- ProbeXRef&#45;&gt;Probe -->
+<g id="edge23" class="edge">
+<title>ProbeXRef:ProbeId&#45;&gt;Probe</title>
+<path fill="none" stroke="black" d="M4967,-4854.5C5534.68,-4854.5 5262.79,-4114.96 5771,-3862 5877.2,-3809.14 6749.63,-3905.13 6838,-3826 6950.47,-3725.29 6951.4,-3539.28 6936.93,-3416.33"/>
+<polygon fill="black" stroke="black" points="6940.37,-3415.61 6935.68,-3406.11 6933.42,-3416.47 6940.37,-3415.61"/>
+</g>
+<!-- ProbeXRef&#45;&gt;ProbeFreeze -->
+<g id="edge22" class="edge">
+<title>ProbeXRef:ProbeFreezeId&#45;&gt;ProbeFreeze</title>
+<path fill="none" stroke="black" d="M4807,-4875.5C3968.98,-4875.5 3960.35,-4248.91 3217,-3862 3179.88,-3842.68 3157.46,-3857.58 3130,-3826 2809.52,-3457.41 3148.75,-3152.22 2855,-2762 2836.07,-2736.85 2811.36,-2752.26 2794,-2726 2665.13,-2531.04 2665.79,-2246.15 2679.06,-2085.66"/>
+<polygon fill="black" stroke="black" points="2682.59,-2085.53 2679.95,-2075.27 2675.61,-2084.93 2682.59,-2085.53"/>
+</g>
+<!-- Publication_Test -->
+<g id="node25" class="node">
+<title>Publication_Test</title>
+<polygon fill="white" stroke="transparent" points="8610.5,-4769 8610.5,-5006 8797.5,-5006 8797.5,-4769 8610.5,-4769"/>
+<polygon fill="#df65b0" stroke="transparent" points="8614,-4981.5 8614,-5002.5 8795,-5002.5 8795,-4981.5 8614,-4981.5"/>
+<polygon fill="none" stroke="black" points="8614,-4981.5 8614,-5002.5 8795,-5002.5 8795,-4981.5 8614,-4981.5"/>
+<text text-anchor="start" x="8617" y="-4988.3" font-family="Times,serif" font-size="14.00">Publication_Test (7 MiB)</text>
+<text text-anchor="start" x="8674" y="-4966.3" font-family="Times,serif" font-size="14.00">Abstract</text>
+<text text-anchor="start" x="8676" y="-4945.3" font-family="Times,serif" font-size="14.00">Authors</text>
+<text text-anchor="start" x="8697" y="-4924.3" font-family="Times,serif" font-size="14.00">Id</text>
+<text text-anchor="start" x="8678.5" y="-4903.3" font-family="Times,serif" font-size="14.00">Journal</text>
+<text text-anchor="start" x="8681" y="-4882.3" font-family="Times,serif" font-size="14.00">Month</text>
+<text text-anchor="start" x="8683" y="-4861.3" font-family="Times,serif" font-size="14.00">Pages</text>
+<text text-anchor="start" x="8663" y="-4840.3" font-family="Times,serif" font-size="14.00">PubMed_ID</text>
+<text text-anchor="start" x="8688" y="-4819.3" font-family="Times,serif" font-size="14.00">Title</text>
+<text text-anchor="start" x="8678" y="-4798.3" font-family="Times,serif" font-size="14.00">Volume</text>
+<text text-anchor="start" x="8688.5" y="-4777.3" font-family="Times,serif" font-size="14.00">Year</text>
+<polygon fill="none" stroke="black" points="8610.5,-4769 8610.5,-5006 8797.5,-5006 8797.5,-4769 8610.5,-4769"/>
+</g>
+<!-- DBList -->
+<g id="node26" class="node">
+<title>DBList</title>
+<polygon fill="white" stroke="transparent" points="8301,-4821.5 8301,-4953.5 8425,-4953.5 8425,-4821.5 8301,-4821.5"/>
+<polygon fill="#d7b5d8" stroke="transparent" points="8304,-4929.5 8304,-4950.5 8422,-4950.5 8422,-4929.5 8304,-4929.5"/>
+<polygon fill="none" stroke="black" points="8304,-4929.5 8304,-4950.5 8422,-4950.5 8422,-4929.5 8304,-4929.5"/>
+<text text-anchor="start" x="8307" y="-4936.3" font-family="Times,serif" font-size="14.00">DBList (99 KiB)</text>
+<text text-anchor="start" x="8344.5" y="-4914.3" font-family="Times,serif" font-size="14.00">Code</text>
+<text text-anchor="start" x="8327.5" y="-4893.3" font-family="Times,serif" font-size="14.00">DBTypeId</text>
+<text text-anchor="start" x="8331" y="-4872.3" font-family="Times,serif" font-size="14.00">FreezeId</text>
+<text text-anchor="start" x="8355.5" y="-4851.3" font-family="Times,serif" font-size="14.00">Id</text>
+<text text-anchor="start" x="8341.5" y="-4830.3" font-family="Times,serif" font-size="14.00">Name</text>
+<polygon fill="none" stroke="black" points="8301,-4821.5 8301,-4953.5 8425,-4953.5 8425,-4821.5 8301,-4821.5"/>
+</g>
+<!-- DBList&#45;&gt;DBType -->
+<g id="edge24" class="edge">
+<title>DBList:DBTypeId&#45;&gt;DBType</title>
+<path fill="none" stroke="black" d="M8423,-4897.5C8462.94,-4897.5 8383.01,-3608.94 8366.07,-3342.76"/>
+<polygon fill="black" stroke="black" points="8369.55,-3342.4 8365.42,-3332.64 8362.57,-3342.84 8369.55,-3342.4"/>
+</g>
+<!-- H2 -->
+<g id="node27" class="node">
+<title>H2</title>
+<polygon fill="white" stroke="transparent" points="8831.5,-4832 8831.5,-4943 8922.5,-4943 8922.5,-4832 8831.5,-4832"/>
+<polygon fill="#df65b0" stroke="transparent" points="8835,-4918.5 8835,-4939.5 8920,-4939.5 8920,-4918.5 8835,-4918.5"/>
+<polygon fill="none" stroke="black" points="8835,-4918.5 8835,-4939.5 8920,-4939.5 8920,-4918.5 8835,-4918.5"/>
+<text text-anchor="start" x="8838" y="-4925.3" font-family="Times,serif" font-size="14.00">H2 (2 MiB)</text>
+<text text-anchor="start" x="8853" y="-4903.3" font-family="Times,serif" font-size="14.00">DataId</text>
+<text text-anchor="start" x="8856.5" y="-4882.3" font-family="Times,serif" font-size="14.00">H2SE</text>
+<text text-anchor="start" x="8856" y="-4861.3" font-family="Times,serif" font-size="14.00">HPH2</text>
+<text text-anchor="start" x="8859" y="-4840.3" font-family="Times,serif" font-size="14.00">ICH2</text>
+<polygon fill="none" stroke="black" points="8831.5,-4832 8831.5,-4943 8922.5,-4943 8922.5,-4832 8831.5,-4832"/>
+</g>
+<!-- InbredSet&#45;&gt;Species -->
+<g id="edge25" class="edge">
+<title>InbredSet:SpeciesId&#45;&gt;Species</title>
+<path fill="none" stroke="black" d="M3784,-767C3728.83,-767 3795.51,-561.36 3759,-520 3641.66,-387.09 3085.79,-325.05 2876.21,-306.09"/>
+<polygon fill="black" stroke="black" points="2876.47,-302.6 2866.2,-305.19 2875.85,-309.57 2876.47,-302.6"/>
+</g>
+<!-- DatasetMapInvestigator -->
+<g id="node29" class="node">
+<title>DatasetMapInvestigator</title>
+<polygon fill="white" stroke="transparent" points="8,-1918 8,-2008 258,-2008 258,-1918 8,-1918"/>
+<polygon fill="#d7b5d8" stroke="transparent" points="11,-1984 11,-2005 255,-2005 255,-1984 11,-1984"/>
+<polygon fill="none" stroke="black" points="11,-1984 11,-2005 255,-2005 255,-1984 11,-1984"/>
+<text text-anchor="start" x="14" y="-1990.8" font-family="Times,serif" font-size="14.00">DatasetMapInvestigator (28 KiB)</text>
+<text text-anchor="start" x="98" y="-1968.8" font-family="Times,serif" font-size="14.00">DatasetId</text>
+<text text-anchor="start" x="125.5" y="-1947.8" font-family="Times,serif" font-size="14.00">Id</text>
+<text text-anchor="start" x="82" y="-1926.8" font-family="Times,serif" font-size="14.00">InvestigatorId</text>
+<polygon fill="none" stroke="black" points="8,-1918 8,-2008 258,-2008 258,-1918 8,-1918"/>
+</g>
+<!-- DatasetMapInvestigator&#45;&gt;Datasets -->
+<g id="edge26" class="edge">
+<title>DatasetMapInvestigator:DatasetId&#45;&gt;Datasets</title>
+<path fill="none" stroke="black" d="M256,-1973C277.48,-1973 271.49,-1221.19 275,-1200 283.9,-1146.31 298.97,-1089.52 315.22,-1037.42"/>
+<polygon fill="black" stroke="black" points="318.6,-1038.33 318.27,-1027.74 311.93,-1036.23 318.6,-1038.33"/>
+</g>
+<!-- DatasetMapInvestigator&#45;&gt;Investigators -->
+<g id="edge27" class="edge">
+<title>DatasetMapInvestigator:InvestigatorId&#45;&gt;Investigators</title>
+<path fill="none" stroke="black" d="M133,-1920C133,-1405.22 153.42,-798.72 165.08,-494.41"/>
+<polygon fill="black" stroke="black" points="168.59,-494.29 165.48,-484.16 161.59,-494.02 168.59,-494.29"/>
+</g>
+<!-- Docs -->
+<g id="node30" class="node">
+<title>Docs</title>
+<polygon fill="white" stroke="transparent" points="8956.5,-4832 8956.5,-4943 9075.5,-4943 9075.5,-4832 8956.5,-4832"/>
+<polygon fill="#d7b5d8" stroke="transparent" points="8960,-4918.5 8960,-4939.5 9073,-4939.5 9073,-4918.5 8960,-4918.5"/>
+<polygon fill="none" stroke="black" points="8960,-4918.5 8960,-4939.5 9073,-4939.5 9073,-4918.5 8960,-4918.5"/>
+<text text-anchor="start" x="8963" y="-4925.3" font-family="Times,serif" font-size="14.00">Docs (148 KiB)</text>
+<text text-anchor="start" x="8989" y="-4903.3" font-family="Times,serif" font-size="14.00">content</text>
+<text text-anchor="start" x="8997" y="-4882.3" font-family="Times,serif" font-size="14.00">entry</text>
+<text text-anchor="start" x="9009.5" y="-4861.3" font-family="Times,serif" font-size="14.00">id</text>
+<text text-anchor="start" x="9001.5" y="-4840.3" font-family="Times,serif" font-size="14.00">title</text>
+<polygon fill="none" stroke="black" points="8956.5,-4832 8956.5,-4943 9075.5,-4943 9075.5,-4832 8956.5,-4832"/>
+</g>
+<!-- Phenotype -->
+<g id="node31" class="node">
+<title>Phenotype</title>
+<polygon fill="lightgrey" stroke="transparent" points="2910,-713 2910,-971 3134,-971 3134,-713 2910,-713"/>
+<polygon fill="#df65b0" stroke="transparent" points="2913,-947 2913,-968 3131,-968 3131,-947 2913,-947"/>
+<polygon fill="none" stroke="black" points="2913,-947 2913,-968 3131,-968 3131,-947 2913,-947"/>
+<text text-anchor="start" x="2955" y="-953.8" font-family="Times,serif" font-size="14.00">Phenotype (9 MiB)</text>
+<text text-anchor="start" x="3014.5" y="-931.8" font-family="Times,serif" font-size="14.00">Id</text>
+<text text-anchor="start" x="2915" y="-910.8" font-family="Times,serif" font-size="14.00">Post_publication_abbreviation</text>
+<text text-anchor="start" x="2918" y="-889.8" font-family="Times,serif" font-size="14.00">Pre_publication_abbreviation</text>
+<polygon fill="green" stroke="transparent" points="2913,-863 2913,-882 3131,-882 3131,-863 2913,-863"/>
+<text text-anchor="start" x="2958.5" y="-868.8" font-family="Times,serif" font-size="14.00">Authorized_Users</text>
+<polygon fill="green" stroke="transparent" points="2913,-842 2913,-861 3131,-861 3131,-842 2913,-842"/>
+<text text-anchor="start" x="2988.5" y="-847.8" font-family="Times,serif" font-size="14.00">Lab_code</text>
+<polygon fill="green" stroke="transparent" points="2913,-821 2913,-840 3131,-840 3131,-821 2913,-821"/>
+<text text-anchor="start" x="2949.5" y="-826.8" font-family="Times,serif" font-size="14.00">Original_description</text>
+<polygon fill="green" stroke="transparent" points="2913,-800 2913,-819 3131,-819 3131,-800 2913,-800"/>
+<text text-anchor="start" x="2998" y="-805.8" font-family="Times,serif" font-size="14.00">Owner</text>
+<polygon fill="green" stroke="transparent" points="2913,-779 2913,-798 3131,-798 3131,-779 2913,-779"/>
+<text text-anchor="start" x="2919.5" y="-784.8" font-family="Times,serif" font-size="14.00">Post_publication_description</text>
+<polygon fill="green" stroke="transparent" points="2913,-758 2913,-777 3131,-777 3131,-758 2913,-758"/>
+<text text-anchor="start" x="2922.5" y="-763.8" font-family="Times,serif" font-size="14.00">Pre_publication_description</text>
+<polygon fill="green" stroke="transparent" points="2913,-737 2913,-756 3131,-756 3131,-737 2913,-737"/>
+<text text-anchor="start" x="2985.5" y="-742.8" font-family="Times,serif" font-size="14.00">Submitter</text>
+<polygon fill="green" stroke="transparent" points="2913,-716 2913,-735 3131,-735 3131,-716 2913,-716"/>
+<text text-anchor="start" x="3002" y="-721.8" font-family="Times,serif" font-size="14.00">Units</text>
+<polygon fill="none" stroke="black" points="2910,-713 2910,-971 3134,-971 3134,-713 2910,-713"/>
+</g>
+<!-- SnpPattern -->
+<g id="node32" class="node">
+<title>SnpPattern</title>
+<polygon fill="white" stroke="transparent" points="9110,-3866 9110,-5909 9294,-5909 9294,-3866 9110,-3866"/>
+<polygon fill="#ce1256" stroke="transparent" points="9113,-5884.5 9113,-5905.5 9291,-5905.5 9291,-5884.5 9113,-5884.5"/>
+<polygon fill="none" stroke="black" points="9113,-5884.5 9113,-5905.5 9291,-5905.5 9291,-5884.5 9113,-5884.5"/>
+<text text-anchor="start" x="9134" y="-5891.3" font-family="Times,serif" font-size="14.00">SnpPattern (8 GiB)</text>
+<text text-anchor="start" x="9150.5" y="-5869.3" font-family="Times,serif" font-size="14.00">129P2/OlaHsd</text>
+<text text-anchor="start" x="9155.5" y="-5848.3" font-family="Times,serif" font-size="14.00">129S1/SvImJ</text>
+<text text-anchor="start" x="9153.5" y="-5827.3" font-family="Times,serif" font-size="14.00">129S2/SvHsd</text>
+<text text-anchor="start" x="9156.5" y="-5806.3" font-family="Times,serif" font-size="14.00">129S4/SvJae</text>
+<text text-anchor="start" x="9145" y="-5785.3" font-family="Times,serif" font-size="14.00">129S5/SvEvBrd</text>
+<text text-anchor="start" x="9158" y="-5764.3" font-family="Times,serif" font-size="14.00">129S6/SvEv</text>
+<text text-anchor="start" x="9149.5" y="-5743.3" font-family="Times,serif" font-size="14.00">129T2/SvEmsJ</text>
+<text text-anchor="start" x="9165" y="-5722.3" font-family="Times,serif" font-size="14.00">129X1/SvJ</text>
+<text text-anchor="start" x="9192" y="-5701.3" font-family="Times,serif" font-size="14.00">A/J</text>
+<text text-anchor="start" x="9181.5" y="-5680.3" font-family="Times,serif" font-size="14.00">AKR/J</text>
+<text text-anchor="start" x="9115" y="-5659.3" font-family="Times,serif" font-size="14.00">B6A6_Esline_Regeneron</text>
+<text text-anchor="start" x="9164" y="-5638.3" font-family="Times,serif" font-size="14.00">BALB/cByJ</text>
+<text text-anchor="start" x="9173" y="-5617.3" font-family="Times,serif" font-size="14.00">BALB/cJ</text>
+<text text-anchor="start" x="9176" y="-5596.3" font-family="Times,serif" font-size="14.00">BPH/2J</text>
+<text text-anchor="start" x="9177.5" y="-5575.3" font-family="Times,serif" font-size="14.00">BPL/1J</text>
+<text text-anchor="start" x="9176" y="-5554.3" font-family="Times,serif" font-size="14.00">BPN/3J</text>
+<text text-anchor="start" x="9148.5" y="-5533.3" font-family="Times,serif" font-size="14.00">BTBRT&lt;+&gt;tf/J</text>
+<text text-anchor="start" x="9170.5" y="-5512.3" font-family="Times,serif" font-size="14.00">BUB/BnJ</text>
+<text text-anchor="start" x="9135.5" y="-5491.3" font-family="Times,serif" font-size="14.00">C2T1_Esline_Nagy</text>
+<text text-anchor="start" x="9171" y="-5470.3" font-family="Times,serif" font-size="14.00">C3H/HeJ</text>
+<text text-anchor="start" x="9163" y="-5449.3" font-family="Times,serif" font-size="14.00">C3HeB/FeJ</text>
+<text text-anchor="start" x="9164" y="-5428.3" font-family="Times,serif" font-size="14.00">C57BL/10J</text>
+<text text-anchor="start" x="9159" y="-5407.3" font-family="Times,serif" font-size="14.00">C57BL/6ByJ</text>
+<text text-anchor="start" x="9168.5" y="-5386.3" font-family="Times,serif" font-size="14.00">C57BL/6J</text>
+<text text-anchor="start" x="9140" y="-5365.3" font-family="Times,serif" font-size="14.00">C57BL/6JBomTac</text>
+<text text-anchor="start" x="9157.5" y="-5344.3" font-family="Times,serif" font-size="14.00">C57BL/6JCrl</text>
+<text text-anchor="start" x="9142" y="-5323.3" font-family="Times,serif" font-size="14.00">C57BL/6JOlaHsd</text>
+<text text-anchor="start" x="9154" y="-5302.3" font-family="Times,serif" font-size="14.00">C57BL/6NCrl</text>
+<text text-anchor="start" x="9150.5" y="-5281.3" font-family="Times,serif" font-size="14.00">C57BL/6NHsd</text>
+<text text-anchor="start" x="9162.5" y="-5260.3" font-family="Times,serif" font-size="14.00">C57BL/6NJ</text>
+<text text-anchor="start" x="9150.5" y="-5239.3" font-family="Times,serif" font-size="14.00">C57BL/6NNIH</text>
+<text text-anchor="start" x="9153" y="-5218.3" font-family="Times,serif" font-size="14.00">C57BL/6NTac</text>
+<text text-anchor="start" x="9162.5" y="-5197.3" font-family="Times,serif" font-size="14.00">C57BLKS/J</text>
+<text text-anchor="start" x="9164" y="-5176.3" font-family="Times,serif" font-size="14.00">C57BR/cdJ</text>
+<text text-anchor="start" x="9178" y="-5155.3" font-family="Times,serif" font-size="14.00">C57L/J</text>
+<text text-anchor="start" x="9182.5" y="-5134.3" font-family="Times,serif" font-size="14.00">C58/J</text>
+<text text-anchor="start" x="9167.5" y="-5113.3" font-family="Times,serif" font-size="14.00">CALB/RkJ</text>
+<text text-anchor="start" x="9170" y="-5092.3" font-family="Times,serif" font-size="14.00">CAST/EiJ</text>
+<text text-anchor="start" x="9181.5" y="-5071.3" font-family="Times,serif" font-size="14.00">CBA/J</text>
+<text text-anchor="start" x="9186.5" y="-5050.3" font-family="Times,serif" font-size="14.00">CE/J</text>
+<text text-anchor="start" x="9157.5" y="-5029.3" font-family="Times,serif" font-size="14.00">CZECHII/EiJ</text>
+<text text-anchor="start" x="9176.5" y="-5008.3" font-family="Times,serif" font-size="14.00">DBA/1J</text>
+<text text-anchor="start" x="9176.5" y="-4987.3" font-family="Times,serif" font-size="14.00">DBA/2J</text>
+<text text-anchor="start" x="9170.5" y="-4966.3" font-family="Times,serif" font-size="14.00">DDK/Pas</text>
+<text text-anchor="start" x="9135.5" y="-4945.3" font-family="Times,serif" font-size="14.00">DDY/JclSidSeyFrkJ</text>
+<text text-anchor="start" x="9148.5" y="-4924.3" font-family="Times,serif" font-size="14.00">EL/SuzSeyFrkJ</text>
+<text text-anchor="start" x="9183.5" y="-4903.3" font-family="Times,serif" font-size="14.00">Fline</text>
+<text text-anchor="start" x="9176" y="-4882.3" font-family="Times,serif" font-size="14.00">FVB/NJ</text>
+<text text-anchor="start" x="9154" y="-4861.3" font-family="Times,serif" font-size="14.00">HTG/GoSfSnJ</text>
+<text text-anchor="start" x="9185" y="-4840.3" font-family="Times,serif" font-size="14.00">I/LnJ</text>
+<text text-anchor="start" x="9162.5" y="-4819.3" font-family="Times,serif" font-size="14.00">ILS/IbgTejJ</text>
+<text text-anchor="start" x="9164" y="-4798.3" font-family="Times,serif" font-size="14.00">IS/CamRkJ</text>
+<text text-anchor="start" x="9162.5" y="-4777.3" font-family="Times,serif" font-size="14.00">ISS/IbgTejJ</text>
+<text text-anchor="start" x="9176.5" y="-4756.3" font-family="Times,serif" font-size="14.00">JF1/Ms</text>
+<text text-anchor="start" x="9178" y="-4735.3" font-family="Times,serif" font-size="14.00">KK/HlJ</text>
+<text text-anchor="start" x="9162.5" y="-4714.3" font-family="Times,serif" font-size="14.00">LEWES/EiJ</text>
+<text text-anchor="start" x="9186.5" y="-4693.3" font-family="Times,serif" font-size="14.00">LG/J</text>
+<text text-anchor="start" x="9184" y="-4672.3" font-family="Times,serif" font-size="14.00">Lline</text>
+<text text-anchor="start" x="9187.5" y="-4651.3" font-family="Times,serif" font-size="14.00">LP/J</text>
+<text text-anchor="start" x="9173.5" y="-4630.3" font-family="Times,serif" font-size="14.00">MA/MyJ</text>
+<text text-anchor="start" x="9172.5" y="-4609.3" font-family="Times,serif" font-size="14.00">MAI/Pas</text>
+<text text-anchor="start" x="9167" y="-4588.3" font-family="Times,serif" font-size="14.00">MOLF/EiJ</text>
+<text text-anchor="start" x="9164" y="-4567.3" font-family="Times,serif" font-size="14.00">MOLG/DnJ</text>
+<text text-anchor="start" x="9168.5" y="-4546.3" font-family="Times,serif" font-size="14.00">MRL/MpJ</text>
+<text text-anchor="start" x="9169.5" y="-4525.3" font-family="Times,serif" font-size="14.00">MSM/Ms</text>
+<text text-anchor="start" x="9160.5" y="-4504.3" font-family="Times,serif" font-size="14.00">NOD/ShiLtJ</text>
+<text text-anchor="start" x="9171.5" y="-4483.3" font-family="Times,serif" font-size="14.00">NON/LtJ</text>
+<text text-anchor="start" x="9172.5" y="-4462.3" font-family="Times,serif" font-size="14.00">NOR/LtJ</text>
+<text text-anchor="start" x="9167" y="-4441.3" font-family="Times,serif" font-size="14.00">NZB/BlNJ</text>
+<text text-anchor="start" x="9174" y="-4420.3" font-family="Times,serif" font-size="14.00">NZL/LtJ</text>
+<text text-anchor="start" x="9164.5" y="-4399.3" font-family="Times,serif" font-size="14.00">NZO/HlLtJ</text>
+<text text-anchor="start" x="9166.5" y="-4378.3" font-family="Times,serif" font-size="14.00">NZW/LacJ</text>
+<text text-anchor="start" x="9187" y="-4357.3" font-family="Times,serif" font-size="14.00">O20</text>
+<text text-anchor="start" x="9192" y="-4336.3" font-family="Times,serif" font-size="14.00">P/J</text>
+<text text-anchor="start" x="9169" y="-4315.3" font-family="Times,serif" font-size="14.00">PERA/EiJ</text>
+<text text-anchor="start" x="9168.5" y="-4294.3" font-family="Times,serif" font-size="14.00">PERC/EiJ</text>
+<text text-anchor="start" x="9187.5" y="-4273.3" font-family="Times,serif" font-size="14.00">PL/J</text>
+<text text-anchor="start" x="9170" y="-4252.3" font-family="Times,serif" font-size="14.00">PWD/PhJ</text>
+<text text-anchor="start" x="9170" y="-4231.3" font-family="Times,serif" font-size="14.00">PWK/PhJ</text>
+<text text-anchor="start" x="9185.5" y="-4210.3" font-family="Times,serif" font-size="14.00">Qsi5</text>
+<text text-anchor="start" x="9171.5" y="-4189.3" font-family="Times,serif" font-size="14.00">RBA/DnJ</text>
+<text text-anchor="start" x="9186.5" y="-4168.3" font-family="Times,serif" font-size="14.00">RF/J</text>
+<text text-anchor="start" x="9179" y="-4147.3" font-family="Times,serif" font-size="14.00">RIIIS/J</text>
+<text text-anchor="start" x="9171.5" y="-4126.3" font-family="Times,serif" font-size="14.00">SEA/GnJ</text>
+<text text-anchor="start" x="9171.5" y="-4105.3" font-family="Times,serif" font-size="14.00">SEG/Pas</text>
+<text text-anchor="start" x="9185" y="-4084.3" font-family="Times,serif" font-size="14.00">SJL/J</text>
+<text text-anchor="start" x="9166.5" y="-4063.3" font-family="Times,serif" font-size="14.00">SKIVE/EiJ</text>
+<text text-anchor="start" x="9185" y="-4042.3" font-family="Times,serif" font-size="14.00">SM/J</text>
+<text text-anchor="start" x="9180.5" y="-4021.3" font-family="Times,serif" font-size="14.00">SnpId</text>
+<text text-anchor="start" x="9168.5" y="-4000.3" font-family="Times,serif" font-size="14.00">SOD1/EiJ</text>
+<text text-anchor="start" x="9164.5" y="-3979.3" font-family="Times,serif" font-size="14.00">SPRET/EiJ</text>
+<text text-anchor="start" x="9183" y="-3958.3" font-family="Times,serif" font-size="14.00">ST/bJ</text>
+<text text-anchor="start" x="9179.5" y="-3937.3" font-family="Times,serif" font-size="14.00">SWR/J</text>
+<text text-anchor="start" x="9151.5" y="-3916.3" font-family="Times,serif" font-size="14.00">TALLYHO/JngJ</text>
+<text text-anchor="start" x="9172" y="-3895.3" font-family="Times,serif" font-size="14.00">WSB/EiJ</text>
+<text text-anchor="start" x="9153" y="-3874.3" font-family="Times,serif" font-size="14.00">ZALENDE/EiJ</text>
+<polygon fill="none" stroke="black" points="9110,-3866 9110,-5909 9294,-5909 9294,-3866 9110,-3866"/>
+</g>
+<!-- AccessLog -->
+<g id="node34" class="node">
+<title>AccessLog</title>
+<polygon fill="white" stroke="transparent" points="9328,-4842.5 9328,-4932.5 9482,-4932.5 9482,-4842.5 9328,-4842.5"/>
+<polygon fill="#df65b0" stroke="transparent" points="9331,-4908.5 9331,-4929.5 9479,-4929.5 9479,-4908.5 9331,-4908.5"/>
+<polygon fill="none" stroke="black" points="9331,-4908.5 9331,-4929.5 9479,-4929.5 9479,-4908.5 9331,-4908.5"/>
+<text text-anchor="start" x="9334" y="-4915.3" font-family="Times,serif" font-size="14.00">AccessLog (46 MiB)</text>
+<text text-anchor="start" x="9365.5" y="-4893.3" font-family="Times,serif" font-size="14.00">accesstime</text>
+<text text-anchor="start" x="9398" y="-4872.3" font-family="Times,serif" font-size="14.00">id</text>
+<text text-anchor="start" x="9366.5" y="-4851.3" font-family="Times,serif" font-size="14.00">ip_address</text>
+<polygon fill="none" stroke="black" points="9328,-4842.5 9328,-4932.5 9482,-4932.5 9482,-4842.5 9328,-4842.5"/>
+</g>
+<!-- GeneRIF -->
+<g id="node35" class="node">
+<title>GeneRIF</title>
+<polygon fill="white" stroke="transparent" points="3576.5,-692 3576.5,-992 3709.5,-992 3709.5,-692 3576.5,-692"/>
+<polygon fill="#df65b0" stroke="transparent" points="3580,-968 3580,-989 3707,-989 3707,-968 3580,-968"/>
+<polygon fill="none" stroke="black" points="3580,-968 3580,-989 3707,-989 3707,-968 3580,-968"/>
+<text text-anchor="start" x="3583" y="-974.8" font-family="Times,serif" font-size="14.00">GeneRIF (2 MiB)</text>
+<text text-anchor="start" x="3610" y="-952.8" font-family="Times,serif" font-size="14.00">comment</text>
+<text text-anchor="start" x="3604.5" y="-931.8" font-family="Times,serif" font-size="14.00">createtime</text>
+<text text-anchor="start" x="3617.5" y="-910.8" font-family="Times,serif" font-size="14.00">display</text>
+<text text-anchor="start" x="3623.5" y="-889.8" font-family="Times,serif" font-size="14.00">email</text>
+<text text-anchor="start" x="3636" y="-868.8" font-family="Times,serif" font-size="14.00">Id</text>
+<text text-anchor="start" x="3622.5" y="-847.8" font-family="Times,serif" font-size="14.00">initial</text>
+<text text-anchor="start" x="3602" y="-826.8" font-family="Times,serif" font-size="14.00">PubMed_ID</text>
+<text text-anchor="start" x="3619" y="-805.8" font-family="Times,serif" font-size="14.00">reason</text>
+<text text-anchor="start" x="3608.5" y="-784.8" font-family="Times,serif" font-size="14.00">SpeciesId</text>
+<text text-anchor="start" x="3617.5" y="-763.8" font-family="Times,serif" font-size="14.00">symbol</text>
+<text text-anchor="start" x="3617.5" y="-742.8" font-family="Times,serif" font-size="14.00">user_ip</text>
+<text text-anchor="start" x="3610" y="-721.8" font-family="Times,serif" font-size="14.00">versionId</text>
+<text text-anchor="start" x="3618.5" y="-700.8" font-family="Times,serif" font-size="14.00">weburl</text>
+<polygon fill="none" stroke="black" points="3576.5,-692 3576.5,-992 3709.5,-992 3709.5,-692 3576.5,-692"/>
+</g>
+<!-- GeneRIF&#45;&gt;Species -->
+<g id="edge28" class="edge">
+<title>GeneRIF:SpeciesId&#45;&gt;Species</title>
+<path fill="none" stroke="black" d="M3579,-788C3549.14,-788 3577.82,-543.18 3559,-520 3471.93,-412.76 3053.77,-338.32 2876.12,-311.02"/>
+<polygon fill="black" stroke="black" points="2876.46,-307.54 2866.05,-309.49 2875.41,-314.46 2876.46,-307.54"/>
+</g>
+<!-- ProbeData -->
+<g id="node36" class="node">
+<title>ProbeData</title>
+<polygon fill="white" stroke="transparent" points="5291,-1918 5291,-2008 5443,-2008 5443,-1918 5291,-1918"/>
+<polygon fill="#ce1256" stroke="transparent" points="5294,-1984 5294,-2005 5440,-2005 5440,-1984 5294,-1984"/>
+<polygon fill="none" stroke="black" points="5294,-1984 5294,-2005 5440,-2005 5440,-1984 5294,-1984"/>
+<text text-anchor="start" x="5297" y="-1990.8" font-family="Times,serif" font-size="14.00">ProbeData (10 GiB)</text>
+<text text-anchor="start" x="5359.5" y="-1968.8" font-family="Times,serif" font-size="14.00">Id</text>
+<text text-anchor="start" x="5337.5" y="-1947.8" font-family="Times,serif" font-size="14.00">StrainId</text>
+<text text-anchor="start" x="5347.5" y="-1926.8" font-family="Times,serif" font-size="14.00">value</text>
+<polygon fill="none" stroke="black" points="5291,-1918 5291,-2008 5443,-2008 5443,-1918 5291,-1918"/>
+</g>
+<!-- ProbeData&#45;&gt;Strain -->
+<g id="edge29" class="edge">
+<title>ProbeData:StrainId&#45;&gt;Strain</title>
+<path fill="none" stroke="black" d="M5441,-1951C5461.87,-1951 5451.21,-1219.36 5459,-1200 5511.05,-1070.73 5632.85,-959.15 5712.21,-896.58"/>
+<polygon fill="black" stroke="black" points="5714.51,-899.22 5720.23,-890.3 5710.2,-893.71 5714.51,-899.22"/>
+</g>
+<!-- AvgMethod -->
+<g id="node37" class="node">
+<title>AvgMethod</title>
+<polygon fill="lightgrey" stroke="transparent" points="982.5,-786.5 982.5,-897.5 1133.5,-897.5 1133.5,-786.5 982.5,-786.5"/>
+<polygon fill="#f1eef6" stroke="transparent" points="986,-873 986,-894 1131,-894 1131,-873 986,-873"/>
+<polygon fill="none" stroke="black" points="986,-873 986,-894 1131,-894 1131,-873 986,-873"/>
+<text text-anchor="start" x="989" y="-879.8" font-family="Times,serif" font-size="14.00">AvgMethod (792 B)</text>
+<text text-anchor="start" x="1010" y="-857.8" font-family="Times,serif" font-size="14.00">AvgMethodId</text>
+<text text-anchor="start" x="1051" y="-836.8" font-family="Times,serif" font-size="14.00">Id</text>
+<polygon fill="green" stroke="transparent" points="986,-810 986,-829 1131,-829 1131,-810 986,-810"/>
+<text text-anchor="start" x="1037" y="-815.8" font-family="Times,serif" font-size="14.00">Name</text>
+<text text-anchor="start" x="1007.5" y="-794.8" font-family="Times,serif" font-size="14.00">Normalization</text>
+<polygon fill="none" stroke="black" points="982.5,-786.5 982.5,-897.5 1133.5,-897.5 1133.5,-786.5 982.5,-786.5"/>
+</g>
+<!-- GeneRIFXRef -->
+<g id="node38" class="node">
+<title>GeneRIFXRef</title>
+<polygon fill="white" stroke="transparent" points="3003,-1918 3003,-2008 3175,-2008 3175,-1918 3003,-1918"/>
+<polygon fill="#d7b5d8" stroke="transparent" points="3006,-1984 3006,-2005 3172,-2005 3172,-1984 3006,-1984"/>
+<polygon fill="none" stroke="black" points="3006,-1984 3006,-2005 3172,-2005 3172,-1984 3006,-1984"/>
+<text text-anchor="start" x="3009" y="-1990.8" font-family="Times,serif" font-size="14.00">GeneRIFXRef (82 KiB)</text>
+<text text-anchor="start" x="3030.5" y="-1968.8" font-family="Times,serif" font-size="14.00">GeneCategoryId</text>
+<text text-anchor="start" x="3050.5" y="-1947.8" font-family="Times,serif" font-size="14.00">GeneRIFId</text>
+<text text-anchor="start" x="3055.5" y="-1926.8" font-family="Times,serif" font-size="14.00">versionId</text>
+<polygon fill="none" stroke="black" points="3003,-1918 3003,-2008 3175,-2008 3175,-1918 3003,-1918"/>
+</g>
+<!-- GeneRIFXRef&#45;&gt;GeneRIF -->
+<g id="edge31" class="edge">
+<title>GeneRIFXRef:GeneRIFId&#45;&gt;GeneRIF</title>
+<path fill="none" stroke="black" d="M3173,-1951C3214.74,-1951 3168.49,-1230.49 3197,-1200 3252.21,-1140.95 3497.53,-1216.51 3559,-1164 3604.75,-1124.91 3627.15,-1064.28 3637.64,-1006.19"/>
+<polygon fill="black" stroke="black" points="3641.12,-1006.59 3639.34,-996.14 3634.22,-1005.42 3641.12,-1006.59"/>
+</g>
+<!-- GeneCategory -->
+<g id="node73" class="node">
+<title>GeneCategory</title>
+<polygon fill="white" stroke="transparent" points="3373.5,-807.5 3373.5,-876.5 3542.5,-876.5 3542.5,-807.5 3373.5,-807.5"/>
+<polygon fill="#d7b5d8" stroke="transparent" points="3377,-852 3377,-873 3540,-873 3540,-852 3377,-852"/>
+<polygon fill="none" stroke="black" points="3377,-852 3377,-873 3540,-873 3540,-852 3377,-852"/>
+<text text-anchor="start" x="3380" y="-858.8" font-family="Times,serif" font-size="14.00">GeneCategory (5 KiB)</text>
+<text text-anchor="start" x="3451" y="-836.8" font-family="Times,serif" font-size="14.00">Id</text>
+<text text-anchor="start" x="3437" y="-815.8" font-family="Times,serif" font-size="14.00">Name</text>
+<polygon fill="none" stroke="black" points="3373.5,-807.5 3373.5,-876.5 3542.5,-876.5 3542.5,-807.5 3373.5,-807.5"/>
+</g>
+<!-- GeneRIFXRef&#45;&gt;GeneCategory -->
+<g id="edge30" class="edge">
+<title>GeneRIFXRef:GeneCategoryId&#45;&gt;GeneCategory</title>
+<path fill="none" stroke="black" d="M3173,-1973C3215.97,-1973 3169.76,-1233.22 3197,-1200 3241.84,-1145.31 3299.78,-1211.69 3352,-1164 3430.43,-1092.39 3450.94,-961.62 3456.23,-891.11"/>
+<polygon fill="black" stroke="black" points="3459.75,-890.96 3456.93,-880.75 3452.77,-890.49 3459.75,-890.96"/>
+</g>
+<!-- CaseAttribute -->
+<g id="node39" class="node">
+<title>CaseAttribute</title>
+<polygon fill="lightgrey" stroke="transparent" points="1168,-797 1168,-887 1334,-887 1334,-797 1168,-797"/>
+<polygon fill="#d7b5d8" stroke="transparent" points="1171,-863 1171,-884 1331,-884 1331,-863 1171,-863"/>
+<polygon fill="none" stroke="black" points="1171,-863 1171,-884 1331,-884 1331,-863 1171,-863"/>
+<text text-anchor="start" x="1174" y="-869.8" font-family="Times,serif" font-size="14.00">CaseAttribute (2 KiB)</text>
+<polygon fill="green" stroke="transparent" points="1171,-842 1171,-861 1331,-861 1331,-842 1171,-842"/>
+<text text-anchor="start" x="1209.5" y="-847.8" font-family="Times,serif" font-size="14.00">Description</text>
+<polygon fill="green" stroke="transparent" points="1171,-821 1171,-840 1331,-840 1331,-821 1171,-821"/>
+<text text-anchor="start" x="1243.5" y="-826.8" font-family="Times,serif" font-size="14.00">Id</text>
+<polygon fill="green" stroke="transparent" points="1171,-800 1171,-819 1331,-819 1331,-800 1171,-800"/>
+<text text-anchor="start" x="1229.5" y="-805.8" font-family="Times,serif" font-size="14.00">Name</text>
+<polygon fill="none" stroke="black" points="1168,-797 1168,-887 1334,-887 1334,-797 1168,-797"/>
+</g>
+<!-- Strain&#45;&gt;Species -->
+<g id="edge32" class="edge">
+<title>Strain:SpeciesId&#45;&gt;Species</title>
+<path fill="none" stroke="black" d="M5731,-777C5128.52,-777 4994.43,-618.17 4400,-520 3817.59,-423.81 3111.33,-337.05 2876.33,-308.98"/>
+<polygon fill="black" stroke="black" points="2876.51,-305.48 2866.17,-307.77 2875.68,-312.43 2876.51,-305.48"/>
+</g>
+<!-- Probe&#45;&gt;ProbeSE -->
+<g id="edge33" class="edge">
+<title>Probe:ProbeSetId&#45;&gt;ProbeSE</title>
+<path fill="none" stroke="black" d="M6968,-3261C6999.5,-3261 7043.75,-2274.36 7054.55,-2022.15"/>
+<polygon fill="black" stroke="black" points="7058.05,-2022.23 7054.98,-2012.09 7051.06,-2021.93 7058.05,-2022.23"/>
+</g>
+<!-- ProbeFreeze&#45;&gt;InbredSet -->
+<g id="edge34" class="edge">
+<title>ProbeFreeze:InbredSetId&#45;&gt;InbredSet</title>
+<path fill="none" stroke="black" d="M2775,-1951C2816.74,-1951 2764.69,-1229.71 2794,-1200 2866.79,-1126.23 3641.27,-1223.68 3726,-1164 3778.21,-1127.22 3809.31,-1065.62 3827.82,-1006.16"/>
+<polygon fill="black" stroke="black" points="3831.27,-1006.83 3830.79,-996.25 3824.56,-1004.82 3831.27,-1006.83"/>
+</g>
+<!-- ProbeFreeze&#45;&gt;Tissue -->
+<g id="edge35" class="edge">
+<title>ProbeFreeze:TissueId&#45;&gt;Tissue</title>
+<path fill="none" stroke="black" d="M2613,-1867C2575.92,-1867 2609.31,-1231.02 2589,-1200 2568.75,-1169.06 2537.32,-1192.7 2514,-1164 2463.47,-1101.8 2444.56,-1011.96 2437.81,-943.13"/>
+<polygon fill="black" stroke="black" points="2441.29,-942.77 2436.9,-933.13 2434.32,-943.41 2441.29,-942.77"/>
+</g>
+<!-- BXDSnpPosition -->
+<g id="node43" class="node">
+<title>BXDSnpPosition</title>
+<polygon fill="white" stroke="transparent" points="5476.5,-1886.5 5476.5,-2039.5 5681.5,-2039.5 5681.5,-1886.5 5476.5,-1886.5"/>
+<polygon fill="#df65b0" stroke="transparent" points="5480,-2015 5480,-2036 5679,-2036 5679,-2015 5480,-2015"/>
+<polygon fill="none" stroke="black" points="5480,-2015 5480,-2036 5679,-2036 5679,-2015 5480,-2015"/>
+<text text-anchor="start" x="5483" y="-2021.8" font-family="Times,serif" font-size="14.00">BXDSnpPosition (230 MiB)</text>
+<text text-anchor="start" x="5566" y="-1999.8" font-family="Times,serif" font-size="14.00">Chr</text>
+<text text-anchor="start" x="5572.5" y="-1978.8" font-family="Times,serif" font-size="14.00">id</text>
+<text text-anchor="start" x="5567.5" y="-1957.8" font-family="Times,serif" font-size="14.00">Mb</text>
+<text text-anchor="start" x="5546" y="-1936.8" font-family="Times,serif" font-size="14.00">Mb_2016</text>
+<text text-anchor="start" x="5545.5" y="-1915.8" font-family="Times,serif" font-size="14.00">StrainId1</text>
+<text text-anchor="start" x="5545.5" y="-1894.8" font-family="Times,serif" font-size="14.00">StrainId2</text>
+<polygon fill="none" stroke="black" points="5476.5,-1886.5 5476.5,-2039.5 5681.5,-2039.5 5681.5,-1886.5 5476.5,-1886.5"/>
+</g>
+<!-- BXDSnpPosition&#45;&gt;Strain -->
+<g id="edge36" class="edge">
+<title>BXDSnpPosition:StrainId1&#45;&gt;Strain</title>
+<path fill="none" stroke="black" d="M5680,-1919C5699.98,-1919 5696.36,-1219.8 5699,-1200 5711.36,-1107.45 5738.02,-1004.03 5758.6,-932.42"/>
+<polygon fill="black" stroke="black" points="5762.04,-933.11 5761.46,-922.54 5755.32,-931.17 5762.04,-933.11"/>
+</g>
+<!-- BXDSnpPosition&#45;&gt;Strain -->
+<g id="edge37" class="edge">
+<title>BXDSnpPosition:StrainId2&#45;&gt;Strain</title>
+<path fill="none" stroke="black" d="M5680,-1898C5699.4,-1898 5696.43,-1219.22 5699,-1200 5711.39,-1107.46 5738.05,-1004.03 5758.62,-932.43"/>
+<polygon fill="black" stroke="black" points="5762.06,-933.12 5761.48,-922.54 5755.34,-931.17 5762.06,-933.12"/>
+</g>
+<!-- GeneRIF_BASIC -->
+<g id="node44" class="node">
+<title>GeneRIF_BASIC</title>
+<polygon fill="white" stroke="transparent" points="531.5,-744.5 531.5,-939.5 734.5,-939.5 734.5,-744.5 531.5,-744.5"/>
+<polygon fill="#df65b0" stroke="transparent" points="535,-915 535,-936 732,-936 732,-915 535,-915"/>
+<polygon fill="none" stroke="black" points="535,-915 535,-936 732,-936 732,-915 535,-915"/>
+<text text-anchor="start" x="538" y="-921.8" font-family="Times,serif" font-size="14.00">GeneRIF_BASIC (275 MiB)</text>
+<text text-anchor="start" x="600" y="-899.8" font-family="Times,serif" font-size="14.00">comment</text>
+<text text-anchor="start" x="594.5" y="-878.8" font-family="Times,serif" font-size="14.00">createtime</text>
+<text text-anchor="start" x="607.5" y="-857.8" font-family="Times,serif" font-size="14.00">GeneId</text>
+<text text-anchor="start" x="592" y="-836.8" font-family="Times,serif" font-size="14.00">PubMed_ID</text>
+<text text-anchor="start" x="598.5" y="-815.8" font-family="Times,serif" font-size="14.00">SpeciesId</text>
+<text text-anchor="start" x="607.5" y="-794.8" font-family="Times,serif" font-size="14.00">symbol</text>
+<text text-anchor="start" x="612.5" y="-773.8" font-family="Times,serif" font-size="14.00">TaxID</text>
+<text text-anchor="start" x="599.5" y="-752.8" font-family="Times,serif" font-size="14.00">VersionId</text>
+<polygon fill="none" stroke="black" points="531.5,-744.5 531.5,-939.5 734.5,-939.5 734.5,-744.5 531.5,-744.5"/>
+</g>
+<!-- GeneRIF_BASIC&#45;&gt;Species -->
+<g id="edge38" class="edge">
+<title>GeneRIF_BASIC:SpeciesId&#45;&gt;Species</title>
+<path fill="none" stroke="black" d="M733,-819C766.29,-819 728.98,-544.05 752,-520 890.33,-375.45 2354.35,-314.96 2715.71,-302.17"/>
+<polygon fill="black" stroke="black" points="2715.96,-305.66 2725.83,-301.81 2715.71,-298.67 2715.96,-305.66"/>
+</g>
+<!-- GeneList_rn33 -->
+<g id="node45" class="node">
+<title>GeneList_rn33</title>
+<polygon fill="white" stroke="transparent" points="9516.5,-4737.5 9516.5,-5037.5 9691.5,-5037.5 9691.5,-4737.5 9516.5,-4737.5"/>
+<polygon fill="#df65b0" stroke="transparent" points="9520,-5013.5 9520,-5034.5 9689,-5034.5 9689,-5013.5 9520,-5013.5"/>
+<polygon fill="none" stroke="black" points="9520,-5013.5 9520,-5034.5 9689,-5034.5 9689,-5013.5 9520,-5013.5"/>
+<text text-anchor="start" x="9523" y="-5020.3" font-family="Times,serif" font-size="14.00">GeneList_rn33 (2 MiB)</text>
+<text text-anchor="start" x="9578" y="-4998.3" font-family="Times,serif" font-size="14.00">cdsEnd</text>
+<text text-anchor="start" x="9574" y="-4977.3" font-family="Times,serif" font-size="14.00">cdsStart</text>
+<text text-anchor="start" x="9559" y="-4956.3" font-family="Times,serif" font-size="14.00">chromosome</text>
+<text text-anchor="start" x="9566" y="-4935.3" font-family="Times,serif" font-size="14.00">exonCount</text>
+<text text-anchor="start" x="9569.5" y="-4914.3" font-family="Times,serif" font-size="14.00">exonEnds</text>
+<text text-anchor="start" x="9565" y="-4893.3" font-family="Times,serif" font-size="14.00">exonStarts</text>
+<text text-anchor="start" x="9560.5" y="-4872.3" font-family="Times,serif" font-size="14.00">geneSymbol</text>
+<text text-anchor="start" x="9597.5" y="-4851.3" font-family="Times,serif" font-size="14.00">id</text>
+<text text-anchor="start" x="9587.5" y="-4830.3" font-family="Times,serif" font-size="14.00">kgID</text>
+<text text-anchor="start" x="9579.5" y="-4809.3" font-family="Times,serif" font-size="14.00">NM_ID</text>
+<text text-anchor="start" x="9581" y="-4788.3" font-family="Times,serif" font-size="14.00">strand</text>
+<text text-anchor="start" x="9583" y="-4767.3" font-family="Times,serif" font-size="14.00">txEnd</text>
+<text text-anchor="start" x="9578.5" y="-4746.3" font-family="Times,serif" font-size="14.00">txStart</text>
+<polygon fill="none" stroke="black" points="9516.5,-4737.5 9516.5,-5037.5 9691.5,-5037.5 9691.5,-4737.5 9516.5,-4737.5"/>
+</g>
+<!-- Geno&#45;&gt;Species -->
+<g id="edge39" class="edge">
+<title>Geno:SpeciesId&#45;&gt;Species</title>
+<path fill="none" stroke="black" d="M4247,-704C4089.83,-704 4091.63,-576.6 3945,-520 3561.93,-372.13 3067.37,-320.3 2876.27,-305.04"/>
+<polygon fill="black" stroke="black" points="2876.28,-301.52 2866.03,-304.23 2875.73,-308.5 2876.28,-301.52"/>
+</g>
+<!-- Organizations -->
+<g id="node47" class="node">
+<title>Organizations</title>
+<polygon fill="white" stroke="transparent" points="90,-4 90,-73 256,-73 256,-4 90,-4"/>
+<polygon fill="#d7b5d8" stroke="transparent" points="93,-48.5 93,-69.5 253,-69.5 253,-48.5 93,-48.5"/>
+<polygon fill="none" stroke="black" points="93,-48.5 93,-69.5 253,-69.5 253,-48.5 93,-48.5"/>
+<text text-anchor="start" x="96" y="-55.3" font-family="Times,serif" font-size="14.00">Organizations (3 KiB)</text>
+<text text-anchor="start" x="119.5" y="-33.3" font-family="Times,serif" font-size="14.00">OrganizationId</text>
+<text text-anchor="start" x="105.5" y="-12.3" font-family="Times,serif" font-size="14.00">OrganizationName</text>
+<polygon fill="none" stroke="black" points="90,-4 90,-73 256,-73 256,-4 90,-4"/>
+</g>
+<!-- StrainXRef -->
+<g id="node48" class="node">
+<title>StrainXRef</title>
+<polygon fill="white" stroke="transparent" points="4871,-1897 4871,-2029 5019,-2029 5019,-1897 4871,-1897"/>
+<polygon fill="#df65b0" stroke="transparent" points="4874,-2005 4874,-2026 5016,-2026 5016,-2005 4874,-2005"/>
+<polygon fill="none" stroke="black" points="4874,-2005 4874,-2026 5016,-2026 5016,-2005 4874,-2005"/>
+<text text-anchor="start" x="4877" y="-2011.8" font-family="Times,serif" font-size="14.00">StrainXRef (1 MiB)</text>
+<text text-anchor="start" x="4902" y="-1989.8" font-family="Times,serif" font-size="14.00">InbredSetId</text>
+<text text-anchor="start" x="4916.5" y="-1968.8" font-family="Times,serif" font-size="14.00">OrderId</text>
+<text text-anchor="start" x="4890" y="-1947.8" font-family="Times,serif" font-size="14.00">PedigreeStatus</text>
+<text text-anchor="start" x="4915.5" y="-1926.8" font-family="Times,serif" font-size="14.00">StrainId</text>
+<text text-anchor="start" x="4878.5" y="-1905.8" font-family="Times,serif" font-size="14.00">Used_for_mapping</text>
+<polygon fill="none" stroke="black" points="4871,-1897 4871,-2029 5019,-2029 5019,-1897 4871,-1897"/>
+</g>
+<!-- StrainXRef&#45;&gt;InbredSet -->
+<g id="edge40" class="edge">
+<title>StrainXRef:InbredSetId&#45;&gt;InbredSet</title>
+<path fill="none" stroke="black" d="M4873,-1994C4828.88,-1994 4884.67,-1231.72 4854,-1200 4805.57,-1149.92 4292.6,-1190.1 4228,-1164 4115.23,-1118.43 4012.54,-1024.28 3943.58,-949.66"/>
+<polygon fill="black" stroke="black" points="3945.94,-947.05 3936.6,-942.05 3940.78,-951.79 3945.94,-947.05"/>
+</g>
+<!-- StrainXRef&#45;&gt;Strain -->
+<g id="edge41" class="edge">
+<title>StrainXRef:StrainId&#45;&gt;Strain</title>
+<path fill="none" stroke="black" d="M5017,-1930C5057.58,-1930 5018.82,-1233.98 5041,-1200 5195.5,-963.36 5553.55,-879.5 5710.26,-853.43"/>
+<polygon fill="black" stroke="black" points="5710.98,-856.86 5720.28,-851.79 5709.85,-849.95 5710.98,-856.86"/>
+</g>
+<!-- SnpSource -->
+<g id="node49" class="node">
+<title>SnpSource</title>
+<polygon fill="white" stroke="transparent" points="9726,-4832 9726,-4943 9870,-4943 9870,-4832 9726,-4832"/>
+<polygon fill="#d7b5d8" stroke="transparent" points="9729,-4918.5 9729,-4939.5 9867,-4939.5 9867,-4918.5 9729,-4918.5"/>
+<polygon fill="none" stroke="black" points="9729,-4918.5 9729,-4939.5 9867,-4939.5 9867,-4918.5 9729,-4918.5"/>
+<text text-anchor="start" x="9732" y="-4925.3" font-family="Times,serif" font-size="14.00">SnpSource (1 KiB)</text>
+<text text-anchor="start" x="9758.5" y="-4903.3" font-family="Times,serif" font-size="14.00">DateAdded</text>
+<text text-anchor="start" x="9752.5" y="-4882.3" font-family="Times,serif" font-size="14.00">DateCreated</text>
+<text text-anchor="start" x="9790.5" y="-4861.3" font-family="Times,serif" font-size="14.00">Id</text>
+<text text-anchor="start" x="9776.5" y="-4840.3" font-family="Times,serif" font-size="14.00">Name</text>
+<polygon fill="none" stroke="black" points="9726,-4832 9726,-4943 9870,-4943 9870,-4832 9726,-4832"/>
+</g>
+<!-- user_openids -->
+<g id="node50" class="node">
+<title>user_openids</title>
+<polygon fill="white" stroke="transparent" points="9904.5,-4853 9904.5,-4922 10049.5,-4922 10049.5,-4853 9904.5,-4853"/>
+<polygon fill="#f1eef6" stroke="transparent" points="9908,-4897.5 9908,-4918.5 10047,-4918.5 10047,-4897.5 9908,-4897.5"/>
+<polygon fill="none" stroke="black" points="9908,-4897.5 9908,-4918.5 10047,-4918.5 10047,-4897.5 9908,-4897.5"/>
+<text text-anchor="start" x="9911" y="-4904.3" font-family="Times,serif" font-size="14.00">user_openids (0 B)</text>
+<text text-anchor="start" x="9939.5" y="-4882.3" font-family="Times,serif" font-size="14.00">openid_url</text>
+<text text-anchor="start" x="9951.5" y="-4861.3" font-family="Times,serif" font-size="14.00">user_id</text>
+<polygon fill="none" stroke="black" points="9904.5,-4853 9904.5,-4922 10049.5,-4922 10049.5,-4853 9904.5,-4853"/>
+</g>
+<!-- GeneMap_cuiyan -->
+<g id="node51" class="node">
+<title>GeneMap_cuiyan</title>
+<polygon fill="white" stroke="transparent" points="10084,-4832 10084,-4943 10290,-4943 10290,-4832 10084,-4832"/>
+<polygon fill="#d7b5d8" stroke="transparent" points="10087,-4918.5 10087,-4939.5 10287,-4939.5 10287,-4918.5 10087,-4918.5"/>
+<polygon fill="none" stroke="black" points="10087,-4918.5 10087,-4939.5 10287,-4939.5 10287,-4918.5 10087,-4918.5"/>
+<text text-anchor="start" x="10090" y="-4925.3" font-family="Times,serif" font-size="14.00">GeneMap_cuiyan (376 KiB)</text>
+<text text-anchor="start" x="10160" y="-4903.3" font-family="Times,serif" font-size="14.00">GeneID</text>
+<text text-anchor="start" x="10180" y="-4882.3" font-family="Times,serif" font-size="14.00">id</text>
+<text text-anchor="start" x="10160" y="-4861.3" font-family="Times,serif" font-size="14.00">Symbol</text>
+<text text-anchor="start" x="10141.5" y="-4840.3" font-family="Times,serif" font-size="14.00">TranscriptID</text>
+<polygon fill="none" stroke="black" points="10084,-4832 10084,-4943 10290,-4943 10290,-4832 10084,-4832"/>
+</g>
+<!-- InfoFilesUser_md5 -->
+<g id="node52" class="node">
+<title>InfoFilesUser_md5</title>
+<polygon fill="white" stroke="transparent" points="10324,-4853 10324,-4922 10520,-4922 10520,-4853 10324,-4853"/>
+<polygon fill="#f1eef6" stroke="transparent" points="10327,-4897.5 10327,-4918.5 10517,-4918.5 10517,-4897.5 10327,-4897.5"/>
+<polygon fill="none" stroke="black" points="10327,-4897.5 10327,-4918.5 10517,-4918.5 10517,-4897.5 10327,-4897.5"/>
+<text text-anchor="start" x="10330" y="-4904.3" font-family="Times,serif" font-size="14.00">InfoFilesUser_md5 (96 B)</text>
+<text text-anchor="start" x="10387.5" y="-4882.3" font-family="Times,serif" font-size="14.00">Password</text>
+<text text-anchor="start" x="10385" y="-4861.3" font-family="Times,serif" font-size="14.00">Username</text>
+<polygon fill="none" stroke="black" points="10324,-4853 10324,-4922 10520,-4922 10520,-4853 10324,-4853"/>
+</g>
+<!-- PublishXRef -->
+<g id="node53" class="node">
+<title>PublishXRef</title>
+<polygon fill="lightgrey" stroke="transparent" points="2811.5,-1834 2811.5,-2092 2968.5,-2092 2968.5,-1834 2811.5,-1834"/>
+<polygon fill="#df65b0" stroke="transparent" points="2815,-2068 2815,-2089 2966,-2089 2966,-2068 2815,-2068"/>
+<polygon fill="none" stroke="black" points="2815,-2068 2815,-2089 2966,-2089 2966,-2068 2815,-2068"/>
+<text text-anchor="start" x="2818" y="-2074.8" font-family="Times,serif" font-size="14.00">PublishXRef (2 MiB)</text>
+<text text-anchor="start" x="2861.5" y="-2052.8" font-family="Times,serif" font-size="14.00">additive</text>
+<text text-anchor="start" x="2853.5" y="-2031.8" font-family="Times,serif" font-size="14.00">comments</text>
+<text text-anchor="start" x="2866" y="-2010.8" font-family="Times,serif" font-size="14.00">DataId</text>
+<polygon fill="green" stroke="transparent" points="2815,-1984 2815,-2003 2966,-2003 2966,-1984 2815,-1984"/>
+<text text-anchor="start" x="2883" y="-1989.8" font-family="Times,serif" font-size="14.00">Id</text>
+<text text-anchor="start" x="2847.5" y="-1968.8" font-family="Times,serif" font-size="14.00">InbredSetId</text>
+<text text-anchor="start" x="2869.5" y="-1947.8" font-family="Times,serif" font-size="14.00">Locus</text>
+<text text-anchor="start" x="2875.5" y="-1926.8" font-family="Times,serif" font-size="14.00">LRS</text>
+<text text-anchor="start" x="2870.5" y="-1905.8" font-family="Times,serif" font-size="14.00">mean</text>
+<text text-anchor="start" x="2845" y="-1884.8" font-family="Times,serif" font-size="14.00">PhenotypeId</text>
+<polygon fill="green" stroke="transparent" points="2815,-1858 2815,-1877 2966,-1877 2966,-1858 2815,-1858"/>
+<text text-anchor="start" x="2843" y="-1863.8" font-family="Times,serif" font-size="14.00">PublicationId</text>
+<text text-anchor="start" x="2855.5" y="-1842.8" font-family="Times,serif" font-size="14.00">Sequence</text>
+<polygon fill="none" stroke="black" points="2811.5,-1834 2811.5,-2092 2968.5,-2092 2968.5,-1834 2811.5,-1834"/>
+</g>
+<!-- PublishXRef&#45;&gt;Publication -->
+<g id="edge44" class="edge">
+<title>PublishXRef:PublicationId&#45;&gt;Publication</title>
+<path fill="none" stroke="black" d="M2814,-1867C2776.93,-1867 2815.52,-1230.19 2794,-1200 2767.79,-1163.23 2729.57,-1197.23 2699,-1164 2651.77,-1112.67 2628.61,-1038.69 2617.34,-974.68"/>
+<polygon fill="black" stroke="black" points="2620.73,-973.78 2615.62,-964.5 2613.83,-974.94 2620.73,-973.78"/>
+</g>
+<!-- PublishXRef&#45;&gt;InbredSet -->
+<g id="edge42" class="edge">
+<title>PublishXRef:InbredSetId&#45;&gt;InbredSet</title>
+<path fill="none" stroke="black" d="M2967,-1973C3009.96,-1973 2955.99,-1230.74 2986,-1200 3043.5,-1141.1 3658.94,-1211.74 3726,-1164 3777.91,-1127.05 3808.95,-1065.59 3827.5,-1006.29"/>
+<polygon fill="black" stroke="black" points="3830.95,-1006.99 3830.49,-996.41 3824.25,-1004.97 3830.95,-1006.99"/>
+</g>
+<!-- PublishXRef&#45;&gt;Phenotype -->
+<g id="edge43" class="edge">
+<title>PublishXRef:PhenotypeId&#45;&gt;Phenotype</title>
+<path fill="none" stroke="black" d="M2967,-1888C2986.12,-1888 2984.78,-1219.08 2986,-1200 2990.55,-1129.04 2998.2,-1050.39 3005.28,-985.01"/>
+<polygon fill="black" stroke="black" points="3008.76,-985.37 3006.37,-975.05 3001.8,-984.61 3008.76,-985.37"/>
+</g>
+<!-- RatSnpPattern -->
+<g id="node54" class="node">
+<title>RatSnpPattern</title>
+<polygon fill="white" stroke="transparent" points="10554,-4517 10554,-5258 10748,-5258 10748,-4517 10554,-4517"/>
+<polygon fill="#df65b0" stroke="transparent" points="10557,-5233.5 10557,-5254.5 10745,-5254.5 10745,-5233.5 10557,-5233.5"/>
+<polygon fill="none" stroke="black" points="10557,-5233.5 10557,-5254.5 10745,-5254.5 10745,-5233.5 10557,-5233.5"/>
+<text text-anchor="start" x="10560" y="-5240.3" font-family="Times,serif" font-size="14.00">RatSnpPattern (202 MiB)</text>
+<text text-anchor="start" x="10638" y="-5218.3" font-family="Times,serif" font-size="14.00">ACI</text>
+<text text-anchor="start" x="10628.5" y="-5197.3" font-family="Times,serif" font-size="14.00">ACI_N</text>
+<text text-anchor="start" x="10629.5" y="-5176.3" font-family="Times,serif" font-size="14.00">BBDP</text>
+<text text-anchor="start" x="10639.5" y="-5155.3" font-family="Times,serif" font-size="14.00">BN</text>
+<text text-anchor="start" x="10630" y="-5134.3" font-family="Times,serif" font-size="14.00">BN_N</text>
+<text text-anchor="start" x="10625" y="-5113.3" font-family="Times,serif" font-size="14.00">BUF_N</text>
+<text text-anchor="start" x="10632.5" y="-5092.3" font-family="Times,serif" font-size="14.00">F344</text>
+<text text-anchor="start" x="10623" y="-5071.3" font-family="Times,serif" font-size="14.00">F344_N</text>
+<text text-anchor="start" x="10634" y="-5050.3" font-family="Times,serif" font-size="14.00">FHH</text>
+<text text-anchor="start" x="10635.5" y="-5029.3" font-family="Times,serif" font-size="14.00">FHL</text>
+<text text-anchor="start" x="10640" y="-5008.3" font-family="Times,serif" font-size="14.00">GK</text>
+<text text-anchor="start" x="10643.5" y="-4987.3" font-family="Times,serif" font-size="14.00">Id</text>
+<text text-anchor="start" x="10641" y="-4966.3" font-family="Times,serif" font-size="14.00">LE</text>
+<text text-anchor="start" x="10634" y="-4945.3" font-family="Times,serif" font-size="14.00">LEW</text>
+<text text-anchor="start" x="10640" y="-4924.3" font-family="Times,serif" font-size="14.00">LH</text>
+<text text-anchor="start" x="10641.5" y="-4903.3" font-family="Times,serif" font-size="14.00">LL</text>
+<text text-anchor="start" x="10640" y="-4882.3" font-family="Times,serif" font-size="14.00">LN</text>
+<text text-anchor="start" x="10620.5" y="-4861.3" font-family="Times,serif" font-size="14.00">M520_N</text>
+<text text-anchor="start" x="10632.5" y="-4840.3" font-family="Times,serif" font-size="14.00">MHS</text>
+<text text-anchor="start" x="10632.5" y="-4819.3" font-family="Times,serif" font-size="14.00">MNS</text>
+<text text-anchor="start" x="10629" y="-4798.3" font-family="Times,serif" font-size="14.00">MR_N</text>
+<text text-anchor="start" x="10634.5" y="-4777.3" font-family="Times,serif" font-size="14.00">SBH</text>
+<text text-anchor="start" x="10634.5" y="-4756.3" font-family="Times,serif" font-size="14.00">SBN</text>
+<text text-anchor="start" x="10634.5" y="-4735.3" font-family="Times,serif" font-size="14.00">SHR</text>
+<text text-anchor="start" x="10625" y="-4714.3" font-family="Times,serif" font-size="14.00">SHRSP</text>
+<text text-anchor="start" x="10629.5" y="-4693.3" font-family="Times,serif" font-size="14.00">SnpId</text>
+<text text-anchor="start" x="10640.5" y="-4672.3" font-family="Times,serif" font-size="14.00">SR</text>
+<text text-anchor="start" x="10641.5" y="-4651.3" font-family="Times,serif" font-size="14.00">SS</text>
+<text text-anchor="start" x="10633.5" y="-4630.3" font-family="Times,serif" font-size="14.00">WAG</text>
+<text text-anchor="start" x="10634" y="-4609.3" font-family="Times,serif" font-size="14.00">WKY</text>
+<text text-anchor="start" x="10625" y="-4588.3" font-family="Times,serif" font-size="14.00">WKY_N</text>
+<text text-anchor="start" x="10636.5" y="-4567.3" font-family="Times,serif" font-size="14.00">WLI</text>
+<text text-anchor="start" x="10634" y="-4546.3" font-family="Times,serif" font-size="14.00">WMI</text>
+<text text-anchor="start" x="10628" y="-4525.3" font-family="Times,serif" font-size="14.00">WN_N</text>
+<polygon fill="none" stroke="black" points="10554,-4517 10554,-5258 10748,-5258 10748,-4517 10554,-4517"/>
+</g>
+<!-- Genbank -->
+<g id="node55" class="node">
+<title>Genbank</title>
+<polygon fill="white" stroke="transparent" points="769,-797 769,-887 911,-887 911,-797 769,-797"/>
+<polygon fill="#df65b0" stroke="transparent" points="772,-863 772,-884 908,-884 908,-863 772,-863"/>
+<polygon fill="none" stroke="black" points="772,-863 772,-884 908,-884 908,-863 772,-863"/>
+<text text-anchor="start" x="775" y="-869.8" font-family="Times,serif" font-size="14.00">Genbank (37 MiB)</text>
+<text text-anchor="start" x="832.5" y="-847.8" font-family="Times,serif" font-size="14.00">Id</text>
+<text text-anchor="start" x="805" y="-826.8" font-family="Times,serif" font-size="14.00">Sequence</text>
+<text text-anchor="start" x="805" y="-805.8" font-family="Times,serif" font-size="14.00">SpeciesId</text>
+<polygon fill="none" stroke="black" points="769,-797 769,-887 911,-887 911,-797 769,-797"/>
+</g>
+<!-- Genbank&#45;&gt;Species -->
+<g id="edge45" class="edge">
+<title>Genbank:SpeciesId&#45;&gt;Species</title>
+<path fill="none" stroke="black" d="M909,-809C941.22,-809 910.62,-543.18 933,-520 1058.95,-389.57 2375.45,-319.21 2715.96,-303.1"/>
+<polygon fill="black" stroke="black" points="2716.17,-306.6 2725.99,-302.63 2715.84,-299.61 2716.17,-306.6"/>
+</g>
+<!-- EnsemblChip -->
+<g id="node56" class="node">
+<title>EnsemblChip</title>
+<polygon fill="white" stroke="transparent" points="1780.5,-786.5 1780.5,-897.5 1945.5,-897.5 1945.5,-786.5 1780.5,-786.5"/>
+<polygon fill="#f1eef6" stroke="transparent" points="1784,-873 1784,-894 1943,-894 1943,-873 1784,-873"/>
+<polygon fill="none" stroke="black" points="1784,-873 1784,-894 1943,-894 1943,-873 1784,-873"/>
+<text text-anchor="start" x="1787" y="-879.8" font-family="Times,serif" font-size="14.00">EnsemblChip (296 B)</text>
+<text text-anchor="start" x="1856" y="-857.8" font-family="Times,serif" font-size="14.00">Id</text>
+<text text-anchor="start" x="1842" y="-836.8" font-family="Times,serif" font-size="14.00">Name</text>
+<text text-anchor="start" x="1815" y="-815.8" font-family="Times,serif" font-size="14.00">ProbeSetSize</text>
+<text text-anchor="start" x="1846" y="-794.8" font-family="Times,serif" font-size="14.00">Type</text>
+<polygon fill="none" stroke="black" points="1780.5,-786.5 1780.5,-897.5 1945.5,-897.5 1945.5,-786.5 1780.5,-786.5"/>
+</g>
+<!-- LCorrRamin3 -->
+<g id="node57" class="node">
+<title>LCorrRamin3</title>
+<polygon fill="white" stroke="transparent" points="10782.5,-4842.5 10782.5,-4932.5 10945.5,-4932.5 10945.5,-4842.5 10782.5,-4842.5"/>
+<polygon fill="#ce1256" stroke="transparent" points="10786,-4908.5 10786,-4929.5 10943,-4929.5 10943,-4908.5 10786,-4908.5"/>
+<polygon fill="none" stroke="black" points="10786,-4908.5 10786,-4929.5 10943,-4929.5 10943,-4908.5 10786,-4908.5"/>
+<text text-anchor="start" x="10789" y="-4915.3" font-family="Times,serif" font-size="14.00">LCorrRamin3 (2 GiB)</text>
+<text text-anchor="start" x="10834" y="-4893.3" font-family="Times,serif" font-size="14.00">GeneId1</text>
+<text text-anchor="start" x="10834" y="-4872.3" font-family="Times,serif" font-size="14.00">GeneId2</text>
+<text text-anchor="start" x="10845" y="-4851.3" font-family="Times,serif" font-size="14.00">value</text>
+<polygon fill="none" stroke="black" points="10782.5,-4842.5 10782.5,-4932.5 10945.5,-4932.5 10945.5,-4842.5 10782.5,-4842.5"/>
+</g>
+<!-- UserPrivilege -->
+<g id="node59" class="node">
+<title>UserPrivilege</title>
+<polygon fill="white" stroke="transparent" points="7239,-4842.5 7239,-4932.5 7407,-4932.5 7407,-4842.5 7239,-4842.5"/>
+<polygon fill="#f1eef6" stroke="transparent" points="7242,-4908.5 7242,-4929.5 7404,-4929.5 7404,-4908.5 7242,-4908.5"/>
+<polygon fill="none" stroke="black" points="7242,-4908.5 7242,-4929.5 7404,-4929.5 7404,-4908.5 7242,-4908.5"/>
+<text text-anchor="start" x="7245" y="-4915.3" font-family="Times,serif" font-size="14.00">UserPrivilege (224 B)</text>
+<text text-anchor="start" x="7246.5" y="-4893.3" font-family="Times,serif" font-size="14.00">download_result_priv</text>
+<text text-anchor="start" x="7258" y="-4872.3" font-family="Times,serif" font-size="14.00">ProbeSetFreezeId</text>
+<text text-anchor="start" x="7298.5" y="-4851.3" font-family="Times,serif" font-size="14.00">UserId</text>
+<polygon fill="none" stroke="black" points="7239,-4842.5 7239,-4932.5 7407,-4932.5 7407,-4842.5 7239,-4842.5"/>
+</g>
+<!-- UserPrivilege&#45;&gt;User -->
+<g id="edge46" class="edge">
+<title>UserPrivilege:UserId&#45;&gt;User</title>
+<path fill="none" stroke="black" d="M7323,-4844.5C7323,-4319.22 7309.04,-3693.9 7302.41,-3426.66"/>
+<polygon fill="black" stroke="black" points="7305.91,-3426.44 7302.16,-3416.53 7298.91,-3426.61 7305.91,-3426.44"/>
+</g>
+<!-- GeneChip -->
+<g id="node61" class="node">
+<title>GeneChip</title>
+<polygon fill="lightgrey" stroke="transparent" points="1980,-744.5 1980,-939.5 2116,-939.5 2116,-744.5 1980,-744.5"/>
+<polygon fill="#d7b5d8" stroke="transparent" points="1983,-915 1983,-936 2113,-936 2113,-915 1983,-915"/>
+<polygon fill="none" stroke="black" points="1983,-915 1983,-936 2113,-936 2113,-915 1983,-915"/>
+<text text-anchor="start" x="1986" y="-921.8" font-family="Times,serif" font-size="14.00">GeneChip (9 KiB)</text>
+<text text-anchor="start" x="2005.5" y="-899.8" font-family="Times,serif" font-size="14.00">GeneChipId</text>
+<polygon fill="green" stroke="transparent" points="1983,-873 1983,-892 2113,-892 2113,-873 1983,-873"/>
+<text text-anchor="start" x="1992" y="-878.8" font-family="Times,serif" font-size="14.00">GeneChipName</text>
+<text text-anchor="start" x="2002.5" y="-857.8" font-family="Times,serif" font-size="14.00">GeoPlatform</text>
+<text text-anchor="start" x="1996" y="-836.8" font-family="Times,serif" font-size="14.00">GO_tree_value</text>
+<text text-anchor="start" x="2040.5" y="-815.8" font-family="Times,serif" font-size="14.00">Id</text>
+<polygon fill="green" stroke="transparent" points="1983,-789 1983,-808 2113,-808 2113,-789 1983,-789"/>
+<text text-anchor="start" x="2026.5" y="-794.8" font-family="Times,serif" font-size="14.00">Name</text>
+<text text-anchor="start" x="2013" y="-773.8" font-family="Times,serif" font-size="14.00">SpeciesId</text>
+<text text-anchor="start" x="2031.5" y="-752.8" font-family="Times,serif" font-size="14.00">Title</text>
+<polygon fill="none" stroke="black" points="1980,-744.5 1980,-939.5 2116,-939.5 2116,-744.5 1980,-744.5"/>
+</g>
+<!-- GeneChip&#45;&gt;Species -->
+<g id="edge47" class="edge">
+<title>GeneChip:SpeciesId&#45;&gt;Species</title>
+<path fill="none" stroke="black" d="M2114,-777C2142.63,-777 2115.4,-542.59 2133,-520 2274.95,-337.76 2572.58,-304.64 2715.73,-299.62"/>
+<polygon fill="black" stroke="black" points="2715.88,-303.12 2725.77,-299.31 2715.66,-296.12 2715.88,-303.12"/>
+</g>
+<!-- IndelXRef -->
+<g id="node62" class="node">
+<title>IndelXRef</title>
+<polygon fill="white" stroke="transparent" points="5716,-1918 5716,-2008 5856,-2008 5856,-1918 5716,-1918"/>
+<polygon fill="#df65b0" stroke="transparent" points="5719,-1984 5719,-2005 5853,-2005 5853,-1984 5719,-1984"/>
+<polygon fill="none" stroke="black" points="5719,-1984 5719,-2005 5853,-2005 5853,-1984 5719,-1984"/>
+<text text-anchor="start" x="5722" y="-1990.8" font-family="Times,serif" font-size="14.00">IndelXRef (1 MiB)</text>
+<text text-anchor="start" x="5760.5" y="-1968.8" font-family="Times,serif" font-size="14.00">IndelId</text>
+<text text-anchor="start" x="5752" y="-1947.8" font-family="Times,serif" font-size="14.00">StrainId1</text>
+<text text-anchor="start" x="5752" y="-1926.8" font-family="Times,serif" font-size="14.00">StrainId2</text>
+<polygon fill="none" stroke="black" points="5716,-1918 5716,-2008 5856,-2008 5856,-1918 5716,-1918"/>
+</g>
+<!-- IndelXRef&#45;&gt;Strain -->
+<g id="edge48" class="edge">
+<title>IndelXRef:StrainId1&#45;&gt;Strain</title>
+<path fill="none" stroke="black" d="M5854,-1951C5904.87,-1951 5825.54,-1197.02 5796.2,-933"/>
+<polygon fill="black" stroke="black" points="5799.64,-932.24 5795.05,-922.68 5792.68,-933.01 5799.64,-932.24"/>
+</g>
+<!-- IndelXRef&#45;&gt;Strain -->
+<g id="edge49" class="edge">
+<title>IndelXRef:StrainId2&#45;&gt;Strain</title>
+<path fill="none" stroke="black" d="M5786,-1920C5786,-1553.9 5786,-1117.79 5786,-932.93"/>
+<polygon fill="black" stroke="black" points="5789.5,-932.72 5786,-922.72 5782.5,-932.72 5789.5,-932.72"/>
+</g>
+<!-- user -->
+<g id="node63" class="node">
+<title>user</title>
+<polygon fill="white" stroke="transparent" points="10979.5,-4779.5 10979.5,-4995.5 11108.5,-4995.5 11108.5,-4779.5 10979.5,-4779.5"/>
+<polygon fill="#d7b5d8" stroke="transparent" points="10983,-4971.5 10983,-4992.5 11106,-4992.5 11106,-4971.5 10983,-4971.5"/>
+<polygon fill="none" stroke="black" points="10983,-4971.5 10983,-4992.5 11106,-4992.5 11106,-4971.5 10983,-4971.5"/>
+<text text-anchor="start" x="10997" y="-4978.3" font-family="Times,serif" font-size="14.00">user (64 KiB)</text>
+<text text-anchor="start" x="11023" y="-4956.3" font-family="Times,serif" font-size="14.00">active</text>
+<text text-anchor="start" x="11008.5" y="-4935.3" font-family="Times,serif" font-size="14.00">confirmed</text>
+<text text-anchor="start" x="10993" y="-4914.3" font-family="Times,serif" font-size="14.00">email_address</text>
+<text text-anchor="start" x="11009.5" y="-4893.3" font-family="Times,serif" font-size="14.00">full_name</text>
+<text text-anchor="start" x="11037.5" y="-4872.3" font-family="Times,serif" font-size="14.00">id</text>
+<text text-anchor="start" x="10999.5" y="-4851.3" font-family="Times,serif" font-size="14.00">organization</text>
+<text text-anchor="start" x="11010" y="-4830.3" font-family="Times,serif" font-size="14.00">password</text>
+<text text-anchor="start" x="10985" y="-4809.3" font-family="Times,serif" font-size="14.00">registration_info</text>
+<text text-anchor="start" x="11008" y="-4788.3" font-family="Times,serif" font-size="14.00">superuser</text>
+<polygon fill="none" stroke="black" points="10979.5,-4779.5 10979.5,-4995.5 11108.5,-4995.5 11108.5,-4779.5 10979.5,-4779.5"/>
+</g>
+<!-- PublishSE -->
+<g id="node64" class="node">
+<title>PublishSE</title>
+<polygon fill="white" stroke="transparent" points="5890,-1918 5890,-2008 6034,-2008 6034,-1918 5890,-1918"/>
+<polygon fill="#df65b0" stroke="transparent" points="5893,-1984 5893,-2005 6031,-2005 6031,-1984 5893,-1984"/>
+<polygon fill="none" stroke="black" points="5893,-1984 5893,-2005 6031,-2005 6031,-1984 5893,-1984"/>
+<text text-anchor="start" x="5896" y="-1990.8" font-family="Times,serif" font-size="14.00">PublishSE (3 MiB)</text>
+<text text-anchor="start" x="5937.5" y="-1968.8" font-family="Times,serif" font-size="14.00">DataId</text>
+<text text-anchor="start" x="5943.5" y="-1947.8" font-family="Times,serif" font-size="14.00">error</text>
+<text text-anchor="start" x="5932.5" y="-1926.8" font-family="Times,serif" font-size="14.00">StrainId</text>
+<polygon fill="none" stroke="black" points="5890,-1918 5890,-2008 6034,-2008 6034,-1918 5890,-1918"/>
+</g>
+<!-- PublishSE&#45;&gt;Strain -->
+<g id="edge50" class="edge">
+<title>PublishSE:StrainId&#45;&gt;Strain</title>
+<path fill="none" stroke="black" d="M5962,-1920C5962,-1549.32 5859.2,-1116.17 5810.73,-932.54"/>
+<polygon fill="black" stroke="black" points="5814.06,-931.43 5808.11,-922.66 5807.29,-933.22 5814.06,-931.43"/>
+</g>
+<!-- EnsemblProbe -->
+<g id="node65" class="node">
+<title>EnsemblProbe</title>
+<polygon fill="white" stroke="transparent" points="11143,-4821.5 11143,-4953.5 11327,-4953.5 11327,-4821.5 11143,-4821.5"/>
+<polygon fill="#df65b0" stroke="transparent" points="11146,-4929.5 11146,-4950.5 11324,-4950.5 11324,-4929.5 11146,-4929.5"/>
+<polygon fill="none" stroke="black" points="11146,-4929.5 11146,-4950.5 11324,-4950.5 11324,-4929.5 11146,-4929.5"/>
+<text text-anchor="start" x="11149" y="-4936.3" font-family="Times,serif" font-size="14.00">EnsemblProbe (94 MiB)</text>
+<text text-anchor="start" x="11211" y="-4914.3" font-family="Times,serif" font-size="14.00">ChipId</text>
+<text text-anchor="start" x="11227.5" y="-4893.3" font-family="Times,serif" font-size="14.00">Id</text>
+<text text-anchor="start" x="11212" y="-4872.3" font-family="Times,serif" font-size="14.00">length</text>
+<text text-anchor="start" x="11213.5" y="-4851.3" font-family="Times,serif" font-size="14.00">Name</text>
+<text text-anchor="start" x="11201.5" y="-4830.3" font-family="Times,serif" font-size="14.00">ProbeSet</text>
+<polygon fill="none" stroke="black" points="11143,-4821.5 11143,-4953.5 11327,-4953.5 11327,-4821.5 11143,-4821.5"/>
+</g>
+<!-- InfoFiles -->
+<g id="node66" class="node">
+<title>InfoFiles</title>
+<polygon fill="lightgrey" stroke="transparent" points="2048.5,-1424.5 2048.5,-2501.5 2279.5,-2501.5 2279.5,-1424.5 2048.5,-1424.5"/>
+<polygon fill="#df65b0" stroke="transparent" points="2052,-2477 2052,-2498 2277,-2498 2277,-2477 2052,-2477"/>
+<polygon fill="none" stroke="black" points="2052,-2477 2052,-2498 2277,-2498 2277,-2477 2052,-2477"/>
+<text text-anchor="start" x="2104" y="-2483.8" font-family="Times,serif" font-size="14.00">InfoFiles (4 MiB)</text>
+<text text-anchor="start" x="2085.5" y="-2461.8" font-family="Times,serif" font-size="14.00">About_Array_Platform</text>
+<text text-anchor="start" x="2119" y="-2440.8" font-family="Times,serif" font-size="14.00">About_Cases</text>
+<text text-anchor="start" x="2054" y="-2419.8" font-family="Times,serif" font-size="14.00">About_Data_Values_Processing</text>
+<text text-anchor="start" x="2104.5" y="-2398.8" font-family="Times,serif" font-size="14.00">About_Download</text>
+<text text-anchor="start" x="2117" y="-2377.8" font-family="Times,serif" font-size="14.00">About_Tissue</text>
+<text text-anchor="start" x="2104" y="-2356.8" font-family="Times,serif" font-size="14.00">AuthorizedUsers</text>
+<text text-anchor="start" x="2116" y="-2335.8" font-family="Times,serif" font-size="14.00">AvgMethodId</text>
+<text text-anchor="start" x="2135.5" y="-2314.8" font-family="Times,serif" font-size="14.00">Citation</text>
+<text text-anchor="start" x="2149.5" y="-2293.8" font-family="Times,serif" font-size="14.00">City</text>
+<text text-anchor="start" x="2112" y="-2272.8" font-family="Times,serif" font-size="14.00">Contact_Name</text>
+<text text-anchor="start" x="2122" y="-2251.8" font-family="Times,serif" font-size="14.00">Contributor</text>
+<text text-anchor="start" x="2135.5" y="-2230.8" font-family="Times,serif" font-size="14.00">Country</text>
+<text text-anchor="start" x="2069" y="-2209.8" font-family="Times,serif" font-size="14.00">Data_Source_Acknowledge</text>
+<text text-anchor="start" x="2129.5" y="-2188.8" font-family="Times,serif" font-size="14.00">DatasetId</text>
+<text text-anchor="start" x="2129" y="-2167.8" font-family="Times,serif" font-size="14.00">DB_Name</text>
+<text text-anchor="start" x="2121" y="-2146.8" font-family="Times,serif" font-size="14.00">Department</text>
+<text text-anchor="start" x="2140" y="-2125.8" font-family="Times,serif" font-size="14.00">Emails</text>
+<text text-anchor="start" x="2101.5" y="-2104.8" font-family="Times,serif" font-size="14.00">Experiment_Type</text>
+<text text-anchor="start" x="2122" y="-2083.8" font-family="Times,serif" font-size="14.00">GeneChipId</text>
+<polygon fill="green" stroke="transparent" points="2052,-2057 2052,-2076 2277,-2076 2277,-2057 2052,-2057"/>
+<text text-anchor="start" x="2111" y="-2062.8" font-family="Times,serif" font-size="14.00">GN_AccesionId</text>
+<text text-anchor="start" x="2128.5" y="-2041.8" font-family="Times,serif" font-size="14.00">InbredSet</text>
+<text text-anchor="start" x="2121.5" y="-2020.8" font-family="Times,serif" font-size="14.00">InbredSetId</text>
+<text text-anchor="start" x="2129.5" y="-1999.8" font-family="Times,serif" font-size="14.00">InfoFileId</text>
+<polygon fill="green" stroke="transparent" points="2052,-1973 2052,-1992 2277,-1992 2277,-1973 2052,-1973"/>
+<text text-anchor="start" x="2120.5" y="-1978.8" font-family="Times,serif" font-size="14.00">InfoFileTitle</text>
+<text text-anchor="start" x="2112" y="-1957.8" font-family="Times,serif" font-size="14.00">InfoPageName</text>
+<text text-anchor="start" x="2117" y="-1936.8" font-family="Times,serif" font-size="14.00">InfoPageTitle</text>
+<text text-anchor="start" x="2125" y="-1915.8" font-family="Times,serif" font-size="14.00">Laboratory</text>
+<text text-anchor="start" x="2113.5" y="-1894.8" font-family="Times,serif" font-size="14.00">Normalization</text>
+<text text-anchor="start" x="2129.5" y="-1873.8" font-family="Times,serif" font-size="14.00">Organism</text>
+<text text-anchor="start" x="2119" y="-1852.8" font-family="Times,serif" font-size="14.00">Organism_Id</text>
+<text text-anchor="start" x="2093.5" y="-1831.8" font-family="Times,serif" font-size="14.00">Organization_Name</text>
+<text text-anchor="start" x="2110" y="-1810.8" font-family="Times,serif" font-size="14.00">Overall_Design</text>
+<text text-anchor="start" x="2142" y="-1789.8" font-family="Times,serif" font-size="14.00">Phone</text>
+<text text-anchor="start" x="2129.5" y="-1768.8" font-family="Times,serif" font-size="14.00">Platforms</text>
+<text text-anchor="start" x="2132" y="-1747.8" font-family="Times,serif" font-size="14.00">Progreso</text>
+<text text-anchor="start" x="2088.5" y="-1726.8" font-family="Times,serif" font-size="14.00">QualityControlStatus</text>
+<text text-anchor="start" x="2134" y="-1705.8" font-family="Times,serif" font-size="14.00">Samples</text>
+<text text-anchor="start" x="2137" y="-1684.8" font-family="Times,serif" font-size="14.00">Species</text>
+<text text-anchor="start" x="2129.5" y="-1663.8" font-family="Times,serif" font-size="14.00">SpeciesId</text>
+<polygon fill="green" stroke="transparent" points="2052,-1637 2052,-1656 2277,-1656 2277,-1637 2052,-1637"/>
+<text text-anchor="start" x="2132.5" y="-1642.8" font-family="Times,serif" font-size="14.00">Specifics</text>
+<text text-anchor="start" x="2145" y="-1621.8" font-family="Times,serif" font-size="14.00">State</text>
+<text text-anchor="start" x="2141" y="-1600.8" font-family="Times,serif" font-size="14.00">Status</text>
+<text text-anchor="start" x="2141.5" y="-1579.8" font-family="Times,serif" font-size="14.00">Street</text>
+<text text-anchor="start" x="2102.5" y="-1558.8" font-family="Times,serif" font-size="14.00">Submission_Date</text>
+<text text-anchor="start" x="2129.5" y="-1537.8" font-family="Times,serif" font-size="14.00">Summary</text>
+<text text-anchor="start" x="2141.5" y="-1516.8" font-family="Times,serif" font-size="14.00">Tissue</text>
+<text text-anchor="start" x="2134" y="-1495.8" font-family="Times,serif" font-size="14.00">TissueId</text>
+<polygon fill="green" stroke="transparent" points="2052,-1469 2052,-1488 2277,-1488 2277,-1469 2052,-1469"/>
+<text text-anchor="start" x="2148" y="-1474.8" font-family="Times,serif" font-size="14.00">Title</text>
+<text text-anchor="start" x="2148.5" y="-1453.8" font-family="Times,serif" font-size="14.00">URL</text>
+<text text-anchor="start" x="2152" y="-1432.8" font-family="Times,serif" font-size="14.00">ZIP</text>
+<polygon fill="none" stroke="black" points="2048.5,-1424.5 2048.5,-2501.5 2279.5,-2501.5 2279.5,-1424.5 2048.5,-1424.5"/>
+</g>
+<!-- InfoFiles&#45;&gt;Datasets -->
+<g id="edge52" class="edge">
+<title>InfoFiles:DatasetId&#45;&gt;Datasets</title>
+<path fill="none" stroke="black" d="M2051,-2193C1940.48,-2193 2072.47,-1276.81 1993,-1200 1933.9,-1142.88 581.41,-1211.03 514,-1164 470.71,-1133.8 442.18,-1086.38 423.37,-1037.17"/>
+<polygon fill="black" stroke="black" points="426.6,-1035.81 419.85,-1027.64 420.03,-1038.23 426.6,-1035.81"/>
+</g>
+<!-- InfoFiles&#45;&gt;InbredSet -->
+<g id="edge54" class="edge">
+<title>InfoFiles:InbredSetId&#45;&gt;InbredSet</title>
+<path fill="none" stroke="black" d="M2278,-2025C2323.84,-2025 2263.64,-1232.47 2296,-1200 2352.1,-1143.71 3660.72,-1209.33 3726,-1164 3778.57,-1127.49 3809.73,-1065.76 3828.19,-1006.12"/>
+<polygon fill="black" stroke="black" points="3831.65,-1006.77 3831.16,-996.18 3824.94,-1004.76 3831.65,-1006.77"/>
+</g>
+<!-- InfoFiles&#45;&gt;Species -->
+<g id="edge55" class="edge">
+<title>InfoFiles:SpeciesId&#45;&gt;Species</title>
+<path fill="none" stroke="black" d="M2278,-1667C2303.96,-1667 2277.61,-1218.33 2296,-1200 2376.56,-1119.71 3240,-1245.83 3319,-1164 3368.7,-1112.52 3358.57,-579.62 3319,-520 3219.73,-370.42 2996.86,-322.06 2876.6,-306.62"/>
+<polygon fill="black" stroke="black" points="2876.71,-303.1 2866.35,-305.35 2875.85,-310.05 2876.71,-303.1"/>
+</g>
+<!-- InfoFiles&#45;&gt;AvgMethod -->
+<g id="edge51" class="edge">
+<title>InfoFiles:AvgMethodId&#45;&gt;AvgMethod</title>
+<path fill="none" stroke="black" d="M2051,-2340C1924.17,-2340 2083.05,-1289.32 1993,-1200 1926.52,-1134.05 1224.64,-1221.84 1151,-1164 1075.17,-1104.44 1058.6,-986.94 1056.31,-911.82"/>
+<polygon fill="black" stroke="black" points="1059.8,-911.43 1056.07,-901.51 1052.8,-911.59 1059.8,-911.43"/>
+</g>
+<!-- InfoFiles&#45;&gt;GeneChip -->
+<g id="edge53" class="edge">
+<title>InfoFiles:GeneChipId&#45;&gt;GeneChip</title>
+<path fill="none" stroke="black" d="M2051,-2088C2022.77,-2088 2038.62,-1258.67 2045.41,-953.75"/>
+<polygon fill="black" stroke="black" points="2048.91,-953.64 2045.63,-943.57 2041.91,-953.49 2048.91,-953.64"/>
+</g>
+<!-- InfoFiles&#45;&gt;Tissue -->
+<g id="edge56" class="edge">
+<title>InfoFiles:TissueId&#45;&gt;Tissue</title>
+<path fill="none" stroke="black" d="M2278,-1499C2311.28,-1499 2278.84,-1228.52 2296,-1200 2311.83,-1173.68 2336.81,-1188.76 2355,-1164 2402.06,-1099.94 2421.62,-1011.33 2429.66,-943.43"/>
+<polygon fill="black" stroke="black" points="2433.17,-943.47 2430.81,-933.15 2426.22,-942.7 2433.17,-943.47"/>
+</g>
+<!-- Vlookup -->
+<g id="node67" class="node">
+<title>Vlookup</title>
+<polygon fill="white" stroke="transparent" points="2070,-2766 2070,-3822 2258,-3822 2258,-2766 2070,-2766"/>
+<polygon fill="#d7b5d8" stroke="transparent" points="2073,-3798 2073,-3819 2255,-3819 2255,-3798 2073,-3798"/>
+<polygon fill="none" stroke="black" points="2073,-3798 2073,-3819 2255,-3819 2255,-3798 2073,-3798"/>
+<text text-anchor="start" x="2099" y="-3804.8" font-family="Times,serif" font-size="14.00">Vlookup (120 KiB)</text>
+<text text-anchor="start" x="2147" y="-3782.8" font-family="Times,serif" font-size="14.00">alias</text>
+<text text-anchor="start" x="2137" y="-3761.8" font-family="Times,serif" font-size="14.00">AlignID</text>
+<text text-anchor="start" x="2130.5" y="-3740.8" font-family="Times,serif" font-size="14.00">assembly</text>
+<text text-anchor="start" x="2115.5" y="-3719.8" font-family="Times,serif" font-size="14.00">AvgMethodId</text>
+<text text-anchor="start" x="2135.5" y="-3698.8" font-family="Times,serif" font-size="14.00">BlatSeq</text>
+<text text-anchor="start" x="2117.5" y="-3677.8" font-family="Times,serif" font-size="14.00">CAS_number</text>
+<text text-anchor="start" x="2137.5" y="-3656.8" font-family="Times,serif" font-size="14.00">cdsEnd</text>
+<text text-anchor="start" x="2133.5" y="-3635.8" font-family="Times,serif" font-size="14.00">cdsStart</text>
+<text text-anchor="start" x="2129" y="-3614.8" font-family="Times,serif" font-size="14.00">ChEBI_ID</text>
+<text text-anchor="start" x="2120" y="-3593.8" font-family="Times,serif" font-size="14.00">ChEMBL_ID</text>
+<text text-anchor="start" x="2108" y="-3572.8" font-family="Times,serif" font-size="14.00">ChemSpider_ID</text>
+<text text-anchor="start" x="2150.5" y="-3551.8" font-family="Times,serif" font-size="14.00">Chr</text>
+<text text-anchor="start" x="2129" y="-3530.8" font-family="Times,serif" font-size="14.00">DatasetId</text>
+<text text-anchor="start" x="2123.5" y="-3509.8" font-family="Times,serif" font-size="14.00">description</text>
+<text text-anchor="start" x="2122" y="-3488.8" font-family="Times,serif" font-size="14.00">EC_number</text>
+<text text-anchor="start" x="2125.5" y="-3467.8" font-family="Times,serif" font-size="14.00">exonCount</text>
+<text text-anchor="start" x="2129" y="-3446.8" font-family="Times,serif" font-size="14.00">exonEnds</text>
+<text text-anchor="start" x="2124.5" y="-3425.8" font-family="Times,serif" font-size="14.00">exonStarts</text>
+<text text-anchor="start" x="2105" y="-3404.8" font-family="Times,serif" font-size="14.00">Full_Description</text>
+<text text-anchor="start" x="2121.5" y="-3383.8" font-family="Times,serif" font-size="14.00">GeneChipId</text>
+<text text-anchor="start" x="2138" y="-3362.8" font-family="Times,serif" font-size="14.00">GeneId</text>
+<text text-anchor="start" x="2110.5" y="-3341.8" font-family="Times,serif" font-size="14.00">GN_AccesionId</text>
+<text text-anchor="start" x="2128" y="-3320.8" font-family="Times,serif" font-size="14.00">HMDB_ID</text>
+<text text-anchor="start" x="2156.5" y="-3299.8" font-family="Times,serif" font-size="14.00">Id</text>
+<text text-anchor="start" x="2121" y="-3278.8" font-family="Times,serif" font-size="14.00">InbredSetId</text>
+<text text-anchor="start" x="2129" y="-3257.8" font-family="Times,serif" font-size="14.00">InfoFileId</text>
+<text text-anchor="start" x="2111.5" y="-3236.8" font-family="Times,serif" font-size="14.00">InfoPageName</text>
+<text text-anchor="start" x="2130.5" y="-3215.8" font-family="Times,serif" font-size="14.00">KEGG_ID</text>
+<text text-anchor="start" x="2147" y="-3194.8" font-family="Times,serif" font-size="14.00">kgID</text>
+<text text-anchor="start" x="2152" y="-3173.8" font-family="Times,serif" font-size="14.00">Mb</text>
+<text text-anchor="start" x="2099.5" y="-3152.8" font-family="Times,serif" font-size="14.00">Molecular_Weight</text>
+<text text-anchor="start" x="2142.5" y="-3131.8" font-family="Times,serif" font-size="14.00">Name</text>
+<text text-anchor="start" x="2139" y="-3110.8" font-family="Times,serif" font-size="14.00">NM_ID</text>
+<text text-anchor="start" x="2118.5" y="-3089.8" font-family="Times,serif" font-size="14.00">Nugowiki_ID</text>
+<text text-anchor="start" x="2135" y="-3068.8" font-family="Times,serif" font-size="14.00">Position</text>
+<text text-anchor="start" x="2079" y="-3047.8" font-family="Times,serif" font-size="14.00">Probe_set_Blat_Mb_end</text>
+<text text-anchor="start" x="2075" y="-3026.8" font-family="Times,serif" font-size="14.00">Probe_set_Blat_Mb_start</text>
+<text text-anchor="start" x="2129" y="-3005.8" font-family="Times,serif" font-size="14.00">ProteinID</text>
+<text text-anchor="start" x="2117.5" y="-2984.8" font-family="Times,serif" font-size="14.00">PubChem_ID</text>
+<text text-anchor="start" x="2129" y="-2963.8" font-family="Times,serif" font-size="14.00">SnpName</text>
+<text text-anchor="start" x="2129" y="-2942.8" font-family="Times,serif" font-size="14.00">SpeciesId</text>
+<text text-anchor="start" x="2139.5" y="-2921.8" font-family="Times,serif" font-size="14.00">Strand</text>
+<text text-anchor="start" x="2137" y="-2900.8" font-family="Times,serif" font-size="14.00">Symbol</text>
+<text text-anchor="start" x="2133.5" y="-2879.8" font-family="Times,serif" font-size="14.00">TissueId</text>
+<text text-anchor="start" x="2141" y="-2858.8" font-family="Times,serif" font-size="14.00">TxEnd</text>
+<text text-anchor="start" x="2136.5" y="-2837.8" font-family="Times,serif" font-size="14.00">TxStart</text>
+<text text-anchor="start" x="2135" y="-2816.8" font-family="Times,serif" font-size="14.00">UNII_ID</text>
+<text text-anchor="start" x="2126" y="-2795.8" font-family="Times,serif" font-size="14.00">VLBlatSeq</text>
+<text text-anchor="start" x="2114" y="-2774.8" font-family="Times,serif" font-size="14.00">VLProbeSetId</text>
+<polygon fill="none" stroke="black" points="2070,-2766 2070,-3822 2258,-3822 2258,-2766 2070,-2766"/>
+</g>
+<!-- Vlookup&#45;&gt;Datasets -->
+<g id="edge58" class="edge">
+<title>Vlookup:DatasetId&#45;&gt;Datasets</title>
+<path fill="none" stroke="black" d="M2072,-3535C1300.04,-3535 942.38,-3381.71 535,-2726 490.25,-2653.97 509.59,-1283.71 496,-1200 487.3,-1146.41 472.62,-1089.65 456.8,-1037.55"/>
+<polygon fill="black" stroke="black" points="460.12,-1036.41 453.84,-1027.88 453.42,-1038.46 460.12,-1036.41"/>
+</g>
+<!-- Vlookup&#45;&gt;InbredSet -->
+<g id="edge60" class="edge">
+<title>Vlookup:InbredSetId&#45;&gt;InbredSet</title>
+<path fill="none" stroke="black" d="M2256,-3282C2538.62,-3282 2374.11,-2897.73 2622,-2762 2701.02,-2718.73 3368.94,-2790.34 3432,-2726 3491.36,-2665.43 3412.08,-1262.87 3469,-1200 3556.17,-1103.72 3659.82,-1247.85 3759,-1164 3805.29,-1124.86 3829.81,-1064.39 3842.6,-1006.44"/>
+<polygon fill="black" stroke="black" points="3846.07,-1006.91 3844.7,-996.41 3839.22,-1005.48 3846.07,-1006.91"/>
+</g>
+<!-- Vlookup&#45;&gt;Species -->
+<g id="edge62" class="edge">
+<title>Vlookup:SpeciesId&#45;&gt;Species</title>
+<path fill="none" stroke="black" d="M2256,-2946C2438.07,-2946 2446.26,-2809.59 2622,-2762 2683.25,-2745.41 3148.01,-2771.74 3192,-2726 3250.79,-2664.88 3170.49,-1261.39 3229,-1200 3305.4,-1119.84 3650.58,-1245.08 3726,-1164 3774.74,-1111.61 3770.38,-576.13 3726,-520 3619.99,-385.91 3082.28,-324.84 2876.38,-306.1"/>
+<polygon fill="black" stroke="black" points="2876.51,-302.6 2866.23,-305.19 2875.88,-309.57 2876.51,-302.6"/>
+</g>
+<!-- Vlookup&#45;&gt;AvgMethod -->
+<g id="edge57" class="edge">
+<title>Vlookup:AvgMethodId&#45;&gt;AvgMethod</title>
+<path fill="none" stroke="black" d="M2072,-3724C882.38,-3724 1769.05,-2234.12 1181,-1200 1170.7,-1181.9 1160.77,-1182.39 1151,-1164 1107.82,-1082.73 1082.45,-978.95 1069.39,-911.74"/>
+<polygon fill="black" stroke="black" points="1072.79,-910.86 1067.48,-901.69 1065.91,-912.17 1072.79,-910.86"/>
+</g>
+<!-- Vlookup&#45;&gt;GeneChip -->
+<g id="edge59" class="edge">
+<title>Vlookup:GeneChipId&#45;&gt;GeneChip</title>
+<path fill="none" stroke="black" d="M2072,-3388C1777.21,-3388 2040.12,-3020.64 2031,-2726 2010.03,-2048.1 2014.87,-1878.03 2031,-1200 2032.96,-1117.52 2037.47,-1024.42 2041.35,-953.97"/>
+<polygon fill="black" stroke="black" points="2044.86,-953.92 2041.92,-943.75 2037.87,-953.54 2044.86,-953.92"/>
+</g>
+<!-- Vlookup&#45;&gt;InfoFiles -->
+<g id="edge61" class="edge">
+<title>Vlookup:InfoFileId&#45;&gt;InfoFiles</title>
+<path fill="none" stroke="black" d="M2256,-3261C2335.46,-3261 2299.68,-2868.62 2251.39,-2515.5"/>
+<polygon fill="black" stroke="black" points="2254.86,-2515.02 2250.03,-2505.59 2247.92,-2515.97 2254.86,-2515.02"/>
+</g>
+<!-- Vlookup&#45;&gt;Tissue -->
+<g id="edge63" class="edge">
+<title>Vlookup:TissueId&#45;&gt;Tissue</title>
+<path fill="none" stroke="black" d="M2256,-2883C2406.09,-2883 2477.36,-2854.46 2555,-2726 2598.85,-2653.44 2589.96,-1277.23 2555,-1200 2545,-1177.91 2526.59,-1184.73 2514,-1164 2473.2,-1096.81 2453.44,-1009.63 2443.89,-943.08"/>
+<polygon fill="black" stroke="black" points="2447.33,-942.43 2442.5,-933 2440.4,-943.39 2447.33,-942.43"/>
+</g>
+<!-- user_collection -->
+<g id="node68" class="node">
+<title>user_collection</title>
+<polygon fill="white" stroke="transparent" points="11361,-4811 11361,-4964 11543,-4964 11543,-4811 11361,-4811"/>
+<polygon fill="#d7b5d8" stroke="transparent" points="11364,-4939.5 11364,-4960.5 11540,-4960.5 11540,-4939.5 11364,-4939.5"/>
+<polygon fill="none" stroke="black" points="11364,-4939.5 11364,-4960.5 11540,-4960.5 11540,-4939.5 11364,-4939.5"/>
+<text text-anchor="start" x="11367" y="-4946.3" font-family="Times,serif" font-size="14.00">user_collection (60 KiB)</text>
+<text text-anchor="start" x="11380" y="-4924.3" font-family="Times,serif" font-size="14.00">changed_timestamp</text>
+<text text-anchor="start" x="11383" y="-4903.3" font-family="Times,serif" font-size="14.00">created_timestamp</text>
+<text text-anchor="start" x="11445" y="-4882.3" font-family="Times,serif" font-size="14.00">id</text>
+<text text-anchor="start" x="11418.5" y="-4861.3" font-family="Times,serif" font-size="14.00">members</text>
+<text text-anchor="start" x="11432" y="-4840.3" font-family="Times,serif" font-size="14.00">name</text>
+<text text-anchor="start" x="11436" y="-4819.3" font-family="Times,serif" font-size="14.00">user</text>
+<polygon fill="none" stroke="black" points="11361,-4811 11361,-4964 11543,-4964 11543,-4811 11361,-4811"/>
+</g>
+<!-- pubmedsearch -->
+<g id="node69" class="node">
+<title>pubmedsearch</title>
+<polygon fill="white" stroke="transparent" points="11577.5,-4800.5 11577.5,-4974.5 11770.5,-4974.5 11770.5,-4800.5 11577.5,-4800.5"/>
+<polygon fill="#df65b0" stroke="transparent" points="11581,-4950.5 11581,-4971.5 11768,-4971.5 11768,-4950.5 11581,-4950.5"/>
+<polygon fill="none" stroke="black" points="11581,-4950.5 11581,-4971.5 11768,-4971.5 11768,-4950.5 11581,-4950.5"/>
+<text text-anchor="start" x="11584" y="-4957.3" font-family="Times,serif" font-size="14.00">pubmedsearch (586 MiB)</text>
+<text text-anchor="start" x="11619.5" y="-4935.3" font-family="Times,serif" font-size="14.00">authorfullname</text>
+<text text-anchor="start" x="11612.5" y="-4914.3" font-family="Times,serif" font-size="14.00">authorshortname</text>
+<text text-anchor="start" x="11650" y="-4893.3" font-family="Times,serif" font-size="14.00">geneid</text>
+<text text-anchor="start" x="11667.5" y="-4872.3" font-family="Times,serif" font-size="14.00">id</text>
+<text text-anchor="start" x="11644" y="-4851.3" font-family="Times,serif" font-size="14.00">institute</text>
+<text text-anchor="start" x="11638.5" y="-4830.3" font-family="Times,serif" font-size="14.00">pubmedid</text>
+<text text-anchor="start" x="11659.5" y="-4809.3" font-family="Times,serif" font-size="14.00">title</text>
+<polygon fill="none" stroke="black" points="11577.5,-4800.5 11577.5,-4974.5 11770.5,-4974.5 11770.5,-4800.5 11577.5,-4800.5"/>
+</g>
+<!-- EnsemblProbeLocation -->
+<g id="node70" class="node">
+<title>EnsemblProbeLocation</title>
+<polygon fill="white" stroke="transparent" points="6793,-4790 6793,-4985 7037,-4985 7037,-4790 6793,-4790"/>
+<polygon fill="#df65b0" stroke="transparent" points="6796,-4960.5 6796,-4981.5 7034,-4981.5 7034,-4960.5 6796,-4960.5"/>
+<polygon fill="none" stroke="black" points="6796,-4960.5 6796,-4981.5 7034,-4981.5 7034,-4960.5 6796,-4960.5"/>
+<text text-anchor="start" x="6799" y="-4967.3" font-family="Times,serif" font-size="14.00">EnsemblProbeLocation (99 MiB)</text>
+<text text-anchor="start" x="6901.5" y="-4945.3" font-family="Times,serif" font-size="14.00">Chr</text>
+<text text-anchor="start" x="6900.5" y="-4924.3" font-family="Times,serif" font-size="14.00">End</text>
+<text text-anchor="start" x="6879" y="-4903.3" font-family="Times,serif" font-size="14.00">End_2016</text>
+<text text-anchor="start" x="6867" y="-4882.3" font-family="Times,serif" font-size="14.00">MisMataches</text>
+<text text-anchor="start" x="6886.5" y="-4861.3" font-family="Times,serif" font-size="14.00">ProbeId</text>
+<text text-anchor="start" x="6896.5" y="-4840.3" font-family="Times,serif" font-size="14.00">Start</text>
+<text text-anchor="start" x="6875" y="-4819.3" font-family="Times,serif" font-size="14.00">Start_2016</text>
+<text text-anchor="start" x="6890.5" y="-4798.3" font-family="Times,serif" font-size="14.00">Strand</text>
+<polygon fill="none" stroke="black" points="6793,-4790 6793,-4985 7037,-4985 7037,-4790 6793,-4790"/>
+</g>
+<!-- EnsemblProbeLocation&#45;&gt;Probe -->
+<g id="edge64" class="edge">
+<title>EnsemblProbeLocation:ProbeId&#45;&gt;Probe</title>
+<path fill="none" stroke="black" d="M7035,-4864.5C7071.26,-4864.5 6964.83,-3784.86 6927.45,-3416.46"/>
+<polygon fill="black" stroke="black" points="6930.91,-3415.9 6926.42,-3406.3 6923.95,-3416.61 6930.91,-3415.9"/>
+</g>
+<!-- Investigators&#45;&gt;Organizations -->
+<g id="edge65" class="edge">
+<title>Investigators:OrganizationId&#45;&gt;Organizations</title>
+<path fill="none" stroke="black" d="M256,-296.5C296.78,-296.5 271.73,-150.19 255,-113 250.33,-102.62 243.39,-93.09 235.5,-84.57"/>
+<polygon fill="black" stroke="black" points="237.88,-82 228.35,-77.36 232.9,-86.93 237.88,-82"/>
+</g>
+<!-- ProbeSetSE -->
+<g id="node72" class="node">
+<title>ProbeSetSE</title>
+<polygon fill="white" stroke="transparent" points="6068,-1918 6068,-2008 6222,-2008 6222,-1918 6068,-1918"/>
+<polygon fill="#ce1256" stroke="transparent" points="6071,-1984 6071,-2005 6219,-2005 6219,-1984 6071,-1984"/>
+<polygon fill="none" stroke="black" points="6071,-1984 6071,-2005 6219,-2005 6219,-1984 6071,-1984"/>
+<text text-anchor="start" x="6074" y="-1990.8" font-family="Times,serif" font-size="14.00">ProbeSetSE (7 GiB)</text>
+<text text-anchor="start" x="6120.5" y="-1968.8" font-family="Times,serif" font-size="14.00">DataId</text>
+<text text-anchor="start" x="6126.5" y="-1947.8" font-family="Times,serif" font-size="14.00">error</text>
+<text text-anchor="start" x="6115.5" y="-1926.8" font-family="Times,serif" font-size="14.00">StrainId</text>
+<polygon fill="none" stroke="black" points="6068,-1918 6068,-2008 6222,-2008 6222,-1918 6068,-1918"/>
+</g>
+<!-- ProbeSetSE&#45;&gt;Strain -->
+<g id="edge66" class="edge">
+<title>ProbeSetSE:StrainId&#45;&gt;Strain</title>
+<path fill="none" stroke="black" d="M6070,-1930C6049.72,-1930 6057.62,-1219.18 6051,-1200 6011.97,-1086.88 5923.03,-979.85 5858.94,-913.01"/>
+<polygon fill="black" stroke="black" points="5861.11,-910.22 5851.65,-905.47 5856.08,-915.09 5861.11,-910.22"/>
+</g>
+<!-- TableComments -->
+<g id="node74" class="node">
+<title>TableComments</title>
+<polygon fill="white" stroke="transparent" points="11805,-4853 11805,-4922 11995,-4922 11995,-4853 11805,-4853"/>
+<polygon fill="#d7b5d8" stroke="transparent" points="11808,-4897.5 11808,-4918.5 11992,-4918.5 11992,-4897.5 11808,-4897.5"/>
+<polygon fill="none" stroke="black" points="11808,-4897.5 11808,-4918.5 11992,-4918.5 11992,-4897.5 11808,-4897.5"/>
+<text text-anchor="start" x="11811" y="-4904.3" font-family="Times,serif" font-size="14.00">TableComments (34 KiB)</text>
+<text text-anchor="start" x="11865" y="-4882.3" font-family="Times,serif" font-size="14.00">Comment</text>
+<text text-anchor="start" x="11859.5" y="-4861.3" font-family="Times,serif" font-size="14.00">TableName</text>
+<polygon fill="none" stroke="black" points="11805,-4853 11805,-4922 11995,-4922 11995,-4853 11805,-4853"/>
+</g>
+<!-- Dataset_mbat -->
+<g id="node75" class="node">
+<title>Dataset_mbat</title>
+<polygon fill="white" stroke="transparent" points="12029.5,-4800.5 12029.5,-4974.5 12198.5,-4974.5 12198.5,-4800.5 12029.5,-4800.5"/>
+<polygon fill="#f1eef6" stroke="transparent" points="12033,-4950.5 12033,-4971.5 12196,-4971.5 12196,-4950.5 12033,-4950.5"/>
+<polygon fill="none" stroke="black" points="12033,-4950.5 12033,-4971.5 12196,-4971.5 12196,-4950.5 12033,-4950.5"/>
+<text text-anchor="start" x="12036" y="-4957.3" font-family="Times,serif" font-size="14.00">Dataset_mbat (764 B)</text>
+<text text-anchor="start" x="12095.5" y="-4935.3" font-family="Times,serif" font-size="14.00">cross</text>
+<text text-anchor="start" x="12082" y="-4914.3" font-family="Times,serif" font-size="14.00">database</text>
+<text text-anchor="start" x="12040" y="-4893.3" font-family="Times,serif" font-size="14.00">database_LongName</text>
+<text text-anchor="start" x="12107.5" y="-4872.3" font-family="Times,serif" font-size="14.00">id</text>
+<text text-anchor="start" x="12088" y="-4851.3" font-family="Times,serif" font-size="14.00">species</text>
+<text text-anchor="start" x="12091" y="-4830.3" font-family="Times,serif" font-size="14.00">switch</text>
+<text text-anchor="start" x="12093" y="-4809.3" font-family="Times,serif" font-size="14.00">tissue</text>
+<polygon fill="none" stroke="black" points="12029.5,-4800.5 12029.5,-4974.5 12198.5,-4974.5 12198.5,-4800.5 12029.5,-4800.5"/>
+</g>
+<!-- CaseAttributeXRefNew -->
+<g id="node76" class="node">
+<title>CaseAttributeXRefNew</title>
+<polygon fill="white" stroke="transparent" points="3817,-1907.5 3817,-2018.5 4053,-2018.5 4053,-1907.5 3817,-1907.5"/>
+<polygon fill="#df65b0" stroke="transparent" points="3820,-1994 3820,-2015 4050,-2015 4050,-1994 3820,-1994"/>
+<polygon fill="none" stroke="black" points="3820,-1994 3820,-2015 4050,-2015 4050,-1994 3820,-1994"/>
+<text text-anchor="start" x="3823" y="-2000.8" font-family="Times,serif" font-size="14.00">CaseAttributeXRefNew (5 MiB)</text>
+<text text-anchor="start" x="3877.5" y="-1978.8" font-family="Times,serif" font-size="14.00">CaseAttributeId</text>
+<text text-anchor="start" x="3892" y="-1957.8" font-family="Times,serif" font-size="14.00">InbredSetId</text>
+<text text-anchor="start" x="3905.5" y="-1936.8" font-family="Times,serif" font-size="14.00">StrainId</text>
+<text text-anchor="start" x="3915" y="-1915.8" font-family="Times,serif" font-size="14.00">Value</text>
+<polygon fill="none" stroke="black" points="3817,-1907.5 3817,-2018.5 4053,-2018.5 4053,-1907.5 3817,-1907.5"/>
+</g>
+<!-- CaseAttributeXRefNew&#45;&gt;InbredSet -->
+<g id="edge68" class="edge">
+<title>CaseAttributeXRefNew:InbredSetId&#45;&gt;InbredSet</title>
+<path fill="none" stroke="black" d="M3819,-1961C3795.41,-1961 3828.4,-1316.38 3845.65,-1006.1"/>
+<polygon fill="black" stroke="black" points="3849.14,-1006.29 3846.2,-996.11 3842.15,-1005.9 3849.14,-1006.29"/>
+</g>
+<!-- CaseAttributeXRefNew&#45;&gt;CaseAttribute -->
+<g id="edge67" class="edge">
+<title>CaseAttributeXRefNew:CaseAttributeId&#45;&gt;CaseAttribute</title>
+<path fill="none" stroke="black" d="M3819,-1983C3775.49,-1983 3829.94,-1230.6 3799,-1200 3702.3,-1104.35 1459.95,-1245.42 1351,-1164 1269.39,-1103.01 1252.58,-975.97 1250.14,-901.3"/>
+<polygon fill="black" stroke="black" points="1253.64,-901.03 1249.89,-891.12 1246.64,-901.2 1253.64,-901.03"/>
+</g>
+<!-- CaseAttributeXRefNew&#45;&gt;Strain -->
+<g id="edge69" class="edge">
+<title>CaseAttributeXRefNew:StrainId&#45;&gt;Strain</title>
+<path fill="none" stroke="black" d="M4051,-1940C4092.12,-1940 4042.15,-1230.26 4070,-1200 4119.95,-1145.72 4327.27,-1176.34 4400,-1164 4905.53,-1078.2 5502.61,-920.46 5710.32,-863.88"/>
+<polygon fill="black" stroke="black" points="5711.48,-867.19 5720.21,-861.18 5709.64,-860.44 5711.48,-867.19"/>
+</g>
+<!-- GenoCode -->
+<g id="node77" class="node">
+<title>GenoCode</title>
+<polygon fill="white" stroke="transparent" points="3486.5,-1907.5 3486.5,-2018.5 3619.5,-2018.5 3619.5,-1907.5 3486.5,-1907.5"/>
+<polygon fill="#f1eef6" stroke="transparent" points="3490,-1994 3490,-2015 3617,-2015 3617,-1994 3490,-1994"/>
+<polygon fill="none" stroke="black" points="3490,-1994 3490,-2015 3617,-2015 3617,-1994 3490,-1994"/>
+<text text-anchor="start" x="3493" y="-2000.8" font-family="Times,serif" font-size="14.00">GenoCode (40 B)</text>
+<text text-anchor="start" x="3506.5" y="-1978.8" font-family="Times,serif" font-size="14.00">AlleleSymbol</text>
+<text text-anchor="start" x="3516" y="-1957.8" font-family="Times,serif" font-size="14.00">AlleleType</text>
+<text text-anchor="start" x="3500.5" y="-1936.8" font-family="Times,serif" font-size="14.00">DatabaseValue</text>
+<text text-anchor="start" x="3510.5" y="-1915.8" font-family="Times,serif" font-size="14.00">InbredSetId</text>
+<polygon fill="none" stroke="black" points="3486.5,-1907.5 3486.5,-2018.5 3619.5,-2018.5 3619.5,-1907.5 3486.5,-1907.5"/>
+</g>
+<!-- GenoCode&#45;&gt;InbredSet -->
+<g id="edge70" class="edge">
+<title>GenoCode:InbredSetId&#45;&gt;InbredSet</title>
+<path fill="none" stroke="black" d="M3618,-1919C3657.96,-1919 3611.64,-1231.67 3636,-1200 3670.72,-1154.85 3718.61,-1204.16 3759,-1164 3801.12,-1122.13 3824.91,-1062.6 3838.29,-1006.16"/>
+<polygon fill="black" stroke="black" points="3841.71,-1006.93 3840.51,-996.4 3834.88,-1005.38 3841.71,-1006.93"/>
+</g>
+<!-- ProbeSE&#45;&gt;Strain -->
+<g id="edge71" class="edge">
+<title>ProbeSE:StrainId&#45;&gt;Strain</title>
+<path fill="none" stroke="black" d="M6994,-1930C6953.43,-1930 6998.65,-1232.22 6974,-1200 6834.26,-1017.37 6100.93,-891 5861.61,-854.12"/>
+<polygon fill="black" stroke="black" points="5862.02,-850.65 5851.61,-852.59 5860.96,-857.57 5862.02,-850.65"/>
+</g>
+<!-- Temp -->
+<g id="node80" class="node">
+<title>Temp</title>
+<polygon fill="white" stroke="transparent" points="4087.5,-1865.5 4087.5,-2060.5 4206.5,-2060.5 4206.5,-1865.5 4087.5,-1865.5"/>
+<polygon fill="#df65b0" stroke="transparent" points="4091,-2036 4091,-2057 4204,-2057 4204,-2036 4091,-2036"/>
+<polygon fill="none" stroke="black" points="4091,-2036 4091,-2057 4204,-2057 4204,-2036 4091,-2036"/>
+<text text-anchor="start" x="4099" y="-2042.8" font-family="Times,serif" font-size="14.00">Temp (1 MiB)</text>
+<text text-anchor="start" x="4108.5" y="-2020.8" font-family="Times,serif" font-size="14.00">createtime</text>
+<text text-anchor="start" x="4123" y="-1999.8" font-family="Times,serif" font-size="14.00">DataId</text>
+<text text-anchor="start" x="4093" y="-1978.8" font-family="Times,serif" font-size="14.00">dbdisplayname</text>
+<text text-anchor="start" x="4107" y="-1957.8" font-family="Times,serif" font-size="14.00">description</text>
+<text text-anchor="start" x="4140" y="-1936.8" font-family="Times,serif" font-size="14.00">Id</text>
+<text text-anchor="start" x="4104.5" y="-1915.8" font-family="Times,serif" font-size="14.00">InbredSetId</text>
+<text text-anchor="start" x="4139.5" y="-1894.8" font-family="Times,serif" font-size="14.00">IP</text>
+<text text-anchor="start" x="4126" y="-1873.8" font-family="Times,serif" font-size="14.00">Name</text>
+<polygon fill="none" stroke="black" points="4087.5,-1865.5 4087.5,-2060.5 4206.5,-2060.5 4206.5,-1865.5 4087.5,-1865.5"/>
+</g>
+<!-- Temp&#45;&gt;InbredSet -->
+<g id="edge72" class="edge">
+<title>Temp:InbredSetId&#45;&gt;InbredSet</title>
+<path fill="none" stroke="black" d="M4090,-1919C4070.02,-1919 4075.62,-1219.17 4070,-1200 4043.91,-1110.94 3990,-1021.51 3942.68,-954.43"/>
+<polygon fill="black" stroke="black" points="3945.3,-952.07 3936.65,-945.95 3939.59,-956.12 3945.3,-952.07"/>
+</g>
+<!-- GenoData -->
+<g id="node81" class="node">
+<title>GenoData</title>
+<polygon fill="white" stroke="transparent" points="6256.5,-1918 6256.5,-2008 6403.5,-2008 6403.5,-1918 6256.5,-1918"/>
+<polygon fill="#ce1256" stroke="transparent" points="6260,-1984 6260,-2005 6401,-2005 6401,-1984 6260,-1984"/>
+<polygon fill="none" stroke="black" points="6260,-1984 6260,-2005 6401,-2005 6401,-1984 6260,-1984"/>
+<text text-anchor="start" x="6263" y="-1990.8" font-family="Times,serif" font-size="14.00">GenoData (10 GiB)</text>
+<text text-anchor="start" x="6323" y="-1968.8" font-family="Times,serif" font-size="14.00">Id</text>
+<text text-anchor="start" x="6301" y="-1947.8" font-family="Times,serif" font-size="14.00">StrainId</text>
+<text text-anchor="start" x="6311" y="-1926.8" font-family="Times,serif" font-size="14.00">value</text>
+<polygon fill="none" stroke="black" points="6256.5,-1918 6256.5,-2008 6403.5,-2008 6403.5,-1918 6256.5,-1918"/>
+</g>
+<!-- GenoData&#45;&gt;Strain -->
+<g id="edge73" class="edge">
+<title>GenoData:StrainId&#45;&gt;Strain</title>
+<path fill="none" stroke="black" d="M6259,-1951C6217.26,-1951 6257.72,-1237.31 6239,-1200 6158.18,-1038.89 5967.05,-927.85 5860.69,-876.11"/>
+<polygon fill="black" stroke="black" points="5862.14,-872.92 5851.61,-871.74 5859.11,-879.23 5862.14,-872.92"/>
+</g>
+<!-- GenoFreeze&#45;&gt;InbredSet -->
+<g id="edge74" class="edge">
+<title>GenoFreeze:InbredSetId&#45;&gt;InbredSet</title>
+<path fill="none" stroke="black" d="M4409,-1930C4368.43,-1930 4415.79,-1231.31 4390,-1200 4343.1,-1143.07 4293.94,-1197.05 4228,-1164 4118.16,-1108.94 4014.02,-1014.44 3943.83,-942.19"/>
+<polygon fill="black" stroke="black" points="3946.19,-939.59 3936.73,-934.83 3941.15,-944.45 3946.19,-939.59"/>
+</g>
+<!-- ProbeSetData -->
+<g id="node83" class="node">
+<title>ProbeSetData</title>
+<polygon fill="white" stroke="transparent" points="6438,-1918 6438,-2008 6614,-2008 6614,-1918 6438,-1918"/>
+<polygon fill="#ce1256" stroke="transparent" points="6441,-1984 6441,-2005 6611,-2005 6611,-1984 6441,-1984"/>
+<polygon fill="none" stroke="black" points="6441,-1984 6441,-2005 6611,-2005 6611,-1984 6441,-1984"/>
+<text text-anchor="start" x="6444" y="-1990.8" font-family="Times,serif" font-size="14.00">ProbeSetData (62 GiB)</text>
+<text text-anchor="start" x="6518.5" y="-1968.8" font-family="Times,serif" font-size="14.00">Id</text>
+<text text-anchor="start" x="6496.5" y="-1947.8" font-family="Times,serif" font-size="14.00">StrainId</text>
+<text text-anchor="start" x="6506.5" y="-1926.8" font-family="Times,serif" font-size="14.00">value</text>
+<polygon fill="none" stroke="black" points="6438,-1918 6438,-2008 6614,-2008 6614,-1918 6438,-1918"/>
+</g>
+<!-- ProbeSetData&#45;&gt;Strain -->
+<g id="edge75" class="edge">
+<title>ProbeSetData:StrainId&#45;&gt;Strain</title>
+<path fill="none" stroke="black" d="M6440,-1951C6398.26,-1951 6441.54,-1235.75 6420,-1200 6294.74,-992.11 6000.36,-895.18 5861.29,-859.75"/>
+<polygon fill="black" stroke="black" points="5862.1,-856.35 5851.55,-857.31 5860.4,-863.14 5862.1,-856.35"/>
+</g>
+<!-- CeleraINFO_mm6 -->
+<g id="node84" class="node">
+<title>CeleraINFO_mm6</title>
+<polygon fill="white" stroke="transparent" points="12232,-4706 12232,-5069 12448,-5069 12448,-4706 12232,-4706"/>
+<polygon fill="#df65b0" stroke="transparent" points="12235,-5044.5 12235,-5065.5 12445,-5065.5 12445,-5044.5 12235,-5044.5"/>
+<polygon fill="none" stroke="black" points="12235,-5044.5 12235,-5065.5 12445,-5065.5 12445,-5044.5 12235,-5044.5"/>
+<text text-anchor="start" x="12238" y="-5051.3" font-family="Times,serif" font-size="14.00">CeleraINFO_mm6 (780 MiB)</text>
+<text text-anchor="start" x="12309.5" y="-5029.3" font-family="Times,serif" font-size="14.00">allele_AJ</text>
+<text text-anchor="start" x="12307.5" y="-5008.3" font-family="Times,serif" font-size="14.00">allele_B6</text>
+<text text-anchor="start" x="12307" y="-4987.3" font-family="Times,serif" font-size="14.00">allele_D2</text>
+<text text-anchor="start" x="12308" y="-4966.3" font-family="Times,serif" font-size="14.00">allele_S1</text>
+<text text-anchor="start" x="12308" y="-4945.3" font-family="Times,serif" font-size="14.00">allele_X1</text>
+<text text-anchor="start" x="12319" y="-4924.3" font-family="Times,serif" font-size="14.00">B6_AJ</text>
+<text text-anchor="start" x="12316.5" y="-4903.3" font-family="Times,serif" font-size="14.00">B6_D2</text>
+<text text-anchor="start" x="12294.5" y="-4882.3" font-family="Times,serif" font-size="14.00">chromosome</text>
+<text text-anchor="start" x="12318.5" y="-4861.3" font-family="Times,serif" font-size="14.00">D2_AJ</text>
+<text text-anchor="start" x="12306.5" y="-4840.3" font-family="Times,serif" font-size="14.00">flanking3</text>
+<text text-anchor="start" x="12306.5" y="-4819.3" font-family="Times,serif" font-size="14.00">flanking5</text>
+<text text-anchor="start" x="12332.5" y="-4798.3" font-family="Times,serif" font-size="14.00">Id</text>
+<text text-anchor="start" x="12302" y="-4777.3" font-family="Times,serif" font-size="14.00">MB_celera</text>
+<text text-anchor="start" x="12302.5" y="-4756.3" font-family="Times,serif" font-size="14.00">MB_UCSC</text>
+<text text-anchor="start" x="12283.5" y="-4735.3" font-family="Times,serif" font-size="14.00">MB_UCSC_OLD</text>
+<text text-anchor="start" x="12315.5" y="-4714.3" font-family="Times,serif" font-size="14.00">SNPID</text>
+<polygon fill="none" stroke="black" points="12232,-4706 12232,-5069 12448,-5069 12448,-4706 12232,-4706"/>
+</g>
+<!-- TableFieldAnnotation -->
+<g id="node85" class="node">
+<title>TableFieldAnnotation</title>
+<polygon fill="white" stroke="transparent" points="12482,-4842.5 12482,-4932.5 12710,-4932.5 12710,-4842.5 12482,-4842.5"/>
+<polygon fill="#d7b5d8" stroke="transparent" points="12485,-4908.5 12485,-4929.5 12707,-4929.5 12707,-4908.5 12485,-4908.5"/>
+<polygon fill="none" stroke="black" points="12485,-4908.5 12485,-4929.5 12707,-4929.5 12707,-4908.5 12485,-4908.5"/>
+<text text-anchor="start" x="12488" y="-4915.3" font-family="Times,serif" font-size="14.00">TableFieldAnnotation (43 KiB)</text>
+<text text-anchor="start" x="12556.5" y="-4893.3" font-family="Times,serif" font-size="14.00">Annotation</text>
+<text text-anchor="start" x="12552" y="-4872.3" font-family="Times,serif" font-size="14.00">Foreign_Key</text>
+<text text-anchor="start" x="12558.5" y="-4851.3" font-family="Times,serif" font-size="14.00">TableField</text>
+<polygon fill="none" stroke="black" points="12482,-4842.5 12482,-4932.5 12710,-4932.5 12710,-4842.5 12482,-4842.5"/>
+</g>
+<!-- ProbeSet -->
+<g id="node86" class="node">
+<title>ProbeSet</title>
+<polygon fill="white" stroke="transparent" points="752.5,-1204 752.5,-2722 983.5,-2722 983.5,-1204 752.5,-1204"/>
+<polygon fill="#ce1256" stroke="transparent" points="756,-2698 756,-2719 981,-2719 981,-2698 756,-2698"/>
+<polygon fill="none" stroke="black" points="756,-2698 756,-2719 981,-2719 981,-2698 756,-2698"/>
+<text text-anchor="start" x="808" y="-2704.8" font-family="Times,serif" font-size="14.00">ProbeSet (2 GiB)</text>
+<text text-anchor="start" x="851.5" y="-2682.8" font-family="Times,serif" font-size="14.00">alias</text>
+<text text-anchor="start" x="842.5" y="-2661.8" font-family="Times,serif" font-size="14.00">alias_H</text>
+<text text-anchor="start" x="821.5" y="-2640.8" font-family="Times,serif" font-size="14.00">Biotype_ENS</text>
+<text text-anchor="start" x="840" y="-2619.8" font-family="Times,serif" font-size="14.00">BlatSeq</text>
+<text text-anchor="start" x="822" y="-2598.8" font-family="Times,serif" font-size="14.00">CAS_number</text>
+<text text-anchor="start" x="833.5" y="-2577.8" font-family="Times,serif" font-size="14.00">ChEBI_ID</text>
+<text text-anchor="start" x="824.5" y="-2556.8" font-family="Times,serif" font-size="14.00">ChEMBL_ID</text>
+<text text-anchor="start" x="812.5" y="-2535.8" font-family="Times,serif" font-size="14.00">ChemSpider_ID</text>
+<text text-anchor="start" x="844.5" y="-2514.8" font-family="Times,serif" font-size="14.00">ChipId</text>
+<text text-anchor="start" x="855" y="-2493.8" font-family="Times,serif" font-size="14.00">Chr</text>
+<text text-anchor="start" x="833.5" y="-2472.8" font-family="Times,serif" font-size="14.00">Chr_2016</text>
+<text text-anchor="start" x="833.5" y="-2451.8" font-family="Times,serif" font-size="14.00">Chr_mm8</text>
+<text text-anchor="start" x="837.5" y="-2430.8" font-family="Times,serif" font-size="14.00">chr_num</text>
+<text text-anchor="start" x="813.5" y="-2409.8" font-family="Times,serif" font-size="14.00">chromosome_H</text>
+<text text-anchor="start" x="831.5" y="-2388.8" font-family="Times,serif" font-size="14.00">comments</text>
+<text text-anchor="start" x="829" y="-2367.8" font-family="Times,serif" font-size="14.00">Confidence</text>
+<text text-anchor="start" x="828" y="-2346.8" font-family="Times,serif" font-size="14.00">description</text>
+<text text-anchor="start" x="818.5" y="-2325.8" font-family="Times,serif" font-size="14.00">description_H</text>
+<text text-anchor="start" x="826.5" y="-2304.8" font-family="Times,serif" font-size="14.00">EC_number</text>
+<text text-anchor="start" x="804.5" y="-2283.8" font-family="Times,serif" font-size="14.00">ENSEMBLGeneId</text>
+<text text-anchor="start" x="855" y="-2262.8" font-family="Times,serif" font-size="14.00">flag</text>
+<text text-anchor="start" x="830" y="-2241.8" font-family="Times,serif" font-size="14.00">Flybase_Id</text>
+<text text-anchor="start" x="829.5" y="-2220.8" font-family="Times,serif" font-size="14.00">GenbankId</text>
+<text text-anchor="start" x="842.5" y="-2199.8" font-family="Times,serif" font-size="14.00">GeneId</text>
+<text text-anchor="start" x="833.5" y="-2178.8" font-family="Times,serif" font-size="14.00">GeneId_H</text>
+<text text-anchor="start" x="833.5" y="-2157.8" font-family="Times,serif" font-size="14.00">HGNC_ID</text>
+<text text-anchor="start" x="832.5" y="-2136.8" font-family="Times,serif" font-size="14.00">HMDB_ID</text>
+<text text-anchor="start" x="814" y="-2115.8" font-family="Times,serif" font-size="14.00">HomoloGeneID</text>
+<text text-anchor="start" x="861" y="-2094.8" font-family="Times,serif" font-size="14.00">Id</text>
+<text text-anchor="start" x="835" y="-2073.8" font-family="Times,serif" font-size="14.00">KEGG_ID</text>
+<text text-anchor="start" x="856.5" y="-2052.8" font-family="Times,serif" font-size="14.00">Mb</text>
+<text text-anchor="start" x="835" y="-2031.8" font-family="Times,serif" font-size="14.00">Mb_2016</text>
+<text text-anchor="start" x="846.5" y="-2010.8" font-family="Times,serif" font-size="14.00">MB_H</text>
+<text text-anchor="start" x="835" y="-1989.8" font-family="Times,serif" font-size="14.00">Mb_mm8</text>
+<text text-anchor="start" x="804" y="-1968.8" font-family="Times,serif" font-size="14.00">Molecular_Weight</text>
+<text text-anchor="start" x="847" y="-1947.8" font-family="Times,serif" font-size="14.00">Name</text>
+<text text-anchor="start" x="829.5" y="-1926.8" font-family="Times,serif" font-size="14.00">name_num</text>
+<text text-anchor="start" x="823" y="-1905.8" font-family="Times,serif" font-size="14.00">Nugowiki_ID</text>
+<text text-anchor="start" x="845.5" y="-1884.8" font-family="Times,serif" font-size="14.00">OMIM</text>
+<text text-anchor="start" x="806.5" y="-1863.8" font-family="Times,serif" font-size="14.00">PeptideSequence</text>
+<text text-anchor="start" x="818.5" y="-1842.8" font-family="Times,serif" font-size="14.00">PrimaryName</text>
+<text text-anchor="start" x="783.5" y="-1821.8" font-family="Times,serif" font-size="14.00">Probe_set_Blat_Mb_end</text>
+<text text-anchor="start" x="762" y="-1800.8" font-family="Times,serif" font-size="14.00">Probe_set_Blat_Mb_end_2016</text>
+<text text-anchor="start" x="762" y="-1779.8" font-family="Times,serif" font-size="14.00">Probe_set_Blat_Mb_end_mm8</text>
+<text text-anchor="start" x="779.5" y="-1758.8" font-family="Times,serif" font-size="14.00">Probe_set_Blat_Mb_start</text>
+<text text-anchor="start" x="758" y="-1737.8" font-family="Times,serif" font-size="14.00">Probe_set_Blat_Mb_start_2016</text>
+<text text-anchor="start" x="758" y="-1716.8" font-family="Times,serif" font-size="14.00">Probe_set_Blat_Mb_start_mm8</text>
+<text text-anchor="start" x="788.5" y="-1695.8" font-family="Times,serif" font-size="14.00">Probe_set_BLAT_score</text>
+<text text-anchor="start" x="784.5" y="-1674.8" font-family="Times,serif" font-size="14.00">Probe_set_Note_by_RW</text>
+<text text-anchor="start" x="793.5" y="-1653.8" font-family="Times,serif" font-size="14.00">Probe_set_specificity</text>
+<text text-anchor="start" x="806.5" y="-1632.8" font-family="Times,serif" font-size="14.00">Probe_set_strand</text>
+<text text-anchor="start" x="781" y="-1611.8" font-family="Times,serif" font-size="14.00">Probe_set_target_region</text>
+<text text-anchor="start" x="776" y="-1590.8" font-family="Times,serif" font-size="14.00">Probe_Target_Description</text>
+<text text-anchor="start" x="833.5" y="-1569.8" font-family="Times,serif" font-size="14.00">ProteinID</text>
+<text text-anchor="start" x="821" y="-1548.8" font-family="Times,serif" font-size="14.00">ProteinName</text>
+<text text-anchor="start" x="822" y="-1527.8" font-family="Times,serif" font-size="14.00">PubChem_ID</text>
+<text text-anchor="start" x="795" y="-1506.8" font-family="Times,serif" font-size="14.00">RefSeq_TranscriptId</text>
+<text text-anchor="start" x="840" y="-1485.8" font-family="Times,serif" font-size="14.00">RGD_ID</text>
+<text text-anchor="start" x="806" y="-1464.8" font-family="Times,serif" font-size="14.00">SecondaryNames</text>
+<text text-anchor="start" x="852.5" y="-1443.8" font-family="Times,serif" font-size="14.00">SNP</text>
+<text text-anchor="start" x="822" y="-1422.8" font-family="Times,serif" font-size="14.00">Strand_Gene</text>
+<text text-anchor="start" x="819.5" y="-1401.8" font-family="Times,serif" font-size="14.00">Strand_Probe</text>
+<text text-anchor="start" x="841.5" y="-1380.8" font-family="Times,serif" font-size="14.00">Symbol</text>
+<text text-anchor="start" x="832" y="-1359.8" font-family="Times,serif" font-size="14.00">Symbol_H</text>
+<text text-anchor="start" x="838" y="-1338.8" font-family="Times,serif" font-size="14.00">TargetId</text>
+<text text-anchor="start" x="831.5" y="-1317.8" font-family="Times,serif" font-size="14.00">TargetSeq</text>
+<text text-anchor="start" x="845.5" y="-1296.8" font-family="Times,serif" font-size="14.00">Tissue</text>
+<text text-anchor="start" x="851" y="-1275.8" font-family="Times,serif" font-size="14.00">Type</text>
+<text text-anchor="start" x="830" y="-1254.8" font-family="Times,serif" font-size="14.00">UniGeneId</text>
+<text text-anchor="start" x="839.5" y="-1233.8" font-family="Times,serif" font-size="14.00">UNII_ID</text>
+<text text-anchor="start" x="832" y="-1212.8" font-family="Times,serif" font-size="14.00">UniProtID</text>
+<polygon fill="none" stroke="black" points="752.5,-1204 752.5,-2722 983.5,-2722 983.5,-1204 752.5,-1204"/>
+</g>
+<!-- ProbeSet&#45;&gt;Genbank -->
+<g id="edge76" class="edge">
+<title>ProbeSet:GenbankId&#45;&gt;Genbank</title>
+<path fill="none" stroke="black" d="M755,-2225C726.53,-2225 752.7,-1228.28 756,-1200 768.49,-1092.85 801.24,-971.17 821.96,-901.12"/>
+<polygon fill="black" stroke="black" points="825.42,-901.75 824.93,-891.16 818.72,-899.75 825.42,-901.75"/>
+</g>
+<!-- GenoFile -->
+<g id="node87" class="node">
+<title>GenoFile</title>
+<polygon fill="white" stroke="transparent" points="4240.5,-1886.5 4240.5,-2039.5 4373.5,-2039.5 4373.5,-1886.5 4240.5,-1886.5"/>
+<polygon fill="#f1eef6" stroke="transparent" points="4244,-2015 4244,-2036 4371,-2036 4371,-2015 4244,-2015"/>
+<polygon fill="none" stroke="black" points="4244,-2015 4244,-2036 4371,-2036 4371,-2015 4244,-2015"/>
+<text text-anchor="start" x="4247" y="-2021.8" font-family="Times,serif" font-size="14.00">GenoFile (332 B)</text>
+<text text-anchor="start" x="4300.5" y="-1999.8" font-family="Times,serif" font-size="14.00">id</text>
+<text text-anchor="start" x="4263.5" y="-1978.8" font-family="Times,serif" font-size="14.00">InbredSetID</text>
+<text text-anchor="start" x="4279" y="-1957.8" font-family="Times,serif" font-size="14.00">location</text>
+<text text-anchor="start" x="4284.5" y="-1936.8" font-family="Times,serif" font-size="14.00">server</text>
+<text text-anchor="start" x="4293" y="-1915.8" font-family="Times,serif" font-size="14.00">sort</text>
+<text text-anchor="start" x="4292.5" y="-1894.8" font-family="Times,serif" font-size="14.00">title</text>
+<polygon fill="none" stroke="black" points="4240.5,-1886.5 4240.5,-2039.5 4373.5,-2039.5 4373.5,-1886.5 4240.5,-1886.5"/>
+</g>
+<!-- GenoFile&#45;&gt;InbredSet -->
+<g id="edge77" class="edge">
+<title>GenoFile:InbredSetID&#45;&gt;InbredSet</title>
+<path fill="none" stroke="black" d="M4243,-1983C4221.24,-1983 4231.73,-1219.93 4223,-1200 4165.37,-1068.5 4034.27,-960.98 3945.16,-899.43"/>
+<polygon fill="black" stroke="black" points="3946.9,-896.38 3936.67,-893.62 3942.95,-902.16 3946.9,-896.38"/>
+</g>
+<!-- TempData -->
+<g id="node88" class="node">
+<title>TempData</title>
+<polygon fill="white" stroke="transparent" points="6636,-3228 6636,-3360 6788,-3360 6788,-3228 6636,-3228"/>
+<polygon fill="#df65b0" stroke="transparent" points="6639,-3336 6639,-3357 6785,-3357 6785,-3336 6639,-3336"/>
+<polygon fill="none" stroke="black" points="6639,-3336 6639,-3357 6785,-3357 6785,-3336 6639,-3336"/>
+<text text-anchor="start" x="6642" y="-3342.8" font-family="Times,serif" font-size="14.00">TempData (11 MiB)</text>
+<text text-anchor="start" x="6704.5" y="-3320.8" font-family="Times,serif" font-size="14.00">Id</text>
+<text text-anchor="start" x="6683.5" y="-3299.8" font-family="Times,serif" font-size="14.00">NStrain</text>
+<text text-anchor="start" x="6701.5" y="-3278.8" font-family="Times,serif" font-size="14.00">SE</text>
+<text text-anchor="start" x="6682.5" y="-3257.8" font-family="Times,serif" font-size="14.00">StrainId</text>
+<text text-anchor="start" x="6692.5" y="-3236.8" font-family="Times,serif" font-size="14.00">value</text>
+<polygon fill="none" stroke="black" points="6636,-3228 6636,-3360 6788,-3360 6788,-3228 6636,-3228"/>
+</g>
+<!-- TempData&#45;&gt;NStrain -->
+<g id="edge78" class="edge">
+<title>TempData:NStrain&#45;&gt;NStrain</title>
+<path fill="none" stroke="black" d="M6786,-3304C6851.17,-3304 6745.87,-2280.14 6718.32,-2022.36"/>
+<polygon fill="black" stroke="black" points="6721.77,-2021.66 6717.22,-2012.09 6714.81,-2022.4 6721.77,-2021.66"/>
+</g>
+<!-- TempData&#45;&gt;Strain -->
+<g id="edge79" class="edge">
+<title>TempData:StrainId&#45;&gt;Strain</title>
+<path fill="none" stroke="black" d="M6786,-3261C6799.61,-3261 6829.44,-1253.01 6792,-1200 6572.1,-888.62 6056.1,-847.14 5861.8,-842.87"/>
+<polygon fill="black" stroke="black" points="5861.75,-839.37 5851.68,-842.67 5861.61,-846.36 5861.75,-839.37"/>
+</g>
+<!-- CaseAttributeXRef -->
+<g id="node89" class="node">
+<title>CaseAttributeXRef</title>
+<polygon fill="white" stroke="transparent" points="2630,-4832 2630,-4943 2848,-4943 2848,-4832 2630,-4832"/>
+<polygon fill="#d7b5d8" stroke="transparent" points="2633,-4918.5 2633,-4939.5 2845,-4939.5 2845,-4918.5 2633,-4918.5"/>
+<polygon fill="none" stroke="black" points="2633,-4918.5 2633,-4939.5 2845,-4939.5 2845,-4918.5 2633,-4918.5"/>
+<text text-anchor="start" x="2636" y="-4925.3" font-family="Times,serif" font-size="14.00">CaseAttributeXRef (753 KiB)</text>
+<text text-anchor="start" x="2681.5" y="-4903.3" font-family="Times,serif" font-size="14.00">CaseAttributeId</text>
+<text text-anchor="start" x="2674" y="-4882.3" font-family="Times,serif" font-size="14.00">ProbeSetFreezeId</text>
+<text text-anchor="start" x="2709.5" y="-4861.3" font-family="Times,serif" font-size="14.00">StrainId</text>
+<text text-anchor="start" x="2719" y="-4840.3" font-family="Times,serif" font-size="14.00">Value</text>
+<polygon fill="none" stroke="black" points="2630,-4832 2630,-4943 2848,-4943 2848,-4832 2630,-4832"/>
+</g>
+<!-- CaseAttributeXRef&#45;&gt;CaseAttribute -->
+<g id="edge80" class="edge">
+<title>CaseAttributeXRef:CaseAttributeId&#45;&gt;CaseAttribute</title>
+<path fill="none" stroke="black" d="M2632,-4907.5C859.27,-4907.5 1188.58,-1398.42 1244.12,-901.29"/>
+<polygon fill="black" stroke="black" points="1247.63,-901.45 1245.27,-891.12 1240.67,-900.66 1247.63,-901.45"/>
+</g>
+<!-- CaseAttributeXRef&#45;&gt;Strain -->
+<g id="edge82" class="edge">
+<title>CaseAttributeXRef:StrainId&#45;&gt;Strain</title>
+<path fill="none" stroke="black" d="M2846,-4864.5C3071.96,-4864.5 2844.72,-4009.37 3016,-3862 3099.31,-3790.32 4915.51,-3902.94 4994,-3826 5098.23,-3723.83 4995.8,-1323.24 5074,-1200 5218.94,-971.59 5558.15,-883.8 5710.07,-855.09"/>
+<polygon fill="black" stroke="black" points="5711.05,-858.47 5720.24,-853.2 5709.77,-851.59 5711.05,-858.47"/>
+</g>
+<!-- CaseAttributeXRef&#45;&gt;ProbeSetFreeze -->
+<g id="edge81" class="edge">
+<title>CaseAttributeXRef:ProbeSetFreezeId&#45;&gt;ProbeSetFreeze</title>
+<path fill="none" stroke="black" d="M2846,-4885.5C3129.96,-4885.5 2889.92,-3863.52 2783.5,-3457.98"/>
+<polygon fill="black" stroke="black" points="2786.86,-3457.01 2780.93,-3448.23 2780.09,-3458.79 2786.86,-3457.01"/>
+</g>
+<!-- ProbeSetFreeze&#45;&gt;ProbeFreeze -->
+<g id="edge83" class="edge">
+<title>ProbeSetFreeze:ProbeFreezeId&#45;&gt;ProbeFreeze</title>
+<path fill="none" stroke="black" d="M2642,-3198C2531.36,-3198 2632.91,-2395.98 2676.43,-2085.09"/>
+<polygon fill="black" stroke="black" points="2679.9,-2085.53 2677.83,-2075.14 2672.97,-2084.56 2679.9,-2085.53"/>
+</g>
+<!-- temporary -->
+<g id="node91" class="node">
+<title>temporary</title>
+<polygon fill="white" stroke="transparent" points="12744.5,-4811 12744.5,-4964 12889.5,-4964 12889.5,-4811 12744.5,-4811"/>
+<polygon fill="#df65b0" stroke="transparent" points="12748,-4939.5 12748,-4960.5 12887,-4960.5 12887,-4939.5 12748,-4939.5"/>
+<polygon fill="none" stroke="black" points="12748,-4939.5 12748,-4960.5 12887,-4960.5 12887,-4939.5 12748,-4939.5"/>
+<text text-anchor="start" x="12751" y="-4946.3" font-family="Times,serif" font-size="14.00">temporary (4 MiB)</text>
+<text text-anchor="start" x="12790.5" y="-4924.3" font-family="Times,serif" font-size="14.00">GeneID</text>
+<text text-anchor="start" x="12771.5" y="-4903.3" font-family="Times,serif" font-size="14.00">HomoloGene</text>
+<text text-anchor="start" x="12794.5" y="-4882.3" font-family="Times,serif" font-size="14.00">OMIM</text>
+<text text-anchor="start" x="12766.5" y="-4861.3" font-family="Times,serif" font-size="14.00">Other_GeneID</text>
+<text text-anchor="start" x="12790.5" y="-4840.3" font-family="Times,serif" font-size="14.00">Symbol</text>
+<text text-anchor="start" x="12796" y="-4819.3" font-family="Times,serif" font-size="14.00">tax_id</text>
+<polygon fill="none" stroke="black" points="12744.5,-4811 12744.5,-4964 12889.5,-4964 12889.5,-4811 12744.5,-4811"/>
+</g>
+<!-- Chr_Length -->
+<g id="node92" class="node">
+<title>Chr_Length</title>
+<polygon fill="white" stroke="transparent" points="1368,-765.5 1368,-918.5 1518,-918.5 1518,-765.5 1368,-765.5"/>
+<polygon fill="#d7b5d8" stroke="transparent" points="1371,-894 1371,-915 1515,-915 1515,-894 1371,-894"/>
+<polygon fill="none" stroke="black" points="1371,-894 1371,-915 1515,-915 1515,-894 1371,-894"/>
+<text text-anchor="start" x="1374" y="-900.8" font-family="Times,serif" font-size="14.00">Chr_Length (2 KiB)</text>
+<text text-anchor="start" x="1417.5" y="-878.8" font-family="Times,serif" font-size="14.00">Length</text>
+<text text-anchor="start" x="1396" y="-857.8" font-family="Times,serif" font-size="14.00">Length_2016</text>
+<text text-anchor="start" x="1396" y="-836.8" font-family="Times,serif" font-size="14.00">Length_mm8</text>
+<text text-anchor="start" x="1421.5" y="-815.8" font-family="Times,serif" font-size="14.00">Name</text>
+<text text-anchor="start" x="1414.5" y="-794.8" font-family="Times,serif" font-size="14.00">OrderId</text>
+<text text-anchor="start" x="1408" y="-773.8" font-family="Times,serif" font-size="14.00">SpeciesId</text>
+<polygon fill="none" stroke="black" points="1368,-765.5 1368,-918.5 1518,-918.5 1518,-765.5 1368,-765.5"/>
+</g>
+<!-- Chr_Length&#45;&gt;Species -->
+<g id="edge84" class="edge">
+<title>Chr_Length:SpeciesId&#45;&gt;Species</title>
+<path fill="none" stroke="black" d="M1516,-777C1544.63,-777 1515.78,-541.23 1535,-520 1694.07,-344.29 2463.44,-308.31 2715.71,-301.19"/>
+<polygon fill="black" stroke="black" points="2716,-304.69 2725.9,-300.91 2715.81,-297.69 2716,-304.69"/>
+</g>
+<!-- GenoSE -->
+<g id="node93" class="node">
+<title>GenoSE</title>
+<polygon fill="white" stroke="transparent" points="6848.5,-1918 6848.5,-2008 6957.5,-2008 6957.5,-1918 6848.5,-1918"/>
+<polygon fill="#f1eef6" stroke="transparent" points="6852,-1984 6852,-2005 6955,-2005 6955,-1984 6852,-1984"/>
+<polygon fill="none" stroke="black" points="6852,-1984 6852,-2005 6955,-2005 6955,-1984 6852,-1984"/>
+<text text-anchor="start" x="6855" y="-1990.8" font-family="Times,serif" font-size="14.00">GenoSE (0 B)</text>
+<text text-anchor="start" x="6879" y="-1968.8" font-family="Times,serif" font-size="14.00">DataId</text>
+<text text-anchor="start" x="6885" y="-1947.8" font-family="Times,serif" font-size="14.00">error</text>
+<text text-anchor="start" x="6874" y="-1926.8" font-family="Times,serif" font-size="14.00">StrainId</text>
+<polygon fill="none" stroke="black" points="6848.5,-1918 6848.5,-2008 6957.5,-2008 6957.5,-1918 6848.5,-1918"/>
+</g>
+<!-- GenoSE&#45;&gt;Strain -->
+<g id="edge85" class="edge">
+<title>GenoSE:StrainId&#45;&gt;Strain</title>
+<path fill="none" stroke="black" d="M6851,-1930C6810.42,-1930 6850.14,-1232.62 6826,-1200 6591.69,-883.44 6059.6,-845.25 5861.86,-842.35"/>
+<polygon fill="black" stroke="black" points="5861.61,-838.85 5851.57,-842.23 5861.53,-845.85 5861.61,-838.85"/>
+</g>
+<!-- ProbeH2 -->
+<g id="node94" class="node">
+<title>ProbeH2</title>
+<polygon fill="white" stroke="transparent" points="5788.5,-4832 5788.5,-4943 5921.5,-4943 5921.5,-4832 5788.5,-4832"/>
+<polygon fill="#df65b0" stroke="transparent" points="5792,-4918.5 5792,-4939.5 5919,-4939.5 5919,-4918.5 5792,-4918.5"/>
+<polygon fill="none" stroke="black" points="5792,-4918.5 5792,-4939.5 5919,-4939.5 5919,-4918.5 5792,-4918.5"/>
+<text text-anchor="start" x="5795" y="-4925.3" font-family="Times,serif" font-size="14.00">ProbeH2 (9 MiB)</text>
+<text text-anchor="start" x="5846" y="-4903.3" font-family="Times,serif" font-size="14.00">h2</text>
+<text text-anchor="start" x="5802.5" y="-4882.3" font-family="Times,serif" font-size="14.00">ProbeFreezeId</text>
+<text text-anchor="start" x="5827" y="-4861.3" font-family="Times,serif" font-size="14.00">ProbeId</text>
+<text text-anchor="start" x="5831" y="-4840.3" font-family="Times,serif" font-size="14.00">weight</text>
+<polygon fill="none" stroke="black" points="5788.5,-4832 5788.5,-4943 5921.5,-4943 5921.5,-4832 5788.5,-4832"/>
+</g>
+<!-- ProbeH2&#45;&gt;Probe -->
+<g id="edge87" class="edge">
+<title>ProbeH2:ProbeId&#45;&gt;Probe</title>
+<path fill="none" stroke="black" d="M5920,-4864.5C6401.38,-4864.5 5940.09,-4144.3 6330,-3862 6421.67,-3795.63 6755.1,-3903.04 6838,-3826 6948.34,-3723.46 6950.01,-3538.6 6936.27,-3416.32"/>
+<polygon fill="black" stroke="black" points="6939.72,-3415.69 6935.07,-3406.16 6932.76,-3416.5 6939.72,-3415.69"/>
+</g>
+<!-- ProbeH2&#45;&gt;ProbeFreeze -->
+<g id="edge86" class="edge">
+<title>ProbeH2:ProbeFreezeId&#45;&gt;ProbeFreeze</title>
+<path fill="none" stroke="black" d="M5791,-4885.5C5212.27,-4885.5 5503.91,-4120.25 4986,-3862 4899.92,-3819.08 3329.71,-3886.58 3255,-3826 2877.83,-3520.19 3360.75,-3094.62 3007,-2762 2937.05,-2696.23 2860.62,-2795.13 2794,-2726 2629.09,-2554.88 2645.25,-2253.02 2670.34,-2085.17"/>
+<polygon fill="black" stroke="black" points="2673.84,-2085.47 2671.89,-2075.05 2666.92,-2084.41 2673.84,-2085.47"/>
+</g>
+<!-- MappingMethod -->
+<g id="node96" class="node">
+<title>MappingMethod</title>
+<polygon fill="white" stroke="transparent" points="12923.5,-4853 12923.5,-4922 13110.5,-4922 13110.5,-4853 12923.5,-4853"/>
+<polygon fill="#f1eef6" stroke="transparent" points="12927,-4897.5 12927,-4918.5 13108,-4918.5 13108,-4897.5 12927,-4897.5"/>
+<polygon fill="none" stroke="black" points="12927,-4897.5 12927,-4918.5 13108,-4918.5 13108,-4897.5 12927,-4897.5"/>
+<text text-anchor="start" x="12930" y="-4904.3" font-family="Times,serif" font-size="14.00">MappingMethod (100 B)</text>
+<text text-anchor="start" x="13010" y="-4882.3" font-family="Times,serif" font-size="14.00">Id</text>
+<text text-anchor="start" x="12996" y="-4861.3" font-family="Times,serif" font-size="14.00">Name</text>
+<polygon fill="none" stroke="black" points="12923.5,-4853 12923.5,-4922 13110.5,-4922 13110.5,-4853 12923.5,-4853"/>
+</g>
+<!-- SnpAll -->
+<g id="node97" class="node">
+<title>SnpAll</title>
+<polygon fill="white" stroke="transparent" points="1552,-524 1552,-1160 1746,-1160 1746,-524 1552,-524"/>
+<polygon fill="#ce1256" stroke="transparent" points="1555,-1136 1555,-1157 1743,-1157 1743,-1136 1555,-1136"/>
+<polygon fill="none" stroke="black" points="1555,-1136 1555,-1157 1743,-1157 1743,-1136 1555,-1136"/>
+<text text-anchor="start" x="1593.5" y="-1142.8" font-family="Times,serif" font-size="14.00">SnpAll (11 GiB)</text>
+<text text-anchor="start" x="1603.5" y="-1120.8" font-family="Times,serif" font-size="14.00">3Prime_UTR</text>
+<text text-anchor="start" x="1603.5" y="-1099.8" font-family="Times,serif" font-size="14.00">5Prime_UTR</text>
+<text text-anchor="start" x="1625" y="-1078.8" font-family="Times,serif" font-size="14.00">Alleles</text>
+<text text-anchor="start" x="1602" y="-1057.8" font-family="Times,serif" font-size="14.00">Chromosome</text>
+<text text-anchor="start" x="1581" y="-1036.8" font-family="Times,serif" font-size="14.00">ConservationScore</text>
+<text text-anchor="start" x="1621.5" y="-1015.8" font-family="Times,serif" font-size="14.00">Domain</text>
+<text text-anchor="start" x="1603.5" y="-994.8" font-family="Times,serif" font-size="14.00">Downstream</text>
+<text text-anchor="start" x="1630.5" y="-973.8" font-family="Times,serif" font-size="14.00">Exon</text>
+<text text-anchor="start" x="1630.5" y="-952.8" font-family="Times,serif" font-size="14.00">Gene</text>
+<text text-anchor="start" x="1641.5" y="-931.8" font-family="Times,serif" font-size="14.00">Id</text>
+<text text-anchor="start" x="1612" y="-910.8" font-family="Times,serif" font-size="14.00">Intergenic</text>
+<text text-anchor="start" x="1626.5" y="-889.8" font-family="Times,serif" font-size="14.00">Intron</text>
+<text text-anchor="start" x="1591.5" y="-868.8" font-family="Times,serif" font-size="14.00">Non_Splice_Site</text>
+<text text-anchor="start" x="1557" y="-847.8" font-family="Times,serif" font-size="14.00">Non_Synonymous_Coding</text>
+<text text-anchor="start" x="1620" y="-826.8" font-family="Times,serif" font-size="14.00">Position</text>
+<text text-anchor="start" x="1599" y="-805.8" font-family="Times,serif" font-size="14.00">Position_2016</text>
+<text text-anchor="start" x="1639.5" y="-784.8" font-family="Times,serif" font-size="14.00">Rs</text>
+<text text-anchor="start" x="1614" y="-763.8" font-family="Times,serif" font-size="14.00">SnpName</text>
+<text text-anchor="start" x="1624" y="-742.8" font-family="Times,serif" font-size="14.00">Source</text>
+<text text-anchor="start" x="1614" y="-721.8" font-family="Times,serif" font-size="14.00">SpeciesId</text>
+<text text-anchor="start" x="1609.5" y="-700.8" font-family="Times,serif" font-size="14.00">Splice_Site</text>
+<text text-anchor="start" x="1602" y="-679.8" font-family="Times,serif" font-size="14.00">Start_Gained</text>
+<text text-anchor="start" x="1611.5" y="-658.8" font-family="Times,serif" font-size="14.00">Start_Lost</text>
+<text text-anchor="start" x="1603.5" y="-637.8" font-family="Times,serif" font-size="14.00">Stop_Gained</text>
+<text text-anchor="start" x="1613.5" y="-616.8" font-family="Times,serif" font-size="14.00">Stop_Lost</text>
+<text text-anchor="start" x="1575" y="-595.8" font-family="Times,serif" font-size="14.00">Synonymous_Coding</text>
+<text text-anchor="start" x="1611.5" y="-574.8" font-family="Times,serif" font-size="14.00">Transcript</text>
+<text text-anchor="start" x="1558.5" y="-553.8" font-family="Times,serif" font-size="14.00">Unknown_Effect_In_Exon</text>
+<text text-anchor="start" x="1613" y="-532.8" font-family="Times,serif" font-size="14.00">Upstream</text>
+<polygon fill="none" stroke="black" points="1552,-524 1552,-1160 1746,-1160 1746,-524 1552,-524"/>
+</g>
+<!-- SnpAll&#45;&gt;Species -->
+<g id="edge88" class="edge">
+<title>SnpAll:SpeciesId&#45;&gt;Species</title>
+<path fill="none" stroke="black" d="M1744,-725C1789.75,-725 1732.61,-554.2 1763,-520 1889.95,-377.13 2495.01,-320.73 2715.44,-304.71"/>
+<polygon fill="black" stroke="black" points="2715.91,-308.18 2725.64,-303.98 2715.41,-301.2 2715.91,-308.18"/>
+</g>
+<!-- GeneInfo -->
+<g id="node98" class="node">
+<title>GeneInfo</title>
+<polygon fill="white" stroke="transparent" points="2150,-671 2150,-1013 2338,-1013 2338,-671 2150,-671"/>
+<polygon fill="#df65b0" stroke="transparent" points="2153,-989 2153,-1010 2335,-1010 2335,-989 2153,-989"/>
+<polygon fill="none" stroke="black" points="2153,-989 2153,-1010 2335,-1010 2335,-989 2153,-989"/>
+<text text-anchor="start" x="2178" y="-995.8" font-family="Times,serif" font-size="14.00">GeneInfo (23 MiB)</text>
+<text text-anchor="start" x="2226.5" y="-973.8" font-family="Times,serif" font-size="14.00">Alias</text>
+<text text-anchor="start" x="2215.5" y="-952.8" font-family="Times,serif" font-size="14.00">BlatSeq</text>
+<text text-anchor="start" x="2230.5" y="-931.8" font-family="Times,serif" font-size="14.00">Chr</text>
+<text text-anchor="start" x="2218" y="-910.8" font-family="Times,serif" font-size="14.00">GeneId</text>
+<text text-anchor="start" x="2189.5" y="-889.8" font-family="Times,serif" font-size="14.00">HomoloGeneID</text>
+<text text-anchor="start" x="2236.5" y="-868.8" font-family="Times,serif" font-size="14.00">Id</text>
+<text text-anchor="start" x="2232" y="-847.8" font-family="Times,serif" font-size="14.00">Mb</text>
+<text text-anchor="start" x="2221" y="-826.8" font-family="Times,serif" font-size="14.00">OMIM</text>
+<text text-anchor="start" x="2159" y="-805.8" font-family="Times,serif" font-size="14.00">Probe_set_Blat_Mb_end</text>
+<text text-anchor="start" x="2155" y="-784.8" font-family="Times,serif" font-size="14.00">Probe_set_Blat_Mb_start</text>
+<text text-anchor="start" x="2209" y="-763.8" font-family="Times,serif" font-size="14.00">SpeciesId</text>
+<text text-anchor="start" x="2197.5" y="-742.8" font-family="Times,serif" font-size="14.00">Strand_Gene</text>
+<text text-anchor="start" x="2195" y="-721.8" font-family="Times,serif" font-size="14.00">Strand_Probe</text>
+<text text-anchor="start" x="2217" y="-700.8" font-family="Times,serif" font-size="14.00">Symbol</text>
+<text text-anchor="start" x="2224" y="-679.8" font-family="Times,serif" font-size="14.00">TaxId</text>
+<polygon fill="none" stroke="black" points="2150,-671 2150,-1013 2338,-1013 2338,-671 2150,-671"/>
+</g>
+<!-- GeneInfo&#45;&gt;Species -->
+<g id="edge89" class="edge">
+<title>GeneInfo:SpeciesId&#45;&gt;Species</title>
+<path fill="none" stroke="black" d="M2336,-767C2363.53,-767 2339.64,-542.84 2355,-520 2438.32,-396.09 2612.85,-338.66 2715.61,-314.65"/>
+<polygon fill="black" stroke="black" points="2716.66,-318 2725.63,-312.36 2715.1,-311.18 2716.66,-318"/>
+</g>
+<!-- GeneList_rn3 -->
+<g id="node99" class="node">
+<title>GeneList_rn3</title>
+<polygon fill="white" stroke="transparent" points="552,-1718.5 552,-2207.5 718,-2207.5 718,-1718.5 552,-1718.5"/>
+<polygon fill="#df65b0" stroke="transparent" points="555,-2183 555,-2204 715,-2204 715,-2183 555,-2183"/>
+<polygon fill="none" stroke="black" points="555,-2183 555,-2204 715,-2204 715,-2183 555,-2183"/>
+<text text-anchor="start" x="558" y="-2189.8" font-family="Times,serif" font-size="14.00">GeneList_rn3 (5 MiB)</text>
+<text text-anchor="start" x="589.5" y="-2167.8" font-family="Times,serif" font-size="14.00">chromosome</text>
+<text text-anchor="start" x="621.5" y="-2146.8" font-family="Times,serif" font-size="14.00">flag</text>
+<text text-anchor="start" x="595.5" y="-2125.8" font-family="Times,serif" font-size="14.00">genBankID</text>
+<text text-anchor="start" x="576" y="-2104.8" font-family="Times,serif" font-size="14.00">geneDescription</text>
+<text text-anchor="start" x="609" y="-2083.8" font-family="Times,serif" font-size="14.00">geneID</text>
+<text text-anchor="start" x="591" y="-2062.8" font-family="Times,serif" font-size="14.00">geneSymbol</text>
+<text text-anchor="start" x="628" y="-2041.8" font-family="Times,serif" font-size="14.00">id</text>
+<text text-anchor="start" x="607" y="-2020.8" font-family="Times,serif" font-size="14.00">identity</text>
+<text text-anchor="start" x="618" y="-1999.8" font-family="Times,serif" font-size="14.00">kgID</text>
+<text text-anchor="start" x="601.5" y="-1978.8" font-family="Times,serif" font-size="14.00">ProbeSet</text>
+<text text-anchor="start" x="616" y="-1957.8" font-family="Times,serif" font-size="14.00">qEnd</text>
+<text text-anchor="start" x="615" y="-1936.8" font-family="Times,serif" font-size="14.00">qSize</text>
+<text text-anchor="start" x="612" y="-1915.8" font-family="Times,serif" font-size="14.00">qStart</text>
+<text text-anchor="start" x="615.5" y="-1894.8" font-family="Times,serif" font-size="14.00">score</text>
+<text text-anchor="start" x="601.5" y="-1873.8" font-family="Times,serif" font-size="14.00">sequence</text>
+<text text-anchor="start" x="618" y="-1852.8" font-family="Times,serif" font-size="14.00">span</text>
+<text text-anchor="start" x="598.5" y="-1831.8" font-family="Times,serif" font-size="14.00">specificity</text>
+<text text-anchor="start" x="611.5" y="-1810.8" font-family="Times,serif" font-size="14.00">strand</text>
+<text text-anchor="start" x="613.5" y="-1789.8" font-family="Times,serif" font-size="14.00">txEnd</text>
+<text text-anchor="start" x="612.5" y="-1768.8" font-family="Times,serif" font-size="14.00">txSize</text>
+<text text-anchor="start" x="609" y="-1747.8" font-family="Times,serif" font-size="14.00">txStart</text>
+<text text-anchor="start" x="602" y="-1726.8" font-family="Times,serif" font-size="14.00">unigenID</text>
+<polygon fill="none" stroke="black" points="552,-1718.5 552,-2207.5 718,-2207.5 718,-1718.5 552,-1718.5"/>
+</g>
+<!-- GeneList_rn3&#45;&gt;Genbank -->
+<g id="edge90" class="edge">
+<title>GeneList_rn3:genBankID&#45;&gt;Genbank</title>
+<path fill="none" stroke="black" d="M716,-2130C741.84,-2130 729.38,-1225.22 735,-1200 738.81,-1182.91 745.09,-1180.48 751,-1164 783.34,-1073.83 811.09,-965.96 826.65,-901.05"/>
+<polygon fill="black" stroke="black" points="830.13,-901.54 829.04,-891 823.32,-899.92 830.13,-901.54"/>
+</g>
+<!-- News -->
+<g id="node100" class="node">
+<title>News</title>
+<polygon fill="white" stroke="transparent" points="13145,-4842.5 13145,-4932.5 13269,-4932.5 13269,-4842.5 13145,-4842.5"/>
+<polygon fill="#d7b5d8" stroke="transparent" points="13148,-4908.5 13148,-4929.5 13266,-4929.5 13266,-4908.5 13148,-4908.5"/>
+<polygon fill="none" stroke="black" points="13148,-4908.5 13148,-4929.5 13266,-4929.5 13266,-4908.5 13148,-4908.5"/>
+<text text-anchor="start" x="13151" y="-4915.3" font-family="Times,serif" font-size="14.00">News (167 KiB)</text>
+<text text-anchor="start" x="13191" y="-4893.3" font-family="Times,serif" font-size="14.00">date</text>
+<text text-anchor="start" x="13182.5" y="-4872.3" font-family="Times,serif" font-size="14.00">details</text>
+<text text-anchor="start" x="13200" y="-4851.3" font-family="Times,serif" font-size="14.00">id</text>
+<polygon fill="none" stroke="black" points="13145,-4842.5 13145,-4932.5 13269,-4932.5 13269,-4842.5 13145,-4842.5"/>
+</g>
+<!-- login -->
+<g id="node101" class="node">
+<title>login</title>
+<polygon fill="white" stroke="transparent" points="13303.5,-4800.5 13303.5,-4974.5 13414.5,-4974.5 13414.5,-4800.5 13303.5,-4800.5"/>
+<polygon fill="#d7b5d8" stroke="transparent" points="13307,-4950.5 13307,-4971.5 13412,-4971.5 13412,-4950.5 13307,-4950.5"/>
+<polygon fill="none" stroke="black" points="13307,-4950.5 13307,-4971.5 13412,-4971.5 13412,-4950.5 13307,-4950.5"/>
+<text text-anchor="start" x="13310" y="-4957.3" font-family="Times,serif" font-size="14.00">login (52 KiB)</text>
+<text text-anchor="start" x="13315.5" y="-4935.3" font-family="Times,serif" font-size="14.00">assumed_by</text>
+<text text-anchor="start" x="13352.5" y="-4914.3" font-family="Times,serif" font-size="14.00">id</text>
+<text text-anchor="start" x="13321" y="-4893.3" font-family="Times,serif" font-size="14.00">ip_address</text>
+<text text-anchor="start" x="13323" y="-4872.3" font-family="Times,serif" font-size="14.00">session_id</text>
+<text text-anchor="start" x="13322.5" y="-4851.3" font-family="Times,serif" font-size="14.00">successful</text>
+<text text-anchor="start" x="13321" y="-4830.3" font-family="Times,serif" font-size="14.00">timestamp</text>
+<text text-anchor="start" x="13343.5" y="-4809.3" font-family="Times,serif" font-size="14.00">user</text>
+<polygon fill="none" stroke="black" points="13303.5,-4800.5 13303.5,-4974.5 13414.5,-4974.5 13414.5,-4800.5 13303.5,-4800.5"/>
+</g>
+<!-- GeneList -->
+<g id="node102" class="node">
+<title>GeneList</title>
+<polygon fill="white" stroke="transparent" points="1017.5,-1582 1017.5,-2344 1164.5,-2344 1164.5,-1582 1017.5,-1582"/>
+<polygon fill="#df65b0" stroke="transparent" points="1021,-2320 1021,-2341 1162,-2341 1162,-2320 1021,-2320"/>
+<polygon fill="none" stroke="black" points="1021,-2320 1021,-2341 1162,-2341 1162,-2320 1021,-2320"/>
+<text text-anchor="start" x="1026" y="-2326.8" font-family="Times,serif" font-size="14.00">GeneList (37 MiB)</text>
+<text text-anchor="start" x="1064.5" y="-2304.8" font-family="Times,serif" font-size="14.00">AlignID</text>
+<text text-anchor="start" x="1065" y="-2283.8" font-family="Times,serif" font-size="14.00">cdsEnd</text>
+<text text-anchor="start" x="1043.5" y="-2262.8" font-family="Times,serif" font-size="14.00">cdsEnd_2016</text>
+<text text-anchor="start" x="1043.5" y="-2241.8" font-family="Times,serif" font-size="14.00">cdsEnd_mm8</text>
+<text text-anchor="start" x="1061" y="-2220.8" font-family="Times,serif" font-size="14.00">cdsStart</text>
+<text text-anchor="start" x="1039.5" y="-2199.8" font-family="Times,serif" font-size="14.00">cdsStart_2016</text>
+<text text-anchor="start" x="1039.5" y="-2178.8" font-family="Times,serif" font-size="14.00">cdsStart_mm8</text>
+<text text-anchor="start" x="1044.5" y="-2157.8" font-family="Times,serif" font-size="14.00">Chromosome</text>
+<text text-anchor="start" x="1023" y="-2136.8" font-family="Times,serif" font-size="14.00">Chromosome_mm8</text>
+<text text-anchor="start" x="1053" y="-2115.8" font-family="Times,serif" font-size="14.00">exonCount</text>
+<text text-anchor="start" x="1031.5" y="-2094.8" font-family="Times,serif" font-size="14.00">exonCount_mm8</text>
+<text text-anchor="start" x="1056.5" y="-2073.8" font-family="Times,serif" font-size="14.00">exonEnds</text>
+<text text-anchor="start" x="1035" y="-2052.8" font-family="Times,serif" font-size="14.00">exonEnds_mm8</text>
+<text text-anchor="start" x="1052" y="-2031.8" font-family="Times,serif" font-size="14.00">exonStarts</text>
+<text text-anchor="start" x="1031" y="-2010.8" font-family="Times,serif" font-size="14.00">exonStarts_mm8</text>
+<text text-anchor="start" x="1050.5" y="-1989.8" font-family="Times,serif" font-size="14.00">GenBankID</text>
+<text text-anchor="start" x="1031.5" y="-1968.8" font-family="Times,serif" font-size="14.00">GeneDescription</text>
+<text text-anchor="start" x="1064.5" y="-1947.8" font-family="Times,serif" font-size="14.00">GeneID</text>
+<text text-anchor="start" x="1046" y="-1926.8" font-family="Times,serif" font-size="14.00">GeneSymbol</text>
+<text text-anchor="start" x="1084" y="-1905.8" font-family="Times,serif" font-size="14.00">Id</text>
+<text text-anchor="start" x="1056" y="-1884.8" font-family="Times,serif" font-size="14.00">Info_mm9</text>
+<text text-anchor="start" x="1074.5" y="-1863.8" font-family="Times,serif" font-size="14.00">kgID</text>
+<text text-anchor="start" x="1066.5" y="-1842.8" font-family="Times,serif" font-size="14.00">NM_ID</text>
+<text text-anchor="start" x="1056.5" y="-1821.8" font-family="Times,serif" font-size="14.00">ProteinID</text>
+<text text-anchor="start" x="1063" y="-1800.8" font-family="Times,serif" font-size="14.00">RGD_ID</text>
+<text text-anchor="start" x="1056.5" y="-1779.8" font-family="Times,serif" font-size="14.00">SpeciesId</text>
+<text text-anchor="start" x="1067" y="-1758.8" font-family="Times,serif" font-size="14.00">Strand</text>
+<text text-anchor="start" x="1045.5" y="-1737.8" font-family="Times,serif" font-size="14.00">Strand_mm8</text>
+<text text-anchor="start" x="1068.5" y="-1716.8" font-family="Times,serif" font-size="14.00">TxEnd</text>
+<text text-anchor="start" x="1047" y="-1695.8" font-family="Times,serif" font-size="14.00">TxEnd_2016</text>
+<text text-anchor="start" x="1047" y="-1674.8" font-family="Times,serif" font-size="14.00">TxEnd_mm8</text>
+<text text-anchor="start" x="1064" y="-1653.8" font-family="Times,serif" font-size="14.00">TxStart</text>
+<text text-anchor="start" x="1043" y="-1632.8" font-family="Times,serif" font-size="14.00">TxStart_2016</text>
+<text text-anchor="start" x="1043" y="-1611.8" font-family="Times,serif" font-size="14.00">TxStart_mm8</text>
+<text text-anchor="start" x="1057" y="-1590.8" font-family="Times,serif" font-size="14.00">UnigenID</text>
+<polygon fill="none" stroke="black" points="1017.5,-1582 1017.5,-2344 1164.5,-2344 1164.5,-1582 1017.5,-1582"/>
+</g>
+<!-- GeneList&#45;&gt;Species -->
+<g id="edge92" class="edge">
+<title>GeneList:SpeciesId&#45;&gt;Species</title>
+<path fill="none" stroke="black" d="M1020,-1783C987.59,-1783 1012.7,-1229.81 1000,-1200 991.25,-1179.47 973.39,-1184.68 965,-1164 938.08,-1097.7 917.52,-573.54 965,-520 1083.11,-386.82 2377.72,-318.63 2715.68,-303.02"/>
+<polygon fill="black" stroke="black" points="2716.05,-306.51 2725.88,-302.55 2715.73,-299.51 2716.05,-306.51"/>
+</g>
+<!-- GeneList&#45;&gt;Genbank -->
+<g id="edge91" class="edge">
+<title>GeneList:GenBankID&#45;&gt;Genbank</title>
+<path fill="none" stroke="black" d="M1020,-1994C975.87,-1994 1023.12,-1237.58 1000,-1200 982.29,-1171.21 954.25,-1190.29 933,-1164 870.98,-1087.29 850.32,-970.88 843.44,-901.34"/>
+<polygon fill="black" stroke="black" points="846.89,-900.65 842.48,-891.02 839.92,-901.3 846.89,-900.65"/>
+</g>
+<!-- GeneChipEnsemblXRef -->
+<g id="node103" class="node">
+<title>GeneChipEnsemblXRef</title>
+<polygon fill="white" stroke="transparent" points="1750,-1928.5 1750,-1997.5 1976,-1997.5 1976,-1928.5 1750,-1928.5"/>
+<polygon fill="#f1eef6" stroke="transparent" points="1753,-1973 1753,-1994 1973,-1994 1973,-1973 1753,-1973"/>
+<polygon fill="none" stroke="black" points="1753,-1973 1753,-1994 1973,-1994 1973,-1973 1753,-1973"/>
+<text text-anchor="start" x="1756" y="-1979.8" font-family="Times,serif" font-size="14.00">GeneChipEnsemblXRef (36 B)</text>
+<text text-anchor="start" x="1808" y="-1957.8" font-family="Times,serif" font-size="14.00">EnsemblChipId</text>
+<text text-anchor="start" x="1820.5" y="-1936.8" font-family="Times,serif" font-size="14.00">GeneChipId</text>
+<polygon fill="none" stroke="black" points="1750,-1928.5 1750,-1997.5 1976,-1997.5 1976,-1928.5 1750,-1928.5"/>
+</g>
+<!-- GeneChipEnsemblXRef&#45;&gt;EnsemblChip -->
+<g id="edge93" class="edge">
+<title>GeneChipEnsemblXRef:EnsemblChipId&#45;&gt;EnsemblChip</title>
+<path fill="none" stroke="black" d="M1974,-1961C2027,-1961 1909.96,-1154.89 1873.44,-911.66"/>
+<polygon fill="black" stroke="black" points="1876.86,-910.9 1871.91,-901.53 1869.94,-911.94 1876.86,-910.9"/>
+</g>
+<!-- GeneChipEnsemblXRef&#45;&gt;GeneChip -->
+<g id="edge94" class="edge">
+<title>GeneChipEnsemblXRef:GeneChipId&#45;&gt;GeneChip</title>
+<path fill="none" stroke="black" d="M1974,-1940C1994.57,-1940 1996.24,-1220.49 1998,-1200 2005.12,-1117.24 2018.29,-1024.34 2029.33,-954.05"/>
+<polygon fill="black" stroke="black" points="2032.84,-954.27 2030.95,-943.85 2025.93,-953.18 2032.84,-954.27"/>
+</g>
+<!-- SnpAllele_to_be_deleted -->
+<g id="node104" class="node">
+<title>SnpAllele_to_be_deleted</title>
+<polygon fill="white" stroke="transparent" points="13448.5,-4842.5 13448.5,-4932.5 13687.5,-4932.5 13687.5,-4842.5 13448.5,-4842.5"/>
+<polygon fill="#d7b5d8" stroke="transparent" points="13452,-4908.5 13452,-4929.5 13685,-4929.5 13685,-4908.5 13452,-4908.5"/>
+<polygon fill="none" stroke="black" points="13452,-4908.5 13452,-4929.5 13685,-4929.5 13685,-4908.5 13452,-4908.5"/>
+<text text-anchor="start" x="13455" y="-4915.3" font-family="Times,serif" font-size="14.00">SnpAllele_to_be_deleted (3 KiB)</text>
+<text text-anchor="start" x="13551" y="-4893.3" font-family="Times,serif" font-size="14.00">Base</text>
+<text text-anchor="start" x="13561" y="-4872.3" font-family="Times,serif" font-size="14.00">Id</text>
+<text text-anchor="start" x="13554.5" y="-4851.3" font-family="Times,serif" font-size="14.00">Info</text>
+<polygon fill="none" stroke="black" points="13448.5,-4842.5 13448.5,-4932.5 13687.5,-4932.5 13687.5,-4842.5 13448.5,-4842.5"/>
+</g>
+</g>
+</svg>
diff --git a/topics/deploy/configuring-nginx-on-host.gmi b/topics/deploy/configuring-nginx-on-host.gmi
new file mode 100644
index 0000000..cb1c497
--- /dev/null
+++ b/topics/deploy/configuring-nginx-on-host.gmi
@@ -0,0 +1,220 @@
+# Configuring Nginx on the Host System
+
+## Tags
+
+* type: doc, docs, documentation
+* keywords: deploy, deployment, deploying, nginx, guix, guix container, guix system container
+* status: in progress
+
+## Introduction
+
+We deploy the GeneNetwork system within GNU Guix system containers. All the configurations and HTTPS certificates are handled from within the container, thus all the host has to do is to pass the traffic on to the system container.
+
+This document shows you how to set up the host container to forward all the necessary traffic so that you do not run into all the problems that we did when figuring this stuff out :-).
+
+## Ports and Domains
+
+In your system container, there are certain ports that are defined for various traffic. The most important ones, and the ones we will deal with, are for HTTP and HTTPS. The ideas should translate for most other ports.
+
+For the examples is this document, we will assume the following ports are defined in the Guix system container:
+* HTTP on port 9080
+* HTTPS on port 9081
+
+## HTTPS Traffic
+
+### Nginx --with-stream_ssl_preread_module
+
+We handle all the necessary traffic details (e.g. SSL/TLS termination, etc.) within the container, and only need the host to forward the traffic.
+
+In order to achieve this, your Nginx will need to be compiled with the
+=> https://nginx.org/en/docs/stream/ngx_stream_ssl_preread_module.html Nginx Stream SSL Preread Module.
+
+Now, because we are awesome, we include
+=> https://git.genenetwork.org/gn-machines/tree/nginx-preread.scm a definition for nginx compiled with the module.
+Simply install it on your host by doing something like:
+
+```
+$ git clone https://git.genenetwork.org/gn-machines
+$ cd gn-machines
+$ ./nginx-preread-deploy.sh
+```
+
+That will install the nginx under "/usr/local/sbin/nginx".
+
+Now, we comment out, or delete any/all lines loading any nginx modules for any previously existing nginx. Comment out/delete the following line in your "/etc/nginx/nginx.conf" file if it exists:
+
+```
+include /etc/nginx/modules-enabled/*.conf;
+```
+
+This is necessary since the nginx we installed from guix comes with all the modules we need, and even if not, it would not successfully use the hosts modules anyhow. You'd need to modify the nginx config for yourself to add any missing modules for the nginx from guix — how to do that is outside the scope of this document, but should not be particularly difficult.
+
+Set up your init system to use the nginx from guix. Assuming systemd, you need to have something like the following in your "/etc/systemd/system/nginx.service" unit file:
+
+```
+[Unit]
+Description=nginx web server (from Guix, not the host)
+After=network.target
+
+[Service]
+Type=forking
+PIDFile=/run/nginx.pid
+ExecStartPre=/usr/local/sbin/nginx -q -t -c /etc/nginx/nginx.conf -e /var/log/nginx/error.log
+ExecStart=/usr/local/sbin/nginx -c /etc/nginx/nginx.conf -p /var/run/nginx -e /var/log/nginx/error.log
+ExecReload=/usr/local/sbin/nginx -c /etc/nginx/nginx.conf -s reload -e /var/log/nginx/error.log
+ExecStop=-/sbin/start-stop-daemon --quiet --stop --retry QUIT/5 --pidfile /run/nginx.pid
+TimeoutStopSec=5
+KillMode=mixed
+
+[Install]
+WantedBy=multi-user.target
+```
+
+Awesome. Now enable the unit file:
+
+```
+$ sudo systemctl enable nginx.service
+```
+
+### Forwarding the HTTPS Traffic
+
+Now that we have nginx in place, we can forward HTTPS traffic for all the domains we want. In "/etc/nginx/nginx.conf" we add:
+
+```
+# Forward some HTTPS connections into existing guix containers
+stream {
+    upstream my-container {
+        # This is our Guix system container
+        server 127.0.0.1:9081;
+    }
+
+    upstream host-https {
+        # Forward any https traffic for any previously existing domains on the
+        # host itself.
+        server 127.0.0.1:6443;
+    }
+
+    map $ssl_preread_server_name $upstream {
+        yourdomain1.genenetwork.org my-container;
+        yourdomain2.genenetwork.org my-container;
+        default host-https;
+    }
+
+    server {
+        listen 443;
+        proxy_pass $upstream;
+        ssl_preread on;
+    }
+}
+```
+
+## HTTP Traffic
+
+You will need to pass the HTTP traffic on to the container in order to enable HTTP-dependent traffic (e.g. setting up the SSL certificates using the ACME protocol) is successfully handled.
+
+You have 2 options to do this:
+* Add a separate server block in `/etc/nginx/site-available/` (or other configured directory)
+* Add the server block directly in `/etc/nginx/nginx.conf` (or your main nginx config file, if it's not the standard one mentioned here).
+
+The configuration to add is as follows:
+
+```
+server {
+    ## Forward HTTP traffic to container
+    ## Without this, the HTTP calls will fall through to the defaults in
+    ## /etc/nginx/sites-enabled/ leading to http-dependent traffic, like
+    ## that of the ACME client, failing.
+    server_name yourdomain1.genenetwork.org yourdomain2.genenetwork.org …;
+    listen 80;
+    location / {
+        proxy_pass http://127.0.0.1:9080;
+        proxy_set_header Host $host;
+    }
+}
+```
+
+** Do please replace the "yourdomain*" parts in the example above as appropriate for your scenario. The ellipsis (…) indicate optional extra domains you might need to configure.
+
+Without this, the `Run ACME Client` below will fail
+
+## Run ACME Client
+
+Now that all traffic is set up, and you can reach your sites using both HTTP and HTTPS (you have tested your sites, right? right?) we can now request the SSL certificates from Let's Encrypt so that we no longer see the "Self-signed Certificate" warning.
+
+You need to get into your system container to do this. The steps are a follows:
+
+=> https://git.genenetwork.org/gn-machines/tree/README.org#n61 Figure out which process is your container
+=> https://git.genenetwork.org/gn-machines/tree/README.org#n55 Get a shell into the container
+=> https://guix-forge.systemreboot.net/manual/dev/en/#section-acme-service Run "/usr/bin/acme renew" to get your initial SSL certificates from Let's Encrypt.
+
+At this point, the traffic portion of the configuration is done.
+
+## Sample "/etc/nginx/nginx.conf"
+
+```
+user www-data;
+worker_processes auto;
+pid /run/nginx.pid;
+# include /etc/nginx/modules-enabled/*.conf;
+
+access_log /var/log/nginx/access.log;
+error_log /var/log/nginx/error.log error;
+
+events {
+    worker_connections 768;
+    # multi_accept on;
+}
+
+stream {
+    upstream my-container {
+        # This is our Guix system container
+        server 127.0.0.1:9081;
+    }
+
+    upstream host-https {
+        # Forward any https traffic for any previously existing domains on the
+        # host itself.
+        server 127.0.0.1:6443;
+    }
+
+    map $ssl_preread_server_name $upstream {
+        yourdomain1.genenetwork.org my-container;
+        yourdomain2.genenetwork.org my-container;
+        default host-https;
+    }
+
+    server {
+        listen 443;
+        proxy_pass $upstream;
+        ssl_preread on;
+    }
+}
+
+http {
+    ##
+    # Basic Settings
+    ##
+    
+    ⋮
+    
+    include /etc/nginx/conf.d/*.conf;
+    server {
+        ## Forward HTTP traffic to container
+        ## Without this, the HTTP calls will fall through to the defaults in
+        ## /etc/nginx/sites-enabled/ leading to http-dependent traffic, like
+        ## that of the ACME client, failing.
+        server_name yourdomain1.genenetwork.org yourdomain2.genenetwork.org …;
+        listen 80;
+        location / {
+            proxy_pass http://127.0.0.1:9080;
+            proxy_set_header Host $host;
+        }
+    }
+    include /etc/nginx/sites-enabled/*;
+    
+    ⋮
+}
+
+⋮
+
+```
diff --git a/topics/deploy/deployment.gmi b/topics/deploy/deployment.gmi
index b844821..74fd6f0 100644
--- a/topics/deploy/deployment.gmi
+++ b/topics/deploy/deployment.gmi
@@ -1,14 +1,21 @@
 # Deploy GeneNetwork
 
+## Tags
+
+* type: doc, docs, documentation
+* keywords: deploy, deployment, deploying, guix, guix container, guix system container
+* status: in progress
+
 # Description
 
 This page attempts to document the deployment process we have for GeneNetwork. We use Guix system containers for deployment of CI/CD and the Guix configuration for the CI/CD container should be considered the authoritative reference.
 
-=> https://github.com/genenetwork/genenetwork-machines/blob/main/genenetwork-development.scm
+=> https://git.genenetwork.org/gn-machines/tree/genenetwork-development.scm
 
 See also
 
 => ./guix-system-containers-and-how-we-use-them
+=> ./configuring-nginx-on-host
 
 ## genenetwork2
 
diff --git a/topics/deploy/genecup.gmi b/topics/deploy/genecup.gmi
index c5aec17..fc93d07 100644
--- a/topics/deploy/genecup.gmi
+++ b/topics/deploy/genecup.gmi
@@ -53,3 +53,72 @@ and port forward:
 ssh -L 4200:127.0.0.1:4200 -f -N server
 curl localhost:4200
 ```
+
+# Troubleshooting
+
+## Moving the PubMed dir
+
+After moving the PubMed dir GeneCup stopped displaying part of the connections. This can be reproduced by running the standard example on the home page - the result should look like the image on the right of the home page.
+
+After fixing the paths and restarting the service there still was no result.
+
+Genecup is currently managed by the shepherd as user shepherd. Stop the service as that user:
+
+```
+shepherd@tux02:~$ herd stop genecup
+guile: warning: failed to install locale
+Service genecup has been stopped.
+```
+
+Now the servic looks stopped, but it is still running and you need to kill by hand:
+
+```
+shepherd@tux02:~$ ps xau|grep genecup
+shepherd  89524  0.0  0.0  12780   944 pts/42   S+   00:32   0:00 grep genecup
+shepherd 129334  0.0  0.7 42620944 2089640 ?    Sl   Mar05  66:30 /gnu/store/1w5v338qk5m8khcazwclprs3znqp6f7f-python-3.10.7/bin/python3 /gnu/store/a6z0mmj6iq6grwynfvkzd0xbbr4zdm0l-genecup-latest-with-tensorflow-native-HEAD-of-master-branch/.server.py-real
+shepherd@tux02:~$ kill -9 129334
+shepherd@tux02:~$ ps xau|grep genecup
+shepherd  89747  0.0  0.0  12780   944 pts/42   S+   00:32   0:00 grep genecup
+shepherd@tux02:~$
+```
+
+The log file lives in
+
+```
+shepherd@tux02:~/logs$ tail -f genecup.log
+```
+
+and we were getting errors on a reload and I had to fix
+
+```
+shepherd@tux02:~/shepherd-services$ grep export run_genecup.sh
+export EDIRECT_PUBMED_MASTER=/export3/PubMed
+export TMPDIR=/export/ratspub/tmp
+export NLTK_DATA=/export3/PubMed/nltk_data
+```
+
+See
+
+=> https://git.genenetwork.org/gn-shepherd-services/commit/?id=cd4512634ce1407b14b0842b0ef6a9cd35e6d46c
+
+The symlink from /export2 is not honoured by the guix container. Now the service works.
+
+Note we have deprecation warnings that need to be addressed in the future:
+
+```
+2025-04-22 00:40:07 /home/shepherd/services/genecup/guix-past/modules/past/packages/python.scm:740:19: warning: 'texlive-union' is deprecated,
+ use 'texlive-updmap.cfg' instead
+2025-04-22 00:40:07 guix build: warning: 'texlive-latex-base' is deprecated, use 'texlive-latex-bin' instead
+2025-04-22 00:40:15 updating checkout of 'https://git.genenetwork.org/genecup'...
+/gnu/store/9lbn1l04y0xciasv6zzigqrrk1bzz543-tensorflow-native-1.9.0/lib/python3.10/site-packages/tensorflow/python/framewo
+rk/dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
+2025-04-22 00:40:38   _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
+2025-04-22 00:40:38 /gnu/store/9lbn1l04y0xciasv6zzigqrrk1bzz543-tensorflow-native-1.9.0/lib/python3.10/site-packages/tensorflow/python/framewo
+rk/dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
+2025-04-22 00:40:38   _np_qint32 = np.dtype([("qint32", np.int32, 1)])
+2025-04-22 00:40:38 /gnu/store/9lbn1l04y0xciasv6zzigqrrk1bzz543-tensorflow-native-1.9.0/lib/python3.10/site-packages/tensorflow/python/framewo
+rk/dtypes.py:532: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
+2025-04-22 00:40:38   np_resource = np.dtype([("resource", np.ubyte, 1)])
+2025-04-22 00:40:39 /gnu/store/7sam0mr9kxrd4p7g1hlz9wrwag67a6x6-python-flask-sqlalchemy-2.5.1/lib/python3.10/site-packages/flask_sqlalchemy/__
+init__.py:872: FSADeprecationWarning: SQLALCHEMY_TRACK_MODIFICATIONS adds significant overhead and will be disabled by default in the future. Set it to True or False to suppress this warning.
+```
diff --git a/topics/deploy/installation.gmi b/topics/deploy/installation.gmi
index 757d848..d6baa79 100644
--- a/topics/deploy/installation.gmi
+++ b/topics/deploy/installation.gmi
@@ -319,7 +319,7 @@ Currently we have two databases for deployment,
 from BXD mice and 'db_webqtl_plant' which contains all plant related
 material.
 
-Download one database from
+Download a recent database from
 
 => https://files.genenetwork.org/database/
 
diff --git a/topics/deploy/machines.gmi b/topics/deploy/machines.gmi
index d610c9f..a7c197c 100644
--- a/topics/deploy/machines.gmi
+++ b/topics/deploy/machines.gmi
@@ -2,17 +2,19 @@
 
 ```
 - [ ] bacchus             172.23.17.156   (00:11:32:ba:7f:17) -  1 Gbs
-- [X] lambda01            172.23.18.212   (7c:c2:55:11:9c:ac)
+- [ ] penguin2
+- [X] lambda01            172.23.18.212   (7c:c2:55:11:9c:ac) - currently 172.23.17.41
 - [X] tux03i              172.23.17.181   (00:0a:f7:c1:00:8d) - 10 Gbs
   [X] tux03               128.169.5.101   (00:0a:f7:c1:00:8b) -  1 Gbs
-- [ ] tux04i              172.23.17.170   (14:23:f2:4f:e6:10)
-- [ ] tux04               128.169.5.119   (14:23:f2:4f:e6:11)
+- [X] tux04i              172.23.17.170   (14:23:f2:4f:e6:10)
+- [X] tux04               128.169.5.119   (14:23:f2:4f:e6:11)
 - [X] tux05               172.23.18.129   (14:23:f2:4f:35:00)
 - [X] tux06               172.23.17.188   (14:23:f2:4e:29:10)
 - [X] tux07               172.23.17.191   (14:23:f2:4e:7d:60)
 - [X] tux08               172.23.17.186   (14:23:f2:4f:4e:b0)
 - [X] tux09               172.23.17.182   (14:23:f2:4e:49:10)
 - [X] space               128.169.5.175   (e4:3d:1a:80:6c:40)
+- [ ] space-i             172.23.18.153   (cc:48:3a:13:db:4c)
 - [ ] octopus01f          172.23.18.221   (2c:ea:7f:60:bf:61)
 - [ ] octopus02f          172.23.22.159   (2c:ea:7f:60:bd:61)
 - [ ] octopus03f          172.23.19.187   (2c:ea:7f:60:ac:2b)
@@ -25,6 +27,8 @@ c for console or control
 
 ```
 - [ ] DNS entries no longer visible
+- [X] penguin2-c        172.23.31.83
+- [ ] octolair01        172.23.16.228
 - [X] lambda01-c        172.23.17.173   (3c:ec:ef:aa:e5:50)
 - [X] tux01-c           172.23.31.85    (58:8A:5A:F9:3A:22)
 - [X] tux02-c           172.23.30.40    (58:8A:5A:F0:E6:E4)
diff --git a/topics/deploy/our-virtuoso-instances.gmi b/topics/deploy/our-virtuoso-instances.gmi
index 0336018..3ac56ae 100644
--- a/topics/deploy/our-virtuoso-instances.gmi
+++ b/topics/deploy/our-virtuoso-instances.gmi
@@ -9,6 +9,8 @@ We run three instances of virtuoso.
 The public SPARQL endpoint is accessible at
 => https://sparql.genenetwork.org/sparql
 
+These are now generally run as part of genenetwork2 containers(!)
+
 ## Configuration
 
 All our virtuoso instances are deployed in Guix system containers. The configuration for these containers is at
diff --git a/topics/deploy/paths-in-flask-applications.gmi b/topics/deploy/paths-in-flask-applications.gmi
new file mode 100644
index 0000000..77bc201
--- /dev/null
+++ b/topics/deploy/paths-in-flask-applications.gmi
@@ -0,0 +1,22 @@
+# Paths in Flask Application
+
+## Tags
+
+* type: doc, docs, documentation
+* assigned: fredm
+* keywords: application paths, flask, absolute path, relative path
+
+## Content
+
+Always build and use absolute paths for the resources you use in your application. Assuming that the application will always be run with the root of the application's repository/package as the working directory is a recipe for failure.
+
+To demonstrate, see the following issue:
+=> /issues/genenetwork2/haley-knott-regression-mapping-error
+
+In this case, the path issue was not caught in the CI/CD environment since it runs the application with the repository root as its working directory. This issue will also not show up in most development environments since it is easier to run the application from the root of the repository than have to set up the PYTHONPATH variables.
+
+In the new containers making use of the "(genenetwork services genenetwork)" module in gn-machines[fn:1], the working directory where the application is invoked has no relation with the application's package — in fact, the working directory is actually the root of the containers file system ("/").
+
+# Footnotes
+
+[fn:1] https://git.genenetwork.org/gn-machines/
diff --git a/topics/deploy/setting-up-or-migrating-production-across-machines.gmi b/topics/deploy/setting-up-or-migrating-production-across-machines.gmi
new file mode 100644
index 0000000..631a000
--- /dev/null
+++ b/topics/deploy/setting-up-or-migrating-production-across-machines.gmi
@@ -0,0 +1,202 @@
+# Setting Up or Migrating Production Across Machines
+
+## Tags
+
+* type: documentation, docs, doc
+* status: in-progress
+* assigned: fredm
+* priority: undefined
+* keywords: migration, production, genenetwork
+* interested-parties: pjotrp, zachs
+
+## Introduction
+
+Recent events (Late 2024 and early 2025) have led to us needing to move the production system from one machine to the other several time, due to machine failures, disk space, security concerns, and the like.
+
+In this respect, a number of tasks rise to the front as necessary to accomplish for a successful migration. Each of the following sections will detail a task that's necessary for a successful migration.
+
+## Copy Over Auth Database
+
+We need to synchronise the authorisation database. We can copy this over from the production system, or the backups
+
+* TODO: Indicate where the backups for the auth database are here!
+
+Steps (flesh out better):
+
+* Extract backup (or copy from existing production system)
+* Stop the (new) container (if it's running)
+* Backup the (new) container's auth-db file (
+* Place the auth db file in the correct place in the container's filesystem:
+* Backup existing secrets
+* Login to the `/auth/admin/dashboard` of the auth server (e.g. https://cd.genenetwork.org/auth/admin/dashboard)
+* If client with the CLIENT_ID in the secrets exists
+* 1. update the uris for that client, if it doesn't exist, create an entirely new client and replace both the CLIENT_ID and CLIENT_SECRET in the secrets file.
+* 2. Click on the "Change Secret" button and generate a new secret. Replace the secret in the secrets file with the newly generated secret
+* If client with the CLIENT_ID in the secrets DOES NOT exist, register a new client, setting up the appropriate URIs and endpoints, and then add/replace both the CLIENT_ID and CLIENT_SECRET in the secrets file.
+* Restart (new) container
+
+## Set Up the Database
+
+=> /topics/systems/restore-backups Extract the latest database from the backups.
+=> /topics/deploy/installation Configure MariaDB according to this document.
+
+## Set Up the File System
+
+* TODO: List the necessary directories and describe what purpose each serves. This will be from the perspective of the container — actual paths on the host system are left to the builders choice, and can vary wildly.
+* TODO: Prefer explicit binding rather than implicit — makes the shell scripts longer, but no assumptions have to be made, everything is explicitly spelled out.
+
+The container(s) need access to various files and directories from the host system in order to work correctly.
+
+Filesystem bindings could be linked to wildly different paths on different physical host machines, therefore, we shall examine the bindings from the point of view of the paths within the container, rather than forcing a particular file system layout on the host systems themselves.
+
+Each of the sections below details a specific binding:
+
+### /var/genenetwork
+
+This binding must be READWRITE within the container.
+
+The purpose is to hold varying files that are specific to the genenetwork system(s). Examples of the files are:
+
+* "gn-meta" and "synteny" files for GN3
+* genotype files
+* session files for various systems (GN2, gn-uploader, etc.)
+
+### /var/lib/acme
+
+This binding must be READWRITE within the container.
+
+This is used to store TLS certificates for the various services within the container by the ACME (Automatic Certificate Management Environment) script.
+
+### /var/lib/redis
+
+This binding must be READWRITE within the container.
+
+This is used by the redis daemon to persist its state(s).
+
+### /var/lib/virtuoso
+
+This binding must be READWRITE within the container.
+
+Used by the virtuoso daemon to save its state, and maybe some log files.
+
+### /export/data/virtuoso/
+
+This binding must be READONLY within the container. (Really?)
+
+This is used for importing data into virtuoso, say by sharing Turtle (TTL) files within the binding.
+
+---
+At this point the binding is READONLY because any TTL files to load are imported from outside the container. If the transformation of data from MariaDB to TTL form is built into the production containers down the line, then this might change to READWRITE to allow the tranformation tool to write to it.
+
+### /var/log
+
+This binding must be READWRITE within the container.
+
+Allows logs from various services running in the container be accessible in the host system. This is useful for debugging issues with the running systems.
+
+### /etc/genenetwork
+
+This binding must be READWRITE within the container.
+
+Useful for storing various configuration files/data for the service(s) running inside the running container.
+
+### /var/lib/xapian
+
+This binding must be READWRITE within the container.
+
+Stores the processed search indexes for the xapian search system.
+
+### /var/lib/genenetwork/sqlite/gn-auth
+
+This binding must be READWRITE within the container.
+
+The authorisation database is stored here. The directory needs to be writable to avoid permissions issues within the container when attempting to write data into the database.
+
+### /var/lib/genenetwork/sqlite/genenetwork3
+
+This binding must be READWRITE within the container.
+
+This stores various SQLite databases in use with GN3. These are:
+
+* Database for the GNQA system
+* ...
+
+### /run/mysqld
+
+This binding must be READWRITE within the container.
+
+This binding is the link to the host directory that holds the socket file for the running MariaDB instance.
+
+### /opt/gn/tmp
+
+This binding must be READWRITE within the container.
+
+Holds temporary files for the various services that run within the container. Some of the generated files from various services are also stored here.
+
+**PROPOSAL**: Move all generated files here, or have a dedicated directory for holding generated files?
+
+
+### /var/genenetwork/sessions
+
+This binding must be READWRITE within the container.
+
+Holds session files for various services within the container. See also the /var/genenetwork binding.
+
+### /var/lib/genenetwork/uploader
+
+This binding must be READWRITE within the container.
+
+**gn-uploader** specific data files. Types of data files that could go here are:
+
+* File uploads
+* (Reusable) Cache files and generated files
+* ... others?
+
+### /var/lib/genenetwork/sqlite/gn-uploader
+
+This binding must be READWRITE within the container.
+
+Holds various SQLite databases used with the **gn-uploader** service, e.g.:
+
+* Background jobs database
+* ...
+
+### /var/lib/genenetwork/gn-guile
+
+This binding must be READWRITE within the container.
+
+Various data files for the **gn-guile** service, such as:
+
+* The bare **gn-docs** repository (Previously bound at `/export/data/gn-docs`: now deprecated).
+
+## Redis
+
+We currently (2025-06-11) use Redis for:
+
+* Tracking user collection (this will be moved to SQLite database)
+* Tracking background jobs (this is being moved out to SQLite databases)
+* Tracking running-time (not sure what this is about)
+* Others?
+
+We do need to copy over the redis save file whenever we do a migration, at least until the user collections and background jobs features have been moved completely out of Redis.
+
+## Container Configurations: Secrets
+
+* TODO: Detail how to extract/restore the existing secrets configurations in the new machine
+
+## Build Production Container
+
+* TODO: Add notes on building
+* TODO: Add notes on setting up systemd
+
+## NGINX
+
+* TODO: Add notes on streaming and configuration of it thereof
+
+## SSL Certificates
+
+* TODO: Add notes on acquisition and setup of SSL certificates
+
+## DNS
+
+* TODO: Migrate DNS settings
diff --git a/topics/deploy/uthsc-email.gmi b/topics/deploy/uthsc-email.gmi
new file mode 100644
index 0000000..05f2ba5
--- /dev/null
+++ b/topics/deploy/uthsc-email.gmi
@@ -0,0 +1,64 @@
+# UTHSC E-mail
+
+Similar to many organizations UT uses outlook and exchange for their E-mail. Thanks to mobile support it is possible to work with email using other tools outside the UT network.
+
+## Prospect E-mail client
+
+People have success using Prospect as an E-mail client. You can follow the instructions on the UT website that are similar to Android support.
+
+## Davmail IMAP bridge
+
+An interesting solution is to create an IMAP bridge. It is a little slower, but can be set up! That way you can use your favorite E-mail tool (and filters!).
+
+I have had success setting up davmail with the following settings and testing with thunderbird first:
+
+```
+apt install davmail openjfx thunderbird
+```
+
+Start davmail with
+
+```
+davmail -d
+```
+
+Stop davmail and edit the ~/.davmail.properties file with the following.
+
+```
+davmail.mode=O365Interactive
+davmail.url=https://outlook.office365.com/EWS/Exchange.asmx
+davmail.oauth.clientId=d3590ed6-52b3-4102-aeff-aad2292ab01c
+davmail.enableOauth2=true
+davmail.oauth.deviceCode=true
+davmail.oauth.enableOauth2=true
+davmail.oauth.redirectUri=urn:ietf:wg:oauth:2.0:oob
+davmail.oauth.tenantId=common
+davmail.imapPort=1143
+davmail.smtpPort=1025
+davmail.logFilePath=/home/yours/.davmail/davmail.log
+log4j.logger.httpclient.wire=DEBUG
+log4j.rootLogger=DEBUG
+log4j.logger.org.apache.http.wire=DEBUG
+```
+
+Restart davmail and point thunderbird to
+
+```
+IMAP Server: localhost:1143
+SMTP Server: localhost:1025
+Username: your-email@uthsc.edu
+```
+
+Note that you should set the UT password in the 2FA browser when it pops up. Do *not* set it in Thunderbird, also when it asks for it to send out SMTP.
+
+When something fails make sure to track the log in ~/.davmail/davmail.log
+
+# Using Mutt
+
+Some useful links:
+
+=> https://jonathanh.co.uk/blog/exchange-mutt/
+=> https://movementarian.org/blog/posts/mutt-and-office365/
+=> https://www.vanormondt.net/~peter/blog/2021-03-16-mutt-office365-mfa.html
+
+If someone can get the last one to work we won't even need davmail any more!
diff --git a/topics/deploy/uthsc-vpn-with-free-software.gmi b/topics/deploy/uthsc-vpn-with-free-software.gmi
index 344772c..aeba322 100644
--- a/topics/deploy/uthsc-vpn-with-free-software.gmi
+++ b/topics/deploy/uthsc-vpn-with-free-software.gmi
@@ -6,10 +6,24 @@ It is possible to connect to the UTHSC VPN using only free software. For this, y
 
 To connect, run openconnect-sso as follows. A browser window will pop up for you to complete the Duo authentication. Once done, you will be connected to the VPN.
 ```
-$ openconnect-sso --server uthscvpn1.uthsc.edu --authgroup UTHSC
+$ openconnect-sso --server vpn-server --authgroup UTHSC
 ```
 Note that openconnect-sso should be run as a regular user, not as root. After passing Duo authentication, openconnect-sso will try to gain root priviliges to set up the network routes. At that point, it will prompt you for your password using sudo.
 
+## Recommended way
+
+The recommended way is to use Arun's g-expression setup using guix. See below. It should just work, provided you have the chained certificate that you can get from the browser or one of us and point to the right server. Simply
+
+```
+$(guix build -f uthsc-vpn.scm)
+```
+
+See
+
+=> ./uthsc-vpn.scm
+
+Get the final details from us. UT does not like it when we put it online even though there is no real risk.
+
 ## Avoid tunneling all your network traffic through the VPN (aka Split Tunneling)
 
 openconnect, by default, tunnels all your traffic through the VPN. This is not good for your privacy. It is better to tunnel only the traffic destined to the specific hosts that you want to access. This can be done using the vpn-slice script.
@@ -17,7 +31,7 @@ openconnect, by default, tunnels all your traffic through the VPN. This is not g
 
 For example, to connect to the UTHSC VPN but only access the hosts tux01 and tux02e through the VPN, run the following command.
 ```
-$ openconnect-sso --server uthscvpn1.uthsc.edu --authgroup UTHSC -- --script 'vpn-slice tux01 tux02e'
+$ openconnect-sso --server vpn-server --authgroup UTHSC -- --script 'vpn-slice tux01 tux02e'
 ```
 The vpn-slice script looks up the hostnames tux01 and tux02e on the VPN DNS and adds /etc/hosts entries and routes to your system. vpn-slice can also set up more complicated routes. To learn more, read the vpn-slice documentation.
 
@@ -44,50 +58,50 @@ export OPENSSL_CONF=/tmp/openssl.cnf
 ```
 Then, run the openconnect-sso client as usual.
 
-## Putting it all together using Guix G-expressions
+## Misconfigured UTHSC TLS certificate
 
-Remembering to do all these steps is a hassle. Writing a shell script to automate this is a good idea, but why write shell scripts when we have G-expressions! Here's a G-expression script that I prepared earlier.
-=> uthsc-vpn.scm
-Download it, tweak the %hosts variable to specify the hosts you are interested in, and run it like so:
+The UTHSC TLS certificate does not validate on some systems. You can work around this by downloading the certificate chain and adding it to your system:
+* Navigate with browser to https://vpn-server/. Inspect the certificate in the browser (lock icon next to search bar) and export .pem file
+* Move it to /usr/local/share/ca-certificates (with .crt extension) or equivalent
+* On Debian/Ubuntu update the certificate store with update-ca-certificates
+You should see
 ```
-$(guix build -f uthsc-vpn.scm)
+Updating certificates in /etc/ssl/certs...
+1 added, 0 removed; done.
 ```
+Thanks Niklas. See also
+=> https://superuser.com/a/719047/914881
 
-# Troubleshooting
-
-Older versions would not show a proper dialog for sign-in. Try
-
+However, adding certificates to your system manually is not good security practice. It is better to limit the added certificate to the openconnect process. You can do this using the REQUESTS_CA_BUNDLE environment variable like so:
 ```
-export QTWEBENGINE_CHROMIUM_FLAGS=--disable-seccomp-filter-sandbox
+REQUESTS_CA_BUNDLE=/path/to/uthsc/certificate.pem openconnect-sso --server vpn-server --authgroup UTHSC
 ```
 
-## Update certificate
-
-When the certificate expires you can download the new one with:
+## Putting it all together using Guix G-expressions
 
-* Navigate with browser to https://uthscvpn1.uthsc.edu/. Inspect the certificate in the browser (lock icon next to search bar) and export .pem file
-* Move it to /usr/local/share/ca-certificates (with .crt extension) or equivalent
-* On Debian/Ubuntu update the certificate store with update-ca-certificates
+Remembering to do all these steps is a hassle. Writing a shell script to automate this is a good idea, but why write shell scripts when we have G-expressions! Here's a G-expression script that I prepared earlier.
+=> uthsc-vpn.scm
+Download it, download the UTHSC TLS certificate chain to uthsc-certificate.pem, tweak the %hosts variable to specify the hosts you are interested in, and run it like so:
+```
+$(guix build -f uthsc-vpn.scm)
+```
 
-You should see
+to add a route by hand after you can do
 
 ```
-Updating certificates in /etc/ssl/certs...
-1 added, 0 removed; done.
+ip route add 172.23.17.156 dev tun0
 ```
 
-Thanks Niklas. See also
-
-=> https://superuser.com/a/719047/914881
+# Troubleshooting
 
-On GUIX you may need to point to the updated certificates file with:
+Older versions would not show a proper dialog for sign-in. Try
 
 ```
-env REQUESTS_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt  openconnect-sso --server uthscvpn1.uthsc.edu --authgroup UTHSC
+export QTWEBENGINE_CHROMIUM_FLAGS=--disable-seccomp-filter-sandbox
 ```
 
 ## Acknowledgement
 
-Many thanks to Pjotr Prins and Erik Garrison without whose earlier work this guide would not be possible.
+Many thanks to Arun, Pjotr and Erik without whose earlier work this guide would not be possible.
 => https://github.com/pjotrp/linux-at-university-of-tennessee
 => https://github.com/ekg/openconnect-sso-docker
diff --git a/topics/deploy/uthsc-vpn.scm b/topics/deploy/uthsc-vpn.scm
index c714731..f204cdf 100644
--- a/topics/deploy/uthsc-vpn.scm
+++ b/topics/deploy/uthsc-vpn.scm
@@ -1,11 +1,30 @@
-(use-modules ((gnu packages guile-xyz) #:select (guile-ini guile-lib guile-smc))
-             ((gnu packages vpn) #:select (openconnect-sso vpn-slice))
-             (guix gexp))
+(use-modules ((gnu packages check)
+              #:select (python-pytest python-pytest-asyncio python-pytest-httpserver))
+             ((gnu packages freedesktop) #:select (python-pyxdg))
+             ((gnu packages python-build) #:select (python-poetry-core python-toml))
+             ((gnu packages python-crypto)
+              #:select (python-keyring python-pyotp))
+             ((gnu packages python-web) #:select (python-requests python-urllib3))
+             ((gnu packages python-xyz)
+              #:select (python-attrs python-charset-normalizer
+                                     python-colorama python-prompt-toolkit python-pysocks
+                                     python-structlog))
+             ((gnu packages guile-xyz) #:select (guile-ini guile-lib guile-smc))
+             ((gnu packages qt) #:select (python-pyqt-6 python-pyqtwebengine-6))
+             ((gnu packages vpn) #:select (openconnect vpn-slice))
+             ((gnu packages xml) #:select (python-lxml-4.9))
+             (guix build-system pyproject)
+             (guix build-system python)
+             (guix download)
+             (guix gexp)
+             (guix git-download)
+             ((guix licenses) #:prefix license:)
+             (guix packages))
 
 ;; Put in the hosts you are interested in here.
 (define %hosts
   (list "octopus01"
-        "tux01.genenetwork.org"))
+        "spacex"))
 
 (define (ini-file name scm)
   "Return a file-like object representing INI file with @var{name} and
@@ -19,6 +38,127 @@
                        (call-with-output-file #$output
                          (cut scm->ini #$scm #:port <>))))))
 
+(define python-urllib3-1.26
+  (package
+    (inherit python-urllib3)
+    (version "1.26.15")
+    (source
+     (origin
+       (method url-fetch)
+       (uri (pypi-uri "urllib3" version))
+       (sha256
+        (base32
+         "01dkqv0rsjqyw4wrp6yj8h3bcnl7c678qkj845596vs7p4bqff4a"))))
+    (build-system python-build-system)))
+
+(define python-charset-normalizer-2.10
+  (package
+    (inherit python-charset-normalizer)
+    (version "2.1.0")
+    (source
+     (origin
+       (method url-fetch)
+       (uri (pypi-uri "charset-normalizer" version))
+       (sha256
+        (base32 "04zlajr77f6c7ai59l46as1idi0jjgbvj72lh4v5wfpz2s070pjp"))))
+    (build-system python-build-system)
+    (arguments (list))
+    (native-inputs
+     (modify-inputs (package-native-inputs python-charset-normalizer)
+       (delete "python-setuptools")))))
+
+(define python-requests-2.28
+  (package
+    (inherit python-requests)
+    (name "python-requests")
+    (version "2.28.1")
+    (source (origin
+              (method url-fetch)
+              (uri (pypi-uri "requests" version))
+              (sha256
+               (base32
+                "10vrr7bijzrypvms3g2sgz8vya7f9ymmcv423ikampgy0aqrjmbw"))))
+    (build-system python-build-system)
+    (arguments (list #:tests? #f))
+    (native-inputs (list))
+    (propagated-inputs
+     (modify-inputs (package-propagated-inputs python-requests)
+       (replace "python-charset-normalizer" python-charset-normalizer-2.10)
+       (replace "python-urllib3" python-urllib3-1.26)))))
+
+(define-public openconnect-sso
+  (package
+    (name "openconnect-sso")
+    ;; 0.8.0 was released in 2021, the latest update on master HEAD is from
+    ;; 2023.
+    (properties '((commit . "94128073ef49acb3bad84a2ae19fdef926ab7bdf")
+                  (revision . "0")))
+    (version (git-version "0.8.0"
+                          (assoc-ref properties 'revision)
+                          (assoc-ref properties 'commit)))
+    (source
+      (origin
+        (method git-fetch)
+        (uri (git-reference
+               (url "https://github.com/vlaci/openconnect-sso")
+              (commit (assoc-ref properties 'commit))))
+        (file-name (git-file-name name version))
+        (sha256
+         (base32 "08cqd40p9vld1liyl6qrsdrilzc709scyfghfzmmja3m1m7nym94"))))
+    (build-system pyproject-build-system)
+    (arguments
+     `(#:phases
+       (modify-phases %standard-phases
+          (add-after 'unpack 'use-poetry-core
+            (lambda _
+              ;; Patch to use the core poetry API.
+              (substitute* "pyproject.toml"
+                (("poetry.masonry.api")
+                 "poetry.core.masonry.api"))))
+         (add-after 'unpack 'patch-openconnect
+           (lambda* (#:key inputs #:allow-other-keys)
+             (substitute* "openconnect_sso/app.py"
+               (("\"openconnect\"")
+                (string-append "\""
+                               (search-input-file inputs "/sbin/openconnect")
+                               "\""))))))))
+    (inputs
+     (list openconnect
+           python-attrs
+           python-colorama
+           python-keyring
+           python-lxml-4.9
+           python-prompt-toolkit
+           python-pyotp
+           python-pyqt-6
+           python-pyqtwebengine-6
+           python-pysocks
+           python-pyxdg
+           python-requests
+           python-structlog
+           python-toml))
+    (native-inputs
+     (list python-poetry-core
+           python-pytest
+           python-pytest-asyncio
+           python-pytest-httpserver))
+    (home-page "https://github.com/vlaci/openconnect-sso")
+    (synopsis "OpenConnect wrapper script supporting Azure AD (SAMLv2)")
+    (description
+     "This package provides a wrapper script for OpenConnect supporting Azure AD
+(SAMLv2) authentication to Cisco SSL-VPNs.")
+    (license license:gpl3)))
+
+;; Login to the UTHSC VPN fails with an SSLV3_ALERT_HANDSHAKE_FAILURE
+;; on newer python-requests.
+(define openconnect-sso-uthsc
+  (package
+    (inherit openconnect-sso)
+    (name "openconnect-sso-uthsc")
+    (inputs
+     (modify-inputs (package-inputs openconnect-sso)
+       (replace "python-requests" python-requests-2.28)))))
+
 (define uthsc-vpn
   (with-imported-modules '((guix build utils))
     #~(begin
@@ -34,8 +174,10 @@
                                  ("system_default" . "system_default_sect"))
                                 ("system_default_sect"
                                  ("Options" . "UnsafeLegacyRenegotiation")))))
-        (invoke #$(file-append openconnect-sso "/bin/openconnect-sso")
-                "--server" "uthscvpn1.uthsc.edu"
+        (setenv "REQUESTS_CA_BUNDLE"
+                #$(local-file "uthsc-certificate.pem"))
+        (invoke #$(file-append openconnect-sso-uthsc "/bin/openconnect-sso")
+                "--server" "$vpn-server" ; ask us for end-point or see UT docs
                 "--authgroup" "UTHSC"
                 "--"
                 "--script" (string-join (cons #$(file-append vpn-slice "/bin/vpn-slice")
diff --git a/topics/documentation/guides_vs_references.gmi b/topics/documentation/guides_vs_references.gmi
new file mode 100644
index 0000000..7df0be2
--- /dev/null
+++ b/topics/documentation/guides_vs_references.gmi
@@ -0,0 +1,24 @@
+# Guides Vs References
+
+Before coming up with docs, figure out their use. It can either be as a guide (provides solutions to problems encountered) or a reference (similar to man pages, where we provide detailed explanations).
+
+## For guides:
+
+* Be as brief as possible, providing reference links for users that want to explore i.e. don't aim from completeness, but rather practicality.
+* Prefer providing code or command snippets where possible.
+* Preferable have another team member review the docs. This helps eliminate blindspots due to our current knowledge.
+* Organize the document in such a way that it starts with the most actionable steps.
+* Avoid stream of consciousness writing.
+
+### Example
+
+Wrong:
+
+When setting up guix OS, I couldn't get `tmux` to start, getting `tmux: invalid LC_ALL, LC_CTYPE or LANG`. Running `locale -a` failed too. It took me a while to figure out the solution for this problem, and I attempted to reinstall `glibc-locales` which didn't help. After a lot of research, I found that the root cause was that my applications were built on a different version of `glibc`. I ran `guix update` and the problem disappeared.
+
+Correct:
+
+`tmux` failing with `tmux: invalid LC_ALL, LC_CTYPE or LANG` could be caused by having packages build on a different version of `glibc`. Attempt:
+
+> locale -a # should also fail
+> guix update # rebuilds your packages with your current glibc
diff --git a/topics/editing/case-attributes.gmi b/topics/editing/case-attributes.gmi
new file mode 100644
index 0000000..1a86131
--- /dev/null
+++ b/topics/editing/case-attributes.gmi
@@ -0,0 +1,110 @@
+# Editing Case-Attributes
+
+## Tags
+
+* type: document
+* keywords: case-attribute, editing
+* assigned: fredm, zachs, acenteno, bonfacem
+* status: requirements gathering
+
+## Introduction
+
+Case-attributes metadata for samples.  They are include: sex, age, etc of the various individuals and exist separately from "normal" traits mainly because they're non-numeric.  From the GN2 traits page, they are shown as extra columns under the "Reviews and Edit Data" section.
+
+Case-attributes are determined at the group-level.  E.g. for BXD, case attributes would apply at the level of each sample, across all BXD data.  Every strain has a unique attribute and it's fixed, not variable.
+
+We need to differentiate these two things:
+
+* Case-Attribute labels/names/categories (e.g. Sex, Height, Cage-handler, etc)
+* Case-Attribute values (e.g. Male/Female, 20cm, Frederick, etc.)
+
+Currently, both labels and values are set at the group level:
+
+=> https://github.com/genenetwork/genenetwork1/blob/0f170f0b748a4e10eaf8538f6bcbf88b573ce8e7/web/webqtl/showTrait/DataEditingPage.py Case-Attributes on GeneNetwork1
+is a good starting point to help with understanding how case-attributes were implemented and how they worked.
+
+Critical bug existed where editing one case-attribute affected all case-attributes defined for a group.
+
+Case attributes can have the following data-types:
+
+* Free-form text (no constraints) - see the `Status` column
+* Enumerations - textual data, but where the user can only pick from specific values
+* Links - The value displayed also acts as a link - e.g. the 'JAX:*' values in the `RRID` column
+
+## HOWTO
+
+Example SQL query to fetch case-attribute data:
+
+```
+SELECT
+	caxrn.*, ca.Name AS CaseAttributeName,
+	ca.Description AS CaseAttributeDescription,
+	iset.InbredSetId AS OrigInbredSetId
+FROM
+	CaseAttribute AS ca INNER JOIN CaseAttributeXRefNew AS caxrn
+	ON ca.Id=caxrn.CaseAttributeId
+INNER JOIN
+      StrainXRef AS sxr
+      ON caxrn.StrainId=sxr.StrainId
+INNER JOIN
+      InbredSet AS iset
+      ON sxr.InbredSetId=iset.InbredSetId
+WHERE
+	caxrn.value != 'x'
+	AND caxrn.value IS NOT NULL;
+```
+
+CaseAttributeXRefNew differs from CaseAttributeXRef:
+
+```
+mysql> describe CaseAttributeXRef;
++------------------+----------------------+------+-----+---------+-------+
+| Field            | Type                 | Null | Key | Default | Extra |
++------------------+----------------------+------+-----+---------+-------+
+| ProbeSetFreezeId | smallint(5) unsigned | NO   | PRI | 0       |       |
+| StrainId         | smallint(5) unsigned | NO   | PRI | 0       |       |
+| CaseAttributeId  | smallint(5)          | NO   | PRI | 0       |       |
+| Value            | varchar(100)         | NO   |     |         |       |
++------------------+----------------------+------+-----+---------+-------+
+4 rows in set (0.01 sec)
+
+mysql> describe CaseAttributeXRefNew;
++-----------------+------------------+------+-----+---------+-------+
+| Field           | Type             | Null | Key | Default | Extra |
++-----------------+------------------+------+-----+---------+-------+
+| InbredSetId     | int(5) unsigned  | NO   | PRI | NULL    |       |
+| StrainId        | int(20) unsigned | NO   | PRI | NULL    |       |
+| CaseAttributeId | int(5) unsigned  | NO   | PRI | NULL    |       |
+| Value           | varchar(100)     | NO   |     | NULL    |       |
++-----------------+------------------+------+-----+---------+-------+
+4 rows in set (0.01 sec)
+```
+
+=> https://github.com/genenetwork/genenetwork3/blob/dd0b29c07017ec398c447ca683dd4b4be18d73b7/scripts/update-case-attribute-tables-20230818 Script to update CaseAttribute and CaseAttributeXRefNew table
+
+## Tasks
+
+* @bmunyoki: Model case-attributes correctly in RDF.
+* @bmunyoki, @zachs: Implement case-attributes editing in GN3 that correctly models case-attributes at the group-level. CRUD operations with the correct authorization.  People who can edit sample data should not be able to edit case-attributes because case-attributes are defined at the group level; and editing case-attributes at the group-level will affect other samples.
+* @rob: Confirm to team whether "N" and "SE" are case-attributes.  @bmunyoki AFAICT, no.
+
+
+Possible set of privileges subject to discussion:
+
+* group:resource:add-case-attributes - Allows user to add a completely new case attribute
+* group:resource:edit-case-attributes - Allows user to edit an existing case attribute
+* group:resource:delete-case-attributes - Allows user to delete an existing case attribute
+* group:resource:view-case-attributes - Allows user to view case attributes and their value
+
+Given groups are not directly linked to any auth resource, we may introduce some level of indirection.  Addy a new resource type that handles groups may solve this.
+
+## See Also
+
+=> https://matrix.to/#/!EhYjlMbZHbGPjmAFbh:matrix.org/$myIoafLp_dIONnyNvEI0k2xf3Y8-LyiI_mkP2vBN08o?via=matrix.org Discussion on Case-Attributes Editing in Matrix
+=> https://matrix.to/#/!EhYjlMbZHbGPjmAFbh:matrix.org/$P6SNnpY-nAZsDr3VZlRi05m6MT32lXBsCl-BYLh-YLM?via=matrix.org More Discussion on Matrix
+=> /issues/case-attr-edit-error Case Attribute Edting Problems
+=> /issues/fix-case-attribute-work Fix Case Attribute Work (Same Columns)
+=> /issues/fix-case-attribute-editing Editing Case Attribute
+=> /issues/consecutive-crud-applications-when-uploading-data Fix Case Attribute Work (Consecutive CRUD applications)
+=> /issues/edit-metadata-bugs Cannot Edit Metadata of BXD Traits Effectively
+=> /topics/data-uploads/datasets Some Historical Context
diff --git a/topics/editing/case_attributes.gmi b/topics/editing/case_attributes.gmi
deleted file mode 100644
index 5a11026..0000000
--- a/topics/editing/case_attributes.gmi
+++ /dev/null
@@ -1,180 +0,0 @@
-# Editing Case-Attributes
-
-## Tags
-
-* type: document
-* keywords: case-attribute, editing
-* assigned: fredm, zachs, acenteno
-* status: requirements gathering
-
-## Introduction
-
-Case-Attributes are essentially the metadata for the samples. In the GN2 system, they are the extra columns in the table in the "Reviews and Edit Data" accordion tab besides the value and its error margin.
-
-To quote @zachs
-
-> "Case Attributes" are basically just sample metadata. So stuff like the sex, age, etc of the various individuals (and exist separately from "normal" traits mainly because they're non-numeric)
-
-They are the metadata for the various sample in a trait. The case attributes are determined at the group-level:
-
-> Since they're metadata (or "attributes" in this case) for samples, they're group-level so for BXD, case attributes would apply at the level of each sample, across all BXD data
-
-Also From email:
-> Every strain has a unique attribute and it's fixed, not variable.
-
-## Direction
-
-We need to differentiate two things:
-* Case-Attribute labels/names/categories (e.g. Sex, Height, Cage-handler, etc)
-* Case-Attribute values (e.g. Male/Female, 20cm, Frederick, etc.)
-
-As is currently implemented (as of before 2023-08-31), both the labels and values are set at group level.
-
-A look at
-=> https://github.com/genenetwork/genenetwork1/blob/0f170f0b748a4e10eaf8538f6bcbf88b573ce8e7/web/webqtl/showTrait/DataEditingPage.py Case-Attributes on GeneNetwork1
-is a good starting point to help with understanding how case-attributes were implemented and how they worked.
-
-## Status
-
-There is code that existed for the case-attributes editing, but it had a critical bug where the data for existing attributes would be deleted/replaced randomly when one made a change. This lead to a pause in this effort.
-
-The chosen course of action will, however, not make use of this existing code. Instead, we will reimplement the feature with code in GN3, exposing the data and its editing via API endpoints.
-
-## Database
-
-The existing database tables of concern to us are:
-
-* InbredSet
-* CaseAttribute
-* StrainXRef
-* Strain
-* CaseAttributeXRefNew
-
-We can fetch case-attribute data from the database with:
-
-```
-SELECT
-	caxrn.*, ca.Name AS CaseAttributeName,
-	ca.Description AS CaseAttributeDescription,
-	iset.InbredSetId AS OrigInbredSetId
-FROM
-	CaseAttribute AS ca INNER JOIN CaseAttributeXRefNew AS caxrn
-	ON ca.Id=caxrn.CaseAttributeId
-INNER JOIN
-      StrainXRef AS sxr
-      ON caxrn.StrainId=sxr.StrainId
-INNER JOIN
-      InbredSet AS iset
-      ON sxr.InbredSetId=iset.InbredSetId
-WHERE
-	caxrn.value != 'x'
-	AND caxrn.value IS NOT NULL;
-```
-
-which gives us all the information we need to rework the database schema.
-
-Since the Case-Attributes are group-level, we need to move the `InbredSetId` to the `CaseAttribute` table from the `CaseAttributeXRefNew` table.
-
-For more concrete relationship declaration, we can have the `CaseAttributeXRefNew` table have it primary key be composed of the `InbredSetId`, `StrainId` and `CaseAttributeId`. That has the added advantage that we can index the table on `InbredSetId` and `StrainId`.
-
-That leaves the `CaseAttribute` table with the following columns:
-
-* InbredSetId: Foreign Key from `InbredSet` table
-* Id: The CaseAttribute identifier
-* Name: Textual name for the Case-Attribute
-* Description: Textual description fro the case-attribute
-
-while the `CaseAttributeXRefNew` table ends up with the following columns:
-
-* InbredSetId: Foreign Key from `InbredSet` table
-* StrainId: The strain
-* CaseAttributeId: The case-attribute identifier
-* Value: The value for the case-attribute for this specific strain
-
-There will not be any `NULL` values allowed for any of the columns in both tables. If a strain has no value, we simply delete the corresponding record from the `CaseAttributeXRefNew` table.
-
-To that end, the following script has been added to ease the migration of the table schemas:
-=> https://github.com/genenetwork/genenetwork3/blob/dd0b29c07017ec398c447ca683dd4b4be18d73b7/scripts/update-case-attribute-tables-20230818
-The script is meant to be run only once, and makes the changes mentioned above for both tables.
-
-## Data Types
-
-> ... (and exist separately from "normal" traits mainly because they're non-numeric)
-
-The values for Case-Attributes are non-numeric data. This will probably be mostly textual data.
-
-As an example:
-=> https://genenetwork.org/show_trait?trait_id=10010&dataset=BXDPublish Trait Data and Analysis for BXD_10010
-we see Case-Attributes as:
-
-* Free-form text (no constraints) - see the `Status` column
-* Enumerations - textual data, but where the user can only pick from specific values
-* Links - The value displayed also acts as a link - e.g. the 'JAX:*' values in the `RRID` column
-
-
-=> https://genenetwork.org/show_trait?trait_id=10002&dataset=CCPublish For this trait
-
-We see:
-* Numeric data - see the `N` and `SE` columns
-though that might be a misunderstanding of the quote
-
-> In the following link for example, every column after Value is a case attribute - https://genenetwork.org/show_trait?trait_id=10010&dataset=BXDPublish
-
-**TODO**: Verify whether `N` and `SE` are Case-Attributes
-
-## Authorisation
-
-From email:
-> it's probably not okay to let anyone who can edit sample data for a trait also edit case attributes, since they're group level
-
-and from matrix:
-> The weird bug aside, Bonface had (mostly) successfully implemented editing these through the CSV files in the same way as any other sample data, but for authorization reasons this probably doesn't make sense (since a user having access to editing sample data for specific traits doesn't imply that they'd have access for editing case attributes across the entire group)
-
-From this, it implies we might need a new set of privileges for dealing with case-attributes, e.g.
-* group:resource:add-case-attributes - Allows user to add a completely new case attribute
-* group:resource:edit-case-attributes - Allows user to edit an existing case attribute
-* group:resource:delete-case-attributes - Allows user to delete an existing case attribute
-* group:resource:view-case-attributes - Allows user to view case attributes and their value
-
-Considering, however, that groups (InbredSets) are not directly linked to any auth resource, this might mean some indirection of sorts, or maybe add a new resource type that handles groups.
-
-## Features
-
-* Editing existing case-attributes: YES
-* Adding new case attributes: ???
-* Deleting existing case attributes: ???
-
-Strains/samples are shared across traits. The values for the case attributes are the same for a particular strain/sample for all traits within a particular InbredSet (group).
-
-## Related and Unsynthesised Chats
-
-=> https://matrix.to/#/!EhYjlMbZHbGPjmAFbh:matrix.org/$myIoafLp_dIONnyNvEI0k2xf3Y8-LyiI_mkP2vBN08o?via=matrix.org
-```
-Zachary SloanZ
-I'm pretty sure multiple phenotypes and mRNA datasets can belong to the same experiment (and definitely for the purposes of case attributes
-since the mRNA datasets are split by tissue
-genotype traits should all be considered part of the same "experiment" (at least as long as we're still only databasing a single genotype file for each group)
-
-pjotrp
-: Case attribute editing will still need to be group level, at least until the whole feature is completely changed. Since they're basically just phenotypes we choose to show in the trait page table, and phenotypes are at the group level
-```
-
-=> https://matrix.to/#/!EhYjlMbZHbGPjmAFbh:matrix.org/$P6SNnpY-nAZsDr3VZlRi05m6MT32lXBsCl-BYLh-YLM?via=matrix.org
-```
-Zachary SloanZ
-21:14
-Groups are defined by their list of samples/strains, and the "case attributes" are just "the characteristics of those samples/strains we choose to show on the trait page" (if we move away from the "group" concept entirely that could change, but if we did that we probably would also replace "case attributes" with something else because the way that's implemented is kind of weird to begin with)
-ZB
-```
-
-## Related issues
-
-=> /issues/case-attr-edit-error
-=> /issues/fix-case-attribute-work
-=> /issues/fix-case-attribute-editing
-=> /issues/consecutive-crud-applications-when-uploading-data
-=> /issues/edit-metadata-bugs
-
-## References
-
-=> /topics/data-uploads/datasets
diff --git a/topics/engineering/improving-wiki-rif-search-in-genenetwork.gmi b/topics/engineering/improving-wiki-rif-search-in-genenetwork.gmi
new file mode 100644
index 0000000..74e7178
--- /dev/null
+++ b/topics/engineering/improving-wiki-rif-search-in-genenetwork.gmi
@@ -0,0 +1,119 @@
+# Improving RIF+WIKI Search
+
+* author: bonfacem
+* reviewed-by: jnduli
+
+At the time of this writing, WIKI and/or RIF Search is extremely slow for MySQL .e.g. searching: "WIKI=nicotine MEAN=(12.103 12.105)" causes an Nginx time-out in Genenetwork2.  This blog discusses how we improved the WIKI+RIF search using XAPIAN and some of our key learnings.
+
+### TLDR; Key Learnings from Adding RIF+WIKI to the Index
+
+* xapian-compacting is IO bound.
+* Instrument your indexing script and appropriately choose an appropriate parallel process_count that fits your needs.
+* Do NOT store positional data unless you need it.
+* Consider stemming your data and removing stop-words from your data ahead of indexing.
+
+### Slow MySQL Performance
+
+When indexing genes, we have a complex query [0] which returns 48,308,714 rows
+
+running an "EXPLAIN" on [0] yields:
+
+```
+1  +------+-------------+----------------+--------+-----------------------------+------------------+---------+------------------------------------------------------------+-------+-------------+
+2  | id   | select_type | table          | type   | possible_keys               | key              | key_len | ref                                                        | rows  | Extra       |
+3  +------+-------------+----------------+--------+-----------------------------+------------------+---------+------------------------------------------------------------+-------+-------------+
+4  |    1 | SIMPLE      | ProbeSetFreeze | ALL    | PRIMARY                     | NULL             | NULL    | NULL                                                       | 931   |             |
+5  |    1 | SIMPLE      | ProbeFreeze    | eq_ref | PRIMARY                     | PRIMARY          | 2       | db_webqtl.ProbeSetFreeze.ProbeFreezeId                     | 1     | Using where |
+6  |    1 | SIMPLE      | Tissue         | eq_ref | PRIMARY                     | PRIMARY          | 2       | db_webqtl.ProbeFreeze.TissueId                             | 1     |             |
+7  |    1 | SIMPLE      | InbredSet      | eq_ref | PRIMARY                     | PRIMARY          | 2       | db_webqtl.ProbeFreeze.InbredSetId                          | 1     | Using where |
+8  |    1 | SIMPLE      | Species        | eq_ref | PRIMARY                     | PRIMARY          | 2       | db_webqtl.InbredSet.SpeciesId                              | 1     |             |
+9  |    1 | SIMPLE      | ProbeSetXRef   | ref    | ProbeSetFreezeId,ProbeSetId | ProbeSetFreezeId | 2       | db_webqtl.ProbeSetFreeze.Id                                | 27287 |             |
+10 |    1 | SIMPLE      | ProbeSet       | eq_ref | PRIMARY                     | PRIMARY          | 4       | db_webqtl.ProbeSetXRef.ProbeSetId                          | 1     |             |
+11 |    1 | SIMPLE      | Geno           | eq_ref | species_name                | species_name     | 164     | db_webqtl.InbredSet.SpeciesId,db_webqtl.ProbeSetXRef.Locus | 1     | Using where |
++------+-------------+----------------+--------+-----------------------------+------------------+---------+------------------------------------------------------------+-------+-------------+
+```
+
+From the above table, we note that we have "ref" under the "type" column in line 9.  The "type" column describes how the rows are found from the table (I.e the join type) [2].  In this case, "ref" means a non-unique index or prefix is used to find all the rows which we can see by running "SHOW INDEXES FROM ProbeSetXRef" (note the Non-unique value of 1 for ProbeSetFreezeId):
+
+```
++--------------+------------+------------------+--------------+------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
+| Table        | Non_unique | Key_name         | Seq_in_index | Column_name      | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
++--------------+------------+------------------+--------------+------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
+| ProbeSetXRef |          0 | PRIMARY          |            1 | DataId           | A         |    46061750 |     NULL | NULL   |      | BTREE      |         |               |
+| ProbeSetXRef |          1 | ProbeSetFreezeId |            1 | ProbeSetFreezeId | A         |        1688 |     NULL | NULL   |      | BTREE      |         |               |
+| ProbeSetXRef |          1 | ProbeSetId       |            1 | ProbeSetId       | A         |    11515437 |     NULL | NULL   |      | BTREE      |         |               |
+| ProbeSetXRef |          1 | Locus_2          |            1 | Locus            | A         |        1806 |        5 | NULL   | YES  | BTREE      |         |               |
++--------------+------------+------------------+--------------+------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
+```
+
+We get a performance hit on the join: "INNER JOIN ProbeSetXRef ON ProbeSetXRef.ProbeSetFreezeId = ProbeSetFreeze.Id" since ProbeSetXRef.ProbeSetFreezeId is a non-unique index.  What this means to our query is that for rows scanned in the ProbeSetFreeze table, there are several rows under the ProbeSetXRef table tha will satisfy the JOIN condition.   This is analogous to nested loops in programming.
+
+In the RIF Search, we append "INNER JOIN GeneRIF_BASIC ON GeneRIF_BASIC.symbol = ProbeSet.Symbol" to [0].  Running an EXPLAIN on this new query yields:
+
+```
+1   +------+-------------+----------------+--------+---------------------------------------+--------------+---------+------------------------------------------------------------+---------+-----------------------+
+2   | id   | select_type | table          | type   | possible_keys                         | key          | key_len | ref                                                        | rows    | Extra                 |
+3   +------+-------------+----------------+--------+---------------------------------------+--------------+---------+------------------------------------------------------------+---------+-----------------------+
+4   |    1 | SIMPLE      | GeneRIF_BASIC  | index  | NULL                                  | symbol       | 777     | NULL                                                       | 1366287 | Using index           |
+5   |    1 | SIMPLE      | ProbeSet       | ref    | PRIMARY,symbol_IDX,ft_ProbeSet_Symbol | symbol_IDX   | 403     | func                                                       | 1       | Using index condition |
+6   |    1 | SIMPLE      | ProbeSetXRef   | ref    | ProbeSetFreezeId,ProbeSetId           | ProbeSetId   | 4       | db_webqtl.ProbeSet.Id                                      | 4       |                       |
+7   |    1 | SIMPLE      | ProbeSetFreeze | eq_ref | PRIMARY                               | PRIMARY      | 2       | db_webqtl.ProbeSetXRef.ProbeSetFreezeId                    | 1       |                       |
+8   |    1 | SIMPLE      | ProbeFreeze    | eq_ref | PRIMARY                               | PRIMARY      | 2       | db_webqtl.ProbeSetFreeze.ProbeFreezeId                     | 1       | Using where           |
+9   |    1 | SIMPLE      | InbredSet      | eq_ref | PRIMARY                               | PRIMARY      | 2       | db_webqtl.ProbeFreeze.InbredSetId                          | 1       | Using where           |
+10  |    1 | SIMPLE      | Tissue         | eq_ref | PRIMARY                               | PRIMARY      | 2       | db_webqtl.ProbeFreeze.TissueId                             | 1       |                       |
+11  |    1 | SIMPLE      | Species        | eq_ref | PRIMARY                               | PRIMARY      | 2       | db_webqtl.InbredSet.SpeciesId                              | 1       |                       |
+12  |    1 | SIMPLE      | Geno           | eq_ref | species_name                          | species_name | 164     | db_webqtl.InbredSet.SpeciesId,db_webqtl.ProbeSetXRef.Locus | 1       | Using where           |
+13  +------+-------------+----------------+--------+---------------------------------------+--------------+---------+------------------------------------------------------------+---------+-----------------------+
+```
+
+From the above we see that we have an extra "ref" on line 5 which adds extra overhead.  Additionally, now under the "ref" column we see "func" with a "Using index condition" under the "Extra" column.  This means that we are using some function during this join [3].  Specifically, this is because the "symbol" column in the GeneRIF_BASIC table is indexed, but the "Symbol" column in the ProbeSet table is not indexed.  Regardless, this increases the performance of the query by some orders of magnitude.
+
+### Adding RIF+WIKI Search to the Existing Gene Index
+
+Our current indexer[4] works by indexing the results from [0] in chunks of 100,000 into separate xapian databases stored in different directories.  This happens by spawning different child processes from the main indexer script.  The final step in this process is to compact all the different databases into one database.
+
+To add RIF+WIKI indices to the existing gene index, we built a global cache.  In each child process, we fetch the relevant RIF+WIKI entry from this cache and index.  This increased our indexing time and space consumption.  At one point we ran out of our RAM causing an intermittent outage on 2024-06-21 (search for "Outage for 2024-06-20 in the following link"):
+
+=> https://issues.genenetwork.org/topics/meetings/jnduli_bmunyoki Meeting notes
+
+When troubleshooting our outage, we realized the indexing script consumed all the RAM.  This was because the child processes spawned by the index script each consumed around 3GB of RAM; with the total number of child processes and their RAM usage exceeding the system RAM.  To remedy this, we settled on a total_child_process count of 67, limiting the number of spawned children and putting a cap on the total number of RAM the indexing script could consume.  You can see the fix in this commit:
+
+=> https://github.com/genenetwork/genenetwork3/commit/99d0d1200d7dcd81e27ce65ab84bab145d9ae543 feat: set 67 parallel processes to run in prod
+
+To try to speed our indexing speed, we attempted to parallelize our compacting.  Parallelising had some improvements in reducing our compacting time, but nothing significant.  On a SATA drive, compacting 3 different databases which had been compacted from 50 different databases was significantly faster than compacting one database at once from 150 different databases.  The conclusion we could draw from this was that the compacting process is IO bound.  This is useful data because it informs the type of drive you would want to run our indexing script in, and in our case, an NVMe drive is an ideal candidate because of the fast IO speeds it has.
+
+To attempt to reduce the index script's space consumption and improve the script's performance, we first removed stop-words and most common words from the global cache, and stemmed words from other documents.  This reduced the space footprint to 152 Gb.  This was still unacceptable per our needs.  Further research with how xapian indexing works pointed us to positional data in the XAPIAN index.  In XAPIAN, positional data allows someone to be able to perform phrase searches such as: "nicotine NEAR mouse" which loosely translates to "search for the term nicotine which occurs near the term mouse."  One thing we noticed in the RIF+WIKI search is that we don't need this type of search, a trade-off we were willing to make to make search faster and our XAPIAN database smaller.  Instrumenting the impact of dropping positional data from RIF+WIKI data was immediate.  Our indexing times, on the NVMe drive dropped to a record high of 1 hour 9 minutes with a size of 73 Gb!  The table below summarizes our findings:
+
+
+```
+|                                               | Indexing Time (min) | Space (Gb) | % Inc Size (from G+P) | % Inc Time |
+|------------------------------------------------------------------------------------------------------------------ -----|
+|G+P (no stop-words, no-stemming, pos. data)    | 75                  | 60         | 0                     | 0          |  
+|G+P+W+R (no stop-words, no stemming, pos. data)| 429                 | 152        | 153.3                 | 472        |
+|G+P+W+R (stop-words, stemming, no pos. data)   | 69                  | 73         | 21.6                  | -8         |
+
+Key:
+----
+G: Genes
+P: Phenotypes
+W: Wiki
+R: RIF
+```
+
+### Some Example Searches
+
+With RIF+WIKI search added, here are some searches you can try out in CD genenetwork instance:
+
+* wiki:nicotine AND mean:12.103..12.105
+* rif:isoneuronal AND mean:12.103..12.105
+* species:mouse wiki:addiction rif:heteroneuronal mean:12.103..12.105
+* symbol:shh rif:glioma wiki:nicotine
+
+### References
+
+=> https://github.com/genenetwork/genenetwork3/blob/52cd294c2f1d06dddbd6ff613b11f8bc43066038/scripts/index-genenetwork#L54-L89 [0] Gene Indexing SQL Query
+=> https://mariadb.com/kb/en/explain/ [1] MariaDB EXPLAIN
+=> https://stackoverflow.com/a/4528433 [2] What does eq_ref and ref types mean in MySQL explain?
+=> https://planet.mysql.com/entry/?id=29724 [3] The meaning of ref=func in MySQL EXPLAIN
+=> https://issues.genenetwork.org/topics/engineering/instrumenting-ram-usage [3] Instrument RAM Usage
+=> https://github.com/genenetwork/genenetwork3/blob/main/scripts/index-genenetwork#L54 index-genenetwork
diff --git a/topics/engineering/instrumenting-ram-usage.gmi b/topics/engineering/instrumenting-ram-usage.gmi
new file mode 100644
index 0000000..4f7ab96
--- /dev/null
+++ b/topics/engineering/instrumenting-ram-usage.gmi
@@ -0,0 +1,32 @@
+# Instrumenting RAM usage
+
+* author: bonfacem
+* reviewed-by: jnduli
+
+On 2024-06-21, TUX02 experienced an outage because we ran out of RAM on the server.  Here we outline how to instrument processes that consume RAM, in particular, what to watch out for.
+
+=> https://issues.genenetwork.org/topics/meetings/jnduli_bmunyoki Meeting Notes
+
+The output of "free -m -h" looks like:
+
+```
+              total        used        free      shared  buff/cache   available
+Mem:           251G         88G         57G        6.2G        105G        155G
+Swap:           29G         20G        9.8G
+```
+
+When running "free", you can refresh the output regularly.  As an example, to get human readable output every 2 seconds:
+
+> free -m -h -s 2
+
+It's tempting to check the "free" column to see how much RAM is being used.  However, this column also includes disk caching.  Disk caching doesn't prevent applications from getting the memory they want[1].  What we need to be aware of instead are:
+
+* available: Make sure this is within acceptable thresholds.
+* swap used: Make sure this does not change significantly.
+
+Also, use htop/top and filter out the process (and preferably order by RAM usage) you are monitoring to see how much RAM a process and it's children (if any) consume.
+
+## References
+
+=> https://www.linuxatemyram.com/index.html [0] Linux ate my ram!
+=> https://www.linuxatemyram.com/play.html [1] Experiments and fun with Linux disk cache
diff --git a/topics/engineering/setting-up-a-basic-pre-commit-hook-for-linting-scheme-files.gmi b/topics/engineering/setting-up-a-basic-pre-commit-hook-for-linting-scheme-files.gmi
new file mode 100644
index 0000000..5324de8
--- /dev/null
+++ b/topics/engineering/setting-up-a-basic-pre-commit-hook-for-linting-scheme-files.gmi
@@ -0,0 +1,31 @@
+# Setting Up a Basic Pre-Commit Hook for Linting Scheme Files
+
+* author: bonfacem
+* reviewed-by: jnduli
+
+Git executes hooks before/after events such as: commit, push and receive.  A pre-commit hook runs before a commit is finalized [0].  This post shows how to create a pre-commit hook for linting scheme files using `guix style`.
+
+```
+# Step 1: Create the hook
+touch .git/hooks/pre-commit
+
+# Step 2: Make the hook executable
+chmod +x .git/hooks/pre-commit
+
+# Step 3: Copy the following to .git/hooks/pre-commit
+
+#!/bin/sh
+
+# Run guix style on staged .scm files
+for file in $(git diff --cached --name-only --diff-filter=ACM | grep ".scm$"); do
+  if ! guix style --whole-file "$file"; then
+    echo "Linting failed for $file. Please fix the errors and try again."
+    exit 1
+  fi
+  git add $file
+done
+```
+
+## References:
+
+=> https://www.slingacademy.com/article/git-pre-commit-hook-a-practical-guide-with-examples/ [0] Git Pre-Commit Hook: A Practical Guide (with Examples)
diff --git a/topics/engineering/using-architecture-decision-records-in-genenetwork.gmi b/topics/engineering/using-architecture-decision-records-in-genenetwork.gmi
new file mode 100644
index 0000000..43d344c
--- /dev/null
+++ b/topics/engineering/using-architecture-decision-records-in-genenetwork.gmi
@@ -0,0 +1,56 @@
+# Using Architecture Decision Records at GeneNetwork
+
+* author: bonfacem
+* reviewed-by: fredm, jnduli
+
+> One of the hardest things to track during the life of a project is the motivation behind certain decisions.  A new person coming on to a project may be perplexed, baffled, delighted, or infuriated by some past decision.
+> -- Michael Nygard
+
+When building or maintaining software, there's often moments when we ask, "What were they thinking?"  This happens when we are trying to figure out why something was done a certain way, leading to speculation, humor, or criticism[0].  Given the constraints we face when writing code, it's important to make sure that important decisions are well-documented and transparent.  Architecture Decision Records (ADRs) are one such tool.  They provide a structured way to capture the reasoning behind key decisions.
+
+ADRs consist 4 key sections [0]:
+
+* Status: An ADR begins with a proposed status.  After discussions, it will be accepted or rejected.  It is also possible for a decision to be superseded by a newer ADR later on.
+* Context: The context section outlines the situation or problem, providing the background and constraints relevant to the decision.  This section is meant to frame the issue concisely, not as a lengthy blog post or detailed explanation.
+* Decision: This section clearly defines the chosen approach and the specific actions that will be taken to address the issue.
+* Consequences: This part lays out the impact or outcomes of the decision, detailing the expected results and potential trade-offs.
+
+Optionally, when an ADR is rejected, you can add a section:
+
+* Rejection Rationale: Briefly provides some context for why the ADR was rejected.
+
+At GeneNetwork, we manage ADRs within our issue tracker, organizing them under the path "/topics/ADR/<project-name>/XXX-name.gmi".  The "XXX" represents a three-digit number, allowing for an easy, chronological order of the proposals as they are created.
+
+Here is a template for a typical ADR in Genenetwork:
+
+```
+# [<project>/ADR-<XXX>] Title
+
+* author: author-name
+* status: proposed
+* reviewed-by: A, B, C
+
+## Context
+
+Some context.
+
+## Decision
+
+Decisions.
+
+## Consequences
+
+Consequences.
+```
+
+Here are some examples of Genenetwork specific ADRs:
+
+=> https://issues.genenetwork.org/topics/ADR/gn3/000-add-test-cases-for-rdf [gn3/ADR-000] Add RDF Test Case
+=> https://issues.genenetwork.org/topics/ADR/gn3/000-remove-stace-traces-in-gn3-error-response [gn3/ADR-001] Remove Stack Traces in GN3
+
+### References
+
+=> https://www.oreilly.com/library/view/mastering-api-architecture/9781492090625/ [0] Gough, J., Bryant, D., & Auburn, M.  (2022).  Mastering API Architecture: Design, Operate, and Evolve API-based Systems.  O'Reilly Media, Incorporated.
+=> https://adr.github.io/ [1] Architectural Decision Records.  Homepage of the ADR GitHub organization
+=> https://docs.aws.amazon.com/prescriptive-guidance/latest/architectural-decision-records/adr-process.html [2] Amazon's ADR process
+=> https://cloud.google.com/architecture/architecture-decision-records [3] Google Cloud Center Architecture Decision Records Overview
diff --git a/topics/engineering/working-with-virtuoso-locally.gmi b/topics/engineering/working-with-virtuoso-locally.gmi
new file mode 100644
index 0000000..af249a5
--- /dev/null
+++ b/topics/engineering/working-with-virtuoso-locally.gmi
@@ -0,0 +1,70 @@
+# Working with Virtuoso for Local Development
+
+* author: bonfacem
+* reviewed-by: jnduli
+
+Using guix, install the Virtuoso server:
+
+```
+guix install virtuoso-ose # or any other means to install virtuoso
+cd /path/to/virtuoso/database/folder
+cp $HOME/.guix-profile/var/lib/virtuoso/db/virtuoso.ini ./virtuoso.ini
+# modify the virtuoso.ini file to save files to the folder you'd prefer
+virtuoso-t +foreground +wait +debug
+```
+
+## Common Virtuoso Operations
+
+Use isql to load up data:
+
+```
+isql
+# subsquent commands run in isql prompt
+# this folder is relative to the folder virtuoso was started from
+ld_dir ('path/to/folder/with/ttls', '*.ttl', 'http://genenetwork.org');
+rdf_loader_run();
+checkpoint;
+```
+
+Add data using HTTP:
+
+```
+# Replace dba:dba with <user>:<password>
+curl --digest --user 'dba:dba' --verbose --url\
+"http://localhost:8890/sparql-graph-crud-auth?graph=http://genenetwork.org"\
+-T test-data.ttl
+```
+
+Delete data using HTTP:
+
+```
+# Replace dba:dba with <user>:<password>
+curl --digest --user 'dba:dba' --verbose --url\
+"http://localhost:8890/sparql-graph-crud-auth?graph=http://genenetwork.org" -X DELETE
+```
+
+Query the graph data:
+
+```
+curl --verbose --url\
+"http://localhost:8890/sparql-graph-crud?graph=http://genenetwork.org"
+```
+
+Check out more cURL examples here:
+
+=> https://vos.openlinksw.com/owiki/wiki/VOS/VirtGraphProtocolCURLExamples SPARQL 1.1 Graph Store HTTP Protocol cURL Exampple Collection
+
+## Setting Passwords
+
+Virtuoso's default user is "dba" and its default password is "dba".  To change a password, use isql to run:
+
+```
+set password "dba" "dba";
+CHECKPOINT;
+```
+
+## More
+
+Read a fuller more complete tutorial on Virtuoso here:
+
+=> https://issues.genenetwork.org/topics/systems/virtuoso Virtuoso
diff --git a/topics/genenetwork-releases.gmi b/topics/genenetwork-releases.gmi
new file mode 100644
index 0000000..e179629
--- /dev/null
+++ b/topics/genenetwork-releases.gmi
@@ -0,0 +1,77 @@
+# GeneNetwork Releases
+
+## Tags
+
+* status: open
+* priority:
+* assigned:
+* type: documentation
+* keywords: documentation, docs, release, releases, genenetwork
+
+## Introduction
+
+The sections that follow will be note down the commits used for various stable (and stable-ish) releases of genenetwork.
+
+The tagging of the commits will need to distinguish repository-specific tags from overall system tags.
+
+In this document, we only concern ourselves with the overall system tags, that shall have the template:
+
+```
+genenetwork-system-v<major>.<minor>.<patch>[-<commit>]
+```
+
+the portions in angle brackets will be replaced with the actual version numbers.
+
+## genenetwork-system-v1.0.0
+
+This is the first, guix-system-container-based, stable release of the entire genenetwork system.
+The commits involved are:
+
+=> https://github.com/genenetwork/genenetwork2/commit/314c6d597a96ac903071fcb6e50df3d9e88935e9 GN2: 314c6d5
+=> https://github.com/genenetwork/genenetwork3/commit/0d902ec267d96b87648669a7a43b699c8a22a3de GN3: 0d902ec
+=> https://git.genenetwork.org/gn-auth/commit/?id=8e64f7f8a392b8743a4f36c497cd2ec339fcfebc: gn-auth: 8e64f7f
+=> https://git.genenetwork.org/gn-libs/commit/?id=72a95f8ffa5401649f70978e863dd3f21900a611: gn-libs: 72a95f8
+
+The guix channels used for deployment of the system above are as follows:
+
+```
+(list (channel
+       (name 'guix-bioinformatics)
+       (url "https://git.genenetwork.org/guix-bioinformatics/")
+       (branch "master")
+       (commit
+        "039a3dd72c32d26b9c5d2cc99986fd7c968a90a5"))
+      (channel
+       (name 'guix-forge)
+       (url "https://git.systemreboot.net/guix-forge/")
+       (branch "main")
+       (commit
+        "bcb3e2353b9f6b5ac7bc89d639e630c12049fc42")
+       (introduction
+        (make-channel-introduction
+         "0432e37b20dd678a02efee21adf0b9525a670310"
+         (openpgp-fingerprint
+          "7F73 0343 F2F0 9F3C 77BF  79D3 2E25 EE8B 6180 2BB3"))))
+      (channel
+       (name 'guix-past)
+       (url "https://gitlab.inria.fr/guix-hpc/guix-past")
+       (branch "master")
+       (commit
+        "5fb77cce01f21a03b8f5a9c873067691cf09d057")
+       (introduction
+        (make-channel-introduction
+         "0c119db2ea86a389769f4d2b9c6f5c41c027e336"
+         (openpgp-fingerprint
+          "3CE4 6455 8A84 FDC6 9DB4  0CFB 090B 1199 3D9A EBB5"))))
+      (channel
+       (name 'guix)
+       (url "https://git.savannah.gnu.org/git/guix.git")
+       (branch "master")
+       (commit
+        "2394a7f5fbf60dd6adc0a870366adb57166b6d8b")
+       (introduction
+        (make-channel-introduction
+         "9edb3f66fd807b096b48283debdcddccfea34bad"
+         (openpgp-fingerprint
+          "BBB0 2DDF 2CEA F6A8 0D1D  E643 A2A0 6DF2 A33A 54FA")))))
+```
diff --git a/topics/genenetwork/Case_Attributes_GN2 b/topics/genenetwork/Case_Attributes_GN2
new file mode 100644
index 0000000..52a956f
--- /dev/null
+++ b/topics/genenetwork/Case_Attributes_GN2
@@ -0,0 +1,2 @@
+# Update Case Attributes to capture hierarchy info 
+## The following provides guidelines and insight regarding case attributes as used in GeneNetwork Webservice searches 
diff --git a/topics/genenetwork/genenetwork-services.gmi b/topics/genenetwork/genenetwork-services.gmi
new file mode 100644
index 0000000..717fdd8
--- /dev/null
+++ b/topics/genenetwork/genenetwork-services.gmi
@@ -0,0 +1,122 @@
+# GeneNetwork Services
+
+## Tags
+
+* type: documentation
+* keywords: documentation, docs, doc, services, genenetwork services
+
+## GeneNetwork Core Services
+
+GeneNetwork is composed of a number of different services. This document attempts to document all the services that make up GeneNetwork and document what links give access to the services.
+
+### GeneNetwork2
+
+This is the main user-interface to the entire GeneNetwork system.
+
+#### Links
+
+=> https://github.com/genenetwork/genenetwork2 Repository
+=> https://genenetwork.org/ GN2 on production
+=> https://fallback.genenetwork.org/ GN2 on old production
+=> https://cd.genenetwork.org/ GN2 on CI/CD
+=> https://staging.genenetwork.org/ GN2 on staging
+
+### GeneNetwork3
+
+This is the main API server for GeneNetwork.
+
+#### Links
+
+=> https://github.com/genenetwork/genenetwork3 Repository
+=> https://genenetwork.org/api3/ GN3 on production
+=> https://fallback.genenetwork.org/api3/ GN3 on old production
+=> https://cd.genenetwork.org/api3/ GN3 on CI/CD
+=> https://staging.genenetwork.org/api3/ GN3 on staging
+
+### Sparql Service
+
+The SparQL service is served from a Virtuoso-OSE service.
+
+=> https://issues.genenetwork.org/topics/deploy/our-virtuoso-instances We have notes on our virtuoso instances here.
+
+
+#### Links
+
+=> https://github.com/genenetwork/genenetwork3 Repository
+=> https://sparql.genenetwork.org/sparql/ sparql-service on production
+* ??? sparql-service on old production
+* ??? sparql-service on CI/CD
+* ??? sparql-service on staging
+
+### GN-Auth
+
+This is the authorisation server for the GeneNetwork system.
+
+#### Links
+
+=> https://git.genenetwork.org/gn-auth/ Repository
+=> https://auth.genenetwork.org/ gn-auth on production
+=> https://fallback.genenetwork.org/gn-auth/ gn-auth on old production
+* ??? gn-auth on CI/CD
+=> https://staging-auth.genenetwork.org/ gn-auth on staging
+
+### GN-Uploader
+
+This service is to be used for uploading data to GeneNetwork. It is currently in development (best case, alpha).
+
+#### Links
+
+=> https://git.genenetwork.org/gn-uploader/ Repository
+* ??? gn-uploader on production
+* ??? gn-uploader on old production
+* ??? gn-uploader on CI/CD
+=> https://staging-uploader.genenetwork.org/ gn-uploader on staging
+
+### Aliases Server
+
+An extra server to respond with aliases for genetic (etc.) symbols.
+
+This is currently a project in racket, but we should probably pull in the features in this repository into one of the others (probably GeneNetwork3) and trash this repository.
+
+#### Links
+
+=> https://github.com/genenetwork/gn3 Repository
+=> https://genenetwork.org/gn3/ aliases-server on production
+=> https://fallback.genenetwork.org/gn3/ aliases-server on old production
+=> https://cd.genenetwork.org/gn3/ aliases-server on CI/CD
+=> https://staging.genenetwork.org/gn3/ aliases-server on staging
+
+### Markdown Editing Server
+
+#### Links
+
+=> https://git.genenetwork.org/gn-guile/ Repository
+=> https://genenetwork.org/facilities/ markdown-editing-server on production
+=> https://fallback.genenetwork.org/facilities/ markdown-editing-server on old production
+=> https://cd.genenetwork.org/facilities/ markdown-editing-server on CI/CD
+=> https://staging.genenetwork.org/facilities/ markdown-editing-server on staging
+
+## Support Services
+
+These are other services that support the development and maintenance of the core services.
+
+### Issue Tracker
+
+We use a text-based issue tracker that is accessible via
+=> https://issues.genenetwork.org/
+
+The repository for this service is at
+=> https://github.com/genenetwork/gn-gemtext-threads/
+
+### Repositories Server
+
+This is where a lot of the genenetwork repositories live. You can access it at
+=> https://git.genenetwork.org/
+
+### Continuous Integration Service
+
+…
+
+=> https://ci.genenetwork.org/
+
+### …
diff --git a/topics/genenetwork/genenetwork-streaming-functionality.gmi b/topics/genenetwork/genenetwork-streaming-functionality.gmi
new file mode 100644
index 0000000..4f81eea
--- /dev/null
+++ b/topics/genenetwork/genenetwork-streaming-functionality.gmi
@@ -0,0 +1,43 @@
+# Genenetwork Streaming Functionality
+
+## Tags
+* type: documentation
+* Keywords: documentation, docs, genenetwork, streaming
+
+### Introduction
+Genenetwork implements streaming functionality that logs results from a running external process to a terminal emulator.
+
+The streaming functionality can be divided into several sections.
+
+### Streaming UI
+The terminal emulator is implemented using the `xterm.js` library and
+logs results from the GN3 API.
+
+See:
+=> https://github.com/xtermjs/xterm.js
+
+### Streaming API
+This is the main endpoint for streaming:
+
+See reference:
+=> https://github.com/genenetwork/genenetwork3/gn3/api/streaming.py
+
+### How to Integrate
+
+#### Import the `enable_streaming` Decorator
+
+```
+from gn3.computations.streaming import enable_streaming
+```
+
+#### Apply the Decorator to Your Endpoint that Runs an External Process
+
+Note: To run the external process, use the `run_process` function,
+which captures the `stdout` in a file identified by the `run_id`.
+
+```
+@app.route('/your-endpoint')
+@enable_streaming
+def your_endpoint(streaming_output_file):
+    run_process(command, streaming_output_file, run_id)
+```
diff --git a/topics/genenetwork/publications-on-genenetwork.gmi b/topics/genenetwork/publications-on-genenetwork.gmi
new file mode 100644
index 0000000..aea1f63
--- /dev/null
+++ b/topics/genenetwork/publications-on-genenetwork.gmi
@@ -0,0 +1,14 @@
+# Publications on Genenetwork
+
+## Tags
+
+* type: documentation
+* keywords: documentation, docs, doc, publications
+
+## Important points
+
+A publication can relate to more than one Dataset (or family), i.e. you can have a publication with phenotypes from the BXD and CXB populations: from @robw:
+
+```
+Yes. A single publication can make use of several different families of strains. Our. hippocampus paper with Rupert included both BXD, CXB, and Mouse Diversity panel. We (awkwardly) put them all into BXDs with "Other" and also in CXB and also in Mouse Diversity. Definitely not optimal, but the code would have been way more work that just entering in three ways.
+```
diff --git a/topics/genenetwork/starting_gn1.gmi b/topics/genenetwork/starting_gn1.gmi
index efbfd0f..e31061f 100644
--- a/topics/genenetwork/starting_gn1.gmi
+++ b/topics/genenetwork/starting_gn1.gmi
@@ -51,9 +51,7 @@ On an update of guix the build may fail. Try
    #######################################'
    #      Environment Variables - private
    #########################################
-   # sql_host = '[1]tux02.uthsc.edu'
-   # sql_host = '128.169.4.67'
-   sql_host = '172.23.18.213'
+   sql_host = '170.23.18.213' 
    SERVERNAME = sql_host
    MYSQL_SERVER = sql_host
    DB_NAME = 'db_webqtl'
diff --git a/topics/genetics/pangenotypes.gmi b/topics/genetics/pangenotypes.gmi
new file mode 100644
index 0000000..9b3d534
--- /dev/null
+++ b/topics/genetics/pangenotypes.gmi
@@ -0,0 +1,52 @@
+# Pangenotypes
+
+Here we discuss different storage solutions for pangenotypes.
+
+## GRG format
+
+
+Looking for graph genotyping I ran into Genotype Representation Graphs GRG
+
+=> https://pmc.ncbi.nlm.nih.gov/articles/PMC11071416/
+
+It has a binary storage format that represents something like:
+
+```
+# GRG file example: genotype graph
+# Nodes section: NODE <id> <label> allele=<genotype>
+NODE 1 GeneA allele=AA
+NODE 2 GeneB allele=AG
+NODE 3 GeneC allele=GG
+NODE 4 GeneD allele=AA
+NODE 5 GeneE allele=AG
+
+# Edges section: EDGE <from_id> <to_id>
+EDGE 1 2
+EDGE 1 3
+EDGE 2 4
+EDGE 3 4
+EDGE 4 5
+EDGE 5 1
+```
+
+the tooling
+
+=> https://github.com/aprilweilab/grgl.git
+
+builds with
+
+```
+guix shell -C -N coreutils gcc-toolchain make cmake openssl nss-certs git pkg-config zlib
+```
+
+I did some tests and read the source code. The nice thing is that they have very similar ideas. Unfortunately the implementation is not what we want. I wonder why people alway reinvent data structures :/. To get an idea:
+
+=> https://github.com/aprilweilab/grgl/blob/main/src/serialize.cpp
+
+I would like to take similar ideas and take it to an efficient in-memory graph structure that is easily extensible. RDF is key for extensions (and queries). A fast RDF implementation we are going to try is
+
+=> https://pyoxigraph.readthedocs.io/en/stable/index.html
+
+Toshiaki pointed out we should look at qlever instead:
+
+=> https://github.com/ad-freiburg/qlever
diff --git a/topics/genetics/standards/gemma-genotype-format.gmi b/topics/genetics/standards/gemma-genotype-format.gmi
new file mode 100644
index 0000000..6ca5998
--- /dev/null
+++ b/topics/genetics/standards/gemma-genotype-format.gmi
@@ -0,0 +1,99 @@
+# PanGEMMA Genotype Format
+
+Here we describe the genotype DB format that is used by GN and pangemma. Essentially it contains the genotypes as markers x samples (rows x cols). Unlike some earlier formats it also carries metadata and allows for track changes to the genotypes.
+
+The current reference implementation for creating the file lives at
+
+=> https://github.com/genetics-statistics/gemma-wrapper/blob/master/bin/geno2mdb.rb
+
+Note that we'll likely create new versions in python, guile and/or rust.
+
+# Storage
+
+We use the LMDB b-tree format to store and retrieve records based on an index. LMDB is very fast as it uses the memory map facilities of the underlying operating system.
+
+=> https://en.wikipedia.org/wiki/Lightning_Memory-Mapped_Database
+
+LMDB supports multiple 'tables' in one file. We also use a metadata table named3 'info'. Another table named 'track-changes' keep track of modifications to the genotypes. This allows the genotypes to change over time - still giving people access to the original information if they need it.
+
+# Genotypes in the 'geno' table
+
+Genotypes are stored as fixed size rows of genotypes. Genotypes can be represented as 4-byte floats 'f*' or a list of bytes 'C*' (note these format specifiers come from ruby pack - python has similar but slightly different specifiers). The idea being that storing floats gives enough precision for probabilities and single bytes can represent all other cases. In the future we may add 2-byte integers, but that is probably not necessary.
+
+For the float version we use NaN to disignate a missing value (NA).
+
+For the byte version we use the value 255 or 0xFF to designate a missing value (NA). The other 255 values (including 0) are used either as an index - so A,B,H could be 0,1,2 - or we use it to project a range of values. In many cases 255 values is enough to present genotype variation in a population. Otherwise opt for the float option.
+
+The index to the rows is currently built out of keys. These keys hold the chromosome number as a single byte 'C', the position as a 4-byte long integer 'L>' and the row number in the original file as a 4-byte long 'L>'. These numbers are stored native-endian so the index is always correctly sorted(!).
+
+# Metadata in the 'info' table
+
+The default metadata is stored in the info table as
+
+```
+meta = {
+  "type" => "gemma-geno",
+  "format" => options[:format],
+  "version" => 1.0,
+  "eval" => EVAL.to_s,
+  "key-format" => CHRPOS_PACK,
+  "rec-format" => PACK,
+  "geno" => json
+}
+```
+
+where CHRPOS_PACK gives the key layout 'CL>L>' and PACK the genotype list, e.g. 'f*'. The format line gives the 'standard' storage type, e.g. 'Gf' for the floats and eval is the command used to transform values. The only field we really have to use for unpacking the data is format or rec-format because key-format does not change. The info table has some extra records that may be used:
+
+```
+  info['numsamples'] = [numsamples].pack("Q") # uint64
+  info['nummarkers'] = [geno.size].pack("Q")
+  info['meta'] = meta.to_json.to_s
+  info['format'] = options[:format].to_s
+  info['options'] = options.to_s
+```
+
+where 'numsamples' and 'nummarkers' are counts. 'meta' reflects above json record. 'format' mirrors format in the meta record and 'options' shows the options as they where fed to the program that generated the file.
+
+# Tracking changes
+
+Note: this is a proposal and has not yet implemented. But the idea is to store records by time stamp. Each record will describe the change so the last genotypes can be rolled forward at the user's wish. In case of a replacement it could be:
+
+```
+timestamp =>
+{
+  "marker" => name,
+  "chr" => chr,
+  "pos" => pos,
+  "line" => line,
+  "action" => "update",
+  "author" => author,
+  "genotypes" => list
+```
+
+Where list contains the *updated* genotypes.
+Likewise for a marker insertion or deletion.
+
+The track changes can also specify that a change only applies to a trait, a list of traits, a specific set of samples, or a group. E.g.
+
+```
+timestamp =>
+{
+  "marker" => name,
+  "chr" => chr,
+  "pos" => pos,
+  "line" => line,
+  "action" => "update",
+  "author" => author,
+  "genotypes" => list,
+  "for-traits" => list,
+  "for-samples" => list,
+  "for-group" => name
+}
+```
+
+The 'geno' database will therefore always the *first* version. These records make it possible to roll forward on changes and present an updated genotype matrix. Used genotypes are retained. This, naturally, can be handled in a cache. So any rewritten genotype files will be available in cache for a period of time.
+In the future a tool, such as GEMMA, could support dynamic application of these edits. That way we only have to cache the latest version.
+
+This way users may be able to select changes (i.e. pick and choose), use all (latest) or use original (init).
+
+For the editing we should provide an API.
diff --git a/topics/genetics/test-pangenome-derived-genotypes.gmi b/topics/genetics/test-pangenome-derived-genotypes.gmi
new file mode 100644
index 0000000..3a8473a
--- /dev/null
+++ b/topics/genetics/test-pangenome-derived-genotypes.gmi
@@ -0,0 +1,1005 @@
+# Test pangenome derived genotypes
+
+Here we follow up on the work we did on precompute PublishData:
+
+=> ../systems/mariadb/precompute-publishdata
+
+But now run against pangenome derived genotypes.
+For the BXD we have 23M markers(!) whereof 8M *not* on the reference genome.
+
+# Tasks
+
+* [ ] Document lmdb geno and marker information
+* [ ] Extract epoch information
+* [ ] Add BED file and link SNPS
+* [ ] Check MAF filter - it may be too stringent
+* [ ] Use ravanan/CWL to push to Octopus
+* [ ] Reintroduce nodes that were not annotated for position (Flavia)
+* [ ] GWA plotter
+* [ ] Speed up IO for GEMMA by using lmdb for genotypes and marker file
+* [ ] Use 1.5LOD score to compute QTLs instead of using 50M distance
+* [ ] Reduce GEMMA GRM RAM requirements (not urgent)
+* [ ] Fix -lmm 4 ERROR: Enforce failed for Trying to take the sqrt of nan in src/mathfunc.cpp at line 127 in safe_sqrt
+
+# Summary
+
+To get the mapping and generate the assoc output in mdb format we run a variant of gemma-wrapper.
+
+The workflow essentially is:
+
+* capture the significant markers from GEMMA's mdb output (as created by gemma-wrapper)
+* These are transformed into RDF using the 'gemma-mdb-to-rdf.rb' script
+* Next we upload that RDF into virtuoso
+* from there download a table of start-stop data using SPARQL
+* We compute QTL locations using 'sparql-qtl-detect.rb'
+* Upload that RDF also into virtuoso
+
+For mapping virtuoso contains four important ttl files:
+
+* marker positions in pangenome-marker graph
+* mapped markers in pangenome-mapped graph
+* computed QTL positions in pangenome-qtl graph
+* trait values in traits graph (nyi)
+
+
+```
+gemma-batch-run.sh
+```
+
+Next we convert that output to RDF with
+
+```
+../bin/gemma-mdb-to-rdf.rb --header > output.ttl
+time ../bin/gemma-mdb-to-rdf.rb --anno snps-matched.txt.mdb tmp/panlmm/*-gemma-GWA.tar.xz >> output.ttl # two hours for 7000 traits
+time serdi -i turtle -o ntriples output.ttl > output.n3
+```
+
+(note that n3 files are less error prone and serdi does better than rapper with huge files) and copy the file to the virtuoso instance and load it with isql (note it may be worth search-replacing the gnt:run tag to something descriptive).
+
+```
+cd /export/guix-containers/virtuoso/data/virtuoso/ttl/
+guix shell -C -N --expose=/export/guix-containers/virtuoso/data/virtuoso/ttl/=/export/data/virtuoso/ttl virtuoso-ose -- isql -S 8891
+SQL> ld_dir('/export/data/virtuoso/ttl','test-run-3000.n3','http://pan-test.genenetwork.org');
+Done. -- 3 msec.
+# for testing the validity and optional delete problematic ones:
+SQL> SELECT * FROM DB.DBA.load_list;
+SQL> DELETE from DB.DBA.LOAD_LIST where ll_error IS NOT NULL ;
+SQL> DELETE from DB.DBA.LOAD_LIST where LL_STATE = 1;
+# commit changes
+SQL> rdf_loader_run (); // about 1 min per GB n3
+SQL> checkpoint;
+Done. -- 16 msec.
+SQL> SPARQL SELECT count(*) FROM <http://pan-test.genenetwork.org> WHERE { ?s ?p ?o } LIMIT 10;
+34200686
+```
+
+Note it may be a good idea to drop graphs first. That is why we have separate subgraph spaces for every large TTL file:
+
+```
+log_enable(3,1);
+SQL> SPARQL CLEAR GRAPH  <http://pan-test.genenetwork.org>;
+SQL> SPARQL CLEAR GRAPH  <http://pan-mapped.genenetwork.org>; // 10 min
+SQL> SPARQL CLEAR GRAPH  <http://pangenome-marker.genenetwork.org>;
+SQL> ld_dir('/export/data/virtuoso/ttl','pangenome-markers.n3','http://pangenome-marker.genenetwork.org');
+SQL> SPARQL SELECT count(*) FROM <http://pan-test.genenetwork.org> WHERE { ?s ?p ?o } LIMIT 10;
+```
+
+For pangenomes we have a marker file, a QTL file
+
+As a test, fetch a table of the traits with their SNPs
+
+```
+PREFIX dct: <http://purl.org/dc/terms/>
+PREFIX gn: <http://genenetwork.org/id/>
+PREFIX owl: <http://www.w3.org/2002/07/owl#>
+PREFIX gnc: <http://genenetwork.org/category/>
+PREFIX gnt: <http://genenetwork.org/term/>
+PREFIX sdmx-measure: <http://purl.org/linked-data/sdmx/2009/measure#>
+PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
+PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
+PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
+PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
+PREFIX qb: <http://purl.org/linked-data/cube#>
+PREFIX xkos: <http://rdf-vocabulary.ddialliance.org/xkos#>
+PREFIX pubmed: <http://rdf.ncbi.nlm.nih.gov/pubmed/>
+
+SELECT * FROM <http://pangenome-mapped.genenetwork.org> WHERE {
+?traitid a gnt:mappedTrait;
+         gnt:run gn:test .
+?snp gnt:mappedSnp ?traitid ;
+        gnt:locus ?locus ;
+        gnt:lodScore ?lod ;
+        gnt:af ?af .
+?locus rdfs:label ?nodeid ;
+         gnt:chr ?chr ;
+         gnt:pos ?pos .
+FILTER (contains(?nodeid,"Marker") && ?pos < 1000)
+} LIMIT 100
+```
+
+OK, we are ready to run a little workflow. First create a sorted list of IDs.
+
+```
+PREFIX dct: <http://purl.org/dc/terms/>
+PREFIX gn: <http://genenetwork.org/id/>
+PREFIX owl: <http://www.w3.org/2002/07/owl#>
+PREFIX gnc: <http://genenetwork.org/category/>
+PREFIX gnt: <http://genenetwork.org/term/>
+PREFIX sdmx-measure: <http://purl.org/linked-data/sdmx/2009/measure#>
+PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
+PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
+PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
+PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
+PREFIX qb: <http://purl.org/linked-data/cube#>
+PREFIX xkos: <http://rdf-vocabulary.ddialliance.org/xkos#>
+PREFIX pubmed: <http://rdf.ncbi.nlm.nih.gov/pubmed/>
+
+SELECT DISTINCT ?trait FROM <http://pangenome-mapped.genenetwork.org> WHERE {
+?traitid a gnt:mappedTrait;
+         gnt:run gn:test ;
+         gnt:traitId ?trait.
+}
+```
+
+See also
+
+=> https://github.com/genetics-statistics/gemma-wrapper/blob/master/doc/examples/list-traits.sparql
+
+Sort that list and save as 'pan-ids-sorted.txt'. Next run
+
+```
+../../bin/workflow/qtl-detect-batch-run.sh
+```
+
+and load those in virtuoso. List new QTL
+
+```
+SELECT DISTINCT ?t ?lod (count(?snp) as ?snps) ?chr ?s ?e WHERE {
+  ?traitid a gnt:mappedTrait ;
+  gnt:traitId ?t .
+  MINUS { ?traitid gnt:run gn:test } # use if you want the original GEMMA QTL
+  # ?traitid gnt:run gn:test . # use if you want the new QTL
+  ?qtl gnt:mappedQTL ?traitid ;
+  gnt:qtlChr ?chr ;
+  gnt:qtlLOD ?lod ;
+  gnt:qtlStart ?s ;
+  gnt:qtlStop ?e .
+  ?qtl gnt:mappedSnp ?snp .
+  FILTER (?t = "10002" && ?lod >= 5.0 ) .
+} LIMIT 100
+```
+
+# Prior work
+
+For the first traits (presented at CTC'25) gemma was run as
+
+```
+echo "[$(date)] Starting kinship matrix calculation for PERCENTILE..."
+gemma -g ${BIMBAM_DIR}/143samples.percentile.bimbam.bimbam.gz \
+        -p ${PHENO_FILE} \
+              -gk \
+              -o percentile_result > percentile.kinship.143.txt
+
+echo "[$(date)] Kinship matrix calculation completed for PERCENTILE."
+echo "[$(date)] Starting association analysis for PERCENTILE..."
+gemma -g ${BIMBAM_DIR}/143samples.percentile.bimbam.bimbam.gz \
+        -p ${PHENO_FILE} \
+              -k ./output/percentile_result.cXX.txt \
+              -lmm 4 \
+              -maf 0.05 \
+              -o percentile_association > percentile.assoc.143.txt
+```
+
+Note no LOCO.
+
+The genotype BIMBAM file is 45G uncompressed. Even though GEMMA does not load everything in RAM, it is a bit large for my workstation. I opted to use tux04 since no one is using it. Had to reboot the machine because it is unreliable and had crashed.
+
+There I rebuilt gemma and set up a first run:
+
+```
+tux04:/export/data/wrk/iwrk/opensource/code/genetics/gemma/tmp$
+/bin/time -v ../bin/gemma -g 143samples.percentile.bimbam.bimbam.gz -p 143samples.percentile.bimbam.pheno.gz -gk
+```
+
+Without LOCO this took about 18 minutes (186% CPU), 110Gb of RAM. We ought to work on this ;) Next
+
+```
+/bin/time -v ../bin/gemma -g 143samples.percentile.bimbam.bimbam.gz -p 143samples.percentile.bimbam.pheno.gz -k output/result.cXX.txt -lmm 9 -maf 0.05
+```
+
+To run gemma on the current 23M BXD pangenome derived genotypes takes 2.5 hours (@ 200% CPU). That is a bit long :). 13K traits would be 43 months on a single machine. We'll need something better. As Rob writes:
+
+> The huge majority of variants will have r2 of 1 with hundreds ir thousands of neighbors. This is just a monster distraction. We just want proximal and distal haplotype boundaries for each BXD. Then we want to layer on the weird non-SNP variants and inversions.
+
+A few days later I had to rerun gemma because the output was wrong (I should have checked!). It shows:
+
+```
+chr     rs      ps      n_miss  allele1 allele0 af      beta    se      logl_H1 l_remle l_mle   p_wald  p_lrt   p_score
+-9      A1-0    -9      0       A       T       0.171   -nan    -nan    -nan    1.000000e+05    1.000000e+05    -nan  -nan     -nan
+-9      A2-0    -9      0       A       T       0.170   -nan    -nan    -nan    1.000000e+05    1.000000e+05    -nan  -nan     -nan
+```
+
+Turns out I was using the wrong pheno file. Let's try again.
+
+```
+/bin/time -v ../bin/gemma -g 143samples.percentile.bimbam.bimbam.gz -p 10354082_143.list.pang.txt -k output/result.cXX.txt -lmm 9 -maf 0.05
+```
+
+As a check I can diff against the original output. So, I replicated the original run! It also ran faster at 400% CPU in 35 minutes.
+
+(btw tux04 crashed, so I upgraded the BIOS and iDRAC remotely, let's see if this improves things).
+
+## Moving to gemma-wrapper
+
+gemma-wrapper has extra facilities, such as LOCO and caching and lmdb output. Last time we used it in
+
+=> ../genetics/systems/mariadb/precompute-publishdata
+
+in a guix container it looked like
+
+```
+#! /bin/env sh
+
+export TMPDIR=./tmp
+curl http://127.0.0.1:8092/dataset/bxd-publish/list > bxd-publish.json
+jq ".[] | .Id" < bxd-publish.json > ids.txt
+./bin/gemma-wrapper --force --json --loco -- -g BXD.geno.txt -p BXD_pheno.txt -a BXD.8_snps.txt -n 2 -gk > K.json
+
+for id in 'cat ids.txt' ; do
+  echo Precomputing $id
+  if [ ! -e tmp/*-BXDPublish-$id-gemma-GWA.tar.xz ] ; then
+    curl http://127.0.0.1:8092/dataset/bxd-publish/values/$id.json > pheno.json
+    ./bin/gn-pheno-to-gemma.rb --phenotypes pheno.json --geno-json BXD.geno.json > BXD_pheno.txt
+    ./bin/gemma-wrapper --json --lmdb --population BXD --name BXDPublish --trait $id --loco --input K.json -- -g BXD.geno.txt -p BXD_pheno.txt -a BXD.8_snps.txt -lmm 9 -maf 0.1 -n 2 -debug > GWA.json
+  fi
+done
+```
+
+Let's try running the big stuff instead:
+
+```
+./bin/gemma-wrapper --force --json --loco -- -g tmp/143samples.percentile.bimbam.bimbam.gz -p tmp/143samples.percentile.bimbam.pheno.gz  -gk
+```
+
+## Individuals
+
+gemma does not really track individuals. The order of genotype columns should just be the same as in the pheno file.
+In this case a sample list is provided and we'll generate a geno-json version that we can give to gemma-wrapper. Basically such a file lists the following
+
+```
+{
+  "type": "gn-geno-to-gemma",
+  "genofile": "BXD.geno",
+  "samples": [
+    "BXD1",
+    "BXD2",
+    "BXD5",
+...
+  ],
+  "numsamples": 237,
+  "header": [
+    "# File name: BXD_experimental_DGA_7_Dec_2021",
+...
+```
+
+To get this
+
+```
+cut -f 1 143samples.pc-list.tsv|sed -e s,_.\*,,|sed -e s,^,\",|sed -e s,$,\"\,,| cut -f 1 143samples.pc-list.tsv|sed -e s,_.\*,,|sed -e s,^,\",|sed -e "s,$,\"\\,," > bxd_inds.list.txt
+"BXD100",
+"BXD101",
+"BXD102",
+```
+
+Next I turned it into a JSON file by hand as 'bxd_inds.list.json'.
+
+## Markers
+
+With GEMMA marker names are listed in the geno file. GEMMA also can use a SNP file that gives the chromosome and location.
+Without the SNP filegemma-wrapper complains it needs the SNP/marker annotation file. This is logical because for LOCO it needs to know what chromosome a marker is on.
+
+The next step is to take the nodes file that and extract all rows from the genotype file that match nodes with chromosomes defined. Andrea is going to deliver all positions for all nodes, but for now we can use what we have. Currently we have nodes annotated in mm10+C57BL_6+DBA_2J.p98.s10k.matrix-pos.txt:
+
+```
+mm10#1#chr3     23209565        93886997
+mm10#1#chr3     23209564        93886999
+mm10#1#chr3     23209563        93887016
+...
+```
+
+In the genotype file we find, for example
+
+```
+A23209564-0, A, T, 1.919141867395325,  0.9306930597711228,  1.8201319833577734,  0.7607260422339468,  1.427392726736106,  1.2310230984252724,  1.6633662444541875,  0.6105610229068721, ...
+```
+
+bit funny, but you get the idea. So we can take the mm10 file and write out the genotype file again for all matching nodes with a matching SNP file that should contain for this node:
+
+```
+A23209564-0        93886999        3
+```
+
+To rewrite above mm10+C57BL_6+DBA_2J.p98.s10k.matrix-pos.txt file we can do something like
+
+```
+#! ruby
+
+ARGF.each_line do |line|
+  tag,name,pos = line.strip.split(/\t/)
+  tag =~ /chr(.*)$/
+  chrom = $1
+  print "A#{name}-0\t#{pos}\t#{chrom}\n"
+end
+```
+
+Now, another problem is that not all SNPs have a position in the genotype file (yet). As we can't display them I can drop them at this stage. So we take the SNP file and rewrite the BIMBAM file using that information. That throwaway script looks like
+
+```
+bimbamfn = ARGV.shift
+snpfn = ARGV.shift
+snps = {}
+open(snpfn).each_line do |snpl|
+  name = snpl.split(/\t/)[0]
+  snps  [name] = 1
+end
+open(bimbamfn).each_line do |line|
+  marker = line.split(/[,\s]/)[0]
+  if snps[marker]
+    print line
+  end
+end
+```
+
+takes a while to run, but as this is a one-off that does not matter. Reducing the file leads to 13667900 markers with genotypes. The original SNP file has 14927024 lines. Hmmm. The overlap is therefor not perfect (we have more annotations than genotypes now). To check this I'll run a diff.
+
+```
+cut -f 1 -d "," 143samples.percentile.bimbam.bimbam-reduced > 143samples.percentile.bimbam.bimbam-reduced-markers
+sort 143samples.percentile.bimbam.bimbam-reduced-markers > markers-sorted.txt
+diff --speed-large-files  143samples.percentile.bimbam.bimbam-reduced-markers markers-sorted.txt
+< A80951-0
+< A80952-0
+< A80953-0
+...
+cut -f 1 snps.txt |sort > snps-col1-sorted.txt
+diff --speed-large-files snps-col1-sorted.txt markers-sorted.txt
+241773d228996
+< A10314686-0
+241777d228999
+< A10314689-0
+241781d229002
+< A10314692-0
+grep A10314686 snps-col1-sorted.txt markers-sorted.txt
+snps-col1-sorted.txt:A10314686-0
+snps-col1-sorted.txt:A10314686-0
+markers-sorted.txt:A10314686-0
+```
+
+Ah, we have duplicate annotation lines in the SNP file.
+
+```
+grep A10314686-0 snps.txt
+A10314686-0     20257882        8
+A10314686-0     20384895        8
+grep A10314692-0 snps.txt
+A10314692-0     20257575        8
+A10314692-0     20384588        8
+```
+
+so, the same node is considered two snps. This is due to the node covering multiple inds (paths). Turns out a chunk of them map on different chromosomes too. I think we ought to drop them until we have a better understanding of what they represent (they may be mismapping artifacts).
+
+I updated the script. Now I see it skips A280000 because there is no marker annotation for that node. Good. Also the number of genotype markers got further reduced to 13209385.
+I checked the gemma code and the SNP annotation file should match the genotype file line for line. Usurprising, perhaps, but now I need to rewrite both. After adapting the script we now have to files with the same number of lines.
+
+Rerunning with the new files:
+
+```
+gemma -g new-genotypes.txt -p pheno_filtered_143.txt -gk
+gemma -g new-genotypes.txt -p pheno_filtered_143.txt -k output/result.cXX.txt -maf 0.05 -lmm 4 -a snps-matched.txt
+```
+
+And, even though the results differ somewhat in size -- due to the different number of markers -- the results look very similar to what was produced before. Good!
+
+Now we have confirmation and all the pieces we can run the same set with gemma-wrapper and LOCO.
+
+## gemma-wrapper
+
+The first 'challenge' is that gemma-wrapper computes hash values using a Ruby lib which is rather slow. This is also something we encounter in guix. I replaced that by using our pfff hashing for larger files.
+
+```
+/bin/time -v ../bin/gemma-wrapper --json --loco --jobs 8 -v -- -g new-genotypes.txt -p pheno_filtered_143.txt -gk -a snps-matched.txt > K.json
+```
+
+For this computation each gemma maxed out at 80Gb RAM (total 640Gb). We are really hitting limits here. In the near future we need to check why so much data is retained. As we only have 150 individuals it is a marker thing.
+
+```
+/bin/time -v ../bin/gemma-wrapper -v --json --lmdb --loco --input K.json -- -g new-genotypes.txt -p pheno_filtered_143.txt -a snps-matched.txt -debug -maf 0.05 -lmm 9 > GWA.json
+```
+
+This time gemma requires only 25Gb per chromosome, so we can run it in one go in RAM on this large server. Much of the time is spent in IO, so I think that when we start using mmap (lmdb) we can speed it up significantly.
+gemma-wrapper has a wall clock time of 10 minutes utilizing 17 cores.
+
+Some chromosomes failed with 'ERROR: Enforce failed for Trying to take the sqrt of nan in src/mathfunc.cpp at line 127 in safe_sqrt2'. Running the same with -lmm 9 passed. I'll need to keep an eye on that one.
+
+After some fixes we now have loco in an lmdb output. The mdb file comes in at 693Mb. That will make 9TB for 13K traits. Storing the full vector is probably not wise here (and arguably we won't ever use it at this size - we should use the smoothed haplotypes). Only storing the significant values (4.0) made the size 17Mb. That makes it 215Gb total. Which is manageable. I made it even smaller by removing the (superfluous) hits from the metadata. Now down to 7Mb and 3.2Mb compressed. That'll total less than 100Gb for 13K traits. Good.
+
+## Final hookup
+
+Now gemma-wrapper works (and test results are confirmed) we have to wire it up to fetch traits from the DB. We also have to make sure the trait values align with the individuals in the genotype file. Earlier I was running the script gemma-batch-run.sh:
+
+=> https://github.com/genetics-statistics/gemma-wrapper/blob/master/bin/gemma-batch-run.sh
+
+which looks like:
+
+```
+export TMPDIR=./tmp
+curl http://127.0.0.1:8092/dataset/bxd-publish/list > bxd-publish.json
+jq ".[] | .Id" < bxd-publish.json > ids.txt
+# ---- Compute GRM
+./bin/gemma-wrapper --json --loco -- -g BXD.geno.txt -p BXD_pheno.txt -a BXD.8_snps.txt -n 2 -gk > K.json
+
+# ---- For all entries run LMM
+for id in 'cat ids.txt' ; do
+  echo Precomputing $id
+  if [ ! -e tmp/*-BXDPublish-$id-gemma-GWA.tar.xz ] ; then
+    curl http://127.0.0.1:8092/dataset/bxd-publish/values/$id.json > pheno.json
+    ./bin/gemma-wrapper --json --lmdb --geno-json BXD.geno.json --phenotypes pheno.json --population BXD --name BXDPublish --trait $id --loco --input K.json -- -g BXD.geno.txt -a BXD.8_snps.txt -lmm 9 -maf 0.1 -n 2 -debug > GWA.json
+  fi
+done
+```
+
+We already have ids.txt and the GRM. What is required is the trait values from the DB. What we need to do is run gn-guile somewhere with access to the DB. Also I need to make sure the current gemma-wrapper tar-balls up the result.
+
+OK, we are running. Looks like the smaller datasets only use 11Gb RES RAM per chromosome. Which means we can run two computes in parallel on this machine.
+
+The first run came through! I forgot the --reduce flag, so it came as 190Mb. I'll fix that. 34 individuals ran in 7 minutes.
+We are currently runnings at a trait in 6 min. We can double that on this machine.
+
+The following puzzles me a bit
+
+```
+## number of analyzed individuals = 31
+## number of covariates = 1
+## number of phenotypes = 1
+## leave one chromosome out (LOCO) =       14
+## number of total SNPs/var        = 13209385
+## number of SNPS for K            = 12322657
+## number of SNPS for GWAS         =   886728
+## number of analyzed SNPs         = 13122153
+```
+
+why is the number of SNPs for GWAS low? Perhaps a threshold of 10% for maf is a bit stringent. See below.
+
+Anyway, we are running traits and the first 500 we'll use for analysis.
+
+Meanwhile I'll look at deploying on octopus and maybe speeding up GEMMA. See
+
+=> issues/genetics/speeding-up-gemma
+
+# MAF
+
+GEMMA has a MAF filter. For every SNP a maf is computed by adding the geno value:
+
+```
+maf += geno
+```
+
+when all genotype values are added up MAF is divided by 2x the number of individuals (minus missing).
+
+```
+maf /= 2.0 * (double)(ni_test - n_miss);
+```
+
+and this is held against the maf passed on the command line. The 2.0 therefore assumes all values are between 0 and 2.
+
+Actually I now realise we are using LOCO. So the number of SNPs are the ones on one chromosome. That makes sense!
+Still we have to be careful about the MAF range. In our genotype file the values are between 0 and 2. So that is fine in itself.
+
+# RDF
+
+Next step is to generate RDF. The SNP annotation was slow, so I moved that to lmdb. Parsing 400 traits now takes 3 minutes. The RDF file is under 1Gb and the SNP annotation RDF is 330Mb. Not too bad!
+
+```
+guix shell -C -N --expose=/export/guix-containers/virtuoso/data/virtuoso/ttl/=/export/data/virtuoso/ttl virtuoso-ose -- isql -S 8891
+SQL> ld_dir('/export/data/virtuoso/ttl','pan-test-snps-400.n3','http://pan-test.genenetwork.org');
+Done. -- 3 msec.
+# for testing the validity and optional delete problematic ones:
+SQL> SELECT * FROM DB.DBA.load_list;
+SQL> DELETE from DB.DBA.LOAD_LIST where ll_error IS NOT NULL ;
+# commit changes
+SQL> rdf_loader_run ();
+SQL> ld_dir('/export/data/virtuoso/ttl','pan-test-400.n3','http://pan-test.genenetwork.org');
+SQL> rdf_loader_run ();
+SQL> checkpoint;
+Done. -- 16 msec.
+SQL> SPARQL SELECT count(*) FROM <http://pan-test.genenetwork.org> WHERE { ?s ?p ?o } LIMIT 10;
+34200686
+```
+
+Or in the web interface:
+
+```
+SELECT count(*) FROM <http://pan-test.genenetwork.org> WHERE { ?s ?p ?o }
+```
+
+## Query
+
+The RDF is formed as:
+
+```
+gn:GEMMAMapped_test_LOCO_BXDPublish_10383_gemma_GWA_e6478639 a gnt:mappedTrait;
+      rdfs:label "GEMMA BXDPublish trait 10383 mapped with LOCO (defaults)";
+      gnt:trait gn:publishXRef_10383;
+      gnt:loco true;
+      gnt:run gn:test;
+      gnt:time "2025/11/10 08:12";
+      gnt:belongsToGroup gn:setBxd;
+      gnt:name "BXDPublish";
+      gnt:traitId "10383";
+      gnt:nind 14;
+      gnt:mean 18.0;
+      gnt:std 10.9479;
+      gnt:skew 0.3926;
+      gnt:kurtosis -1.1801;
+      skos:altLabel "BXD_10383";
+      gnt:filename "0233fa0cf277ee7d749de08b32f97c8be6478639-BXDPublish-10383-gemma-GWA.tar.xz";
+      gnt:hostname "napoli";
+      gnt:user "wrk".
+gn:A8828461_0_BXDPublish_10383_gemma_GWA_e6478639 a gnt:mappedLocus;
+      gnt:mappedSnp gn:GEMMAMapped_test_LOCO_BXDPublish_10383_gemma_GWA_e6478639;
+      gnt:locus gn:A8828461_0;
+      gnt:lodScore 4.8;
+      gnt:af 0.536;
+      gnt:effect -32.859.
+```
+
+and SNPs are annotated as
+
+```
+gn:A8828461_0 a gnt:marker;
+                 rdfs:label "A8828461-0";
+                 gnt:chr  "1";
+                 gnt:pos  3304440.
+gn:A8828464_0 a gnt:marker;
+                 rdfs:label "A8828464-0";
+                 gnt:chr  "1";
+                 gnt:pos  3304500.
+```
+
+To get all tested traits you can list:
+
+```
+PREFIX dct: <http://purl.org/dc/terms/>
+PREFIX gn: <http://genenetwork.org/id/>
+PREFIX owl: <http://www.w3.org/2002/07/owl#>
+PREFIX gnc: <http://genenetwork.org/category/>
+PREFIX gnt: <http://genenetwork.org/term/>
+PREFIX sdmx-measure: <http://purl.org/linked-data/sdmx/2009/measure#>
+PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
+PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
+PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
+PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
+PREFIX qb: <http://purl.org/linked-data/cube#>
+PREFIX xkos: <http://rdf-vocabulary.ddialliance.org/xkos#>
+PREFIX pubmed: <http://rdf.ncbi.nlm.nih.gov/pubmed/>
+
+SELECT * FROM <http://pan-test.genenetwork.org> WHERE {
+?trait a gnt:mappedTrait;
+         gnt:run gn:test ;
+         gnt:traitId ?traitid ;
+         gnt:kurtosis ?kurtosis .
+} limit 100
+```
+
+To get all SNPs for trait "10001"
+
+```
+SELECT * FROM <http://pan-test.genenetwork.org> WHERE {
+?traitid a gnt:mappedTrait;
+         gnt:run gn:test ;
+         gnt:traitId "10381" .
+?snp gnt:mappedSnp ?traitid ;
+        gnt:locus ?locus .
+?locus rdfs:label ?nodeid ;
+         gnt:chr ?chr ;
+         gnt:pos ?pos .
+}
+```
+
+Lists:
+
+```
+| http://genenetwork.org/id/A8828461_0_BXDPublish_10383_gemma_GWA_e6478639 | "A8828461-0" | "1" | 3304440 |
+```
+
+## Scoring/annotating QTL
+
+Next step is annotating the QTL in RDF. Earlier I wrote a script rdf-analyse-gemma-hits. It uses rapper to read two RDF files (two runs) and annotates the QTL and differences between the files. The code is not pretty:
+
+=> https://github.com/genetics-statistics/gemma-wrapper/blob/6d667ac97284013867b6cac451ec7e7a22ffbf4b/bin/rdf-analyse-gemma-hits.rb#L1
+
+The supporting library is a bit better:
+
+=> https://github.com/genetics-statistics/gemma-wrapper/blob/6d667ac97284013867b6cac451ec7e7a22ffbf4b/lib/qtlrange.rb#L1
+
+Basically we have a QTL locus (QLocus) that tracks chr,pos,af and lod for each hit.
+QRange is a set of QLocus which also tracks some stats chr,min,max,snps,max_af,lod.
+It can compute whether two QTL (QRange) overlap.
+Next we have a container that tracks the QTL (QRanges) on a chromosome.
+
+Finally there is a diff function that can show the differences on a chromosome (QRanges) for two mapped traits.
+
+Maybe the naming could be a bit better, but the code is clear as it stands. On thing to note is that we use a fixed distance MAX_SNP_DISTANCE_BPS of 50M that decides whether a SNP falls in the same QTL. It would be worth trying to base it on dropping LOD scores (1.5 from the top). Rob and Flavia pointed out.
+
+So, the library is fine, but the calling program is not great. The reason is that I parse RDF directly, teasing apart the logic we do in above SPARQL. I track state in dictionaries (hashes of hashes) and the result ends up convoluted. Also a lot of state in RAM. I chose RDF direct parsing because it makes for easier development. The downside is that I need to parse the whole file to make sure I have everything related to a trait. To fetch SNP results from SPARQL directly is slow too. I am in a bind.
+
+Using curl:
+
+```
+time curl -G http://sparql -H "Accept: application/json; charset=utf-8" --data-urlencode query="
+PREFIX dct: <http://purl.org/dc/terms/>
+PREFIX gn: <http://genenetwork.org/id/>
+PREFIX owl: <http://www.w3.org/2002/07/owl#>
+PREFIX gnc: <http://genenetwork.org/category/>
+PREFIX gnt: <http://genenetwork.org/term/>
+PREFIX sdmx-measure: <http://purl.org/linked-data/sdmx/2009/measure#>
+PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
+PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
+PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
+PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
+PREFIX qb: <http://purl.org/linked-data/cube#>
+PREFIX xkos: <http://rdf-vocabulary.ddialliance.org/xkos#>
+PREFIX pubmed: <http://rdf.ncbi.nlm.nih.gov/pubmed/>
+SELECT * FROM <http://pan-test.genenetwork.org> WHERE { ?traitid a gnt:mappedTrait ; gnt:traitId ?trait ; gnt:kurtosis ?k . }
+```
+
+
+```
+time curl -G http:///sparql -H "Accept: application/json; charset=utf-8" --data-urlencode query="
+PREFIX dct: <http://purl.org/dc/terms/>
+PREFIX gn: <http://genenetwork.org/id/>
+PREFIX owl: <http://www.w3.org/2002/07/owl#>
+PREFIX gnc: <http://genenetwork.org/category/>
+PREFIX gnt: <http://genenetwork.org/term/>
+PREFIX sdmx-measure: <http://purl.org/linked-data/sdmx/2009/measure#>
+PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
+PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
+PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
+PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
+PREFIX qb: <http://purl.org/linked-data/cube#>
+PREFIX xkos: <http://rdf-vocabulary.ddialliance.org/xkos#>
+PREFIX pubmed: <http://rdf.ncbi.nlm.nih.gov/pubmed/>
+SELECT * FROM <http://pan-test.genenetwork.org> WHERE { ?traitid a gnt:mappedTrait ; gnt:traitId \"10001\" ; gnt:kurtosis ?k . ?snp gnt:mappedSnp ?traitid ; gnt:locus ?locus . }
+"  > test.out
+real    0m1.612s
+user    0m0.020s
+sys     0m0.000s
+```
+
+To get the trait info for 400 traits takes a second. So, that is no big deal. To get the 6K SNPs for one trait also takes a second. Hmmm. That takes hours, compared to the minutes for direct RDF parsing. Before lmdb comes to the rescue we should try running in on the virtuoso server itself. For curl we get 0.5s. Which makes it two hours for 13K traits. But when we run the query using isql it runs in 70ms which totals 15 minutes. That is perfectly fine for running the whole set!
+
+One way is to simply script isql from the command line. Meanwhile, it also turns out the ODBC interface can be used from python or ruby. Here an example in R:
+
+=> https://cran.stat.auckland.ac.nz/web/packages/virtuoso/index.html
+
+Not sure if that is fast enough, but perhaps worth trying.
+
+So, now we have a way to query the data around a trait in seconds. This means I can rewrite the QTL generator to go by trait. This also allows for a quick turnaround during development (good!). Also I want two scripts: one for computing the QTL and one for annotating the differences.
+
+Alright. The first script should simply to fetch a trait with its markers from SPARQL and score the QTL (as RDF output). The new script is at
+
+=> https://github.com/genetics-statistics/gemma-wrapper/blob/master/bin/sparql-qtl-detect.rb
+
+First, the query for one trait looks like:
+
+```
+PREFIX dct: <http://purl.org/dc/terms/>
+PREFIX gn: <http://genenetwork.org/id/>
+PREFIX owl: <http://www.w3.org/2002/07/owl#>
+PREFIX gnc: <http://genenetwork.org/category/>
+PREFIX gnt: <http://genenetwork.org/term/>
+PREFIX sdmx-measure: <http://purl.org/linked-data/sdmx/2009/measure#>
+PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
+PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
+PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
+PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
+PREFIX qb: <http://purl.org/linked-data/cube#>
+PREFIX xkos: <http://rdf-vocabulary.ddialliance.org/xkos#>
+PREFIX pubmed: <http://rdf.ncbi.nlm.nih.gov/pubmed/>
+
+SELECT ?lod ?af ?nodeid ?chr ?pos FROM <http://pan-test.genenetwork.org> WHERE {
+?traitid a gnt:mappedTrait;
+         gnt:run gn:test ;
+         gnt:traitId "10002" .
+?snp gnt:mappedSnp ?traitid ;
+        gnt:locus ?locus ;
+        gnt:lodScore ?lod ;
+        gnt:af ?af .
+?locus rdfs:label ?nodeid ;
+         gnt:chr ?chr ;
+         gnt:pos ?pos .
+} ORDER BY DESC(?lod)
+```
+
+rendering some 22K markers for trait 10002 as a TSV:
+
+```
+"lod"   "af"    "nodeid"        "chr"   "pos"
+7.5     0.547   "A13459298-0"   "8"     98658490
+7.1     0.154   "A13402313-0"   "8"     96798487
+7       0.432   "A13446492-0"   "8"     97355019
+7       0.263   "A13387873-0"   "8"     94934820
+7       0.585   "A4794343-0"    "1"     172265488
+...
+```
+
+Earlier with precompute for trait 10002 we got:
+
+```
+[10002,HK] =>{"1"=>[#<QRange Chr1 𝚺14 179.862..181.546 LOD=3.07..3.07>], "8"=>[#<QRange Chr8 𝚺102 94.3743..112.929 LOD=3.1..5.57>]}
+[10002,LOCO] =>{"1"=>[#<QRange Chr1 𝚺15 72.2551..73.3771 AF=0.574 LOD=4.0..5.1>, #<QRange Chr1 𝚺91 171.172..183.154 AF=0.588 LOD=4.5..5.3>], "8"=>[#<QRange Chr8 𝚺32 94.4792..97.3382 AF=0.441 LOD=4.5..4.8>]}
+```
+
+so the hits are in range, but the LOD may be inflated because of the number of markers. Anyway, this point we are merely concerned with scoring QTL. The first script is simply:
+
+```
+qtls = QTL::QRanges.new("10002","test")
+CSV.foreach(fn,headers: true, col_sep: "\t") do |hit|
+    qlocus = QTL::QLocus.new(hit["nodeid"],hit["chr"],hit["pos"].to_i,hit["af"].to_f,hit["lod"].to_f)
+    qtls.add_locus(qlocus)
+end
+print qtls
+```
+
+and prints a long list of QTL containing a single hit.
+
+```
+[10002,test] =>{"1"=>[#<QRange Chr1 𝚺1 3099543..3099543 AF=0.583 LOD=5.8..5.8>, #<QRange Chr1 𝚺1 65908328..65908328 AF=0.627 LOD=5.7..5.7>, #<QRange Chr1 𝚺1 81604902..81604902 AF=0.451 LOD=5.5..5.5>, #<QRange Chr1 𝚺2 85087169..85087177 AF=0.781 LOD=5.5..5.6>, #<QRange Chr1 𝚺1 93740525..93740525 AF=0.762 LOD=6.5..6.5>, #<QRange Chr1 𝚺1 114086053..114086053 AF=0.568 LOD=5.7..5.7>,...
+```
+
+For trait 10002 tweaking thresholds and rebinning we get
+
+```
+#<QRange Chr8 𝚺2 34.303454..35.675301 AF=0.571 LOD=5.7..5.8>
+#<QRange Chr8 𝚺621 91.752748..102.722635 AF=0.663 LOD=5.6..7.5>
+#<QRange Chr1 𝚺16 65.908328..175.232335 AF=0.781 LOD=5.6..7.0>
+#<QRange Chr4 𝚺5 56.498971..126.135422 AF=0.657 LOD=5.6..6.4>
+#<QRange Chr12 𝚺3 23.037869..58.306731 AF=0.643 LOD=5.8..6.2>
+#<QRange Chr10 𝚺2 13.442071..13.442088 AF=0.641 LOD=5.8..6.0>
+#<QRange Chr10 𝚺3 94.246536..103.438796 AF=0.608 LOD=5.9..6.2>
+#<QRange Chr3 𝚺2 47.644513..82.451061 AF=0.548 LOD=5.7..6.2>
+#<QRange Chr9 𝚺2 97.445077..120.263403 AF=0.717 LOD=5.8..5.8>
+#<QRange Chr11 𝚺2 27.4058..56.30011 AF=0.559 LOD=5.7..5.7>
+```
+
+with a LOD>5.5 cut-off. That seems justified because LOD scores are inflated. Compare this with the earlier mapping using 'traditional' genotypes:
+
+```
+[10002,LOCO] =>{
+"1"=>[#<QRange Chr1 𝚺15 72.2551..73.3771 AF=0.574 LOD=4.0..5.1>,
+      #<QRange Chr1 𝚺91 171.172..183.154 AF=0.588 LOD=4.5..5.3>],
+"8"=>[#<QRange Chr8 𝚺32 94.4792..97.3382 AF=0.441 LOD=4.5..4.8>]}
+```
+
+we can see the significance of chr8 has gone up with pangenome mapping (relative to chr1) and we find 2 QTL now on chr8, a new one to the left.  Chr1 looks similar. We have some other candidates that may or may not be relevant (all narrow!).
+
+Note this *is* a random trait(!) and suggests the landscape of QTLs will change pretty dramatically. Note also that Andrea will give new genotypes and smoothing to follow. But it is encouraging.
+
+I played a bit with the QTL output, and for now settled on tracking nodes that have a LOD>5.0. We drop QTL based on the following:
+
+```
+qtl.lod.max < 6.0 or (qtl.lod.max < 7.5 - qtl.snps.size/2)
+```
+
+I.e. a single SNP QTL has to have a LOD of 7.0. A 2-SNP QTL has to have a LOD of 6.5. This begets
+
+```
+[10002,test] =>{
+"1"=>[#<QRange Chr1 𝚺69 3.099543..192.718161 AF=0.781 LOD=5.1..7.0>],
+"4"=>[#<QRange Chr4 𝚺12 56.498971..147.86044 AF=0.676 LOD=5.1..6.4>],
+"8"=>[#<QRange Chr8 𝚺2774 34.303454..116.023702 AF=0.899 LOD=5.1..7.5>],
+"10"=>[#<QRange Chr10 𝚺7 82.334108..105.062097 AF=0.623 LOD=5.1..6.2>],
+"12"=>[#<QRange Chr12 𝚺9 21.707644..72.57041 AF=0.77 LOD=5.1..6.2>]}
+```
+
+which are all worth considering (I think). Obviously we could annotate all QTL in RDF triples and filter on that using SPARQL. But this makes processing a bit faster without having to deal with too much noise. We can fine tune later.
+
+Now two more steps to go:
+
+* [X] Fetch all mapped traits using SPARQL and write RDF
+* [X] Compare QTL between datasets and annotate new hits
+
+## Fetch all mapped traits
+
+```
+SELECT * FROM <http://pan-test.genenetwork.org> WHERE {
+?traitid a gnt:mappedTrait;
+         gnt:run gn:test ;
+         gnt:traitId "10002" .
+?snp gnt:mappedSnp ?traitid ;
+        gnt:locus ?locus ;
+        gnt:lodScore ?lod ;
+        gnt:af ?af .
+?locus rdfs:label ?nodeid ;
+         gnt:chr ?chr ;
+         gnt:pos ?pos .
+} ORDER BY DESC(?lod)
+```
+
+The first step is to fetch this data. Let's try SPARQL over the web first.
+
+## Compare QTL sets
+
+The previous code I wrote to compare QTLs essentially walks the QTLs and annotates a new QTL if there is no overlap between the two sets. Again, this code is too convoluted:
+
+=> https://github.com/genetics-statistics/gemma-wrapper/blob/18e7a3ac8a11becba84325499116621ad095f28e/lib/qtlrange.rb#L190
+
+The principle is straightforward, however. The code for reading the SPARQL output for a trait is
+
+```
+  CSV.foreach(fn,headers: true, col_sep: "\t") do |hit|
+    trait_id = hit["traitid"] if not trait_id
+    lod = hit["lod"].to_f
+    if lod > 5.0 # set for pangenome input
+      qlocus = QTL::QLocus.new(hit["snp"],hit["chr"],hit["pos"].to_f/10**6,hit["af"].to_f,lod)
+      qtls.add_locus(qlocus)
+    end
+  end
+```
+
+So we can use SPARQL to build two sets on the fly and then run the diff.
+
+Actually, when thinking about this I realised it should not be too hard to do in SPARQL to find the 'new' QTL.
+
+```
+
+SELECT * WHERE {
+?traitid a gnt:mappedTrait ;
+            gnt:traitId "10002" .
+}
+http://genenetwork.org/id/GEMMAMapped_LOCO_BXDPublish_10002_gemma_GWA_7c00f36d
+http://genenetwork.org/id/HK_trait_BXDPublish_10002_gemma_GWA_hk_assoc_txt
+http://genenetwork.org/id/GEMMAMapped_test_LOCO_BXDPublish_10002_gemma_GWA_82087f23
+```
+
+lists the three versions of compute for traits. To fetch all QTL for first mapping:
+
+```
+SELECT ?qtl ?lod ?chr ?start ?stop (count(?snp) as ?snps) WHERE {
+?traitid a gnt:mappedTrait ;
+  gnt:traitId "10002" .
+?qtl gnt:mappedQTL ?traitid ;
+  gnt:qtlChr ?chr ;
+  gnt:qtlStart ?start ;
+  gnt:qtlStop ?stop ;
+  gnt:qtlLOD ?lod .
+?qtl gnt:mappedSnp ?snp .
+}
+```
+
+gets 3 QTL. Now I did not store HK in RDF, but to show the filtering principle we can fetch two traits and compare QTL.
+The following gets two QTL from trait "10002" on CHR1 and holds that against that of trait "10079":
+
+```
+SELECT ?t ?s1 ?e1 ?t2 ?s2 ?e2 WHERE {
+  ?traitid a gnt:mappedTrait ;
+  gnt:traitId ?t .
+  ?qtl gnt:mappedQTL ?traitid ;
+  gnt:qtlChr ?chr ;
+  gnt:qtlStart ?s1 ;
+  gnt:qtlStop ?e1 .
+  {
+    SELECT * WHERE {
+      ?tid a gnt:mappedTrait ;
+      gnt:traitId "10079" ;
+      gnt:traitId ?t2 .
+      ?qtl2 gnt:mappedQTL ?tid ;
+      gnt:qtlChr ?chr ;
+      gnt:qtlStart ?s2 ;
+      gnt:qtlStop ?e2 .
+    }
+  }
+  FILTER (?t = "10002") .
+} LIMIT 10
+
+"10002",171.172,183.154,"10079",172.235,172.235
+"10002",72.2551,73.3771,"10079",172.235,172.235
+```
+
+Note we pivot on two traits and one chromosome, so we find all pairs.
+To say if a QTL is *new* or different we can add another FILTER
+
+```
+FILTER ((?s2 > ?s1 && ?e2 > ?e1) || (?s2 < ?s1 && ?e2 < ?e1)) .
+"t","s1","e1","t2","s2","e2"
+"10002",72.2551,73.3771,"10079",172.235,172.235
+```
+
+that says that this ?qtl2 does not overlap with ?qtl. I.e. here it is a new QTL!
+
+This new insight means we should should store *all* QTL in RDF, including the single SNP ones, because it is easy to filter on them. Note that there may be a more elegant way to query traits pairwise. This is just the first thing that worked. It may need more tuning if there are more than two QTL on a chromosome. E.g. the comparison between 10002 and 10413 finds:
+
+```
+"t","s1","e1","t2","s2","e2"
+"10002",72.2551,73.3771,"10413",32.3113,42.4624
+"10002",171.172,183.154,"10413",171.04,171.041
+"10002",171.172,183.154,"10413",32.3113,42.4624
+"10002",72.2551,73.3771,"10413",171.04,171.041
+```
+
+I.e. it does find new QTL here and you still need to do a little set analysis. In words you should be able to "remove all overlapping QTL from a chromosome". Maybe we can filter the other way - select overlapping QTL and remove those from the result set.
+
+```
+BIND ((?s2 >= ?s1 && ?e2 <= ?e1) || (?s1 >= ?s2 && ?e1 <= ?e2) as ?overlap) .
+"10002",171.172,183.154,"10079",172.235,172.235,1
+"10002",72.2551,73.3771,"10079",172.235,172.235,0
+```
+
+now drop all ?t's that are overlapping. It appears to work with:
+
+=> https://github.com/genetics-statistics/gemma-wrapper/blob/master/doc/examples/show-qtls-two-traits.sparql
+
+I'll need to test it on the pangenome set.
+
+# Listing QTL
+
+To get all QTL from a run you can use something like
+
+```
+SELECT DISTINCT ?t ?lod (count(?snp) as ?snps) ?chr ?s ?e WHERE {
+  ?traitid a gnt:mappedTrait ;
+  gnt:traitId ?t .
+  MINUS { ?traitid gnt:run gn:test } # use if you want the original GEMMA QTL
+  # ?traitid gnt:run gn:test . # use if you want the new QTL
+  ?qtl gnt:mappedQTL ?traitid ;
+  gnt:qtlChr ?chr ;
+  gnt:qtlLOD ?lod ;
+  gnt:qtlStart ?s ;
+  gnt:qtlStop ?e .
+  ?qtl gnt:mappedSnp ?snp .
+  FILTER (?t = "10002" && ?lod >= 5.0 ) .
+} LIMIT 100
+```
+
+Note we filter on a trait name and LOD score.
+
+For panQTL (gnt:run == gn:test) this results in
+
+```
+"t"	"lod"	"snps"	"chr"	"start"	"end"
+"10002"	6.4	3	"15"	87.671663	98.028911
+"10002"	6.4	12	"4"	56.498971	147.86044
+"10002"	7	69	"1"	3.099543	192.718161
+"10002"	7.5	2774	"8"	34.303454	116.023702
+"10002"	6.2	7	"10"	82.334108	105.062097
+"10002"	6.2	2	"3"	47.644513	82.451061
+"10002"	6.2	1	"3"	130.145235	130.145235
+"10002"	6	2	"10"	13.442071	13.442088
+"10002"	6.2	9	"12"	21.707644	72.57041
+```
+
+For the traditional genotypes (gnt:run != gn:test)
+
+```
+"t"	"lod"	"snps"	"chr"	"start"	"end"
+"10002"	5.3	91	"1"	171.172	183.154
+"10002"	5.1	15	"1"	72.2551	73.3771
+```
+
+
+# Listing SNPs
+
+Now we have all QTLs in the DB, as well as underlying SNPs, one interesting question to ask is what SNPs are repeated across our traits. This, if you remember, is the key idea of reversed genetics.
+Of course, with our pangenome-derived genotypes, we now have thousands of SNPs per trait. Let's see if we can rank them by number of traits.
+
+For our 1000 traits we map about 7.7M snps with a LOD>5
+
+
+# Using sparql from emacs
+
+Note: if you are doing SPARQL quite a bit, I recommend using sparql-mode in emacs! It is easy, faster and you can use git :)
+
+=> https://github.com/ljos/sparql-mode
+
+```
+M-x sparql-query-region [ENTER] http://sparql-test.genenetwork.org/sparql/ [ENTER]
+```
diff --git a/topics/genome-browser/hoot-genome-browser.gmi b/topics/genome-browser/hoot-genome-browser.gmi
new file mode 100644
index 0000000..219fda5
--- /dev/null
+++ b/topics/genome-browser/hoot-genome-browser.gmi
@@ -0,0 +1,21 @@
+# Hoot Genome Browser
+
+Together with Andrew we have created a genome browser that runs in WASM. Safari recently (202509) added critical hoot support, so we should have it in all important browsers now!
+
+With this task tracker we want to embed the existing browser in GN and add tracks for mapped QTL.
+
+# Tags
+
+* assigned: andrewt, pjotrp
+* priority: high
+* status: open, in progress
+* keywords: mapping
+
+# Tasks
+
+* [ ] Embed hoot browser in GN2 as a pilot
+*   + [ ] Guix package for JS and minimal JBrowse2?
+*   + [ ] Embedding code in GN2
+* [ ] Create two tracks for QTL comparisons - vector data available
+* [ ] Create BED file for matched QTL - use SPARQL live?
+* [ ] Annotated SNPs
diff --git a/topics/gn-learning-team/next-steps.gmi b/topics/gn-learning-team/next-steps.gmi
new file mode 100644
index 0000000..b427923
--- /dev/null
+++ b/topics/gn-learning-team/next-steps.gmi
@@ -0,0 +1,48 @@
+# Next steps
+
+Wednesday we had a wrap-up meeting of the gn-learning efforts.
+
+## Data uploading
+
+The goal of these meetings was to learn how to upload data into GN. In the process Felix has become the de facto uploader, next to Arthur. A C. elegans dataset was uploaded and Felix is preparing
+
+* More C. elegans
+* HSRat
+* Kilifish
+* Medaka
+
+Updates are here:
+
+=> https://issues.genenetwork.org/tasks/felixl
+
+We'll keep focussing on that work and hopefully we'll get more parties interested in doing some actual work down the line.
+
+## Hosting GN in Wageningen
+
+Harm commented that he thought these meetings were valuable, particularly we learnt a lot about GN ins and outs. Harm suggests we focus on hosting GN in Wageningen for C. elegans and Arabidopsis.
+Pjotr says that is a priority this year, even if we start on a privately hosted machine in NL. Wageningen requires Docker images and Bonface says that is possible - with some work. So:
+
+* Host GN in NL
+* Make GN specific for C.elegans and Arabidopsis - both trim and add datasets
+* Create Docker container
+* Host Docker container in Wageningen
+* Present to other parties in Wageningen
+
+Having above datasets will help this effort succeed.
+
+## AI
+
+Harm is also very interested in the AI efforts and wants to pursue that in the context of above server - i.e., functionality arrives when it lands in GN.
+
+## Wormbase
+
+Jameson suggest we can work with Wormbase and the Caender folks once we have a running system. Interactive data analysis is very powerful and could run in conjunction with those sites.
+
+=> https://caendr.org/
+=> https://wormbase.org/
+
+Other efforts are Flybase and Arabidopsis Magic which we can host, in principle.
+
+## Mapping methods
+
+Jameson will continue with his work on risiduals.
diff --git a/topics/gn-uploader/genome-details.gmi b/topics/gn-uploader/genome-details.gmi
new file mode 100644
index 0000000..f8a12f6
--- /dev/null
+++ b/topics/gn-uploader/genome-details.gmi
@@ -0,0 +1,42 @@
+# Genome Details
+
+This file is probably misnamed.
+
+*TODO*: Update name once we know where this fits
+
+## Tags
+
+* type: documentation, doc, docs
+* assigned: fredm
+* priority: docs
+* status: open
+* keywords: gn-uploader, uploader, genome
+
+## Location
+
+### centiMorgan (cM)
+
+We no longer use centiMorgan in GeneNetwork
+
+From the email threads:
+
+```
+> …
+> Sorry, we now generally do not use centimorgans. Chr 19 is 57 cM
+> using markers that exclude telomeres in most crosses.
+> …
+```
+
+and
+
+```
+> …
+> I know that cM is a bit more variable because it's not a direct measurement, …
+> …
+```
+
+### Megabasepairs (Mbp)
+
+The uploader will store any provided physical location values (in megabasepairs) in the
+=> https://gn1.genenetwork.org/webqtl/main.py?FormID=schemaShowPage#Geno Geno table
+specifically in the `Mb` field of that table.
diff --git a/topics/gn-uploader/genotypes-assemblies-markers-and-genenetwork.gmi b/topics/gn-uploader/genotypes-assemblies-markers-and-genenetwork.gmi
new file mode 100644
index 0000000..db0ddf3
--- /dev/null
+++ b/topics/gn-uploader/genotypes-assemblies-markers-and-genenetwork.gmi
@@ -0,0 +1,40 @@
+# Genotypes, Assemblies, Markers and GeneNetwork
+
+## Tags
+
+* type: documentation, docs, doc
+* keywords: genotype, assembly, markers, data, database, genenetwork, uploader
+
+## Markers
+
+```
+The marker is the SNP…
+
+— Rob (Paraphrased)
+```
+
+SNPs (Single Nucleotide Polymorphisms) are specific locations of interest within the genome, where the pair of nucleotides can take different forms.
+
+A SNP and its immediate neighbourhood (a number of megabase pairs before and after the SNP) form a sequence that is effectively the marker, e.g. for mouse (Mus musculus) you could have the following sequence from the GRCm38 genome assembly (mm10):
+
+```
+GAGATAAAGATGGGTCCCTTGGCACAGGACTGGCCCACATTTCCaatataaattacaacaattttttttaaatttttaaaCAAAACAAGCATCTCACACAC/TTGAAAAAGAAGATGCATTCAAAGAAAATAGATGTTTCAATGTATTTAAGATAATCAAGAGATAACCATGACCATATCATGAGGAAACTTAAGAATTGGCA
+```
+
+where the position with `C/T` represents the SNP of interest and thus the marker.
+
+You can search this on the UCSC Genome Browser, specifically the 
+=> https://genome.ucsc.edu/cgi-bin/hgBlat BLAT search
+to get the name of the marker, and some extra details regarding it.
+
+## Genome Assemblies
+
+The genome assembly used will "determine" the position of the marker on the genome — newer assemblies will (generally) give a better position accounting for more of the issues discovered in older assemblies.
+
+With most of the newer assemblies, the positions do not shift very drastically.
+
+## GeneNetwork
+
+Currently (September 2024), GeneNetwork uses the GRCm38 (mm10) assembly for mice.
+
+Unfortunately, since the system was built for mice, the tables (e.g. Geno table) do not account for the fact that you could have markers (and other data) from species other than Mus musculus. You thus have the Geno table with fields like `Mb_mm8`, `Chr_mm8` which are very mouse-specific.
diff --git a/topics/gn-uploader/samplelist-details.gmi b/topics/gn-uploader/samplelist-details.gmi
new file mode 100644
index 0000000..2e64d8a
--- /dev/null
+++ b/topics/gn-uploader/samplelist-details.gmi
@@ -0,0 +1,17 @@
+# Explanation of how Sample Lists are handled in GN2 (and may be handled moving forward)
+
+## Tags
+
+* status: open
+* assigned: fredm, zsloan
+* priority: medium
+* type: documentation
+* keywords: strains, gn-uploader
+
+## Description
+
+Regarding the order of samples/strains, it can basically be whatever we decide it is. It just needs to stay consistent (like if there are multiple genotype files). It only really affects how the strains are displayed, and any other genotype files we use for mapping needs to share the same order.
+
+I think this is the case regardless of whether it's strains or individuals (and both the code and files make no distinction). Sometimes it just logically makes sense to sort them in a particular way for display purposes (like BXD1, BXD2, etc), but technically everything would still work the same if you swapped those columns across all genotype files. Users would be confused about why BXD2 is before BXD1, but everything would still work and all calculations would give the same results.
+
+zsloan's proposal for handling sample lists in the future is to just store them in a JSON file in the genotype_files/genotype directory.
diff --git a/topics/gn-uploader/types-of-data.gmi b/topics/gn-uploader/types-of-data.gmi
new file mode 100644
index 0000000..1f53dec
--- /dev/null
+++ b/topics/gn-uploader/types-of-data.gmi
@@ -0,0 +1,63 @@
+# Types of Data in GeneNetwork
+
+## Tags
+
+* assigned:
+* priority:
+* status: open
+* type: documentation
+* keywords: gn-uploader, uploader, genenetwork, documentation, doc, docs, data, data type, types of data
+
+## Description
+
+There are five (5) main types of data in GeneNetwork
+
+* Classical Phenotypes (PublishData)
+* High Content Data
+* Genotype Data
+* Cofactors and Attributes
+* Metadata
+
+### Classical Phenotypes
+
+This is usually low-content data e.g. body weight, tail length, etc.
+
+This is currently saved in the `Publish*` tables in the database.
+
+This data is saved as is i.e. not log-transformed
+
+### High Content Data
+
+This includes mainly molecular data such as
+* mRNA assay data
+* genetic expression data
+* probes
+* tissue type and data
+
+These data are saved in the `ProbeSet*` database tables (and other closely related tables like the `Tissue*` tables - fred added this: verify).
+
+These could be saved in the database in a log-tranformed form - verify.
+
+How do you check for log-transformation in the data?
+
+### Genotype Data
+
+This is core data, and all other data seem to rely on its existence.
+
+Useful for:
+* correlations, cofactor and PheWAS computations.
+* mapping purposes
+* search and display
+* editing and curation
+
+### Cofactors and Attributes
+
+This data can be alphanumeric (mix of numerical and non-numerical) data.
+
+It is not intended for mapping.
+
+### Metadata
+
+This data should (ideally) always accompany any and all of the data types above. It provides contextual information regarding the data it accompanies, and is useful for search, and other contextualising operations.
+
+It is alphanumeric data, and mostly cannot be used for numeric computations.
diff --git a/topics/guix/genenetwork-fixating-guix.gmi b/topics/guix/genenetwork-fixating-guix.gmi
new file mode 100644
index 0000000..844b0fd
--- /dev/null
+++ b/topics/guix/genenetwork-fixating-guix.gmi
@@ -0,0 +1,34 @@
+# Fixating Guix for GN
+
+The GeneNetwork services depend on a rather complicated Guix deployment. The problem is not guix, but GN itself :)
+But we were getting bitten by updates on upstream, as well as updates on our different targets/services.
+
+# Using channels that affect GN production
+
+To avoid duplication of work and unknown rabbit holes we decided to fixate guix trunk and other dependencies by using Guix channels. This means all GN development happens on a single version of Guix! That version is defined here:
+
+=> https://git.genenetwork.org/gn-machines/tree/.guix-channel
+
+Note that guix-forge and guix-bioinformatics are *also* fixated. The idea is that we only upgrade GN packages in gn-machines itself by inheriting definitions. E.g.
+
+=> https://git.genenetwork.org/gn-machines/tree/guix/gn-machines/genenetwork.scm
+
+We will probably get rid of the guix-past and guix-rust-past-crates sub-channels soon by removing those packages that depend on those (genenetwork1 will get its own tree, and @alexm will upgrade the rust packages).
+
+If someone wants to update guix channel or guix-bioinformatics channel they should not update this file. The one in charge is @fredm. Fred has to be in control because we don't want to break production. It is forbidden to touch this channel file.
+
+People can patch the packages and gn-machines, but if it involves CI/CD and/or production in any way, Fred will have to know about it.
+
+# Service level channels
+
+For individual services, such as genenetwork2, genenetwork3, gn-auth, etc., we have local channel files. These should mirror above gn-machines channel file to make sure we can migrate your code easily. E.g.
+
+=> https://github.com/genenetwork/genenetwork3/blob/main/.guix-channel
+
+Should match
+
+=> https://git.genenetwork.org/gn-machines/tree/.guix-channel
+
+If that is not the case we have a major problem! So before sending patches to Fred make sure the channels match.
+
+To be honest, I think we should fetch these channels automagically from gn-machines as a first step.
diff --git a/topics/guix/guix-profiles.gmi b/topics/guix/guix-profiles.gmi
index 578bb82..15f7683 100644
--- a/topics/guix/guix-profiles.gmi
+++ b/topics/guix/guix-profiles.gmi
@@ -15,9 +15,9 @@ Note that a recently tested channel can be fetched from cd.genenetwork.org. That
 Alternatively put the following into a channels.scm file.
 ```
 (list (channel
-       (name 'gn-bioinformatics)
-       (url "https://gitlab.com/genenetwork/guix-bioinformatics")
-       (branch "master")))
+       (name 'gn-machines)
+       (url "https://git.genenetwork.org/gn-machines")
+       (branch "main")))
 ```
 Build a profile using
 ```
@@ -55,23 +55,6 @@ And everything should be in the PATH, PYTHONPATH etc.
 
 From time to time, updates to upstream Guix break our guix-bioinformatics channel. As a result, `guix pull` breaks and our work comes to a standstill for a few days until this can be fixed. While it is important to gradually move towards the latest and greatest, we would like to avoid being at the mercy of externalities and would prefer to update in a more controlled way without interrupting everyone's work.
 
-To this end, we hard-code the guix-bioinformatics channel to depend on a *specific* commit of upstream Guix that is tied to guix-bioinformatics, for example:
+To this end, we hard-code the guix-bioinformatics channel to depend on a specific commit of upstream Guix that is tied to guix-bioinformatics. This is why the recommended channels.scm file above does not include a %default-guix-channel. However, this comes with the drawback that your entire system will be stuck at that specific commit of upstream Guix (but not if you use another profile as described above). We highly recommend using a separate `guix pull` profile specifically for GeneNetwork work, as described above.
 
-```
-(list (channel
-        (name 'gn-bioinformatics)
-        (url "https://gitlab.com/genenetwork/guix-bioinformatics")
-        (branch "master")
-        (commit
-          "feff05b47c305d8c944499fbc00fd2126f2b881d")))
-```
-
-This is why the recommended channels.scm file above does not include a %default-guix-channel. However, this comes with the drawback that your entire system will be stuck at that specific commit of upstream Guix (but not if you use another profile as described above). We highly recommend using a separate `guix pull` profile specifically for GeneNetwork work, as described above.
-
-This scheme also comes with the added bonus that all members on the team and the continuous integration system will be using exactly the same Guix. Above channels.scm file is only exposed on a *succesful* build.
-
-## Notes
-
-We recently had to switch to gitlab because our git server went down on Penguin2. We may move to a cgit solution soon, see
-
-=> ../issues/cant-use-guix-bioinformatics-with-guix-pull.gmi
+This scheme also comes with the added bonus that all members on the team and the continuous integration system will be using exactly the same Guix.
\ No newline at end of file
diff --git a/topics/guix/packages.gmi b/topics/guix/packages.gmi
index a52f49b..b4a393c 100644
--- a/topics/guix/packages.gmi
+++ b/topics/guix/packages.gmi
@@ -2,15 +2,27 @@
 
 To deploy GN we have packages in Guix itself (that comes with a distribution), in guix-bioinformatics and in guix-past (for older packages).
 
+When you develop a new package it is best to run against a recent version of guix. Note that with GeneNetwork this is different as we 'fixate' guix at an older version. See
+
+=> genenetwork-fixating-guix
+
 Typically run a guix pull to get the latest guix:
 
 ```
 mkdir -p ~/opt
-guix package -i guix -p ~/opt/guix
+guix pull -p ~/opt/guix-pull
+unset GUIX_PROFILE # for Debian
+source ~/opt/guix-pull
 ```
 
 and checkout guix-past and guix-bioinformatics using git.
 
+Note that a codeberg pull may be faster (Guix recently moved main development to Codeberg):
+
+```
+guix pull -p ~/opt/guix-pull --url=https://codeberg.org/guix/guix
+```
+
 Now Guix should be happy with
 
 ```
@@ -20,7 +32,7 @@ genenetwork2            3.11-2.1328932  out     /home/wrk/guix-bioinformatics/gn
 genenetwork3            0.1.0-2.e781996 out     /home/wrk/guix-bioinformatics/gn/packages/genenetwork.scm:107:4
 ```
 
-and we can try building
+note that using the -L switch is a bit of a hack. Normally we use channels. We can try building
 
 ```
 ~/opt/guix-pull/bin/guix build -L ~/guix-bioinformatics/ -L ~/guix-past/modules/ genenetwork2
diff --git a/topics/gunicorn/deploying-app-under-url-prefix.gmi b/topics/gunicorn/deploying-app-under-url-prefix.gmi
new file mode 100644
index 0000000..b2e382f
--- /dev/null
+++ b/topics/gunicorn/deploying-app-under-url-prefix.gmi
@@ -0,0 +1,121 @@
+# Deploying Your Flask Application Under a URL Prefix With GUnicorn
+
+## TAGS
+
+* type: doc, documentation, docs
+* author: fredm, zachs
+* keywords: flask, gunicorn, SCRIPT_NAME, URL prefix
+
+## Introduction
+
+You have your application and are ready to deploy it, however, for some reason, you want to deploy it under a URL prefix, rather than at a top-level-domain.
+
+This short article details the things you need to set up.
+
+## Set up Your WebServer (Nginx)
+
+You need to tell your webserver to serve the application under a particular url prefix. You do this using that particular webserver's reverse-proxying configurations: For this article, we will use Nginx as the server.
+
+Normally, you'd simply do something like:
+
+```
+server {
+    server_name your.server.domain
+
+    ⋮
+
+    location /the-prefix/ {
+        proxy_pass    http://127.0.0.1:8080/;
+        proxy_set_header Host $host;
+        ⋮
+    }
+
+    ⋮
+}
+```
+
+Here, your top-level domain will be https://your.server.domain and you therefore want to access your shiny new application at https://your.server.domain/the-prefix/
+
+For a simple application, with no sessions or anything, this should work, somewhat, though you might run into trouble with things like static files (e.g. css, js, etc) if the application does not use the same ones as that one on the TLD.
+
+If you are using sessions, you might also run into an issue where there is an interaction in the session management of both applications, especially if the application on the TLD makes use of services from the application at the url prefix. This is mostly due to redirects from the url-prefix app getting lost and hitting the TLD app.
+
+To fix this, we change the configuration above to:
+
+```
+server {
+    server_name your.server.domain
+
+    ⋮
+
+    location /the-prefix/ {
+        proxy_pass    http://127.0.0.1:8080/the-prefix/;
+        proxy_set_header Host $host;
+        ⋮
+    }
+
+    ⋮
+}
+```
+
+but now, you get errors, since there is no endpoint in your shiny new app that in at the route /the-prefix/***.
+
+Enter Gunicorn!
+   
+
+## Setting up SCRIPT_NAME for GUnicorn
+
+### The "Hacky" Way
+
+At the point of invocation of GUnicorn, we set the SCRIPT_NAME environment variable to the value "/the-prefix" — note that there is no trailing slash; this is very important. You should now have something like:
+
+```
+$ export SCRIPT_NAME="/the-prefix"
+$ gunicorn --bind 0.0.0.0:8082 --workers …
+```
+
+The first line tells GUnicorn what the URL prefix is. It will use this to compute what URL to pass to the flask application.
+
+Example, say you try accessing the endpoint
+
+```
+https://your.server.domain/the-prefix/auth/authorise?response_type=code&client_id=some-id&redirect_uri=some-uri
+```
+
+Gunicorn will split that URL into 2 parts using the value of the SCRIPT_NAME environment variable, giving you:
+
+* https://your.server.domain
+* /auth/authorise?response_type=code&client_id=some-id&redirect_uri=some-uri
+
+It will then pass on the second part to flask. This is why the value of SCRIPT_NAME should not have a trailing slash.
+
+Note that using the SCRIPT_NAME environment variable is a convenience feature provided by GUnicorn, not a WSGI feature. If you ever change your WSGI server, there is no guarantee this fix will work.
+
+### Using WSGI Routing MiddleWare
+
+A better way is to make use of a WSGI routing middleware. You could do this by defining a separate WSGI entry point in your application's repository.
+
+```
+# wsgi_url_prefix.py
+from werkzeug.wrappers import Response
+from werkzeug.middleware.dispatcher import DispatcherMiddleware
+
+from app import create_app
+
+def init_prefixed_app(theapp):
+    theapp.wsgi_app = DispatcherMiddleware(
+        Response("Not Found", 404),
+        {
+            "/the-prefix": the_app.wsgi_app
+        })
+    return theapp
+
+
+app = init_prefixed_app(create_app())
+```
+
+## References
+
+=> https://docs.gunicorn.org/en/latest/faq.html#how-do-i-set-script-name
+=> https://dlukes.github.io/flask-wsgi-url-prefix.html
+=> https://www.reddit.com/r/Python/comments/juwj3x/comment/gchdsld/
diff --git a/topics/hpc/octopus/slurm-user-guide.gmi b/topics/hpc/octopus/slurm-user-guide.gmi
index f7ea6d4..d0a3cc4 100644
--- a/topics/hpc/octopus/slurm-user-guide.gmi
+++ b/topics/hpc/octopus/slurm-user-guide.gmi
@@ -37,7 +37,6 @@ To get a shell prompt on one of the nodes (useful for testing your environment)
 srun -N 1 --mem=32G --pty /bin/bash
 ```
 
-
 # Differences
 
 ## Guix (look ma, no modules)
diff --git a/topics/lmms/bulklmm/readme.gmi b/topics/lmms/bulklmm/readme.gmi
new file mode 100644
index 0000000..8bd96a8
--- /dev/null
+++ b/topics/lmms/bulklmm/readme.gmi
@@ -0,0 +1 @@
+This is a stub
diff --git a/topics/lmms/gemma/permutations.gmi b/topics/lmms/gemma/permutations.gmi
new file mode 100644
index 0000000..4c8932a
--- /dev/null
+++ b/topics/lmms/gemma/permutations.gmi
@@ -0,0 +1,1014 @@
+# Permutations
+
+Currently we use gemma-wrapper to compute the significance level - by shuffling the phenotype vector 1000x.
+As this is a lengthy procedure we have not incorporated it into the GN web service. The new bulklmm may work
+in certain cases (genotypes have to be complete, for one).
+
+Because of many changes gemma-wrapper is not working for permutations. I have a few steps to take care of:
+
+* [X] read R/qtl2 format for phenotype
+
+# R/qtl2 and GEMMA formats
+
+See
+
+=> data/R-qtl2-format-notes
+
+# One-offs
+
+## Phenotypes
+
+For a study Dave handed me phenotype and covariate files for the BXD. Phenotypes look like:
+
+```
+
+Record ID,21526,21527,21528,21529,21530,21531,21532,21537,24398,24401,24402,24403,24404,24405,24406,24407,24408,24412,27513,27514,27515,27516,
+27517
+BXD1,18.5,161.5,6.5,1919.450806,3307.318848,0.8655,1.752,23.07,0.5,161.5,18.5,6.5,1919.450806,3307.318848,0.8655,1.752,0.5,32,1.5,1.75,2.25,1.
+25,50
+BXD100,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x
+BXD101,20.6,176.199997,4.4,2546.293945,4574.802734,1.729,3.245,25.172001,0.6,176.199997,20.6,4.4,2546.294189,4574.802734,1.7286,3.2446,0.6,32,
+1.875,2.375,2.75,1.75,38
+BXD102,18.785,159.582993,6.167,1745.671997,4241.505859,0.771,2.216,22.796667,0.25,159.583328,18.785,6.166667,1745.672485,4241.506348,0.770667,
+2.216242,0.25,28.08333,1.5,2,2.875,1.5,28.5
+...
+```
+
+which is close to the R/qtl2 format. GEMMA meanwile expects a tab delimited file where x=NA. You can pass in the column number with the -n switch. One thing GEMMA lacks it the first ID which has to align with the genotype file. The BIMBAM geno format, again, does not contain the IDs. See
+
+=> http://www.xzlab.org/software/GEMMAmanual.pdf
+
+What we need to do is create and use R/qtl2 format files because they can be error checked on IDs and convert those, again, to BIMBAM for use by GEMMA. In the past I wrote Python converters for gemma2lib:
+
+=> https://github.com/genetics-statistics/gemma2lib
+
+I kinda abandoned the project, but you can see a lot of functionality, e.g.
+
+=> https://github.com/genetics-statistics/gemma2lib/blob/master/gemma2/format/bimbam.py
+
+We also have bioruby-table as a generic command line tool
+
+=> https://github.com/pjotrp/bioruby-table
+
+which is an amazingly flexible tool and can probably do the same. I kinda abandoned that project too. You know, bioinformatics is a graveyard of projects :/
+
+OK, let's try. The first step is to convert the phenotype file to something GEMMA can use. We have to make sure that the individuals align with the genotype file(!). So, because we work with GN's GEMMA files, the steps are:
+
+* [X] Read the JSON layout file - 'sample_list' is essentially the header of the BIMBAM geno file
+* [X] Use the R/qtl2-style phenotype file to write a correct GEMMA pheno file (multi column)
+* [X] Compare results with GN pheno output
+
+Running GEMMA by hand it complained
+
+```
+## number of total individuals = 235
+## number of analyzed individuals = 26
+## number of covariates = 1
+## number of phenotypes = 1
+## number of total SNPs/var        =    21056
+## number of analyzed SNPs         =    21056
+Calculating Relatedness Matrix ...
+rsm10000000001, X, Y, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0.5, 0, 1, 0, 1, 0.5, 0, 1, 0, 0, 0, 1, 1, 0, 0.5, 1, 1, 0.5, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0.5, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0.5, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0.5, 0, 0, 0.5, 0, 1, 0, 1, 0, 0, 1, 0.5, 0, 1, 0, 0.5, 1, 1, 1, 1, 0.5, 0, 0, 0.5, 1, 0.5, 0.5, 0.5, 1, 0.5, 1, 0.5, 0.5, 0, 0, 0, 0.5, 1, 0.5, 0, 0, 0.5, 0, 0, 1, 0, 0.5, 1, 0.5, 0.5, 0.5, 1, 0.5, 0.5, 0.5
+237 != 235
+WARNING: Columns in geno file do not match # individuals in phenotypes
+ERROR: Enforce failed for not enough genotype fields for marker in src/gemma_io.cpp at line 1470 in BimbamKin
+```
+
+GEMMA on production is fine. So, I counted BXDs. For comparison, GN's pheno outputs 241 BXDs. Daves pheno file has 241 BXDs (good). But when using my script we get 235 BXDs. Ah, apparently they are different from what we use on GN because GN does not use the parents and the F1s for GEMMA. So, my script should complain when a match is not made. Turns out the JSON file only contains 235 'mappable' BXDs and refers to BXD.8 which is from Apr 26, 2023. The header says `BXD_experimental_DGA_7_Dec_2021` and GN says WGS March 2022. So which one is it? I'll just go with latest, but genotype naming is problematic and the headers are not updated.
+
+> MOTTO: Always complain when there are problems!
+
+Luckily GEMMA complained, but the script should have also complained. The JSON file with 235 genometypes is not representing the actual 237 genometypes. We'll work on that in the next section.
+
+Meanwhile let's add this code to gemma-wrapper. The code can be found here:
+
+=> https://github.com/genetics-statistics/gemma-wrapper/blob/master/bin/rqtl2-pheno-to-gemma.py
+
+## Genotypes
+
+The pheno script now errors with
+
+```
+ERROR: sets differ {'BXD065xBXD102F1', 'C57BL/6J', 'DBA/2J', 'BXD077xBXD065F1', 'D2B6F1', 'B6D2F1'}
+```
+
+Since these are parents and F1s, and are all NAs in Dave's phenotypes, they are easy to remove. So, now we have 235 samples in the phenotype file and 237 genometypes in the genotype file (according to GEMMA). A quick check shows that BXD.geno has 236 genometypes. Same for the bimbam on production. We now have 3 values: 235, 236 and 237. Question is why these do not overlap.
+
+### Genotype probabilities for GEMMA
+
+Another problem on production is that we are not using the standard GEMMA values. So GEMMA complains with
+
+```
+WARNING: The maximum genotype value is not 2.0 - this is not the BIMBAM standard and will skew l_lme and effect sizes
+```
+
+This explains why we divide the effect size by 2 in the GN production code. Maybe it is a better idea to fix then geno files!
+
+* [X] Generate BIMBAM file from GENO .geno files (via R/qtl2)
+* [X] Check bimbam files on production
+
+So we need to convert .geno files as they are the current source of genotypes in GN and contain the sample names that we need to align with pheno files. For this we'll output two files - one JSON file with metadata and sample names and the actual BIMBAM file GEMMA requires. I notice that I actually never had the need to parse a geno file! Zach wrote a tool `gn2/maintenance/convert_geno_to_bimbam.py` that also writes the GN JSON file and I'll take some ideas from that. We'll also need to convert to R/qtl2 as that is what Dave can use and then on to BIMBAM. So, let's add that code to gemma-wrapper again.
+
+This is another tool at
+
+=> https://github.com/genetics-statistics/gemma-wrapper/blob/master/bin/gn-geno-to-gemma.py
+
+where the generated JSON file helps create the pheno file. We ended up with 237 genometypes/samples to match the genotype file and all of Dave's samples matched. Also, now I was able to run GEMMA successfully and passed in the pheno column number with
+
+```
+gemma -gk -g BXD-test.txt -p BXD_pheno_Dave-GEMMA.txt -n 5
+gemma -lmm 9 -g BXD-test.txt -p BXD_pheno_Dave-GEMMA.txt -k output/result.cXX.txt -n 5
+```
+
+the pheno file can include the sample names as long as there are no spaces in them. For marker rs3718618 we get values -9  0 X Y 0.317 7.930689e+02  1.779940e+02  1.000000e+05  7.532662e-05. The last value translates to
+
+```
+-Math.log10(7.532662e-05) => 4.123051519468808
+```
+
+and that matches GN's run of GEMMA w.o. LOCO.
+
+The next step is to make the -n switch run with LOCO on gemma-wrapper.
+
+```
+./bin/gemma-wrapper --loco --json --  -gk -g BXD-test.txt -p BXD_pheno_Dave-GEMMA.txt -n 5 -a BXD.8_snps.txt > K.json
+./bin/gemma-wrapper --keep --force --json --loco --input K.json -- -lmm 9 -g BXD-test.txt -p BXD_pheno_Dave-GEMMA.txt -n 5 -a BXD.8_snps.txt > GWA.json
+```
+
+Checking the output we get
+
+```
+-Math.log10(3.191755e-05) => 4.495970452606926
+```
+
+and that matches Dave's output for LOCO and marker rs3718618. All good, so far. Next step permute.
+
+## Permute
+
+Now we have gemma-wrapper working we need to fix it to work with the latest type of files.
+
+* [X] randomize phenotypes using -n switch
+* [X] Permute gemma and collect results
+* [X] Unseed randomizer or make it an option
+* [X] Fix tmpdir
+* [X] Show final score
+* [X] Compare small and large BXD set
+
+For the first one, the --permutate-phenotype switch takes the input pheno file. Because we pick a column with gemma we can randomize all input lines together. So, in the above example, we shuffle BXD_pheno_Dave-GEMMA.txt. Interestingly it looks like we are already shuffling by line in gemma-wrapper.
+
+The good news is that it runs, but the outcome is wrong:
+
+```
+["95 percentile (significant) ", 1000.0, -3.0]
+["67 percentile (suggestive)  ", 1000.0, -3.0]
+```
+
+Inspecting the phenotype files they are shuffled, e.g.
+
+```
+BXD073xBXD065F1 NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA
+BXD49 NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA
+BXD86 NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA
+BXD161  15.623  142.908997  4.0 2350.637939 3294.824951 1.452 2.08  20.416365 0.363636  142.909088  15.622727 4.0 2350.638672 3294.825928 1.45
+1636  2.079909  0.363636  33.545448 2.125 2.0 2.375 1.25  44.5
+BXD154  20.143  195.5 4.75  1533.689941 4568.76416  0.727 2.213748  27.9275 0.75  195.5 20.142857 4.75  1533.690796 4568.76416  0.72675 2.2137
+48  0.75  54.5  0.75  1.75  3.0 1.5 33.0
+```
+
+which brings out an interesting point. Most BXDs in the genotype file are missing from this experiment. We are computing LOD scores as if we have a full BXD population. So, what we are saying here is that if we have all BXD genotypes and we randomly assign phenotypes against a subset, what is the chance we get a hit at random. I don't think this is a bad assumption, but it not exactly what Gary Churchill had in mind in his 1994 paper:
+
+=> https://pubmed.ncbi.nlm.nih.gov/7851788/ Empirical threshold values for quantitative trait mapping
+
+The idea is to shuffle genotypes against phenotypes. If there is a high correlation we get a result. The idea is to break the correlation and that should work for both the large and the small BXD set. Scoring the best 'random' result out of 1000 permutations at, say 95% highest, sets the significance level.
+With our new precompute we should be able to show the difference. Anyway, that is one problem, the other is that the stats somehow do not add up to the final result. Score min is set at
+
+=> https://github.com/genetics-statistics/gemma-wrapper/blob/7769f209bcaff2472ba185234fad47985e59e7a3/bin/gemma-wrapper#L667
+
+The next line says 'if false'. Alright, that explains part of it at least as the next block was disabled for slurm and is never run. I should rip the slurm stuff out, actually, as Arun has come up with a much better solution. But that is for later.
+
+Disabling that permutation stopped with
+
+```
+Add parallel job: time -v /bin/gemma -loco X -k 02fe8482913a998e6e9559ff5e3f1b89e904d59d.X.cXX.txt.cXX.txt -o 55b49eb774f638d16fd267313d8b4d1d6d2a0a25.X.assoc.txt -p phenotypes-1 -lmm 9 -g BXD-test.txt -n 5 -a BXD.8_snps.txt -outdir /tmp/d20240823-4481-xfrnp6
+DEBUG: Reading 55b49eb774f638d16fd267313d8b4d1d6d2a0a25.X.assoc.txt.1.assoc.txt
+./bin/gemma-wrapper:672:in `foreach': No such file or directory @ rb_sysopen - 55b49eb774f638d16fd267313d8b4d1d6d2a0a25.X.assoc.txt.1.assoc.txt (Errno::ENOENT)
+```
+
+so it created a file, but can't find it because outdir is not shared. Now tmpdir is in the outer block so the file should still exist. For troubleshooting the first step is to seed the randomizer (seed) so we get the same run every time.
+It turns out there are a number of problems. First of all the permutation output was numbered and the result was not found. Fixing that gave a first result without the -parallel switch:
+
+```
+[0.0008489742, 0.03214928, 0.03426648, 0.0351207, 0.0405179, 0.04688354, 0.0692488, 0.1217158, 0.1270747, 0.1880325]
+["95 percentile (significant) ", 0.0008489742, 3.1]
+["67 percentile (suggestive)  ", 0.0351207, 1.5]
+```
+
+That is pleasing and it suggests that we have a significant result for the trait of interest: `volume of the first tumor that developed`. Running LOCO withouth parallel is slow (how did we survive in the past!).
+
+The 100 run shows
+
+```
+[0.0001626146, 0.0001993085, 0.000652191, 0.0007356249, 0.0008489742, 0.0009828207, 0.00102203, 0.001091924, 0.00117823, 0.001282312, 0.001471041, 0.001663572, 0.001898194, 0.003467039, 0.004655921, 0.005284387, 0.005628393, 0.006319995, 0.006767502, 0.007752473, 0.008757406, 0.008826192, 0.009018125, 0.009735282, 0.01034488, 0.01039465, 0.0122644, 0.01231366, 0.01265093, 0.01317425, 0.01348443, 0.013548, 0.01399461, 0.01442383, 0.01534904, 0.01579931, 0.01668551, 0.01696015, 0.01770371, 0.01838937, 0.01883068, 0.02011034, 0.02234977, 0.02362105, 0.0242342, 0.02520063, 0.02536663, 0.0266905, 0.02932001, 0.03116032, 0.03139836, 0.03176087, 0.03214928, 0.03348359, 0.03426648, 0.0351207, 0.03538503, 0.0354338, 0.03609931, 0.0371134, 0.03739827, 0.03787489, 0.04022586, 0.0405179, 0.04056273, 0.04076034, 0.04545012, 0.04588635, 0.04688354, 0.04790254, 0.05871501, 0.05903692, 0.05904868, 0.05978341, 0.06103624, 0.06396175, 0.06628317, 0.06640048, 0.06676557, 0.06848021, 0.0692488, 0.07122914, 0.07166011, 0.0749728, 0.08174019, 0.08188341, 0.08647539, 0.0955264, 0.1019648, 0.1032776, 0.1169525, 0.1182405, 0.1217158, 0.1270747, 0.1316735, 0.1316905, 0.1392859, 0.1576149, 0.1685975, 0.1880325]
+["95 percentile (significant) ", 0.0009828207, 3.0]
+["67 percentile (suggestive)  ", 0.01442383, 1.8]
+```
+
+Not too far off!
+
+The command was
+
+```
+./bin/gemma-wrapper --debug --no-parallel --keep --force --json --loco --input K.json --permutate 100 --permute-phenotype BXD_pheno_Dave-GEMMA.txt -- -lmm 9 -g BXD-test.txt -n 5 -a BXD.8_snps.txt
+```
+
+It is fun to see that when I did a second run the
+
+```
+[100, ["95 percentile (significant) ", 0.0002998286, 3.5], ["67 percentile (suggestive)  ", 0.01167864, 1.9]]
+```
+
+significance value was 3.5. Still, our hit is whopper - based on this.
+
+## Run permutations in parallel
+
+Next I introduced and fixed parallel support for permutations, now we can run gemma LOCO with decent speed - about 1 permutation per 3s! That is one trait in an hour on my machine.
+
+=> https://github.com/genetics-statistics/gemma-wrapper/commit/a8d3922a21c7807a9f20cf9ffb62d8b16f18c591
+
+Now we can run 1000 permutations in an hour, rerunning above we get
+
+```
+["95 percentile (significant) ", 0.0006983356, 3.2]
+["67 percentile (suggestive)  ", 0.01200505, 1.9]
+```
+
+which proves that 100 permutations is not enough. It is a bit crazy to think that 5% of randomized phenotypes will get a LOD score of 3.2 or higher!
+
+Down the line I can use Arun's CWL implementation to fire this on a cluster. Coming...
+
+## Reduce genotypes for permutations
+
+In the next phase we need to check if shuffling the full set of BXDs makes sense for computing permutations. Since I wrote a script for this exercise to transform BIMBAM genotypes I can reuse that:
+
+=> https://github.com/genetics-statistics/gemma-wrapper/blob/a8d3922a21c7807a9f20cf9ffb62d8b16f18c591/bin/gn-geno-to-gemma.py#L31
+
+If we check the sample names we can write a reduced genotype matrix. Use that to compute the GRM. Next permute with the smaller BXD sample set and genotypes.
+
+Instead of modifying above script I decided to add another one
+
+```
+bimbam-filter.py --json BXD.geno.json --sample-file BXD_pheno_Dave-GEMMA-samples.txt BXD_geno.txt > BXD_geno-samples.txt
+```
+
+which takes as inputs the json file from gn-geno-to-gemma and the GEMMA input file. This is not to mix targets and keeping the code simple. Now create the GRM with
+
+```
+./bin/gemma-wrapper --loco --json --  -gk -g BXD_geno-samples.txt -p BXD_pheno_Dave-GEMMA-samples.txt -n 5 -a BXD.8_snps.txt > K-samples.json
+./bin/gemma-wrapper --keep --force --json --loco --input K-samples.json -- -lmm 9 -g BXD_geno-samples.txt -p BXD_pheno_Dave-GEMMA-samples.txt -n 5 -a BXD.8_snps.txt > GWA-samples.json
+```
+
+Now the hit got reduced:
+
+```
+-Math.log10(1.111411e-04)
+=> 3.9541253091741235
+```
+
+and with 1000 permutations
+
+```
+./bin/gemma-wrapper --debug --parallel --keep --force --json --loco --input K-samples.json --permutate 1000 --permute-phenotype BXD_pheno_Dave-GEMMA-samples.txt -- -lmm 9 -g BXD_geno-samples.txt -n 5 -a BXD.8_snps.txt
+["95 percentile (significant) ", 0.0004184217, 3.4]
+["67 percentile (suggestive)  ", 0.006213012, 2.2]
+```
+
+we are still significant. Though the question is now why results differ so much, compared to using the full BXD genotypes.
+
+## Why do we have a difference with the full BXD genotypes?
+
+GEMMA strips out the missing phenotypes in a list. Only the actual phenotypes are used. We need to check how the GRM is used and what genotypes are used by GEMMA. For the GRM the small genotype file compares vs the large:
+
+```
+Samples           small    large
+BXD1  <->  BXD1   0.248    0.253
+BXD24 <->  BXD24  0.255    0.248
+BXD1  <->  BXD24 -0.040   -0.045
+BXD1  <->  BXD29  0.010    0.009
+```
+
+You can see there is a small difference in the computation of K even though it looks pretty close. This is logical because with the full BXD set all genotypes are used. With a smaller BXD set only those genotypes are used. We expect a difference in values, but not much of a difference in magnitude (shift). The only way to prove that K impacts the outcome is to take the larger matrix and reduce it to the smaller one using those values. I feel another script coming ;)
+
+Above numbers are without LOCO. With LOCO on CHR18
+
+```
+Samples            small    large
+BXD1  <->  BXD1    0.254    0.248
+BXD1  <->  BXD24  -0.037    -0.042
+```
+
+again a small shift. OK, let's try computing with a reduced matrix and compare results for rs3718618. Example:
+
+```
+gemma -gk -g BXD-test.txt -p BXD_pheno_Dave-GEMMA.txt -n 5 -a BXD.8_snps.txt -o full-bxd
+gemma -lmm 9 -k output/full-bxd.cXX.txt -g BXD-test.txt -p BXD_pheno_Dave-GEMMA.txt -n 5 -a BXD.8_snps.txt -o full-bxd
+```
+
+we get three outcomes where full-bxd is the full set,
+```
+output/full-bxd.assoc.txt:18              rs3718618 7.532662e-05
+output/full-reduced-bxd.assoc.txt:18      rs3718618 2.336439e-04
+output/small-bxd.assoc.txt:18             rs3718618 2.338226e-04
+```
+
+even without LOCO you can see a huge jump for the full BXD kinship matrix, just looking at our hit rs3718618:
+
+```
+-Math.log10(7.532662e-05)
+=> 4.123051519468808
+-Math.log10(2.338226e-04)
+=> 3.631113514641496
+```
+
+With LOCO the difference may be even greater.
+
+So, which one to use? Truth is that the GRM is a blunt instrument. Essentially every combination of two samples/strains/genometypes gets compressed into a single number that gives a distance between the genomes. This number represents a hierarchy of relationships computed in differences in DNA (haplotypes) between those individuals. The more DNA variation is represented in the calculation, the more 'fine tuned' this GRM matrix becomes. Instinctively the larger matrix, or full BXD population, is a better estimate of distance between the individuals than just using a subset of DNA.
+
+So, I still underwrite using the full BXD for computing the GRM. To run GEMMA, I have just proven we can use the reduced GRM which will be quite a bit faster too, as the results are the same. For permutations we *should* use the reduced form of the full BXD GRM as it does not make sense to shuffle phenotypes against BXDs we don't use. So I need to recompute that.
+
+## Recomputing significance with the reduced GRM matrix
+
+* [ ] Recomute significance with reduced GRM
+
+I can reuse the script I wrote for the previous section.
+
+=> https://github.com/genetics-statistics/gemma-wrapper/blob/master/bin/grm-filter.py
+
+So, the idea is to rerun permutations with the small set, but with the reduced GRM from the full BXD population. That ought to be straightforward by using the new matrix as an input for GWA. Only problem is that LOCO generates a GRM for every chromosome, so we need to make gemma-wrapper aware about the matrix reduction. As the reduction is fast we can do it for every run of gemma-wrapper and destroy it automatically with tmpdir. So:
+
+* [X] Compute the full GRM for every LOCO (if not cached) - already part of gemma-wrapper
+* [X] Run through GRMs and reduce them in tmpdir
+* [X] Plug new GRM name into computations - which really updates the JSON file that is input for GWA
+
+The interesting bit is that GEMMA requires input of phenotypes, but does not use them to compute the GRM.
+
+After giving it some thought we want GRM reduction to work in production GN because of the speed benefit. That means modifying gemma-wrapper to take a list of samples/genometypes as input - and we'll output that with GN. It is a good idea anyhow because it can give us some improved error feedback down the line.
+
+We'll use the --input switch to gemma-wrapper by providing the full list of genometypes that are used to compute the GRM and the 'reduced' list of genometypes that are used to reduce the GRM and compute GWA after.
+So the first step is to create this JSON input file. We already created the "gn-geno-to-gemma" output that has a full list of samples as parsed from the GN .geno file. Now we need a script to generate the reduced samples JSON and merge that to "gn-geno-to-gemma-reduced" by addind a "samples-reduced" vector.
+
+The rqtl2-pheno-to-gemma.py script I wrote above already takes the "gn-geno-to-gemma" JSON. It now adds to the JSON:
+
+```
+  "samples-column": 2,
+  "samples-reduced": {
+    "BXD1": 18.5,
+    "BXD24": 27.510204,
+    "BXD29": 17.204,
+    "BXD43": 21.825397,
+    "BXD44": 23.454,
+    "BXD60": 22.604,
+    "BXD63": 19.171,
+    "BXD65": 21.607,
+    "BXD66": 17.056999,
+    "BXD70": 17.962999,
+    "BXD73b": 20.231001,
+    "BXD75": 19.952999,
+    "BXD78": 19.514,
+    "BXD83": 18.031,
+    "BXD87": 18.258715,
+    "BXD89": 18.365,
+    "BXD90": 20.489796,
+    "BXD101": 20.6,
+    "BXD102": 18.785,
+    "BXD113": 24.52,
+    "BXD124": 21.762142,
+    "BXD128a": 18.952,
+    "BXD154": 20.143,
+    "BXD161": 15.623,
+    "BXD210": 23.771999,
+    "BXD214": 19.533117
+  },
+  "numsamples-reduced": 26
+```
+
+which is kinda cool because now I can reduce and write the pheno file in one go. Implementation:
+
+=> https://github.com/genetics-statistics/gemma-wrapper/blob/master/bin/rqtl2-pheno-to-gemma.py
+
+OK, we are going to input the resulting JSON file into gemma-wrapper. At the GRM stage we ignore the reduction but we need to add these details to the outgoing JSON. So the following commands can run:
+
+```
+./bin/gemma-wrapper --loco --json --input BXD_pheno_Dave-GEMMA.txt.json -- -gk -g BXD-test.txt -p BXD_pheno_Dave-GEMMA.txt -n 5 -a BXD.8_snps.txt > K.json
+```
+
+where K.json has a json["input"] which essentially is above structure.
+
+```
+./bin/gemma-wrapper --keep --force --json --loco --input K.json -- -lmm 9 -g BXD-test.txt -p BXD_pheno_Dave-GEMMA.txt -n 5 -a BXD.8_snps.txt > GWA.json
+```
+
+Now I have to deal with phenotype files as they are rewritten. We should still cater for `-p` for GEMMA. We already have `--permute-phenotypes filen` for gemma-wrapper. Now we are adding `--phenotypes` to gemma-wrapper which replaces both!
+Note that we can use -p if --phenotypes is NOT defined. Problem is we have a few paths now:
+
+* [X] Check phenotypes are directly passed into GEMMA with -p switch
+* [X] Check phenotypes are passed in as a file with --phenotypes switch
+* [X] Check phenotypes are coming in using the JSON file
+
+Fixed the first one with
+
+=> https://github.com/genetics-statistics/gemma-wrapper/commit/2b7570a7f0ba0d1080c730b208823c0622dd8f2c
+
+though that does not do caching (yet). Next stop doing LOCO I notice xz is phenomenally slow. Turns out it was not xz, but when using `tar -C` we switch into the path and somehow xz kept growing its output.
+
+At this point David told me that we don't have to do epoch or covariates. So it is just the traits. After getting side-tracked by a slow running python program for haplotype assessment we start up again.
+
+So, now we can pass in a trait using JSON. This is probably not a great idea when you have a million values, but for our purposes it will do. K.json contains the reduced samples. Next GWA is run on that. I had to fix minor niggles and get `parallel' to give more useful debug info.
+
+Next write the pheno file and pass it in!
+
+```
+./bin/gemma-wrapper  --debug --verbose --force  --loco --json --lmdb --input K.json -- -g test/data/input/BXD_geno.txt.gz  -a test/data/input/BXD_snps.txt  -lmm 9 -maf 0.05 -n 2 -debug
+```
+
+note the '-n 2' switch to get the second generated column in the phenotype file. We had our first successful run! To run permutations I get:
+
+```
+./bin/gemma-wrapper:722:in `<main>': You should supply --permute-phenotypes with gemma-wrapper --permutate (RuntimeError)
+```
+
+and, of course, as this reduced file is generated it not available yet. That was an easy fix/hack. Next I got
+
+```
+./bin/gemma-wrapper:230:in `block in <main>': Do not use the GEMMA -p switch with gemma-wrapper if you are using JSON phenotypes!
+```
+
+Hmm. This is a bit harder. The call to GWAS takes a kinship matrix and it gets reduced with every permutation. That is probably OK because it runs quickly, but I'll need to remove the -p switch... OK. Done that and permutations are running in a second for 28 BXD! That implies computing significance in the web service comes into view - especially if we use a cluster on the backend.
+
+It is interesting to see that 60% of time is spent in the kernel - which means still heavy IO on GEMMA's end - even with the reduced data:
+
+```
+%Cpu0  : 39.1 us, 51.0 sy
+%Cpu1  : 34.0 us, 54.8 sy
+%Cpu2  : 35.8 us, 54.5 sy
+%Cpu3  : 37.5 us, 49.8 sy
+%Cpu4  : 36.0 us, 53.3 sy
+%Cpu5  : 29.5 us, 57.9 sy
+%Cpu6  : 42.7 us, 44.7 sy
+%Cpu7  : 35.9 us, 52.2 sy
+%Cpu8  : 27.0 us, 60.7 sy
+%Cpu9  : 24.5 us, 63.2 sy
+%Cpu10 : 29.8 us, 58.9 sy
+%Cpu11 : 25.3 us, 62.7 sy
+%Cpu12 : 28.1 us, 58.9 sy
+%Cpu13 : 34.2 us, 52.8 sy
+%Cpu14 : 34.6 us, 52.2 sy
+%Cpu15 : 37.5 us, 51.8 sy
+```
+
+There is room for more optimization.
+
+The good news is for a peak we have we find that it is statistically significant:
+
+```
+["95 percentile (significant) ", 0.0004945423, 3.3]
+["67 percentile (suggestive)  ", 0.009975183, 2.0]
+```
+
+Even though it was low permutations there was actually a real bug. It turns out I only picked the values from the X chromosome (ugh!). It looks different now.
+
+For the peaks of
+
+=> https://genenetwork.org/show_trait?trait_id=21526&dataset=BXDPublish
+
+after 1000 permutations (I tried a few times) the significance threshold with MAF 0.05 ends up at approx.
+
+["95 percentile (significant) ", 1.434302e-05, 4.8]
+["67 percentile (suggestive)  ", 0.0001620244, 3.8]
+
+If it is it means that for this trait BXD_21526 the peaks on chr 14 at LOD 3.5 are not significant, but close to suggestive (aligning with Dave's findings and comments). It is interesting to see the numbers quickly stabilize by 100 permutations (see attached). Now, this is before correcting for epoch effects and other covariates. And I took the data from Dave as is (the distribution looks fairly normal). Also there is a problem with MAF I have to look into:
+
+GEMMA in GN2 shows the same result when setting MAF to 0.05 or 0.1 (you can try that). The GN2 GEMMA code for LOCO does pass in -maf (though I see that non-LOCO does not - ugh again). I need to run GEMMA to see if the output should differ and I'll need to see the GN2 logs to understand what is happening. Maybe it just says that the hits are haplotype driven - and that kinda makes sense because there is a range of them.
+
+That leads me to think that we only need to check for epoch when we have a single *low* MAF hit, say 0.01 for 28 mice. As we actively filter on MAF right now we won't likely see an epoch hit.
+
+
+## Protocol for permutations
+
+First we run GEMMA just without LOCO using default settings that GN uses
+
+```
+# Convert the GN geno file to BIMBAM geno file
+./bin/gn-geno-to-gemma.py BXD.geno > BXD.geno.txt
+# Match pheno file
+./bin/rqtl2-pheno-to-gemma.py BXD_pheno_Dave.csv --json BXD.geno.json > BXD_pheno_matched.txt
+  Wrote GEMMA pheno 237 from 237 with genometypes (rows) and 24 collections (cols)!
+gemma -gk -g BXD.geno.txt -p BXD_pheno_matched.txt -n 5
+gemma -lmm 9 -g BXD.geno.txt -p BXD_pheno_matched.txt -k output/result.cXX.txt -n 5
+```
+
+So far the output is correct.
+
+```
+-Math.log10(7.532460e-05)
+=> 4.123063165904243
+```
+
+Try with gemma-wrapper
+
+```
+./bin/gemma-wrapper --json -- -gk -g BXD.geno.txt -p BXD_pheno_matched.txt -n 5 -a BXD.8_snps.txt > K.json
+cp output/bab43175329bd14d485e582b7ad890cf0ec28915.cXX.txt /tmp
+```
+
+Works, but the following failed without the -n switch:
+
+```
+./bin/gemma-wrapper --debug --verbose --force --json --lmdb --input K.json -- -g BXD.geno.txt -a BXD.8_snps.txt -lmm 9 -p BXD_pheno_matched.txt -n 5
+```
+
+and worked with. That is logical, if you see output like
+
+```
+19      rs30886715      46903165        0       X       Y       0.536   0.000000e+00    0.000000e+00    1.000000e-05    1.000000e+00
+19      rs6376540       46905638        0       X       Y       0.536   0.000000e+00    0.000000e+00    1.000000e-05    1.000000e+00
+19      rs50610897      47412184        0       X       Y       0.538   0.000000e+00    0.000000e+00    1.000000e-05    1.000000e+00
+```
+
+It means the phenotype column that was parsed has empty values. In this case the BXD strain names. GEMMA should show a meaningful error.
+
+Now that works we can move to a full LOCO
+
+
+```
+./bin/gemma-wrapper --loco --json -- -gk -g BXD.geno.txt -p BXD_pheno_matched.txt -n 5 -a BXD.8_snps.txt > K.json
+./bin/gemma-wrapper --debug --verbose --force --loco --json --lmdb --input K.json -- -g BXD.geno.txt -a test/data/input/BXD_snps.txt -lmm 9 -maf 0.05 -p BXD_pheno_matched.txt -n 5
+./bin/./bin/view-gemma-mdb --sort /tmp/test/ca55b05e8b48fb139179fe09c35cff0340fe13bc.mdb
+```
+
+and we get
+
+```
+18,69216071,rs3718618,0.635,-195.5784,82.1243,100000.0,0.0,4.5
+18,69825784,rs50446650,0.635,-195.5784,82.1243,100000.0,0.0,4.5
+18,68189477,rs29539715,0.596,-189.7332,79.7479,100000.0,0.0,4.49
+```
+
+When we converted BXD.geno to its BIMBAM BXD.geno.txt we also got a BXD.geno.json file which contains a list of the individuals/genometypes that were used in the genotype file.
+
+Now we reduce the traits file to something GEMMA can use for permutations - adding the trait number and output BXD_pheno_Dave.csv.json
+
+```sh
+./bin/rqtl2-pheno-to-gemma.py BXD_pheno_Dave.csv --json BXD.geno.json -n 5 > BXD_pheno_matched-5.txt
+```
+
+The matched file should be identical to the earlier BXD_pheno_matched.txt file. Meanwhile, if you inspect the JSON file you should see
+
+```
+jq < BXD_pheno_Dave.csv.json
+  "samples-column": 5,
+  "trait": "21529",
+  "samples-reduced": {
+    "BXD1": 1919.450806,
+    "BXD101": 2546.293945,
+    "BXD102": 1745.671997,
+```
+
+So far we are OK!
+
+At this point we have a reduced sample set, a BIMBAM file and a phenotype file GEMMA can use!
+
+```
+./bin/gemma-wrapper --loco --json --input BXD_pheno_Dave.csv.json -- -gk -g BXD.geno.txt -p BXD_pheno_matched.txt -a BXD.8_snps.txt -n 5 > K.json
+```
+
+Note that at this step we actually create a full GRM. Reducing happens in the next mapping stage.
+
+```
+./bin/gemma-wrapper --debug --verbose --force --loco --json --lmdb --input K.json -- -g BXD.geno.txt  -a test/data/input/BXD_snps.txt -lmm 9 -maf 0.05
+```
+
+Note the use of '-n' switch. We should change that.
+
+```
+./bin/./bin/view-gemma-mdb /tmp/test/8599834ee474b9da9ff39cc4954d662518a6b5c8.mdb --sort
+```
+
+Look for rs3718618 at 69216071 and I am currently getting the wrong result for trait 21529 and it is not clear why that is:
+
+```
+chr,pos,marker,af,beta,se,l_mle,l_lrt,-logP
+16,88032783,?,0.538,-134.1339,75.7837,0.0,0.0009,3.02
+16,88038734,?,0.538,-134.1339,75.7837,0.0,0.0009,3.02
+(...)
+18,69216071,?,0.462,10.8099,93.3936,0.0,0.8097,0.09
+```
+
+The failing command is:
+
+```
+/bin/gemma -loco 18 -k /tmp/test/reduced-GRM-18.txt.tmp -o 69170e8a2d2f08905daa14461eca1d82a676b4c4.18.assoc.txt -p /tmp/test/reduced-pheno.txt.tmp -n 2 -g BXD.geno.txt -a test/data/input/BXD_snps.txt -lmm 9 -maf 0.05 -outdir /tmp/test
+```
+
+produces
+
+```
+18  rs3718618       69216071        0       X       Y       0.462   -2.161984e+01   9.339365e+01    1.000000e-05    8.097026e-01
+```
+
+The pheno file looks correct, so it has to be the reduced GRM. And this does not look good either:
+
+```
+number of SNPS for K            =     7070
+number of SNPS for GWAS         =      250
+```
+
+When running GEMMA on genenetwork.org we get a peak for LOCO at that position for rs3718618. I note that the non-LOCO version at 4.1 vs 4.5 for LOCO has a higher peak. We should compute the significance for both!
+
+Now, when I run the non-LOCO version by hand I get
+
+```
+-Math.log10(7.532460e-05)
+=> 4.123063165904243
+```
+
+## Finally
+
+So, we rolled back to not using reduced phenotypes for now.
+
+For trait 21529 after 1000 permutations we get for LOCO:
+
+```
+["95 percentile (significant) ", 1.051208e-05, 5.0]
+["67 percentile (suggestive)  ", 0.0001483188, 3.8]
+```
+
+which means our GWA hit is at 4.5 is not so close to being significant.
+
+Next I made sure the phenotypes got shuffled against the BXD used - which is arguably the right thing to do.
+It should not have a huge impact because the BXDs share haplotypes - so randomized association should end up in the same ball park. The new result after 1000 permutations is:
+
+```
+["95 percentile (significant) ", 8.799303e-06, 5.1]
+["67 percentile (suggestive)  ", 0.0001048443, 4.0]
+```
+
+## More for Dave
+
+
+Run and permute:
+
+```
+./bin/gemma-wrapper --lmdb --debug --phenotypes BXD_pheno_matched.txt --verbose --force --loco  --json --input K.json -- -g BXD.geno.txt -a BXD.8. -lmm 9 -maf 0.05 -n 2 -p BXD_pheno_matched.txt
+./bin/gemma-wrapper --debug --phenotypes BXD_pheno_matched.txt --permutate 1000 --phenotype-column 2 --verbose --force --loco --json --input K.json -- -g BXD.geno.txt -a test/data/input/BXD_snps.txt -lmm 9 -maf 0.05
+```
+
+```
+21526 How old was the mouse when a tumor was first detected?
+chr,pos,marker,af,beta,se,l_mle,l_lrt,-logP
+14,99632276,?,0.462,-0.6627,0.3322,100000.0,0.0003,3.56
+14,99694520,?,0.462,-0.6627,0.3322,100000.0,0.0003,3.56
+17,80952261,?,0.538,0.6528,0.3451,100000.0,0.0005,3.31
+["95 percentile (significant) ", 6.352578e-06, 5.2]
+["67 percentile (suggestive)  ", 0.0001007502, 4.0]
+```
+
+```
+24406 What was the weight of the first tumor that developed, at death?
+chr,pos,marker,af,beta,se,l_mle,l_lrt,-logP
+11,9032629,?,0.536,0.1293,0.0562,100000.0,0.0,4.36
+11,9165457,?,0.536,0.1293,0.0562,100000.0,0.0,4.36
+11,11152439,?,0.5,0.126,0.0562,100000.0,0.0001,4.21
+11,11171143,?,0.5,0.126,0.0562,100000.0,0.0001,4.21
+11,11525458,?,0.5,0.126,0.0562,100000.0,0.0001,4.21
+11,8786241,?,0.571,0.1203,0.0581,100000.0,0.0002,3.78
+11,8836726,?,0.571,0.1203,0.0581,100000.0,0.0002,3.78
+11,19745817,?,0.536,0.1183,0.061,100000.0,0.0003,3.46
+11,19833554,?,0.536,0.1183,0.061,100000.0,0.0003,3.46
+["95 percentile (significant) ", 1.172001e-05, 4.9]
+["67 percentile (suggestive)  ", 0.0001175644, 3.9]
+```
+
+```
+27515 No description
+chr,pos,marker,af,beta,se,l_mle,l_lrt,-logP
+4,103682035,?,0.481,-0.1653,0.0585,100000.0,0.0,5.57
+4,103875085,?,0.481,-0.1653,0.0585,100000.0,0.0,5.57
+4,104004372,?,0.481,-0.1653,0.0585,100000.0,0.0,5.57
+4,104156915,?,0.481,-0.1653,0.0585,100000.0,0.0,5.57
+4,104166428,?,0.481,-0.1653,0.0585,100000.0,0.0,5.57
+4,104584276,?,0.481,-0.1653,0.0585,100000.0,0.0,5.57
+4,103634906,?,0.519,-0.1497,0.0733,100000.0,0.0002,3.67
+4,103640707,?,0.519,-0.1497,0.0733,100000.0,0.0002,3.67
+["95 percentile (significant) ", 7.501004e-06, 5.1]
+["67 percentile (suggestive)  ", 7.804668e-05, 4.1]
+```
+
+## Dealing with significance
+
+Now the significance thresholds appear to be a bit higher than we expect. So, let's see what is going on. First I check the randomization of the phenotypes. That looks great. There are 1000 different phenotype files and they randomized only the BXD we used. Let's zoom in on our most interesting 27515. When running in GN2 I get more hits - they are at the same level, but somehow SNPs have dropped off. In those runs our SNP of interest shows only a few higher values:
+
+```
+./6abd89211d93b0d03dc4281ac3a0abe7fc10da46.4.assoc.txt.assoc.txt:4      rs28166983      103682035       0       X       Y       0.481   -2.932957e-01 7.337327e-02    1.000000e+05    2.700506e-04
+./b6e58d6092987d0c23ae1735d11d4a293782c511.4.assoc.txt.assoc.txt:4      rs28166983      103682035       0       X       Y       0.481   -2.413067e-01 6.416133e-02    1.000000e+05    5.188637e-04
+./4266656951ab0c5f3097ddb4bf917448d7542dd5.4.assoc.txt.assoc.txt:4      rs28166983      103682035       0       X       Y       0.481   2.757074e-01  6.815899e-02    1.000000e+05    2.365318e-04
+./265e44a4c078d2a608b7117bbdcb9be36f56c7de.4.assoc.txt.assoc.txt:4      rs28166983      103682035       0       X       Y       0.481   2.358494e-01  5.743872e-02    1.000000e+05    1.996261e-04
+napoli:/export/local/home/wrk/iwrk/opensource/code/genetics/gemma-wrapper/tmp/test$ rg 103682035 .|grep 5$
+./b29f08a4b1061301d52f939087f1a4c1376256f0.4.assoc.txt.assoc.txt:4      rs28166983      103682035       0       X       Y       0.481   -2.841255e-01 6.194426e-02    1.000000e+05    5.220922e-05
+./3e5b12e9b7478b127b47c23ccdfba2127cf7e2b2.4.assoc.txt.assoc.txt:4      rs28166983      103682035       0       X       Y       0.481   -2.813968e-01 6.379554e-02    1.000000e+05    8.533857e-05
+```
+
+but none as high as the original hit of 5.57
+
+```
+irb(main):001:0> -Math.log10(2.700506e-04)
+=> 3.5685548534637
+irb(main):002:0> -Math.log10(5.220922e-05)
+=> 4.282252795052573
+irb(main):003:0> -Math.log10(8.533857e-05)
+=> 4.06885463879464
+```
+
+All good. This leaves two things to look into. First, I see less hits than with GN2(!). Second, qnorm gives a higher peak in GN2.
+
+* [X] Check for number of SNPs
+
+The number of SNPs is not enough:
+
+```
+GEMMA 0.98.6 (2022-08-05) by Xiang Zhou, Pjotr Prins and team (C) 2012-2022
+Reading Files ...
+## number of total individuals = 237
+## number of analyzed individuals = 26
+## number of covariates = 1
+## number of phenotypes = 1
+## leave one chromosome out (LOCO) =        1
+## number of total SNPs/var        =    21056
+## number of SNPS for K            =     6684
+## number of SNPS for GWAS         =      636
+## number of analyzed SNPs         =    21056
+```
+
+Even when disabling MAF filtering we still see a subset of SNPs. I am wondering what GN2 does here.
+
+## Missing SNPs
+
+In our results we miss SNPs that are listed on GN2, but do appear in our genotypes, e.g.
+
+```
+BXD.8_snps.txt
+19463:rsm10000013598, 69448067, 18
+```
+
+First of all we find we used a total of 6360 SNPs out of the original 21056. For this SNP the genotype files show:
+
+```
+BXD_geno.txt
+19463:rsm10000013598, X, Y, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 1, 1, 0.5, 1, 1, 1, 1, 0, 1, 0, 1, 0.5, 0, 0, 0, 1, 0.5, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0.5, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0.5, 1, 0, 0, 0, 1, 1, 1, 0.5, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0.5, 0.5, 0, 0.5, 0.5, 0.5, 0, 0.5, 0.5, 0.5, 0, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 1, 0.5, 1, 0.5, 0.5, 1, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0, 0.5, 1, 0.5, 0, 0.5
+```
+
+and in our updated
+
+```
+BXD.geno.txt
+rsm10000013598,X,Y,2,0,2,0,2,0,2,0,0,0,0,0,0,2,2,2,0,0,0,0,2,2,2,2,2,0,0,0,0,2,0,2,2,2,0,0,0,0,2,0,2,2,2,2,0,0,2,0,0,0,2,2,0,2,0,0,2,2,2,0,0,2,2,2,2,2,2,2,2,2,2,0,0,2,2,0,2,2,2,2,0,2,2,2,2,2,2,2,0,0,2,2,0,2,0,0,2,2,2,0,2,2,2,0,1,1,1,1,1,1,2,2,1,2,2,2,2,0,2,0,2,1,0,0,0,2,1,0,2,2,2,2,2,0,0,2,2,0,2,2,0,2,2,2,2,2,2,2,2,0,2,2,2,2,2,0,0,0,0,0,2,0,0,2,0,2,1,0,2,0,0,0,0,0,0,0,0,1,2,0,0,0,2,2,2,1,0,2,2,2,2,0,2,0,0,0,2,2,2,2,1,1,0,1,1,1,0,1,1,1,0,1,1,1,1,1,1,1,1,2,1,2,1,1,2,1,1,1,1,1,1,0,1,2,1,0,1
+```
+
+That looks good. Turns out we need the annotation file(?!)
+
+I figured out where the missing SNPs went. Turns out that, if you pass in an annotation file, and if it is not complete, GEMMA drops the non-annotated SNPs unceremoniously. Getting the right annotation file fixed it. GEMMA should obviously not behave like that ;). Anyway, I am in sync with GN2 now. Unfortunately, with permutations, the significance threshold did not change much (which kinda makes sense).
+
+I want to see why gemma is giving this number. If I can't find it fast I'll try to run bulklmm or R/qtl2 lmm instead and see if they disagree with gemma and if we can get close to what Rob expects.
+
+
+```
+gemma -gk -g BXD.geno.txt -p BXD_pheno_matched.txt -n 22
+gemma -lmm 9 -g BXD.geno.txt -p BXD_pheno_matched.txt -k output/result.cXX.txt -n 22
+```
+
+Now that works we can move to a full LOCO
+
+```
+./bin/gemma-wrapper --loco --json -- -gk -g BXD.geno.txt -p BXD_pheno_matched.txt -n 5  -a BXD.8_snps.txt > K.json
+./bin/gemma-wrapper --debug --verbose --force --loco --json --lmdb --input K.json -- -g BXD.geno.txt -a BXD.8_snps.txt -lmm 9 -maf 0.05 -p BXD_pheno_matched.txt -n 5
+./bin/./bin/view-gemma-mdb --sort /tmp/test/ca55b05e8b48fb139179fe09c35cff0340fe13bc.mdb
+```
+
+and we get
+
+```
+chr,pos,marker,af,beta,se,l_mle,l_lrt,-logP
+18,69216071,rs3718618,0.635,-195.5784,82.1243,100000.0,0.0,4.5
+18,69448067,rsm10000013598,0.635,-195.5784,82.1243,100000.0,0.0,4.5
+18,69463065,rsm10000013599,0.635,-195.5784,82.1243,100000.0,0.0,4.5
+18,69803489,rsm10000013600,0.635,-195.5784,82.1243,100000.0,0.0,4.5
+18,69825784,rs50446650,0.635,-195.5784,82.1243,100000.0,0.0,4.5
+18,69836387,rsm10000013601,0.635,-195.5784,82.1243,100000.0,0.0,4.5
+18,68188822,rsm10000013579,0.596,-189.7332,79.7479,100000.0,0.0,4.49
+18,68189477,rs29539715,0.596,-189.7332,79.7479,100000.0,0.0,4.49
+18,68195226,rsm10000013580,0.596,-189.7332,79.7479,100000.0,0.0,4.49
+18,68195289,rsm10000013581,0.596,-189.7332,79.7479,100000.0,0.0,4.49
+18,68195758,rsm10000013582,0.596,-189.7332,79.7479,100000.0,0.0,4.49
+18,68454446,rs30216358,0.596,-189.7332,79.7479,100000.0,0.0,4.49
+18,68514475,rs6346101,0.596,-189.7332,79.7479,100000.0,0.0,4.49
+18,68521138,rsm10000013583,0.596,-189.7332,79.7479,100000.0,0.0,4.49
+18,68526029,rs29984158,0.596,-189.7332,79.7479,100000.0,0.0,4.49
+18,68542739,rsm10000013584,0.596,-189.7332,79.7479,100000.0,0.0,4.49
+18,68543456,rsm10000013585,0.596,-189.7332,79.7479,100000.0,0.0,4.49
+18,68564736,rsm10000013586,0.596,-189.7332,79.7479,100000.0,0.0,4.49
+18,68565230,rsm10000013587,0.596,-189.7332,79.7479,100000.0,0.0,4.49
+```
+
+which is in line with GN2.
+
+Run and permute:
+
+```
+./bin/gemma-wrapper --debug --phenotypes BXD_pheno_matched.txt --permutate 1000 --phenotype-column 2 --verbose --force --loco --json --input K.json -- -g BXD.geno.txt -a BXD.8_snps.txt -lmm 9 -maf 0.05
+```
+
+* [X] Test significance effect for higher and lower MAF than 0.05
+
+Lower MAF increases significance thresholds?
+
+```
+0.05?
+["95 percentile (significant) ", 6.268117e-06, 5.2]
+["67 percentile (suggestive)  ", 7.457537e-05, 4.1]
+
+0.01
+["95 percentile (significant) ", 5.871237e-06, 5.2]
+["67 percentile (suggestive)  ", 7.046853e-05, 4.2]
+```
+
+* [ ] Check distribution of hits with permutations
+
+## What about significance
+
+What we are trying to do here is to decide on a significance level that says that the chance of a hit caused by a random event is less that 1 in a thousand. We are currently finding levels of 5.0 and from earlier work it should be less than 4.0. We are essentially following Gary Churchill's '94 paper: ``Empirical threshold values for quantitative trait mapping''. The significance level depends on the shape of the data - i.e., the shape of both genotypes and the trait under study. If the significance level is 5.0 it means that we can expect alpha=0.05 or 5% of random trait vectors can be expected to show a LOD score of 5 or higher.
+
+What GEMMA does is look for a correlation between a marker, e.g.
+
+```
+BXD.geno.txt
+rsm10000013598,X,Y,2,0,2,0,2,0,2,0,0,0,0,0,0,2,2,2,0,0,0,0,2,2,2,2,2,0,0,0,0,2,0,2,2,2,0,0,0,0,2,0,2,2,2,2,0,0,2,0,0,0,2,2,0,2,0,0,2,2,2,0,0,2,2,2,2,2,2,2,2,2,2,0,0,2,2,0,2,2,2,2,0,2,2,2,2,2,2,2,0,0,2,2,0,2,0,0,2,2,2,0,2,2,2,0,1,1,1,1,1,1,2,2,1,2,2,2,2,0,2,0,2,1,0,0,0,2,1,0,2,2,2,2,2,0,0,2,2,0,2,2,0,2,2,2,2,2,2,2,2,0,2,2,2,2,2,0,0,0,0,0,2,0,0,2,0,2,1,0,2,0,0,0,0,0,0,0,0,1,2,0,0,0,2,2,2,1,0,2,2,2,2,0,2,0,0,0,2,2,2,2,1,1,0,1,1,1,0,1,1,1,0,1,1,1,1,1,1,1,1,2,1,2,1,1,2,1,1,1,1,1,1,0,1,2,1,0,1
+```
+
+and a trait that is measured for a limited number against these individuals/strains/genometypes. We also correct for kinship between the individuals, but that is tied to the individuals, so we can ignore that for now. So you get a vector of:
+
+```
+marker rsm10000013598
+ind  trait
+0     8.1
+0     7.9
+2     12.3
+2     13.4
+```
+
+We permute the data after breaking the correlation between left and right columns. When running 1000 permutations for this particular hit we find that the shuffled never gets a higher value then for our main run. That is comforting because random permutations are always less correlated (for this marker).
+
+If we do this genome-wide we also see a randomly positioned highest hit across all chromosomes after shuffling the trait vector and our hit never appears the highest. E.g.
+
+```
+[10, ["2", "rs13476914", "170826974"], ["95 percentile (significant) ", 1.870138e-05, 4.7], ["67 percentile (suggestive)  ", 6.3797e-05, 4.2]]
+[11, ["6", "rsm10000004149", "25227945"], ["95 percentile (significant) ", 1.870138e-05, 4.7], ["67 percentile (suggestive)  ", 6.3797e-05, 4. 2]]
+[12, ["9", "rsm10000006852", "81294046"], ["95 percentile (significant) ", 1.555683e-05, 4.8], ["67 percentile (suggestive)  ", 4.216931e-05, 4.4]]
+[13, ["2", "rsm10000001382", "57898368"], ["95 percentile (significant) ", 1.555683e-05, 4.8], ["67 percentile (suggestive)  ", 6.3797e-05, 4. 2]]
+[14, ["1", "rsm10000000166", "94030054"], ["95 percentile (significant) ", 1.555683e-05, 4.8], ["67 percentile (suggestive)  ", 6.3797e-05, 4. 2]]
+[15, ["X", "rsm10000014672", "163387262"], ["95 percentile (significant) ", 1.555683e-05, 4.8], ["67 percentile (suggestive)  ", 6.3797e-05, 4 .2]]
+```
+
+### Shuffling a normally distributed trait
+
+
+So the randomization works well. Still, or 95% is close to 5.0 and that is by chance. What happens when we change the shape of the data? Let's create a new trait, so the distribution is random and normal:
+
+```
+> rnorm(25, mean = 10, sd = 2)
+ [1] 10.347116  9.475156 11.747876 10.969742 11.374611 12.283834 11.499779
+ [8] 11.123520 10.830300 11.640049 10.392085 11.586836 11.540470 10.700869
+[15]  8.802858 10.238498 11.099536  8.832104  6.463636 10.347956 11.222558
+[22]  8.658024  7.796304 10.684967  9.540483
+```
+
+These random trait values renders a hit of -Math.log10(8.325683e-04) = 3.0! Now we permute and we get:
+
+["95 percentile (significant) ", 5.22093e-06, 5.3]
+["67 percentile (suggestive)  ", 7.303966e-05, 4.1]
+
+So the shape of a normally distribute trait gives a higher threshold - it is easier to get a hit by chance.
+
+### Genotypes
+
+So 95% of random shuffled trait runs still gives us 5.x. So this has to be a property of the genotypes in conjunction with the method GEMMA applies. With regard to genotypes, the BXD are not exactly random because they share markers from two parents which run along haplotypes. I.e. we are dealing with a patchwork of similar genotypes. You may expect that would suppress the chance of finding random hits. Let's try to prove that by creating fully random genotypes and an extreme haplotype set. And, for good measure something in between.
+
+* [X] Fully random genotypes
+
+In the next phase we are going to play a bit with the haplotypes. First we fully randomize the genotype matrix. This way we break all haplotypes. As BIMBAM is a simple format we'll just modify an existing BIMBAM file. It looks like
+
+```
+rs3677817,X,Y,1.77,0.42,0.18,0.42,1.42,0.34,0.69,1.57,0.52,0.1,0.37,1.27,0.62,1.87,1.71,1.65,1.83,0.04,1.05,0.52,1.92,0.57,0.61,0.11,1.49,1.07,1.48,1.7,0.5,1.75,1.74,0.29,0.37,1.78,1.91,1.37,1.64,0.32,0.09,1.21,1.58,0.4,1.0,0.62,1.1,0.7,0.35,0.86,0.7,0.46,1.14,0.04,1.87,1.96,0.61,1.34,0.63,1.04,1.95,0.22,0.54,0.31,0.14,0.95,1.45,0.93,0.37,0.79,1.37,0.87,1.79,0.41,1.73,1.25,1.49,1.57,0.39,1.61,0.37,1.85,1.83,1.71,1.5,1.78,1.34,1.29,1.41,1.54,1.05,0.3,0.87,1.85,0.5,0.19,1.54,0.53,0.26,1.47,0.67,0.84,0.18,0.79,0.68,1.48,0.4,1.83,1.76,1.09,0.2,1.48,0.24,0.53,0.41,1.24,1.38,1.31,1.73,0.52,1.86,1.21,0.58,1.68,0.79,0.4,1.41,0.07,0.57,0.42,0.47,0.49,0.05,0.77,1.33,0.15,1.41,0.03,0.24,1.66,1.39,2.0,0.23,1.4,1.05,0.79,0.51,0.66,1.24,0.29,1.12,0.46,0.92,1.12,1.53,1.78,1.22,1.35,0.1,0.43,0.41,1.89,0.09,0.13,1.04,0.24,1.4,1.25,0.24,0.26,0.31,0.36,0.31,1.34,1.23,1.91,0.7,0.08,1.43,0.17,1.9,0.06,1.42,1.94,0.43,0.54,1.96,1.29,0.64,0.82,1.85,1.63,0.23,1.79,0.52,1.65,1.43,0.95,1.13,0.59,0.07,0.66,1.79,0.92,1.89,1.2,0.51,0.18,0.96,0.44,0.46,0.88,0.39,0.89,1.68,0.07,1.46,1.61,1.73,0.56,1.33,1.67,0.16,1.78,0.61,1.55,0.88,0.15,1.98,1.96,0.61,0.04,0.12,1.4,1.65,0.71,1.3,1.85,0.49
+```
+
+We'll stick in the old hit for good measure and run our genotypes:
+
+```
+./bin/gemma-wrapper --loco --json -- -gk -g BXD.geno.rand.txt -p BXD_pheno_matched.txt -n 5  -a BXD.8_snps.txt > K.json
+./bin/gemma-wrapper --debug --verbose --force --loco --json --lmdb --input K.json -- -g BXD.geno.rand.txt -a BXD.8_snps.txt -lmm 9 -maf 0.05 -p BXD_pheno_matched.txt -n 22
+./bin/./bin/view-gemma-mdb --sort /tmp/test/ca55b05e8b48fb139179fe09c35cff0340fe13bc.mdb
+./bin/view-gemma-mdb /tmp/e279abbebee8e41d7eb9dae...-gemma-GWA.tar.xz --anno BXD.8_snps.txt|head -20
+chr,pos,marker,af,beta,se,l_mle,l_lrt,-logP
+X,139258413,rsm10000014629,0.496,0.2248,0.093,100000.0,0.0,4.58
+6,132586518,rsm10000003691,0.517,0.2399,0.1068,100000.0,0.0001,4.17
+2,161895805,rs27350606,0.585,-0.2303,0.1059,100000.0,0.0001,4.0
+X,47002415,rsm10000014323,0.562,-0.1904,0.0877,100000.0,0.0001,3.99
+3,32576363,rsm10000001568,0.468,-0.2251,0.104,100000.0,0.0001,3.97
+14,19281191,rs52350512,0.5,-0.2454,0.1154,100000.0,0.0001,3.88
+7,111680092,rs32385258,0.536,0.2022,0.0968,100000.0,0.0002,3.79
+4,151267320,rsm10000002095,0.604,-0.2257,0.1102,100000.0,0.0002,3.69
+2,157353289,rs27323024,0.455,0.2188,0.1072,100000.0,0.0002,3.67
+19,56503719,rsm10000013894,0.617,0.2606,0.1302,100000.0,0.0003,3.58
+```
+
+Interestingly our trait did not do that well:
+
+```
+18,69448067,rsm10000013598,0.635,0.0941,0.0774,100000.0,0.0167,1.78
+```
+
+It shows how large the impact of the GRM is. We can run our permutations.
+
+```
+./bin/gemma-wrapper --debug --phenotypes BXD_pheno_matched.txt --permutate 1000 --phenotype-column 22 --verbose --force --loco --json --input K.json -- -g BXD.geno.rand.txt -a BXD.8_snps.txt -lmm 9 -maf 0.05
+["95 percentile (significant) ", 1.478479e-07, 6.8]
+["67 percentile (suggestive)  ", 1.892087e-06, 5.7]
+```
+
+Well that went through the roof :). It makes sense when you think about it. Randomizing genotypes of 21K SNPs gives you a high chance of finding SNPs that correlate with the trait. Let's go the other way and give 20% of indidivuals the exact same haplotypes, basically copying
+
+```
+rsm10000013598,X,Y,2,0,2,0,2,0,2,0,0,0,0,0,0,2,2,2,0,0,0,0,2,2,2,2,2,0,0,0,0,2,0,2,2,2,0,0,0,0,2,0,2,2,2,2,0,0,2,0,0,0,2,2,0,2,0,0,2,2,2,0,0,2,2,2,2,2,2,2,2,2,2,0,0,2,2,0,2,2,2,2,0,2,2,2,2,2,2,2,0,0,2,2,0,2,0,0,2,2,2,0,2,2,2,0,1,1,1,1,1,1,2,2,1,2,2,2,2,0,2,0,2,1,0,0,0,2,1,0,2,2,2,2,2,0,0,2,2,0,2,2,0,2,2,2,2,2,2,2,2,0,2,2,2,2,2,0,0,0,0,0,2,0,0,2,0,2,1,0,2,0,0,0,0,0,0,0,0,1,2,0,0,0,2,2,2,1,0,2,2,2,2,0,2,0,0,0,2,2,2,2,1,1,0,1,1,1,0,1,1,1,0,1,1,1,1,1,1,1,1,2,1,2,1,1,2,1,1,1,1,1,1,0,1,2,1,0,1
+```
+
+```
+./bin/bimbam-rewrite.py --inject inject.geno.txt BXD.geno.txt --perc=20 > BXD.geno.20.txt
+rg -c "2,0,2,0,2,0,2,0,0,0,0,0,0,2,2,2,0,0,0,0,2,2,2,2,2,0,0,0,0,2,0,2,2,2,0,0,0,0,2,0,2,2,2,2,0,0,2,0,0,0,2,2,0,2,0,0,2,2,2,0,0,2,2,2,2,2,2,2,2,2,2,0,0,2,2,0,2,2,2,2,0,2,2,2,2,2,2,2,0,0,2,2,0,2,0,0,2,2,2,0,2,2,2,0,1,1,1,1,1,1,2,2,1,2,2,2,2,0,2,0,2,1,0,0,0,2,1,0,2,2,2,2,2,0,0,2,2,0,2,2,0,2,2,2,2,2,2,2,2,0,2,2,2,2,2,0,0,0,0,0,2,0,0,2,0,2,1,0,2,0,0,0,0,0,0,0,0,1,2,0,0,0,2,2,2,1,0,2,2,2,2,0,2,0,0,0,2,2,2,2,1,1,0,1,1,1,0,1,1,1,0,1,1,1,1,1,1,1,1,2,1,2,1,1,2,1,1,1,1,1,1,0,1,2,1,0,1" BXD.geno.20.txt
+4276
+```
+
+so 4K out of 20K SNPs has identical haplotypes which correlate with our trait of interest:
+
+```
+["95 percentile (significant) ", 5.16167e-06, 5.3]
+["67 percentile (suggestive)  ", 6.163728e-05, 4.2]
+```
+
+and at 40% haplotype injection we get
+
+```
+["95 percentile (significant) ", 3.104788e-06, 5.5]
+["67 percentile (suggestive)  ", 7.032406e-05, 4.2]
+```
+
+* [X] Haplotype equal genotypes 20% and 40%
+
+All looks interesting, but does not help.
+
+Also when we halve the number of SNPs the results are similar too.
+
+```
+["95 percentile (significant) ", 6.026549e-06, 5.2]
+["67 percentile (suggestive)  ", 8.571557e-05, 4.1]
+```
+
+Even though the threshold is high, it is kind of interesting to see that no matter what you do you end up similar levels. After a meeting with Rob and Saunak the latter pointed out that these numbers are not completely surprising. For LMMs we need to use an adaptation - i.e. shuffle the trait values after rotation and transformation and then reverse that procedure. There is only the assumption of normality that Churchill does not require. The good news is that BulkLMM contains that method and thresholds will be lower. The bad news is that I'll have to adapt it because it does not handle missing data.
+
+Oh yes, rereading the Churchill paper from 1994 I now realise he also suggests an at marker significance method that will end lower - we saw that already in an earlier comparison. Saunak, however, says that we *should* do experiment-wide.
+
+## BulkLMM
+
+* [ ] Run bulklmm
+
+
+## Dealing with epoch
+
+Rob pointed out that the GRM does not necessarily represent epoch and that may influence the significance level. I.e. we should check for that. I agree that the GRM distances are not precise enough (blunt instrument) to capture a few variants that appeared in a new epoch of mice. I.e., the mice from the 90s may be different from the mice today in a few DNA variants that won't be reflected in the GRM.
+
+* [ ] Deal with epoch
+
+We have two or more possible solutions to deal with hierarchy in the population.
+
+## Covariates
+
+* [ ] Try covariates Dave
+
+## Later
+
+* [ ] Check running or trait without LOCO with both standard and random GRMs
+* [ ] Test non-loco effect for rsm10000013598 - looks too low and does not agree with GN2
+* [X] Try qnorm run
+* [ ] Fix non-use of MAF in GN for non-LOCO
+* [ ] Fix running of -p switch when assoc cache exists (bug)
+
+Quantile-Based Permutation Thresholds for Quantitative Trait Loci Hotspots
+https://academic.oup.com/genetics/article/191/4/1355/5935078
+by Karl, Ritsert et al. 2012
diff --git a/topics/lmms/rqtl2/genenetwork-rqtl2-implementation.gmi b/topics/lmms/rqtl2/genenetwork-rqtl2-implementation.gmi
new file mode 100644
index 0000000..452930f
--- /dev/null
+++ b/topics/lmms/rqtl2/genenetwork-rqtl2-implementation.gmi
@@ -0,0 +1,71 @@
+# Implementation of QTL Analysis Using r-qtl2 in GeneNetwork
+## Tags
+
+* Assigned: alexm
+* Keywords: RQTL, GeneNetwork2, implementation
+* Type: Feature
+* Status: In Progress
+
+## Description
+
+This document outlines the implementation of a QTL analysis tool in GeneNetwork using r-qtl2 (see docs: https://kbroman.org/qtl2/) and explains what the script does.  
+This PR contains the implementation of the r-qtl2 script for genenetwork:  
+=> https://github.com/genenetwork/genenetwork3/pull/201
+
+## Tasks
+
+The script currently aims to achieve the following:
+
+* [x] Parsing arguments required for the script
+* [x] Data validation for the script
+* [x] Generating the cross file
+* [x] Reading the cross file
+* [x] Calculating genotype probabilities
+* [x] Performing Geno Scan (scan1) using HK, LOCO, etc.
+* [x] Finding LOD peaks
+* [x] Performing permutation tests
+* [x] Conducting QTL analysis for multiparent populations
+* [ ] Generating required plots
+
+## How to Run the Script
+
+The script requires an input file containing all the necessary data to generate the control file. Example:
+
+```json
+{
+    "crosstype": "riself",
+    "geno_file": "grav2_geno.csv",
+    "geno_map_file": "grav2_gmap.csv",
+    "pheno_file": "grav2_pheno.csv",
+    "phenocovar_file": "grav2_phenocovar.csv"
+}
+
+```
+In addition other parameters required are
+
+* output file (A file path of where the output for the script will be generated)
+* --directory ( A workspace of where to generate the control file)
+
+Optional parameters include
+* --output_file: The file path where the output for the script will be generated.
+* --directory: The workspace directory where the control file will be generated.
+
+Optional parameters:
+
+* --cores: The number of cores to use (set to 0 for using all cores).
+* --method: The scanning method to use (e.g., Haley-Knott, Linear Mixed Model, or LMM with Leave-One-Chromosome-Out).
+* --pstrata: Use permutation strata.
+* --threshold: Minimum LOD score for a peak.
+
+
+An example of how to run the script:
+
+```sh
+
+Rscript rqtl2_wrapper.R --input_file [file_path] --directory [workspace_dir] --output_file [file_path] --nperm 100 --cores 3
+
+```
+## Related issues:
+https://issues.genenetwork.org/topics/lmms/rqtl2/using-rqtl2
+=> ./using-rqtl2
+=> ./gn-rqtl-design-implementation
diff --git a/topics/lmms/rqtl2/gn-rqtl-design-implementation.gmi b/topics/lmms/rqtl2/gn-rqtl-design-implementation.gmi
new file mode 100644
index 0000000..f37da42
--- /dev/null
+++ b/topics/lmms/rqtl2/gn-rqtl-design-implementation.gmi
@@ -0,0 +1,203 @@
+# RQTL Implementation for GeneNetwork Design Proposal
+
+## Tags
+
+* Assigned: alexm,
+* Keywords: RQTL, GeneNetwork2, Design
+* Type: Enhancements,
+* Status: In Progress
+
+
+
+## Description
+
+This document outlines the design proposal for the re-implementation of the RQTL feature in GeneNetwork providing also  a console view to track the  stdout from the external process.
+
+### Problem Definition
+
+The current RQTL implementation faces the following challenges:
+
+- Lack of adequate error handling for the API and scripts.
+
+- Insufficient separation of concerns between GN2 and GN3.
+
+- lack way for user to track the progress of the r-qtl script being executed
+
+- There is lack of a clear way in which the r-qtl script is executed
+
+We will address these challenges and add  enhancements by:
+
+- Rewriting the R script using r-qtl2 instead of r-qtl.
+
+- Establishing clear separation of concerns between GN2 and GN3, eliminating file path transfers between the two.
+
+- Implementing better error handling for both the API and the RQTL script.
+
+- run the script as  a  job in a task queue
+
+- Piping stdout from the script to the browser through a console for real-time monitoring.
+
+- Improving the overall design and architecture of the system.
+
+
+
+## High-Level Design
+This is divided into three major components:
+
+* GN3 RQTL-2 Script implementation
+* RQTL Api
+* Monitoring system for the rqtl script
+
+
+### GN3 RQTL-2 Script implementation
+We currently have an rqtl script written in rqtl https://github.com/genenetwork/genenetwork3/blob/main/scripts/rqtl_wrapper.R
+There is a newer rqtl implementation (rqtl-2) which is
+a reimplementation of the QTL analysis software R/qtl, to better handle high-dimensional data and complex cross designs.
+To see the difference between the two see documentation:
+=> https://kbroman.org/qtl2/assets/vignettes/rqtl_diff.html
+We aim to implement a seperate script using this while maintaining the  one
+implemented using rqtl1 (rqtl) .
+(TODO) This probably needs to be split to a new issue(with enough knowledge) , to capture
+each computation step  in the r script.
+
+### RQTL Api
+
+
+This component will serve as the entry point for running RQTL in GN3. At this stage, we need to improve the overall architecture and error handling. This process will be divided into the following steps:
+
+-  Data Validation
+In this step, we must validate that all required data to run RQTL is provided in the JSON format. This includes the mapping method, genotype file, phenotype file, etc. Please refer to the r-qtl2 documentation for an overview on the requirements :
+=> https://rqtl.org/
+
+-  Data Preprocessing
+During this stage, we will transform the data into a format that R can understand. This includes converting boolean values to the appropriate representations, preparing the RQTL command with all required values, and adding defaults where necessary.
+
+-  Data Computation
+In this stage, we will pass the RQTL script command to the task queue to run as a job.
+
+-   Output Data Processing
+In this step, we need to retrieve the results outputted from the script in a specified format, such as JSON or CSV and process the data. This may include outputs like RQTL pair scans and generated diagrams. Please refer to the documentation for an overview:
+=> https://rqtl.org/
+
+
+
+**Subtasks:**
+
+- [ ] add the rqtl api endpoint (10%)
+- [ ] Input Data validation (15%)
+- [ ] Input data  processing  (20%)
+- [ ] Passing data to r-script for the computation (40%)
+- [ ] output data processing  (80%)
+ -[ ] add unittests for this module (100%)
+
+
+###  Monitoring system for the  rqtl script
+
+This component involves creating a monitoring system to track the state of the external process and output relevant information to the user.
+We need a way to determine the status for the current job for example
+QUEUED, STARTED, INPROGRESS, COMPLETED (see deep dive for more on this)
+
+
+## Deep Dive
+
+
+### Running the External Script
+The RQTL implementation is in R, and we need a strategy for executing this script as an external process. This can be subdivided into several key steps:
+
+- **Task Queue Integration**:
+
+  - We will utilize a task queue system ,
+  We already have an implementation in gn3
+  to  manage script execution
+
+- https://github.com/genenetwork/genenetwork3/blob/0820295202c2fe747c05b93ce0f1c5a604442f69/gn3/commands.py#L101
+
+- **Job Submission**: 
+  - Each API call will create a new job in the task queue, which will handle the execution of the R script.
+
+- **Script Execution**: 
+  - This stage involves executing the R script in a controlled environment, ensuring all necessary dependencies are loaded.
+
+- **Monitoring and Logging**: 
+
+- The system will include monitoring tools to track the status of each job. Users will receive real-time updates on job progress and logs for the current task.
+
+In this stage, we can have different states for the current job, such as QUEUED, IN PROGRESS, and COMPLETED. 
+
+We need to output to the user which stage of computation we are currently on during the script
+execution.
+
+- During the QUEUED state, the standard output (stdout) should display the  command to be executed along with all its arguments. 
+
+- During the STARTED stage, the stdout should notify the user that execution has begun. 
+
+- In the IN PROGRESS stage, we need to fetch logs from the script being executed at each computation step. Please refer to this documentation for an overview of the different computations we
+shall have :
+=> https://rqtl.org/
+
+- During the DONE step, the system should output the results from the R/qtl script to the user.
+
+
+- **Result Retrieval**: 
+  - Once the R script completes (either successfully or with an error), results will be returned to the API call.
+
+- **Error Handling**: 
+  - Better error handling will be implemented to manage potential issues during script execution. This includes capturing errors from the R script and providing meaningful feedback to users through the application.
+
+### Additional Error Handling Considerations
+This will involve:
+* API error handling
+* Error handling within the R script
+
+## Additional UI Considerations
+We need to rethink where to output the external process stdout  in the UI. Currently, we can add flags to the URL to enable this functionality, e.g., `URL/page&flags&console=1`.
+Also the design suggestion is to output the results in a terminal emulator for
+example xterm ,See more: https://xtermjs.org/, A current implementation already exists
+for gn3 see
+=> https://github.com/genenetwork/genenetwork2/blob/abe324888fc3942d4b3469ec8d1ce2c7dcbd8a93/gn2/wqflask/templates/wgcna_setup.html#L89
+
+###  Design Suggestions:
+#### With HTMX, offer a split screen
+This will include an output page and a monitoring system page.
+
+#### Popup button for preview
+A button that allows users to preview and hide the console output.
+
+
+
+
+
+## Long-Term Goals
+We aim to run computations on clusters rather than locally. This project will serve as a pioneer for that approach.
+
+## Related Issues
+=> https://issues.genenetwork.org/topics/lmms/rqtl2/using-rqtl2
+
+### Tasks
+
+* stage 1  (20%) *
+
+ - [x] implement the rqtl script using rqtl2
+ 
+* stage 2  (40%) * 
+
+- [ ] Implement  the RQTL API endpoints
+- [ ] validation  and preprocessing for data from the client
+- [ ] Implement state-of-the-art error handling
+- [ ] Add unit tests for the rqtl api module
+- [ ] Make improvements to the current R script if possible
+
+* stage 3  (60%)*
+
+- [ ] Task queue integration (refer to the Deep Dive section)
+- [ ] Implement a monitoring and logging system for job execution (refer to the deep dive section
+- [ ] Fetch results from running jobs
+- [ ]  Processing output from  the external script
+
+* stage 4  (80%) *
+- [ ] Implement a console preview UI for user feedback
+- [ ] Refactor the GN2 UI
+
+* stage 5  (100%) * 
+
+- [ ] Run this computation on clusters    
\ No newline at end of file
diff --git a/topics/lmms/rqtl2/using-rqtl2-lmdb-adapter.gmi b/topics/lmms/rqtl2/using-rqtl2-lmdb-adapter.gmi
new file mode 100644
index 0000000..8e5332a
--- /dev/null
+++ b/topics/lmms/rqtl2/using-rqtl2-lmdb-adapter.gmi
@@ -0,0 +1,84 @@
+# R/qtl2 LMDB Adapter
+## Tags
+
+* assigned: alexm
+* priority: medium
+* type: feature, documentation
+* status: WIP
+* keywords: rqtl2, lmdb, adapter, cross
+
+## Description
+We want to add support for reading crosses from LMDB.
+Currently, R/qtl2 (https://kbroman.org/qtl2/) only supports reading from CSV files.
+
+## Tasks
+
+* [x] Dump genotypes to LMDB
+* [x] Dump cross metadata to LMDB
+* [-] Create a `read_lmdb_cross` adapter
+* [] Dump phenotypes to LMDB
+
+## Using the Adapter
+
+### Dumping the Genotypes
+You can find the `lmdb_matrix.py` script here:
+
+=> https://github.com/genenetwork/genenetwork3/blob/main/scripts/lmdb_matrix.py
+
+```sh
+guix shell python-click python-lmdb python-wrapper python-numpy -- \
+     python lmdb_matrix.py import-genotype \
+     <path-to-genotype-file> <path-to-lmdb-store>
+```
+
+## Dumping the Cross Metadata
+
+The script can be found here:
+=> https://github.com/genenetwork/genenetwork3/pull/235/files   # lmdb_cross_metadata.py
+
+You need to provide a cross file path. The currently supported formats are JSON and YAML.
+
+Example:
+
+```sh
+guix shell python-click python-lmdb python-wrapper python-pyyaml -- \
+     python dump_metadata.py dump-cross [LMDB_PATH] [CROSS_FILE_PATH] --file-format yaml/json
+
+# Example
+
+python dump_metadata.py dump-cross "./test_lmdb_data" "./cross_file.json"
+```
+
+### Running the R/qtl2 LMDB Adapter Script
+
+The script `rqtl_lmdb_adapter.r` can be found here:
+=> https://github.com/genenetwork/genenetwork3/pull/235/files   # rqtl_lmdb_adapter.r
+
+```sh
+guix shell r r-thor r-rjson r-qtl2 -- \
+     Rscript [PATH_TO_ADAPTER_SCRIPT] [LMDB_PATH]
+
+# Example
+Rscript https://github.com/genenetwork/genenetwork3/pull/235/files ./lmdb_path
+
+
+```
+### using this with rqtl2 Example
+
+```r
+cross <- read_lmdb_cross(LMDB_DB_PATH)
+summary(cross)
+cat("Is this cross okay", check_cross2(cross), "\n")
+warnings() #  enable warnings for the debug purposes  only!
+pr <- calc_genoprob(cross)
+out <- scan1(pr, cross$pheno, cores=4)
+par(mar=c(5.1, 4.1, 1.1, 1.1))
+ymx <- maxlod(out)
+plot(out, cross$gmap, lodcolumn=1, col="slateblue") # test generating of qtl plots
+
+```
+
+
+### References
+=> https://kbroman.org/qtl2/assets/vignettes/developer_guide.html
+
diff --git a/topics/lmms/rqtl2/using-rqtl2.gmi b/topics/lmms/rqtl2/using-rqtl2.gmi
new file mode 100644
index 0000000..7f671ba
--- /dev/null
+++ b/topics/lmms/rqtl2/using-rqtl2.gmi
@@ -0,0 +1,44 @@
+# R/qtl2
+
+# Tags
+
+* assigned: pjotrp, alexm
+* priority: high
+* type: enhancement
+* status: open
+* keywords: database, gemma, reaper, rqtl2
+
+# Description
+
+R/qtl2 handles multi-parent populations, such as DO, HS rat and the collaborative cross (CC). It also comes with an LMM implementation. Here we describe using and embedding R/qtl2 in GN2.
+
+# Tasks
+
+
+## R/qtl2
+
+R/qtl2 is packaged in guix and can be run in a shell with
+
+
+```
+guix shell -C r r-qtl2
+R
+library(qtl2)
+```
+
+R/qtl2 also comes with many tests. When starting up with development tools in the R/qtl2 checked out git repo
+
+```sh
+cd qtl2
+guix shell -C -D r r-qtl2 r-devtools make coreutils gcc-toolchain
+make test
+Warning: Your system is mis-configured: '/var/db/timezone/localtime' is not a symlink
+i Testing qtl2
+Error in dyn.load(dll_copy_file) :
+unable to load shared object '/tmp/RtmpWaf4td/pkgload31850824d/qtl2.so': /gnu/store/hs6jjk97kzafl3qn4wkdc8l73bfqqmqh-gfortran-11.4.0-lib/lib/libstdc++.so.6: version `GLIBCXX_3.4.32' not found (required by /tmp/RtmpWaf4td/pkgload31850824d/qtl2.so)
+Calls: <Anonymous> ... <Anonymous> -> load_dll -> library.dynam2 -> dyn.load
+Execution halted
+make: *** [Makefile:9: test] Error 1
+```
+
+not sure what the problem is yet.
diff --git a/topics/meetings/gn-kilifi-2025-standup.gmi b/topics/meetings/gn-kilifi-2025-standup.gmi
new file mode 100644
index 0000000..c20e847
--- /dev/null
+++ b/topics/meetings/gn-kilifi-2025-standup.gmi
@@ -0,0 +1,177 @@
+# Stand-up Notes
+
+## 25-8-2025
+### Felix
+* Completed first draft of the abstract
+* HS genotyping; polishing founders vcf file:
+* - sorting xsome order
+* - remove duplicate markers
+
+### Johannnes
+* Extending the rag system, make it more of an agent:
+* - fixing bugs
+* - reduce runtime
+
+### Bonface
+* CD went down!
+* revert it to last commit
+* set it up locally (not very easy)
+* coordinate with Fred and Aarun for help
+
+## 27-8-2025
+### Felix
+* updating the abstract: meeting with Kauthar, for more tips and insight
+* HS rats genotyping
+* still no significant output
+
+### Bonface
+* fixed GN3, GN guile, and GN auth
+* reproduce GN2 error, not yet fixed though
+
+### Johannes
+* LLM transformer is taking too long to run
+* change input documentations to observe results
+
+## 29-8-2025
+### Felix
+* Still on the abstract
+* Errors with hs genotyping script
+
+### Johannes
+*  Still on the rag system: experimenting with documents as input
+
+### Bonface
+* CD down again, tux02 is the calprit
+* Running gn2 outside container in meantime
+
+## 1-9-2025
+### Felix
+* improving the smoothing scripts for hs: building a snakemake pipeline; redrafting the abstract introductory part
+
+### Bonface
+* looked on deployment to fix...,
+* install gpu drivers for penguine 2 with alex
+* it has gpu drivers tesla k8, but hardware installation is important for the meantime
+
+### Alex
+* rqlt2/lmdb runs and rebooted
+* dumping bxd phenotype dataset for testing
+
+### Johannes
+* no big updates
+* literature review on rag systems and how they work
+
+## 3-9-2025
+### Felix
+* Improved the inferring script
+* Abstract on hold first
+
+### Johannes
+* fix documents using agents: the issue is parallelization
+* reviewing fahamu AI (the python part)
+
+### Bonface
+* Gn2 finally runs
+* Documenting the progress and steps
+* Gn-auth is broken; Fred to solve it
+* Updated tesla k8 gpu drivers
+
+
+### Alex
+* Rqtl-wrapper for BXD is done
+* focusing on cross-information;lmdb
+
+## 5-9-2025
+### Felix
+* managed to generate haplotype blocks for hs rats
+
+### Johannes
+* RAG experimentation script: balg01, running slow; prepare for msc defense
+
+### Alex
+* dumping phenotypes to lmdb
+
+### Bonface
+* install drivers on tesla k8
+* review Johannes scripts
+* fixing gn2
+* review alex's patches, {pull requests via emails}
+
+* AOB;
+* API tokens for RAG and coding
+
+## 10-9-2025
+### Felix
+* Managed to run gemma and generate plots for the original hs genotype data and hs smoothed genotype data
+* Still working on fine-tuning the statistical metrices to suite the plots
+
+### Bonface
+* All the GN machines are up and running, expect GN Auth
+* Sent documentations to Pjotr on the progress with troubleshooting GN machines
+
+### Alex
+* writing queries to move phenotypes matrices from gn2 db to lmdb
+
+### Johannes
+* defense
+
+## 17-9-2025
+### Felix
+* Completed abstract writing
+* Working on generating genotype files for hs in gn2 format
+
+### Bonface
+* Gn machines are up and running
+* Still some issues with manually restarting the container
+* Worked with Alex to set up an external bootable ssd for Johannes
+
+### Alex
+* Succeeded dumping phenotypes to lmdb
+* Been reading literature around phenotypes in gn2
+* Assisted Bonz in setting up external bootable ssd for Johannes
+
+### Johannes
+* Managed to get access to bux01 server for the llms work
+
+## 19-09-2025
+### Felix
+* Finalized HS genotypes/haplotypes
+
+### Bonface
+* GN machines are running on CD
+* troubleshooting assisted by Aaron
+
+### Johannes
+* Experimenting with RAG systems
+
+### Alex
+* Writing LMDB scripts to dump phenotypes to LMDB from the common gn DB
+
+## [Review] MONTH OCTOBER, 2025
+*** Metric wise, plans and goals to achieve
+
+### Bonface
+Thoughts: Good start after relocation.
+
+* CD/CI: Make sure that tests/infra is super stable
+* Look at suggested forge/guix-bioinformatics upgrades that affect prod
+* Support work for GNQA: sane gn-qna(fahamu)/AI deploys; Review Johannes' work; Help with tuning models; play catch up (cover up knowledge gap); figure out how to compare the different models.
+* Follow up on my phd
+
+### Alex
+Thoughts: very productive
+
+* Rqtl2 lmdb adapter running on production; depends on traction with Rqtl2 upstream by karl Browman to gn-bioinformatics
+* Collaborate with Felix on his scripts
+
+### Johannes
+
+Thoughts: Not very productive, barely, looking forward for more improvement
+
+* Experimenting on the qtl data with the AI system being currently built; also in need of optimization
+* Comparing LLMs/GNQA agents; getting assistance from Shelby's work
+
+### Felix
+* HS genotypes to be fully supported in GN2
+* Poster presentation
+* PhD: ML objective (ML fundamentals); Complete 1st manuscript draft; Process my student pass
diff --git a/topics/meetings/gn-nairobi-2025.gmi b/topics/meetings/gn-nairobi-2025.gmi
new file mode 100644
index 0000000..fb357a5
--- /dev/null
+++ b/topics/meetings/gn-nairobi-2025.gmi
@@ -0,0 +1,17 @@
+# Meeting Notes
+
+## 2024-01-10
+* @flisso: Prepare gn-uploader presentation for KEMRI.
+* @flisso: Put c-elegans dataset to staging.
+* @flisso: PHEWAS --- extract phenotypes from genenetwork and analyse them using PHEWAS.
+* @alexm: Clean up R/Qtl 1.
+* @alexm: Add R/Qtl 2 in gn.
+* @alexm: Fix UI issues around GN AI.
+* @bonfacem: Fix UI for group pages.
+* @bonfacem: Add git hooks to cd container for self-hosted repositories.
+* @bonfacem: Share developer work container and have Alex test it out.
+* @bonfacem: Prepare RDF presentation for KEMRI.
+
+Nice to have:
+* @bonfacem: Start dataset metadata editing work.
+* @flisso: Write PhD concept note.
diff --git a/topics/meetings/jnduli_bmunyoki.gmi b/topics/meetings/jnduli_bmunyoki.gmi
index 5af7221..26621d1 100644
--- a/topics/meetings/jnduli_bmunyoki.gmi
+++ b/topics/meetings/jnduli_bmunyoki.gmi
@@ -1,5 +1,462 @@
 # Meeting Notes
 
+## 2024-10-15
+* DONE: @flisso: Follow up with the Medaka team on verification of genotype sample names
+* DONE: @flisso: Understand uploader scripts and help improve then.
+* CANCELLED: @flisso: Set up virtuoso.  @bonfacem shall share notes on this.
+* NOT DONE: @flisso: Write PhD concept note.
+* DONE: @alexm @jnduli: R/Qtl script.
+* DONE: @bonfacem: Test the production container locally and provide @fredm some feedback.
+* DONE: @bonfacem: Wrap-up re-writing gn-guile to be part of genenetwork-webservices.
+* NOT DONE: @bonfacem: Start dataset metadata editing work.
+
+## 2024-10-08
+* NOT DONE: @bonfacem RIF Indexing for RIF page in Xapian.
+* IN PROGRESS: @bonfacem: Test the production container locally and provide @fredm some feedback.
+* IN PROGRESS: @bonfacem: Re-writing gn-guile to be part of genenetwork-webservices.
+* NOT DONE: @shelbys @bonfacem: Getting RDF into R2R.
+* NOT DONE: @flisso: Follow up with the Medaka team on verification of genotype sample names.  NOTE: Medaka team are yet to respond.
+* IN PROGRESS: @flisso: Figure out how to add C Elegans data in staging. NOTE: Got access to staging server.  Ran example tests.  Still working on some errors.
+* NOT DONE: @flisso: Set up virtuoso.  @bonfacem shall share notes on this.
+* NOT DONE: @flisso: Write PhD concept note. NOTE: Doing some lit review.
+* @shelbys: Be able to test things on lambda01 for LLM tests.
+* @alexm @jnduli: R/Qtl script.
+
+## 2024-10-18
+* IN-PROGRESS: @priscilla @flisso: Set up mariadb and virtuoso to test out some GN3 endpoints. NOTE: Mariadb set-up
+* NOT DONE: @priscilla @flisso @bmunyoki: Improve docs while hacking on the above.
+* DONE: @jnduli Remove gn-auth code from GN3.
+* DONE: @jnduli Resolve current issue with broken auth in gn-qa.
+* DONE: @jnduli @alexm Work on the R/Qtl design doc.
+* IN-PROGRESS: @alexm: R/Qtl script.  NOTE: Reviewed by @jnduli.
+* DONE: @flisso MIKK genotyping.  NOTE: Verification pending from Medaka team.
+* DONE: @flisso Make sure we have C Elegans and HS Rats dataset to testing, and have the genotyping pipeline working.  NOTE: Issues with tux02 staging server.
+* DONE: @shelbys: Modify existing Grant write-up for pangenomes.  NOTES: Some more edits to be done.
+* NOT DONE: @shelbys @bonfacem: Getting RDF into R2R.
+* NOT DONE: @bonfacem RIF Indexing for RIF page in Xapian.
+* DONE: @bonfacem Work on properly containerizing gn-guile.  NOTE: Send in patches to @alexm, @aruni, and @fredm to review later today.
+* DONE: @bonfacem: Fix the virtuoso CI job in CD: NOTE: I'm awaiting feedback from @arun/@fredm.
+
+## 2024-10-11
+* WIP @priscilla @flisso: Try out API endpoints that don't require auth.  NOTE: Priscilla got to set-up guix channels for gn3.  Felix ran into problems.  Priscilla set up the MySQL in her Ubuntu system.
+* NOT DONE: @jnduli Harden hook system for gn-auth.
+* WIP: @jnduli Remove gn-auth code from GN3.  NOTE: Sent latest patches to Fred.  Running issue, some patches may have caused gn-qa to fail.
+* DONE: @jnduli @bonfacem Finish up RIF Editing project.
+* NOT DONE: @jnduli @alexm Create issue on describing the monitoring system. 
+* NOT DONE: @jnduli @alexm Create issue on prompt engineering in GN to improve what we already have.
+* WIP: @alex Work on R/Qtl.  NOTE: @jnduli/@bonfacem help out with this.  NOTE: Finished writing the design doc for gn-qa.
+* DONE: Looked at documentation for R/Qtl.
+* NOT DONE: @alex: Review @bmunyoki's work on RIF/Indexing.
+* WIP: @flisso: Make sure we have C Elegans dataset and MIKK genotypes to production.   NOTE: Issues with data entry scripts.  Fred/Zach working to set up test environment.
+* WIP: @flisso: MIKK genotyping.  NOTE: Still testing the pipeline.  Halfway there.
+* NOT DONE: @flisso: Make sure we have HS Rats in testing stage.
+* WIP: @flisso: Make progress in learning back-end coding WRT GN.  NOTE: Issue setting up GN3.
+* WIP: @shelbys: Modify existing Grant write-up for pangenomes.  NOTE: Reviewed by Pj and Eric.  More mods based of feedback.  Paper got accepted by BioArxiv.  Added some docs to R2R evaluation code.
+* DONE: @shelbys: Finish getting all the R2R scores from the first study. NOTE: Got scores for all the scores from first papers using R2R instead of Fahamu.
+* NOT DONE: @bonfacem RIF Indexing for RIF page in Xapian.
+* WIP: @bonfacem Work on properly containerizing gn-guile.
+* DONE: @bonfacem Fix the gn-transform-database in CI.  Sent patches to Arun for review.
+* DONE: @bonfacem Fixed broken utf-8 characters in gn-gemtext.
+
+## 2024-10-04
+* IN PROGRESS: @priscilla @bonfacem Setting up GN3.  @priscilla try out API endpoints that don't require auth. NOTE: @priscilla Able to set up guix as a package manager.  Trouble with Guix set-up with GN3.  @bonfacem good opportunity to improve docs in GN3.
+* IN PROGRESS: @jnduli Harden hook system for gn-auth.
+* IN PROGRESS: @jnduli Remove gn-auth code from GN3.
+* DONE: @jnduli Finish UI changes for RIF editing.  NOTE: Demo done in GN Learning team.
+* IN PROGRESS: @alex Work on R/Qtl.  NOTE: Met with Karl Brohman/PJ.  Been reading the docs.  Will track this issue in GN.
+* NOT DONE: @alex @bonfacem Work on properly containerizing gn-guile.
+* DONE: @bonfacem API/Display of NCBI Rif metadata.
+* IN PROGRESS: @bonfacem @alex RIF Indexing for RIF page in Xapian.
+* IN PROGRESS: @flisso Push data to production.  Commence work on Arabidopsis data and HS Rats data.  NOTE: C-Elegans pushed in process of being pushed to testing server, then later production.  WIP with HS Rats data in collab with Palmer.
+* DONE: @flisso: Learning how to use SQL WRT C Elegans data.
+* IN PROGRESS: @shelbys Re-formatting grant to use pangenomes.  Waiting for Garisson for feedback.
+* DONE: @shelbys Got the R2R for the human generated questions.  TODO: Run this for GPT 4.0 model.
+
+## 2024-09-27
+
+* DONE: @jnduli @bonfacem @alex Look at base files refactor and merge work.
+* DONE: @priscilla continue to upload more papers. NOTE: Uploaded an extra 200 papers.
+* NOT DONE: @priscilla @flisso Set up GN3.  Goal is to be able to query some APIs in cURL.
+* IN PROGRESS: @jnduli Improve hook systems for gn-auth. NOTE: Still figuring out a cleaner implementation for some things.
+* IN PROGRESS: @jnduli Trying to remove auth code GN3 code.  NOTE: Idea, though unsure about safety.  @fred to review work and make sure things are safe.
+* DONE: @jnduli @bonfacem @alex Push most recent changes to production.  Figure out what needs doing.  NOTE: @Zach is in charge of deployment.  @fredm is working on the production container.
+* DONE: @alex Close down remaining issues on issue tracker.  NOTE: Merged work on cleaning up base files.  Few more minor modifications to the UI.
+* NOT DONE: @alex investigate the dumped static files for encoding issues.
+* IN PROGRESS: @bonfacem NCBI Metadata - Modelling and Display.  NOTE: Done with the modelling.  Almost done with API/UI work.
+* DONE: @bonfacem Fix broken CD tests.  NOTE: We have tests running inside the guix build phase.
+* IN-PROGRESS: @flisso Continue work on uploading datasets: C Elegans and MIKK.  NOTE: Managed to create data files that need to be upleaded to the testing gn2 stage server.
+* NOT DONE: @flisso @jnduli help @flisso with SQL.
+
+## 2024-09-20
+* NOT DONE: @priscilla @flisso @bmunyoki @jnduli set up GN ecosystem and review UI PRs
+* DONE: @priscilla continue to upload more papers. NOTE: Shared access to drive to @bmunyoki.  We are at 800 papers.
+* DONE: @bmunyoki update tux02/01 with recent RIF modifications
+* DONE: @jnduli Finish up experiments on hook system.  NOTE: Patches got merged.  Needs to make some things more concrete.
+* NOT DONE: @alex @bonfacem investigate the dumped static files for encoding issues.
+* DONE: Refactoring base files for GN2.
+* IN PROGRESS: @flisso: Continue work on uploading datasets: C Elegans and MIKK.  Note: Waiting for the original MIKK genotype file from the Medaka team.  C Elegans yet to process the annotation file---some info is missing.
+* NOT DONE: @flisso: Do code reviews on Sarthak's script.
+* NOT DONE: @bmunyoki NCBI Metadata - Modelling and Display.
+* DONE: @bmunyoki update tux02/01 with recent RIF modifications.  NOTE: CD tests are broken and need to be fixed.
+
+## 2024-09-13
+* NOT DONE: @jnduli @bmunyoki fetch ncbi metadata and display them in GN2
+* DONE: @jnduli @bmunyoki add auth layer to edit rifs functionality
+* DONE: @jnduli complete design doc for hooks system for gn-auth.  NOTE: More experimentation with this.
+* DONE: @jnduli @alex bug fixes for LLM integration.
+* DONE: @priscilla added more papers to the LLM ~ 250 papers.
+* NOT DONE: @priscilla @flisso @bmunyoki @jnduli set up GN ecosystem and review UI PRs
+* DONE: @bmunyoki modify edit api to also write to RIF
+* NOT DONE: @bmunyoki update tux02/01 with recent RIF modifications
+* DONE: @bmunyoki Add test cases for RDF
+* DONE: @alex Bug fix for session expiry.
+* DONE: @alex Update links for static content to use self-hosted git repo.
+* IN PROGRESS: @flisso Upload C Elegans Dataset.  Nb: MIKK one has some issues, so work is paused for now. NOTE: Waiting for annotation and phenotype file for the C Elegans Dataset.
+* DONE @flisso: Reviewed  gemma wrapper scripts.
+
+
+Nice to have:
+* @bmunyoki build system container for gn-guile and write documentation for creating containers
+
+## 2024-09-06
+
+* DONE: @bmunyoki Replicate GN1 WIKI+RIF in GN2.
+* DONE: @bmunyoki update server to include latest code changes
+* IN PROGRESS: @bmunyoki modify edit api to also write to RIF
+* NOT DONE: @bmunyoki build system container for gn-guile and write documentation for creating containers
+* DONE: @bmunyoki @flisso update case attributes to capture hierarchy info
+* DONE: @bmunyoki prepare presentation for RIF work to GN learning team (goal is to present on Wednesday next week)
+* NOT DONE: @bmunyoki update tux02/01 with recent RIF modifications
+* NOT DONE: @jnduli @bmunyoki fetch ncbi metadata and display them in GN2
+* NOT DONE: @jnduli complete design doc for hooks system for gn-auth; Focus for next week.
+* DONE: @alexm @jnduli integrate LLM in GN2 and GN3: On the look-out for bug-fixes.
+* IN PROGRESS: @jnduli add auth layer to edit rifs functionality
+* DONE: @flisso generate genotype file on Medaka fish dataset: @arthur to have a look at this.
+* IN PROGRESS: @flisso code reviews for gemma-wrapper with @pjotr
+* DONE: @flisso update gemtext documentation
+* DONE: @flisso help Masters students with their proposal defences
+* @priscilla add more papers to LLM
+* NOT DONE: @priscilla @flisso @bmunyoki @jnduli set up GN ecosystem and review UI PRs
+
+
+## 2024-09-02 (Sync with @flisso+@bonfacem)
+
+### Case-Attributes
+
+* @bmunyoki understood case attributes by reverse-engineering the relevant tables from GeneNetwork's database.
+
+* One source of confusion for @bmunyoki is that we have the same "CaseAttribute.Name" that applies to different strains.  Example Query:
+
+```
+SELECT * FROM CaseAttribute JOIN CaseAttributeXRef ON CaseAttribute.CaseAttributeId = CaseAttributeXRef.CaseAttributeId WHERE CaseAttribute.Name = "Sex"\G
+```
+
+* @rob wants fine-grained access control with case attributes.
+
+* @flisso, case-attributes are GN invention.  Case Attributes are extra metadata about a given dataset beyond the phenotype measurements.  E.g.  We can have the phenotype: "Central nervous system"; whereby we collect the values, and SE.  However, we can also collect extra metadata like "Body Weight", "Sex", "Status", etc, and in GN terminology, that extra metadata is called Case Attributes.
+
+* @bmunyoki.  Most of the confusion around case-attributes is because of how we store case-attributes.  We don't have unique identifiers for case-attributes.
+
+## 2024-08-30
+
+* IN PROGRESS: @bmunyoki Replicate GN1 WIKI+RIF in GN2.
+* DONE: @bmunyoki and @alex help Alex deploy gn-guile code on tux02, run this in a tmux session.
+* DONE: @bmunyoki api for history for all tasks
+* DONE: @bmunyoki UI layer for RDF history
+* @bmunyoki modify edit api to also write to RIF
+* @bmunyoki build system container for gn-guile and write documentation for creating containers
+* NOT DONE: @jnduli complete design doc for hooks system for gn-auth
+* DONE: @alexm @jnduli create branches to testing for LLM in GN2 and GN3
+* IN PROGRESS: @alexm @jnduli integrate LLM in GN2 and GN3
+* IN PROGRESS: @jnduli add auth layer to edit rifs functionality
+* DONE: @bmunyoki @felix sync on case attributes and document
+* DONE: @flisso managed to upload <TODO> dataset to production
+
+
+### nice to haves
+
+* nice_to_have: @bmunyoki experiment and document updating gn-bioinformatics set up packages (to support rshiny)
+
+## 2024-08-23
+* @shelby re-ingest data and run RAGAs against the queries already in the system to perform comparison with new papers.
+* @shelby figure out Claude Sonnet stuff.
+* IN PROGRESS: @felix @fred push RQTL bundles to uploader, also includes metadata.
+* IN PROGESS: @felix look for means to fix metadata challenge ie. trouble associating data we upload and metadata that provides descriptions.
+* DONE: @bmunyoki API: Get all RIF metadata by symbols from rdf.
+* NOT DONE: @bmunyoki UI: Modify traits page to have "GN2 (GeneWiki)", to be picked after RDF is updated in tux02
+* DONE: @bmunyoki UI: Integrate with API
+* NOT DONE: @bmunyoki Replicate GN1 WIKI+RIF in GN2.
+* IN PROGRESS: @bmunyoki and @alex help Alex deploy gn-guile code on tux02.
+* DONE: @bmunyoki @jnduli review gn2 UI change for markdown editor
+* NOT DONE: @bmunyoki create template for bio paper
+* DONE: @alex sync with Boni to set up gn-guile
+* DONE: @alex @bmunyoki @jnduli sync to plan out work for llm integration
+* DONE: @jnduli edit WIKI+RIF 
+* NOT DONE: @jnduli set up gn-uploader locally and improve docs
+* NOT DONE: @jnduli complete design doc for hooks system for gn-auth
+* DONE: @felix to document email threads on gemtext
+
+## 2024-08-22
+
+=> https://issues.genenetwork.org/issues/edit-rif-metadata APIs for wiki editting and broke down wiki-editting task to sub-projects.
+
+## 2024-08-20
+
+Integrating GNQA to GN2 website and how it will work?
+
+1. Have the context information displayed to the right of the GN2 xapian search page
+2. When someone clicks the context info page, it opens the search from GNQA which has all the references.
+3. Cache queries since many searches are the same.
+
+Problems:
+
+1. search has xapian specific terminology. How do we handle this? Remove xapian prefixes and provide the key words to search.
+2. how do we handle cache expiry?
+    - no expiry for now.
+    - store them in a database table.
+    - every quarter year, the search can be updated.
+    - group:bxd, species: mouse -> bxd mouse
+      mouse bxd: -> when caching the ordering for the seach terms shouldn't matter much.
+
+Game Plan:
+
+1. Production the code relating to LLM search. Get the code for LLMs merged into main branch. 
+2. UI changes to show the search context from LLM.
+3. Figuring out caching:
+    - database table structure
+    - cache expiry (use 1 month for now)
+    - modify LLM search to pick from cache if it exists.
+4. Have another qa branch that fixes all errors since we had the freeze.
+5. Only logged in users will have access to this functionality.
+
+## 2024-08-16
+* @jnduli Fix failing unit tests on GN-Auth.
+* @jnduli Exploring Mechanical Rob for Integration Tests.  GN-Auth should be as stable as possible.
+* @jnduli Research e-mail patch workflow and propose a sane workflow for GN through an engineering blog post.
+* @jnduli Help @alexm with auth work.
+* @felix @fred push RQTL bundles to uploader.
+* @felix look for means to fix metadata challenge ie. trouble associating data we upload and metadata that provides descriptions.
+* @felix @jnduli programming learning: started building a web server to learn backend using Flask.
+* @felix @jnduli Read Shelby's paper and provide feedback by the end of Saturday.
+
+## 2024-08-16
+* DONE: @jnduli Fix failing unit tests on GN-Auth.
+* NOT DONE: @jnduli Exploring Mechanical Rob for Integration Tests. GN-Auth should be as stable as possible.
+* NOT DONE: @jnduli Research e-mail patch workflow and propose a sane workflow for GN through an engineering blog post.
+* DONE: @jnduli Help @alexm with auth work.
+* IN PROGRESS: @felix @fred push RQTL bundles to uploader, also includes metadata.
+* IN PROGRESS: @felix look for means to fix metadata challenge ie. trouble associating data we upload and metadata that provides descriptions.
+* DONE: @felix @jnduli programming learning: started building a web server to learn backend using Flask. Learning html and css and will share the progress with this.
+* DONE: @felix ~@jnduli~ Read Shelby's paper and provide feedback by the end of Saturday.
+* DONE: @felix tested the time tracker script.
+* IN PROGRESS: @bmunyoki implementation code work to edit Rif + WIki SQL n RDF data. We'll break this down.
+* @bmunyoki and @alex help Alex deploy gn-guile code on tux02.
+* NOT DONE: @bmunyoki Replicate GN1 WIKI+RIF in GN2.
+* @shelby @bonfacem @alex Integrate QNQA Search to global search.
+* @shelby handling edits with the current open paper
+
+Nice To Have:
+* DEPRIORITIZED: @felix figure out how to fix large data uploads ie. most data sets are large e.g. 45GB. Uploader cannot handle these large files.
+* DONE: @felix Try out John's time tracking tool and provide feedback.
+* @shelby run RAGAs against the queries already in the system to perform comparison with new papers: re-ingesting, now at 1500 papers.
+* @bmunyoki Send out emails to the culprit on failing tests in CI/CD.
+
+## 2024-08-15
+### RTF Editing (bmunyoki+alexm)
+
+In our static content, we don't really store RTF; instead we store, HTML.  As an example compare these 2 documements and note their difference:
+
+=> https://github.com/bitfocus/rtf2text/blob/master/sample.rtf => [Proper RTF] sample.rtf
+=> https://github.com/genenetwork/gn-docs/blob/master/general/datasets/Br_u_1203_rr/acknowledgment.rtf => [GN] acknowledgement.rtf
+
+* TODO @alexm Rename all the *rtf to *html during transform to make things clearer.  Send @bonfacem PR.
+
+## 2024-08-13
+### Markdown Editor (bmunyoki+alexm)
+
+* @alexm @bonfacem Tested the Markdown Editor locally and it works fine.  Only issue is that someone can make edits without logging in.
+* API end-points to be only exposed locally.
+* @alexm: Fix minor bug for when showing the diff.  Have a back arrow.
+* @bonfacem, @alexm: Deploy gn-guile; make sure it's only exposed locally.
+* [blocking] @alexm having issues setting up gn-auth.  @jnduli to help out to set up gn-auth and work out any quirks.  @alexm to make sure you can't make edits without being logged in.
+* @bmunyoki to set merge ge-editor UI work once basic auth is figured out.
+* [nice-to-have] @alexm work on packaging: "diff2html-ui.min.js", "diff.min.js", "marked.min.js", "index.umd.js", "diff2html.min.js".
+* [nice-to-have] @alexm to check-out djlint for linting jinja templates.
+* @bonfacem share pre-commit hooks for setting up djlint and auto-pep8.
+* [nice-to-have] @alexm to checkout:
+
+> djlint gn2/wqflask/templates/gn_editor.htmll --profile=jinja --reformat --format-css --format-js
+=> https://www.djlint.com/ dj Lint; Lint & Format HTML Templates
+
+## 2024-08-09
+
+* @shelby figure out Claude Sonnet stuff: NOT DONE, main focus was on the paper
+* @shelby planning session for next work and tasks for Priscilla.  DONE: Priscilla was given some work.  Loop in Priscilla for our meetings.
+* @shelby format output for ingested paper so that we can test the RAG engine.  IN PROGRESS.  Most focus has been on editing paper and some funding pursuit.
+* @shelby run RAGAs against the queries already in the system to perform comparison with new papers.  NOT DONE.
+* @bmunyoki implementation code work to edit Rif + WIki SQL n RDF data.  IN PROGRESS.  Updated the RDF transform for geneWIKI; Now we can do a single GET for a single comment in RDF.
+* @bmunyoki @shelby group paper on dissertation to target Arxiv.  NOT DONE.
+* @bmunyoki and @alex help Alex deploy gn-guile code on tux02.  NOT DONE.  Currently auth is a blocker.
+* @bmunyoki review UI code editor work.  DONE.
+* @alex address comments in UI work.  DONE.
+* @felix @fred push RQTL bundles to uploader.  In Progress: OOM Killer killing upload process. 
+* @felix look for means to fix metadata challenge ie. trouble associating data we upload and metadata that provides descriptions. The metadata doesn't meeting requirements.  In Progress: Some things to be confirmed with Rob/PJ on coming up with a good format for adding metadata. NOT DONE.
+* @felix figure out how to fix large data uploads ie. most data sets are large e.g. 45GB. Uploader cannot handle these large files.
+* @felix @jnduli programming learning: started building a web server to learn backend using Flask.  NOT DONE.
+* @felix (@bmunyoki / @alex) learning emacs so that he figures out how to track times. @jnduli shared his time-tracking tool with @felix.  DONE.
+* @jnduli fix group creation bug in gn-auth. DONE: Group creation wasn't exactly a bug; updated docs, and fixed the masquerade API.
+* @jnduli edit rif metadata using gn3. NOT DONE
+* @jnduli update documentation for gn-auth setup.  DONE
+* @jnduli investigate more bugs related to gn-auth.  DONE
+
+Note: When setting up sync between @jnduli and @felix, add @bmunyoki too.
+
+
+## 2024-08-02
+
+* DONE: @bmunyoki virtuoso and xapian updated in prod
+* @bmunyoki code work to edit Rif + WIki SQL n RDF data: WIP, we have desired API, but we need to implement code.
+* NOT DONE: @bmunyoki group paper on dissertation to target Arxiv
+* DONE: @bmunyoki fix case insensitivity in Xapian search
+* DONE: @jnduli review Alex patches
+* DONE: @bmunyoki: updated gn2 and gn3 on git.genetwork server. Shared QA code with @shelby on a special branch.
+* @bmunyoki @jnduli: fixed minor bug on xapian reflected with stemming.
+* @shelby figure out Claude Sonnet stuff: NOT DONE, main focus was on the paper
+* IN PROGRESS: @shelby edit paper with @pjtor
+* @shelby planning session for next work and tasks for Priscilla.
+* @shelby use RAGAS to test R2R with the new papers (follow up on the ingestion of papers tasks)
+* @shelby and @boni to discuss R2R and interfacing with Virtuoso: deprioritized, we'll figure out interfacing with R2R. Implementation to happen later.
+* DONE: @jnduli get up to speed on gn-auth
+* @alex have an instance of gn-guile running on production: Code in prod, but needs to liase with Boni to get this working.
+* @jgart getting genecup and rshiny containers to run as normal users instead of root users. May use libvirts APIs; or podman/docker as normal user; or rewriting the services as guix home services: system container doesn't have work around this, there's no work around. Because guix by default needs root to run as a system container. We also need sudo since at root level we define our system containers in a systemd that needs to be run as root. Why systemd? Systemd no one needs to run this.
+
+### Meeting with Sevila on Masters Papers
+
+- mainly stylistic changes provided.
+- provide an email explaining how long ethical review took, so that he follows up on unexpected delays.
+- met up with Dr Betsy, once done with defences in October (hopefully), and Boni may get his degree before graduation next year, to facilitate Boni applying for PhD.
+
+### Guix Root Container
+
+- With docker, to prevent the need for sudo, we usually create a docker group, and add users that need to run this to this group. Can this ahppen in guix?
+- Guix has a guix group. Why haven't we done this??? @jgart and @boni
+
+## 2024-07-26
+Plan for this week:
+
+* NOT DONE, needs a meeting: @bmunyoki virtuoso and xapian are up-to-date in prod. Boni doesn't have root access in production, so coordination with Fred and Zach is causing delays.
+* Apis design DONE, actual CODE incomplete: @bmunyoki update RIF+WIKI on SQL and RDF from gn2 website
+* DONE: @bmunyoki and @shelby review dissertation for Masters
+* DONE, needs to review new changes: @bmunyoki and @jgart to review patches for `genecup` and `rshiny`.
+* @bmunyoki and @jnduli to review patches for markdown parser
+* DONE, patches sent. @alexm add validation and document to markdown parser.
+* DONE: @shelby ingest ageing data to RAG, 10% left to complete.
+* DONE: @shelby do another round for editting on the AI paper
+* IN PROGRESS: @shelby RAG engine only works with OpenAI, figure out Claude Sonnet integration
+* IN PROGRESS: @jnduli get up to speed on gn-auth
+* @jgart enabling acme service in genecup and rshiny containers.
+* @jnduli and @bmunyoki to attempt to get familiar with R2R
+
+Nice to have:
+* @bmunyoki fix CI job for GN transformer database i.e. instead of checksums just run full job once per month: scheme script created that dumps the example files, next step is to create Gexp that runs this script. Bandwidth constraints.
+
+## 2024-07-23
+### LLM Meeting (@shelby+@bmunyoki)
+* There's no clear way of ingesting human-readable data with context into the RAG Graph from RDF.
+* What specific graph should we ingest into the RAG Graph from RDF?   @bmunyoki suggested RIF, PubMed Metadata.  We'll figure this out.
+* @bonfacem recommended: Much better to work with SPARQL than directly with TTL files.
+* We've uploaded rdf triples, yet they loose their strength as the RAG system is not undergirded with a knowledge graph.  @bonfacem should read the following for more context and should reach out to @shelby on how to move forward with SPARQL more concretely:
+
+=> https://r2r-docs.sciphi.ai/cookbooks/knowledge-graph#r2r-knowledge-graph-configuration
+
+* We need to test the knowledge graph backend of R2R to see how feasible it is to use with the existing data (RDF).
+* Fahamu just stored the object and lost the subject+predicate
+* Loop in Alex.
+
+
+## 2024-07-19
+Plan for this week:
+
+* DONE: @jgart getting `genecup` app to run in a guix container i.e. `gunicorn service` should then run `genecup`, similar to how gn2 and gn-uploader work. Patches sent to Boni, include `genecup` and `rshiny` and the container patches are tested. 
+* @jgart enable acme certificates for `genecup` container: Should just enable a single form, let's use arun's email since its what we use for all our services. Reverse proxy happens inside the container. Add a comment explaining that this shouldn't be standard python set up.
+* INPROGRESS: @bmunyoki virtuoso and xapian are up-to-date in prod:
+* NOT DONE: @bmunyoki update RIF+WIKI on SQL and RDF from gn2 website
+* INPROGRESS: @bmunyoki fix CI job for GN transformer database i.e. instead of checksums just run full job once per month: scheme script created that dumps the example files, next step is to create Gexp that runs this script. Bandwidth constraints.
+* @bmunyoki and @shelby review dissertation for Masters: @bonz needs to send updated version. Also reviewed another masters by Johannes.
+* ON HOLD: @alexm rewrite UI code using htmx
+* INPROGRESS: @alexm address review comments in markdown parser. Api endpoints are getting reimplemented. Needs to add validation and documentation and send v2 patches for review.
+* DONE: @shelby compile ingesting 500 more papers into RAG engine
+* @shelby ingesting ageing research into the RAG engine: diabetes reseach is ingested, ageing will be done later.
+* NOT DONE: @shelby RAG engine only works with OpenAI, figure out Claude Sonnet integration
+* DONE: @shelby @bmunyoki @alexm to define the problem with RDF triple stores
+* DONE: @jnduli finish up on RIF update
+* IN PROGRESS: @jnduli get up to speed on gn-auth
+
+AOB
+
+* RAG engine uses R2R for the integration. It would be great if we could integrate this into guix. @shelby will send @jgart the paper on how we use the RAG.
+
+
+## 2024-07-12
+
+Plan for this week:
+
+* @shelby use Claude Sonnet with R2R RAG engine with 1000 papers and fix bugs: 500 papers ingested into R2R, remaining with 500.
+* @shelby final run through for paper 1 before Pjotr's review. DONE, configurations fixed. New repo gnai that contains the results and will contain R2R stuff.
+* NOT DONE: @shelby and @bmunyoki review dissertation paper for Masters
+* @shelby @bmunyoki @alexm to define the problem with RDF triple stores
+* @alexm integrate the markdown parser: DONE, patches sent to Boni
+* @alexm rewrite UI code using htmx: NOT DONE
+* @bmunyoki investigate why xapian index isn't getting rebuilt: DONE
+* @bmunyoki investigate discrepancies between wiki and rif search: DONE, get this to prod to be tested
+* @jnduli update the generif_basic table from NCBI: IN PROGRESS.
+* @jnduli blog post of preference for documentation: DONE.
+
+We have qa.genenetwork.com. We need to have this set up to `qa.genenetwork.com/paper1` so that we always have the system that was used for this. How?
+
+Nice to Haves
+
+* @bmunyoki Nice to have tag for paper1: Fix this with Boni and get done later on/iron them out then.
+* @bmunyoki fix CI job that transforms gn tables to TTL: Move this to running a cron job once per month instead of 
+
+
+## 2024-06-24
+
+Plan for this Week:
+
+* CANCELED: @bmunyoki Remove boolean prefixes from search where it makes sense.
+* DONE: @bmunyoki GeneWiki + GeneRIF search in production. Mostly needs to be run in prod to see impact.
+* DONE: @jnduli Children process termination when we kill the main index-genenetwork script
+* CANCELED: @bmunyoki Follow up on getting virtuoso child pages in production
+* IN PROGRESS @alexm push endpoints for editting and making commits for markdown files
+* DONE: @all Reply to survey from Shelby
+* DONE: @jnduli Fix JS import orders (without messing up the rest of Genenetwork)
+* DONE: @jnduli fix search results when nothing is found
+* CANCELED: @jnduli test out running guix cron jobs locally
+* NOT DONE: @Jnduli mention our indexing documentation in gn2 README
+
+Note: For qa.genenetwork.com, we chose to pause work on this until papers are done.
+
+Review for last week
+
+* DONE: @bmunyoki rebuild guix container with new mcron changes
+* WIP: @jnduli attempts to make UI change that shows all supported keys in the search: Blocked because our JS imports aren't ordered correctly and using `boolean_prefixes` means our searches don't work as we'd expect.
+* WIP: @bmunyoki create an issue with all the problems experienced with search and potential solutions. Make sure it has replication steps, and plans for solutions. Issue was created but we need to get a better understanding for how cis and trans searches work.
+* TODO: @bmunyoki and @jnduli genewiki indexing: PR for WIKI indexing is completed, but we didn't test it out due to the outage caused by RAM and our script. We don't have a way to easily instrument how much RAM our process uses and how to kill the process.
+* DONE: @bmunyoki demoes and documents how to run and test guix cron job for indexing
+* DONE: @bmunyoki trains @jnduli on how to review patchsets from emails
+* DONE: @jnduli Follow up notes on setting up local index-genenetwork search
+* DONE: @alexm handling with graduation, AFK
+* TODO: @bmunyoki follow up with Rob to makes sure he tests search after everything is complete: He got some feedback and Rob is out of Town but wants RIF and Wiki search by July 2nd.
+
+Nice to haves:
+
+* TODO: minor: bonfacem makes sure that mypy/pylint in CI runs against the index-genenetwork script.
+* TODO: @bmunyoki follow up how do we make sure that xapian prefix changes in code retrigger xapian indexing?
+    - howto: xapian prefix changes, let's maintain a hash for the file and store it in xapian
+    - howto: for RDF changes, since we have ttl files, if this ever changes we trigger the script. It's also nice to be able to automatically also load up data to virtuoso if this file changes.
+
+
 ## 2024-06-21
 ### Outage for 2024-06-20
 
diff --git a/topics/octopus/lizardfs/README.gmi b/topics/octopus/lizardfs/lizard-maintenance.gmi
index 78316ef..a34ef3e 100644
--- a/topics/octopus/lizardfs/README.gmi
+++ b/topics/octopus/lizardfs/lizard-maintenance.gmi
@@ -1,4 +1,4 @@
-# Information about lizardfs, and some usage suggestions
+# Lizard maintenance
 
 On the octopus cluster the lizardfs head node is on octopus01, with disks being added mainly from the other nodes. SSDs are added to the lizardfs-chunkserver.service systemd service and SDDs added to the lizardfs-chunkserver-hdd.service. The storage pool is available on all nodes at /lizardfs, with the default storage option of "slow", which corresponds to two copies of the data, both on SDDs.
 
@@ -73,6 +73,17 @@ Chunks deletion state:
         2ssd    7984    -       -       -       -       -       -       -       -       -       -
 ```
 
+<<<<<<< HEAD
+This table essentially says that slow and fast are replicating data (if they are in column 0 it is OK!). This looks good for fast:
+
+```
+Chunks replication state:
+        Goal    0       1       2       3       4       5       6       7       8       9       10+
+        slow    -       137461  448977  -       -       -       -       -       -       -       -
+        fast    6133152 -       5       -       -       -       -       -       -       -       -
+```
+This table essentially says that slow and fast are replicating data (if they are in column 0 it is OK!).
+
 To query how the individual disks are filling up and if there are any errors:
 
 List all disks
@@ -83,17 +94,62 @@ lizardfs-admin list-disks octopus01 9421 | less
 
 Other commands can be found with `man lizardfs-admin`.
 
+## Info
+
+```
+lizardfs-admin info octopus01 9421
+LizardFS v3.12.0
+Memory usage:   2.5GiB23
+
+Total space:    250TiB                                                                                                 Available space:        10TiB
+Trash space:    510GiB
+Trash files:    188
+Reserved space: 21GiB                                                                                                  Reserved files: 18
+FS objects:     7369883
+Directories:    378782
+Files:  6858803
+Chunks: 9100088
+Chunk copies:   20017964
+Regular copies (deprecated):    20017964
+```
+
+```
+lizardfs-admin chunks-health  octopus01 9421
+Chunks availability state:
+        Goal    Safe    Unsafe  Lost
+        slow    1323220 1       -
+        fast    6398524 -       5
+
+Chunks replication state:
+        Goal    0       1       2       3       4       5       6       7       8       9       10+
+        slow    -       218663  1104558 -       -       -       -       -       -       -       -
+        fast    6398524 -       5       -       -       -       -       -       -       -       -
+
+Chunks deletion state:
+        Goal    0       1       2       3       4       5       6       7       8       9       10+
+        slow    -       104855  554911  203583  76228   39425   19348   8659    3276    20077   292859
+        fast    6380439 18060   30      -       -       -       -       -       -       -       -
+```
 
 ## Deleted files
 
-Lizardfs also keeps deleted files, by default for 30 days. If you need to recover deleted files (or delete them permanently) then the metadata directory can be mounted with:
+Lizardfs also keeps deleted files, by default for 30 days in `/mnt/lizardfs-meta/trash`. If you need to recover deleted files (or delete them permanently) then the metadata directory can be mounted with:
 
 ```
 $ mfsmount /path/to/unused/mount -o mfsmeta
 ```
 
 For more information see the lizardfs documentation online
-=> https://dev.lizardfs.com/docs/adminguide/advanced_configuration.html#trash-directory lizardfs documentation for the trash directory
+=> https://lizardfs-docs.readthedocs.io/en/latest/adminguide/advanced_configuration.html#trash-directory lizardfs documentation for the trash directory
+
+## Start lizardfs-mount (lizardfs reader daemon) after a system reboot
+
+```
+sudo bash
+systemctl daemon-reload
+systemctl restart lizardfs-mount
+systemctl status lizardfs-mount
+```
 
 ## Gotchas
 
@@ -179,3 +235,54 @@ KeyringMode=inherit
 [Install]
 WantedBy=multi-user.target
 ```
+
+# To deplete and remove a drive in LizardFS
+
+**1. Mark the chunkserver (or specific disk) for removal**
+
+Edit the chunkserver's disk configuration file (typically `/etc/lizardfs/mfshdd.cfg`) and prefix the drive path with an asterisk:
+
+```
+*/mnt/disk_to_remove
+```
+
+Restart the chunkserver process on the node
+
+```bash
+systemctl stop lizardfs-chunkserver
+systemctl start lizardfs-chunkserver
+```
+
+**3. Monitor the evacuation progress**
+
+The master will begin migrating chunks off the marked drive. You can monitor progress with:
+
+```bash
+lizardfs-admin list-disks octopus01 9421
+lizardfs-admin list-disks octopus01 9421|grep 172.23.19.59 -A 7
+172.23.19.59:9422:/mnt/sdc/lizardfs_vol/
+        to delete: yes
+        damaged: no
+        scanning: no
+        last error: no errors
+        total space: 3.6TiB
+        used space: 3.4TiB
+        chunks: 277k
+```
+
+Look for the disk showing evacuation status. The "to delete" chunks count should decrease over time as data is replicated elsewhere.
+
+You can also check the CGI web interface if you have it running—it shows disk status and chunk counts.
+
+**4. Remove the drive once empty**
+
+Once all chunks have been evacuated (the disk shows 0 chunks or is marked as empty), you can safely:
+
+1. Remove the line from `mfshdd.cfg` entirely
+2. Reload the configuration again
+3. Physically remove or repurpose the drive
+
+**Important notes:**
+- Ensure you have enough free space on other disks to absorb the migrating chunks
+- The evacuation time depends on the amount of data and network/disk speed
+- Don't forcibly remove a drive before evacuation completes, or you risk data loss if replication goals aren't met
diff --git a/topics/octopus/maintenance.gmi b/topics/octopus/maintenance.gmi
new file mode 100644
index 0000000..00cc575
--- /dev/null
+++ b/topics/octopus/maintenance.gmi
@@ -0,0 +1,98 @@
+# Octopus/Tux maintenance
+
+## To remember
+
+`fdisk -l` to see disk models
+`lsblk -nd` to see mounted disks
+
+## Status
+
+octopus02
+- Devices: 2 3.7T SSDs + 2 894.3G SSDs + 2 4.6T HDDs
+- **Status: Slurm not OK, LizardFS not OK**
+- Notes:
+  - `octopus02 mfsmount[31909]: can't resolve master hostname and/or portname (octopus01:9421)`,
+  - **I don't see 2 drives that are physically mounted**
+
+octopus03
+- Devices: 4 3.7T SSDs + 2 894.3G SSDs
+- Status: Slurm OK, LizardFS OK
+- Notes: **I don't see 2 drives that are physically mounted**
+
+octopus04
+- Devices: 4 7.3 T SSDs (Neil) + 1 4.6T HDD + 1 3.7T SSD + 2 894.3G SSDs
+- Status: Slurm NO, LizardFS OK (we don't share the HDD)
+- Notes: no
+
+octopus05
+- Devices: 1 7.3 T SSDs (Neil) + 5 3.7T SSDs + 2 894.3G SSDs
+- Status: Slurm OK, LizardFS OK
+- Notes: no
+
+octopus06
+- Devices: 1 7.3 T SSDs (Neil) + 1 4.6T HDD + 4 3.7T SSDs + 2 894.3G SSDs
+- Status: Slurm OK, LizardFS OK (we don't share the HDD)
+- Notes: no
+
+octopus07
+- Devices: 1 7.3 T SSDs (Neil) + 4 3.7T SSDs + 2 894.3G SSDs
+- Status: Slurm OK, LizardFS OK
+- Notes: **I don't see 1 device that is physically mounted**
+
+octopus08
+- Devices: 1 7.3 T SSDs (Neil) + 1 4.6T HDD + 4 3.7T SSDs + 2 894.3G SSDs
+- Status: Slurm OK, LizardFS OK (we don't share the HDD)
+- Notes: no
+
+octopus09
+- Devices: 1 7.3 T SSDs (Neil) + 1 4.6T HDD + 4 3.7T SSDs + 2 894.3G SSDs
+- Status: Slurm OK, LizardFS OK (we don't share the HDD)
+- Notes: no
+
+octopus10
+- Devices: 1 7.3 T SSDs (Neil) + 4 3.7T SSDs + 2 894.3G SSDs
+- Status: Slurm OK, LizardFS OK (we don't share the HDD)
+- Notes: **I don't see 1 device that is physically mounted**
+
+octopus11
+- Devices: 1 7.3 T SSDs (Neil) + 5 3.7T SSDs + 2 894.3G SSDs
+- Status: Slurm OK, LizardFS OK
+- Notes: on
+
+tux05
+- Devices: 1 3.6 NVMe + 1 1.5T NVMe + 1 894.3G NVMe
+- Status: Slurm OK, LizardFS OK (we don't share anything)
+- Notes: **I don't have a picture to confirm physically mounted devices**
+
+tux06
+- Devices: 2 3.6 T SSDs (1 from Neil) + 1 1.5T NVMe + 1 894.3G NVMe
+- Status: Slurm OK, LizardFS (we don't share anything)
+- Notes:
+  - **Last picture reports 1 7.3 T SSD (Neil) that is missing**
+  - **Disk /dev/sdc: 3.64 TiB (Samsung SSD 990: free and usable for lizardfs**
+  - **Disk /dev/sdd: 3.64 TiB (Samsung SSD 990): free and usable for lizardfs**
+
+tux07
+- Devices: 3 3.6 T SSDs + 1 1.5T NVMe (Neil) + 1 894.3G NVMe
+- Status: Slurm OK, LizardFS
+- Notes:
+  - **Disk /dev/sdb: 3.64 TiB (Samsung SSD 990): free and usable for lizardfs**
+  - **Disk /dev/sdd: 3.64 TiB (Samsung SSD 990): mounted at /mnt/sdb and shared on LIZARDFS: TO CHECK BECAUSE IT HAS NO PARTITIONS**
+
+tux08
+- Devices: 3 3.6 T SSDs + 1 1.5T NVMe (Neil) + 1 894.3G NVMe
+- Status: Slurm OK, LizardFS
+- Notes: no
+
+tux09
+- Devices: 1 3.6 T SSDs + 1 1.5T NVMe + 1 894.3G NVMe
+- Status: Slurm OK, LizardFS
+- Notes: **I don't see 1 device that is physically mounted**
+
+## Neil disks
+- four 8TB SSDs on the right of octopus04
+- one 8TB SSD in the left slot of octopus05
+- six 8TB SSDs bottom-right slot of octopus06,07,08,09,10,11
+- one 4TB NVMe and one 8TB SSDs on tux06, NVME in the bottom-right of the group of 4 on the left, SSD on the bottom-left of the group of 4 on the right
+- one 4TB NVMe on tux07, on the top-left of the group of 4 on the right
+- one 4TB NVMe on tux08, on the top-left of the group of 4 on the right
diff --git a/topics/octopus/moosefs/moosefs-maintenance.gmi b/topics/octopus/moosefs/moosefs-maintenance.gmi
new file mode 100644
index 0000000..1032cde
--- /dev/null
+++ b/topics/octopus/moosefs/moosefs-maintenance.gmi
@@ -0,0 +1,252 @@
+# Moosefs
+
+We use moosefs as a network distributed storage system with redundancy. The setup is to use SSDs for fast access and spinning storage for redundancy/backups (in turn these are in RAID5 configuration). In addition we'll experiment with a non-redundant fast storage access using the fastest drives and network connections.
+
+# Configuration
+
+## Ports
+
+We should use different ports than lizard. Lizard uses 9419-24 by default. So let's use
+9519- ports.
+
+* 9519 for moose meta logger
+* 9520 for chunk server connections
+* 9521 for mount connections
+* 9522 for slow HDD chunks (HDD)
+* 9523 for replicating SSD chunks (SSD)
+* 9524 for fast non-redundant SSD chunks (FAST)
+
+## Topology
+
+Moosefs uses topology to decide where to fetch data. We can host the slow spinning HDD drives in a 'distant' location, so that data is fetched last.
+
+## Disks
+
+Some disks are slower than others. To test we can do:
+
+```
+root@octopus03:/export# dd if=/dev/zero of=test1.img bs=1G count=1
+1+0 records in
+1+0 records out
+1073741824 bytes (1.1 GB, 1.0 GiB) copied, 2.20529 s, 487 MB/s
+/sbin/sysctl -w vm.drop_caches=3
+root@octopus03:/export#  dd if=test1.img of=/dev/null bs=1G count=1
+1+0 records in
+1+0 records out
+1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.649035 s, 1.7 GB/s
+rm test1.img
+```
+
+Above is on a RAID5 setup. Typical values are:
+
+```
+                       Write         Read
+Octopus Dell NVME      1.2 GB/s      2.0 GB/s
+Octopus03 RAID5        487 MB/s      1.7 GB/s
+Octopus01 RAID5        127 MB/s      163 MB/s
+Samsung SSD 870        408 MB/s      565 MB/s
+```
+
+```
+mfs#octopus03:9521   3.7T  4.0G  3.7T   1% /moosefs-fast
+```
+
+## Command line
+
+```
+. /usr/local/guix-profiles/moosefs/etc/profile
+mfscli -H octopus03 -P 9521 -SCS
+```
+
+## Config
+
+```
+root@octopus03:/etc/mfs# diff example/mfsexports.cfg.sample mfsexports.cfg
+2c2,4
+< *                     /       rw,alldirs,admin,maproot=0:0
+---
+> 172.23.21.0/24                       /       rw,alldirs,maproot=0,ignoregid
+> 172.23.22.0/24                       /       rw,alldirs,maproot=0,ignoregid
+> 172.23.17.0/24                       /       rw,alldirs,maproot=0,ignoregid
+```
+
+```
+root@octopus03:/etc/mfs# diff example/mfsmaster.cfg.sample mfsmaster.cfg
+4a5,10
+> ## Only one metadata server in LizardFS shall have 'master' personality.
+> PERSONALITY = master
+>
+> ## Password for administrative connections and commands.
+> ADMIN_PASSWORD = nolizard
+>
+6c12
+< # WORKING_USER = nobody
+---
+> WORKING_USER = mfs
+9c15
+< # WORKING_GROUP =
+---
+> WORKING_GROUP = mfs
+27c33
+< # DATA_PATH = /gnu/store/yg0xb1g9mls04h4085kmfbbg8z36a7c2-moosefs-4.58.3/var/mfs
+---
+> DATA_PATH = /export/var/lib/mfs
+34c40
+< # EXPORTS_FILENAME = /gnu/store/yg0xb1g9mls04h4085kmfbbg8z36a7c2-moosefs-4.58.3/etc/mfs/mfsexports.cfg
+---
+> EXPORTS_FILENAME = /etc/mfs/mfsexports.cfg
+87c93
+< # MATOML_LISTEN_PORT = 9419
+---
+> MATOML_LISTEN_PORT = 9519
+103c109
+< # MATOCS_LISTEN_PORT = 9420
+---
+> MATOCS_LISTEN_PORT = 9520
+219c225
+< # MATOCL_LISTEN_PORT = 9421
+---
+> MATOCL_LISTEN_PORT = 9521
+```
+
+```
+root@octopus03:/etc/mfs# cat mfsgoals.cfg
+# safe - 2 copies, 1 on slow disk, 1 on fast disk
+11 slow: HDD SSD
+
+# Fast storage - 1 copy on fast disks, no redundancy
+12 fast: FAST
+```
+
+```
++++ b/mfs/mfschunkserver-fast.cfg
+ # user to run daemon as (default is nobody)
+-# WORKING_USER = nobody
++WORKING_USER = mfs
+
+ # group to run daemon as (optional - if empty then default user group will be used)
+-# WORKING_GROUP =
++WORKING_GROUP = mfs
+
+ # name of process to place in syslog messages (default is mfschunkserver)
+ # SYSLOG_IDENT = mfschunkserver
+@@ -28,6 +28,7 @@
+
+ # where to store daemon lock file (default is /gnu/store/yg0xb1g9mls04h4085kmfbbg8z36a7c2-moosefs-4.58.3/var/mfs)
+ # DATA_PATH = /gnu/store/yg0xb1g9mls04h4085kmfbbg8z36a7c2-moosefs-4.58.3/var/mfs
++DATA_PATH=/var/lib/mfs
+
+ # when set to one chunkserver will not abort start even when incorrect entries are found in 'mfshdd.cfg' file
+ # ALLOW_STARTING_WITH_INVALID_DISKS = 0
+@@ -41,6 +42,7 @@
+
+ # alternate location/name of mfshdd.cfg file (default is /gnu/store/yg0xb1g9mls04h4085kmfbbg8z36a7c2-moosefs-4.58.3/etc/mfs/mfshdd.cfg); this
+file will be re-read on each process reload, regardless if the path was changed
+ # HDD_CONF_FILENAME = /gnu/store/yg0xb1g9mls04h4085kmfbbg8z36a7c2-moosefs-4.58.3/etc/mfs/mfshdd.cfg
++HDD_CONF_FILENAME = /etc/mfs/mfsdisk-fast.cfg
+
+ # speed of background chunk tests in MB/s per disk (formally entry defined in mfshdd.cfg). Value can be given as a decimal number (default is
+1.0)
+ # deprecates: HDD_TEST_FREQ (if HDD_TEST_SPEED is not defined, but there is redefined HDD_TEST_FREQ, then HDD_TEST_SPEED = 10 / HDD_TEST_FREQ)
+@@ -109,10 +111,10 @@
+ # BIND_HOST = *
+
+ # MooseFS master host, IP is allowed only in single-master installations (default is mfsmaster)
+-# MASTER_HOST = mfsmaster
++MASTER_HOST = octopus03
+
+ # MooseFS master command port (default is 9420)
+-# MASTER_PORT = 9420
++MASTER_PORT = 9520
+
+ # timeout in seconds for master connections. Value >0 forces given timeout, but when value is 0 then CS asks master for timeout (default is 0
+- ask master)
+ # MASTER_TIMEOUT = 0
+@@ -134,5 +136,5 @@
+ # CSSERV_LISTEN_HOST = *
+
+ # port to listen for client (mount) connections (default is 9422)
+-# CSSERV_LISTEN_PORT = 9422
++CSSERV_LISTEN_PORT = 9524
+```
+
+```
++++ b/mfs/mfsmount.cfg
+mfsmaster=octopus03,nosuid,nodev,noatime,nosuid,mfscachemode=AUTO,mfstimeout=30,mfswritecachesize=2048,mfsreadaheadsize=2048,mfsport=9521
+/moosefs-fast
+```
+
+## systemd
+
+
+```
+root@octopus03:/etc# cat systemd/system/moosefs-master.service
+Description=MooseFS master server daemon
+Documentation=man:mfsmaster
+After=network.target
+Wants=network-online.target
+
+[Service]
+Type=forking
+TimeoutSec=0
+ExecStart=/usr/local/guix-profiles/moosefs/sbin/mfsmaster -d start -c /etc/mfs/mfsmaster.cfg -x
+ExecStop=/usr/local/guix-profiles/moosefs/sbin/mfsmaster -c /etc/mfs/mfsmaster.cfg stop
+ExecStop=/usr/local/guix-profiles/moosefs/sbin/mfsmaster -c /etc/mfs/mfsmaster.cfg reload
+ExecReload=/bin/kill -HUP $MAINPID
+User=mfs
+Group=mfs
+Restart=on-failure
+RestartSec=60
+OOMScoreAdjust=-999
+
+[Install]
+WantedBy=multi-user.target
+```
+
+```
+ cat systemd/system/moosefs-mount.service
+[Unit]
+Description=Moosefs mounts
+After=syslog.target network.target
+
+[Service]
+Type=forking
+TimeoutSec=600
+ExecStart=/usr/local/guix-profiles/moosefs/bin/mfsmount -c /etc/mfs/mfsmount.cfg
+ExecStop=/usr/bin/umount /moosefs-fast
+
+[Install]
+WantedBy=multi-user.target
+root@octopus04:/etc# cat systemd/system/moosefs-chunkserver-fast.service
+[Unit]
+Description=MooseFS Chunkserver (Fast)
+After=network.target
+
+[Service]
+Type=simple
+ExecStart=/usr/local/guix-profiles/moosefs/sbin/mfschunkserver -f -c /etc/mfs/mfschunkserver-fast.cfg
+User=mfs
+Group=mfs
+Restart=on-failure
+RestartSec=5
+LimitNOFILE=65535
+
+[Install]
+WantedBy=multi-user.target
+```
+
+```
+cat systemd/system/moosefs-mount.service
+[Unit]
+Description=Moosefs mounts
+After=syslog.target network.target
+
+[Service]
+Type=forking
+TimeoutSec=600
+ExecStart=/usr/local/guix-profiles/moosefs/bin/mfsmount -c /etc/mfs/mfsmount.cfg
+ExecStop=/usr/bin/umount /moosefs-fast
+
+[Install]
+WantedBy=multi-user.target
+```
diff --git a/topics/octopus/octopussy-needs-love.gmi b/topics/octopus/octopussy-needs-love.gmi
new file mode 100644
index 0000000..8c6315d
--- /dev/null
+++ b/topics/octopus/octopussy-needs-love.gmi
@@ -0,0 +1,266 @@
+# Octopussy needs love
+
+At UTHSC, Memphis, TN, around October 2020 Efraim and I installed Octopus on Debian+Guix with lizard as a distributed network storage system and slurm for job control. Around October 2023 we added 5 genoa tux05-09 machines, doubling the cluster in size. See
+
+=> https://genenetwork.org/gn-docs/facilities
+
+Octopus made a lot of work possible we can't really do on larger HPCs and led to a bunch of high impact studies and publications, particularly on pangenomics.
+
+In the coming period we want te replace lizard with moosefs. Lizard is no longer maintained and as it was a fork of Moose, it is only logical to go forward on that one. We also looked at Ceph, but apparently Ceph is not great for systems that carry no redundancy. So far, lizard has been using redundancy, but we figure we can do without if the occassional (cheap) SSD goes bad.
+
+We also need to look at upgrading some of the Dell BIOS - particularly tux05-09 - as they can be occassionally problematic with non-OEM SSDs.
+
+On the worker nodes it may be wise to upgrade Debian. Followed by an upgrade to the head nodes and other supporting machines. Even though we rely on Guix for latest and greatest, there may be good upgrades in the underlying Linux kernel and drivers.
+
+Our Slurm PBS we are up-to-date because we run that completely on Guix and Arun supports the latest and greatest.
+
+Another thing we ought to fix is introduce centralized user management. So far we have had few users and just got by. But sometimes it bites us that users have different UIDs on the nodes.
+
+## Architecture overview
+
+* O1 is the old head node hosting lizardfs - will move to a compute
+* O2 is the old backup hosting the lizardfs shadow - will move to compute
+* O3 is the new head node hosting moosefs
+* O4 is the backup head node hosting moosefs shadow - will act as a compute node too
+
+All the other nodes are for compute. O1 and O4 will be the last nodes to remain on older Debian. They will handle the last bits of lizard.
+
+# Tasks
+
+* [X] Create moosefs package
+* [X] Install moosefs
+* [X] Upgrade bios (all tuxes)
+* [ ] Migrate lizardfs nodes to moosefs (one at a time)
+* [ ] Add server monitoring with sheepdog
+* [ ] Upgrade Debian
+* - [ ] Maybe, just maybe, boot the nodes from a central server
+* [ ] Introduce centralized user management
+
+# Progress
+
+## Lizardfs and Moosefs
+
+Our Lizard documention lives at
+
+=> lizardfs/README
+
+Efraim wrote a lizardfs for Guix at the time in guix-bioinformatics, but we ended up deploying with Debian. Going back now, the package does not look too taxing (I think we dropped it because the Guix system configuration did not play well).
+
+=> https://git.genenetwork.org/guix-bioinformatics/tree/gn/packages/file-systems.scm
+
+Looking at the Debian package
+
+=> https://salsa.debian.org/debian/moosefs
+
+It carries no special patches, but a few nice hints in *.README.debian. I think it is worth trying to write a Guix package so we can easily upgrade (even on an aging Debian). Future proofing is key.
+
+The following built moosefs in a guix shell:
+
+```
+guix shell -C -D -F coreutils make autoconf automake fuse libpcap zlib pkg-config python libtool gcc-toolchain
+autoreconf -f -i
+make
+```
+
+Next I created a guix package that installs with:
+
+```
+guix build -L ~/guix-bioinformatics -L ~/guix-past/modules moosefs
+```
+
+See
+
+=> https://git.genenetwork.org/guix-bioinformatics/commit/?id=236903baaab0f84f012a55700c1917265a2b701c
+
+Next stop testing and deploying!
+
+## Choosing a head node
+
+Currently octopus01 is the head node. It probably is a good idea to change that, so we can safely upgrade the new server. The first choice would be octopus02 (o2). We can mirror the moose daemons on octopus01 (o1) later. Let's see what that looks like.
+
+A quick assessment of o1 shows that we have 14T storage on o1 that takes care of /home and /gnu. But only 1.2T is used.
+
+o2 has also quite a few disks (up 1417 days!), but a bunch of SSDs appears to error out. E.g.
+
+```
+Sep 04 07:44:56 octopus02 mfschunkserver[22766]: can't create lock file /mnt/sdd1/lizardfs_vol/.lock, marking hdd as damaged: Input/output error
+UUID=277c05de-64f5-48a8-8614-8027a53be212 /mnt/sdd1 xfs rw,exec,nodev,noatime,nodiratime,largeio,inode64 0 1
+```
+
+Lizard also complains 4 SSDs have been wiped out.
+We'll need to reboot the server to see what storage still may work. The slurm connection appears to be misconfigured:
+
+```
+[2025-12-20T09:36:27.846] error: service_connection: slurm_receive_msg: Insane message length
+[2025-12-20T09:36:28.415] error: unpackstr_xmalloc: Buffer to be unpacked is too large (1700881509 > 1073741824)       [2025-12-20T09:36:28.415] error: unpacking header                                                                      [2025-12-20T09:36:28.415] error: destroy_forward: no init                                                              [2025-12-20T09:36:28.415] error: slurm_receive_msg_and_forward: [[nessus6.uthsc.edu]:35553] failed: Message receive failure
+```
+
+looks like Andrea is the only one using the machine right now though some others logged in. Before rebooting I'll block users, ask Andrea to move off, and deplete slurm and lizard. But o2 is a large RAM machine, so we should not use that as a head node.
+
+Let's take a look at o3. This one has less RAM. Flavia is running some tools, but I don't think the machine is really used right now. Slurm is running, but shows similar configuration issues as o2. Let's take a look at slurm
+
+=> ../systems/hpc/octopus-maintenance
+=> ../hpc/octopus/slurm-user-guide
+
+Alright, I depleted and removed slurm from o3. I think it would be wise to also deplete the lizard drives on that machine.
+
+The big users on lizard are:
+
+```
+1.6T    dashbrook
+1.8T    pangenomes
+2.1T    erikg
+3.4T    aruni
+3.4T    junh
+8.4T    hchen
+9.2T    salehi
+13T     guarracino
+16T     flaviav
+```
+
+it seems we can clean some of that up! We have some backup storage that we can use. Alternatively move to ISAAC.
+
+We'll slowly start depleting the lizard. See also
+
+=> lizardfs/README
+
+O3 has 4 lizard drives. We'll start by depleting one.
+
+
+# O2
+
+```
+172.23.22.159:9422:/mnt/sde1/lizardfs_vol/
+        to delete: no
+        damaged: yes
+        scanning: no
+        last error: no errors
+        total space: 0B
+        used space: 0B
+        chunks: 0
+172.23.22.159:9422:/mnt/sdd1/lizardfs_vol/
+        to delete: no
+        damaged: yes
+        scanning: no
+        last error: no errors
+        total space: 0B
+        used space: 0B
+        chunks: 0
+172.23.22.159:9422:/mnt/sdc1/lizardfs_vol/
+        to delete: no
+        damaged: yes
+        scanning: no
+        last error: no errors
+        total space: 0B
+        used space: 0B
+        chunks: 0
+```
+
+Stopped the chunk server.
+sde remounted after xfs_repair. The others were not visible, so rebooted. The folloing storage should add to the total again:
+
+```
+/dev/sdc1            4.6T  3.9T  725G  85% /mnt/sdc1
+/dev/sdd1            4.6T  4.2T  428G  91% /mnt/sdd1
+/dev/sdf1            4.6T  4.2T  358G  93% /mnt/sdf1
+/dev/sde             3.7T  3.7T  4.0G 100% /mnt/sde
+/dev/sdg1            3.7T  3.7T  3.9G 100% /mnt/sdg1
+```
+
+After adding this storage and people removing material it starts to look better:
+
+```
+mfs#octopus01:9421   171T   83T   89T  49% /lizardfs
+```
+
+# O3
+
+I have marked the disks (4x4T) on o3 for deletion - that will subtract 7T. This in preparation for upgrading Linux and migrating those disks to moosefs. Continue below.
+
+# T5
+
+T5 requires a new bios - it has the same one as the unreliable T4. I also need to see if there are any disks in the bios we don't see right now. T5 has two small fast SSDs and one larger one (3.5T).
+
+I managed to install the new bios, but I had trouble getting into linux because of some network/driver issues. ipmi was suspect. Finally managed rescue mode by adding 'systemd.unit=emergency.target' in the grub line. 'single' is no longer enough (grrr). One to keep in mind.
+
+Had to disable ipmi modules. See my idrac.org.
+
+# T6
+
+Tux06 (T6) contains two unused drives that appear to have contained XFS. xfs_repair did not really help...
+The BIOS on T6 is newer than on T4+T5. That probably explains why the higher T numbers have no disk issues, while T4+T5 had problems with non-OEM! Anyway, as I was at it, I updated the BIOS for all.
+
+T6 has 4 SSDs, 2x 3.5T. Both unused. The lizard chunk server is failing, so might as well disable it.
+
+I am using T6 to test network boots because it is not serving lizard.
+
+# T7
+
+On T7 root was full(!?). Culprit was Andrea with /tmp/sweepga_genomes_111850/.
+T7 has 3x3.5T with one unused.
+
+# T8
+
+T8 has 3x3.5T, all used. After the BIOS upgrade the efi partition did not boot. After a few reboots it did get into grub and I made a copy of the efi partition on sdd (just in case).
+
+# T9
+
+T9 has 1x3.5T. Used. I had to reduce HDD_LEAVE_SPACE_DEFAULT to give the chunkserver some air.
+
+# O3 + O4
+
+Back to O3, our future head node. lizard has mostly been depleted. Though every drive has a few chunks left. I just pulled down the chunkserver and lizard appears to be fine (no errors). Good!
+
+Next install Linux. I have two routes, one is using debootstrap, the other is via PXE. I want to try the latter.
+
+So far, I managed to boot into ipxe on Octopus.
+The linux kernel loads over http, but it does not show output. Likely I need to:
+
+* [X] Build ipxe with serial support
+* [X] Test the installer with serial support
+* [X] Add NFS support
+* [X] debootstrap install of new Debian on /export/nfs/nodes/debian14
+* [X] Make available through NFS and boot through IPXE
+
+I managed to boot T6 over the network.
+Essentially we have a running Debian last stable on T6 that is completely run over NFS!
+In the next steps I need to figure out:
+
+* [X] Mount NFS with root access
+* [ ] Every PXE node needs its own hard disk configuration
+* [ ] Mount NFS from octopus01
+* [ ] Start slurm
+
+We can have this as a test node pretty soon.
+But first we have to start moosefs and migrate data.
+
+I am doing some small tests and will put (old) T6 back on slurm again.
+
+To get every node booted with its own version of fstab and state logging on a local disk we need to pull some trick with initrd.
+
+Basically NFS boot initrd needs to contain a script that invokes changes for every node. The node hostname and primary partition can be passed on from ipxe using the kernel myhost=client01 localdisk=/dev/sda1. So that is the differentiator. The script in /etc/nodes/initramfs-tools/update-node-etc will remount /tmp and /var onto $localdisk and copy /etc there too. Next it will symnlink a few files, such as /etc/hostname and /etc/fstab to adjust for local settings.
+
+This way we will deploy all nodes centrally. One aspect is that we don't need dynamic user management as it is centrally orchestrated! The user files can be copied from the head node when they change.
+
+O4 is going to be the backup head node. It will act as a compute node too, until we need it as the head node. O4 is currently not on the slurm queue.
+
+* [X] Update guix on O1
+* [X] Install guix moosefs
+* [X] Start moosefs master on O3
+* [X] Start moosefs metalogger on O4
+* [ ] Check moosefs logging facilities
+* [ ] See if we can mark drives so it is easier to track them
+* [ ] Test broken (?) /dev/sdf on octopus03
+
+We can start moose master on O3. We should use different ports than lizard. Lizard uses 9419-24 by default. So let's use
+9519- ports. See
+
+=> moosefs/moosefs-maintenance.gmi
+
+# P2
+
+Penguin2 has 80T of spinning disk storage. We are going to use that for redundancy. Basically these disks get a moosefs goal of HDD 'slow' and we'll configure them on a remote rack - so chunks get fetched from local chunk servers (first). This will gain us 40T of immediate storage. Adding more spinning disks will free up SSDs further.
+
+* [X] P2 Update Guix
+* [X] Install moosefs
+* [ ] Create HDD chunk server
diff --git a/topics/octopus/recent-rust.gmi b/topics/octopus/recent-rust.gmi
new file mode 100644
index 0000000..7ce8968
--- /dev/null
+++ b/topics/octopus/recent-rust.gmi
@@ -0,0 +1,76 @@
+# Use a recent Rust on Octopus
+
+
+For impg we currently need a rust that is more recent than what we have in Debian
+or Guix. No panic, because Rust has few requirements.
+
+Install latest rust using the script
+
+```
+curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
+```
+
+Set path
+
+```
+. ~/.cargo/env
+```
+
+Update rust
+
+```
+rustup default stable
+```
+
+Next update Rust
+
+```
+octopus01:~/tmp/impg$ . ~/.cargo/env
+octopus01:~/tmp/impg$ rustup default stable
+info: syncing channel updates for 'stable-x86_64-unknown-linux-gnu'
+info: latest update on 2025-05-15, rust version 1.87.0 (17067e9ac 2025-05-09)
+info: downloading component 'cargo'
+info: downloading component 'clippy'
+info: downloading component 'rust-docs'
+info: downloading component 'rust-std'
+info: downloading component 'rustc'
+(...)
+```
+
+and build the package
+
+```
+octopus01:~/tmp/impg$ cargo build
+```
+
+Since we are not in guix we get the local dependencies:
+
+```
+octopus01:~/tmp/impg$ ldd target/debug/impg
+  linux-vdso.so.1 (0x00007ffdb266a000)
+  libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fe404001000)
+  librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fe403ff7000)
+  libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fe403fd6000)
+  libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fe403fd1000)
+  libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fe403e11000)
+  /lib64/ld-linux-x86-64.so.2 (0x00007fe404682000)
+```
+
+Login on another octopus - say 02 you can run impg from this directory:
+
+```
+octopus02:~$ ~/tmp/impg/target/debug/impg
+Command-line tool for querying overlaps in PAF files
+
+Usage: impg <COMMAND>
+
+Commands:
+  index      Create an IMPG index
+  partition  Partition the alignment
+  query      Query overlaps in the alignment
+  stats      Print alignment statistics
+
+Options:
+  -h, --help     Print help
+  -V, --version  Print version
+```
diff --git a/topics/octopus/set-up-guix-for-new-users.gmi b/topics/octopus/set-up-guix-for-new-users.gmi
new file mode 100644
index 0000000..f459559
--- /dev/null
+++ b/topics/octopus/set-up-guix-for-new-users.gmi
@@ -0,0 +1,38 @@
+# Set up Guix for new users
+
+This document describes how to set up Guix for new users on a machine in which Guix is already installed (such as octopus01).
+
+## Create a per-user profile for yourself by running your first guix pull
+
+"Borrow" some other user's guix to run guix pull. In the example below, we use root's guix, but it might as well be any guix.
+```
+$ /var/guix/profiles/per-user/root/current-guix/bin/guix pull
+```
+This should create your very own Guix profile at ~/.config/guix/current. You may invoke guix from this profile as
+```
+$ ~/.config/guix/current/bin/guix ...
+```
+But, you'd normally want to make this more convenient. So, add ~/.config/guix/current/bin to your PATH. To do this, add the following to your ~/.profile
+```
+GUIX_PROFILE=~/.config/guix/current
+. $GUIX_PROFILE/etc/profile
+```
+Thereafter, you may run any guix command simply as
+```
+$ guix ...
+```
+
+## Pulling from a different channels.scm
+
+By default, guix pull pulls the latest commit of the main upstream Guix channel. You may want to pull from additional channels as well. Put the channels you want into ~/.config/guix/channels.scm, and then run guix pull. For example, here's a channels.scm if you want to use the guix-bioinformatics channel.
+```
+$ cat ~/.config/guix/channels.scm
+(list (channel
+       (name 'gn-bioinformatics)
+       (url "https://git.genenetwork.org/guix-bioinformatics")
+       (branch "master")))
+```
+And,
+```
+$ guix pull
+```
diff --git a/topics/octopus/slurm-upgrade.gmi b/topics/octopus/slurm-upgrade.gmi
new file mode 100644
index 0000000..822f68e
--- /dev/null
+++ b/topics/octopus/slurm-upgrade.gmi
@@ -0,0 +1,89 @@
+# How to upgrade slurm on octopus
+
+This document closely mirrors the official upgrade guide. The official upgrade guide is very thorough. Please refer to it and update this document if something is not clear.
+=> https://slurm.schedmd.com/upgrades.html Official slurm upgrade guide
+
+## Preparation
+
+It is possible to upgrade slurm in-place without upsetting running jobs. But, for our small cluster, we don't mind a little downtime. So, it is simpler if we schedule some downtime with other users and make sure there are no running jobs.
+
+slurm can only be upgraded safely in small version increments. For example, it is safe to upgrade version 18.08 to 19.05 or 20.02, but not to 20.11 or later. This compatibility information is in the RELEASE_NOTES file of the slurm git repo with the git tag corresponding to the version checked out. Any configuration file changes are also outlined in this file.
+=> https://github.com/SchedMD/slurm/ slurm git repository
+
+## Backup
+
+Stop the slurmdbd, slurmctld and slurmd services.
+```
+# systemctl stop slurmdbd slurmctld slurmd slurmrestd
+```
+Backup the slurm StateSaveLocation (/var/spool/slurmd/ctld in our case) and the slurm configuration directory.
+```
+# cp -av /var/spool/slurmd/ctld /somewhere/safe/
+# cp -av /etc/slurm /somewhere/safe/
+```
+Backup the slurmdbd MySQL database. Enter the password when prompted. The password is specified in StoragePass of /etc/slurm/slurmdbd.conf.
+```
+$ mysqldump -u slurm -p --databases slurm_acct_db > /somewhere/safe/slurm_acct_db.sql
+```
+
+## Upgrade slurm on octopus01 (the head node)
+
+Clone the gn-machines git repo.
+```
+$ git clone https://git.genenetwork.org/gn-machines
+```
+Edit slurm.scm to build the version of slurm you are upgrading to. Ensure it builds successfully using
+```
+$ guix build -f slurm.scm
+```
+Upgrade slurm.
+```
+# ./slurm-head-deploy.sh
+```
+Make any configuration file changes outlined in RELEASE_NOTES. Next, run the slurmdbd daemon, wait for it to start up successfully and then exit with Ctrl+C. During upgrades, slurmdbd may take extra time to update the database. This may cause systemd to timeout and kill slurmdbd. So, we do it this way, instead of simply starting the slurmdbd systemd service.
+```
+# sudo -u slurm slurmdbd -D
+```
+Reload the new systemd configuration files. Then, start the slurmdbd, slurmctld and slurmd services one at a time ensuring that each starts up correctly before proceeding on to the next.
+```
+# systemctl daemon-reload
+# systemctl start slurmdbd
+# systemctl start slurmctld
+# systemctl start slurmd
+# systemctl start slurmrestd
+```
+
+## Upgrade slurm on the worker nodes
+
+Repeat the steps below on every worker node.
+
+Stop the slurmd service.
+```
+# systemctl stop slurmd
+```
+Upgrade slurm, passing slurm-worker-deploy.sh the slurm store path obtained from building slurm using guix build on octopus01. Recall that you cannot invoke guix build on the worker nodes.
+```
+# ./slurm-worker-deploy.sh /gnu/store/...-slurm
+```
+Copy over any configuration file changes from octopus01. Then, reload the new systemd configuration files and start slurmd.
+```
+# systemctl daemon-reload
+# systemctl start slurmd
+```
+
+## Tip: Running the same command on all worker nodes
+
+It is a lot of typing to run the same command on all worker nodes. You could make this a little less cumbersome with the following bash for loop.
+```
+for node in octopus02 octopus03 octopus05 octopus06 octopus07 octopus08 octopus09 octopus10 octopus11 tux05 tux06 tux07 tux08 tux09;
+do
+    ssh $node your command
+done
+```
+You can even do this for sudo commands using the -S flag of sudo that makes it read the password from stdin. Assuming your password is in the pass password manager, the bash for loop would then look like:
+```
+for node in octopus02 octopus03 octopus05 octopus06 octopus07 octopus08 octopus09 octopus10 octopus11 tux05 tux06 tux07 tux08 tux09;
+do
+    pass octopus | ssh $node sudo -S your command
+done
+```
\ No newline at end of file
diff --git a/topics/pangenome/impg/impg-agc-bindings.gmi b/topics/pangenome/impg/impg-agc-bindings.gmi
new file mode 100644
index 0000000..2451c0a
--- /dev/null
+++ b/topics/pangenome/impg/impg-agc-bindings.gmi
@@ -0,0 +1,246 @@
+# IMPG AGC bindings
+
+In this document we will create a build setup that allows us to use AGC (a C++ library) from a recent Rust compiler. The original binding proves tricky. So we break it down into parts. Also we try out the new Rust cargo support in Guix.
+
+Fortunately the AGC include file contains a limited list of functions that have C ABI bindings:
+
+```c
+EXTERNC agc_t* agc_open(char* fn, int prefetching);
+EXTERNC int agc_close(agc_t* agc);
+EXTERNC int agc_get_ctg_len(const agc_t *agc, const char *sample, const char *name);
+EXTERNC int agc_get_ctg_seq(const agc_t *agc, const char *sample, const char *name, int start, int end, char *buf);
+EXTERNC int agc_n_sample(const agc_t* agc);
+EXTERNC int agc_n_ctg(const agc_t *agc, const char *sample);
+EXTERNC char* agc_reference_sample(const agc_t* agc);
+EXTERNC char **agc_list_sample(const agc_t *agc, int *n_sample);
+EXTERNC char **agc_list_ctg(const agc_t *agc, const char *sample, int *n_ctg);
+EXTERNC int agc_list_destroy(char **list);
+EXTERNC int agc_string_destroy(char *sample);
+```
+
+Even for a C++ library it is very thoughtful to provide a C ABI! Both the current Rust binding and the Python example in AGC actually use the C++ class - which means they need to build against a matching C++ source tree.
+It should be straightforward to create a Rust module that calles into the shared library directly using the C ABI instead of importing and building all the source code.
+
+One early choice is a separation of concerns. We will try to build the library independently of the Rust package. This follows a standard model. For example cargo should not build zlib - it is provided by the environment. The bindings, meanwhile, are defined and built in cargo.
+
+# Tasks
+
+* [X] Fix AGC passing exceptions through C ABI
+* [X] Get guix to compile impg (here testlibagc) with AGC
+* [ ] Add optimization
+* [ ] Make sure spoa build in spoa-rs is optimized
+* [ ] Create static binary for distribution
+* [ ] Create singularity example
+
+# Steps
+
+## Setting up Guix with rust
+
+Guix provides a reproducible build environment. If you get over the fact that it is Lisp, it proves a remarkably nice way to handle dependencies. The first step is to set up guix so you get a recent set of dependencies. For this run guix pull and set it up in a profile
+
+```sh
+guix pull -p ~/opt/guix-pull --url=https://codeberg.org/guix/guix
+```
+
+it takes a few minutes. Next set the environment
+
+```sh
+unset GUIX_PROFILE
+. ~/opt/guix-pull/etc/profile
+```
+
+and list the packages
+
+```sh
+guix package -A rust
+rust                    1.85.1                  rust-src,tools,out,cargo        gnu/packages/rust.scm:1454:4
+```
+
+should show a recent edition of rust (typically about half a year old, the rust-team in guix is now working on 1.89). Note you can also pull an older version of guix (and rust) by passing in the git hash value of the codeberg repo. This allows you to go back to the dependency tree of, say, three months ago. It allows for a level of sanity not seen in other software deployment systems.
+
+Note that we tend not to be too recent with packages as Guix is used to deploy *stable* systems. If you want a more recent version of rust you can write your own guix package - it is not that hard. We may attempt it later for this exercise.
+
+Note also that newbies run guix-pull too often. I typically do it every three months, or so. So the slowness of guix-pull should not really count.
+
+One thing that is a bit funny now is that we currently can't list most cargo packages in guix because they the crates are now 'local' to a package. We have to check the source tree:
+
+=> https://codeberg.org/guix/guix/src/branch/master/gnu/packages/rust-crates.scm
+
+## Building AGC in guix
+
+AGC is a C++ program with a C ABI. The README suggests there are no dependencies, but that is misleading. It sources other dependencies and builds them (bit like git submodules). I managed to build AGC using a guix shell with:
+
+```sh
+guix shell -C guix gcc-toolchain make libdeflate pkg-config xz mimalloc coreutils sed minizip-ng lzlib zlib:static zstd:static zstd:lib zstd zlib
+make PLATFORM=avx2 libagc
+```
+
+Note it pulls in too much. To make it compile the patch I applied is
+
+```diff
+--- a/agc/makefile
++++ b/agc/makefile
+@@ -14,14 +14,14 @@ $(call SET_SRC_OBJ_BIN,src,obj,bin)
+
+ # *** Project configuration
+ $(call CHECK_NASM)
+-$(call ADD_MIMALLOC, $(3RD_PARTY_DIR)/mimalloc)
++# $(call ADD_MIMALLOC, $(3RD_PARTY_DIR)/mimalloc)
+ $(call PROPOSE_ISAL, $(3RD_PARTY_DIR)/isa-l)
+-$(call PROPOSE_ZLIB_NG, $(3RD_PARTY_DIR)/zlib-ng)
+-$(call CHOOSE_GZIP_DECOMPRESSION)
+-$(call ADD_LIBDEFLATE, $(3RD_PARTY_DIR)/libdeflate)
+-$(call ADD_LIBZSTD, $(3RD_PARTY_DIR)/zstd)
++# $(call PROPOSE_ZLIB_NG, $(3RD_PARTY_DIR)/zlib-ng)
++# $(call CHOOSE_GZIP_DECOMPRESSION)
++# $(call ADD_LIBDEFLATE, $(3RD_PARTY_DIR)/libdeflate)
++# $(call ADD_LIBZSTD, $(3RD_PARTY_DIR)/zstd)
+ $(call ADD_RADULS_INPLACE,$(3RD_PARTY_DIR)/raduls-inplace)
+-$(call ADD_PYBIND11,$(3RD_PARTY_DIR)/pybind11/include)
++# $(call ADD_PYBIND11,$(3RD_PARTY_DIR)/pybind11/include)
+ $(call SET_STATIC, $(STATIC_LINK))
+
+ $(call SET_C_CPP_STANDARDS, c11, c++20)
+@@ -57,7 +57,7 @@ $(OUT_BIN_DIR)/agc: \
+        $(CXX) -o $@  \
+        $(MIMALLOC_OBJ) \
+        $(OBJ_APP) $(OBJ_CORE) $(OBJ_COMMON) \
+-       $(LIBRARY_FILES) $(LINKER_FLAGS) $(LINKER_DIRS)
++       $(LIBRARY_FILES) -lzstd -lz -ldeflate $(LINKER_FLAGS) $(LINKER_DIRS)^M
+
+ libagc: $(OUT_BIN_DIR)/libagc
+ $(OUT_BIN_DIR)/libagc:
+```
+
+Essentially disables 3rd-party dependency builds, in favour of using the Guix ones.
+
+Note that Bioconda installes AGC as a binary:
+
+=> https://github.com/bioconda/bioconda-recipes/blob/master/recipes/agc/meta.yaml
+
+So it circumvents building AGC by downloading the provided static binaries. In only downloads the binary, not the library.
+
+## The current cargo package
+
+The current cargo bindings package named agc-rs vendors in (in its turn) the AGC github repository. Similarly to git modules. It is kinda ironic that we left git submodules for something that is not better (maybe even worse because it does not do the hash values, but a versioned branch/tag -- who is to say what happened upstream).
+
+## Changes
+
+So we propose to take a different approach when it comes to distributing software. First premise is that we will prepare pre-built *binaries* for external use that can be handled by conda and singularity. Both these deployers can handle external dependencies, so we can just use a standard AGC build/distribution. That is key to keeping sane - so not have cargo build AGC itself as it is just a library with a decent C ABI.
+
+To make it work with Rust we can create a cargo module that binds to the C ABI using FFI (and not care where the AGC library comes from). One great feature is we can use the C ABI without having to generate bindings using clang and all that. A C ABI can be written and maintained by hand in Rust.
+
+For C++ only libraries, the narrative gets a bit harder. If the C++ interface is rich it may be best to use a bindings generator. In general it should be possible to provide a C ABI that calls into C++, however, in C. This means we can take the same deployment approach (in general) for pure C++ libraries, provided we can write a short C ABI. I have done this for vcflib, for example, to write the Zig version of vcflib:
+
+=> https://github.com/vcflib/vcflib/blob/master/src/vcf-c-api.cpp
+
+To support AGC in Rust we need to:
+
+* [X] Create a Rust binding that uses the AGC C ABI instead of the C++ one, so we can use a statically built AGC lib and don't need the source tree for cargo
+
+We will also write a
+
+* [ ] Guix build to create the optimized AGC static lib
+* [ ] Guix build that creates an optimized impg
+
+And that last one allows us to distribute prebuilt binaries in CONDA and apptainer/singularity/docker.
+
+Note that this is the same approach as taken by
+
+=> https://github.com/rust-lang/libz-sys/blob/main/build.rs
+
+which binds against libz. It *optionally* builds the source tree of zlib which is included as a submodule
+
+=> https://github.com/rust-lang/libz-sys/tree/main/src
+
+In our case, a rebuild can be useful when AGC lib can not be found. Note that the cargo edition of libz-sys does not invoke make or cmake. It builds it by 'hand'!
+
+There is also libz-rs, but that is a somewhat typical Rust rewrite of libz:
+
+=> https://github.com/trifectatechfoundation/zlib-rs
+
+I also took a quick look at the rust spoa crate. Here a build is always forced, but I don't think it actually optimizes the build. Add a note to my tasks.
+
+## First guix package by Fred
+
+Fred drafted a first guix package which can build impg with
+
+```
+guix build -L .guix/modules -f guix.scm
+
+/gnu/store/cdjiq6aalpc849hl8irmbn8xax9mq2b6-impg-0.3.1/bin/impg
+Command-line tool for querying overlaps in PAF files
+
+Usage: impg <COMMAND>
+
+Commands:
+  index       Create an IMPG index
+  lace        Lace files together (graphs or VCFs)
+  partition   Partition the alignment
+  query       Query overlaps in the alignment
+  similarity  Compute pairwise similarity between sequences in a region
+  stats       Print alignment statistics
+
+Options:
+  -h, --help     Print help
+  -V, --version  Print version
+```
+
+It builds against rust 1.85 and uses the new cargo support in Guix. It does not have to rebuild the cargo packages already in guix. Nice and a good start!
+
+=> https://github.com/pangenome/impg/blob/f5ebaf8b511ee06bdeb193ef509836c26cd4793a/.guix/modules/impg/impg.scm#L4
+
+we'll still need to add AGC, static output and optimizations.
+
+## Adding a guix package for AGC
+
+As a first step we build a package for AGC that compiles libagc.a using AVX2:
+
+=> https://github.com/pjotrp/impg/commit/ed16948cc4145ff933a19ba54c3bc1fe4cec709f
+
+we used the vendored in source for raduls-inplace and isa-l. Not sure they are really required, but I think it is harmless here.
+
+## Make sure libagc.a is linked to impg
+
+To create a rust package for binding libagc it is worth reading:
+
+=> https://doc.rust-lang.org/cargo/reference/build-scripts.html#a-sys-packages
+
+* The library crate should link to the native library libfoo. This will often probe the current system for libfoo before resorting to building from source.
+* The library crate should provide declarations for types and functions in libfoo, but not higher-level abstractions.
+
+So we should create an agc-rs crate that provides a high-level interface to the upcoming libagc-sys crate. No wonder these crates proliferate.
+
+# Using a linked libagc.so
+
+I managed to create a crate that binds libagc.so against Rust:
+
+=> https://github.com/pjotrp/libagc-sys
+
+See also the included test in lib.rs. It binds against the updated agc:
+
+=> https://github.com/refresh-bio/agc/compare/main...pjotrp:agc:main
+
+which contains the fixes that don't allow C++ exceptions to pass through the C ABI.
+Also I fixed one function and added a shared lib as output.
+
+Finally, rather than messing with the impg code tree (which keeps changing), I created a test crate that mirrors impg:
+
+=> https://github.com/pjotrp/testlibagc
+
+which can be build and run with
+
+```
+cargo build --release
+target/release/testagc-sys
+Number of samples: 4
+```
+
+At least we have a reference implementation for binding successfully against a shared C library with a very *light* and standardised interface. It obviously also works in Guix. We can use it to benchmark against the new (impressive) Rust implementation by Erik. It also acts as a template for future bindings.
+
+Note that we should discourage C++ bindings. Mostly because there is no standard C++ ABI (in contrast to the C one), so avoid the use of the cxx crates - unless you really know what you are doing.
+
+Potential future work is:
+
+- [ ] Optimized runtime
+- [ ] Static binary for distribution
diff --git a/topics/programming/autossh-for-keeping-ssh-tunnels.gmi b/topics/programming/autossh-for-keeping-ssh-tunnels.gmi
new file mode 100644
index 0000000..a977232
--- /dev/null
+++ b/topics/programming/autossh-for-keeping-ssh-tunnels.gmi
@@ -0,0 +1,65 @@
+# Using autossh to Keep SSH Tunnels Alive
+
+## Tags
+* keywords: ssh, autossh, tunnel, alive
+
+
+## TL;DR
+
+```
+guix package -i autossh  # Install autossh with Guix
+autossh -M 0 -o "ServerAliveInterval 60" -o "ServerAliveCountMax 5" -L 4000:127.0.0.1:3306 alexander@remoteserver.org
+```
+
+## Introduction
+
+Autossh is a utility for automatically restarting SSH sessions and tunnels if they drop or become inactive. It's particularly useful for long-lived tunnels in unstable network environments.
+
+See official docs:
+
+=> https://www.harding.motd.ca/autossh/
+
+## Installing autossh
+
+Install autossh using Guix:
+
+```
+guix package -i autossh
+```
+
+Basic usage:
+
+```
+autossh [-V] [-M monitor_port[:echo_port]] [-f] [SSH_OPTIONS]
+```
+
+## Examples
+
+### Keep a database tunnel alive with autossh
+
+Forward a remote MySQL port to your local machine:
+
+**Using plain SSH:**
+
+```
+ssh -L 5000:localhost:3306 alexander@remoteserver.org
+```
+
+**Using autossh:**
+
+```
+autossh -L 5000:localhost:3306 alexander@remoteserver.org
+```
+
+### Better option 
+
+```
+autossh -M 0 -o "ServerAliveInterval 30" -o "ServerAliveCountMax 3" -L 5000:localhost:3306 alexander@remoteserver.org
+```
+
+#### Option explanations:
+
+- `ServerAliveInterval`: Seconds between sending keepalive packets to the server (default: 0).
+- `ServerAliveCountMax`: Number of unanswered keepalive packets before SSH disconnects (default: 3).
+
+You can also configure these options in your `~/.ssh/config` file to simplify command-line usage.
diff --git a/topics/programming/better-logging.gmi b/topics/programming/better-logging.gmi
index dca8c0d..d80bb0d 100644
--- a/topics/programming/better-logging.gmi
+++ b/topics/programming/better-logging.gmi
@@ -1,14 +1,17 @@
-# Improving Logging in GN2
+# Improving Alerting/Logging in GN2
 
-## What Are We Trying To Solve?
+## Problem Statement
 
-We prioritise maintaining user functionality over speed in GN [with time this speed will be improved].  As such we should be pay more attention at not breaking any currently working GN2 functionality.  And when/if we do, trouble-shooting should be easy.  On this front, one way is to stream-line logging in both GN2/GN3 and make it more script friendly - only report when something fails, not to instrument variables - and in so doing make the process of monitoring easier.
+Current logging in the genenetwork ecosystem is noisy and difficult to parse programatically which makes it hard to:
+
+* Integrate logs into some observability pipeline (E.g. sheepdog).
+* Troubleshoot issues as they occur.  We always learn of bugs from users.
 
 ## Goals
 
-* Have script-friendly error/info logs.
-* Remove noise from GN2.
-* Separate logging into different files: error logs, info logs.  Add this somewhere with Flask itself instead of re-directing STDOUT to a file.
+* Standardize logging format and config across GN2 flask apps and gn-guile.
+* Adopt structured logging.
+* Extend sheep-dog to be able to parse gn logs and send alerts on e-mail or matrix.
 
 ### Non-goals
 
@@ -27,3 +30,5 @@ We prioritise maintaining user functionality over speed in GN [with time this sp
 ## Resources
 
 => https://realpython.com/python-logging/ Logging in Python
+=> https://signoz.io/guides/python-logging-best-practices/ Python Logging Best Practices - Obvious and Not-So-Obvious
+=> https://signoz.io/blog/what-is-opentelemetry/ What is OpenTelemetry
diff --git a/topics/rust/guix-rust-bootstrap.gmi b/topics/rust/guix-rust-bootstrap.gmi
new file mode 100644
index 0000000..cd3c322
--- /dev/null
+++ b/topics/rust/guix-rust-bootstrap.gmi
@@ -0,0 +1,173 @@
+# Guix Rust Bootstrap
+
+To develop code against rust you often need a recent edition of rust. With Guix this is possible because you don't depend on the underlying linux distribution to provide recent versions of glibc and other libraries. Here we have a recipe that should work anywhere on Linux.
+
+I succeeded in running the latest Rust on Octopus and building packages with guix.
+
+To make it work the following steps are required:
+
+* Update guix with guix-pull if your guix is older than 3 months
+* Unset GUIX_PROFILE on some systems
+* Set your updated guix profile vars
+* Create a container that has all dependencies for rust itself
+* Run rustup
+* Run cargo with LD_LIBRARY_PATH set to $GUIX_ENVIRONMENT/lib
+
+# Get Guix updated
+
+Important is to have a recent version of Guix. This is achieved with 'guix pull' and making sure it works.
+
+
+```sh
+mkdir -p ~/opt
+guix pull -p ~/opt/guix-pull --url=https://codeberg.org/guix/guix
+```
+
+it takes a few minutes. Next set the environment
+
+```sh
+unset GUIX_PROFILE
+. ~/opt/guix-pull/etc/profile
+```
+
+This will point the path to a recent guix. You can make sure with
+
+```
+guix describe
+  guix 772c456
+    repository URL: https://codeberg.org/guix/guix
+    branch: master
+    commit: 772c456717e755829397a6ff6dba4c1e135426d8
+```
+
+which can be validated against the Guix tree. Running
+
+
+```sh
+guix package -A rust
+rust                    1.85.1                  rust-src,tools,out,cargo       gnu/packages/rust.scm:1454:4
+```
+
+shows the current *stable* version in Guix. Now, of course, we want something more to get rust latest.
+
+# Update Rust and Cargo to latest (stable)
+
+The trick is to set up a container with Rust in your git working directory:
+
+```
+mkdir -p ~/.cargo ~/.rustup # to prevent rebuilds
+guix shell --share=$HOME/.cargo  --share=$HOME/.rustup -C -N -D -F -v 3 guix gcc-toolchain make libdeflate pkg-config xz coreutils sed zstd zlib nss-certs openssl curl
+curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
+. ~/.cargo/env
+rustup default stable
+```
+
+Now rustc shows it is recent:
+
+```
+rustc --version
+rustc 1.90.0 (1159e78c4 2025-09-14)
+```
+
+Next run 'cargo build' with:
+
+```
+env LD_LIBRARY_PATH=$GUIX_ENVIRONMENT/lib cargo build
+  Compiling libagc-sys v0.1.0 (/home/wrk/iwrk/opensource/code/pangenome/libagc-sys)
+    Finished 'dev' profile [unoptimized + debuginfo] target(s) in 0.06s
+$ ./target/debug/libagc-sys
+./target/debug/libagc-sys: error while loading shared libraries: libgcc_s.so.1: cannot open shared object file: No such file or directory
+$ env LD_LIBRARY_PATH=$GUIX_ENVIRONMENT/lib ./target/debug/libagc-sys
+Hello, world!
+```
+
+and your source should build and run. Note the libgcc_s.so.1 error.
+
+## What if you get a libgcc or librt error?
+
+The problem is that cargo picks up the wrong libgcc:
+
+```
+$ ls /gnu/store/*/lib/libgcc_s.so.1
+/gnu/store/m2vhzr0dy352cn59sgcklcaykprrr4j6-gcc-14.3.0-lib/lib/libgcc_s.so.1
+/gnu/store/rbs3nrx9z6sfawn3fa8r8z1kffdbnk8q-gcc-toolchain-15.2.0/lib/libgcc_s.so.1
+/gnu/store/v3bq3shn333kh7m6gj3r58l0v7mkn4in-profile/lib/libgcc_s.so.1
+/gnu/store/xm7i1gvi0i9pyndlkv627r08rsw1ny96-gcc-15.2.0-lib/lib/libgcc_s.so.1
+```
+
+This is because Guix itself builds on an older libgcc and librt. You need to tell it explicitly what library to load that built your cargo:
+
+```
+ldd ~/.cargo/bin/cargo
+        linux-vdso.so.1 (0x00007ffd409b2000)
+        libgcc_s.so.1 => /lib/libgcc_s.so.1 (0x00007fd2cf433000)
+        librt.so.1 => /lib/librt.so.1 (0x00007fd2cf42e000)
+```
+
+in the container:
+
+```
+ls -l /lib/libgcc_s.so.1
+lrwxrwxrwx 1 65534 overflow 82 Jan  1  1970 /lib/libgcc_s.so.1 -> /gnu/store/rbs3nrx9z6sfawn3fa8r8z1kffdbnk8q-gcc-toolchain-15.2.0/lib/libgcc_s.so.1
+```
+
+which happens to be the one in $GUIX_ENVIRONMENT/lib! So setting the library path solves it.
+
+The reason that we don't get the automatically resolving libraries that you normally have in guix is that we have updated rust by *hand* using rustup. Guix has no control over this process.
+
+# spoa-rs on octopus01
+
+I just did above to build spoa-rs. Only had to add cmake to the shell packages.
+
+# sweepga on octopus01
+
+I just built sweepga. Only had the add clang to the shell:
+
+```sh
+guix shell --share=$HOME/.cargo  --share=$HOME/.rustup -C -N -D -F -v 3 guix gcc-toolchain make libdeflate pkg-config xz coreutils sed zstd zlib nss-certs openssl curl zlib cmake clang
+. ~/.cargo/env
+env LD_LIBRARY_PATH=$GUIX_ENVIRONMENT/lib cargo build
+env LD_LIBRARY_PATH=$GUIX_ENVIRONMENT/lib ./target/debug/sweepga
+[sweepga::start::0.000*1.00] 2025-10-11 15:27:28 | ./target/debug/sweepga
+[sweepga::detect::0.000*1.00] Using .1aln workflow (FastGA native format)
+[sweepga] ERROR: No valid input provided
+```
+
+To run on the cluster you likely don't want to use the container. Make a note of GUIX_ENVIRONMENT:
+
+```
+echo $GUIX_ENVIRONMENT/
+/gnu/store/6khi7iv7l75595hwlfc1nwmdcv72m24s-profile/
+```
+
+It has your libs! So, outsite the container you can run
+
+```
+export GUIX_ENVIRONMENT=/gnu/store/6khi7iv7l75595hwlfc1nwmdcv72m24s-profile
+env LD_LIBRARY_PATH=$GUIX_ENVIRONMENT/lib /home/wrk/tmp/sweepga/target/debug/sweepga
+```
+
+# Updating the container
+
+Now your build may fail because you miss a crucial library or tool. This is a feature of guix containers as it makes dependencies explicit.
+
+Just add them to the guix shell command. Let's say we add zlib
+
+```
+guix shell --share=$HOME/.cargo  --share=$HOME/.rustup -C -N -D -F -v 3 guix gcc-toolchain make libdeflate pkg-config xz coreutils sed zstd zlib nss-certs openssl curl zlib
+```
+
+# Troubleshooting
+
+## Collisions
+
+Guix may complain about collisions. These are mostly naming issues:
+
+```
+warning: collision encountered:
+  /gnu/store/nym6kiinrg2mb8z4lwnvfx5my8df9vrs-glibc-for-fhs-2.41/bin/ldd
+  /gnu/store/rbs3nrx9z6sfawn3fa8r8z1kffdbnk8q-gcc-toolchain-15.2.0/bin/ldd
+warning: choosing /gnu/store/nym6kiinrg2mb8z4lwnvfx5my8df9vrs-glibc-for-fhs-2.41/bin/ldd
+```
+
+it will like one into your environment. You can still use both tools by using the full path and normally ignore the warning.
diff --git a/topics/systems/backup-drops.gmi b/topics/systems/backup-drops.gmi
index 191b185..a29e605 100644
--- a/topics/systems/backup-drops.gmi
+++ b/topics/systems/backup-drops.gmi
@@ -4,6 +4,10 @@ To make backups we use a combination of sheepdog, borg, sshfs, rsync. sheepdog i
 
 This system proves pretty resilient over time. Only on the synology server I can't get it to work because of some CRON permission issue.
 
+For doing the actual backups see
+
+=> ./backups-with-borg.gmi
+
 # Tags
 
 * assigned: pjotrp
@@ -13,7 +17,7 @@ This system proves pretty resilient over time. Only on the synology server I can
 
 ## Borg backups
 
-It is advised to use a backup password and not store that on the remote.
+Despite our precautions it is advised to use a backup password and *not* store that on the remote.
 
 ## Running sheepdog on rabbit
 
@@ -59,14 +63,14 @@ where remote can be an IP address.
 
 Warning: if you introduce this `AllowUsers` command all users should be listed or people may get locked out of the machine.
 
-Next create a special key on the backup machine's ibackup user (just hit enter):
+Next create a special password-less key on the backup machine's ibackup user (just hit enter):
 
 ```
 su ibackup
 ssh-keygen -t ecdsa -f $HOME/.ssh/id_ecdsa_backup
 ```
 
-and copy the public key into the remote /home/bacchus/.ssh/authorized_keys
+and copy the public key into the remote /home/bacchus/.ssh/authorized_keys.
 
 Now test it from the backup server with
 
@@ -82,13 +86,20 @@ On the drop server you can track messages by
 tail -40 /var/log/auth.log
 ```
 
+or on recent linux with systemd
+
+```
+journalctl -r
+```
+
 Next
 
 ```
 ssh -v -i ~/.ssh/id_ecdsa_backup bacchus@dropserver
 ```
 
-should give a Broken pipe(!). In auth.log you may see something like
+should give a Broken pipe(!) or -- more recently -- it says `This service allows sftp connections only`.
+When running sshd with a verbose switch you may see something like
 
 fatal: bad ownership or modes for chroot directory component "/export/backup/"
 
@@ -106,10 +117,23 @@ So, as root
 ```
 cd /export
 mkdir -p backup/bacchus/drop
-chown bacchus.bacchus backup/bacchus/drop/
+chown bacchus:bacchus backup/bacchus/drop/
 chmod 0700 backup/bacchus/drop/
 ```
 
+Another error may be:
+
+```
+fusermount3: mount failed: Operation not permitted
+```
+
+This means you need to set the suid on the fusermount3 command. Bit nasty in Guix.
+
+```
+apt-get install fuse(3) sshfs
+chmod 4755 /usr/bin/fusermount
+```
+
 If auth.log says error: /dev/pts/11: No such file or directory on ssh, or received disconnect (...) disconnected by user we are good to go!
 
 Note: at this stage it may pay to track the system log with
@@ -171,3 +195,56 @@ sshfs -o reconnect,ServerAliveInterval=15,ServerAliveCountMax=3,IdentityFile=~/.
 The recent scripts can be found at
 
 => https://github.com/genenetwork/gn-deploy-servers/blob/master/scripts/tux01/backup_drop.sh
+
+# borg-borg
+
+
+Backups work for production according to sheepdog. They run at 5am CST. Which (I guess) is OK. On the remote server we are going to forward the backup to a server on a different continent at 4pm GMT. I have been running that by hand lately, so time to sheepdog it!
+
+The manual command is
+
+```
+rsync -e "ssh -i ~/.ssh/id_ecdsa_borgborg" -vaP tux03 $HOST:/export/backup/bacchus/drop/
+```
+
+With sheepdog we can make it:
+
+```
+sheepdog_run.rb -v --tag "drop-mount-$name" -c "sshfs -o $SFTP_SETTING,IdentityFile=~/.ssh/id_ecdsa_backup bacchus@$host:/ ~/mnt/$name"
+sheepdog_run.rb --always -v --tag "drop-rsync-$name" -c "rsync -vrltDP borg/* ~/mnt/$name/drop/$HOST/ --delete"
+sheepdog_run.rb -v --tag "drop-unmount-$name" -c "fusermount -u ~/mnt/$name"
+```
+
+For some reason this took a while to figure out. Part of it is that the machine on the other end has a rather slow CPU! An
+Intel(R) Celeron(R) CPU  J1900  @ 1.99GHz launched over 10 years ago. We still use it because of its low energy consumption. Once it starts pumping a file it is up to speed
+
+```
+tux03/tux03-containers/data/0/239
+    154,501,120  29%   11.20MB/s    0:00:32
+```
+
+So one backup of a backup has started running and I made it a CRON job. Next stop is borgborg on the receiving HOST. The CRON job looks like
+
+```
+0 3 * * * env BORG_PASSPHRASE=none /home/wrk/iwrk/deploy/deploy/bin/sheepdog_borg.rb -t borgborg --always -v -b /export/backup/bacchus/borgborg/drop /export/backup/bacchus/drop --args '--stats' >> ~/cron.log 2>&1
+```
+
+note the backups are already password protected. No need to do that again. Now this backup is going to go onto optical media twice a year with the password printed on the backup. That should keep it for 100 years.
+
+You can track this backup progress daily on the sheepdog status
+
+=> http://sheepdog.genenetwork.org/sheepdog/status.html
+
+i.e. in reverse order the flow is:
+
+```
+2025-09-18 08:35:00 +0200	FAIL	host	borgborg-backup
+2025-09-18 16:19:45 -0500	SUCCESS	balg01	drop-rsync-zero
+2025-09-18 05:59:46 +0000	SUCCESS	tux03	mariadb-check
+2025-09-18 05:26:01 +0000	SUCCESS	tux03	drop-rsync-balg01
+2025-09-18 05:25:48 +0000	SUCCESS	tux03	borg-tux03-sql-backup
+2025-09-18 04:44:38 +0000	SUCCESS	tux03	mariabackup-make-consistent
+2025-09-18 04:44:25 +0000	SUCCESS	tux03	mariabackup-dump
+```
+
+The borgborg should be fixed now. I am missing the container backups. What is going on there? These were last backed up on 'Sun, 2025-09-14 00:00:52'. Ah, I set the CRON job to runs once a week. That should be fixed now and it should show up.
diff --git a/topics/systems/backups-with-borg.gmi b/topics/systems/backups-with-borg.gmi
new file mode 100644
index 0000000..dbd9192
--- /dev/null
+++ b/topics/systems/backups-with-borg.gmi
@@ -0,0 +1,449 @@
+# Borg backups
+
+We use borg for backups. Borg is an amazing tool and after 25+ years of making backups it just feels right.
+With the new tux04 production install we need to organize backups off-site. The first step is to create a
+borg runner using sheepdog -- sheepdog we use for monitoring success/failure.
+Sheepdog essentially wraps a Unix command and sends a report to a local or remote redis instance.
+Sheepdog also includes a web server for output:
+
+=> http://sheepdog.genenetwork.org/sheepdog/status.html
+
+which I run on one of my machines.
+
+# Tags
+
+* assigned: pjotrp
+* keywords: systems, backup, sheepdog, database
+
+# Install borg
+
+Usually I use a version of borg from guix. This should really be done as the borg user (ibackup).
+
+```
+ibackup@tux03:~$ mkdir ~/opt
+ibackup@tux03:~$ guix package -i borg -p ~/opt/borg
+~/opt/borg/bin/borg --version
+  1.2.2
+```
+
+# Create a new backup dir and user
+
+The backup should live on a *different* disk from the things we backup, so when that disk fails we have another. In fact in 2025 we had a corruption of the backups(!) We could recover from the original data + older backups. Not great. But if it had been the same disk it would have been worse.
+
+The SQL database lives on /export and the containers live on /export2. /export3 is a largish slow drive, so perfect.
+
+By convention I point /export/backup to the real backup dir on /export3/backup/borg/ Another convention is that we use an ibackup user which has the backup passphrase in ~/.borg-pass. As root:
+
+```
+mkdir /export/backup/borg
+chown ibackup:ibackup /export/backup/borg
+chown ibackup:ibackup /home/ibackup/.borg-pass
+su ibackup
+```
+
+Now you should be able to load the passphrase and create the backup dir
+
+```
+id
+  uid=1003(ibackup)
+. ~/.borg-pass
+cd /export/backup/borg
+~/opt/borg/bin/borg init --encryption=repokey-blake2 genenetwork
+```
+
+Note that we typically start from an existing backup. These go back a long time.
+
+Now we can run our first backup. Note that ibackup should be a member of the mysql and gn groups
+
+```
+mysql:x:116:ibackup
+```
+
+# First backup
+
+Run the backup the first time:
+
+```
+id
+  uid=1003(ibackup) groups=1003(ibackup),116(mysql)
+~/opt/borg/bin/borg create --progress --stats genenetwork::first-backup /export/mysql/database/*
+```
+
+You may first need to update permissions to give group  access
+
+```
+chmod g+rx -R /var/lib/mysql/*
+```
+
+When that works borg reports:
+
+```
+Archive name: first-backup
+Archive fingerprint: 376d32fda9738daa97078fe4ca6d084c3fa9be8013dc4d359f951f594f24184d
+Time (start): Sat, 2025-02-08 04:46:48
+Time (end):   Sat, 2025-02-08 05:30:01
+Duration: 43 minutes 12.87 seconds
+Number of files: 799
+Utilization of max. archive size: 0%
+------------------------------------------------------------------------------
+                       Original size      Compressed size    Deduplicated size
+This archive:              534.24 GB            238.43 GB            237.85 GB
+All archives:              534.24 GB            238.43 GB            238.38 GB
+                       Unique chunks         Total chunks
+Chunk index:                  200049               227228
+------------------------------------------------------------------------------
+```
+
+50% compression is not bad. borg is incremental so it will only backup differences next round.
+
+Once borg works we could run a CRON job. But we should use the sheepdog monitor to make sure backups keep going without failure going unnoticed.
+
+# Using the sheepdog
+
+=> https://github.com/pjotrp/deploy sheepdog code
+
+## Clone sheepdog
+
+=> https://github.com/pjotrp/deploy#install sheepdog install
+
+Essentially clone the repo so it shows up in ~/deploy
+
+```
+cd /home/ibackup
+git clone https://github.com/pjotrp/deploy.git
+/export/backup/scripts/tux04/backup-tux04.sh
+```
+
+## Setup redis
+
+All sheepdog messages get pushed to redis. You can run it locally or remotely.
+
+By default we use redis, but syslog and others may also be used. The advantage of redis is that it is not bound to the same host, can cross firewalls using an ssh reverse tunnel, and is easy to query.
+
+=> https://github.com/pjotrp/deploy#install sheepdog install
+
+In our case we use redis on a remote host and the results get displayed by a webserver. Also some people get E-mail updates on failure. The configuration is in
+
+```
+/home/ibackup# cat .config/sheepdog/sheepdog.conf .
+{
+  "redis": {
+    "host"  : "remote-host",
+    "password": "something"
+  }
+}
+```
+
+If you see localhost with port 6377 it is probably a reverse tunnel setup:
+
+=> https://github.com/pjotrp/deploy#redis-reverse-tunnel
+
+Update the fields according to what we use. Main thing is that is the definition of the sheepdog->redis connector. If you also use sheepdog as another user you'll need to add a config.
+
+Sheepdog should show a warning when you configure redis and it is not connecting.
+
+## Scripts
+
+Typically I run the cron job from root CRON so people can find it. Still it is probably a better idea to use an ibackup CRON. In my version a script is run that also captures output:
+
+```cron root
+0 6 * * * /bin/su ibackup -c /export/backup/scripts/tux04/backup-tux04.sh >> ~/cron.log 2>&1
+```
+
+The script contains something like
+
+```bash
+#! /bin/bash
+if [ "$EUID" -eq 0 ]
+  then echo "Please do not run as root. Run as: su ibackup -c $0"
+  exit
+fi
+rundir=$(dirname "$0")
+# ---- for sheepdog
+source $rundir/sheepdog_env.sh
+cd $rundir
+sheepdog_borg.rb -t borg-tux04-sql --group ibackup -v -b /export/backup/borg/genenetwork /export/mysql/database/*
+```
+
+and the accompanying sheepdov_env.sh
+
+```
+export GEM_PATH=/home/ibackup/opt/deploy/lib/ruby/vendor_ruby
+export PATH=/home/ibackup/opt/deploy/deploy/bin:/home/wrk/opt/deploy/bin:$PATH
+```
+
+If it reports
+
+```
+/export/backup/scripts/tux04/backup-tux04.sh: line 11: /export/backup/scripts/tux04/sheepdog_env.sh: No such file or directory
+```
+
+you need to install sheepdog first.
+
+If all shows green (and takes some time) we made a backup. Check the backup with
+
+```
+ibackup@tux04:/export/backup/borg$ borg list genenetwork/
+first-backup                         Sat, 2025-02-08 04:39:50 [58715b883c080996ab86630b3ae3db9bedb65e6dd2e83977b72c8a9eaa257cdf]
+borg-tux04-sql-20250209-01:43-Sun    Sun, 2025-02-09 01:43:23 [5e9698a032143bd6c625cdfa12ec4462f67218aa3cedc4233c176e8ffb92e16a]
+```
+and you should see the latest. The contents with all files should be visible with
+
+```
+borg list genenetwork::borg-tux04-sql-20250209-01:43-Sun
+```
+
+Make sure you not only see just a symlink.
+
+# More backups
+
+Our production server runs databases and file stores that need to be backed up too.
+
+# Drop backups
+
+Once backups work it is useful to copy them to a remote server, so when the machine stops functioning we have another chance at recovery. See
+
+=> ./backup-drops.gmi
+
+# Recovery
+
+With tux04 we ran into a problem where all disks were getting corrupted(!) Probably due to the RAID controller, but we still need to figure that one out.
+
+Anyway, we have to assume the DB is corrupt. Files are corrupt AND the backups are corrupt. Borg backup has checksums which you can
+
+```
+borg check repo
+```
+
+it has a --repair switch which we needed to remove some faults in the backup itself:
+
+```
+borg check --repair repo
+```
+
+# Production backups
+
+Now backups were supposed to run, but they don't show up yet. Ah, it is not yet 3am CST. Meanwhile we drop the backups on another server. Just in case we lose *both* drives on the production server and/or the server itself. To achieve this we have set up a user 'bacchus' with limited permissions on the remote. All bacchus can do is copy the files across. So, we add an ssh key and invoke the commands:
+
+```
+sheepdog_run.rb -v --tag "drop-mount-$name" -c "sshfs -o $SFTP_SETTING,IdentityFile=~/.ssh/id_ecdsa_backup bacchus@$host:/ ~/mnt/$name"
+sheepdog_run.rb --always -v --tag "drop-rsync-$name" -c "rsync -vrltDP borg/* ~/mnt/$name/drop/$HOST/ --delete"
+sheepdog_run.rb -v --tag "drop-unmount-$name" -c "fusermount -u ~/mnt/$name"
+```
+
+essentially mounting the remote dir, rsync files across, and unmount. All monitored by sheepdog. Copying files over sshfs is not the fastest route, but it is very secure because of the limited permissions. On the remote we have space and for now we'll use the old backups as a starting point. When it works I'll disable and remove the old tux04 backups. Actually I'll disable the cron job now and make sure mariadb did not start (so no one can use that by mistake). All checked!
+
+Meanwhile the system log at point of failure shows no information. This means it is a hard crash the Linux kernel is not even aware of and it points out it is not a kernel/driver/software issue on our end. It really sucks. We'll work on it:
+
+=> tux04-disk-issues
+
+OK, so I prepared the old production backups on the remote and we run an update by hand. And after some fiddling with permissions it worked:
+
+```
+ibackup@tux03:/export/backup/scripts/tux03$ ./backup_drop_balg01.sh
+fusermount: entry for /home/ibackup/mnt/balg01 not found in /etc/mtab
+{:cmd=>"sshfs -o reconnect,ServerAliveInterval=15,ServerAliveCountMax=3,IdentityFile=~/.ssh/id_ecdsa_backup bacchus@balg01.genenetwork.org:/ ~/mnt/balg01", :channel=>"run", :host=>"localhost", :port=>6377, :password=>"*", :verbose=>true, :tag=>"drop-mount-balg01", :config=>"/home/ibackup/.config/sheepdog/sheepdog.conf"}
+sshfs -o reconnect,ServerAliveInterval=15,ServerAliveCountMax=3,IdentityFile=~/.ssh/id_ecdsa_backup bacchus@balg01.genenetwork.org:/ ~/mnt/balg01
+No event to report <sheepdog_run>                                                                                      {:cmd=>"rsync -vrltDP borg/* ~/mnt/balg01/drop/tux03/ --delete", :channel=>"run", :host=>"localhost", :port=>6377, :password=>"*", :always=>true, :verbose=>true, :tag=>"drop-rsync-balg01", :config=>"/home/ibackup/.config/sheepdog/sheepdog.conf"}
+rsync -vrltDP borg/* ~/mnt/balg01/drop/tux03/ --delete                                                                 sending incremental file list
+deleting genenetwork/integrity.1148
+(...)
+sent 22,153,007 bytes  received 352 bytes  3,408,209.08 bytes/sec
+total size is 413,991,028,933  speedup is 18,687.51
+{:time=>"2025-09-12 07:51:52 +0000", :elapsed=>5, :user=>"ibackup", :host=>"tux03", :command=>"rsync -vrltDP borg/* ~/mnt/balg01/drop/tux03/ --delete", :tag=>"drop-rsync-balg01", :stdout=>nil, :stderr=>nil, :status=>0, :err=>"SUCCESS"}
+Pushing out event <sheepdog_run> to <localhost:6377>
+{:cmd=>"fusermount -u ~/mnt/balg01", :channel=>"run", :host=>"localhost", :port=>6377, :password=>"*", :verbose=>true, :tag=>"drop-unmount-balg01", :config=>"/home/ibackup/.config/sheepdog/sheepdog.conf"}
+fusermount -u ~/mnt/balg01
+No event to report <sheepdog_run>
+```
+
+And on the remote I can see the added backup:
+
+```
+tux03-new Wed, 2025-09-10 04:33:21 [dd4bbdc30898327b62d8ccdc63c5285f916d5643bffe942b73561fe297540eae]
+```
+
+All good. Now we add this to CRON and track sheepdog to see if there are problems popping up. It now confirms: 'SUCCESS	tux03	drop-rsync-balg01'.
+
+The backup drop setup is documented here:
+
+=> https://issues.genenetwork.org/topics/systems/backup-drops
+
+I am looking into setting up the backups again. Tux04 crashed a few days ago, yet again, so we were saved from that debacle! I rebooted to get at the old backups (they are elsewhere, but that is the latest). Setting up backups is slightly laborious, described here:
+
+=> https://issues.genenetwork.org/topics/systems/backups-with-borg
+
+we use sheepdog for monitoring
+
+=> http://sheepdog.genenetwork.org/sheepdog/status.html
+
+Code:
+
+=> https://github.com/pjotrp/deploy
+
+a tool that does a lot of checks in the background every day! Compressed backup sizes:
+
+```
+283G    genenetwork
+103G    tux04-containers
+```
+
+the local network speed between tux04 and tux03 is 100 Mbs. Not bad, but it takes more an hour to move across.
+
+First manual backup worked:
+
+```
+ibackup@tux03:/export/backup/borg$ borg create genenetwork::tux03-new /export/mariadb/export/backup/mariadb/latest --stats --progress
+Archive name: tux03-new
+Archive fingerprint: dd4bbdc30898327b62d8ccdc63c5285f916d5643bffe942b73561fe297540eae
+Time (start): Wed, 2025-09-10 09:33:21
+Time (end):   Wed, 2025-09-10 10:02:52
+Duration: 29 minutes 31.00 seconds
+Number of files: 907
+Utilization of max. archive size: 0%
+------------------------------------------------------------------------------
+                       Original size      Compressed size    Deduplicated size
+This archive:              536.84 GB            238.56 GB              3.68 MB
+All archives:               65.60 TB             29.15 TB            303.71 GB
+
+                       Unique chunks         Total chunks
+Chunk index:                  253613             24717056
+------------------------------------------------------------------------------
+```
+
+Next we set up sheepdog for monitoring automated backups. Next to the
+code repos we have a script repo at
+'tux02.genenetwork.org:/home/git/pjotrp/gn-deploy-servers' which
+currently handles monitoring for our servers, including: bacchus epysode
+octopus01 penguin2 rabbit shared thebird tux01 tux02 tux04. Now tux03. The main backup script looks like
+
+```
+rm -rf $backupdir/latest
+tag="mariabackup-dump"
+sheepdog_run.rb --always -v --tag $tag -c "mariabackup --backup --innodb-io-capacity=200 --kill-long-query-type=SELEC
+--kill-long-queries-timeout=120 --target-dir=$backupdir/latest/ --user=webqtlout --password=webqtlout"
+tag="mariabackup-make-consistent"
+sheepdog_run.rb --always -v --tag $tag -c "mariabackup --prepare --target-dir=$backupdir/latest/"
+sheepdog_borg.rb -t borg-tux04-sql --always --group ibackup -v -b /export/backup/borg/genenetwork $backupdir --args '
+--stats'
+```
+
+What it does is make a full copy of mariadb databases and make sure it is consistent. Next we use borg to make a backup. The reason a DB have a consistent copy is that the running DB may change during the backup. And that is no good! We use sheepdog to monitor these command - i.e. on failure we get notified. First we run it by hand to make sure it works. First errors, for example
+
+```
+ibackup@tux03:/export/backup/scripts/tux03$ ./backup.sh
+{:cmd=>"mariabackup --backup --innodb-io-capacity=200 --kill-long-query-type=SELECT --kill-long-queries-timeout=120 --target-dir=/export/backup/mariadb/latest/ --user=webqtlout --password=webqtlout", :channel=>"run", :host=>"localhost", : port=>6379, :always=>true, :verbose=>true, :tag=>"mariabackup-dump", :config=>"/home/ibackup/.redis.conf"} mariabackup --backup --innodb-io-capacity=200 --kill-long-query-type=SELECT --kill-long-queries-timeout=120 --target-di r=/export/backup/mariadb/latest/ --user=webqtlout --password=webqtlout
+[00] 2025-09-10 10:31:19 Connecting to MariaDB server host: localhost, user: webqtlout, password: set, port: not set, s
+ocket: not set
+[00] 2025-09-10 10:31:19 Using server version 10.11.11-MariaDB-0+deb12u1-log
+(...)
+[00] 2025-09-10 10:31:19 InnoDB: Using liburing
+[00] 2025-09-10 10:31:19 mariabackup: The option "innodb_force_recovery" should only be used with "--prepare".
+[00] 2025-09-10 10:31:19 mariabackup: innodb_init_param(): Error occurred.
+```
+
+The good thing is that the actual command is listed, so we can fix things a step at a time.
+
+```
+mariabackup --backup --innodb-io-capacity=200 --kill-long-query-type=SELECT --kill-long-queries-timeout=120 --target-dir=/export/backup/mariadb/latest/ --user=webqtlout --password=*
+```
+
+I had to disable 'innodb_force_recovery=1' to make it work. Also permissions have to allow the backup user with 'chmod u+rX -R /var/lib/mysql/*'.
+
+Now that works I need to make sure sheepdog can send its updates to the remote machine (in NL). It is a bit complicated because we set up an ssh tunnel that can only run redis commands. It looks like
+
+```
+3 * * * * /usr/bin/ssh -i ~/.ssh/id_ecdsa_sheepdog -f -NT -o ServerAliveInterval=60 -L 6377:127.0.0.1:6379 redis-tun@sheepdog.genenetwork.org >> tunnel.log &2>1
+```
+
+Now when I run sheepdog_status it reports
+
+```
+2025-09-10 06:01:02 -0500 (@tux04) FAIL 1 <00m00s> mariadb-test02
+2025-09-10 06:01:02 -0500 (@tux04) FAIL 1 <00m00s> mariadb-test01
+```
+
+which is correct because I switched mariadb off on tux04!
+
+Now Mariadb on tux03 is showing errors. The problem is that it actually is in an inconsitent state (sigh). Basically I am getting endless errors like:
+
+```
+Retrying read of log at LSN=1537842295040
+Retrying read of log at LSN=1537842295040
+Retrying read of log at LSN=1537842295040
+```
+
+There is a way to fix the replay log - probably harmless in our case.
+
+But what we *should* do is move this database out of the way - I may need it for Arthur - and do a proper backup recovery. I bumped off an E-mail to Arthur and started recovery. That takes also an hour to extract a borg backup of this size. I keep GN running in parallel (meanwhile) using the old DB. Bit of extra work, but less work than trying to recover from a broken DB. The good thing is we get to test backups. Btw this is exactly why it is *not* easy to migrate/update/copy/sync databases by 'just copying files'. They are too easily in an inconsistent state. There was some E-mail thread about that this year. Maybe it is a flaw of mysql/mariabd because the replay log is inconsistent when it is left open.
+
+```
+ibackup@tux03:/export/mariadb/restore$ borg extract /export/backup/borg/genenetwork::borg-tux04-sql-20250906-04:16-Sat --progress
+ 71.1% Extracting: export/backup/mariadb/latest/db_webqtl/ProbeSetData.MYI
+```
+
+So we rolled back the DB until further complaints. And made a new backup... This is how we keep ourselves busy.
+
+Turns out the new backup is problematic too! It completes, but still has redo isssues. It ends with:
+
+```
+Redo log (from LSN 1537842295024 to 1537842295040) was copied.
+```
+
+The error was
+
+```
+Retrying read of log at LSN=1537842295040
+```
+
+so it is the last record (or all of them!). Kranky. I used
+
+```
+RESET MASTER
+```
+
+to clear out the redo log. It says 'Log flushed up to   1537842295040'. Good. Try another backup. Still not working. The mysql log says '[Warning] Could not read packet: fd: 24  state: 1  read_length: 4  errno: 11  vio_errno: 1158  length: 0'. But this does not appear to be related.
+
+```
+perror 11
+OS error code  11:  Resource temporarily unavailable
+```
+
+hmmm. Still not related. The error relates to the file:
+
+```
+ls -l /proc/574984/fd|grep '24 '
+lrwx------ 1 mysql mysql 64 Sep 11 07:46 124 -> /export/mariadb/export/backup/mariadb/latest/db_webqtl/IndelAll.ibd
+```
+
+Probably a good idea to check all tables! OK, let's test this table first.
+
+```
+mysqlcheck -c db_webqtl -u webqtlout -pwebqtlout IndelAll
+db_webqtl.IndelAll                                 OK
+```
+
+looks OK. Try all
+
+```
+time mysqlcheck -c -u webqtlout -pwebqtlout db_webqtl
+real    33m39.642s
+```
+
+all tables are good. Alright, I think we can make backups and the warning may go away with a future mariadb version. My assessment is that this Warning is harmless. Let's move forward by setting up sheepdog and borg backup. First backup run should show up soon as 'SUCCESS	tux03	borg-tux03-sql-backup' in
+
+=> http://sheepdog.genenetwork.org/sheepdog/status.html
+
+Now it works I add it as a CRON job to run daily. Sheepdog will tell me whether we are healthy or not.
+
+
+
+## Backups (part 3)
+
+As an aside. Last night, according to sheepdog, tux03 made a perfect backup run and dropped the data on a server in a different location.
+
+=> http://sheepdog.genenetwork.org/sheepdog/status.html
+
+There is more to do, however. First of all we don't backup everything. We should also backup the containers and the state of the machine. Finally we need to make sure the backups are backed up(!) The reason is that if a backup is corrupted it will just propagate - it has happened to us. A backup of a backup will have sane versions from before the corruption. These days, you also have to anticipate bad actors injecting stuff. That you won't find if they penetrated the backup system. We are quite keen on having offline backups for that reason alone.
+
+For backup of the containers we need to run as root (unfortunately). I see now we did not have a proper backup on tux04. The last one was from 2025-03-04. Now we generate these containers, but still a bad idea not to backup the small databases. Anyway, first add the containers to the backup and more state. I set it up and added the CRON job. See if it pops up on sheepdog.
diff --git a/topics/systems/ci-cd.gmi b/topics/systems/ci-cd.gmi
index 6aa17f2..e20a37a 100644
--- a/topics/systems/ci-cd.gmi
+++ b/topics/systems/ci-cd.gmi
@@ -1,4 +1,6 @@
-# CI/ CD for genetwork projects
+# CI/CD for genetwork projects
+
+Continuous intergration (CI) and continuous deployment (CD) are critical parts of making sure software development does not mess up our deployment(s).
 
 We need various levels of tests to be run, from unit tests to the more complicated ones like integration, performance, regression, etc tests, and of course, they cannot all be run for each and every commit, and will thus need to be staggered across the entire deployment cycle to help with quick iteration of the code.
 
@@ -22,16 +24,86 @@ As part of the CI/CD effort, it is necessary that there is
 GeneNetwork is interested in doing two things on every commit (or
 periodically, say, once an hour/day):
 
-* CI: run unit tests
+* CI: run unit tests on git push
 * CD: rebuild and redeploy a container running GN3
 
-Arun has figured out the CI part. It runs a suitably configured laminar CI service in a Guix container created with `guix system container'. A cron job periodically triggers the laminar CI job (note: this is no longer true).
+Arun has figured out the CI part. It runs a suitably configured laminar CI service in a Guix container created with `guix system container'.
 
 => https://git.systemreboot.net/guix-forge/about/
 
-CD hasn't been figured out. Normally, Guix VMs and containers created by `guix system` can only access the store read-only. Since containers don't have write access to the store, you cannot `guix build' from within a container or deploy new containers from within a container. This is a problem for CD. How do you make Guix containers have write access to the store?
+We have the quick running tests, e.g unit tests, run on each commit to branch "main". Once those are successful, the CI/CD system we choose should automatically pick the latest commit that passed the quick running tests for for further testing and deployment.
+Once the next battery of tests is passed, the CI/CD system will create a build/artifact to be deployed to staging and have the next battery of tests runs against it. If that passes, then that artifact could be deployed to production with details of the commit and deployment dependencies.
+
+## Adding a web-hook
+
+### Github hooks
+
+IIRC actions run artifacts inside github's infrastracture.  We use webhooks: e.g.
+
+Update the hook at
+
+=> https://github.com/genenetwork/genenetwork3/settings/hooks
+
+A web hook basically calls an endpoint on a git push event. The webhook for genenetwork3 has recently not been called (ever? it says: This hook has never been triggered. ). The webhook for genenetwork2, however, has been called.
+
+=> ./screenshot-github-webhook.png
+
+To trigger CI manually, run this with the project name:
+
+```
+curl https://ci.genenetwork.org/hooks/example-gn3
+```
+
+I just tested and it appeared this triggered a redeploy of gn2:
+
+```
+curl -XGET "https://ci.genenetwork.org/hooks/genenetwork2
+```
+
+For gemtext we have a github hook that adds a forge-project and looks like
+
+```lisp
+(define gn-gemtext-threads-project
+  (forge-project
+   (name "gn-gemtext-threads")
+   (repository "https://github.com/genenetwork/gn-gemtext-threads/")
+   (ci-jobs (list (forge-laminar-job
+                   (name "gn-gemtext-threads")
+                   (run (with-packages (list nss-certs openssl)
+                          (with-imported-modules '((guix build utils))
+                            #~(begin
+                                (use-modules (guix build utils))
+
+                                (setenv "LC_ALL" "en_US.UTF-8")
+                                (invoke #$(file-append tissue "/bin/tissue")
+                                        "pull" "issues.genenetwork.org"))))))))
+   (ci-jobs-trigger 'webhook)))
+```
+
+The normal trigger is automatic, you push code in any of the two repos (three? I'll verify), GN2 and GN3 and the laminar runs the jobs and updates the code in the container restarts services, as appropriate.
+
+If you want to trigger the CI manually, there are webhooks available for that that can be triggered manually with something like:
+
+```
+curl -XGET "https://ci.genenetwork.org/hooks/genenetwork2"
+```
+
+for GN2. Change the part after /hooks/ for each of the different repos as follows:
+
+```
+GN2: /genenetwork2
+GN3: /genenetwork3
+gn-auth: /gn-auth (I need to verify this)
+gn-uploader: Does not exist right now
+```
+
+Guix forge can be found at
+
+=> https://git.systemreboot.net/guix-forge/
+
+### git.genenetwork.org hooks
 
-Another alternative for CI/ CID were to have the quick running tests, e.g unit tests, run on each commit to branch "main". Once those are successful, the CI/CD system we choose should automatically pick the latest commit that passed the quick running tests for for further testing and deployment, maybe once an hour or so. Once the next battery of tests is passed, the CI/CD system will create a build/artifact to be deployed to staging and have the next battery of tests runs against it. If that passes, then that artifact could be deployed to production, and details on the commit and
+TBD
 
 #### Possible Steps
 
@@ -81,7 +153,7 @@ Below are some possible steps (and tasks) to undertake for automated deployment
 * Generate guix declaration for re-generating the release
 * Archive container image, documentation and guix declaration for possible rollback
 
-#### Some Work Done
+#### See also
 
 => /topics/systems/gn-services GN Services
 
diff --git a/topics/systems/debug-and-developing-code-with-genenetwork-system-container.gmi b/topics/systems/debug-and-developing-code-with-genenetwork-system-container.gmi
index 131474c..f3cbbd6 100644
--- a/topics/systems/debug-and-developing-code-with-genenetwork-system-container.gmi
+++ b/topics/systems/debug-and-developing-code-with-genenetwork-system-container.gmi
@@ -1,12 +1,59 @@
 # Debugging and developing code
 
-Once we get to the stage of having a working system container it would be nice to develop code against it. The idea is to take an existing running system container and start modifying code *inside* the container by brining in an external path.
+Once we get to the stage of having a working system container it would be nice to develop code against it. The idea is to take an existing running system container and start modifying code *inside* the container by bringing in an external path.
 
-First build and start a guix system container as described in
+In principle we'll build guix system containers as described in
 
-=> /topics/guix/guix-system-containers-and-how-we-use-them.gmi
+=> /topics/guix/guix-system-containers-and-how-we-use-them
 
-The idea is to do less `guix pull' and system container builds, so as to speed up development. The advantage of using an existing system container is that the full deployment is the same on our other running systems! No more path hacks, in other words.
+The idea is to minimise `guix pull' and system container builds, so as to speed up development. The advantage of using an existing system container is that the full deployment is the same on our other running systems! No more path hacks, in other words.
+
+## Philosophy
+
+For development containers we will:
+
+* Use sane default values - for URLs, paths etc.
+* Add services incrementally (i.e., not one big blob)
+* Run tests inside the container (not during build time)
+* Build indexes etc. outside the container - or make it optional
+
+Also:
+
+* We should be able to run gn3 and gn-guile (aka gn4) as a guix shell without anything else
+* We should be able to run gn2 with only gn3 and/or gn-guile as a guix shell with external DBs.
+* We should be albe to run gn2+gn3+gn-guile as a system container with external DBs.
+* We should be able to run gn-auth with gn2 as a system container
+* We should be able to run the uploader as a system container
+
+I.e. no https, not authentication by default (as long as we run on localhost). The localhost URLs and file paths can be defaults because there will only be one development container running on a single machine.
+
+System containers are a bit overkill for development. Still, in some cases we'll need a system container. For example when testing integration of gn-auth, uploader, gn2 etc. We have the CD deployment that gets updated when git repos change. We also have a development container written by @bonz that needs to be improved and documented.
+
+=> https://git.genenetwork.org/gn-machines/tree/genenetwork-local-container.scm?h=gn-local-development-container
+
+Note it is on a special branch for now.
+
+Databases, and files will simply be shared on default paths - /export/guix-containers/gndev/...
+And if you need different combinations it should be relatively easy to compose a new shell or system container.
+
+# Tags
+
+* type: bug
+* status: open
+* priority: high
+* assigned: pjotrp
+* interested: pjotrp,bonfacem,fredm
+* keywords: development, deployment, server
+
+# Tasks
+
+Create a dev environment for:
+
+* [ ] GN3
+* [ ] gn-guile
+* [ ] GN2
+* [ ] gn-auth
+* [ ] gn-uploader
 
 # GN3 in system container
 
@@ -258,6 +305,12 @@ guix-vm-run:
   $cmd
 ```
 
+## Virtuoso in a system container
+
+See
+
+=> ./virtuoso
+
 # Troubleshooting
 
 ## Updating the VM does not show latest fixes
diff --git a/topics/systems/dns-changes.gmi b/topics/systems/dns-changes.gmi
index 7f1d8f1..30aae58 100644
--- a/topics/systems/dns-changes.gmi
+++ b/topics/systems/dns-changes.gmi
@@ -27,6 +27,7 @@ We are moving thing to a new DNS hosting service. We have accounts on both. To m
 * Import DNS settings on DNSimple (cut-N-paste)
   + Edit delegation - make sure the delegation box is set
 => https://support.dnsimple.com/articles/delegating-dnsimple-registered
+  + Registration menu item comes up after transfer...
 * Approve transfer on GoDaddy a few minutes later (!!), see
   + https://dcc.godaddy.com/control/transfers
 * Add DNSSec
diff --git a/topics/systems/hpc/octopus-maintenance.gmi b/topics/systems/hpc/octopus-maintenance.gmi
index a0a2f16..d034575 100644
--- a/topics/systems/hpc/octopus-maintenance.gmi
+++ b/topics/systems/hpc/octopus-maintenance.gmi
@@ -2,10 +2,23 @@
 
 ## Slurm
 
-Status of slurm
+Status of slurm (as of 202512)
 
 ```
 sinfo
+workers*     up   infinite      8   idle octopus[03,05-11]
+allnodes     up   infinite      3  alloc tux[06,08-09]
+allnodes     up   infinite     11   idle octopus[02-03,05-11],tux[05,07]
+tux          up   infinite      3  alloc tux[06,08-09]
+tux          up   infinite      2   idle tux[05,07]
+1tbmem       up   infinite      1   idle octopus02
+headnode     up   infinite      1   idle octopus01
+highmem      up   infinite      2   idle octopus[02,11]
+386mem       up   infinite      6   idle octopus[03,06-10]
+lowmem       up   infinite      7   idle octopus[03,05-10]
+```
+
+```
 sinfo -R
 squeue
 ```
@@ -29,7 +42,7 @@ UnkillableStepProgram   = (null)
 UnkillableStepTimeout   = 60 sec
 ```
 
-check valid configuration with `slurmd -C` and update nodes with
+check valid configuration with 'slurmd -C' and update nodes with
 
 ```
 scontrol reconfigure
@@ -45,13 +58,13 @@ Basically the root user can copy across.
 
 ## Execute binaries on mounted devices
 
-To avoid `./scratch/script.sh: Permission denied` on `device_file`:
+To avoid './scratch/script.sh: Permission denied' on 'device_file':
 
-- `sudo bash`
-- `ls /scratch -l` to check where `/scratch` is
-- `vim /etc/fstab`
-- replace `noexec` with `exec` for `device_file`
-- `mount -o remount [device_file]` to remount the partition with its new configuration.
+- 'sudo bash'
+- 'ls /scratch -l' to check where '/scratch' is
+- 'vim /etc/fstab'
+- replace 'noexec' with 'exec' for 'device_file'
+- 'mount -o remount [device_file]' to remount the partition with its new configuration.
 
 Some notes:
 
@@ -67,7 +80,7 @@ x-systemd.device-timeout=
 10.0.0.110:/export/3T  /mnt/3T  nfs nofail,x-systemd.automount,x-systemd.requires=network-online.target,x-systemd.device-timeout=10 0 0
 
 
-## Installation of `munge` and `slurm` on a new node
+## Installation of 'munge' and 'slurm' on a new node
 
 Current nodes in the pool have:
 
@@ -78,7 +91,7 @@ sbatch --version
     slurm-wlm 18.08.5-2
 ```
 
-To install `munge`, go to `octopus01` and run:
+To install 'munge', go to 'octopus01' and run:
 
 ```shell
 guix package -i munge@0.5.14 -p /export/octopus01/guix-profiles/slurm
@@ -86,7 +99,7 @@ guix package -i munge@0.5.14 -p /export/octopus01/guix-profiles/slurm
 systemctl status munge # to check if the service is running and where its service file is
 ```
 
-We need to setup the rights for `munge`:
+We need to setup the rights for 'munge':
 
 ```shell
 sudo bash
@@ -100,7 +113,7 @@ mkdir -p /var/lib/munge
 chown munge:munge /var/lib/munge/
 
 mkdir -p /etc/munge
-# copy `munge.key` (from a working node) to `/etc/munge/munge.key`
+# copy 'munge.key' (from a working node) to '/etc/munge/munge.key'
 chown -R munge:munge /etc/munge
 
 mkdir -p /run/munge
@@ -112,7 +125,7 @@ chown munge:munge /var/log/munge
 mkdir -p /var/run/munge # todo: not sure why it needs such a folder
 chown munge:munge /var/run/munge
 
-# copy `munge.service` (from a working node) to `/etc/systemd/system/munge.service`
+# copy 'munge.service' (from a working node) to '/etc/systemd/system/munge.service'
 
 systemctl daemon-reload
 systemctl enable munge
@@ -120,25 +133,25 @@ systemctl start munge
 systemctl status munge
 ```
 
-To test the new installation, go to `octopus01` and then:
+To test the new installation, go to 'octopus01' and then:
 
 ```shell
 munge -n | ssh tux08 /export/octopus01/guix-profiles/slurm-2-link/bin/unmunge
 ```
 
-If you get `STATUS: Rewound credential (16)`, it means that there is a difference between the encoding and decoding times. To fix it, go into the new machine and fix the time with
+If you get 'STATUS: Rewound credential (16)', it means that there is a difference between the encoding and decoding times. To fix it, go into the new machine and fix the time with
 
 ```shell
 sudo date MMDDhhmmYYYY.ss
 ```
 
-To install `slurm`, go to `octopus01` and run:
+To install 'slurm', go to 'octopus01' and run:
 
 ```shell
 guix package -i slurm@18.08.9 -p /export/octopus01/guix-profiles/slurm
 ```
 
-We need to setup the rights for `slurm`:
+We need to setup the rights for 'slurm':
 
 ```shell
 sudo bash
@@ -152,8 +165,8 @@ mkdir -p /var/lib/slurm
 chown munge:munge /var/lib/slurm/
 
 mkdir -p /etc/slurm
-# copy `slurm.conf` to `/etc/slurm/slurm.conf`
-# copy `cgroup.conf` to `/etc/slurm/cgroup.conf`
+# copy 'slurm.conf' to '/etc/slurm/slurm.conf'
+# copy 'cgroup.conf' to '/etc/slurm/cgroup.conf'
 
 chown -R slurm:slurm /etc/slurm
 
@@ -163,7 +176,7 @@ chown slurm:slurm /run/slurm
 mkdir -p /var/log/slurm
 chown slurm:slurm /var/log/slurm
 
-# copy `slurm.service` to `/etc/systemd/system/slurm.service`
+# copy 'slurm.service' to '/etc/systemd/system/slurm.service'
 
 /export/octopus01/guix-profiles/slurm-2-link/sbin/slurmd -f /etc/slurm/slurm.conf -C | head -n 1 >> /etc/slurm/slurm.conf # add node configuration information
 
@@ -173,12 +186,24 @@ systemctl start slurm
 systemctl status slurm
 ```
 
-On `octopus01` (the master):
+On 'octopus01' (the master):
 
 ```shell
 sudo bash
 
-# add the new node to `/etc/slurm/slurm.conf`
+# add the new node to '/etc/slurm/slurm.conf'
 
 systemctl restart slurmctld # after editing /etc/slurm/slurm.conf on the master
 ```
+
+
+# Removing a node
+
+We are removing o3 so it can become the new head node:
+
+```
+scontrol update nodename=octopus03 state=drain reason="removing"
+scontrol show node octopus03 | grep State
+scontrol update nodename=octopus03 state=down reason="removed"
+  State=DOWN+DRAIN ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
+```
diff --git a/topics/systems/hpc/performance.gmi b/topics/systems/hpc/performance.gmi
index ce6a111..ee604b5 100644
--- a/topics/systems/hpc/performance.gmi
+++ b/topics/systems/hpc/performance.gmi
@@ -12,6 +12,23 @@ For disk speeds make sure there is no load and run
 hdparm -Ttv /dev/sdc1
 ```
 
+Cheap and cheerful:
+
+Write test:
+
+```
+dd if=/dev/zero of=./test bs=512k count=2048 oflag=direct
+```
+
+Read test:
+
+```
+/sbin/sysctl -w vm.drop_caches=3
+dd if=./test of=/dev/zero bs=512k count=2048
+```
+
+
+
 ## Networking
 
 To check the network devices installed use
diff --git a/topics/systems/linux/GPU-on-balg01.gmi b/topics/systems/linux/GPU-on-balg01.gmi
new file mode 100644
index 0000000..d0cb3fc
--- /dev/null
+++ b/topics/systems/linux/GPU-on-balg01.gmi
@@ -0,0 +1,201 @@
+# Installing GPU on Balg01 server
+
+lspci shows the card, an L4
+
+=> https://www.techpowerup.com/gpu-specs/l4.c4091
+
+```
+lspci|grep NVIDIA
+NVIDIA Corporation AD104GL
+```
+
+The machine had raspi and Tesla support installed (?!), so I removed that:
+
+```
+apt-get remove firmware-nvidia-tesla-gsp
+```
+
+Disabled nouveau drivers
+
+```/etc/modprobe.d/blacklist-nouveau.conf
+blacklist nouveau
+options nouveau modeset=0
+```
+
+```
+dpkg --purge raspi-firmware
+update-initramfs -u
+reboot (can skip for a bit)
+```
+
+## Create fallback boot partition
+
+Well, before rebooting I should have created another fallback boot partitition with a more recent debian.
+Unfortunately I had not prepared space on one of the disks (something I normally do). Turned out /dev/sdc on /export3 was not really used lately, so I could move that data and reuse that partition.
+
+```
+/dev/sdc1       1.8T  552G  1.2T  33% /export3
+```
+
+it is a very slow drive (btw), not sure why. I ran badblocks but it does not make a difference. The logs show:
+
+```
+Oct 04 09:34:37 balg01 kernel: I/O error, dev sdc, sector 23392285 op 0x9:(WRITE_ZEROES) flags 0x8000000 >
+O
+```
+
+but it looks more like a driver problem than an actual disk error. Well, maybe on the new debian install it will be fine.
+At this point it is just to install a fallback boot partition, so no real worries.
+
+On using debootstrap, grub etc. the old partition came back fine and I tested I can also boot into the new Debian install. Especially with remote servers this is a great comfort.
+
+## CUDA continued
+
+Now we have a fallback boot partition it is a bit easier to mess with CUDA drivers.
+
+To install the CUDA drivers you may need to disable 'secure boot' in the bios.
+
+```
+apt install build-essential gcc make cmake dkms
+apt install linux-headers-$(uname -r)
+```
+
+The debian selector, choose data center and L series: Driver Version:580.95.05 CUDA Toolkit:13.0 Release Date:Wed Oct 01, 2025 File Size:844.44 MB
+
+Note I installed the nvidia-open drivers. If things are not working we should look at the proprietary stuff. I used the 'local repository installation' instructions of
+
+=> https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/index.html#debian-installation
+
+
+```
+apt-get install nvidia-libopencl1 nvidia-open nvidia-driver-cuda
+```
+
+The first one is to prevent
+
+```
+libnppc11 : Conflicts: nvidia-libopencl1
+```
+
+now this should run
+
+```
+balg01:~# nvidia-smi
+Sat Oct  4 11:56:19 2025
++-----------------------------------------------------------------------------------------+
+| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
++-----------------------------------------+------------------------+----------------------+
+| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
+| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
+|                                         |                        |               MIG M. |
+|=========================================+========================+======================|
+|   0  NVIDIA L4                      Off |   00000000:81:00.0 Off |                    0 |
+| N/A   57C    P0             29W /   72W |       0MiB /  23034MiB |      2%      Default |
+|                                         |                        |                  N/A |
++-----------------------------------------+------------------------+----------------------+
+
++-----------------------------------------------------------------------------------------+
+| Processes:                                                                              |
+|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
+|        ID   ID                                                               Usage      |
+|=========================================================================================|
+|  No running processes found                                                             |
++-----------------------------------------------------------------------------------------+
+```
+
+## Testing GPU
+
+
+Using Guix python I ran:
+
+```
+pip install "gpu-benchmark-tool[nvidia]"
+```
+
+of course it downloads a ridiculous amount of binaries... But then we can run
+
+```
+export PATH=/home/wrk/.local/bin:$PATH
+gpu-benchmark benchmark --duration=30
+```
+
+that did not work. CUDA samples are packaged in Debian and requires building the scripts:
+
+```
+apt-get install nvidia-cuda-samples nvidia-cuda-toolkit-gcc
+cd /usr/share/doc/nvidia-cuda-toolkit/examples/Samples/6_Performance/transpose
+export CUDA_PATH=/usr
+make
+./transpose
+> [NVIDIA L4] has 58 MP(s) x 128 (Cores/MP) = 7424 (Cores)
+> Compute performance scaling factor = 1.00
+...
+Test passed
+```
+
+Note that this removed nvidia-smi. Let's look at versions:
+
+```
+pool/non-free/n/nvidia-graphics-drivers/nvidia-libopencl1_535.247.01-1~deb12u1_amd64.deb
+pool/contrib/n/nvidia-cuda-samples/nvidia-cuda-samples_11.8~dfsg-2_all.deb
+pool/non-free/n/nvidia-cuda-toolkit/nvidia-cuda-toolkit-gcc_11.8.0-5~deb12u1_amd64.deb
+pool/non-free/n/nvidia-graphics-drivers/nvidia-libopencl1_535.247.01-1~deb12u1_amd64.deb
+```
+
+while
+
+```
+Filename: ./nvidia-open_580.95.05-1_amd64.deb
+Package: nvidia-driver-cuda
+Version: 580.95.05-1
+Section: NVIDIA
+Source: nvidia-graphics-drivers
+Provides: nvidia-cuda-mps, nvidia-smi
+```
+
+and it turns out to be a mixture. I have to take real care not to mix in Debian packages! For example this package is a Debian original:
+
+```
+ii  nvidia-cuda-gdb                             11.8.86~11.8.0-5~deb12u1                amd64        NVIDIA CUDA Debugger (GDB)
+```
+
+```
+apt remove --purge nvidia-* cuda-* libnvidia-*
+```
+
+says
+
+```
+Note, selecting 'libnvidia-gpucomp' instead of 'libnvidia-gpucomp-580.95.05'
+```
+
+To view installed packages belonging to Debian itself:
+
+```
+dpkg -l|grep nvid|grep deb12
+dpkg -l|grep cuda|grep deb12
+```
+
+Let's reinstall and make sure only NVIDIA packages are used:
+
+```
+wget https://developer.download.nvidia.com/compute/cuda/repos/debian12/x86_64/cuda-keyring_1.1-1_all.deb
+dpkg -i cuda-keyring_1.1-1_all.deb
+apt-get update
+apt-get install cuda-toolkit  cuda-compiler-12-2
+```
+
+Now we have:
+
+```
+/usr/local/cuda-12.3/bin/nvcc --version
+nvcc: NVIDIA (R) Cuda compiler driver
+Copyright (c) 2005-2023 NVIDIA Corporation
+Built on Wed_Nov_22_10:17:15_PST_2023
+```
+
+# Pytorch
+
+CUDA environment variable for pytorch is probably useful:
+
+=> https://docs.pytorch.org/docs/stable/cuda_environment_variables.html
diff --git a/topics/systems/linux/add-boot-partition.gmi b/topics/systems/linux/add-boot-partition.gmi
new file mode 100644
index 0000000..564e044
--- /dev/null
+++ b/topics/systems/linux/add-boot-partition.gmi
@@ -0,0 +1,52 @@
+# Add (2nd) boot and other partitions
+
+As we handle machines remotely it is often useful to have a secondary boot partition that can be used from grub.
+
+Basically, create a similar sized boot partition on a different disk and copy the running one over with:
+
+```
+parted -a optimal /dev/sdb
+(parted) p
+Model: NVMe CT4000P3SSD8 (scsi)
+Disk /dev/sdb: 4001GB
+Sector size (logical/physical): 512B/512B
+Partition Table: gpt
+Disk Flags:
+
+Number  Start   End     Size    File system  Name  Flags
+ 1      32.0GB  4001GB  3969GB  ext4         bulk
+
+(parted) rm 1
+mklabel gpt
+mkpart fat23 1 1GB
+set 1 esp on
+align-check optimal 1
+mkpart ext4 1GB 32GB
+mkpart swap 32GB 48GB
+set 2 boot on # other flags are raid, swap, lvm
+set 3 swap on
+mkpart scratch 48GB 512GB
+mkpart ceph 512GB -1
+```
+
+We also took the opportunity to create a new scratch partition (for moving things around) and a ceph partition (for testing).
+Resulting in
+
+```
+Number  Start   End     Size    File system  Name     Flags
+ 1      1049kB  1000MB  999MB                fat23    boot, esp
+ 2      1000MB  24.0GB  23.0GB               ext4     boot, esp
+ 3      24.0GB  32.0GB  8001MB               swap     swap
+ 4      32.0GB  512GB   480GB   ext4         scratch
+ 5      512GB   4001GB  3489GB               ceph
+```
+
+Now we have the drive ready we can copy the existing boot partitions and make sure you don't get it wrong and the target partitiong is larger.
+Here the original boot disk is /dev/sda (894Gb). We copy that to the new disk /dev/sdb (3.64Tb)
+
+```
+root@tux05:/home/wrk# dd if=/dev/sda1 of=/dev/sdb1
+root@tux05:/home/wrk# dd if=/dev/sda2 of=/dev/sdb2
+```
+
+Next, test mount the dirs and reboot. You make want to run e2fsck and resize2fs on the new partitions (or their equivalent if you use xfs or something).
diff --git a/topics/systems/linux/adding-nvidia-drivers-penguin2.gmi b/topics/systems/linux/adding-nvidia-drivers-penguin2.gmi
new file mode 100644
index 0000000..81e721f
--- /dev/null
+++ b/topics/systems/linux/adding-nvidia-drivers-penguin2.gmi
@@ -0,0 +1,74 @@
+# GPU Graphics Driver Set-Up
+
+Tux02 has the Tesla K80 (GK210GL) GPU.  For machine learning, we want the official proprietary NVIDIA drivers.
+
+## Installation
+
+* Debian 12 moved NVIDIA driver into the non-free-firmware repo.  Add the following to "/etc/apt/sources.list" and run "sudo apt update":
+
+```
+deb http://deb.debian.org/debian/ bookworm main contrib non-free non-free-firmware
+```
+
+* Make sure the correct kernel headers are installed:
+
+```
+sudo apt install linux-headers-$(uname -r)
+```
+
+* Install "nvidia-tesla-470-driver"⁰ (The NVIDIA line-up of programmable "Tesla" devices, used primarily for simulations and large-scale calculations, also require separate driver packages to function correctly compared to the consumer-grade GeForce GPUs that are instead targeted for desktop and gaming usage)¹:
+
+```
+sudo apt purge 'nvidia-*'
+sudo apt install nvidia-tesla-470-driver
+```
+
+* Black list nouveau since it conflicts with NVIDIA's driver, and regenerate the initramfs "sudo update-initramfs -u":
+
+```
+echo "blacklist nouveau" | sudo tee /etc/modprobe.d/blacklist-nouveau.conf
+echo "options nouveau modeset=0" | sudo tee -a /etc/modprobe.d/blacklist-nouveau.conf
+```
+
+* Reboot and test the nvidia drivers:
+
+```
+sudo reboot
+nvidia-smi
+
+# optional if you want to use nvidia-cuda-toolkit
+sudo apt install nvidia-cuda-dev nvidia-cuda-toolkit
+```
+
+## Issues
+
+Holding on reboot until I check in with the rest of team regarding some initd raspi hook:
+
+```
+update-initramfs: Generating /boot/initrd.img-6.1.0-9-amd64
+raspi-firmware: missing /boot/firmware, did you forget to mount it?
+run-parts: /etc/initramfs/post-update.d//z50-raspi-firmware exited with return code 1
+dpkg: error processing package initramfs-tools (--configure):
+ installed initramfs-tools package post-installation script subprocess returned error exit status 1
+Processing triggers for libgdk-pixbuf-2.0-0:amd64 (2.42.10+dfsg-1+deb12u1) ...
+Errors were encountered while processing:
+ initramfs-tools
+```
+
+Removed the firmware by running:
+
+```
+sudo apt purge raspi-firmware
+
+# Configure all packages that are installed but not yet fully configured
+sudo dpkg --configure -a
+
+# Update initramfs since we updated our drivers
+sudo update-initramfs -u
+```
+
+## References
+
+=> https://us.download.nvidia.com/XFree86/Linux-x86_64/470.129.06/README/supportedchips.html ⁰ Nvidia 470.129.06 Supported Chipsets.
+=> https://wiki.debian.org/NvidiaGraphicsDrivers#Tesla_Drivers ¹ Debian Tesla Drivers.
+=> https://wiki.debian.org/NvidiaGraphicsDrivers/Configuration ² NVIDIA Proprietary Driver: Configuration.
diff --git a/topics/systems/mariadb/mariadb.gmi b/topics/systems/mariadb/mariadb.gmi
index ae0ab19..ec8b739 100644
--- a/topics/systems/mariadb/mariadb.gmi
+++ b/topics/systems/mariadb/mariadb.gmi
@@ -16,6 +16,8 @@ To install Mariadb (as a container) see below and
 Start the client and:
 
 ```
+mysql
+show databases
 MariaDB [db_webqtl]> show binary logs;
 +-----------------------+-----------+
 | Log_name              | File_size |
@@ -60,4 +62,11 @@ Stop the running mariadb-guix.service. Restore the latest backup archive and ove
 => https://www.borgbackup.org/ Borg
 => https://borgbackup.readthedocs.io/en/stable/ Borg documentation
 
-#
+# Upgrade mariadb
+
+It is wise to upgrade mariadb once in a while. In a disaster recovery it is better to move forward in versions too.
+Before upgrading make sure there is a decent backup of the current setup.
+
+See also
+
+=> issues/systems/tux04-disk-issues.gmi
diff --git a/topics/systems/mariadb/precompute-mapping-input-data.gmi b/topics/systems/mariadb/precompute-mapping-input-data.gmi
index 0c89fe5..3442d4e 100644
--- a/topics/systems/mariadb/precompute-mapping-input-data.gmi
+++ b/topics/systems/mariadb/precompute-mapping-input-data.gmi
@@ -2,7 +2,7 @@
 
 GN relies on precomputed mapping scores for search and other functionality. Here we prepare for a new generation of functionality that introduces LMMs for compute and multiple significant scores for queries.
 
-At this stage we precompute GEMMA and tarball or lmdb it. As a project is never complete we need to add a metadata record in each tarball that track the status of the 'package'. Also, to offload compute to machines without DB access we need to prepare a first step that contains genotypes and phenotypes for compute. The genotypes have to be shared, as well as the computed kinship with and without LOCO. See
+At this stage we precompute GEMMA and tarball or lmdb it. As a project is never complete we need to add a metadata record in each tarball that tracks the status of the 'package'. Also, to offload compute to machines without DB access we need to prepare a first step that contains genotypes and phenotypes for compute. The genotypes have to be shared, as well as the computed kinship with and without LOCO. See
 
 => /topics/data/precompute/steps
 
@@ -43,16 +43,41 @@ And after:
 
 # Info
 
-## Original qtlreaper version
+## Original qtlreaper version for PublishData
+
+See the writeup at
+
+=> ./precompute-publishdata
+
+## Original qtlreaper version for ProbeSetData
 
 The original reaper precompute lives in
 
 => https://github.com/genenetwork/genenetwork2/blob/testing/scripts/maintenance/QTL_Reaper_v6.py
 
-This script first fetches inbredsets
+More recent incarnations are at v8, including a PublishData version that can be found in
+
+=> https://github.com/genenetwork/genenetwork2/tree/testing/scripts/maintenance
+
+Note that the locations are on space:
+
+```
+cd /mount/space2/lily-clone/acenteno/GN-Data
+ls -l
+python QTL_Reaper_v8_space_good.py 116
+--
+python UPDATE_Mean_MySQL_tab.py
+cd /mount/space2/lily-clone/gnshare/gn/web/webqtl/maintainance
+ls -l
+python QTL_Reaper_cal_lrs.py 7
+```
+
+The first task is to prepare an update script that can run a set at a time and compute GEMMA output (instead of reaper).
+
+The script first fetches inbredsets
 
 ```
- select Id,InbredSetId,InbredSetName,Name,SpeciesId,FullName,public,MappingMethodId,GeneticType,Family,FamilyOrder,MenuOrderId,InbredSetCode from InbredSet LIMIT 5;
+select Id,InbredSetId,InbredSetName,Name,SpeciesId,FullName,public,MappingMethodId,GeneticType,Family,FamilyOrder,MenuOrderId,InbredSetCode from InbredSet LIMIT 5;
 +----+-------------+-------------------+----------+-----------+-------------------+--------+-----------------+-------------+--------------------------------------------------+-------------+-------------+---------------+
 | Id | InbredSetId | InbredSetName     | Name     | SpeciesId | FullName          | public | MappingMethodId | GeneticType | Family                                           | FamilyOrder | MenuOrderId | InbredSetCode |
 +----+-------------+-------------------+----------+-----------+-------------------+--------+-----------------+-------------+--------------------------------------------------+-------------+-------------+---------------+
diff --git a/topics/systems/mariadb/precompute-publishdata.gmi b/topics/systems/mariadb/precompute-publishdata.gmi
new file mode 100644
index 0000000..74c278f
--- /dev/null
+++ b/topics/systems/mariadb/precompute-publishdata.gmi
@@ -0,0 +1,3370 @@
+# Precompute PublishData
+
+Based on the QTL_Reaper_cal_lrs.py aka QTL_Reaper_v8_PublishXRef.py. This script simply updates PublishXRef table with a highest hit as computed by qtlreaper.
+
+In a first attempt to update the database we are going to do just that using GEMMA.
+
+For the new script we will pass in the genotype file as well as the phenotype file, so gemma-wrapper can process it. I wrote quite a few scripts already
+
+=> https://github.com/genetics-statistics/gemma-wrapper/tree/master/bin
+
+So we can convert a .geno file to BIMBAM. I need to extract GN traits to a R/qtl2 or lmdb trait format file and use that as input.
+
+* [X] Visit use of PublishXRef
+* [X] geno -> BIMBAM (BXD first)
+* [X] Get PublishData trait(s) and convert to gemma, R/qtl2 or lmdb
+* - [X] see scripts/lmdb-publishdata-export.scm
+* - [X] see scripts for ProbeSetData
+* - [X] Make sure the BXDs are mappable
+* [X] Run gemma-wrapper
+* [X] We should map by trait-id, data id is not intuitive: curl http://127.0.0.1:8091/dataset/bxd-publish/values/8967044.json > 10002-pheno.json
+* [X] Check why Zach/GN JSON file lists different mappable BXDs
+* [X] Update DB on run-server
+* [X] Add batch run and some metadata so we can link back from results
+* [X] Create a DB/table containing hits and old reaper values
+* [X] Convert this info to RDF
+* [X] Run virtuoso server
+* [X] When loading traits compute mean, se, skew, kurtosis and store them as metadata in lmdb
+* [ ] Why is X not showing in LMM precompute for trait 51064
+* [X] Correctly handle Infinite LOD
+* [X] Ask interesting questions about the overlap between reaper and gemma
+* [ ] Update PublishXRef and store old reaper value(?)
+* [ ] Correctly Handle gn-guile escalating errors
+* [X] RDF point back to original data file
+* [ ] Fix Infinity also in LMM run (156 SNPs only)
+* [ ] Make time stamp, host, user a compute 'origin' block in RDF
+* [X] RDF mark QTL
+* [ ] Make sure the trait fetcher handles authorization or runs localhost only
+* [ ] gemma-wrapper --force does not work for GRM and re-check GRM does not change on phenotype
+* [ ] Use SNP URIs when possible (instead of inventing our own) - and BED information so we can locate them
+* [ ] Check lmdb duplicate key warning
+* [ ] run gemma with pangenome-derived genotypes
+* [ ] run gemma with qnorm
+* [ ] run gemma with sex covariate
+* [ ] run gemma again with the hit as a covariate
+* [ ] Check invalid data sets/traits and feed them to Rob/Arthur
+* [ ] Add metadata for bimodality indicator in addition to kurtosis (see below)
+* [ ] Provide SPARQL to find QTL and return metadata about traits
+* [ ] Provide PheWAS examples
+* [ ] Add BED information on Genes
+* [ ] Update Xapian search - also to handle gene aliases
+* [ ] Create GN UI with Zach
+
+For the last we should probably add a few columns. Initially we'll only store the maximum hit.
+
+After
+
+* [ ] provide distributed storage of files using https
+
+# Visit use of PublishXRef
+
+In GN2 this table is used in search, auth, and router. For search it is to look for trait hits (logically). For the router it is to fetch train info as well as dataset info.
+
+In GN3 this table is used for partial correlations. Also to fetch API trait info and to build the search index.
+
+In GN1 usage is similar.
+
+# geno -> BIMBAM
+
+We can use the script in gemma-wrapper
+
+=> https://github.com/genetics-statistics/gemma-wrapper/blob/master/bin/gn-geno-to-gemma.py
+
+there is probably something similar in GN2. And I have another version somewhere.
+
+To identify the geno file the reaper script uses
+
+```python
+cursor.execute('select Id, Name from InbredSet')
+results = cursor.fetchall()
+InbredSets = {}
+for item in results:
+	InbredSets[item[0]] = genotypeDir+str(item[1])+'.geno'
+```
+
+which assumes one single geno file for the BXD that is indexed by the InbredSetID (a number). Note it ignores the many genotype files we have per inbredset (today). Also there is a funny hardcoded
+
+```python
+	if InbredSetId==3:
+		InbredSetId=1
+```
+
+(no comment).
+
+Later we'll output to lmdb when GEMMA supports it.
+
+There are about 100 InbredSets. Genotype files can be found on production in
+/export/guix-containers/genenetwork/var/genenetwork/genotype-files/genotype. For the BXD alone there are
+
+```
+BXD.2.geno               BXD-Heart-Metals_old.geno   BXD-Micturition.6.geno
+BXD.4.geno               BXD-JAX-AD.4.geno           BXD-Micturition.8.geno
+BXD.5.geno               BXD-JAX-AD.8.geno           BXD-Micturition.geno
+BXD.6.geno               BXD-JAX-AD.geno             BXD-Micturition_old.4.geno
+BXD.7.geno               BXD-JAX-AD_old.geno         BXD-Micturition_old.6.geno
+BXD.8.geno               BXD-JAX-OFS.geno            BXD-Micturition_old.geno
+BXD-AE.4.geno            BXD-Longevity.4.geno        BXD_mm8.geno
+BXD-AE.8.geno            BXD-Longevity.8.geno        BXD-NIA-AD.4.geno
+BXD-AE.geno              BXD-Longevity.9.geno        BXD-NIA-AD.8.geno
+BXD-AE_old.geno          BXD-Longevity.array.geno    BXD-NIA-AD.geno
+BXD-Bone.geno            BXD-Longevity.classic.geno  BXD-NIA-AD_old2.geno
+BXD-Bone_orig.geno       BXD-Longevity.geno          BXD-NIA-AD_old.geno
+BXD.geno                 BXD-Longevity_old.4.geno    BXD_Nov_23_2010_before_polish_101_102_103.geno
+BXD-Harvested.geno       BXD-Longevity_old.8.geno    BXD_Nov_24_2010_before_polish_55_81.geno
+BXD-Heart-Metals.4.geno  BXD-Longevity_old.geno      BXD_old.geno
+BXD-Heart-Metals.8.geno  BXD-MBD-UTHSC.geno          BXD_unsure.geno
+BXD-Heart-Metals.geno    BXD-Micturition.4.geno      BXD_UT-SJ.geno
+```
+
+Not really reflected in the DB:
+
+```
+MariaDB [db_webqtl]> select Id, Name from InbredSet where name like '%BXD%';
++----+------------------+
+| Id | Name             |
++----+------------------+
+|  1 | BXD              |
+| 58 | BXD-Bone         |
+| 64 | BXD-Longevity    |
+| 68 | BXD_Dev          |
+| 76 | DOD-BXD-GWI      |
+| 84 | BXD-Heart-Metals |
+| 86 | BXD-AE           |
+| 91 | BXD-Micturition  |
+| 92 | BXD-JAX-AD       |
+| 93 | BXD-NIA-AD       |
+| 94 | CCBXD-TM         |
+| 96 | BXD-JAX-OFS      |
+| 97 | BXD-MBD-UTHSC    |
++----+------------------+
+```
+
+Bit of a mess. Looks like some files are discarded. Let's see what the reaper script does.
+
+We should also look into distributed storage. One option is webdav.
+
+# Get PublishData trait(s) and convert to R/qtl2 or lmdb
+
+Let's see how the scripts do it. Note that we already did that for the probeset script in
+
+=> precompute-mapping-input-data
+
+The code is reflected in
+
+=> https://git.genenetwork.org/gn-guile/tree/scripts/precompute/list-traits-to-compute.scm
+
+Now I need to do the exact same thing, but for PublishData.
+
+Let's connect to a remote GN DB:
+
+```
+ssh -L 3306:127.0.0.1:3306 -f -N tux02.genenetwork.org
+```
+
+and follow
+
+=> https://github.com/genenetwork/genenetwork2/blob/testing/scripts/maintenance/QTL_Reaper_v8_PublishXRef.py
+
+the script takes a number of values 'PublishFreezeIds'. Alternatively it picks it up by SpeciesId (hard effing coded, of course).
+
+=> https://github.com/genenetwork/genenetwork2/blob/fcde38b0f37f12508a01b16b7820029aa951bded/scripts/maintenance/QTL_Reaper_v8_PublishXRef.py#L62
+
+Next it picks the geno file from the InbredSetID with
+
+```
+select InbredSetId  from PublishFreeze  where PublishFreeze.Id = 1;
+```
+
+Here we are initially going to focus on BXD=1 datasets only.
+
+```
+MariaDB [db_webqtl]> select Id,InbredSetId  from PublishFreeze  where InbredSetId = 1;
++----+-------------+
+| Id | InbredSetId |
++----+-------------+
+|  1 |           1 |
++----+-------------+
+```
+
+(we are half way the script now). Next we capture some metadata
+
+```
+MariaDB [db_webqtl]> select PhenotypeId, Locus, DataId, Phenotype.Post_publication_description from PublishXRef, Phenotype where PublishXRef.PhenotypeId = Phenotype.Id and InbredSetId=1 limit 5;
++-------------+----------------+---------+----------------------------------------------------------------------------------------------------------------------------+
+| PhenotypeId | Locus          | DataId  | Post_publication_description                                                                                               |
++-------------+----------------+---------+----------------------------------------------------------------------------------------------------------------------------+
+|           4 | rs48756159     | 8967043 | Central nervous system, morphology: Cerebellum weight, whole, bilateral in adults of both sexes [mg]                       |
+|          10 | rsm10000005699 | 8967044 | Central nervous system, morphology: Cerebellum weight after adjustment for covariance with brain size [mg]                 |
+|          15 | rsm10000013713 | 8967045 | Central nervous system, morphology: Brain weight, male and female adult average, unadjusted for body weight, age, sex [mg] |
+|          20 | rs48756159     | 8967046 | Central nervous system, morphology: Cerebellum volume [mm3]                                                                |
+|          25 | rsm10000005699 | 8967047 | Central nervous system, morphology: Cerebellum volume, adjusted for covariance with brain size [mm3]                       |
++-------------+----------------+---------+----------------------------------------------------------------------------------------------------------------------------+
+```
+
+it captures LRS
+
+```
+MariaDB [db_webqtl]> select LRS from PublishXRef where PhenotypeId=4 and InbredSetId=1;
++--------------------+
+| LRS                |
++--------------------+
+| 13.497491147108706 |
++--------------------+
+```
+
+and finally the trait values that are used for mapping
+
+```
+select Strain.Name, PublishData.value from Strain, PublishData where Strain.Id = PublishData.StrainId and PublishData.Id = 8967043;
++-------+-----------+
+| Name  | value     |
++-------+-----------+
+| BXD1  | 61.400002 |
+| BXD2  | 49.000000 |
+| BXD5  | 62.500000 |
+| BXD6  | 53.099998 |
+| BXD8  | 59.099998 |
+| BXD9  | 53.900002 |
+| BXD11 | 53.099998 |
+| BXD12 | 45.900002 |
+| BXD13 | 48.400002 |
+| BXD14 | 49.400002 |
+| BXD15 | 47.400002 |
+| BXD16 | 56.299999 |
+| BXD18 | 53.599998 |
+| BXD19 | 50.099998 |
+| BXD20 | 48.200001 |
+| BXD21 | 50.599998 |
+| BXD22 | 53.799999 |
+| BXD23 | 48.599998 |
+| BXD24 | 54.900002 |
+| BXD25 | 49.599998 |
+| BXD27 | 47.400002 |
+| BXD28 | 51.500000 |
+| BXD29 | 50.200001 |
+| BXD30 | 53.599998 |
+| BXD31 | 49.700001 |
+| BXD32 | 56.000000 |
+| BXD33 | 52.099998 |
+| BXD34 | 53.700001 |
+| BXD35 | 49.700001 |
+| BXD36 | 44.500000 |
+| BXD38 | 51.099998 |
+| BXD39 | 54.900002 |
+| BXD40 | 49.900002 |
+| BXD42 | 59.400002 |
++-------+-----------+
+```
+
+Note that we need to filter out the parents - the original reaper script does not do that! My gn-guile code does handle that:
+
+```
+SELECT StrainId,Strain.Name FROM Strain, StrainXRef WHERE StrainXRef.StrainId = Strain.Id AND StrainXRef.InbredSetId =1 AND Used_for_mapping<>'Y' limit 5;
++----------+----------+
+| StrainId | Name     |
++----------+----------+
+|        1 | B6D2F1   |
+|        2 | C57BL/6J |
+|        3 | DBA/2J   |
+|      150 | A/J      |
+|      151 | AXB1     |
++----------+----------+
+etc.
+```
+
+Also Bonz' script
+
+=> https://git.genenetwork.org/gn-guile/tree/scripts/lmdb-publishdata-export.scm
+
+has an interesting query:
+
+```
+MariaDB [db_webqtl]>
+SELECT DISTINCT PublishFreeze.Name, PublishXRef.Id FROM PublishData
+  INNER JOIN Strain ON PublishData.StrainId = Strain.Id
+  INNER JOIN PublishXRef ON PublishData.Id = PublishXRef.DataId
+  INNER JOIN PublishFreeze ON PublishXRef.InbredSetId = PublishFreeze.InbredSetId
+  LEFT JOIN PublishSE ON PublishSE.DataId = PublishData.Id AND PublishSE.StrainId = PublishData.StrainId
+  LEFT JOIN NStrain ON NStrain.DataId = PublishData.Id AND NStrain.StrainId = PublishData.StrainId
+  WHERE PublishFreeze.public > 0 AND PublishFreeze.confidentiality < 1
+  ORDER BY PublishFreeze.Id, PublishXRef.Id limit 5;
++------------+-------+
+| Name       | Id    |
++------------+-------+
+| BXDPublish | 10001 |
+| BXDPublish | 10002 |
+| BXDPublish | 10003 |
+| BXDPublish | 10004 |
+| BXDPublish | 10005 |
++------------+-------+
+5 rows in set (0.239 sec)
+```
+
+that shows we have 13689 BXDPublish datasets. It also has
+
+```
+SELECT
+JSON_ARRAYAGG(JSON_ARRAY(Strain.Name, PublishData.Value)) AS data,
+ MD5(JSON_ARRAY(Strain.Name, PublishData.Value)) as md5hash
+FROM
+    PublishData
+    INNER JOIN Strain ON PublishData.StrainId = Strain.Id
+    INNER JOIN PublishXRef ON PublishData.Id = PublishXRef.DataId
+    INNER JOIN PublishFreeze ON PublishXRef.InbredSetId = PublishFreeze.InbredSetId
+LEFT JOIN PublishSE ON
+    PublishSE.DataId = PublishData.Id AND
+    PublishSE.StrainId = PublishData.StrainId
+LEFT JOIN NStrain ON
+    NStrain.DataId = PublishData.Id AND
+    NStrain.StrainId = PublishData.StrainId
+WHERE
+    PublishFreeze.Name = "BXDPublish" AND
+    PublishFreeze.public > 0 AND
+    PublishData.value IS NOT NULL AND
+    PublishFreeze.confidentiality < 1
+ORDER BY
+    LENGTH(Strain.Name), Strain.Name LIMIT 5;
+```
+
+best to pipe that to a file. It outputs JSON and an MD5SUM straight from mariadb. Interesting.
+
+Finally, let's have a look at the existing GN API
+
+```
+SELECT
+                            Strain.Name, Strain.Name2, PublishData.value, PublishData.Id, PublishSE.error, NStrain.count
+                        FROM
+                            (PublishData, Strain, PublishXRef, PublishFreeze)
+                        LEFT JOIN PublishSE ON
+                            (PublishSE.DataId = PublishData.Id AND PublishSE.StrainId = PublishData.StrainId)
+                        LEFT JOIN NStrain ON
+                            (NStrain.DataId = PublishData.Id AND
+                            NStrain.StrainId = PublishData.StrainId)
+                        WHERE
+                            PublishXRef.InbredSetId = 1 AND
+                            PublishXRef.PhenotypeId = 4 AND
+                            PublishData.Id = PublishXRef.DataId AND
+                            PublishData.StrainId = Strain.Id AND
+                            PublishXRef.InbredSetId = PublishFreeze.InbredSetId AND
+                            PublishFreeze.public > 0 AND
+                            PublishFreeze.confidentiality < 1
+                        ORDER BY
+                            Strain.Name;
+ +-------+-------+-----------+---------+-------+-------+
+| Name  | Name2 | value     | Id      | error | count |
++-------+-------+-----------+---------+-------+-------+
+| BXD1  | BXD1  | 61.400002 | 8967043 |  2.38 | NULL  |
+| BXD11 | BXD11 | 53.099998 | 8967043 |   1.1 | NULL  |
+| BXD12 | BXD12 | 45.900002 | 8967043 |  1.09 | NULL  |
+| BXD13 | BXD13 | 48.400002 | 8967043 |  1.63 | NULL  |
+...
+```
+
+which actually blocks non-public sets and shows std err, as well as counts when available(?) It does not exclude the parents for mapping (btw). That probably happens on the mapping page itself.
+
+Probably the most elegant query is in GN3 API:
+
+```
+SELECT st.Name, ifnull(pd.value, 'x'), ifnull(ps.error, 'x'), ifnull(ns.count, 'x')
+    FROM PublishFreeze pf JOIN PublishXRef px ON px.InbredSetId = pf.InbredSetId
+        JOIN PublishData pd ON pd.Id = px.DataId JOIN Strain st ON pd.StrainId = st.Id
+        LEFT JOIN PublishSE ps ON ps.DataId = pd.Id AND ps.StrainId = pd.StrainId
+        LEFT JOIN NStrain ns ON ns.DataId = pd.Id AND ns.StrainId = pd.StrainId
+    WHERE px.PhenotypeId = 4 limit 5;
++------+-----------------------+-----------------------+-----------------------+
+| Name | ifnull(pd.value, 'x') | ifnull(ps.error, 'x') | ifnull(ns.count, 'x') |
++------+-----------------------+-----------------------+-----------------------+
+| BXD1 | 61.400002             | 2.38                  | x                     |
+| BXD2 | 49.000000             | 1.25                  | x                     |
+| BXD5 | 62.500000             | 2.32                  | x                     |
+| BXD6 | 53.099998             | 1.22                  | x                     |
+| BXD8 | 59.099998             | 2.07                  | x                     |
++------+-----------------------+-----------------------+-----------------------+
+```
+
+written by Zach and Bonface. See
+
+=> https://github.com/genenetwork/genenetwork3/blame/main/gn3/db/sample_data.py
+
+
+
+We can get a list of the 13689 BXD datasets we can use. Note that we start with public data because we'll feed it to AI and all privacy will be gone after. We'll design an second API that makes use of Fred's authentication/authorization later.
+Let's start with the SQL statement listed on:
+
+
+We can run mysql through an ssh tunnel with
+
+```
+ssh -L 3306:127.0.0.1:3306 -f -N tux02.genenetwork.org
+mysql -A -h 127.0.0.1 -uwebqtlout -pwebqtlout db_webqtl
+```
+
+and test the query, i.e.
+
+```
+MariaDB [db_webqtl]> SELECT DISTINCT PublishFreeze.Name, PublishXRef.Id FROM PublishData
+    ->   INNER JOIN Strain ON PublishData.StrainId = Strain.Id
+    ->   INNER JOIN PublishXRef ON PublishData.Id = PublishXRef.DataId
+    ->   INNER JOIN PublishFreeze ON PublishXRef.InbredSetId = PublishFreeze.InbredSetId
+    ->   LEFT JOIN PublishSE ON PublishSE.DataId = PublishData.Id AND PublishSE.StrainId = PublishData.StrainId
+    ->   LEFT JOIN NStrain ON NStrain.DataId = PublishData.Id AND NStrain.StrainId = PublishData.StrainId
+    ->   WHERE PublishFreeze.public > 0 AND PublishFreeze.confidentiality < 1
+    ->   ORDER BY PublishFreeze.Id, PublishXRef.Id limit 5;
++------------+-------+
+| Name       | Id    |
++------------+-------+
+| BXDPublish | 10001 |
+| BXDPublish | 10002 |
+| BXDPublish | 10003 |
+| BXDPublish | 10004 |
+| BXDPublish | 10005 |
+```
+
+Let's take this apart a little. First of all PublishFreeze has only one record for BXDPublish where ID=1. PublishData may be used to check valid fields, but the real information is in PublishXRef. A simple
+
+```
+ select count(*) from PublishXRef WHERE InbredSetId=1;
++----------+
+| count(*) |
++----------+
+|    13711 |
++----------+
+```
+
+counts a few extra datasets (it was 13689). It may mean that PublishXRef contains some records that are still not public? Anyway,
+let's go for the full dataset for precompute right now. We'll add an API endpoint to gn-guile so it can be used later.
+
+Note GN2 on the menu search
+
+=> https://genenetwork.org/search?species=mouse&group=BXD&type=Phenotypes&dataset=BXDPublish&search_terms_or=*&search_terms_and=&accession_id=None&FormID=searchResult
+
+gives 13,729 entries, including recent BXD_51094. That is because that production database is newer. If we look at our highest records:
+
+```
+select * from PublishXRef WHERE InbredSetId=1 ORDER BY ID DESC limit 3;
++-------+-------------+-------------+---------------+----------+-------------------+----------------+--------------------+--------------------+----------+----------+
+| Id    | InbredSetId | PhenotypeId | PublicationId | DataId   | mean              | Locus          | LRS                | additive           | Sequence | comments |
++-------+-------------+-------------+---------------+----------+-------------------+----------------+--------------------+--------------------+----------+----------+
+| 51060 |           1 |       45821 |         39794 | 41022015 |              NULL | rsm10000000968 | 13.263934206457122 | 2.1741201177177185 |        1 |          |
+| 51049 |           1 |       45810 |         39783 | 41022004 | 8.092333210508029 | rsm10000014174 |   16.8291804498215 | 18.143229769230775 |        1 |          |
+| 51048 |           1 |       45809 |         39782 | 41022003 | 6.082199917286634 | rsm10000009222 | 14.462661474938166 |  4.582111488461538 |        1 |          |
++-------+-------------+-------------+---------------+----------+-------------------+----------------+--------------------+--------------------+----------+----------+
+```
+
+You can see they match that list (51060 got updated on production). The ID matches record BXD_51060 on the production search table.
+We can look at the DataId with
+
+```
+select Id,PhenotypeId,DataId from PublishXRef WHERE InbredSetId=1 ORDER BY ID DESC limit 3;
++-------+-------------+----------+
+| Id    | PhenotypeId | DataId   |
++-------+-------------+----------+
+| 51060 |       45821 | 41022015 |
+| 51049 |       45810 | 41022004 |
+| 51048 |       45809 | 41022003 |
++-------+-------------+----------+
+```
+
+And get the actual values with
+
+```
+select * from PublishData WHERE Id=41022003;
++----------+----------+-----------+
+| Id       | StrainId | value     |
++----------+----------+-----------+
+| 41022003 |        2 |  9.136000 |
+| 41022003 |        3 |  4.401000 |
+| 41022003 |        9 |  4.360000 |
+| 41022003 |       29 | 15.745000 |
+| 41022003 |       98 |  4.073000 |
+| 41022003 |       99 | -0.580000 |
+```
+
+which match the values on
+
+=> https://genenetwork.org/show_trait?trait_id=51048&dataset=BXDPublish
+
+The phenotypeid is useful for some metadata:
+
+
+```
+select * from Phenotype WHERE ID=45809;
+| 45809 | Central nervous system, metabolism, nutrition, toxicology: Difference score for Iron (Fe) concentration in cortex (CTX) between 20 to 120-day-old and 300 to 918-day-old males mice fed Envigo diet 7912 containing 240, 93, and 63 ppm Fe, Cu and Zn, respectively [µg/g wet weight]  | Central nervous system, metabolism, nutrition, toxicology: Difference score for Iron (Fe) concentration in cortex (CTX) between 20 to 120-day-old and 300 to 918-day-old males mice fed Envigo diet 7912 containing 240, 93, and 63 ppm Fe, Cu and Zn, respectively [µg/g wet weight]  | Central nervous system, metabolism, nutrition, toxicology: Difference score for Iron (Fe) concentration in cortex (CTX) between 20 to 120-day-old and 300 to 918-day-old males mice fed Envigo diet 7912 containing 240, 93, and 63 ppm Fe, Cu and Zn, respectively [µg/g wet weight]  | [ug/mg wet weight] | Fe300-120CTXMale             | Fe300-120CTXMale              | NULL     | acenteno  | Jones B | joneslab         |
+```
+
+Since I am going for the simpler query I'll add an API endpoint named
+datasets/bxd-publish/list (so others can use that too).  We'll return
+tuples for each entry so we can extend it later. First we need the
+DataID so we can point into PublishData. We expect the endpoint to
+return something like
+
+```
++-------+-------------+----------+
+| Id    | PhenotypeId | DataId   |
++-------+-------------+----------+
+| 51060 |       45821 | 41022015 |
+| 51049 |       45810 | 41022004 |
+| 51048 |       45809 | 41022003 |
+...
+```
+
+Alright, let's write some code. The following patch returns on the endpoint:
+
+```
+[
+  {
+    "Id": 10001,
+    "PhenotypeId": 4,
+    "DataId": 8967043
+  },
+  {
+    "Id": 10002,
+    "PhenotypeId": 10,
+    "DataId": 8967044
+  },
+  {
+    "Id": 10003,
+    "PhenotypeId": 15,
+    "DataId": 8967045
+  },
+...
+```
+
+in about 3 seconds. It will run a lot faster on a local network. But for our purpose it is fine. The code I wrote is here:
+
+=> https://git.genenetwork.org/gn-guile/commit/?id=1590be15f85e30d7db879c19d2d3b4bed201556a
+
+Note the simple SQL query (compared to the first one).
+Next step is to fetch the trait values we can feed to GEMMA. The full query using the PhenotypeId and DataId in GN is:
+
+```
+SELECT Strain.Name, Strain.Name2, PublishData.value, PublishData.Id, PublishSE.error, NStrain.count
+  FROM
+      (PublishData, Strain, PublishXRef, PublishFreeze)
+  LEFT JOIN PublishSE ON
+      (PublishSE.DataId = PublishData.Id AND PublishSE.StrainId = PublishData.StrainId)
+  LEFT JOIN NStrain ON
+      (NStrain.DataId = PublishData.Id AND
+      NStrain.StrainId = PublishData.StrainId)
+  WHERE
+      PublishXRef.InbredSetId = 1 AND
+      PublishXRef.PhenotypeId = 4 AND
+      PublishData.Id = PublishXRef.DataId AND
+      PublishData.StrainId = Strain.Id AND
+      PublishXRef.InbredSetId = PublishFreeze.InbredSetId AND
+      PublishFreeze.public > 0 AND
+      PublishFreeze.confidentiality < 1;
++-------+-------+-----------+---------+-------+-------+
+| Name  | Name2 | value     | Id      | error | count |
++-------+-------+-----------+---------+-------+-------+
+| BXD1  | BXD1  | 61.400002 | 8967043 |  2.38 | NULL  |
+| BXD2  | BXD2  | 49.000000 | 8967043 |  1.25 | NULL  |
+| BXD5  | BXD5  | 62.500000 | 8967043 |  2.32 | NULL  |
+| BXD6  | BXD6  | 53.099998 | 8967043 |  1.22 | NULL  |
+...
+```
+
+(result includes parents). We can simplify this for GEMMA because it only wants the name and (mean) value.
+
+The short version when you have the data ID is:
+
+```
+SELECT Strain.Name, PublishData.value FROM Strain, PublishData WHERE PublishData.Id=41022003 and Strain.Id=StrainID;
++----------+-----------+
+| Name     | value     |
++----------+-----------+
+| C57BL/6J |  9.136000 |
+| DBA/2J   |  4.401000 |
+| BXD9     |  4.360000 |
+| BXD32    | 15.745000 |
+| BXD43    |  4.073000 |
+| BXD44    | -0.580000 |
+| BXD48    | -1.810000 |
+| BXD51    |  4.294000 |
+| BXD60    | -0.208000 |
+| BXD62    | -0.013000 |
+| BXD63    |  3.221000 |
+| BXD66    |  2.472000 |
+| BXD69    | 12.886000 |
+| BXD70    | -1.973000 |
+| BXD78    | 19.511999 |
+| BXD79    |  7.845000 |
+| BXD73a   |  3.201000 |
+| BXD87    | -3.054000 |
+| BXD48a   | 11.585000 |
+| BXD100   |  7.088000 |
+| BXD102   |  8.485000 |
+| BXD124   | 13.442000 |
+| BXD170   | -1.274000 |
+| BXD172   | 18.587000 |
+| BXD186   | 10.634000 |
++----------+-----------+
+```
+
+which matches GN perfectly (some individuals where added). Alright, let's add an endpoint for this named
+'dataset/bxd-publish/values/dataid/41022003'. Note we only deal with public data (so far). Later we may come up with more generic
+end points and authorization. At this point the API is either on the local network (this one is) or public.
+
+The first version returns this data from the endpoint:
+
+```
+time curl http://127.0.0.1:8091/dataset/bxd-publish/values/41022003
+[{"Name":"C57BL/6J","value":9.136},{"Name":"DBA/2J","value":4.401},{"Name":"BXD9","value":4.36},{"Name":"BXD32","value":15.745},{"Name":"BXD43","value":4.073},{"Name":"BXD44","value":-0.58},{"Name":"BXD48","value":-1.81},{"Name":"BXD51","value":4.294},{"Name":"BXD60","value":-0.208},{"Name":"BXD62","value":-0.013},{"Name":"BXD63","value":3.221},{"Name":"BXD66","value":2.472},{"Name":"BXD69","value":12.886},{"Name":"BXD70","value":-1.973},{"Name":"BXD78","value":19.511999},{"Name":"BXD79","value":7.845},{"Name":"BXD73a","value":3.201},{"Name":"BXD87","value":-3.054},{"Name":"BXD48a","value":11.585},{"Name":"BXD100","value":7.088},{"Name":"BXD102","value":8.485},{"Name":"BXD124","value":13.442},{"Name":"BXD170","value":-1.274},{"Name":"BXD172","value":18.587},{"Name":"BXD186","value":10.634}]
+real    0m0.537s
+user    0m0.002s
+sys     0m0.005s
+```
+
+Note it includes the parents. We should drop them. In this case we can simple check for (string-contains name "BXD"). The database records allow for a filter, so we get
+
+```
+curl http://127.0.0.1:8091/dataset/bxd-publish/mapping/values/41022003
+[{"Name":"BXD9","value":4.36},{"Name":"BXD32","value":15.745},{"Name":"BXD43","value":4.073},{"Name":"BXD44","value":-0.58},{"Name":"BXD48","value":-1.81},{"Name":"BXD51","value":4.294},{"Name":"BXD60","value":-0.208},{"Name":"BXD62","value":-0.013},{"Name":"BXD63","value":3.221},{"Name":"BXD66","value":2.472},{"Name":"BXD69","value":12.886},{"Name":"BXD70","value":-1.973},{"Name":"BXD78","value":19.511999},{"Name":"BXD79","value":7.845},{"Name":"BXD73a","value":3.201},{"Name":"BXD87","value":-3.054},{"Name":"BXD48a","value":11.585},{"Name":"BXD100","value":7.088},{"Name":"BXD102","value":8.485},{"Name":"BXD124","value":13.442},{"Name":"BXD170","value":-1.274},{"Name":"BXD172","value":18.587},{"Name":"BXD186","value":10.634}]
+```
+
+That code went in as
+
+=> https://git.genenetwork.org/gn-guile/commit/?id=9ad0793eb477611c700f4a5b02f60ac793bfae96
+
+It took a bit longer than I wanted because I made a mistake converting the results to a hash table. It broke the JSON conversion and the error was not so helpful.
+
+To write a CSV it turns out I have written
+
+=> https://git.genenetwork.org/gn-guile/tree/gn/runner/gemma.scm?id=9ad0793eb477611c700f4a5b02f60ac793bfae96#n18
+
+which takes the GN BXD.json file and our trait file. BXD.json captures the genotype information GN has:
+
+```
+{
+        "mat": "C57BL/6J",
+        "pat": "DBA/2J",
+        "f1s": ["B6D2F1", "D2B6F1"],
+        "genofile" : [{
+                "title" : "WGS-based (Mar2022)",
+                "location" : "BXD.8.geno",
+                "sample_list" : ["BXD1", "BXD2", "BXD5", "BXD6", "BXD8", "BXD9", "BXD11", "BXD12", "BXD13", "BXD14", "BXD15", "BXD16", "BXD18", "BXD19", "BXD20", "BXD21", "BXD22", "BXD23", "BXD24", "BXD24a", "BXD25", "BXD27", "BXD28", "BXD29", "BXD30", "BXD31", "BXD32", "BXD33", "BXD34", "BXD35", "BXD36", "BXD37", "BXD38", "BXD39", "BXD40", "BXD41", "BXD42", "BXD43", "BXD44", "BXD45", "BXD48", "BXD48a", "BXD49", "BXD50", "BXD51", "BXD52", "BXD53", "BXD54", "BXD55", "BXD56", "BXD59", "BXD60", "BXD61",
+(...)
+"BXD065xBXD077F1", "BXD069xBXD090F1", "BXD071xBXD061F1", "BXD073bxBXD065F1", "BXD073bxBXD077F1", "BXD073xBXD034F1", "BXD073xBXD065F1", "BXD073xBXD077F1", "BXD074xBXD055F1", "BXD077xBXD062F1", "BXD083xBXD045F1", "BXD087xBXD100F1", "BXD065bxBXD055F1", "BXD102xBXD077F1", "BXD102xBXD73bF1", "BXD170xBXD172F1", "BXD172xBXD197F1", "BXD197xBXD009F1", "BXD197xBXD170F1"]
+```
+
+The code maps the traits values I generated against these columns to see what inviduals overlap which corrects for unmappable individuals (anyway).
+
+The function 'write-pheno-file', listed above, does not work however because of the format of the endpoint. Remember it generates
+
+```
+[{"Name":"BXD9","value":4.36},{"Name":"BXD32","value":15.745}...]
+```
+
+While this function expects the shorter
+
+```
+{"BXD9":4.36,"BXD23":15.745...}
+```
+
+Now, for endpoints there is no real standard. We have written ideas up here:
+
+=> https://git.genenetwork.org/gn-docs/tree/api
+
+and, most recently
+
+=> https://git.genenetwork.org/gn-docs/tree/api/GN-REST-API-v2.md
+
+Where I make a case for having the metadata as a separate endpoint that can be reasoned on by people and machines (and AI).
+That means I should default to the short version of the data and describe that layout using metadata. This we can do later.
+
+I modified the endpoint to return the shorter hash:
+
+```
+time curl http://127.0.0.1:8091/dataset/bxd-publish/values/41022003
+{"BXD9":4.36,"BXD23":15.745...}
+```
+
+Next, to align with
+
+=> https://github.com/genenetwork/gn-docs/blob/master/api/GN-REST-API-v2.md
+
+I gave the API the json extension, so we have http://127.0.0.1:8091/dataset/bxd-publish/values/41022003.json
+
+This allows writing a special handler for GEMMA output (.gemma extension) downloading the pheno file with
+
+```
+curl http://127.0.0.1:8091/dataset/bxd-publish/values/41022003.gemma
+NA
+NA
+NA
+NA
+NA
+4.36NA
+NA
+NA
+NA
+(...)
+```
+
+that GEMMA can use directly and matches the order of the individuals in the BXD.8.geno file and the founders/parents are not included. Note that all of this now only works for the BXD (on PublishData) and I am using BXD.json as described in
+
+=> https://issues.genenetwork.org/topics/systems/mariadb/precompute-mapping-input-data
+
+I.e., it is Zach's listed stopgap solution. Code is here:
+
+=> https://git.genenetwork.org/gn-guile/log/
+
+Next step run gemma as we are on par with my earlier work on ProbeSetData. I wrote a gemma runner for that too at
+
+=> https://git.genenetwork.org/gn-guile/tree/gn/runner/gemma.scm#n79
+
+Now here I use guile to essentially script running GEMMA. There is no real advantage for that, so I will simply tell gemma-wrapper to use the output of above .gemma endpoint to fetch the trait values. Basically gemma-wrapper can specify the standard gemma -p switch, or pass in --phenotypes, that are used for permutations.
+
+Now the new method we want to introduce is that the trait values are read from a REST API, instead of a file. The dirty way is to provide that functionality directly to gemma-wrapper, but we plan to get rid of that code (useful as it is -- it duplicates what Arun's ravanan does and ravanan has the advantage that it can be run on a cluster).
+
+So we simply download the data and write it to a file with a small script. To run:
+
+```
+curl http://127.0.0.1:8091/dataset/bxd-publish/values/41022003.gemma > 41022003-pheno.txt
+```
+
+Next we create a container for gemma-wrapper (and includes the gemma that GN uses):
+
+```
+. .guix-deploy
+env TMPDIR=tmp ruby ./bin/gemma-wrapper --force --json \
+        --loco -- \
+        -g BXD.8_geno.txt.gz \
+        -p 41022003-pheno.txt \
+        -a BXD.8_snps.txt \
+        -gk > K.json
+```
+
+this bailed out with
+
+Executing: parallel --results /tmp/test --joblog /tmp/test/5f3849a9e61b70e3d562b20c5eade5a699923c68-parallel.log < /tmp/test/parallel-commands.txt
+Command exited with non-zero status 20
+
+When running an individual chromosome (from the parallel log) we get two warnings and an error:
+
+```
+**** WARNING: The maximum genotype value is not 2.0 - this is not the BIMBAM standard and will skew l_lme and effect sizes
+**** WARNING: Columns in geno file do not match # individuals in phenotypes
+ERROR: Enforce failed for not enough genotype fields for marker in src/gemma_io.cpp at line 1470 in BimbamKin
+```
+
+Looks familiar!
+The first warning we'll ignore for now, as we just want the hits initially. The second warning relates to the error that there is a mismatch in number of inds.
+
+This topic I have covered in the past, particularly trying to debug Dave's conflicting results:
+
+=> https://issues.genenetwork.org/topics/lmms/gemma/permutations
+
+It makes somewhat depressive reading, though we have a solution.
+
+Note the correct conversion we only have to do once (basically the code I wrote earlier
+to fetch BXD traits needs to work with the latest BXD genotypes).
+The real problem is that gemma itself does not compare individual names (at all), so any corrections need to be done beforehand. In this case our pheno file contains 212 inds from the earlier BXD.json file.
+
+```
+wc -l 41022003-pheno.txt
+212 41022003-pheno.txt
+```
+
+And that is off. Let's try the tool I wrote during that exercise. It can create a different json file after parsing BXD.geno
+that has in the header:
+
+> # Date Modified: April 23, 2024 by Arthur Centeno, Suheeta Roy. March 22, 2022 by Rob Williams, David Ashbrook, and Danny Arends to remove excessive cross-over events in strains BXD42 (Chr9), BXD81 (Chrs1, 5, 10), BXD99 (Chr1), and BXD100 (Chrs2 and 6); and to add Taar1 maker on Chr 10 for T. Phillips-Richards.   Jan 19, 2017: Danny Arends computed BXD cM values and recombinations between markers. Rob W. Williams fixed errors on most chromosomes and added Affy eQTL markers. BXD223 now has been added based on David Ashbrook's spreadsheet genotype information.
+
+```
+md5sum BXD.geno:
+  a78aa312b51ac15dd8ece911409c5b98  BXD.geno
+gemma-wrapper$ ./bin/gn-geno-to-gemma.py BXD.geno > BXD.geno.txt
+```
+
+creates a .json file (that is different from Zach/GN's) and a bimbam file GEMMA can use. Now in the next step I need to adapt above code to use this format. What I *should* have done, instead of writing gemma phenotypes directly, is write the R/qtl2 format that includes the ind names (so we can compare and validate those) and *then* parse that data against our new JSON file created by gn-geno-to-gemma.py using the rqtl2-pheno-to-gemma.py script. Both Python scripts are already part of gemma-wrapper:
+
+=> https://github.com/genetics-statistics/gemma-wrapper/blob/master/bin/gn-geno-to-gemma.py
+=> https://github.com/genetics-statistics/gemma-wrapper/blob/master/bin/rqtl2-pheno-to-gemma.py
+
+The idea was to create the rqtl2 API endpoint, or I'll adapt the 2nd script to take the endpoint as input and then correct for GEMMA's requirements.
+
+OK, updated the endpoints and the code for rqtl2-pheno-to-gemma.py so it accepts a URL instead of a file. So the idea is
+to run
+
+```
+./bin/rqtl2-pheno-to-gemma.py BXD_pheno_Dave.csv --json BXD.geno.json > BXD_pheno_matched.txt
+```
+
+A line in BXD_pheno_Dave.csv is:
+
+```
+BXD113,24.52,205.429001,3.643,2203.312012,3685.907959,1.199,2.019,29.347143,0.642857,205.428574,24.520409,3.642857,2203
+.312012,3685.908203,1.198643,2.018643,0.642857,33.785709,1.625,2,1.625,1,22.75
+```
+
+Now if I read the Rqtl2 docs it says:
+
+> We split the numeric phenotypes from the mixed-mode covariates, as two separate CSV files. Each file forms a matrix of individuals × phenotypes (or covariates), with the first column being individual IDs and the first row being phenotype or covariate names. Sex and line IDs (if needed) can be columns in the covariate data.
+
+This differs from the BXD Dave layout (it is transposed). Karl added in the docs:
+
+> All of these CSV files may be transposed relative to the form described below. You just need to include, in the control file, a line like: "geno_transposed: true". So, OK, we can use the transposed form. First we make it possible to parse json:
+
+```
+curl http://127.0.0.1:8091/dataset/bxd-publish/values/41022003.json > 41022003-pheno.json
+jq < 41022003-pheno.json
+{
+  "C57BL/6J": 9.136,
+  "DBA/2J": 4.401,
+  "BXD9": 4.36,
+  "BXD32": 15.745,
+(...)
+```
+
+note it includes the parents. Feed it to
+
+```
+./bin/rqtl2-pheno-to-gemma.py 41022003-pheno.json --json BXD.geno.json
+```
+
+where BXD.geno.json is not the Zach/GN json file, but the actual BXDs in GEMMA's bimbam file.
+
+One question is why Zach's JSON file gives a different number of mappable BXDs. I made of note of that to check.
+
+I wrote a new script and we had our first GEMMA run with lmdb output:
+
+```
+wrk@napoli /export/local/home/wrk/iwrk/opensource/code/genetics/gemma-wrapper [env]$ tar tvf /tmp/3fddda2374509c7b346>
+-rw-r--r-- wrk/users    294912 2025-08-06 05:49 3fddda2374509c7b346b7819ae358ed23be9cb46-gemma-GWA.mdb
+```
+
+The script is just 10 lines of code (after the command line handler)
+
+=> https://github.com/genetics-statistics/gemma-wrapper/blob/master/bin/gn-pheno-to-gemma.rb
+
+Excellent, now we can run gemma and the next step is to look at the largest hit.
+
+So the trait we try to run is 41022003 = https://genenetwork.org/show_trait?trait_id=51048&dataset=BXDPublish. The inputs match up. When we run GEMMA in GN it has a 4.0 score on chr 12 and 3.9 on chr 19.
+
+Running gemma-wrapper we get
+
+```
+LOCO K computation with caching and JSON output
+
+gemma-wrapper --json --force --loco -- -g BXD.geno.txt -p BXD_pheno.txt -a BXD.8_snps.txt -n 2 -gk -debug > K.json
+
+LMM's using the K's captured in K.json using the --input switch
+
+gemma-wrapper --json --force --lmdb --loco --input K.json -- -g BXD.geno.txt -p BXD_pheno.txt -a BXD.8_snps.txt -lmm 9 -maf 0.1 -n 2 -debug > GWA.json
+```
+
+We can view the lmdb file with something like:
+
+```
+./bin/view-gemma-mdb --sort /tmp/66b8c19be87e9566358ce904682a56250eb05748-gemma-GWA.tar.xz --anno BXD.8_snps.txt > test.out
+/tmp/3fddda2374509c7b346b7819ae358ed23be9cb46-gemma-GWA.tar.xz
+chr,pos,marker,af,beta,se,l_mle,l_lrt,-logP
+7,67950073,rsm10000004928,0.543,1.5226,1.3331,100000.0,0.0002,3.79
+7,68061665,rs32453663,0.543,1.5226,1.3331,100000.0,0.0002,3.79
+7,68111284,rs32227186,0.543,1.5226,1.3331,100000.0,0.0002,3.79
+19,30665443,rsm10000014129,0.522,2.2128,1.0486,100000.0,0.0002,3.77
+19,30671753,rs31207057,0.522,2.2128,1.0486,100000.0,0.0002,3.77
+12,40785621,rsm10000009222,0.565,2.8541,1.3576,100000.0,0.0002,3.75
+12,40786657,rs29124638,0.565,2.8541,1.3576,100000.0,0.0002,3.75
+12,40842857,rs13481410,0.565,2.8541,1.3576,100000.0,0.0002,3.75
+12,40887762,rsm10000009223,0.565,2.8541,1.3576,100000.0,0.0002,3.75
+12,40887894,rsm10000009224,0.565,2.8541,1.3576,100000.0,0.0002,3.75
+12,40900825,rs50979658,0.565,2.8541,1.3576,100000.0,0.0002,3.75
+12,41054766,rs46705481,0.565,2.8541,1.3576,100000.0,0.0002,3.75
+```
+
+Interestingly the hits are very similar to what is on production now, though not the same! That points out that I am not using the production database on this recent dataset. Let's try an older one. BXD_10002 has data id 8967044
+
+```
+curl http://127.0.0.1:8091/dataset/bxd-publish/values/8967044.json > 10002-pheno.json
+./bin/gn-pheno-to-gemma.rb -p 10002-pheno.json --geno-json BXD.geno.json > 10002-pheno.txt
+gemma-wrapper --json --force --loco -- -g BXD.geno.txt -p 10002-pheno.txt -a BXD.8_snps.txt -n 2 -gk -debug > K.json
+gemma-wrapper --json --force --lmdb --loco --input K.json -- -g BXD.geno.txt -p 10002-pheno.txt -a BXD.8_snps.txt -lmm 9 -maf 0.1 -n 2 -debug > GWA.json
+./bin/view-gemma-mdb --sort /tmp/c4ffedf358698814c6e29a54a2a51cb6c66328d0-gemma-GWA.tar.xz --anno BXD.8_snps.txt > test.out
+```
+
+Luckily this is a perfect match:
+
+```
+1,179861787,rsm10000000444,0.559,0.8837,0.3555,100000.0,0.0,4.99
+1,179862838,rs30712622,0.559,0.8837,0.3555,100000.0,0.0,4.99
+1,179915631,rsm10000000787,0.559,0.8837,0.3555,100000.0,0.0,4.99
+1,179919811,rsm10000000788,0.559,0.8837,0.3555,100000.0,0.0,4.99
+(...)
+8,94479237,rs32095272,0.441,1.0456,0.4362,100000.0,0.0,4.75
+8,94765445,rsm10000005684,0.441,1.0456,0.4362,100000.0,0.0,4.75
+8,94785223,rsm10000005685,0.441,1.0456,0.4362,100000.0,0.0,4.75
+8,94840921,rsm10000005686,0.441,1.0456,0.4362,100000.0,0.0,4.75
+```
+
+The lmdb file contains the full vector and compresses to 100K. For 13K traits that equals about 1Gb.
+
+First I wanted to check how Zach's list of mappable inds compares to mine. A simple REPL exercise shows:
+
+```
+zach = JSON.parse(File.read('BXD.json'))
+pj = JSON.parse(File.read('BXD.geno.json'))
+s1 = zach["genofile"][0]["sample_list"]
+=> ["BXD1", "BXD2", "BXD5", "BXD6", "BXD8", "BXD9", "BXD11", "BXD12", "BXD13", "BXD14", "BXD15", "BXD16", "BXD18",...
+s2 = pj["samples"]
+=> ["BXD1", "BXD2", "BXD5", "BXD6", "BXD8", "BXD9", "BXD11", "BXD12", "BXD13", "BXD14", "BXD15", "BXD16", "BXD18",...
+s1.size()
+=> 235
+s2.size()
+=> 237
+ s2-s1
+=> ["BXD077xBXD065F1", "BXD065xBXD102F1"]
+```
+
+So it turns out the newer geno file contains these two new inds that are *also* in the .geno file and confuses the hell out of my scripts ;). The GN2 webserver probably uses the header of the geno file to fetch the correct number. The trait page also lists these inds, so (I guess) the BXD.json file ought to be updated.
+
+Now that is explained and we are good.
+
+## Running at scale
+
+In the next step we need to batch run GEMMA. Initially we'll run on one server. gemma-wrapper takes care of running only once, so we can restart the pipeline at any point (we'll move to ravanan after to run on the cluster). At this point the API uses the dataid to return the trait values. I think that is not so intuitive, so I modified the endpoint to give the same results for:
+
+```
+curl http://127.0.0.1:8091/dataset/bxd-publish/values/10002.json > 10002-pheno.json
+curl http://127.0.0.1:8091/dataset/bxd-publish/dataid/values/8967044.json > 10002-pheno.json
+```
+
+Now that works we can get a list of all BXDPublish datasets that I wrote earlier:
+
+```
+curl http://127.0.0.1:8091/dataset/bxd-publish/list > bxd-publish.json
+[
+  {
+    "Id": 10001,
+    "PhenotypeId": 4,
+    "DataId": 8967043
+  },
+  {
+    "Id": 10002,
+    "PhenotypeId": 10,
+    "DataId": 8967044
+  },
+  {
+    "Id": 10003,
+    "PhenotypeId": 15,
+    "DataId": 8967045
+  },
+```
+
+so we can use this to create our batch list. There are 13711 datasets listed on this DB. We can use jq to extract all Ids
+
+```
+jq ".[] | .Id" < bxd-publish.json > ids.txt
+```
+
+All set to run our first batch! Now we replicate our guix-wrapper environment, start the gn-guile server and fire up a batch script that pulls the data from the database and runs gemma for every step.
+
+
+To get precompute going we need a server set up with a recent database. I don't want to use the production server. The fastest other server we have is balg01, and it is not busy right now, so let's use that. First we recover a DB from our backup, as described in
+
+=> topics/systems/mariadb/precompute-mapping-input-data
+
+(btw that examples show we started on precompute since November 2023 - 1.5 years ago). On that server mariadb is running as
+/usr/local/guix-profiles/gn-latest/bin/mariadbd --datadir=/export/mariadb/tux01. We can simply overwrite that database as it
+is an installation of Feb 18 2024. We extract:
+
+```
+borg extract --progress /export/backup/bacchus/drop/tux04/genenetwork::borg-tux04-sql-20250807-04:16-Thu
+```
+
+After extracting the backup we need to update permissions and point mariadb to the new dir: balg01:/export/mariadb/tux04/latest/.
+Restarting the DB and it all appears to work.
+
+Before I move the code across we need to make sure metadata on the traits get added to the lmdb mapping data. I actually wrote the code for that here. This adds the metadata to lmdb:
+
+=> https://github.com/genetics-statistics/gemma-wrapper/blob/a0eb8ed829072cb539b32affe135a7930989ca30/bin/gemma2lmdb.py#L99
+
+gemma-wrapper writes data like this:
+
+```
+  "meta": {
+    "type": "gemma-wrapper",
+    "version": "0.99.7-pre1",
+    "population": "BXD",
+    "name": "HC_U_0304_R",
+    "trait": "101500_at",
+    "url": "https://genenetwork.org/show_trait?trait_id=101500_at&dataset=HC_U_0304_R",
+    "archive_GRM": "46bfba373fe8c19e68be6156cad3750120280e2e-gemma-cXX.tar.xz",
+    "archive_GWA": "779a54a59e4cd03608178db4068791db4ca44ab3-gemma-GWA.tar.xz",
+    "dataid": 75629,
+    "probesetid": 1097,
+    "probesetfreezeid": 7
+    }
+```
+
+This was done for probesetdata and needs to be adapted for our BXD PublishData exercise. Also I want the archive_GWA file name to include the trait name/ID so we can find it quickly on the storage (without having to parse/query all lmdb files).
+
+From the gemma-wrapper invocation you can see I added a few switches to pass in this information:
+
+=> https://git.genenetwork.org/gn-guile/tree/gn/runner/gemma.scm#n97
+
+```
+        --meta NAME                  Pass in metadata as JSON file
+        --population NAME            Add population identifier to metadata
+        --name NAME                  Add dataset identifier to metadata
+        --id ID                      Add identifier to metadata
+        --trait TRAIT                Add trait identifier to metadata
+```
+
+We can add BXD as population and BXDPublish as a dataset identifier. Set id with dataid, and trait id with PublishXRefID and point it back to GN, so we can click
+
+=>  https://genenetwork.org/show_trait?trait_id=51048&dataset=BXDPublish
+
+Another thing I want to add are the existing qtlreaper hit values.  That way we can assess where the biggest impact was of using gemma over qtlreaper. To achieve this we will create a new API endpoint that can serve that data. Remember we get the trait values with:
+
+=> http://127.0.0.1:8091/dataset/bxd-publish/values/10002.json
+
+so we can add an endpoint that lists the mapping results
+
+=> http://127.0.0.1:8091/dataset/bxd-publish/trait-hits/10002.json
+
+we also will have
+
+=> http://127.0.0.1:8091/dataset/bxd-publish/trait-info/10002.json
+
+That will return more metadata and point into our RDF store. Note that this is now all very specific to bxd-publish. Later we'll have to think how to generalise these endpoints. We are just moving forward to do the BXD precompute run.
+
+Interestingly GN2 shows this information (well, only the highest hit) on the search page, but not on the trait page. As we can get hits from multiple sources we should (eventually) account for that with something like:
+
+```
+=> http://127.0.0.1:8091/dataset/bxd-publish/trait-hits/10002.json
+{ "qtlreaper-hk":
+  {
+    [
+      { "name":..., "chr": ..., "pos":..., "LRS":..., "additive":..., }
+    ]
+  }
+  "gemma-loco":
+  {
+    [
+      { "name":..., "chr": ..., "pos":..., "LRS":..., "additive":..., }
+      { "name":..., "chr": ..., "pos":..., "LRS":..., "additive":..., }
+      { "name":..., "chr": ..., "pos":..., "LRS":..., "additive":..., }
+    ]
+  }
+}
+```
+
+Eventually we may list gemma, Rqtl2 hits with and without LOCO and with and without covariates. Once we build this support we can adapt our search tools.
+
+Obviously this won't fit the current PublishXRef format, so -- for now -- we will just mirror its contents:
+
+```
+{ "qtlreaper-hk":
+  {
+    [
+      { "name":..., "chr": ..., "pos":..., "LRS":..., "additive":..., }
+    ]
+  }
+}
+```
+
+To get compute going I am going to skip above because we can update the lmdb files later.
+The first fix is to add the trait name to the file names and the following record to lmdb:
+
+  "meta": {
+    "type": "gemma-wrapper",
+    "version": "0.99.7-pre1",
+    "population": "BXD",
+    "name": "BXDPublish",
+    "table": "PublishData",
+    "traitid": 10002, // aka PublishXrefId
+    "url": "https://genenetwork.org/show_trait?trait_id=51048&dataset=BXDPublish,
+    "archive_GRM": "46bfba373fe8c19e68be6156cad3750120280e2e-gemma-cXX.tar.xz",
+    "archive_GWA": "779a54a59e4cd03608178db4068791db4ca44ab3-BXDPublish-10002-gemma-GWA.tar.xz",
+    "dataid": 8967044,
+    }
+
+This required modifications to gemma-wrapper.
+
+Running:
+
+```
+gemma-wrapper --json --force --loco -- -g BXD.geno.txt -p BXD_pheno.txt -a BXD.8_snps.txt -n 2 -gk -debug > K.json
+gemma-wrapper --json --force --lmdb --population BXD --name BXDPublish --trait 10002 --loco --input K.json -- -g BXD.geno.txt -p BXD_pheno.txt -a BXD.8_snps.txt -lmm 9 -maf 0.1 -n 2 -debug > GWA.json
+```
+
+begets '66b8c19be87e9566358ce904682a56250eb05748-BXDPublish-10002-gemma-GWA.tar.xz'. When I check the meta data in the lmdb file it is set to
+
+```
+"meta": {"type": "gemma-wrapper", "version": "1.00-pre1", "population": "BXD", "name": "BXDPublish", "trait": "10002", "geno_filename": "BXD.geno.txt", "geno_hash": "3b65ed252fa47270a3ea867409b0bdc5700ad6f6", "loco": true, "url": "https://genenetwork.org/show_trait?trait_id=10002&dataset=BXDPublish", "archive_GRM": "185eb08dc3897c7db5d7ea987170898035768f93-gemma-cXX.tar.xz", "archive_GWA":"66b8c19be87e9566358ce904682a56250eb05748-BXDPublish-10002-gemma-GWA.tar.xz", "table": "PublishData", "traitid": 10002, "dataid": 0}
+```
+
+which is good enough (for now). I may still add the dataid, but it requires a SQL call. Code is here:
+
+=> https://github.com/genetics-statistics/gemma-wrapper/commit/49587523fc93bdcf0265da9da97f8d6d2a9e1008
+
+I should note that up to this point I would have had no advantage from AI programming. I know there are topics that I'll work on where I may benefit, but this type of architecturing, with very little code writing, does not really help. I certainly have the intention of using AI! Next steps, unfortunately, there is still little to be gained. Where we'll probably gain is:
+
+- Using the RDF data store and documenting the endpoint(s)
+- Refactoring some of GN2's code to introduce lmdb\
+- Deduplicating GN2/GN3 SQL code
+- Improving the REST API and writing documentation and tests
+- Analysing existing code bases, such as GEMMA itself
+
+Next step is getting the data churn going! After that we'll list all the hits which requires processing the lmdb output.
+
+Precompute of 13K traits has its first test run on balg01.
+
+It is going at 30 gemma runs per minute, so perhaps 8 hours for the full run if it keeps going. But I am hitting errors.
+
+Afther that will be to digest hits from the precomputed vectors in lmdb.
+
+## Yesterday's tux02 crash
+
+All servers work on tux02 except for BNW.
+
+I tried to restart BNW, but it is giving an error, including the mystifying shepherd error (that I have as a sticker on my laptop):
+
+> 2025-08-11 01:13:41 error in finalization thread: Success
+
+It is on our end, so no need to ping Yan. I'll fix it when I have time (I did below).
+
+## Precompute
+
+To get precompute up and running I need to create the environment on balg01. The DB I updated a few days ago, so that should be fine.
+
+First we check out the guile webserver:
+
+```
+git clone tux02.genenetwork.org:/home/git/public/gn-guile gn-guile-8092
+```
+
+Now gn-guile is already running serving aliases, so we want to run this as an internal endpoint right now with something like
+
+```
+unset GUIX_PROFILE
+. /usr/local/guix-profiles/guix-pull/etc/profile
+guix shell -L ~/guix-bioinformatics --container --network --file=guix.scm -- guile -L . --fresh-auto-compile -e main web/webserver.scm 8092
+```
+
+so, this renders
+
+```
+curl http://127.0.0.1:8092/dataset/bxd-publish/values/10002.json
+{"BXD1":54.099998,"BXD2":50.099998,"BXD5":53.299999,"BXD6":55.099998
+```
+
+Next step is to set up gemma-wrapper. Now this failed because guix was not happy. We have been updating things these last weeks. Rather than trying to align with recent changes I could have rolled back to the version I am using on my desktop. But I decided not to let those bits rot and updated guix from
+
+guix describe Thu Mar 14 21:33:55 2024
+
+to
+
+guix describe Sun Aug 10 18:18:20 2025
+
+Should use a newer version first! Let's try
+
+```
+guix pull --url=https://codeberg.org/guix/guix  -p ~/opt/guix-pull
+```
+
+(that took a while, so I took the opportunity to fix BNW -- turns out someone disabled BNW in shepherd by creating a systemd version that did not start properly).
+
+After the pull there were quite a few problems with gemma dependencies that needed fixing. First problem
+
+```
+guix package: warning: failed to load '(gn packages gemma)':
+In procedure abi-check: #<record-type <git-reference>>: record ABI mismatch; recompilation needed
+```
+
+required
+
+```
+find ~/.cache/guile -name "*.go" -delete
+```
+
+I also had to point guix-past to the new codeberg record! And now, magically, things started working.
+
+So, now I have an identical setup on my desktop and on the balg server. Next is to write a script that will batch run gemma-wrapper for every BXD PublishData ID. We created that list with jq earlier.
+
+```
+curl http://127.0.0.1:8092/dataset/bxd-publish/list > bxd-publish.json
+jq ".[] | .Id" < bxd-publish.json > ids.txt
+```
+
+For every ID in that list we are going to fetch the trait values with
+
+```
+#! /bin/env sh
+export TMPDIR=./tmp
+curl http://127.0.0.1:8092/dataset/bxd-publish/list > bxd-publish.json
+jq ".[] | .Id" < bxd-publish.json > ids.txt
+./bin/gemma-wrapper --force --json --loco -- -g BXD.geno.txt -p BXD_pheno.txt -a BXD.8_snps.txt -n 2 -gk > K.json
+
+for id in 'cat ids.txt' ; do
+  echo Precomputing $id
+  curl http://127.0.0.1:8092/dataset/bxd-publish/values/$id.json > pheno.json
+  ./bin/gn-pheno-to-gemma.rb --phenotypes pheno.json --geno-json BXD.geno.json > BXD_pheno.txt
+  ./bin/gemma-wrapper --json --lmdb --population BXD --name BXDPublish --trait $id --loco --input K.json -- -g BXD.geno.txt -p BXD_pheno.txt -a BXD.8_snps.txt -lmm 9 -maf 0.1 -n 2 -debug > GWA.json
+done
+```
+
+I hard copied the following files
+
+```
+BXD.geno.json
+BXD.geno.txt
+BXD.8_snps.txt
+```
+
+One thing I need to check is that the GRM is actually a constant. I forgot what GEMMA does.
+
+We hit an error
+
+```
+/gnu/store/vvl1g1l0j19w39kry2xcsawvlhbyb87j-ruby-3.4.4/lib/ruby/3.4.0/json/common.rb:221:in 'JSON::Ext::Parser.parse':
+unexpected token at '' (JSON::ParserError)
+FATAL ERROR: gemma-wrapper bailed out with pid 340588 exit 20
+./bin/gemma-wrapper:494:in 'block (2 levels) in <main>'
+./bin/gemma-wrapper:479:in 'IO.open'
+./bin/gemma-wrapper:479:in 'block in <main>'
+./bin/gemma-wrapper:832:in '<main>'Precomputing 10137
+```
+
+The JSON file is empty 10136. Hmmm.
+
+I also see
+
+```
+WARNING: failed to update lmdb record with key b'\r\x02n\x7f\x10' -- probably a duplicate 13:40795920 (b'\r':40795920)
+```
+
+For the first the webserver actually stopped on `In procedure accept: Too many open files`. The problem looks similar to
+
+=> https://issues.guix.gnu.org/60226
+
+and Arun's patch
+
+=> https://cgit.git.savannah.gnu.org/cgit/guix/mumi.git/commit/?id=897967a84d3f51da2b1cc8c3ee942fd14f4c669b
+
+I raised ulimit, but may need to restart the webserver several time. We are computing though:
+
+```
+-rw-r--r-- 1 wrk wrk   82968 Aug 11 05:16 ab51d69f79601cfa7399feebca619ea1a71c1270-BXDPublish-10146-gemma-GWA.tar.xz
+-rw-r--r-- 1 wrk wrk   82772 Aug 11 05:16 e6739ace8ca4931fc51baa1844b3b5ceac592104-BXDPublish-10147-gemma-GWA.tar.xz
+-rw-r--r-- 1 wrk wrk   81848 Aug 11 05:16 60880fc7e8c86dffb17f28664e478204ea26f827-BXDPublish-10148-gemma-GWA.tar.xz
+-rw-r--r-- 1 wrk wrk   79336 Aug 11 05:16 c914d6221b004dec98d60e08c0fdf8791c09cb41-BXDPublish-10149-gemma-GWA.tar.xz
+-rw-r--r-- 1 wrk wrk   83536 Aug 11 05:16 3d72b19730edab29bdc593cb6a1a86dd789d351f-BXDPublish-10150-gemma-GWA.tar.xz
+-rw-r--r-- 1 wrk wrk   69060 Aug 11 05:16 0e965f1778425071a5497d0fe69f2dc2e534ef60-BXDPublish-10151-gemma-GWA.tar.xz
+-rw-r--r-- 1 wrk wrk   69072 Aug 11 05:16 4de26e62a75727bc7edd6b266dfcd7753d185f1a-BXDPublish-10152-gemma-GWA.tar.xz
+(...)
+```
+
+There are some scarily small datasets:
+
+```
+GET /dataset/bxd-publish/values/10198.json
+;;; ("8967240")
+
+;;; ((("C57BL/6J" . 1.62) ("BXD1" . 2.37) ("BXD5" . 2.73) ("BXD9" . 3.52) ("BXD11" . 0.18) ("BXD12" . 3.69) ("BXD16" . 0.29) ("BXD21" . 2.34) ("BXD27" . 3.38) ("BXD32" . 0.24)))
+```
+
+i.e. https://genenetwork.org/show_trait?trait_id=10198&dataset=BXDPublish
+
+Not sure we should be running GEMMA on those!
+
+
+The computation initially stopped at 70% (we are now at 98%).
+
+To get from 70% I run the webserver without fibers as suggested by Arun's patch:
+
+=> https://cgit.git.savannah.gnu.org/cgit/guix/mumi.git/commit/?id=897967a84d3f51da2b1cc8c3ee942fd14f4c669b
+
+Because we were getting errors like: In procedure accept: Too many open files with GET /dataset/bxd-publish/values/23486.json
+
+Afther removing fibers precompute just continued where it left off. As it should. The fix is:
+
+=> https://git.genenetwork.org/gn-guile/commit/?id=289da2e13e07928cdb8a1d165483a3a3cd9ae1c6
+
+Now that is running I want to make sure I can point back to metadata and perhaps fetch some information to enrich our lmdb files for further processing. Earlier we captured some metadata with
+
+Next we capture some metadata
+
+```
+MariaDB [db_webqtl]> select PhenotypeId, Locus, DataId, Phenotype.Post_publication_description from PublishXRef, Phenotype where PublishXRef.PhenotypeId = Phenotype.Id and InbredSetId=1 limit 5;
++-------------+----------------+---------+----------------------------------------------------------------------------------------------------------------------------+
+| PhenotypeId | Locus          | DataId  | Post_publication_description                                                                                               |
++-------------+----------------+---------+----------------------------------------------------------------------------------------------------------------------------+
+|           4 | rs48756159     | 8967043 | Central nervous system, morphology: Cerebellum weight, whole, bilateral in adults of both sexes [mg]                       |
+|          10 | rsm10000005699 | 8967044 | Central nervous system, morphology: Cerebellum weight after adjustment for covariance with brain size [mg]                 |
+|          15 | rsm10000013713 | 8967045 | Central nervous system, morphology: Brain weight, male and female adult average, unadjusted for body weight, age, sex [mg] |
+|          20 | rs48756159     | 8967046 | Central nervous system, morphology: Cerebellum volume [mm3]                                                                |
+|          25 | rsm10000005699 | 8967047 | Central nervous system, morphology: Cerebellum volume, adjusted for covariance with brain size [mm3]                       |
++-------------+----------------+---------+----------------------------------------------------------------------------------------------------------------------------+
+```
+
+The qtlreaper hits are also of interest. Note Bonz has brilliantly captured this in RDF, see
+
+=> https://github.com/genenetwork/gn-docs/blob/master/rdf-documentation/phenotype-metadata.md
+
+which is parseable by machines(!). Let's try to use RDF first. The query:
+
+```
+SELECT * WHERE {
+    <http://genenetwork.org/id/traitBxd_10002> ?p ?o .
+}
+```
+
+renders
+
+```
+"http://www.w3.org/1999/02/22-rdf-syntax-ns#type","http://genenetwork.org/category/Phenotype"
+"http://genenetwork.org/term/belongsToGroup","http://genenetwork.org/id/setBxd"
+"http://www.w3.org/2004/02/skos/core#altLabel","BXD_10002"
+"http://purl.org/dc/terms/description","Central nervous system, morphology: Cerebellum weight after adjustment for covariance with brain size [mg]"
+"http://genenetwork.org/term/abbreviation","ADJCBLWT"
+"http://genenetwork.org/term/additive",2.08179
+"http://genenetwork.org/term/locus","http://genenetwork.org/id/Rsm10000005699"
+"http://genenetwork.org/term/lodScore",4.77938
+"http://genenetwork.org/term/mean",52.2206
+"http://genenetwork.org/term/sequence",1
+"http://genenetwork.org/term/submitter","robwilliams"
+"http://genenetwork.org/term/traitId","10002"
+"http://purl.org/dc/terms/isReferencedBy","http://rdf.ncbi.nlm.nih.gov/pubmed/11438585"
+```
+
+which covers pretty much what we need. Note that this is coming from our public endpoint and can be used to instruct AI agents(!)
+
+Now we want to fetch these values for all these traitBxd (yes, we need to fix some naming) with a single query:
+
+```
+SELECT count(*) WHERE {
+    ?s gnt:belongsToGroup gn:setBxd.
+} limit 5
+```
+
+returns 14039 traits. Good! Let's get all properties
+
+```
+
+SELECT * WHERE {
+    ?s gnt:belongsToGroup gn:setBxd;
+         gnt:traitId ?id;
+         gnt:locus ?locus;
+         gnt:lodScore ?lrs;
+         dct:description ?descr.
+} limit 50
+```
+
+[Try](https://sparql.genenetwork.org/sparql?default-graph-uri=&query=%0D%0APREFIX+dct%3A+%3Chttp%3A%2F%2Fpurl.org%2Fdc%2Fterms%2F%3E+%0D%0APREFIX+gn%3A+%3Chttp%3A%2F%2Fgenenetwork.org%2Fid%2F%3E+%0D%0APREFIX+owl%3A+%3Chttp%3A%2F%2Fwww.w3.org%2F2002%2F07%2Fowl%23%3E+%0D%0APREFIX+gnc%3A+%3Chttp%3A%2F%2Fgenenetwork.org%2Fcategory%2F%3E+%0D%0APREFIX+gnt%3A+%3Chttp%3A%2F%2Fgenenetwork.org%2Fterm%2F%3E+%0D%0APREFIX+sdmx-measure%3A+%3Chttp%3A%2F%2Fpurl.org%2Flinked-data%2Fsdmx%2F2009%2Fmeasure%23%3E+%0D%0APREFIX+skos%3A+%3Chttp%3A%2F%2Fwww.w3.org%2F2004%2F02%2Fskos%2Fcore%23%3E+%0D%0APREFIX+rdf%3A+%3Chttp%3A%2F%2Fwww.w3.org%2F1999%2F02%2F22-rdf-syntax-ns%23%3E+%0D%0APREFIX+rdfs%3A+%3Chttp%3A%2F%2Fwww.w3.org%2F2000%2F01%2Frdf-schema%23%3E+%0D%0APREFIX+xsd%3A+%3Chttp%3A%2F%2Fwww.w3.org%2F2001%2FXMLSchema%23%3E+%0D%0APREFIX+qb%3A+%3Chttp%3A%2F%2Fpurl.org%2Flinked-data%2Fcube%23%3E+%0D%0APREFIX+xkos%3A+%3Chttp%3A%2F%2Frdf-vocabulary.ddialliance.org%2Fxkos%23%3E+%0D%0APREFIX+pubmed%3A+%3Chttp%3A%2F%2Frdf.ncbi.nlm.nih.gov%2Fpubmed%2F%3E+%0D%0A%0D%0A%0D%0A%0D%0ASELECT+*+WHERE+%7B%0D%0A++++%3Fs+gnt%3AbelongsToGroup+gn%3AsetBxd%3B%0D%0A+++++++++gnt%3AtraitId+%3Fid%3B%0D%0A+++++++++gnt%3Alocus+%3Flocus%3B%0D%0A+++++++++%23+gnt%3Achr+%3Fchr%3B%0D%0A+++++++++%23+gnt%3Apos+%3Fpos%3B%0D%0A+++++++++gnt%3AlodScore+%3Flrs%3B%0D%0A+++++++++dct%3Adescription+%3Fdescr.%0D%0A%7D+limit+50&format=text%2Fhtml&timeout=0&signal_void=on)
+
+If we want to get the chr+location we can query one:
+
+```
+SELECT * WHERE {
+gn:Rs47436964 ?p ?o.
+}
+```
+
+renders
+
+```
+http://www.w3.org/2000/01/rdf-schema#label 	"rs47436964"
+chr "12"
+mb 	65.0498
+```
+
+Now the label is not so interesting, so in one query we can do:
+
+```
+SELECT ?id ?lod ?chr ?mb ?descr WHERE {
+    ?s gnt:belongsToGroup gn:setBxd;
+         gnt:traitId ?id;
+         gnt:locus ?locus;
+         gnt:lodScore ?lod;
+         dct:description ?descr.
+    ?locus gnt:chr ?chr;
+         gnt:mb ?mb.
+} order by desc(?lod) limit 50
+```
+
+which gets, for example a massive reaper HK QTL at
+
+```
+"21588" 34.558 "12" 116.67 "Cofactor, genetics, genomics: Structural variants SVs on chromosome 12, raw uncorrected sum of calls using LongRanger on linked-read sequencing data [n]"
+```
+
+The description of the phenotype is unfortunate. I think it is a synthetic QTL. The title is "SVs_Chr12". Luckily most traits give more an idea of what it is about.
+
+[SPARQL](https://sparql.genenetwork.org/sparql?default-graph-uri=&query=%0D%0APREFIX+dct%3A+%3Chttp%3A%2F%2Fpurl.org%2Fdc%2Fterms%2F%3E+%0D%0APREFIX+gn%3A+%3Chttp%3A%2F%2Fgenenetwork.org%2Fid%2F%3E+%0D%0APREFIX+owl%3A+%3Chttp%3A%2F%2Fwww.w3.org%2F2002%2F07%2Fowl%23%3E+%0D%0APREFIX+gnc%3A+%3Chttp%3A%2F%2Fgenenetwork.org%2Fcategory%2F%3E+%0D%0APREFIX+gnt%3A+%3Chttp%3A%2F%2Fgenenetwork.org%2Fterm%2F%3E+%0D%0APREFIX+sdmx-measure%3A+%3Chttp%3A%2F%2Fpurl.org%2Flinked-data%2Fsdmx%2F2009%2Fmeasure%23%3E+%0D%0APREFIX+skos%3A+%3Chttp%3A%2F%2Fwww.w3.org%2F2004%2F02%2Fskos%2Fcore%23%3E+%0D%0APREFIX+rdf%3A+%3Chttp%3A%2F%2Fwww.w3.org%2F1999%2F02%2F22-rdf-syntax-ns%23%3E+%0D%0APREFIX+rdfs%3A+%3Chttp%3A%2F%2Fwww.w3.org%2F2000%2F01%2Frdf-schema%23%3E+%0D%0APREFIX+xsd%3A+%3Chttp%3A%2F%2Fwww.w3.org%2F2001%2FXMLSchema%23%3E+%0D%0APREFIX+qb%3A+%3Chttp%3A%2F%2Fpurl.org%2Flinked-data%2Fcube%23%3E+%0D%0APREFIX+xkos%3A+%3Chttp%3A%2F%2Frdf-vocabulary.ddialliance.org%2Fxkos%23%3E+%0D%0APREFIX+pubmed%3A+%3Chttp%3A%2F%2Frdf.ncbi.nlm.nih.gov%2Fpubmed%2F%3E+%0D%0A%0D%0A%0D%0A%0D%0ASELECT+%3Fid+%3Flrs+%3Fchr+%3Fmb+%3Fdescr+WHERE+%7B%0D%0A++++%3Fs+gnt%3AbelongsToGroup+gn%3AsetBxd%3B%0D%0A+++++++++gnt%3AtraitId+%3Fid%3B%0D%0A+++++++++gnt%3Alocus+%3Flocus%3B%0D%0A+++++++++gnt%3AlodScore+%3Flrs%3B%0D%0A+++++++++dct%3Adescription+%3Fdescr.%0D%0A++++%3Flocus+gnt%3Achr+%3Fchr%3B%0D%0A+++++++++++++++gnt%3Amb+%3Fmb.%0D%0A%7D+order+by+desc%28%3Flrs%29+limit+50&format=text%2Fhtml&timeout=0&signal_void=on)
+
+To run this query on all 13K traits takes just a second! The resulting 3Mb TSV I'll share. Note that there is no code necessary to get to this point! Just SPARQL queries on a public endpoint.
+
+Now, what we want to do is take these results and combine them with the full vector data stored in lmdb.
+The first thing we can do is list the top hit from every trait and combine that with above data. That way we can quickly asses what trait hits will change using GEMMA instead of HK reaper. One thing to note is the formula LRS/4.6=LOD. The GN2 interface shows LRS.
+
+Meanwhile I am waiting for precompute. Most of it is done, but some interesting errors:
+
+```
+Precomputing 20484
+;;; ("41012208")
+SQL Connection ERROR! file not found
+```
+
+especially since it appears this is a cache hit. OK, I'll check tomorrow. For now we have 12837 completed vectors!
+After some reruns we have 13491 vectors, i.e. 98% of BXD PublishData.
+
+
+After some reruns we have 13491 vectors, i.e. 98% of BXD PublishData.
+
+Some remaining problems:
+
+```
+Executing: parallel --results /tmp/test --joblog /tmp/test/79d6dbd2fbd55b159c35d903ba10d9cab14f7816-parallel.log < /tmp
+/test/parallel-commands.txt
+Command exited with non-zero status 20
+```
+
+the trait values are all 1.0.
+
+```
+BXD1    1.0
+BXD2    1.0
+BXD5    1.0
+BXD6    1.0
+BXD8    1.0
+BXD9    1.0
+BXD11   1.0
+BXD12   1.0
+BXD13   1.0
+BXD14   1.0
+BXD15   1.0
+BXD16   1.0
+BXD18   1.0
+BXD19   1.0
+```
+
+We'll look into those later.
+
+Next step is to collect all the highest hits and we can do that with
+
+```
+./bin/view-gemma-mdb --sort tmp/tmp/9179b...923f181-gemma-GWA.mdb --anno BXD.8_snps.txt |head -2
+Reading tmp/tmp/9179b192fc1c19142d97607b64c04bf5a923f181-gemma-GWA.mdb...
+chr,pos,marker,af,beta,se,l_mle,l_lrt,-logP
+10,125580028,rsm10000007478,0.655,0.014,0.0134,100000.0,0.0005,3.34
+```
+
+That is great, but now we need to put the data in a place that we can analyse it - and the difference with qtlreaper. We can do a one-off using some tabular format. But that would mean we would have to redo things later to get it in SQL and/or present it some other way. So, basically, we need a flexible storage format that allows us to query things -- without predicting how people want to use that data and -- importantly -- have machines do it. Here comes RDF as the solution. As Mark Wilkinson has it: in my lab we only do RDF. No hacks (please).
+
+So, let's adapt the output of view-gemma-mdb and convert that to RDF. Bonz has done many such exercises in
+
+=> https://git.genenetwork.org/gn-transform-databases/tree/
+
+e.g. for the earlier phenotypes RDF+SPARQL we used to get the reaper values
+
+=> https://git.genenetwork.org/gn-transform-databases/tree/examples/phenotype.scm
+
+In this code SQL queries are embedded. I would argue these need to be replaced with REST API calls. But hey.
+
+First step is to include the ID with ./bin/view-gemma-mdb and some other metadata as fields, that we so thoughtfully included in the mdb metadata. This results in:
+
+```
+Reading /tmp/tmphvi6grqm/2b8e7c7cfe98f7e44bb2f07f057cc1adedf29c38-gemma-GWA.mdb...
+name,id,chr,pos,marker,af,beta,se,l_mle,l_lrt,-logP
+BXDPublish,22200,1,4858261,rsm10000000111,0.5,0.0246,0.0537,100000.0,0.0192,1.72
+BXDPublish,22200,1,182581091,rsm10000000451,0.548,-0.009,0.0537,100000.0,0.139,0.86
+BXDPublish,22200,1,182635325,rsm10000000452,0.548,-0.009,0.0537,100000.0,0.139,0.86
+```
+
+Now remember the HK reaper data is already in RDF. If we push this data in we should be able to query the combined datasets. Let's convert this to RDF that looks like:
+
+```
+gn:GEMMAMappedLOCO_22200 a gnt:mappedTrait;
+                         label "GEMMA trait 22200 mapped with LOCO (defaults)";
+                         gnt:LOCO true;
+                         gnt:belongsToGroup gn:setBxd;
+                         gnt:traitId "22200";
+                         skos:altLabel "BXD_22200";
+                         gnt:locus gn:rsm10000000111;
+                         gnt:lodScore 1.72;
+                         gnt:af 0.5;
+                         gnt:effect 0.0246;
+```
+
+If the marker is not yet defined we can add:
+
+```
+gn:rsm10000000111        a gnt:marker;
+                         label "rsm10000000111I";
+                         gnt:chr  "1";
+                         gnt:mb   4.858261;
+                         gnt:pos  4858261.
+```
+
+This means we can pivot on the trait id between reaper and gemma results. It will also be easy to store multiple
+GEMMA hits.
+I note that GEMMA does not store the mean
+value. We can fetch that from trait values.
+
+Rob wrote:
+
+> We will want to harvest the sample size for each trait. That will be a critical parameter for filtering. Knowing the skew and kurtosis also highly valuable in filtering and diagnostics. Many users forget to log their data and this introduces serious problems since you have a tail of outliers. Obviously a dumb mistake to have traits with all values of 1. Perhaps you can assign the task of fixing/deleting that traits to Arthur and me. Just send a list.
+
+I'll make a list to send to Arthur and you - it is on my tasks. With regard to trait info we should compute that as metadata when doing the precompute (as we have the trait values at that point!). I have added that to the task list.
+
+=> https://issues.genenetwork.org/topics/systems/mariadb/precompute-publishdata
+
+We'll do a rerun with this data soon, as it only took a day.
+
+Alright, I am keen to move forward on our precompute, because this is the fun phase. Getting the metadata in place should be easy, now we are on RDF. First we are going to simply mirror PublishXRef information for HK reaper and GEMMA runs. Reaper is already in RDF (mostly), so let's add some functionality to gemma-wrapper.
+
+The viewer for 1e59d19a679359516ecd97cf20375c80e987ee3e-BXDPublish-22282-gemma-GWA.tar.xz  gives
+
+```
+name,id,chr,pos,marker,af,beta,se,l_mle,l_lrt,-logP
+BXDPublish,22282,5,110385941,rs29780222,0.484,-0.0802,0.0356,2.0341,0.0,4.51
+BXDPublish,22282,5,110421808,rsm10000002804,0.484,-0.0802,0.0356,2.0341,0.0,4.51
+BXDPublish,22282,5,110479038,rsm10000002805,0.484,-0.0802,0.0356,2.0341,0.0,4.51
+BXDPublish,22282,5,110515858,rs33083878,0.484,-0.0802,0.0356,2.0341,0.0,4.51
+```
+
+Note that the sorting is arbitrary because -logP is identical! My take is that we should include all hits (read SNP names) for comparison with HK reaper. We will be able to parse range locations - so we can check 50K base pairs up and downstream too.
+
+Looking at SNPs we should look at using existing URIs instead of inventing new ones. I'll make a note of that too (to move forward). Looking at the first hit rs29780222 some googling finds https://www.informatics.jax.org/marker/MGI:1925270. I need to check with the GN database what is known there. Adding a BED file to RDF makes sense. Yet another task to add.
+
+OK, back to focussing on generating RDF with what we have now. A first attempt is
+
+```
+gn:GEMMAMapped_LOCO_e987ee3e_BXDPublish_22282_gemma_GWA a gnt:mappedTrait;
+      rdfs:label "GEMMA BXDPublish trait 22282 mapped with LOCO (defaults)";
+      gnt:trait gn:publishXRef_22282;
+      gnt:loco true;
+      gnt:time "2025/08/11 10:15";
+      gnt:belongsToGroup gn:setBxd;
+      gnt:name "BXDPublish";
+      gnt:traitId "22282";
+      skos:altLabel "BXD_22282";
+      gnt:locus gn:rs29780222;
+      gnt:lodScore 4.51;
+      gnt:af 0.484;
+      gnt:effect -0.08;
+```
+
+which looks nice already. We want to support more SNPs, however, so we split those up and now this dataset shows 84 snps at a cut off of logP of 4.0. We'll improve on that later (and will us precompute to estimate levels for the BXD). We always show the single highest score, no matter what. The cool thing is that we have *all* peaks now in RDF and we can query that:
+
+```
+gn:GEMMAMapped_LOCO_BXDPublish_22282_gemma_GWA_e987ee3e a gnt:mappedTrait;
+      rdfs:label "GEMMA BXDPublish trait 22282 mapped with LOCO (defaults)";
+      gnt:trait gn:publishXRef_22282;
+      gnt:loco true;
+      gnt:time "2025/08/11 10:15";
+      gnt:belongsToGroup gn:setBxd;
+      gnt:name "BXDPublish";
+      gnt:traitId "22282";
+      skos:altLabel "BXD_22282".
+gn:rs29780222_BXDPublish_22282_gemma_GWA_e987ee3e a gnt:mappedLocus;
+      gnt:mappedSnp gn:GEMMAMapped_LOCO_BXDPublish_22282_gemma_GWA_e987ee3e;
+      gnt:locus gn:rs29780222;
+      gnt:lodScore 4.51;
+      gnt:af 0.484;
+      gnt:effect -0.08.
+gn:rsm10000002804_BXDPublish_22282_gemma_GWA_e987ee3e a gnt:mappedLocus;
+      gnt:mappedSnp gn:GEMMAMapped_LOCO_BXDPublish_22282_gemma_GWA_e987ee3e;
+      gnt:locus gn:rsm10000002804;
+      gnt:lodScore 4.51;
+      gnt:af 0.484;
+      gnt:effect -0.08.
+(...)
+gn:rs33400361_BXDPublish_22282_gemma_GWA_e987ee3e a gnt:mappedLocus;
+      gnt:mappedSnp gn:GEMMAMapped_LOCO_BXDPublish_22282_gemma_GWA_e987ee3e;
+      gnt:locus gn:rs33400361;
+      gnt:lodScore 4.07;
+      gnt:af 0.452;
+      gnt:effect -0.078.
+gn:rsm10000002851_BXDPublish_22282_gemma_GWA_e987ee3e a gnt:mappedLocus;
+      gnt:mappedSnp gn:GEMMAMapped_LOCO_BXDPublish_22282_gemma_GWA_e987ee3e;
+      gnt:locus gn:rsm10000002851;
+      gnt:lodScore 4.07;
+      gnt:af 0.452;
+      gnt:effect -0.078.
+```
+
+Next step is to use rapper to see if this is valid RDF.
+
+```
+rapper --input turtle test.ttl
+```
+
+For this one trait: rapper: Parsing returned 513 triples. It may look like a lot of data, but RDF stores are pretty good at creating small enough representations. All identifiers are stored once as a string and referenced by 64-bit pointers.
+
+For the locus I notice Bonz capitalized the SNP identifiers. We don't want that. But I'll stick it in for now. The code is here:
+
+=> https://github.com/genetics-statistics/gemma-wrapper/blob/master/bin/gemma-mdb-to-rdf.rb
+
+Basically we run
+
+```
+rm test.rdf
+for x in tmp/*.xz ; do
+    env GEM_PATH=tmp/ruby GEM_HOME=tmp/ruby ./bin/gemma-mdb-to-rdf.rb $x --anno BXD.8_snps.txt --sort >> test.rdf
+done
+```
+
+for the 98% BXD PublishData that rendered 1512885 triples. It needs some minor fixes, such as a Lod of infinite and the use of ? for an unknown locus.
+
+To load the file on production:
+
+```
+guix shell -C -N virtuoso-ose -- isql
+# or
+/gnu/store/9d81kdw2frn6b3fwqphsmkssc9zblir1-virtuoso-ose-7.2.11/bin/isql -u dba -P "*" -S 8981
+OpenLink Virtuoso Interactive SQL (Virtuoso)
+Version 07.20.3238 as of Jan  1 1970
+Type HELP; for help and EXIT; to exit.
+Connected to OpenLink Virtuoso
+Driver: 07.20.3238 OpenLink Virtuoso ODBC Driver
+ld_dir("/home/wrk/","test.ttl","http://pjotr.genenetwork.org")
+SQL> rdf_loader_run();
+Done. -- 13 msec.
+SQL> checkpoint;
+Done. -- 243 msec.
+SQL>
+```
+
+But it don't show. Same for:
+
+```
+root@tux04:/export/guix-containers/genenetwork/data/virtuoso/ttl# curl --digest -v --user 'dba:*' --url "http://localhost:8982/sparql-graph-crud-auth?graph=http://pjotr.genenetwork.org" -T test.ttl
+```
+
+
+I tried to upload to production, but this crashed the virtuoso server :/.
+So I built a new virtuoso instance using gn-machines:
+
+=> https://git.genenetwork.org/gn-machines/commit/?id=90fa4fdacffe26c57649cb0515d0679ca19c27cc
+
+Now we can run isql locally as
+
+```
+guix shell -C -N --expose=/export/guix-containers/virtuoso/data/virtuoso/ttl/=/export/data/virtuoso/ttl virtuoso-ose -- isql -S 8891
+
+SQL> ld_dir('/export/data/virtuoso/ttl','test.n3','http://pjotr.genenetwork.org');
+Done. -- 3 msec.
+# for testing the validity and optional delete problematic ones:
+SQL> SELECT * FROM DB.DBA.load_list;
+SQL> DELETE from DB.DBA.LOAD_LIST where ll_error IS NOT NULL ;
+# commit changes
+SQL> rdf_loader_run ();
+SQL> checkpoint;
+Done. -- 16 msec.
+SQL> SPARQL SELECT count(*) FROM <http://pjotr.genenetwork.org> WHERE { ?s ?p ?o };
+15
+```
+
+If an error exists all uploads will be blocked unless DB.DBA.LOAD_LIST is emptied (DELETE).
+An error may look like:
+
+```
+ERROR  : Character data are not allowed here by XML structure rules
+at line 2 column 3 of source text
+@prefix dct: <http://purl.org/dc/terms/> .
+```
+
+I don't know why, but only n3 triples appeared to work. The full manual is here:
+
+=> https://vos.openlinksw.com/owiki/wiki/VOS/VirtBulkRDFLoader Virtuoso bulk uploader
+
+## Fixing hanging virtuoso on production
+
+Going back to production I cleaned up the DB.DBA.LOAD_LIST as described above. Running isql can be done outside the container:
+
+```
+guix shell virtuoso-ose -- isql 8981
+SQL> DELETE from DB.DBA.LOAD_LIST;
+SQL> checkpoint;
+```
+
+SPARQL queries inside isql are fast:
+
+```
+SQL> SPARQL SELECT count(*) FROM <http://pjotr.genenetwork.org> WHERE { ?s ?p ?o };
+1206882
+SQL> SPARQL SELECT count(*) FROM <http://genenetwork.org> WHERE { ?s ?p ?o };
+46982542
+```
+
+The web socket is not connected. This does not respond:
+
+```
+curl http://localhost:8982/sparql/
+```
+
+herd stop/start virtuoso made no difference. Nor did nginx or nscd. Hmm. Restarting the full container it starts up at
+
+```
+root@tux04:/export/guix-containers/genenetwork/var/log# tail virtuoso.log
+  2025-08-17 07:47:07 07:47:07 HTTP server online at localhost:9893
+  2025-08-17 07:47:07 07:47:07 Server online at localhost:9892 (pid 43)
+curl localhost:9893/sparql
+```
+
+Aha, the domain is pointing to the wrong virtuoso server... I modified nginx on tux04 and, at least, we have SPARQL running on http. For https nginx is pointing to https://127.0.0.1:8993. Hmmm. That is not the same as what the logs tell me. Looks like there is still some problem with the production container. Well, we can solve that later.
+
+I'll first run virtuoso on a server. Starting from a guix from half a year ago:
+
+```
+. /usr/local/guix-profiles/guix-pull-3-link/etc/profile
+cd ~/gn-machines
+./virtuoso-deploy.sh
+curl localhost:8892/sparql/
+```
+
+Configure nginx to listen
+
+```
+server {
+  server_name sparql-test.genenetwork.org;
+  listen 80;
+  access_log /var/log/nginx/sparql-test-access.log;
+  error_log /var/log/nginx/sparql-test-error.log;
+  location / {
+    proxy_pass http://localhost:8892;
+    proxy_set_header Host $host;
+  }
+}
+```
+
+Added DNS-entry and we should be able to see
+
+=> http://sparql-test.genenetwork.org/sparql/
+
+Now I need to load the important data into this SPARQL server. On tux02 I find a recent set:
+
+```
+     4096 Dec  5  2024 wip
+   260886 Jul 21 19:57 schema.ttl
+443454617 Jul 21 19:57 generif-old.ttl
+    44902 Jul 21 19:57 classification.ttl
+339900838 Jul 21 19:58 genelist.ttl
+ 42509383 Jul 21 19:58 genbank.ttl
+152936953 Jul 21 19:58 genotype.ttl
+  1460511 Jul 21 19:58 dataset-metadata.ttl
+700627810 Jul 21 19:58 generif.ttl
+ 10491221 Jul 21 19:58 strains.ttl
+     1388 Jul 21 19:58 species.ttl
+ 23495986 Jul 21 19:58 publication.ttl
+    16879 Jul 21 19:58 tissue.ttl
+ 18537935 Jul 21 19:58 phenotype.ttl
+root@tux02:/export/data/genenetwork-virtuoso# du -sh .
+1.7G    .
+```
+
+Which is about 2Gb uncompressed. Not bad. To load the ttl files I have to move them into
+/export/guix-containers/virtuoso/data/virtuoso/ttl.
+
+```
+guix shell virtuoso-ose -- isql 8891 exec="ld_dir('/export/data/virtuoso/ttl','*.ttl','http://genenetwork.org');"
+guix shell virtuoso-ose -- isql 8891 exec="rdf_loader_run();"
+```
+
+That takes a few minutes for 29746544 triples. Not bad at all!
+
+```
+guix shell virtuoso-ose -- isql 8891 exec="SELECT * FROM DB.DBA.load_list;"
+guix shell virtuoso-ose -- isql 8891 exec="checkpoint;"
+```
+
+Let's list all the tissues we have with
+
+```
+SELECT * WHERE {
+  ?s rdf:type gnc:tissue .
+  ?s rdfs:label ?o .
+}
+"http://genenetwork.org/id/tissueA1c" "Primary Auditory (A1) Cortex mRNA"
+"http://genenetwork.org/id/tissueAcc" "Anterior Cingulate Cortex mRNA"
+"http://genenetwork.org/id/tissueAdr" "Adrenal Gland mRNA"
+"http://genenetwork.org/id/tissueAmg" "Amygdala mRNA"
+"http://genenetwork.org/id/tissueBebv"  "Lymphoblast B-cell mRNA"
+"http://genenetwork.org/id/tissueBla" "Bladder mRNA"
+(...)
+```
+
+=> http://sparql-test.genenetwork.org/sparql/?default-graph-uri=&query=PREFIX+dct%3A+%3Chttp%3A%2F%2Fpurl.org%2Fdc%2Fterms%2F%3E%0D%0APREFIX+gn%3A+%3Chttp%3A%2F%2Fgenenetwork.org%2Fid%2F%3E%0D%0APREFIX+owl%3A+%3Chttp%3A%2F%2Fwww.w3.org%2F2002%2F07%2Fowl%23%3E%0D%0APREFIX+gnc%3A+%3Chttp%3A%2F%2Fgenenetwork.org%2Fcategory%2F%3E%0D%0APREFIX+gnt%3A+%3Chttp%3A%2F%2Fgenenetwork.org%2Fterm%2F%3E%0D%0APREFIX+sdmx-measure%3A+%3Chttp%3A%2F%2Fpurl.org%2Flinked-data%2Fsdmx%2F2009%2Fmeasure%23%3E%0D%0APREFIX+skos%3A+%3Chttp%3A%2F%2Fwww.w3.org%2F2004%2F02%2Fskos%2Fcore%23%3E%0D%0APREFIX+rdf%3A+%3Chttp%3A%2F%2Fwww.w3.org%2F1999%2F02%2F22-rdf-syntax-ns%23%3E%0D%0APREFIX+rdfs%3A+%3Chttp%3A%2F%2Fwww.w3.org%2F2000%2F01%2Frdf-schema%23%3E%0D%0APREFIX+xsd%3A+%3Chttp%3A%2F%2Fwww.w3.org%2F2001%2FXMLSchema%23%3E%0D%0APREFIX+qb%3A+%3Chttp%3A%2F%2Fpurl.org%2Flinked-data%2Fcube%23%3E%0D%0APREFIX+xkos%3A+%3Chttp%3A%2F%2Frdf-vocabulary.ddialliance.org%2Fxkos%23%3E%0D%0APREFIX+pubmed%3A+%3Chttp%3A%2F%2Frdf.ncbi.nlm.nih.gov%2Fpubmed%2F%3E%0D%0A%0D%0ASELECT+*+WHERE+%7B%0D%0A%3Fs+rdf%3Atype+gnc%3Atissue+.%0D%0A%3Fs+rdfs%3Alabel+%3Fo+.%0D%0A%7D%0D%0A&format=text%2Fhtml&timeout=0&signal_void=on Try it!
+
+## Getting to our first PublishData queries
+
+Next we need to upload our fresh PublishData RDF. We generated that with:
+
+```
+rm test.rdf ; for x in tmp/*.xz ; do ./bin/gemma-mdb-to-rdf.rb $x --anno BXD.8_snps.txt --sort >> test.ttl; done
+```
+
+Takes 10 minutes. rapper still returns an error for 'gnt:lodScore Infinity;'. I'll fix that down the line.
+
+Put test.ttl in /export/guix-containers/virtuoso/data/virtuoso/ttl and use the isql commands to update virtuoso. I use a separate graph named 'http://pjotr.genenetwork.org' so we can easily delete the triples.
+
+```
+guix shell virtuoso-ose -- isql 8891 exec="ld_dir('/export/data/virtuoso/ttl','test.ttl','http://pjotr.genenetwork.org'); rdf_loader_run();"
+```
+
+OK, we have the data together. Time for our first queries. Interesting questions are:
+
+* How many hits do we have for qtlreaper and how many for gemma in total
+* How many hits do we have for qtlreaper and how many for gemma that have a hit of 4.0 or higher
+* How many of these hits for qtlreaper differ from those of gemma
+* What datasets have been mapped in qtlreaper, but not in gemma
+
+### How many hits do we have for qtlreaper and how many for gemma in total
+
+Remember we had this query for reaper:
+
+```
+SELECT * WHERE {
+    ?s gnt:belongsToGroup gn:setBxd;
+         gnt:traitId ?id;
+         gnt:locus ?locus;
+         gnt:lodScore ?lrs;
+         dct:description ?descr.
+} limit 5
+"http://genenetwork.org/id/traitBxd_10001","10001","http://genenetwork.org/id/Rs48756159",2.93169,"Central nervous system, morphology: Cerebellum weight, whole, bilateral in adults of both sexes [mg]"
+"http://genenetwork.org/id/traitBxd_10002","10002","http://genenetwork.org/id/Rsm10000005699",4.77938,"Central nervous system, morphology: Cerebellum weight after adjustment for covariance with brain size [mg]"
+"http://genenetwork.org/id/traitBxd_10003","10003","http://genenetwork.org/id/Rsm10000013713",3.38682,"Central nervous system, morphology: Brain weight, male and female adult average, unadjusted for body weight, age, sex [mg]"
+"http://genenetwork.org/id/traitBxd_10004","10004","http://genenetwork.org/id/Rs48756159",2.56076,"Central nervous system, morphology: Cerebellum volume [mm3]"
+"http://genenetwork.org/id/traitBxd_10005","10005","http://genenetwork.org/id/Rsm10000005699",5.02907,"Central nervous system, morphology: Cerebellum volume, adjusted for covariance with brain size [mm3]"
+```
+
+we can run a similar query for GEMMA results with trait id "10001" and locus names.
+
+```
+SELECT * WHERE {
+    ?s gnt:mappedSnp ?id;
+         gnt:locus ?locus;
+         gnt:lodScore ?lrs.
+    filter(?lrs > 4.0).
+} limit 5
+```
+
+to find distinct datasets for GEMMA:
+
+```
+SELECT count(*) WHERE {
+  ?id gnt:name "BXDPublish" .
+} limit 5
+```
+
+To count the total number of hits we have 13576 reaper hits and 231911 GEMMA hits. For GEMMA we have 13491 uniquely mapped datasets.
+
+### Count hits that are significant
+
+For GEMMA 223232 hits are 4.0 or higher. For Reaper we count 1098. Almost all reaper values are between 2.0 and 4.0. When we count GEMMA below 4.0 we get 8679 datasets - and that makes sense because for gemmma we list all SNPs that are over 4.0 and only the datasets that are below we list the highest SNP. In both cases the majority of traits are below our threshold.
+
+### Start looking at the difference
+
+For every reaper SNP 'locus' we want to find that GEMMA sets that contain that particular SNP. In other words, those are the hits that GEMMA found that compare with qtlreaper. We pivot on SNP ?locus and ?traitid.
+
+```
+SELECT count(*) WHERE {
+    ?reaper gnt:belongsToGroup gn:setBxd;
+         gnt:traitId ?traitid;
+         gnt:locus ?locus;
+         gnt:lodScore ?lrs .
+    ?gemma gnt:mappedSnp ?id2;
+         gnt:locus ?locus;
+         gnt:lodScore ?lrs2.
+    ?id2 gnt:name "BXDPublish" ;
+        gnt:traitId ?traitid.
+    filter(?lrs2 >= 4.0).
+} limit 5
+```
+
+Now find 4222 overlapping traits! Whereof 2924 have a gemma lod score >= 4.0. And reaper 892 > 4.0 (out of 1098). That implies that some 200 significant scores find (completely) different SNPs for GEMMA.
+
+The next step is to list these differences. That is a reverse query. In plain English it should be something like:
+
+> List all sets where reaper has a SNP (r_snp) that does not appear in its GEMMA computation (g_snps).
+
+This is rather hard to do in SPARQL. We can make a list, however, of the overlapping traits with a lod score>4.0 with
+
+```
+PREFIX dct: <http://purl.org/dc/terms/>
+PREFIX gn: <http://genenetwork.org/id/>
+PREFIX owl: <http://www.w3.org/2002/07/owl#>
+PREFIX gnc: <http://genenetwork.org/category/>
+PREFIX gnt: <http://genenetwork.org/term/>
+PREFIX sdmx-measure: <http://purl.org/linked-data/sdmx/2009/measure#>
+PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
+PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
+PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
+PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
+PREFIX qb: <http://purl.org/linked-data/cube#>
+PREFIX xkos: <http://rdf-vocabulary.ddialliance.org/xkos#>
+PREFIX pubmed: <http://rdf.ncbi.nlm.nih.gov/pubmed/>
+
+SELECT ?traitid WHERE {
+   # --- get the reaper SNPs
+    ?r_trait gnt:belongsToGroup gn:setBxd;
+         gnt:traitId ?traitid;
+         gnt:locus ?snp.
+    # --- get gemma trait that matches reaper traitid (pivot on traitid)
+    ?g_trait gnt:name "BXDPublish" ;
+        gnt:traitId ?traitid.
+    # --- g_snp is the SNP scored within a gemma trait run
+    ?g_snp gnt:mappedSnp ?g_trait;
+         gnt:locus ?snp;
+         gnt:lodScore ?g_lrs.
+    filter(?g_lrs >= 4.0).
+} limit 5
+```
+
+Resulting in 2925 overlapping results. For example, it lists trait
+
+=> https://genenetwork.org/show_trait?trait_id=12014&dataset=BXDPublish
+
+where both reaper and gemma show a top hit for rs13478947.
+
+
+SELECT count(distinct ?traitid) WHERE {
+   # --- get the reaper SNPs
+    ?r_trait gnt:belongsToGroup gn:setBxd;
+         gnt:traitId ?traitid;
+         gnt:locus ?snp.
+    # --- get gemma trait that matches reaper traitid (pivot on traitid)
+    ?g_trait gnt:name "BXDPublish" ;
+        gnt:traitId ?traitid.
+    # --- g_snp is the SNP scored within a gemma trait run
+    ?g_snp gnt:mappedSnp ?g_trait;
+         gnt:lodScore ?g_lrs.
+    MINUS { ?g_snp gnt:locus ?snp . }
+    filter(?g_lrs >= 4.0).
+}
+
+
+
+Now we can make a second list for all gemma results where g_lrs > 4.0. The difference is our set.
+
+```
+SELECT DISTINCT ?traitid WHERE {
+    # --- get gemma trait that matches reaper traitid (pivot on traitid)
+    ?g_trait gnt:name "BXDPublish" ;
+        gnt:traitId ?traitid.
+    # --- g_snp is the SNP scored within a gemma trait run
+    ?g_snp gnt:mappedSnp ?g_trait;
+         gnt:locus ?snp;
+         gnt:lodScore ?g_lrs.
+    filter(?g_lrs >= 4.0).
+}
+```
+
+One example is trait 23777 where reaper has rsm10000008413 and gemma ranks SNPs, and rsm10000008413 with LRS 3.44 is below the threshold. That makes not such a strong case because both results are on Chr11 and not to far from each other (58 vs 73 Mb). Still, it may be a difference of interest. GEMMA's main hit rs13480386 is also ranked by reaper (in GN2).
+I think we need to refine our method. Peaks on Chr9 and 15 are also of interest.
+
+See
+
+=> https://genenetwork.org/show_trait?trait_id=23777&dataset=BXDPublish
+
+Another trait 14905 shows a whopper on Chr4 with gemma and and one on Chr8 with reaper.
+This is rather a good example. To improve the power of our search I think I should extend the GEMMA results with all hits above 3.0. That greatly increase the chance that a reaper marker is seen. To do an even better job we should run reaper precompute and also store the highest ranked markers (rather than one single hit). That way we get a true picture of the overlap and differences. While we are at it, we should store the trait values with the sample size etc.
+
+But first let's try finding those that differ on chromosome hits:
+
+Hmmm. the folloinwg not working quite right because it shows all the differences with 200K results. I tried
+
+```
+PREFIX dct: <http://purl.org/dc/terms/>
+PREFIX gn: <http://genenetwork.org/id/>
+PREFIX owl: <http://www.w3.org/2002/07/owl#>
+PREFIX gnc: <http://genenetwork.org/category/>
+PREFIX gnt: <http://genenetwork.org/term/>
+PREFIX sdmx-measure: <http://purl.org/linked-data/sdmx/2009/measure#>
+PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
+PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
+PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
+PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
+PREFIX qb: <http://purl.org/linked-data/cube#>
+PREFIX xkos: <http://rdf-vocabulary.ddialliance.org/xkos#>
+PREFIX pubmed: <http://rdf.ncbi.nlm.nih.gov/pubmed/>
+
+SELECT DISTINCT ?traitid ?chr1 ?chr2 ?url ?descr WHERE {
+   # --- get the reaper SNPs
+    ?r_trait gnt:belongsToGroup gn:setBxd;
+         gnt:traitId ?traitid;
+         gnt:locus ?snp ;
+         dct:description ?descr.
+    # --- get gemma trait that matches reaper traitid (pivot on traitid)
+    ?g_trait gnt:name "BXDPublish" ;
+        gnt:traitId ?traitid.
+    # --- g_snp is the SNP scored within a gemma trait run
+    ?g_snp gnt:mappedSnp ?g_trait;
+         gnt:lodScore ?g_lrs ;
+         gnt:locus ?snp2 .
+    # --- get Chr positions of both snps
+    ?snp gnt:chr ?chr1 .
+    ?snp2 gnt:chr ?chr2 .
+    MINUS { ?g_snp gnt:locus ?snp . }
+    filter(?g_lrs >= 4.0).
+    filter(?chr2 != ?chr1) .
+    BIND(REPLACE(?traitid, "(\\d+)","https://genenetwork.org/show_trait?trait_id=$1&dataset=BXDPublish") AS ?url)
+} LIMIT 15
+```
+
+What I am trying is set analysis and SPARQL is so powerful that you actually try, but it is far simpler to do in any programming language. I tooted about this rediscovery:
+
+=> https://genomic.social/@pjotrprins@mastodon.social/115059451578588805
+
+I created list for Rob using some simple shell commands, so he can see what the challenge is. I wrote
+
+> Attached a list of traits that show a reaper SNP that is not significant (LOD 4.0) for GEMMA and still show a significant hit for GEMMA. You can test run them on GN2 and see that the story is ambiguous. To do a proper job we should store more hits for GEMMA (say from LOD 3.0) and do a precompute exercise with reaper storing all top hits. That way we can probably do better and even get a list for Claude.
+
+One example is trait 23777 where reaper has rsm10000008413 and gemma ranks SNPs, and rsm10000008413 with LRS 3.44 is be low the threshold. That makes not such a strong case because both results are on Chr11 and not to far from each other (58 vs 73 Mb). Still, it may be a difference of interest. GEMMA's main hit rs13480386 is also ranked by reaper (in GN2). I think we need to refine our method. Peaks on Chr9 and 15 are also of interest.
+
+See
+
+=> https://genenetwork.org/show_trait?trait_id=23777&dataset=BXDPublish
+
+Another trait 14905 shows a whopper on Chr4 with gemma and and one on Chr8 with reaper. This is rather a good example. To improve the power of our search I think I should extend the GEMMA results with all hi ts above 3.0. That greatly increase the chance that a reaper marker is seen. To do an even better job we should run rea per precompute and also store the highest ranked markers (rather than one single hit). That way we get a true picture o f the overlap and differences. While we are at it, we should store the trait values with the sample size etc.
+
+So, rerunning GEMMA and reaper are on the books. While we are at it we can adapt reruns for
+
+* qnormalized data*
+* auto winsorizing
+* sex covariate
+* run gemma without LOCO
+* cis covariate, using the current hit and recompute with that as a covariate*
+* epistatic covariates
+
+and that should all be reasonably easy for the 13K traits.
+
+## More metadata
+
+But first we set up a new run with more metadata. In the lmdb files we should add the trait values, the mean, SE, skew, kurtosis, any DOIs.
+
+gemma-wrapper can take trait values as produced by our gn-guile endpoint (in .json). First step is to add thes values to the meta data. The existing permutate switch takes a pheno file and outputs that during a run. We can use that to pass in the pheno file.
+
+
+Now we should write out the gemma phenotypes to make sure they align. Now we essentially moved the functionality from gn-pheno-to-gemma.rb into gemma-wrapper, so we need to pass in the geno information too.
+
+The command becomes
+
+```
+./bin/gemma-wrapper --force --json --loco -- -g BXD.geno.txt -p BXD_pheno.txt -a BXD.8_snps.txt -n 2 -gk > K.json
+./bin/gemma-wrapper --json --lmdb --geno-json BXD.geno.json --lmdb --phenotypes 10002-pheno.json --population BXD --name BXDPublish --trait $id --loco --input K.json -- -g BXD.geno.txt -a BXD.8_snps.txt -lmm 9 -maf 0.1 -n 2 -debug > GWA.json
+```
+
+We now store the trait values into the metadata and they go into lmdb!
+
+```
+  "meta": {
+    "type": "gemma-wrapper",
+    "version": "1.00-pre1",
+    "population": "BXD",
+    "name": "BXDPublish",
+    "trait": "1",
+    "geno_filename": "BXD.geno.txt",
+    "geno_hash": "3b65ed252fa47270a3ea867409b0bdc5700ad6f6",
+    "loco": true,
+    "url": "https://genenetwork.org/show_trait?trait_id=1&dataset=BXDPublish",
+    "archive_GRM": "185eb08dc3897c7db5d7ea987170898035768f93-gemma-cXX.tar.xz",
+    "archive_GWA": "c143bc7928408fdc53affed0dacdd98d7c00f36d-BXDPublish-1-gemma-GWA.tar.xz",
+    "trait_values": {
+      "BXD1": 54.099998,
+      "BXD2": 50.099998,
+      "BXD5": 53.299999,
+...
+```
+
+Commit is here:
+
+=> https://github.com/genetics-statistics/gemma-wrapper/commit/9ad5f762823031da08fc51c2a6adae983e6e8314
+
+Now gemma2lmdb is actually written in python, so we can make use of scipy functions using the trait values.
+
+So, for example, we can compute:
+
+```
+mean= 52.22058749999999  std= 2.968538937833582  kurtosis= 0.03143766680654192  skew= -0.1315270039489698
+for
+[54.099998, 50.099998, 53.299999, 55.099998, 57.299999, 51.200001, 53.599998, 46.799999, 50.599998, 49.299999, 45.700001, 52.5, 52.0, 51.099998, 52.400002, 49.0, 51.599998, 50.700001, 55.5, 52.599998, 53.099998, 53.5, 53.200001, 58.700001, 50.799999, 53.299999, 51.900002, 54.099998, 52.299999, 46.099998, 51.799999, 57.0, 48.599998, 56.599998]
+```
+
+Using
+
+=> https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.skew.html
+=> https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kurtosis.html
+
+Code in gemma-wrapper repo.
+
+I'll set up a new run and export to RDF. Some additions first.
+
+Even though we store trait values, I should add the number of indiduals too. We store that as nind.
+
+Now we have these metrics, no metadata is complete without its publication. PublishXRef contains a PublicationID. It points into the Publication table that contains, for example:
+
+```
+| Id  | PubMed_ID | Abstract | Authors | Title | Journal | Volume | Pages | Month | Year |
+| 116 |  11438585 | To discover genes influencing cerebellum development, we conducted a complex trait analysis of variation in the size of the adult mouse cerebellum. We analyzed two sets of recombinant inbred BXD strains and an F2 intercross of the common inbred strains, C57BL/6J and DBA/2J. We measured cerebellar size as the weight or volume of fixed or histologically processed tissue. Among BXD recombinant inbred strains, the cerebellum averages 52 mg (12.4% of the brain) and ranges 18 mg in size. In F2 mice, the cerebellum averages 62 mg (12.9% of the brain) and ranges approximately 20 mg in size. Five quantitative trait loci (QTLs) that significantly control variation in cerebellar size were mapped to chromosomes 1 (Cbs1a), 8 (Cbs8a), 14 (Cbs14a), and 19 (Cbs19a, Cbs19b). In combination, these QTLs can shift cerebellar size to an appreciable 35% of the observed range. To assess regional genetic control of the cerebellum, we also measured the volume of the cell-rich, internal granule layer (IGL) in a set of BXD strains. The IGL ranges from 34 to 43% of total cerebellar volume. The QTL Cbs8a is significantly linked to variation in IGL volume and is suggestively linked to variation in the number of cerebellar folia. The QTLs we have discovered are among the first loci shown to modulate the size and architecture of the adult mouse cerebellum. | Airey DC, Lu L, Williams RW | Genetic control of the mouse cerebellum: identification of quantitative trait loci modulating size and architecture | J Neuroscience | 21     | 5099-5109 | NULL  | 2001 |
+```
+
+That is a nice example.
+But we also find many publications without abstracts, e.g. | 7276 |     15792 | NULL | Williams EG, Andreux P, Houtkooper R, Auwerx J | Recombinant Inbred BXD Mice as a Model for the Metabolic Syndrome.
+
+In fact, 22K entries out of 29K miss the abstract. Also I can't find this last paper by Evan Williams. The closest is "Systems Genetics of Metabolism: The Use of the BXD Murine Reference Panel for Multiscalar Integration of Traits" which is probably worth reading.
+
+=> https://www.cell.com/cell/pdfExtended/S0092-8674(12)01007-0?__cf_chl_tk=kYZ49R4P29zOzYPeuWdrXVJC61HyhpHwFtq8lS2_rlk-1756022056-1.0.1.1-uY.PpAbgi8FO54P4_wYp_f6Nm84CdfHNQEI1WOmngFE
+
+I have no idea where the number 15792 comes from. It is not a pubmed ID. Some quick checks:
+
+```
+MariaDB [db_webqtl]> select count(*) from Publication WHERE Pubmed_ID>0 limit 3;
++----------+
+|      427 |
++----------+
+MariaDB [db_webqtl]> select count(*) from Publication WHERE Pubmed_ID>0 and Pubmed_ID<99999 limit 3;
++----------+
+|        2 |
++----------+
+MariaDB [db_webqtl]> select count(*) from Publication WHERE Pubmed_ID>0 and Pubmed_ID<999999 limit 3;
++----------+
+|       10 |
++----------+
+select count(*) from Publication WHERE NOT Abstract is NULL limit 3;
++----------+
+|     6750 |
++----------+
+```
+
+so, out of 29K entries, we have a very limited number of useful PMIDs, but we have some 6750 abstracts - mostly related to the BXD. Meanwhile some 16572 entries (about half) appear to have valid titles. Almost all records have authors, however.
+
+It really is a bit of a mess. What we need to do is harvest what we have and then collect pubmed ids for the missing BXD PublishData records and use that to fetch up-to-date abstracts and author lists. We can even adapt my Pubmed script that I use for bibtex. A search for just the combination of these authors
+
+```
+pubmed2bib.sh 'Williams EG, Andreux P, Houtkooper R, Auwerx J  [au]'
+```
+
+renders
+
+```
+@article{Andreux:2012,
+  keywords     = { },
+  pmid         = {22939713},
+  pmcid        = {3604687},
+  note         = {{PMC3604687}},
+  IDS          = {PMC3604687, PMID:22939713},
+  author       = {Andreux, P. A. and Williams, E. G. and Koutnikova, H. and Houtkooper, R. H. and Champy, M. F. and Henry, H. and Schoonjans, K. and Williams, R. W. and Auwerx, J.},
+  title        = {{Systems genetics of metabolism: the use of the BXD murine reference panel for multiscalar integration of traits}},
+  journal      = {Cell},
+  year         = {2012},
+  volume       = {150},
+  number       = {6},
+  pages        = {1287-1299},
+  doi          = {10.1016/j.cell.2012.08.012},
+  url          = {http://www.ncbi.nlm.nih.gov/pubmed/22939713},
+  abstract     = {Metabolic homeostasis is achieved by complex molecular and cellular networks that differ significantly among individuals and are difficult to model with genetically engineered lines of mice optimized to study single gene function. Here, we systematically acquired metabolic phenotypes by using the EUMODIC EMPReSS protocols across a large panel of isogenic but diverse strains of mice (BXD type) to study the genetic control of metabolism. We generated and analyzed 140 classical phenotypes and deposited these in an open-access web service for systems genetics (www.genenetwork.org). Heritability, influence of sex, and genetic modifiers of traits were examined singly and jointly by using quantitative-trait locus (QTL) and expression QTL-mapping methods. Traits and networks were linked to loci encompassing both known variants and novel candidate genes, including alkaline phosphatase (ALPL), here linked to hypophosphatasia. The assembled and curated phenotypes provide key resources and exemplars that can be used to dissect complex metabolic traits and disorders.},
+}
+```
+
+So, yes, it is the likely candidate. We can use this information to suggest updates. It just proves again how useful manual curation is.
+
+Note that this information is collected at the experimental level (rather than the trait level), so it really does not belong in the GEMMA lmdb data. Every trait has an entry in PublishXRef that points back to the Publication ID. So we can take it later (and fix it!).
+
+# Rerun GEMMA precompute
+
+Let's set up a full rerun for the 13K BXD PublishData entries with this new information. That should allow us to see how skew and kurtosis and experimental size affect the outcome. Remember we have the batch run script:
+
+```
+#! /bin/env sh
+
+export TMPDIR=./tmp
+curl http://127.0.0.1:8092/dataset/bxd-publish/list > bxd-publish.json
+jq ".[] | .Id" < bxd-publish.json > ids.txt
+./bin/gemma-wrapper --force --json --loco -- -g BXD.geno.txt -p BXD_pheno.txt -a BXD.8_snps.txt -n 2 -gk > K.json
+
+for id in 'cat ids.txt' ; do
+  echo Precomputing $id
+  if [ ! -e tmp/*-BXDPublish-$id-gemma-GWA.tar.xz ] ; then
+    curl http://127.0.0.1:8092/dataset/bxd-publish/values/$id.json > pheno.json
+    ./bin/gn-pheno-to-gemma.rb --phenotypes pheno.json --geno-json BXD.geno.json > BXD_pheno.txt
+    ./bin/gemma-wrapper --json --lmdb --population BXD --name BXDPublish --trait $id --loco --input K.json -- -g BXD.geno.txt -p BXD_pheno.txt -a BXD.8_snps.txt -lmm 9 -maf 0.1 -n 2 -debug > GWA.json
+  fi
+done
+```
+
+that can be simplified because gemma-wrapper now replaces gn-pheno-to-gemma.rb. First Guix had to install scipy which pulls in inkscape and Jupyter among other things. It is really too much! But at least Guix makes it easy to reproduce the environment I use on my desktop to the server. Now we get a beautiful record in every lmdb GEMMA run:
+
+```
+"archive_GWA": "c143bc7928408fdc53affed0dacdd98d7c00f36d-BXDPublish-10001-gemma-GWA.tar.xz", "trait_values": {"BXD
+1": 61.400002, "BXD2": 49.0, "BXD5": 62.5, "BXD6": 53.099998, "BXD8": 59.099998, "BXD9": 53.900002, "BXD11": 53.099998,
+ "BXD12": 45.900002, "BXD13": 48.400002, "BXD14": 49.400002, "BXD15": 47.400002, "BXD16": 56.299999, "BXD18": 53.599998
+, "BXD19": 50.099998, "BXD20": 48.200001, "BXD21": 50.599998, "BXD22": 53.799999, "BXD23": 48.599998, "BXD24": 54.90000
+2, "BXD25": 49.599998, "BXD27": 47.400002, "BXD28": 51.5, "BXD29": 50.200001, "BXD30": 53.599998, "BXD31": 49.700001, "
+BXD32": 56.0, "BXD33": 52.099998, "BXD34": 53.700001, "BXD35": 49.700001, "BXD36": 44.5, "BXD38": 51.099998, "BXD39": 5
+4.900002, "BXD40": 49.900002, "BXD42": 59.400002}, "table": "PublishData", "traitid": 10001, "dataid": 0}}, "nind": 34,
+ "mean": 52.1353, "std": 4.1758, "skew": 0.6619, "kurtosis": 0.0523,
+```
+
+and the job is running....
+
+Next stop is to rerun reaper and variations on gemma. Last night it halted at 9K. The webserver gave an SQL error and just stopped/waited. As it is not using threads it will block. It says: SQL Connection ERROR! file not found
+
+# HK
+
+We want to rerun reaper to get more top ranked hits (and peaks). Now I also realize GEMMA can also do LR and it would be interesting to see how that differs from reaper. The '-lm' switch says:
+
+```
+ -lm       [num]          specify analysis options (default 1).
+          options: 1: Wald test
+                   2: Likelihood ratio test
+                   3: Score test
+                   4: 1-3
+```
+
+the documentation points out that we don't need a GRM. Exactly. Now we could try and embed this in gemma-wrapper, but that is overkill. Part of the complexity of gemma-wrapper is related to handling the GRM with LOCO. Here we have a simple command that needs to be iterated. We don't need to record trait values, kurtosis etc. because that is already part of the previous exercise (and is constant). So the main complications are to create the trait vector, run gemma, and write an lmdb file. For now this will be a one-off, so we are not going to bother with caching and all that.
+
+```
+gemma -g BXD.geno.txt -p BXD_pheno.txt -a BXD.8_snps.txt -lm 2 -o trait-BXDPublish-$id-gemma-GWA-hk
+```
+
+This produces a file
+
+```
+chr rs  ps  n_mis n_obs allele1 allele0 af  p_lrt
+1 rsm10000000001  3001490 0 237 X Y 0.527 -nan
+1 rs31443144  3010274 0 237 X Y 0.525 -nan
+1 rs6269442 3492195 0 237 X Y 0.525 -nan
+1 rs32285189  3511204 0 237 X Y 0.525 -nan
+```
+
+Hmm. All p_lrt are NaN. Oh, I need to make sure the second column is used:
+
+```
+gemma -g BXD.geno.txt -p BXD_pheno.txt -a BXD.8_snps.txt -n 2 -lm 2 -o tmp/trait-BXDPublish-$id-gemma-GWA-hk
+chr rs  ps  n_mis n_obs allele1 allele0 af  p_lrt
+1 rsm10000000001  3001490 0 23  X Y 0.739 8.331149e-01
+1 rs31443144  3010274 0 23  X Y 0.739 8.331149e-01
+1 rs6269442 3492195 0 23  X Y 0.739 8.331149e-01
+1 rs32285189  3511204 0 23  X Y 0.739 8.331149e-01
+1 rs258367496 3659804 0 23  X Y 0.739 8.331149e-01
+```
+
+much better! Now we need to turn this into an lmdb file. We can adapt gemma2lmdb.py to do that. But I am not going to do that. The attraction of repurposing code is always there, but it will mean diluting the meaning of the code - basically ifthen blocks - and making the code less readable. This is one reason the Linux kernel does not share code between device drivers. Even for these simple tools I prefer to split out at the risk of not being DRY. I hope you can see what I mean with:
+
+=> https://github.com/genetics-statistics/gemma-wrapper/blob/master/bin/gemma2lmdb.py
+
+which is now pretty straightforward for parsing LMM output of GEMMA into lmdb. We are going to do the same thing for a simpler output. But when writing it suddenly struck me we don't need lmdb here in the first place! lmdb is for the full vector output and there is no reason to retain it. All we want is the top hits. Great, that simplifies matters even more. Which btw points out how baffling it is to me that people think they can replace programmers with AI. Well, maybe for the obvious code... You just see how much code will be garbage.
+
+Now we have the same idea in gemma-mdb-to-rdf.rb - and for the same reason as before I am not going to adapt that code.
+
+Fun fact, HK returns the same hits for GEMMA and reaper versions. Good. the log10 of the GEMMA's p_LRT returns a value of 2.720446e-06 where -log10/LOD is 5.56 and the multiplier with 4.61 renders 25 where GN2 shows an LRS of 22. Oh well, we are not too concerned, as long as the ranking is correct.
+
+So for GN trait
+
+=> https://genenetwork.org/show_trait?trait_id=10002&dataset=BXDPublish
+
+we now get for GEMMA HK:
+
+```
+gn:HK_output_trait_BXDPublish_1_gemma_GWA_hk_assoc_txt a gnt:mappedTrait;
+        rdfs:label "GEMMA_BXDPublish output/trait-BXDPublish-1-gemma-GWA-hk.assoc.txt trait HK mapped";
+        gnt:GEMMA_HK true;
+        gnt:belongsToGroup gn:setBxd;
+        gnt:trait gn:publishXRef_1;
+        gnt:time "2025-08-25 10:14:23 +0000";
+        gnt:belongsToGroup gn:setBxd;
+        gnt:name "BXDPublish";
+        gnt:traitId "1";
+        skos:altLabel "BXD_1".
+gn:rsm10000005699_HK_output_trait_BXDPublish_1_gemma_GWA_hk_assoc_txt a gnt:mappedLocus;
+       gnt:mappedSnp gn:rsm10000005699_HK_output_trait_BXDPublish_1_gemma_GWA_hk_assoc_txt ;
+       gnt:locus gn:Rsm10000005699 ;
+       gnt:lodScore 5.6 .
+gn:rs47899232_HK_output_trait_BXDPublish_1_gemma_GWA_hk_assoc_txt a gnt:mappedLocus;
+       gnt:mappedSnp gn:rs47899232_HK_output_trait_BXDPublish_1_gemma_GWA_hk_assoc_txt ;
+       gnt:locus gn:Rs47899232 ;
+       gnt:lodScore 5.6 .
+gn:rs3661882_HK_output_trait_BXDPublish_1_gemma_GWA_hk_assoc_txt a gnt:mappedLocus;
+       gnt:mappedSnp gn:rs3661882_HK_output_trait_BXDPublish_1_gemma_GWA_hk_assoc_txt ;
+       gnt:locus gn:Rs3661882 ;
+       gnt:lodScore 5.3 .
+gn:rs33490412_HK_output_trait_BXDPublish_1_gemma_GWA_hk_assoc_txt a gnt:mappedLocus;
+       gnt:mappedSnp gn:rs33490412_HK_output_trait_BXDPublish_1_gemma_GWA_hk_assoc_txt ;
+       gnt:locus gn:Rs33490412 ;
+       gnt:lodScore 5.3 .
+gn:rsm10000005703_HK_output_trait_BXDPublish_1_gemma_GWA_hk_assoc_txt a gnt:mappedLocus;
+       gnt:mappedSnp gn:rsm10000005703_HK_output_trait_BXDPublish_1_gemma_GWA_hk_assoc_txt ;
+       gnt:locus gn:Rsm10000005703 ;
+       gnt:lodScore 5.3 .
+(...)
+```
+
+Code is here:
+
+=> https://github.com/genetics-statistics/gemma-wrapper/commit/a17901d927d21a1686c0ac0d1552695f0096b84b
+
+Generate RDF incl. skew, kurtosis etc
+
+```
+./bin/gemma-mdb-to-rdf.rb --header > test.ttl
+time for x in tmp/*.xz ; do
+    ./bin/gemma-mdb-to-rdf.rb $x --anno BXD.8_snps.txt --sort >> test.ttl
+done
+```
+
+Renders
+
+```
+gn:GEMMAMapped_LOCO_BXDPublish_10001_gemma_GWA_7c00f36d a gnt:mappedTrait;
+      rdfs:label "GEMMA BXDPublish trait 10001 mapped with LOCO (defaults)";
+      gnt:trait gn:publishXRef_10001;
+      gnt:loco true;
+      gnt:time "2025/08/24 08:22";
+      gnt:belongsToGroup gn:setBxd;
+      gnt:name "BXDPublish";
+      gnt:traitId "10001";
+      gnt:nind 34;
+      gnt:mean 52.1353;
+      gnt:std 4.1758;
+      gnt:skew 0.6619;
+      gnt:kurtosis 0.0523;
+      skos:altLabel "BXD_10001".
+gn:Rsm10000005700_BXDPublish_10001_gemma_GWA_7c00f36d a gnt:mappedLocus;
+      gnt:mappedSnp gn:GEMMAMapped_LOCO_BXDPublish_10001_gemma_GWA_7c00f36d;
+      gnt:locus gn:Rsm10000005700;
+      gnt:lodScore 6.2;
+      gnt:af 0.382;
+      gnt:effect 1.626.
+n:Rs32133186_BXDPublish_10001_gemma_GWA_7c00f36d a gnt:mappedLocus;
+      gnt:mappedSnp gn:GEMMAMapped_LOCO_BXDPublish_10001_gemma_GWA_7c00f36d;
+      gnt:locus gn:Rs32133186;
+      gnt:lodScore 6.2;
+      gnt:af 0.382;
+      gnt:effect 1.626.
+...
+```
+
+Funny thing is that the hash values are now all the same because gemma-wrapper no longer includes the trait values. That is a harmless bug that I'll fix for the next run.
+
+The GEMMA run ended up generating 1,576,110 triples. The gemma-mdb-to-rdf script took 42 minutes.
+
+After GEMMA LMM completed its run we set up the HK run which should reflect reaper.
+
+# On bimodality (of trait values)
+
+Kurtosis is not a great predictor of bimodality.
+
+=> https://aldenbradford.com/bimodality.html
+
+Rob says that for the BXD bimodality works best. Maybe annotate with
+
+=> https://skeptric.com/dip-statistic/
+
+We'll skip it for now - I added a task above.
+
+# Combine results
+
+First we upload the data into virtuoso after dropping the old graph. We can do again, now introducing new sub graphs
+
+```
+rapper -i turtle test.ttl > test.n3
+guix shell -C -N --expose=/export/guix-containers/virtuoso/data/virtuoso/ttl/=/export/data/virtuoso/ttl virtuoso-ose -- isql -S 8891
+SQL> log_enable(3,1);
+SQL> DELETE FROM rdf_quad WHERE g = iri_to_id ('http://pjotr.genenetwork.org');
+SQL> SPARQL SELECT count(*) FROM <http://pjotr.genenetwork.org> WHERE { ?s ?p ?o };
+  0
+SQL> ld_dir('/export/data/virtuoso/ttl','test.n3','http://lmm2.genenetwork.org');
+  Done. -- 3 msec.
+# for testing the validity and optional delete problematic ones:
+SQL> SELECT * FROM DB.DBA.load_list;
+SQL> DELETE from DB.DBA.LOAD_LIST where ll_error IS NOT NULL ;
+# commit changes
+SQL> rdf_loader_run ();
+SQL> checkpoint;
+Done. -- 16 msec.
+SQL> SPARQL SELECT count(*) FROM <http://pjotr.genenetwork.org> WHERE { ?s ?p ?o };
+  1576102
+```
+
+and after HK we are at 6838444 triples for this exercise. Note that you can clean up the load list with
+
+```
+DELETE from DB.DBA.LOAD_LIST;
+```
+
+
+Let's list all the tissues we have with
+
+```
+PREFIX dct: <http://purl.org/dc/terms/>
+PREFIX gn: <http://genenetwork.org/id/>
+PREFIX owl: <http://www.w3.org/2002/07/owl#>
+PREFIX gnc: <http://genenetwork.org/category/>
+PREFIX gnt: <http://genenetwork.org/term/>
+PREFIX sdmx-measure: <http://purl.org/linked-data/sdmx/2009/measure#>
+PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
+PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
+PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
+PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
+PREFIX qb: <http://purl.org/linked-data/cube#>
+PREFIX xkos: <http://rdf-vocabulary.ddialliance.org/xkos#>
+PREFIX pubmed: <http://rdf.ncbi.nlm.nih.gov/pubmed/>
+
+SELECT * WHERE { ?s rdf:type gnc:tissue . ?s rdfs:label ?o . }
+
+"http://genenetwork.org/id/tissueA1c" "Primary Auditory (A1) Cortex mRNA"
+"http://genenetwork.org/id/tissueAcc" "Anterior Cingulate Cortex mRNA"
+"http://genenetwork.org/id/tissueAdr" "Adrenal Gland mRNA"
+"http://genenetwork.org/id/tissueAmg" "Amygdala mRNA"
+"http://genenetwork.org/id/tissueBebv"  "Lymphoblast B-cell mRNA"
+"http://genenetwork.org/id/tissueBla" "Bladder mRNA"
+(...)
+```
+
+To other quick queries confirm that our data is loaded correctly. One quick test we would want to do is to see if all reaper hits overlap with GEMMA_HK. That would be a comfort.
+
+The reaper hits are found with
+
+```
+SELECT * WHERE {
+    ?s gnt:belongsToGroup gn:setBxd;
+         gnt:traitId ?id;
+         gnt:locus ?locus;
+         gnt:lodScore ?lrs;
+         dct:description ?descr.
+} limit 50
+```
+
+The HK hits are defined as
+
+```
+gn:HK_output_trait_BXDPublish_1_gemma_GWA_hk_assoc_txt a gnt:mappedTrait;
+        rdfs:label "GEMMA_BXDPublish output/trait-BXDPublish-1-gemma-GWA-hk.assoc.txt trait HK mapped";
+        gnt:GEMMA_HK true;
+        gnt:belongsToGroup gn:setBxd;
+        gnt:trait gn:publishXRef_1;
+        gnt:time "2025-08-25 10:14:23 +0000";
+        gnt:belongsToGroup gn:setBxd;
+        gnt:name "BXDPublish";
+        gnt:traitId "1";
+        skos:altLabel "BXD_1".
+gn:rsm10000005699_HK_output_trait_BXDPublish_1_gemma_GWA_hk_assoc_txt a gnt:mappedLocus;
+       gnt:mappedSnp gn:rsm10000005699_HK_output_trait_BXDPublish_1_gemma_GWA_hk_assoc_txt ;
+       gnt:locus gn:Rsm10000005699 ;
+       gnt:lodScore 5.6 .
+gn:rs47899232_HK_output_trait_BXDPublish_1_gemma_GWA_hk_assoc_txt a gnt:mappedLocus;
+       gnt:mappedSnp gn:rs47899232_HK_output_trait_BXDPublish_1_gemma_GWA_hk_assoc_txt ;
+       gnt:locus gn:Rs47899232 ;
+       gnt:lodScore 5.6 .
+```
+
+So the hits can be listed as
+
+```
+SELECT count(*) WHERE {
+    ?reaper gnt:belongsToGroup gn:setBxd;
+         gnt:traitId ?traitid;
+         gnt:locus ?locus;
+         gnt:lodScore ?lrs .
+    ?gemma gnt:mappedSnp ?id2;
+         gnt:locus ?locus;
+         gnt:lodScore ?lrs2.
+    ?id2 gnt:name "BXDPublish" ;
+        gnt:GEMMA_HK true;
+        gnt:traitId ?traitid.
+} limit 5
+```
+
+Unfortunately I made a mistake mapping the SNPs. This should have linked back. So instead of:
+
+```
+gn:rsm10000005699_HK_output_trait_BXDPublish_1_gemma_GWA_hk_assoc_txt a gnt:mappedLocus;
+       gnt:mappedSnp gn:rsm10000005699_HK_output_trait_BXDPublish_1_gemma_GWA_hk_assoc_txt ;
+```
+
+I should have generated
+
+```
+gn:rsm10000005699_HK_output_trait_BXDPublish_1_gemma_GWA_hk_assoc_txt a gnt:mappedLocus;
+       gnt:mappedSnp gn:HK_output_trait_BXDPublish_1_gemma_GWA_hk_assoc_txt ;
+
+```
+
+Doh! These SNPs are dangling now. Bit hard to see sometimes with these identifiers. OK, set up another rdf generation run.
+Now I see it show an error for a few traits, e.g.
+
+```
+./bin/gemma2rdf.rb:74:in "initialize": No such file or directory @ rb_sysopen - ./tmp/trait-BXDPublish-18078-gemma-GWA-hk.assoc.txt (Errno::ENOENT)
+```
+
+For later (again) as the majority is coming through.
+
+```
+SQL> ld_dir('/export/data/virtuoso/ttl','gemma-GWA-hk.ttl','http://hk.genenetwork.org');
+SQL> rdf_loader_run ();
+SQL> SPARQL SELECT count(*) FROM <http://hk.genenetwork.org> WHERE { ?s ?p ?o };
+  5262347
+```
+
+Try again
+
+```
+SELECT count(*) WHERE {
+    ?reaper gnt:belongsToGroup gn:setBxd;
+         gnt:traitId ?traitid;
+         gnt:locus ?locus;
+         gnt:lodScore ?lrs .
+    ?trait gnt:GEMMA_HK true;
+        gnt:traitId ?traitid.
+    # filter(?lrs2 >= 4.0).
+    ?snp gnt:mappedSnp ?trait ;
+        gnt:locus ?locus ;
+        gnt:lodScore ?lrs2 .
+}
+"traitid","locus","lrs","lrs2"
+"21188","http://genenetwork.org/id/Rs31400538",2.73982,3.42
+"21194","http://genenetwork.org/id/Rs29514307",3.94845,4.7
+"21199","http://genenetwork.org/id/Rs50530980",2.60066,3.27
+"21203","http://genenetwork.org/id/Rs13483656",2.57406,3.24
+"21205","http://genenetwork.org/id/Rsm10000000057",2.90985,3.6
+"21210","http://genenetwork.org/id/Rsm10000000182",2.67097,3.34
+"21217","http://genenetwork.org/id/Rs29525970",3.80402,4.54
+"21220","http://genenetwork.org/id/Rs46586055",2.50946,3.17
+"21221","http://genenetwork.org/id/Rs47967883",2.54473,3.21
+"21223","http://genenetwork.org/id/Rs29327089",3.94623,4.69
+"21230","http://genenetwork.org/id/Rs30026335",2.78151,3.46
+"21238","http://genenetwork.org/id/Rs32170136",2.83393,3.52
+"21267","http://genenetwork.org/id/Rsm10000000063",2.54818,3.21
+```
+
+counts 9261 overlapping SNPs. So, about 4000 traits are not mapping exactly. Also interesting is that GEMMA HK LRS/LOD is consistently higher than reaper.
+
+For the non-overlapping traits we find, for example 10023, has no significant HK hit. For GEMMA_HK it is simply ignored and for reaper Bonz included the lodScore of 1.77. If we count the significant hits for reaper LOD>3.0 we find 4541 hits. Out of these 4506 hits overlap with GEMMA_HK. That is perfect!
+
+```
+SELECT ?traitid WHERE {
+    ?reaper gnt:belongsToGroup gn:setBxd;
+         gnt:traitId ?traitid;
+         gnt:locus ?locus;
+         gnt:lodScore ?lrs .
+    ?trait gnt:GEMMA_HK true;
+        gnt:traitId ?traitid.
+    filter(?lrs >= 3.0).
+    ?snp gnt:mappedSnp ?trait ;
+        gnt:locus ?locus ;
+        gnt:lodScore ?lrs2 .
+}
+```
+
+Essentially every reaper result is replicated in GEMMA_HK and now we have all SNPs that can be compared against the LMM results.
+
+# On Normality
+
+But first we want to take a look normality for the datasets now we stored ninds, mean, std, skew and kurtosis. At this stage let's just count datasets. So, out of 13427 GEMMA LMM traits 12416 have more than 16 individuals. When looking at abs(skew)<0.8 we have 7691 fairly normal traits. Adding an abs(kurtosis)<1.0 we have 6289 traits. So about half of them are fairly normal. So if we quantile normalize these vectors it may have some impact. Let that be another task I add above (run gemma with qnorm).
+
+The query was
+
+```
+SELECT count(*) WHERE {
+    ?trait gnt:loco true;
+        gnt:traitId ?traitid;
+        gnt:nind ?nind;
+        gnt:skew ?skew;
+        gnt:kurtosis ?kurtosis.
+    filter(?nind > 16 and abs(?skew) < 0.8 and abs(?kurtosis) < 1.0).
+} LIMIT 40
+```
+
+# Pubmed
+
+As an aside, I did an interesting discovery. Some of the pubmed IDs that I thought were wrong may actually be OK. Maybe Bonz did some screening because his RDF differs from what is in MySQL.
+
+# Preparing for comparison
+
+OK, we are finally at the point where we can compare LMM results with HK (read reaper). This is a 'set analysis' because we want to see what SNPs differ between the two results for every trait and highlight those where peaks are different. We have captured in RDF all the SNPs that are considered (fairly) significant for both LMM and HK.
+
+The easiest way is to capture all SNPs and write the analysis in code. There may be a way to do this in SPARQL but it will take me more time and we'll end with less flexibility. Now there are two main ways to go about it. I can dump a table with all SNPs using SPARQL itself and process the tabular data (this, btw, may be a good input for AI). Another option is to use an RDF library and parse the RDF triples directly (without Virtuoso) in the middle. That should allow for quicker processing and also a shorter turnaround if I need to modify RDF (the process of updating, uploading, checking and writing SPARQL queries, is quite long). There is one thing in writing software that is very important: you want a quick turnaround, otherwise you are just staring at a prompt ;). So it pays to learn these short cuts. It also allows accessing lmdb files and even SQL if useful.
+Note that we still can also use SPARQL *also* to output RDF triples. So if we want more powerful filtering and/or add metadata it will all work.
+
+## Reading RDF
+
+So, I wrote a first script to digest our RDF from GEMMA. The RDF library in Guix is a bit old, so we have to upgrade that in Guix.
+
+For testing I created a small TTL file and convert to N3 with wrapper.
+
+```
+rapper -i turtle test-2000.ttl > test-2000.n3
+```
+
+What we want to do is walk the dataset and harvest SNPs that belong to a run. As a start.
+
+First I needed to add the relevant RDF packages to Guix.
+
+=> https://git.genenetwork.org/guix-bioinformatics/commit/?id=fcbe2919a1e4b168e8ec9ac995a6512360d56ac8
+
+The following code fetches all traits with all SNPs:
+
+```
+  graph = RDF::Graph.load(fn)
+  datasets = graph.query(RDF::Query.new {
+                           pattern [:dataset, RDF.type, GNT.mappedTrait]
+                         })
+  datasets.each { |trait|
+    p "-------"
+    p trait.dataset
+    snps = graph.query(RDF::Query.new {
+                         pattern [ :snp, GNT.mappedSnp, trait.dataset ]
+                       })
+    p snps
+  }
+```
+
+Resulting in
+
+```
+"-------"
+#<RDF::URI:0x9ec0 URI:http://genenetwork.org/id/GEMMAMapped_LOCO_BXDPublish_10007_gemma_GWA_7c00f36d>
+[#<RDF::Query::Solution:0x9ed4({:snp=>#<RDF::URI:0x9ee8 URI:http://genenetwork.org/id/Rsm10000005697_BXDPublish_10007_gemma_GWA_7c00f36d>})>]
+```
+
+At the next step we want to do a bit more sophisticated queries. This thing has SPARQL support with the graph in RAM, but I want to try the native interface first.
+
+The first hurdle was that loading RDF triples is extremely slow. So I wanted to try the RDF Raptor C extension, but that sent me down a temporary Guix rabbit hole because nss-certs moved. Also the raptor gem was ancient, and was showing errors, so I updated to the latest github code.
+
+Anyway guix-bioinformatics was updated to support that. Next I tried loading with raptor and that made the difference. At least the triples are read in minutes rather than hours, but the next step building the large graph takes a lot of time too. This sucks.
+
+Creating and inspecting each statement is fast enough that look like:
+
+```
+#<RDF::Statement:0x7a8(<http://genenetwork.org/id/HK_trait_BXDPublish_10001_gemma_GWA_hk_assoc_txt> <http://genenetwork.org/term/trait> <http://genenetwork.org/id/publishXRef_10001> .)>
+```
+
+So, rather than including all triples, we first filter out the ones we are not interested in and that speeds things up. That worked until I included all SNPs. Are we delivered here? These libraries may be too slow. Analysing 200K triples took forever. Constructing the graph through an enumerator is a really slow step. The graph query is also slow. But adding the raptor read triples to an array only took 7s. It makes pretty clear we should process the 'raw' data directly.
+
+The current script collects all SNPs by GEMMA trait:
+
+```
+time ./bin/rdf-analyse-gemma-hits.rb test.nt
+Parsing test.nt...
+
+real    0m12.314s
+user    0m12.117s
+sys     0m0.196s
+```
+
+Next stop we make it a set and do the same for HK. And we can do set analysis. The first round is pretty impressive, it looks like trait 10001 has exactly the same SNPs for HK and GEMMA. That is a nice confirmation. Actually 10001 is an interesting test case because in GN you can see HK and GEMMA find different secondary peaks:
+
+=> https://genenetwork.org/show_trait?trait_id=10001&dataset=BXDPublish
+
+At the GEMMA threshold we set (LOD>4.0) all hits are on chr8 and they overlap with HK. Down the line we could look at lower values, but lets stick with this for now.
+
+For 10004 we find some different SNPs. The mapping looks similar in GN:
+
+=> https://genenetwork.org/show_trait?trait_id=10001&dataset=BXDPublish
+
+The difference is:
+
+```
+["10004", #<Set: {#<RDF::URI:0x1a18 URI:http://genenetwork.org/id/Rs47899232>, #<RDF::URI:0x1a54 URI:http://genenetwork.org/id/Rsm10000005699>, #<RDF::URI:0xf78 URI:http://genenetwork.org/id/Rsm10000005700>, #<RDF::URI:0xf3c URI:http://genenetwork.org/id/Rs32133186>, #<RDF::URI:0xf00 URI:http://genenetwork.org/id/Rs32818171>, #<RDF::URI:0xec4 URI:http://genenetwork.org/id/Rsm10000005701>, #<RDF::URI:0xe88 URI:http://genenetwork.org/id/Rsm10000005702>, #<RDF::URI:0xdd4 URI:http://genenetwork.org/id/Rsm10000005703>, #<RDF::URI:0xfb4 URI:http://genenetwork.org/id/Rs33490412>, #<RDF::URI:0xff0 URI:http://genenetwork.org/id/Rs3661882>, #<RDF::URI:0x102c URI:http://genenetwork.org/id/Rsm10000005704>, #<RDF::URI:0x1068 URI:http://genenetwork.org/id/Rs32579649>, #<RDF::URI:0x10a4 URI:http://genenetwork.org/id/Rsm10000005705>}>]
+```
+
+This locus Rs47899232 is not in my test set, so it looks like it is under the threshold. If you look at Chr8 you can see the GEMMA hit shifted somewhat to the right from HK Chr8: 68.799000 to LOCO Chr8: 95.704608. The LOCO hit is also visible in HK, but dropped below significance.
+
+So we can do this analysis now! But just looking at SNPs is going to be laborious. At this stage we are mostly interested in the highest peak and whether it changed. What we need to do is capture regions, i.e. the chromosome positions, and map out if they moved.
+
+In the next phase I am going to take all SNP positions and map their region (+- 10,000 bps). For every trait we'll have a list of *regions* linked to significant hits. If these regions differ then the peaks differ, and we can highlight them.
+
+# Getting SNPs and their positions
+
+To get SNPs and their positions a simple SPARQL query will do. Bonz has created a TTL, e.g.
+
+```
+gn:Rs47899232 rdf:type gnc:Genotype .
+gn:Rs47899232 rdfs:label "rs47899232" .
+gn:Rs47899232 gnt:chr "8" .
+gn:Rs47899232 gnt:mb "95.704608"^^xsd:double .
+gn:Rs47899232 gnt:belongsToSpecies gn:Mus_musculus .
+gn:Rs47899232 gnt:chrNum "0"^^xsd:int .
+gn:Rsm10000005700 rdf:type gnc:Genotype .
+gn:Rsm10000005700 rdfs:label "rsm10000005700" .
+gn:Rsm10000005700 gnt:chr "8" .
+gn:Rsm10000005700 gnt:mb "95.712996"^^xsd:double .
+gn:Rsm10000005700 gnt:belongsToSpecies gn:Mus_musculus .
+gn:Rsm10000005700 gnt:chrNum "0"^^xsd:int .
+```
+
+A few things are a bit puzzling, but at this stage we mostly care for are the identifier, label, chr and mb. GN, for some reason tracks mb as a floating point. I don't like that, but it will work for tracking positions. To get a table we use the following query:
+
+```
+SELECT * WHERE {
+    ?snp a gnc:Genotype;
+             gnt:belongsToSpecies gn:Mus_musculus ;
+             rdfs:label ?name ;
+             gnt:chr ?chr ;
+             gnt:mb ?mb .
+
+}
+```
+
+we save that as a TSV and have 120K SNPs formatted like:
+
+```
+"http://genenetwork.org/id/Rs47899232"   "rs47899232"    "8"     95.7046
+```
+
+# Ranges
+
+In the next step we want do define peak ranges. It would be nice to visualize them as a line, e.g. for HK and LOCO:
+
+```
+Chr   1              2             3 ...
+HK    ---X-------------------X-----
+LOCO  ---X----X--------------X-----
+```
+
+That way we can see that a peak appeared on Chr 1. Down the line we can use the same info to compare traits A and B:
+
+```
+Chr   1              2             3 ...
+A     ---X-------------------X-----
+B     ---X-------------------------
+```
+
+where we see some chromosome area is shared. Rob sent me this nice 2008 paper:
+
+=> https://pubmed.ncbi.nlm.nih.gov/19008955/
+
+which states that a remarkably diverse set of traits maps to a region on mouse distal chromosome 1 (Chr 1) that corresponds to human Chr 1q21-q23. This region is highly enriched in quantitative trait loci (QTLs) that control neural and behavioral phenotypes, including motor behavior, escape latency, emotionality, seizure susceptibility (Szs1), and responses to ethanol, caffeine, pentobarbital, and haloperidol.
+
+And we are still doing this research today.
+
+Anyway, for our purposes, for each trait we have a range of SNPs. If they are close to each other they form a 'peak'. What I am going to do is combine the SNPs we are comparing into one set first. Use that to define the ranges (say within 10K BPs). Next we go back to the computed SNPs and figure out what fits a range. We will pick out those ranges that are unique to a trait. But first we'll just visualize.
+
+As this involves some logic we will have to do it in real code (again). First we show how many SNPs we have combined for HK+LOCO and how many differ, e.g.
+
+```
+["10001",  78,  0]
+["10002", 208, 92]
+["10003",  96,  0]
+["10004",  35, 13]
+["10005",  76,  0]
+```
+
+so, for 10001 we have 78 SNPs and the LOCO ones overlap with HK. We showed before that for every set we have the SNP ids.
+
+For the first time this exercise I have to write some real new code (before I was just tying together existing work and fixing bugs on the fly). The reason is that we have to track QTL peak ranges by inserting SNP positions. Not only that, we also need to make sure that these ranges do not overlap and build faithfully. For example, the order of adding SNPs matters - we grow a range by adding SNPs on the same chromosome. If a SNP falls out of range (e.g. 10K BPs away) we create a new range. But when a nother SNP falls in the middle we need to merge them into one range (or peak). This requires some logic and I am creating a new module for it.
+
+The current code creates the following peaks on chr1:
+
+```
+@chromosome={"1"=>[#<QRange 𝚺14 173.339..173.679>, #<QRange 𝚺9 175.615..176.205>, #<QRange 𝚺2 174.541..174.679>, #<QRange 𝚺7 175.437..176.032>, #<QRange 𝚺15 72.2551..73.3771>, #<QRange 𝚺10 179.862..180.284>, #<QRange 𝚺22 181.476..183.154>, #<QRange 𝚺9 179.916..180.412>, #<QRange 𝚺4 177.555..177.901>, #<QRange 𝚺29 171.749..173.532>, #<QRange 𝚺8 171.172..172.175>]
+```
+
+The sigma tells you how many SNPs are in there. There is some overlap, so I need to fix that. When I set the distance at 50,000 BPS we get too many peaks. We need some other heuristic to decide what is a peak and what not. Probably look at the direction the significance is going. I.e. when it drops and rises again we may have a local peak. Would be nice to track those as separate ranges.
+
+Rob suggested a bin size of 500,000 BPs for the BXD. Let's try that first. This results in an orderly combined LOCO+HK results for trait 10002:
+
+```
+#<QTL::QRanges:0x00007f99f277c840 @chromosome={"1"=>[#<QRange 𝚺15 72.2551..73.3771>, #<QRange 𝚺91 171.172..183.154>], "8"=>[#<QRange 𝚺102 94.3743..112.929>]}>
+```
+
+Next we do this for LOCO and HK separately:
+
+```
+[10002,combined] =>{"1"=>[#<QRange 𝚺15 72.2551..73.3771>, #<QRange 𝚺91 171.172..183.154>], "8"=>[#<QRange 𝚺102 94.3743..112.929>]}
+[10002,HK]       =>{"1"=>[#<QRange 𝚺14 179.862..181.546>], "8"=>[#<QRange 𝚺102 94.3743..112.929>]}
+[10002,LOCO]     =>{"1"=>[#<QRange 𝚺15 72.2551..73.3771>, #<QRange 𝚺91 171.172..183.154>], "8"=>[#<QRange 𝚺32 94.4792..97.3382>]}
+["10003", 96, 0]
+["10004", 35, 13]
+[10004,combined] =>{"8"=>[#<QRange 𝚺35 68.7992..97.3516>]}
+[10004,HK]       =>{"8"=>[#<QRange 𝚺22 68.7992..74.9652>]}
+[10004,LOCO]     =>{"8"=>[#<QRange 𝚺13 95.6926..97.3516>]}
+```
+
+Resulting in a new QTL for 10002,LOCO. And with 10004 we see the QTL shift to the right. Nice!
+
+We'll want to track the LOD score too, so let's load that using the RDF file we parse anyway.
+
+```
+[10002,HK]       =>{"1"=>[#<QRange 𝚺14 179.862..181.546 LOD=3.07..3.07>], "8"=>[#<QRange 𝚺102 94.3743..112.929 LOD=3.1..5.57>]}
+[10002,LOCO]     =>{"1"=>[#<QRange 𝚺15 72.2551..73.3771 LOD=4.0..5.1>, #<QRange 𝚺91 171.172..183.154 LOD=4.5..5.3>], "8"=>[#<QRange 𝚺32 94.4792..97.3382 LOD=4.5..4.8>]}
+[10004,HK]       =>{"8"=>[#<QRange 𝚺22 68.7992..74.9652 LOD=3.14..3.23>]}
+[10004,LOCO]     =>{"8"=>[#<QRange 𝚺13 95.6926..97.3516 LOD=4.1..4.6>]}
+```
+
+Speaks for itself.
+
+# Analyzing peaks
+
+
+
+Now we have the peaks for different runs (HK and LOCO). We would like to see how many of the traits are affected - gaining or losing or moving peaks. Also, before we introduce the GEMMA values to GN, we would like to assess how many of the peaks are really different.
+
+With above example we can see that 10002 gained a peak on chr1. With 10004 we see that the peak on chr8 shifted position. These are the things we want to capture. Also we want to bring back some metadata to show what the trait is about. Finally we want to point to the full vector lmdb file which I forgot to include in the original parsing though I did include the hash, e.g.
+
+```
+gn:GEMMAMapped_LOCO_BXDPublish_10001_gemma_GWA_7c00f36d a gnt:mappedTrait;
+      rdfs:label "GEMMA BXDPublish trait 10001 mapped with LOCO (defaults)";
+      gnt:trait gn:publishXRef_10001;
+      gnt:loco true;
+      gnt:time "2025/08/24 08:22";
+      gnt:belongsToGroup gn:setBxd;
+      gnt:name "BXDPublish";
+      gnt:traitId "10001";
+```
+
+I shoud add
+
+```
+      gnt:filename "c143bc7928408fdc53affed0dacdd98d7c00f36d-BXDPublish-10001-gemma-GWA.tar.xz"
+      gnt:hostname "balg01"
+```
+
+so we can find it back easily.
+
+Next step is to say something about the peaks. Let's enrich our RDF store to show these results. Basically for 10002 we can add RDF statements for
+
+```
+[10002,HK]       =>{"1"=>[#<QRange 𝚺14 179.862..181.546 LOD=3.07..3.07>], "8"=>[#<QRange 𝚺102 94.3743..112.929 LOD=3.1..5.57>]}
+[10002,LOCO]     =>{"1"=>[#<QRange 𝚺15 72.2551..73.3771 LOD=4.0..5.1>, #<QRange 𝚺91 171.172..183.154 LOD=4.5..5.3>], "8"=>[#<QRange 𝚺32 94.4792..97.3382 LOD=4.5..4.8>]}
+```
+
+e.g.
+
+```
+gn:qtl00001_LOCO
+    gnt:qtlChr      "1";
+    gnt:qtlStart    72.2551 ;
+    gnt:qtlStop     73.3771 ;
+    gnt:qtlLOD      5.1 ;
+    gnt:SNPs        15 ;
+gn:qtl00002_LOCO
+    gnt:qtlChr      "1";
+    gnt:qtlStart    171.172 ;
+    gnt:qtlStop     183.154 ;
+    gnt:qtlLOD      5.3 ;
+    gnt:SNPs        91 ;
+    gnt:qtlOverlaps gn:qtl00001_HK.
+```
+
+This way, in SPARQL, we can query all QTL that are not in HK. For the QTL that are in HK we can also see if they shifted. Actually for SPARQL we don't really need the last statement - it is just a convenience. We will also add the actual SNP identifiers so the SNP counter is not really necessary either (let SPARQL count):
+
+```
+gn:QTL_CHR1_722551_GEMMAMapped_LOCO_BXDPublish_10002_gemma_GWA_7c00f36d
+    gnt:mappedQTL gn:GEMMAMapped_LOCO_BXDPublish_10002_gemma_GWA_7c00f36d
+    rdfs:label     "GEMMA BXDPublish LOCO QTL on 1:722551 trait 10002";
+    gnt:qtlChr     "1";
+    gnt:qtlStart   72.2551 ;
+    gnt:qtlStop    73.3771 ;
+    gnt:qtlLOD     5.1 ;
+    gnt:qtlSNP     gn:Rs13475920_BXDPublish_10002_gemma_GWA_7c00f36d
+    gnt:qtlSNP     gn:Rs31428112_BXDPublish_10002_gemma_GWA_7c00f36d
+    (...)
+```
+
+I have two things to solve now. First we need to check whether QTLs between the two runs overlap. And then there is a bug in the QTL computation from SNP positions. I am seeing some inconsistencies wrt binning.
+
+The problem I was referring to yesterday turns out to be alright. I thought that when I was using the combined SNPs from HK and LOCO that there was only one peak. But there are two:
+
+```
+[10002,combined] =>{"1"=>[#<QRange 𝚺15 72.2551..73.3771 LOD=..>,       #<QRange 𝚺91 171.172..183.154 LOD=..>]},
+[10002,HK]       =>{"1"=>                                              #<QRange 𝚺14 179.862..181.546 LOD=3.07..3.07>],
+[10002,LOCO]     =>{"1"=>[#<QRange 𝚺15 72.2551..73.3771 LOD=4.0..5.1>, #<QRange 𝚺91 171.172..183.154 LOD=4.5..5.3>]
+```
+
+It is interesting to see that HK misses out on one peak completely and the second peak completely overlaps with LOCO (including all SNPs). All good, so far. OK. Let's add some logic to see what peaks match or don't match:
+
+```
+[10002,HK] =>{"1"=>[#<QRange Chr1 𝚺14 179.862..181.546 LOD=3.07..3.07>], "8"=>[#<QRange Chr8 𝚺102 94.3743..112.929 LOD=3.1..5.57>]}
+[10002,LOCO] =>{"1"=>[#<QRange Chr1 𝚺15 72.2551..73.3771 LOD=4.0..5.1>, #<QRange Chr1 𝚺91 171.172..183.154 LOD=4.5..5.3>], "8"=>[#<QRange Chr8 𝚺32 94.4792..97.3382 LOD=4.5..4.8>]}
+["10002: NO HK match for LOCO Chr 1 QTL!", #<QRange Chr1 𝚺15 72.2551..73.3771 LOD=4.0..5.1>]
+[10004,HK] =>{"8"=>[#<QRange Chr8 𝚺22 68.7992..74.9652 LOD=3.14..3.23>]}
+[10004,LOCO] =>{"8"=>[#<QRange Chr8 𝚺13 95.6926..97.3516 LOD=4.1..4.6>]}
+["10004: NO HK match for LOCO Chr 8 QTL!", #<QRange Chr8 𝚺13 95.6926..97.3516 LOD=4.1..4.6>]
+```
+
+So 10002 correctly says there is a new QTL on chr1 and for 10004 a new QTL on chr8. Now, for 10004 it appears the HK version is in a different location, but I think it suffices to point out 'apparently' new QTL.
+
+Alright, so we can now annotate new/moved QTL! We are going to feed this back into virtuoso by writing RDF as I showed yesterday.
+
+Next step is to say something about the peaks. Let's enrich our RDF store to show these results. Basically for 10002 we add RDF statements for
+
+```
+[10002,HK]       =>{"1"=>[#<QRange 𝚺14 179.862..181.546 LOD=3.07..3.07>], "8"=>[#<QRange 𝚺102 94.3743..112.929 LOD=3.1..5.57>]}
+[10002,LOCO]     =>{"1"=>[#<QRange 𝚺15 72.2551..73.3771 LOD=4.0..5.1>, #<QRange 𝚺91 171.172..183.154 LOD=4.5..5.3>], "8"=>[#<QRange 𝚺32 94.4792..97.3382 LOD=4.5..4.8>]}
+```
+
+E.g.
+
+```
+gn:GEMMAMapped_LOCO_BXDPublish_10002_gemma_GWA_7c00f36d_QTL_Chr8_94_97
+    gnt:mappedQTL   gn:GEMMAMapped_LOCO_BXDPublish_10002_gemma_GWA_7c00f36d;
+    rdfs:label      "GEMMA BXDPublish QTL";
+    gnt:qtlChr      "8";
+    gnt:qtlStart    94.4792 ;
+    gnt:qtlStop     97.3382 ;
+    gnt:qtlLOD      4.8 .
+gn:GEMMAMapped_LOCO_BXDPublish_10002_gemma_GWA_7c00f36d_QTL_Chr8_94_97 gnt:mappedSnp gn:Rsm10000005689_BXDPublish_10002_gemma_GWA_7c00f36d .
+gn:GEMMAMapped_LOCO_BXDPublish_10002_gemma_GWA_7c00f36d_QTL_Chr8_94_97 gnt:mappedSnp gn:Rs232396986_BXDPublish_10002_gemma_GWA_7c00f36d .
+gn:GEMMAMapped_LOCO_BXDPublish_10002_gemma_GWA_7c00f36d_QTL_Chr8_94_97 gnt:mappedSnp gn:Rsm10000005690_BXDPublish_10002_gemma_GWA_7c00f36d .
+(...)
+```
+
+and if it is a new QTL compared to HK we annotate a newly discovered QTL:
+
+```
+gn:GEMMAMapped_LOCO_BXDPublish_10002_gemma_GWA_7c00f36d_1_72_73 a gnt:newlyDiscoveredQTL .
+gn:GEMMAMapped_LOCO_BXDPublish_10004_gemma_GWA_7c00f36d_8_96_97 a gnt:newlyDiscoveredQTL .
+```
+
+Note we skipped the results that show no SNP changes - I should add them later to give full QTL cover.
+
+Code is here:
+
+=> https://github.com/genetics-statistics/gemma-wrapper/blob/master/bin/rdf-analyse-gemma-hits.rb
+=> https://github.com/genetics-statistics/gemma-wrapper/blob/master/lib/qtlrange.rb
+
+Now we have all the RDF to figure out what traits have new QTL compared to reaper!
+I'll upload them in virtuoso for further analysis.
+
+I want to do a run that shows what traits have changed QTLs.
+Basically the command is
+
+```
+./bin/rdf-analyse-gemma-hits.rb test-hk-2000.ttl test-2000.ttl -o RDF
+```
+
+let's try to run with the full ttl files. Actually I converted them to n3 because of some error:
+
+```
+rapper --input turtle gemma-GWA.ttl > gemma-GWA.n3
+rapper --input turtle gemma-GWA-hk.ttl > gemma-GWA-hk.n3
+time ./bin/rdf-analyse-gemma-hits.rb gemma-GWA-hk.n3 gemma-GWA.n3 > test.out
+real    3m21.979s
+user    3m21.076s
+sys     0m0.716s
+```
+
+3.5 minutes is fine for testing stuff (if already a little tedious). The first run failed because I have renamed GEMMA_HK to GemmaHK. Another bug I hit was with:
+
+```
+[10009,HK] =>{"15"=>[#<QRange Chr15 𝚺30 25.6987..74.5398 LOD=3.01..3.27>]}
+[10009,LOCO] =>{"10"=>[#<QRange Chr10 𝚺1 76.2484..76.2484 LOD=3.5..3.5>]}
+/export/local/home/wrk/iwrk/opensource/code/genetics/gemma-wrapper/lib/qtlrange.rb:126:in `block (2 levels) in qtl_diff': undefined method `each' for nil (NoMethodError)
+```
+
+There are a few more bugs to fix - mostly around empty results, e.g. if a trait had no SNPs. Also HK would render a lodScore of infinite `gnt:lodScore Infinity` and that reduced the result set. I set a LOD of infinity to 99.0. So at least it'll stand out. Fixing it at 12 minutes made the run a lot slower than 3.5 minutes! Still OK, for now.
+
+The first run shows 7943 new QTL. Turns out that a bunch of them are non-significant, so need to filter those. Remember we kept the highest hit, even if significance was low. A quick filter shows that with LMM 2802 traits show new QTLs (out of 13K). Out of those 1984 traits did not compute a QTL at all with HK. That looks exciting, but we need to validate. Lets take a look at
+
+```
+[10727,HK] =>{}
+[10727,LOCO] =>{"15"=>[#<QRange Chr15 𝚺9 62.3894..63.6584 LOD=4.4..4.4>]}
+["10727: NO HK match for LOCO Chr 15 QTL!", [#<QRange Chr15 𝚺9 62.3894..63.6584 LOD=4.4..4.4>]]
+```
+
+=> https://genenetwork.org/show_trait?trait_id=10727&dataset=BXDPublish
+
+That looks correct to me. Rob you may want to check. And another:
+
+```
+[51064,HK] =>{"10"=>[#<QRange Chr10 𝚺12 92.3035..108.525 LOD=3.08..4.15>], "19"=>[#<QRange Chr19 𝚺34 8.93047..34.2017 LOD=3.06..3.41>], "3"=>[#<QRange Chr3 𝚺5 138.273..138.581 LOD=3.06..3.06>], "X"=>[#<QRange ChrX 𝚺5 160.766..163.016 LOD=3.48..3.48>]}
+[51064,LOCO] =>{"19"=>[#<QRange Chr19 𝚺37 29.9654..34.2017 LOD=4.3..5.5>]}
+```
+
+=> https://genenetwork.org/show_trait?trait_id=51064&dataset=BXDPublish
+
+Looks correct. With HK we see QTL on Chr 3,10,19 and X. On GN LMM we see a whopper on chr 19, as well as X. I need to see why GEMMA is not finding that X in precompute! Made a note of that too.
+
+# Updating RDF
+
+Now we have QTL output we can upload that to RDF.
+
+Making the traits accessible we need to add some metadata on description of trait, publication and authors. All this information can also be used to build a UI.
+
+For this I am going to regenerate the RDF without running gemma again to sure it is complete and mark the new QTL. One change is that if a LOD is infinite we set it to 99.1. The number will stand out. The idea is that when a P-value ends up rounded to zero we can pick it up easily as a conversion. This turns out to be relevant for example:
+
+```
+gn:HK_trait_BXDPublish_13032_gemma_GWA_hk_assoc_txt a gnt:mappedTrait;
+        rdfs:label "GEMMA_BXDPublish ./tmp/trait-BXDPublish-13032-gemma-GWA-hk.assoc.txt trait HK mapped";
+        gnt:GEMMA_HK true;
+        gnt:belongsToGroup gn:setBxd;
+        gnt:trait gn:publishXRef_13032;
+        gnt:time "2025-08-27 06:44:45 +0000";
+        gnt:name "BXDPublish";
+        gnt:traitId "13032";
+        skos:altLabel "BXD_13032".
+
+gn:rsm10000005888_HK_trait_BXDPublish_13032_gemma_GWA_hk_assoc_txt a gnt:mappedLocus;
+       gnt:mappedSnp gn:HK_trait_BXDPublish_13032_gemma_GWA_hk_assoc_txt ;
+       gnt:locus gn:Rsm10000005888 ;
+       gnt:lodScore Infinity .
+
+gn:rsm10000005889_HK_trait_BXDPublish_13032_gemma_GWA_hk_assoc_txt a gnt:mappedLocus;
+       gnt:mappedSnp gn:HK_trait_BXDPublish_13032_gemma_GWA_hk_assoc_txt ;
+       gnt:locus gn:Rsm10000005889 ;
+       gnt:lodScore Infinity .
+```
+
+The trait has +1 and -1 values:
+
+=> https://genenetwork.org/show_trait?trait_id=13032&dataset=BXDPublish
+
+HK on GN show a map, but no result table. Hmmm. The SNPs listed here as Infinity don't really show in GN - and GEMMA finds no hits there. I think, on consideration, since we don't use HK other than for comparison I should just drop these results. It looks dodgy. Aha, in the GEMMA run these actually show up as not a number (NaN), so I should drop them!
+
+```
+chr rs  ps  n_mis n_obs allele1 allele0 af  p_lrt
+9 rsm10000005888  31848339  0 23  X Y 0.348 -nan
+9 rsm10000005864  27578739  0 23  X Y 0.391 1.770379e-10
+```
+
+Funny enough they are on the same chromosome as the highest ranking hits.
+
+Let's generate RDF and look at the differences:
+
+```
+export RDF=gemma-GWA-hk2.ttl
+wrk@balg01 ~/services/gemma-wrapper [env]$ ./bin/gemma2rdf.rb --header > $RDF
+wrk@balg01 ~/services/gemma-wrapper [env]$ for id in 'cat ids.txt' ; do traitfn=trait-BXDPublish-$id-gemma-GWA-hk ; ./bin/gemma2rdf.rb $TMPDIR/$traitfn.assoc.txt >> $RDF ; done
+```
+
+Took 43 min. The diff with the orignal looks good. Note I don't track origin files for this. Maybe I should, but I don't think we'll really use those. Next generate GEMMA LOCO RDF again
+
+```
+RDF=gemma-GWA.ttl
+wrk@balg01 ~/services/gemma-wrapper [env]$ ./bin/gemma-mdb-to-rdf.rb --header > $RDF
+time for x in tmp/*.xz ; do
+    ./bin/gemma-mdb-to-rdf.rb $x --anno BXD.8_snps.txt --sort >> $RDF
+done
+```
+
+Runs in 50min for 13K traits.
+
+The output now points to the lmdb vector files:
+
+```
++      gnt:filename "c143bc7928408fdc53affed0dacdd98d7c00f36d-BXDPublish-10080-gemma-GWA.tar.xz";
++      gnt:hostname "balg01";
+```
+
+## Digest QTL to RDF
+
+In the next step we want to show the QTL in RDF. First I created a small subset for testing that I can run with
+
+```
+time ./bin/rdf-analyse-gemma-hits.rb test-hk-2000.n3 test-2000.n3
+```
+
+It shows, for example,
+
+```
+gn:GEMMAMapped_LOCO_BXDPublish_10012_gemma_GWA_7c00f36d_QTL_Chr4_25_25
+    gnt:mappedQTL   gn:GEMMAMapped_LOCO_BXDPublish_10012_gemma_GWA_7c00f36d;
+    rdfs:label      "GEMMA BXDPublish QTL";
+    gnt:qtlChr      "4";
+    gnt:qtlStart    24.7356 ;
+    gnt:qtlStop     24.7356 ;
+    gnt:qtlLOD      3.6 .
+gn:GEMMAMapped_LOCO_BXDPublish_10012_gemma_GWA_7c00f36d_QTL_Chr4_25_25 gnt:mappedSnp gn:Rsm10000001919_BXDPublish_10012
+_gemma_GWA_7c00f36d .
+gn:GEMMAMapped_LOCO_BXDPublish_10012_gemma_GWA_7c00f36d_QTL_Chr4_25_25 a gnt:newQTL .
+```
+
+in other words a QTL with LOD 3.6 and a single SNP that is new compared to the HK output. We want to annotate a bit more, because I want to show the maximum allele frequency contained by the SNPs. That is not too hard as it is contained in the mapped SNP info:
+
+```
+gn:Rsm10000005700_BXDPublish_10001_gemma_GWA_7c00f36d a gnt:mappedLocus;
+      gnt:mappedSnp gn:GEMMAMapped_LOCO_BXDPublish_10001_gemma_GWA_7c00f36d;
+      gnt:locus gn:Rsm10000005700;
+      gnt:lodScore 6.2;
+      gnt:af 0.382;
+      gnt:effect 1.626.
+```
+
+
+With precompute I added allele frequencies to the QTL. So for trait 10002 we get:
+
+```
+[10002,HK] =>{"1"=>[#<QRange Chr1 𝚺14 179.862..181.546 LOD=3.07..3.07>], "8"=>[#<QRange Chr8 𝚺102 94.3743..112.929 LOD=3.1..5.57>]}
+[10002,LOCO] =>{"1"=>[#<QRange Chr1 𝚺15 72.2551..73.3771 AF=0.574 LOD=4.0..5.1>, #<QRange Chr1 𝚺91 171.172..183.154 AF=0.588 LOD=4.5..5.3>], "8"=>[#<QRange Chr8 𝚺32 94.4792..97.3382 AF=0.441 LOD=4.5..4.8>]}
+```
+
+and with RDF:
+
+```
+gn:GEMMAMapped_LOCO_BXDPublish_10002_gemma_GWA_7c00f36d_QTL_Chr1_72_73
+    gnt:mappedQTL   gn:GEMMAMapped_LOCO_BXDPublish_10002_gemma_GWA_7c00f36d;
+    rdfs:label      "GEMMA BXDPublish QTL";
+    gnt:qtlChr      "1";
+    gnt:qtlStart    72.2551 ;
+    gnt:qtlStop     73.3771 ;
+    gnt:qtlAF       0.574 ;
+    gnt:qtlLOD      5.1 .
+gn:GEMMAMapped_LOCO_BXDPublish_10002_gemma_GWA_7c00f36d_QTL_Chr1_72_73 gnt:mappedSnp gn:Rsm10000000582_BXDPublish_10002_gemma_GWA_7c00f36d .
+gn:GEMMAMapped_LOCO_BXDPublish_10002_gemma_GWA_7c00f36d_QTL_Chr1_72_73 gnt:mappedSnp gn:Rsm10000000583_BXDPublish_10002_gemma_GWA_7c00f36d .
+gn:GEMMAMapped_LOCO_BXDPublish_10002_gemma_GWA_7c00f36d_QTL_Chr1_72_73 gnt:mappedSnp gn:Rs37034472_BXDPublish_10002_gemma_GWA_7c00f36d .
+...etc...
+gn:GEMMAMapped_LOCO_BXDPublish_10002_gemma_GWA_7c00f36d_QTL_Chr1_72_73 a gnt:newQTL .
+```
+
+Important: we only store LOCO QTL (which we reckon are 'truth'), not the HK QTL. We also marked QTL that are *not* in HK with the gnt:newQTL annotation.
+
+For AF filtering we track this information on the trait:
+
+```
+gn:GEMMAMapped_LOCO_BXDPublish_10002_gemma_GWA_7c00f36d a gnt:mappedTrait;
+      rdfs:label "GEMMA BXDPublish trait 10002 mapped with LOCO (defaults)";
+      gnt:trait gn:publishXRef_10002;
+      gnt:loco true;
+      gnt:time "2025/08/24 08:22";
+      gnt:belongsToGroup gn:setBxd;
+      gnt:name "BXDPublish";
+      gnt:traitId "10002";
+      gnt:nind 34;
+      gnt:mean 52.2206;
+      gnt:std 2.9685;
+      gnt:skew -0.1315;
+      gnt:kurtosis 0.0314;
+      skos:altLabel "BXD_10002";
+      gnt:filename "c143bc7928408fdc53affed0dacdd98d7c00f36d-BXDPublish-10002-gemma-GWA.tar.xz";
+      gnt:hostname "balg01";
+      gnt:user "wrk".
+```
+
+So, for the first QTL, an AF of 0.574 is based on (1-0.574)*34 = 14 out of 34 individuals is great. When we get to 1 or 2 individuals it may be kinda dodgy. For a dataset this size the AF threshold should be 0.06 (and 0.94). If we have 15 individuals we should be closer to 0.1 (0.9). Anyway, we can compute these on the fly in SPARQL. I rather show too many false positives.
+
+Also note that AF is not a problem with our BXD genotyping. Even so, we are going to use pangenome genotypes next and it will be important for that.
+
+Let's do a full QTL compute with
+
+```
+time ./bin/rdf-analyse-gemma-hits.rb gemma-GWA-hk2.n3 gemma-GWA.n3 -o RDF > QTL.rdf
+```
+
+And we should have the queriable mapped QTL we wished for! But some inspection shows:
+
+```
+[10015,HK] =>{"12"=>[#<QRange Chr12 𝚺2 3.2..9.74252 LOD=3.74..3.74>], "2"=>[#<QRange Chr2 𝚺259 4.03246..52.4268 LOD=3.11..16.01>]}
+[10015,LOCO] =>{"2"=>[#<QRange Chr2 𝚺256 4.03246..57.8635 AF=0.542 LOD=4.0..15.2>]}
+["10015: NO HK match, QTL LOCO Chr 2!", #<QRange Chr2 𝚺256 4.03246..57.8635 AF=0.542 LOD=4.0..15.2>]
+```
+
+which is strange because there is overlap on that particular QTL Chr2! They are obviously the same. As subtle bug. Instead of
+
+```
+-      return true if qtl.min > @min and qtl.max < @max
+-      return true if qtl.min < @min and qtl.max > @min
+-      return true if qtl.min < @max and qtl.max > @max
+```
+
+I now  have:
+
+```
++      return true if qtl.min >= @min and qtl.max <= @max # qtl falls within boundaries
++      return true if qtl.min <= @min and qtl.max >= @min # qtl over left boundary
++      return true if qtl.min <= @max and qtl.max >= @max # qtl over right boundary
+
+```
+
+I had to include the boundaries themselves.
+
+Now we also still log false positives with
+
+```
+[10009,HK] =>{"15"=>[#<QRange Chr15 𝚺30 25.6987..74.5398 LOD=3.01..3.27>]}
+[10009,LOCO] =>{"10"=>[#<QRange Chr10 𝚺1 76.2484..76.2484 AF=0.5 LOD=3.5..3.5>]}
+["10009: NO HK results, new QTL(s) LOCO Chr 10!", [#<QRange Chr10 𝚺1 76.2484..76.2484 AF=0.5 LOD=3.5..3.5>]]
+```
+
+note the LOD score. I should not mark new QTL that are below 4.0. Now we count 2351 new QTL and that is in line with my earlier quick counts.
+
+Note the current script eats RAM because it holds all LOD scorer and SNPs in memory. That is fine for our 13K classical traits but will probably not work for millions of traits. It runs in 8 minutes. That is cool too.
+
+# Updating RDF in virtuoso
+
+Similar to what we did before we are going to update Virtuoso on the sparql-test server using the CLI isql commands discussed above.
+
+
+Similar to what we did before we are going to update Virtuoso on the sparql-test server using the CLI isql commands discussed above.
+
+In August I uploaded:
+
+```
+SELECT * FROM DB.DBA.load_list;
+/export/data/virtuoso/ttl/gemma-GWA-hk.ttl                                     http://hk.genenetwork.org                                                         2           2025.8.27 8:31.57 122123000  2025.8.27 8:32.6 104530000  0           NULL        NULL
+/export/data/virtuoso/ttl/test.n3                                                 http://lmm2.genenetwork.org                                                       2           2025.8.27 6:47.44 947047000  2025.8.27 6:47.49 73865000  0           NULL        NULL
+```
+
+Also, to list all available graphs you can do
+
+```
+SELECT  DISTINCT ?g
+   WHERE  { GRAPH ?g {?s ?p ?o} }
+ORDER BY  ?g
+http://genenetwork.org
+http://hk.genenetwork.org
+http://lmm2.genenetwork.org
+```
+
+The first graph is for all Bonz' RDF. I can now safely delete the other two, to start with a fresh slate.
+The graph has 36584993 triples. Deleting HK remains 31322646 and LMM2 remains 29746544 triples.
+
+```
+ld_dir('/export/data/virtuoso/ttl','QTL.rdf','http://qtl.genenetwork.org');
+```
+
+Ouch, we got an error. With the proper prefix values and renaming the file to QTL.ttl it worked with 183562 new triples!
+Next we loaded the updated TTL files. HK imported 3196834 triples. LMM imported 1616383 and we total 34743323 triples. Which is less than the previous set - because we cleaned out the SNPs that had a LOD of infinite.
+
+After a checkpoint, time to SPARQL! This query lists all new QTL with their traits:
+
+```
+PREFIX gn: <http://genenetwork.org/id/>
+PREFIX gnt: <http://genenetwork.org/term/>
+PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
+PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
+SELECT ?trait, ?chr, ?start, ?stop, ?lod  WHERE {
+   ?qtl gnt:mappedQTL ?traitid ;
+          gnt:qtlChr ?chr ;
+          gnt:qtlStart ?start ;
+          gnt:qtlStop ?stop ;
+          a gnt:newQTL ;
+          gnt:qtlLOD ?lod .
+   ?traitid gnt:traitId ?trait .
+} LIMIT 20
+
+"trait" "chr"   "start" "stop"  "lod"
+"26116" "7"     36.9408 36.9408 4
+"26118" "2"     3.19074 4.29272 4.3
+"26118" "9"     60.6863 64.4059 4.3
+"26126" "17"    71.754  72.1374 4.7
+"26135" "15"    93.3404 94.2523 5.5
+(...)
+```
+
+So we list all traits that have a *NEW* QTL using GEMMA compared to HK. We have a few thousand trait updates that have new QTL. Let's add the number of samples/genometypes, se we can ignore the smaller sets. Or better, count them first. We simplify the query first:
+
+```
+SELECT count(DISTINCT ?trait)  WHERE {
+   ?qtl a gnt:newQTL ;
+          gnt:mappedQTL ?traitid .
+   ?traitid gnt:traitId ?trait ;
+               gnt:nind ?nind.
+} LIMIT 20
+```
+
+Counts 2040 traits with at least one new QTL. When we FILTER (?nind > 16) we get 2019 traits. That is a tiny minority with fewer individuals. So we can ignore filtering them.
+
+Of course we visited several traits before to see if the QTL were correct. I'll make a list for Rob to check, expanding the trait to a clickable URL:
+
+Let's look for the new QTL.
+
+```
+SELECT ?trait, ?chr, ?start, ?stop, ?lod  WHERE {
+   ?qtl gnt:mappedQTL ?traitid ;
+          gnt:qtlChr ?chr ;
+          gnt:qtlStart ?start ;
+          gnt:qtlStop ?stop ;
+          a gnt:newQTL ;
+          gnt:qtlLOD ?lod .
+   ?traitid gnt:traitId ?trait .
+   BIND(REPLACE(?trait, "(\\d+)","https://genenetwork.org/show_trait?trait_id=$1&dataset=BXDPublish") AS ?url)
+} LIMIT 20
+
+"trait" "chr"   "start" "stop"  "lod"   "url"
+"26116" "7"     36.9408 36.9408 4       "https://genenetwork.org/show_trait?trait_id=26116&dataset=BXDPublish"
+"26118" "2"     3.19074 4.29272 4.3     "https://genenetwork.org/show_trait?trait_id=26118&dataset=BXDPublish"
+"26118" "9"     60.6863 64.4059 4.3     "https://genenetwork.org/show_trait?trait_id=26118&dataset=BXDPublish"
+"26126" "17"    71.754  72.1374 4.7     "https://genenetwork.org/show_trait?trait_id=26126&dataset=BXDPublish"
+"26135" "15"    93.3404 94.2523 5.5     "https://genenetwork.org/show_trait?trait_id=26135&dataset=BXDPublish"
+```
+
+Now when I click the link for 26118 I can run HK and GEMMA and I can confirm we have a new result on CHR2 and CHR9.
+Very cool. Now we want to show the trait info and authors, so we can see who we want to approach with this new information.
+
+Now in the phenotype RDF we have
+
+```
+gn:traitBxd_10001 rdf:type gnc:Phenotype .
+gn:traitBxd_10001 gnt:belongsToGroup gn:setBxd .
+gn:traitBxd_10001 gnt:traitId "10001" .
+gn:traitBxd_10001 dct:description "Central nervous system, morphology: Cerebellum weight, whole, bilateral in adults of
+ both sexes [mg]" .
+gn:traitBxd_10001 gnt:submitter "robwilliams" .
+gn:traitBxd_10001 dct:isReferencedBy pubmed:11438585 .
+```
+
+The submitter is mostly one of the GN team. The pubmed id may help find the authors. Bonz RDF'd it as
+
+```
+pubmed:11438585 rdf:type fabio:ResearchPaper .
+pubmed:11438585 fabio:hasPubMedId pubmed:11438585 .
+pubmed:11438585 dct:title "Genetic control of the mouse cerebellum: identification of quantitative trait loci modulatin
+g size and architecture" .
+pubmed:11438585 fabio:Journal "J Neuroscience" .
+pubmed:11438585 prism:volume "21" .
+pubmed:11438585 fabio:page "5099-5109" .
+pubmed:11438585 fabio:hasPublicationYear "2001"^^xsd:gYear .
+pubmed:11438585 dct:creator "Airey DC" .
+pubmed:11438585 dct:creator "Lu L" .
+pubmed:11438585 dct:creator "Williams RW" .
+```
+
+So we can fetch that when it is available. You can run the query here:
+
+=> http://sparql-test.genenetwork.org/sparql/
+
+Just copy paste:
+
+```
+PREFIX dct: <http://purl.org/dc/terms/>
+PREFIX pubmed: <http://rdf.ncbi.nlm.nih.gov/pubmed/>
+PREFIX gn: <http://genenetwork.org/id/>
+PREFIX gnt: <http://genenetwork.org/term/>
+PREFIX gnc: <http://genenetwork.org/category/>
+PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
+PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
+PREFIX fabio: <http://purl.org/spar/fabio/>
+
+SELECT ?trait, ?chr, ?start, ?stop, ?lod, ?year, ?submitter, SAMPLE(?author as ?one_author), ?url, ?descr  WHERE {
+   ?qtl gnt:mappedQTL ?traitid ;
+          gnt:qtlChr ?chr ;
+          gnt:qtlStart ?start ;
+          gnt:qtlStop ?stop ;
+          a gnt:newQTL ;
+          gnt:qtlLOD ?lod .
+   ?traitid gnt:traitId ?trait .
+   OPTIONAL { ?phenoid gnt:traitId ?trait ;
+          a gnc:Phenotype ;
+          gnt:belongsToGroup gn:setBxd ;
+          gnt:submitter ?submitter ;
+          dct:description ?descr ;
+          dct:isReferencedBy ?pubid . } .
+         ?pubid dct:creator ?author ;
+                     fabio:hasPublicationYear ?pubyear .
+   BIND(concat(str(?pubyear)) as ?year)
+   BIND(REPLACE(?trait, "(\\d+)","https://genenetwork.org/show_trait?trait_id=$1&dataset=BXDPublish") AS ?url)
+} ORDER by ?trait
+LIMIT 100
+"10002" "1" 72.2551 73.3771 5.1 "2001"  "robwilliams" "Lu L"  "https://genenetwork.org/show_trait?trait_id=10002&dataset=BXDPublish"  "Central nervous system, morphology: Cerebellum weight after adjustment for covariance with brain size [mg]"
+"10004" "8" 95.6926 97.3516 4.6 "2001"  "robwilliams" "Lu L"  "https://genenetwork.org/show_trait?trait_id=10004&dataset=BXDPublish"  "Central nervous system, morphology: Cerebellum volume [mm3]"
+"10013" "2" 160.117 160.304 4.8 "1996"  "robwilliams" "Alexander RC"  "https://genenetwork.org/show_trait?trait_id=10013&dataset=BXDPublish"  "Central nervous system, behavior: Saline control response 0.9% ip, locomotor activity from 0-60 min after injection just prior to injection of 5 mg/kg amphetamine [cm]"
+(...)
+```
+
+
+Currently authors are not 'ranked' in RDF, so I pick a random one. I can add ranking later, so we get the first author. We also have the option to fetch all traits that, for example, involve Dave Ashbrook.
+
+We can also look for details like skewness by adding
+
+```
+     ?traitid gnt:traitId ?trait  ;
+            gnt:skew ?skew .
+```
+
+# Testing pangenome derived genotypes
+
+We continue testing new genotypes in this document:
+
+=> ../genetics/test-pangenome-derived-genotypes
+
+# Introducing epochs
+
+see
+
+=> topics/data/epochs
diff --git a/topics/systems/migrate-p2.gmi b/topics/systems/migrate-p2.gmi
deleted file mode 100644
index c7fcb90..0000000
--- a/topics/systems/migrate-p2.gmi
+++ /dev/null
@@ -1,12 +0,0 @@
-* Penguin2 crash
-
-This week the boot partition of P2 crashed. We have a few lessons here, not least having a fallback for all services ;)
-
-* Tasks
-
-- [ ] setup space.uthsc.edu for GN2 development
-- [ ] update DNS to tux02 128.169.4.52 and space 128.169.5.175
-- [ ] move CI/CD to tux02
-
-
-* Notes
diff --git a/topics/systems/restore-backups.gmi b/topics/systems/restore-backups.gmi
index 518c56d..b97af2b 100644
--- a/topics/systems/restore-backups.gmi
+++ b/topics/systems/restore-backups.gmi
@@ -26,7 +26,7 @@ The last backup on 'tux02' is from October 2022 - after I did a reinstall. That
 
 According to sheepdog the drops are happening to 'space' and 'epysode', but 'tux02' is missing:
 
-=> https://rabbit.genenetwork.org/sheepdog/index.html
+=> http://sheepdog.genenetwork.org/sheepdog/status.html
 
 ## Mariadb
 
diff --git a/topics/systems/screenshot-github-webhook.png b/topics/systems/screenshot-github-webhook.png
new file mode 100644
index 0000000..08feed3
--- /dev/null
+++ b/topics/systems/screenshot-github-webhook.png
Binary files differdiff --git a/topics/systems/security.gmi b/topics/systems/security.gmi
new file mode 100644
index 0000000..a7192d4
--- /dev/null
+++ b/topics/systems/security.gmi
@@ -0,0 +1,61 @@
+# Security
+
+We secure our system by running recent stable versions of Linux distributions. We also are minimalistic on what we install and run and web services typically run in guix system containers (a kind of light weight Docker setup).
+
+# ssh
+
+Secure shell is very important. First we disable password logins. We use keys only. We also limit AllowUsers because it is flexible to remove and add users.
+
+```
+--- a/ssh/sshd_config
++++ b/ssh/sshd_config
+@@ -54,7 +54,7 @@ Include /etc/ssh/sshd_config.d/*.conf
+ #IgnoreRhosts yes
+
+ # To disable tunneled clear text passwords, change to no here!
+-#PasswordAuthentication yes
++PasswordAuthentication no
+ #PermitEmptyPasswords no
+
++AllowUsers marco daniel ...
+```
+
+Note that keys should be password protected.
+
+# Firewalling
+
+We typically use the monitored CISCO firewalling UTHSC provides. In addition we use nftables, e.g. in /etc/nftables.conf
+
+```
+table inet filter {
+        set udp_accepted {
+                type inet_service
+                flags interval
+                elements = { 60000-61000 } # for mosh
+        }
+        chain input {
+                type filter hook input priority filter; policy drop;
+                ct state { established, related } accept
+                iifname "lo" accept
+                iifname "lo" ip saddr != 127.0.0.0/8 drop
+                tcp dport ssh limit rate 5/minute accept
+                tcp dport { http, https } accept
+                tcp dport mysql ip saddr { list of ips } accept
+                udp dport @udp_accepted accept
+                reject with icmp port-unreachable
+        }
+        chain forward {
+                type filter hook forward priority filter; policy accept;
+        }
+        chain output {
+                type filter hook output priority filter; policy accept;
+        }
+}
+```
+
+Enable this with
+
+```
+systemctl enable nftables
+nft list ruleset
+```
diff --git a/topics/systems/synchronising-the-different-environments.gmi b/topics/systems/synchronising-the-different-environments.gmi
new file mode 100644
index 0000000..207b234
--- /dev/null
+++ b/topics/systems/synchronising-the-different-environments.gmi
@@ -0,0 +1,68 @@
+# Synchronising the Different Environments
+
+## Tags
+
+* status: open
+* priority:
+* type: documentation
+* assigned: fredm
+* keywords: doc, docs, documentation
+
+## Introduction
+
+We have different environments we run for various reasons, e.g.
+
+* Production: This is the user-facing environment. This is what GeneNetwork is about.
+* gn2-fred: production-adjacent. It is meant to test out changes before they get to production. It is **NOT** meant for users.
+* CI/CD: Used for development. The latest commits get auto-deployed here. It's the first place (outside of developer machines) where errors and breakages are caught and/or revealed. This will break a lot. Do not expose to users!
+* staging: Uploader environment. This is where Felix, Fred and Arthur flesh out the upload process, and tasks, and also test out the uploader.
+
+These different environments demand synchronisation, in order to have mostly similar results and failure modes.
+
+## Synchronisation of the Environments
+
+### Main Database: MariaDB
+
+* [ ] TODO: Describe process
+
+=> https://issues.genenetwork.org/topics/systems/restore-backups Extract borg archive
+* Automate? Will probably need some checks for data sanity.
+
+### Authorisation Database
+
+* [ ] TODO: Describe process
+
+* Copy backup from production
+* Update/replace GN2 client configs in database
+* What other things?
+
+### Virtuoso/RDF
+
+* [ ] TODO: Describe process
+
+* Copy TTL (Turtle) files from (where?). Production might not always be latest source of TTL files.
+=> https://issues.genenetwork.org/issues/set-up-virtuoso-on-production Run setup to "activate" database entries
+* Can we automate this? What checks are necessary?
+
+## Genotype Files
+
+* [ ] TODO: Describe process
+
+* Copy from source-of-truth (currently Zach's tux01 and/or production).
+* Rsync?
+
+### gn-docs
+
+* [ ] TODO: Describe process
+
+* Not sure changes from other environments should ever take
+
+### AI Summaries (aka. gnqna)
+
+* [ ] TODO: Describe process
+
+* Update configs (should be once, during container setup)
+
+### Others?
+
+* [ ] TODO: Describe process
diff --git a/topics/systems/update-production-checklist.gmi b/topics/systems/update-production-checklist.gmi
new file mode 100644
index 0000000..2cb0761
--- /dev/null
+++ b/topics/systems/update-production-checklist.gmi
@@ -0,0 +1,197 @@
+# Update production checklist
+
+Last migration round is the move to tux03 (Sept2025)!
+
+# Tasks
+
+* [X] Install underlying Debian
+* [X] Get guix going
+* [X] Check database settings
+* [X] Check gemma working
+* [X] Check global search
+* [X] Check authentication
+* [X] Check sending E-mails
+* [X] Check SPARQL
+* [X] Make sure info.genenetwork.org and 'space' can reach the DB
+* [ ] Backups
+
+The following are at the system level
+
+* [X] Firewalling and other security measures (sshd)
+* [X] Check tmpdirs (cleanup?)
+* [X] Make sure journalctl persistent (check for reboots)
+* [X] Update certificates in CRON (no longer if not part of Guix)
+* [X] Run trim in CRON
+* [ ] Monitors (sheepdog)
+
+# Install underlying Debian
+
+For our production systems we use Debian as a base install. Once installed:
+
+* [X] set up git in /etc and limit permissions to root user
+* [X] add ttyS0 support for grub and kernel - so out-of-band works
+* [X] start ssh server and configure not to use with passwords
+* [X] start nginx and check external networking
+* [X] mount old root
+* [X] Clean up /etc/profile (remove global profile.d loading)
+* [X] set up E-mail routing
+
+It may help to mount the old root if you have it. Now it is on
+
+```
+mount /dev/sdd2 /mnt/old-root/
+```
+
+# Get Guix going
+
+* [X] Mount bind /gnu on a large partition
+* [X] Move /gnu/store to larger partition
+* [X] Install Guix daemon
+* [X] Update Guix daemon and setup in systemd (if necessary)
+* [X] Make available in /usr/local/guix-profiles
+
+Next move the /gnu store to a large partion and hard mount it in /etc/fstab with
+
+```
+/export2/gnu /gnu none defaults,bind 0 0
+```
+
+We can bootstrap with the Debian guix package (though I prefer the guix-install.sh script these days, mostly because it is more modern).
+
+=> https://guix.gnu.org/manual/en/html_node/Binary-Installation.html
+
+
+Run guix pull
+
+```
+guix pull --url=https://codeberg.org/guix/guix  -p ~/opt/guix-pull
+```
+
+Use that also to install guix in /usr/local/guix-profiles
+
+```
+guix package -i guix -p /usr/local/guix-profiles/guix
+```
+
+and update the daemon in systemd accordingly. After that I tend to remove /usr/bin/guix
+
+The Debian installer configures guix. I tend to remove the profiles from /etc/profile so people have a minimal profile.
+
+# Check database
+
+* [X] Install mariadb
+* [X] Recover database
+* [X] Test permissions
+* [X] Mariadb update my.cnf
+
+Basically recover the database from a backup is the best start and set permissions. We usually take the default mariadb unless production is already on a newer version - so we move to guix deployment.
+
+On tux02 mariadb-10.5.8 is running. On Debian it is now 10.11.11-0+deb12u1, so we should be good. On Guix is 10.10 at this point.
+
+```
+apt-get install mariadb-server
+```
+
+Next unpack the database files and set permissions to the mysql user. And (don't forget) update the /etc/mysql config files.
+
+Restart mysql until you see:
+
+```
+mysql -u webqtlout -p -e "show databases"
++---------------------------+
+| Database                  |
++---------------------------+
+| 20081110_uthsc_dbdownload |
+| db_GeneOntology           |
+| db_webqtl                 |
+| db_webqtl_s               |
+| go                        |
+| information_schema        |
+| kegg                      |
+| mysql                     |
+| performance_schema        |
+| sys                       |
++---------------------------+
+```
+
+=> topics/systems/mariadb/mariadb.gmi
+
+## Recover database
+
+We use borg for backups. First restore the backup on the PCIe. Also a test for overheating!
+
+
+# Check sending E-mails
+
+The swaks package is quite useful to test for a valid receive host:
+
+```
+swaks --to testing-my-server@gmail.com --server smtp.network
+=== Trying smtp.network:25...
+=== Connected to smtp.network.
+<-  220 mailrouter8.network ESMTP NO UCE
+ -> EHLO tux04.network
+<-  250-mailrouter8.network
+<-  250-PIPELINING
+<-  250-SIZE 26214400
+<-  250-VRFY
+<-  250-ETRN
+<-  250-STARTTLS
+<-  250-ENHANCEDSTATUSCODES
+<-  250-8BITMIME
+<-  250-DSN
+<-  250 SMTPUTF8
+ -> MAIL FROM:<root@tux04.network>
+<-  250 2.1.0 Ok
+ -> RCPT TO:<pjotr2020@thebird.nl>
+<-  250 2.1.5 Ok
+ -> DATA
+<-  354 End data with <CR><LF>.<CR><LF>
+ -> Date: Thu, 06 Mar 2025 08:34:24 +0000
+ -> To: pjotr2020@thebird.nl
+ -> From: root@tux04.network
+ -> Subject: test Thu, 06 Mar 2025 08:34:24 +0000
+ -> Message-Id: <20250306083424.624509@tux04.network>
+ -> X-Mailer: swaks v20201014.0 jetmore.org/john/code/swaks/
+ ->
+ -> This is a test mailing
+ ->
+ ->
+ -> .
+<-  250 2.0.0 Ok: queued as 4157929DD
+ -> QUIT
+<-  221 2.0.0 Bye                                                                                                                             === Connection closed with remote host
+```
+
+An exim configuration can be
+
+```
+dc_eximconfig_configtype='smarthost'
+dc_other_hostnames='genenetwork.org'
+dc_local_interfaces='127.0.0.1 ; ::1'
+dc_readhost=''
+dc_relay_domains=''
+dc_minimaldns='false'
+dc_relay_nets=''
+dc_smarthost='smtp.network'
+CFILEMODE='644'
+dc_use_split_config='false'
+dc_hide_mailname='false'
+dc_mailname_in_oh='true'
+dc_localdelivery='maildir_home'
+```
+
+And this should work:
+
+```
+swaks --to myemailaddress --from john@network --server localhost
+```
+
+# Backups
+
+* [ ] Create an ibackup user.
+* [ ] Install borg (usually guix version)
+* [ ] Create a borg passphrase
+
+=> topics/systems/backups-with-borg.gmi
+=> topics/systems/backup-drops.gmi
diff --git a/topics/systems/virtuoso.gmi b/topics/systems/virtuoso.gmi
index e911a8b..bd7424a 100644
--- a/topics/systems/virtuoso.gmi
+++ b/topics/systems/virtuoso.gmi
@@ -8,6 +8,10 @@ We run instances of virtuoso for our graph databases. Virtuoso is remarkable sof
 ## Running virtuoso
 ### Running virtuoso in a guix system container
 
+See also
+
+=> ../deploy/our-virtuoso-instances
+
 We have a Guix virtuoso service in the guix-bioinformatics channel. The easiest way to run virtuoso is to use the virtuoso service to run it in a guix system container. The only downside of this method is that, since guix system containers require root privileges to start up, you will need root priviliges on the machine you are running this on.
 
 Here is a basic guix system configuration that runs virtuoso listening on port 8891, and with its HTTP server listening on port 8892. Among other things, the HTTP server provides a SPARQL endpoint to interact with.
@@ -104,11 +108,16 @@ After running virtuoso, you will want to change the default password of the `dba
 
 In a typical production virtuoso installation, you will want to change the password of the dba user and disable the dav user. Here are the commands to do so. Pay attention to the single versus double quoting.
 ```
-SQL> set password "dba" "rFw,OntlJ@Sz";
+SQL> set password "dba" "dba";
 SQL> UPDATE ws.ws.sys_dav_user SET u_account_disabled=1 WHERE u_name='dav';
 SQL> CHECKPOINT;
 ```
 
+We now store the passwords in secrets:
+
+*  CI/CD: /export2/guix-containers/genenetwork-development/etc/genenetwork/conf/gn3/secrets.py
+*  Production: /export/guix-containers/genenetwork/etc/genenetwork/genenetwork3/gn3-secrets.py
+
 ## Loading data into virtuoso
 
 Virtuoso supports at least three different ways to load RDF.
@@ -151,6 +160,19 @@ Start isql with something like
 guix shell --expose=verified-data=/var/lib/data virtuoso-ose -- isql -U dba -P password 8981
 ```
 
+Password is in container secrets file.
+Inside a container, you can do also do
+
+```
+root@tux04 ~# /gnu/store/9d81kdw2frn6b3fwqphsmkssc9zblir1-virtuoso-ose-7.2.11/bin/isql -u dba -P password -S 8981
+OpenLink Virtuoso Interactive SQL (Virtuoso)
+Version 07.20.3238 as of Jan  1 1970
+Type HELP; for help and EXIT; to exit.
+
+*** Error 28000: [Virtuoso Driver]CL034: Bad login
+
+```
+
 To delete a graph:
 
 ```
@@ -166,6 +188,18 @@ rdf_loader_run();
 checkpoint;
 ```
 
+You may not have permissions to dir. Check
+
+```
+select virtuoso_ini_path();
+```
+
+the file should contain the relevant dir
+
+```
+DirsAllowed=/dir
+```
+
 => http://vos.openlinksw.com/owiki/wiki/VOS/VirtTipsAndTricksGuideDeleteLargeGraphs How can I delete graphs containing large numbers of triples from the Virtuoso Quad Store?
 
 When virtuoso has just been started up with a clean state (that is, the virtuoso state directory was empty before virtuoso started), uploading large amounts of data using the SPARQL 1.1 Graph Store HTTP Protocol fails the first time. It succeeds only the second time. It is not clear why. I can only recommend retrying as in this commit:
@@ -274,3 +308,7 @@ To dump data into a ttl file, first make sure that you are in the guix environme
 => https://github.com/genenetwork/dump-genenetwork-database/ Dump Genenetwork Database
 
 See the README for instructions.
+
+For the public GN endpoint visit
+
+=> https://sparql.genenetwork.org/sparql/
diff --git a/topics/testing/mechanical-rob.gmi b/topics/testing/mechanical-rob.gmi
index 9413b47..baf111a 100644
--- a/topics/testing/mechanical-rob.gmi
+++ b/topics/testing/mechanical-rob.gmi
@@ -1,9 +1,74 @@
 # Mechanical Rob
 
-We need to run Mechanical Rob tests as part of our continuous integration tests.
+## Tags
 
-The Mechanical Rob CI tests are functioning again now. To see how to run Mechanical Rob, see the CI job definition in the genenetwork-machines repo.
+* type: documentation, docs
+* assigned: bonfacem, rookie101, fredm
+* priority: medium
+* status: open
+* keywords: tests, testing, mechanical-rob
 
-=> genenetwork-machines/src/branch/main/genenetwork-development.scm
+## What is Mechanical Rob?
 
-The invocation procedure is bound to change as the many environment variables in genenetwork2 are cleared up.
+Mechanical Rob is our name for what could be considered our integration tests.
+
+The idea is that we observe how Prof. Robert Williams (Rob) (and other scientists) use(s) GeneNetwork and create a "mechanical" facsimile of that. The purpose is to ensure that the system works correctly with each and every commit in any of our various repositories.
+
+If any commit causes any part of the Mechanical Rob system to raise an error, then we know, immediately, that something is broken, and the culprit can get onto fixing that with haste.
+
+## Show Me Some Code!!!
+
+Nice! I like your enthusiasm.
+
+You can find the
+=> https://github.com/genenetwork/genenetwork2/tree/testing/test/requests Mechanical Rob code here
+within the genenetwork2 repository.
+
+You can also see how it is triggered in the gn-machines repository in
+=> https://git.genenetwork.org/gn-machines/tree/genenetwork-development.scm this module.
+Search for "genenetwork2-mechanical-rob" within that module and you should find how the system is triggered.
+
+## How About Running it Locally
+
+All the above is nice and all, but sometimes you just want to run the checks locally.
+
+In that case, you can run Mechanical Rob locally by following the steps below:
+(note that these steps are mostly the same ones to run GN2 locally).
+
+
+1. Get a guix shell for GN2 development:
+```
+$ cd genenetwork2/
+$ guix shell --container --network \
+        --expose=</path/to/directory/with/genotypes> \
+        --expose=</path/to/local/genenetwork3> \
+        --expose=</path/to/setting/file> \
+        --expose=</path/to/secrets/file> \
+        --file=guix.scm bash
+```
+The last `bash` is to ensure we install the Bourne-Again Shell whic we use to launch the application. The `</path/to/local/genenetwork3>` can be omitted if you do not need the latest code in GN3 to be included in your running GN2.
+
+2. Set up the appropriate environment variables:
+```
+[env]$ export HOME=</path/to/home/directory>
+[env]$ export GN2_SETTINGS=</path/to/settings/file>
+[env]$ export SERVER_PORT=5003
+[env]$ export GN2_PROFILE="${GUIX_ENVIRONMENT}"
+[env]$ export GN3_PYTHONPATH=</path/to/local/genenetwor3> # Only needed if you need to test GN3 updates
+```
+
+3. Run the mechanical-rob tests
+```
+[env]$ bash bin/genenetwork2 gn2/default_settings.py -c \
+        test/requests/test-website.py \
+        --all "http://localhost:${SERVER_PORT}"
+```
+Of course, here we are assuming that `SERVER_PORT` has the value of the port on which GN2 is running.
+
+
+## Possible Improvements
+
+Look into using geckodriver to help with the mechanical-rob tests.
+`geckodriver` comes with the
+=> https://icecatbrowser.org/index.html GNU IceCat browser
+which is present as a package in GNU Guix.
diff --git a/topics/xapian/xapian-indexing.gmi b/topics/xapian/xapian-indexing.gmi
index 1c82018..68ab7a6 100644
--- a/topics/xapian/xapian-indexing.gmi
+++ b/topics/xapian/xapian-indexing.gmi
@@ -2,18 +2,48 @@
 
 Due to the enormous size of the GeneNetwork database, indexing it in a reasonable amount of time is a tricky process that calls for careful identification and optimization of the performance bottlenecks. This document is a description of how we achieve it.
 
-Indexing happens in the following three phases.
+Indexing happens in these phases.
 
 * Phase 1: retrieve data from SQL
-* Phase 2: index text
-* Phase 3: write Xapian index to disk
+* Phase 2: retrieve metadata from RDF
+* Phase 3: index text
+* Phase 4: write Xapian index to disk
 
-Phases 1 and 3 (that is, the retrieval of data from SQL and writing of the Xapian index to disk) are I/O bound processes. Phase 2 (the actual indexing of text) is CPU bound. So, we parallelize phase 2 while keeping phases 1 and 3 sequential.
+Phases 1, 2 and 4 are I/O bound processes. Phase 3 (the actual indexing of text) is CPU bound. So, we parallelize phase 2 while keeping phases 1, 2 and 3 sequential.
 
-There is a long delay in retrieving data from SQL and loading it into memory. In this time, the CPU is waiting on I/O and idling away. In order to avoid this, we retrieve SQL data chunk by chunk and spawn off phase 2 worker processes. Thus, we interleave phase 1 and 2 so that they don't block each other. Despite this, on tux02, the indexing script is only able to keep around 10 of the 128 CPUs busy. As phase 1 is dishing out jobs to phase 2 worker processes, before it can finish dishing out jobs to all 128 CPUs, the earliest worker processes finish and exit. The only way to avoid this and improve CPU utilization would be to further optimize the I/O of phase 1.
+There is a long delay in retrieving data from SQL and loading it into memory. In this time, the CPU is waiting on I/O and idling away. In order to avoid this, we retrieve SQL data chunk by chunk and spawn off phase 3 worker processes. We get RDF data in one large call before any processing is done. Thus, we interleave phase 1 and 3 so that they don't block each other. Despite this, on tux02, the indexing script is only able to keep around 10 of the 128 CPUs busy. As phase 1 is dishing out jobs to phase 2 worker processes, before it can finish dishing out jobs to all 128 CPUs, the earliest worker processes finish and exit. The only way to avoid this and improve CPU utilization would be to further optimize the I/O of phase 1.
 
 Building a single large Xapian index is not scalable. See detailed report on Xapian scalability.
 => xapian-scalability
 So, we let each process of phase 2 build its own separate Xapian index. Finally, we compact and combine them into one large index. When writing smaller indexes in parallel, we take care to lock access to the disk so that only one process is writing to the disk at any given time. If many processes try to simultaneously write to the disk, the write speed is slowed down, often considerably, due to I/O contention.
 
-It is important to note that the performance bottlenecks identified in this document are machine-specific. For example, on my laptop with only 2 cores, CPU performance in phase 2 is the bottleneck. Phase 1 I/O waits on the CPU to finish instead of the other way around.
+It is important to note that the performance bottlenecks identified in this document are machine-specific. For example, on my laptop with only 2 cores, CPU performance in phase 3 is the bottleneck. Phase 1 I/O waits on the CPU to finish instead of the other way around.
+
+## Local Development
+
+For local development, see:
+
+=> https://issues.genenetwork.org/topics/database/working-with-virtuoso-locally Working with Virtuoso for Local Development
+
+Ping @bmunyoki for the ttl folder backups.
+
+Set up mysql with instructions from
+
+=> https://issues.genenetwork.org/topics/database/setting-up-local-development-database
+
+and load up the backup file using:
+
+> mariadb gn2 < /path/to/backup/file.sql
+
+A backup file can be generated using:
+
+> mysqldump -u mysqluser -pmysqlpasswd --opt --where="1 limit 100000" db_webqtl > out.sql                                  
+> xz out.sql
+
+And run the index script using:
+
+> python3 scripts/index-genenetwork create-xapian-index /tmp/xapian "mysql://gn2:password@localhost/gn2" "http://localhost:8890/sparql"
+
+Verify the index with:
+
+> xapian-delve /tmp/xapian