101 files changed, 10664 insertions, 240 deletions
diff --git a/issues/CI-CD/cd-is-slow.gmi b/issues/CI-CD/cd-is-slow.gmi
new file mode 100644
index 0000000..9b0e1ee
--- /dev/null
+++ b/issues/CI-CD/cd-is-slow.gmi
@@ -0,0 +1,276 @@
+# CD is slow
+
+The pages are slow and some are broken.
+
+We found out that there are quite a full network calls using DNS - and DNS was slow. The configured DNS server was not responding. Using Google's DNS made things go fast again. We will probably introduce dnsmasq in the container to make things even faster.
+
+# Tags
+
+* type: bug
+* status: in progress
+* priority: high
+* assigned: pjotrp
+* interested: pjotrp, bonfacem
+* keywords: deployment, server
+
+# Tasks
+
+* [ ] Use dnsmasq caching - it is a guix system service
+* [ ] Run less gunicorn processes on CD (2 should do)
+* [ ] Increase debugging output for GN2
+* [ ] Fix GN3 hook for github (it is not working)
+* [X] gn-guile lacks certificates it can use for sparql
+
+# Measuring
+
+bonfacekilz:
+I'm currently instrumenting the requests.  See what hogs up time.  Loading the landing page takes up 32 seconds!
+
+Something's off. From outside the container:
+
+```
+123bonfacem@tux02 ~ $ guix shell python-wrapper python-requests -- python time.py
+Status: 200
+Time taken: 32.989222288131714 seconds
+```
+
+From inside the container:
+
+```
+12025-07-18 14:46:36 INFO:gn2.wqflask:Landing page rendered in 8.12 seconds
+```
+
+And I see:
+
+## CD
+
+```
+> curl -w @- -o /dev/null -s https://cd.genenetwork.org <<EOF
+\n
+DNS lookup:  %{time_namelookup}s\n
+Connect time: %{time_connect}s\n
+TLS handshake: %{time_appconnect}s\n
+Pre-transfer: %{time_pretransfer}s\n
+Start transfer: %{time_starttransfer}s\n
+Total time:   %{time_total}s\n
+EOF
+
+DNS lookup:  8.117543s
+Connect time: 8.117757s
+TLS handshake: 8.197767s
+Pre-transfer: 8.197861s
+Start transfer: 33.096467s
+Total time:   33.096601s
+```
+
+## Production
+```
+> curl -w @- -o /dev/null -s https://genenetwork.org <<EOF
+\n
+DNS lookup:  %{time_namelookup}s\n
+Connect time: %{time_connect}s\n
+TLS handshake: %{time_appconnect}s\n
+Pre-transfer: %{time_pretransfer}s\n
+Start transfer: %{time_starttransfer}s\n
+Total time:   %{time_total}s\n
+EOF
+
+DNS lookup:  8.075794s
+Connect time: 8.076402s
+TLS handshake: 8.147322s
+Pre-transfer: 8.147370s
+Start transfer: 8.797107s
+Total time:   8.797299s
+```
+
+## On tux02 (outside CD container)
+
+```
+> curl -w @- -o /dev/null -s http://localhost:9092 <<EOF
+\n
+DNS lookup:  %{time_namelookup}s\n
+Connect time: %{time_connect}s\n
+TLS handshake: %{time_appconnect}s\n
+Pre-transfer: %{time_pretransfer}s\n
+Start transfer: %{time_starttransfer}s\n
+Total time:   %{time_total}s\n
+EOF
+
+DNS lookup:  0.000068s
+Connect time: 0.000543s
+TLS handshake: 0.000000s
+Pre-transfer: 0.000606s
+Start transfer: 24.851069s
+Total time:   24.851166s
+```
+
+This does not look like an nginx problem (at least on tux02 itself). Also the nginx configuration was not really changed.
+The mysql configuration ditto. I can still test both, but it looks like the problem is inside the system container.
+
+The container logs are at
+
+```
+root@tux02:/export2/guix-containers/genenetwork-development/var/log/cd# tail -100 genenetwork2.log
+```
+
+Some interesting errors there that need resolving, such as
+
+## gn-guile error
+
+```
+tail gn-guile.log
+2025-07-20 04:49:49 X.509 certificate of 'sparql.genenetwork.org' could not be verified:
+2025-07-20 04:49:49   signer-not-found invalid
+```
+
+Guile is not finding the certificates for our virtuoso server. It does work with curl, try
+
+```
+curl -G https://query.wikidata.org/sparql -H "Accept: application/json; charset=utf-8" --data-urlencode query="SELECT DISTINCT * where {
+  wd:Q158695 wdt:P225 ?o .
+} limit 5"
+{
+  "head" : {
+    "vars" : [ "o" ]                                                                                                     },                                                                                                                     "results" : {                                                                                                            "bindings" : [ {                                                                                                         "o" : {
+        "type" : "literal",
+        "value" : "Arabidopsis thaliana"
+      }
+    } ]
+  }
+```
+
+Also inside the container:
+
+```
+curl http://localhost:8091/gene/aliases/Shh
+```
+
+renders the same error! X.509 certificate of 'query.wikidata.org' could not be verified. so it is a gn-guile issue.
+
+## GN2 error reporting
+
+Also there are too many gunicorn processes - and strikingly - no debug output. Also I see a missing robots.txt file (even though LLMs hardly honour them).
+
+Let's try to get inside the container with nsenter:
+
+```
+ps xau|grep genenetwork-development-container
+root     115940  0.0  0.0 163692 26296 ?        Ssl  Jul18   0:00 /gnu/store/ylwk2vn18dkzkj0nxq2h4vjzhz17bm7c-guile-3.0.9/bin/guile --no-auto-compile /usr/local/bin/genenetwork-development-container
+pgrep -P 115940
+115961
+```
+
+Use this child PID and a recent nsenter:
+
+```
+/gnu/store/w7a3frdmffpw3hvxpvvxwxgzfhyqdm6n-profile/bin/nsenter -m -p -t 115961 /run/current-system/profile/bin/bash -login
+```
+
+System tools are in '/run/current-system/profile/bin/'
+
+Make it a one-liner with
+
+```
+/gnu/store/w7a3frdmffpw3hvxpvvxwxgzfhyqdm6n-profile/bin/nsenter -m -p -t $(pgrep -P `ps xau|grep genenetwork-development-container|awk  '{print $2}'|sort -r|head -1`) /run/current-system/profile/bin/bash -login
+```
+
+Once inside we can pick up curl (I note the system container has full access to the /gnu/store on the host:
+
+```
+root@tux02 /# /gnu/store/vdaspmq10c3zmqhp38lfqy812w6r4xg3-curl-8.6.0/bin/curl  -w @- -o /dev/null -s http://localhost:9092 <<EOF
+\n
+DNS lookup:  %{time_namelookup}s\n
+Connect time: %{time_connect}s\n
+TLS handshake: %{time_appconnect}s\n
+Pre-transfer: %{time_pretransfer}s\n
+Start transfer: %{time_starttransfer}s\n
+Total time:   %{time_total}s\n
+EOF
+
+DNS lookup:  0.000064s
+Connect time: 0.000478s
+TLS handshake: 0.000000s
+Pre-transfer: 0.000551s
+Start transfer: 24.792926s
+Total time:   24.793015s
+```
+
+That rules out container and nginx streaming issues.
+
+So the problem is with GN and its DBs. The gn-machines is used from /home/aruni and it checkout is March. Has CD been slow since then? I don't think so. Also the changes to the actual scripts are even older. Also the guix-bioinformatics repo shows no changes. Remaining culprits I suspect are:
+
+* [*] MySQL
+* [ ] Interaction gn-auth with gn2
+* [ ] Interaction gnqa with gn2
+
+Running a standard test on mysql shows it is fine:
+
+```
+time mysql -u webqtlout -pwebqtlout db_webqtl < $rundir/../shared/sql/test02.sql
+Name    FullName        Name    Symbol  CAST(ProbeSet."description" AS BINARY)  CAST(ProbeSet."Probe_Target_Description" AS BINARY)    Chr     Mb      Mean    LRS     Locus   pValue  additive        geno_chr        geno_mb
+HC_M2_0606_P    Hippocampus Consortium M430v2 (Jun06) PDNN      1457545_at      9530036O11Rik   long non-coding RNA, expressed sequence tag (EST) AK035474 with high bladder expression        antisense EST 14 Kb upstream of Shh     5     28.480441        6.7419292929293 15.2845189682605        rsm10000001525  0.055   0.0434848484848485      3       9.671673
+HC_M2_0606_P    Hippocampus Consortium M430v2 (Jun06) PDNN      1427571_at      Shh     sonic hedgehog (hedgehog)     last exon        5       28.457886       6.50113131313131        9.58158655605723        rs8253327       0.697   0.0494097096188748     1       191.908118
+HC_M2_0606_P    Hippocampus Consortium M430v2 (Jun06) PDNN      1436869_at      Shh     sonic hedgehog (hedgehog)     mid distal 3' UTR        5       28.457155       9.279090909090911       12.7711275309832        rs8253327       0.306 -0.214087568058076       1       191.908118
+
+real    0m0.010s
+user    0m0.004s
+sys     0m0.000s
+```
+
+# Profiling CD
+
+Ran a profiler against a traits page.  See the following:
+
+=> /issues/CI-CD/profiling-flask
+
+## Results/Interpretation
+
+* By fixing gn-guile and gene-alias resolution, times dropped by ~10s.  However, the page takes 37.9s to run.
+
+* Resolving a DNS takes around 4.585s.  We make 7 requests.  Totalling to 32.09.  Typically, a traits page should take 8.79s.  The difference: (- 37.9 32.09) = 5.8s; which explains the slowness:
+
+```
+ ncall     tottime    percall  cumtime  percall  filename:lineno(function)
+----------------------------------------------------------------------------
+     7  0.00002618  3.741e-05    32.09    4.585  socket.py:938(getaddrinfo)
+```
+
+* The above is consistent all the analysis I've done across all the profile dumps.
+
+* Testing my theory out:
+
+```
+@app.route("/test-network")
+def test_network():
+    start = time.time()
+    http_url = urljoin(
+            current_app.config["GN_SERVER_URL"],
+            "version"
+        )
+    result =  requests.get(http_url)
+    duration = time.time() - start
+    app.logger.error(f"{http_url}: {duration:.4f}s")
+
+    start = time.time()
+    local_url = "http://localhost:9093/api/version"
+    result =  requests.get(local_url)
+    duration = time.time() - start
+    app.logger.error(f"{local_url}: {duration:.4f}s")
+    return result.json()
+```
+
+* Results:
+
+```
+2025-07-24 10:20:43 [2025-07-24 10:20:43 +0000] [101] [ERROR] https://cd.genenetwork.org/api3/version: 8.1647s
+2025-07-24 10:20:43 ERROR:gn2.wqflask:https://cd.genenetwork.org/api3/version: 8.1647s
+2025-07-24 10:20:43 [2025-07-24 10:20:43 +0000] [101] [ERROR] result: 1.0
+2025-07-24 10:20:43 ERROR:gn2.wqflask:result: 1.0
+2025-07-24 10:20:43 [2025-07-24 10:20:43 +0000] [101] [ERROR] http://localhost:9093/api/version: 0.0088s
+2025-07-24 10:20:43 ERROR:gn2.wqflask:http://localhost:9093/api/version: 0.0088s
+2025-07-24 10:20:43 [2025-07-24 10:20:43 +0000] [101] [ERROR] result: 1.0
+```
+
+## Possible Mitigations
+
+* Switch over gn-auth.genenetwork.org to localhost.
diff --git a/issues/CI-CD/development-container-checklist.gmi b/issues/CI-CD/development-container-checklist.gmi
new file mode 100644
index 0000000..7cf4687
--- /dev/null
+++ b/issues/CI-CD/development-container-checklist.gmi
@@ -0,0 +1,101 @@
+# Deploying GeneNetwork CD
+
+## Prerequisites
+
+Ensure you have `fzf' installed and Guix is set up with your preferred channel configuration. 
+
+
+## Step 1: Pull the Latest Profiles
+
+```
+guix pull -C channels.scm -p ~/.guix-extra-profiles/gn-machines --allow-downgrades
+guix pull -C channels.scm -p ~/.guix-extra-profiles/gn-machines-shepherd-upgrade --allow-downgrades
+```
+
+
+## Step 2: Source the Correct Profile
+
+```
+. ,choose-profile
+```
+
+
+### Contents of `,choose-profile'
+
+This script lets you interactively select a profile using `fzf': 
+
+```
+#!/bin/env sh
+
+export GUIX_PROFILE="$(guix package --list-profiles | fzf --multi)"
+. "$GUIX_PROFILE/etc/profile"
+
+hash guix
+
+echo "Currently using: $GUIX_PROFILE"
+```
+
+
+## Step 3: Verify the Profile
+
+```
+guix describe
+```
+
+
+## Step 4: Pull the Latest Code
+
+```
+cd gn-machines
+git pull
+```
+
+
+## Step 5: Run the Deployment Script
+
+```
+./genenetwork-development-deploy.sh
+```
+
+
+## Step 6: Restart the Development Container
+
+```
+sudo systemctl restart genenetwork-development-container
+```
+
+
+## Step 7: Verify Changes
+
+Manually confirm that the intended changes were applied correctly. 
+
+
+# Accessing the Development Container on tux02
+
+To enter the running container shell, ensure you're using the *parent* PID of the `shepherd' process. 
+
+
+## Step 1: Identify the Correct PID
+
+Use this command to locate the correct container parent process: 
+
+```
+ps -u root -f --forest | grep -A4 '/usr/local/bin/genenetwork-development-container' | grep shepherd
+```
+
+
+## Step 2: Enter the Container
+
+Replace `46804' with your actual parent PID: 
+
+```
+sudo /home/bonfacem/.config/guix/current/bin/guix container exec 46804 \
+  /gnu/store/m6c5hgqg569mbcjjbp8l8m7q82ascpdl-bash-5.1.16/bin/bash \
+  --init-file /home/bonfacem/.guix-profile/etc/profile --login
+```
+
+
+## Notes
+
+* Ensure the PID is the container’s *shepherd parent*, not a child process. 
+* Always double-check your environment and profiles before deploying. 
diff --git a/issues/CI-CD/failing-services-startup.gmi b/issues/CI-CD/failing-services-startup.gmi
new file mode 100644
index 0000000..751e61c
--- /dev/null
+++ b/issues/CI-CD/failing-services-startup.gmi
@@ -0,0 +1,236 @@
+# Failing Services' Startup
+
+## Tags
+
+* type: bug
+* status: closed, completed
+* priority: high
+* assigned: fredm, bonfacem
+* interested: pjotrp, bonfacem, aruni
+* keywords: deployment, CI, CD
+
+## Description
+
+Upgrading guix to `34453b97005ff86355399df89c8827c57839d9c7` for CI/CD fails with:
+
+```
+2025-08-20 16:05:20 Backtrace:
+2025-08-20 16:05:20            6 (primitive-load "/gnu/store/xbxd2zihw9dssrhips925gri0yn?")
+2025-08-20 16:05:20 In ice-9/eval.scm:
+2025-08-20 16:05:20    191:35  5 (_ _)
+2025-08-20 16:05:20 In gnu/build/linux-container.scm:
+2025-08-20 16:05:20     368:8  4 (call-with-temporary-directory #<procedure 7f014aa3a3f0?>)
+2025-08-20 16:05:20    476:16  3 (_ "/tmp/guix-directory.VWRNbv")
+2025-08-20 16:05:20      62:6  2 (call-with-clean-exit #<procedure 7f014aa1de80 at gnu/b?>)
+2025-08-20 16:05:20    321:20  1 (_)
+2025-08-20 16:05:20 In guix/build/syscalls.scm:
+2025-08-20 16:05:20   1231:10  0 (_ 268566528)
+2025-08-20 16:05:20 
+2025-08-20 16:05:20 guix/build/syscalls.scm:1231:10: In procedure unshare: 268566528: Invalid argument
+2025-08-20 16:05:20 Backtrace:
+2025-08-20 16:05:20            4 (primitive-load "/gnu/store/xbxd2zihw9dssrhips925gri0yn?")
+2025-08-20 16:05:20 In ice-9/eval.scm:
+2025-08-20 16:05:20    191:35  3 (_ #f)
+2025-08-20 16:05:20 In gnu/build/linux-container.scm:
+2025-08-20 16:05:20     368:8  2 (call-with-temporary-directory #<procedure 7f014aa3a3f0?>)
+2025-08-20 16:05:20     485:7  1 (_ "/tmp/guix-directory.VWRNbv")
+2025-08-20 16:05:20 In unknown file:
+2025-08-20 16:05:20            0 (waitpid #f #<undefined>)
+2025-08-20 16:05:20 
+2025-08-20 16:05:20 ERROR: In procedure waitpid:
+2025-08-20 16:05:20 Wrong type (expecting exact integer): #f
+```
+
+Failing services:
+
+* genenetwork3: consistently
+* genenetwork2: consistently
+* gn-auth: intermittently
+
+## Troubleshooting Notes
+
+### Unable to run genenetwork2 in a shell container with the "-C" flag
+
+With the following channels:
+
+```
+$ guix describe
+Generation 3    Aug 28 2025 03:56:44    (current)
+  gn-bioinformatics cffafde
+    repository URL: file:///home/bonfacem/guix-bioinformatics/
+    branch: master
+    commit: cffafde125f3e711418d3ebb62eacd48a3efa8cf
+  guix-forge 3c8dc85
+    repository URL: https://git.genenetwork.org/guix-forge/
+    branch: main
+    commit: 3c8dc85a584c98bc90088ec1c85933d4d10e7383
+  guix-past b14d7f9
+    repository URL: https://codeberg.org/guix-science/guix-past
+    branch: master
+    commit: b14d7f997ae8eec788a7c16a7252460cba3aaef8
+  guix 34453b9
+    repository URL: https://codeberg.org/guix/guix
+    branch: master
+    commit: 34453b97005ff86355399df89c8827c57839d9c7
+```
+
+Running:
+
+```
+$ guix shell -C genenetwork2
+```
+
+Produces:
+
+```
+guix shell: error: unshare: 268566528: Invalid argument
+Backtrace:
+          16 (primitive-load "/export3/local/home/bonfacem/.guix-ext…")
+In guix/ui.scm:
+   2399:7 15 (run-guix . _)
+  2362:10 14 (run-guix-command _ . _)
+In ice-9/boot-9.scm:
+  1752:10 13 (with-exception-handler _ _ #:unwind? _ # _)
+In guix/status.scm:
+    842:4 12 (call-with-status-report _ _)
+In guix/store.scm:
+    703:3 11 (_)
+In ice-9/boot-9.scm:
+  1752:10 10 (with-exception-handler _ _ #:unwind? _ # _)
+In guix/store.scm:
+   690:37  9 (thunk)
+   1331:8  8 (call-with-build-handler _ _)
+   1331:8  7 (call-with-build-handler #<procedure 7fc86bb50de0 at g…> …)
+In guix/scripts/environment.scm:
+  1205:11  6 (proc _)
+In guix/store.scm:
+  2212:25  5 (run-with-store #<store-connection 256.100 7fc87a46d820> …)
+In guix/scripts/environment.scm:
+    911:8  4 (_ _)
+In gnu/build/linux-container.scm:
+    485:7  3 (call-with-container _ _ #:namespaces _ #:host-uids _ # …)
+In unknown file:
+           2 (waitpid #f #<undefined>)
+In ice-9/boot-9.scm:
+  1685:16  1 (raise-exception _ #:continuable? _)
+  1685:16  0 (raise-exception _ #:continuable? _)
+
+ice-9/boot-9.scm:1685:16: In procedure raise-exception:
+Wrong type (expecting exact integer): #f
+```
+
+This is fixed by increasing the value of respawn-delay (default is 0.5s) to 5s.
+
+
+### Unable to write to a temporary directory and issues with running git inside the g-exp
+
+Stack trace:
+```
+2025-09-03 12:23:32 In ice-9/eval.scm:
+2025-09-03 12:23:32    191:35  3 (_ #f)
+2025-09-03 12:23:32 In gnu/build/linux-container.scm:
+2025-09-03 12:23:32     368:8  2 (call-with-temporary-directory #<procedure 7f012241d3f0?>)
+2025-09-03 12:23:32     485:7  1 (_ "/tmp/guix-directory.Bl6jtx")
+2025-09-03 12:23:32 In unknown file:
+2025-09-03 12:23:32            0 (waitpid #f #<undefined>)
+2025-09-03 12:23:32
+
+```
+
+Cryptic message.   Running the g-exps as a program shows:
+
+```
+Receiving objects: 100% (698/698), 16.18 MiB | 30.29 MiB/s, done.
+Resolving deltas: 100% (49/49), done.
+==================================================
+error: cannot run less: No such file or directory
+fatal: unable to execute pager 'less'
+Backtrace:
+           5 (primitive-load "/gnu/store/c9bvy90s5mglp6xdfkc1s4qkzj8?")
+In ice-9/eval.scm:
+    619:8  4 (_ #f)
+In ice-9/boot-9.scm:
+    142:2  3 (dynamic-wind #<procedure 7fa954b25880 at ice-9/eval.s?> ?)
+    142:2  2 (dynamic-wind #<procedure 7fa94b7970c0 at ice-9/eval.s?> ?)
+In ice-9/eval.scm:
+    619:8  1 (_ #(#(#<directory (guile-user) 7fa954b03c80>)))
+In guix/build/utils.scm:
+    822:6  0 (invoke "git" "log" "--max-count" "1")
+
+guix/build/utils.scm:822:6: In procedure invoke:
+ERROR:
+  1. &invoke-error:
+      program: "git"
+      arguments: ("log" "--max-count" "1")
+      exit-status: 128
+      term-signal: #f
+      stop-signal: #f
+```
+
+Fixed by adding "less" to the with-packages form and setting:
+
+```
+(setenv "TERM" "xterm-256color")
+
+```
+
+### gn-auth: sqlite3.OperationalError: unable to open database file
+
+Despite having all file perms correctly set with 0644, we see:
+
+```
+Traceback (most recent call last):
+  File "/gnu/store/ag1m9bv22iwm3sq87xly35y138l6kzd7-profile/lib/python3.11/site-packages/flask/app.py", line 917, in full_dispatch_request
+    rv = self.dispatch_request()
+         ^^^^^^^^^^^^^^^^^^^^^^^
+  File "/gnu/store/ag1m9bv22iwm3sq87xly35y138l6kzd7-profile/lib/python3.11/site-packages/flask/app.py", line 902, in dispatch_request
+    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)  # type: ignore[no-any-return]
+           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+  File "/export/data/repositories/gn-auth/gn_auth/auth/authentication/oauth2/views.py", line 102, in authorise
+    return with_db_connection(__authorise__)
+           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+  File "/export/data/repositories/gn-auth/gn_auth/auth/db/sqlite3.py", line 63, in with_db_connection
+    return func(conn)
+           ^^^^^^^^^^
+  File "/export/data/repositories/gn-auth/gn_auth/auth/authentication/oauth2/views.py", line 90, in __authorise__
+    return server.create_authorization_response(request=request, grant_user=user)
+           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+  File "/gnu/store/ag1m9bv22iwm3sq87xly35y138l6kzd7-profile/lib/python3.11/site-packages/authlib/oauth2/rfc6749/authorization_server.py", line 297, in create_authorization_response
+    args = grant.create_authorization_response(redirect_uri, grant_user)
+           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+  File "/export/data/repositories/gn-auth/gn_auth/auth/authentication/oauth2/grants/authorisation_code_grant.py", line 31, in create_authorization_response
+    response = super().create_authorization_response(
+               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+  File "/gnu/store/ag1m9bv22iwm3sq87xly35y138l6kzd7-profile/lib/python3.11/site-packages/authlib/oauth2/rfc6749/grants/authorization_code.py", line 158, in create_authorization_response
+    self.save_authorization_code(code, self.request)
+  File "/export/data/repositories/gn-auth/gn_auth/auth/authentication/oauth2/grants/authorisation_code_grant.py", line 45, in save_authorization_code
+    return __save_authorization_code__(
+           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+  File "/export/data/repositories/gn-auth/gn_auth/auth/authentication/oauth2/grants/authorisation_code_grant.py", line 106, in __save_authorization_code__
+    return with_db_connection(lambda conn: save_authorisation_code(conn, code))
+           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+  File "/export/data/repositories/gn-auth/gn_auth/auth/db/sqlite3.py", line 63, in with_db_connection
+    return func(conn)
+           ^^^^^^^^^^
+  File "/export/data/repositories/gn-auth/gn_auth/auth/authentication/oauth2/grants/authorisation_code_grant.py", line 106, in <lambda>
+    return with_db_connection(lambda conn: save_authorisation_code(conn, code))
+                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+  File "/export/data/repositories/gn-auth/gn_auth/auth/authentication/oauth2/models/authorization_code.py", line 92, in save_authorisation_code
+    cursor.execute(
+sqlite3.OperationalError: unable to open database file
+```
+
+Fixed above by correctly mapping:
+
+```
+-                                                (source auth-db-path)
++                                                (source (dirname auth-db-path))
+```
+
+in the relevant g-exp, and making sure that the parent directory is set to #o775 (rwx for both user/group).
+
+## Also See
+
+=> https://issues.guix.gnu.org/78356 Broken system and home containers
+=> https://codeberg.org/guix/guix/src/commit/34453b97005ff86355399df89c8827c57839d9c7/guix/build/syscalls.scm#L1218-L1233 How "unshare" is defined
+=> https://codeberg.org/guix/guix/src/commit/34453b97005ff86355399df89c8827c57839d9c7/gnu/build/linux-container.scm#L321 Where `unshare` is called
diff --git a/issues/CI-CD/profiling-flask.gmi b/issues/CI-CD/profiling-flask.gmi
new file mode 100644
index 0000000..2d0c539
--- /dev/null
+++ b/issues/CI-CD/profiling-flask.gmi
@@ -0,0 +1,33 @@
+# Profiling GN
+
+Use this simple structure:
+
+```
+from werkzeug.middleware.profiler import ProfilerMiddleware
+
+
+app = Flask(__name__)
+app.config["PROFILE"] = True
+app.wsgi_app = ProfilerMiddleware(
+    app.wsgi_app,
+    restrictions=[40, "main"],
+    profile_dir="profiler_dump",
+    filename_format="{time:.0f}-{method}-{path}-{elapsed:.0f}ms.prof",
+)
+```
+
+
+You can use gprof2dot to visualise the profile
+
+```
+guix shell gprof2dot -- gprof2dot -f pstats 1753202013-GET-show_trait-37931ms.prof > 1753202013-GET-show_trait-37931ms.prof.dot
+guix shell xdot -- xdot 1753202013-GET-show_trait-37931ms.prof.dot
+```
+
+Or snakeviz to visualize it:
+
+
+```
+scp genenetwork:/home/bonfacem/profiling/1753202013-GET-show_trait-37931ms.prof /tmp/test
+snakeviz 1753202013-GET-show_trait-37931ms.prof
+```
diff --git a/issues/CI-CD/troubleshooting-within-the-development-container.gmi b/issues/CI-CD/troubleshooting-within-the-development-container.gmi
new file mode 100644
index 0000000..3aa8c3b
--- /dev/null
+++ b/issues/CI-CD/troubleshooting-within-the-development-container.gmi
@@ -0,0 +1,46 @@
+# Troubleshooting inside the GN dev container
+* type: systems, debugging, container
+* keywords: container, troubleshooting, logs, webhooks
+
+You need to find the development container so that you can begin troubleshooting:
+
+```
+ps -u root -f --forest | grep -A4 '/usr/local/bin/genenetwork-development-container' | grep shepherd
+```
+
+Example output:
+
+```
+root      16182  16162  0 03:57 ?        00:00:04  \_ /gnu/store/n87px1cazqkav83npg80ccp1n777j08s-guile-3.0.9/bin/guile --no-auto-compile /gnu/store/b4n5ax7l1ccia7sr123fqcjqi4vy03pv-shepherd-1.0.2/bin/shepherd --config /gnu/store/5ahb3745wlpa5mjsbk8j6frn78khvzzw-shepherd.conf
+```
+
+Get into the container:
+
+```
+# Use the correct pid and guix/bash path.
+
+sudo /home/bonfacem/.config/guix/current/bin/guix container exec 16182 /gnu/store/m6c5hgqg569mbcjjbp8l8m7q82ascpdl-bash-5.1.16/bin/bash --init-file /home/bonfacem/.guix-profile/etc/profile --login
+```
+
+All the gn related logs can be found in "/var/log/cd":
+
+```
+genenetwork2.log
+genenetwork3.log
+gn-auth.log
+gn-guile.log
+```
+
+All the nginx log are in "/var/log/nginx"
+
+Sometimes, it's useful to trigger webhooks while troubleshooting.  Here are all the relevant webhooks:
+
+```
+/gn-guile
+/genenetwork2
+/genenetwork3
+/gn-libs
+/gn-auth
+```
+
+Inside the container, we have "coreutils-minimal", and "curl" that you can use to troubleshoot.
diff --git a/issues/acme-error.gmi b/issues/acme-error.gmi
new file mode 100644
index 0000000..b31d04b
--- /dev/null
+++ b/issues/acme-error.gmi
@@ -0,0 +1,106 @@
+# uACME Error: "urn:ietf:params:acme:error:unauthorized"
+
+## Tags
+
+* status: closed, completed
+* priority: high
+* type: bug
+* assigned: fredm
+* keywords: uacme, certificates, "urn:ietf:params:acme:error:unauthorized"
+
+## Description
+
+Sometimes, when we attempt to request TLS certificates from Let's Encrypt using uacme, we run into an error of the following form:
+
+```
+uacme: polling challenge status at https://acme-v02.api.letsencrypt.org/acme/chall/2399017717/599167439271/jFB2Pg
+uacme: challenge https://acme-v02.api.letsencrypt.org/acme/chall/2399017717/599167439271/jFB2Pg failed with status invalid
+uacme: the server reported the following error:
+{
+    "type": "urn:ietf:params:acme:error:unauthorized",
+    "detail": "128.xxx.xxx.xxx: Invalid response from http://sparql.genenetwork.org/.well-known/acme-challenge/N-P-mhiK04c-Iophbem4iFYsaB
+yeaxeSyXHSijx3e6k: 404",
+    "status": 403
+}
+uacme: running /gnu/store/zwqavgjqyk0f0krv8ndwhv3767f6cnx1-uacme-hook failed http-01 sparql.genenetwork.org N-P-mhiK04c-Iophbem4iFYsaBy
+eaxeSyXHSijx3e6k N-P-mhiK04c-Iophbem4iFYsaByeaxeSyXHSijx3e6k.9dRdXFhCbqeDGWYndRd_hTh920rplmy-ef-_aLgjJJE
+uacme: failed to authorize order at https://acme-v02.api.letsencrypt.org/acme/order/2399017717/438986245271
+
+```
+
+From the above error, we note that the request for the "/.well-known/..." path fails with a 404 code: Why.
+
+Let's try figuring it out; connect to the running container:
+
+```
+$ sudo guix container exec 89086 /run/current-system/profile/bin/bash --login
+root@sparql /# cd /var/run/acme/acme-challenge/
+root@sparql /var/run/acme/acme-challenge# while true; do ls; sleep 0.5; clear; done
+```
+
+In a separate terminal, connect to the same container and run `/usr/bin/acme renew`.
+
+The loop we created to list what files are created in the challenge directory outputs the file
+
+```
+root@sparql /var/run/acme/acme-challenge# while true; do ls; sleep 0.5; clear; done
+Rm7qCec3naVvqPldGSGI9W4i9AceW0X3MUNSAbC7SVE
+Rm7qCec3naVvqPldGSGI9W4i9AceW0X3MUNSAbC7SVE
+⋮
+```
+
+but we are still getting the same error:
+
+```
+uacme: challenge https://acme-v02.api.letsencrypt.org/acme/chall/2399017717/599184604221/7mTNdA failed with status invalid
+uacme: the server reported the following error:
+{   
+    "type": "urn:ietf:params:acme:error:unauthorized",
+    "detail": "128.169.5.101: Invalid response from http://sparql.genenetwork.org/.well-known/acme-challenge/Rm7qCec3naVvqPldGSGI9W4i9AceW0X3MUNSAbC7SVE: 404",
+    "status": 403
+}
+uacme: running /gnu/store/zwqavgjqyk0f0krv8ndwhv3767f6cnx1-uacme-hook failed http-01 sparql.genenetwork.org Rm7qCec3naVvqPldGSGI9W4i9AceW0X3MUNSAbC7SVE Rm7qCec3naVvqPldGSGI9W4i9AceW0X3MUNSAbC7SVE.9dRdXFhCbqeDGWYndRd_hTh920rplmy-ef-_aLgjJJE
+uacme: failed to authorize order at https://acme-v02.api.letsencrypt.org/acme/order/2399017717/438997397751
+```
+
+meaning that somehow, nginx is not able to serve up this file.
+
+## Discovered Cause: 2025-10-20
+
+There are 2 layers of nginx, the host nginx, and the internal/container nginx.
+
+The host nginx was proxying directly to the virtuoso http server rather than proxying to nte internal/container nginx. This led to the failure because the internal/container nginx handles the TLS/SSL certificates for the site. The host nginx should have offloaded the handling of the TLS/SSL certificates to the internal/container nginx, but since it was not going through the internal nginx, that led to the failure.
+
+A simile of the error condition and the solution are in the sections below:
+
+### Error Condition: Wrong proxying
+
+In host's "nginx.conf":
+```
+⋮
+ proxy_pass http://localhost:<virtuoso-http-server-port>;
+⋮
+```
+
+In internal/container "nginx.conf":
+```
+⋮
+ proxy_pass http://localhost:<virtuoso-http-server-port>;
+⋮
+```
+
+### Solution/Fix
+
+In host's "nginx.conf":
+```
+⋮
+ proxy_pass http://localhost:<container-nginx-http-port>;
+⋮
+```
+
+In internal/container "nginx.conf":
+```
+⋮
+ proxy_pass http://localhost:<virtuoso-http-server-port>;
+⋮
+```
diff --git a/issues/auth/masquarade-as-bug.gmi b/issues/auth/masquarade-as-bug.gmi
index 12c2c5f..36fe34a 100644
--- a/issues/auth/masquarade-as-bug.gmi
+++ b/issues/auth/masquarade-as-bug.gmi
@@ -2,6 +2,7 @@
 
 * assigned: fredm
 * tags: critical
+* status: closed, completed
 
 Right now you can't masquared as another user.  Here's the trace:
 
diff --git a/issues/correlation-timing-out.gmi b/issues/correlation-timing-out.gmi
index 419524d..bed8692 100644
--- a/issues/correlation-timing-out.gmi
+++ b/issues/correlation-timing-out.gmi
@@ -5,7 +5,7 @@
 * assigned: fredm, zsloan, alexm
 * type: bug
 * priority: high
-* status: ongoing
+* status: closed, completed
 * keywords: correlations
 
 ## Description
@@ -17,3 +17,7 @@ Do correlations against the same dataset
 This might be the same issue as the one in
 => /issues/correlation-missing-file correlation-missing-file.gmi
 but I'm not sure.
+
+## Close as completed
+
+This is fixed.
diff --git a/issues/fix-spam-entries-in-gn-auth-production.gmi b/issues/fix-spam-entries-in-gn-auth-production.gmi
index db88eec..5ef7a42 100644
--- a/issues/fix-spam-entries-in-gn-auth-production.gmi
+++ b/issues/fix-spam-entries-in-gn-auth-production.gmi
@@ -2,6 +2,7 @@
 
 # Tags
 
+* status: closed, completed
 * assigned: fredm
 * keywords: auth
 
@@ -13,4 +14,8 @@ We have spam entries in gn-auth in production in the groups table:
 b59229de-2fce-4a3d-82f1-d9eeee9b7009|Business For Sale Adelaide|{"group_description": "Welcome to Business2Sell, the ultimate online platform for those seeking affordable business opportunities in Adelaide. As a trusted first-party provider, we offer the ideal marketplace for buying or selling businesses across the country. Whether you're an aspiring entrepreneur looking for your next venture or a business owner ready to sell, Business2Sell provides the perfect platform for you. Our user-friendly interface and extensive listings make it effortless to discover a wide range of businesses, all within your budget. Join our vibrant community of buyers and sellers today, and let us help you achieve your business goals in Adelaide with ease and confidence.\r\nhttps://www.business2sell.com.au/businesses/sa/adelaide"}
 ```
 
+## Close as completed
 
+We added email verification when registering, which should help reduce the success of these automated bots.
+
+We also added tooling to help with users and groups management, which is helping clean up these spam data.
diff --git a/issues/genenetwork/guix-bioinformatics-remove-guix-rust-past-crates-channel.gmi b/issues/genenetwork/guix-bioinformatics-remove-guix-rust-past-crates-channel.gmi
new file mode 100644
index 0000000..b804e10
--- /dev/null
+++ b/issues/genenetwork/guix-bioinformatics-remove-guix-rust-past-crates-channel.gmi
@@ -0,0 +1,23 @@
+# guix-bioinformatics: Remove `guix-rust-past-crates` channel
+
+## Tags
+
+* assigned: alexm, bonfacem
+* interested: fredm
+* priority: normal
+* status: open
+* type: bug
+* keywords: guix-bioinformatics, guix-rust-past-crates, guix, rust, crates
+
+## Description
+
+GNU Guix recently changed[1] the way it handles packaging of rust packages.
+
+The old rust packages got moved to the "guix-rust-past-crates" to help avoid huge breakages for systems depending on the older packaging system. "guix-bioinformatics" used a number of rust packages, defined in the old form, and we needed a quick fix, thus the introduction of the "guix-rust-past-crates" channel as a dependency.
+
+We need to move away from depending on this channel, by updating all the rust crates we use to the new packaging model.
+
+
+## Footnotes
+
+=> https://guix.gnu.org/en/blog/2025/a-new-rust-packaging-model/ [1]
diff --git a/issues/genenetwork/markdown-editing-service-not-deployed.gmi b/issues/genenetwork/markdown-editing-service-not-deployed.gmi
index e7a1717..9d72e4e 100644
--- a/issues/genenetwork/markdown-editing-service-not-deployed.gmi
+++ b/issues/genenetwork/markdown-editing-service-not-deployed.gmi
@@ -3,7 +3,7 @@
 ## Tags
 
 * type: bug
-* status: open
+* status: closed, completed, fixed
 * assigned: fredm
 * priority: critical
 * keywords: production, container, tux04
@@ -32,3 +32,8 @@ If you do an edit and refresh the page, it will show up in the system, but it wi
 Set `CGIT_REPO_PATH="https://git.genenetwork.org/gn-guile"` which seems to allow the commit to work, but we do not actually get the changes pushed to the remote in any useful sense.
 
 It seems to me, that we need to configure the environment in such a way that it will be able to push the changes to remote.
+
+
+## Close as Completed
+
+The markdown editing service is deployed and configured correctly.
diff --git a/issues/genenetwork/genenetwork2_configurations.gmi b/issues/genenetwork2/genenetwork2_configurations.gmi
index 7d08db0..4ba0a89 100644
--- a/issues/genenetwork/genenetwork2_configurations.gmi
+++ b/issues/genenetwork2/genenetwork2_configurations.gmi
@@ -4,7 +4,7 @@
 
 * assigned: fredm
 * priority: normal
-* status: open
+* status: closed, obsoleted
 * keywords: configuration, config, gn2, genenetwork, genenetwork2
 * type: bug
 
@@ -72,3 +72,10 @@ For `wqflask/run_gunicorn.py`, the route can remain as is, since this is an entr
 ### Non-Executable Configuration Files
 
 Eschew executable formats (*.py) for configuration files and prefer non-executable formats e.g. *.cfg, *.json, *.conf etc
+
+
+## Closed as obsoleted
+
+I am closing this issue as obsoleted, since a lot of things have changed since this issue was set up. The `bin/genenetwork2` script no longer exists and most of the paths mentioned have changed.
+
+The configuration issue(s) mentioned above still abound, but the changes will have to be incremental to avoid breaking the system.
diff --git a/issues/genenetwork2/handle-oauth-errors-better.gmi b/issues/genenetwork2/handle-oauth-errors-better.gmi
index 462ded5..77ad7ad 100644
--- a/issues/genenetwork2/handle-oauth-errors-better.gmi
+++ b/issues/genenetwork2/handle-oauth-errors-better.gmi
@@ -3,7 +3,7 @@
 ## Tags
 
 * type: bug
-* status: open
+* status: closed, completed
 * priority: high
 * assigned: fredm
 * interested: zachs, robw
@@ -15,3 +15,7 @@ When a session expires, for whatever reason, a notification is displayed to the
 => ./session_expiry_oauth_error.png
 
 The message is a little jarring to the end user. Make it gentler, and probably more informative, so the user is not as surprised.
+
+## Close as complete
+
+This should be fixed at this point. Closing this as complete.
diff --git a/issues/genenetwork2/mapping-error.gmi b/issues/genenetwork2/mapping-error.gmi
index 2e28491..7e7d0a7 100644
--- a/issues/genenetwork2/mapping-error.gmi
+++ b/issues/genenetwork2/mapping-error.gmi
@@ -3,7 +3,7 @@
 ## Tags
 
 * type: bug
-* status: open
+* status: closed
 * priority: medium
 * assigned: zachs, fredm, flisso
 * keywords: gn2, genenetwork2, genenetwork 2, mapping
@@ -49,3 +49,18 @@ TypeError: 'NoneType' object is not iterable
 ### Updates
 
 This is likely just because the genotype file doesn't exist in the necessary format (BIMBAM). We probably need to convert the R/qtl2 genotypes to BIMBAM.
+
+## Stalled
+
+This is currently stalled, until we can upload genotypes via the uploader.
+
+
+## Notes
+
+### 2025-12-31
+
+I am closing this issue as WONTFIX because of the following reasons:
+
+- Better fix is to prevent mapping in the first place, if no genotypes exist for the given trait(s)
+- Issue relies on non-implemented feature (Genotypes upload) to fix it
+- Issue does not exist on production
diff --git a/issues/genenetwork2/mechanical-rob-add-partial-correlations-tests.gmi b/issues/genenetwork2/mechanical-rob-add-partial-correlations-tests.gmi
new file mode 100644
index 0000000..e38f653
--- /dev/null
+++ b/issues/genenetwork2/mechanical-rob-add-partial-correlations-tests.gmi
@@ -0,0 +1,22 @@
+# mechanical-rob: Add Partial Correlations Tests
+
+## Tags
+
+* assigned: fredm
+* priority: medium
+* status: open
+* keywords: genenetwork2, gn2, mechanical-rob, partial correlations, tests, regression
+* type: enhancement
+
+## Description
+
+Add regression tests to verify that the partial correlations feature still works
+as expected.
+
+### TODOS
+
+- [-] Tests for "entry-point" page
+- [x] Tests for partial correlation using Pearson's R against select traits
+- [ ] Tests for partial correlation using Spearman's Rho against select traits
+- [ ] Tests for partial correlation using Pearson's R against an entire dataset
+- [ ] Tests for partial correlation using Spearman's Rho against an entire dataset
diff --git a/issues/genenetwork2/refresh-token-failure.gmi b/issues/genenetwork2/refresh-token-failure.gmi
index dd33341..c488820 100644
--- a/issues/genenetwork2/refresh-token-failure.gmi
+++ b/issues/genenetwork2/refresh-token-failure.gmi
@@ -2,7 +2,7 @@
 
 ## Tags
 
-* status: open
+* status: closed, obsoleted
 * priority: high
 * type: bug
 * assigned: fredm, zsloan, zachs
@@ -106,3 +106,6 @@ The following commits were done as part of the troubleshooting:
 => https://git.genenetwork.org/guix-bioinformatics/commit/?id=955e4ce9370be9811262d7c73fa5398385cc04d8
 
 
+# Closed as Obsolete
+
+We no longer rely on refresh tokens. This issue is no longer present.
diff --git a/issues/genenetwork2/remove-bin-genenetwork2-script.gmi b/issues/genenetwork2/remove-bin-genenetwork2-script.gmi
new file mode 100644
index 0000000..da11be7
--- /dev/null
+++ b/issues/genenetwork2/remove-bin-genenetwork2-script.gmi
@@ -0,0 +1,114 @@
+# Remove `bin/genenetwork2` Script
+
+## Tags
+
+* type: improvement
+* status: closed, completed
+* priority: medium
+* assigned: fredm, bonfacem, alexm, zachs
+* interested: pjotrp, aruni
+* keywords: gn2, bin/genenetwork2, startup script
+
+## Description
+
+The `bin/genenetwork2` script was used for a really long time to launch Genenetwork2, and has served that purpose with honour and dedication. We applaud that.
+
+It is, however, time to retire the script, since at this point in time, it serves more to obfuscate the startup that as a helpful tool.
+
+On production, we have all but abandoned the use of the script, and we need to do the same for CI/CD, and eventually, development.
+
+This issue tracks the process, and problems that come up during the move to retire the script.
+
+### Process
+
+* [x] Identify how to run unit tests without the script
+* [x] Document how to run unit tests without the script
+* [x] Identify how to run mechanical-rob tests without the script
+* [x] Document how to run mechanical-rob tests without the script
+* [x] Update CI/CD definitions to get rid of the references to the script
+* [x] Delete the script from the repository
+
+### Setup
+
+First, we need to setup the following mandatory environment variables:
+
+* GN2_PROFILE
+* GN2_SETTINGS
+* JS_GUIX_PATH
+* GEMMA_COMMAND
+* PLINK_COMMAND
+* GEMMA_WRAPPER_COMMAND
+* REQUESTS_CA_BUNDLE
+
+Within a guix shell, you could do that with something like:
+
+```
+export GN2_PROFILE="${GUIX_ENVIRONMENT}"
+export GN2_SETTINGS="/home/frederick/genenetwork/gn2_settings.conf"
+export JS_GUIX_PATH="${GN2_PROFILE}/share/genenetwork2/javascript"
+export GEMMA_COMMAND="${GN2_PROFILE}/bin/gemma"
+export PLINK_COMMAND="${GN2_PROFILE}/bin/plink2"
+export GEMMA_WRAPPER_COMMAND="${GN2_PROFILE}/bin/gemma-wrapper"
+export REQUESTS_CA_BUNDLE="${GUIX_ENVIRONMENT}/etc/ssl/certs/ca-certificates.crt"
+```
+
+Note that, you can define all the variables derived from "GN2_PROFILE" in your settings file, if such a settings file is computed.
+
+### Running Unit Tests
+
+To run unit tests, run pytest at the root of the repository.
+
+```
+$ cd /path/to/genenetwork2
+$ pytest
+```
+
+### Running "mechanical-rob" Tests
+
+At the root of the repository, run something like:
+
+```
+python test/requests/test-website.py --all http://localhost:5033
+```
+
+Change the port, as appropriate.
+
+
+### Launching Application
+
+In addition to the minimum set of envvars defined in the "Setup" section above, we need the following variables defined to get the application to launch:
+
+* FLASK_APP
+
+In a guix shell, you could do:
+
+```
+export FLASK_APP="gn2.wsgi"
+```
+
+Now you can launch the application with flask with something like:
+
+```
+flask run --port=5033 --with-threads
+```
+
+or with green unicorn with something like:
+
+```
+gunicorn --reload \
+         --workers 3 \
+         --timeout 1200 \
+         --log-level="debug" \
+         --keep-alive 6000 \
+         --max-requests 10 \
+         --bind="127.0.0.1:5033" \
+         --max-requests-jitter 5 \
+         gn2.wsgi:application
+```
+
+You can change the gunicorn setting to fit your scenario.
+
+
+## Close as completed
+
+The script has been deleted.
diff --git a/issues/genenetwork3/broken-aliases.gmi b/issues/genenetwork3/broken-aliases.gmi
index 5735a1c..2bfbdae 100644
--- a/issues/genenetwork3/broken-aliases.gmi
+++ b/issues/genenetwork3/broken-aliases.gmi
@@ -5,23 +5,184 @@
 * type: bug
 * status: open
 * priority: high
-* assigned: fredm
+* assigned: pjotrp
 * interested: pjotrp
 * keywords: aliases, aliases server
 
+## Tasks
+
+* [X] Rewrite server in gn-guile
+* [X] Fix menu search
+* [X] Fix global search aliases
+* [ ] Deploy and test aliases in GN2
 
 ## Repository
 
 => https://github.com/genenetwork/gn3
 
+moved to
+
+gn-guile repo.
+
 ## Bug Report
 
 ### Actual
 
 * Go to https://genenetwork.org/gn3/gene/aliases2/Shh,Brca2
-* Not that an exception is raised, with a "404 Not Found" message
+* Note that an exception is raised, with a "404 Not Found" message
 
 ### Expected
 
 * We expected a list of aliases to be returned for the given symbols as is done in https://fallback.genenetwork.org/gn3/gene/aliases2/Shh,Brca2
 
+## Resolution
+
+Actually the server is up, but it is not part of the main deployment because it is written in Racket - and we don't have much support in Guix. I wrote the code the days after my bike accident:
+
+=> https://github.com/genenetwork/gn3/blob/master/gn3/web/wikidata.rkt
+
+and it is probably easiest to move it to gn-guile. Guile is another Scheme after all ;). Only fitting I spent days in hospital only recently (for a different reason). gn-guile already has its own web server and provides a REST API for our markdown editor, for example. On tux04 it responds with
+
+```
+curl http://127.0.0.1:8091/version
+"4.0.0"
+```
+
+What we want is to add the aliases server that should respond to
+
+```
+curl http://localhost:8000/gene/aliases/Shh # direct on tux01
+["9530036O11Rik","Dsh","Hhg1","Hx","Hxl3","M100081","ShhNC","ShhNC"]
+curl https://genenetwork.org/gn3/gene/aliases2/Shh,Brca2
+[["Shh",["9530036O11Rik","Dsh","Hhg1","Hx","Hxl3","M100081","ShhNC","ShhNC"]],["Brca2",["Fancd1","RAB163"]]]
+```
+
+Note this is used by search functionality in GN, as well as the gene aliases list on the mapping page. In principle we cache it for the duration of the running server so as not to overload wikidata. No one uses aliases2, that I can tell, so we only implement the first 'aliases'.
+
+Note the wikidata interface has been stable all this time. That is good.
+
+Turns out we already use wikidata in the gn-guile implementation for fetching the wikidata id for a species (as part of metadata retrieval). I wrote that about two years ago as part of the REST API expansion.
+
+Unfortunately
+
+```
+(sparql-scm (wd-sparql-endpoint-url)  (wikidata-gene-alias "Q24420953"))
+```
+
+throws a 403 forbidden error.
+
+This however works:
+
+```
+scheme@(gn db sparql) [15]> (sparql-wd-species-info "Q83310")
+;;; ("https://query.wikidata.org/sparql?query=%0ASELECT%20DISTINCT%20%3Ftaxon%20%3Fncbi%20%3Fdescr%20where%20%7B%0A%20%20%20%20wd%3AQ83310%20wdt%3AP225%20%3Ftaxon%20%3B%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20wdt%3AP685%20%3Fncbi%20%3B%0A%20%20%20%20%20%20schema%3Adescription%20%3Fdescr%20.%0A%20%20%20%20%3Fspecies%20wdt%3AP685%20%3Fncbi%20.%0A%20%20%20%20FILTER%20%28lang%28%3Fdescr%29%3D%27en%27%29%0A%7D%20limit%205%0A%0A")
+$11 = "?taxon\t?ncbi\t?descr\n\"Mus musculus\"\t\"10090\"\t\"species of mammal\"@en\n"
+```
+
+(if you can see the mouse ;).
+
+Ah, this works
+
+```
+scheme@(gn db sparql) [17]> (sparql-tsv (wd-sparql-endpoint-url) (wikidata-query-geneids "Shh" ))
+;;; ("https://query.wikidata.org/sparql?query=SELECT%20DISTINCT%20%3Fwikidata_id%0A%20%20%20%20%20%20%20%20%20%20%20%20WHERE%20%7B%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%3Fwikidata_id%20wdt%3AP31%20wd%3AQ7187%3B%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20wdt%3AP703%20%3Fspecies%20.%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20VALUES%20%28%3Fspecies%29%20%7B%20%28wd%3AQ15978631%20%29%20%28%20wd%3AQ83310%20%29%20%28%20wd%3AQ184224%20%29%20%7D%20.%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%3Fwikidata_id%20rdfs%3Alabel%20%22Shh%22%40en%20.%0A%20%20%20%20%20%20%20%20%7D%0A")
+$12 = "?wikidata_id\n<http://www.wikidata.org/entity/Q14860079>\n<http://www.wikidata.org/entity/Q24420953>\n"
+```
+
+But this does not
+
+```
+scheme@(gn db sparql) [17]> (sparql-scm (wd-sparql-endpoint-url) (wikidata-query-geneids "Shh" ))
+ice-9/boot-9.scm:1685:16: In procedure raise-exception:
+In procedure utf8->string: Wrong type argument in position 1 (expecting bytevector): "<html>\r\n<head><title>403 Forbidden</title></head>\r\n<body>\r\n<center><h1>403 Forbidden</h1></center>\r\n<hr><center>nginx/1.18.0</center>\r\n</body>\r\n</html>\r\n"
+```
+
+Going via tsv does work
+
+```
+scheme@(gn db sparql) [18]> (tsv->scm (sparql-tsv (wd-sparql-endpoint-url) (wikidata-query-geneids "Shh" )))
+
+;;; ("https://query.wikidata.org/sparql?query=SELECT%20DISTINCT%20%3Fwikidata_id%0A%20%20%20%20%20%20%20%20%20%20%20%20WHERE%20%7B%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%3Fwikidata_id%20wdt%3AP31%20wd%3AQ7187%3B%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20wdt%3AP703%20%3Fspecies%20.%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20VALUES%20%28%3Fspecies%29%20%7B%20%28wd%3AQ15978631%20%29%20%28%20wd%3AQ83310%20%29%20%28%20wd%3AQ184224%20%29%20%7D%20.%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%3Fwikidata_id%20rdfs%3Alabel%20%22Shh%22%40en%20.%0A%20%20%20%20%20%20%20%20%7D%0A")
+$13 = ("?wikidata_id")
+$14 = (("<http://www.wikidata.org/entity/Q14860079>") ("<http://www.wikidata.org/entity/Q24420953>"))
+```
+
+that is nice enough.
+
+We now got a working alias server that is part of gn-guile. E.g.
+
+```
+curl http://127.0.0.1:8091/gene/aliases/Brca2
+["breast cancer 2","breast cancer 2, early onset","Fancd1","RAB163","BRCA2, DNA repair associated"]
+```
+
+it is part of gn-guile. gn-guile also has the 'commit/' handler by Alex, documented as
+'curl -X POST http://127.0.0.1:8091/commit' in git-markdown-editor.md. Let's see how that is wired up. The web interface is at, for example,
+https://genenetwork.org/editor/edit?file-path=general/help/facilities.md. Part of gn2's
+
+```
+gn2/wqflask/views.py
+398:@app.route("/editor/edit", methods=["GET"])
+408:@app.route("/editor/settings", methods=["GET"])
+414:@app.route("/editor/commit", methods=["GET", "POST"])
+```
+
+which has the code
+
+```
+@app.route("/editor/edit", methods=["GET"])
+@require_oauth2
+def edit_gn_doc_file():
+    file_path = urllib.parse.urlencode(
+        {"file_path": request.args.get("file-path", "")})
+    response = requests.get(f"http://localhost:8091/edit?{file_path}")
+    response.raise_for_status()
+    return render_template("gn_editor.html", **response.json())
+```
+
+Running over localhost. This is unfortunately hard coded, and we should change that! In guix system
+configuration it is already a variable as 'genenetwork-configuration-gn-guile-port 8091'. gn-guile should also be visible from outside, so that is a separate configuration.
+
+Also I note that the mapping page does three requests to wikidata (for mouse, rat and human). That could really be one.
+
+# Search
+
+Aliases are also used in search. You can tell when GN search renders too few results that aliases are not used. When aliases work we expect to list '2310010I16Rik' with
+
+=> https://genenetwork.org/search?species=mouse&group=BXD&type=Hippocampus+mRNA&dataset=HC_M2_0606_P&search_terms_or=sh*&search_terms_and=&FormID=searchResult
+
+Sheepdog tests for that and it has been failing for a while.
+
+Global search finds way more results, but also lacks that alias! Meanwhile GN1 does find that alias for record  1431728_at. GN2 finds it with hippocampus mRNA
+
+=> https://genenetwork.org/search?species=mouse&group=BXD&type=Hippocampus+mRNA&dataset=HC_M2_0606_P&search_terms_or=1431728_at%0D%0A&search_terms_and=&accession_id=None&FormID=searchResult
+
+in standard search.
+But neither 1431728_at or '2310010I16Rik' has a hit in *global* search and the result for Ssh should include the record in both search systems.
+
+# Deploy
+
+We introduced a new environment variable that does not show up on CD, part of the mapping page:
+
+=>
+
+In the logs on /export2:
+
+```
+root@tux02:/export2/guix-containers/genenetwork-development/var/log/cd# tail -100 genenetwork2.log
+2025-07-20 04:19:43   File "/genenetwork2/gn2/base/trait.py", line 157, in wikidata_alias_fmt
+2025-07-20 04:19:43     GN_GUILE_SERVER_URL + "gene/aliases/" + self.symbol.upper())
+2025-07-20 04:19:43 NameError: name 'GN_GUILE_SERVER_URL' is not defined
+```
+
+One thing I ran into is http://genenetwork.org/gn3-proxy/ - what is that for?
+
+## Deploy Updates: 2025-08-15
+=> https://git.genenetwork.org/guix-bioinformatics/commit/?id=269f99f1e1f0c253ecdd99f04bc7c6697012b0aa Update commit of gn-guile used on production
+
+This does not fix the issue on https://gn2-fred.genenetwork.org/show_trait?trait_id=1427571_at&dataset=HC_M2_0606_P, instead we get
+
+```
+fredm@tux04:~$ curl http://localhost:8091/gene/aliases/Brca2
+Resource not found: /gene/aliases/Brca2
+```
diff --git a/issues/genenetwork/genenetwork3_configuration.gmi b/issues/genenetwork3/genenetwork3_configuration.gmi
index fcab572..cdd7c15 100644
--- a/issues/genenetwork/genenetwork3_configuration.gmi
+++ b/issues/genenetwork3/genenetwork3_configuration.gmi
@@ -1,10 +1,10 @@
-# Genenetwork2 Configurations
+# Genenetwork3 Configurations
 
 ## Tags
 
 * assigned: fredm
 * priority: normal
-* status: open
+* status: closed, completed
 * keywords: configuration, config, gn2, genenetwork, genenetwork2
 * type: bug
 
@@ -13,3 +13,7 @@
 The configuration file should only ever contain settings, and no code. Remove all code from the default settings file.
 
 Eschew executable formats (*.py) for configuration files and prefer non-executable formats e.g. *.cfg, *.json, *.conf etc
+
+## Closed as Completed
+
+See commit https://github.com/genenetwork/genenetwork3/commit/977efbb54da284fb3e8476f200206d00cb8e64cd
diff --git a/issues/genenetwork3/rqtl2-mapping-error.gmi b/issues/genenetwork3/rqtl2-mapping-error.gmi
index 480c7c6..b43d66f 100644
--- a/issues/genenetwork3/rqtl2-mapping-error.gmi
+++ b/issues/genenetwork3/rqtl2-mapping-error.gmi
@@ -3,7 +3,7 @@
 ## Tags
 
 * type: bug
-* status: open
+* status: closed, completed
 * priority: high
 * assigned: alexm, zachs, fredm
 * keywords: R/qtl2, R/qtl2 Maps, gn3, genetwork3, genenetwork 3
@@ -40,3 +40,7 @@ This might imply a code issue: Perhaps
 * the wrong path value is passed
 
 The same error occurs on https://cd.genenetwork.org but does not seem to prevent CD from running the mapping to completion. Maybe something is missing on production — what, though?
+
+## Closed as Completed
+
+This seems fixed now.
diff --git a/issues/genetics/speeding-up-gemma.gmi b/issues/genetics/speeding-up-gemma.gmi
new file mode 100644
index 0000000..91bab17
--- /dev/null
+++ b/issues/genetics/speeding-up-gemma.gmi
@@ -0,0 +1,492 @@
+# Speeding up GEMMA
+
+GEMMA is slow, but usually fast enough. Earlier I wrote gemma-wrapper to speed things up. In genenetwork.org, by using gemma-wrapper with LOCO, most traits are mapped in a few seconds on a a large server (30 individuals x 200K markers). By expanding makers to over 1 million, however, runtimes degrade to 6 minutes. Increasing the number of individuals to 1000 may slow mapping down to hour(s). As we are running 'precompute' on 13K traits - and soon maybe millions - it would be beneficial to reduce runtimes again.
+
+One thing to look at is Sen's bulklmm. It can do phenotypes in parallel, provided there is no missing data. This is perfect for permutations which we'll also do. For multiple phenotypes it is a bit tricky however, because you'll have to mix and match experiments to show the same individuals (read samples).
+
+So the approach is to first analyze steps in GEMMA and see where it is particularly inefficient. Maybe we can do something about that. I note I started the pangemma effort (and mgamma effort before). The idea is to use a propagator network for incremental improvements and also to introduce a new build system and testing framework. In parallel we'll try to scale out on HPC using Arun's ravanan software.
+
+There is no such thing as a free lunch. So, let's dive in.
+
+# Description
+
+# Tags
+
+* assigned: pjotrp
+* type: feature
+* priority: high
+
+# Tasks
+
+* [X] Try gzipped version
+* [X] Run without debug
+* [ ] Use lmdb for genotypes
+* -   [X] convert genotypes to lmdb
+* -   [X] replace GEMMA ReadGenotypes
+* -   [X] replace reading genotypes in AnalyzeBimbam
+* -   [+] Apply similar SNP filtering as the original
+* -   [X] Add SNP info tho Geno file
+* -   [X] Try different geno encodings
+* -   [+] Fix support for NAs - also in compute
+* [X] Use lmdb for SNPs (probably part of Geno file)
+* [X] Match output
+* [ ] Write lmdb for output with filter
+* [X] Optimize openblas for target architecture
+* [ ] Use profiler
+* [ ] Hash genotypes? Try buf.hash or xxhash
+* [ ] Skip highly correlated markers with backtracking
+* [ ] Perhaps try a faster malloc library for GEMMA
+* [ ] Fix sqrt(NaN) when running big file example with -debug
+* [ ] Fix/check assumption that geno is between 0 and 2
+* [ ] Try 64-bit integer index for lmdb
+* [ ] Other improvements...
+
+# Summary
+
+Convert a geno file to mdb with
+
+```
+./bin/anno2mdb.rb mouse_hs1940.anno.txt
+./bin/geno2mdb.rb mouse_hs1940.geno.txt --anno mouse_hs1940.anno.txt.mdb --eval Gf # convert to floating point
+real    0m14.042s
+user    0m12.639s
+sys     0m0.402s
+```
+
+```
+../bin/anno2mdb.rb snps-matched.txt
+../bin/geno2mdb.rb pangenome-13M-genotypes.txt --geno-json bxd_inds.list.json --anno snps-matched.txt.mdb --eval Gf
+../bin/geno2mdb.rb pangenome-13M-genotypes.txt --geno-json bxd_inds.list.json --anno snps-matched.txt.mdb --eval Gb
+```
+
+even with floats a 30G pangenome genotype file got reduced to 12G. A quick full run of the mdb version takes 6 minutes. That is a massive 3x speedup. It also used less RAM (because it is one process instead of 20) and had a 40x core usage, much of it in the Linux kernel:
+
+```
+/bin/time -v ./build/bin/Release/gemma -k tmp/93f6b39ec06c09fb9ba9ca628b5fb990921b6c60.11.cXX.txt.cXX.txt -p tmp/pheno.json.txt -g pangenome-13M-genotypes.txt.mdb -lmm 9 -maf 0.1 -n 2 -debug
+LD_LIBRARY_PATH=$GUIX_ENVIRONMENT/lib /bin/time -v ./build/bin/Release/gemma -k tmp/93f6b39ec06c09fb9ba9ca628b5fb990921b6c60.3.cXX.txt.cXX.txt -p tmp/pheno.json.txt -g tmp/pangenome-13M-genotypes.txt.mdb -lmm 9 -maf 0.1 -n 2 -no-check
+real    5m47.587s
+user    39m33.796s
+sys     211m1.143s
+
+Command being timed: "./build/bin/Release/gemma -k tmp/93f6b39ec06c09fb9ba9ca628b5fb990921b6c60.3.cXX.txt.cXX.txt -p tmp/pheno.json.txt -g tmp/pangenome-13M-genotypes.txt.mdb -lmm 9 -maf 0.1 -n 2 -no-check"
+        User time (seconds): 2169.77
+        System time (seconds): 11919.04
+        Percent of CPU this job got: 3919%
+        Elapsed (wall clock) time (h:mm:ss or m:ss): 5:59.48
+        Maximum resident set size (kbytes): 13377040
+```
+
+as we only read the genotype file once it shows how much is IO bound! Moving to lmdb was the right choice to speed up pangemma.
+
+Old gemma does:
+
+```
+        Command being timed: "/bin/gemma -k 93f6b39ec06c09fb9ba9ca628b5fb990921b6c60.11.cXX.txt.cXX.txt -p pheno.json.txt -g pangenome-13M-genotypes.txt.gz -a snps-matched.txt -lmm 9 -maf 0.1 -n 2 -no-check"
+        User time (seconds): 2017.25
+        System time (seconds): 62.21
+        Percent of CPU this job got: 240%
+        Elapsed (wall clock) time (h:mm:ss or m:ss): 14:24.17
+        Maximum resident set size (kbytes): 9736884
+```
+
+So we are at 3x speed.
+
+With Gb byte encoding the file got further reduced from 13Gb to 4Gb.
+
+What is more exciting is that LOCO now runs in 30s - compared to gemma's earlier 6 minutes, so that is at 10x speed, using about 1/3 of RAM. Note the CPU usage:
+
+```
+        Command being timed: "./build/bin/Release/gemma -k tmp/93f6b39ec06c09fb9ba9ca628b5fb990921b6c60.3.cXX.txt.cXX.txt -p tmp/pheno.json.txt -g tmp/pangenome-13M-genotypes.txt-Gb.mdb -loco 2 -lmm 9 -maf 0.1 -n 2 -no-check"                     User time (seconds): 177.81
+        System time (seconds): 934.92
+        Percent of CPU this job got: 3391%
+        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:32.80
+        Maximum resident set size (kbytes): 4326308
+```
+
+it looks like disk IO is no longer the bottleneck. The Gb version is much smaller than Gf, but runtime is only slightly better. So it is time for the profiler to find how we can make use of the other cores! But, for now, I am going to focus on getting the pipeline set up with ravanan.
+
+# Analysis
+
+As a test case we'll take on of the runs:
+
+```
+time -v /bin/gemma -loco 11 -k /export2/data/wrk/services/gemma-wrapper/tmp/tmp/panlmm/93f6b39ec06c09fb9ba9ca628b5fb990921b6c60.11.cXX.txt.cXX.txt -o 680029457111fdd460990f95853131c87ea20c57.11.assoc.txt -p pheno.json.txt -g pangenome-13M-genotypes.txt -a snps-matched.txt -lmm 9 -maf 0.1 -n 2 -outdir /export2/data/wrk/services/gemma-wrapper/tmp/tmp/panlmm/d20251111-588798-f81icw
+```
+
+which I simplify to
+
+```
+/bin/time -v /bin/gemma -loco 11 -k 93f6b39ec06c09fb9ba9ca628b5fb990921b6c60.11.cXX.txt.cXX.txt -p pheno.json.txt -g pangenome-13M-genotypes.txt -a snps-matched.txt -lmm 9 -maf 0.1 -n 2 -debug
+Reading Files ...
+number of total individuals = 143
+number of analyzed individuals = 20
+number of total SNPs/var        = 13209385
+number of SNPS for K            = 12376792
+number of SNPS for GWAS         =   832593
+number of analyzed SNPs         = 13111938
+```
+
+The timer says:
+
+```
+User time (seconds): 365.33
+System time (seconds): 16.59
+Percent of CPU this job got: 128%
+Elapsed (wall clock) time (h:mm:ss or m:ss): 4:57.01
+Average shared text size (kbytes): 0
+Average unshared data size (kbytes): 0
+Average stack size (kbytes): 0
+Average total size (kbytes): 0
+Maximum resident set size (kbytes): 11073412
+Average resident set size (kbytes): 0
+Major (requiring I/O) page faults: 0
+Minor (reclaiming a frame) page faults: 5756557
+Voluntary context switches: 1365
+nInvoluntary context switches: 478
+Swaps: 0
+File system inputs: 0
+File system outputs: 143704
+Socket messages sent: 0
+Socket messages received: 0
+Signals delivered: 0
+Page size (bytes): 4096
+Exit status: 0
+```
+
+The genotype file is unzipped at 30G. Let's try running the gzipped version (which will be beneficial on a compute cluster anyhow) which comes in at 9.2G. We know that Gemma is not the most efficient when it comes to IO. So testing is crucial.
+Critically the run gets slower:
+
+```
+Percent of CPU this job got: 118%
+Elapsed (wall clock) time (h:mm:ss or m:ss): 7:43.56
+```
+
+The problem is that unzip runs on a single thread in GEMMA, so it is actually slower that the gigantic raw text file.
+
+## Running without debug
+
+Without the debug swith gemma runs at the same speed with 128% CPU. That won't help much.
+
+## Optimizing GEMMA+OpenBLAS+GSL
+
+Compiling with optimization can be low hanging fruit - despite the fact that we seem to be IO bound at 128% CPU. Still, aggressive compiler optimizations may make a difference. The current build reads:
+
+```
+GEMMA Version    = 0.98.6 (2022-08-05)
+Build profile    = /gnu/store/8rvid272yb53bgascf5c468z0jhsyflj-profile
+GCC version      = 14.3.0
+GSL Version      = 2.8
+OpenBlas         = OpenBLAS 0.3.30  - OpenBLAS 0.3.30 DYNAMIC_ARCH NO_AFFINITY Cooperlake MAX_THREADS=128
+arch           = Cooperlake
+threads        = 96
+parallel type  = threaded
+```
+
+this uses the gemma-gn2 package in
+
+=> https://git.genenetwork.org/guix-bioinformatics/tree/gn/packages/gemma.scm#n27
+
+which is currently not built with arch optimizations (even though Cooperlake suggests differently). Another potential optimization is to use a fast malloc library. We do, however, already compile with a recent gcc, thanks to Guix. No need to improve on that.
+
+## Introduce lmdb for genotypes
+
+Rather than focussing on gzip, another potential improvement is to use lmdb with mmap. We am not going to upgrade the original gemma code (which is in maintenance mode). We are going to upgrade the new pangemma project instead:
+
+=> https://git.genenetwork.org/pangemma/
+
+Reason being that this is our experimental project.
+
+So I just managed to build pangemma/gemma in Guix. Next step is to introduce lmdb genotypes. Genotypes come essentially as a matrix of markers x individuals. In the case of GN geno files and BIMBAM files they are simply stored as tab delimited values and/or probabilities. This happens in
+
+```
+src/param.cpp
+1261:void PARAM::ReadGenotypes(gsl_matrix *UtX, gsl_matrix *K, const bool calc_K) {
+1280:void PARAM::ReadGenotypes(vector<vector<unsigned char>> &Xt, gsl_matrix *K,
+```
+
+calling into
+
+```
+gemma_io.cpp
+644:bool ReadFile_geno(const string &file_geno, const set<string> &setSnps,
+1752:bool ReadFile_geno(const string file_geno, vector<int> &indicator_idv,
+1857:bool ReadFile_geno(const string &file_geno, vector<int> &indicator_idv,
+```
+
+which are called from gemma.cpp. Also lmm.cpp reads the geno file in the AnalyzeBimbam function (see file_geno):
+
+```
+src/lmm.cpp
+61:  file_geno = cPar.file_geno;
+1664:  debug_msg(file_geno);
+1665:  auto infilen = file_geno.c_str();
+2291:    cout << "error reading genotype file:" << file_geno << endl;
+```
+
+Note that also SNPs are read from a file (see file_snps). We already have an lmdb version for that!
+
+So, reading genotypes happens in multiple places. In fact, it is read 1x for computing K and 2x for GWA. And it is worth than this because LOCO runs GWA 20x rereading the same files. Reading it once using lmdb should speed things up.
+
+We'll start with the 30G 143samples.percentile.bimbam.bimbam-reduced2 file. To convert this file into lmdb we only do this once. We want to track both column and row names in the same lmdb and we will use a meta JSON record for that. On the command line we'll state wether the genotypes are stored as char or int. Floats will be packed into either of those. We'll expirement a bit to see what the default should be. A genotype is usually a number/character or a probability. In the latter case we don't have to have high precison and can choose to store an index into a range of values. We can also opt for Float16 or something more ad hoc because we don't have to store the exponent.
+
+But let's start with a standard float here, to keep things simple. To write the first version of code I'll use a byte conversion:
+
+```
+./bin/geno2mdb.rb BXD.geno.bimbam --eval '{"0"=>0,"1"=>1,"2"=>2,"NA"=>-1}' --pack 'C*' --geno-json BXD.geno.json
+```
+
+The lmdb file contains a metadata record that looks like:
+
+```
+{
+  "type": "gemma-geno",
+  "version": 1,
+  "eval": "G0-2",
+  "key-format": "string",
+  "rec-format": "C*",
+  "geno": {
+    "type": "gn-geno-to-gemma",
+    "genofile": "BXD.geno",
+    "samples": [
+      "BXD1",
+      "BXD2",
+      "BXD5",
+etc.
+```
+
+i.e. it is a self-contained, efficient, genotype format. There is also another trick, we can use Plink-style compression with
+
+```
+./bin/geno2mdb.rb BXD.geno.bimbam --eval '{"0"=>0,"1"=>1,"2"=>2,"NA"=>4}' --geno-json BXD.geno.json --gpack 'l.each_slice(4).map { |slice| slice.map.with_index.sum {|val,i| val << (i*2) } }.pack("C*")'
+```
+
+reducing the original uncompressed BIMBAM from 9.9Mb to 2.7Mb. This is still a lot larger than the gzip compressed BIMBAM, but as I pointed out earlier the uncompressed version is faster by a wide margin. Compressing the lmdb file gets it in range of the compressed BIMBAM btw. So that is always an option.
+
+Next we create a floating point version. That reduces the file to 30% with
+
+```
+geno2mdb.rb fp.bimbam --geval 'g.to_f' --pack 'F*' --geno-json bxd_inds.list.json
+```
+
+and if we compress the probabilities into a byte reduces the file to 10%:
+
+```
+geno2mdb.rb fp.bimbam --geval '(g.to_f*255.0).to_i' --pack 'C*' --geno-json bxd_inds.list.json
+```
+
+And now the compressed version is also 4x smaller. We'll have to run gemma at scale to see what the impact is, but an uncompressed 10x reduction schould have an impact on the IO bottle neck. Note how easy it is to try these things with my little Ruby script.
+
+=> https://github.com/genetics-statistics/gemma-wrapper/blob/master/bin/geno2mdb.rb
+
+## Use lmdb genotypes from pangemma
+
+Rather than writing new code in C++ I proceeded embedding guile in pangemma. If it turns out to be a performance problem we can always fall back to C. Here we show a simple test witten in guile that gets called from main.cpp:
+
+=> https://git.genenetwork.org/pangemma/commit/?id=5b6b5e2ad97b4733125c0845cfae007e8094a687
+
+## Some analysis of GEMMA
+
+GEMMA::BatchRun reads files and executes (b gemma.cpp:1657)
+cPar.ReadFiles()
+  ReadFile_anno
+  ReadFile_pheno
+  ReadFile_geno (gemma_io.cpp:652) - first read to fetch SNPs info, num (ns_tset) and total SNPs (ns_total).
+                                   - it also does some checks
+    Note: These can all be handled by the lmdb files. So it saves one run.
+
+Summary of Mutated Outputs:
+* indicator_snp: Binary indicators for which SNPs passed filtering
+* snpInfo: Complete metadata for all SNPs in the file
+* ns_test: Count of SNPs passing filters
+checkpoint("read-geno-file",file_geno);
+
+Next start LMM9 gemma.cpp:2571
+  ReadFile_kin
+  EigenDecomp_Zeroed
+  2713          CalcUtX(U, W, UtW);
+  2714          CalcUtX(U, Y, UtY);
+  CalcLambda
+  CalcLmmVgVeBeta
+  CalcPve
+  cPar.PrintSummary()
+  debug_msg("fit LMM (one phenotype)");
+  cLmm.AnalyzeBimbam lmm.cpp:1665 and
+    LMM::Analyze lmm.cpp:1704
+
+
+Based on LLM code analysis, here's what gets mutated in the 'LMM' and Param class:
+
+### By 'ReadFile_geno':
+This is a **standalone function** (not a member of LMM), but it mutates LMM members when passed as parameters:
+
+1. **'indicator_snp'** - cleared and populated with 0/1 filter flags
+2. **'snpInfo'** - cleared and populated with SNP metadata
+3. **'ns_test'** - set to count of SNPs that passed all filters
+
+### By 'LMM::AnalyzeBimbam':
+(which calls 'LMM::Analyze')
+
+**Directly mutated in 'LMM::Analyze':**
+
+1. **'sumStat'** - PRIMARY OUTPUT
+   - Cleared at start (implied)
+   - Populated with one SUMSTAT entry per analyzed SNP
+   - Contains: beta, se, lambda_remle, lambda_mle, p_wald, p_lrt, p_score, logl_H1
+
+2. **'time_UtX'** - timing accumulator
+   - '+= time_spent_on_matrix_multiplication'
+
+3. **'time_opt'** - timing accumulator
+   - '+= time_spent_on_optimization'
+
+**Read but NOT mutated:**
+- 'indicator_snp' - read to determine which SNPs to process
+- 'indicator_idv' - read to determine which individuals to include
+- 'ni_total', 'ni_test' - used for loop bounds and assertions
+- 'n_cvt' - number of covariates, used in calculations
+- 'l_mle_null', 'l_min', 'l_max', 'n_region', 'logl_mle_H0' - analysis parameters
+- 'a_mode' - determines which statistical tests to run
+- 'd_pace' - controls progress bar display
+
+### Summary Table:
+
+| Member Variable | Mutated By | Purpose |
+|----------------|------------|---------|
+| 'indicator_snp' | 'ReadFile_geno' | Which SNPs passed filters |
+| 'snpInfo' | 'ReadFile_geno' | SNP metadata (chr, pos, alleles, etc.) |
+| 'ns_test' | 'ReadFile_geno' | Count of SNPs to analyze |
+| 'sumStat' | 'Analyze' | **Main output**: Statistical results per SNP |
+| 'time_UtX' | 'Analyze' | Performance profiling |
+| 'time_opt' | 'Analyze' | Performance profiling |
+
+The key output is **'sumStat'** which contains all the association test results.
+
+PARAM variables directly mutated by these functions:
+
+    indicator_snp (by ReadFile_geno)
+    snpInfo (by ReadFile_geno)
+    ns_test (by ReadFile_geno)
+
+LMM variables mutated:
+
+    indicator_snp (by ReadFile_geno if passed LMM's copy)
+    snpInfo (by ReadFile_geno if passed LMM's copy)
+    ns_test (by ReadFile_geno if passed LMM's copy)
+    sumStat (by Analyze - this is LMM-only, not in PARAM)
+    time_UtX, time_opt (by Analyze)
+
+The actual analysis results (sumStat) exist only in LMM, not in PARAM.
+
+## Coding for lmdb support
+
+From above it should be clear that, if we have the genotypes and snp annotations in lmdb, we can skip reading the genotype file the first time. We can also rewrite the 'analyze' functions to fetch this information on the fly.
+
+Note that OpenBLAS will have to run single threaded when introducing SNP-based threads.
+
+## Fine grained multithreading
+
+From above it can be concluded that we can batch process SNPs in parallel. The only output is sumStat and that is written at once at the end. So, if we can collect the sumStat data without collision it should just work.
+
+Interestingly both Guile and C++ have recently introduced fibers. Boost.Fiber looks pretty clean:
+
+```
+#include <boost/fiber/all.hpp>
+#include <vector>
+#include <iostream>
+
+namespace fibers = boost::fibers;
+
+// Worker fiber
+void compute_worker(int start, int end,
+                   fibers::buffered_channel<int>& channel) {
+    for (int i = start; i < end; ++i) {
+        channel.push(i * i);
+    }
+}
+
+void parallel_compute_fibers() {
+    fibers::buffered_channel<int> channel(100);
+
+    // Spawn fibers
+    fibers::fiber f1([&]() {
+        compute_worker(0, 100, channel);
+        channel.close();  // Signal completion
+    });
+
+    fibers::fiber f2([&]() {
+        compute_worker(100, 200, channel);
+    });
+
+    // Collect results
+    std::vector<int> results;
+    int value;
+    while (fibers::channel_op_status::success == channel.pop(value)) {
+        results.push_back(value);
+    }
+
+    f1.join();
+    f2.join();
+
+    std::cout << "Total results: " << results.size() << std::endl;
+}
+```
+
+Compare that with guile:
+
+```
+(use-modules (fibers)
+             (fibers channels))
+
+;; Worker that streams individual results
+(define (compute-worker-streaming start end result-channel)
+  (let loop ((i start))
+    (when (< i end)
+      (put-message result-channel (* i i))
+      (loop (+ i 1))))
+  ;; Send completion signal
+  (put-message result-channel 'done))
+
+;; Collector fiber
+(define (result-collector result-channel num-workers)
+  (let loop ((results '())
+             (done-count 0))
+    (if (= done-count num-workers)
+        (reverse results)
+        (let ((msg (get-message result-channel)))
+          (if (eq? msg 'done)
+              (loop results (+ done-count 1))
+              (loop (cons msg results) done-count))))))
+
+(define (parallel-compute-streaming)
+  (run-fibers
+   (lambda ()
+     (let ((result-channel (make-channel)))
+
+       ;; Spawn workers
+       (spawn-fiber
+        (lambda () (compute-worker-streaming 0 100 result-channel)))
+       (spawn-fiber
+        (lambda () (compute-worker-streaming 100 200 result-channel)))
+
+       ;; Collect results
+       (result-collector result-channel 2)))))
+```
+
+The Boost fiber is a relatively mature library now, with about 8+ years of development and real-world usage.
+Interestingly Boost.fibers has work stealing built in. We can look at that later:
+
+=> https://www.boost.org/doc/libs/1_66_0/libs/fiber/doc/html/fiber/worker.html
+
+What about LOCO? Actually we can use the same fiber strategy for each chromosome as a per CHR process. We can set the number of threads differently based on chromosome SNP num, so all chromosomes take (about) the same time. Later, we can bring LOCO into one process with the advantage that the genotype data is only read once. In both cases the kinship matrices are in RAM anyway.
+
+# Reducing the size of the genotype file
+
+The first version of lmdb genotypes used simple floats. That reduced the pangenome text version from 30Gb to 12Gb with about a 3x speedup of gemma. Next I tried byte representation of the genotypes.
+
+# Optimizing SNP handling
+
+GEMMA originally used a separate SNP annotation file which proves inefficient. Now we transform the geno information to lmdb, we might as well include chr+pos. We'll make the key out of that and add a table with marker annotation.
+
+# Optimizing the index
+
+I opted for using a CHR+POS index (byte+long value). There are a few things to consider. There may be duplicates and there may be missing values. Also LMDB likes and integer index. The built-in dubsort does not work, so we need to create a unique pos for every variant. I'll do that by adding the line number.
diff --git a/issues/gn-auth/email_verification.gmi b/issues/gn-auth/email_verification.gmi
index fff3d54..07e2b04 100644
--- a/issues/gn-auth/email_verification.gmi
+++ b/issues/gn-auth/email_verification.gmi
@@ -12,7 +12,7 @@
 
 When setting up e-mail verification, the following configurations should be set for gn-auth:
 
-SMTP_HOST = "smtp.uthsc.edu"
+SMTP_HOST = "smtp.uthsc"
 SMTP_PORT = 25 (not 587, which is what we first tried)
 SMTP_TIMEOUT = 200 # seconds
 
diff --git a/issues/gn-auth/fix-refresh-token.gmi b/issues/gn-auth/fix-refresh-token.gmi
index 1a6a825..222b731 100644
--- a/issues/gn-auth/fix-refresh-token.gmi
+++ b/issues/gn-auth/fix-refresh-token.gmi
@@ -2,7 +2,7 @@
 
 ## Tags
 
-* status: open
+* status: closed, obsolete
 * priority: high
 * assigned: fredm
 * type: feature-request, bug
@@ -51,3 +51,8 @@ This actually kills 2 birds with the one stone:
 
 * Get the refresh token from the cookies rather than from the body
 * Maybe: make refreshing the access token unaware of threads/workers
+
+
+## Close as Obsolete
+
+We no longer do refresh tokens at all, they were a pain to look into, so I simply removed them from the system.
diff --git a/issues/gn-auth/pass-on-unknown-get-parameters.gmi b/issues/gn-auth/pass-on-unknown-get-parameters.gmi
new file mode 100644
index 0000000..a349800
--- /dev/null
+++ b/issues/gn-auth/pass-on-unknown-get-parameters.gmi
@@ -0,0 +1,17 @@
+# Pass on Unknown GET Parameters
+
+## Tags
+
+* status: open
+* priority: medium
+* type: feature-request, enhancement
+* assigned: fredm, zsloan
+* keywords: gn-auth, authorisation
+
+## Description
+
+A developer or user could be needing to access some feature hidden behind some flag (so called, "feature flags"). Some of these flags are set using known (to the application and developer/user) GET parameters.
+
+If the user provides these get parameters before login, then go through the login process, the unknown GET parameters are dropped silently, and the user has to them manually set them up again. This, while not a big deal, is annoying and wastes a few seconds each time.
+
+This feature request proposes to pass any unknown GET parameters untouched through the authentication/authorisation server and back to the authenticating client during the login process, to mitigate this small annoyance.
diff --git a/issues/gn-auth/rework-view-resource-page.gmi b/issues/gn-auth/rework-view-resource-page.gmi
new file mode 100644
index 0000000..2d6e145
--- /dev/null
+++ b/issues/gn-auth/rework-view-resource-page.gmi
@@ -0,0 +1,22 @@
+# Rework "View-Resource" Page
+
+## Tags
+
+* status: closed, completed
+* priority: medium
+* type: enhancement
+* assigned: fredm, zsloan
+* keywords: gn-auth, resource, resources, view resource
+
+## Description
+
+The view resource page ('/oauth2/resource/<uuid>/view') was built with only Genotype, Phenotype, and mRNA resources in mind.
+
+We have since moved on, and added more types of resources (group, system, inbredset-group, etc). This leads to the page breaking for these other types of resources.
+
+We need to update the UI and route to ensure the page renders correctly for each type, or at the very least, redirects to the correct page (e.g. in the case of groups, which have a separate "view group" page).
+
+
+## Close as complete
+
+This is fixed now.
diff --git a/issues/gn-guile/activations-on-production-not-running-as-expected.gmi b/issues/gn-guile/activations-on-production-not-running-as-expected.gmi
new file mode 100644
index 0000000..be9cc00
--- /dev/null
+++ b/issues/gn-guile/activations-on-production-not-running-as-expected.gmi
@@ -0,0 +1,57 @@
+# gn-guile: Activations on Production not Running as Expected
+
+## Tags
+
+* status: closed, completed, fixed
+* priority: high
+* type: bug
+* assigned: bonfacem, fredm, aruni
+* keywords: gn-guile, deployment, activation-service-type
+
+## Description
+
+With the recent changes to guix's `least-authority-wrapper` we can no longer write to the root filesystem ("/"). That is not much of a problem.
+
+So I tried adding `#:directory (dirname gn-doc-git-checkout)` to the `make-forkexec-constructor` for the `gn-guile-shepherd-service` and that actually changes the working directory of the process, as I would expect.
+
+In `genenetwork-activation` I add:
+
+```
+          ;; setup correct ownership for gn-docs
+          (for-each (lambda (file)
+                      (chown file
+                             (passwd:uid (getpw "genenetwork"))
+                             (passwd:gid (getpw "genenetwork"))))
+                    (find-files #$(dirname gn-doc-git-checkout)
+                                                   #:directories? #t))
+```
+
+which, ideally, should change ownership of the parent directory of the bare git checkout for "gn-docs" when we build/start the container. This does not happen — the directory is still owned by root.
+
+My thinking goes, the "genenetwork" user[1] is not yet created at the point when the activation[2] is run, leading to the service failing to start.
+
+The reason I think this, is because, when I do:
+
+```
+fredm@tux04:/...$ sudo guix container exec <container-pid> /run/current-system/profile/bin/bash --login
+root@genenetwork-gn2-fred /# chown -R genenetwork:genenetwork /var/lib/genenetwork/
+root@genenetwork-gn2-fred /# chown -R genenetwork:genenetwork /var/lib/genenetwork/
+```
+
+The bound directory's permissions change, and we can now enable and start the service:
+
+```
+root@genenetwork-gn2-fred /# herd enable gn-guile
+root@genenetwork-gn2-fred /# herd start gn-guile
+```
+
+which starts the service as expected. We can also simply restart the entire container at this point, and it works too.
+
+## Footnotes
+
+=> https://git.genenetwork.org/gn-machines/tree/genenetwork/services/genenetwork.scm?id=e425671e69a321a032134fafee974442e8c1ce6f#n167 [1] "genenetwork" user declaration
+=> https://git.genenetwork.org/gn-machines/tree/genenetwork/services/genenetwork.scm?id=e425671e69a321a032134fafee974442e8c1ce6f#n680 [2] Activation of services (see also the account-service-type being extended with the "genenetwork" user).
+
+## Close as Fixed
+
+This issue is fixed, with newer Guix and changes that @bonz did to the gn-machines repo.
diff --git a/issues/gn-libs/jobs-allow-job-cascades.gmi b/issues/gn-libs/jobs-allow-job-cascades.gmi
new file mode 100644
index 0000000..f659f32
--- /dev/null
+++ b/issues/gn-libs/jobs-allow-job-cascades.gmi
@@ -0,0 +1,26 @@
+# Jobs: Allow Job Cascades
+
+## Tags
+
+* status: open
+* priority: medium
+* type: enhancement
+* assigned: fredm, zsloan
+* keywords: gn-libs, genenetwork, async jobs, asynchronous jobs, background jobs
+
+## Description
+
+Some jobs could require more than a single command/script to be run to complete.
+
+Rather than refactoring/rewriting the entire "async jobs" feature, I propose adding a way to note who started a job, i.e.
+* the user, OR
+* another job
+
+This could be tracked in an extra field in the database, say "started_by" which can have values of the form
+* "user:<user-id>"
+* "job:<job-id>"
+where the parts in the angle bracket (i.e. "<user-id>" and "<job-id>") are replaced by actual ids.
+
+## Related Issues
+
+=> /issues/gn-libs/jobs-track-who-jobs-belong-to
diff --git a/issues/gn-libs/jobs-track-who-jobs-belong-to.gmi b/issues/gn-libs/jobs-track-who-jobs-belong-to.gmi
new file mode 100644
index 0000000..00eaf21
--- /dev/null
+++ b/issues/gn-libs/jobs-track-who-jobs-belong-to.gmi
@@ -0,0 +1,23 @@
+# Jobs: Track Who Jobs Belong To
+
+## Tags
+
+* status: open
+* priority: medium
+* type: enhancement
+* assigned: fredm, zsloan
+* keywords: gn-libs, genenetwork, async jobs, asynchronous jobs, background jobs
+
+## Description
+
+Some features in Genenetwork require long-running processes to be triggered and run in the background. We have a way to trigger such background processes, but there is no way of tracking who started what job, and therefore, no real way for a user to list only their jobs.
+
+This issue will track the introduction of such tracking. This will enable the building new job-related functionality such as a user being able to:
+* list their past, unexpired jobs
+* delete past jobs
+* possibly rerun jobs that failed but are recoverable
+* see currently running jobs, and their status
+
+## Related Issues
+
+=> /issues/gn-libs/jobs-allow-job-cascades
diff --git a/issues/gn-uploader/AuthorisationError-gn-uploader.gmi b/issues/gn-uploader/AuthorisationError-gn-uploader.gmi
index 50a236d..262ad19 100644
--- a/issues/gn-uploader/AuthorisationError-gn-uploader.gmi
+++ b/issues/gn-uploader/AuthorisationError-gn-uploader.gmi
@@ -2,7 +2,7 @@
 
 ## Tags 
 * assigned: fredm 
-* status: open 
+* status: closed, obsoleted
 * priority: critical 
 * type: error 
 * key words: authorisation, permission 
@@ -64,3 +64,7 @@ Genetic type: intercross
 
 And when pressed the `Create Population` icon, it led to the error above.  
 
+## Closed as Obsolete
+
+* The service this was happening on (https://staging-uploader.genenenetwork.org) is no longer running
+* Most of the authorisation issues are resolved in newer code
diff --git a/issues/export-uploaded-data-to-RDF-store.gmi b/issues/gn-uploader/export-uploaded-data-to-RDF-store.gmi
index c39edec..3ef05cd 100644
--- a/issues/export-uploaded-data-to-RDF-store.gmi
+++ b/issues/gn-uploader/export-uploaded-data-to-RDF-store.gmi
@@ -6,7 +6,7 @@
 * priority: medium
 * type: feature-request
 * status: open
-* keywords: API, data upload
+* keywords: API, data upload, gn-uploader
 
 ## Description
 
@@ -73,10 +73,16 @@ The metadata is useful for searching for the data. The "metadata->rdf" project[4
 * [ ] How do we handle this?
 
 
+## Related Issues and Topics
+
+=> https://issues.genenetwork.org/topics/next-gen-databases/design-doc
+=> https://issues.genenetwork.org/topics/lmms/rqtl2/using-rqtl2-lmdb-adapter
+=> https://issues.genenetwork.org/issues/dump-sample-data-to-lmdb
+=> https://issues.genenetwork.org/topics/database/genotype-database
 
 ## Footnotes
 
-=> https://gitlab.com/fredmanglis/gnqc_py 1: QC/Data upload project repository
+=> https://git.genenetwork.org/gn-uploader/ 1: QC/Data upload project (gn-uploader) repository
 => https://github.com/genenetwork/genenetwork3/pull/130 2: Munyoki's Pull request
 => https://github.com/BonfaceKilz/gn-dataset-dump 3: Dataset -> LMDB export repository
-=> https://github.com/genenetwork/dump-genenetwork-database 4: Metadata -> RDF export repository
+=> https://git.genenetwork.org/gn-transform-databases/ 4: Metadata -> RDF export repository
diff --git a/issues/gn-uploader/guix-build-gn-uploader-error.gmi b/issues/gn-uploader/guix-build-gn-uploader-error.gmi
index 44a5c4b..aeb6308 100644
--- a/issues/gn-uploader/guix-build-gn-uploader-error.gmi
+++ b/issues/gn-uploader/guix-build-gn-uploader-error.gmi
@@ -86,7 +86,7 @@ Filesystem      Size  Used Avail Use% Mounted on
 
 so we know that's not a problem.
 
-A similar thing had shown up on space.uthsc.edu.
+A similar thing had shown up on our space server.
 
 ### More Troubleshooting Efforts
 
diff --git a/issues/gn-uploader/handling-tissues-in-uploader.gmi b/issues/gn-uploader/handling-tissues-in-uploader.gmi
index 826af15..0c43040 100644
--- a/issues/gn-uploader/handling-tissues-in-uploader.gmi
+++ b/issues/gn-uploader/handling-tissues-in-uploader.gmi
@@ -2,11 +2,11 @@
 
 ## Tags
 
-* status: open
+* status: closed, wontfix
 * priority: high
 * assigned: fredm
 * type: feature-request
-* keywords: gn-uploader, tissues
+* keywords: gn-uploader, tissues, archived
 
 ## Description
 
@@ -112,3 +112,9 @@ ALTER TABLE Tissue MODIFY Id INT(5) UNIQUE NOT NULL;
 
 * [1] https://gn1.genenetwork.org/webqtl/main.py?FormID=schemaShowPage#ProbeFreeze
 * [2] https://gn1.genenetwork.org/webqtl/main.py?FormID=schemaShowPage#Tissue
+
+## Closed as WONTFIX
+
+I am closing this issue because it was created (2024-03-28) while I had a fundamental misunderstanding of the way data is laid out in the database.
+
+The information on the schema/layout of the tables is still useful, but chances are, we'll look at the tables themselves anyway should we need to figure out the schema.
diff --git a/issues/gn-uploader/link-authentication-authorisation.gmi b/issues/gn-uploader/link-authentication-authorisation.gmi
index 90b8e5e..b64f887 100644
--- a/issues/gn-uploader/link-authentication-authorisation.gmi
+++ b/issues/gn-uploader/link-authentication-authorisation.gmi
@@ -2,7 +2,7 @@
 
 ## Tags
 
-* status: open
+* status: closed, completed
 * assigned: fredm
 * priority: critical
 * type: feature request, feature-request
@@ -13,3 +13,9 @@
 The last chain in the link to the uploads is the authentication/authorisation. Once the user uploads their data, they need access to it. The auth system, by default, will deny anyone/everyone access to any data that is not linked to a resource and which no user has any roles allowing them access to the data.
 
 We, currently, assign such data to the user manually, but that is not a sustainable way of working, especially as the uploader is exposed to more and more users.
+
+### Close as Completed
+
+The current iteration of the uploader does actually take into account the user that is uploading the data, granting them ownership of the uploaded data. By default, the data is not public, and is only accessible to the user who uploaded it.
+
+The user who uploads the data (and therefore own it) can later grant access to other users of the system.
diff --git a/issues/gn-uploader/probeset-not-applicable-to-all-data.gmi b/issues/gn-uploader/probeset-not-applicable-to-all-data.gmi
index 1841d36..af3b274 100644
--- a/issues/gn-uploader/probeset-not-applicable-to-all-data.gmi
+++ b/issues/gn-uploader/probeset-not-applicable-to-all-data.gmi
@@ -4,7 +4,7 @@
 
 * type: bug
 * assigned: fredm
-* status: open
+* status: closed
 * priority: high
 * keywords: gn-uploader, uploader, ProbeSet
 
@@ -20,3 +20,10 @@ applicable to our data, I don't think.
 ```
 
 It seems like some of the data does not require a ProbeSet, and in that case, it should be possible to add it without one.
+
+
+## Notes
+
+This "bug" is obsoleted by the fact that the implementation leading to it was entirely wrong.
+
+The feature that was leading to this bug no longer exists, and will have to be re-implemented from scratch with the involvement of @acenteno.
diff --git a/issues/gn-uploader/provide-page-for-uploaded-data.gmi b/issues/gn-uploader/provide-page-for-uploaded-data.gmi
index 60b154b..5ab7f80 100644
--- a/issues/gn-uploader/provide-page-for-uploaded-data.gmi
+++ b/issues/gn-uploader/provide-page-for-uploaded-data.gmi
@@ -2,7 +2,7 @@
 
 ## Tags
 
-* status: open
+* status: closed, completed
 * assigned: fredm
 * priority: medium
 * type: feature, feature request, feature-request
@@ -20,3 +20,8 @@ Once a user has uploaded their data, provide them with a landing page/dashboard
 Depends on
 
 => /issues/gn-uploader/link-authentication-authorisation
+
+
+## Close as complete
+
+Current uploader directs user to a view of the data they uploader on GN2. This is complete.
diff --git a/issues/gn-uploader/replace-redis-with-sqlite3.gmi b/issues/gn-uploader/replace-redis-with-sqlite3.gmi
index 3e5020a..d3f94f0 100644
--- a/issues/gn-uploader/replace-redis-with-sqlite3.gmi
+++ b/issues/gn-uploader/replace-redis-with-sqlite3.gmi
@@ -15,3 +15,15 @@ We currently (as of 2024-06-27) use Redis for tracking any asynchronous jobs (e.
 A lot of what we use redis for, we can do in one of the many SQL databases (we'll probably use SQLite3 anyway), which are more standardised, and easier to migrate data from and to. It has the added advantage that we can open multiple connections to the database, enabling the different processes to update the status and metadata of the same job consistently.
 
 Changes done here can then be migrated to the other systems, i.e. GN2, GN3, and gn-auth, as necessary.
+
+### 2025-12-31: Progress Update
+
+Initial basic implementation can be found in:
+
+=> https://git.genenetwork.org/gn-libs/tree/gn_libs/jobs
+=> https://git.genenetwork.org/gn-uploader/commit/?id=774a0af9db439f50421a47249c57e5a0a6932301
+=> https://git.genenetwork.org/gn-uploader/commit/?id=589ab74731aed62b1e1b3901d25a95fc73614f57
+
+and others.
+
+More work needs to be done to clean-up some minor annoyances.
diff --git a/issues/gn-volt-genofiles-parsing-integration.gmi b/issues/gn-volt-genofiles-parsing-integration.gmi
index 8d3d149..e1b0162 100644
--- a/issues/gn-volt-genofiles-parsing-integration.gmi
+++ b/issues/gn-volt-genofiles-parsing-integration.gmi
@@ -5,7 +5,7 @@
 * assigned: alexm,
 * type: improvement
 * priority: high
-* status: in progress
+* status: stalled, closed.
 
 ## Notes
 
diff --git a/issues/gnqa/implement-no-login-requirement-for-gnqa.gmi b/issues/gnqa/implement-no-login-requirement-for-gnqa.gmi
index 9dcef53..5b0a1ff 100644
--- a/issues/gnqa/implement-no-login-requirement-for-gnqa.gmi
+++ b/issues/gnqa/implement-no-login-requirement-for-gnqa.gmi
@@ -3,7 +3,7 @@
 ## Tags
 
 * type: feature
-* status: progress
+* status: completed, closed
 * priority: medium
 * assigned: alexm,
 * keywords: gnqa, user experience, authentication, login, llm
@@ -15,6 +15,6 @@ This feature will allow usage of LLM/GNQA features without requiring user authen
 ## Tasks
 
 * [x] If logged in: perform AI search with zero penalty
-* [ ] Add caching lifetime to save on token usage
-* [ ] Routes: check for referrer headers — if the previous search was not from the homepage, perform AI search
-* [ ] If global search returns more than *n* results (*n = number*), perform an AI search
+* [x] Add caching lifetime to save on token usage
+* [x] Routes: check for referrer headers — if the previous search was not from the homepage, perform AI search
+* [x] If global search returns more than *n* results (*n = number*), perform an AI search
diff --git a/issues/gnqa/merge-gnqa-to-production.gmi b/issues/gnqa/merge-gnqa-to-production.gmi
index 3d34bb1..6e5f119 100644
--- a/issues/gnqa/merge-gnqa-to-production.gmi
+++ b/issues/gnqa/merge-gnqa-to-production.gmi
@@ -4,6 +4,7 @@
 
 * assigned: alexm,
 * keywords: production, GNQA, integration
+* status: closed, completed
 
 ## Description
 
@@ -12,5 +13,5 @@ be pushed to production. We need to allow only logged-in users to access the ser
 
 ## Tasks
 
-* [] Integrate GN-auth for the service
-* [] Push production to the current commit
\ No newline at end of file
+* [x] Integrate GN-auth for the service
+* [x] Push production to the current commit
\ No newline at end of file
diff --git a/issues/gnqna/query-bug-DatabaseError.gmi b/issues/gnqna/query-bug-DatabaseError.gmi
new file mode 100644
index 0000000..b8c1cfc
--- /dev/null
+++ b/issues/gnqna/query-bug-DatabaseError.gmi
@@ -0,0 +1,37 @@
+# Query Bug: DatabaseError
+
+## Tags
+
+* assigned: fredm, bonfacem
+* priority: high
+* status: open
+* type: bug
+* keywords: gnqna
+
+## Descriptions
+
+* Go to https://genenetwork.org/gnqna
+* Type in a query
+* Press "Enter"
+* Observe the error "DatabaseError" with a status code of 500.
+
+Expected: Query returns a result.
+
+
+## Troubleshooting: 2025-10-27
+
+* GNQNA's deployment is not part of the gn-machine's definitions!
+
+## Troubleshooting: 2025-12-31
+
+If a user **IS NOT** logged in, the system responds with:
+
+```
+Search_Query:
+Status_Code: 500
+Error/Reason: Login/Verification required to make this request
+```
+
+On the other hand, if a user is logged in, a query returns a result.
+
+We, therefore, probably need to notify the user that they need to be logged in to use this service.
diff --git a/issues/guix-bioinformatics/guix-updates.gmi b/issues/guix-bioinformatics/guix-updates.gmi
new file mode 100644
index 0000000..9c65fb9
--- /dev/null
+++ b/issues/guix-bioinformatics/guix-updates.gmi
@@ -0,0 +1,18 @@
+# Planned Guix Updates
+
+## Tags
+
+* status: open
+* priority: medium
+* type: enhancement
+* assigned: fredm, bonfacem
+* keywords: guix-bioinformatics, guix
+* interested: pjotrp, aruni
+
+## Description
+
+The following outlines issues around the next upgrade:
+
+* Update pinned guix commit to the latest and see whether inferior profiles for the laminar user are properly created.
+* Rust packages (new package build system) we need to think about.
+
diff --git a/issues/guix-bioinformatics/pin-channels-commits.gmi b/issues/guix-bioinformatics/pin-channels-commits.gmi
new file mode 100644
index 0000000..216dd24
--- /dev/null
+++ b/issues/guix-bioinformatics/pin-channels-commits.gmi
@@ -0,0 +1,39 @@
+# Pin Channel Commits; Decouple from Guix
+
+## Tags
+
+* status: closed
+* priority: medium
+* type: enhancement
+* assigned: fredm, bonfacem, aruni
+* keywords: guix-bioinformatics, guix
+* interested: pjotrp, aruni
+
+## Description
+
+Changes in upstream Guix often lead to deployment issues, due to breakages caused by changes in how GNU Guix does things. This interrupts our day-to-day operations, leading us to scramble to fix the breakages and make the builds sane again.
+
+In order to avoid these breakages in the future, we'll need to actually pin the commit(s) for all the channels we depend on, to avoid surprises down the line.
+
+### Channel Dependencies
+
+We depend on the following channels in guix-bioinformatics:
+
+* guix: Mainline Guix channel
+* guix-past: Channel for old packages, no longer maintained on guix mainline
+* guix-rust-past-crates: Channel for rust packages using the old packaging form
+* guix-forge: Manages building containers and whatnot. The dependence is implicit here, but it is one of the main causes of breakages
+
+### Tasks
+
+* [x] Pin guix channel
+* [x] Pin guix-past
+* [x] Pin guix-rust-past-crates channel
+* [x] Pin guix-forge channel
+* [ ] Move packages from (gn packages bioinformatics) to upstream (gnu packages bioinformatics)
+
+### Solution
+
+To allow guix-bioinformatics to continue improving, while preventing random breakages, we stopped depending on guix-bioinformatics directly, rather, we changed our main channel to gn-machines, and there, we pinned the version of guix-bioinformatics we depend on.
+
+This allows us to continue updating our packages while keeping the channel dependencies relatively stable.
diff --git a/issues/guix-ci-tests.gmi b/issues/guix-ci-tests.gmi
new file mode 100644
index 0000000..ce56705
--- /dev/null
+++ b/issues/guix-ci-tests.gmi
@@ -0,0 +1,47 @@
+# Guix CI failure: guix-past build breaks due to missing (libchop)
+
+# Tags
+
+* assigned: bonfacem
+* type: bug, infrastructure
+* priority: high
+
+# Notes
+
+After fixing a permissions issue in the Laminar CI environment (/var/guix/profiles/per-user/laminar):
+
+```
+[laminar] Executing cfg/jobs/gn-libs.run Backtrace: 9 (primitive-load "/var/lib/laminar/cfg/jobs/gn-libs.run") In ice-9/boot-9.scm: 152:2 8 (with-fluid* _ _ _) In ice-9/eval.scm: 202:51 7 (_ #(#(#<directory (guile-user) 7fce0bc71c80> #<pro?> ?))) 293:34 6 (_ #(#(#<directory (guile-user) 7fce0bc71c80> #<pro?> ?))) In guix/inferior.scm: 1006:4 5 (inferior-for-channels _ #:cache-directory _ #:ttl _) In ice-9/boot-9.scm: 1752:10 4 (with-exception-handler _ _ #:unwind? _ # _) In guix/store.scm: 690:37 3 (thunk) 1331:8 2 (call-with-build-handler #<procedure 7fce00e9f0c0 at g?> ?) In guix/inferior.scm: 951:2 1 (cached-channel-instance #<store-connection 256.100 7f?> ?) In ice-9/boot-9.scm: 1685:16 0 (raise-exception _ #:continuable? _) ice-9/boot-9.scm:1685:16: In procedure raise-exception: In procedure mkdir: Permission denied: "/var/guix/profiles/per-user/laminar"
+```
+
+...  by (inside the container) running:
+
+```
+mkdir -p /var/guix/profiles/per-user/laminar
+chown -R laminar:laminar /var/guix/profiles/per-user/laminar
+```
+
+...  the CI progressed further but now fails when attempting to build guix-past.  The failure is caused by an unbound variable error for the module (libchop), indicating a mismatch or missing dependency in the pinned Guix channels.
+
+Error Log:
+
+```
+(exception unbound-variable (value #f)
+ (value "Unbound variable: ~S")
+ (value (libchop)) (value #f))
+
+builder for /gnu/store/gx57wj08yv0x0g1r8rbnwcp2fc58lqvx-guix-past.drv
+failed to produce output path
+/gnu/store/n3q0sgqwm9mwvna5215npwmdfigfyr9f-guix-past
+
+cannot build derivation
+/gnu/store/3fwagz1p9vv3h020lwb2ab52f6wj6z1g-profile.drv:
+1 dependencies couldn't be built
+```
+
+# Resolution
+
+* Inside genenetwork-development.scm, manually create `/var/guix/profiles/per-user/laminar` if it doesn't exist.
+* Update the relevant .guix-channel file to match channels in guix-bioinformatics.
+
+* closed
diff --git a/issues/implement-gn-markdown-editor.gmi b/issues/implement-gn-markdown-editor.gmi
index 7d7d08f..a0d386b 100644
--- a/issues/implement-gn-markdown-editor.gmi
+++ b/issues/implement-gn-markdown-editor.gmi
@@ -13,7 +13,7 @@ Example of similar implementation
 
 * assigned: alexm
 * type: enhancement
-* status: IN PROGRESS
+* status: done, completed.
 
 * keywords: markdown,editor
 
@@ -23,7 +23,7 @@ Example of similar implementation
 
 * [x]  add live preview for page markdown on edit
 
-* [] authentication(WIP)
+* [x] authentication
 
 * [x] commit changes to github repo
 
diff --git a/issues/implement_xapian_to_text_transformer.gmi b/issues/implement_xapian_to_text_transformer.gmi
index a3c3dc8..192491a 100644
--- a/issues/implement_xapian_to_text_transformer.gmi
+++ b/issues/implement_xapian_to_text_transformer.gmi
@@ -4,7 +4,7 @@
 * assigned: alexm, jnduli
 * keywords: llm, genenetwork2, xapian, transform
 * type: feature
-* status: in-progress
+* status: closed,  completed 
 
 ## Description:
 
diff --git a/issues/prevent-weak-passwords.gmi b/issues/prevent-weak-passwords.gmi
index 8e8ca2f..957a170 100644
--- a/issues/prevent-weak-passwords.gmi
+++ b/issues/prevent-weak-passwords.gmi
@@ -19,3 +19,11 @@
 There was a request made to prevent weak passwords.
 
 Use existing libraries to check and prevent weak passwords.
+
+## Notes
+
+### 2025-12-31: Look Into Libraries
+
+=> https://pypi.org/project/password-strength/ password-strength
+
+The library above seems promising. Unfortunately, we'd have to write a guix definition for it.
diff --git a/issues/provide-link-to-register-user-in-sign-in-page.gmi b/issues/provide-link-to-register-user-in-sign-in-page.gmi
index 24d7c21..b9e6a4d 100644
--- a/issues/provide-link-to-register-user-in-sign-in-page.gmi
+++ b/issues/provide-link-to-register-user-in-sign-in-page.gmi
@@ -3,7 +3,7 @@
 ## Tags
 
 * type: bug
-* status: open
+* status: closed
 * assigned: fredm
 * priority: medium
 * keywords: register user, gn-auth, genenetwork
@@ -16,3 +16,8 @@ Provide a link allowing a user to register with the system on the sign-in page.
 We are now using OAuth2 to enable sign-in, which means that the user is redirected from the service they were in to the authorisation service to sign-in.
 
 The service should retain a note of the service which the user came from, and redirect back to it on successful registration.
+
+
+### Close as Completed
+
+@zachs seems to have fixed this.
diff --git a/issues/quality-control/r-qtl2-features.gmi b/issues/quality-control/r-qtl2-features.gmi
index eac53c4..bcc5d71 100644
--- a/issues/quality-control/r-qtl2-features.gmi
+++ b/issues/quality-control/r-qtl2-features.gmi
@@ -3,7 +3,7 @@
 ## Tags
 
 * type: listing
-* status: open
+* status: closed, completed
 * assigned: fredm
 * priority: high
 * keywords: listing, bug, feature
@@ -12,5 +12,9 @@
 
 This is a listing of non-critical features and bugs that do not currently have a dedicated issue, and need to be handled some time in the future.
 
-* [feature] "Undo Transpose": Files marked as '*_transposed: true' will have the transposition undone to ease processing down the line.
+* Closed, completed: [feature] "Undo Transpose": Files marked as '*_transposed: true' will have the transposition undone to ease processing down the line.
 * …
+
+### Close as completed
+
+Actually open dedicated issues for bugs and features rather than collecting them here.
diff --git a/issues/systems/apps.gmi b/issues/systems/apps.gmi
index b9d4155..e374250 100644
--- a/issues/systems/apps.gmi
+++ b/issues/systems/apps.gmi
@@ -194,14 +194,32 @@ Package definition is at
 
 Container is at
 
-=> https://git.genenetwork.org/guix-bioinformatics/tree/gn/services/bxd-power-container.scm
+=> https://git.genenetwork.org/gn-machines/tree/gn/services/mouse-longevity.scm
+
+gaeta:~/iwrk/deploy/gn-machines$ guix system container -L . -L ~/guix-bioinformatics --verbosity=3 test-r-container.scm -L ~/iwrk/deploy/guix-forge/guix
+forge/nginx.scm:145:40: error: acme-service-type: unbound variable
+hint: Did you forget `(use-modules (forge acme))'?
+
 
 ## jumpshiny
 
 Jumpshiny is hosted on balg01. Scripts are in tux02 git.
 
+=> git.genenetwork.org:/home/git/shared/source/jumpshiny
+
 ```
 root@balg01:/home/j*/gn-machines# . /usr/local/guix-profiles/guix-pull/etc/profile
 guix system container --network -L . -L ../guix-forge/guix/ -L ../guix-bioinformatics/ -L ../guix-past/modules/ --substitute-urls='https://ci.guix.gnu.org https://bordeaux.guix.gnu.org https://cuirass.genenetwork.org' test-r-container.scm -L ../guix-forge/guix/gnu/store/xyks73sf6pk78rvrwf45ik181v0zw8rx-run-container
 /gnu/store/6y65x5jk3lxy4yckssnl32yayjx9nwl5-run-container
 ```
+
+Currently:
+
+Jumpshiny: as aijun, cd services/jumpshiny and ./.guix-run
+
+
+## JUMPsem_web
+
+Another shiny app to run on balg01.
+
+Jumpshiny: as aijun, cd services/jumpsem and ./.guix-run
diff --git a/issues/systems/octopus.gmi b/issues/systems/octopus.gmi
index c510fd9..3a6d317 100644
--- a/issues/systems/octopus.gmi
+++ b/issues/systems/octopus.gmi
@@ -1,6 +1,9 @@
 # Octopus sysmaintenance
 
-Reopened tasks because of new sheepdog layout and add new machines to Octopus and get fiber optic network going with @andreag. See also
+Reopened tasks because of new sheepdog layout and add new machines to Octopus and get fiber optic network going with @andreag.
+IT recently upgraded the network switch, so we should have great interconnect between all nodes. We also need to work on user management and network storage.
+
+See also
 
 => ../../topics/systemtopics/systems/hpcs/hpc/octopus-maintenance
 
@@ -14,7 +17,7 @@ Reopened tasks because of new sheepdog layout and add new machines to Octopus an
 
 # Tasks
 
-* [ ] add lizardfs to nodes
+* [X] add lizardfs to nodes
 * [ ] add PBS to nodes
 * [ ] use fiber optic network
 * [ ] install sheepdog
@@ -36,6 +39,17 @@ default via 172.23.16.1 dev ens1f0np0
 
 # Current topology
 
+vim /etc/ssh/sshd_config
+systemctl reload ssh
+
+The routing should be as on octopus01
+
+```
+default via 172.23.16.1 dev eno1
+172.23.16.0/21 dev ens1f0np0 proto kernel scope link src 172.23.18.221
+172.23.16.0/21 dev eno1 proto kernel scope link src 172.23.18.188
+```
+
 ```
 ip a
 ip route
@@ -44,3 +58,9 @@ ip route
 - Octopus01 uses eno1 172.23.18.188/21 gateway 172.23.16.1 (eno1: Link is up at 1000 Mbps)
 - Octopus02 uses eno1 172.23.17.63/21  gateway 172.23.16.1 (eno1: Link is up at 1000 Mbps)
                       172.23.x.x
+
+# Work
+
+* After the switch upgrade penguin2 NFS is not visible for octopus01. I disabled the mount in fstab
+* On octopus01 disabled unattended upgrade script - we don't want kernel updates on this machine(!)
+* Updated IP addresses in sshd_config
diff --git a/issues/systems/t02-crash.gmi b/issues/systems/t02-crash.gmi
new file mode 100644
index 0000000..bf0c5d5
--- /dev/null
+++ b/issues/systems/t02-crash.gmi
@@ -0,0 +1,47 @@
+## Postmortem tux02 crash
+
+I'll take a look at tux02 - it rebooted last night and I need to start some services. It rebooted at CDT Aug 07 19:29:14 tux02 kernel: Linux version ... We have two out of memory messages before that:
+
+```
+Aug  7 18:45:27 tux02 kernel: [13521994.665636] Out of memory: Kill process 30165 (guix) score 759 or sacrifice child
+Aug  7 18:45:27 tux02 kernel: [13521994.758974] Killed process 30165 (guix) total-vm:498873224kB, anon-rss:223599272kB, file-rss:4kB, shmem-rss:0kB
+```
+
+My mosh clapped out before that
+
+```
+wrk      pts/96       mosh [128868]    Thu Aug  7 18:53 - down   (00:00)
+```
+
+Someone killed the development container before that
+
+```
+Aug  7 18:06:32 tux02 systemd[1]: genenetwork-development-container.service: Killing process 86832 (20qjyhd7n9n62fa) with signal SIGKILL.
+```
+
+and
+
+```
+Aug  7 13:28:26 tux02 kernel: [13502972.611421] oom_reaper: reaped process 25224 (guix), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
+Aug  7 18:16:00 tux02 kernel: [13520227.160945] oom_reaper: reaped process 128091 (guix), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
+```
+
+Guix builds running out of RAM... My conclusion is that someone has been doing some heavy lifting. Probably Fred. I'll ask him to use a different machine that is not shared by many people. First I need to bring up some processes. The shepherd had not started, so:
+
+```
+systemctl status user-shepherd.service
+```
+
+most services started now. I need to check in half an hour.
+
+BNW is the one that does not start up automatically.
+
+```
+su shepherd
+herd status
+herd stop bnw
+herd status bnw
+tail -f /home/shepherd/logs/bnw.log
+```
+
+Shows a process is blocking the port. Kill as root, after making sure herd status shows it as stopped.
diff --git a/issues/systems/tux02-production.gmi b/issues/systems/tux02-production.gmi
index 7de911f..d811c5e 100644
--- a/issues/systems/tux02-production.gmi
+++ b/issues/systems/tux02-production.gmi
@@ -14,9 +14,9 @@ We are going to move production to tux02 - tux01 will be the staging machine. Th
 
 * [X] update guix guix-1.3.0-9.f743f20
 * [X] set up nginx (Debian)
-* [X] test ipmi console (172.23.30.40)
+* [X] test ipmi console
 * [X] test ports (nginx)
-* [?] set up network for external tux02e.uthsc.edu (128.169.4.52)
+* [?] set up network for external tux02
 * [X] set up deployment evironment
 * [X] sheepdog copy database backup from tux01 on a daily basis using ibackup user
 * [X] same for GN2 production environment
diff --git a/issues/systems/tux04-disk-issues.gmi b/issues/systems/tux04-disk-issues.gmi
index bc6e1db..3df0a03 100644
--- a/issues/systems/tux04-disk-issues.gmi
+++ b/issues/systems/tux04-disk-issues.gmi
@@ -378,3 +378,46 @@ The code where it segfaulted is online at:
 => https://github.com/tianocore/edk2/blame/master/MdePkg/Library/BasePciSegmentLibPci/PciSegmentLib.c
 
 and has to do with PCI registers and that can actually be caused by the new PCIe card we hosted.
+
+# Sept 2025
+
+We moved production away from tux04, so now we should be able to work on this machine.
+
+
+## System crash on tux04
+
+And tux04 is down *again*. Wow, glad we moved off! I want to fix that machine and we had to move production off! I left the terminal open and the last message is:
+
+```
+tux04:~$ [SMM] APIC 0x00 S00:C00:T00 > ASSERT [AmdPlatformRasRsSmm] u:\EDK2\MdePkg\Library\BasePciSegmentLibPci\PciSegmentLib.c(766): ((Address) & (0xfffffffff0000000ULL | (3))) == 0
+!!!! X64 Exception Type - 03(#BP - Breakpoint)  CPU Apic ID - 00000000 !!!!
+RIP  - 0000000076DA4343, CS  - 0000000000000038, RFLAGS - 0000000000000002
+RAX  - 0000000000000010, RCX - 00000000770D5B58, RDX - 00000000000002F8
+RBX  - 0000000000000000, RSP - 0000000077773278, RBP - 0000000000000000
+RSI  - 0000000000000000, RDI - 00000000777733E0
+R8   - 00000000777731F8, R9  - 0000000000000000, R10 - 0000000000000000
+R11  - 00000000000000A0, R12 - 0000000000000000, R13 - 0000000000000000
+R14  - FFFFFFFFAC41A118, R15 - 000000000005B000
+DS   - 0000000000000020, ES  - 0000000000000020, FS  - 0000000000000020
+GS   - 0000000000000020, SS  - 0000000000000020
+CR0  - 0000000080010033, CR2 - 00007F67F5268030, CR3 - 0000000077749000
+CR4  - 0000000000001668, CR8 - 0000000000000001
+DR0  - 0000000000000000, DR1 - 0000000000000000, DR2 - 0000000000000000
+DR3  - 0000000000000000, DR6 - 00000000FFFF0FF0, DR7 - 0000000000000400
+GDTR - 000000007773C000 000000000000004F, LDTR - 0000000000000000
+IDTR - 0000000077761000 00000000000001FF,   TR - 0000000000000040
+FXSAVE_STATE - 0000000077772ED0
+!!!! Find image based on IP(0x76DA4343) u:\Build_Genoa\DellBrazosPkg\DEBUG_MYTOOLS\X64\DellPkgs\DellChipsetPkgs\AmdGenoaModulePkg\Override\AmdCpmPkg\Features\PlatformRas\Rs\Smm\AmdPlatformRasRsSmm\DEBUG\AmdPlatformRasRsSmm.pdb (ImageBase=0000000076D3E000, EntryPoint=0000000076D3E6C0) !!!!
+```
+
+and the racadm system log says
+
+```
+Record:      362
+Date/Time:   09/11/2025 21:47:02
+Source:      system
+Severity:    Critical
+Description: A high-severity issue has occurred at the Power-On Self-Test (POST) phase which has resulted in the system BIOS to abruptly stop functioning.
+```
+
+I have seen that before and it is definitely a hardware/driver issue on the Dell itself. I'll work on tha later. Luckily it always reboots.
diff --git a/tasks/alexm.gmi b/tasks/alexm.gmi
index 7ec8e87..e410046 100644
--- a/tasks/alexm.gmi
+++ b/tasks/alexm.gmi
@@ -75,6 +75,95 @@ You can refine the search by constraining the checks some more, e.g. to get high
 
 * masters ; submit documents 
 
+## 9/06/2025
+
+* [x] no login for gnqna
+* [-] hsrat rqtl2 integration: follow up on the dumping genotypes files to lmdb 
+
+
+## 16/06/2025
+
+* [x] implementation for  no login  for gnqa users.
+ see issue here: https://issues.genenetwork.org/issues/gnqa/implement-no-login-requirement-for-gnqa
+* [-] hsrat rqtl2 integration: follow up on the dumping genotypes files to lmdb with bons 
+* [] create rqtl2 adapter for reading cross from  lmdb 
+
+
+## 23/06/2025
+
+*  [-]  focus on documentation/source code  for read cross  ;;add option for reading lmdb
+*  [-]  looking at thor an lmdb interface
+*  [x] implementation of no login for gnqna users ;; TODO push this code to cd.
+* [x]  for nologin llm provide the correct feedbacks to user  if ai search does not meet criteria;; currently only static response `Please login to view AI generated summary`
+
+
+## 30/06/2025
+
+* [] rqtl2 lmdb interfac
+e
+
+## 7/7/2025
+
+* [x] guix packaging for thor package
+* [x] reading metadata from lmdb file using thor 
+* [-] working on reconstructing the genotype files to geno, geno_map and pheno_map file
+* [x] rate limiting for gnqna users.
+* [x]  check on tokens for no logged in users
+* [x]  Look at isse about signing tokens for non logged in users 
+
+
+## 21/7/2025
+* [x] working on reconstructing/reading  the genotype files based on bons dumping script 
+* [x]  generate and validate cross objects
+
+## 29/7/2025
+* [x] adding founder_geno and pheno covariates pheno
+* [-] minor fixes for lmdb matrix script (missing metadata)
+* [-] check phenotype work  for lmdb
+
+## 4/8/2025
+* [-] refactoring lmdb matrix script
+* [-] integrating fetching rqtl2 from lmdb using bxd as a test pilot
+
+
+##  11/8/2025
+
+* [x] integrating lmdb genotypes  for rqtl2 computation for BXD
+
+
+##  18/8/2025
+
+* [x] integrating lmdb genotypes for rqtl2 computation for BXD
+
+## 25/8/2025
+* [x] script to dump phenotypes and cross metadata to lmdb
+
+## 31/8/2025
+
+* [x] generic script that can  parse json cross files and dump the metadata to lmdb
+* [x] follow up on dumped phenotypes in LMDB for GeneNetwork cc @bons
+* [x] integrating lmdb rqtl2 adapter to CD (test this on cd for bxd.)
+
+## 23/9/2025
+* [x] dumping  phenotypes to lmdb (BXDPublish)
+* [x] dumping cross metadata to lmdb
+
+## 30/9/2025
+* [x]  profiling and benchmarking  read_lmdb_cross against read_cross
+* [-] Run this on tux02 and integrate to CD
+
+
+## 7/9/2025
+
+* [-] integrate rqtl2-lmdb adapter to CD cc @bons with directory setup.
+* [x] improvement on rqtl2-lmdb adapter script;  add covariates supports.
+
+
+## 15/9/2025
+
+* [-] clone qtl2 repo and  build rqtl2-lmdb package locally.
+* [] package rqtl2-lmdb package to guix-bioinformatics.
+
 ## Next week(s)
 
 * [ ] Accelerate Xapian functionality - needs Aider key from Pjotr
diff --git a/tasks/bonfacem.gmi b/tasks/bonfacem.gmi
index 03848f1..2d56e72 100644
--- a/tasks/bonfacem.gmi
+++ b/tasks/bonfacem.gmi
@@ -4,41 +4,46 @@
 
 * kanban: bonfacem
 * assigned: bonfacem
-* status: in progress
 
 ## Tasks
 
 ### Note
-- GN-auth dashboard fixes.  Follow up with Fred.
-- Case-attributes used in co-variates.
-- Encourage FahamuAI to be open.
+* Don't lose metadata.   Have an array of disabled snips.
+* Store by snip (rows).  Storage by marker.  2 different files.
+* gn-auth:
+  have wrappers around gn-auth (draw-back: folk may forget).
+  use Nginx as a MTM (re-routing calls).  How to add handler in Nginx and to work with tokens.
+* GN-auth dashboard fixes.  Follow up with Fred.
+* Case-attributes used in co-variates.
+
+### PhD Work
+
+* Concept note/ideas: Add all metadata in GN to an LLM that enriches GnQA.
+* Use mapping output as full vectors for gpt/transformers.  Integrate this work into GN.
+* Share PhD concept note with PJ for polishing
 
 ### This week
-* [+] Case Attributes (Do a diagnostic and delegate)
-*     - Git blame.  Add tests.
-*     - Error when checking the history.
-*     - Reach out to Zach.
-*     - Disable diff in the UI.
-* [ ] Distinct admin and dev user.
-* [ ] Adapter to LMDB into a cross object.
+* [] Look at deep-seek/anthropic (also really doc deployment in balg01).  Run in debian machine.
+* [~] Adapter to LMDB into a cross object.
 *     - Try computations with R/qtl2.
 *     - Look at R LMDB libraries.
 *     - Look at functions that read the files.
 *     - PJ: LMDB adapter in R and cross-type files.
-* [ ] Send Arun an e-mail on how to go about upgrading shepherd.
-* [ ] Dump all genotypes from production to LMDB.
+* [~] gn-guile webhook.
+* [~] Dump all genotypes from tux02 to LMDB.
 *     - PJ sync tux01 genotypes with tux02/04.
+*     - Yet to set-up 2FA on new device
+
+### Later
+* [ ] Generate RDF docs using AI.
+* [ ] Editing genotype metadata
+* [ ] Look at XAPIAN search for gene alias.
+* [ ] Add GeneNetwork abstracts to XAPIAN search.
 * [+] Correlations hash.
 *     - Add dataset count to RDF.
 * [ ] Spam + LLMs
 *     - RateLimiting for Rif Editing.
-*     - Honep Pot approach.
-* [+] Help Alex with SSL certification container error.
-*     - Put the changes in the actual scm files.
-* [X] Python Fahamu.
-* [X] Memvid - brief look.
-
-### Later
+*     - Honepot approach.
 * [ ] Dockerise GN container.   For Harm.
 * [ ] Send emails when job fail.
 * [ ] Look at updating gn-auth/gn-libs to PYTHONPATH for gn2/3.
@@ -92,7 +97,8 @@ Currently closed issues are:
 * [X] lmdb publishdata output and share with Pjotr and Johannes
 
 ## Done
-
+* [X] Updated Penguin2 with cuda drivers.  Cuda no longer supports K80s
+* [X] Provided Johannes anthropic tokens
 * [X] Add lmdb output hashes with index and export LMDB_DATA_DIRECTORY
 * [X] Share small database with @pjotrp and @felixl
 * [X] With Alex get rqtl2 demo going in CD (for BXD)
@@ -131,3 +137,31 @@ Currently closed issues are:
 *     - Simplify (focus on small files).  Don't over-rely on Numpy.
 * [X] Assess adding GeneRIF to LLM.
 * [X] Referrer headers -- a way of preventing bots beyond rate-limiting. 
+* [X] Python Fahamu.
+* [X] Memvid - brief look.
+* [X] Encourage FahamuAI to be open.
+*     Another paper with his group should be out this month
+* [X] Help Alex with SSL certification container error.
+*     - Fix SSL issues in local container.
+* [X] Send Arun an e-mail on how to go about upgrading shepherd.
+* [X] Case Attributes.
+*     - Git blame.  Add tests.  Fred.
+*     - NOTE: Fixed the diffs.  But there's an edge-case with BXD longevity (I haven't checked.  Shared scripts)
+*     - NOTE: Elpy broke.  Eglot/lspemacs doesn't work.
+*     - NOTE: Moved away from storing diffs in files to LMDB.
+*     - Error when checking the history.  Fixed by fixing the diffs.
+*     - Reach out to Zach.  NOTE: Timing differences.
+*     - Disable diff in the UI - unnecessary.
+* [X] Added LMDB_PATH to dev container.   Updated old commits.
+* [X] Merged no-login AI work that Alex did.
+* [X] Talk to Fred and hand over case-attributes.
+* [X] Distinct admin and dev user. [w/ Fred]
+*     - Extra fluff to grant dev user access to everything.
+* [X] Merged rate-limiter.
+* [X] Look at slow running CD (look at issue tracker and be systematic).
+* [X] Fix CD.  Build guix against a recent pinned profile.
+* [X] Fix CD tests.
+* [X] Look at different provider(s) for LLMs.
+* [X] Install OpenCL in Penguin2 and try LLM script.  Check differences between OpenCL and CUDA.  PJ installed CUDA in balg01.
+* [X] Look at container work.  Look at permissions issue.
+* [X] Set-up wolfshead: Resolving dependency conflicts in Python.  Using DSPY
diff --git a/tasks/felixl.gmi b/tasks/felixl.gmi
index 347f387..7a472a1 100644
--- a/tasks/felixl.gmi
+++ b/tasks/felixl.gmi
@@ -41,7 +41,7 @@
 
 * PhD tasks 
   * [ X ] Complete and share concept note and timeline to supervisors, have a meeting for progress 
-  * [ ] Make a milestone on chapter one manuscript (deep dive into the selected papers){THE BIG PICTURE; a complete draft by early May} 
+  * [+] Make a milestone on chapter one manuscript (deep dive into the selected papers){THE BIG PICTURE; a complete draft by early May} 
 
 * Programming 
   * [ ] Make a milestone with the uploader (really push and learn!) 
@@ -116,9 +116,106 @@
 * [X] - document the new findings about smoothing using bcftools and plink
 
 * ## this week (09-06-onwards) 
-* [ ] - identify start and end points for haplotypes in hs genotype files 
-* [ ] - upload the final updates to gn2, test and see the results 
-* [ ] - gn-uploader/uploader folder, explore 
+* [+] - identify start and end points for haplotypes in hs genotype files 
+* [+] - upload the final updates to gn2, test and see the results 
+* [-] - gn-uploader/uploader folder, explore 
+
+* ## this week (16-06-onwards)
+* [X] - hs rats proximal and distal haplotype edges
+* [+] - uploading kilifish using the backend route
+
+* ## this week (23-06-onwards)
+* [X] - hs rats recombination counts
+* [+] - kilifish to gn2 via backend
+
+* ## this week (30-06-onwards)
+* [ ] - mapping offsprings to founders (hs rats)
+* [ ] - upload kilifish to genenetwork
+* [ ] - revise celegans smoothing (genotypes)
+
+* ## this week (07-07-onwards) 
+* [X] - generate haplotypes for offsprings and founders combined; intepretation next.., 
+* [+] - keep improving the uploader via data uploading and error solving 
+* [-] - close smoothing revision for celegans, as left before 
+* [X] - why should people read my paper on improving genotyping methods? 
+*        - on smoothing (low density genotypes for mapping, high density genotypes for fine mapping.,)  
+*        - liftovers due to reference versions (currently, a challenge to be looked upon) 
+*        - founders and their offsprings in genotyping 
+*        - pangenomics and machine learning for improved genotyping 
+
+** keys (+; in progress, X; done, -; not yet) 
+* ## this week (14-07-onwards) 
+* [+] - map founders to offspring, work with only pure recombiantions 
+     [+] - tools available? (plink, rqtl2, beagle, etc)  
+     [+] - custom pipeline, to reflect gaps in the existing tools? (dealing with multiparent species)  
+     [+] - documentation for the paper write up
+
+* ## this week (21-07-onwards) 
+* [ ] - HS rats smoothing continues 
+*    [ ] - documenting the milestones 
+*    [ ] - see the possibility to write a tool from it 
+* [ ] - Pushing kilifish to genenetwork2/learn the source code build up 
+* [ ] - resmoothen celegans genotypes with the new knowledge  
+
+* ## this week (28-07-onwards)
+* [-] - predict genotype probabilities with rqlt2 functions
+       - problems with control setup to load in the needed files for the functions
+* [+] - comparison models for @individual rat vs 8 founders (similarities and percentage composition)
+     [+] - ongoing discussion with alex, there's progress
+
+* ## this week (04-08-onwards)
+* [+] - Testing the logic to infer Hs outbred genotypes with the founders 
+           - Managed to identify parents of origin for each snp on each rat per position, corresponding to the 8 founders
+           - Still, need to filter in the disntictive snps, then generate haplo blocks., 
+
+* ## this week (11 - 08 - onwards) 
+* [X] - generate final haplo file and document 
+* [+] - testing on local gemma and in gn2 
+
+* ## this week (18-08-onwards) 
+* [+] - push for the file to be in gn2, and feedback from the team 
+* [X] - complete the local gemma run, interpret the results 
+* [+] - process the rest of the Xsomes for a ready file to go to gn2 
+*       - issues: over filtering snps, neglecting the one parent of origin, takes long to run.  
+* [+] - prepare an abstract for CTC conference in Barcelona  
+
+* ## this week (01-09-onWards) 
+* [ ] - finetune abstract 
+*       - include more of what i achieved: main focus; genotype smoothing on models with complex traits 
+*       - thought map: generate plots, compare before and after smoothing, check for overlaps, and whether or not the peaks in traits are same before and after smoothing 
+* [ ] - troubleshoot inferring scripts for all Xsomes 
+*       - request bonz/alex's help on this (to save time) 
+
+
+* ## this week (30-06-onwards)  
+* [X] - mapping offsprings to founders (hs rats) 
+* [+] - upload kilifish to genenetwork 
+* [-] - revise celegans smoothing (genotypes) 
+
+* ## this week (07-07-onwards) 
+* [X] - generate haplotypes for offsprings and founders combined; intepretation next.., 
+* [+] - keep improving the uploader via data uploading and error solving 
+* [-] - close smoothing revision for celegans, as left before 
+* [X] - why should people read my paper on improving genotyping methods? 
+*        - on smoothing (low density genotypes for mapping, high density genotypes for fine mapping.,)  
+*        - liftovers due to reference versions (currently, a challenge to be looked upon) 
+*        - founders and their offsprings in genotyping 
+*        - pangenomics and machine learning for improved genotyping 
+
+** keys (+; in progress, X; done, -; not yet) 
+* ## this week (14-07-onwards) 
+* [+] - map founders to offspring, work with only pure recombiantions 
+     [+] - tools available? (plink, rqtl2, beagle, etc)  
+     [+] - custom pipeline, to reflect gaps in the existing tools? (dealing with multiparent species)  
+     [+] - documentation for the paper write up
+
+* ## this week (21-07-onwards) 
+* [ ] - HS rats smoothing continues 
+*    [ ] - documenting the milestones 
+*    [ ] - see the possibility to write a tool from it 
+* [ ] - Pushing kilifish to genenetwork2/learn the source code build up 
+* [ ] - resmoothen celegans genotypes with the new knowledge  
+
 
 ### Later weeks (non-programming tasks)
 
diff --git a/tasks/hasitha.gmi b/tasks/hasitha.gmi
new file mode 100644
index 0000000..fcef29b
--- /dev/null
+++ b/tasks/hasitha.gmi
@@ -0,0 +1,37 @@
+# Tasks for Hasitha
+
+## Tags
+
+* kanban: hasitha
+* assigned: hasitha
+* status: in progress
+
+## Tasks
+
+### Notes
+* 
+
+### This week
+* [ ] Implementing CRAM encoding methods in GBAM
+*       - [ ] ReadName tokenization pipeline - could save ~6% space using this technique. Need to work on decoding.
+* [ ] Starting off with Cigar compression sub project with Andrea
+* [ ] Discuss AGC and population compression with Andrea
+
+### Later
+* [ ] 
+
+### Even later
+
+* [ ]
+
+### On Hold
+* [ ] GBAM reader using noodles
+
+## Done
+
+* [X] Moving GBAM python and rust stuff to C
+* [X] Fixing memory issues in C
+* [X] SAM input to GBAM in Rust
+* [X] agc-rs setup testing on M1 mac
+
+
diff --git a/tasks/johannesm.gmi b/tasks/johannesm.gmi
new file mode 100644
index 0000000..840bd3e
--- /dev/null
+++ b/tasks/johannesm.gmi
@@ -0,0 +1,90 @@
+# Tasks for Johannes
+
+## Tags
+
+* kanban: johannesm
+* assigned: johannesm
+* status: in progress
+
+## Tasks
+
+### Ongoing
+
+* [] Get system into use in GN
+
+* [] Draft outline paper
+
+* [] Talk with Hao about publication on agent system and gain over GNQA
+
+* [] Read literature for paper
+
+
+
+### Later
+
+* [~] Catch up on SPARQL
+
+* [~] Catch up on LMDB
+
+
+### Past
+
+* [X] Make RAG script available for reuse and clean it with Bonface
+
+* [X] Discuss with Bonface on how to get metadata for RAG
+
+* [X] Pickle RAG -> not successful :)
+
+* [X] Try out SPARQLWrapper
+
+* [X] Optimize RAG
+
+* [X] Integrate with RAG
+
+* [X] Get actual metadata with SPARQL endpoint
+
+* [X] Make RAG agentic -> AI system
+
+* [X] Visit precompute issue
+
+* [X] Find more affordable options for GNQA
+
+* [X] Replace GNQA backend with Bonz
+
+* [X] Test GPU
+
+* [X] Test new AI system
+
+* [X] Use GPU to make naturalization faster
+
+* [X] Fix bugs and optimize AI system
+
+* [X] Test and validate AI system working
+
+* [X] Read up on performance evaluation for AI systems
+
+* [X] Refetch all data from SPARQL
+
+* [X] Preproces RDF for improved naturalization
+
+* [X] Make asynchronous requests to server for naturalization
+
+* [X] Document work on issue tracker
+
+* [X] Package code AI system
+
+* [X] Draft API endpoint
+
+* [X] Test package and share with Bonz
+
+* [X] Test AI system with descriptions, qtl and real biology questions
+
+* [X] Compare performance of Claude and open model on GN data before finetuning
+
+* [X] Look into system finetuning
+
+* [X] Test performance gain
+
+* [X] Think about how to show responses of agent system and GNQA in one UI
+
+* [X] Get API working
diff --git a/tasks/machine-room.gmi b/tasks/machine-room.gmi
index 77f7b8e..d656f2f 100644
--- a/tasks/machine-room.gmi
+++ b/tasks/machine-room.gmi
@@ -63,7 +63,7 @@ Security:
 * [X] describe machines with Rick Stripes
 * [X] get bacchus back on line
 * [X] fix www.genenetwork.org and gn2.genenetwork.org https
-* [-] get data from summer211.uthsc.edu (access machine room)
+* [-] get data from summer211 (access machine room)
 * [X] VPN access and FoUT
 * [X] lambda: get fiber working
 * [X] lambda: add to Octopus HPC
diff --git a/tissue.scm b/tissue.scm
index a0d6345..b3fd1fd 100644
--- a/tissue.scm
+++ b/tissue.scm
@@ -28,6 +28,7 @@
 (tissue-configuration
  #:aliases '(("Alexander Kabui" "Alexander" "alex" "alexk")
              ("Arun Isaac" "arun" "aruni")
+	     ("Hasitha")
              ("BonfaceKilz" "Bonface Kilz" "bonfacem")
 	     ("Felix Lisso" "fetche" "flisso")
              ("Efraim Flashner" "efraimf")
diff --git a/topics/ai/aider.gmi b/topics/ai/aider.gmi
index aa88e71..00845c8 100644
--- a/topics/ai/aider.gmi
+++ b/topics/ai/aider.gmi
@@ -3,8 +3,10 @@
 => https://aider.chat/
 
 ```
+apt-get install python3-venv # or use guix
 python3 -m venv ~/opt/python-aider
 ~/opt/python-aider/bin/python3 -m pip install aider-install
+export PATH="/home/wrk/.local/bin:$PATH"
 ~/opt/python-aider/bin/aider-install
 ```
 
@@ -12,5 +14,6 @@ Installed 1 executable: aider
 Executable directory /home/wrk/.local/bin is already in PATH
 
 ```
+aider --model sonnet --api-key anthropic=sk-ant...
 aider --model gpt-4o --openai-api-key aa...
 ```
diff --git a/topics/ai/gn_agent.gmi b/topics/ai/gn_agent.gmi
new file mode 100644
index 0000000..2b789c9
--- /dev/null
+++ b/topics/ai/gn_agent.gmi
@@ -0,0 +1,790 @@
+# Build an AI system for GN
+
+## Tags
+* type: feature
+* assigned: johannesm
+* priority: medium
+* status: in progress
+* keywords: llm, rag, ai, agent
+
+## Description
+
+The aim is to build an AI system/agent/RAG able to digest mapping results and metadata in GN for analysis scaling. This is not quite possible at the moment, given that one stills need to dig and compare manually that type of information. And the data in GN is somehow big for such approach :)
+
+I have made an attempt to using Deep-Learning for my Masters project. It could work but required further processing of results for interpretation. Not quite handy! Instead, we want a system which takes care of all the work (at least most of it) and that we can understand. This is how transformers and LLMs came into the picture.
+
+This work is an extension of the GNQA system initiated by Shelby and Pjotr. 
+
+## Tasks
+* [X] Look for transformer model ready for use and try
+* [X] Build a RAG system and test with small corpus of mapping results
+* [X] Experiment with actual mapping results and metadata
+* [X] Move from RAG to agent
+* [X] Optimize AI system
+* [] Scale analysis to more data
+* [] Compare performance of open LLMs with Claude in the system
+
+
+
+
+### Look for transformer model ready for use and try
+
+Given the success of transformers, I was first incited by Pjotr to look for a model that can support different types of data i.e numerical (mapping results) vs textual (metadata).
+
+I found TAPAS which:
+* takes data of different types in tabular format
+* takes a query or question in form of text
+* performs operations on rows of the data table
+* retrieves relevant information
+* returns an answer to the original query
+
+Experimentations were ongoing when Rob found with the help of Claude that this architecture would not go far. I know we used an AI to assist our work on AI (at least we did not ask an AI to do the job from the go :))
+But it was a good point. TAPAS is relatively old and a lot of progress have been made with LLM and agent since!
+
+To take advantage of all the progress made with LLM, need to find a way to have only text data. LLMs are trained to undertand and work with text. Metadata, being RDF, is already in text format. I only needed to convert the mapping results to text. It is a detour worth the shot if this can give more flexibility and save development time!
+
+### Build a RAG system and test with a small corpus of mapping results
+
+I have read a number of books and found that RAG system are pretty easy to design with LangChain. The RAG is made of 2 components:
+* search and retrieval -> need a keyword search algorithm or embedding model
+* response generation -> need an LLM
+
+The system takes a query and return a response based on the information retrieved. Very efficient in increasing factuality of LLM responses!
+
+I wrote a first draft with:
+
+1. BMD25 keyword search algorithm and Qwen3-Embedding-0.6B model
+This is an hybrid search subsystem. It searches both by keyword and meaning. It should be more scalable. The subsystem:
+* takes the query
+* processes it
+* extracts the most similar documents or data chunks
+* passes the results to the LLM for further processing
+
+2. Qwen2.5-72B-Instruct LLM
+This is the response subsystem. It:
+* takes the original query
+* takes retrieved information
+* reason using tree-of-thought mechanism for validation
+* generate an answer to the query using retrieved information
+
+Small bonus
+I suspected that more than one query might be required to address some questions. I needed to find a way to make the system conversational so that one could ask many questions without the information to be lost in between. This came down to adding memory to the system. So, I added a small model for memory propagation. The model was:
+3. Phi-3-mini-4k-instruct
+It uses the model and the ConversationSummary class from LangChain to summarizekey information in interaction for accessibility between sessions.
+
+Time to test!
+
+* Query 1
+
+```
+Question: What is the lod for trait leptin receptor EPFLMouseLiverCDEx0413 at position 100?
+
+Response:
+**Expert 1:**
+The question asks for the LOD score at position 100. From the relevant information, I see that there is a LOD score of 1 at position 100.
+
+**Expert 2:**
+I agree with Expert 1. The relevant information clearly states that the LOD score at position 100 is 1.
+
+**Expert 3:**
+I concur with both Experts 1 and 2. The LOD score for the trait leptin receptor EPFLMouseLiverCDEx0413 at position 100 is indeed 1.
+
+**Conclusion:**
+All experts agree that the LOD score for the trait leptin receptor EPFLMouseLiverCDEx0413 at position 100 is 1. <|end|>
+```
+
+Since the corpus used for testing is small, I could inspect to confirm the accuracy of the response generated by the RAG. Lookup revealed that it was correct, yah!
+
+* Query 2
+
+```
+Question: Which position of leptin receptor EPFLMouseLiverCDEx0413 was examined in the previous prompt?
+
+Response:
+**Expert 1:**
+The previous prompt asked for the LOD score at position 100. Therefore, the position examined in the previous prompt is 100.
+
+**Expert 2:**
+I agree with Expert 1. The previous prompt specifically asked about the LOD score at position 100. Thus, the position examined is 100.
+
+**Expert 3:**
+I concur with both Experts 1 and 2. The previous prompt clearly specified that it was asking for the LOD score at position 100. Therefore, the position examined in the previous prompt is 100.
+
+**Conclusion:**
+All experts agree that the position of leptin receptor EPFLMouseLiverCDEx0413 examined in the previous prompt is 100. <|end|>
+```
+
+This was also correct. I was very excited, Rob too!
+
+I received a feedback that made me clarify the meaning of position. I rebuilt the small corpus using a BXD dataset of traits I run GEMMA on for my Masters project, making sure to use actual marker positions this time and including also marker names.
+
+Let's experiment again!
+
+I got results similar to the ones above to the exception that marker positions are true and marker names are supported.
+
+I faced a challenge though :(
+
+For queries that require combinations of different data chunks or documents (non atomic queries), the system does not perform well. For example, to the query
+
+* How many traits hepatic nuclear factor 4 are in the datasets?
+The system was confused. Even after prompt engineering, the answer generated was not accurate
+
+* Identify 2 traits that have similar lod values on chromosome 1 position 3010274
+The system sometimes missed or caught only 1 trait having a lod value at the position.
+
+This is probably because the system cannot execute more than one retrieval run. To get there, I need to make the RAG more autonomous: this is how the concept of agent came up.
+
+
+### Experiment with actual mapping results and metadata
+
+Getting an agent asked for more readings. In the meantime, I decided to get actual mapping results and metadata for experimentation. Would be sad to proceed if the system is actually not compatible with data to use in production :)
+
+I waited for Pjotr to precompute GEMMA association results and export them with metadata to an endpoint. The RDF schema was very interesting to learn and Bonz did some work about that in the past :)
+
+You can check out recent developments of  Pjotr's work here:
+=> https://issues.genenetwork.org/topics/systems/mariadb/precompute-publishdata
+
+For Bonz work, see:
+=> https://github.com/genenetwork/gn-docs/blob/master/rdf-documentation/phenotype-metadata.md
+
+Anyway, it took some time but I finally got a glance of the data.
+
+This started with the metadata from an old endpoint created by Bonz. I had also to learn SPARQL - I was quite new to it!
+
+We thought LLMs can make directly sense of RDF data (still in text format) but it turns out it is not. They can recognize that it is RDF but in between all the URIs, they start making mistakes quite quickly. Instead of using RDF natively, we decided to use LLMs to first convert RDF data - could be both metadata or mapping results - to natural text before using it with the RAG system. The system should do best and we confirmed that!
+
+Pjotr made available the first version of the global endpoint. Nothing should stop me now :) I wrote a script to fetch metadata from the endpoint. I am not sharing my code so far. Let me fix that right now. You can follow this link for the script I was referring to above:
+=> https://github.com/johanmed/gn-rag/blob/main/fetch_metadata.py
+
+Pjotr also made available the ttl files in my home directory on balg01 - full flexibility!
+
+I naturalized some RDF triples. The corpus now looked like below
+
+```
+The phenotype identified as BXD_19926, or abbreviated as TAG(48:3)_HFD, is part of the EPFL LISP3 Study. It's a measurement of Liver Lipidomics: TAG48:3_HFD, which has side chain FA composition [TG12:0_18:1_18:2,TG14:0_16:0_18:3,TG14:0_16:1_18:2,TG14:0_16:2_18:1,TG14:1_16:0_18:2,TG14:1_16:1_18:1,TG16:0_16:1_16:2,TG16:1_16:1_16:1] belonging to the lipid class triacylglycerol TAG. This phenotype was measured by MS [R.U.AUC] in overnight fasted male mice at 29 weeks of age, who were on a high fat diet (60% kCal/fat HarlanTD.06414).
+The phenotype was submitted by Jha P and contributed to by Auwerx J. It belongs to the group setBxd and has a sequence number of 1. The mean value for this phenotype is 0.00410489, with an additive effect of 0.00066854. The LOD score for this phenotype is 2.91779.
+The locus associated with this phenotype is Rsm10000001653.
+```
+
+It was time to proceed to testing. The results I show below are not exactly for the RAG system I explained above. I was improving the RAG in parallel. So watch out. I will explain everything in the next task :)
+
+For now, you can have a look at the first implementation at:
+=> https://github.com/johanmed/gn-rag/commit/2cf0b74442e8f7e3a67d563b882f3ab25a4ceb6d
+
+The goal was to try a complex query the previous system failed on. You can see that the question was indeed not atomic. 
+
+```
+Question: Identify traits having a lod score > 4.0
+
+Response:
+In addressing the query to identify traits with a LOD score exceeding 4.0, it's pertinent to highlight three mapped loci that meet these criteria: Rsm10000011038_BXDPublish_170
+86_gemma_GWA_19ca1aeb (LOD: 4.5), Rsm10000000185_BXDPublish_12641_gemma_GWA_8c8d0548 (LOD: 4.14) and Rsm10000000559 (LOD: 4.61). These findings underscore the significance of these loci in genetic studies, given their high LOD scores indicative of strong linkage evidence.
+```
+
+It is already better. At least, the system combined information from different data or document chunks. Retrieved chunks were:
+
+```
+[[Document(metadata={}, page_content='\nThe mapped locus associated with Rsm10000001536 has a LOD score of 2.07, and the mapped SNP is identified as GEMMAMapped_LOCO_BXDPublish_24451_gemma_GWA_aecf628e. The effect size for this locus is 0.01, with an allele frequency (AF) of 0.333.\n                '), Document(metadata={}, page_content='\nThe mapped
+ locus Rsm10000011536 is associated with a LOD score of 5.69, an effect size of 0.385 and an allele frequency of 0.526. This locus has been mapped to the SNP GEMMAMapped_LOCO_BXDPublish_2032
+0_gemma_GWA_6832c0e4.\n                '), Document(metadata={}, page_content='\nThe mapped locus, Rsm10000000185_BXDPublish_12641_gemma_GWA_8c8d0548, has an effect size of -3.137 and a LOD
+score of 4.14. This locus is associated with the mapped SNP GEMMAMapped_LOCO_BXDPublish_12641_gemma_GWA_8c8d0548, and it has an allele frequency of 0.556.\n                '), Document(metad
+ata={}, page_content='\nIn plain English, this data refers to a mapped locus associated with the Rsm10000011038_BXDPublish_17086_gemma_GWA_19ca1aeb identifier. This locus is linked to the Rsm10000011038 identifier, has an effect size of -0.048, a LOD score of 4.5, and an allele frequency (AF) of 0.167. The mapped SNP associated with this data can be found under the GEMMAMapped_LOCO_BXDPublish_17086_gemma_GWA_19ca1aeb identifier.\n                '), Document(metadata={}, page_content='\n                In plain English, the data describes a genetic locus identified as Rsm10000000559. This locus was mapped through an effect size of -34.191, with an allele frequency of 0.438. The mapping achieved a LOD score of 4.61, indicating the statistical significance of this genetic association. The mapped locus is associated with a specific SNP (Single Nucleotide Polymorphism) identified as GEMMAMapped_LOCO_BXDPublish_12016_gemma_GWA_bc6adcae.\n                ')]]
+```
+
+### Move from RAG to agent
+
+This is where I made the system more autonomous i.e agentic. I am now going to explain how I did it. I read a couple of sources and found that RAG system built with LangChain could be made agentic by using LangGraph. This creates a graph structure which splits the task among different nodes or agents. Each agent achieves a specific subtasks and a final node manages the integration.
+
+Checkout this commit to see the results:
+=> https://github.com/johanmed/gn-rag/commit/ecde30a31588605358007cc39df25976b9c2e295
+
+You can clearly see differences between *rag_langchain.py* and *rag_langgraph.py*
+
+Basically,
+
+```
+def ask_question(self, question: str):
+        start=time.time()
+        memory_var=self.memory.load_memory_variables({})
+        chat_history=memory_var.get('chat_history', '')
+        result=self.retrieval_chain.invoke(
+            {'question': question,
+             'input': question,
+             'chat_history': chat_history})
+        answer=result.get("answer")
+        citations=result.get("context")
+        self.memory.save_context(
+            {'input': question},
+            {'answer': answer})
+        # Close LLMs
+        GENERATIVE_MODEL.client.close()
+        SUMMARY_MODEL.client.close()
+        end=time.time()
+        print(f'ask_question: {end-start}')
+        return {
+            "question": question,
+            "answer": answer,
+            "citations": citations,
+        }
+```
+
+became:
+
+```
+def retrieve(self, state: State) -> dict:
+        # Define graph node for retrieval
+        prompt = f"""
+        You are powerful data retriever and you strictly return
+        what is asked for.
+        Retrieve relevant documents for the query below,
+        excluding these documents: {state.get('seen_documents', [])}
+        Query: {state['input']}"""
+        retrieved_docs = self.ensemble_retriever.invoke(prompt)
+        return {"input": state["input"],
+                "context": retrieved_docs,
+                "digested_context": state.get("digested_context", []),
+                "result_count": state.get("result_count", 0),
+                "target": state.get("target", 3),
+                "max_iterations": state.get("max_iterations", 5),
+                "should_continue": "naturalize",
+                "iterations": state.get("iterations", 0) + 1, # Add one per run
+                "chat_history": state.get("chat_history", []),
+                "answer": state.get("answer", ""),
+                "seen_documents": state.get("seen_documents", [])}
+
+    def manage(self, state:State) -> dict:
+        # Define graph node for task orchestration
+        context = state.get("context", [])
+        digested_context = state.get("digested_context", [])
+        answer = state.get("answer", "")
+        iterations = state.get("iterations", 0)
+        chat_history = state.get("chat_history", [])
+        result_count = state.get("result_count", 0)
+        target = state.get("target", 3)
+        max_iterations = state.get("max_iterations", 5)
+        should_continue = state.get("should_continue", "retrieve")
+        # Orchestration logic
+        if iterations >= max_iterations or result_count >= target:
+            should_continue = "summarize"
+        elif should_continue == "retrieve":
+            # Reset fields
+            context = []
+            digested_context = []
+            answer = ""
+        elif should_continue == "naturalize" and not context:
+            should_continue = "retrieve"  # Can't naturalize without context
+            context = []
+            digested_context = []
+            answer = ""
+        elif should_continue == "analyze" and \
+             (not context or not digested_context):
+            should_continue = "retrieve"  # Can't analyze without context
+            context = []
+            digested_context = []
+            answer = ""
+        elif should_continue == "check_relevance" and not answer:
+            should_continue = "analyze"  # Can't check relevance without answer
+        elif should_continue not in ["retrieve", \
+                "naturalize", "check_relevance", "analyze", "summarize"]:
+            should_continue = "summarize"  # Fallback
+        return {"input": state["input"],
+                "should_continue": should_continue,
+                "result_count": result_count,
+                "target": target,
+                "iterations": iterations,
+                "max_iterations": max_iterations,
+                "context": context,
+                "digested_context": digested_context,
+                "chat_history": chat_history,
+                "answer": answer,
+                "seen_documents": state.get("seen_documents", [])}
+
+    def analyze(self, state:State) -> dict:
+        # Define graph node for analysis and text generation
+        context = "\n".join(state.get("digested_context", []))
+        existing_history="\n".join(state.get("chat_history", [])) \
+            if state.get("chat_history") else ""
+        iterations = state.get("iterations", 0)
+        max_iterations = state.get("max_iterations", 5)
+        result_count = state.get("result_count", 0)
+        target = state.get("target", 3)
+        if not context: # Cannot proceed without context
+            should_continue = "summarize" if iterations >= max_iterations \
+                or result_count >= target else "retrieve"
+            response = ""
+        else:
+            prompt = f"""
+             <|im_start|>system
+             You are an experienced analyst that can use available information
+             to provide accurate and concise feedback.
+             <|im_end|>
+             <|im_start|>user
+             Answer the question below using following information.
+             Context: {context}
+             History: {existing_history}
+             Question: {state["input"]}
+             Answer:
+             <|im_end|>
+             <|im_start|>assistant"""
+            response = GENERATIVE_MODEL.invoke(prompt)
+            if not response or not isinstance(response, str) or \
+                    response.strip() == "": # Need valid generation
+                should_continue = "summarize" if iterations >= max_iterations \
+                    or result_count >= target else "retrieve"
+                response = ""  # Ensure a clean state
+            else:
+                should_continue = "check_relevance"
+        return {"input": state["input"],
+                "answer": response,
+                "should_continue": should_continue,
+                "context": state.get("context", []),
+                "digested_context": state.get("digested_context", []),
+                "iterations": iterations,
+                "max_iterations": max_iterations,
+                "result_count": result_count,
+                "target": target,
+                "chat_history": state.get("chat_history", []),
+                "seen_documents": state.get("seen_documents", [])}
+
+    
+    def summarize(self, state:State) -> dict:
+        # Define node for summarization
+        existing_history = state.get("chat_history", [])
+        current_interaction=f"""
+            User: {state["input"]}\nAssistant: {state["answer"]}"""
+        full_context = "\n".join(existing_history) + "\n" + \
+            current_interaction if existing_history else current_interaction
+        result_count = state.get("result_count", 0)
+        target = state.get("target", 3)
+        iterations = state.get("iterations", 0)
+        max_iterations = state.get("max_iterations", 5)
+        prompt = f"""
+            <|system|>
+            You are an excellent and concise summary maker.
+            <|end|>
+            <|user|>
+            Summarize in bullet points the conversation below.
+            Follow this format: input - answer
+            Conversation: {full_context}
+            <|end|>
+            <|assistant|>"""
+        summary = GENERATIVE_MODEL.invoke(prompt).strip() # central task
+        if not summary or not isinstance(summary, str) or summary.strip() == "":
+            summary = f"- {state['input']} - No valid answer generated"
+        should_continue="end" if result_count >= target or \
+            iterations >= max_iterations else "retrieve"
+        updated_history = existing_history + [summary] # update chat_history
+        print(f"\nChat history in summarize: {updated_history}")
+        return {"input": state["input"],
+                "answer": summary,
+                "should_continue": should_continue,
+                "context": state.get("context", []),
+                "digested_context": state.get("digested_context", []),
+                "iterations": iterations,
+                "max_iterations": max_iterations,
+                "result_count": result_count,
+                "target": target,
+                "chat_history": updated_history,
+                "seen_documents": state.get("seen_documents", [])}
+
+    def check_relevance(self, state:State) -> dict:
+        # Define node to check relevance of retrieved data
+        context = "\n".join(state.get("digested_context", []))
+        result_count = state.get("result_count", 0)
+        target = state.get("target", 3)
+        iterations = state.get("iterations", 0)
+        max_iterations = state.get("max_iterations", 5)
+        seen_documents = state.get("seen_documents", [])
+        prompt = f"""
+            <|system|>
+            You are an expert in evaluating data relevance. You do it seriously.
+            <|end|>
+            <|user|>
+            Assess if the provided answer is relevant to the query.
+            Return only yes or no. Nothing else.
+            Answer: {state["answer"]}
+            Query: {state["input"]}
+            Context: {context}
+            <|end|>
+            <|assistant|>"""
+        assessment = GENERATIVE_MODEL.invoke(prompt).strip()
+        if assessment=="yes":
+            result_count = result_count + 1
+            should_continue = "summarize"
+        elif result_count >= target or iterations >= max_iterations:
+            should_continue = "summarize"
+        else:
+            should_continue = "retrieve"
+            seen_documents.extend([doc.page_content for doc in \
+                state.get("context", [])])
+        return {"input": state["input"],
+                "context": state.get("context", []),
+                "digested_context": state.get("digested_context", []),
+                "iterations": iterations,
+                "max_iterations": max_iterations,
+                "answer": state["answer"],
+                "result_count": result_count,
+                "target": target,
+                "seen_documents": seen_documents,
+                "chat_history": state.get("chat_history", []),
+                "should_continue": should_continue}
+        
+    def route_manage(self, state: State) -> str:
+            should_continue = state.get("should_continue", "retrieve")
+            iterations = state.get("iterations", 0)
+            max_iterations = state.get("max_iterations", 5)
+            result_count = state.get("result_count", 0)
+            target = state.get("target", 3)
+            context = state.get("context", [])
+            digested_context = state.get("digested_context", [])
+            answer = state.get("answer", "")
+            # Validate state and enforce termination
+            if iterations >= max_iterations or result_count >= target:
+                return "summarize"
+            if should_continue not in ["retrieve", "naturalize", \
+                    "check_relevance", "analyze", "summarize"]:
+                return "summarize"  # Fallback to summarize
+            return should_continue
+
+    def initialize_langgraph_chain(self) -> Any:
+        graph_builder = StateGraph(State)
+        graph_builder.add_node("manage", self.manage)
+        graph_builder.add_node("retrieve", self.retrieve)
+        graph_builder.add_node("naturalize", self.naturalize)
+        graph_builder.add_node("check_relevance", self.check_relevance)
+        graph_builder.add_node("analyze", self.analyze)
+        graph_builder.add_node("summarize", self.summarize)
+        graph_builder.add_edge(START, "manage")
+        graph_builder.add_edge("retrieve", "naturalize")
+        graph_builder.add_edge("naturalize", "analyze")
+        graph_builder.add_edge("analyze", "check_relevance")
+        graph_builder.add_edge("check_relevance", "manage")
+        graph_builder.add_edge("summarize", END)
+        graph_builder.add_conditional_edges(
+            "manage",
+            self.route_manage,
+            {"retrieve": "retrieve",
+             "naturalize": "naturalize",
+             "check_relevance": "check_relevance",
+             "analyze": "analyze",
+             "summarize": "summarize"})
+        graph=graph_builder.compile()
+        return graph
+
+    async def invoke_langgraph(self, question: str) -> Any:
+        graph = self.initialize_langgraph_chain()
+        initial_state = {
+            "input": question,
+            "chat_history": [],
+            "context": [],
+            "digested_context": [],
+            "seen_documents": [],
+            "answer": "",
+            "iterations": 0,
+            "result_count": 0,
+            "should_continue": "retrieve",
+            "target": 3,  # Explain magic number 3
+            "max_iterations": 5 # Explain magic number 5
+        }
+        result = await graph.ainvoke(initial_state) # Run graph asynchronously
+        return result
+
+    
+    def answer_question(self, question: str) -> Any:
+        start = time.time()
+        result = asyncio.run(self.invoke_langgraph(question))
+        end = time.time()
+        print(f'answer_question: {end-start}')
+        return {"result": result["chat_history"],
+                "state": result}
+```
+
+As mentioned above, we quickly spotted the need for the naturalization of RDF triples. This explains the addition of a naturalization node to the graph:
+
+```
+def naturalize(self, state: State) -> dict:
+        # Define graph node for RDF naturalization
+        prompt = f"""
+        <|im_start|>system
+        You are extremely good at naturalizing RDF and inferring meaning.
+        <|im_end|>
+        <|im_start|>user
+        Take element in the list of RDF triples one by one and
+        make it sounds like Plain English. Repeat for each the subject
+        which is at the start. You should return a list. Nothing else.
+        List: ["Entity http://genenetwork.org/id/traitBxd_20537 \
+        \nhas http://purl.org/dc/terms/isReferencedBy of \
+        http://genenetwork.org/id/unpublished22893", "has \
+        http://genenetwork.org/term/locus of \
+        http://genenetwork.org/id/Rsm10000002554"]
+        <|im_end|>
+        <|im_start|>assistant
+        New list: ["traitBxd_20537 isReferencedBy unpublished22893", \
+        "traitBxd_20537 has a locus Rsm10000002554"]
+        <|im_end|>
+        <|im_start|>user
+        Take element in the list of RDF triples one by one and
+        make it sounds like Plain English. Repeat for each the subject
+        which is at the start. You should return a list. Nothing else.
+        List: {state.get("context", [])}
+        <|im_start|>end
+        <|im_start|>assistant"""
+        response = GENERATIVE_MODEL.invoke(prompt)
+        print(f"Response in naturalize: {response}")
+        if isinstance(response, str):
+            start=response.find("[")
+            end=response.rfind("]") + 1 # offset by 1 to make slicing
+            response=json.loads(response[start:end])
+        else:
+            response=[]
+        return {"input": state["input"],
+                "context": state.get("context", []),
+                "digested_context": response,
+                "result_count": state.get("result_count", 0),
+                "target": state.get("target", 3),
+                "max_iterations": state.get("max_iterations", 5),
+                "should_continue": "analyze",
+                "iterations": state.get("iterations", 0),
+                "chat_history": state.get("chat_history", []),
+                "answer": state.get("answer", ""),
+                "seen_documents": state.get("seen_documents", [])}
+```
+
+Next step is to compare the RAG system (rag_langchain.py) to the agent system (rag_langgraph.py) on some queries.
+
+* Query 1
+
+```
+Question: What is the lod score of BXDPublish_10187_gemma_GWA at D12mit280?
+
+RAG response: 3.01245
+
+Agent response: 5.21
+```
+
+Hmm, the result of the RAG system is dubious. I did not expect long floating number for the LOD scores. Fortunately, Pjotr's endpoint is useful to make quick query agains the RDF data. By the way, one could see how complementary SPARQL queries and LLM calls are. It was very thoughtful to pass through RDF :)
+
+Using the endpoint, I have fetched all lodscores for BXDPublish_10187_gemma_GWA at D12mit280 and only the result of the agent system appeared to be true.
+
+* Query 2
+
+```
+Question: I want you look for D12mit280. You are allowed to initiate many rounds of retrieval until you get 20 relevant results. Next, extract the lod score and trait for each result. List for me traits that have a lod score > 4.0. Join to the list the corresponding lod scores so I can confirm. Show results using the following format: trait - lod score
+
+RAG response: It seems there's a misunderstanding. The provided data doesn't include any results related to "D12mit280". I can only provide information or perform tasks based on the given context and data. If you have another query or need clarification on something else, feel free to ask.
+
+Agent response: It appears there are no results for D12mit280 with a lod score > 4.0 in the provided entities. If you have more data or different criteria, please let me know so I can assist further.
+
+```
+
+This was surprising. My hypotehsis was that the node in charge of retrieval was not fetching relevant documents for the LLM to elaborate a sound feedback. There is only one way to confirm: check the documents retrieved.
+
+Printing the documents after retrieval showed that I was actually right. I also noticed a couple of other problems in the execution of the program. Nodes were mostly not executed in the order I expected. Consequently, I decided to get into a long quest of incremental improvements :)
+
+### Optimize AI system
+
+A couple of things needed to be optimized. This included retrieval, node orchestration and GPU acceleration.
+
+* Retrieval
+
+Let's start with retrieval. I played with different parameters of the retriever. It was an EnsembleRetriever using both keyword and semantic search as illustrated below:
+```
+ensemble_retriever = EnsembleRetriever(
+            retrievers=[
+                self.chroma_db.as_retriever(search_kwargs={"k": 10}),
+                bm25_retriever,
+            ],
+            weights=[0.4, 0.6],
+        )
+```
+I opted for trying different combinations of weights to get to this selection. But more rigorous work needs to be done to systematically identify the best hyparameters for retrieval.
+
+* Node orchestration
+
+Moving to node orchestration. It took me some time and reflection to realize that the nodes, I had at the moment, make only sense to be executed sequentially. Analysis (analyze node) should always be followed with relevance checking (check_relevance node), finding summarization (summarize node), and in that order. Any other sequence of execution is not useful. I had to modify the code to comply to this and prevent getting into some unnecessary loops :)
+
+But this also highlighted other limitations of the system: lack of flexibility and lack of autonomy
+
+To address the lack of flexibility, I introduced a new node to split a query into multiple queries that can be solved independently and asynchronously. The node split_query works as follows:
+```
+def split_query(self, query: str) -> list[str]:
+
+        prompt = f"""
+            <|im_start|>system
+            You are a very powerful task generator.
+        
+            Split the query into task and context based on tags.
+            Based on the context, ask relevant questions that help achieve the task. Make sure the subquestions are atomic and do not rely on each other.
+            Return only the subquestions.
+            Return strictly a JSON list of strings, nothing else.
+            <|im_end|>
+            <|im_start|>user
+            Query:
+            Task: Identify traits with a lod score > 3.0 for the marker Rsm10000011643. Tell me what marker Rsm10000011643 is involved in biology.
+            Context: A trait has a long name and contain generally strings like GWA or GEMMA. The goal is to know the biological processes which might be related to the marker previously mentioned.
+        
+            Result:
+            <|im_end|>
+            <|im_start|>assistant
+            ["What traits (containing GWA or GEMMA) have a lod score > 3.0 at Rsm10000011643?", "Which biological processes are related to Rsm10000011643?"]
+            <|im_end|>
+            <|im_start|>user
+            Query:
+            {query}
+            Result:
+            <|im_end|>
+            <|im_start|>assistant"""
+
+        with self.generative_lock:
+            response = GENERATIVE_MODEL.invoke(prompt)
+        print(f"Subqueries in split_query: {response}")
+
+        if isinstance(response, str):
+            start = response.find("[")
+            end = response.rfind("]") + 1
+            subqueries = json.loads(response[start:end])
+        else:
+            subqueries = [query]
+
+        return subqueries
+
+```
+
+There is need for another node to reconcile answers generated for each subquery. This motivated the addition of the node finalize:
+```
+def finalize(self, query: str, subqueries: list[str], answers: list[str]) -> dict:
+
+        prompt = f"""
+            <|im_start|>system
+            You are an experienced biology scientist. Given the subqueries and corresponding answers, generate a comprehensive explanation to address the query using all information provided.
+            Ensure the response is insightful, concise, and draws logical inferences where possible.
+            Do not modify entities names such as trait and marker.            
+            Make sure to link based on what is common in the answers.
+            Provide only the story, nothing else.
+            Do not repeat answers. Use only 200 words max.
+            <|im_end|>
+            <|im_start|>user
+            Query:
+            Identify two traits related to diabetes.
+            Compare their lod scores at Rsm149505.
+            Subqueries:
+            ["Identify two traits related to diabetes",
+            "Compare lod scores of same traits at Rsm149505"]
+            Answers:
+            ["Traits A and B are related to diabetes", \
+            "The lod score at Rsm149505 is 2.3 and 3.4 for trait A and B"]
+            Conclusion:
+            <|im_end|>
+            <|im_start|>assistant
+            Traits A and B are related to diabetes and have a lod score of\
+            2.3 and 3.4 at Rsm149505. The two traits could interact via a\
+            gene close to the marker Rsm149505.
+            <|im_end|>
+            <|im_start|>user
+            Query:
+            {query}
+            Subqueries:
+            {subqueries}
+            Answers:
+            {answers}
+            Conclusion:
+            <|im_end|>
+            <|im_start|>assistant"""
+	with self.generative_lock:
+            response = GENERATIVE_MODEL.invoke(prompt)
+        print(f"Response in finalize: {response}")
+
+        final_answer = (
+            response
+            if response
+            else "Sorry, we are unable to \
+            provide an overall feedback due to lack of relevant data."
+        )
+
+        return final_answer
+```
+
+The system could now take a multi-faceted query, split it into multiple subqueries, address each one of them asynchronously using sequentially retriever, analysis, check_relevance and summarize. Results are combined in the end before giving a feedback to the user.
+
+Time to make the system really agentic - so far it is not trulty because of the lack of autonomy! An agentic system requires access to many tools and a core LLM that can reason on its own about sequence of tools to call in order to solve a problem. This sounds scary but not quite if well designed :) I was also planning to add some safeguards to prevent infinite looping that could consume a lot of tokens very quickly.
+
+What I did was to register the graph I have so far as a subgraph a bigger graph (real AI system). This arm of the AI system is called researcher and has the following definition:
+```
+def researcher(self, state: AgentState) -> Any:
+        if len(state.messages) < 3:
+            input = state.messages[0]
+        else:
+            input = state.messages[-1]
+        input = input.content
+        logging.info(f"Input in researcher: {input}")
+        result = self.manage_subtasks(input)
+        end = time.time()
+        logging.info(f"Result in researcher: {result}")
+
+        return {
+            "messages": [result],
+        }
+```
+
+I also designed a planner, reflector and supervisor that the system can use. As the name indicates, the planner helps with planning steps to take to solve the problem. The reflector provides feedback and helps improve the output of the researcher. The supervisor is the core handler. It manages interations between planner, researcher and reflector.
+
+You can inspect design code for planner, reflector and supervisor below:
+```
+def planner(self, state: AgentState) -> Any:
+    input = [self.plan_system_prompt] + state.messages
+    result = plan(background=input)
+    answer = result.get("answer")
+    return {
+            "messages": [answer],
+        }
+
+def reflector(self, state: AgentState) -> Any:
+    trans_map = {AIMessage: HumanMessage, HumanMessage: AIMessage}
+    translated_messages = [self.refl_system_prompt, state.messages[0]] + [
+    trans_map[msg.__class__](content=msg.content) for msg in state.messages[1:]
+        ]
+    result = tune(background=translated_messages)
+    answer = result.get("answer")
+    answer = (
+            "Progress has been made. Use now all the resources to addess this new suggestion: "
+            + answer
+        )
+    return {
+            "messages": [HumanMessage(answer)],
+        }
+
+def supervisor(self, state: AgentState) -> Any:
+    messages = [
+            ("system", self.sup_system_prompt1),
+            *state.messages,
+            ("system", self.sup_system_prompt2),
+        ]
+
+    if len(messages) > self.max_global_visits:
+       return {"next": "end"}
+
+    result = supervise(background=messages)
+    next = result.get("next")
+
+    return {
+            "next": next,
+        }
+```
+
+* GPU acceleration
+
+The last point is GPU acceleration. Pjotr installed a GPU on balg01 to allow for acceleration. You can check out the details here:
+=> https://issues.genenetwork.org/topics/systems/linux/GPU-on-balg01
+
+The GPU is automatically used for LLM related work. I just started using it at first. Later, I learnt about SGLang which allows for deployment of LLM server and even faster inference. Code for deployment of the server is here:
+=> https://github.com/johanmed/gn-rag/blob/543a7835f5620a541cdb679b852c91e62bca2698/src/agent_system/config.sh
+
+With DSPy, I could literally switch between any model, closed or open. Consequently, I added support for DSPy. For details, check out the following commit:
+=> https://github.com/johanmed/gn-rag/commit/ec0d8ffc174cca0ccf32cb98d82ebdc7106b4ac2
+
+Small gotcha, for locally served model using SGLang, not all open models could be run given VRAM (GPU's RAM) constraint. Took me some experiments to find workable models that are finetuned for instruction following and have decent performance. At the time of writing, I am working with Qwen/Qwen2.5-7B-Instruct accessed via HuggingFace. This is the LLM. There is also an embedding model but I have not added GPU acceleration support for it to improve memory management. We have limited resources for now :)
+
+I also performed a series of refactoring and formatting to improve readability of the source code. Find it here:
+=> https://github.com/johanmed/gn-rag/tree/main/src
+
+### Scale analysis to more data
diff --git a/topics/data/epochs.gmi b/topics/data/epochs.gmi
new file mode 100644
index 0000000..3e8b676
--- /dev/null
+++ b/topics/data/epochs.gmi
@@ -0,0 +1,153 @@
+# Epochs
+
+In the 2019 BXD paper epochs are brought up. Basically, even though the BXD are 'immortal' with identical children, mutations do creep in. An epoch is a period of mice and we track the years a mouse was used. So a BXD1 breeding started at 1971 and production in 2001. In GN we don't make a distinction (per se), but obviously these are (slightly) different mice today. Ashbrook et al. find some interesting results that differ in epochs.
+
+In GN epochs are currently handled as a trait. This can help with covariate mapping. For a different epoch, however, the genotypes should also be adapted. The effect on the kinship matrix will be minor, but genotypes can be used for fine mapping. With pangenome derived genotypes it should get even more interesting.
+
+# Fetching data
+
+Tracking the epochs is happening in spreadsheet. According to track changes only one item was changed in two years - BXD10 was marked as extinct.
+
+In the GN SQL database Epoch with its RRID is stored as a CaseAttribute:
+
+```
+MariaDB [db_webqtl]> select * from CaseAttribute LIMIT 3;
++-------------+-----------------+--------+-------------------------------------------------------------------------------------------------------------------------------------------------------------+
+| InbredSetId | CaseAttributeId | Name   | Description
+                                                                                |
++-------------+-----------------+--------+-------------------------------------------------------------------------------------------------------------------------------------------------------------+
+|           1 |               1 | Status | Live= Available at JAX, Cryo=Cryopreserved only, Extinct
+                                                                                |
+|           1 |              36 | RRID   | Research resource identifier given by SciCrunch.org
+                                                                                |
+|           1 |              37 | Epoch  | BXD family subgroups. Each number with common parents. Epoch1(BXD1-32), Epoch2-6 (BXD33-220). See Ashbrook et al. https://pubmed.ncbi.nlm.nih.gov/33472028/ |
++-------------+-----------------+--------+-------------------------------------------------------------------------------------------------------------------------------------------------------------+
+```
+
+And
+
+```
+MariaDB [db_webqtl]> select * from CaseAttributeXRefNew LIMIT 40;
++-------------+----------+-----------------+------------+
+| InbredSetId | StrainId | CaseAttributeId | Value      |
++-------------+----------+-----------------+------------+
+|           1 |        1 |               1 | Live       |
+|           1 |        1 |              36 | JAX:100006 |
+|           1 |        1 |              37 | 0          |
+|           1 |        1 |              40 |            |
+|           1 |        2 |               1 | Live       |
+|           1 |        2 |              36 | JAX:000664 |
+|           1 |        2 |              37 | 0          |
+|           1 |        2 |              40 | 69         |
+|           1 |        3 |               1 | Live       |
+|           1 |        3 |              36 | JAX:000671 |
+|           1 |        3 |              37 | 0          |
+|           1 |        3 |              40 | 108        |
+|           1 |        4 |               1 | Live
+```
+
+I am not going to comment on this table architecture, other than that RDF is a much better fit.
+
+For extracting this data, the SQL table is probably the best source of 'truth' as it is seen by users on a regular basis. But, at this point, we'll just use the spreadsheet. Generating something like:
+
+```
+gn:Bxd14
+                                dct:description "BXD014/TyJ" ;
+                                gnt:epoch 1 ;
+                                gnt:availability "Cryorecovery" ;
+                                gnt:method "B6 female to D2 male F2 intercross" ;
+                                gnt:M_origin "B6" ;
+                                gnt:Y_origin "D2" ;
+                                gnt:JAX "000329" ;
+                                gnt:start_year 1971 ;
+                                gnt:age_seq_ind 271 ;
+                                gnt:birth_seq_ind "2/18/2016" ;
+                                gnt:availability_2023 "Cryorecovery" ;
+                                gnt:has_genotypes true ;
+                                rdfs:label "BXD14" .
+gn:Bxd65
+                                dct:description "BXD065/RwwJ" ;
+                                gnt:epoch 3 ;
+                                gnt:availability "Available" ;
+                                gnt:method "Advanced intercross progeny of B6 female to D2 male" ;
+                                gnt:M_origin "B6" ;
+                                gnt:Y_origin "D2" ;
+                                gnt:JAX "007110" ;
+                                gnt:start_year 1999 ;
+                                gnt:age_seq_ind 46 ;
+                                gnt:birth_seq_ind "9/18/2016" ;
+                                gnt:availability_2023 "Available" ;
+                                gnt:has_genotypes true ;
+                                rdfs:label "BXD65" .
+etc.
+```
+
+# Approach
+
+## Fetching data
+
+To get at the epochs we'll need to fetch the sample/ind names (such as BXD73b) from GN.
+
+For every dataset we can fetch samples+values with
+
+```
+curl http://127.0.0.1:8092/dataset/bxd-publish/values/$id.json > pheno.json
+{"BXD40":-1.631969,"BXD68":-2.721761,"BXD43":-2.290135,"BXD44":-2.512057,"BXD48":-3.128819 ...
+```
+
+These are also stored in the pangemma output lmdb files. We don't want to store all values in RDF as these are only used for compute and can be easily fetched on demand from GN. We do want to access the sample names, but that is a list that is not necessarily unique to a single trait. In fact a trait should be referencing an experiment/dataset that has the samples/inds. Usually they will use the same animals. To not complicate things we'll just point to the samples with something like
+
+```
+traitid gn:sample gn:BXD40 .
+```
+
+Currently RDF contains
+
+```
+gn:Bxd12 rdfs:label "BXD12" .
+gn:Bxd12 rdf:type gnc:strain .
+gn:Bxd12 gnt:belongsToSpecies gn:Mus_musculus .
+```
+
+and traits have
+
+```
+gn:traitBxd_10002 rdf:type gnc:Phenotype .
+gn:traitBxd_10002 gnt:belongsToGroup gn:setBxd .
+gn:traitBxd_10002 gnt:traitId "10002" .
+gn:traitBxd_10002 skos:altLabel "BXD_10002" .
+gn:traitBxd_10002 dct:description "Central nervous system, morphology: Cerebellum weight after adjustment for covariance with brain size [mg]" .
+gn:traitBxd_10002 gnt:abbreviation "ADJCBLWT" .
+gn:traitBxd_10002 gnt:submitter "robwilliams" .
+gn:traitBxd_10002 gnt:mean "52.22058767430923"^^xsd:double .
+gn:traitBxd_10002 gnt:locus gn:Rsm10000005699 .
+gn:traitBxd_10002 gnt:lodScore "4.779380894726979"^^xsd:double .
+gn:traitBxd_10002 gnt:additive "2.0817857571428617"^^xsd:double .
+gn:traitBxd_10002 gnt:sequence "1"^^xsd:integer .
+gn:traitBxd_10002 dct:isReferencedBy pubmed:11438585 .
+```
+
+ignore the capitalization and some naming - gnc:strain should be gnc:sample - we'll fix that. But for now we can find some trait info and we can link the individuals up with a trait.
+
+The query we want to write is something like
+
+```
+SELECT * WHERE {
+  ?traitid a gnc:Phenotype;
+  gnt:traitId "10002" ;
+  gnt:belongsToGroup gn:setBxd ;
+  gnt:traitId ?trait ;
+  dct:isReferencedBy ?pubmed .
+  OPTIONAL {
+    ?traitid dct:description ?descr ;
+    gnt:sample_id ?sampleid .
+    ?sampleid rdfs:label ?sample .
+    }
+} LIMIT 10
+```
+
+So, for every trait/sample combination we need to add
+
+```
+gn:traitBxd_10002 gnt:sample_id gn:Bxd12 .
+```
diff --git a/topics/data/precompute/steps.gmi b/topics/data/precompute/steps.gmi
index d22778a..ac03d1a 100644
--- a/topics/data/precompute/steps.gmi
+++ b/topics/data/precompute/steps.gmi
@@ -2,7 +2,8 @@
 
 At this stage precompute fetches a trait from the DB and runs GEMMA. Next it tar balls up the vector for later use. It also updates the database with the latest info.
 
-To actually kick off compute on machines that do not access the DB I realize now we need a step-wise approach. Basically you want to shift files around without connecting to a DB. And then update the DB whenever it is convenient. So we are going to make it a multi-step procedure. I don't have to write all code because we have a working runner. I just need to chunk the work.
+To actually kick off compute on machines that do not access the DB I realize now we need a step-wise approach. Basically you want to shift files around without connecting to a DB. And then update the DB whenever it is convenient. So we are going to make it a multi-step procedure.
+We need to chunk the work.
 
 We will track precompute steps here. We will have:
 
diff --git a/topics/deploy/our-virtuoso-instances.gmi b/topics/deploy/our-virtuoso-instances.gmi
index 0336018..3ac56ae 100644
--- a/topics/deploy/our-virtuoso-instances.gmi
+++ b/topics/deploy/our-virtuoso-instances.gmi
@@ -9,6 +9,8 @@ We run three instances of virtuoso.
 The public SPARQL endpoint is accessible at
 => https://sparql.genenetwork.org/sparql
 
+These are now generally run as part of genenetwork2 containers(!)
+
 ## Configuration
 
 All our virtuoso instances are deployed in Guix system containers. The configuration for these containers is at
diff --git a/topics/deploy/setting-up-or-migrating-production-across-machines.gmi b/topics/deploy/setting-up-or-migrating-production-across-machines.gmi
index 1f35dae..631a000 100644
--- a/topics/deploy/setting-up-or-migrating-production-across-machines.gmi
+++ b/topics/deploy/setting-up-or-migrating-production-across-machines.gmi
@@ -15,24 +15,168 @@ Recent events (Late 2024 and early 2025) have led to us needing to move the prod
 
 In this respect, a number of tasks rise to the front as necessary to accomplish for a successful migration. Each of the following sections will detail a task that's necessary for a successful migration.
 
+## Copy Over Auth Database
+
+We need to synchronise the authorisation database. We can copy this over from the production system, or the backups
+
+* TODO: Indicate where the backups for the auth database are here!
+
+Steps (flesh out better):
+
+* Extract backup (or copy from existing production system)
+* Stop the (new) container (if it's running)
+* Backup the (new) container's auth-db file (
+* Place the auth db file in the correct place in the container's filesystem:
+* Backup existing secrets
+* Login to the `/auth/admin/dashboard` of the auth server (e.g. https://cd.genenetwork.org/auth/admin/dashboard)
+* If client with the CLIENT_ID in the secrets exists
+* 1. update the uris for that client, if it doesn't exist, create an entirely new client and replace both the CLIENT_ID and CLIENT_SECRET in the secrets file.
+* 2. Click on the "Change Secret" button and generate a new secret. Replace the secret in the secrets file with the newly generated secret
+* If client with the CLIENT_ID in the secrets DOES NOT exist, register a new client, setting up the appropriate URIs and endpoints, and then add/replace both the CLIENT_ID and CLIENT_SECRET in the secrets file.
+* Restart (new) container
+
 ## Set Up the Database
 
-* Extract: detail this — link to existing document in this repo. Also, probably note that we symlink the extraction back to `/var/lib/mysql`?
-* Configure: detail this — link to existing document in this repo
+=> /topics/systems/restore-backups Extract the latest database from the backups.
+=> /topics/deploy/installation Configure MariaDB according to this document.
 
 ## Set Up the File System
 
 * TODO: List the necessary directories and describe what purpose each serves. This will be from the perspective of the container — actual paths on the host system are left to the builders choice, and can vary wildly.
 * TODO: Prefer explicit binding rather than implicit — makes the shell scripts longer, but no assumptions have to be made, everything is explicitly spelled out.
 
+The container(s) need access to various files and directories from the host system in order to work correctly.
+
+Filesystem bindings could be linked to wildly different paths on different physical host machines, therefore, we shall examine the bindings from the point of view of the paths within the container, rather than forcing a particular file system layout on the host systems themselves.
+
+Each of the sections below details a specific binding:
+
+### /var/genenetwork
+
+This binding must be READWRITE within the container.
+
+The purpose is to hold varying files that are specific to the genenetwork system(s). Examples of the files are:
+
+* "gn-meta" and "synteny" files for GN3
+* genotype files
+* session files for various systems (GN2, gn-uploader, etc.)
+
+### /var/lib/acme
+
+This binding must be READWRITE within the container.
+
+This is used to store TLS certificates for the various services within the container by the ACME (Automatic Certificate Management Environment) script.
+
+### /var/lib/redis
+
+This binding must be READWRITE within the container.
+
+This is used by the redis daemon to persist its state(s).
+
+### /var/lib/virtuoso
+
+This binding must be READWRITE within the container.
+
+Used by the virtuoso daemon to save its state, and maybe some log files.
+
+### /export/data/virtuoso/
+
+This binding must be READONLY within the container. (Really?)
+
+This is used for importing data into virtuoso, say by sharing Turtle (TTL) files within the binding.
+
+---
+At this point the binding is READONLY because any TTL files to load are imported from outside the container. If the transformation of data from MariaDB to TTL form is built into the production containers down the line, then this might change to READWRITE to allow the tranformation tool to write to it.
+
+### /var/log
+
+This binding must be READWRITE within the container.
+
+Allows logs from various services running in the container be accessible in the host system. This is useful for debugging issues with the running systems.
+
+### /etc/genenetwork
+
+This binding must be READWRITE within the container.
+
+Useful for storing various configuration files/data for the service(s) running inside the running container.
+
+### /var/lib/xapian
+
+This binding must be READWRITE within the container.
+
+Stores the processed search indexes for the xapian search system.
+
+### /var/lib/genenetwork/sqlite/gn-auth
+
+This binding must be READWRITE within the container.
+
+The authorisation database is stored here. The directory needs to be writable to avoid permissions issues within the container when attempting to write data into the database.
+
+### /var/lib/genenetwork/sqlite/genenetwork3
+
+This binding must be READWRITE within the container.
+
+This stores various SQLite databases in use with GN3. These are:
+
+* Database for the GNQA system
+* ...
+
+### /run/mysqld
+
+This binding must be READWRITE within the container.
+
+This binding is the link to the host directory that holds the socket file for the running MariaDB instance.
+
+### /opt/gn/tmp
+
+This binding must be READWRITE within the container.
+
+Holds temporary files for the various services that run within the container. Some of the generated files from various services are also stored here.
+
+**PROPOSAL**: Move all generated files here, or have a dedicated directory for holding generated files?
+
+
+### /var/genenetwork/sessions
+
+This binding must be READWRITE within the container.
+
+Holds session files for various services within the container. See also the /var/genenetwork binding.
+
+### /var/lib/genenetwork/uploader
+
+This binding must be READWRITE within the container.
+
+**gn-uploader** specific data files. Types of data files that could go here are:
+
+* File uploads
+* (Reusable) Cache files and generated files
+* ... others?
+
+### /var/lib/genenetwork/sqlite/gn-uploader
+
+This binding must be READWRITE within the container.
+
+Holds various SQLite databases used with the **gn-uploader** service, e.g.:
+
+* Background jobs database
+* ...
+
+### /var/lib/genenetwork/gn-guile
+
+This binding must be READWRITE within the container.
+
+Various data files for the **gn-guile** service, such as:
+
+* The bare **gn-docs** repository (Previously bound at `/export/data/gn-docs`: now deprecated).
+
 ## Redis
 
 We currently (2025-06-11) use Redis for:
 
-- Tracking user collection (this will be moved to SQLite database)
-- Tracking background jobs (this is being moved out to SQLite databases)
-- Tracking running-time (not sure what this is about)
-- Others?
+* Tracking user collection (this will be moved to SQLite database)
+* Tracking background jobs (this is being moved out to SQLite databases)
+* Tracking running-time (not sure what this is about)
+* Others?
 
 We do need to copy over the redis save file whenever we do a migration, at least until the user collections and background jobs features have been moved completely out of Redis.
 
diff --git a/topics/deploy/uthsc-email.gmi b/topics/deploy/uthsc-email.gmi
new file mode 100644
index 0000000..05f2ba5
--- /dev/null
+++ b/topics/deploy/uthsc-email.gmi
@@ -0,0 +1,64 @@
+# UTHSC E-mail
+
+Similar to many organizations UT uses outlook and exchange for their E-mail. Thanks to mobile support it is possible to work with email using other tools outside the UT network.
+
+## Prospect E-mail client
+
+People have success using Prospect as an E-mail client. You can follow the instructions on the UT website that are similar to Android support.
+
+## Davmail IMAP bridge
+
+An interesting solution is to create an IMAP bridge. It is a little slower, but can be set up! That way you can use your favorite E-mail tool (and filters!).
+
+I have had success setting up davmail with the following settings and testing with thunderbird first:
+
+```
+apt install davmail openjfx thunderbird
+```
+
+Start davmail with
+
+```
+davmail -d
+```
+
+Stop davmail and edit the ~/.davmail.properties file with the following.
+
+```
+davmail.mode=O365Interactive
+davmail.url=https://outlook.office365.com/EWS/Exchange.asmx
+davmail.oauth.clientId=d3590ed6-52b3-4102-aeff-aad2292ab01c
+davmail.enableOauth2=true
+davmail.oauth.deviceCode=true
+davmail.oauth.enableOauth2=true
+davmail.oauth.redirectUri=urn:ietf:wg:oauth:2.0:oob
+davmail.oauth.tenantId=common
+davmail.imapPort=1143
+davmail.smtpPort=1025
+davmail.logFilePath=/home/yours/.davmail/davmail.log
+log4j.logger.httpclient.wire=DEBUG
+log4j.rootLogger=DEBUG
+log4j.logger.org.apache.http.wire=DEBUG
+```
+
+Restart davmail and point thunderbird to
+
+```
+IMAP Server: localhost:1143
+SMTP Server: localhost:1025
+Username: your-email@uthsc.edu
+```
+
+Note that you should set the UT password in the 2FA browser when it pops up. Do *not* set it in Thunderbird, also when it asks for it to send out SMTP.
+
+When something fails make sure to track the log in ~/.davmail/davmail.log
+
+# Using Mutt
+
+Some useful links:
+
+=> https://jonathanh.co.uk/blog/exchange-mutt/
+=> https://movementarian.org/blog/posts/mutt-and-office365/
+=> https://www.vanormondt.net/~peter/blog/2021-03-16-mutt-office365-mfa.html
+
+If someone can get the last one to work we won't even need davmail any more!
diff --git a/topics/deploy/uthsc-vpn-with-free-software.gmi b/topics/deploy/uthsc-vpn-with-free-software.gmi
index 95fd1cd..aeba322 100644
--- a/topics/deploy/uthsc-vpn-with-free-software.gmi
+++ b/topics/deploy/uthsc-vpn-with-free-software.gmi
@@ -6,14 +6,23 @@ It is possible to connect to the UTHSC VPN using only free software. For this, y
 
 To connect, run openconnect-sso as follows. A browser window will pop up for you to complete the Duo authentication. Once done, you will be connected to the VPN.
 ```
-$ openconnect-sso --server uthscvpn1.uthsc.edu --authgroup UTHSC
+$ openconnect-sso --server vpn-server --authgroup UTHSC
 ```
 Note that openconnect-sso should be run as a regular user, not as root. After passing Duo authentication, openconnect-sso will try to gain root priviliges to set up the network routes. At that point, it will prompt you for your password using sudo.
 
 ## Recommended way
 
-The recommended way is to use Arun's g-expression setup using guix. See below. It should just work, provided you have the
-chained certificate that you can get from the browser or one of us.
+The recommended way is to use Arun's g-expression setup using guix. See below. It should just work, provided you have the chained certificate that you can get from the browser or one of us and point to the right server. Simply
+
+```
+$(guix build -f uthsc-vpn.scm)
+```
+
+See
+
+=> ./uthsc-vpn.scm
+
+Get the final details from us. UT does not like it when we put it online even though there is no real risk.
 
 ## Avoid tunneling all your network traffic through the VPN (aka Split Tunneling)
 
@@ -22,7 +31,7 @@ openconnect, by default, tunnels all your traffic through the VPN. This is not g
 
 For example, to connect to the UTHSC VPN but only access the hosts tux01 and tux02e through the VPN, run the following command.
 ```
-$ openconnect-sso --server uthscvpn1.uthsc.edu --authgroup UTHSC -- --script 'vpn-slice tux01 tux02e'
+$ openconnect-sso --server vpn-server --authgroup UTHSC -- --script 'vpn-slice tux01 tux02e'
 ```
 The vpn-slice script looks up the hostnames tux01 and tux02e on the VPN DNS and adds /etc/hosts entries and routes to your system. vpn-slice can also set up more complicated routes. To learn more, read the vpn-slice documentation.
 
@@ -52,7 +61,7 @@ Then, run the openconnect-sso client as usual.
 ## Misconfigured UTHSC TLS certificate
 
 The UTHSC TLS certificate does not validate on some systems. You can work around this by downloading the certificate chain and adding it to your system:
-* Navigate with browser to https://uthscvpn1.uthsc.edu/. Inspect the certificate in the browser (lock icon next to search bar) and export .pem file
+* Navigate with browser to https://vpn-server/. Inspect the certificate in the browser (lock icon next to search bar) and export .pem file
 * Move it to /usr/local/share/ca-certificates (with .crt extension) or equivalent
 * On Debian/Ubuntu update the certificate store with update-ca-certificates
 You should see
@@ -65,7 +74,7 @@ Thanks Niklas. See also
 
 However, adding certificates to your system manually is not good security practice. It is better to limit the added certificate to the openconnect process. You can do this using the REQUESTS_CA_BUNDLE environment variable like so:
 ```
-REQUESTS_CA_BUNDLE=/path/to/uthsc/certificate.pem openconnect-sso --server uthscvpn1.uthsc.edu --authgroup UTHSC
+REQUESTS_CA_BUNDLE=/path/to/uthsc/certificate.pem openconnect-sso --server vpn-server --authgroup UTHSC
 ```
 
 ## Putting it all together using Guix G-expressions
@@ -93,6 +102,6 @@ export QTWEBENGINE_CHROMIUM_FLAGS=--disable-seccomp-filter-sandbox
 
 ## Acknowledgement
 
-Many thanks to Pjotr Prins and Erik Garrison without whose earlier work this guide would not be possible.
+Many thanks to Arun, Pjotr and Erik without whose earlier work this guide would not be possible.
 => https://github.com/pjotrp/linux-at-university-of-tennessee
 => https://github.com/ekg/openconnect-sso-docker
diff --git a/topics/deploy/uthsc-vpn.scm b/topics/deploy/uthsc-vpn.scm
index 82f67f5..f204cdf 100644
--- a/topics/deploy/uthsc-vpn.scm
+++ b/topics/deploy/uthsc-vpn.scm
@@ -1,15 +1,30 @@
-(use-modules ((gnu packages python-web) #:select (python-requests python-urllib3))
+(use-modules ((gnu packages check)
+              #:select (python-pytest python-pytest-asyncio python-pytest-httpserver))
+             ((gnu packages freedesktop) #:select (python-pyxdg))
+             ((gnu packages python-build) #:select (python-poetry-core python-toml))
+             ((gnu packages python-crypto)
+              #:select (python-keyring python-pyotp))
+             ((gnu packages python-web) #:select (python-requests python-urllib3))
+             ((gnu packages python-xyz)
+              #:select (python-attrs python-charset-normalizer
+                                     python-colorama python-prompt-toolkit python-pysocks
+                                     python-structlog))
              ((gnu packages guile-xyz) #:select (guile-ini guile-lib guile-smc))
-             ((gnu packages vpn) #:select (openconnect-sso vpn-slice))
+             ((gnu packages qt) #:select (python-pyqt-6 python-pyqtwebengine-6))
+             ((gnu packages vpn) #:select (openconnect vpn-slice))
+             ((gnu packages xml) #:select (python-lxml-4.9))
+             (guix build-system pyproject)
              (guix build-system python)
              (guix download)
              (guix gexp)
+             (guix git-download)
+             ((guix licenses) #:prefix license:)
              (guix packages))
 
 ;; Put in the hosts you are interested in here.
 (define %hosts
   (list "octopus01"
-        "spacex.uthsc.edu"))
+        "spacex"))
 
 (define (ini-file name scm)
   "Return a file-like object representing INI file with @var{name} and
@@ -36,6 +51,22 @@
          "01dkqv0rsjqyw4wrp6yj8h3bcnl7c678qkj845596vs7p4bqff4a"))))
     (build-system python-build-system)))
 
+(define python-charset-normalizer-2.10
+  (package
+    (inherit python-charset-normalizer)
+    (version "2.1.0")
+    (source
+     (origin
+       (method url-fetch)
+       (uri (pypi-uri "charset-normalizer" version))
+       (sha256
+        (base32 "04zlajr77f6c7ai59l46as1idi0jjgbvj72lh4v5wfpz2s070pjp"))))
+    (build-system python-build-system)
+    (arguments (list))
+    (native-inputs
+     (modify-inputs (package-native-inputs python-charset-normalizer)
+       (delete "python-setuptools")))))
+
 (define python-requests-2.28
   (package
     (inherit python-requests)
@@ -52,13 +83,78 @@
     (native-inputs (list))
     (propagated-inputs
      (modify-inputs (package-propagated-inputs python-requests)
+       (replace "python-charset-normalizer" python-charset-normalizer-2.10)
        (replace "python-urllib3" python-urllib3-1.26)))))
 
+(define-public openconnect-sso
+  (package
+    (name "openconnect-sso")
+    ;; 0.8.0 was released in 2021, the latest update on master HEAD is from
+    ;; 2023.
+    (properties '((commit . "94128073ef49acb3bad84a2ae19fdef926ab7bdf")
+                  (revision . "0")))
+    (version (git-version "0.8.0"
+                          (assoc-ref properties 'revision)
+                          (assoc-ref properties 'commit)))
+    (source
+      (origin
+        (method git-fetch)
+        (uri (git-reference
+               (url "https://github.com/vlaci/openconnect-sso")
+              (commit (assoc-ref properties 'commit))))
+        (file-name (git-file-name name version))
+        (sha256
+         (base32 "08cqd40p9vld1liyl6qrsdrilzc709scyfghfzmmja3m1m7nym94"))))
+    (build-system pyproject-build-system)
+    (arguments
+     `(#:phases
+       (modify-phases %standard-phases
+          (add-after 'unpack 'use-poetry-core
+            (lambda _
+              ;; Patch to use the core poetry API.
+              (substitute* "pyproject.toml"
+                (("poetry.masonry.api")
+                 "poetry.core.masonry.api"))))
+         (add-after 'unpack 'patch-openconnect
+           (lambda* (#:key inputs #:allow-other-keys)
+             (substitute* "openconnect_sso/app.py"
+               (("\"openconnect\"")
+                (string-append "\""
+                               (search-input-file inputs "/sbin/openconnect")
+                               "\""))))))))
+    (inputs
+     (list openconnect
+           python-attrs
+           python-colorama
+           python-keyring
+           python-lxml-4.9
+           python-prompt-toolkit
+           python-pyotp
+           python-pyqt-6
+           python-pyqtwebengine-6
+           python-pysocks
+           python-pyxdg
+           python-requests
+           python-structlog
+           python-toml))
+    (native-inputs
+     (list python-poetry-core
+           python-pytest
+           python-pytest-asyncio
+           python-pytest-httpserver))
+    (home-page "https://github.com/vlaci/openconnect-sso")
+    (synopsis "OpenConnect wrapper script supporting Azure AD (SAMLv2)")
+    (description
+     "This package provides a wrapper script for OpenConnect supporting Azure AD
+(SAMLv2) authentication to Cisco SSL-VPNs.")
+    (license license:gpl3)))
+
 ;; Login to the UTHSC VPN fails with an SSLV3_ALERT_HANDSHAKE_FAILURE
 ;; on newer python-requests.
 (define openconnect-sso-uthsc
   (package
     (inherit openconnect-sso)
+    (name "openconnect-sso-uthsc")
     (inputs
      (modify-inputs (package-inputs openconnect-sso)
        (replace "python-requests" python-requests-2.28)))))
@@ -81,7 +177,7 @@
         (setenv "REQUESTS_CA_BUNDLE"
                 #$(local-file "uthsc-certificate.pem"))
         (invoke #$(file-append openconnect-sso-uthsc "/bin/openconnect-sso")
-                "--server" "uthscvpn1.uthsc.edu"
+                "--server" "$vpn-server" ; ask us for end-point or see UT docs
                 "--authgroup" "UTHSC"
                 "--"
                 "--script" (string-join (cons #$(file-append vpn-slice "/bin/vpn-slice")
diff --git a/topics/genenetwork/publications-on-genenetwork.gmi b/topics/genenetwork/publications-on-genenetwork.gmi
new file mode 100644
index 0000000..aea1f63
--- /dev/null
+++ b/topics/genenetwork/publications-on-genenetwork.gmi
@@ -0,0 +1,14 @@
+# Publications on Genenetwork
+
+## Tags
+
+* type: documentation
+* keywords: documentation, docs, doc, publications
+
+## Important points
+
+A publication can relate to more than one Dataset (or family), i.e. you can have a publication with phenotypes from the BXD and CXB populations: from @robw:
+
+```
+Yes. A single publication can make use of several different families of strains. Our. hippocampus paper with Rupert included both BXD, CXB, and Mouse Diversity panel. We (awkwardly) put them all into BXDs with "Other" and also in CXB and also in Mouse Diversity. Definitely not optimal, but the code would have been way more work that just entering in three ways.
+```
diff --git a/topics/genetics/pangenotypes.gmi b/topics/genetics/pangenotypes.gmi
new file mode 100644
index 0000000..9b3d534
--- /dev/null
+++ b/topics/genetics/pangenotypes.gmi
@@ -0,0 +1,52 @@
+# Pangenotypes
+
+Here we discuss different storage solutions for pangenotypes.
+
+## GRG format
+
+
+Looking for graph genotyping I ran into Genotype Representation Graphs GRG
+
+=> https://pmc.ncbi.nlm.nih.gov/articles/PMC11071416/
+
+It has a binary storage format that represents something like:
+
+```
+# GRG file example: genotype graph
+# Nodes section: NODE <id> <label> allele=<genotype>
+NODE 1 GeneA allele=AA
+NODE 2 GeneB allele=AG
+NODE 3 GeneC allele=GG
+NODE 4 GeneD allele=AA
+NODE 5 GeneE allele=AG
+
+# Edges section: EDGE <from_id> <to_id>
+EDGE 1 2
+EDGE 1 3
+EDGE 2 4
+EDGE 3 4
+EDGE 4 5
+EDGE 5 1
+```
+
+the tooling
+
+=> https://github.com/aprilweilab/grgl.git
+
+builds with
+
+```
+guix shell -C -N coreutils gcc-toolchain make cmake openssl nss-certs git pkg-config zlib
+```
+
+I did some tests and read the source code. The nice thing is that they have very similar ideas. Unfortunately the implementation is not what we want. I wonder why people alway reinvent data structures :/. To get an idea:
+
+=> https://github.com/aprilweilab/grgl/blob/main/src/serialize.cpp
+
+I would like to take similar ideas and take it to an efficient in-memory graph structure that is easily extensible. RDF is key for extensions (and queries). A fast RDF implementation we are going to try is
+
+=> https://pyoxigraph.readthedocs.io/en/stable/index.html
+
+Toshiaki pointed out we should look at qlever instead:
+
+=> https://github.com/ad-freiburg/qlever
diff --git a/topics/genetics/standards/gemma-genotype-format.gmi b/topics/genetics/standards/gemma-genotype-format.gmi
new file mode 100644
index 0000000..6ca5998
--- /dev/null
+++ b/topics/genetics/standards/gemma-genotype-format.gmi
@@ -0,0 +1,99 @@
+# PanGEMMA Genotype Format
+
+Here we describe the genotype DB format that is used by GN and pangemma. Essentially it contains the genotypes as markers x samples (rows x cols). Unlike some earlier formats it also carries metadata and allows for track changes to the genotypes.
+
+The current reference implementation for creating the file lives at
+
+=> https://github.com/genetics-statistics/gemma-wrapper/blob/master/bin/geno2mdb.rb
+
+Note that we'll likely create new versions in python, guile and/or rust.
+
+# Storage
+
+We use the LMDB b-tree format to store and retrieve records based on an index. LMDB is very fast as it uses the memory map facilities of the underlying operating system.
+
+=> https://en.wikipedia.org/wiki/Lightning_Memory-Mapped_Database
+
+LMDB supports multiple 'tables' in one file. We also use a metadata table named3 'info'. Another table named 'track-changes' keep track of modifications to the genotypes. This allows the genotypes to change over time - still giving people access to the original information if they need it.
+
+# Genotypes in the 'geno' table
+
+Genotypes are stored as fixed size rows of genotypes. Genotypes can be represented as 4-byte floats 'f*' or a list of bytes 'C*' (note these format specifiers come from ruby pack - python has similar but slightly different specifiers). The idea being that storing floats gives enough precision for probabilities and single bytes can represent all other cases. In the future we may add 2-byte integers, but that is probably not necessary.
+
+For the float version we use NaN to disignate a missing value (NA).
+
+For the byte version we use the value 255 or 0xFF to designate a missing value (NA). The other 255 values (including 0) are used either as an index - so A,B,H could be 0,1,2 - or we use it to project a range of values. In many cases 255 values is enough to present genotype variation in a population. Otherwise opt for the float option.
+
+The index to the rows is currently built out of keys. These keys hold the chromosome number as a single byte 'C', the position as a 4-byte long integer 'L>' and the row number in the original file as a 4-byte long 'L>'. These numbers are stored native-endian so the index is always correctly sorted(!).
+
+# Metadata in the 'info' table
+
+The default metadata is stored in the info table as
+
+```
+meta = {
+  "type" => "gemma-geno",
+  "format" => options[:format],
+  "version" => 1.0,
+  "eval" => EVAL.to_s,
+  "key-format" => CHRPOS_PACK,
+  "rec-format" => PACK,
+  "geno" => json
+}
+```
+
+where CHRPOS_PACK gives the key layout 'CL>L>' and PACK the genotype list, e.g. 'f*'. The format line gives the 'standard' storage type, e.g. 'Gf' for the floats and eval is the command used to transform values. The only field we really have to use for unpacking the data is format or rec-format because key-format does not change. The info table has some extra records that may be used:
+
+```
+  info['numsamples'] = [numsamples].pack("Q") # uint64
+  info['nummarkers'] = [geno.size].pack("Q")
+  info['meta'] = meta.to_json.to_s
+  info['format'] = options[:format].to_s
+  info['options'] = options.to_s
+```
+
+where 'numsamples' and 'nummarkers' are counts. 'meta' reflects above json record. 'format' mirrors format in the meta record and 'options' shows the options as they where fed to the program that generated the file.
+
+# Tracking changes
+
+Note: this is a proposal and has not yet implemented. But the idea is to store records by time stamp. Each record will describe the change so the last genotypes can be rolled forward at the user's wish. In case of a replacement it could be:
+
+```
+timestamp =>
+{
+  "marker" => name,
+  "chr" => chr,
+  "pos" => pos,
+  "line" => line,
+  "action" => "update",
+  "author" => author,
+  "genotypes" => list
+```
+
+Where list contains the *updated* genotypes.
+Likewise for a marker insertion or deletion.
+
+The track changes can also specify that a change only applies to a trait, a list of traits, a specific set of samples, or a group. E.g.
+
+```
+timestamp =>
+{
+  "marker" => name,
+  "chr" => chr,
+  "pos" => pos,
+  "line" => line,
+  "action" => "update",
+  "author" => author,
+  "genotypes" => list,
+  "for-traits" => list,
+  "for-samples" => list,
+  "for-group" => name
+}
+```
+
+The 'geno' database will therefore always the *first* version. These records make it possible to roll forward on changes and present an updated genotype matrix. Used genotypes are retained. This, naturally, can be handled in a cache. So any rewritten genotype files will be available in cache for a period of time.
+In the future a tool, such as GEMMA, could support dynamic application of these edits. That way we only have to cache the latest version.
+
+This way users may be able to select changes (i.e. pick and choose), use all (latest) or use original (init).
+
+For the editing we should provide an API.
diff --git a/topics/genetics/test-pangenome-derived-genotypes.gmi b/topics/genetics/test-pangenome-derived-genotypes.gmi
new file mode 100644
index 0000000..3a8473a
--- /dev/null
+++ b/topics/genetics/test-pangenome-derived-genotypes.gmi
@@ -0,0 +1,1005 @@
+# Test pangenome derived genotypes
+
+Here we follow up on the work we did on precompute PublishData:
+
+=> ../systems/mariadb/precompute-publishdata
+
+But now run against pangenome derived genotypes.
+For the BXD we have 23M markers(!) whereof 8M *not* on the reference genome.
+
+# Tasks
+
+* [ ] Document lmdb geno and marker information
+* [ ] Extract epoch information
+* [ ] Add BED file and link SNPS
+* [ ] Check MAF filter - it may be too stringent
+* [ ] Use ravanan/CWL to push to Octopus
+* [ ] Reintroduce nodes that were not annotated for position (Flavia)
+* [ ] GWA plotter
+* [ ] Speed up IO for GEMMA by using lmdb for genotypes and marker file
+* [ ] Use 1.5LOD score to compute QTLs instead of using 50M distance
+* [ ] Reduce GEMMA GRM RAM requirements (not urgent)
+* [ ] Fix -lmm 4 ERROR: Enforce failed for Trying to take the sqrt of nan in src/mathfunc.cpp at line 127 in safe_sqrt
+
+# Summary
+
+To get the mapping and generate the assoc output in mdb format we run a variant of gemma-wrapper.
+
+The workflow essentially is:
+
+* capture the significant markers from GEMMA's mdb output (as created by gemma-wrapper)
+* These are transformed into RDF using the 'gemma-mdb-to-rdf.rb' script
+* Next we upload that RDF into virtuoso
+* from there download a table of start-stop data using SPARQL
+* We compute QTL locations using 'sparql-qtl-detect.rb'
+* Upload that RDF also into virtuoso
+
+For mapping virtuoso contains four important ttl files:
+
+* marker positions in pangenome-marker graph
+* mapped markers in pangenome-mapped graph
+* computed QTL positions in pangenome-qtl graph
+* trait values in traits graph (nyi)
+
+
+```
+gemma-batch-run.sh
+```
+
+Next we convert that output to RDF with
+
+```
+../bin/gemma-mdb-to-rdf.rb --header > output.ttl
+time ../bin/gemma-mdb-to-rdf.rb --anno snps-matched.txt.mdb tmp/panlmm/*-gemma-GWA.tar.xz >> output.ttl # two hours for 7000 traits
+time serdi -i turtle -o ntriples output.ttl > output.n3
+```
+
+(note that n3 files are less error prone and serdi does better than rapper with huge files) and copy the file to the virtuoso instance and load it with isql (note it may be worth search-replacing the gnt:run tag to something descriptive).
+
+```
+cd /export/guix-containers/virtuoso/data/virtuoso/ttl/
+guix shell -C -N --expose=/export/guix-containers/virtuoso/data/virtuoso/ttl/=/export/data/virtuoso/ttl virtuoso-ose -- isql -S 8891
+SQL> ld_dir('/export/data/virtuoso/ttl','test-run-3000.n3','http://pan-test.genenetwork.org');
+Done. -- 3 msec.
+# for testing the validity and optional delete problematic ones:
+SQL> SELECT * FROM DB.DBA.load_list;
+SQL> DELETE from DB.DBA.LOAD_LIST where ll_error IS NOT NULL ;
+SQL> DELETE from DB.DBA.LOAD_LIST where LL_STATE = 1;
+# commit changes
+SQL> rdf_loader_run (); // about 1 min per GB n3
+SQL> checkpoint;
+Done. -- 16 msec.
+SQL> SPARQL SELECT count(*) FROM <http://pan-test.genenetwork.org> WHERE { ?s ?p ?o } LIMIT 10;
+34200686
+```
+
+Note it may be a good idea to drop graphs first. That is why we have separate subgraph spaces for every large TTL file:
+
+```
+log_enable(3,1);
+SQL> SPARQL CLEAR GRAPH  <http://pan-test.genenetwork.org>;
+SQL> SPARQL CLEAR GRAPH  <http://pan-mapped.genenetwork.org>; // 10 min
+SQL> SPARQL CLEAR GRAPH  <http://pangenome-marker.genenetwork.org>;
+SQL> ld_dir('/export/data/virtuoso/ttl','pangenome-markers.n3','http://pangenome-marker.genenetwork.org');
+SQL> SPARQL SELECT count(*) FROM <http://pan-test.genenetwork.org> WHERE { ?s ?p ?o } LIMIT 10;
+```
+
+For pangenomes we have a marker file, a QTL file
+
+As a test, fetch a table of the traits with their SNPs
+
+```
+PREFIX dct: <http://purl.org/dc/terms/>
+PREFIX gn: <http://genenetwork.org/id/>
+PREFIX owl: <http://www.w3.org/2002/07/owl#>
+PREFIX gnc: <http://genenetwork.org/category/>
+PREFIX gnt: <http://genenetwork.org/term/>
+PREFIX sdmx-measure: <http://purl.org/linked-data/sdmx/2009/measure#>
+PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
+PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
+PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
+PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
+PREFIX qb: <http://purl.org/linked-data/cube#>
+PREFIX xkos: <http://rdf-vocabulary.ddialliance.org/xkos#>
+PREFIX pubmed: <http://rdf.ncbi.nlm.nih.gov/pubmed/>
+
+SELECT * FROM <http://pangenome-mapped.genenetwork.org> WHERE {
+?traitid a gnt:mappedTrait;
+         gnt:run gn:test .
+?snp gnt:mappedSnp ?traitid ;
+        gnt:locus ?locus ;
+        gnt:lodScore ?lod ;
+        gnt:af ?af .
+?locus rdfs:label ?nodeid ;
+         gnt:chr ?chr ;
+         gnt:pos ?pos .
+FILTER (contains(?nodeid,"Marker") && ?pos < 1000)
+} LIMIT 100
+```
+
+OK, we are ready to run a little workflow. First create a sorted list of IDs.
+
+```
+PREFIX dct: <http://purl.org/dc/terms/>
+PREFIX gn: <http://genenetwork.org/id/>
+PREFIX owl: <http://www.w3.org/2002/07/owl#>
+PREFIX gnc: <http://genenetwork.org/category/>
+PREFIX gnt: <http://genenetwork.org/term/>
+PREFIX sdmx-measure: <http://purl.org/linked-data/sdmx/2009/measure#>
+PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
+PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
+PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
+PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
+PREFIX qb: <http://purl.org/linked-data/cube#>
+PREFIX xkos: <http://rdf-vocabulary.ddialliance.org/xkos#>
+PREFIX pubmed: <http://rdf.ncbi.nlm.nih.gov/pubmed/>
+
+SELECT DISTINCT ?trait FROM <http://pangenome-mapped.genenetwork.org> WHERE {
+?traitid a gnt:mappedTrait;
+         gnt:run gn:test ;
+         gnt:traitId ?trait.
+}
+```
+
+See also
+
+=> https://github.com/genetics-statistics/gemma-wrapper/blob/master/doc/examples/list-traits.sparql
+
+Sort that list and save as 'pan-ids-sorted.txt'. Next run
+
+```
+../../bin/workflow/qtl-detect-batch-run.sh
+```
+
+and load those in virtuoso. List new QTL
+
+```
+SELECT DISTINCT ?t ?lod (count(?snp) as ?snps) ?chr ?s ?e WHERE {
+  ?traitid a gnt:mappedTrait ;
+  gnt:traitId ?t .
+  MINUS { ?traitid gnt:run gn:test } # use if you want the original GEMMA QTL
+  # ?traitid gnt:run gn:test . # use if you want the new QTL
+  ?qtl gnt:mappedQTL ?traitid ;
+  gnt:qtlChr ?chr ;
+  gnt:qtlLOD ?lod ;
+  gnt:qtlStart ?s ;
+  gnt:qtlStop ?e .
+  ?qtl gnt:mappedSnp ?snp .
+  FILTER (?t = "10002" && ?lod >= 5.0 ) .
+} LIMIT 100
+```
+
+# Prior work
+
+For the first traits (presented at CTC'25) gemma was run as
+
+```
+echo "[$(date)] Starting kinship matrix calculation for PERCENTILE..."
+gemma -g ${BIMBAM_DIR}/143samples.percentile.bimbam.bimbam.gz \
+        -p ${PHENO_FILE} \
+              -gk \
+              -o percentile_result > percentile.kinship.143.txt
+
+echo "[$(date)] Kinship matrix calculation completed for PERCENTILE."
+echo "[$(date)] Starting association analysis for PERCENTILE..."
+gemma -g ${BIMBAM_DIR}/143samples.percentile.bimbam.bimbam.gz \
+        -p ${PHENO_FILE} \
+              -k ./output/percentile_result.cXX.txt \
+              -lmm 4 \
+              -maf 0.05 \
+              -o percentile_association > percentile.assoc.143.txt
+```
+
+Note no LOCO.
+
+The genotype BIMBAM file is 45G uncompressed. Even though GEMMA does not load everything in RAM, it is a bit large for my workstation. I opted to use tux04 since no one is using it. Had to reboot the machine because it is unreliable and had crashed.
+
+There I rebuilt gemma and set up a first run:
+
+```
+tux04:/export/data/wrk/iwrk/opensource/code/genetics/gemma/tmp$
+/bin/time -v ../bin/gemma -g 143samples.percentile.bimbam.bimbam.gz -p 143samples.percentile.bimbam.pheno.gz -gk
+```
+
+Without LOCO this took about 18 minutes (186% CPU), 110Gb of RAM. We ought to work on this ;) Next
+
+```
+/bin/time -v ../bin/gemma -g 143samples.percentile.bimbam.bimbam.gz -p 143samples.percentile.bimbam.pheno.gz -k output/result.cXX.txt -lmm 9 -maf 0.05
+```
+
+To run gemma on the current 23M BXD pangenome derived genotypes takes 2.5 hours (@ 200% CPU). That is a bit long :). 13K traits would be 43 months on a single machine. We'll need something better. As Rob writes:
+
+> The huge majority of variants will have r2 of 1 with hundreds ir thousands of neighbors. This is just a monster distraction. We just want proximal and distal haplotype boundaries for each BXD. Then we want to layer on the weird non-SNP variants and inversions.
+
+A few days later I had to rerun gemma because the output was wrong (I should have checked!). It shows:
+
+```
+chr     rs      ps      n_miss  allele1 allele0 af      beta    se      logl_H1 l_remle l_mle   p_wald  p_lrt   p_score
+-9      A1-0    -9      0       A       T       0.171   -nan    -nan    -nan    1.000000e+05    1.000000e+05    -nan  -nan     -nan
+-9      A2-0    -9      0       A       T       0.170   -nan    -nan    -nan    1.000000e+05    1.000000e+05    -nan  -nan     -nan
+```
+
+Turns out I was using the wrong pheno file. Let's try again.
+
+```
+/bin/time -v ../bin/gemma -g 143samples.percentile.bimbam.bimbam.gz -p 10354082_143.list.pang.txt -k output/result.cXX.txt -lmm 9 -maf 0.05
+```
+
+As a check I can diff against the original output. So, I replicated the original run! It also ran faster at 400% CPU in 35 minutes.
+
+(btw tux04 crashed, so I upgraded the BIOS and iDRAC remotely, let's see if this improves things).
+
+## Moving to gemma-wrapper
+
+gemma-wrapper has extra facilities, such as LOCO and caching and lmdb output. Last time we used it in
+
+=> ../genetics/systems/mariadb/precompute-publishdata
+
+in a guix container it looked like
+
+```
+#! /bin/env sh
+
+export TMPDIR=./tmp
+curl http://127.0.0.1:8092/dataset/bxd-publish/list > bxd-publish.json
+jq ".[] | .Id" < bxd-publish.json > ids.txt
+./bin/gemma-wrapper --force --json --loco -- -g BXD.geno.txt -p BXD_pheno.txt -a BXD.8_snps.txt -n 2 -gk > K.json
+
+for id in 'cat ids.txt' ; do
+  echo Precomputing $id
+  if [ ! -e tmp/*-BXDPublish-$id-gemma-GWA.tar.xz ] ; then
+    curl http://127.0.0.1:8092/dataset/bxd-publish/values/$id.json > pheno.json
+    ./bin/gn-pheno-to-gemma.rb --phenotypes pheno.json --geno-json BXD.geno.json > BXD_pheno.txt
+    ./bin/gemma-wrapper --json --lmdb --population BXD --name BXDPublish --trait $id --loco --input K.json -- -g BXD.geno.txt -p BXD_pheno.txt -a BXD.8_snps.txt -lmm 9 -maf 0.1 -n 2 -debug > GWA.json
+  fi
+done
+```
+
+Let's try running the big stuff instead:
+
+```
+./bin/gemma-wrapper --force --json --loco -- -g tmp/143samples.percentile.bimbam.bimbam.gz -p tmp/143samples.percentile.bimbam.pheno.gz  -gk
+```
+
+## Individuals
+
+gemma does not really track individuals. The order of genotype columns should just be the same as in the pheno file.
+In this case a sample list is provided and we'll generate a geno-json version that we can give to gemma-wrapper. Basically such a file lists the following
+
+```
+{
+  "type": "gn-geno-to-gemma",
+  "genofile": "BXD.geno",
+  "samples": [
+    "BXD1",
+    "BXD2",
+    "BXD5",
+...
+  ],
+  "numsamples": 237,
+  "header": [
+    "# File name: BXD_experimental_DGA_7_Dec_2021",
+...
+```
+
+To get this
+
+```
+cut -f 1 143samples.pc-list.tsv|sed -e s,_.\*,,|sed -e s,^,\",|sed -e s,$,\"\,,| cut -f 1 143samples.pc-list.tsv|sed -e s,_.\*,,|sed -e s,^,\",|sed -e "s,$,\"\\,," > bxd_inds.list.txt
+"BXD100",
+"BXD101",
+"BXD102",
+```
+
+Next I turned it into a JSON file by hand as 'bxd_inds.list.json'.
+
+## Markers
+
+With GEMMA marker names are listed in the geno file. GEMMA also can use a SNP file that gives the chromosome and location.
+Without the SNP filegemma-wrapper complains it needs the SNP/marker annotation file. This is logical because for LOCO it needs to know what chromosome a marker is on.
+
+The next step is to take the nodes file that and extract all rows from the genotype file that match nodes with chromosomes defined. Andrea is going to deliver all positions for all nodes, but for now we can use what we have. Currently we have nodes annotated in mm10+C57BL_6+DBA_2J.p98.s10k.matrix-pos.txt:
+
+```
+mm10#1#chr3     23209565        93886997
+mm10#1#chr3     23209564        93886999
+mm10#1#chr3     23209563        93887016
+...
+```
+
+In the genotype file we find, for example
+
+```
+A23209564-0, A, T, 1.919141867395325,  0.9306930597711228,  1.8201319833577734,  0.7607260422339468,  1.427392726736106,  1.2310230984252724,  1.6633662444541875,  0.6105610229068721, ...
+```
+
+bit funny, but you get the idea. So we can take the mm10 file and write out the genotype file again for all matching nodes with a matching SNP file that should contain for this node:
+
+```
+A23209564-0        93886999        3
+```
+
+To rewrite above mm10+C57BL_6+DBA_2J.p98.s10k.matrix-pos.txt file we can do something like
+
+```
+#! ruby
+
+ARGF.each_line do |line|
+  tag,name,pos = line.strip.split(/\t/)
+  tag =~ /chr(.*)$/
+  chrom = $1
+  print "A#{name}-0\t#{pos}\t#{chrom}\n"
+end
+```
+
+Now, another problem is that not all SNPs have a position in the genotype file (yet). As we can't display them I can drop them at this stage. So we take the SNP file and rewrite the BIMBAM file using that information. That throwaway script looks like
+
+```
+bimbamfn = ARGV.shift
+snpfn = ARGV.shift
+snps = {}
+open(snpfn).each_line do |snpl|
+  name = snpl.split(/\t/)[0]
+  snps  [name] = 1
+end
+open(bimbamfn).each_line do |line|
+  marker = line.split(/[,\s]/)[0]
+  if snps[marker]
+    print line
+  end
+end
+```
+
+takes a while to run, but as this is a one-off that does not matter. Reducing the file leads to 13667900 markers with genotypes. The original SNP file has 14927024 lines. Hmmm. The overlap is therefor not perfect (we have more annotations than genotypes now). To check this I'll run a diff.
+
+```
+cut -f 1 -d "," 143samples.percentile.bimbam.bimbam-reduced > 143samples.percentile.bimbam.bimbam-reduced-markers
+sort 143samples.percentile.bimbam.bimbam-reduced-markers > markers-sorted.txt
+diff --speed-large-files  143samples.percentile.bimbam.bimbam-reduced-markers markers-sorted.txt
+< A80951-0
+< A80952-0
+< A80953-0
+...
+cut -f 1 snps.txt |sort > snps-col1-sorted.txt
+diff --speed-large-files snps-col1-sorted.txt markers-sorted.txt
+241773d228996
+< A10314686-0
+241777d228999
+< A10314689-0
+241781d229002
+< A10314692-0
+grep A10314686 snps-col1-sorted.txt markers-sorted.txt
+snps-col1-sorted.txt:A10314686-0
+snps-col1-sorted.txt:A10314686-0
+markers-sorted.txt:A10314686-0
+```
+
+Ah, we have duplicate annotation lines in the SNP file.
+
+```
+grep A10314686-0 snps.txt
+A10314686-0     20257882        8
+A10314686-0     20384895        8
+grep A10314692-0 snps.txt
+A10314692-0     20257575        8
+A10314692-0     20384588        8
+```
+
+so, the same node is considered two snps. This is due to the node covering multiple inds (paths). Turns out a chunk of them map on different chromosomes too. I think we ought to drop them until we have a better understanding of what they represent (they may be mismapping artifacts).
+
+I updated the script. Now I see it skips A280000 because there is no marker annotation for that node. Good. Also the number of genotype markers got further reduced to 13209385.
+I checked the gemma code and the SNP annotation file should match the genotype file line for line. Usurprising, perhaps, but now I need to rewrite both. After adapting the script we now have to files with the same number of lines.
+
+Rerunning with the new files:
+
+```
+gemma -g new-genotypes.txt -p pheno_filtered_143.txt -gk
+gemma -g new-genotypes.txt -p pheno_filtered_143.txt -k output/result.cXX.txt -maf 0.05 -lmm 4 -a snps-matched.txt
+```
+
+And, even though the results differ somewhat in size -- due to the different number of markers -- the results look very similar to what was produced before. Good!
+
+Now we have confirmation and all the pieces we can run the same set with gemma-wrapper and LOCO.
+
+## gemma-wrapper
+
+The first 'challenge' is that gemma-wrapper computes hash values using a Ruby lib which is rather slow. This is also something we encounter in guix. I replaced that by using our pfff hashing for larger files.
+
+```
+/bin/time -v ../bin/gemma-wrapper --json --loco --jobs 8 -v -- -g new-genotypes.txt -p pheno_filtered_143.txt -gk -a snps-matched.txt > K.json
+```
+
+For this computation each gemma maxed out at 80Gb RAM (total 640Gb). We are really hitting limits here. In the near future we need to check why so much data is retained. As we only have 150 individuals it is a marker thing.
+
+```
+/bin/time -v ../bin/gemma-wrapper -v --json --lmdb --loco --input K.json -- -g new-genotypes.txt -p pheno_filtered_143.txt -a snps-matched.txt -debug -maf 0.05 -lmm 9 > GWA.json
+```
+
+This time gemma requires only 25Gb per chromosome, so we can run it in one go in RAM on this large server. Much of the time is spent in IO, so I think that when we start using mmap (lmdb) we can speed it up significantly.
+gemma-wrapper has a wall clock time of 10 minutes utilizing 17 cores.
+
+Some chromosomes failed with 'ERROR: Enforce failed for Trying to take the sqrt of nan in src/mathfunc.cpp at line 127 in safe_sqrt2'. Running the same with -lmm 9 passed. I'll need to keep an eye on that one.
+
+After some fixes we now have loco in an lmdb output. The mdb file comes in at 693Mb. That will make 9TB for 13K traits. Storing the full vector is probably not wise here (and arguably we won't ever use it at this size - we should use the smoothed haplotypes). Only storing the significant values (4.0) made the size 17Mb. That makes it 215Gb total. Which is manageable. I made it even smaller by removing the (superfluous) hits from the metadata. Now down to 7Mb and 3.2Mb compressed. That'll total less than 100Gb for 13K traits. Good.
+
+## Final hookup
+
+Now gemma-wrapper works (and test results are confirmed) we have to wire it up to fetch traits from the DB. We also have to make sure the trait values align with the individuals in the genotype file. Earlier I was running the script gemma-batch-run.sh:
+
+=> https://github.com/genetics-statistics/gemma-wrapper/blob/master/bin/gemma-batch-run.sh
+
+which looks like:
+
+```
+export TMPDIR=./tmp
+curl http://127.0.0.1:8092/dataset/bxd-publish/list > bxd-publish.json
+jq ".[] | .Id" < bxd-publish.json > ids.txt
+# ---- Compute GRM
+./bin/gemma-wrapper --json --loco -- -g BXD.geno.txt -p BXD_pheno.txt -a BXD.8_snps.txt -n 2 -gk > K.json
+
+# ---- For all entries run LMM
+for id in 'cat ids.txt' ; do
+  echo Precomputing $id
+  if [ ! -e tmp/*-BXDPublish-$id-gemma-GWA.tar.xz ] ; then
+    curl http://127.0.0.1:8092/dataset/bxd-publish/values/$id.json > pheno.json
+    ./bin/gemma-wrapper --json --lmdb --geno-json BXD.geno.json --phenotypes pheno.json --population BXD --name BXDPublish --trait $id --loco --input K.json -- -g BXD.geno.txt -a BXD.8_snps.txt -lmm 9 -maf 0.1 -n 2 -debug > GWA.json
+  fi
+done
+```
+
+We already have ids.txt and the GRM. What is required is the trait values from the DB. What we need to do is run gn-guile somewhere with access to the DB. Also I need to make sure the current gemma-wrapper tar-balls up the result.
+
+OK, we are running. Looks like the smaller datasets only use 11Gb RES RAM per chromosome. Which means we can run two computes in parallel on this machine.
+
+The first run came through! I forgot the --reduce flag, so it came as 190Mb. I'll fix that. 34 individuals ran in 7 minutes.
+We are currently runnings at a trait in 6 min. We can double that on this machine.
+
+The following puzzles me a bit
+
+```
+## number of analyzed individuals = 31
+## number of covariates = 1
+## number of phenotypes = 1
+## leave one chromosome out (LOCO) =       14
+## number of total SNPs/var        = 13209385
+## number of SNPS for K            = 12322657
+## number of SNPS for GWAS         =   886728
+## number of analyzed SNPs         = 13122153
+```
+
+why is the number of SNPs for GWAS low? Perhaps a threshold of 10% for maf is a bit stringent. See below.
+
+Anyway, we are running traits and the first 500 we'll use for analysis.
+
+Meanwhile I'll look at deploying on octopus and maybe speeding up GEMMA. See
+
+=> issues/genetics/speeding-up-gemma
+
+# MAF
+
+GEMMA has a MAF filter. For every SNP a maf is computed by adding the geno value:
+
+```
+maf += geno
+```
+
+when all genotype values are added up MAF is divided by 2x the number of individuals (minus missing).
+
+```
+maf /= 2.0 * (double)(ni_test - n_miss);
+```
+
+and this is held against the maf passed on the command line. The 2.0 therefore assumes all values are between 0 and 2.
+
+Actually I now realise we are using LOCO. So the number of SNPs are the ones on one chromosome. That makes sense!
+Still we have to be careful about the MAF range. In our genotype file the values are between 0 and 2. So that is fine in itself.
+
+# RDF
+
+Next step is to generate RDF. The SNP annotation was slow, so I moved that to lmdb. Parsing 400 traits now takes 3 minutes. The RDF file is under 1Gb and the SNP annotation RDF is 330Mb. Not too bad!
+
+```
+guix shell -C -N --expose=/export/guix-containers/virtuoso/data/virtuoso/ttl/=/export/data/virtuoso/ttl virtuoso-ose -- isql -S 8891
+SQL> ld_dir('/export/data/virtuoso/ttl','pan-test-snps-400.n3','http://pan-test.genenetwork.org');
+Done. -- 3 msec.
+# for testing the validity and optional delete problematic ones:
+SQL> SELECT * FROM DB.DBA.load_list;
+SQL> DELETE from DB.DBA.LOAD_LIST where ll_error IS NOT NULL ;
+# commit changes
+SQL> rdf_loader_run ();
+SQL> ld_dir('/export/data/virtuoso/ttl','pan-test-400.n3','http://pan-test.genenetwork.org');
+SQL> rdf_loader_run ();
+SQL> checkpoint;
+Done. -- 16 msec.
+SQL> SPARQL SELECT count(*) FROM <http://pan-test.genenetwork.org> WHERE { ?s ?p ?o } LIMIT 10;
+34200686
+```
+
+Or in the web interface:
+
+```
+SELECT count(*) FROM <http://pan-test.genenetwork.org> WHERE { ?s ?p ?o }
+```
+
+## Query
+
+The RDF is formed as:
+
+```
+gn:GEMMAMapped_test_LOCO_BXDPublish_10383_gemma_GWA_e6478639 a gnt:mappedTrait;
+      rdfs:label "GEMMA BXDPublish trait 10383 mapped with LOCO (defaults)";
+      gnt:trait gn:publishXRef_10383;
+      gnt:loco true;
+      gnt:run gn:test;
+      gnt:time "2025/11/10 08:12";
+      gnt:belongsToGroup gn:setBxd;
+      gnt:name "BXDPublish";
+      gnt:traitId "10383";
+      gnt:nind 14;
+      gnt:mean 18.0;
+      gnt:std 10.9479;
+      gnt:skew 0.3926;
+      gnt:kurtosis -1.1801;
+      skos:altLabel "BXD_10383";
+      gnt:filename "0233fa0cf277ee7d749de08b32f97c8be6478639-BXDPublish-10383-gemma-GWA.tar.xz";
+      gnt:hostname "napoli";
+      gnt:user "wrk".
+gn:A8828461_0_BXDPublish_10383_gemma_GWA_e6478639 a gnt:mappedLocus;
+      gnt:mappedSnp gn:GEMMAMapped_test_LOCO_BXDPublish_10383_gemma_GWA_e6478639;
+      gnt:locus gn:A8828461_0;
+      gnt:lodScore 4.8;
+      gnt:af 0.536;
+      gnt:effect -32.859.
+```
+
+and SNPs are annotated as
+
+```
+gn:A8828461_0 a gnt:marker;
+                 rdfs:label "A8828461-0";
+                 gnt:chr  "1";
+                 gnt:pos  3304440.
+gn:A8828464_0 a gnt:marker;
+                 rdfs:label "A8828464-0";
+                 gnt:chr  "1";
+                 gnt:pos  3304500.
+```
+
+To get all tested traits you can list:
+
+```
+PREFIX dct: <http://purl.org/dc/terms/>
+PREFIX gn: <http://genenetwork.org/id/>
+PREFIX owl: <http://www.w3.org/2002/07/owl#>
+PREFIX gnc: <http://genenetwork.org/category/>
+PREFIX gnt: <http://genenetwork.org/term/>
+PREFIX sdmx-measure: <http://purl.org/linked-data/sdmx/2009/measure#>
+PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
+PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
+PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
+PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
+PREFIX qb: <http://purl.org/linked-data/cube#>
+PREFIX xkos: <http://rdf-vocabulary.ddialliance.org/xkos#>
+PREFIX pubmed: <http://rdf.ncbi.nlm.nih.gov/pubmed/>
+
+SELECT * FROM <http://pan-test.genenetwork.org> WHERE {
+?trait a gnt:mappedTrait;
+         gnt:run gn:test ;
+         gnt:traitId ?traitid ;
+         gnt:kurtosis ?kurtosis .
+} limit 100
+```
+
+To get all SNPs for trait "10001"
+
+```
+SELECT * FROM <http://pan-test.genenetwork.org> WHERE {
+?traitid a gnt:mappedTrait;
+         gnt:run gn:test ;
+         gnt:traitId "10381" .
+?snp gnt:mappedSnp ?traitid ;
+        gnt:locus ?locus .
+?locus rdfs:label ?nodeid ;
+         gnt:chr ?chr ;
+         gnt:pos ?pos .
+}
+```
+
+Lists:
+
+```
+| http://genenetwork.org/id/A8828461_0_BXDPublish_10383_gemma_GWA_e6478639 | "A8828461-0" | "1" | 3304440 |
+```
+
+## Scoring/annotating QTL
+
+Next step is annotating the QTL in RDF. Earlier I wrote a script rdf-analyse-gemma-hits. It uses rapper to read two RDF files (two runs) and annotates the QTL and differences between the files. The code is not pretty:
+
+=> https://github.com/genetics-statistics/gemma-wrapper/blob/6d667ac97284013867b6cac451ec7e7a22ffbf4b/bin/rdf-analyse-gemma-hits.rb#L1
+
+The supporting library is a bit better:
+
+=> https://github.com/genetics-statistics/gemma-wrapper/blob/6d667ac97284013867b6cac451ec7e7a22ffbf4b/lib/qtlrange.rb#L1
+
+Basically we have a QTL locus (QLocus) that tracks chr,pos,af and lod for each hit.
+QRange is a set of QLocus which also tracks some stats chr,min,max,snps,max_af,lod.
+It can compute whether two QTL (QRange) overlap.
+Next we have a container that tracks the QTL (QRanges) on a chromosome.
+
+Finally there is a diff function that can show the differences on a chromosome (QRanges) for two mapped traits.
+
+Maybe the naming could be a bit better, but the code is clear as it stands. On thing to note is that we use a fixed distance MAX_SNP_DISTANCE_BPS of 50M that decides whether a SNP falls in the same QTL. It would be worth trying to base it on dropping LOD scores (1.5 from the top). Rob and Flavia pointed out.
+
+So, the library is fine, but the calling program is not great. The reason is that I parse RDF directly, teasing apart the logic we do in above SPARQL. I track state in dictionaries (hashes of hashes) and the result ends up convoluted. Also a lot of state in RAM. I chose RDF direct parsing because it makes for easier development. The downside is that I need to parse the whole file to make sure I have everything related to a trait. To fetch SNP results from SPARQL directly is slow too. I am in a bind.
+
+Using curl:
+
+```
+time curl -G http://sparql -H "Accept: application/json; charset=utf-8" --data-urlencode query="
+PREFIX dct: <http://purl.org/dc/terms/>
+PREFIX gn: <http://genenetwork.org/id/>
+PREFIX owl: <http://www.w3.org/2002/07/owl#>
+PREFIX gnc: <http://genenetwork.org/category/>
+PREFIX gnt: <http://genenetwork.org/term/>
+PREFIX sdmx-measure: <http://purl.org/linked-data/sdmx/2009/measure#>
+PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
+PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
+PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
+PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
+PREFIX qb: <http://purl.org/linked-data/cube#>
+PREFIX xkos: <http://rdf-vocabulary.ddialliance.org/xkos#>
+PREFIX pubmed: <http://rdf.ncbi.nlm.nih.gov/pubmed/>
+SELECT * FROM <http://pan-test.genenetwork.org> WHERE { ?traitid a gnt:mappedTrait ; gnt:traitId ?trait ; gnt:kurtosis ?k . }
+```
+
+
+```
+time curl -G http:///sparql -H "Accept: application/json; charset=utf-8" --data-urlencode query="
+PREFIX dct: <http://purl.org/dc/terms/>
+PREFIX gn: <http://genenetwork.org/id/>
+PREFIX owl: <http://www.w3.org/2002/07/owl#>
+PREFIX gnc: <http://genenetwork.org/category/>
+PREFIX gnt: <http://genenetwork.org/term/>
+PREFIX sdmx-measure: <http://purl.org/linked-data/sdmx/2009/measure#>
+PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
+PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
+PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
+PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
+PREFIX qb: <http://purl.org/linked-data/cube#>
+PREFIX xkos: <http://rdf-vocabulary.ddialliance.org/xkos#>
+PREFIX pubmed: <http://rdf.ncbi.nlm.nih.gov/pubmed/>
+SELECT * FROM <http://pan-test.genenetwork.org> WHERE { ?traitid a gnt:mappedTrait ; gnt:traitId \"10001\" ; gnt:kurtosis ?k . ?snp gnt:mappedSnp ?traitid ; gnt:locus ?locus . }
+"  > test.out
+real    0m1.612s
+user    0m0.020s
+sys     0m0.000s
+```
+
+To get the trait info for 400 traits takes a second. So, that is no big deal. To get the 6K SNPs for one trait also takes a second. Hmmm. That takes hours, compared to the minutes for direct RDF parsing. Before lmdb comes to the rescue we should try running in on the virtuoso server itself. For curl we get 0.5s. Which makes it two hours for 13K traits. But when we run the query using isql it runs in 70ms which totals 15 minutes. That is perfectly fine for running the whole set!
+
+One way is to simply script isql from the command line. Meanwhile, it also turns out the ODBC interface can be used from python or ruby. Here an example in R:
+
+=> https://cran.stat.auckland.ac.nz/web/packages/virtuoso/index.html
+
+Not sure if that is fast enough, but perhaps worth trying.
+
+So, now we have a way to query the data around a trait in seconds. This means I can rewrite the QTL generator to go by trait. This also allows for a quick turnaround during development (good!). Also I want two scripts: one for computing the QTL and one for annotating the differences.
+
+Alright. The first script should simply to fetch a trait with its markers from SPARQL and score the QTL (as RDF output). The new script is at
+
+=> https://github.com/genetics-statistics/gemma-wrapper/blob/master/bin/sparql-qtl-detect.rb
+
+First, the query for one trait looks like:
+
+```
+PREFIX dct: <http://purl.org/dc/terms/>
+PREFIX gn: <http://genenetwork.org/id/>
+PREFIX owl: <http://www.w3.org/2002/07/owl#>
+PREFIX gnc: <http://genenetwork.org/category/>
+PREFIX gnt: <http://genenetwork.org/term/>
+PREFIX sdmx-measure: <http://purl.org/linked-data/sdmx/2009/measure#>
+PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
+PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
+PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
+PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
+PREFIX qb: <http://purl.org/linked-data/cube#>
+PREFIX xkos: <http://rdf-vocabulary.ddialliance.org/xkos#>
+PREFIX pubmed: <http://rdf.ncbi.nlm.nih.gov/pubmed/>
+
+SELECT ?lod ?af ?nodeid ?chr ?pos FROM <http://pan-test.genenetwork.org> WHERE {
+?traitid a gnt:mappedTrait;
+         gnt:run gn:test ;
+         gnt:traitId "10002" .
+?snp gnt:mappedSnp ?traitid ;
+        gnt:locus ?locus ;
+        gnt:lodScore ?lod ;
+        gnt:af ?af .
+?locus rdfs:label ?nodeid ;
+         gnt:chr ?chr ;
+         gnt:pos ?pos .
+} ORDER BY DESC(?lod)
+```
+
+rendering some 22K markers for trait 10002 as a TSV:
+
+```
+"lod"   "af"    "nodeid"        "chr"   "pos"
+7.5     0.547   "A13459298-0"   "8"     98658490
+7.1     0.154   "A13402313-0"   "8"     96798487
+7       0.432   "A13446492-0"   "8"     97355019
+7       0.263   "A13387873-0"   "8"     94934820
+7       0.585   "A4794343-0"    "1"     172265488
+...
+```
+
+Earlier with precompute for trait 10002 we got:
+
+```
+[10002,HK] =>{"1"=>[#<QRange Chr1 𝚺14 179.862..181.546 LOD=3.07..3.07>], "8"=>[#<QRange Chr8 𝚺102 94.3743..112.929 LOD=3.1..5.57>]}
+[10002,LOCO] =>{"1"=>[#<QRange Chr1 𝚺15 72.2551..73.3771 AF=0.574 LOD=4.0..5.1>, #<QRange Chr1 𝚺91 171.172..183.154 AF=0.588 LOD=4.5..5.3>], "8"=>[#<QRange Chr8 𝚺32 94.4792..97.3382 AF=0.441 LOD=4.5..4.8>]}
+```
+
+so the hits are in range, but the LOD may be inflated because of the number of markers. Anyway, this point we are merely concerned with scoring QTL. The first script is simply:
+
+```
+qtls = QTL::QRanges.new("10002","test")
+CSV.foreach(fn,headers: true, col_sep: "\t") do |hit|
+    qlocus = QTL::QLocus.new(hit["nodeid"],hit["chr"],hit["pos"].to_i,hit["af"].to_f,hit["lod"].to_f)
+    qtls.add_locus(qlocus)
+end
+print qtls
+```
+
+and prints a long list of QTL containing a single hit.
+
+```
+[10002,test] =>{"1"=>[#<QRange Chr1 𝚺1 3099543..3099543 AF=0.583 LOD=5.8..5.8>, #<QRange Chr1 𝚺1 65908328..65908328 AF=0.627 LOD=5.7..5.7>, #<QRange Chr1 𝚺1 81604902..81604902 AF=0.451 LOD=5.5..5.5>, #<QRange Chr1 𝚺2 85087169..85087177 AF=0.781 LOD=5.5..5.6>, #<QRange Chr1 𝚺1 93740525..93740525 AF=0.762 LOD=6.5..6.5>, #<QRange Chr1 𝚺1 114086053..114086053 AF=0.568 LOD=5.7..5.7>,...
+```
+
+For trait 10002 tweaking thresholds and rebinning we get
+
+```
+#<QRange Chr8 𝚺2 34.303454..35.675301 AF=0.571 LOD=5.7..5.8>
+#<QRange Chr8 𝚺621 91.752748..102.722635 AF=0.663 LOD=5.6..7.5>
+#<QRange Chr1 𝚺16 65.908328..175.232335 AF=0.781 LOD=5.6..7.0>
+#<QRange Chr4 𝚺5 56.498971..126.135422 AF=0.657 LOD=5.6..6.4>
+#<QRange Chr12 𝚺3 23.037869..58.306731 AF=0.643 LOD=5.8..6.2>
+#<QRange Chr10 𝚺2 13.442071..13.442088 AF=0.641 LOD=5.8..6.0>
+#<QRange Chr10 𝚺3 94.246536..103.438796 AF=0.608 LOD=5.9..6.2>
+#<QRange Chr3 𝚺2 47.644513..82.451061 AF=0.548 LOD=5.7..6.2>
+#<QRange Chr9 𝚺2 97.445077..120.263403 AF=0.717 LOD=5.8..5.8>
+#<QRange Chr11 𝚺2 27.4058..56.30011 AF=0.559 LOD=5.7..5.7>
+```
+
+with a LOD>5.5 cut-off. That seems justified because LOD scores are inflated. Compare this with the earlier mapping using 'traditional' genotypes:
+
+```
+[10002,LOCO] =>{
+"1"=>[#<QRange Chr1 𝚺15 72.2551..73.3771 AF=0.574 LOD=4.0..5.1>,
+      #<QRange Chr1 𝚺91 171.172..183.154 AF=0.588 LOD=4.5..5.3>],
+"8"=>[#<QRange Chr8 𝚺32 94.4792..97.3382 AF=0.441 LOD=4.5..4.8>]}
+```
+
+we can see the significance of chr8 has gone up with pangenome mapping (relative to chr1) and we find 2 QTL now on chr8, a new one to the left.  Chr1 looks similar. We have some other candidates that may or may not be relevant (all narrow!).
+
+Note this *is* a random trait(!) and suggests the landscape of QTLs will change pretty dramatically. Note also that Andrea will give new genotypes and smoothing to follow. But it is encouraging.
+
+I played a bit with the QTL output, and for now settled on tracking nodes that have a LOD>5.0. We drop QTL based on the following:
+
+```
+qtl.lod.max < 6.0 or (qtl.lod.max < 7.5 - qtl.snps.size/2)
+```
+
+I.e. a single SNP QTL has to have a LOD of 7.0. A 2-SNP QTL has to have a LOD of 6.5. This begets
+
+```
+[10002,test] =>{
+"1"=>[#<QRange Chr1 𝚺69 3.099543..192.718161 AF=0.781 LOD=5.1..7.0>],
+"4"=>[#<QRange Chr4 𝚺12 56.498971..147.86044 AF=0.676 LOD=5.1..6.4>],
+"8"=>[#<QRange Chr8 𝚺2774 34.303454..116.023702 AF=0.899 LOD=5.1..7.5>],
+"10"=>[#<QRange Chr10 𝚺7 82.334108..105.062097 AF=0.623 LOD=5.1..6.2>],
+"12"=>[#<QRange Chr12 𝚺9 21.707644..72.57041 AF=0.77 LOD=5.1..6.2>]}
+```
+
+which are all worth considering (I think). Obviously we could annotate all QTL in RDF triples and filter on that using SPARQL. But this makes processing a bit faster without having to deal with too much noise. We can fine tune later.
+
+Now two more steps to go:
+
+* [X] Fetch all mapped traits using SPARQL and write RDF
+* [X] Compare QTL between datasets and annotate new hits
+
+## Fetch all mapped traits
+
+```
+SELECT * FROM <http://pan-test.genenetwork.org> WHERE {
+?traitid a gnt:mappedTrait;
+         gnt:run gn:test ;
+         gnt:traitId "10002" .
+?snp gnt:mappedSnp ?traitid ;
+        gnt:locus ?locus ;
+        gnt:lodScore ?lod ;
+        gnt:af ?af .
+?locus rdfs:label ?nodeid ;
+         gnt:chr ?chr ;
+         gnt:pos ?pos .
+} ORDER BY DESC(?lod)
+```
+
+The first step is to fetch this data. Let's try SPARQL over the web first.
+
+## Compare QTL sets
+
+The previous code I wrote to compare QTLs essentially walks the QTLs and annotates a new QTL if there is no overlap between the two sets. Again, this code is too convoluted:
+
+=> https://github.com/genetics-statistics/gemma-wrapper/blob/18e7a3ac8a11becba84325499116621ad095f28e/lib/qtlrange.rb#L190
+
+The principle is straightforward, however. The code for reading the SPARQL output for a trait is
+
+```
+  CSV.foreach(fn,headers: true, col_sep: "\t") do |hit|
+    trait_id = hit["traitid"] if not trait_id
+    lod = hit["lod"].to_f
+    if lod > 5.0 # set for pangenome input
+      qlocus = QTL::QLocus.new(hit["snp"],hit["chr"],hit["pos"].to_f/10**6,hit["af"].to_f,lod)
+      qtls.add_locus(qlocus)
+    end
+  end
+```
+
+So we can use SPARQL to build two sets on the fly and then run the diff.
+
+Actually, when thinking about this I realised it should not be too hard to do in SPARQL to find the 'new' QTL.
+
+```
+
+SELECT * WHERE {
+?traitid a gnt:mappedTrait ;
+            gnt:traitId "10002" .
+}
+http://genenetwork.org/id/GEMMAMapped_LOCO_BXDPublish_10002_gemma_GWA_7c00f36d
+http://genenetwork.org/id/HK_trait_BXDPublish_10002_gemma_GWA_hk_assoc_txt
+http://genenetwork.org/id/GEMMAMapped_test_LOCO_BXDPublish_10002_gemma_GWA_82087f23
+```
+
+lists the three versions of compute for traits. To fetch all QTL for first mapping:
+
+```
+SELECT ?qtl ?lod ?chr ?start ?stop (count(?snp) as ?snps) WHERE {
+?traitid a gnt:mappedTrait ;
+  gnt:traitId "10002" .
+?qtl gnt:mappedQTL ?traitid ;
+  gnt:qtlChr ?chr ;
+  gnt:qtlStart ?start ;
+  gnt:qtlStop ?stop ;
+  gnt:qtlLOD ?lod .
+?qtl gnt:mappedSnp ?snp .
+}
+```
+
+gets 3 QTL. Now I did not store HK in RDF, but to show the filtering principle we can fetch two traits and compare QTL.
+The following gets two QTL from trait "10002" on CHR1 and holds that against that of trait "10079":
+
+```
+SELECT ?t ?s1 ?e1 ?t2 ?s2 ?e2 WHERE {
+  ?traitid a gnt:mappedTrait ;
+  gnt:traitId ?t .
+  ?qtl gnt:mappedQTL ?traitid ;
+  gnt:qtlChr ?chr ;
+  gnt:qtlStart ?s1 ;
+  gnt:qtlStop ?e1 .
+  {
+    SELECT * WHERE {
+      ?tid a gnt:mappedTrait ;
+      gnt:traitId "10079" ;
+      gnt:traitId ?t2 .
+      ?qtl2 gnt:mappedQTL ?tid ;
+      gnt:qtlChr ?chr ;
+      gnt:qtlStart ?s2 ;
+      gnt:qtlStop ?e2 .
+    }
+  }
+  FILTER (?t = "10002") .
+} LIMIT 10
+
+"10002",171.172,183.154,"10079",172.235,172.235
+"10002",72.2551,73.3771,"10079",172.235,172.235
+```
+
+Note we pivot on two traits and one chromosome, so we find all pairs.
+To say if a QTL is *new* or different we can add another FILTER
+
+```
+FILTER ((?s2 > ?s1 && ?e2 > ?e1) || (?s2 < ?s1 && ?e2 < ?e1)) .
+"t","s1","e1","t2","s2","e2"
+"10002",72.2551,73.3771,"10079",172.235,172.235
+```
+
+that says that this ?qtl2 does not overlap with ?qtl. I.e. here it is a new QTL!
+
+This new insight means we should should store *all* QTL in RDF, including the single SNP ones, because it is easy to filter on them. Note that there may be a more elegant way to query traits pairwise. This is just the first thing that worked. It may need more tuning if there are more than two QTL on a chromosome. E.g. the comparison between 10002 and 10413 finds:
+
+```
+"t","s1","e1","t2","s2","e2"
+"10002",72.2551,73.3771,"10413",32.3113,42.4624
+"10002",171.172,183.154,"10413",171.04,171.041
+"10002",171.172,183.154,"10413",32.3113,42.4624
+"10002",72.2551,73.3771,"10413",171.04,171.041
+```
+
+I.e. it does find new QTL here and you still need to do a little set analysis. In words you should be able to "remove all overlapping QTL from a chromosome". Maybe we can filter the other way - select overlapping QTL and remove those from the result set.
+
+```
+BIND ((?s2 >= ?s1 && ?e2 <= ?e1) || (?s1 >= ?s2 && ?e1 <= ?e2) as ?overlap) .
+"10002",171.172,183.154,"10079",172.235,172.235,1
+"10002",72.2551,73.3771,"10079",172.235,172.235,0
+```
+
+now drop all ?t's that are overlapping. It appears to work with:
+
+=> https://github.com/genetics-statistics/gemma-wrapper/blob/master/doc/examples/show-qtls-two-traits.sparql
+
+I'll need to test it on the pangenome set.
+
+# Listing QTL
+
+To get all QTL from a run you can use something like
+
+```
+SELECT DISTINCT ?t ?lod (count(?snp) as ?snps) ?chr ?s ?e WHERE {
+  ?traitid a gnt:mappedTrait ;
+  gnt:traitId ?t .
+  MINUS { ?traitid gnt:run gn:test } # use if you want the original GEMMA QTL
+  # ?traitid gnt:run gn:test . # use if you want the new QTL
+  ?qtl gnt:mappedQTL ?traitid ;
+  gnt:qtlChr ?chr ;
+  gnt:qtlLOD ?lod ;
+  gnt:qtlStart ?s ;
+  gnt:qtlStop ?e .
+  ?qtl gnt:mappedSnp ?snp .
+  FILTER (?t = "10002" && ?lod >= 5.0 ) .
+} LIMIT 100
+```
+
+Note we filter on a trait name and LOD score.
+
+For panQTL (gnt:run == gn:test) this results in
+
+```
+"t"	"lod"	"snps"	"chr"	"start"	"end"
+"10002"	6.4	3	"15"	87.671663	98.028911
+"10002"	6.4	12	"4"	56.498971	147.86044
+"10002"	7	69	"1"	3.099543	192.718161
+"10002"	7.5	2774	"8"	34.303454	116.023702
+"10002"	6.2	7	"10"	82.334108	105.062097
+"10002"	6.2	2	"3"	47.644513	82.451061
+"10002"	6.2	1	"3"	130.145235	130.145235
+"10002"	6	2	"10"	13.442071	13.442088
+"10002"	6.2	9	"12"	21.707644	72.57041
+```
+
+For the traditional genotypes (gnt:run != gn:test)
+
+```
+"t"	"lod"	"snps"	"chr"	"start"	"end"
+"10002"	5.3	91	"1"	171.172	183.154
+"10002"	5.1	15	"1"	72.2551	73.3771
+```
+
+
+# Listing SNPs
+
+Now we have all QTLs in the DB, as well as underlying SNPs, one interesting question to ask is what SNPs are repeated across our traits. This, if you remember, is the key idea of reversed genetics.
+Of course, with our pangenome-derived genotypes, we now have thousands of SNPs per trait. Let's see if we can rank them by number of traits.
+
+For our 1000 traits we map about 7.7M snps with a LOD>5
+
+
+# Using sparql from emacs
+
+Note: if you are doing SPARQL quite a bit, I recommend using sparql-mode in emacs! It is easy, faster and you can use git :)
+
+=> https://github.com/ljos/sparql-mode
+
+```
+M-x sparql-query-region [ENTER] http://sparql-test.genenetwork.org/sparql/ [ENTER]
+```
diff --git a/topics/genome-browser/hoot-genome-browser.gmi b/topics/genome-browser/hoot-genome-browser.gmi
new file mode 100644
index 0000000..219fda5
--- /dev/null
+++ b/topics/genome-browser/hoot-genome-browser.gmi
@@ -0,0 +1,21 @@
+# Hoot Genome Browser
+
+Together with Andrew we have created a genome browser that runs in WASM. Safari recently (202509) added critical hoot support, so we should have it in all important browsers now!
+
+With this task tracker we want to embed the existing browser in GN and add tracks for mapped QTL.
+
+# Tags
+
+* assigned: andrewt, pjotrp
+* priority: high
+* status: open, in progress
+* keywords: mapping
+
+# Tasks
+
+* [ ] Embed hoot browser in GN2 as a pilot
+*   + [ ] Guix package for JS and minimal JBrowse2?
+*   + [ ] Embedding code in GN2
+* [ ] Create two tracks for QTL comparisons - vector data available
+* [ ] Create BED file for matched QTL - use SPARQL live?
+* [ ] Annotated SNPs
diff --git a/issues/gn-uploader/samplelist-details.gmi b/topics/gn-uploader/samplelist-details.gmi
index 2e64d8a..2e64d8a 100644
--- a/issues/gn-uploader/samplelist-details.gmi
+++ b/topics/gn-uploader/samplelist-details.gmi
diff --git a/topics/guix/genenetwork-fixating-guix.gmi b/topics/guix/genenetwork-fixating-guix.gmi
new file mode 100644
index 0000000..844b0fd
--- /dev/null
+++ b/topics/guix/genenetwork-fixating-guix.gmi
@@ -0,0 +1,34 @@
+# Fixating Guix for GN
+
+The GeneNetwork services depend on a rather complicated Guix deployment. The problem is not guix, but GN itself :)
+But we were getting bitten by updates on upstream, as well as updates on our different targets/services.
+
+# Using channels that affect GN production
+
+To avoid duplication of work and unknown rabbit holes we decided to fixate guix trunk and other dependencies by using Guix channels. This means all GN development happens on a single version of Guix! That version is defined here:
+
+=> https://git.genenetwork.org/gn-machines/tree/.guix-channel
+
+Note that guix-forge and guix-bioinformatics are *also* fixated. The idea is that we only upgrade GN packages in gn-machines itself by inheriting definitions. E.g.
+
+=> https://git.genenetwork.org/gn-machines/tree/guix/gn-machines/genenetwork.scm
+
+We will probably get rid of the guix-past and guix-rust-past-crates sub-channels soon by removing those packages that depend on those (genenetwork1 will get its own tree, and @alexm will upgrade the rust packages).
+
+If someone wants to update guix channel or guix-bioinformatics channel they should not update this file. The one in charge is @fredm. Fred has to be in control because we don't want to break production. It is forbidden to touch this channel file.
+
+People can patch the packages and gn-machines, but if it involves CI/CD and/or production in any way, Fred will have to know about it.
+
+# Service level channels
+
+For individual services, such as genenetwork2, genenetwork3, gn-auth, etc., we have local channel files. These should mirror above gn-machines channel file to make sure we can migrate your code easily. E.g.
+
+=> https://github.com/genenetwork/genenetwork3/blob/main/.guix-channel
+
+Should match
+
+=> https://git.genenetwork.org/gn-machines/tree/.guix-channel
+
+If that is not the case we have a major problem! So before sending patches to Fred make sure the channels match.
+
+To be honest, I think we should fetch these channels automagically from gn-machines as a first step.
diff --git a/topics/guix/guix-profiles.gmi b/topics/guix/guix-profiles.gmi
index 8cf41d8..15f7683 100644
--- a/topics/guix/guix-profiles.gmi
+++ b/topics/guix/guix-profiles.gmi
@@ -15,9 +15,9 @@ Note that a recently tested channel can be fetched from cd.genenetwork.org. That
 Alternatively put the following into a channels.scm file.
 ```
 (list (channel
-       (name 'gn-bioinformatics)
-       (url "https://git.genenetwork.org/guix-bioinformatics")
-       (branch "master")))
+       (name 'gn-machines)
+       (url "https://git.genenetwork.org/gn-machines")
+       (branch "main")))
 ```
 Build a profile using
 ```
@@ -55,23 +55,6 @@ And everything should be in the PATH, PYTHONPATH etc.
 
 From time to time, updates to upstream Guix break our guix-bioinformatics channel. As a result, `guix pull` breaks and our work comes to a standstill for a few days until this can be fixed. While it is important to gradually move towards the latest and greatest, we would like to avoid being at the mercy of externalities and would prefer to update in a more controlled way without interrupting everyone's work.
 
-To this end, we hard-code the guix-bioinformatics channel to depend on a *specific* commit of upstream Guix that is tied to guix-bioinformatics, for example:
+To this end, we hard-code the guix-bioinformatics channel to depend on a specific commit of upstream Guix that is tied to guix-bioinformatics. This is why the recommended channels.scm file above does not include a %default-guix-channel. However, this comes with the drawback that your entire system will be stuck at that specific commit of upstream Guix (but not if you use another profile as described above). We highly recommend using a separate `guix pull` profile specifically for GeneNetwork work, as described above.
 
-```
-(list (channel
-        (name 'gn-bioinformatics)
-        (url "https://gitlab.com/genenetwork/guix-bioinformatics")
-        (branch "master")
-        (commit
-          "feff05b47c305d8c944499fbc00fd2126f2b881d")))
-```
-
-This is why the recommended channels.scm file above does not include a %default-guix-channel. However, this comes with the drawback that your entire system will be stuck at that specific commit of upstream Guix (but not if you use another profile as described above). We highly recommend using a separate `guix pull` profile specifically for GeneNetwork work, as described above.
-
-This scheme also comes with the added bonus that all members on the team and the continuous integration system will be using exactly the same Guix. Above channels.scm file is only exposed on a *succesful* build.
-
-## Notes
-
-We recently had to switch to gitlab because our git server went down on Penguin2. We may move to a cgit solution soon, see
-
-=> ../issues/cant-use-guix-bioinformatics-with-guix-pull.gmi
+This scheme also comes with the added bonus that all members on the team and the continuous integration system will be using exactly the same Guix.
\ No newline at end of file
diff --git a/topics/guix/packages.gmi b/topics/guix/packages.gmi
index a52f49b..b4a393c 100644
--- a/topics/guix/packages.gmi
+++ b/topics/guix/packages.gmi
@@ -2,15 +2,27 @@
 
 To deploy GN we have packages in Guix itself (that comes with a distribution), in guix-bioinformatics and in guix-past (for older packages).
 
+When you develop a new package it is best to run against a recent version of guix. Note that with GeneNetwork this is different as we 'fixate' guix at an older version. See
+
+=> genenetwork-fixating-guix
+
 Typically run a guix pull to get the latest guix:
 
 ```
 mkdir -p ~/opt
-guix package -i guix -p ~/opt/guix
+guix pull -p ~/opt/guix-pull
+unset GUIX_PROFILE # for Debian
+source ~/opt/guix-pull
 ```
 
 and checkout guix-past and guix-bioinformatics using git.
 
+Note that a codeberg pull may be faster (Guix recently moved main development to Codeberg):
+
+```
+guix pull -p ~/opt/guix-pull --url=https://codeberg.org/guix/guix
+```
+
 Now Guix should be happy with
 
 ```
@@ -20,7 +32,7 @@ genenetwork2            3.11-2.1328932  out     /home/wrk/guix-bioinformatics/gn
 genenetwork3            0.1.0-2.e781996 out     /home/wrk/guix-bioinformatics/gn/packages/genenetwork.scm:107:4
 ```
 
-and we can try building
+note that using the -L switch is a bit of a hack. Normally we use channels. We can try building
 
 ```
 ~/opt/guix-pull/bin/guix build -L ~/guix-bioinformatics/ -L ~/guix-past/modules/ genenetwork2
diff --git a/topics/hpc/octopus/slurm-user-guide.gmi b/topics/hpc/octopus/slurm-user-guide.gmi
index f7ea6d4..d0a3cc4 100644
--- a/topics/hpc/octopus/slurm-user-guide.gmi
+++ b/topics/hpc/octopus/slurm-user-guide.gmi
@@ -37,7 +37,6 @@ To get a shell prompt on one of the nodes (useful for testing your environment)
 srun -N 1 --mem=32G --pty /bin/bash
 ```
 
-
 # Differences
 
 ## Guix (look ma, no modules)
diff --git a/topics/lmms/rqtl2/using-rqtl2-lmdb-adapter.gmi b/topics/lmms/rqtl2/using-rqtl2-lmdb-adapter.gmi
new file mode 100644
index 0000000..8e5332a
--- /dev/null
+++ b/topics/lmms/rqtl2/using-rqtl2-lmdb-adapter.gmi
@@ -0,0 +1,84 @@
+# R/qtl2 LMDB Adapter
+## Tags
+
+* assigned: alexm
+* priority: medium
+* type: feature, documentation
+* status: WIP
+* keywords: rqtl2, lmdb, adapter, cross
+
+## Description
+We want to add support for reading crosses from LMDB.
+Currently, R/qtl2 (https://kbroman.org/qtl2/) only supports reading from CSV files.
+
+## Tasks
+
+* [x] Dump genotypes to LMDB
+* [x] Dump cross metadata to LMDB
+* [-] Create a `read_lmdb_cross` adapter
+* [] Dump phenotypes to LMDB
+
+## Using the Adapter
+
+### Dumping the Genotypes
+You can find the `lmdb_matrix.py` script here:
+
+=> https://github.com/genenetwork/genenetwork3/blob/main/scripts/lmdb_matrix.py
+
+```sh
+guix shell python-click python-lmdb python-wrapper python-numpy -- \
+     python lmdb_matrix.py import-genotype \
+     <path-to-genotype-file> <path-to-lmdb-store>
+```
+
+## Dumping the Cross Metadata
+
+The script can be found here:
+=> https://github.com/genenetwork/genenetwork3/pull/235/files   # lmdb_cross_metadata.py
+
+You need to provide a cross file path. The currently supported formats are JSON and YAML.
+
+Example:
+
+```sh
+guix shell python-click python-lmdb python-wrapper python-pyyaml -- \
+     python dump_metadata.py dump-cross [LMDB_PATH] [CROSS_FILE_PATH] --file-format yaml/json
+
+# Example
+
+python dump_metadata.py dump-cross "./test_lmdb_data" "./cross_file.json"
+```
+
+### Running the R/qtl2 LMDB Adapter Script
+
+The script `rqtl_lmdb_adapter.r` can be found here:
+=> https://github.com/genenetwork/genenetwork3/pull/235/files   # rqtl_lmdb_adapter.r
+
+```sh
+guix shell r r-thor r-rjson r-qtl2 -- \
+     Rscript [PATH_TO_ADAPTER_SCRIPT] [LMDB_PATH]
+
+# Example
+Rscript https://github.com/genenetwork/genenetwork3/pull/235/files ./lmdb_path
+
+
+```
+### using this with rqtl2 Example
+
+```r
+cross <- read_lmdb_cross(LMDB_DB_PATH)
+summary(cross)
+cat("Is this cross okay", check_cross2(cross), "\n")
+warnings() #  enable warnings for the debug purposes  only!
+pr <- calc_genoprob(cross)
+out <- scan1(pr, cross$pheno, cores=4)
+par(mar=c(5.1, 4.1, 1.1, 1.1))
+ymx <- maxlod(out)
+plot(out, cross$gmap, lodcolumn=1, col="slateblue") # test generating of qtl plots
+
+```
+
+
+### References
+=> https://kbroman.org/qtl2/assets/vignettes/developer_guide.html
+
diff --git a/topics/meetings/gn-kilifi-2025-standup.gmi b/topics/meetings/gn-kilifi-2025-standup.gmi
new file mode 100644
index 0000000..c20e847
--- /dev/null
+++ b/topics/meetings/gn-kilifi-2025-standup.gmi
@@ -0,0 +1,177 @@
+# Stand-up Notes
+
+## 25-8-2025
+### Felix
+* Completed first draft of the abstract
+* HS genotyping; polishing founders vcf file:
+* - sorting xsome order
+* - remove duplicate markers
+
+### Johannnes
+* Extending the rag system, make it more of an agent:
+* - fixing bugs
+* - reduce runtime
+
+### Bonface
+* CD went down!
+* revert it to last commit
+* set it up locally (not very easy)
+* coordinate with Fred and Aarun for help
+
+## 27-8-2025
+### Felix
+* updating the abstract: meeting with Kauthar, for more tips and insight
+* HS rats genotyping
+* still no significant output
+
+### Bonface
+* fixed GN3, GN guile, and GN auth
+* reproduce GN2 error, not yet fixed though
+
+### Johannes
+* LLM transformer is taking too long to run
+* change input documentations to observe results
+
+## 29-8-2025
+### Felix
+* Still on the abstract
+* Errors with hs genotyping script
+
+### Johannes
+*  Still on the rag system: experimenting with documents as input
+
+### Bonface
+* CD down again, tux02 is the calprit
+* Running gn2 outside container in meantime
+
+## 1-9-2025
+### Felix
+* improving the smoothing scripts for hs: building a snakemake pipeline; redrafting the abstract introductory part
+
+### Bonface
+* looked on deployment to fix...,
+* install gpu drivers for penguine 2 with alex
+* it has gpu drivers tesla k8, but hardware installation is important for the meantime
+
+### Alex
+* rqlt2/lmdb runs and rebooted
+* dumping bxd phenotype dataset for testing
+
+### Johannes
+* no big updates
+* literature review on rag systems and how they work
+
+## 3-9-2025
+### Felix
+* Improved the inferring script
+* Abstract on hold first
+
+### Johannes
+* fix documents using agents: the issue is parallelization
+* reviewing fahamu AI (the python part)
+
+### Bonface
+* Gn2 finally runs
+* Documenting the progress and steps
+* Gn-auth is broken; Fred to solve it
+* Updated tesla k8 gpu drivers
+
+
+### Alex
+* Rqtl-wrapper for BXD is done
+* focusing on cross-information;lmdb
+
+## 5-9-2025
+### Felix
+* managed to generate haplotype blocks for hs rats
+
+### Johannes
+* RAG experimentation script: balg01, running slow; prepare for msc defense
+
+### Alex
+* dumping phenotypes to lmdb
+
+### Bonface
+* install drivers on tesla k8
+* review Johannes scripts
+* fixing gn2
+* review alex's patches, {pull requests via emails}
+
+* AOB;
+* API tokens for RAG and coding
+
+## 10-9-2025
+### Felix
+* Managed to run gemma and generate plots for the original hs genotype data and hs smoothed genotype data
+* Still working on fine-tuning the statistical metrices to suite the plots
+
+### Bonface
+* All the GN machines are up and running, expect GN Auth
+* Sent documentations to Pjotr on the progress with troubleshooting GN machines
+
+### Alex
+* writing queries to move phenotypes matrices from gn2 db to lmdb
+
+### Johannes
+* defense
+
+## 17-9-2025
+### Felix
+* Completed abstract writing
+* Working on generating genotype files for hs in gn2 format
+
+### Bonface
+* Gn machines are up and running
+* Still some issues with manually restarting the container
+* Worked with Alex to set up an external bootable ssd for Johannes
+
+### Alex
+* Succeeded dumping phenotypes to lmdb
+* Been reading literature around phenotypes in gn2
+* Assisted Bonz in setting up external bootable ssd for Johannes
+
+### Johannes
+* Managed to get access to bux01 server for the llms work
+
+## 19-09-2025
+### Felix
+* Finalized HS genotypes/haplotypes
+
+### Bonface
+* GN machines are running on CD
+* troubleshooting assisted by Aaron
+
+### Johannes
+* Experimenting with RAG systems
+
+### Alex
+* Writing LMDB scripts to dump phenotypes to LMDB from the common gn DB
+
+## [Review] MONTH OCTOBER, 2025
+*** Metric wise, plans and goals to achieve
+
+### Bonface
+Thoughts: Good start after relocation.
+
+* CD/CI: Make sure that tests/infra is super stable
+* Look at suggested forge/guix-bioinformatics upgrades that affect prod
+* Support work for GNQA: sane gn-qna(fahamu)/AI deploys; Review Johannes' work; Help with tuning models; play catch up (cover up knowledge gap); figure out how to compare the different models.
+* Follow up on my phd
+
+### Alex
+Thoughts: very productive
+
+* Rqtl2 lmdb adapter running on production; depends on traction with Rqtl2 upstream by karl Browman to gn-bioinformatics
+* Collaborate with Felix on his scripts
+
+### Johannes
+
+Thoughts: Not very productive, barely, looking forward for more improvement
+
+* Experimenting on the qtl data with the AI system being currently built; also in need of optimization
+* Comparing LLMs/GNQA agents; getting assistance from Shelby's work
+
+### Felix
+* HS genotypes to be fully supported in GN2
+* Poster presentation
+* PhD: ML objective (ML fundamentals); Complete 1st manuscript draft; Process my student pass
diff --git a/topics/octopus/lizardfs/README.gmi b/topics/octopus/lizardfs/lizard-maintenance.gmi
index 7c91136..a34ef3e 100644
--- a/topics/octopus/lizardfs/README.gmi
+++ b/topics/octopus/lizardfs/lizard-maintenance.gmi
@@ -1,4 +1,4 @@
-# Information about lizardfs, and some usage suggestions
+# Lizard maintenance
 
 On the octopus cluster the lizardfs head node is on octopus01, with disks being added mainly from the other nodes. SSDs are added to the lizardfs-chunkserver.service systemd service and SDDs added to the lizardfs-chunkserver-hdd.service. The storage pool is available on all nodes at /lizardfs, with the default storage option of "slow", which corresponds to two copies of the data, both on SDDs.
 
@@ -73,6 +73,17 @@ Chunks deletion state:
         2ssd    7984    -       -       -       -       -       -       -       -       -       -
 ```
 
+<<<<<<< HEAD
+This table essentially says that slow and fast are replicating data (if they are in column 0 it is OK!). This looks good for fast:
+
+```
+Chunks replication state:
+        Goal    0       1       2       3       4       5       6       7       8       9       10+
+        slow    -       137461  448977  -       -       -       -       -       -       -       -
+        fast    6133152 -       5       -       -       -       -       -       -       -       -
+```
+This table essentially says that slow and fast are replicating data (if they are in column 0 it is OK!).
+
 To query how the individual disks are filling up and if there are any errors:
 
 List all disks
@@ -83,6 +94,42 @@ lizardfs-admin list-disks octopus01 9421 | less
 
 Other commands can be found with `man lizardfs-admin`.
 
+## Info
+
+```
+lizardfs-admin info octopus01 9421
+LizardFS v3.12.0
+Memory usage:   2.5GiB23
+
+Total space:    250TiB                                                                                                 Available space:        10TiB
+Trash space:    510GiB
+Trash files:    188
+Reserved space: 21GiB                                                                                                  Reserved files: 18
+FS objects:     7369883
+Directories:    378782
+Files:  6858803
+Chunks: 9100088
+Chunk copies:   20017964
+Regular copies (deprecated):    20017964
+```
+
+```
+lizardfs-admin chunks-health  octopus01 9421
+Chunks availability state:
+        Goal    Safe    Unsafe  Lost
+        slow    1323220 1       -
+        fast    6398524 -       5
+
+Chunks replication state:
+        Goal    0       1       2       3       4       5       6       7       8       9       10+
+        slow    -       218663  1104558 -       -       -       -       -       -       -       -
+        fast    6398524 -       5       -       -       -       -       -       -       -       -
+
+Chunks deletion state:
+        Goal    0       1       2       3       4       5       6       7       8       9       10+
+        slow    -       104855  554911  203583  76228   39425   19348   8659    3276    20077   292859
+        fast    6380439 18060   30      -       -       -       -       -       -       -       -
+```
 
 ## Deleted files
 
@@ -188,3 +235,54 @@ KeyringMode=inherit
 [Install]
 WantedBy=multi-user.target
 ```
+
+# To deplete and remove a drive in LizardFS
+
+**1. Mark the chunkserver (or specific disk) for removal**
+
+Edit the chunkserver's disk configuration file (typically `/etc/lizardfs/mfshdd.cfg`) and prefix the drive path with an asterisk:
+
+```
+*/mnt/disk_to_remove
+```
+
+Restart the chunkserver process on the node
+
+```bash
+systemctl stop lizardfs-chunkserver
+systemctl start lizardfs-chunkserver
+```
+
+**3. Monitor the evacuation progress**
+
+The master will begin migrating chunks off the marked drive. You can monitor progress with:
+
+```bash
+lizardfs-admin list-disks octopus01 9421
+lizardfs-admin list-disks octopus01 9421|grep 172.23.19.59 -A 7
+172.23.19.59:9422:/mnt/sdc/lizardfs_vol/
+        to delete: yes
+        damaged: no
+        scanning: no
+        last error: no errors
+        total space: 3.6TiB
+        used space: 3.4TiB
+        chunks: 277k
+```
+
+Look for the disk showing evacuation status. The "to delete" chunks count should decrease over time as data is replicated elsewhere.
+
+You can also check the CGI web interface if you have it running—it shows disk status and chunk counts.
+
+**4. Remove the drive once empty**
+
+Once all chunks have been evacuated (the disk shows 0 chunks or is marked as empty), you can safely:
+
+1. Remove the line from `mfshdd.cfg` entirely
+2. Reload the configuration again
+3. Physically remove or repurpose the drive
+
+**Important notes:**
+- Ensure you have enough free space on other disks to absorb the migrating chunks
+- The evacuation time depends on the amount of data and network/disk speed
+- Don't forcibly remove a drive before evacuation completes, or you risk data loss if replication goals aren't met
diff --git a/topics/octopus/maintenance.gmi b/topics/octopus/maintenance.gmi
index 65ea52e..00cc575 100644
--- a/topics/octopus/maintenance.gmi
+++ b/topics/octopus/maintenance.gmi
@@ -11,7 +11,7 @@ octopus02
 - Devices: 2 3.7T SSDs + 2 894.3G SSDs + 2 4.6T HDDs
 - **Status: Slurm not OK, LizardFS not OK**
 - Notes:
-  - `octopus02 mfsmount[31909]: can't resolve master hostname and/or portname (octopus01:9421)`, 
+  - `octopus02 mfsmount[31909]: can't resolve master hostname and/or portname (octopus01:9421)`,
   - **I don't see 2 drives that are physically mounted**
 
 octopus03
@@ -21,7 +21,7 @@ octopus03
 
 octopus04
 - Devices: 4 7.3 T SSDs (Neil) + 1 4.6T HDD + 1 3.7T SSD + 2 894.3G SSDs
-- Status: Slurm NO, LizardFS OK (we don't share the HDD) 
+- Status: Slurm NO, LizardFS OK (we don't share the HDD)
 - Notes: no
 
 octopus05
@@ -31,7 +31,7 @@ octopus05
 
 octopus06
 - Devices: 1 7.3 T SSDs (Neil) + 1 4.6T HDD + 4 3.7T SSDs + 2 894.3G SSDs
-- Status: Slurm OK, LizardFS OK (we don't share the HDD) 
+- Status: Slurm OK, LizardFS OK (we don't share the HDD)
 - Notes: no
 
 octopus07
@@ -41,17 +41,17 @@ octopus07
 
 octopus08
 - Devices: 1 7.3 T SSDs (Neil) + 1 4.6T HDD + 4 3.7T SSDs + 2 894.3G SSDs
-- Status: Slurm OK, LizardFS OK (we don't share the HDD) 
+- Status: Slurm OK, LizardFS OK (we don't share the HDD)
 - Notes: no
 
 octopus09
 - Devices: 1 7.3 T SSDs (Neil) + 1 4.6T HDD + 4 3.7T SSDs + 2 894.3G SSDs
-- Status: Slurm OK, LizardFS OK (we don't share the HDD) 
+- Status: Slurm OK, LizardFS OK (we don't share the HDD)
 - Notes: no
 
 octopus10
 - Devices: 1 7.3 T SSDs (Neil) + 4 3.7T SSDs + 2 894.3G SSDs
-- Status: Slurm OK, LizardFS OK (we don't share the HDD) 
+- Status: Slurm OK, LizardFS OK (we don't share the HDD)
 - Notes: **I don't see 1 device that is physically mounted**
 
 octopus11
diff --git a/topics/octopus/moosefs/moosefs-maintenance.gmi b/topics/octopus/moosefs/moosefs-maintenance.gmi
new file mode 100644
index 0000000..1032cde
--- /dev/null
+++ b/topics/octopus/moosefs/moosefs-maintenance.gmi
@@ -0,0 +1,252 @@
+# Moosefs
+
+We use moosefs as a network distributed storage system with redundancy. The setup is to use SSDs for fast access and spinning storage for redundancy/backups (in turn these are in RAID5 configuration). In addition we'll experiment with a non-redundant fast storage access using the fastest drives and network connections.
+
+# Configuration
+
+## Ports
+
+We should use different ports than lizard. Lizard uses 9419-24 by default. So let's use
+9519- ports.
+
+* 9519 for moose meta logger
+* 9520 for chunk server connections
+* 9521 for mount connections
+* 9522 for slow HDD chunks (HDD)
+* 9523 for replicating SSD chunks (SSD)
+* 9524 for fast non-redundant SSD chunks (FAST)
+
+## Topology
+
+Moosefs uses topology to decide where to fetch data. We can host the slow spinning HDD drives in a 'distant' location, so that data is fetched last.
+
+## Disks
+
+Some disks are slower than others. To test we can do:
+
+```
+root@octopus03:/export# dd if=/dev/zero of=test1.img bs=1G count=1
+1+0 records in
+1+0 records out
+1073741824 bytes (1.1 GB, 1.0 GiB) copied, 2.20529 s, 487 MB/s
+/sbin/sysctl -w vm.drop_caches=3
+root@octopus03:/export#  dd if=test1.img of=/dev/null bs=1G count=1
+1+0 records in
+1+0 records out
+1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.649035 s, 1.7 GB/s
+rm test1.img
+```
+
+Above is on a RAID5 setup. Typical values are:
+
+```
+                       Write         Read
+Octopus Dell NVME      1.2 GB/s      2.0 GB/s
+Octopus03 RAID5        487 MB/s      1.7 GB/s
+Octopus01 RAID5        127 MB/s      163 MB/s
+Samsung SSD 870        408 MB/s      565 MB/s
+```
+
+```
+mfs#octopus03:9521   3.7T  4.0G  3.7T   1% /moosefs-fast
+```
+
+## Command line
+
+```
+. /usr/local/guix-profiles/moosefs/etc/profile
+mfscli -H octopus03 -P 9521 -SCS
+```
+
+## Config
+
+```
+root@octopus03:/etc/mfs# diff example/mfsexports.cfg.sample mfsexports.cfg
+2c2,4
+< *                     /       rw,alldirs,admin,maproot=0:0
+---
+> 172.23.21.0/24                       /       rw,alldirs,maproot=0,ignoregid
+> 172.23.22.0/24                       /       rw,alldirs,maproot=0,ignoregid
+> 172.23.17.0/24                       /       rw,alldirs,maproot=0,ignoregid
+```
+
+```
+root@octopus03:/etc/mfs# diff example/mfsmaster.cfg.sample mfsmaster.cfg
+4a5,10
+> ## Only one metadata server in LizardFS shall have 'master' personality.
+> PERSONALITY = master
+>
+> ## Password for administrative connections and commands.
+> ADMIN_PASSWORD = nolizard
+>
+6c12
+< # WORKING_USER = nobody
+---
+> WORKING_USER = mfs
+9c15
+< # WORKING_GROUP =
+---
+> WORKING_GROUP = mfs
+27c33
+< # DATA_PATH = /gnu/store/yg0xb1g9mls04h4085kmfbbg8z36a7c2-moosefs-4.58.3/var/mfs
+---
+> DATA_PATH = /export/var/lib/mfs
+34c40
+< # EXPORTS_FILENAME = /gnu/store/yg0xb1g9mls04h4085kmfbbg8z36a7c2-moosefs-4.58.3/etc/mfs/mfsexports.cfg
+---
+> EXPORTS_FILENAME = /etc/mfs/mfsexports.cfg
+87c93
+< # MATOML_LISTEN_PORT = 9419
+---
+> MATOML_LISTEN_PORT = 9519
+103c109
+< # MATOCS_LISTEN_PORT = 9420
+---
+> MATOCS_LISTEN_PORT = 9520
+219c225
+< # MATOCL_LISTEN_PORT = 9421
+---
+> MATOCL_LISTEN_PORT = 9521
+```
+
+```
+root@octopus03:/etc/mfs# cat mfsgoals.cfg
+# safe - 2 copies, 1 on slow disk, 1 on fast disk
+11 slow: HDD SSD
+
+# Fast storage - 1 copy on fast disks, no redundancy
+12 fast: FAST
+```
+
+```
++++ b/mfs/mfschunkserver-fast.cfg
+ # user to run daemon as (default is nobody)
+-# WORKING_USER = nobody
++WORKING_USER = mfs
+
+ # group to run daemon as (optional - if empty then default user group will be used)
+-# WORKING_GROUP =
++WORKING_GROUP = mfs
+
+ # name of process to place in syslog messages (default is mfschunkserver)
+ # SYSLOG_IDENT = mfschunkserver
+@@ -28,6 +28,7 @@
+
+ # where to store daemon lock file (default is /gnu/store/yg0xb1g9mls04h4085kmfbbg8z36a7c2-moosefs-4.58.3/var/mfs)
+ # DATA_PATH = /gnu/store/yg0xb1g9mls04h4085kmfbbg8z36a7c2-moosefs-4.58.3/var/mfs
++DATA_PATH=/var/lib/mfs
+
+ # when set to one chunkserver will not abort start even when incorrect entries are found in 'mfshdd.cfg' file
+ # ALLOW_STARTING_WITH_INVALID_DISKS = 0
+@@ -41,6 +42,7 @@
+
+ # alternate location/name of mfshdd.cfg file (default is /gnu/store/yg0xb1g9mls04h4085kmfbbg8z36a7c2-moosefs-4.58.3/etc/mfs/mfshdd.cfg); this
+file will be re-read on each process reload, regardless if the path was changed
+ # HDD_CONF_FILENAME = /gnu/store/yg0xb1g9mls04h4085kmfbbg8z36a7c2-moosefs-4.58.3/etc/mfs/mfshdd.cfg
++HDD_CONF_FILENAME = /etc/mfs/mfsdisk-fast.cfg
+
+ # speed of background chunk tests in MB/s per disk (formally entry defined in mfshdd.cfg). Value can be given as a decimal number (default is
+1.0)
+ # deprecates: HDD_TEST_FREQ (if HDD_TEST_SPEED is not defined, but there is redefined HDD_TEST_FREQ, then HDD_TEST_SPEED = 10 / HDD_TEST_FREQ)
+@@ -109,10 +111,10 @@
+ # BIND_HOST = *
+
+ # MooseFS master host, IP is allowed only in single-master installations (default is mfsmaster)
+-# MASTER_HOST = mfsmaster
++MASTER_HOST = octopus03
+
+ # MooseFS master command port (default is 9420)
+-# MASTER_PORT = 9420
++MASTER_PORT = 9520
+
+ # timeout in seconds for master connections. Value >0 forces given timeout, but when value is 0 then CS asks master for timeout (default is 0
+- ask master)
+ # MASTER_TIMEOUT = 0
+@@ -134,5 +136,5 @@
+ # CSSERV_LISTEN_HOST = *
+
+ # port to listen for client (mount) connections (default is 9422)
+-# CSSERV_LISTEN_PORT = 9422
++CSSERV_LISTEN_PORT = 9524
+```
+
+```
++++ b/mfs/mfsmount.cfg
+mfsmaster=octopus03,nosuid,nodev,noatime,nosuid,mfscachemode=AUTO,mfstimeout=30,mfswritecachesize=2048,mfsreadaheadsize=2048,mfsport=9521
+/moosefs-fast
+```
+
+## systemd
+
+
+```
+root@octopus03:/etc# cat systemd/system/moosefs-master.service
+Description=MooseFS master server daemon
+Documentation=man:mfsmaster
+After=network.target
+Wants=network-online.target
+
+[Service]
+Type=forking
+TimeoutSec=0
+ExecStart=/usr/local/guix-profiles/moosefs/sbin/mfsmaster -d start -c /etc/mfs/mfsmaster.cfg -x
+ExecStop=/usr/local/guix-profiles/moosefs/sbin/mfsmaster -c /etc/mfs/mfsmaster.cfg stop
+ExecStop=/usr/local/guix-profiles/moosefs/sbin/mfsmaster -c /etc/mfs/mfsmaster.cfg reload
+ExecReload=/bin/kill -HUP $MAINPID
+User=mfs
+Group=mfs
+Restart=on-failure
+RestartSec=60
+OOMScoreAdjust=-999
+
+[Install]
+WantedBy=multi-user.target
+```
+
+```
+ cat systemd/system/moosefs-mount.service
+[Unit]
+Description=Moosefs mounts
+After=syslog.target network.target
+
+[Service]
+Type=forking
+TimeoutSec=600
+ExecStart=/usr/local/guix-profiles/moosefs/bin/mfsmount -c /etc/mfs/mfsmount.cfg
+ExecStop=/usr/bin/umount /moosefs-fast
+
+[Install]
+WantedBy=multi-user.target
+root@octopus04:/etc# cat systemd/system/moosefs-chunkserver-fast.service
+[Unit]
+Description=MooseFS Chunkserver (Fast)
+After=network.target
+
+[Service]
+Type=simple
+ExecStart=/usr/local/guix-profiles/moosefs/sbin/mfschunkserver -f -c /etc/mfs/mfschunkserver-fast.cfg
+User=mfs
+Group=mfs
+Restart=on-failure
+RestartSec=5
+LimitNOFILE=65535
+
+[Install]
+WantedBy=multi-user.target
+```
+
+```
+cat systemd/system/moosefs-mount.service
+[Unit]
+Description=Moosefs mounts
+After=syslog.target network.target
+
+[Service]
+Type=forking
+TimeoutSec=600
+ExecStart=/usr/local/guix-profiles/moosefs/bin/mfsmount -c /etc/mfs/mfsmount.cfg
+ExecStop=/usr/bin/umount /moosefs-fast
+
+[Install]
+WantedBy=multi-user.target
+```
diff --git a/topics/octopus/octopussy-needs-love.gmi b/topics/octopus/octopussy-needs-love.gmi
new file mode 100644
index 0000000..8c6315d
--- /dev/null
+++ b/topics/octopus/octopussy-needs-love.gmi
@@ -0,0 +1,266 @@
+# Octopussy needs love
+
+At UTHSC, Memphis, TN, around October 2020 Efraim and I installed Octopus on Debian+Guix with lizard as a distributed network storage system and slurm for job control. Around October 2023 we added 5 genoa tux05-09 machines, doubling the cluster in size. See
+
+=> https://genenetwork.org/gn-docs/facilities
+
+Octopus made a lot of work possible we can't really do on larger HPCs and led to a bunch of high impact studies and publications, particularly on pangenomics.
+
+In the coming period we want te replace lizard with moosefs. Lizard is no longer maintained and as it was a fork of Moose, it is only logical to go forward on that one. We also looked at Ceph, but apparently Ceph is not great for systems that carry no redundancy. So far, lizard has been using redundancy, but we figure we can do without if the occassional (cheap) SSD goes bad.
+
+We also need to look at upgrading some of the Dell BIOS - particularly tux05-09 - as they can be occassionally problematic with non-OEM SSDs.
+
+On the worker nodes it may be wise to upgrade Debian. Followed by an upgrade to the head nodes and other supporting machines. Even though we rely on Guix for latest and greatest, there may be good upgrades in the underlying Linux kernel and drivers.
+
+Our Slurm PBS we are up-to-date because we run that completely on Guix and Arun supports the latest and greatest.
+
+Another thing we ought to fix is introduce centralized user management. So far we have had few users and just got by. But sometimes it bites us that users have different UIDs on the nodes.
+
+## Architecture overview
+
+* O1 is the old head node hosting lizardfs - will move to a compute
+* O2 is the old backup hosting the lizardfs shadow - will move to compute
+* O3 is the new head node hosting moosefs
+* O4 is the backup head node hosting moosefs shadow - will act as a compute node too
+
+All the other nodes are for compute. O1 and O4 will be the last nodes to remain on older Debian. They will handle the last bits of lizard.
+
+# Tasks
+
+* [X] Create moosefs package
+* [X] Install moosefs
+* [X] Upgrade bios (all tuxes)
+* [ ] Migrate lizardfs nodes to moosefs (one at a time)
+* [ ] Add server monitoring with sheepdog
+* [ ] Upgrade Debian
+* - [ ] Maybe, just maybe, boot the nodes from a central server
+* [ ] Introduce centralized user management
+
+# Progress
+
+## Lizardfs and Moosefs
+
+Our Lizard documention lives at
+
+=> lizardfs/README
+
+Efraim wrote a lizardfs for Guix at the time in guix-bioinformatics, but we ended up deploying with Debian. Going back now, the package does not look too taxing (I think we dropped it because the Guix system configuration did not play well).
+
+=> https://git.genenetwork.org/guix-bioinformatics/tree/gn/packages/file-systems.scm
+
+Looking at the Debian package
+
+=> https://salsa.debian.org/debian/moosefs
+
+It carries no special patches, but a few nice hints in *.README.debian. I think it is worth trying to write a Guix package so we can easily upgrade (even on an aging Debian). Future proofing is key.
+
+The following built moosefs in a guix shell:
+
+```
+guix shell -C -D -F coreutils make autoconf automake fuse libpcap zlib pkg-config python libtool gcc-toolchain
+autoreconf -f -i
+make
+```
+
+Next I created a guix package that installs with:
+
+```
+guix build -L ~/guix-bioinformatics -L ~/guix-past/modules moosefs
+```
+
+See
+
+=> https://git.genenetwork.org/guix-bioinformatics/commit/?id=236903baaab0f84f012a55700c1917265a2b701c
+
+Next stop testing and deploying!
+
+## Choosing a head node
+
+Currently octopus01 is the head node. It probably is a good idea to change that, so we can safely upgrade the new server. The first choice would be octopus02 (o2). We can mirror the moose daemons on octopus01 (o1) later. Let's see what that looks like.
+
+A quick assessment of o1 shows that we have 14T storage on o1 that takes care of /home and /gnu. But only 1.2T is used.
+
+o2 has also quite a few disks (up 1417 days!), but a bunch of SSDs appears to error out. E.g.
+
+```
+Sep 04 07:44:56 octopus02 mfschunkserver[22766]: can't create lock file /mnt/sdd1/lizardfs_vol/.lock, marking hdd as damaged: Input/output error
+UUID=277c05de-64f5-48a8-8614-8027a53be212 /mnt/sdd1 xfs rw,exec,nodev,noatime,nodiratime,largeio,inode64 0 1
+```
+
+Lizard also complains 4 SSDs have been wiped out.
+We'll need to reboot the server to see what storage still may work. The slurm connection appears to be misconfigured:
+
+```
+[2025-12-20T09:36:27.846] error: service_connection: slurm_receive_msg: Insane message length
+[2025-12-20T09:36:28.415] error: unpackstr_xmalloc: Buffer to be unpacked is too large (1700881509 > 1073741824)       [2025-12-20T09:36:28.415] error: unpacking header                                                                      [2025-12-20T09:36:28.415] error: destroy_forward: no init                                                              [2025-12-20T09:36:28.415] error: slurm_receive_msg_and_forward: [[nessus6.uthsc.edu]:35553] failed: Message receive failure
+```
+
+looks like Andrea is the only one using the machine right now though some others logged in. Before rebooting I'll block users, ask Andrea to move off, and deplete slurm and lizard. But o2 is a large RAM machine, so we should not use that as a head node.
+
+Let's take a look at o3. This one has less RAM. Flavia is running some tools, but I don't think the machine is really used right now. Slurm is running, but shows similar configuration issues as o2. Let's take a look at slurm
+
+=> ../systems/hpc/octopus-maintenance
+=> ../hpc/octopus/slurm-user-guide
+
+Alright, I depleted and removed slurm from o3. I think it would be wise to also deplete the lizard drives on that machine.
+
+The big users on lizard are:
+
+```
+1.6T    dashbrook
+1.8T    pangenomes
+2.1T    erikg
+3.4T    aruni
+3.4T    junh
+8.4T    hchen
+9.2T    salehi
+13T     guarracino
+16T     flaviav
+```
+
+it seems we can clean some of that up! We have some backup storage that we can use. Alternatively move to ISAAC.
+
+We'll slowly start depleting the lizard. See also
+
+=> lizardfs/README
+
+O3 has 4 lizard drives. We'll start by depleting one.
+
+
+# O2
+
+```
+172.23.22.159:9422:/mnt/sde1/lizardfs_vol/
+        to delete: no
+        damaged: yes
+        scanning: no
+        last error: no errors
+        total space: 0B
+        used space: 0B
+        chunks: 0
+172.23.22.159:9422:/mnt/sdd1/lizardfs_vol/
+        to delete: no
+        damaged: yes
+        scanning: no
+        last error: no errors
+        total space: 0B
+        used space: 0B
+        chunks: 0
+172.23.22.159:9422:/mnt/sdc1/lizardfs_vol/
+        to delete: no
+        damaged: yes
+        scanning: no
+        last error: no errors
+        total space: 0B
+        used space: 0B
+        chunks: 0
+```
+
+Stopped the chunk server.
+sde remounted after xfs_repair. The others were not visible, so rebooted. The folloing storage should add to the total again:
+
+```
+/dev/sdc1            4.6T  3.9T  725G  85% /mnt/sdc1
+/dev/sdd1            4.6T  4.2T  428G  91% /mnt/sdd1
+/dev/sdf1            4.6T  4.2T  358G  93% /mnt/sdf1
+/dev/sde             3.7T  3.7T  4.0G 100% /mnt/sde
+/dev/sdg1            3.7T  3.7T  3.9G 100% /mnt/sdg1
+```
+
+After adding this storage and people removing material it starts to look better:
+
+```
+mfs#octopus01:9421   171T   83T   89T  49% /lizardfs
+```
+
+# O3
+
+I have marked the disks (4x4T) on o3 for deletion - that will subtract 7T. This in preparation for upgrading Linux and migrating those disks to moosefs. Continue below.
+
+# T5
+
+T5 requires a new bios - it has the same one as the unreliable T4. I also need to see if there are any disks in the bios we don't see right now. T5 has two small fast SSDs and one larger one (3.5T).
+
+I managed to install the new bios, but I had trouble getting into linux because of some network/driver issues. ipmi was suspect. Finally managed rescue mode by adding 'systemd.unit=emergency.target' in the grub line. 'single' is no longer enough (grrr). One to keep in mind.
+
+Had to disable ipmi modules. See my idrac.org.
+
+# T6
+
+Tux06 (T6) contains two unused drives that appear to have contained XFS. xfs_repair did not really help...
+The BIOS on T6 is newer than on T4+T5. That probably explains why the higher T numbers have no disk issues, while T4+T5 had problems with non-OEM! Anyway, as I was at it, I updated the BIOS for all.
+
+T6 has 4 SSDs, 2x 3.5T. Both unused. The lizard chunk server is failing, so might as well disable it.
+
+I am using T6 to test network boots because it is not serving lizard.
+
+# T7
+
+On T7 root was full(!?). Culprit was Andrea with /tmp/sweepga_genomes_111850/.
+T7 has 3x3.5T with one unused.
+
+# T8
+
+T8 has 3x3.5T, all used. After the BIOS upgrade the efi partition did not boot. After a few reboots it did get into grub and I made a copy of the efi partition on sdd (just in case).
+
+# T9
+
+T9 has 1x3.5T. Used. I had to reduce HDD_LEAVE_SPACE_DEFAULT to give the chunkserver some air.
+
+# O3 + O4
+
+Back to O3, our future head node. lizard has mostly been depleted. Though every drive has a few chunks left. I just pulled down the chunkserver and lizard appears to be fine (no errors). Good!
+
+Next install Linux. I have two routes, one is using debootstrap, the other is via PXE. I want to try the latter.
+
+So far, I managed to boot into ipxe on Octopus.
+The linux kernel loads over http, but it does not show output. Likely I need to:
+
+* [X] Build ipxe with serial support
+* [X] Test the installer with serial support
+* [X] Add NFS support
+* [X] debootstrap install of new Debian on /export/nfs/nodes/debian14
+* [X] Make available through NFS and boot through IPXE
+
+I managed to boot T6 over the network.
+Essentially we have a running Debian last stable on T6 that is completely run over NFS!
+In the next steps I need to figure out:
+
+* [X] Mount NFS with root access
+* [ ] Every PXE node needs its own hard disk configuration
+* [ ] Mount NFS from octopus01
+* [ ] Start slurm
+
+We can have this as a test node pretty soon.
+But first we have to start moosefs and migrate data.
+
+I am doing some small tests and will put (old) T6 back on slurm again.
+
+To get every node booted with its own version of fstab and state logging on a local disk we need to pull some trick with initrd.
+
+Basically NFS boot initrd needs to contain a script that invokes changes for every node. The node hostname and primary partition can be passed on from ipxe using the kernel myhost=client01 localdisk=/dev/sda1. So that is the differentiator. The script in /etc/nodes/initramfs-tools/update-node-etc will remount /tmp and /var onto $localdisk and copy /etc there too. Next it will symnlink a few files, such as /etc/hostname and /etc/fstab to adjust for local settings.
+
+This way we will deploy all nodes centrally. One aspect is that we don't need dynamic user management as it is centrally orchestrated! The user files can be copied from the head node when they change.
+
+O4 is going to be the backup head node. It will act as a compute node too, until we need it as the head node. O4 is currently not on the slurm queue.
+
+* [X] Update guix on O1
+* [X] Install guix moosefs
+* [X] Start moosefs master on O3
+* [X] Start moosefs metalogger on O4
+* [ ] Check moosefs logging facilities
+* [ ] See if we can mark drives so it is easier to track them
+* [ ] Test broken (?) /dev/sdf on octopus03
+
+We can start moose master on O3. We should use different ports than lizard. Lizard uses 9419-24 by default. So let's use
+9519- ports. See
+
+=> moosefs/moosefs-maintenance.gmi
+
+# P2
+
+Penguin2 has 80T of spinning disk storage. We are going to use that for redundancy. Basically these disks get a moosefs goal of HDD 'slow' and we'll configure them on a remote rack - so chunks get fetched from local chunk servers (first). This will gain us 40T of immediate storage. Adding more spinning disks will free up SSDs further.
+
+* [X] P2 Update Guix
+* [X] Install moosefs
+* [ ] Create HDD chunk server
diff --git a/topics/pangenome/impg/impg-agc-bindings.gmi b/topics/pangenome/impg/impg-agc-bindings.gmi
new file mode 100644
index 0000000..2451c0a
--- /dev/null
+++ b/topics/pangenome/impg/impg-agc-bindings.gmi
@@ -0,0 +1,246 @@
+# IMPG AGC bindings
+
+In this document we will create a build setup that allows us to use AGC (a C++ library) from a recent Rust compiler. The original binding proves tricky. So we break it down into parts. Also we try out the new Rust cargo support in Guix.
+
+Fortunately the AGC include file contains a limited list of functions that have C ABI bindings:
+
+```c
+EXTERNC agc_t* agc_open(char* fn, int prefetching);
+EXTERNC int agc_close(agc_t* agc);
+EXTERNC int agc_get_ctg_len(const agc_t *agc, const char *sample, const char *name);
+EXTERNC int agc_get_ctg_seq(const agc_t *agc, const char *sample, const char *name, int start, int end, char *buf);
+EXTERNC int agc_n_sample(const agc_t* agc);
+EXTERNC int agc_n_ctg(const agc_t *agc, const char *sample);
+EXTERNC char* agc_reference_sample(const agc_t* agc);
+EXTERNC char **agc_list_sample(const agc_t *agc, int *n_sample);
+EXTERNC char **agc_list_ctg(const agc_t *agc, const char *sample, int *n_ctg);
+EXTERNC int agc_list_destroy(char **list);
+EXTERNC int agc_string_destroy(char *sample);
+```
+
+Even for a C++ library it is very thoughtful to provide a C ABI! Both the current Rust binding and the Python example in AGC actually use the C++ class - which means they need to build against a matching C++ source tree.
+It should be straightforward to create a Rust module that calles into the shared library directly using the C ABI instead of importing and building all the source code.
+
+One early choice is a separation of concerns. We will try to build the library independently of the Rust package. This follows a standard model. For example cargo should not build zlib - it is provided by the environment. The bindings, meanwhile, are defined and built in cargo.
+
+# Tasks
+
+* [X] Fix AGC passing exceptions through C ABI
+* [X] Get guix to compile impg (here testlibagc) with AGC
+* [ ] Add optimization
+* [ ] Make sure spoa build in spoa-rs is optimized
+* [ ] Create static binary for distribution
+* [ ] Create singularity example
+
+# Steps
+
+## Setting up Guix with rust
+
+Guix provides a reproducible build environment. If you get over the fact that it is Lisp, it proves a remarkably nice way to handle dependencies. The first step is to set up guix so you get a recent set of dependencies. For this run guix pull and set it up in a profile
+
+```sh
+guix pull -p ~/opt/guix-pull --url=https://codeberg.org/guix/guix
+```
+
+it takes a few minutes. Next set the environment
+
+```sh
+unset GUIX_PROFILE
+. ~/opt/guix-pull/etc/profile
+```
+
+and list the packages
+
+```sh
+guix package -A rust
+rust                    1.85.1                  rust-src,tools,out,cargo        gnu/packages/rust.scm:1454:4
+```
+
+should show a recent edition of rust (typically about half a year old, the rust-team in guix is now working on 1.89). Note you can also pull an older version of guix (and rust) by passing in the git hash value of the codeberg repo. This allows you to go back to the dependency tree of, say, three months ago. It allows for a level of sanity not seen in other software deployment systems.
+
+Note that we tend not to be too recent with packages as Guix is used to deploy *stable* systems. If you want a more recent version of rust you can write your own guix package - it is not that hard. We may attempt it later for this exercise.
+
+Note also that newbies run guix-pull too often. I typically do it every three months, or so. So the slowness of guix-pull should not really count.
+
+One thing that is a bit funny now is that we currently can't list most cargo packages in guix because they the crates are now 'local' to a package. We have to check the source tree:
+
+=> https://codeberg.org/guix/guix/src/branch/master/gnu/packages/rust-crates.scm
+
+## Building AGC in guix
+
+AGC is a C++ program with a C ABI. The README suggests there are no dependencies, but that is misleading. It sources other dependencies and builds them (bit like git submodules). I managed to build AGC using a guix shell with:
+
+```sh
+guix shell -C guix gcc-toolchain make libdeflate pkg-config xz mimalloc coreutils sed minizip-ng lzlib zlib:static zstd:static zstd:lib zstd zlib
+make PLATFORM=avx2 libagc
+```
+
+Note it pulls in too much. To make it compile the patch I applied is
+
+```diff
+--- a/agc/makefile
++++ b/agc/makefile
+@@ -14,14 +14,14 @@ $(call SET_SRC_OBJ_BIN,src,obj,bin)
+
+ # *** Project configuration
+ $(call CHECK_NASM)
+-$(call ADD_MIMALLOC, $(3RD_PARTY_DIR)/mimalloc)
++# $(call ADD_MIMALLOC, $(3RD_PARTY_DIR)/mimalloc)
+ $(call PROPOSE_ISAL, $(3RD_PARTY_DIR)/isa-l)
+-$(call PROPOSE_ZLIB_NG, $(3RD_PARTY_DIR)/zlib-ng)
+-$(call CHOOSE_GZIP_DECOMPRESSION)
+-$(call ADD_LIBDEFLATE, $(3RD_PARTY_DIR)/libdeflate)
+-$(call ADD_LIBZSTD, $(3RD_PARTY_DIR)/zstd)
++# $(call PROPOSE_ZLIB_NG, $(3RD_PARTY_DIR)/zlib-ng)
++# $(call CHOOSE_GZIP_DECOMPRESSION)
++# $(call ADD_LIBDEFLATE, $(3RD_PARTY_DIR)/libdeflate)
++# $(call ADD_LIBZSTD, $(3RD_PARTY_DIR)/zstd)
+ $(call ADD_RADULS_INPLACE,$(3RD_PARTY_DIR)/raduls-inplace)
+-$(call ADD_PYBIND11,$(3RD_PARTY_DIR)/pybind11/include)
++# $(call ADD_PYBIND11,$(3RD_PARTY_DIR)/pybind11/include)
+ $(call SET_STATIC, $(STATIC_LINK))
+
+ $(call SET_C_CPP_STANDARDS, c11, c++20)
+@@ -57,7 +57,7 @@ $(OUT_BIN_DIR)/agc: \
+        $(CXX) -o $@  \
+        $(MIMALLOC_OBJ) \
+        $(OBJ_APP) $(OBJ_CORE) $(OBJ_COMMON) \
+-       $(LIBRARY_FILES) $(LINKER_FLAGS) $(LINKER_DIRS)
++       $(LIBRARY_FILES) -lzstd -lz -ldeflate $(LINKER_FLAGS) $(LINKER_DIRS)^M
+
+ libagc: $(OUT_BIN_DIR)/libagc
+ $(OUT_BIN_DIR)/libagc:
+```
+
+Essentially disables 3rd-party dependency builds, in favour of using the Guix ones.
+
+Note that Bioconda installes AGC as a binary:
+
+=> https://github.com/bioconda/bioconda-recipes/blob/master/recipes/agc/meta.yaml
+
+So it circumvents building AGC by downloading the provided static binaries. In only downloads the binary, not the library.
+
+## The current cargo package
+
+The current cargo bindings package named agc-rs vendors in (in its turn) the AGC github repository. Similarly to git modules. It is kinda ironic that we left git submodules for something that is not better (maybe even worse because it does not do the hash values, but a versioned branch/tag -- who is to say what happened upstream).
+
+## Changes
+
+So we propose to take a different approach when it comes to distributing software. First premise is that we will prepare pre-built *binaries* for external use that can be handled by conda and singularity. Both these deployers can handle external dependencies, so we can just use a standard AGC build/distribution. That is key to keeping sane - so not have cargo build AGC itself as it is just a library with a decent C ABI.
+
+To make it work with Rust we can create a cargo module that binds to the C ABI using FFI (and not care where the AGC library comes from). One great feature is we can use the C ABI without having to generate bindings using clang and all that. A C ABI can be written and maintained by hand in Rust.
+
+For C++ only libraries, the narrative gets a bit harder. If the C++ interface is rich it may be best to use a bindings generator. In general it should be possible to provide a C ABI that calls into C++, however, in C. This means we can take the same deployment approach (in general) for pure C++ libraries, provided we can write a short C ABI. I have done this for vcflib, for example, to write the Zig version of vcflib:
+
+=> https://github.com/vcflib/vcflib/blob/master/src/vcf-c-api.cpp
+
+To support AGC in Rust we need to:
+
+* [X] Create a Rust binding that uses the AGC C ABI instead of the C++ one, so we can use a statically built AGC lib and don't need the source tree for cargo
+
+We will also write a
+
+* [ ] Guix build to create the optimized AGC static lib
+* [ ] Guix build that creates an optimized impg
+
+And that last one allows us to distribute prebuilt binaries in CONDA and apptainer/singularity/docker.
+
+Note that this is the same approach as taken by
+
+=> https://github.com/rust-lang/libz-sys/blob/main/build.rs
+
+which binds against libz. It *optionally* builds the source tree of zlib which is included as a submodule
+
+=> https://github.com/rust-lang/libz-sys/tree/main/src
+
+In our case, a rebuild can be useful when AGC lib can not be found. Note that the cargo edition of libz-sys does not invoke make or cmake. It builds it by 'hand'!
+
+There is also libz-rs, but that is a somewhat typical Rust rewrite of libz:
+
+=> https://github.com/trifectatechfoundation/zlib-rs
+
+I also took a quick look at the rust spoa crate. Here a build is always forced, but I don't think it actually optimizes the build. Add a note to my tasks.
+
+## First guix package by Fred
+
+Fred drafted a first guix package which can build impg with
+
+```
+guix build -L .guix/modules -f guix.scm
+
+/gnu/store/cdjiq6aalpc849hl8irmbn8xax9mq2b6-impg-0.3.1/bin/impg
+Command-line tool for querying overlaps in PAF files
+
+Usage: impg <COMMAND>
+
+Commands:
+  index       Create an IMPG index
+  lace        Lace files together (graphs or VCFs)
+  partition   Partition the alignment
+  query       Query overlaps in the alignment
+  similarity  Compute pairwise similarity between sequences in a region
+  stats       Print alignment statistics
+
+Options:
+  -h, --help     Print help
+  -V, --version  Print version
+```
+
+It builds against rust 1.85 and uses the new cargo support in Guix. It does not have to rebuild the cargo packages already in guix. Nice and a good start!
+
+=> https://github.com/pangenome/impg/blob/f5ebaf8b511ee06bdeb193ef509836c26cd4793a/.guix/modules/impg/impg.scm#L4
+
+we'll still need to add AGC, static output and optimizations.
+
+## Adding a guix package for AGC
+
+As a first step we build a package for AGC that compiles libagc.a using AVX2:
+
+=> https://github.com/pjotrp/impg/commit/ed16948cc4145ff933a19ba54c3bc1fe4cec709f
+
+we used the vendored in source for raduls-inplace and isa-l. Not sure they are really required, but I think it is harmless here.
+
+## Make sure libagc.a is linked to impg
+
+To create a rust package for binding libagc it is worth reading:
+
+=> https://doc.rust-lang.org/cargo/reference/build-scripts.html#a-sys-packages
+
+* The library crate should link to the native library libfoo. This will often probe the current system for libfoo before resorting to building from source.
+* The library crate should provide declarations for types and functions in libfoo, but not higher-level abstractions.
+
+So we should create an agc-rs crate that provides a high-level interface to the upcoming libagc-sys crate. No wonder these crates proliferate.
+
+# Using a linked libagc.so
+
+I managed to create a crate that binds libagc.so against Rust:
+
+=> https://github.com/pjotrp/libagc-sys
+
+See also the included test in lib.rs. It binds against the updated agc:
+
+=> https://github.com/refresh-bio/agc/compare/main...pjotrp:agc:main
+
+which contains the fixes that don't allow C++ exceptions to pass through the C ABI.
+Also I fixed one function and added a shared lib as output.
+
+Finally, rather than messing with the impg code tree (which keeps changing), I created a test crate that mirrors impg:
+
+=> https://github.com/pjotrp/testlibagc
+
+which can be build and run with
+
+```
+cargo build --release
+target/release/testagc-sys
+Number of samples: 4
+```
+
+At least we have a reference implementation for binding successfully against a shared C library with a very *light* and standardised interface. It obviously also works in Guix. We can use it to benchmark against the new (impressive) Rust implementation by Erik. It also acts as a template for future bindings.
+
+Note that we should discourage C++ bindings. Mostly because there is no standard C++ ABI (in contrast to the C one), so avoid the use of the cxx crates - unless you really know what you are doing.
+
+Potential future work is:
+
+- [ ] Optimized runtime
+- [ ] Static binary for distribution
diff --git a/topics/programming/better-logging.gmi b/topics/programming/better-logging.gmi
index dca8c0d..d80bb0d 100644
--- a/topics/programming/better-logging.gmi
+++ b/topics/programming/better-logging.gmi
@@ -1,14 +1,17 @@
-# Improving Logging in GN2
+# Improving Alerting/Logging in GN2
 
-## What Are We Trying To Solve?
+## Problem Statement
 
-We prioritise maintaining user functionality over speed in GN [with time this speed will be improved].  As such we should be pay more attention at not breaking any currently working GN2 functionality.  And when/if we do, trouble-shooting should be easy.  On this front, one way is to stream-line logging in both GN2/GN3 and make it more script friendly - only report when something fails, not to instrument variables - and in so doing make the process of monitoring easier.
+Current logging in the genenetwork ecosystem is noisy and difficult to parse programatically which makes it hard to:
+
+* Integrate logs into some observability pipeline (E.g. sheepdog).
+* Troubleshoot issues as they occur.  We always learn of bugs from users.
 
 ## Goals
 
-* Have script-friendly error/info logs.
-* Remove noise from GN2.
-* Separate logging into different files: error logs, info logs.  Add this somewhere with Flask itself instead of re-directing STDOUT to a file.
+* Standardize logging format and config across GN2 flask apps and gn-guile.
+* Adopt structured logging.
+* Extend sheep-dog to be able to parse gn logs and send alerts on e-mail or matrix.
 
 ### Non-goals
 
@@ -27,3 +30,5 @@ We prioritise maintaining user functionality over speed in GN [with time this sp
 ## Resources
 
 => https://realpython.com/python-logging/ Logging in Python
+=> https://signoz.io/guides/python-logging-best-practices/ Python Logging Best Practices - Obvious and Not-So-Obvious
+=> https://signoz.io/blog/what-is-opentelemetry/ What is OpenTelemetry
diff --git a/topics/rust/guix-rust-bootstrap.gmi b/topics/rust/guix-rust-bootstrap.gmi
new file mode 100644
index 0000000..cd3c322
--- /dev/null
+++ b/topics/rust/guix-rust-bootstrap.gmi
@@ -0,0 +1,173 @@
+# Guix Rust Bootstrap
+
+To develop code against rust you often need a recent edition of rust. With Guix this is possible because you don't depend on the underlying linux distribution to provide recent versions of glibc and other libraries. Here we have a recipe that should work anywhere on Linux.
+
+I succeeded in running the latest Rust on Octopus and building packages with guix.
+
+To make it work the following steps are required:
+
+* Update guix with guix-pull if your guix is older than 3 months
+* Unset GUIX_PROFILE on some systems
+* Set your updated guix profile vars
+* Create a container that has all dependencies for rust itself
+* Run rustup
+* Run cargo with LD_LIBRARY_PATH set to $GUIX_ENVIRONMENT/lib
+
+# Get Guix updated
+
+Important is to have a recent version of Guix. This is achieved with 'guix pull' and making sure it works.
+
+
+```sh
+mkdir -p ~/opt
+guix pull -p ~/opt/guix-pull --url=https://codeberg.org/guix/guix
+```
+
+it takes a few minutes. Next set the environment
+
+```sh
+unset GUIX_PROFILE
+. ~/opt/guix-pull/etc/profile
+```
+
+This will point the path to a recent guix. You can make sure with
+
+```
+guix describe
+  guix 772c456
+    repository URL: https://codeberg.org/guix/guix
+    branch: master
+    commit: 772c456717e755829397a6ff6dba4c1e135426d8
+```
+
+which can be validated against the Guix tree. Running
+
+
+```sh
+guix package -A rust
+rust                    1.85.1                  rust-src,tools,out,cargo       gnu/packages/rust.scm:1454:4
+```
+
+shows the current *stable* version in Guix. Now, of course, we want something more to get rust latest.
+
+# Update Rust and Cargo to latest (stable)
+
+The trick is to set up a container with Rust in your git working directory:
+
+```
+mkdir -p ~/.cargo ~/.rustup # to prevent rebuilds
+guix shell --share=$HOME/.cargo  --share=$HOME/.rustup -C -N -D -F -v 3 guix gcc-toolchain make libdeflate pkg-config xz coreutils sed zstd zlib nss-certs openssl curl
+curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
+. ~/.cargo/env
+rustup default stable
+```
+
+Now rustc shows it is recent:
+
+```
+rustc --version
+rustc 1.90.0 (1159e78c4 2025-09-14)
+```
+
+Next run 'cargo build' with:
+
+```
+env LD_LIBRARY_PATH=$GUIX_ENVIRONMENT/lib cargo build
+  Compiling libagc-sys v0.1.0 (/home/wrk/iwrk/opensource/code/pangenome/libagc-sys)
+    Finished 'dev' profile [unoptimized + debuginfo] target(s) in 0.06s
+$ ./target/debug/libagc-sys
+./target/debug/libagc-sys: error while loading shared libraries: libgcc_s.so.1: cannot open shared object file: No such file or directory
+$ env LD_LIBRARY_PATH=$GUIX_ENVIRONMENT/lib ./target/debug/libagc-sys
+Hello, world!
+```
+
+and your source should build and run. Note the libgcc_s.so.1 error.
+
+## What if you get a libgcc or librt error?
+
+The problem is that cargo picks up the wrong libgcc:
+
+```
+$ ls /gnu/store/*/lib/libgcc_s.so.1
+/gnu/store/m2vhzr0dy352cn59sgcklcaykprrr4j6-gcc-14.3.0-lib/lib/libgcc_s.so.1
+/gnu/store/rbs3nrx9z6sfawn3fa8r8z1kffdbnk8q-gcc-toolchain-15.2.0/lib/libgcc_s.so.1
+/gnu/store/v3bq3shn333kh7m6gj3r58l0v7mkn4in-profile/lib/libgcc_s.so.1
+/gnu/store/xm7i1gvi0i9pyndlkv627r08rsw1ny96-gcc-15.2.0-lib/lib/libgcc_s.so.1
+```
+
+This is because Guix itself builds on an older libgcc and librt. You need to tell it explicitly what library to load that built your cargo:
+
+```
+ldd ~/.cargo/bin/cargo
+        linux-vdso.so.1 (0x00007ffd409b2000)
+        libgcc_s.so.1 => /lib/libgcc_s.so.1 (0x00007fd2cf433000)
+        librt.so.1 => /lib/librt.so.1 (0x00007fd2cf42e000)
+```
+
+in the container:
+
+```
+ls -l /lib/libgcc_s.so.1
+lrwxrwxrwx 1 65534 overflow 82 Jan  1  1970 /lib/libgcc_s.so.1 -> /gnu/store/rbs3nrx9z6sfawn3fa8r8z1kffdbnk8q-gcc-toolchain-15.2.0/lib/libgcc_s.so.1
+```
+
+which happens to be the one in $GUIX_ENVIRONMENT/lib! So setting the library path solves it.
+
+The reason that we don't get the automatically resolving libraries that you normally have in guix is that we have updated rust by *hand* using rustup. Guix has no control over this process.
+
+# spoa-rs on octopus01
+
+I just did above to build spoa-rs. Only had to add cmake to the shell packages.
+
+# sweepga on octopus01
+
+I just built sweepga. Only had the add clang to the shell:
+
+```sh
+guix shell --share=$HOME/.cargo  --share=$HOME/.rustup -C -N -D -F -v 3 guix gcc-toolchain make libdeflate pkg-config xz coreutils sed zstd zlib nss-certs openssl curl zlib cmake clang
+. ~/.cargo/env
+env LD_LIBRARY_PATH=$GUIX_ENVIRONMENT/lib cargo build
+env LD_LIBRARY_PATH=$GUIX_ENVIRONMENT/lib ./target/debug/sweepga
+[sweepga::start::0.000*1.00] 2025-10-11 15:27:28 | ./target/debug/sweepga
+[sweepga::detect::0.000*1.00] Using .1aln workflow (FastGA native format)
+[sweepga] ERROR: No valid input provided
+```
+
+To run on the cluster you likely don't want to use the container. Make a note of GUIX_ENVIRONMENT:
+
+```
+echo $GUIX_ENVIRONMENT/
+/gnu/store/6khi7iv7l75595hwlfc1nwmdcv72m24s-profile/
+```
+
+It has your libs! So, outsite the container you can run
+
+```
+export GUIX_ENVIRONMENT=/gnu/store/6khi7iv7l75595hwlfc1nwmdcv72m24s-profile
+env LD_LIBRARY_PATH=$GUIX_ENVIRONMENT/lib /home/wrk/tmp/sweepga/target/debug/sweepga
+```
+
+# Updating the container
+
+Now your build may fail because you miss a crucial library or tool. This is a feature of guix containers as it makes dependencies explicit.
+
+Just add them to the guix shell command. Let's say we add zlib
+
+```
+guix shell --share=$HOME/.cargo  --share=$HOME/.rustup -C -N -D -F -v 3 guix gcc-toolchain make libdeflate pkg-config xz coreutils sed zstd zlib nss-certs openssl curl zlib
+```
+
+# Troubleshooting
+
+## Collisions
+
+Guix may complain about collisions. These are mostly naming issues:
+
+```
+warning: collision encountered:
+  /gnu/store/nym6kiinrg2mb8z4lwnvfx5my8df9vrs-glibc-for-fhs-2.41/bin/ldd
+  /gnu/store/rbs3nrx9z6sfawn3fa8r8z1kffdbnk8q-gcc-toolchain-15.2.0/bin/ldd
+warning: choosing /gnu/store/nym6kiinrg2mb8z4lwnvfx5my8df9vrs-glibc-for-fhs-2.41/bin/ldd
+```
+
+it will like one into your environment. You can still use both tools by using the full path and normally ignore the warning.
diff --git a/topics/systems/backup-drops.gmi b/topics/systems/backup-drops.gmi
index 3f81c5a..a29e605 100644
--- a/topics/systems/backup-drops.gmi
+++ b/topics/systems/backup-drops.gmi
@@ -117,7 +117,7 @@ So, as root
 ```
 cd /export
 mkdir -p backup/bacchus/drop
-chown bacchus.bacchus backup/bacchus/drop/
+chown bacchus:bacchus backup/bacchus/drop/
 chmod 0700 backup/bacchus/drop/
 ```
 
@@ -197,3 +197,54 @@ The recent scripts can be found at
 => https://github.com/genenetwork/gn-deploy-servers/blob/master/scripts/tux01/backup_drop.sh
 
 # borg-borg
+
+
+Backups work for production according to sheepdog. They run at 5am CST. Which (I guess) is OK. On the remote server we are going to forward the backup to a server on a different continent at 4pm GMT. I have been running that by hand lately, so time to sheepdog it!
+
+The manual command is
+
+```
+rsync -e "ssh -i ~/.ssh/id_ecdsa_borgborg" -vaP tux03 $HOST:/export/backup/bacchus/drop/
+```
+
+With sheepdog we can make it:
+
+```
+sheepdog_run.rb -v --tag "drop-mount-$name" -c "sshfs -o $SFTP_SETTING,IdentityFile=~/.ssh/id_ecdsa_backup bacchus@$host:/ ~/mnt/$name"
+sheepdog_run.rb --always -v --tag "drop-rsync-$name" -c "rsync -vrltDP borg/* ~/mnt/$name/drop/$HOST/ --delete"
+sheepdog_run.rb -v --tag "drop-unmount-$name" -c "fusermount -u ~/mnt/$name"
+```
+
+For some reason this took a while to figure out. Part of it is that the machine on the other end has a rather slow CPU! An
+Intel(R) Celeron(R) CPU  J1900  @ 1.99GHz launched over 10 years ago. We still use it because of its low energy consumption. Once it starts pumping a file it is up to speed
+
+```
+tux03/tux03-containers/data/0/239
+    154,501,120  29%   11.20MB/s    0:00:32
+```
+
+So one backup of a backup has started running and I made it a CRON job. Next stop is borgborg on the receiving HOST. The CRON job looks like
+
+```
+0 3 * * * env BORG_PASSPHRASE=none /home/wrk/iwrk/deploy/deploy/bin/sheepdog_borg.rb -t borgborg --always -v -b /export/backup/bacchus/borgborg/drop /export/backup/bacchus/drop --args '--stats' >> ~/cron.log 2>&1
+```
+
+note the backups are already password protected. No need to do that again. Now this backup is going to go onto optical media twice a year with the password printed on the backup. That should keep it for 100 years.
+
+You can track this backup progress daily on the sheepdog status
+
+=> http://sheepdog.genenetwork.org/sheepdog/status.html
+
+i.e. in reverse order the flow is:
+
+```
+2025-09-18 08:35:00 +0200	FAIL	host	borgborg-backup
+2025-09-18 16:19:45 -0500	SUCCESS	balg01	drop-rsync-zero
+2025-09-18 05:59:46 +0000	SUCCESS	tux03	mariadb-check
+2025-09-18 05:26:01 +0000	SUCCESS	tux03	drop-rsync-balg01
+2025-09-18 05:25:48 +0000	SUCCESS	tux03	borg-tux03-sql-backup
+2025-09-18 04:44:38 +0000	SUCCESS	tux03	mariabackup-make-consistent
+2025-09-18 04:44:25 +0000	SUCCESS	tux03	mariabackup-dump
+```
+
+The borgborg should be fixed now. I am missing the container backups. What is going on there? These were last backed up on 'Sun, 2025-09-14 00:00:52'. Ah, I set the CRON job to runs once a week. That should be fixed now and it should show up.
diff --git a/topics/systems/backups-with-borg.gmi b/topics/systems/backups-with-borg.gmi
index 1ad0112..dbd9192 100644
--- a/topics/systems/backups-with-borg.gmi
+++ b/topics/systems/backups-with-borg.gmi
@@ -20,15 +20,15 @@ which I run on one of my machines.
 Usually I use a version of borg from guix. This should really be done as the borg user (ibackup).
 
 ```
-mkdir ~/opt
-guix package -i borg ~/opt/borg
-tux04:~$ ~/opt/borg/bin/borg --version
+ibackup@tux03:~$ mkdir ~/opt
+ibackup@tux03:~$ guix package -i borg -p ~/opt/borg
+~/opt/borg/bin/borg --version
   1.2.2
 ```
 
 # Create a new backup dir and user
 
-The backup should live on a different disk from the things we backup, so when that disk fails we have another.
+The backup should live on a *different* disk from the things we backup, so when that disk fails we have another. In fact in 2025 we had a corruption of the backups(!) We could recover from the original data + older backups. Not great. But if it had been the same disk it would have been worse.
 
 The SQL database lives on /export and the containers live on /export2. /export3 is a largish slow drive, so perfect.
 
@@ -51,6 +51,8 @@ cd /export/backup/borg
 ~/opt/borg/bin/borg init --encryption=repokey-blake2 genenetwork
 ```
 
+Note that we typically start from an existing backup. These go back a long time.
+
 Now we can run our first backup. Note that ibackup should be a member of the mysql and gn groups
 
 ```
@@ -218,3 +220,230 @@ it has a --repair switch which we needed to remove some faults in the backup its
 ```
 borg check --repair repo
 ```
+
+# Production backups
+
+Now backups were supposed to run, but they don't show up yet. Ah, it is not yet 3am CST. Meanwhile we drop the backups on another server. Just in case we lose *both* drives on the production server and/or the server itself. To achieve this we have set up a user 'bacchus' with limited permissions on the remote. All bacchus can do is copy the files across. So, we add an ssh key and invoke the commands:
+
+```
+sheepdog_run.rb -v --tag "drop-mount-$name" -c "sshfs -o $SFTP_SETTING,IdentityFile=~/.ssh/id_ecdsa_backup bacchus@$host:/ ~/mnt/$name"
+sheepdog_run.rb --always -v --tag "drop-rsync-$name" -c "rsync -vrltDP borg/* ~/mnt/$name/drop/$HOST/ --delete"
+sheepdog_run.rb -v --tag "drop-unmount-$name" -c "fusermount -u ~/mnt/$name"
+```
+
+essentially mounting the remote dir, rsync files across, and unmount. All monitored by sheepdog. Copying files over sshfs is not the fastest route, but it is very secure because of the limited permissions. On the remote we have space and for now we'll use the old backups as a starting point. When it works I'll disable and remove the old tux04 backups. Actually I'll disable the cron job now and make sure mariadb did not start (so no one can use that by mistake). All checked!
+
+Meanwhile the system log at point of failure shows no information. This means it is a hard crash the Linux kernel is not even aware of and it points out it is not a kernel/driver/software issue on our end. It really sucks. We'll work on it:
+
+=> tux04-disk-issues
+
+OK, so I prepared the old production backups on the remote and we run an update by hand. And after some fiddling with permissions it worked:
+
+```
+ibackup@tux03:/export/backup/scripts/tux03$ ./backup_drop_balg01.sh
+fusermount: entry for /home/ibackup/mnt/balg01 not found in /etc/mtab
+{:cmd=>"sshfs -o reconnect,ServerAliveInterval=15,ServerAliveCountMax=3,IdentityFile=~/.ssh/id_ecdsa_backup bacchus@balg01.genenetwork.org:/ ~/mnt/balg01", :channel=>"run", :host=>"localhost", :port=>6377, :password=>"*", :verbose=>true, :tag=>"drop-mount-balg01", :config=>"/home/ibackup/.config/sheepdog/sheepdog.conf"}
+sshfs -o reconnect,ServerAliveInterval=15,ServerAliveCountMax=3,IdentityFile=~/.ssh/id_ecdsa_backup bacchus@balg01.genenetwork.org:/ ~/mnt/balg01
+No event to report <sheepdog_run>                                                                                      {:cmd=>"rsync -vrltDP borg/* ~/mnt/balg01/drop/tux03/ --delete", :channel=>"run", :host=>"localhost", :port=>6377, :password=>"*", :always=>true, :verbose=>true, :tag=>"drop-rsync-balg01", :config=>"/home/ibackup/.config/sheepdog/sheepdog.conf"}
+rsync -vrltDP borg/* ~/mnt/balg01/drop/tux03/ --delete                                                                 sending incremental file list
+deleting genenetwork/integrity.1148
+(...)
+sent 22,153,007 bytes  received 352 bytes  3,408,209.08 bytes/sec
+total size is 413,991,028,933  speedup is 18,687.51
+{:time=>"2025-09-12 07:51:52 +0000", :elapsed=>5, :user=>"ibackup", :host=>"tux03", :command=>"rsync -vrltDP borg/* ~/mnt/balg01/drop/tux03/ --delete", :tag=>"drop-rsync-balg01", :stdout=>nil, :stderr=>nil, :status=>0, :err=>"SUCCESS"}
+Pushing out event <sheepdog_run> to <localhost:6377>
+{:cmd=>"fusermount -u ~/mnt/balg01", :channel=>"run", :host=>"localhost", :port=>6377, :password=>"*", :verbose=>true, :tag=>"drop-unmount-balg01", :config=>"/home/ibackup/.config/sheepdog/sheepdog.conf"}
+fusermount -u ~/mnt/balg01
+No event to report <sheepdog_run>
+```
+
+And on the remote I can see the added backup:
+
+```
+tux03-new Wed, 2025-09-10 04:33:21 [dd4bbdc30898327b62d8ccdc63c5285f916d5643bffe942b73561fe297540eae]
+```
+
+All good. Now we add this to CRON and track sheepdog to see if there are problems popping up. It now confirms: 'SUCCESS	tux03	drop-rsync-balg01'.
+
+The backup drop setup is documented here:
+
+=> https://issues.genenetwork.org/topics/systems/backup-drops
+
+I am looking into setting up the backups again. Tux04 crashed a few days ago, yet again, so we were saved from that debacle! I rebooted to get at the old backups (they are elsewhere, but that is the latest). Setting up backups is slightly laborious, described here:
+
+=> https://issues.genenetwork.org/topics/systems/backups-with-borg
+
+we use sheepdog for monitoring
+
+=> http://sheepdog.genenetwork.org/sheepdog/status.html
+
+Code:
+
+=> https://github.com/pjotrp/deploy
+
+a tool that does a lot of checks in the background every day! Compressed backup sizes:
+
+```
+283G    genenetwork
+103G    tux04-containers
+```
+
+the local network speed between tux04 and tux03 is 100 Mbs. Not bad, but it takes more an hour to move across.
+
+First manual backup worked:
+
+```
+ibackup@tux03:/export/backup/borg$ borg create genenetwork::tux03-new /export/mariadb/export/backup/mariadb/latest --stats --progress
+Archive name: tux03-new
+Archive fingerprint: dd4bbdc30898327b62d8ccdc63c5285f916d5643bffe942b73561fe297540eae
+Time (start): Wed, 2025-09-10 09:33:21
+Time (end):   Wed, 2025-09-10 10:02:52
+Duration: 29 minutes 31.00 seconds
+Number of files: 907
+Utilization of max. archive size: 0%
+------------------------------------------------------------------------------
+                       Original size      Compressed size    Deduplicated size
+This archive:              536.84 GB            238.56 GB              3.68 MB
+All archives:               65.60 TB             29.15 TB            303.71 GB
+
+                       Unique chunks         Total chunks
+Chunk index:                  253613             24717056
+------------------------------------------------------------------------------
+```
+
+Next we set up sheepdog for monitoring automated backups. Next to the
+code repos we have a script repo at
+'tux02.genenetwork.org:/home/git/pjotrp/gn-deploy-servers' which
+currently handles monitoring for our servers, including: bacchus epysode
+octopus01 penguin2 rabbit shared thebird tux01 tux02 tux04. Now tux03. The main backup script looks like
+
+```
+rm -rf $backupdir/latest
+tag="mariabackup-dump"
+sheepdog_run.rb --always -v --tag $tag -c "mariabackup --backup --innodb-io-capacity=200 --kill-long-query-type=SELEC
+--kill-long-queries-timeout=120 --target-dir=$backupdir/latest/ --user=webqtlout --password=webqtlout"
+tag="mariabackup-make-consistent"
+sheepdog_run.rb --always -v --tag $tag -c "mariabackup --prepare --target-dir=$backupdir/latest/"
+sheepdog_borg.rb -t borg-tux04-sql --always --group ibackup -v -b /export/backup/borg/genenetwork $backupdir --args '
+--stats'
+```
+
+What it does is make a full copy of mariadb databases and make sure it is consistent. Next we use borg to make a backup. The reason a DB have a consistent copy is that the running DB may change during the backup. And that is no good! We use sheepdog to monitor these command - i.e. on failure we get notified. First we run it by hand to make sure it works. First errors, for example
+
+```
+ibackup@tux03:/export/backup/scripts/tux03$ ./backup.sh
+{:cmd=>"mariabackup --backup --innodb-io-capacity=200 --kill-long-query-type=SELECT --kill-long-queries-timeout=120 --target-dir=/export/backup/mariadb/latest/ --user=webqtlout --password=webqtlout", :channel=>"run", :host=>"localhost", : port=>6379, :always=>true, :verbose=>true, :tag=>"mariabackup-dump", :config=>"/home/ibackup/.redis.conf"} mariabackup --backup --innodb-io-capacity=200 --kill-long-query-type=SELECT --kill-long-queries-timeout=120 --target-di r=/export/backup/mariadb/latest/ --user=webqtlout --password=webqtlout
+[00] 2025-09-10 10:31:19 Connecting to MariaDB server host: localhost, user: webqtlout, password: set, port: not set, s
+ocket: not set
+[00] 2025-09-10 10:31:19 Using server version 10.11.11-MariaDB-0+deb12u1-log
+(...)
+[00] 2025-09-10 10:31:19 InnoDB: Using liburing
+[00] 2025-09-10 10:31:19 mariabackup: The option "innodb_force_recovery" should only be used with "--prepare".
+[00] 2025-09-10 10:31:19 mariabackup: innodb_init_param(): Error occurred.
+```
+
+The good thing is that the actual command is listed, so we can fix things a step at a time.
+
+```
+mariabackup --backup --innodb-io-capacity=200 --kill-long-query-type=SELECT --kill-long-queries-timeout=120 --target-dir=/export/backup/mariadb/latest/ --user=webqtlout --password=*
+```
+
+I had to disable 'innodb_force_recovery=1' to make it work. Also permissions have to allow the backup user with 'chmod u+rX -R /var/lib/mysql/*'.
+
+Now that works I need to make sure sheepdog can send its updates to the remote machine (in NL). It is a bit complicated because we set up an ssh tunnel that can only run redis commands. It looks like
+
+```
+3 * * * * /usr/bin/ssh -i ~/.ssh/id_ecdsa_sheepdog -f -NT -o ServerAliveInterval=60 -L 6377:127.0.0.1:6379 redis-tun@sheepdog.genenetwork.org >> tunnel.log &2>1
+```
+
+Now when I run sheepdog_status it reports
+
+```
+2025-09-10 06:01:02 -0500 (@tux04) FAIL 1 <00m00s> mariadb-test02
+2025-09-10 06:01:02 -0500 (@tux04) FAIL 1 <00m00s> mariadb-test01
+```
+
+which is correct because I switched mariadb off on tux04!
+
+Now Mariadb on tux03 is showing errors. The problem is that it actually is in an inconsitent state (sigh). Basically I am getting endless errors like:
+
+```
+Retrying read of log at LSN=1537842295040
+Retrying read of log at LSN=1537842295040
+Retrying read of log at LSN=1537842295040
+```
+
+There is a way to fix the replay log - probably harmless in our case.
+
+But what we *should* do is move this database out of the way - I may need it for Arthur - and do a proper backup recovery. I bumped off an E-mail to Arthur and started recovery. That takes also an hour to extract a borg backup of this size. I keep GN running in parallel (meanwhile) using the old DB. Bit of extra work, but less work than trying to recover from a broken DB. The good thing is we get to test backups. Btw this is exactly why it is *not* easy to migrate/update/copy/sync databases by 'just copying files'. They are too easily in an inconsistent state. There was some E-mail thread about that this year. Maybe it is a flaw of mysql/mariabd because the replay log is inconsistent when it is left open.
+
+```
+ibackup@tux03:/export/mariadb/restore$ borg extract /export/backup/borg/genenetwork::borg-tux04-sql-20250906-04:16-Sat --progress
+ 71.1% Extracting: export/backup/mariadb/latest/db_webqtl/ProbeSetData.MYI
+```
+
+So we rolled back the DB until further complaints. And made a new backup... This is how we keep ourselves busy.
+
+Turns out the new backup is problematic too! It completes, but still has redo isssues. It ends with:
+
+```
+Redo log (from LSN 1537842295024 to 1537842295040) was copied.
+```
+
+The error was
+
+```
+Retrying read of log at LSN=1537842295040
+```
+
+so it is the last record (or all of them!). Kranky. I used
+
+```
+RESET MASTER
+```
+
+to clear out the redo log. It says 'Log flushed up to   1537842295040'. Good. Try another backup. Still not working. The mysql log says '[Warning] Could not read packet: fd: 24  state: 1  read_length: 4  errno: 11  vio_errno: 1158  length: 0'. But this does not appear to be related.
+
+```
+perror 11
+OS error code  11:  Resource temporarily unavailable
+```
+
+hmmm. Still not related. The error relates to the file:
+
+```
+ls -l /proc/574984/fd|grep '24 '
+lrwx------ 1 mysql mysql 64 Sep 11 07:46 124 -> /export/mariadb/export/backup/mariadb/latest/db_webqtl/IndelAll.ibd
+```
+
+Probably a good idea to check all tables! OK, let's test this table first.
+
+```
+mysqlcheck -c db_webqtl -u webqtlout -pwebqtlout IndelAll
+db_webqtl.IndelAll                                 OK
+```
+
+looks OK. Try all
+
+```
+time mysqlcheck -c -u webqtlout -pwebqtlout db_webqtl
+real    33m39.642s
+```
+
+all tables are good. Alright, I think we can make backups and the warning may go away with a future mariadb version. My assessment is that this Warning is harmless. Let's move forward by setting up sheepdog and borg backup. First backup run should show up soon as 'SUCCESS	tux03	borg-tux03-sql-backup' in
+
+=> http://sheepdog.genenetwork.org/sheepdog/status.html
+
+Now it works I add it as a CRON job to run daily. Sheepdog will tell me whether we are healthy or not.
+
+
+
+## Backups (part 3)
+
+As an aside. Last night, according to sheepdog, tux03 made a perfect backup run and dropped the data on a server in a different location.
+
+=> http://sheepdog.genenetwork.org/sheepdog/status.html
+
+There is more to do, however. First of all we don't backup everything. We should also backup the containers and the state of the machine. Finally we need to make sure the backups are backed up(!) The reason is that if a backup is corrupted it will just propagate - it has happened to us. A backup of a backup will have sane versions from before the corruption. These days, you also have to anticipate bad actors injecting stuff. That you won't find if they penetrated the backup system. We are quite keen on having offline backups for that reason alone.
+
+For backup of the containers we need to run as root (unfortunately). I see now we did not have a proper backup on tux04. The last one was from 2025-03-04. Now we generate these containers, but still a bad idea not to backup the small databases. Anyway, first add the containers to the backup and more state. I set it up and added the CRON job. See if it pops up on sheepdog.
diff --git a/topics/systems/ci-cd.gmi b/topics/systems/ci-cd.gmi
index a1ff2e3..e20a37a 100644
--- a/topics/systems/ci-cd.gmi
+++ b/topics/systems/ci-cd.gmi
@@ -1,4 +1,6 @@
-# CI/ CD for genetwork projects
+# CI/CD for genetwork projects
+
+Continuous intergration (CI) and continuous deployment (CD) are critical parts of making sure software development does not mess up our deployment(s).
 
 We need various levels of tests to be run, from unit tests to the more complicated ones like integration, performance, regression, etc tests, and of course, they cannot all be run for each and every commit, and will thus need to be staggered across the entire deployment cycle to help with quick iteration of the code.
 
@@ -22,16 +24,86 @@ As part of the CI/CD effort, it is necessary that there is
 GeneNetwork is interested in doing two things on every commit (or
 periodically, say, once an hour/day):
 
-* CI: run unit tests
+* CI: run unit tests on git push
 * CD: rebuild and redeploy a container running GN3
 
-Arun has figured out the CI part. It runs a suitably configured laminar CI service in a Guix container created with `guix system container'. A cron job periodically triggers the laminar CI job (note: this is no longer true).
+Arun has figured out the CI part. It runs a suitably configured laminar CI service in a Guix container created with `guix system container'.
 
 => https://git.systemreboot.net/guix-forge/about/
 
-CD hasn't been figured out. Normally, Guix VMs and containers created by `guix system` can only access the store read-only. Since containers don't have write access to the store, you cannot `guix build' from within a container or deploy new containers from within a container. This is a problem for CD. How do you make Guix containers have write access to the store?
+We have the quick running tests, e.g unit tests, run on each commit to branch "main". Once those are successful, the CI/CD system we choose should automatically pick the latest commit that passed the quick running tests for for further testing and deployment.
+Once the next battery of tests is passed, the CI/CD system will create a build/artifact to be deployed to staging and have the next battery of tests runs against it. If that passes, then that artifact could be deployed to production with details of the commit and deployment dependencies.
+
+## Adding a web-hook
+
+### Github hooks
+
+IIRC actions run artifacts inside github's infrastracture.  We use webhooks: e.g.
+
+Update the hook at
+
+=> https://github.com/genenetwork/genenetwork3/settings/hooks
+
+A web hook basically calls an endpoint on a git push event. The webhook for genenetwork3 has recently not been called (ever? it says: This hook has never been triggered. ). The webhook for genenetwork2, however, has been called.
+
+=> ./screenshot-github-webhook.png
+
+To trigger CI manually, run this with the project name:
+
+```
+curl https://ci.genenetwork.org/hooks/example-gn3
+```
+
+I just tested and it appeared this triggered a redeploy of gn2:
+
+```
+curl -XGET "https://ci.genenetwork.org/hooks/genenetwork2
+```
+
+For gemtext we have a github hook that adds a forge-project and looks like
+
+```lisp
+(define gn-gemtext-threads-project
+  (forge-project
+   (name "gn-gemtext-threads")
+   (repository "https://github.com/genenetwork/gn-gemtext-threads/")
+   (ci-jobs (list (forge-laminar-job
+                   (name "gn-gemtext-threads")
+                   (run (with-packages (list nss-certs openssl)
+                          (with-imported-modules '((guix build utils))
+                            #~(begin
+                                (use-modules (guix build utils))
+
+                                (setenv "LC_ALL" "en_US.UTF-8")
+                                (invoke #$(file-append tissue "/bin/tissue")
+                                        "pull" "issues.genenetwork.org"))))))))
+   (ci-jobs-trigger 'webhook)))
+```
+
+The normal trigger is automatic, you push code in any of the two repos (three? I'll verify), GN2 and GN3 and the laminar runs the jobs and updates the code in the container restarts services, as appropriate.
+
+If you want to trigger the CI manually, there are webhooks available for that that can be triggered manually with something like:
+
+```
+curl -XGET "https://ci.genenetwork.org/hooks/genenetwork2"
+```
+
+for GN2. Change the part after /hooks/ for each of the different repos as follows:
+
+```
+GN2: /genenetwork2
+GN3: /genenetwork3
+gn-auth: /gn-auth (I need to verify this)
+gn-uploader: Does not exist right now
+```
+
+Guix forge can be found at
+
+=> https://git.systemreboot.net/guix-forge/
 
-Another alternative for CI/ CD were to have the quick running tests, e.g unit tests, run on each commit to branch "main". Once those are successful, the CI/CD system we choose should automatically pick the latest commit that passed the quick running tests for for further testing and deployment, maybe once an hour or so. Once the next battery of tests is passed, the CI/CD system will create a build/artifact to be deployed to staging and have the next battery of tests runs against it. If that passes, then that artifact could be deployed to production, and details on the commit and
+### git.genenetwork.org hooks
+
+TBD
 
 #### Possible Steps
 
@@ -81,7 +153,7 @@ Below are some possible steps (and tasks) to undertake for automated deployment
 * Generate guix declaration for re-generating the release
 * Archive container image, documentation and guix declaration for possible rollback
 
-#### Some Work Done
+#### See also
 
 => /topics/systems/gn-services GN Services
 
@@ -90,49 +162,3 @@ This contains a check-list of things that need to be done:
 => /topics/systems/orchestration Orchestration
 
 => /issues/broken-cd  Broken-cd (Resolved)
-
-## Adding a web-hook
-
-### Github hooks
-
-IIRC actions run artifacts inside github's infrastracture.  We use webhooks: e.g.
-
-Update the hook at
-
-=> https://github.com/genenetwork/genenetwork3/settings/hooks
-
-=> ./screenshot-github-webhook.png
-
-To trigger CI manually, run this with the project name:
-
-```
-curl https://ci.genenetwork.org/hooks/example-gn3
-```
-
-For gemtext we have a github hook that adds a forge-project and looks like
-
-```lisp
-(define gn-gemtext-threads-project
-  (forge-project
-   (name "gn-gemtext-threads")
-   (repository "https://github.com/genenetwork/gn-gemtext-threads/")
-   (ci-jobs (list (forge-laminar-job
-                   (name "gn-gemtext-threads")
-                   (run (with-packages (list nss-certs openssl)
-                          (with-imported-modules '((guix build utils))
-                            #~(begin
-                                (use-modules (guix build utils))
-
-                                (setenv "LC_ALL" "en_US.UTF-8")
-                                (invoke #$(file-append tissue "/bin/tissue")
-                                        "pull" "issues.genenetwork.org"))))))))
-   (ci-jobs-trigger 'webhook)))
-```
-
-Guix forge can be found at
-
-=> https://git.systemreboot.net/guix-forge/
-
-### git.genenetwork.org hooks
-
-TBD
diff --git a/topics/systems/debug-and-developing-code-with-genenetwork-system-container.gmi b/topics/systems/debug-and-developing-code-with-genenetwork-system-container.gmi
index 131474c..f3cbbd6 100644
--- a/topics/systems/debug-and-developing-code-with-genenetwork-system-container.gmi
+++ b/topics/systems/debug-and-developing-code-with-genenetwork-system-container.gmi
@@ -1,12 +1,59 @@
 # Debugging and developing code
 
-Once we get to the stage of having a working system container it would be nice to develop code against it. The idea is to take an existing running system container and start modifying code *inside* the container by brining in an external path.
+Once we get to the stage of having a working system container it would be nice to develop code against it. The idea is to take an existing running system container and start modifying code *inside* the container by bringing in an external path.
 
-First build and start a guix system container as described in
+In principle we'll build guix system containers as described in
 
-=> /topics/guix/guix-system-containers-and-how-we-use-them.gmi
+=> /topics/guix/guix-system-containers-and-how-we-use-them
 
-The idea is to do less `guix pull' and system container builds, so as to speed up development. The advantage of using an existing system container is that the full deployment is the same on our other running systems! No more path hacks, in other words.
+The idea is to minimise `guix pull' and system container builds, so as to speed up development. The advantage of using an existing system container is that the full deployment is the same on our other running systems! No more path hacks, in other words.
+
+## Philosophy
+
+For development containers we will:
+
+* Use sane default values - for URLs, paths etc.
+* Add services incrementally (i.e., not one big blob)
+* Run tests inside the container (not during build time)
+* Build indexes etc. outside the container - or make it optional
+
+Also:
+
+* We should be able to run gn3 and gn-guile (aka gn4) as a guix shell without anything else
+* We should be able to run gn2 with only gn3 and/or gn-guile as a guix shell with external DBs.
+* We should be albe to run gn2+gn3+gn-guile as a system container with external DBs.
+* We should be able to run gn-auth with gn2 as a system container
+* We should be able to run the uploader as a system container
+
+I.e. no https, not authentication by default (as long as we run on localhost). The localhost URLs and file paths can be defaults because there will only be one development container running on a single machine.
+
+System containers are a bit overkill for development. Still, in some cases we'll need a system container. For example when testing integration of gn-auth, uploader, gn2 etc. We have the CD deployment that gets updated when git repos change. We also have a development container written by @bonz that needs to be improved and documented.
+
+=> https://git.genenetwork.org/gn-machines/tree/genenetwork-local-container.scm?h=gn-local-development-container
+
+Note it is on a special branch for now.
+
+Databases, and files will simply be shared on default paths - /export/guix-containers/gndev/...
+And if you need different combinations it should be relatively easy to compose a new shell or system container.
+
+# Tags
+
+* type: bug
+* status: open
+* priority: high
+* assigned: pjotrp
+* interested: pjotrp,bonfacem,fredm
+* keywords: development, deployment, server
+
+# Tasks
+
+Create a dev environment for:
+
+* [ ] GN3
+* [ ] gn-guile
+* [ ] GN2
+* [ ] gn-auth
+* [ ] gn-uploader
 
 # GN3 in system container
 
@@ -258,6 +305,12 @@ guix-vm-run:
   $cmd
 ```
 
+## Virtuoso in a system container
+
+See
+
+=> ./virtuoso
+
 # Troubleshooting
 
 ## Updating the VM does not show latest fixes
diff --git a/topics/systems/hpc/octopus-maintenance.gmi b/topics/systems/hpc/octopus-maintenance.gmi
index a0a2f16..d034575 100644
--- a/topics/systems/hpc/octopus-maintenance.gmi
+++ b/topics/systems/hpc/octopus-maintenance.gmi
@@ -2,10 +2,23 @@
 
 ## Slurm
 
-Status of slurm
+Status of slurm (as of 202512)
 
 ```
 sinfo
+workers*     up   infinite      8   idle octopus[03,05-11]
+allnodes     up   infinite      3  alloc tux[06,08-09]
+allnodes     up   infinite     11   idle octopus[02-03,05-11],tux[05,07]
+tux          up   infinite      3  alloc tux[06,08-09]
+tux          up   infinite      2   idle tux[05,07]
+1tbmem       up   infinite      1   idle octopus02
+headnode     up   infinite      1   idle octopus01
+highmem      up   infinite      2   idle octopus[02,11]
+386mem       up   infinite      6   idle octopus[03,06-10]
+lowmem       up   infinite      7   idle octopus[03,05-10]
+```
+
+```
 sinfo -R
 squeue
 ```
@@ -29,7 +42,7 @@ UnkillableStepProgram   = (null)
 UnkillableStepTimeout   = 60 sec
 ```
 
-check valid configuration with `slurmd -C` and update nodes with
+check valid configuration with 'slurmd -C' and update nodes with
 
 ```
 scontrol reconfigure
@@ -45,13 +58,13 @@ Basically the root user can copy across.
 
 ## Execute binaries on mounted devices
 
-To avoid `./scratch/script.sh: Permission denied` on `device_file`:
+To avoid './scratch/script.sh: Permission denied' on 'device_file':
 
-- `sudo bash`
-- `ls /scratch -l` to check where `/scratch` is
-- `vim /etc/fstab`
-- replace `noexec` with `exec` for `device_file`
-- `mount -o remount [device_file]` to remount the partition with its new configuration.
+- 'sudo bash'
+- 'ls /scratch -l' to check where '/scratch' is
+- 'vim /etc/fstab'
+- replace 'noexec' with 'exec' for 'device_file'
+- 'mount -o remount [device_file]' to remount the partition with its new configuration.
 
 Some notes:
 
@@ -67,7 +80,7 @@ x-systemd.device-timeout=
 10.0.0.110:/export/3T  /mnt/3T  nfs nofail,x-systemd.automount,x-systemd.requires=network-online.target,x-systemd.device-timeout=10 0 0
 
 
-## Installation of `munge` and `slurm` on a new node
+## Installation of 'munge' and 'slurm' on a new node
 
 Current nodes in the pool have:
 
@@ -78,7 +91,7 @@ sbatch --version
     slurm-wlm 18.08.5-2
 ```
 
-To install `munge`, go to `octopus01` and run:
+To install 'munge', go to 'octopus01' and run:
 
 ```shell
 guix package -i munge@0.5.14 -p /export/octopus01/guix-profiles/slurm
@@ -86,7 +99,7 @@ guix package -i munge@0.5.14 -p /export/octopus01/guix-profiles/slurm
 systemctl status munge # to check if the service is running and where its service file is
 ```
 
-We need to setup the rights for `munge`:
+We need to setup the rights for 'munge':
 
 ```shell
 sudo bash
@@ -100,7 +113,7 @@ mkdir -p /var/lib/munge
 chown munge:munge /var/lib/munge/
 
 mkdir -p /etc/munge
-# copy `munge.key` (from a working node) to `/etc/munge/munge.key`
+# copy 'munge.key' (from a working node) to '/etc/munge/munge.key'
 chown -R munge:munge /etc/munge
 
 mkdir -p /run/munge
@@ -112,7 +125,7 @@ chown munge:munge /var/log/munge
 mkdir -p /var/run/munge # todo: not sure why it needs such a folder
 chown munge:munge /var/run/munge
 
-# copy `munge.service` (from a working node) to `/etc/systemd/system/munge.service`
+# copy 'munge.service' (from a working node) to '/etc/systemd/system/munge.service'
 
 systemctl daemon-reload
 systemctl enable munge
@@ -120,25 +133,25 @@ systemctl start munge
 systemctl status munge
 ```
 
-To test the new installation, go to `octopus01` and then:
+To test the new installation, go to 'octopus01' and then:
 
 ```shell
 munge -n | ssh tux08 /export/octopus01/guix-profiles/slurm-2-link/bin/unmunge
 ```
 
-If you get `STATUS: Rewound credential (16)`, it means that there is a difference between the encoding and decoding times. To fix it, go into the new machine and fix the time with
+If you get 'STATUS: Rewound credential (16)', it means that there is a difference between the encoding and decoding times. To fix it, go into the new machine and fix the time with
 
 ```shell
 sudo date MMDDhhmmYYYY.ss
 ```
 
-To install `slurm`, go to `octopus01` and run:
+To install 'slurm', go to 'octopus01' and run:
 
 ```shell
 guix package -i slurm@18.08.9 -p /export/octopus01/guix-profiles/slurm
 ```
 
-We need to setup the rights for `slurm`:
+We need to setup the rights for 'slurm':
 
 ```shell
 sudo bash
@@ -152,8 +165,8 @@ mkdir -p /var/lib/slurm
 chown munge:munge /var/lib/slurm/
 
 mkdir -p /etc/slurm
-# copy `slurm.conf` to `/etc/slurm/slurm.conf`
-# copy `cgroup.conf` to `/etc/slurm/cgroup.conf`
+# copy 'slurm.conf' to '/etc/slurm/slurm.conf'
+# copy 'cgroup.conf' to '/etc/slurm/cgroup.conf'
 
 chown -R slurm:slurm /etc/slurm
 
@@ -163,7 +176,7 @@ chown slurm:slurm /run/slurm
 mkdir -p /var/log/slurm
 chown slurm:slurm /var/log/slurm
 
-# copy `slurm.service` to `/etc/systemd/system/slurm.service`
+# copy 'slurm.service' to '/etc/systemd/system/slurm.service'
 
 /export/octopus01/guix-profiles/slurm-2-link/sbin/slurmd -f /etc/slurm/slurm.conf -C | head -n 1 >> /etc/slurm/slurm.conf # add node configuration information
 
@@ -173,12 +186,24 @@ systemctl start slurm
 systemctl status slurm
 ```
 
-On `octopus01` (the master):
+On 'octopus01' (the master):
 
 ```shell
 sudo bash
 
-# add the new node to `/etc/slurm/slurm.conf`
+# add the new node to '/etc/slurm/slurm.conf'
 
 systemctl restart slurmctld # after editing /etc/slurm/slurm.conf on the master
 ```
+
+
+# Removing a node
+
+We are removing o3 so it can become the new head node:
+
+```
+scontrol update nodename=octopus03 state=drain reason="removing"
+scontrol show node octopus03 | grep State
+scontrol update nodename=octopus03 state=down reason="removed"
+  State=DOWN+DRAIN ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
+```
diff --git a/topics/systems/hpc/performance.gmi b/topics/systems/hpc/performance.gmi
index ac5e861..ee604b5 100644
--- a/topics/systems/hpc/performance.gmi
+++ b/topics/systems/hpc/performance.gmi
@@ -14,13 +14,13 @@ hdparm -Ttv /dev/sdc1
 
 Cheap and cheerful:
 
-Read test:
+Write test:
 
 ```
 dd if=/dev/zero of=./test bs=512k count=2048 oflag=direct
 ```
 
-Write test:
+Read test:
 
 ```
 /sbin/sysctl -w vm.drop_caches=3
diff --git a/topics/systems/linux/GPU-on-balg01.gmi b/topics/systems/linux/GPU-on-balg01.gmi
new file mode 100644
index 0000000..d0cb3fc
--- /dev/null
+++ b/topics/systems/linux/GPU-on-balg01.gmi
@@ -0,0 +1,201 @@
+# Installing GPU on Balg01 server
+
+lspci shows the card, an L4
+
+=> https://www.techpowerup.com/gpu-specs/l4.c4091
+
+```
+lspci|grep NVIDIA
+NVIDIA Corporation AD104GL
+```
+
+The machine had raspi and Tesla support installed (?!), so I removed that:
+
+```
+apt-get remove firmware-nvidia-tesla-gsp
+```
+
+Disabled nouveau drivers
+
+```/etc/modprobe.d/blacklist-nouveau.conf
+blacklist nouveau
+options nouveau modeset=0
+```
+
+```
+dpkg --purge raspi-firmware
+update-initramfs -u
+reboot (can skip for a bit)
+```
+
+## Create fallback boot partition
+
+Well, before rebooting I should have created another fallback boot partitition with a more recent debian.
+Unfortunately I had not prepared space on one of the disks (something I normally do). Turned out /dev/sdc on /export3 was not really used lately, so I could move that data and reuse that partition.
+
+```
+/dev/sdc1       1.8T  552G  1.2T  33% /export3
+```
+
+it is a very slow drive (btw), not sure why. I ran badblocks but it does not make a difference. The logs show:
+
+```
+Oct 04 09:34:37 balg01 kernel: I/O error, dev sdc, sector 23392285 op 0x9:(WRITE_ZEROES) flags 0x8000000 >
+O
+```
+
+but it looks more like a driver problem than an actual disk error. Well, maybe on the new debian install it will be fine.
+At this point it is just to install a fallback boot partition, so no real worries.
+
+On using debootstrap, grub etc. the old partition came back fine and I tested I can also boot into the new Debian install. Especially with remote servers this is a great comfort.
+
+## CUDA continued
+
+Now we have a fallback boot partition it is a bit easier to mess with CUDA drivers.
+
+To install the CUDA drivers you may need to disable 'secure boot' in the bios.
+
+```
+apt install build-essential gcc make cmake dkms
+apt install linux-headers-$(uname -r)
+```
+
+The debian selector, choose data center and L series: Driver Version:580.95.05 CUDA Toolkit:13.0 Release Date:Wed Oct 01, 2025 File Size:844.44 MB
+
+Note I installed the nvidia-open drivers. If things are not working we should look at the proprietary stuff. I used the 'local repository installation' instructions of
+
+=> https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/index.html#debian-installation
+
+
+```
+apt-get install nvidia-libopencl1 nvidia-open nvidia-driver-cuda
+```
+
+The first one is to prevent
+
+```
+libnppc11 : Conflicts: nvidia-libopencl1
+```
+
+now this should run
+
+```
+balg01:~# nvidia-smi
+Sat Oct  4 11:56:19 2025
++-----------------------------------------------------------------------------------------+
+| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
++-----------------------------------------+------------------------+----------------------+
+| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
+| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
+|                                         |                        |               MIG M. |
+|=========================================+========================+======================|
+|   0  NVIDIA L4                      Off |   00000000:81:00.0 Off |                    0 |
+| N/A   57C    P0             29W /   72W |       0MiB /  23034MiB |      2%      Default |
+|                                         |                        |                  N/A |
++-----------------------------------------+------------------------+----------------------+
+
++-----------------------------------------------------------------------------------------+
+| Processes:                                                                              |
+|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
+|        ID   ID                                                               Usage      |
+|=========================================================================================|
+|  No running processes found                                                             |
++-----------------------------------------------------------------------------------------+
+```
+
+## Testing GPU
+
+
+Using Guix python I ran:
+
+```
+pip install "gpu-benchmark-tool[nvidia]"
+```
+
+of course it downloads a ridiculous amount of binaries... But then we can run
+
+```
+export PATH=/home/wrk/.local/bin:$PATH
+gpu-benchmark benchmark --duration=30
+```
+
+that did not work. CUDA samples are packaged in Debian and requires building the scripts:
+
+```
+apt-get install nvidia-cuda-samples nvidia-cuda-toolkit-gcc
+cd /usr/share/doc/nvidia-cuda-toolkit/examples/Samples/6_Performance/transpose
+export CUDA_PATH=/usr
+make
+./transpose
+> [NVIDIA L4] has 58 MP(s) x 128 (Cores/MP) = 7424 (Cores)
+> Compute performance scaling factor = 1.00
+...
+Test passed
+```
+
+Note that this removed nvidia-smi. Let's look at versions:
+
+```
+pool/non-free/n/nvidia-graphics-drivers/nvidia-libopencl1_535.247.01-1~deb12u1_amd64.deb
+pool/contrib/n/nvidia-cuda-samples/nvidia-cuda-samples_11.8~dfsg-2_all.deb
+pool/non-free/n/nvidia-cuda-toolkit/nvidia-cuda-toolkit-gcc_11.8.0-5~deb12u1_amd64.deb
+pool/non-free/n/nvidia-graphics-drivers/nvidia-libopencl1_535.247.01-1~deb12u1_amd64.deb
+```
+
+while
+
+```
+Filename: ./nvidia-open_580.95.05-1_amd64.deb
+Package: nvidia-driver-cuda
+Version: 580.95.05-1
+Section: NVIDIA
+Source: nvidia-graphics-drivers
+Provides: nvidia-cuda-mps, nvidia-smi
+```
+
+and it turns out to be a mixture. I have to take real care not to mix in Debian packages! For example this package is a Debian original:
+
+```
+ii  nvidia-cuda-gdb                             11.8.86~11.8.0-5~deb12u1                amd64        NVIDIA CUDA Debugger (GDB)
+```
+
+```
+apt remove --purge nvidia-* cuda-* libnvidia-*
+```
+
+says
+
+```
+Note, selecting 'libnvidia-gpucomp' instead of 'libnvidia-gpucomp-580.95.05'
+```
+
+To view installed packages belonging to Debian itself:
+
+```
+dpkg -l|grep nvid|grep deb12
+dpkg -l|grep cuda|grep deb12
+```
+
+Let's reinstall and make sure only NVIDIA packages are used:
+
+```
+wget https://developer.download.nvidia.com/compute/cuda/repos/debian12/x86_64/cuda-keyring_1.1-1_all.deb
+dpkg -i cuda-keyring_1.1-1_all.deb
+apt-get update
+apt-get install cuda-toolkit  cuda-compiler-12-2
+```
+
+Now we have:
+
+```
+/usr/local/cuda-12.3/bin/nvcc --version
+nvcc: NVIDIA (R) Cuda compiler driver
+Copyright (c) 2005-2023 NVIDIA Corporation
+Built on Wed_Nov_22_10:17:15_PST_2023
+```
+
+# Pytorch
+
+CUDA environment variable for pytorch is probably useful:
+
+=> https://docs.pytorch.org/docs/stable/cuda_environment_variables.html
diff --git a/topics/systems/linux/adding-nvidia-drivers-penguin2.gmi b/topics/systems/linux/adding-nvidia-drivers-penguin2.gmi
new file mode 100644
index 0000000..81e721f
--- /dev/null
+++ b/topics/systems/linux/adding-nvidia-drivers-penguin2.gmi
@@ -0,0 +1,74 @@
+# GPU Graphics Driver Set-Up
+
+Tux02 has the Tesla K80 (GK210GL) GPU.  For machine learning, we want the official proprietary NVIDIA drivers.
+
+## Installation
+
+* Debian 12 moved NVIDIA driver into the non-free-firmware repo.  Add the following to "/etc/apt/sources.list" and run "sudo apt update":
+
+```
+deb http://deb.debian.org/debian/ bookworm main contrib non-free non-free-firmware
+```
+
+* Make sure the correct kernel headers are installed:
+
+```
+sudo apt install linux-headers-$(uname -r)
+```
+
+* Install "nvidia-tesla-470-driver"⁰ (The NVIDIA line-up of programmable "Tesla" devices, used primarily for simulations and large-scale calculations, also require separate driver packages to function correctly compared to the consumer-grade GeForce GPUs that are instead targeted for desktop and gaming usage)¹:
+
+```
+sudo apt purge 'nvidia-*'
+sudo apt install nvidia-tesla-470-driver
+```
+
+* Black list nouveau since it conflicts with NVIDIA's driver, and regenerate the initramfs "sudo update-initramfs -u":
+
+```
+echo "blacklist nouveau" | sudo tee /etc/modprobe.d/blacklist-nouveau.conf
+echo "options nouveau modeset=0" | sudo tee -a /etc/modprobe.d/blacklist-nouveau.conf
+```
+
+* Reboot and test the nvidia drivers:
+
+```
+sudo reboot
+nvidia-smi
+
+# optional if you want to use nvidia-cuda-toolkit
+sudo apt install nvidia-cuda-dev nvidia-cuda-toolkit
+```
+
+## Issues
+
+Holding on reboot until I check in with the rest of team regarding some initd raspi hook:
+
+```
+update-initramfs: Generating /boot/initrd.img-6.1.0-9-amd64
+raspi-firmware: missing /boot/firmware, did you forget to mount it?
+run-parts: /etc/initramfs/post-update.d//z50-raspi-firmware exited with return code 1
+dpkg: error processing package initramfs-tools (--configure):
+ installed initramfs-tools package post-installation script subprocess returned error exit status 1
+Processing triggers for libgdk-pixbuf-2.0-0:amd64 (2.42.10+dfsg-1+deb12u1) ...
+Errors were encountered while processing:
+ initramfs-tools
+```
+
+Removed the firmware by running:
+
+```
+sudo apt purge raspi-firmware
+
+# Configure all packages that are installed but not yet fully configured
+sudo dpkg --configure -a
+
+# Update initramfs since we updated our drivers
+sudo update-initramfs -u
+```
+
+## References
+
+=> https://us.download.nvidia.com/XFree86/Linux-x86_64/470.129.06/README/supportedchips.html ⁰ Nvidia 470.129.06 Supported Chipsets.
+=> https://wiki.debian.org/NvidiaGraphicsDrivers#Tesla_Drivers ¹ Debian Tesla Drivers.
+=> https://wiki.debian.org/NvidiaGraphicsDrivers/Configuration ² NVIDIA Proprietary Driver: Configuration.
diff --git a/topics/systems/mariadb/precompute-mapping-input-data.gmi b/topics/systems/mariadb/precompute-mapping-input-data.gmi
index 977120d..3442d4e 100644
--- a/topics/systems/mariadb/precompute-mapping-input-data.gmi
+++ b/topics/systems/mariadb/precompute-mapping-input-data.gmi
@@ -2,7 +2,7 @@
 
 GN relies on precomputed mapping scores for search and other functionality. Here we prepare for a new generation of functionality that introduces LMMs for compute and multiple significant scores for queries.
 
-At this stage we precompute GEMMA and tarball or lmdb it. As a project is never complete we need to add a metadata record in each tarball that track the status of the 'package'. Also, to offload compute to machines without DB access we need to prepare a first step that contains genotypes and phenotypes for compute. The genotypes have to be shared, as well as the computed kinship with and without LOCO. See
+At this stage we precompute GEMMA and tarball or lmdb it. As a project is never complete we need to add a metadata record in each tarball that tracks the status of the 'package'. Also, to offload compute to machines without DB access we need to prepare a first step that contains genotypes and phenotypes for compute. The genotypes have to be shared, as well as the computed kinship with and without LOCO. See
 
 => /topics/data/precompute/steps
 
@@ -43,7 +43,13 @@ And after:
 
 # Info
 
-## Original qtlreaper version
+## Original qtlreaper version for PublishData
+
+See the writeup at
+
+=> ./precompute-publishdata
+
+## Original qtlreaper version for ProbeSetData
 
 The original reaper precompute lives in
 
diff --git a/topics/systems/mariadb/precompute-publishdata.gmi b/topics/systems/mariadb/precompute-publishdata.gmi
new file mode 100644
index 0000000..74c278f
--- /dev/null
+++ b/topics/systems/mariadb/precompute-publishdata.gmi
@@ -0,0 +1,3370 @@
+# Precompute PublishData
+
+Based on the QTL_Reaper_cal_lrs.py aka QTL_Reaper_v8_PublishXRef.py. This script simply updates PublishXRef table with a highest hit as computed by qtlreaper.
+
+In a first attempt to update the database we are going to do just that using GEMMA.
+
+For the new script we will pass in the genotype file as well as the phenotype file, so gemma-wrapper can process it. I wrote quite a few scripts already
+
+=> https://github.com/genetics-statistics/gemma-wrapper/tree/master/bin
+
+So we can convert a .geno file to BIMBAM. I need to extract GN traits to a R/qtl2 or lmdb trait format file and use that as input.
+
+* [X] Visit use of PublishXRef
+* [X] geno -> BIMBAM (BXD first)
+* [X] Get PublishData trait(s) and convert to gemma, R/qtl2 or lmdb
+* - [X] see scripts/lmdb-publishdata-export.scm
+* - [X] see scripts for ProbeSetData
+* - [X] Make sure the BXDs are mappable
+* [X] Run gemma-wrapper
+* [X] We should map by trait-id, data id is not intuitive: curl http://127.0.0.1:8091/dataset/bxd-publish/values/8967044.json > 10002-pheno.json
+* [X] Check why Zach/GN JSON file lists different mappable BXDs
+* [X] Update DB on run-server
+* [X] Add batch run and some metadata so we can link back from results
+* [X] Create a DB/table containing hits and old reaper values
+* [X] Convert this info to RDF
+* [X] Run virtuoso server
+* [X] When loading traits compute mean, se, skew, kurtosis and store them as metadata in lmdb
+* [ ] Why is X not showing in LMM precompute for trait 51064
+* [X] Correctly handle Infinite LOD
+* [X] Ask interesting questions about the overlap between reaper and gemma
+* [ ] Update PublishXRef and store old reaper value(?)
+* [ ] Correctly Handle gn-guile escalating errors
+* [X] RDF point back to original data file
+* [ ] Fix Infinity also in LMM run (156 SNPs only)
+* [ ] Make time stamp, host, user a compute 'origin' block in RDF
+* [X] RDF mark QTL
+* [ ] Make sure the trait fetcher handles authorization or runs localhost only
+* [ ] gemma-wrapper --force does not work for GRM and re-check GRM does not change on phenotype
+* [ ] Use SNP URIs when possible (instead of inventing our own) - and BED information so we can locate them
+* [ ] Check lmdb duplicate key warning
+* [ ] run gemma with pangenome-derived genotypes
+* [ ] run gemma with qnorm
+* [ ] run gemma with sex covariate
+* [ ] run gemma again with the hit as a covariate
+* [ ] Check invalid data sets/traits and feed them to Rob/Arthur
+* [ ] Add metadata for bimodality indicator in addition to kurtosis (see below)
+* [ ] Provide SPARQL to find QTL and return metadata about traits
+* [ ] Provide PheWAS examples
+* [ ] Add BED information on Genes
+* [ ] Update Xapian search - also to handle gene aliases
+* [ ] Create GN UI with Zach
+
+For the last we should probably add a few columns. Initially we'll only store the maximum hit.
+
+After
+
+* [ ] provide distributed storage of files using https
+
+# Visit use of PublishXRef
+
+In GN2 this table is used in search, auth, and router. For search it is to look for trait hits (logically). For the router it is to fetch train info as well as dataset info.
+
+In GN3 this table is used for partial correlations. Also to fetch API trait info and to build the search index.
+
+In GN1 usage is similar.
+
+# geno -> BIMBAM
+
+We can use the script in gemma-wrapper
+
+=> https://github.com/genetics-statistics/gemma-wrapper/blob/master/bin/gn-geno-to-gemma.py
+
+there is probably something similar in GN2. And I have another version somewhere.
+
+To identify the geno file the reaper script uses
+
+```python
+cursor.execute('select Id, Name from InbredSet')
+results = cursor.fetchall()
+InbredSets = {}
+for item in results:
+	InbredSets[item[0]] = genotypeDir+str(item[1])+'.geno'
+```
+
+which assumes one single geno file for the BXD that is indexed by the InbredSetID (a number). Note it ignores the many genotype files we have per inbredset (today). Also there is a funny hardcoded
+
+```python
+	if InbredSetId==3:
+		InbredSetId=1
+```
+
+(no comment).
+
+Later we'll output to lmdb when GEMMA supports it.
+
+There are about 100 InbredSets. Genotype files can be found on production in
+/export/guix-containers/genenetwork/var/genenetwork/genotype-files/genotype. For the BXD alone there are
+
+```
+BXD.2.geno               BXD-Heart-Metals_old.geno   BXD-Micturition.6.geno
+BXD.4.geno               BXD-JAX-AD.4.geno           BXD-Micturition.8.geno
+BXD.5.geno               BXD-JAX-AD.8.geno           BXD-Micturition.geno
+BXD.6.geno               BXD-JAX-AD.geno             BXD-Micturition_old.4.geno
+BXD.7.geno               BXD-JAX-AD_old.geno         BXD-Micturition_old.6.geno
+BXD.8.geno               BXD-JAX-OFS.geno            BXD-Micturition_old.geno
+BXD-AE.4.geno            BXD-Longevity.4.geno        BXD_mm8.geno
+BXD-AE.8.geno            BXD-Longevity.8.geno        BXD-NIA-AD.4.geno
+BXD-AE.geno              BXD-Longevity.9.geno        BXD-NIA-AD.8.geno
+BXD-AE_old.geno          BXD-Longevity.array.geno    BXD-NIA-AD.geno
+BXD-Bone.geno            BXD-Longevity.classic.geno  BXD-NIA-AD_old2.geno
+BXD-Bone_orig.geno       BXD-Longevity.geno          BXD-NIA-AD_old.geno
+BXD.geno                 BXD-Longevity_old.4.geno    BXD_Nov_23_2010_before_polish_101_102_103.geno
+BXD-Harvested.geno       BXD-Longevity_old.8.geno    BXD_Nov_24_2010_before_polish_55_81.geno
+BXD-Heart-Metals.4.geno  BXD-Longevity_old.geno      BXD_old.geno
+BXD-Heart-Metals.8.geno  BXD-MBD-UTHSC.geno          BXD_unsure.geno
+BXD-Heart-Metals.geno    BXD-Micturition.4.geno      BXD_UT-SJ.geno
+```
+
+Not really reflected in the DB:
+
+```
+MariaDB [db_webqtl]> select Id, Name from InbredSet where name like '%BXD%';
++----+------------------+
+| Id | Name             |
++----+------------------+
+|  1 | BXD              |
+| 58 | BXD-Bone         |
+| 64 | BXD-Longevity    |
+| 68 | BXD_Dev          |
+| 76 | DOD-BXD-GWI      |
+| 84 | BXD-Heart-Metals |
+| 86 | BXD-AE           |
+| 91 | BXD-Micturition  |
+| 92 | BXD-JAX-AD       |
+| 93 | BXD-NIA-AD       |
+| 94 | CCBXD-TM         |
+| 96 | BXD-JAX-OFS      |
+| 97 | BXD-MBD-UTHSC    |
++----+------------------+
+```
+
+Bit of a mess. Looks like some files are discarded. Let's see what the reaper script does.
+
+We should also look into distributed storage. One option is webdav.
+
+# Get PublishData trait(s) and convert to R/qtl2 or lmdb
+
+Let's see how the scripts do it. Note that we already did that for the probeset script in
+
+=> precompute-mapping-input-data
+
+The code is reflected in
+
+=> https://git.genenetwork.org/gn-guile/tree/scripts/precompute/list-traits-to-compute.scm
+
+Now I need to do the exact same thing, but for PublishData.
+
+Let's connect to a remote GN DB:
+
+```
+ssh -L 3306:127.0.0.1:3306 -f -N tux02.genenetwork.org
+```
+
+and follow
+
+=> https://github.com/genenetwork/genenetwork2/blob/testing/scripts/maintenance/QTL_Reaper_v8_PublishXRef.py
+
+the script takes a number of values 'PublishFreezeIds'. Alternatively it picks it up by SpeciesId (hard effing coded, of course).
+
+=> https://github.com/genenetwork/genenetwork2/blob/fcde38b0f37f12508a01b16b7820029aa951bded/scripts/maintenance/QTL_Reaper_v8_PublishXRef.py#L62
+
+Next it picks the geno file from the InbredSetID with
+
+```
+select InbredSetId  from PublishFreeze  where PublishFreeze.Id = 1;
+```
+
+Here we are initially going to focus on BXD=1 datasets only.
+
+```
+MariaDB [db_webqtl]> select Id,InbredSetId  from PublishFreeze  where InbredSetId = 1;
++----+-------------+
+| Id | InbredSetId |
++----+-------------+
+|  1 |           1 |
++----+-------------+
+```
+
+(we are half way the script now). Next we capture some metadata
+
+```
+MariaDB [db_webqtl]> select PhenotypeId, Locus, DataId, Phenotype.Post_publication_description from PublishXRef, Phenotype where PublishXRef.PhenotypeId = Phenotype.Id and InbredSetId=1 limit 5;
++-------------+----------------+---------+----------------------------------------------------------------------------------------------------------------------------+
+| PhenotypeId | Locus          | DataId  | Post_publication_description                                                                                               |
++-------------+----------------+---------+----------------------------------------------------------------------------------------------------------------------------+
+|           4 | rs48756159     | 8967043 | Central nervous system, morphology: Cerebellum weight, whole, bilateral in adults of both sexes [mg]                       |
+|          10 | rsm10000005699 | 8967044 | Central nervous system, morphology: Cerebellum weight after adjustment for covariance with brain size [mg]                 |
+|          15 | rsm10000013713 | 8967045 | Central nervous system, morphology: Brain weight, male and female adult average, unadjusted for body weight, age, sex [mg] |
+|          20 | rs48756159     | 8967046 | Central nervous system, morphology: Cerebellum volume [mm3]                                                                |
+|          25 | rsm10000005699 | 8967047 | Central nervous system, morphology: Cerebellum volume, adjusted for covariance with brain size [mm3]                       |
++-------------+----------------+---------+----------------------------------------------------------------------------------------------------------------------------+
+```
+
+it captures LRS
+
+```
+MariaDB [db_webqtl]> select LRS from PublishXRef where PhenotypeId=4 and InbredSetId=1;
++--------------------+
+| LRS                |
++--------------------+
+| 13.497491147108706 |
++--------------------+
+```
+
+and finally the trait values that are used for mapping
+
+```
+select Strain.Name, PublishData.value from Strain, PublishData where Strain.Id = PublishData.StrainId and PublishData.Id = 8967043;
++-------+-----------+
+| Name  | value     |
++-------+-----------+
+| BXD1  | 61.400002 |
+| BXD2  | 49.000000 |
+| BXD5  | 62.500000 |
+| BXD6  | 53.099998 |
+| BXD8  | 59.099998 |
+| BXD9  | 53.900002 |
+| BXD11 | 53.099998 |
+| BXD12 | 45.900002 |
+| BXD13 | 48.400002 |
+| BXD14 | 49.400002 |
+| BXD15 | 47.400002 |
+| BXD16 | 56.299999 |
+| BXD18 | 53.599998 |
+| BXD19 | 50.099998 |
+| BXD20 | 48.200001 |
+| BXD21 | 50.599998 |
+| BXD22 | 53.799999 |
+| BXD23 | 48.599998 |
+| BXD24 | 54.900002 |
+| BXD25 | 49.599998 |
+| BXD27 | 47.400002 |
+| BXD28 | 51.500000 |
+| BXD29 | 50.200001 |
+| BXD30 | 53.599998 |
+| BXD31 | 49.700001 |
+| BXD32 | 56.000000 |
+| BXD33 | 52.099998 |
+| BXD34 | 53.700001 |
+| BXD35 | 49.700001 |
+| BXD36 | 44.500000 |
+| BXD38 | 51.099998 |
+| BXD39 | 54.900002 |
+| BXD40 | 49.900002 |
+| BXD42 | 59.400002 |
++-------+-----------+
+```
+
+Note that we need to filter out the parents - the original reaper script does not do that! My gn-guile code does handle that:
+
+```
+SELECT StrainId,Strain.Name FROM Strain, StrainXRef WHERE StrainXRef.StrainId = Strain.Id AND StrainXRef.InbredSetId =1 AND Used_for_mapping<>'Y' limit 5;
++----------+----------+
+| StrainId | Name     |
++----------+----------+
+|        1 | B6D2F1   |
+|        2 | C57BL/6J |
+|        3 | DBA/2J   |
+|      150 | A/J      |
+|      151 | AXB1     |
++----------+----------+
+etc.
+```
+
+Also Bonz' script
+
+=> https://git.genenetwork.org/gn-guile/tree/scripts/lmdb-publishdata-export.scm
+
+has an interesting query:
+
+```
+MariaDB [db_webqtl]>
+SELECT DISTINCT PublishFreeze.Name, PublishXRef.Id FROM PublishData
+  INNER JOIN Strain ON PublishData.StrainId = Strain.Id
+  INNER JOIN PublishXRef ON PublishData.Id = PublishXRef.DataId
+  INNER JOIN PublishFreeze ON PublishXRef.InbredSetId = PublishFreeze.InbredSetId
+  LEFT JOIN PublishSE ON PublishSE.DataId = PublishData.Id AND PublishSE.StrainId = PublishData.StrainId
+  LEFT JOIN NStrain ON NStrain.DataId = PublishData.Id AND NStrain.StrainId = PublishData.StrainId
+  WHERE PublishFreeze.public > 0 AND PublishFreeze.confidentiality < 1
+  ORDER BY PublishFreeze.Id, PublishXRef.Id limit 5;
++------------+-------+
+| Name       | Id    |
++------------+-------+
+| BXDPublish | 10001 |
+| BXDPublish | 10002 |
+| BXDPublish | 10003 |
+| BXDPublish | 10004 |
+| BXDPublish | 10005 |
++------------+-------+
+5 rows in set (0.239 sec)
+```
+
+that shows we have 13689 BXDPublish datasets. It also has
+
+```
+SELECT
+JSON_ARRAYAGG(JSON_ARRAY(Strain.Name, PublishData.Value)) AS data,
+ MD5(JSON_ARRAY(Strain.Name, PublishData.Value)) as md5hash
+FROM
+    PublishData
+    INNER JOIN Strain ON PublishData.StrainId = Strain.Id
+    INNER JOIN PublishXRef ON PublishData.Id = PublishXRef.DataId
+    INNER JOIN PublishFreeze ON PublishXRef.InbredSetId = PublishFreeze.InbredSetId
+LEFT JOIN PublishSE ON
+    PublishSE.DataId = PublishData.Id AND
+    PublishSE.StrainId = PublishData.StrainId
+LEFT JOIN NStrain ON
+    NStrain.DataId = PublishData.Id AND
+    NStrain.StrainId = PublishData.StrainId
+WHERE
+    PublishFreeze.Name = "BXDPublish" AND
+    PublishFreeze.public > 0 AND
+    PublishData.value IS NOT NULL AND
+    PublishFreeze.confidentiality < 1
+ORDER BY
+    LENGTH(Strain.Name), Strain.Name LIMIT 5;
+```
+
+best to pipe that to a file. It outputs JSON and an MD5SUM straight from mariadb. Interesting.
+
+Finally, let's have a look at the existing GN API
+
+```
+SELECT
+                            Strain.Name, Strain.Name2, PublishData.value, PublishData.Id, PublishSE.error, NStrain.count
+                        FROM
+                            (PublishData, Strain, PublishXRef, PublishFreeze)
+                        LEFT JOIN PublishSE ON
+                            (PublishSE.DataId = PublishData.Id AND PublishSE.StrainId = PublishData.StrainId)
+                        LEFT JOIN NStrain ON
+                            (NStrain.DataId = PublishData.Id AND
+                            NStrain.StrainId = PublishData.StrainId)
+                        WHERE
+                            PublishXRef.InbredSetId = 1 AND
+                            PublishXRef.PhenotypeId = 4 AND
+                            PublishData.Id = PublishXRef.DataId AND
+                            PublishData.StrainId = Strain.Id AND
+                            PublishXRef.InbredSetId = PublishFreeze.InbredSetId AND
+                            PublishFreeze.public > 0 AND
+                            PublishFreeze.confidentiality < 1
+                        ORDER BY
+                            Strain.Name;
+ +-------+-------+-----------+---------+-------+-------+
+| Name  | Name2 | value     | Id      | error | count |
++-------+-------+-----------+---------+-------+-------+
+| BXD1  | BXD1  | 61.400002 | 8967043 |  2.38 | NULL  |
+| BXD11 | BXD11 | 53.099998 | 8967043 |   1.1 | NULL  |
+| BXD12 | BXD12 | 45.900002 | 8967043 |  1.09 | NULL  |
+| BXD13 | BXD13 | 48.400002 | 8967043 |  1.63 | NULL  |
+...
+```
+
+which actually blocks non-public sets and shows std err, as well as counts when available(?) It does not exclude the parents for mapping (btw). That probably happens on the mapping page itself.
+
+Probably the most elegant query is in GN3 API:
+
+```
+SELECT st.Name, ifnull(pd.value, 'x'), ifnull(ps.error, 'x'), ifnull(ns.count, 'x')
+    FROM PublishFreeze pf JOIN PublishXRef px ON px.InbredSetId = pf.InbredSetId
+        JOIN PublishData pd ON pd.Id = px.DataId JOIN Strain st ON pd.StrainId = st.Id
+        LEFT JOIN PublishSE ps ON ps.DataId = pd.Id AND ps.StrainId = pd.StrainId
+        LEFT JOIN NStrain ns ON ns.DataId = pd.Id AND ns.StrainId = pd.StrainId
+    WHERE px.PhenotypeId = 4 limit 5;
++------+-----------------------+-----------------------+-----------------------+
+| Name | ifnull(pd.value, 'x') | ifnull(ps.error, 'x') | ifnull(ns.count, 'x') |
++------+-----------------------+-----------------------+-----------------------+
+| BXD1 | 61.400002             | 2.38                  | x                     |
+| BXD2 | 49.000000             | 1.25                  | x                     |
+| BXD5 | 62.500000             | 2.32                  | x                     |
+| BXD6 | 53.099998             | 1.22                  | x                     |
+| BXD8 | 59.099998             | 2.07                  | x                     |
++------+-----------------------+-----------------------+-----------------------+
+```
+
+written by Zach and Bonface. See
+
+=> https://github.com/genenetwork/genenetwork3/blame/main/gn3/db/sample_data.py
+
+
+
+We can get a list of the 13689 BXD datasets we can use. Note that we start with public data because we'll feed it to AI and all privacy will be gone after. We'll design an second API that makes use of Fred's authentication/authorization later.
+Let's start with the SQL statement listed on:
+
+
+We can run mysql through an ssh tunnel with
+
+```
+ssh -L 3306:127.0.0.1:3306 -f -N tux02.genenetwork.org
+mysql -A -h 127.0.0.1 -uwebqtlout -pwebqtlout db_webqtl
+```
+
+and test the query, i.e.
+
+```
+MariaDB [db_webqtl]> SELECT DISTINCT PublishFreeze.Name, PublishXRef.Id FROM PublishData
+    ->   INNER JOIN Strain ON PublishData.StrainId = Strain.Id
+    ->   INNER JOIN PublishXRef ON PublishData.Id = PublishXRef.DataId
+    ->   INNER JOIN PublishFreeze ON PublishXRef.InbredSetId = PublishFreeze.InbredSetId
+    ->   LEFT JOIN PublishSE ON PublishSE.DataId = PublishData.Id AND PublishSE.StrainId = PublishData.StrainId
+    ->   LEFT JOIN NStrain ON NStrain.DataId = PublishData.Id AND NStrain.StrainId = PublishData.StrainId
+    ->   WHERE PublishFreeze.public > 0 AND PublishFreeze.confidentiality < 1
+    ->   ORDER BY PublishFreeze.Id, PublishXRef.Id limit 5;
++------------+-------+
+| Name       | Id    |
++------------+-------+
+| BXDPublish | 10001 |
+| BXDPublish | 10002 |
+| BXDPublish | 10003 |
+| BXDPublish | 10004 |
+| BXDPublish | 10005 |
+```
+
+Let's take this apart a little. First of all PublishFreeze has only one record for BXDPublish where ID=1. PublishData may be used to check valid fields, but the real information is in PublishXRef. A simple
+
+```
+ select count(*) from PublishXRef WHERE InbredSetId=1;
++----------+
+| count(*) |
++----------+
+|    13711 |
++----------+
+```
+
+counts a few extra datasets (it was 13689). It may mean that PublishXRef contains some records that are still not public? Anyway,
+let's go for the full dataset for precompute right now. We'll add an API endpoint to gn-guile so it can be used later.
+
+Note GN2 on the menu search
+
+=> https://genenetwork.org/search?species=mouse&group=BXD&type=Phenotypes&dataset=BXDPublish&search_terms_or=*&search_terms_and=&accession_id=None&FormID=searchResult
+
+gives 13,729 entries, including recent BXD_51094. That is because that production database is newer. If we look at our highest records:
+
+```
+select * from PublishXRef WHERE InbredSetId=1 ORDER BY ID DESC limit 3;
++-------+-------------+-------------+---------------+----------+-------------------+----------------+--------------------+--------------------+----------+----------+
+| Id    | InbredSetId | PhenotypeId | PublicationId | DataId   | mean              | Locus          | LRS                | additive           | Sequence | comments |
++-------+-------------+-------------+---------------+----------+-------------------+----------------+--------------------+--------------------+----------+----------+
+| 51060 |           1 |       45821 |         39794 | 41022015 |              NULL | rsm10000000968 | 13.263934206457122 | 2.1741201177177185 |        1 |          |
+| 51049 |           1 |       45810 |         39783 | 41022004 | 8.092333210508029 | rsm10000014174 |   16.8291804498215 | 18.143229769230775 |        1 |          |
+| 51048 |           1 |       45809 |         39782 | 41022003 | 6.082199917286634 | rsm10000009222 | 14.462661474938166 |  4.582111488461538 |        1 |          |
++-------+-------------+-------------+---------------+----------+-------------------+----------------+--------------------+--------------------+----------+----------+
+```
+
+You can see they match that list (51060 got updated on production). The ID matches record BXD_51060 on the production search table.
+We can look at the DataId with
+
+```
+select Id,PhenotypeId,DataId from PublishXRef WHERE InbredSetId=1 ORDER BY ID DESC limit 3;
++-------+-------------+----------+
+| Id    | PhenotypeId | DataId   |
++-------+-------------+----------+
+| 51060 |       45821 | 41022015 |
+| 51049 |       45810 | 41022004 |
+| 51048 |       45809 | 41022003 |
++-------+-------------+----------+
+```
+
+And get the actual values with
+
+```
+select * from PublishData WHERE Id=41022003;
++----------+----------+-----------+
+| Id       | StrainId | value     |
++----------+----------+-----------+
+| 41022003 |        2 |  9.136000 |
+| 41022003 |        3 |  4.401000 |
+| 41022003 |        9 |  4.360000 |
+| 41022003 |       29 | 15.745000 |
+| 41022003 |       98 |  4.073000 |
+| 41022003 |       99 | -0.580000 |
+```
+
+which match the values on
+
+=> https://genenetwork.org/show_trait?trait_id=51048&dataset=BXDPublish
+
+The phenotypeid is useful for some metadata:
+
+
+```
+select * from Phenotype WHERE ID=45809;
+| 45809 | Central nervous system, metabolism, nutrition, toxicology: Difference score for Iron (Fe) concentration in cortex (CTX) between 20 to 120-day-old and 300 to 918-day-old males mice fed Envigo diet 7912 containing 240, 93, and 63 ppm Fe, Cu and Zn, respectively [µg/g wet weight]  | Central nervous system, metabolism, nutrition, toxicology: Difference score for Iron (Fe) concentration in cortex (CTX) between 20 to 120-day-old and 300 to 918-day-old males mice fed Envigo diet 7912 containing 240, 93, and 63 ppm Fe, Cu and Zn, respectively [µg/g wet weight]  | Central nervous system, metabolism, nutrition, toxicology: Difference score for Iron (Fe) concentration in cortex (CTX) between 20 to 120-day-old and 300 to 918-day-old males mice fed Envigo diet 7912 containing 240, 93, and 63 ppm Fe, Cu and Zn, respectively [µg/g wet weight]  | [ug/mg wet weight] | Fe300-120CTXMale             | Fe300-120CTXMale              | NULL     | acenteno  | Jones B | joneslab         |
+```
+
+Since I am going for the simpler query I'll add an API endpoint named
+datasets/bxd-publish/list (so others can use that too).  We'll return
+tuples for each entry so we can extend it later. First we need the
+DataID so we can point into PublishData. We expect the endpoint to
+return something like
+
+```
++-------+-------------+----------+
+| Id    | PhenotypeId | DataId   |
++-------+-------------+----------+
+| 51060 |       45821 | 41022015 |
+| 51049 |       45810 | 41022004 |
+| 51048 |       45809 | 41022003 |
+...
+```
+
+Alright, let's write some code. The following patch returns on the endpoint:
+
+```
+[
+  {
+    "Id": 10001,
+    "PhenotypeId": 4,
+    "DataId": 8967043
+  },
+  {
+    "Id": 10002,
+    "PhenotypeId": 10,
+    "DataId": 8967044
+  },
+  {
+    "Id": 10003,
+    "PhenotypeId": 15,
+    "DataId": 8967045
+  },
+...
+```
+
+in about 3 seconds. It will run a lot faster on a local network. But for our purpose it is fine. The code I wrote is here:
+
+=> https://git.genenetwork.org/gn-guile/commit/?id=1590be15f85e30d7db879c19d2d3b4bed201556a
+
+Note the simple SQL query (compared to the first one).
+Next step is to fetch the trait values we can feed to GEMMA. The full query using the PhenotypeId and DataId in GN is:
+
+```
+SELECT Strain.Name, Strain.Name2, PublishData.value, PublishData.Id, PublishSE.error, NStrain.count
+  FROM
+      (PublishData, Strain, PublishXRef, PublishFreeze)
+  LEFT JOIN PublishSE ON
+      (PublishSE.DataId = PublishData.Id AND PublishSE.StrainId = PublishData.StrainId)
+  LEFT JOIN NStrain ON
+      (NStrain.DataId = PublishData.Id AND
+      NStrain.StrainId = PublishData.StrainId)
+  WHERE
+      PublishXRef.InbredSetId = 1 AND
+      PublishXRef.PhenotypeId = 4 AND
+      PublishData.Id = PublishXRef.DataId AND
+      PublishData.StrainId = Strain.Id AND
+      PublishXRef.InbredSetId = PublishFreeze.InbredSetId AND
+      PublishFreeze.public > 0 AND
+      PublishFreeze.confidentiality < 1;
++-------+-------+-----------+---------+-------+-------+
+| Name  | Name2 | value     | Id      | error | count |
++-------+-------+-----------+---------+-------+-------+
+| BXD1  | BXD1  | 61.400002 | 8967043 |  2.38 | NULL  |
+| BXD2  | BXD2  | 49.000000 | 8967043 |  1.25 | NULL  |
+| BXD5  | BXD5  | 62.500000 | 8967043 |  2.32 | NULL  |
+| BXD6  | BXD6  | 53.099998 | 8967043 |  1.22 | NULL  |
+...
+```
+
+(result includes parents). We can simplify this for GEMMA because it only wants the name and (mean) value.
+
+The short version when you have the data ID is:
+
+```
+SELECT Strain.Name, PublishData.value FROM Strain, PublishData WHERE PublishData.Id=41022003 and Strain.Id=StrainID;
++----------+-----------+
+| Name     | value     |
++----------+-----------+
+| C57BL/6J |  9.136000 |
+| DBA/2J   |  4.401000 |
+| BXD9     |  4.360000 |
+| BXD32    | 15.745000 |
+| BXD43    |  4.073000 |
+| BXD44    | -0.580000 |
+| BXD48    | -1.810000 |
+| BXD51    |  4.294000 |
+| BXD60    | -0.208000 |
+| BXD62    | -0.013000 |
+| BXD63    |  3.221000 |
+| BXD66    |  2.472000 |
+| BXD69    | 12.886000 |
+| BXD70    | -1.973000 |
+| BXD78    | 19.511999 |
+| BXD79    |  7.845000 |
+| BXD73a   |  3.201000 |
+| BXD87    | -3.054000 |
+| BXD48a   | 11.585000 |
+| BXD100   |  7.088000 |
+| BXD102   |  8.485000 |
+| BXD124   | 13.442000 |
+| BXD170   | -1.274000 |
+| BXD172   | 18.587000 |
+| BXD186   | 10.634000 |
++----------+-----------+
+```
+
+which matches GN perfectly (some individuals where added). Alright, let's add an endpoint for this named
+'dataset/bxd-publish/values/dataid/41022003'. Note we only deal with public data (so far). Later we may come up with more generic
+end points and authorization. At this point the API is either on the local network (this one is) or public.
+
+The first version returns this data from the endpoint:
+
+```
+time curl http://127.0.0.1:8091/dataset/bxd-publish/values/41022003
+[{"Name":"C57BL/6J","value":9.136},{"Name":"DBA/2J","value":4.401},{"Name":"BXD9","value":4.36},{"Name":"BXD32","value":15.745},{"Name":"BXD43","value":4.073},{"Name":"BXD44","value":-0.58},{"Name":"BXD48","value":-1.81},{"Name":"BXD51","value":4.294},{"Name":"BXD60","value":-0.208},{"Name":"BXD62","value":-0.013},{"Name":"BXD63","value":3.221},{"Name":"BXD66","value":2.472},{"Name":"BXD69","value":12.886},{"Name":"BXD70","value":-1.973},{"Name":"BXD78","value":19.511999},{"Name":"BXD79","value":7.845},{"Name":"BXD73a","value":3.201},{"Name":"BXD87","value":-3.054},{"Name":"BXD48a","value":11.585},{"Name":"BXD100","value":7.088},{"Name":"BXD102","value":8.485},{"Name":"BXD124","value":13.442},{"Name":"BXD170","value":-1.274},{"Name":"BXD172","value":18.587},{"Name":"BXD186","value":10.634}]
+real    0m0.537s
+user    0m0.002s
+sys     0m0.005s
+```
+
+Note it includes the parents. We should drop them. In this case we can simple check for (string-contains name "BXD"). The database records allow for a filter, so we get
+
+```
+curl http://127.0.0.1:8091/dataset/bxd-publish/mapping/values/41022003
+[{"Name":"BXD9","value":4.36},{"Name":"BXD32","value":15.745},{"Name":"BXD43","value":4.073},{"Name":"BXD44","value":-0.58},{"Name":"BXD48","value":-1.81},{"Name":"BXD51","value":4.294},{"Name":"BXD60","value":-0.208},{"Name":"BXD62","value":-0.013},{"Name":"BXD63","value":3.221},{"Name":"BXD66","value":2.472},{"Name":"BXD69","value":12.886},{"Name":"BXD70","value":-1.973},{"Name":"BXD78","value":19.511999},{"Name":"BXD79","value":7.845},{"Name":"BXD73a","value":3.201},{"Name":"BXD87","value":-3.054},{"Name":"BXD48a","value":11.585},{"Name":"BXD100","value":7.088},{"Name":"BXD102","value":8.485},{"Name":"BXD124","value":13.442},{"Name":"BXD170","value":-1.274},{"Name":"BXD172","value":18.587},{"Name":"BXD186","value":10.634}]
+```
+
+That code went in as
+
+=> https://git.genenetwork.org/gn-guile/commit/?id=9ad0793eb477611c700f4a5b02f60ac793bfae96
+
+It took a bit longer than I wanted because I made a mistake converting the results to a hash table. It broke the JSON conversion and the error was not so helpful.
+
+To write a CSV it turns out I have written
+
+=> https://git.genenetwork.org/gn-guile/tree/gn/runner/gemma.scm?id=9ad0793eb477611c700f4a5b02f60ac793bfae96#n18
+
+which takes the GN BXD.json file and our trait file. BXD.json captures the genotype information GN has:
+
+```
+{
+        "mat": "C57BL/6J",
+        "pat": "DBA/2J",
+        "f1s": ["B6D2F1", "D2B6F1"],
+        "genofile" : [{
+                "title" : "WGS-based (Mar2022)",
+                "location" : "BXD.8.geno",
+                "sample_list" : ["BXD1", "BXD2", "BXD5", "BXD6", "BXD8", "BXD9", "BXD11", "BXD12", "BXD13", "BXD14", "BXD15", "BXD16", "BXD18", "BXD19", "BXD20", "BXD21", "BXD22", "BXD23", "BXD24", "BXD24a", "BXD25", "BXD27", "BXD28", "BXD29", "BXD30", "BXD31", "BXD32", "BXD33", "BXD34", "BXD35", "BXD36", "BXD37", "BXD38", "BXD39", "BXD40", "BXD41", "BXD42", "BXD43", "BXD44", "BXD45", "BXD48", "BXD48a", "BXD49", "BXD50", "BXD51", "BXD52", "BXD53", "BXD54", "BXD55", "BXD56", "BXD59", "BXD60", "BXD61",
+(...)
+"BXD065xBXD077F1", "BXD069xBXD090F1", "BXD071xBXD061F1", "BXD073bxBXD065F1", "BXD073bxBXD077F1", "BXD073xBXD034F1", "BXD073xBXD065F1", "BXD073xBXD077F1", "BXD074xBXD055F1", "BXD077xBXD062F1", "BXD083xBXD045F1", "BXD087xBXD100F1", "BXD065bxBXD055F1", "BXD102xBXD077F1", "BXD102xBXD73bF1", "BXD170xBXD172F1", "BXD172xBXD197F1", "BXD197xBXD009F1", "BXD197xBXD170F1"]
+```
+
+The code maps the traits values I generated against these columns to see what inviduals overlap which corrects for unmappable individuals (anyway).
+
+The function 'write-pheno-file', listed above, does not work however because of the format of the endpoint. Remember it generates
+
+```
+[{"Name":"BXD9","value":4.36},{"Name":"BXD32","value":15.745}...]
+```
+
+While this function expects the shorter
+
+```
+{"BXD9":4.36,"BXD23":15.745...}
+```
+
+Now, for endpoints there is no real standard. We have written ideas up here:
+
+=> https://git.genenetwork.org/gn-docs/tree/api
+
+and, most recently
+
+=> https://git.genenetwork.org/gn-docs/tree/api/GN-REST-API-v2.md
+
+Where I make a case for having the metadata as a separate endpoint that can be reasoned on by people and machines (and AI).
+That means I should default to the short version of the data and describe that layout using metadata. This we can do later.
+
+I modified the endpoint to return the shorter hash:
+
+```
+time curl http://127.0.0.1:8091/dataset/bxd-publish/values/41022003
+{"BXD9":4.36,"BXD23":15.745...}
+```
+
+Next, to align with
+
+=> https://github.com/genenetwork/gn-docs/blob/master/api/GN-REST-API-v2.md
+
+I gave the API the json extension, so we have http://127.0.0.1:8091/dataset/bxd-publish/values/41022003.json
+
+This allows writing a special handler for GEMMA output (.gemma extension) downloading the pheno file with
+
+```
+curl http://127.0.0.1:8091/dataset/bxd-publish/values/41022003.gemma
+NA
+NA
+NA
+NA
+NA
+4.36NA
+NA
+NA
+NA
+(...)
+```
+
+that GEMMA can use directly and matches the order of the individuals in the BXD.8.geno file and the founders/parents are not included. Note that all of this now only works for the BXD (on PublishData) and I am using BXD.json as described in
+
+=> https://issues.genenetwork.org/topics/systems/mariadb/precompute-mapping-input-data
+
+I.e., it is Zach's listed stopgap solution. Code is here:
+
+=> https://git.genenetwork.org/gn-guile/log/
+
+Next step run gemma as we are on par with my earlier work on ProbeSetData. I wrote a gemma runner for that too at
+
+=> https://git.genenetwork.org/gn-guile/tree/gn/runner/gemma.scm#n79
+
+Now here I use guile to essentially script running GEMMA. There is no real advantage for that, so I will simply tell gemma-wrapper to use the output of above .gemma endpoint to fetch the trait values. Basically gemma-wrapper can specify the standard gemma -p switch, or pass in --phenotypes, that are used for permutations.
+
+Now the new method we want to introduce is that the trait values are read from a REST API, instead of a file. The dirty way is to provide that functionality directly to gemma-wrapper, but we plan to get rid of that code (useful as it is -- it duplicates what Arun's ravanan does and ravanan has the advantage that it can be run on a cluster).
+
+So we simply download the data and write it to a file with a small script. To run:
+
+```
+curl http://127.0.0.1:8091/dataset/bxd-publish/values/41022003.gemma > 41022003-pheno.txt
+```
+
+Next we create a container for gemma-wrapper (and includes the gemma that GN uses):
+
+```
+. .guix-deploy
+env TMPDIR=tmp ruby ./bin/gemma-wrapper --force --json \
+        --loco -- \
+        -g BXD.8_geno.txt.gz \
+        -p 41022003-pheno.txt \
+        -a BXD.8_snps.txt \
+        -gk > K.json
+```
+
+this bailed out with
+
+Executing: parallel --results /tmp/test --joblog /tmp/test/5f3849a9e61b70e3d562b20c5eade5a699923c68-parallel.log < /tmp/test/parallel-commands.txt
+Command exited with non-zero status 20
+
+When running an individual chromosome (from the parallel log) we get two warnings and an error:
+
+```
+**** WARNING: The maximum genotype value is not 2.0 - this is not the BIMBAM standard and will skew l_lme and effect sizes
+**** WARNING: Columns in geno file do not match # individuals in phenotypes
+ERROR: Enforce failed for not enough genotype fields for marker in src/gemma_io.cpp at line 1470 in BimbamKin
+```
+
+Looks familiar!
+The first warning we'll ignore for now, as we just want the hits initially. The second warning relates to the error that there is a mismatch in number of inds.
+
+This topic I have covered in the past, particularly trying to debug Dave's conflicting results:
+
+=> https://issues.genenetwork.org/topics/lmms/gemma/permutations
+
+It makes somewhat depressive reading, though we have a solution.
+
+Note the correct conversion we only have to do once (basically the code I wrote earlier
+to fetch BXD traits needs to work with the latest BXD genotypes).
+The real problem is that gemma itself does not compare individual names (at all), so any corrections need to be done beforehand. In this case our pheno file contains 212 inds from the earlier BXD.json file.
+
+```
+wc -l 41022003-pheno.txt
+212 41022003-pheno.txt
+```
+
+And that is off. Let's try the tool I wrote during that exercise. It can create a different json file after parsing BXD.geno
+that has in the header:
+
+> # Date Modified: April 23, 2024 by Arthur Centeno, Suheeta Roy. March 22, 2022 by Rob Williams, David Ashbrook, and Danny Arends to remove excessive cross-over events in strains BXD42 (Chr9), BXD81 (Chrs1, 5, 10), BXD99 (Chr1), and BXD100 (Chrs2 and 6); and to add Taar1 maker on Chr 10 for T. Phillips-Richards.   Jan 19, 2017: Danny Arends computed BXD cM values and recombinations between markers. Rob W. Williams fixed errors on most chromosomes and added Affy eQTL markers. BXD223 now has been added based on David Ashbrook's spreadsheet genotype information.
+
+```
+md5sum BXD.geno:
+  a78aa312b51ac15dd8ece911409c5b98  BXD.geno
+gemma-wrapper$ ./bin/gn-geno-to-gemma.py BXD.geno > BXD.geno.txt
+```
+
+creates a .json file (that is different from Zach/GN's) and a bimbam file GEMMA can use. Now in the next step I need to adapt above code to use this format. What I *should* have done, instead of writing gemma phenotypes directly, is write the R/qtl2 format that includes the ind names (so we can compare and validate those) and *then* parse that data against our new JSON file created by gn-geno-to-gemma.py using the rqtl2-pheno-to-gemma.py script. Both Python scripts are already part of gemma-wrapper:
+
+=> https://github.com/genetics-statistics/gemma-wrapper/blob/master/bin/gn-geno-to-gemma.py
+=> https://github.com/genetics-statistics/gemma-wrapper/blob/master/bin/rqtl2-pheno-to-gemma.py
+
+The idea was to create the rqtl2 API endpoint, or I'll adapt the 2nd script to take the endpoint as input and then correct for GEMMA's requirements.
+
+OK, updated the endpoints and the code for rqtl2-pheno-to-gemma.py so it accepts a URL instead of a file. So the idea is
+to run
+
+```
+./bin/rqtl2-pheno-to-gemma.py BXD_pheno_Dave.csv --json BXD.geno.json > BXD_pheno_matched.txt
+```
+
+A line in BXD_pheno_Dave.csv is:
+
+```
+BXD113,24.52,205.429001,3.643,2203.312012,3685.907959,1.199,2.019,29.347143,0.642857,205.428574,24.520409,3.642857,2203
+.312012,3685.908203,1.198643,2.018643,0.642857,33.785709,1.625,2,1.625,1,22.75
+```
+
+Now if I read the Rqtl2 docs it says:
+
+> We split the numeric phenotypes from the mixed-mode covariates, as two separate CSV files. Each file forms a matrix of individuals × phenotypes (or covariates), with the first column being individual IDs and the first row being phenotype or covariate names. Sex and line IDs (if needed) can be columns in the covariate data.
+
+This differs from the BXD Dave layout (it is transposed). Karl added in the docs:
+
+> All of these CSV files may be transposed relative to the form described below. You just need to include, in the control file, a line like: "geno_transposed: true". So, OK, we can use the transposed form. First we make it possible to parse json:
+
+```
+curl http://127.0.0.1:8091/dataset/bxd-publish/values/41022003.json > 41022003-pheno.json
+jq < 41022003-pheno.json
+{
+  "C57BL/6J": 9.136,
+  "DBA/2J": 4.401,
+  "BXD9": 4.36,
+  "BXD32": 15.745,
+(...)
+```
+
+note it includes the parents. Feed it to
+
+```
+./bin/rqtl2-pheno-to-gemma.py 41022003-pheno.json --json BXD.geno.json
+```
+
+where BXD.geno.json is not the Zach/GN json file, but the actual BXDs in GEMMA's bimbam file.
+
+One question is why Zach's JSON file gives a different number of mappable BXDs. I made of note of that to check.
+
+I wrote a new script and we had our first GEMMA run with lmdb output:
+
+```
+wrk@napoli /export/local/home/wrk/iwrk/opensource/code/genetics/gemma-wrapper [env]$ tar tvf /tmp/3fddda2374509c7b346>
+-rw-r--r-- wrk/users    294912 2025-08-06 05:49 3fddda2374509c7b346b7819ae358ed23be9cb46-gemma-GWA.mdb
+```
+
+The script is just 10 lines of code (after the command line handler)
+
+=> https://github.com/genetics-statistics/gemma-wrapper/blob/master/bin/gn-pheno-to-gemma.rb
+
+Excellent, now we can run gemma and the next step is to look at the largest hit.
+
+So the trait we try to run is 41022003 = https://genenetwork.org/show_trait?trait_id=51048&dataset=BXDPublish. The inputs match up. When we run GEMMA in GN it has a 4.0 score on chr 12 and 3.9 on chr 19.
+
+Running gemma-wrapper we get
+
+```
+LOCO K computation with caching and JSON output
+
+gemma-wrapper --json --force --loco -- -g BXD.geno.txt -p BXD_pheno.txt -a BXD.8_snps.txt -n 2 -gk -debug > K.json
+
+LMM's using the K's captured in K.json using the --input switch
+
+gemma-wrapper --json --force --lmdb --loco --input K.json -- -g BXD.geno.txt -p BXD_pheno.txt -a BXD.8_snps.txt -lmm 9 -maf 0.1 -n 2 -debug > GWA.json
+```
+
+We can view the lmdb file with something like:
+
+```
+./bin/view-gemma-mdb --sort /tmp/66b8c19be87e9566358ce904682a56250eb05748-gemma-GWA.tar.xz --anno BXD.8_snps.txt > test.out
+/tmp/3fddda2374509c7b346b7819ae358ed23be9cb46-gemma-GWA.tar.xz
+chr,pos,marker,af,beta,se,l_mle,l_lrt,-logP
+7,67950073,rsm10000004928,0.543,1.5226,1.3331,100000.0,0.0002,3.79
+7,68061665,rs32453663,0.543,1.5226,1.3331,100000.0,0.0002,3.79
+7,68111284,rs32227186,0.543,1.5226,1.3331,100000.0,0.0002,3.79
+19,30665443,rsm10000014129,0.522,2.2128,1.0486,100000.0,0.0002,3.77
+19,30671753,rs31207057,0.522,2.2128,1.0486,100000.0,0.0002,3.77
+12,40785621,rsm10000009222,0.565,2.8541,1.3576,100000.0,0.0002,3.75
+12,40786657,rs29124638,0.565,2.8541,1.3576,100000.0,0.0002,3.75
+12,40842857,rs13481410,0.565,2.8541,1.3576,100000.0,0.0002,3.75
+12,40887762,rsm10000009223,0.565,2.8541,1.3576,100000.0,0.0002,3.75
+12,40887894,rsm10000009224,0.565,2.8541,1.3576,100000.0,0.0002,3.75
+12,40900825,rs50979658,0.565,2.8541,1.3576,100000.0,0.0002,3.75
+12,41054766,rs46705481,0.565,2.8541,1.3576,100000.0,0.0002,3.75
+```
+
+Interestingly the hits are very similar to what is on production now, though not the same! That points out that I am not using the production database on this recent dataset. Let's try an older one. BXD_10002 has data id 8967044
+
+```
+curl http://127.0.0.1:8091/dataset/bxd-publish/values/8967044.json > 10002-pheno.json
+./bin/gn-pheno-to-gemma.rb -p 10002-pheno.json --geno-json BXD.geno.json > 10002-pheno.txt
+gemma-wrapper --json --force --loco -- -g BXD.geno.txt -p 10002-pheno.txt -a BXD.8_snps.txt -n 2 -gk -debug > K.json
+gemma-wrapper --json --force --lmdb --loco --input K.json -- -g BXD.geno.txt -p 10002-pheno.txt -a BXD.8_snps.txt -lmm 9 -maf 0.1 -n 2 -debug > GWA.json
+./bin/view-gemma-mdb --sort /tmp/c4ffedf358698814c6e29a54a2a51cb6c66328d0-gemma-GWA.tar.xz --anno BXD.8_snps.txt > test.out
+```
+
+Luckily this is a perfect match:
+
+```
+1,179861787,rsm10000000444,0.559,0.8837,0.3555,100000.0,0.0,4.99
+1,179862838,rs30712622,0.559,0.8837,0.3555,100000.0,0.0,4.99
+1,179915631,rsm10000000787,0.559,0.8837,0.3555,100000.0,0.0,4.99
+1,179919811,rsm10000000788,0.559,0.8837,0.3555,100000.0,0.0,4.99
+(...)
+8,94479237,rs32095272,0.441,1.0456,0.4362,100000.0,0.0,4.75
+8,94765445,rsm10000005684,0.441,1.0456,0.4362,100000.0,0.0,4.75
+8,94785223,rsm10000005685,0.441,1.0456,0.4362,100000.0,0.0,4.75
+8,94840921,rsm10000005686,0.441,1.0456,0.4362,100000.0,0.0,4.75
+```
+
+The lmdb file contains the full vector and compresses to 100K. For 13K traits that equals about 1Gb.
+
+First I wanted to check how Zach's list of mappable inds compares to mine. A simple REPL exercise shows:
+
+```
+zach = JSON.parse(File.read('BXD.json'))
+pj = JSON.parse(File.read('BXD.geno.json'))
+s1 = zach["genofile"][0]["sample_list"]
+=> ["BXD1", "BXD2", "BXD5", "BXD6", "BXD8", "BXD9", "BXD11", "BXD12", "BXD13", "BXD14", "BXD15", "BXD16", "BXD18",...
+s2 = pj["samples"]
+=> ["BXD1", "BXD2", "BXD5", "BXD6", "BXD8", "BXD9", "BXD11", "BXD12", "BXD13", "BXD14", "BXD15", "BXD16", "BXD18",...
+s1.size()
+=> 235
+s2.size()
+=> 237
+ s2-s1
+=> ["BXD077xBXD065F1", "BXD065xBXD102F1"]
+```
+
+So it turns out the newer geno file contains these two new inds that are *also* in the .geno file and confuses the hell out of my scripts ;). The GN2 webserver probably uses the header of the geno file to fetch the correct number. The trait page also lists these inds, so (I guess) the BXD.json file ought to be updated.
+
+Now that is explained and we are good.
+
+## Running at scale
+
+In the next step we need to batch run GEMMA. Initially we'll run on one server. gemma-wrapper takes care of running only once, so we can restart the pipeline at any point (we'll move to ravanan after to run on the cluster). At this point the API uses the dataid to return the trait values. I think that is not so intuitive, so I modified the endpoint to give the same results for:
+
+```
+curl http://127.0.0.1:8091/dataset/bxd-publish/values/10002.json > 10002-pheno.json
+curl http://127.0.0.1:8091/dataset/bxd-publish/dataid/values/8967044.json > 10002-pheno.json
+```
+
+Now that works we can get a list of all BXDPublish datasets that I wrote earlier:
+
+```
+curl http://127.0.0.1:8091/dataset/bxd-publish/list > bxd-publish.json
+[
+  {
+    "Id": 10001,
+    "PhenotypeId": 4,
+    "DataId": 8967043
+  },
+  {
+    "Id": 10002,
+    "PhenotypeId": 10,
+    "DataId": 8967044
+  },
+  {
+    "Id": 10003,
+    "PhenotypeId": 15,
+    "DataId": 8967045
+  },
+```
+
+so we can use this to create our batch list. There are 13711 datasets listed on this DB. We can use jq to extract all Ids
+
+```
+jq ".[] | .Id" < bxd-publish.json > ids.txt
+```
+
+All set to run our first batch! Now we replicate our guix-wrapper environment, start the gn-guile server and fire up a batch script that pulls the data from the database and runs gemma for every step.
+
+
+To get precompute going we need a server set up with a recent database. I don't want to use the production server. The fastest other server we have is balg01, and it is not busy right now, so let's use that. First we recover a DB from our backup, as described in
+
+=> topics/systems/mariadb/precompute-mapping-input-data
+
+(btw that examples show we started on precompute since November 2023 - 1.5 years ago). On that server mariadb is running as
+/usr/local/guix-profiles/gn-latest/bin/mariadbd --datadir=/export/mariadb/tux01. We can simply overwrite that database as it
+is an installation of Feb 18 2024. We extract:
+
+```
+borg extract --progress /export/backup/bacchus/drop/tux04/genenetwork::borg-tux04-sql-20250807-04:16-Thu
+```
+
+After extracting the backup we need to update permissions and point mariadb to the new dir: balg01:/export/mariadb/tux04/latest/.
+Restarting the DB and it all appears to work.
+
+Before I move the code across we need to make sure metadata on the traits get added to the lmdb mapping data. I actually wrote the code for that here. This adds the metadata to lmdb:
+
+=> https://github.com/genetics-statistics/gemma-wrapper/blob/a0eb8ed829072cb539b32affe135a7930989ca30/bin/gemma2lmdb.py#L99
+
+gemma-wrapper writes data like this:
+
+```
+  "meta": {
+    "type": "gemma-wrapper",
+    "version": "0.99.7-pre1",
+    "population": "BXD",
+    "name": "HC_U_0304_R",
+    "trait": "101500_at",
+    "url": "https://genenetwork.org/show_trait?trait_id=101500_at&dataset=HC_U_0304_R",
+    "archive_GRM": "46bfba373fe8c19e68be6156cad3750120280e2e-gemma-cXX.tar.xz",
+    "archive_GWA": "779a54a59e4cd03608178db4068791db4ca44ab3-gemma-GWA.tar.xz",
+    "dataid": 75629,
+    "probesetid": 1097,
+    "probesetfreezeid": 7
+    }
+```
+
+This was done for probesetdata and needs to be adapted for our BXD PublishData exercise. Also I want the archive_GWA file name to include the trait name/ID so we can find it quickly on the storage (without having to parse/query all lmdb files).
+
+From the gemma-wrapper invocation you can see I added a few switches to pass in this information:
+
+=> https://git.genenetwork.org/gn-guile/tree/gn/runner/gemma.scm#n97
+
+```
+        --meta NAME                  Pass in metadata as JSON file
+        --population NAME            Add population identifier to metadata
+        --name NAME                  Add dataset identifier to metadata
+        --id ID                      Add identifier to metadata
+        --trait TRAIT                Add trait identifier to metadata
+```
+
+We can add BXD as population and BXDPublish as a dataset identifier. Set id with dataid, and trait id with PublishXRefID and point it back to GN, so we can click
+
+=>  https://genenetwork.org/show_trait?trait_id=51048&dataset=BXDPublish
+
+Another thing I want to add are the existing qtlreaper hit values.  That way we can assess where the biggest impact was of using gemma over qtlreaper. To achieve this we will create a new API endpoint that can serve that data. Remember we get the trait values with:
+
+=> http://127.0.0.1:8091/dataset/bxd-publish/values/10002.json
+
+so we can add an endpoint that lists the mapping results
+
+=> http://127.0.0.1:8091/dataset/bxd-publish/trait-hits/10002.json
+
+we also will have
+
+=> http://127.0.0.1:8091/dataset/bxd-publish/trait-info/10002.json
+
+That will return more metadata and point into our RDF store. Note that this is now all very specific to bxd-publish. Later we'll have to think how to generalise these endpoints. We are just moving forward to do the BXD precompute run.
+
+Interestingly GN2 shows this information (well, only the highest hit) on the search page, but not on the trait page. As we can get hits from multiple sources we should (eventually) account for that with something like:
+
+```
+=> http://127.0.0.1:8091/dataset/bxd-publish/trait-hits/10002.json
+{ "qtlreaper-hk":
+  {
+    [
+      { "name":..., "chr": ..., "pos":..., "LRS":..., "additive":..., }
+    ]
+  }
+  "gemma-loco":
+  {
+    [
+      { "name":..., "chr": ..., "pos":..., "LRS":..., "additive":..., }
+      { "name":..., "chr": ..., "pos":..., "LRS":..., "additive":..., }
+      { "name":..., "chr": ..., "pos":..., "LRS":..., "additive":..., }
+    ]
+  }
+}
+```
+
+Eventually we may list gemma, Rqtl2 hits with and without LOCO and with and without covariates. Once we build this support we can adapt our search tools.
+
+Obviously this won't fit the current PublishXRef format, so -- for now -- we will just mirror its contents:
+
+```
+{ "qtlreaper-hk":
+  {
+    [
+      { "name":..., "chr": ..., "pos":..., "LRS":..., "additive":..., }
+    ]
+  }
+}
+```
+
+To get compute going I am going to skip above because we can update the lmdb files later.
+The first fix is to add the trait name to the file names and the following record to lmdb:
+
+  "meta": {
+    "type": "gemma-wrapper",
+    "version": "0.99.7-pre1",
+    "population": "BXD",
+    "name": "BXDPublish",
+    "table": "PublishData",
+    "traitid": 10002, // aka PublishXrefId
+    "url": "https://genenetwork.org/show_trait?trait_id=51048&dataset=BXDPublish,
+    "archive_GRM": "46bfba373fe8c19e68be6156cad3750120280e2e-gemma-cXX.tar.xz",
+    "archive_GWA": "779a54a59e4cd03608178db4068791db4ca44ab3-BXDPublish-10002-gemma-GWA.tar.xz",
+    "dataid": 8967044,
+    }
+
+This required modifications to gemma-wrapper.
+
+Running:
+
+```
+gemma-wrapper --json --force --loco -- -g BXD.geno.txt -p BXD_pheno.txt -a BXD.8_snps.txt -n 2 -gk -debug > K.json
+gemma-wrapper --json --force --lmdb --population BXD --name BXDPublish --trait 10002 --loco --input K.json -- -g BXD.geno.txt -p BXD_pheno.txt -a BXD.8_snps.txt -lmm 9 -maf 0.1 -n 2 -debug > GWA.json
+```
+
+begets '66b8c19be87e9566358ce904682a56250eb05748-BXDPublish-10002-gemma-GWA.tar.xz'. When I check the meta data in the lmdb file it is set to
+
+```
+"meta": {"type": "gemma-wrapper", "version": "1.00-pre1", "population": "BXD", "name": "BXDPublish", "trait": "10002", "geno_filename": "BXD.geno.txt", "geno_hash": "3b65ed252fa47270a3ea867409b0bdc5700ad6f6", "loco": true, "url": "https://genenetwork.org/show_trait?trait_id=10002&dataset=BXDPublish", "archive_GRM": "185eb08dc3897c7db5d7ea987170898035768f93-gemma-cXX.tar.xz", "archive_GWA":"66b8c19be87e9566358ce904682a56250eb05748-BXDPublish-10002-gemma-GWA.tar.xz", "table": "PublishData", "traitid": 10002, "dataid": 0}
+```
+
+which is good enough (for now). I may still add the dataid, but it requires a SQL call. Code is here:
+
+=> https://github.com/genetics-statistics/gemma-wrapper/commit/49587523fc93bdcf0265da9da97f8d6d2a9e1008
+
+I should note that up to this point I would have had no advantage from AI programming. I know there are topics that I'll work on where I may benefit, but this type of architecturing, with very little code writing, does not really help. I certainly have the intention of using AI! Next steps, unfortunately, there is still little to be gained. Where we'll probably gain is:
+
+- Using the RDF data store and documenting the endpoint(s)
+- Refactoring some of GN2's code to introduce lmdb\
+- Deduplicating GN2/GN3 SQL code
+- Improving the REST API and writing documentation and tests
+- Analysing existing code bases, such as GEMMA itself
+
+Next step is getting the data churn going! After that we'll list all the hits which requires processing the lmdb output.
+
+Precompute of 13K traits has its first test run on balg01.
+
+It is going at 30 gemma runs per minute, so perhaps 8 hours for the full run if it keeps going. But I am hitting errors.
+
+Afther that will be to digest hits from the precomputed vectors in lmdb.
+
+## Yesterday's tux02 crash
+
+All servers work on tux02 except for BNW.
+
+I tried to restart BNW, but it is giving an error, including the mystifying shepherd error (that I have as a sticker on my laptop):
+
+> 2025-08-11 01:13:41 error in finalization thread: Success
+
+It is on our end, so no need to ping Yan. I'll fix it when I have time (I did below).
+
+## Precompute
+
+To get precompute up and running I need to create the environment on balg01. The DB I updated a few days ago, so that should be fine.
+
+First we check out the guile webserver:
+
+```
+git clone tux02.genenetwork.org:/home/git/public/gn-guile gn-guile-8092
+```
+
+Now gn-guile is already running serving aliases, so we want to run this as an internal endpoint right now with something like
+
+```
+unset GUIX_PROFILE
+. /usr/local/guix-profiles/guix-pull/etc/profile
+guix shell -L ~/guix-bioinformatics --container --network --file=guix.scm -- guile -L . --fresh-auto-compile -e main web/webserver.scm 8092
+```
+
+so, this renders
+
+```
+curl http://127.0.0.1:8092/dataset/bxd-publish/values/10002.json
+{"BXD1":54.099998,"BXD2":50.099998,"BXD5":53.299999,"BXD6":55.099998
+```
+
+Next step is to set up gemma-wrapper. Now this failed because guix was not happy. We have been updating things these last weeks. Rather than trying to align with recent changes I could have rolled back to the version I am using on my desktop. But I decided not to let those bits rot and updated guix from
+
+guix describe Thu Mar 14 21:33:55 2024
+
+to
+
+guix describe Sun Aug 10 18:18:20 2025
+
+Should use a newer version first! Let's try
+
+```
+guix pull --url=https://codeberg.org/guix/guix  -p ~/opt/guix-pull
+```
+
+(that took a while, so I took the opportunity to fix BNW -- turns out someone disabled BNW in shepherd by creating a systemd version that did not start properly).
+
+After the pull there were quite a few problems with gemma dependencies that needed fixing. First problem
+
+```
+guix package: warning: failed to load '(gn packages gemma)':
+In procedure abi-check: #<record-type <git-reference>>: record ABI mismatch; recompilation needed
+```
+
+required
+
+```
+find ~/.cache/guile -name "*.go" -delete
+```
+
+I also had to point guix-past to the new codeberg record! And now, magically, things started working.
+
+So, now I have an identical setup on my desktop and on the balg server. Next is to write a script that will batch run gemma-wrapper for every BXD PublishData ID. We created that list with jq earlier.
+
+```
+curl http://127.0.0.1:8092/dataset/bxd-publish/list > bxd-publish.json
+jq ".[] | .Id" < bxd-publish.json > ids.txt
+```
+
+For every ID in that list we are going to fetch the trait values with
+
+```
+#! /bin/env sh
+export TMPDIR=./tmp
+curl http://127.0.0.1:8092/dataset/bxd-publish/list > bxd-publish.json
+jq ".[] | .Id" < bxd-publish.json > ids.txt
+./bin/gemma-wrapper --force --json --loco -- -g BXD.geno.txt -p BXD_pheno.txt -a BXD.8_snps.txt -n 2 -gk > K.json
+
+for id in 'cat ids.txt' ; do
+  echo Precomputing $id
+  curl http://127.0.0.1:8092/dataset/bxd-publish/values/$id.json > pheno.json
+  ./bin/gn-pheno-to-gemma.rb --phenotypes pheno.json --geno-json BXD.geno.json > BXD_pheno.txt
+  ./bin/gemma-wrapper --json --lmdb --population BXD --name BXDPublish --trait $id --loco --input K.json -- -g BXD.geno.txt -p BXD_pheno.txt -a BXD.8_snps.txt -lmm 9 -maf 0.1 -n 2 -debug > GWA.json
+done
+```
+
+I hard copied the following files
+
+```
+BXD.geno.json
+BXD.geno.txt
+BXD.8_snps.txt
+```
+
+One thing I need to check is that the GRM is actually a constant. I forgot what GEMMA does.
+
+We hit an error
+
+```
+/gnu/store/vvl1g1l0j19w39kry2xcsawvlhbyb87j-ruby-3.4.4/lib/ruby/3.4.0/json/common.rb:221:in 'JSON::Ext::Parser.parse':
+unexpected token at '' (JSON::ParserError)
+FATAL ERROR: gemma-wrapper bailed out with pid 340588 exit 20
+./bin/gemma-wrapper:494:in 'block (2 levels) in <main>'
+./bin/gemma-wrapper:479:in 'IO.open'
+./bin/gemma-wrapper:479:in 'block in <main>'
+./bin/gemma-wrapper:832:in '<main>'Precomputing 10137
+```
+
+The JSON file is empty 10136. Hmmm.
+
+I also see
+
+```
+WARNING: failed to update lmdb record with key b'\r\x02n\x7f\x10' -- probably a duplicate 13:40795920 (b'\r':40795920)
+```
+
+For the first the webserver actually stopped on `In procedure accept: Too many open files`. The problem looks similar to
+
+=> https://issues.guix.gnu.org/60226
+
+and Arun's patch
+
+=> https://cgit.git.savannah.gnu.org/cgit/guix/mumi.git/commit/?id=897967a84d3f51da2b1cc8c3ee942fd14f4c669b
+
+I raised ulimit, but may need to restart the webserver several time. We are computing though:
+
+```
+-rw-r--r-- 1 wrk wrk   82968 Aug 11 05:16 ab51d69f79601cfa7399feebca619ea1a71c1270-BXDPublish-10146-gemma-GWA.tar.xz
+-rw-r--r-- 1 wrk wrk   82772 Aug 11 05:16 e6739ace8ca4931fc51baa1844b3b5ceac592104-BXDPublish-10147-gemma-GWA.tar.xz
+-rw-r--r-- 1 wrk wrk   81848 Aug 11 05:16 60880fc7e8c86dffb17f28664e478204ea26f827-BXDPublish-10148-gemma-GWA.tar.xz
+-rw-r--r-- 1 wrk wrk   79336 Aug 11 05:16 c914d6221b004dec98d60e08c0fdf8791c09cb41-BXDPublish-10149-gemma-GWA.tar.xz
+-rw-r--r-- 1 wrk wrk   83536 Aug 11 05:16 3d72b19730edab29bdc593cb6a1a86dd789d351f-BXDPublish-10150-gemma-GWA.tar.xz
+-rw-r--r-- 1 wrk wrk   69060 Aug 11 05:16 0e965f1778425071a5497d0fe69f2dc2e534ef60-BXDPublish-10151-gemma-GWA.tar.xz
+-rw-r--r-- 1 wrk wrk   69072 Aug 11 05:16 4de26e62a75727bc7edd6b266dfcd7753d185f1a-BXDPublish-10152-gemma-GWA.tar.xz
+(...)
+```
+
+There are some scarily small datasets:
+
+```
+GET /dataset/bxd-publish/values/10198.json
+;;; ("8967240")
+
+;;; ((("C57BL/6J" . 1.62) ("BXD1" . 2.37) ("BXD5" . 2.73) ("BXD9" . 3.52) ("BXD11" . 0.18) ("BXD12" . 3.69) ("BXD16" . 0.29) ("BXD21" . 2.34) ("BXD27" . 3.38) ("BXD32" . 0.24)))
+```
+
+i.e. https://genenetwork.org/show_trait?trait_id=10198&dataset=BXDPublish
+
+Not sure we should be running GEMMA on those!
+
+
+The computation initially stopped at 70% (we are now at 98%).
+
+To get from 70% I run the webserver without fibers as suggested by Arun's patch:
+
+=> https://cgit.git.savannah.gnu.org/cgit/guix/mumi.git/commit/?id=897967a84d3f51da2b1cc8c3ee942fd14f4c669b
+
+Because we were getting errors like: In procedure accept: Too many open files with GET /dataset/bxd-publish/values/23486.json
+
+Afther removing fibers precompute just continued where it left off. As it should. The fix is:
+
+=> https://git.genenetwork.org/gn-guile/commit/?id=289da2e13e07928cdb8a1d165483a3a3cd9ae1c6
+
+Now that is running I want to make sure I can point back to metadata and perhaps fetch some information to enrich our lmdb files for further processing. Earlier we captured some metadata with
+
+Next we capture some metadata
+
+```
+MariaDB [db_webqtl]> select PhenotypeId, Locus, DataId, Phenotype.Post_publication_description from PublishXRef, Phenotype where PublishXRef.PhenotypeId = Phenotype.Id and InbredSetId=1 limit 5;
++-------------+----------------+---------+----------------------------------------------------------------------------------------------------------------------------+
+| PhenotypeId | Locus          | DataId  | Post_publication_description                                                                                               |
++-------------+----------------+---------+----------------------------------------------------------------------------------------------------------------------------+
+|           4 | rs48756159     | 8967043 | Central nervous system, morphology: Cerebellum weight, whole, bilateral in adults of both sexes [mg]                       |
+|          10 | rsm10000005699 | 8967044 | Central nervous system, morphology: Cerebellum weight after adjustment for covariance with brain size [mg]                 |
+|          15 | rsm10000013713 | 8967045 | Central nervous system, morphology: Brain weight, male and female adult average, unadjusted for body weight, age, sex [mg] |
+|          20 | rs48756159     | 8967046 | Central nervous system, morphology: Cerebellum volume [mm3]                                                                |
+|          25 | rsm10000005699 | 8967047 | Central nervous system, morphology: Cerebellum volume, adjusted for covariance with brain size [mm3]                       |
++-------------+----------------+---------+----------------------------------------------------------------------------------------------------------------------------+
+```
+
+The qtlreaper hits are also of interest. Note Bonz has brilliantly captured this in RDF, see
+
+=> https://github.com/genenetwork/gn-docs/blob/master/rdf-documentation/phenotype-metadata.md
+
+which is parseable by machines(!). Let's try to use RDF first. The query:
+
+```
+SELECT * WHERE {
+    <http://genenetwork.org/id/traitBxd_10002> ?p ?o .
+}
+```
+
+renders
+
+```
+"http://www.w3.org/1999/02/22-rdf-syntax-ns#type","http://genenetwork.org/category/Phenotype"
+"http://genenetwork.org/term/belongsToGroup","http://genenetwork.org/id/setBxd"
+"http://www.w3.org/2004/02/skos/core#altLabel","BXD_10002"
+"http://purl.org/dc/terms/description","Central nervous system, morphology: Cerebellum weight after adjustment for covariance with brain size [mg]"
+"http://genenetwork.org/term/abbreviation","ADJCBLWT"
+"http://genenetwork.org/term/additive",2.08179
+"http://genenetwork.org/term/locus","http://genenetwork.org/id/Rsm10000005699"
+"http://genenetwork.org/term/lodScore",4.77938
+"http://genenetwork.org/term/mean",52.2206
+"http://genenetwork.org/term/sequence",1
+"http://genenetwork.org/term/submitter","robwilliams"
+"http://genenetwork.org/term/traitId","10002"
+"http://purl.org/dc/terms/isReferencedBy","http://rdf.ncbi.nlm.nih.gov/pubmed/11438585"
+```
+
+which covers pretty much what we need. Note that this is coming from our public endpoint and can be used to instruct AI agents(!)
+
+Now we want to fetch these values for all these traitBxd (yes, we need to fix some naming) with a single query:
+
+```
+SELECT count(*) WHERE {
+    ?s gnt:belongsToGroup gn:setBxd.
+} limit 5
+```
+
+returns 14039 traits. Good! Let's get all properties
+
+```
+
+SELECT * WHERE {
+    ?s gnt:belongsToGroup gn:setBxd;
+         gnt:traitId ?id;
+         gnt:locus ?locus;
+         gnt:lodScore ?lrs;
+         dct:description ?descr.
+} limit 50
+```
+
+[Try](https://sparql.genenetwork.org/sparql?default-graph-uri=&query=%0D%0APREFIX+dct%3A+%3Chttp%3A%2F%2Fpurl.org%2Fdc%2Fterms%2F%3E+%0D%0APREFIX+gn%3A+%3Chttp%3A%2F%2Fgenenetwork.org%2Fid%2F%3E+%0D%0APREFIX+owl%3A+%3Chttp%3A%2F%2Fwww.w3.org%2F2002%2F07%2Fowl%23%3E+%0D%0APREFIX+gnc%3A+%3Chttp%3A%2F%2Fgenenetwork.org%2Fcategory%2F%3E+%0D%0APREFIX+gnt%3A+%3Chttp%3A%2F%2Fgenenetwork.org%2Fterm%2F%3E+%0D%0APREFIX+sdmx-measure%3A+%3Chttp%3A%2F%2Fpurl.org%2Flinked-data%2Fsdmx%2F2009%2Fmeasure%23%3E+%0D%0APREFIX+skos%3A+%3Chttp%3A%2F%2Fwww.w3.org%2F2004%2F02%2Fskos%2Fcore%23%3E+%0D%0APREFIX+rdf%3A+%3Chttp%3A%2F%2Fwww.w3.org%2F1999%2F02%2F22-rdf-syntax-ns%23%3E+%0D%0APREFIX+rdfs%3A+%3Chttp%3A%2F%2Fwww.w3.org%2F2000%2F01%2Frdf-schema%23%3E+%0D%0APREFIX+xsd%3A+%3Chttp%3A%2F%2Fwww.w3.org%2F2001%2FXMLSchema%23%3E+%0D%0APREFIX+qb%3A+%3Chttp%3A%2F%2Fpurl.org%2Flinked-data%2Fcube%23%3E+%0D%0APREFIX+xkos%3A+%3Chttp%3A%2F%2Frdf-vocabulary.ddialliance.org%2Fxkos%23%3E+%0D%0APREFIX+pubmed%3A+%3Chttp%3A%2F%2Frdf.ncbi.nlm.nih.gov%2Fpubmed%2F%3E+%0D%0A%0D%0A%0D%0A%0D%0ASELECT+*+WHERE+%7B%0D%0A++++%3Fs+gnt%3AbelongsToGroup+gn%3AsetBxd%3B%0D%0A+++++++++gnt%3AtraitId+%3Fid%3B%0D%0A+++++++++gnt%3Alocus+%3Flocus%3B%0D%0A+++++++++%23+gnt%3Achr+%3Fchr%3B%0D%0A+++++++++%23+gnt%3Apos+%3Fpos%3B%0D%0A+++++++++gnt%3AlodScore+%3Flrs%3B%0D%0A+++++++++dct%3Adescription+%3Fdescr.%0D%0A%7D+limit+50&format=text%2Fhtml&timeout=0&signal_void=on)
+
+If we want to get the chr+location we can query one:
+
+```
+SELECT * WHERE {
+gn:Rs47436964 ?p ?o.
+}
+```
+
+renders
+
+```
+http://www.w3.org/2000/01/rdf-schema#label 	"rs47436964"
+chr "12"
+mb 	65.0498
+```
+
+Now the label is not so interesting, so in one query we can do:
+
+```
+SELECT ?id ?lod ?chr ?mb ?descr WHERE {
+    ?s gnt:belongsToGroup gn:setBxd;
+         gnt:traitId ?id;
+         gnt:locus ?locus;
+         gnt:lodScore ?lod;
+         dct:description ?descr.
+    ?locus gnt:chr ?chr;
+         gnt:mb ?mb.
+} order by desc(?lod) limit 50
+```
+
+which gets, for example a massive reaper HK QTL at
+
+```
+"21588" 34.558 "12" 116.67 "Cofactor, genetics, genomics: Structural variants SVs on chromosome 12, raw uncorrected sum of calls using LongRanger on linked-read sequencing data [n]"
+```
+
+The description of the phenotype is unfortunate. I think it is a synthetic QTL. The title is "SVs_Chr12". Luckily most traits give more an idea of what it is about.
+
+[SPARQL](https://sparql.genenetwork.org/sparql?default-graph-uri=&query=%0D%0APREFIX+dct%3A+%3Chttp%3A%2F%2Fpurl.org%2Fdc%2Fterms%2F%3E+%0D%0APREFIX+gn%3A+%3Chttp%3A%2F%2Fgenenetwork.org%2Fid%2F%3E+%0D%0APREFIX+owl%3A+%3Chttp%3A%2F%2Fwww.w3.org%2F2002%2F07%2Fowl%23%3E+%0D%0APREFIX+gnc%3A+%3Chttp%3A%2F%2Fgenenetwork.org%2Fcategory%2F%3E+%0D%0APREFIX+gnt%3A+%3Chttp%3A%2F%2Fgenenetwork.org%2Fterm%2F%3E+%0D%0APREFIX+sdmx-measure%3A+%3Chttp%3A%2F%2Fpurl.org%2Flinked-data%2Fsdmx%2F2009%2Fmeasure%23%3E+%0D%0APREFIX+skos%3A+%3Chttp%3A%2F%2Fwww.w3.org%2F2004%2F02%2Fskos%2Fcore%23%3E+%0D%0APREFIX+rdf%3A+%3Chttp%3A%2F%2Fwww.w3.org%2F1999%2F02%2F22-rdf-syntax-ns%23%3E+%0D%0APREFIX+rdfs%3A+%3Chttp%3A%2F%2Fwww.w3.org%2F2000%2F01%2Frdf-schema%23%3E+%0D%0APREFIX+xsd%3A+%3Chttp%3A%2F%2Fwww.w3.org%2F2001%2FXMLSchema%23%3E+%0D%0APREFIX+qb%3A+%3Chttp%3A%2F%2Fpurl.org%2Flinked-data%2Fcube%23%3E+%0D%0APREFIX+xkos%3A+%3Chttp%3A%2F%2Frdf-vocabulary.ddialliance.org%2Fxkos%23%3E+%0D%0APREFIX+pubmed%3A+%3Chttp%3A%2F%2Frdf.ncbi.nlm.nih.gov%2Fpubmed%2F%3E+%0D%0A%0D%0A%0D%0A%0D%0ASELECT+%3Fid+%3Flrs+%3Fchr+%3Fmb+%3Fdescr+WHERE+%7B%0D%0A++++%3Fs+gnt%3AbelongsToGroup+gn%3AsetBxd%3B%0D%0A+++++++++gnt%3AtraitId+%3Fid%3B%0D%0A+++++++++gnt%3Alocus+%3Flocus%3B%0D%0A+++++++++gnt%3AlodScore+%3Flrs%3B%0D%0A+++++++++dct%3Adescription+%3Fdescr.%0D%0A++++%3Flocus+gnt%3Achr+%3Fchr%3B%0D%0A+++++++++++++++gnt%3Amb+%3Fmb.%0D%0A%7D+order+by+desc%28%3Flrs%29+limit+50&format=text%2Fhtml&timeout=0&signal_void=on)
+
+To run this query on all 13K traits takes just a second! The resulting 3Mb TSV I'll share. Note that there is no code necessary to get to this point! Just SPARQL queries on a public endpoint.
+
+Now, what we want to do is take these results and combine them with the full vector data stored in lmdb.
+The first thing we can do is list the top hit from every trait and combine that with above data. That way we can quickly asses what trait hits will change using GEMMA instead of HK reaper. One thing to note is the formula LRS/4.6=LOD. The GN2 interface shows LRS.
+
+Meanwhile I am waiting for precompute. Most of it is done, but some interesting errors:
+
+```
+Precomputing 20484
+;;; ("41012208")
+SQL Connection ERROR! file not found
+```
+
+especially since it appears this is a cache hit. OK, I'll check tomorrow. For now we have 12837 completed vectors!
+After some reruns we have 13491 vectors, i.e. 98% of BXD PublishData.
+
+
+After some reruns we have 13491 vectors, i.e. 98% of BXD PublishData.
+
+Some remaining problems:
+
+```
+Executing: parallel --results /tmp/test --joblog /tmp/test/79d6dbd2fbd55b159c35d903ba10d9cab14f7816-parallel.log < /tmp
+/test/parallel-commands.txt
+Command exited with non-zero status 20
+```
+
+the trait values are all 1.0.
+
+```
+BXD1    1.0
+BXD2    1.0
+BXD5    1.0
+BXD6    1.0
+BXD8    1.0
+BXD9    1.0
+BXD11   1.0
+BXD12   1.0
+BXD13   1.0
+BXD14   1.0
+BXD15   1.0
+BXD16   1.0
+BXD18   1.0
+BXD19   1.0
+```
+
+We'll look into those later.
+
+Next step is to collect all the highest hits and we can do that with
+
+```
+./bin/view-gemma-mdb --sort tmp/tmp/9179b...923f181-gemma-GWA.mdb --anno BXD.8_snps.txt |head -2
+Reading tmp/tmp/9179b192fc1c19142d97607b64c04bf5a923f181-gemma-GWA.mdb...
+chr,pos,marker,af,beta,se,l_mle,l_lrt,-logP
+10,125580028,rsm10000007478,0.655,0.014,0.0134,100000.0,0.0005,3.34
+```
+
+That is great, but now we need to put the data in a place that we can analyse it - and the difference with qtlreaper. We can do a one-off using some tabular format. But that would mean we would have to redo things later to get it in SQL and/or present it some other way. So, basically, we need a flexible storage format that allows us to query things -- without predicting how people want to use that data and -- importantly -- have machines do it. Here comes RDF as the solution. As Mark Wilkinson has it: in my lab we only do RDF. No hacks (please).
+
+So, let's adapt the output of view-gemma-mdb and convert that to RDF. Bonz has done many such exercises in
+
+=> https://git.genenetwork.org/gn-transform-databases/tree/
+
+e.g. for the earlier phenotypes RDF+SPARQL we used to get the reaper values
+
+=> https://git.genenetwork.org/gn-transform-databases/tree/examples/phenotype.scm
+
+In this code SQL queries are embedded. I would argue these need to be replaced with REST API calls. But hey.
+
+First step is to include the ID with ./bin/view-gemma-mdb and some other metadata as fields, that we so thoughtfully included in the mdb metadata. This results in:
+
+```
+Reading /tmp/tmphvi6grqm/2b8e7c7cfe98f7e44bb2f07f057cc1adedf29c38-gemma-GWA.mdb...
+name,id,chr,pos,marker,af,beta,se,l_mle,l_lrt,-logP
+BXDPublish,22200,1,4858261,rsm10000000111,0.5,0.0246,0.0537,100000.0,0.0192,1.72
+BXDPublish,22200,1,182581091,rsm10000000451,0.548,-0.009,0.0537,100000.0,0.139,0.86
+BXDPublish,22200,1,182635325,rsm10000000452,0.548,-0.009,0.0537,100000.0,0.139,0.86
+```
+
+Now remember the HK reaper data is already in RDF. If we push this data in we should be able to query the combined datasets. Let's convert this to RDF that looks like:
+
+```
+gn:GEMMAMappedLOCO_22200 a gnt:mappedTrait;
+                         label "GEMMA trait 22200 mapped with LOCO (defaults)";
+                         gnt:LOCO true;
+                         gnt:belongsToGroup gn:setBxd;
+                         gnt:traitId "22200";
+                         skos:altLabel "BXD_22200";
+                         gnt:locus gn:rsm10000000111;
+                         gnt:lodScore 1.72;
+                         gnt:af 0.5;
+                         gnt:effect 0.0246;
+```
+
+If the marker is not yet defined we can add:
+
+```
+gn:rsm10000000111        a gnt:marker;
+                         label "rsm10000000111I";
+                         gnt:chr  "1";
+                         gnt:mb   4.858261;
+                         gnt:pos  4858261.
+```
+
+This means we can pivot on the trait id between reaper and gemma results. It will also be easy to store multiple
+GEMMA hits.
+I note that GEMMA does not store the mean
+value. We can fetch that from trait values.
+
+Rob wrote:
+
+> We will want to harvest the sample size for each trait. That will be a critical parameter for filtering. Knowing the skew and kurtosis also highly valuable in filtering and diagnostics. Many users forget to log their data and this introduces serious problems since you have a tail of outliers. Obviously a dumb mistake to have traits with all values of 1. Perhaps you can assign the task of fixing/deleting that traits to Arthur and me. Just send a list.
+
+I'll make a list to send to Arthur and you - it is on my tasks. With regard to trait info we should compute that as metadata when doing the precompute (as we have the trait values at that point!). I have added that to the task list.
+
+=> https://issues.genenetwork.org/topics/systems/mariadb/precompute-publishdata
+
+We'll do a rerun with this data soon, as it only took a day.
+
+Alright, I am keen to move forward on our precompute, because this is the fun phase. Getting the metadata in place should be easy, now we are on RDF. First we are going to simply mirror PublishXRef information for HK reaper and GEMMA runs. Reaper is already in RDF (mostly), so let's add some functionality to gemma-wrapper.
+
+The viewer for 1e59d19a679359516ecd97cf20375c80e987ee3e-BXDPublish-22282-gemma-GWA.tar.xz  gives
+
+```
+name,id,chr,pos,marker,af,beta,se,l_mle,l_lrt,-logP
+BXDPublish,22282,5,110385941,rs29780222,0.484,-0.0802,0.0356,2.0341,0.0,4.51
+BXDPublish,22282,5,110421808,rsm10000002804,0.484,-0.0802,0.0356,2.0341,0.0,4.51
+BXDPublish,22282,5,110479038,rsm10000002805,0.484,-0.0802,0.0356,2.0341,0.0,4.51
+BXDPublish,22282,5,110515858,rs33083878,0.484,-0.0802,0.0356,2.0341,0.0,4.51
+```
+
+Note that the sorting is arbitrary because -logP is identical! My take is that we should include all hits (read SNP names) for comparison with HK reaper. We will be able to parse range locations - so we can check 50K base pairs up and downstream too.
+
+Looking at SNPs we should look at using existing URIs instead of inventing new ones. I'll make a note of that too (to move forward). Looking at the first hit rs29780222 some googling finds https://www.informatics.jax.org/marker/MGI:1925270. I need to check with the GN database what is known there. Adding a BED file to RDF makes sense. Yet another task to add.
+
+OK, back to focussing on generating RDF with what we have now. A first attempt is
+
+```
+gn:GEMMAMapped_LOCO_e987ee3e_BXDPublish_22282_gemma_GWA a gnt:mappedTrait;
+      rdfs:label "GEMMA BXDPublish trait 22282 mapped with LOCO (defaults)";
+      gnt:trait gn:publishXRef_22282;
+      gnt:loco true;
+      gnt:time "2025/08/11 10:15";
+      gnt:belongsToGroup gn:setBxd;
+      gnt:name "BXDPublish";
+      gnt:traitId "22282";
+      skos:altLabel "BXD_22282";
+      gnt:locus gn:rs29780222;
+      gnt:lodScore 4.51;
+      gnt:af 0.484;
+      gnt:effect -0.08;
+```
+
+which looks nice already. We want to support more SNPs, however, so we split those up and now this dataset shows 84 snps at a cut off of logP of 4.0. We'll improve on that later (and will us precompute to estimate levels for the BXD). We always show the single highest score, no matter what. The cool thing is that we have *all* peaks now in RDF and we can query that:
+
+```
+gn:GEMMAMapped_LOCO_BXDPublish_22282_gemma_GWA_e987ee3e a gnt:mappedTrait;
+      rdfs:label "GEMMA BXDPublish trait 22282 mapped with LOCO (defaults)";
+      gnt:trait gn:publishXRef_22282;
+      gnt:loco true;
+      gnt:time "2025/08/11 10:15";
+      gnt:belongsToGroup gn:setBxd;
+      gnt:name "BXDPublish";
+      gnt:traitId "22282";
+      skos:altLabel "BXD_22282".
+gn:rs29780222_BXDPublish_22282_gemma_GWA_e987ee3e a gnt:mappedLocus;
+      gnt:mappedSnp gn:GEMMAMapped_LOCO_BXDPublish_22282_gemma_GWA_e987ee3e;
+      gnt:locus gn:rs29780222;
+      gnt:lodScore 4.51;
+      gnt:af 0.484;
+      gnt:effect -0.08.
+gn:rsm10000002804_BXDPublish_22282_gemma_GWA_e987ee3e a gnt:mappedLocus;
+      gnt:mappedSnp gn:GEMMAMapped_LOCO_BXDPublish_22282_gemma_GWA_e987ee3e;
+      gnt:locus gn:rsm10000002804;
+      gnt:lodScore 4.51;
+      gnt:af 0.484;
+      gnt:effect -0.08.
+(...)
+gn:rs33400361_BXDPublish_22282_gemma_GWA_e987ee3e a gnt:mappedLocus;
+      gnt:mappedSnp gn:GEMMAMapped_LOCO_BXDPublish_22282_gemma_GWA_e987ee3e;
+      gnt:locus gn:rs33400361;
+      gnt:lodScore 4.07;
+      gnt:af 0.452;
+      gnt:effect -0.078.
+gn:rsm10000002851_BXDPublish_22282_gemma_GWA_e987ee3e a gnt:mappedLocus;
+      gnt:mappedSnp gn:GEMMAMapped_LOCO_BXDPublish_22282_gemma_GWA_e987ee3e;
+      gnt:locus gn:rsm10000002851;
+      gnt:lodScore 4.07;
+      gnt:af 0.452;
+      gnt:effect -0.078.
+```
+
+Next step is to use rapper to see if this is valid RDF.
+
+```
+rapper --input turtle test.ttl
+```
+
+For this one trait: rapper: Parsing returned 513 triples. It may look like a lot of data, but RDF stores are pretty good at creating small enough representations. All identifiers are stored once as a string and referenced by 64-bit pointers.
+
+For the locus I notice Bonz capitalized the SNP identifiers. We don't want that. But I'll stick it in for now. The code is here:
+
+=> https://github.com/genetics-statistics/gemma-wrapper/blob/master/bin/gemma-mdb-to-rdf.rb
+
+Basically we run
+
+```
+rm test.rdf
+for x in tmp/*.xz ; do
+    env GEM_PATH=tmp/ruby GEM_HOME=tmp/ruby ./bin/gemma-mdb-to-rdf.rb $x --anno BXD.8_snps.txt --sort >> test.rdf
+done
+```
+
+for the 98% BXD PublishData that rendered 1512885 triples. It needs some minor fixes, such as a Lod of infinite and the use of ? for an unknown locus.
+
+To load the file on production:
+
+```
+guix shell -C -N virtuoso-ose -- isql
+# or
+/gnu/store/9d81kdw2frn6b3fwqphsmkssc9zblir1-virtuoso-ose-7.2.11/bin/isql -u dba -P "*" -S 8981
+OpenLink Virtuoso Interactive SQL (Virtuoso)
+Version 07.20.3238 as of Jan  1 1970
+Type HELP; for help and EXIT; to exit.
+Connected to OpenLink Virtuoso
+Driver: 07.20.3238 OpenLink Virtuoso ODBC Driver
+ld_dir("/home/wrk/","test.ttl","http://pjotr.genenetwork.org")
+SQL> rdf_loader_run();
+Done. -- 13 msec.
+SQL> checkpoint;
+Done. -- 243 msec.
+SQL>
+```
+
+But it don't show. Same for:
+
+```
+root@tux04:/export/guix-containers/genenetwork/data/virtuoso/ttl# curl --digest -v --user 'dba:*' --url "http://localhost:8982/sparql-graph-crud-auth?graph=http://pjotr.genenetwork.org" -T test.ttl
+```
+
+
+I tried to upload to production, but this crashed the virtuoso server :/.
+So I built a new virtuoso instance using gn-machines:
+
+=> https://git.genenetwork.org/gn-machines/commit/?id=90fa4fdacffe26c57649cb0515d0679ca19c27cc
+
+Now we can run isql locally as
+
+```
+guix shell -C -N --expose=/export/guix-containers/virtuoso/data/virtuoso/ttl/=/export/data/virtuoso/ttl virtuoso-ose -- isql -S 8891
+
+SQL> ld_dir('/export/data/virtuoso/ttl','test.n3','http://pjotr.genenetwork.org');
+Done. -- 3 msec.
+# for testing the validity and optional delete problematic ones:
+SQL> SELECT * FROM DB.DBA.load_list;
+SQL> DELETE from DB.DBA.LOAD_LIST where ll_error IS NOT NULL ;
+# commit changes
+SQL> rdf_loader_run ();
+SQL> checkpoint;
+Done. -- 16 msec.
+SQL> SPARQL SELECT count(*) FROM <http://pjotr.genenetwork.org> WHERE { ?s ?p ?o };
+15
+```
+
+If an error exists all uploads will be blocked unless DB.DBA.LOAD_LIST is emptied (DELETE).
+An error may look like:
+
+```
+ERROR  : Character data are not allowed here by XML structure rules
+at line 2 column 3 of source text
+@prefix dct: <http://purl.org/dc/terms/> .
+```
+
+I don't know why, but only n3 triples appeared to work. The full manual is here:
+
+=> https://vos.openlinksw.com/owiki/wiki/VOS/VirtBulkRDFLoader Virtuoso bulk uploader
+
+## Fixing hanging virtuoso on production
+
+Going back to production I cleaned up the DB.DBA.LOAD_LIST as described above. Running isql can be done outside the container:
+
+```
+guix shell virtuoso-ose -- isql 8981
+SQL> DELETE from DB.DBA.LOAD_LIST;
+SQL> checkpoint;
+```
+
+SPARQL queries inside isql are fast:
+
+```
+SQL> SPARQL SELECT count(*) FROM <http://pjotr.genenetwork.org> WHERE { ?s ?p ?o };
+1206882
+SQL> SPARQL SELECT count(*) FROM <http://genenetwork.org> WHERE { ?s ?p ?o };
+46982542
+```
+
+The web socket is not connected. This does not respond:
+
+```
+curl http://localhost:8982/sparql/
+```
+
+herd stop/start virtuoso made no difference. Nor did nginx or nscd. Hmm. Restarting the full container it starts up at
+
+```
+root@tux04:/export/guix-containers/genenetwork/var/log# tail virtuoso.log
+  2025-08-17 07:47:07 07:47:07 HTTP server online at localhost:9893
+  2025-08-17 07:47:07 07:47:07 Server online at localhost:9892 (pid 43)
+curl localhost:9893/sparql
+```
+
+Aha, the domain is pointing to the wrong virtuoso server... I modified nginx on tux04 and, at least, we have SPARQL running on http. For https nginx is pointing to https://127.0.0.1:8993. Hmmm. That is not the same as what the logs tell me. Looks like there is still some problem with the production container. Well, we can solve that later.
+
+I'll first run virtuoso on a server. Starting from a guix from half a year ago:
+
+```
+. /usr/local/guix-profiles/guix-pull-3-link/etc/profile
+cd ~/gn-machines
+./virtuoso-deploy.sh
+curl localhost:8892/sparql/
+```
+
+Configure nginx to listen
+
+```
+server {
+  server_name sparql-test.genenetwork.org;
+  listen 80;
+  access_log /var/log/nginx/sparql-test-access.log;
+  error_log /var/log/nginx/sparql-test-error.log;
+  location / {
+    proxy_pass http://localhost:8892;
+    proxy_set_header Host $host;
+  }
+}
+```
+
+Added DNS-entry and we should be able to see
+
+=> http://sparql-test.genenetwork.org/sparql/
+
+Now I need to load the important data into this SPARQL server. On tux02 I find a recent set:
+
+```
+     4096 Dec  5  2024 wip
+   260886 Jul 21 19:57 schema.ttl
+443454617 Jul 21 19:57 generif-old.ttl
+    44902 Jul 21 19:57 classification.ttl
+339900838 Jul 21 19:58 genelist.ttl
+ 42509383 Jul 21 19:58 genbank.ttl
+152936953 Jul 21 19:58 genotype.ttl
+  1460511 Jul 21 19:58 dataset-metadata.ttl
+700627810 Jul 21 19:58 generif.ttl
+ 10491221 Jul 21 19:58 strains.ttl
+     1388 Jul 21 19:58 species.ttl
+ 23495986 Jul 21 19:58 publication.ttl
+    16879 Jul 21 19:58 tissue.ttl
+ 18537935 Jul 21 19:58 phenotype.ttl
+root@tux02:/export/data/genenetwork-virtuoso# du -sh .
+1.7G    .
+```
+
+Which is about 2Gb uncompressed. Not bad. To load the ttl files I have to move them into
+/export/guix-containers/virtuoso/data/virtuoso/ttl.
+
+```
+guix shell virtuoso-ose -- isql 8891 exec="ld_dir('/export/data/virtuoso/ttl','*.ttl','http://genenetwork.org');"
+guix shell virtuoso-ose -- isql 8891 exec="rdf_loader_run();"
+```
+
+That takes a few minutes for 29746544 triples. Not bad at all!
+
+```
+guix shell virtuoso-ose -- isql 8891 exec="SELECT * FROM DB.DBA.load_list;"
+guix shell virtuoso-ose -- isql 8891 exec="checkpoint;"
+```
+
+Let's list all the tissues we have with
+
+```
+SELECT * WHERE {
+  ?s rdf:type gnc:tissue .
+  ?s rdfs:label ?o .
+}
+"http://genenetwork.org/id/tissueA1c" "Primary Auditory (A1) Cortex mRNA"
+"http://genenetwork.org/id/tissueAcc" "Anterior Cingulate Cortex mRNA"
+"http://genenetwork.org/id/tissueAdr" "Adrenal Gland mRNA"
+"http://genenetwork.org/id/tissueAmg" "Amygdala mRNA"
+"http://genenetwork.org/id/tissueBebv"  "Lymphoblast B-cell mRNA"
+"http://genenetwork.org/id/tissueBla" "Bladder mRNA"
+(...)
+```
+
+=> http://sparql-test.genenetwork.org/sparql/?default-graph-uri=&query=PREFIX+dct%3A+%3Chttp%3A%2F%2Fpurl.org%2Fdc%2Fterms%2F%3E%0D%0APREFIX+gn%3A+%3Chttp%3A%2F%2Fgenenetwork.org%2Fid%2F%3E%0D%0APREFIX+owl%3A+%3Chttp%3A%2F%2Fwww.w3.org%2F2002%2F07%2Fowl%23%3E%0D%0APREFIX+gnc%3A+%3Chttp%3A%2F%2Fgenenetwork.org%2Fcategory%2F%3E%0D%0APREFIX+gnt%3A+%3Chttp%3A%2F%2Fgenenetwork.org%2Fterm%2F%3E%0D%0APREFIX+sdmx-measure%3A+%3Chttp%3A%2F%2Fpurl.org%2Flinked-data%2Fsdmx%2F2009%2Fmeasure%23%3E%0D%0APREFIX+skos%3A+%3Chttp%3A%2F%2Fwww.w3.org%2F2004%2F02%2Fskos%2Fcore%23%3E%0D%0APREFIX+rdf%3A+%3Chttp%3A%2F%2Fwww.w3.org%2F1999%2F02%2F22-rdf-syntax-ns%23%3E%0D%0APREFIX+rdfs%3A+%3Chttp%3A%2F%2Fwww.w3.org%2F2000%2F01%2Frdf-schema%23%3E%0D%0APREFIX+xsd%3A+%3Chttp%3A%2F%2Fwww.w3.org%2F2001%2FXMLSchema%23%3E%0D%0APREFIX+qb%3A+%3Chttp%3A%2F%2Fpurl.org%2Flinked-data%2Fcube%23%3E%0D%0APREFIX+xkos%3A+%3Chttp%3A%2F%2Frdf-vocabulary.ddialliance.org%2Fxkos%23%3E%0D%0APREFIX+pubmed%3A+%3Chttp%3A%2F%2Frdf.ncbi.nlm.nih.gov%2Fpubmed%2F%3E%0D%0A%0D%0ASELECT+*+WHERE+%7B%0D%0A%3Fs+rdf%3Atype+gnc%3Atissue+.%0D%0A%3Fs+rdfs%3Alabel+%3Fo+.%0D%0A%7D%0D%0A&format=text%2Fhtml&timeout=0&signal_void=on Try it!
+
+## Getting to our first PublishData queries
+
+Next we need to upload our fresh PublishData RDF. We generated that with:
+
+```
+rm test.rdf ; for x in tmp/*.xz ; do ./bin/gemma-mdb-to-rdf.rb $x --anno BXD.8_snps.txt --sort >> test.ttl; done
+```
+
+Takes 10 minutes. rapper still returns an error for 'gnt:lodScore Infinity;'. I'll fix that down the line.
+
+Put test.ttl in /export/guix-containers/virtuoso/data/virtuoso/ttl and use the isql commands to update virtuoso. I use a separate graph named 'http://pjotr.genenetwork.org' so we can easily delete the triples.
+
+```
+guix shell virtuoso-ose -- isql 8891 exec="ld_dir('/export/data/virtuoso/ttl','test.ttl','http://pjotr.genenetwork.org'); rdf_loader_run();"
+```
+
+OK, we have the data together. Time for our first queries. Interesting questions are:
+
+* How many hits do we have for qtlreaper and how many for gemma in total
+* How many hits do we have for qtlreaper and how many for gemma that have a hit of 4.0 or higher
+* How many of these hits for qtlreaper differ from those of gemma
+* What datasets have been mapped in qtlreaper, but not in gemma
+
+### How many hits do we have for qtlreaper and how many for gemma in total
+
+Remember we had this query for reaper:
+
+```
+SELECT * WHERE {
+    ?s gnt:belongsToGroup gn:setBxd;
+         gnt:traitId ?id;
+         gnt:locus ?locus;
+         gnt:lodScore ?lrs;
+         dct:description ?descr.
+} limit 5
+"http://genenetwork.org/id/traitBxd_10001","10001","http://genenetwork.org/id/Rs48756159",2.93169,"Central nervous system, morphology: Cerebellum weight, whole, bilateral in adults of both sexes [mg]"
+"http://genenetwork.org/id/traitBxd_10002","10002","http://genenetwork.org/id/Rsm10000005699",4.77938,"Central nervous system, morphology: Cerebellum weight after adjustment for covariance with brain size [mg]"
+"http://genenetwork.org/id/traitBxd_10003","10003","http://genenetwork.org/id/Rsm10000013713",3.38682,"Central nervous system, morphology: Brain weight, male and female adult average, unadjusted for body weight, age, sex [mg]"
+"http://genenetwork.org/id/traitBxd_10004","10004","http://genenetwork.org/id/Rs48756159",2.56076,"Central nervous system, morphology: Cerebellum volume [mm3]"
+"http://genenetwork.org/id/traitBxd_10005","10005","http://genenetwork.org/id/Rsm10000005699",5.02907,"Central nervous system, morphology: Cerebellum volume, adjusted for covariance with brain size [mm3]"
+```
+
+we can run a similar query for GEMMA results with trait id "10001" and locus names.
+
+```
+SELECT * WHERE {
+    ?s gnt:mappedSnp ?id;
+         gnt:locus ?locus;
+         gnt:lodScore ?lrs.
+    filter(?lrs > 4.0).
+} limit 5
+```
+
+to find distinct datasets for GEMMA:
+
+```
+SELECT count(*) WHERE {
+  ?id gnt:name "BXDPublish" .
+} limit 5
+```
+
+To count the total number of hits we have 13576 reaper hits and 231911 GEMMA hits. For GEMMA we have 13491 uniquely mapped datasets.
+
+### Count hits that are significant
+
+For GEMMA 223232 hits are 4.0 or higher. For Reaper we count 1098. Almost all reaper values are between 2.0 and 4.0. When we count GEMMA below 4.0 we get 8679 datasets - and that makes sense because for gemmma we list all SNPs that are over 4.0 and only the datasets that are below we list the highest SNP. In both cases the majority of traits are below our threshold.
+
+### Start looking at the difference
+
+For every reaper SNP 'locus' we want to find that GEMMA sets that contain that particular SNP. In other words, those are the hits that GEMMA found that compare with qtlreaper. We pivot on SNP ?locus and ?traitid.
+
+```
+SELECT count(*) WHERE {
+    ?reaper gnt:belongsToGroup gn:setBxd;
+         gnt:traitId ?traitid;
+         gnt:locus ?locus;
+         gnt:lodScore ?lrs .
+    ?gemma gnt:mappedSnp ?id2;
+         gnt:locus ?locus;
+         gnt:lodScore ?lrs2.
+    ?id2 gnt:name "BXDPublish" ;
+        gnt:traitId ?traitid.
+    filter(?lrs2 >= 4.0).
+} limit 5
+```
+
+Now find 4222 overlapping traits! Whereof 2924 have a gemma lod score >= 4.0. And reaper 892 > 4.0 (out of 1098). That implies that some 200 significant scores find (completely) different SNPs for GEMMA.
+
+The next step is to list these differences. That is a reverse query. In plain English it should be something like:
+
+> List all sets where reaper has a SNP (r_snp) that does not appear in its GEMMA computation (g_snps).
+
+This is rather hard to do in SPARQL. We can make a list, however, of the overlapping traits with a lod score>4.0 with
+
+```
+PREFIX dct: <http://purl.org/dc/terms/>
+PREFIX gn: <http://genenetwork.org/id/>
+PREFIX owl: <http://www.w3.org/2002/07/owl#>
+PREFIX gnc: <http://genenetwork.org/category/>
+PREFIX gnt: <http://genenetwork.org/term/>
+PREFIX sdmx-measure: <http://purl.org/linked-data/sdmx/2009/measure#>
+PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
+PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
+PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
+PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
+PREFIX qb: <http://purl.org/linked-data/cube#>
+PREFIX xkos: <http://rdf-vocabulary.ddialliance.org/xkos#>
+PREFIX pubmed: <http://rdf.ncbi.nlm.nih.gov/pubmed/>
+
+SELECT ?traitid WHERE {
+   # --- get the reaper SNPs
+    ?r_trait gnt:belongsToGroup gn:setBxd;
+         gnt:traitId ?traitid;
+         gnt:locus ?snp.
+    # --- get gemma trait that matches reaper traitid (pivot on traitid)
+    ?g_trait gnt:name "BXDPublish" ;
+        gnt:traitId ?traitid.
+    # --- g_snp is the SNP scored within a gemma trait run
+    ?g_snp gnt:mappedSnp ?g_trait;
+         gnt:locus ?snp;
+         gnt:lodScore ?g_lrs.
+    filter(?g_lrs >= 4.0).
+} limit 5
+```
+
+Resulting in 2925 overlapping results. For example, it lists trait
+
+=> https://genenetwork.org/show_trait?trait_id=12014&dataset=BXDPublish
+
+where both reaper and gemma show a top hit for rs13478947.
+
+
+SELECT count(distinct ?traitid) WHERE {
+   # --- get the reaper SNPs
+    ?r_trait gnt:belongsToGroup gn:setBxd;
+         gnt:traitId ?traitid;
+         gnt:locus ?snp.
+    # --- get gemma trait that matches reaper traitid (pivot on traitid)
+    ?g_trait gnt:name "BXDPublish" ;
+        gnt:traitId ?traitid.
+    # --- g_snp is the SNP scored within a gemma trait run
+    ?g_snp gnt:mappedSnp ?g_trait;
+         gnt:lodScore ?g_lrs.
+    MINUS { ?g_snp gnt:locus ?snp . }
+    filter(?g_lrs >= 4.0).
+}
+
+
+
+Now we can make a second list for all gemma results where g_lrs > 4.0. The difference is our set.
+
+```
+SELECT DISTINCT ?traitid WHERE {
+    # --- get gemma trait that matches reaper traitid (pivot on traitid)
+    ?g_trait gnt:name "BXDPublish" ;
+        gnt:traitId ?traitid.
+    # --- g_snp is the SNP scored within a gemma trait run
+    ?g_snp gnt:mappedSnp ?g_trait;
+         gnt:locus ?snp;
+         gnt:lodScore ?g_lrs.
+    filter(?g_lrs >= 4.0).
+}
+```
+
+One example is trait 23777 where reaper has rsm10000008413 and gemma ranks SNPs, and rsm10000008413 with LRS 3.44 is below the threshold. That makes not such a strong case because both results are on Chr11 and not to far from each other (58 vs 73 Mb). Still, it may be a difference of interest. GEMMA's main hit rs13480386 is also ranked by reaper (in GN2).
+I think we need to refine our method. Peaks on Chr9 and 15 are also of interest.
+
+See
+
+=> https://genenetwork.org/show_trait?trait_id=23777&dataset=BXDPublish
+
+Another trait 14905 shows a whopper on Chr4 with gemma and and one on Chr8 with reaper.
+This is rather a good example. To improve the power of our search I think I should extend the GEMMA results with all hits above 3.0. That greatly increase the chance that a reaper marker is seen. To do an even better job we should run reaper precompute and also store the highest ranked markers (rather than one single hit). That way we get a true picture of the overlap and differences. While we are at it, we should store the trait values with the sample size etc.
+
+But first let's try finding those that differ on chromosome hits:
+
+Hmmm. the folloinwg not working quite right because it shows all the differences with 200K results. I tried
+
+```
+PREFIX dct: <http://purl.org/dc/terms/>
+PREFIX gn: <http://genenetwork.org/id/>
+PREFIX owl: <http://www.w3.org/2002/07/owl#>
+PREFIX gnc: <http://genenetwork.org/category/>
+PREFIX gnt: <http://genenetwork.org/term/>
+PREFIX sdmx-measure: <http://purl.org/linked-data/sdmx/2009/measure#>
+PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
+PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
+PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
+PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
+PREFIX qb: <http://purl.org/linked-data/cube#>
+PREFIX xkos: <http://rdf-vocabulary.ddialliance.org/xkos#>
+PREFIX pubmed: <http://rdf.ncbi.nlm.nih.gov/pubmed/>
+
+SELECT DISTINCT ?traitid ?chr1 ?chr2 ?url ?descr WHERE {
+   # --- get the reaper SNPs
+    ?r_trait gnt:belongsToGroup gn:setBxd;
+         gnt:traitId ?traitid;
+         gnt:locus ?snp ;
+         dct:description ?descr.
+    # --- get gemma trait that matches reaper traitid (pivot on traitid)
+    ?g_trait gnt:name "BXDPublish" ;
+        gnt:traitId ?traitid.
+    # --- g_snp is the SNP scored within a gemma trait run
+    ?g_snp gnt:mappedSnp ?g_trait;
+         gnt:lodScore ?g_lrs ;
+         gnt:locus ?snp2 .
+    # --- get Chr positions of both snps
+    ?snp gnt:chr ?chr1 .
+    ?snp2 gnt:chr ?chr2 .
+    MINUS { ?g_snp gnt:locus ?snp . }
+    filter(?g_lrs >= 4.0).
+    filter(?chr2 != ?chr1) .
+    BIND(REPLACE(?traitid, "(\\d+)","https://genenetwork.org/show_trait?trait_id=$1&dataset=BXDPublish") AS ?url)
+} LIMIT 15
+```
+
+What I am trying is set analysis and SPARQL is so powerful that you actually try, but it is far simpler to do in any programming language. I tooted about this rediscovery:
+
+=> https://genomic.social/@pjotrprins@mastodon.social/115059451578588805
+
+I created list for Rob using some simple shell commands, so he can see what the challenge is. I wrote
+
+> Attached a list of traits that show a reaper SNP that is not significant (LOD 4.0) for GEMMA and still show a significant hit for GEMMA. You can test run them on GN2 and see that the story is ambiguous. To do a proper job we should store more hits for GEMMA (say from LOD 3.0) and do a precompute exercise with reaper storing all top hits. That way we can probably do better and even get a list for Claude.
+
+One example is trait 23777 where reaper has rsm10000008413 and gemma ranks SNPs, and rsm10000008413 with LRS 3.44 is be low the threshold. That makes not such a strong case because both results are on Chr11 and not to far from each other (58 vs 73 Mb). Still, it may be a difference of interest. GEMMA's main hit rs13480386 is also ranked by reaper (in GN2). I think we need to refine our method. Peaks on Chr9 and 15 are also of interest.
+
+See
+
+=> https://genenetwork.org/show_trait?trait_id=23777&dataset=BXDPublish
+
+Another trait 14905 shows a whopper on Chr4 with gemma and and one on Chr8 with reaper. This is rather a good example. To improve the power of our search I think I should extend the GEMMA results with all hi ts above 3.0. That greatly increase the chance that a reaper marker is seen. To do an even better job we should run rea per precompute and also store the highest ranked markers (rather than one single hit). That way we get a true picture o f the overlap and differences. While we are at it, we should store the trait values with the sample size etc.
+
+So, rerunning GEMMA and reaper are on the books. While we are at it we can adapt reruns for
+
+* qnormalized data*
+* auto winsorizing
+* sex covariate
+* run gemma without LOCO
+* cis covariate, using the current hit and recompute with that as a covariate*
+* epistatic covariates
+
+and that should all be reasonably easy for the 13K traits.
+
+## More metadata
+
+But first we set up a new run with more metadata. In the lmdb files we should add the trait values, the mean, SE, skew, kurtosis, any DOIs.
+
+gemma-wrapper can take trait values as produced by our gn-guile endpoint (in .json). First step is to add thes values to the meta data. The existing permutate switch takes a pheno file and outputs that during a run. We can use that to pass in the pheno file.
+
+
+Now we should write out the gemma phenotypes to make sure they align. Now we essentially moved the functionality from gn-pheno-to-gemma.rb into gemma-wrapper, so we need to pass in the geno information too.
+
+The command becomes
+
+```
+./bin/gemma-wrapper --force --json --loco -- -g BXD.geno.txt -p BXD_pheno.txt -a BXD.8_snps.txt -n 2 -gk > K.json
+./bin/gemma-wrapper --json --lmdb --geno-json BXD.geno.json --lmdb --phenotypes 10002-pheno.json --population BXD --name BXDPublish --trait $id --loco --input K.json -- -g BXD.geno.txt -a BXD.8_snps.txt -lmm 9 -maf 0.1 -n 2 -debug > GWA.json
+```
+
+We now store the trait values into the metadata and they go into lmdb!
+
+```
+  "meta": {
+    "type": "gemma-wrapper",
+    "version": "1.00-pre1",
+    "population": "BXD",
+    "name": "BXDPublish",
+    "trait": "1",
+    "geno_filename": "BXD.geno.txt",
+    "geno_hash": "3b65ed252fa47270a3ea867409b0bdc5700ad6f6",
+    "loco": true,
+    "url": "https://genenetwork.org/show_trait?trait_id=1&dataset=BXDPublish",
+    "archive_GRM": "185eb08dc3897c7db5d7ea987170898035768f93-gemma-cXX.tar.xz",
+    "archive_GWA": "c143bc7928408fdc53affed0dacdd98d7c00f36d-BXDPublish-1-gemma-GWA.tar.xz",
+    "trait_values": {
+      "BXD1": 54.099998,
+      "BXD2": 50.099998,
+      "BXD5": 53.299999,
+...
+```
+
+Commit is here:
+
+=> https://github.com/genetics-statistics/gemma-wrapper/commit/9ad5f762823031da08fc51c2a6adae983e6e8314
+
+Now gemma2lmdb is actually written in python, so we can make use of scipy functions using the trait values.
+
+So, for example, we can compute:
+
+```
+mean= 52.22058749999999  std= 2.968538937833582  kurtosis= 0.03143766680654192  skew= -0.1315270039489698
+for
+[54.099998, 50.099998, 53.299999, 55.099998, 57.299999, 51.200001, 53.599998, 46.799999, 50.599998, 49.299999, 45.700001, 52.5, 52.0, 51.099998, 52.400002, 49.0, 51.599998, 50.700001, 55.5, 52.599998, 53.099998, 53.5, 53.200001, 58.700001, 50.799999, 53.299999, 51.900002, 54.099998, 52.299999, 46.099998, 51.799999, 57.0, 48.599998, 56.599998]
+```
+
+Using
+
+=> https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.skew.html
+=> https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kurtosis.html
+
+Code in gemma-wrapper repo.
+
+I'll set up a new run and export to RDF. Some additions first.
+
+Even though we store trait values, I should add the number of indiduals too. We store that as nind.
+
+Now we have these metrics, no metadata is complete without its publication. PublishXRef contains a PublicationID. It points into the Publication table that contains, for example:
+
+```
+| Id  | PubMed_ID | Abstract | Authors | Title | Journal | Volume | Pages | Month | Year |
+| 116 |  11438585 | To discover genes influencing cerebellum development, we conducted a complex trait analysis of variation in the size of the adult mouse cerebellum. We analyzed two sets of recombinant inbred BXD strains and an F2 intercross of the common inbred strains, C57BL/6J and DBA/2J. We measured cerebellar size as the weight or volume of fixed or histologically processed tissue. Among BXD recombinant inbred strains, the cerebellum averages 52 mg (12.4% of the brain) and ranges 18 mg in size. In F2 mice, the cerebellum averages 62 mg (12.9% of the brain) and ranges approximately 20 mg in size. Five quantitative trait loci (QTLs) that significantly control variation in cerebellar size were mapped to chromosomes 1 (Cbs1a), 8 (Cbs8a), 14 (Cbs14a), and 19 (Cbs19a, Cbs19b). In combination, these QTLs can shift cerebellar size to an appreciable 35% of the observed range. To assess regional genetic control of the cerebellum, we also measured the volume of the cell-rich, internal granule layer (IGL) in a set of BXD strains. The IGL ranges from 34 to 43% of total cerebellar volume. The QTL Cbs8a is significantly linked to variation in IGL volume and is suggestively linked to variation in the number of cerebellar folia. The QTLs we have discovered are among the first loci shown to modulate the size and architecture of the adult mouse cerebellum. | Airey DC, Lu L, Williams RW | Genetic control of the mouse cerebellum: identification of quantitative trait loci modulating size and architecture | J Neuroscience | 21     | 5099-5109 | NULL  | 2001 |
+```
+
+That is a nice example.
+But we also find many publications without abstracts, e.g. | 7276 |     15792 | NULL | Williams EG, Andreux P, Houtkooper R, Auwerx J | Recombinant Inbred BXD Mice as a Model for the Metabolic Syndrome.
+
+In fact, 22K entries out of 29K miss the abstract. Also I can't find this last paper by Evan Williams. The closest is "Systems Genetics of Metabolism: The Use of the BXD Murine Reference Panel for Multiscalar Integration of Traits" which is probably worth reading.
+
+=> https://www.cell.com/cell/pdfExtended/S0092-8674(12)01007-0?__cf_chl_tk=kYZ49R4P29zOzYPeuWdrXVJC61HyhpHwFtq8lS2_rlk-1756022056-1.0.1.1-uY.PpAbgi8FO54P4_wYp_f6Nm84CdfHNQEI1WOmngFE
+
+I have no idea where the number 15792 comes from. It is not a pubmed ID. Some quick checks:
+
+```
+MariaDB [db_webqtl]> select count(*) from Publication WHERE Pubmed_ID>0 limit 3;
++----------+
+|      427 |
++----------+
+MariaDB [db_webqtl]> select count(*) from Publication WHERE Pubmed_ID>0 and Pubmed_ID<99999 limit 3;
++----------+
+|        2 |
++----------+
+MariaDB [db_webqtl]> select count(*) from Publication WHERE Pubmed_ID>0 and Pubmed_ID<999999 limit 3;
++----------+
+|       10 |
++----------+
+select count(*) from Publication WHERE NOT Abstract is NULL limit 3;
++----------+
+|     6750 |
++----------+
+```
+
+so, out of 29K entries, we have a very limited number of useful PMIDs, but we have some 6750 abstracts - mostly related to the BXD. Meanwhile some 16572 entries (about half) appear to have valid titles. Almost all records have authors, however.
+
+It really is a bit of a mess. What we need to do is harvest what we have and then collect pubmed ids for the missing BXD PublishData records and use that to fetch up-to-date abstracts and author lists. We can even adapt my Pubmed script that I use for bibtex. A search for just the combination of these authors
+
+```
+pubmed2bib.sh 'Williams EG, Andreux P, Houtkooper R, Auwerx J  [au]'
+```
+
+renders
+
+```
+@article{Andreux:2012,
+  keywords     = { },
+  pmid         = {22939713},
+  pmcid        = {3604687},
+  note         = {{PMC3604687}},
+  IDS          = {PMC3604687, PMID:22939713},
+  author       = {Andreux, P. A. and Williams, E. G. and Koutnikova, H. and Houtkooper, R. H. and Champy, M. F. and Henry, H. and Schoonjans, K. and Williams, R. W. and Auwerx, J.},
+  title        = {{Systems genetics of metabolism: the use of the BXD murine reference panel for multiscalar integration of traits}},
+  journal      = {Cell},
+  year         = {2012},
+  volume       = {150},
+  number       = {6},
+  pages        = {1287-1299},
+  doi          = {10.1016/j.cell.2012.08.012},
+  url          = {http://www.ncbi.nlm.nih.gov/pubmed/22939713},
+  abstract     = {Metabolic homeostasis is achieved by complex molecular and cellular networks that differ significantly among individuals and are difficult to model with genetically engineered lines of mice optimized to study single gene function. Here, we systematically acquired metabolic phenotypes by using the EUMODIC EMPReSS protocols across a large panel of isogenic but diverse strains of mice (BXD type) to study the genetic control of metabolism. We generated and analyzed 140 classical phenotypes and deposited these in an open-access web service for systems genetics (www.genenetwork.org). Heritability, influence of sex, and genetic modifiers of traits were examined singly and jointly by using quantitative-trait locus (QTL) and expression QTL-mapping methods. Traits and networks were linked to loci encompassing both known variants and novel candidate genes, including alkaline phosphatase (ALPL), here linked to hypophosphatasia. The assembled and curated phenotypes provide key resources and exemplars that can be used to dissect complex metabolic traits and disorders.},
+}
+```
+
+So, yes, it is the likely candidate. We can use this information to suggest updates. It just proves again how useful manual curation is.
+
+Note that this information is collected at the experimental level (rather than the trait level), so it really does not belong in the GEMMA lmdb data. Every trait has an entry in PublishXRef that points back to the Publication ID. So we can take it later (and fix it!).
+
+# Rerun GEMMA precompute
+
+Let's set up a full rerun for the 13K BXD PublishData entries with this new information. That should allow us to see how skew and kurtosis and experimental size affect the outcome. Remember we have the batch run script:
+
+```
+#! /bin/env sh
+
+export TMPDIR=./tmp
+curl http://127.0.0.1:8092/dataset/bxd-publish/list > bxd-publish.json
+jq ".[] | .Id" < bxd-publish.json > ids.txt
+./bin/gemma-wrapper --force --json --loco -- -g BXD.geno.txt -p BXD_pheno.txt -a BXD.8_snps.txt -n 2 -gk > K.json
+
+for id in 'cat ids.txt' ; do
+  echo Precomputing $id
+  if [ ! -e tmp/*-BXDPublish-$id-gemma-GWA.tar.xz ] ; then
+    curl http://127.0.0.1:8092/dataset/bxd-publish/values/$id.json > pheno.json
+    ./bin/gn-pheno-to-gemma.rb --phenotypes pheno.json --geno-json BXD.geno.json > BXD_pheno.txt
+    ./bin/gemma-wrapper --json --lmdb --population BXD --name BXDPublish --trait $id --loco --input K.json -- -g BXD.geno.txt -p BXD_pheno.txt -a BXD.8_snps.txt -lmm 9 -maf 0.1 -n 2 -debug > GWA.json
+  fi
+done
+```
+
+that can be simplified because gemma-wrapper now replaces gn-pheno-to-gemma.rb. First Guix had to install scipy which pulls in inkscape and Jupyter among other things. It is really too much! But at least Guix makes it easy to reproduce the environment I use on my desktop to the server. Now we get a beautiful record in every lmdb GEMMA run:
+
+```
+"archive_GWA": "c143bc7928408fdc53affed0dacdd98d7c00f36d-BXDPublish-10001-gemma-GWA.tar.xz", "trait_values": {"BXD
+1": 61.400002, "BXD2": 49.0, "BXD5": 62.5, "BXD6": 53.099998, "BXD8": 59.099998, "BXD9": 53.900002, "BXD11": 53.099998,
+ "BXD12": 45.900002, "BXD13": 48.400002, "BXD14": 49.400002, "BXD15": 47.400002, "BXD16": 56.299999, "BXD18": 53.599998
+, "BXD19": 50.099998, "BXD20": 48.200001, "BXD21": 50.599998, "BXD22": 53.799999, "BXD23": 48.599998, "BXD24": 54.90000
+2, "BXD25": 49.599998, "BXD27": 47.400002, "BXD28": 51.5, "BXD29": 50.200001, "BXD30": 53.599998, "BXD31": 49.700001, "
+BXD32": 56.0, "BXD33": 52.099998, "BXD34": 53.700001, "BXD35": 49.700001, "BXD36": 44.5, "BXD38": 51.099998, "BXD39": 5
+4.900002, "BXD40": 49.900002, "BXD42": 59.400002}, "table": "PublishData", "traitid": 10001, "dataid": 0}}, "nind": 34,
+ "mean": 52.1353, "std": 4.1758, "skew": 0.6619, "kurtosis": 0.0523,
+```
+
+and the job is running....
+
+Next stop is to rerun reaper and variations on gemma. Last night it halted at 9K. The webserver gave an SQL error and just stopped/waited. As it is not using threads it will block. It says: SQL Connection ERROR! file not found
+
+# HK
+
+We want to rerun reaper to get more top ranked hits (and peaks). Now I also realize GEMMA can also do LR and it would be interesting to see how that differs from reaper. The '-lm' switch says:
+
+```
+ -lm       [num]          specify analysis options (default 1).
+          options: 1: Wald test
+                   2: Likelihood ratio test
+                   3: Score test
+                   4: 1-3
+```
+
+the documentation points out that we don't need a GRM. Exactly. Now we could try and embed this in gemma-wrapper, but that is overkill. Part of the complexity of gemma-wrapper is related to handling the GRM with LOCO. Here we have a simple command that needs to be iterated. We don't need to record trait values, kurtosis etc. because that is already part of the previous exercise (and is constant). So the main complications are to create the trait vector, run gemma, and write an lmdb file. For now this will be a one-off, so we are not going to bother with caching and all that.
+
+```
+gemma -g BXD.geno.txt -p BXD_pheno.txt -a BXD.8_snps.txt -lm 2 -o trait-BXDPublish-$id-gemma-GWA-hk
+```
+
+This produces a file
+
+```
+chr rs  ps  n_mis n_obs allele1 allele0 af  p_lrt
+1 rsm10000000001  3001490 0 237 X Y 0.527 -nan
+1 rs31443144  3010274 0 237 X Y 0.525 -nan
+1 rs6269442 3492195 0 237 X Y 0.525 -nan
+1 rs32285189  3511204 0 237 X Y 0.525 -nan
+```
+
+Hmm. All p_lrt are NaN. Oh, I need to make sure the second column is used:
+
+```
+gemma -g BXD.geno.txt -p BXD_pheno.txt -a BXD.8_snps.txt -n 2 -lm 2 -o tmp/trait-BXDPublish-$id-gemma-GWA-hk
+chr rs  ps  n_mis n_obs allele1 allele0 af  p_lrt
+1 rsm10000000001  3001490 0 23  X Y 0.739 8.331149e-01
+1 rs31443144  3010274 0 23  X Y 0.739 8.331149e-01
+1 rs6269442 3492195 0 23  X Y 0.739 8.331149e-01
+1 rs32285189  3511204 0 23  X Y 0.739 8.331149e-01
+1 rs258367496 3659804 0 23  X Y 0.739 8.331149e-01
+```
+
+much better! Now we need to turn this into an lmdb file. We can adapt gemma2lmdb.py to do that. But I am not going to do that. The attraction of repurposing code is always there, but it will mean diluting the meaning of the code - basically ifthen blocks - and making the code less readable. This is one reason the Linux kernel does not share code between device drivers. Even for these simple tools I prefer to split out at the risk of not being DRY. I hope you can see what I mean with:
+
+=> https://github.com/genetics-statistics/gemma-wrapper/blob/master/bin/gemma2lmdb.py
+
+which is now pretty straightforward for parsing LMM output of GEMMA into lmdb. We are going to do the same thing for a simpler output. But when writing it suddenly struck me we don't need lmdb here in the first place! lmdb is for the full vector output and there is no reason to retain it. All we want is the top hits. Great, that simplifies matters even more. Which btw points out how baffling it is to me that people think they can replace programmers with AI. Well, maybe for the obvious code... You just see how much code will be garbage.
+
+Now we have the same idea in gemma-mdb-to-rdf.rb - and for the same reason as before I am not going to adapt that code.
+
+Fun fact, HK returns the same hits for GEMMA and reaper versions. Good. the log10 of the GEMMA's p_LRT returns a value of 2.720446e-06 where -log10/LOD is 5.56 and the multiplier with 4.61 renders 25 where GN2 shows an LRS of 22. Oh well, we are not too concerned, as long as the ranking is correct.
+
+So for GN trait
+
+=> https://genenetwork.org/show_trait?trait_id=10002&dataset=BXDPublish
+
+we now get for GEMMA HK:
+
+```
+gn:HK_output_trait_BXDPublish_1_gemma_GWA_hk_assoc_txt a gnt:mappedTrait;
+        rdfs:label "GEMMA_BXDPublish output/trait-BXDPublish-1-gemma-GWA-hk.assoc.txt trait HK mapped";
+        gnt:GEMMA_HK true;
+        gnt:belongsToGroup gn:setBxd;
+        gnt:trait gn:publishXRef_1;
+        gnt:time "2025-08-25 10:14:23 +0000";
+        gnt:belongsToGroup gn:setBxd;
+        gnt:name "BXDPublish";
+        gnt:traitId "1";
+        skos:altLabel "BXD_1".
+gn:rsm10000005699_HK_output_trait_BXDPublish_1_gemma_GWA_hk_assoc_txt a gnt:mappedLocus;
+       gnt:mappedSnp gn:rsm10000005699_HK_output_trait_BXDPublish_1_gemma_GWA_hk_assoc_txt ;
+       gnt:locus gn:Rsm10000005699 ;
+       gnt:lodScore 5.6 .
+gn:rs47899232_HK_output_trait_BXDPublish_1_gemma_GWA_hk_assoc_txt a gnt:mappedLocus;
+       gnt:mappedSnp gn:rs47899232_HK_output_trait_BXDPublish_1_gemma_GWA_hk_assoc_txt ;
+       gnt:locus gn:Rs47899232 ;
+       gnt:lodScore 5.6 .
+gn:rs3661882_HK_output_trait_BXDPublish_1_gemma_GWA_hk_assoc_txt a gnt:mappedLocus;
+       gnt:mappedSnp gn:rs3661882_HK_output_trait_BXDPublish_1_gemma_GWA_hk_assoc_txt ;
+       gnt:locus gn:Rs3661882 ;
+       gnt:lodScore 5.3 .
+gn:rs33490412_HK_output_trait_BXDPublish_1_gemma_GWA_hk_assoc_txt a gnt:mappedLocus;
+       gnt:mappedSnp gn:rs33490412_HK_output_trait_BXDPublish_1_gemma_GWA_hk_assoc_txt ;
+       gnt:locus gn:Rs33490412 ;
+       gnt:lodScore 5.3 .
+gn:rsm10000005703_HK_output_trait_BXDPublish_1_gemma_GWA_hk_assoc_txt a gnt:mappedLocus;
+       gnt:mappedSnp gn:rsm10000005703_HK_output_trait_BXDPublish_1_gemma_GWA_hk_assoc_txt ;
+       gnt:locus gn:Rsm10000005703 ;
+       gnt:lodScore 5.3 .
+(...)
+```
+
+Code is here:
+
+=> https://github.com/genetics-statistics/gemma-wrapper/commit/a17901d927d21a1686c0ac0d1552695f0096b84b
+
+Generate RDF incl. skew, kurtosis etc
+
+```
+./bin/gemma-mdb-to-rdf.rb --header > test.ttl
+time for x in tmp/*.xz ; do
+    ./bin/gemma-mdb-to-rdf.rb $x --anno BXD.8_snps.txt --sort >> test.ttl
+done
+```
+
+Renders
+
+```
+gn:GEMMAMapped_LOCO_BXDPublish_10001_gemma_GWA_7c00f36d a gnt:mappedTrait;
+      rdfs:label "GEMMA BXDPublish trait 10001 mapped with LOCO (defaults)";
+      gnt:trait gn:publishXRef_10001;
+      gnt:loco true;
+      gnt:time "2025/08/24 08:22";
+      gnt:belongsToGroup gn:setBxd;
+      gnt:name "BXDPublish";
+      gnt:traitId "10001";
+      gnt:nind 34;
+      gnt:mean 52.1353;
+      gnt:std 4.1758;
+      gnt:skew 0.6619;
+      gnt:kurtosis 0.0523;
+      skos:altLabel "BXD_10001".
+gn:Rsm10000005700_BXDPublish_10001_gemma_GWA_7c00f36d a gnt:mappedLocus;
+      gnt:mappedSnp gn:GEMMAMapped_LOCO_BXDPublish_10001_gemma_GWA_7c00f36d;
+      gnt:locus gn:Rsm10000005700;
+      gnt:lodScore 6.2;
+      gnt:af 0.382;
+      gnt:effect 1.626.
+n:Rs32133186_BXDPublish_10001_gemma_GWA_7c00f36d a gnt:mappedLocus;
+      gnt:mappedSnp gn:GEMMAMapped_LOCO_BXDPublish_10001_gemma_GWA_7c00f36d;
+      gnt:locus gn:Rs32133186;
+      gnt:lodScore 6.2;
+      gnt:af 0.382;
+      gnt:effect 1.626.
+...
+```
+
+Funny thing is that the hash values are now all the same because gemma-wrapper no longer includes the trait values. That is a harmless bug that I'll fix for the next run.
+
+The GEMMA run ended up generating 1,576,110 triples. The gemma-mdb-to-rdf script took 42 minutes.
+
+After GEMMA LMM completed its run we set up the HK run which should reflect reaper.
+
+# On bimodality (of trait values)
+
+Kurtosis is not a great predictor of bimodality.
+
+=> https://aldenbradford.com/bimodality.html
+
+Rob says that for the BXD bimodality works best. Maybe annotate with
+
+=> https://skeptric.com/dip-statistic/
+
+We'll skip it for now - I added a task above.
+
+# Combine results
+
+First we upload the data into virtuoso after dropping the old graph. We can do again, now introducing new sub graphs
+
+```
+rapper -i turtle test.ttl > test.n3
+guix shell -C -N --expose=/export/guix-containers/virtuoso/data/virtuoso/ttl/=/export/data/virtuoso/ttl virtuoso-ose -- isql -S 8891
+SQL> log_enable(3,1);
+SQL> DELETE FROM rdf_quad WHERE g = iri_to_id ('http://pjotr.genenetwork.org');
+SQL> SPARQL SELECT count(*) FROM <http://pjotr.genenetwork.org> WHERE { ?s ?p ?o };
+  0
+SQL> ld_dir('/export/data/virtuoso/ttl','test.n3','http://lmm2.genenetwork.org');
+  Done. -- 3 msec.
+# for testing the validity and optional delete problematic ones:
+SQL> SELECT * FROM DB.DBA.load_list;
+SQL> DELETE from DB.DBA.LOAD_LIST where ll_error IS NOT NULL ;
+# commit changes
+SQL> rdf_loader_run ();
+SQL> checkpoint;
+Done. -- 16 msec.
+SQL> SPARQL SELECT count(*) FROM <http://pjotr.genenetwork.org> WHERE { ?s ?p ?o };
+  1576102
+```
+
+and after HK we are at 6838444 triples for this exercise. Note that you can clean up the load list with
+
+```
+DELETE from DB.DBA.LOAD_LIST;
+```
+
+
+Let's list all the tissues we have with
+
+```
+PREFIX dct: <http://purl.org/dc/terms/>
+PREFIX gn: <http://genenetwork.org/id/>
+PREFIX owl: <http://www.w3.org/2002/07/owl#>
+PREFIX gnc: <http://genenetwork.org/category/>
+PREFIX gnt: <http://genenetwork.org/term/>
+PREFIX sdmx-measure: <http://purl.org/linked-data/sdmx/2009/measure#>
+PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
+PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
+PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
+PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
+PREFIX qb: <http://purl.org/linked-data/cube#>
+PREFIX xkos: <http://rdf-vocabulary.ddialliance.org/xkos#>
+PREFIX pubmed: <http://rdf.ncbi.nlm.nih.gov/pubmed/>
+
+SELECT * WHERE { ?s rdf:type gnc:tissue . ?s rdfs:label ?o . }
+
+"http://genenetwork.org/id/tissueA1c" "Primary Auditory (A1) Cortex mRNA"
+"http://genenetwork.org/id/tissueAcc" "Anterior Cingulate Cortex mRNA"
+"http://genenetwork.org/id/tissueAdr" "Adrenal Gland mRNA"
+"http://genenetwork.org/id/tissueAmg" "Amygdala mRNA"
+"http://genenetwork.org/id/tissueBebv"  "Lymphoblast B-cell mRNA"
+"http://genenetwork.org/id/tissueBla" "Bladder mRNA"
+(...)
+```
+
+To other quick queries confirm that our data is loaded correctly. One quick test we would want to do is to see if all reaper hits overlap with GEMMA_HK. That would be a comfort.
+
+The reaper hits are found with
+
+```
+SELECT * WHERE {
+    ?s gnt:belongsToGroup gn:setBxd;
+         gnt:traitId ?id;
+         gnt:locus ?locus;
+         gnt:lodScore ?lrs;
+         dct:description ?descr.
+} limit 50
+```
+
+The HK hits are defined as
+
+```
+gn:HK_output_trait_BXDPublish_1_gemma_GWA_hk_assoc_txt a gnt:mappedTrait;
+        rdfs:label "GEMMA_BXDPublish output/trait-BXDPublish-1-gemma-GWA-hk.assoc.txt trait HK mapped";
+        gnt:GEMMA_HK true;
+        gnt:belongsToGroup gn:setBxd;
+        gnt:trait gn:publishXRef_1;
+        gnt:time "2025-08-25 10:14:23 +0000";
+        gnt:belongsToGroup gn:setBxd;
+        gnt:name "BXDPublish";
+        gnt:traitId "1";
+        skos:altLabel "BXD_1".
+gn:rsm10000005699_HK_output_trait_BXDPublish_1_gemma_GWA_hk_assoc_txt a gnt:mappedLocus;
+       gnt:mappedSnp gn:rsm10000005699_HK_output_trait_BXDPublish_1_gemma_GWA_hk_assoc_txt ;
+       gnt:locus gn:Rsm10000005699 ;
+       gnt:lodScore 5.6 .
+gn:rs47899232_HK_output_trait_BXDPublish_1_gemma_GWA_hk_assoc_txt a gnt:mappedLocus;
+       gnt:mappedSnp gn:rs47899232_HK_output_trait_BXDPublish_1_gemma_GWA_hk_assoc_txt ;
+       gnt:locus gn:Rs47899232 ;
+       gnt:lodScore 5.6 .
+```
+
+So the hits can be listed as
+
+```
+SELECT count(*) WHERE {
+    ?reaper gnt:belongsToGroup gn:setBxd;
+         gnt:traitId ?traitid;
+         gnt:locus ?locus;
+         gnt:lodScore ?lrs .
+    ?gemma gnt:mappedSnp ?id2;
+         gnt:locus ?locus;
+         gnt:lodScore ?lrs2.
+    ?id2 gnt:name "BXDPublish" ;
+        gnt:GEMMA_HK true;
+        gnt:traitId ?traitid.
+} limit 5
+```
+
+Unfortunately I made a mistake mapping the SNPs. This should have linked back. So instead of:
+
+```
+gn:rsm10000005699_HK_output_trait_BXDPublish_1_gemma_GWA_hk_assoc_txt a gnt:mappedLocus;
+       gnt:mappedSnp gn:rsm10000005699_HK_output_trait_BXDPublish_1_gemma_GWA_hk_assoc_txt ;
+```
+
+I should have generated
+
+```
+gn:rsm10000005699_HK_output_trait_BXDPublish_1_gemma_GWA_hk_assoc_txt a gnt:mappedLocus;
+       gnt:mappedSnp gn:HK_output_trait_BXDPublish_1_gemma_GWA_hk_assoc_txt ;
+
+```
+
+Doh! These SNPs are dangling now. Bit hard to see sometimes with these identifiers. OK, set up another rdf generation run.
+Now I see it show an error for a few traits, e.g.
+
+```
+./bin/gemma2rdf.rb:74:in "initialize": No such file or directory @ rb_sysopen - ./tmp/trait-BXDPublish-18078-gemma-GWA-hk.assoc.txt (Errno::ENOENT)
+```
+
+For later (again) as the majority is coming through.
+
+```
+SQL> ld_dir('/export/data/virtuoso/ttl','gemma-GWA-hk.ttl','http://hk.genenetwork.org');
+SQL> rdf_loader_run ();
+SQL> SPARQL SELECT count(*) FROM <http://hk.genenetwork.org> WHERE { ?s ?p ?o };
+  5262347
+```
+
+Try again
+
+```
+SELECT count(*) WHERE {
+    ?reaper gnt:belongsToGroup gn:setBxd;
+         gnt:traitId ?traitid;
+         gnt:locus ?locus;
+         gnt:lodScore ?lrs .
+    ?trait gnt:GEMMA_HK true;
+        gnt:traitId ?traitid.
+    # filter(?lrs2 >= 4.0).
+    ?snp gnt:mappedSnp ?trait ;
+        gnt:locus ?locus ;
+        gnt:lodScore ?lrs2 .
+}
+"traitid","locus","lrs","lrs2"
+"21188","http://genenetwork.org/id/Rs31400538",2.73982,3.42
+"21194","http://genenetwork.org/id/Rs29514307",3.94845,4.7
+"21199","http://genenetwork.org/id/Rs50530980",2.60066,3.27
+"21203","http://genenetwork.org/id/Rs13483656",2.57406,3.24
+"21205","http://genenetwork.org/id/Rsm10000000057",2.90985,3.6
+"21210","http://genenetwork.org/id/Rsm10000000182",2.67097,3.34
+"21217","http://genenetwork.org/id/Rs29525970",3.80402,4.54
+"21220","http://genenetwork.org/id/Rs46586055",2.50946,3.17
+"21221","http://genenetwork.org/id/Rs47967883",2.54473,3.21
+"21223","http://genenetwork.org/id/Rs29327089",3.94623,4.69
+"21230","http://genenetwork.org/id/Rs30026335",2.78151,3.46
+"21238","http://genenetwork.org/id/Rs32170136",2.83393,3.52
+"21267","http://genenetwork.org/id/Rsm10000000063",2.54818,3.21
+```
+
+counts 9261 overlapping SNPs. So, about 4000 traits are not mapping exactly. Also interesting is that GEMMA HK LRS/LOD is consistently higher than reaper.
+
+For the non-overlapping traits we find, for example 10023, has no significant HK hit. For GEMMA_HK it is simply ignored and for reaper Bonz included the lodScore of 1.77. If we count the significant hits for reaper LOD>3.0 we find 4541 hits. Out of these 4506 hits overlap with GEMMA_HK. That is perfect!
+
+```
+SELECT ?traitid WHERE {
+    ?reaper gnt:belongsToGroup gn:setBxd;
+         gnt:traitId ?traitid;
+         gnt:locus ?locus;
+         gnt:lodScore ?lrs .
+    ?trait gnt:GEMMA_HK true;
+        gnt:traitId ?traitid.
+    filter(?lrs >= 3.0).
+    ?snp gnt:mappedSnp ?trait ;
+        gnt:locus ?locus ;
+        gnt:lodScore ?lrs2 .
+}
+```
+
+Essentially every reaper result is replicated in GEMMA_HK and now we have all SNPs that can be compared against the LMM results.
+
+# On Normality
+
+But first we want to take a look normality for the datasets now we stored ninds, mean, std, skew and kurtosis. At this stage let's just count datasets. So, out of 13427 GEMMA LMM traits 12416 have more than 16 individuals. When looking at abs(skew)<0.8 we have 7691 fairly normal traits. Adding an abs(kurtosis)<1.0 we have 6289 traits. So about half of them are fairly normal. So if we quantile normalize these vectors it may have some impact. Let that be another task I add above (run gemma with qnorm).
+
+The query was
+
+```
+SELECT count(*) WHERE {
+    ?trait gnt:loco true;
+        gnt:traitId ?traitid;
+        gnt:nind ?nind;
+        gnt:skew ?skew;
+        gnt:kurtosis ?kurtosis.
+    filter(?nind > 16 and abs(?skew) < 0.8 and abs(?kurtosis) < 1.0).
+} LIMIT 40
+```
+
+# Pubmed
+
+As an aside, I did an interesting discovery. Some of the pubmed IDs that I thought were wrong may actually be OK. Maybe Bonz did some screening because his RDF differs from what is in MySQL.
+
+# Preparing for comparison
+
+OK, we are finally at the point where we can compare LMM results with HK (read reaper). This is a 'set analysis' because we want to see what SNPs differ between the two results for every trait and highlight those where peaks are different. We have captured in RDF all the SNPs that are considered (fairly) significant for both LMM and HK.
+
+The easiest way is to capture all SNPs and write the analysis in code. There may be a way to do this in SPARQL but it will take me more time and we'll end with less flexibility. Now there are two main ways to go about it. I can dump a table with all SNPs using SPARQL itself and process the tabular data (this, btw, may be a good input for AI). Another option is to use an RDF library and parse the RDF triples directly (without Virtuoso) in the middle. That should allow for quicker processing and also a shorter turnaround if I need to modify RDF (the process of updating, uploading, checking and writing SPARQL queries, is quite long). There is one thing in writing software that is very important: you want a quick turnaround, otherwise you are just staring at a prompt ;). So it pays to learn these short cuts. It also allows accessing lmdb files and even SQL if useful.
+Note that we still can also use SPARQL *also* to output RDF triples. So if we want more powerful filtering and/or add metadata it will all work.
+
+## Reading RDF
+
+So, I wrote a first script to digest our RDF from GEMMA. The RDF library in Guix is a bit old, so we have to upgrade that in Guix.
+
+For testing I created a small TTL file and convert to N3 with wrapper.
+
+```
+rapper -i turtle test-2000.ttl > test-2000.n3
+```
+
+What we want to do is walk the dataset and harvest SNPs that belong to a run. As a start.
+
+First I needed to add the relevant RDF packages to Guix.
+
+=> https://git.genenetwork.org/guix-bioinformatics/commit/?id=fcbe2919a1e4b168e8ec9ac995a6512360d56ac8
+
+The following code fetches all traits with all SNPs:
+
+```
+  graph = RDF::Graph.load(fn)
+  datasets = graph.query(RDF::Query.new {
+                           pattern [:dataset, RDF.type, GNT.mappedTrait]
+                         })
+  datasets.each { |trait|
+    p "-------"
+    p trait.dataset
+    snps = graph.query(RDF::Query.new {
+                         pattern [ :snp, GNT.mappedSnp, trait.dataset ]
+                       })
+    p snps
+  }
+```
+
+Resulting in
+
+```
+"-------"
+#<RDF::URI:0x9ec0 URI:http://genenetwork.org/id/GEMMAMapped_LOCO_BXDPublish_10007_gemma_GWA_7c00f36d>
+[#<RDF::Query::Solution:0x9ed4({:snp=>#<RDF::URI:0x9ee8 URI:http://genenetwork.org/id/Rsm10000005697_BXDPublish_10007_gemma_GWA_7c00f36d>})>]
+```
+
+At the next step we want to do a bit more sophisticated queries. This thing has SPARQL support with the graph in RAM, but I want to try the native interface first.
+
+The first hurdle was that loading RDF triples is extremely slow. So I wanted to try the RDF Raptor C extension, but that sent me down a temporary Guix rabbit hole because nss-certs moved. Also the raptor gem was ancient, and was showing errors, so I updated to the latest github code.
+
+Anyway guix-bioinformatics was updated to support that. Next I tried loading with raptor and that made the difference. At least the triples are read in minutes rather than hours, but the next step building the large graph takes a lot of time too. This sucks.
+
+Creating and inspecting each statement is fast enough that look like:
+
+```
+#<RDF::Statement:0x7a8(<http://genenetwork.org/id/HK_trait_BXDPublish_10001_gemma_GWA_hk_assoc_txt> <http://genenetwork.org/term/trait> <http://genenetwork.org/id/publishXRef_10001> .)>
+```
+
+So, rather than including all triples, we first filter out the ones we are not interested in and that speeds things up. That worked until I included all SNPs. Are we delivered here? These libraries may be too slow. Analysing 200K triples took forever. Constructing the graph through an enumerator is a really slow step. The graph query is also slow. But adding the raptor read triples to an array only took 7s. It makes pretty clear we should process the 'raw' data directly.
+
+The current script collects all SNPs by GEMMA trait:
+
+```
+time ./bin/rdf-analyse-gemma-hits.rb test.nt
+Parsing test.nt...
+
+real    0m12.314s
+user    0m12.117s
+sys     0m0.196s
+```
+
+Next stop we make it a set and do the same for HK. And we can do set analysis. The first round is pretty impressive, it looks like trait 10001 has exactly the same SNPs for HK and GEMMA. That is a nice confirmation. Actually 10001 is an interesting test case because in GN you can see HK and GEMMA find different secondary peaks:
+
+=> https://genenetwork.org/show_trait?trait_id=10001&dataset=BXDPublish
+
+At the GEMMA threshold we set (LOD>4.0) all hits are on chr8 and they overlap with HK. Down the line we could look at lower values, but lets stick with this for now.
+
+For 10004 we find some different SNPs. The mapping looks similar in GN:
+
+=> https://genenetwork.org/show_trait?trait_id=10001&dataset=BXDPublish
+
+The difference is:
+
+```
+["10004", #<Set: {#<RDF::URI:0x1a18 URI:http://genenetwork.org/id/Rs47899232>, #<RDF::URI:0x1a54 URI:http://genenetwork.org/id/Rsm10000005699>, #<RDF::URI:0xf78 URI:http://genenetwork.org/id/Rsm10000005700>, #<RDF::URI:0xf3c URI:http://genenetwork.org/id/Rs32133186>, #<RDF::URI:0xf00 URI:http://genenetwork.org/id/Rs32818171>, #<RDF::URI:0xec4 URI:http://genenetwork.org/id/Rsm10000005701>, #<RDF::URI:0xe88 URI:http://genenetwork.org/id/Rsm10000005702>, #<RDF::URI:0xdd4 URI:http://genenetwork.org/id/Rsm10000005703>, #<RDF::URI:0xfb4 URI:http://genenetwork.org/id/Rs33490412>, #<RDF::URI:0xff0 URI:http://genenetwork.org/id/Rs3661882>, #<RDF::URI:0x102c URI:http://genenetwork.org/id/Rsm10000005704>, #<RDF::URI:0x1068 URI:http://genenetwork.org/id/Rs32579649>, #<RDF::URI:0x10a4 URI:http://genenetwork.org/id/Rsm10000005705>}>]
+```
+
+This locus Rs47899232 is not in my test set, so it looks like it is under the threshold. If you look at Chr8 you can see the GEMMA hit shifted somewhat to the right from HK Chr8: 68.799000 to LOCO Chr8: 95.704608. The LOCO hit is also visible in HK, but dropped below significance.
+
+So we can do this analysis now! But just looking at SNPs is going to be laborious. At this stage we are mostly interested in the highest peak and whether it changed. What we need to do is capture regions, i.e. the chromosome positions, and map out if they moved.
+
+In the next phase I am going to take all SNP positions and map their region (+- 10,000 bps). For every trait we'll have a list of *regions* linked to significant hits. If these regions differ then the peaks differ, and we can highlight them.
+
+# Getting SNPs and their positions
+
+To get SNPs and their positions a simple SPARQL query will do. Bonz has created a TTL, e.g.
+
+```
+gn:Rs47899232 rdf:type gnc:Genotype .
+gn:Rs47899232 rdfs:label "rs47899232" .
+gn:Rs47899232 gnt:chr "8" .
+gn:Rs47899232 gnt:mb "95.704608"^^xsd:double .
+gn:Rs47899232 gnt:belongsToSpecies gn:Mus_musculus .
+gn:Rs47899232 gnt:chrNum "0"^^xsd:int .
+gn:Rsm10000005700 rdf:type gnc:Genotype .
+gn:Rsm10000005700 rdfs:label "rsm10000005700" .
+gn:Rsm10000005700 gnt:chr "8" .
+gn:Rsm10000005700 gnt:mb "95.712996"^^xsd:double .
+gn:Rsm10000005700 gnt:belongsToSpecies gn:Mus_musculus .
+gn:Rsm10000005700 gnt:chrNum "0"^^xsd:int .
+```
+
+A few things are a bit puzzling, but at this stage we mostly care for are the identifier, label, chr and mb. GN, for some reason tracks mb as a floating point. I don't like that, but it will work for tracking positions. To get a table we use the following query:
+
+```
+SELECT * WHERE {
+    ?snp a gnc:Genotype;
+             gnt:belongsToSpecies gn:Mus_musculus ;
+             rdfs:label ?name ;
+             gnt:chr ?chr ;
+             gnt:mb ?mb .
+
+}
+```
+
+we save that as a TSV and have 120K SNPs formatted like:
+
+```
+"http://genenetwork.org/id/Rs47899232"   "rs47899232"    "8"     95.7046
+```
+
+# Ranges
+
+In the next step we want do define peak ranges. It would be nice to visualize them as a line, e.g. for HK and LOCO:
+
+```
+Chr   1              2             3 ...
+HK    ---X-------------------X-----
+LOCO  ---X----X--------------X-----
+```
+
+That way we can see that a peak appeared on Chr 1. Down the line we can use the same info to compare traits A and B:
+
+```
+Chr   1              2             3 ...
+A     ---X-------------------X-----
+B     ---X-------------------------
+```
+
+where we see some chromosome area is shared. Rob sent me this nice 2008 paper:
+
+=> https://pubmed.ncbi.nlm.nih.gov/19008955/
+
+which states that a remarkably diverse set of traits maps to a region on mouse distal chromosome 1 (Chr 1) that corresponds to human Chr 1q21-q23. This region is highly enriched in quantitative trait loci (QTLs) that control neural and behavioral phenotypes, including motor behavior, escape latency, emotionality, seizure susceptibility (Szs1), and responses to ethanol, caffeine, pentobarbital, and haloperidol.
+
+And we are still doing this research today.
+
+Anyway, for our purposes, for each trait we have a range of SNPs. If they are close to each other they form a 'peak'. What I am going to do is combine the SNPs we are comparing into one set first. Use that to define the ranges (say within 10K BPs). Next we go back to the computed SNPs and figure out what fits a range. We will pick out those ranges that are unique to a trait. But first we'll just visualize.
+
+As this involves some logic we will have to do it in real code (again). First we show how many SNPs we have combined for HK+LOCO and how many differ, e.g.
+
+```
+["10001",  78,  0]
+["10002", 208, 92]
+["10003",  96,  0]
+["10004",  35, 13]
+["10005",  76,  0]
+```
+
+so, for 10001 we have 78 SNPs and the LOCO ones overlap with HK. We showed before that for every set we have the SNP ids.
+
+For the first time this exercise I have to write some real new code (before I was just tying together existing work and fixing bugs on the fly). The reason is that we have to track QTL peak ranges by inserting SNP positions. Not only that, we also need to make sure that these ranges do not overlap and build faithfully. For example, the order of adding SNPs matters - we grow a range by adding SNPs on the same chromosome. If a SNP falls out of range (e.g. 10K BPs away) we create a new range. But when a nother SNP falls in the middle we need to merge them into one range (or peak). This requires some logic and I am creating a new module for it.
+
+The current code creates the following peaks on chr1:
+
+```
+@chromosome={"1"=>[#<QRange 𝚺14 173.339..173.679>, #<QRange 𝚺9 175.615..176.205>, #<QRange 𝚺2 174.541..174.679>, #<QRange 𝚺7 175.437..176.032>, #<QRange 𝚺15 72.2551..73.3771>, #<QRange 𝚺10 179.862..180.284>, #<QRange 𝚺22 181.476..183.154>, #<QRange 𝚺9 179.916..180.412>, #<QRange 𝚺4 177.555..177.901>, #<QRange 𝚺29 171.749..173.532>, #<QRange 𝚺8 171.172..172.175>]
+```
+
+The sigma tells you how many SNPs are in there. There is some overlap, so I need to fix that. When I set the distance at 50,000 BPS we get too many peaks. We need some other heuristic to decide what is a peak and what not. Probably look at the direction the significance is going. I.e. when it drops and rises again we may have a local peak. Would be nice to track those as separate ranges.
+
+Rob suggested a bin size of 500,000 BPs for the BXD. Let's try that first. This results in an orderly combined LOCO+HK results for trait 10002:
+
+```
+#<QTL::QRanges:0x00007f99f277c840 @chromosome={"1"=>[#<QRange 𝚺15 72.2551..73.3771>, #<QRange 𝚺91 171.172..183.154>], "8"=>[#<QRange 𝚺102 94.3743..112.929>]}>
+```
+
+Next we do this for LOCO and HK separately:
+
+```
+[10002,combined] =>{"1"=>[#<QRange 𝚺15 72.2551..73.3771>, #<QRange 𝚺91 171.172..183.154>], "8"=>[#<QRange 𝚺102 94.3743..112.929>]}
+[10002,HK]       =>{"1"=>[#<QRange 𝚺14 179.862..181.546>], "8"=>[#<QRange 𝚺102 94.3743..112.929>]}
+[10002,LOCO]     =>{"1"=>[#<QRange 𝚺15 72.2551..73.3771>, #<QRange 𝚺91 171.172..183.154>], "8"=>[#<QRange 𝚺32 94.4792..97.3382>]}
+["10003", 96, 0]
+["10004", 35, 13]
+[10004,combined] =>{"8"=>[#<QRange 𝚺35 68.7992..97.3516>]}
+[10004,HK]       =>{"8"=>[#<QRange 𝚺22 68.7992..74.9652>]}
+[10004,LOCO]     =>{"8"=>[#<QRange 𝚺13 95.6926..97.3516>]}
+```
+
+Resulting in a new QTL for 10002,LOCO. And with 10004 we see the QTL shift to the right. Nice!
+
+We'll want to track the LOD score too, so let's load that using the RDF file we parse anyway.
+
+```
+[10002,HK]       =>{"1"=>[#<QRange 𝚺14 179.862..181.546 LOD=3.07..3.07>], "8"=>[#<QRange 𝚺102 94.3743..112.929 LOD=3.1..5.57>]}
+[10002,LOCO]     =>{"1"=>[#<QRange 𝚺15 72.2551..73.3771 LOD=4.0..5.1>, #<QRange 𝚺91 171.172..183.154 LOD=4.5..5.3>], "8"=>[#<QRange 𝚺32 94.4792..97.3382 LOD=4.5..4.8>]}
+[10004,HK]       =>{"8"=>[#<QRange 𝚺22 68.7992..74.9652 LOD=3.14..3.23>]}
+[10004,LOCO]     =>{"8"=>[#<QRange 𝚺13 95.6926..97.3516 LOD=4.1..4.6>]}
+```
+
+Speaks for itself.
+
+# Analyzing peaks
+
+
+
+Now we have the peaks for different runs (HK and LOCO). We would like to see how many of the traits are affected - gaining or losing or moving peaks. Also, before we introduce the GEMMA values to GN, we would like to assess how many of the peaks are really different.
+
+With above example we can see that 10002 gained a peak on chr1. With 10004 we see that the peak on chr8 shifted position. These are the things we want to capture. Also we want to bring back some metadata to show what the trait is about. Finally we want to point to the full vector lmdb file which I forgot to include in the original parsing though I did include the hash, e.g.
+
+```
+gn:GEMMAMapped_LOCO_BXDPublish_10001_gemma_GWA_7c00f36d a gnt:mappedTrait;
+      rdfs:label "GEMMA BXDPublish trait 10001 mapped with LOCO (defaults)";
+      gnt:trait gn:publishXRef_10001;
+      gnt:loco true;
+      gnt:time "2025/08/24 08:22";
+      gnt:belongsToGroup gn:setBxd;
+      gnt:name "BXDPublish";
+      gnt:traitId "10001";
+```
+
+I shoud add
+
+```
+      gnt:filename "c143bc7928408fdc53affed0dacdd98d7c00f36d-BXDPublish-10001-gemma-GWA.tar.xz"
+      gnt:hostname "balg01"
+```
+
+so we can find it back easily.
+
+Next step is to say something about the peaks. Let's enrich our RDF store to show these results. Basically for 10002 we can add RDF statements for
+
+```
+[10002,HK]       =>{"1"=>[#<QRange 𝚺14 179.862..181.546 LOD=3.07..3.07>], "8"=>[#<QRange 𝚺102 94.3743..112.929 LOD=3.1..5.57>]}
+[10002,LOCO]     =>{"1"=>[#<QRange 𝚺15 72.2551..73.3771 LOD=4.0..5.1>, #<QRange 𝚺91 171.172..183.154 LOD=4.5..5.3>], "8"=>[#<QRange 𝚺32 94.4792..97.3382 LOD=4.5..4.8>]}
+```
+
+e.g.
+
+```
+gn:qtl00001_LOCO
+    gnt:qtlChr      "1";
+    gnt:qtlStart    72.2551 ;
+    gnt:qtlStop     73.3771 ;
+    gnt:qtlLOD      5.1 ;
+    gnt:SNPs        15 ;
+gn:qtl00002_LOCO
+    gnt:qtlChr      "1";
+    gnt:qtlStart    171.172 ;
+    gnt:qtlStop     183.154 ;
+    gnt:qtlLOD      5.3 ;
+    gnt:SNPs        91 ;
+    gnt:qtlOverlaps gn:qtl00001_HK.
+```
+
+This way, in SPARQL, we can query all QTL that are not in HK. For the QTL that are in HK we can also see if they shifted. Actually for SPARQL we don't really need the last statement - it is just a convenience. We will also add the actual SNP identifiers so the SNP counter is not really necessary either (let SPARQL count):
+
+```
+gn:QTL_CHR1_722551_GEMMAMapped_LOCO_BXDPublish_10002_gemma_GWA_7c00f36d
+    gnt:mappedQTL gn:GEMMAMapped_LOCO_BXDPublish_10002_gemma_GWA_7c00f36d
+    rdfs:label     "GEMMA BXDPublish LOCO QTL on 1:722551 trait 10002";
+    gnt:qtlChr     "1";
+    gnt:qtlStart   72.2551 ;
+    gnt:qtlStop    73.3771 ;
+    gnt:qtlLOD     5.1 ;
+    gnt:qtlSNP     gn:Rs13475920_BXDPublish_10002_gemma_GWA_7c00f36d
+    gnt:qtlSNP     gn:Rs31428112_BXDPublish_10002_gemma_GWA_7c00f36d
+    (...)
+```
+
+I have two things to solve now. First we need to check whether QTLs between the two runs overlap. And then there is a bug in the QTL computation from SNP positions. I am seeing some inconsistencies wrt binning.
+
+The problem I was referring to yesterday turns out to be alright. I thought that when I was using the combined SNPs from HK and LOCO that there was only one peak. But there are two:
+
+```
+[10002,combined] =>{"1"=>[#<QRange 𝚺15 72.2551..73.3771 LOD=..>,       #<QRange 𝚺91 171.172..183.154 LOD=..>]},
+[10002,HK]       =>{"1"=>                                              #<QRange 𝚺14 179.862..181.546 LOD=3.07..3.07>],
+[10002,LOCO]     =>{"1"=>[#<QRange 𝚺15 72.2551..73.3771 LOD=4.0..5.1>, #<QRange 𝚺91 171.172..183.154 LOD=4.5..5.3>]
+```
+
+It is interesting to see that HK misses out on one peak completely and the second peak completely overlaps with LOCO (including all SNPs). All good, so far. OK. Let's add some logic to see what peaks match or don't match:
+
+```
+[10002,HK] =>{"1"=>[#<QRange Chr1 𝚺14 179.862..181.546 LOD=3.07..3.07>], "8"=>[#<QRange Chr8 𝚺102 94.3743..112.929 LOD=3.1..5.57>]}
+[10002,LOCO] =>{"1"=>[#<QRange Chr1 𝚺15 72.2551..73.3771 LOD=4.0..5.1>, #<QRange Chr1 𝚺91 171.172..183.154 LOD=4.5..5.3>], "8"=>[#<QRange Chr8 𝚺32 94.4792..97.3382 LOD=4.5..4.8>]}
+["10002: NO HK match for LOCO Chr 1 QTL!", #<QRange Chr1 𝚺15 72.2551..73.3771 LOD=4.0..5.1>]
+[10004,HK] =>{"8"=>[#<QRange Chr8 𝚺22 68.7992..74.9652 LOD=3.14..3.23>]}
+[10004,LOCO] =>{"8"=>[#<QRange Chr8 𝚺13 95.6926..97.3516 LOD=4.1..4.6>]}
+["10004: NO HK match for LOCO Chr 8 QTL!", #<QRange Chr8 𝚺13 95.6926..97.3516 LOD=4.1..4.6>]
+```
+
+So 10002 correctly says there is a new QTL on chr1 and for 10004 a new QTL on chr8. Now, for 10004 it appears the HK version is in a different location, but I think it suffices to point out 'apparently' new QTL.
+
+Alright, so we can now annotate new/moved QTL! We are going to feed this back into virtuoso by writing RDF as I showed yesterday.
+
+Next step is to say something about the peaks. Let's enrich our RDF store to show these results. Basically for 10002 we add RDF statements for
+
+```
+[10002,HK]       =>{"1"=>[#<QRange 𝚺14 179.862..181.546 LOD=3.07..3.07>], "8"=>[#<QRange 𝚺102 94.3743..112.929 LOD=3.1..5.57>]}
+[10002,LOCO]     =>{"1"=>[#<QRange 𝚺15 72.2551..73.3771 LOD=4.0..5.1>, #<QRange 𝚺91 171.172..183.154 LOD=4.5..5.3>], "8"=>[#<QRange 𝚺32 94.4792..97.3382 LOD=4.5..4.8>]}
+```
+
+E.g.
+
+```
+gn:GEMMAMapped_LOCO_BXDPublish_10002_gemma_GWA_7c00f36d_QTL_Chr8_94_97
+    gnt:mappedQTL   gn:GEMMAMapped_LOCO_BXDPublish_10002_gemma_GWA_7c00f36d;
+    rdfs:label      "GEMMA BXDPublish QTL";
+    gnt:qtlChr      "8";
+    gnt:qtlStart    94.4792 ;
+    gnt:qtlStop     97.3382 ;
+    gnt:qtlLOD      4.8 .
+gn:GEMMAMapped_LOCO_BXDPublish_10002_gemma_GWA_7c00f36d_QTL_Chr8_94_97 gnt:mappedSnp gn:Rsm10000005689_BXDPublish_10002_gemma_GWA_7c00f36d .
+gn:GEMMAMapped_LOCO_BXDPublish_10002_gemma_GWA_7c00f36d_QTL_Chr8_94_97 gnt:mappedSnp gn:Rs232396986_BXDPublish_10002_gemma_GWA_7c00f36d .
+gn:GEMMAMapped_LOCO_BXDPublish_10002_gemma_GWA_7c00f36d_QTL_Chr8_94_97 gnt:mappedSnp gn:Rsm10000005690_BXDPublish_10002_gemma_GWA_7c00f36d .
+(...)
+```
+
+and if it is a new QTL compared to HK we annotate a newly discovered QTL:
+
+```
+gn:GEMMAMapped_LOCO_BXDPublish_10002_gemma_GWA_7c00f36d_1_72_73 a gnt:newlyDiscoveredQTL .
+gn:GEMMAMapped_LOCO_BXDPublish_10004_gemma_GWA_7c00f36d_8_96_97 a gnt:newlyDiscoveredQTL .
+```
+
+Note we skipped the results that show no SNP changes - I should add them later to give full QTL cover.
+
+Code is here:
+
+=> https://github.com/genetics-statistics/gemma-wrapper/blob/master/bin/rdf-analyse-gemma-hits.rb
+=> https://github.com/genetics-statistics/gemma-wrapper/blob/master/lib/qtlrange.rb
+
+Now we have all the RDF to figure out what traits have new QTL compared to reaper!
+I'll upload them in virtuoso for further analysis.
+
+I want to do a run that shows what traits have changed QTLs.
+Basically the command is
+
+```
+./bin/rdf-analyse-gemma-hits.rb test-hk-2000.ttl test-2000.ttl -o RDF
+```
+
+let's try to run with the full ttl files. Actually I converted them to n3 because of some error:
+
+```
+rapper --input turtle gemma-GWA.ttl > gemma-GWA.n3
+rapper --input turtle gemma-GWA-hk.ttl > gemma-GWA-hk.n3
+time ./bin/rdf-analyse-gemma-hits.rb gemma-GWA-hk.n3 gemma-GWA.n3 > test.out
+real    3m21.979s
+user    3m21.076s
+sys     0m0.716s
+```
+
+3.5 minutes is fine for testing stuff (if already a little tedious). The first run failed because I have renamed GEMMA_HK to GemmaHK. Another bug I hit was with:
+
+```
+[10009,HK] =>{"15"=>[#<QRange Chr15 𝚺30 25.6987..74.5398 LOD=3.01..3.27>]}
+[10009,LOCO] =>{"10"=>[#<QRange Chr10 𝚺1 76.2484..76.2484 LOD=3.5..3.5>]}
+/export/local/home/wrk/iwrk/opensource/code/genetics/gemma-wrapper/lib/qtlrange.rb:126:in `block (2 levels) in qtl_diff': undefined method `each' for nil (NoMethodError)
+```
+
+There are a few more bugs to fix - mostly around empty results, e.g. if a trait had no SNPs. Also HK would render a lodScore of infinite `gnt:lodScore Infinity` and that reduced the result set. I set a LOD of infinity to 99.0. So at least it'll stand out. Fixing it at 12 minutes made the run a lot slower than 3.5 minutes! Still OK, for now.
+
+The first run shows 7943 new QTL. Turns out that a bunch of them are non-significant, so need to filter those. Remember we kept the highest hit, even if significance was low. A quick filter shows that with LMM 2802 traits show new QTLs (out of 13K). Out of those 1984 traits did not compute a QTL at all with HK. That looks exciting, but we need to validate. Lets take a look at
+
+```
+[10727,HK] =>{}
+[10727,LOCO] =>{"15"=>[#<QRange Chr15 𝚺9 62.3894..63.6584 LOD=4.4..4.4>]}
+["10727: NO HK match for LOCO Chr 15 QTL!", [#<QRange Chr15 𝚺9 62.3894..63.6584 LOD=4.4..4.4>]]
+```
+
+=> https://genenetwork.org/show_trait?trait_id=10727&dataset=BXDPublish
+
+That looks correct to me. Rob you may want to check. And another:
+
+```
+[51064,HK] =>{"10"=>[#<QRange Chr10 𝚺12 92.3035..108.525 LOD=3.08..4.15>], "19"=>[#<QRange Chr19 𝚺34 8.93047..34.2017 LOD=3.06..3.41>], "3"=>[#<QRange Chr3 𝚺5 138.273..138.581 LOD=3.06..3.06>], "X"=>[#<QRange ChrX 𝚺5 160.766..163.016 LOD=3.48..3.48>]}
+[51064,LOCO] =>{"19"=>[#<QRange Chr19 𝚺37 29.9654..34.2017 LOD=4.3..5.5>]}
+```
+
+=> https://genenetwork.org/show_trait?trait_id=51064&dataset=BXDPublish
+
+Looks correct. With HK we see QTL on Chr 3,10,19 and X. On GN LMM we see a whopper on chr 19, as well as X. I need to see why GEMMA is not finding that X in precompute! Made a note of that too.
+
+# Updating RDF
+
+Now we have QTL output we can upload that to RDF.
+
+Making the traits accessible we need to add some metadata on description of trait, publication and authors. All this information can also be used to build a UI.
+
+For this I am going to regenerate the RDF without running gemma again to sure it is complete and mark the new QTL. One change is that if a LOD is infinite we set it to 99.1. The number will stand out. The idea is that when a P-value ends up rounded to zero we can pick it up easily as a conversion. This turns out to be relevant for example:
+
+```
+gn:HK_trait_BXDPublish_13032_gemma_GWA_hk_assoc_txt a gnt:mappedTrait;
+        rdfs:label "GEMMA_BXDPublish ./tmp/trait-BXDPublish-13032-gemma-GWA-hk.assoc.txt trait HK mapped";
+        gnt:GEMMA_HK true;
+        gnt:belongsToGroup gn:setBxd;
+        gnt:trait gn:publishXRef_13032;
+        gnt:time "2025-08-27 06:44:45 +0000";
+        gnt:name "BXDPublish";
+        gnt:traitId "13032";
+        skos:altLabel "BXD_13032".
+
+gn:rsm10000005888_HK_trait_BXDPublish_13032_gemma_GWA_hk_assoc_txt a gnt:mappedLocus;
+       gnt:mappedSnp gn:HK_trait_BXDPublish_13032_gemma_GWA_hk_assoc_txt ;
+       gnt:locus gn:Rsm10000005888 ;
+       gnt:lodScore Infinity .
+
+gn:rsm10000005889_HK_trait_BXDPublish_13032_gemma_GWA_hk_assoc_txt a gnt:mappedLocus;
+       gnt:mappedSnp gn:HK_trait_BXDPublish_13032_gemma_GWA_hk_assoc_txt ;
+       gnt:locus gn:Rsm10000005889 ;
+       gnt:lodScore Infinity .
+```
+
+The trait has +1 and -1 values:
+
+=> https://genenetwork.org/show_trait?trait_id=13032&dataset=BXDPublish
+
+HK on GN show a map, but no result table. Hmmm. The SNPs listed here as Infinity don't really show in GN - and GEMMA finds no hits there. I think, on consideration, since we don't use HK other than for comparison I should just drop these results. It looks dodgy. Aha, in the GEMMA run these actually show up as not a number (NaN), so I should drop them!
+
+```
+chr rs  ps  n_mis n_obs allele1 allele0 af  p_lrt
+9 rsm10000005888  31848339  0 23  X Y 0.348 -nan
+9 rsm10000005864  27578739  0 23  X Y 0.391 1.770379e-10
+```
+
+Funny enough they are on the same chromosome as the highest ranking hits.
+
+Let's generate RDF and look at the differences:
+
+```
+export RDF=gemma-GWA-hk2.ttl
+wrk@balg01 ~/services/gemma-wrapper [env]$ ./bin/gemma2rdf.rb --header > $RDF
+wrk@balg01 ~/services/gemma-wrapper [env]$ for id in 'cat ids.txt' ; do traitfn=trait-BXDPublish-$id-gemma-GWA-hk ; ./bin/gemma2rdf.rb $TMPDIR/$traitfn.assoc.txt >> $RDF ; done
+```
+
+Took 43 min. The diff with the orignal looks good. Note I don't track origin files for this. Maybe I should, but I don't think we'll really use those. Next generate GEMMA LOCO RDF again
+
+```
+RDF=gemma-GWA.ttl
+wrk@balg01 ~/services/gemma-wrapper [env]$ ./bin/gemma-mdb-to-rdf.rb --header > $RDF
+time for x in tmp/*.xz ; do
+    ./bin/gemma-mdb-to-rdf.rb $x --anno BXD.8_snps.txt --sort >> $RDF
+done
+```
+
+Runs in 50min for 13K traits.
+
+The output now points to the lmdb vector files:
+
+```
++      gnt:filename "c143bc7928408fdc53affed0dacdd98d7c00f36d-BXDPublish-10080-gemma-GWA.tar.xz";
++      gnt:hostname "balg01";
+```
+
+## Digest QTL to RDF
+
+In the next step we want to show the QTL in RDF. First I created a small subset for testing that I can run with
+
+```
+time ./bin/rdf-analyse-gemma-hits.rb test-hk-2000.n3 test-2000.n3
+```
+
+It shows, for example,
+
+```
+gn:GEMMAMapped_LOCO_BXDPublish_10012_gemma_GWA_7c00f36d_QTL_Chr4_25_25
+    gnt:mappedQTL   gn:GEMMAMapped_LOCO_BXDPublish_10012_gemma_GWA_7c00f36d;
+    rdfs:label      "GEMMA BXDPublish QTL";
+    gnt:qtlChr      "4";
+    gnt:qtlStart    24.7356 ;
+    gnt:qtlStop     24.7356 ;
+    gnt:qtlLOD      3.6 .
+gn:GEMMAMapped_LOCO_BXDPublish_10012_gemma_GWA_7c00f36d_QTL_Chr4_25_25 gnt:mappedSnp gn:Rsm10000001919_BXDPublish_10012
+_gemma_GWA_7c00f36d .
+gn:GEMMAMapped_LOCO_BXDPublish_10012_gemma_GWA_7c00f36d_QTL_Chr4_25_25 a gnt:newQTL .
+```
+
+in other words a QTL with LOD 3.6 and a single SNP that is new compared to the HK output. We want to annotate a bit more, because I want to show the maximum allele frequency contained by the SNPs. That is not too hard as it is contained in the mapped SNP info:
+
+```
+gn:Rsm10000005700_BXDPublish_10001_gemma_GWA_7c00f36d a gnt:mappedLocus;
+      gnt:mappedSnp gn:GEMMAMapped_LOCO_BXDPublish_10001_gemma_GWA_7c00f36d;
+      gnt:locus gn:Rsm10000005700;
+      gnt:lodScore 6.2;
+      gnt:af 0.382;
+      gnt:effect 1.626.
+```
+
+
+With precompute I added allele frequencies to the QTL. So for trait 10002 we get:
+
+```
+[10002,HK] =>{"1"=>[#<QRange Chr1 𝚺14 179.862..181.546 LOD=3.07..3.07>], "8"=>[#<QRange Chr8 𝚺102 94.3743..112.929 LOD=3.1..5.57>]}
+[10002,LOCO] =>{"1"=>[#<QRange Chr1 𝚺15 72.2551..73.3771 AF=0.574 LOD=4.0..5.1>, #<QRange Chr1 𝚺91 171.172..183.154 AF=0.588 LOD=4.5..5.3>], "8"=>[#<QRange Chr8 𝚺32 94.4792..97.3382 AF=0.441 LOD=4.5..4.8>]}
+```
+
+and with RDF:
+
+```
+gn:GEMMAMapped_LOCO_BXDPublish_10002_gemma_GWA_7c00f36d_QTL_Chr1_72_73
+    gnt:mappedQTL   gn:GEMMAMapped_LOCO_BXDPublish_10002_gemma_GWA_7c00f36d;
+    rdfs:label      "GEMMA BXDPublish QTL";
+    gnt:qtlChr      "1";
+    gnt:qtlStart    72.2551 ;
+    gnt:qtlStop     73.3771 ;
+    gnt:qtlAF       0.574 ;
+    gnt:qtlLOD      5.1 .
+gn:GEMMAMapped_LOCO_BXDPublish_10002_gemma_GWA_7c00f36d_QTL_Chr1_72_73 gnt:mappedSnp gn:Rsm10000000582_BXDPublish_10002_gemma_GWA_7c00f36d .
+gn:GEMMAMapped_LOCO_BXDPublish_10002_gemma_GWA_7c00f36d_QTL_Chr1_72_73 gnt:mappedSnp gn:Rsm10000000583_BXDPublish_10002_gemma_GWA_7c00f36d .
+gn:GEMMAMapped_LOCO_BXDPublish_10002_gemma_GWA_7c00f36d_QTL_Chr1_72_73 gnt:mappedSnp gn:Rs37034472_BXDPublish_10002_gemma_GWA_7c00f36d .
+...etc...
+gn:GEMMAMapped_LOCO_BXDPublish_10002_gemma_GWA_7c00f36d_QTL_Chr1_72_73 a gnt:newQTL .
+```
+
+Important: we only store LOCO QTL (which we reckon are 'truth'), not the HK QTL. We also marked QTL that are *not* in HK with the gnt:newQTL annotation.
+
+For AF filtering we track this information on the trait:
+
+```
+gn:GEMMAMapped_LOCO_BXDPublish_10002_gemma_GWA_7c00f36d a gnt:mappedTrait;
+      rdfs:label "GEMMA BXDPublish trait 10002 mapped with LOCO (defaults)";
+      gnt:trait gn:publishXRef_10002;
+      gnt:loco true;
+      gnt:time "2025/08/24 08:22";
+      gnt:belongsToGroup gn:setBxd;
+      gnt:name "BXDPublish";
+      gnt:traitId "10002";
+      gnt:nind 34;
+      gnt:mean 52.2206;
+      gnt:std 2.9685;
+      gnt:skew -0.1315;
+      gnt:kurtosis 0.0314;
+      skos:altLabel "BXD_10002";
+      gnt:filename "c143bc7928408fdc53affed0dacdd98d7c00f36d-BXDPublish-10002-gemma-GWA.tar.xz";
+      gnt:hostname "balg01";
+      gnt:user "wrk".
+```
+
+So, for the first QTL, an AF of 0.574 is based on (1-0.574)*34 = 14 out of 34 individuals is great. When we get to 1 or 2 individuals it may be kinda dodgy. For a dataset this size the AF threshold should be 0.06 (and 0.94). If we have 15 individuals we should be closer to 0.1 (0.9). Anyway, we can compute these on the fly in SPARQL. I rather show too many false positives.
+
+Also note that AF is not a problem with our BXD genotyping. Even so, we are going to use pangenome genotypes next and it will be important for that.
+
+Let's do a full QTL compute with
+
+```
+time ./bin/rdf-analyse-gemma-hits.rb gemma-GWA-hk2.n3 gemma-GWA.n3 -o RDF > QTL.rdf
+```
+
+And we should have the queriable mapped QTL we wished for! But some inspection shows:
+
+```
+[10015,HK] =>{"12"=>[#<QRange Chr12 𝚺2 3.2..9.74252 LOD=3.74..3.74>], "2"=>[#<QRange Chr2 𝚺259 4.03246..52.4268 LOD=3.11..16.01>]}
+[10015,LOCO] =>{"2"=>[#<QRange Chr2 𝚺256 4.03246..57.8635 AF=0.542 LOD=4.0..15.2>]}
+["10015: NO HK match, QTL LOCO Chr 2!", #<QRange Chr2 𝚺256 4.03246..57.8635 AF=0.542 LOD=4.0..15.2>]
+```
+
+which is strange because there is overlap on that particular QTL Chr2! They are obviously the same. As subtle bug. Instead of
+
+```
+-      return true if qtl.min > @min and qtl.max < @max
+-      return true if qtl.min < @min and qtl.max > @min
+-      return true if qtl.min < @max and qtl.max > @max
+```
+
+I now  have:
+
+```
++      return true if qtl.min >= @min and qtl.max <= @max # qtl falls within boundaries
++      return true if qtl.min <= @min and qtl.max >= @min # qtl over left boundary
++      return true if qtl.min <= @max and qtl.max >= @max # qtl over right boundary
+
+```
+
+I had to include the boundaries themselves.
+
+Now we also still log false positives with
+
+```
+[10009,HK] =>{"15"=>[#<QRange Chr15 𝚺30 25.6987..74.5398 LOD=3.01..3.27>]}
+[10009,LOCO] =>{"10"=>[#<QRange Chr10 𝚺1 76.2484..76.2484 AF=0.5 LOD=3.5..3.5>]}
+["10009: NO HK results, new QTL(s) LOCO Chr 10!", [#<QRange Chr10 𝚺1 76.2484..76.2484 AF=0.5 LOD=3.5..3.5>]]
+```
+
+note the LOD score. I should not mark new QTL that are below 4.0. Now we count 2351 new QTL and that is in line with my earlier quick counts.
+
+Note the current script eats RAM because it holds all LOD scorer and SNPs in memory. That is fine for our 13K classical traits but will probably not work for millions of traits. It runs in 8 minutes. That is cool too.
+
+# Updating RDF in virtuoso
+
+Similar to what we did before we are going to update Virtuoso on the sparql-test server using the CLI isql commands discussed above.
+
+
+Similar to what we did before we are going to update Virtuoso on the sparql-test server using the CLI isql commands discussed above.
+
+In August I uploaded:
+
+```
+SELECT * FROM DB.DBA.load_list;
+/export/data/virtuoso/ttl/gemma-GWA-hk.ttl                                     http://hk.genenetwork.org                                                         2           2025.8.27 8:31.57 122123000  2025.8.27 8:32.6 104530000  0           NULL        NULL
+/export/data/virtuoso/ttl/test.n3                                                 http://lmm2.genenetwork.org                                                       2           2025.8.27 6:47.44 947047000  2025.8.27 6:47.49 73865000  0           NULL        NULL
+```
+
+Also, to list all available graphs you can do
+
+```
+SELECT  DISTINCT ?g
+   WHERE  { GRAPH ?g {?s ?p ?o} }
+ORDER BY  ?g
+http://genenetwork.org
+http://hk.genenetwork.org
+http://lmm2.genenetwork.org
+```
+
+The first graph is for all Bonz' RDF. I can now safely delete the other two, to start with a fresh slate.
+The graph has 36584993 triples. Deleting HK remains 31322646 and LMM2 remains 29746544 triples.
+
+```
+ld_dir('/export/data/virtuoso/ttl','QTL.rdf','http://qtl.genenetwork.org');
+```
+
+Ouch, we got an error. With the proper prefix values and renaming the file to QTL.ttl it worked with 183562 new triples!
+Next we loaded the updated TTL files. HK imported 3196834 triples. LMM imported 1616383 and we total 34743323 triples. Which is less than the previous set - because we cleaned out the SNPs that had a LOD of infinite.
+
+After a checkpoint, time to SPARQL! This query lists all new QTL with their traits:
+
+```
+PREFIX gn: <http://genenetwork.org/id/>
+PREFIX gnt: <http://genenetwork.org/term/>
+PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
+PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
+SELECT ?trait, ?chr, ?start, ?stop, ?lod  WHERE {
+   ?qtl gnt:mappedQTL ?traitid ;
+          gnt:qtlChr ?chr ;
+          gnt:qtlStart ?start ;
+          gnt:qtlStop ?stop ;
+          a gnt:newQTL ;
+          gnt:qtlLOD ?lod .
+   ?traitid gnt:traitId ?trait .
+} LIMIT 20
+
+"trait" "chr"   "start" "stop"  "lod"
+"26116" "7"     36.9408 36.9408 4
+"26118" "2"     3.19074 4.29272 4.3
+"26118" "9"     60.6863 64.4059 4.3
+"26126" "17"    71.754  72.1374 4.7
+"26135" "15"    93.3404 94.2523 5.5
+(...)
+```
+
+So we list all traits that have a *NEW* QTL using GEMMA compared to HK. We have a few thousand trait updates that have new QTL. Let's add the number of samples/genometypes, se we can ignore the smaller sets. Or better, count them first. We simplify the query first:
+
+```
+SELECT count(DISTINCT ?trait)  WHERE {
+   ?qtl a gnt:newQTL ;
+          gnt:mappedQTL ?traitid .
+   ?traitid gnt:traitId ?trait ;
+               gnt:nind ?nind.
+} LIMIT 20
+```
+
+Counts 2040 traits with at least one new QTL. When we FILTER (?nind > 16) we get 2019 traits. That is a tiny minority with fewer individuals. So we can ignore filtering them.
+
+Of course we visited several traits before to see if the QTL were correct. I'll make a list for Rob to check, expanding the trait to a clickable URL:
+
+Let's look for the new QTL.
+
+```
+SELECT ?trait, ?chr, ?start, ?stop, ?lod  WHERE {
+   ?qtl gnt:mappedQTL ?traitid ;
+          gnt:qtlChr ?chr ;
+          gnt:qtlStart ?start ;
+          gnt:qtlStop ?stop ;
+          a gnt:newQTL ;
+          gnt:qtlLOD ?lod .
+   ?traitid gnt:traitId ?trait .
+   BIND(REPLACE(?trait, "(\\d+)","https://genenetwork.org/show_trait?trait_id=$1&dataset=BXDPublish") AS ?url)
+} LIMIT 20
+
+"trait" "chr"   "start" "stop"  "lod"   "url"
+"26116" "7"     36.9408 36.9408 4       "https://genenetwork.org/show_trait?trait_id=26116&dataset=BXDPublish"
+"26118" "2"     3.19074 4.29272 4.3     "https://genenetwork.org/show_trait?trait_id=26118&dataset=BXDPublish"
+"26118" "9"     60.6863 64.4059 4.3     "https://genenetwork.org/show_trait?trait_id=26118&dataset=BXDPublish"
+"26126" "17"    71.754  72.1374 4.7     "https://genenetwork.org/show_trait?trait_id=26126&dataset=BXDPublish"
+"26135" "15"    93.3404 94.2523 5.5     "https://genenetwork.org/show_trait?trait_id=26135&dataset=BXDPublish"
+```
+
+Now when I click the link for 26118 I can run HK and GEMMA and I can confirm we have a new result on CHR2 and CHR9.
+Very cool. Now we want to show the trait info and authors, so we can see who we want to approach with this new information.
+
+Now in the phenotype RDF we have
+
+```
+gn:traitBxd_10001 rdf:type gnc:Phenotype .
+gn:traitBxd_10001 gnt:belongsToGroup gn:setBxd .
+gn:traitBxd_10001 gnt:traitId "10001" .
+gn:traitBxd_10001 dct:description "Central nervous system, morphology: Cerebellum weight, whole, bilateral in adults of
+ both sexes [mg]" .
+gn:traitBxd_10001 gnt:submitter "robwilliams" .
+gn:traitBxd_10001 dct:isReferencedBy pubmed:11438585 .
+```
+
+The submitter is mostly one of the GN team. The pubmed id may help find the authors. Bonz RDF'd it as
+
+```
+pubmed:11438585 rdf:type fabio:ResearchPaper .
+pubmed:11438585 fabio:hasPubMedId pubmed:11438585 .
+pubmed:11438585 dct:title "Genetic control of the mouse cerebellum: identification of quantitative trait loci modulatin
+g size and architecture" .
+pubmed:11438585 fabio:Journal "J Neuroscience" .
+pubmed:11438585 prism:volume "21" .
+pubmed:11438585 fabio:page "5099-5109" .
+pubmed:11438585 fabio:hasPublicationYear "2001"^^xsd:gYear .
+pubmed:11438585 dct:creator "Airey DC" .
+pubmed:11438585 dct:creator "Lu L" .
+pubmed:11438585 dct:creator "Williams RW" .
+```
+
+So we can fetch that when it is available. You can run the query here:
+
+=> http://sparql-test.genenetwork.org/sparql/
+
+Just copy paste:
+
+```
+PREFIX dct: <http://purl.org/dc/terms/>
+PREFIX pubmed: <http://rdf.ncbi.nlm.nih.gov/pubmed/>
+PREFIX gn: <http://genenetwork.org/id/>
+PREFIX gnt: <http://genenetwork.org/term/>
+PREFIX gnc: <http://genenetwork.org/category/>
+PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
+PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
+PREFIX fabio: <http://purl.org/spar/fabio/>
+
+SELECT ?trait, ?chr, ?start, ?stop, ?lod, ?year, ?submitter, SAMPLE(?author as ?one_author), ?url, ?descr  WHERE {
+   ?qtl gnt:mappedQTL ?traitid ;
+          gnt:qtlChr ?chr ;
+          gnt:qtlStart ?start ;
+          gnt:qtlStop ?stop ;
+          a gnt:newQTL ;
+          gnt:qtlLOD ?lod .
+   ?traitid gnt:traitId ?trait .
+   OPTIONAL { ?phenoid gnt:traitId ?trait ;
+          a gnc:Phenotype ;
+          gnt:belongsToGroup gn:setBxd ;
+          gnt:submitter ?submitter ;
+          dct:description ?descr ;
+          dct:isReferencedBy ?pubid . } .
+         ?pubid dct:creator ?author ;
+                     fabio:hasPublicationYear ?pubyear .
+   BIND(concat(str(?pubyear)) as ?year)
+   BIND(REPLACE(?trait, "(\\d+)","https://genenetwork.org/show_trait?trait_id=$1&dataset=BXDPublish") AS ?url)
+} ORDER by ?trait
+LIMIT 100
+"10002" "1" 72.2551 73.3771 5.1 "2001"  "robwilliams" "Lu L"  "https://genenetwork.org/show_trait?trait_id=10002&dataset=BXDPublish"  "Central nervous system, morphology: Cerebellum weight after adjustment for covariance with brain size [mg]"
+"10004" "8" 95.6926 97.3516 4.6 "2001"  "robwilliams" "Lu L"  "https://genenetwork.org/show_trait?trait_id=10004&dataset=BXDPublish"  "Central nervous system, morphology: Cerebellum volume [mm3]"
+"10013" "2" 160.117 160.304 4.8 "1996"  "robwilliams" "Alexander RC"  "https://genenetwork.org/show_trait?trait_id=10013&dataset=BXDPublish"  "Central nervous system, behavior: Saline control response 0.9% ip, locomotor activity from 0-60 min after injection just prior to injection of 5 mg/kg amphetamine [cm]"
+(...)
+```
+
+
+Currently authors are not 'ranked' in RDF, so I pick a random one. I can add ranking later, so we get the first author. We also have the option to fetch all traits that, for example, involve Dave Ashbrook.
+
+We can also look for details like skewness by adding
+
+```
+     ?traitid gnt:traitId ?trait  ;
+            gnt:skew ?skew .
+```
+
+# Testing pangenome derived genotypes
+
+We continue testing new genotypes in this document:
+
+=> ../genetics/test-pangenome-derived-genotypes
+
+# Introducing epochs
+
+see
+
+=> topics/data/epochs
diff --git a/topics/systems/security.gmi b/topics/systems/security.gmi
new file mode 100644
index 0000000..a7192d4
--- /dev/null
+++ b/topics/systems/security.gmi
@@ -0,0 +1,61 @@
+# Security
+
+We secure our system by running recent stable versions of Linux distributions. We also are minimalistic on what we install and run and web services typically run in guix system containers (a kind of light weight Docker setup).
+
+# ssh
+
+Secure shell is very important. First we disable password logins. We use keys only. We also limit AllowUsers because it is flexible to remove and add users.
+
+```
+--- a/ssh/sshd_config
++++ b/ssh/sshd_config
+@@ -54,7 +54,7 @@ Include /etc/ssh/sshd_config.d/*.conf
+ #IgnoreRhosts yes
+
+ # To disable tunneled clear text passwords, change to no here!
+-#PasswordAuthentication yes
++PasswordAuthentication no
+ #PermitEmptyPasswords no
+
++AllowUsers marco daniel ...
+```
+
+Note that keys should be password protected.
+
+# Firewalling
+
+We typically use the monitored CISCO firewalling UTHSC provides. In addition we use nftables, e.g. in /etc/nftables.conf
+
+```
+table inet filter {
+        set udp_accepted {
+                type inet_service
+                flags interval
+                elements = { 60000-61000 } # for mosh
+        }
+        chain input {
+                type filter hook input priority filter; policy drop;
+                ct state { established, related } accept
+                iifname "lo" accept
+                iifname "lo" ip saddr != 127.0.0.0/8 drop
+                tcp dport ssh limit rate 5/minute accept
+                tcp dport { http, https } accept
+                tcp dport mysql ip saddr { list of ips } accept
+                udp dport @udp_accepted accept
+                reject with icmp port-unreachable
+        }
+        chain forward {
+                type filter hook forward priority filter; policy accept;
+        }
+        chain output {
+                type filter hook output priority filter; policy accept;
+        }
+}
+```
+
+Enable this with
+
+```
+systemctl enable nftables
+nft list ruleset
+```
diff --git a/topics/systems/update-production-checklist.gmi b/topics/systems/update-production-checklist.gmi
index b17077b..2cb0761 100644
--- a/topics/systems/update-production-checklist.gmi
+++ b/topics/systems/update-production-checklist.gmi
@@ -1,23 +1,28 @@
 # Update production checklist
 
+Last migration round is the move to tux03 (Sept2025)!
 
 # Tasks
 
 * [X] Install underlying Debian
 * [X] Get guix going
-* [ ] Check database
-* [ ] Check gemma working
-* [ ] Check global search
-* [ ] Check authentication
-* [ ] Check sending E-mails
-* [ ] Make sure info.genenetwork.org can reach the DB
+* [X] Check database settings
+* [X] Check gemma working
+* [X] Check global search
+* [X] Check authentication
+* [X] Check sending E-mails
+* [X] Check SPARQL
+* [X] Make sure info.genenetwork.org and 'space' can reach the DB
 * [ ] Backups
 
 The following are at the system level
 
-* [ ] Make journalctl presistent
-* [ ] Update certificates in CRON
-* [ ] Run trim in CRON
+* [X] Firewalling and other security measures (sshd)
+* [X] Check tmpdirs (cleanup?)
+* [X] Make sure journalctl persistent (check for reboots)
+* [X] Update certificates in CRON (no longer if not part of Guix)
+* [X] Run trim in CRON
+* [ ] Monitors (sheepdog)
 
 # Install underlying Debian
 
@@ -27,7 +32,9 @@ For our production systems we use Debian as a base install. Once installed:
 * [X] add ttyS0 support for grub and kernel - so out-of-band works
 * [X] start ssh server and configure not to use with passwords
 * [X] start nginx and check external networking
-* [ ] set up E-mail routing
+* [X] mount old root
+* [X] Clean up /etc/profile (remove global profile.d loading)
+* [X] set up E-mail routing
 
 It may help to mount the old root if you have it. Now it is on
 
@@ -37,25 +44,30 @@ mount /dev/sdd2 /mnt/old-root/
 
 # Get Guix going
 
-* [X] Install Guix daemon
+* [X] Mount bind /gnu on a large partition
 * [X] Move /gnu/store to larger partition
-* [X] Update Guix daemon and setup in systemd
+* [X] Install Guix daemon
+* [X] Update Guix daemon and setup in systemd (if necessary)
 * [X] Make available in /usr/local/guix-profiles
-* [X] Clean up /etc/profile
 
-We can bootstrap with the Debian guix package. Next move the store to a large partion and hard mount it in /etc/fstab with
+Next move the /gnu store to a large partion and hard mount it in /etc/fstab with
 
 ```
 /export2/gnu /gnu none defaults,bind 0 0
 ```
 
+We can bootstrap with the Debian guix package (though I prefer the guix-install.sh script these days, mostly because it is more modern).
+
+=> https://guix.gnu.org/manual/en/html_node/Binary-Installation.html
+
+
 Run guix pull
 
 ```
-wrk@tux04:~$ guix pull -p ~/opt/guix-pull --url=https://codeberg.org/guix/guix-mirror.git
+guix pull --url=https://codeberg.org/guix/guix  -p ~/opt/guix-pull
 ```
 
-Use that to install guix in /usr/local/guix-profiles
+Use that also to install guix in /usr/local/guix-profiles
 
 ```
 guix package -i guix -p /usr/local/guix-profiles/guix
@@ -68,9 +80,9 @@ The Debian installer configures guix. I tend to remove the profiles from /etc/pr
 # Check database
 
 * [X] Install mariadb
-* [ ] Recover database
-* [ ] Test permissions
-* [ ] Mariadb update my.cnf
+* [X] Recover database
+* [X] Test permissions
+* [X] Mariadb update my.cnf
 
 Basically recover the database from a backup is the best start and set permissions. We usually take the default mariadb unless production is already on a newer version - so we move to guix deployment.
 
@@ -114,12 +126,12 @@ We use borg for backups. First restore the backup on the PCIe. Also a test for o
 The swaks package is quite useful to test for a valid receive host:
 
 ```
-swaks --to testing-my-server@gmail.com --server smtp.uthsc.edu
-=== Trying smtp.uthsc.edu:25...
-=== Connected to smtp.uthsc.edu.
-<-  220 mailrouter8.uthsc.edu ESMTP NO UCE
- -> EHLO tux04.uthsc.edu
-<-  250-mailrouter8.uthsc.edu
+swaks --to testing-my-server@gmail.com --server smtp.network
+=== Trying smtp.network:25...
+=== Connected to smtp.network.
+<-  220 mailrouter8.network ESMTP NO UCE
+ -> EHLO tux04.network
+<-  250-mailrouter8.network
 <-  250-PIPELINING
 <-  250-SIZE 26214400
 <-  250-VRFY
@@ -129,7 +141,7 @@ swaks --to testing-my-server@gmail.com --server smtp.uthsc.edu
 <-  250-8BITMIME
 <-  250-DSN
 <-  250 SMTPUTF8
- -> MAIL FROM:<root@tux04.uthsc.edu>
+ -> MAIL FROM:<root@tux04.network>
 <-  250 2.1.0 Ok
  -> RCPT TO:<pjotr2020@thebird.nl>
 <-  250 2.1.5 Ok
@@ -137,9 +149,9 @@ swaks --to testing-my-server@gmail.com --server smtp.uthsc.edu
 <-  354 End data with <CR><LF>.<CR><LF>
  -> Date: Thu, 06 Mar 2025 08:34:24 +0000
  -> To: pjotr2020@thebird.nl
- -> From: root@tux04.uthsc.edu
+ -> From: root@tux04.network
  -> Subject: test Thu, 06 Mar 2025 08:34:24 +0000
- -> Message-Id: <20250306083424.624509@tux04.uthsc.edu>
+ -> Message-Id: <20250306083424.624509@tux04.network>
  -> X-Mailer: swaks v20201014.0 jetmore.org/john/code/swaks/
  ->
  -> This is a test mailing
@@ -161,7 +173,7 @@ dc_readhost=''
 dc_relay_domains=''
 dc_minimaldns='false'
 dc_relay_nets=''
-dc_smarthost='smtp.uthsc.edu'
+dc_smarthost='smtp.network'
 CFILEMODE='644'
 dc_use_split_config='false'
 dc_hide_mailname='false'
@@ -172,7 +184,7 @@ dc_localdelivery='maildir_home'
 And this should work:
 
 ```
-swaks --to myemailaddress --from john@uthsc.edu --server localhost
+swaks --to myemailaddress --from john@network --server localhost
 ```
 
 # Backups
@@ -180,3 +192,6 @@ swaks --to myemailaddress --from john@uthsc.edu --server localhost
 * [ ] Create an ibackup user.
 * [ ] Install borg (usually guix version)
 * [ ] Create a borg passphrase
+
+=> topics/systems/backups-with-borg.gmi
+=> topics/systems/backup-drops.gmi
diff --git a/topics/systems/virtuoso.gmi b/topics/systems/virtuoso.gmi
index 94a15f0..bd7424a 100644
--- a/topics/systems/virtuoso.gmi
+++ b/topics/systems/virtuoso.gmi
@@ -8,6 +8,10 @@ We run instances of virtuoso for our graph databases. Virtuoso is remarkable sof
 ## Running virtuoso
 ### Running virtuoso in a guix system container
 
+See also
+
+=> ../deploy/our-virtuoso-instances
+
 We have a Guix virtuoso service in the guix-bioinformatics channel. The easiest way to run virtuoso is to use the virtuoso service to run it in a guix system container. The only downside of this method is that, since guix system containers require root privileges to start up, you will need root priviliges on the machine you are running this on.
 
 Here is a basic guix system configuration that runs virtuoso listening on port 8891, and with its HTTP server listening on port 8892. Among other things, the HTTP server provides a SPARQL endpoint to interact with.
@@ -109,6 +113,11 @@ SQL> UPDATE ws.ws.sys_dav_user SET u_account_disabled=1 WHERE u_name='dav';
 SQL> CHECKPOINT;
 ```
 
+We now store the passwords in secrets:
+
+*  CI/CD: /export2/guix-containers/genenetwork-development/etc/genenetwork/conf/gn3/secrets.py
+*  Production: /export/guix-containers/genenetwork/etc/genenetwork/genenetwork3/gn3-secrets.py
+
 ## Loading data into virtuoso
 
 Virtuoso supports at least three different ways to load RDF.
@@ -151,6 +160,19 @@ Start isql with something like
 guix shell --expose=verified-data=/var/lib/data virtuoso-ose -- isql -U dba -P password 8981
 ```
 
+Password is in container secrets file.
+Inside a container, you can do also do
+
+```
+root@tux04 ~# /gnu/store/9d81kdw2frn6b3fwqphsmkssc9zblir1-virtuoso-ose-7.2.11/bin/isql -u dba -P password -S 8981
+OpenLink Virtuoso Interactive SQL (Virtuoso)
+Version 07.20.3238 as of Jan  1 1970
+Type HELP; for help and EXIT; to exit.
+
+*** Error 28000: [Virtuoso Driver]CL034: Bad login
+
+```
+
 To delete a graph:
 
 ```
@@ -166,6 +188,18 @@ rdf_loader_run();
 checkpoint;
 ```
 
+You may not have permissions to dir. Check
+
+```
+select virtuoso_ini_path();
+```
+
+the file should contain the relevant dir
+
+```
+DirsAllowed=/dir
+```
+
 => http://vos.openlinksw.com/owiki/wiki/VOS/VirtTipsAndTricksGuideDeleteLargeGraphs How can I delete graphs containing large numbers of triples from the Virtuoso Quad Store?
 
 When virtuoso has just been started up with a clean state (that is, the virtuoso state directory was empty before virtuoso started), uploading large amounts of data using the SPARQL 1.1 Graph Store HTTP Protocol fails the first time. It succeeds only the second time. It is not clear why. I can only recommend retrying as in this commit:
@@ -274,3 +308,7 @@ To dump data into a ttl file, first make sure that you are in the guix environme
 => https://github.com/genenetwork/dump-genenetwork-database/ Dump Genenetwork Database
 
 See the README for instructions.
+
+For the public GN endpoint visit
+
+=> https://sparql.genenetwork.org/sparql/