diff options
260 files changed, 23808 insertions, 569 deletions
diff --git a/.gitignore b/.gitignore index 13d6d76..8a5b167 100644 --- a/.gitignore +++ b/.gitignore @@ -1,3 +1,5 @@ /**/*~ index.gmi tracker.gmi +.aider* +.tissue/**/* diff --git a/issues/CI-CD/cd-is-slow.gmi b/issues/CI-CD/cd-is-slow.gmi new file mode 100644 index 0000000..9b0e1ee --- /dev/null +++ b/issues/CI-CD/cd-is-slow.gmi @@ -0,0 +1,276 @@ +# CD is slow + +The pages are slow and some are broken. + +We found out that there are quite a full network calls using DNS - and DNS was slow. The configured DNS server was not responding. Using Google's DNS made things go fast again. We will probably introduce dnsmasq in the container to make things even faster. + +# Tags + +* type: bug +* status: in progress +* priority: high +* assigned: pjotrp +* interested: pjotrp, bonfacem +* keywords: deployment, server + +# Tasks + +* [ ] Use dnsmasq caching - it is a guix system service +* [ ] Run less gunicorn processes on CD (2 should do) +* [ ] Increase debugging output for GN2 +* [ ] Fix GN3 hook for github (it is not working) +* [X] gn-guile lacks certificates it can use for sparql + +# Measuring + +bonfacekilz: +I'm currently instrumenting the requests. See what hogs up time. Loading the landing page takes up 32 seconds! + +Something's off. From outside the container: + +``` +123bonfacem@tux02 ~ $ guix shell python-wrapper python-requests -- python time.py +Status: 200 +Time taken: 32.989222288131714 seconds +``` + +From inside the container: + +``` +12025-07-18 14:46:36 INFO:gn2.wqflask:Landing page rendered in 8.12 seconds +``` + +And I see: + +## CD + +``` +> curl -w @- -o /dev/null -s https://cd.genenetwork.org <<EOF +\n +DNS lookup: %{time_namelookup}s\n +Connect time: %{time_connect}s\n +TLS handshake: %{time_appconnect}s\n +Pre-transfer: %{time_pretransfer}s\n +Start transfer: %{time_starttransfer}s\n +Total time: %{time_total}s\n +EOF + +DNS lookup: 8.117543s +Connect time: 8.117757s +TLS handshake: 8.197767s +Pre-transfer: 8.197861s +Start transfer: 33.096467s +Total time: 33.096601s +``` + +## Production +``` +> curl -w @- -o /dev/null -s https://genenetwork.org <<EOF +\n +DNS lookup: %{time_namelookup}s\n +Connect time: %{time_connect}s\n +TLS handshake: %{time_appconnect}s\n +Pre-transfer: %{time_pretransfer}s\n +Start transfer: %{time_starttransfer}s\n +Total time: %{time_total}s\n +EOF + +DNS lookup: 8.075794s +Connect time: 8.076402s +TLS handshake: 8.147322s +Pre-transfer: 8.147370s +Start transfer: 8.797107s +Total time: 8.797299s +``` + +## On tux02 (outside CD container) + +``` +> curl -w @- -o /dev/null -s http://localhost:9092 <<EOF +\n +DNS lookup: %{time_namelookup}s\n +Connect time: %{time_connect}s\n +TLS handshake: %{time_appconnect}s\n +Pre-transfer: %{time_pretransfer}s\n +Start transfer: %{time_starttransfer}s\n +Total time: %{time_total}s\n +EOF + +DNS lookup: 0.000068s +Connect time: 0.000543s +TLS handshake: 0.000000s +Pre-transfer: 0.000606s +Start transfer: 24.851069s +Total time: 24.851166s +``` + +This does not look like an nginx problem (at least on tux02 itself). Also the nginx configuration was not really changed. +The mysql configuration ditto. I can still test both, but it looks like the problem is inside the system container. + +The container logs are at + +``` +root@tux02:/export2/guix-containers/genenetwork-development/var/log/cd# tail -100 genenetwork2.log +``` + +Some interesting errors there that need resolving, such as + +## gn-guile error + +``` +tail gn-guile.log +2025-07-20 04:49:49 X.509 certificate of 'sparql.genenetwork.org' could not be verified: +2025-07-20 04:49:49 signer-not-found invalid +``` + +Guile is not finding the certificates for our virtuoso server. It does work with curl, try + +``` +curl -G https://query.wikidata.org/sparql -H "Accept: application/json; charset=utf-8" --data-urlencode query="SELECT DISTINCT * where { + wd:Q158695 wdt:P225 ?o . +} limit 5" +{ + "head" : { + "vars" : [ "o" ] }, "results" : { "bindings" : [ { "o" : { + "type" : "literal", + "value" : "Arabidopsis thaliana" + } + } ] + } +``` + +Also inside the container: + +``` +curl http://localhost:8091/gene/aliases/Shh +``` + +renders the same error! X.509 certificate of 'query.wikidata.org' could not be verified. so it is a gn-guile issue. + +## GN2 error reporting + +Also there are too many gunicorn processes - and strikingly - no debug output. Also I see a missing robots.txt file (even though LLMs hardly honour them). + +Let's try to get inside the container with nsenter: + +``` +ps xau|grep genenetwork-development-container +root 115940 0.0 0.0 163692 26296 ? Ssl Jul18 0:00 /gnu/store/ylwk2vn18dkzkj0nxq2h4vjzhz17bm7c-guile-3.0.9/bin/guile --no-auto-compile /usr/local/bin/genenetwork-development-container +pgrep -P 115940 +115961 +``` + +Use this child PID and a recent nsenter: + +``` +/gnu/store/w7a3frdmffpw3hvxpvvxwxgzfhyqdm6n-profile/bin/nsenter -m -p -t 115961 /run/current-system/profile/bin/bash -login +``` + +System tools are in '/run/current-system/profile/bin/' + +Make it a one-liner with + +``` +/gnu/store/w7a3frdmffpw3hvxpvvxwxgzfhyqdm6n-profile/bin/nsenter -m -p -t $(pgrep -P `ps xau|grep genenetwork-development-container|awk '{print $2}'|sort -r|head -1`) /run/current-system/profile/bin/bash -login +``` + +Once inside we can pick up curl (I note the system container has full access to the /gnu/store on the host: + +``` +root@tux02 /# /gnu/store/vdaspmq10c3zmqhp38lfqy812w6r4xg3-curl-8.6.0/bin/curl -w @- -o /dev/null -s http://localhost:9092 <<EOF +\n +DNS lookup: %{time_namelookup}s\n +Connect time: %{time_connect}s\n +TLS handshake: %{time_appconnect}s\n +Pre-transfer: %{time_pretransfer}s\n +Start transfer: %{time_starttransfer}s\n +Total time: %{time_total}s\n +EOF + +DNS lookup: 0.000064s +Connect time: 0.000478s +TLS handshake: 0.000000s +Pre-transfer: 0.000551s +Start transfer: 24.792926s +Total time: 24.793015s +``` + +That rules out container and nginx streaming issues. + +So the problem is with GN and its DBs. The gn-machines is used from /home/aruni and it checkout is March. Has CD been slow since then? I don't think so. Also the changes to the actual scripts are even older. Also the guix-bioinformatics repo shows no changes. Remaining culprits I suspect are: + +* [*] MySQL +* [ ] Interaction gn-auth with gn2 +* [ ] Interaction gnqa with gn2 + +Running a standard test on mysql shows it is fine: + +``` +time mysql -u webqtlout -pwebqtlout db_webqtl < $rundir/../shared/sql/test02.sql +Name FullName Name Symbol CAST(ProbeSet."description" AS BINARY) CAST(ProbeSet."Probe_Target_Description" AS BINARY) Chr Mb Mean LRS Locus pValue additive geno_chr geno_mb +HC_M2_0606_P Hippocampus Consortium M430v2 (Jun06) PDNN 1457545_at 9530036O11Rik long non-coding RNA, expressed sequence tag (EST) AK035474 with high bladder expression antisense EST 14 Kb upstream of Shh 5 28.480441 6.7419292929293 15.2845189682605 rsm10000001525 0.055 0.0434848484848485 3 9.671673 +HC_M2_0606_P Hippocampus Consortium M430v2 (Jun06) PDNN 1427571_at Shh sonic hedgehog (hedgehog) last exon 5 28.457886 6.50113131313131 9.58158655605723 rs8253327 0.697 0.0494097096188748 1 191.908118 +HC_M2_0606_P Hippocampus Consortium M430v2 (Jun06) PDNN 1436869_at Shh sonic hedgehog (hedgehog) mid distal 3' UTR 5 28.457155 9.279090909090911 12.7711275309832 rs8253327 0.306 -0.214087568058076 1 191.908118 + +real 0m0.010s +user 0m0.004s +sys 0m0.000s +``` + +# Profiling CD + +Ran a profiler against a traits page. See the following: + +=> /issues/CI-CD/profiling-flask + +## Results/Interpretation + +* By fixing gn-guile and gene-alias resolution, times dropped by ~10s. However, the page takes 37.9s to run. + +* Resolving a DNS takes around 4.585s. We make 7 requests. Totalling to 32.09. Typically, a traits page should take 8.79s. The difference: (- 37.9 32.09) = 5.8s; which explains the slowness: + +``` + ncall tottime percall cumtime percall filename:lineno(function) +---------------------------------------------------------------------------- + 7 0.00002618 3.741e-05 32.09 4.585 socket.py:938(getaddrinfo) +``` + +* The above is consistent all the analysis I've done across all the profile dumps. + +* Testing my theory out: + +``` +@app.route("/test-network") +def test_network(): + start = time.time() + http_url = urljoin( + current_app.config["GN_SERVER_URL"], + "version" + ) + result = requests.get(http_url) + duration = time.time() - start + app.logger.error(f"{http_url}: {duration:.4f}s") + + start = time.time() + local_url = "http://localhost:9093/api/version" + result = requests.get(local_url) + duration = time.time() - start + app.logger.error(f"{local_url}: {duration:.4f}s") + return result.json() +``` + +* Results: + +``` +2025-07-24 10:20:43 [2025-07-24 10:20:43 +0000] [101] [ERROR] https://cd.genenetwork.org/api3/version: 8.1647s +2025-07-24 10:20:43 ERROR:gn2.wqflask:https://cd.genenetwork.org/api3/version: 8.1647s +2025-07-24 10:20:43 [2025-07-24 10:20:43 +0000] [101] [ERROR] result: 1.0 +2025-07-24 10:20:43 ERROR:gn2.wqflask:result: 1.0 +2025-07-24 10:20:43 [2025-07-24 10:20:43 +0000] [101] [ERROR] http://localhost:9093/api/version: 0.0088s +2025-07-24 10:20:43 ERROR:gn2.wqflask:http://localhost:9093/api/version: 0.0088s +2025-07-24 10:20:43 [2025-07-24 10:20:43 +0000] [101] [ERROR] result: 1.0 +``` + +## Possible Mitigations + +* Switch over gn-auth.genenetwork.org to localhost. diff --git a/issues/CI-CD/configurations.gmi b/issues/CI-CD/configurations.gmi index 54cea47..acd2512 100644 --- a/issues/CI-CD/configurations.gmi +++ b/issues/CI-CD/configurations.gmi @@ -4,7 +4,7 @@ * assigned: aruni, fredm * priority: normal -* status: open +* status: closed, completed * keywords: CI, CD, configuration, config * type: bug @@ -38,3 +38,7 @@ and at least one of the values other than "localhost" is used to determine the c The secrets (e.g. SECRET_KEY, OAUTH_CLIENT_ID, OAUTH_CLIENT_SECRET, etc) can be encrypted and stored in some secrets management system (e.g. Pass [https://www.passwordstore.org/] etc.) setup in each relevant host: better yet, have all configurations (secret or otherwise) encrypted and stored in such a secrets management system and fetch them from there. This reduces the mental overhead of dealing with multiple places to fetch the configs. From these, the CI/CD system can them build and intern the configurations into the store with guix functions like "plain-file", "local-file", etc. + +## Notes + +This idea was mostly rejected — it seems — in favour of using external settings files that are shared with the running container and separate build scripts for the different environments. This mostly covers all the bases necessary to get the settings correct. diff --git a/issues/CI-CD/development-container-checklist.gmi b/issues/CI-CD/development-container-checklist.gmi new file mode 100644 index 0000000..7cf4687 --- /dev/null +++ b/issues/CI-CD/development-container-checklist.gmi @@ -0,0 +1,101 @@ +# Deploying GeneNetwork CD + +## Prerequisites + +Ensure you have `fzf' installed and Guix is set up with your preferred channel configuration. + + +## Step 1: Pull the Latest Profiles + +``` +guix pull -C channels.scm -p ~/.guix-extra-profiles/gn-machines --allow-downgrades +guix pull -C channels.scm -p ~/.guix-extra-profiles/gn-machines-shepherd-upgrade --allow-downgrades +``` + + +## Step 2: Source the Correct Profile + +``` +. ,choose-profile +``` + + +### Contents of `,choose-profile' + +This script lets you interactively select a profile using `fzf': + +``` +#!/bin/env sh + +export GUIX_PROFILE="$(guix package --list-profiles | fzf --multi)" +. "$GUIX_PROFILE/etc/profile" + +hash guix + +echo "Currently using: $GUIX_PROFILE" +``` + + +## Step 3: Verify the Profile + +``` +guix describe +``` + + +## Step 4: Pull the Latest Code + +``` +cd gn-machines +git pull +``` + + +## Step 5: Run the Deployment Script + +``` +./genenetwork-development-deploy.sh +``` + + +## Step 6: Restart the Development Container + +``` +sudo systemctl restart genenetwork-development-container +``` + + +## Step 7: Verify Changes + +Manually confirm that the intended changes were applied correctly. + + +# Accessing the Development Container on tux02 + +To enter the running container shell, ensure you're using the *parent* PID of the `shepherd' process. + + +## Step 1: Identify the Correct PID + +Use this command to locate the correct container parent process: + +``` +ps -u root -f --forest | grep -A4 '/usr/local/bin/genenetwork-development-container' | grep shepherd +``` + + +## Step 2: Enter the Container + +Replace `46804' with your actual parent PID: + +``` +sudo /home/bonfacem/.config/guix/current/bin/guix container exec 46804 \ + /gnu/store/m6c5hgqg569mbcjjbp8l8m7q82ascpdl-bash-5.1.16/bin/bash \ + --init-file /home/bonfacem/.guix-profile/etc/profile --login +``` + + +## Notes + +* Ensure the PID is the container’s *shepherd parent*, not a child process. +* Always double-check your environment and profiles before deploying. diff --git a/issues/CI-CD/failing-services-startup.gmi b/issues/CI-CD/failing-services-startup.gmi new file mode 100644 index 0000000..751e61c --- /dev/null +++ b/issues/CI-CD/failing-services-startup.gmi @@ -0,0 +1,236 @@ +# Failing Services' Startup + +## Tags + +* type: bug +* status: closed, completed +* priority: high +* assigned: fredm, bonfacem +* interested: pjotrp, bonfacem, aruni +* keywords: deployment, CI, CD + +## Description + +Upgrading guix to `34453b97005ff86355399df89c8827c57839d9c7` for CI/CD fails with: + +``` +2025-08-20 16:05:20 Backtrace: +2025-08-20 16:05:20 6 (primitive-load "/gnu/store/xbxd2zihw9dssrhips925gri0yn?") +2025-08-20 16:05:20 In ice-9/eval.scm: +2025-08-20 16:05:20 191:35 5 (_ _) +2025-08-20 16:05:20 In gnu/build/linux-container.scm: +2025-08-20 16:05:20 368:8 4 (call-with-temporary-directory #<procedure 7f014aa3a3f0?>) +2025-08-20 16:05:20 476:16 3 (_ "/tmp/guix-directory.VWRNbv") +2025-08-20 16:05:20 62:6 2 (call-with-clean-exit #<procedure 7f014aa1de80 at gnu/b?>) +2025-08-20 16:05:20 321:20 1 (_) +2025-08-20 16:05:20 In guix/build/syscalls.scm: +2025-08-20 16:05:20 1231:10 0 (_ 268566528) +2025-08-20 16:05:20 +2025-08-20 16:05:20 guix/build/syscalls.scm:1231:10: In procedure unshare: 268566528: Invalid argument +2025-08-20 16:05:20 Backtrace: +2025-08-20 16:05:20 4 (primitive-load "/gnu/store/xbxd2zihw9dssrhips925gri0yn?") +2025-08-20 16:05:20 In ice-9/eval.scm: +2025-08-20 16:05:20 191:35 3 (_ #f) +2025-08-20 16:05:20 In gnu/build/linux-container.scm: +2025-08-20 16:05:20 368:8 2 (call-with-temporary-directory #<procedure 7f014aa3a3f0?>) +2025-08-20 16:05:20 485:7 1 (_ "/tmp/guix-directory.VWRNbv") +2025-08-20 16:05:20 In unknown file: +2025-08-20 16:05:20 0 (waitpid #f #<undefined>) +2025-08-20 16:05:20 +2025-08-20 16:05:20 ERROR: In procedure waitpid: +2025-08-20 16:05:20 Wrong type (expecting exact integer): #f +``` + +Failing services: + +* genenetwork3: consistently +* genenetwork2: consistently +* gn-auth: intermittently + +## Troubleshooting Notes + +### Unable to run genenetwork2 in a shell container with the "-C" flag + +With the following channels: + +``` +$ guix describe +Generation 3 Aug 28 2025 03:56:44 (current) + gn-bioinformatics cffafde + repository URL: file:///home/bonfacem/guix-bioinformatics/ + branch: master + commit: cffafde125f3e711418d3ebb62eacd48a3efa8cf + guix-forge 3c8dc85 + repository URL: https://git.genenetwork.org/guix-forge/ + branch: main + commit: 3c8dc85a584c98bc90088ec1c85933d4d10e7383 + guix-past b14d7f9 + repository URL: https://codeberg.org/guix-science/guix-past + branch: master + commit: b14d7f997ae8eec788a7c16a7252460cba3aaef8 + guix 34453b9 + repository URL: https://codeberg.org/guix/guix + branch: master + commit: 34453b97005ff86355399df89c8827c57839d9c7 +``` + +Running: + +``` +$ guix shell -C genenetwork2 +``` + +Produces: + +``` +guix shell: error: unshare: 268566528: Invalid argument +Backtrace: + 16 (primitive-load "/export3/local/home/bonfacem/.guix-ext…") +In guix/ui.scm: + 2399:7 15 (run-guix . _) + 2362:10 14 (run-guix-command _ . _) +In ice-9/boot-9.scm: + 1752:10 13 (with-exception-handler _ _ #:unwind? _ # _) +In guix/status.scm: + 842:4 12 (call-with-status-report _ _) +In guix/store.scm: + 703:3 11 (_) +In ice-9/boot-9.scm: + 1752:10 10 (with-exception-handler _ _ #:unwind? _ # _) +In guix/store.scm: + 690:37 9 (thunk) + 1331:8 8 (call-with-build-handler _ _) + 1331:8 7 (call-with-build-handler #<procedure 7fc86bb50de0 at g…> …) +In guix/scripts/environment.scm: + 1205:11 6 (proc _) +In guix/store.scm: + 2212:25 5 (run-with-store #<store-connection 256.100 7fc87a46d820> …) +In guix/scripts/environment.scm: + 911:8 4 (_ _) +In gnu/build/linux-container.scm: + 485:7 3 (call-with-container _ _ #:namespaces _ #:host-uids _ # …) +In unknown file: + 2 (waitpid #f #<undefined>) +In ice-9/boot-9.scm: + 1685:16 1 (raise-exception _ #:continuable? _) + 1685:16 0 (raise-exception _ #:continuable? _) + +ice-9/boot-9.scm:1685:16: In procedure raise-exception: +Wrong type (expecting exact integer): #f +``` + +This is fixed by increasing the value of respawn-delay (default is 0.5s) to 5s. + + +### Unable to write to a temporary directory and issues with running git inside the g-exp + +Stack trace: +``` +2025-09-03 12:23:32 In ice-9/eval.scm: +2025-09-03 12:23:32 191:35 3 (_ #f) +2025-09-03 12:23:32 In gnu/build/linux-container.scm: +2025-09-03 12:23:32 368:8 2 (call-with-temporary-directory #<procedure 7f012241d3f0?>) +2025-09-03 12:23:32 485:7 1 (_ "/tmp/guix-directory.Bl6jtx") +2025-09-03 12:23:32 In unknown file: +2025-09-03 12:23:32 0 (waitpid #f #<undefined>) +2025-09-03 12:23:32 + +``` + +Cryptic message. Running the g-exps as a program shows: + +``` +Receiving objects: 100% (698/698), 16.18 MiB | 30.29 MiB/s, done. +Resolving deltas: 100% (49/49), done. +================================================== +error: cannot run less: No such file or directory +fatal: unable to execute pager 'less' +Backtrace: + 5 (primitive-load "/gnu/store/c9bvy90s5mglp6xdfkc1s4qkzj8?") +In ice-9/eval.scm: + 619:8 4 (_ #f) +In ice-9/boot-9.scm: + 142:2 3 (dynamic-wind #<procedure 7fa954b25880 at ice-9/eval.s?> ?) + 142:2 2 (dynamic-wind #<procedure 7fa94b7970c0 at ice-9/eval.s?> ?) +In ice-9/eval.scm: + 619:8 1 (_ #(#(#<directory (guile-user) 7fa954b03c80>))) +In guix/build/utils.scm: + 822:6 0 (invoke "git" "log" "--max-count" "1") + +guix/build/utils.scm:822:6: In procedure invoke: +ERROR: + 1. &invoke-error: + program: "git" + arguments: ("log" "--max-count" "1") + exit-status: 128 + term-signal: #f + stop-signal: #f +``` + +Fixed by adding "less" to the with-packages form and setting: + +``` +(setenv "TERM" "xterm-256color") + +``` + +### gn-auth: sqlite3.OperationalError: unable to open database file + +Despite having all file perms correctly set with 0644, we see: + +``` +Traceback (most recent call last): + File "/gnu/store/ag1m9bv22iwm3sq87xly35y138l6kzd7-profile/lib/python3.11/site-packages/flask/app.py", line 917, in full_dispatch_request + rv = self.dispatch_request() + ^^^^^^^^^^^^^^^^^^^^^^^ + File "/gnu/store/ag1m9bv22iwm3sq87xly35y138l6kzd7-profile/lib/python3.11/site-packages/flask/app.py", line 902, in dispatch_request + return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args) # type: ignore[no-any-return] + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/export/data/repositories/gn-auth/gn_auth/auth/authentication/oauth2/views.py", line 102, in authorise + return with_db_connection(__authorise__) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/export/data/repositories/gn-auth/gn_auth/auth/db/sqlite3.py", line 63, in with_db_connection + return func(conn) + ^^^^^^^^^^ + File "/export/data/repositories/gn-auth/gn_auth/auth/authentication/oauth2/views.py", line 90, in __authorise__ + return server.create_authorization_response(request=request, grant_user=user) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/gnu/store/ag1m9bv22iwm3sq87xly35y138l6kzd7-profile/lib/python3.11/site-packages/authlib/oauth2/rfc6749/authorization_server.py", line 297, in create_authorization_response + args = grant.create_authorization_response(redirect_uri, grant_user) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/export/data/repositories/gn-auth/gn_auth/auth/authentication/oauth2/grants/authorisation_code_grant.py", line 31, in create_authorization_response + response = super().create_authorization_response( + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/gnu/store/ag1m9bv22iwm3sq87xly35y138l6kzd7-profile/lib/python3.11/site-packages/authlib/oauth2/rfc6749/grants/authorization_code.py", line 158, in create_authorization_response + self.save_authorization_code(code, self.request) + File "/export/data/repositories/gn-auth/gn_auth/auth/authentication/oauth2/grants/authorisation_code_grant.py", line 45, in save_authorization_code + return __save_authorization_code__( + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/export/data/repositories/gn-auth/gn_auth/auth/authentication/oauth2/grants/authorisation_code_grant.py", line 106, in __save_authorization_code__ + return with_db_connection(lambda conn: save_authorisation_code(conn, code)) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/export/data/repositories/gn-auth/gn_auth/auth/db/sqlite3.py", line 63, in with_db_connection + return func(conn) + ^^^^^^^^^^ + File "/export/data/repositories/gn-auth/gn_auth/auth/authentication/oauth2/grants/authorisation_code_grant.py", line 106, in <lambda> + return with_db_connection(lambda conn: save_authorisation_code(conn, code)) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/export/data/repositories/gn-auth/gn_auth/auth/authentication/oauth2/models/authorization_code.py", line 92, in save_authorisation_code + cursor.execute( +sqlite3.OperationalError: unable to open database file +``` + +Fixed above by correctly mapping: + +``` +- (source auth-db-path) ++ (source (dirname auth-db-path)) +``` + +in the relevant g-exp, and making sure that the parent directory is set to #o775 (rwx for both user/group). + +## Also See + +=> https://issues.guix.gnu.org/78356 Broken system and home containers +=> https://codeberg.org/guix/guix/src/commit/34453b97005ff86355399df89c8827c57839d9c7/guix/build/syscalls.scm#L1218-L1233 How "unshare" is defined +=> https://codeberg.org/guix/guix/src/commit/34453b97005ff86355399df89c8827c57839d9c7/gnu/build/linux-container.scm#L321 Where `unshare` is called diff --git a/issues/CI-CD/profiling-flask.gmi b/issues/CI-CD/profiling-flask.gmi new file mode 100644 index 0000000..2d0c539 --- /dev/null +++ b/issues/CI-CD/profiling-flask.gmi @@ -0,0 +1,33 @@ +# Profiling GN + +Use this simple structure: + +``` +from werkzeug.middleware.profiler import ProfilerMiddleware + + +app = Flask(__name__) +app.config["PROFILE"] = True +app.wsgi_app = ProfilerMiddleware( + app.wsgi_app, + restrictions=[40, "main"], + profile_dir="profiler_dump", + filename_format="{time:.0f}-{method}-{path}-{elapsed:.0f}ms.prof", +) +``` + + +You can use gprof2dot to visualise the profile + +``` +guix shell gprof2dot -- gprof2dot -f pstats 1753202013-GET-show_trait-37931ms.prof > 1753202013-GET-show_trait-37931ms.prof.dot +guix shell xdot -- xdot 1753202013-GET-show_trait-37931ms.prof.dot +``` + +Or snakeviz to visualize it: + + +``` +scp genenetwork:/home/bonfacem/profiling/1753202013-GET-show_trait-37931ms.prof /tmp/test +snakeviz 1753202013-GET-show_trait-37931ms.prof +``` diff --git a/issues/CI-CD/troubleshooting-within-the-development-container.gmi b/issues/CI-CD/troubleshooting-within-the-development-container.gmi new file mode 100644 index 0000000..3aa8c3b --- /dev/null +++ b/issues/CI-CD/troubleshooting-within-the-development-container.gmi @@ -0,0 +1,46 @@ +# Troubleshooting inside the GN dev container +* type: systems, debugging, container +* keywords: container, troubleshooting, logs, webhooks + +You need to find the development container so that you can begin troubleshooting: + +``` +ps -u root -f --forest | grep -A4 '/usr/local/bin/genenetwork-development-container' | grep shepherd +``` + +Example output: + +``` +root 16182 16162 0 03:57 ? 00:00:04 \_ /gnu/store/n87px1cazqkav83npg80ccp1n777j08s-guile-3.0.9/bin/guile --no-auto-compile /gnu/store/b4n5ax7l1ccia7sr123fqcjqi4vy03pv-shepherd-1.0.2/bin/shepherd --config /gnu/store/5ahb3745wlpa5mjsbk8j6frn78khvzzw-shepherd.conf +``` + +Get into the container: + +``` +# Use the correct pid and guix/bash path. + +sudo /home/bonfacem/.config/guix/current/bin/guix container exec 16182 /gnu/store/m6c5hgqg569mbcjjbp8l8m7q82ascpdl-bash-5.1.16/bin/bash --init-file /home/bonfacem/.guix-profile/etc/profile --login +``` + +All the gn related logs can be found in "/var/log/cd": + +``` +genenetwork2.log +genenetwork3.log +gn-auth.log +gn-guile.log +``` + +All the nginx log are in "/var/log/nginx" + +Sometimes, it's useful to trigger webhooks while troubleshooting. Here are all the relevant webhooks: + +``` +/gn-guile +/genenetwork2 +/genenetwork3 +/gn-libs +/gn-auth +``` + +Inside the container, we have "coreutils-minimal", and "curl" that you can use to troubleshoot. diff --git a/issues/acme-error.gmi b/issues/acme-error.gmi new file mode 100644 index 0000000..b31d04b --- /dev/null +++ b/issues/acme-error.gmi @@ -0,0 +1,106 @@ +# uACME Error: "urn:ietf:params:acme:error:unauthorized" + +## Tags + +* status: closed, completed +* priority: high +* type: bug +* assigned: fredm +* keywords: uacme, certificates, "urn:ietf:params:acme:error:unauthorized" + +## Description + +Sometimes, when we attempt to request TLS certificates from Let's Encrypt using uacme, we run into an error of the following form: + +``` +uacme: polling challenge status at https://acme-v02.api.letsencrypt.org/acme/chall/2399017717/599167439271/jFB2Pg +uacme: challenge https://acme-v02.api.letsencrypt.org/acme/chall/2399017717/599167439271/jFB2Pg failed with status invalid +uacme: the server reported the following error: +{ + "type": "urn:ietf:params:acme:error:unauthorized", + "detail": "128.xxx.xxx.xxx: Invalid response from http://sparql.genenetwork.org/.well-known/acme-challenge/N-P-mhiK04c-Iophbem4iFYsaB +yeaxeSyXHSijx3e6k: 404", + "status": 403 +} +uacme: running /gnu/store/zwqavgjqyk0f0krv8ndwhv3767f6cnx1-uacme-hook failed http-01 sparql.genenetwork.org N-P-mhiK04c-Iophbem4iFYsaBy +eaxeSyXHSijx3e6k N-P-mhiK04c-Iophbem4iFYsaByeaxeSyXHSijx3e6k.9dRdXFhCbqeDGWYndRd_hTh920rplmy-ef-_aLgjJJE +uacme: failed to authorize order at https://acme-v02.api.letsencrypt.org/acme/order/2399017717/438986245271 + +``` + +From the above error, we note that the request for the "/.well-known/..." path fails with a 404 code: Why. + +Let's try figuring it out; connect to the running container: + +``` +$ sudo guix container exec 89086 /run/current-system/profile/bin/bash --login +root@sparql /# cd /var/run/acme/acme-challenge/ +root@sparql /var/run/acme/acme-challenge# while true; do ls; sleep 0.5; clear; done +``` + +In a separate terminal, connect to the same container and run `/usr/bin/acme renew`. + +The loop we created to list what files are created in the challenge directory outputs the file + +``` +root@sparql /var/run/acme/acme-challenge# while true; do ls; sleep 0.5; clear; done +Rm7qCec3naVvqPldGSGI9W4i9AceW0X3MUNSAbC7SVE +Rm7qCec3naVvqPldGSGI9W4i9AceW0X3MUNSAbC7SVE +⋮ +``` + +but we are still getting the same error: + +``` +uacme: challenge https://acme-v02.api.letsencrypt.org/acme/chall/2399017717/599184604221/7mTNdA failed with status invalid +uacme: the server reported the following error: +{ + "type": "urn:ietf:params:acme:error:unauthorized", + "detail": "128.169.5.101: Invalid response from http://sparql.genenetwork.org/.well-known/acme-challenge/Rm7qCec3naVvqPldGSGI9W4i9AceW0X3MUNSAbC7SVE: 404", + "status": 403 +} +uacme: running /gnu/store/zwqavgjqyk0f0krv8ndwhv3767f6cnx1-uacme-hook failed http-01 sparql.genenetwork.org Rm7qCec3naVvqPldGSGI9W4i9AceW0X3MUNSAbC7SVE Rm7qCec3naVvqPldGSGI9W4i9AceW0X3MUNSAbC7SVE.9dRdXFhCbqeDGWYndRd_hTh920rplmy-ef-_aLgjJJE +uacme: failed to authorize order at https://acme-v02.api.letsencrypt.org/acme/order/2399017717/438997397751 +``` + +meaning that somehow, nginx is not able to serve up this file. + +## Discovered Cause: 2025-10-20 + +There are 2 layers of nginx, the host nginx, and the internal/container nginx. + +The host nginx was proxying directly to the virtuoso http server rather than proxying to nte internal/container nginx. This led to the failure because the internal/container nginx handles the TLS/SSL certificates for the site. The host nginx should have offloaded the handling of the TLS/SSL certificates to the internal/container nginx, but since it was not going through the internal nginx, that led to the failure. + +A simile of the error condition and the solution are in the sections below: + +### Error Condition: Wrong proxying + +In host's "nginx.conf": +``` +⋮ + proxy_pass http://localhost:<virtuoso-http-server-port>; +⋮ +``` + +In internal/container "nginx.conf": +``` +⋮ + proxy_pass http://localhost:<virtuoso-http-server-port>; +⋮ +``` + +### Solution/Fix + +In host's "nginx.conf": +``` +⋮ + proxy_pass http://localhost:<container-nginx-http-port>; +⋮ +``` + +In internal/container "nginx.conf": +``` +⋮ + proxy_pass http://localhost:<virtuoso-http-server-port>; +⋮ +``` diff --git a/issues/add-documentation-and-data-retrieval-for-AI-repo.gmi b/issues/add-documentation-and-data-retrieval-for-AI-repo.gmi index 11f8f30..a96c18d 100644 --- a/issues/add-documentation-and-data-retrieval-for-AI-repo.gmi +++ b/issues/add-documentation-and-data-retrieval-for-AI-repo.gmi @@ -6,7 +6,6 @@ * priority: high * type: ui * keywords: phenotypes -* status: stalled ## Description @@ -15,3 +14,4 @@ * Share alternate way of getting sparql json-ld data from public endpoint outside isql. * Share json-ld gotchas. +* closed diff --git a/issues/add-genotype-files-to-rdf.gmi b/issues/add-genotype-files-to-rdf.gmi index 85ac39c..856c070 100644 --- a/issues/add-genotype-files-to-rdf.gmi +++ b/issues/add-genotype-files-to-rdf.gmi @@ -3,7 +3,7 @@ ## Tags * assigned: bonfacem * type: bug -* status: open, in progress +* status: stalled In Penguin2, genotype files are located in: /export/data/genenetwork/genotype_files/genotype. Each genotype files has an identifier to a dataset it refers to: diff --git a/issues/add-unique-identifiers-for-case-attributes.gmi b/issues/add-unique-identifiers-for-case-attributes.gmi new file mode 100644 index 0000000..0c3123d --- /dev/null +++ b/issues/add-unique-identifiers-for-case-attributes.gmi @@ -0,0 +1,11 @@ +# Add Case Attributes to RDF + +## Tags + +* assigned: bonfacem +* priority: high +* status: open + +## Description + +Add case attributes and their metadata into RDF. diff --git a/issues/assorted-ui-issues.gmi b/issues/assorted-ui-issues.gmi new file mode 100644 index 0000000..5fbacea --- /dev/null +++ b/issues/assorted-ui-issues.gmi @@ -0,0 +1,36 @@ +# Various UI issues raised by Rob (8/19/2024) + +# Tags + +* assigned: zsloan +* keywords: user-interface +* priority: medium +* open + +## Tasks + +* [X] Fix collection encoding issue + +* [X] Don't import empty collections (like the Default Collection) + +* [X] Update/Creation dates aren't listed for collections + +* [X] Remove in-between ticks for Effect Size Plot (from mapping page) so it's just -1/0/1 + +* [X] Also make Effect Size Plot more narrow + +* [X] Prevent X/Y-aix summary text from extending beyond the graph width + +* [X] Longer tick markers as well + +* [X] Remove triangle for phenotype mapping + +* [X] Remove ProbeSetPosition from mapping for traits with no position + +* [X] Make Haplotype legend image thicker + change text to Haplotypes (Mat, Pat, Het, Unknown) + +* [X] Change "Sequence Site" in legend to "Gene Location" + +* [X] When adding genotype marker as covariate (for scatter-plot, maybe also mapping), change description to Position instead of "undefined" + +* [ ] Check Add Covariation colorbox popup on Apple laptop (it shows up weird for Rob, but normal for me) diff --git a/issues/auth/masquarade-as-bug.gmi b/issues/auth/masquarade-as-bug.gmi index 12c2c5f..36fe34a 100644 --- a/issues/auth/masquarade-as-bug.gmi +++ b/issues/auth/masquarade-as-bug.gmi @@ -2,6 +2,7 @@ * assigned: fredm * tags: critical +* status: closed, completed Right now you can't masquared as another user. Here's the trace: diff --git a/issues/auth/reset-password-feature.gmi b/issues/auth/reset-password-feature.gmi index 8eaaa6a..299f915 100644 --- a/issues/auth/reset-password-feature.gmi +++ b/issues/auth/reset-password-feature.gmi @@ -1,6 +1,16 @@ # Reset/Forgot Password Feature for GN2 +# Tags + * assigned: fredm -* tags: critical +* priority: critical +* status: closed +* keywords: gn-auth, auth, reset password +* type: feature-request + +## Description Should a user forget his/her password, there's no clear way to reset the password. + +This issue is +=> https://git.genenetwork.org/gn-auth/tree/gn_auth/auth/authorisation/users/views.py?id=e829074e99fd5bec033765d18d5efa55e1edce44#n454 implemented with the latest code. diff --git a/issues/cleanup-base-file-gn2.gmi b/issues/cleanup-base-file-gn2.gmi new file mode 100644 index 0000000..8a05323 --- /dev/null +++ b/issues/cleanup-base-file-gn2.gmi @@ -0,0 +1,30 @@ +# Cleanup GN2 Base HTML File + +## Tags + +* Assigned: alexm +* Keywords: base, HTML, JavaScript, cleanup +* type: Refactoring +* Status: closed, completed, done + +## Description + +The base file should contain no custom JavaScript since it is inherited in almost all files in GN2. It should only include what is necessary. As a result, we need to move the global search from the base file to the index page, which renders the GN2 home. + +## Tasks + +* [x] Remove global search code from the base file and move it to the index page +* [x] Fix formatting and linting issues in the base file (e.g., tags) +* [x] Inherit from index page for all gn2 templates + + +## Notes + +See the PR that seeks to fix this: +=> https://github.com/genenetwork/genenetwork2/pull/877 + +## Notes 26/09/2024 + +It was agreed that global search should be a feature for all pages, +As such all files need to inherit from the global search which +defines the global search. \ No newline at end of file diff --git a/issues/correlation-timing-out.gmi b/issues/correlation-timing-out.gmi index 419524d..bed8692 100644 --- a/issues/correlation-timing-out.gmi +++ b/issues/correlation-timing-out.gmi @@ -5,7 +5,7 @@ * assigned: fredm, zsloan, alexm * type: bug * priority: high -* status: ongoing +* status: closed, completed * keywords: correlations ## Description @@ -17,3 +17,7 @@ Do correlations against the same dataset This might be the same issue as the one in => /issues/correlation-missing-file correlation-missing-file.gmi but I'm not sure. + +## Close as completed + +This is fixed. diff --git a/issues/create-custom-rif-xapian-index.gmi b/issues/create-custom-rif-xapian-index.gmi new file mode 100644 index 0000000..a0b9039 --- /dev/null +++ b/issues/create-custom-rif-xapian-index.gmi @@ -0,0 +1,16 @@ +# Create Custom RIF XAPIAN Index + +## Tags + +* assigned: bonfacem +* priority: medium +* status: in-progress +* deadline: 2024-10-23 Wed + +## Description + +Given the GN Wiki search page: + +=> https://cd.genenetwork.org/genewiki GeneWiki Entries Search + +We only search by symbol. Add custom XAPIAN index to perform more powerful search. diff --git a/issues/edit-rif-metadata.gmi b/issues/edit-rif-metadata.gmi new file mode 100644 index 0000000..546dc80 --- /dev/null +++ b/issues/edit-rif-metadata.gmi @@ -0,0 +1,121 @@ +# Edit RIF Metadata in GN2 + +## Tags + +* assigned: bonfacem, jnduli +* priority: high +* status: closed + +## Tasks + +### Viewing +* [X] API: Get WIKI/RIF by symbol from rdf. + +> GET /wiki/<symbol> + +``` +[{ + "symbol": "XXXX", + "reason": "XXXX", + "species": "XXXX", + "pubmed_ids": ["XXXX", "XXXX"], // empty array when non-existent + "web_url": "XXXX" // Optional + "comment": "XXXX", + "email": "XXXX", + "categories": ["XXXX", "XXXX"], // Enumeration + "version": "XXXX", + "initial": "XXXX", // Optional user or project code or your initials. +}] +``` + +* [X] UI: Modify traits page to have "GN2 (GeneWiki)" +* [X] UI: Integrate with API + +### Editing + +* [X] API: Edit comment by id in mysql/rdf: modifies GeneRIF and GeneRIFXRef tables. +* [X] API: Modify edit comments by id to include RDF changes. + +> POST /wiki/<comment-id>/edit + +``` +{ + "symbol": "XXXX", + "reason": "XXXX", + "species": "XXXX", + "pubmed_ids": ["XXXX", "XXXX"], // Optional + "web_url": "XXXX" // Optional + "comment": "XXXX", + "email": "XXXX", + "categories": ["XXXX", "XXXX"], // Enumeration + "initial": "XXXX", // Optional user or project code or your initials. +} +``` +* [X] UI: Add buttons that edit various relevant sections. +* [X] UI: Edit page needs to fetch categories from GeneCategory table. When comment write fails, alert with error. When comment write success, update the comment on the page, and alert with success. +* [X] API: Modify edit comments by id to include RDF changes. +* [X] GN auth integration + +### History + +* [X] API: End-point to fetch all the historical data +* [X] UI: Page that contains history for how comments changes. + +> GET /wiki/<comment-id>/history + +``` +[{ + "symbol": "XXXX", + "reason": "XXXX", + "species": "XXXX", + "pubmed_ids": ["XXXX", "XXXX"], // Optional + "web_url": "XXXX" // Optional + "comment": "XXXX", + "email": "XXXX", + "categories": ["XXXX", "XXXX"], // Enumeration + "version": "XXXX", + "initial": "XXXX", // Optional user or project code or your initials. +}] +``` + +### Misc ToDos: + +* [X] Review performance of query used in 72d9a24e8e65 [Genenetwork3] + +### Ops + +* [X] RDF synchronization with SQL (gn-machines). +* [X] Update RDF in tux02. +* [X] UI: Add "edit" button after testing. + +### Resolution + +Genenetwork2: +=> https://github.com/genenetwork/genenetwork2/pull/858 UI/fetch rif using recent apis #858 +=> https://github.com/genenetwork/genenetwork2/pull/864 Add comment history page. #864 +=> https://github.com/genenetwork/genenetwork2/pull/865 Add support for auth in Rif Edit #865 +=> https://github.com/genenetwork/genenetwork2/pull/866 Add a page for searching GeneWiki by symbol. #866 +=> https://github.com/genenetwork/genenetwork2/pull/881 Add display page for NCBI RIF metadata. #881 +=> https://github.com/genenetwork/genenetwork2/pull/881 Add display page for NCBI RIF metadata. #881 +=> https://github.com/genenetwork/genenetwork2/pull/882 GN editting UI improvements #882 + + +GeneNetwork3: +=> https://github.com/genenetwork/genenetwork3/pull/180 Update script that updates Generif_BASIC table #180 +=> https://github.com/genenetwork/genenetwork3/pull/181 Add case insensitive prefixes for rif wiki #181 +=> https://github.com/genenetwork/genenetwork3/pull/184 Api/get wiki from rdf #184 +=> https://github.com/genenetwork/genenetwork3/pull/185 feat: add api calls to get categories and last comment #185 +=> https://github.com/genenetwork/genenetwork3/pull/186 Api/fetch the latest wiki by versionid #186 +=> https://github.com/genenetwork/genenetwork3/pull/187 Api/get end point to fetch all historical data #187 +=> https://github.com/genenetwork/genenetwork3/pull/189 Add auth to edit RIF api call #189 +=> https://github.com/genenetwork/genenetwork3/pull/190 Api/update rif queries #190 +=> https://github.com/genenetwork/genenetwork3/pull/193 Api/edit rif endpoint #193 +=> https://github.com/genenetwork/genenetwork3/pull/194 Fix C0411/C0412 pylint errors in gn3.api.metadata.api.wiki. #194 +=> https://github.com/genenetwork/genenetwork3/pull/195 Add rif tests #195 +=> https://github.com/genenetwork/genenetwork3/pull/196 Handle missing GN3_SECRETS for CI testing. #196 +=> https://github.com/genenetwork/genenetwork3/pull/197 Rif edit atomicity #197 +=> https://github.com/genenetwork/genenetwork3/pull/198 Run tests against Virtuoso that is spun locally. #198 +=> https://github.com/genenetwork/genenetwork3/pull/199 Add rdf-tests after the check phase. #199 +=> https://github.com/genenetwork/genenetwork3/pull/200 Api/ncbi metadata #200 + +* closed diff --git a/issues/editing-dataset-metadata.gmi b/issues/editing-dataset-metadata.gmi index 17d1693..70876e0 100644 --- a/issues/editing-dataset-metadata.gmi +++ b/issues/editing-dataset-metadata.gmi @@ -5,7 +5,7 @@ * assigned: bonfacem * priority: high * type: editing -* status: in-progress +* status: stalled * keywords: metadata editing ## Description diff --git a/issues/error-handling-external-errors.gmi b/issues/error-handling-external-errors.gmi index d1707de..640e1d1 100644 --- a/issues/error-handling-external-errors.gmi +++ b/issues/error-handling-external-errors.gmi @@ -3,7 +3,7 @@ ## Tags * assigned: fredm -* status: open +* status: closed * type: bug * priority: high * keywords: error handling diff --git a/issues/fix-global-search-ui.gmi b/issues/fix-global-search-ui.gmi new file mode 100644 index 0000000..2979d99 --- /dev/null +++ b/issues/fix-global-search-ui.gmi @@ -0,0 +1,24 @@ +# Fix Broken Global Search UI + +## Tags + +* Assigned: alexm, zsloan +* Priority: high +* status: in progress +* Keyword : search, UI, bug, Refactor +* Type: UI, bug + +## Description + +The Global search UI layout is broken on certain browser versions. +This issue was reported to occur for **Firefox Version 128.3.1** ESR Version. +The root cause of the problem is unclear, +but after reviewing the global search UI code, +the following changes need to be implemented (see tasks below): + + + +## Tasks + +* [ ] Remove custom layout CSS and replace it with the Bootstrap layout for better uniformity and easier debugging. +* [ ] Modify the navbar to extend across the full width of the page on medium and small devices. diff --git a/issues/fix-pairscan-mapping.gmi b/issues/fix-pairscan-mapping.gmi new file mode 100644 index 0000000..1b48fee --- /dev/null +++ b/issues/fix-pairscan-mapping.gmi @@ -0,0 +1,28 @@ +# Fix Pairscan Mapping + +## Tags + +* assigned: alexm, +* priority: medium, +* type: bug +* keywords: pairscan, debug, fix, mapping + +## Description +Pairscan mapping is currently not working: + +Error: + +``` +GeneNetwork 3.12-rc1 https://genenetwork.org/run_mapping ( 1:01PM UTC Jan 13, 2025) +Traceback (most recent call last): + File "/gnu/store/cxawl32jm0fgavc9ahcr3g0j66zdan30-profile/lib/python3.10/site-packages/flask/app.py", line 1523, in full_dispatch_request + rv = self.dispatch_request() + File "/gnu/store/cxawl32jm0fgavc9ahcr3g0j66zdan30-profile/lib/python3.10/site-packages/flask/app.py", line 1509, in dispatch_request + return self.ensure_sync(self.view_functions[rule.endpoint])(**req.view_args) + File "/gnu/store/cxawl32jm0fgavc9ahcr3g0j66zdan30-profile/lib/python3.10/site-packages/gn2/wqflask/views.py", line 1035, in mapping_results_page + template_vars = run_mapping.RunMapping(start_vars, + File "/gnu/store/cxawl32jm0fgavc9ahcr3g0j66zdan30-profile/lib/python3.10/site-packages/gn2/wqflask/marker_regression/run_mapping.py", line 312, in __init__ + self.geno_db_exists = geno_db_exists(self.dataset, results[0]['name']) + KeyError: 'name' + +``` \ No newline at end of file diff --git a/issues/fix-rqtl-rm-bug.gmi b/issues/fix-rqtl-rm-bug.gmi new file mode 100644 index 0000000..de71487 --- /dev/null +++ b/issues/fix-rqtl-rm-bug.gmi @@ -0,0 +1,95 @@ +# Investigate and Fix `rm` Command in `rqtl` Logs + +## Tags + +* assigned: alex, bonfacem +* type: Bug +* status: in progress +* keywords: external, qtl, rqtl, bug, logs + +## Description + +For QTL analysis, we invoke the `rqtl` script as an external process through Python's `subprocess` module. +For reference, see the `rqtl_wrapper.R` script: +=> https://github.com/genenetwork/genenetwork3/blob/main/scripts/rqtl_wrapper.R + +The issue is that, upon analyzing the logs for `rqtl`, we see that an `rm` command is unexpectedly invoked: + +``` +sh: line 1: rm: command not found +``` + +This command cannot be traced to its origin, and it does not appear to be part of the expected behavior. + +The issue is currently observed only in the CD environment. The only way I have attempted to reproduce this locally is by invoking the command in a shell environment with string injection, which is not the case for GeneNetwork3, where all strings are parsed and passed as a list argument. + +Here’s an example of the above attempt: + +```python +def run_process(cmd, output_file, run_id): + """Function to execute an external process and capture the stdout in a file. + + Args: + cmd: The command to execute, provided as a list of arguments. + output_file: Absolute file path to write the stdout. + run_id: Unique ID to identify the process. + + Returns: + A dictionary with the results, indicating success or failure. + """ + cmd.append(" && rm") # Injecting potentially problematic command + cmd = " ".join(cmd) # The command is passed as a string + + try: + # Phase: Execute the command in a shell environment + with subprocess.Popen( + cmd, + shell=True, + stdout=subprocess.PIPE, + stderr=subprocess.STDOUT, + ) as process: + # Process output handling goes here +``` + +The error generated at the end of the `rqtl` if the rm run does not exists inside the container is: + +``` +sh: line 1: rm: command not found +``` + +The actual code for GeneNetwork3 is: + +```python +def run_process(cmd, output_file, run_id): + """Function to execute an external process and capture the stdout in a file. + + Args: + cmd: The command to execute, provided as a list of arguments. + output_file: Absolute file path to write the stdout. + run_id: Unique ID to identify the process. + + Returns: + A dictionary with the results, indicating success or failure. + """ + try: + # Phase: Execute the command in a shell environment + with subprocess.Popen( + cmd, + stdout=subprocess.PIPE, + stderr=subprocess.STDOUT, + ) as process: + # Process output handling goes here +``` + +## Investigated and Excluded Possibilities + +* [x] The `rm` command is not explicitly invoked within the `rqtl` script. +* [x] The `rqtl` command is passed as a list of parsed arguments (i.e., no direct string injection). +* [x] The subprocess is not invoked within a shell environment, which would otherwise result in string injection. +* [x] We simulated invoking a system command within the `rqtl` script, but the error does not match the observed issue. + +## TODO + +* [ ] Test in a similar environment to the CD environment to replicate the issue. + +* [ ] Investigate the internals of the QTL library for any unintended `rm` invocation. diff --git a/issues/fix-spam-entries-in-gn-auth-production.gmi b/issues/fix-spam-entries-in-gn-auth-production.gmi index db88eec..5ef7a42 100644 --- a/issues/fix-spam-entries-in-gn-auth-production.gmi +++ b/issues/fix-spam-entries-in-gn-auth-production.gmi @@ -2,6 +2,7 @@ # Tags +* status: closed, completed * assigned: fredm * keywords: auth @@ -13,4 +14,8 @@ We have spam entries in gn-auth in production in the groups table: b59229de-2fce-4a3d-82f1-d9eeee9b7009|Business For Sale Adelaide|{"group_description": "Welcome to Business2Sell, the ultimate online platform for those seeking affordable business opportunities in Adelaide. As a trusted first-party provider, we offer the ideal marketplace for buying or selling businesses across the country. Whether you're an aspiring entrepreneur looking for your next venture or a business owner ready to sell, Business2Sell provides the perfect platform for you. Our user-friendly interface and extensive listings make it effortless to discover a wide range of businesses, all within your budget. Join our vibrant community of buyers and sellers today, and let us help you achieve your business goals in Adelaide with ease and confidence.\r\nhttps://www.business2sell.com.au/businesses/sa/adelaide"} ``` +## Close as completed +We added email verification when registering, which should help reduce the success of these automated bots. + +We also added tooling to help with users and groups management, which is helping clean up these spam data. diff --git a/issues/gemma/gemma2-has-different-output-from-rqtl2.gmi b/issues/gemma/gemma2-has-different-output-from-rqtl2.gmi new file mode 100644 index 0000000..a0b2c5c --- /dev/null +++ b/issues/gemma/gemma2-has-different-output-from-rqtl2.gmi @@ -0,0 +1,80 @@ +# GEMMA output differs from R/qtl2 + +# Tags + +* assigned: pjotrp, davea +* priority: high +* type: bug, enhancement +* status: closed +* keywords: database, gemma, reaper, rqtl2 + +# Description + +When running trait BXD_21526 results differ significantly. + +=> https://genenetwork.org/show_trait?trait_id=21526&dataset=BXDPublish +=> https://genenetwork.org/show_trait?trait_id=21529&dataset=BXDPublish + +So I confirm I am getting the same results as Dave in GN for GEMMA (see Conclusion below). + +# Tasks + +## GeneNetwork + +I run GEMMA for precompute on the command line and that I confirmed to +be the same as what we see in the browser. This suggests either data +or method is different with Dave's approach. + +I confirmed that gemma in GN matches Dave's results. It is interesting +to see that running without LOCO has some impact, but not as bad as +the R/qtl2 difference. First we should check the genotype files to see +if they match. I checked that the phenotypes match. + +Our inputs are different if I count genotypes (first yours, the other +on production): + +``` + 1 2184941 B + 2 2132744 D + 3 628980 H + 1 2195662 B + 2 2142959 D + 3 650168 H +``` + +The number of rows/markers is the same. So we probably added some +genometypes, but if we miss one that would matter. Dave you can find +the file in /home/wrk/BXD.geno on tux02 if you want to look. + +I notice that we don't use H in the R/qtl2 control file. That +might make a difference though it probably won't explain what we see +now. BTW I also correlated the LOD scores from GEMMA and R/qtl2 in +the spreadsheet and at 0.7 that is too low. So it is probably not +just a magnitude problem. The results differ a lot in your +spreadsheet. + +Next step is that I need to run R/qtl2 using the script in your +dropbox and see what Karl's code does. The exercise does not hurt +because it will help us bring R/qtl2 to GN. + +## R/qtl2 + +R/qtl2 is packaged in guix and can be run in a shell with + +``` +guix shell -C r r-qtl2 +> library(qtl2) +> bxd <- read_cross2(file = "bxd_cancer_new_GN_July_2024.json") +Warning messages: +1: In recode_geno(sheet, genotypes) : + 630519 genotypes treated as missing: "H", "U" +2: In matrix(as.numeric(unlist(pheno)), ncol = nc) : + NAs introduced by coercion +3: In check_cross2(output) : Physical map out of order on chr 1, 2, 11, 19 +``` + +The first warning matches above. If data is missing it may be filtered out. We'll have to check for that. The third warning I am not sure about. Probably a ranking of markers. + +# Conclusion + +It turned out that R/qtl was running HK - so it was a QTL mapping rather than an LMM. diff --git a/issues/genenetwork/cannot-connect-to-mariadb.gmi b/issues/genenetwork/cannot-connect-to-mariadb.gmi new file mode 100644 index 0000000..3dfe1bc --- /dev/null +++ b/issues/genenetwork/cannot-connect-to-mariadb.gmi @@ -0,0 +1,121 @@ +# Cannot Connect to MariaDB + + +## Description + +GeneNetwork3 is failing to connect to mariadb with the error: + +``` +⋮ +2024-11-05 14:49:00 Traceback (most recent call last): +2024-11-05 14:49:00 File "/gnu/store/83v79izrqn36nbn0l1msbcxa126v21nz-profile/lib/python3.10/site-packages/flask/app.py", line 1523, in full_dispatch_request +2024-11-05 14:49:00 rv = self.dispatch_request() +2024-11-05 14:49:00 File "/gnu/store/83v79izrqn36nbn0l1msbcxa126v21nz-profile/lib/python3.10/site-packages/flask/app.py", line 1509, in dispatch_request +2024-11-05 14:49:00 return self.ensure_sync(self.view_functions[rule.endpoint])(**req.view_args) +2024-11-05 14:49:00 File "/gnu/store/83v79izrqn36nbn0l1msbcxa126v21nz-profile/lib/python3.10/site-packages/gn3/api/menu.py", line 13, in generate_json +2024-11-05 14:49:00 with database_connection(current_app.config["SQL_URI"], logger=current_app.logger) as conn: +2024-11-05 14:49:00 File "/gnu/store/lzw93sik90d780n09svjx5la1bb8g3df-python-3.10.7/lib/python3.10/contextlib.py", line 135, in __enter__ +2024-11-05 14:49:00 return next(self.gen) +2024-11-05 14:49:00 File "/gnu/store/83v79izrqn36nbn0l1msbcxa126v21nz-profile/lib/python3.10/site-packages/gn3/db_utils.py", line 34, in database_connection +2024-11-05 14:49:00 connection = mdb.connect(db=db_name, +2024-11-05 14:49:00 File "/gnu/store/83v79izrqn36nbn0l1msbcxa126v21nz-profile/lib/python3.10/site-packages/MySQLdb/__init__.py", line 121, in Connect +2024-11-05 14:49:00 return Connection(*args, **kwargs) +2024-11-05 14:49:00 File "/gnu/store/83v79izrqn36nbn0l1msbcxa126v21nz-profile/lib/python3.10/site-packages/MySQLdb/connections.py", line 195, in __init__ +2024-11-05 14:49:00 super().__init__(*args, **kwargs2) +2024-11-05 14:49:00 MySQLdb.OperationalError: (2002, "Can't connect to local MySQL server through socket '/tmp/mysql.sock' (2)") +``` + +We have previously defined the default socket file[^1][^2] as "/run/mysqld/mysqld.sock". + +## Troubleshooting Logs + +### 2024-11-05 + +I attempted to just bind `/run/mysqld/mysqld.sock` to `/tmp/mysql.sock` by adding the following mapping in GN3's `gunicorn-app` definition: + +``` +(file-system-mapping + (source "/run/mysqld/mysqld.sock") + (target "/tmp/mysql.sock") + (writable? #t)) +``` + +but that does not fix things. + +I had tried to change the mysql URI to use IP addresses, i.e. + +``` +SQL_URI="mysql://webqtlout:webqtlout@128.169.5.119:3306/db_webqtl" +``` + +but that simply changes the error from the above to the one below: + +``` +2024-11-05 15:27:12 MySQLdb.OperationalError: (2002, "Can't connect to MySQL server on '128.169.5.119' (115)") +``` + +I tried with both `127.0.0.1` and `128.169.5.119`. + +My hail-mary was to attempt to expose the `my.cnf` file generated by the `mysql-service-type` definition to the "pola-wrapper", but that is proving tricky, seeing as the file is generated elsewhere[^4] and we do not have a way of figuring out the actual final path of the file. + +I tried: + +``` +(file-system-mapping + (source (mixed-text-file "my.cnf" + (string-append "[client]\n" + "socket=/run/mysqld/mysqld.sock"))) + (target "/etc/mysql/my.cnf")) +``` + +but that did not work either. + +### 2024-11-07 + +Start digging into how GNU Guix services are defined[^5] to try and understand why the file mapping attempt did not work. + +=> http://git.savannah.gnu.org/cgit/guix.git/tree/gnu/system/file-systems.scm?id=2394a7f5fbf60dd6adc0a870366adb57166b6d8b#n575 +Looking at the code linked above specifically at lines 575 to 588, and 166, it seems, to me, that the mappings attempt should have worked. + +Try it again, taking care to verify that the paths are correct, with: + +``` +(file-system-mapping + (source (mixed-text-file "my.cnf" + (string-append "[client-server]\n" + "socket=/run/mysqld/mysqld.sock"))) + (target "/etc/my.cnf")) +``` + +Try rebuilding on tux04: started getting `Segmentation fault` errors out of the blue for many guix commands 🤦🏿. +Try building container on local dev machine: this took a long time - quit and continue later. + +### 2024-11-08 + +After guix broke, causing the `Segmentation fault` errors above, I did some troubleshooting and was able to finally fix that by pinning guix to version b0b988c41c9e0e591274495a1b2d6f27fcdae15a as shown in the troubleshooting transcript[^6]. + +Now the fixes I did to make python requests work with the newer guix (defined in guix-bioinformatics[^7]) seem to be leading to failures in the older guix version. + +Let me attempt rebasing to reorder the commits, to make the python requests commit come last, to more easily do a `git reset` before rebuilding the container — not successful. +=> https://git.genenetwork.org/gn-machines/commit/?h=production-container&id=610049b2bfa32cae5d3f992b95aac711290efa2a Manually "undo" the changes in a new commit, + +then rebuild the container. This exposes a bug in gn-auth. + +=> https://git.genenetwork.org/gn-auth/commit/?id=4c21d0e43cf0de1084d0e0a243e441c6e72236eb Fix that. + +and update the `public-jwks-uri` value for the client in the admin dashboard, and voila!!! Now the system works. + +Attempt pulling guix "2394a7f5fbf60dd6adc0a870366adb57166b6d8b" into a profile locally: went through without a hitch + +Upgrade guix daemon, and restart it. Delete profile and run `guix gc`, then try pulling guix "2394a7f5fbf60dd6adc0a870366adb57166b6d8b" again. It also went through without a problem. This eliminates the daemon being the culprit: Running `sudo -i guix pull --list-generations` on both tux04 and my local dev machine gives both daemon commits as `2a6d96425eea57dc6dd48a2bec16743046e32e06`. + + +### Footnotes + +=> https://git.genenetwork.org/gn-machines/tree/production.scm?id=46a1c4c8d01198799e6ac3b99998dca40d2c7094#n47 [^1] Lines 47 to 49 of production.scm +=> https://guix.gnu.org/manual/en/html_node/Database-Services.html#index-mysql_002dconfiguration [^2] Guix's mysql-service-type configurations +=> https://mariadb.com/kb/en/server-system-variables/#socket [^3] MariaDB configuration variables: socket +=> https://git.savannah.gnu.org/cgit/guix.git/tree/gnu/services/databases.scm?id=4c56d0cccdc44e12484b26332715f54768738c5f#n576 [^4] Guix: mysql-service-type configuration code +=> https://guix.gnu.org/manual/en/html_node/Defining-Services.html [^5] Guix documentation: Defining Services +=> https://github.com/genenetwork/gn-gemtext-threads/blob/d785b06643b5e5a2470fd0da075dcf77bda82d16/miscellaneous/broken-guix-on-tux04-20241108.org [^6] Broken guix on tux04: Troubleshooting transcript +=> https://git.genenetwork.org/guix-bioinformatics/commit/?id=eb7beb340a9731775e8ad177e47b70dba2f2a84f [^7] guix-bioinformatics: Upgrade guix channel to 2394a7f diff --git a/issues/genenetwork/containerising-production-issues.gmi b/issues/genenetwork/containerising-production-issues.gmi new file mode 100644 index 0000000..ed5702a --- /dev/null +++ b/issues/genenetwork/containerising-production-issues.gmi @@ -0,0 +1,33 @@ +# Containerising Production: Issues + +## Tags + +* type: bug +* assigned: fredm +* priority: critical +* status: closed, completed +* keywords: production, container, tux04 +* interested: alexk, aruni, bonfacem, fredm, pjotrp, soloshelby, zsloan, jnduli + +## Description + +We have recently got production into a container and deployed it: It has come up, however, that there are services that are useful to get a full-featured GeneNetwork system running that are not part of the container. + +This is, therefore, a meta-issue, tracking all issues that relate to the deployment of the disparate services that make up GeneNetwork. + +## Documentation + +=> https://issues.genenetwork.org/topics/genenetwork/genenetwork-services + +The link above documents the various services that make up the GeneNetwork service. + +## Issues + +* [x] Move user directories to a large partition +=> ./handle-tmp-dirs-in-container [x] Link TMPDIR in container to a directory on a large partition +=> ./markdown-editing-service-not-deployed [ ] Define and deploy Markdown Editing service +=> ./umhet3-samples-timing-slow [ ] Figure out and fix UM-HET3 Samples mappings on Tux04 +=> ./setup-mailing-on-tux04 [x] Setting up email service on Tux04 +=> ./virtuoso-shutdown-clears-data [x] Virtuoso seems to lose data on restart +=> ./python-requests-error-in-container [x] Fix python's requests library certificates error +=> ./cannot-connect-to-mariadb [ ] GN3 cannot connect to mariadb server diff --git a/issues/genenetwork/guix-bioinformatics-remove-guix-rust-past-crates-channel.gmi b/issues/genenetwork/guix-bioinformatics-remove-guix-rust-past-crates-channel.gmi new file mode 100644 index 0000000..b804e10 --- /dev/null +++ b/issues/genenetwork/guix-bioinformatics-remove-guix-rust-past-crates-channel.gmi @@ -0,0 +1,23 @@ +# guix-bioinformatics: Remove `guix-rust-past-crates` channel + +## Tags + +* assigned: alexm, bonfacem +* interested: fredm +* priority: normal +* status: open +* type: bug +* keywords: guix-bioinformatics, guix-rust-past-crates, guix, rust, crates + +## Description + +GNU Guix recently changed[1] the way it handles packaging of rust packages. + +The old rust packages got moved to the "guix-rust-past-crates" to help avoid huge breakages for systems depending on the older packaging system. "guix-bioinformatics" used a number of rust packages, defined in the old form, and we needed a quick fix, thus the introduction of the "guix-rust-past-crates" channel as a dependency. + +We need to move away from depending on this channel, by updating all the rust crates we use to the new packaging model. + + +## Footnotes + +=> https://guix.gnu.org/en/blog/2025/a-new-rust-packaging-model/ [1] diff --git a/issues/genenetwork/handle-tmp-dirs-in-container.gmi b/issues/genenetwork/handle-tmp-dirs-in-container.gmi new file mode 100644 index 0000000..5f6eb92 --- /dev/null +++ b/issues/genenetwork/handle-tmp-dirs-in-container.gmi @@ -0,0 +1,22 @@ +# Handle Temporary Directories in the Container + +## Tags + +* type: feature +* assigned: fredm +* priority: critical +* status: closed, completed +* keywords: production, container, tux04 +* interested: alexk, aruni, bonfacem, pjotrp, zsloan + +## Description + +The container's temporary directories should be in a large partition on the host to avoid a scenario where the writes fill up one of the smaller drives. + +Currently, we use the `/tmp` directory by default, but we should look into transitioning away from that — `/tmp` is world readable and world writable and therefore needs careful consideration to keep safe. + +Thankfully, we are running our systems within a container, and can bind the container's `/tmp` directory to a non-world-accessible directory, keeping things at least contained. + +### Fixes + +=> https://git.genenetwork.org/gn-machines/commit/?id=7306f1127df9d4193adfbfa51295615f13d32b55 diff --git a/issues/genenetwork/markdown-editing-service-not-deployed.gmi b/issues/genenetwork/markdown-editing-service-not-deployed.gmi new file mode 100644 index 0000000..9d72e4e --- /dev/null +++ b/issues/genenetwork/markdown-editing-service-not-deployed.gmi @@ -0,0 +1,39 @@ +# Markdown Editing Service: Not Deployed + +## Tags + +* type: bug +* status: closed, completed, fixed +* assigned: fredm +* priority: critical +* keywords: production, container, tux04 +* interested: alexk, aruni, bonfacem, fredm, pjotrp, zsloan + +## Description + +The Markdown Editing service is not working on production. + +* Link: https://genenetwork.org/facilities/ +* Repository: https://git.genenetwork.org/gn-guile + +Currently, the code is being run directly on the host, rather than inside the container. + +Some important things to note: + +* The service requires access to a checkout of https://github.com/genenetwork/gn-docs +* Currently, the service is hard-coded to use a specific port: we should probably fix that. + +## Reopened: 2024-11-01 + +While the service was deployed, the edit functionality is not working right, specifically, pushing the edits upstream to the remote seems to fail. + +If you do an edit and refresh the page, it will show up in the system, but it will not proceed to be pushed up to the remote. + +Set `CGIT_REPO_PATH="https://git.genenetwork.org/gn-guile"` which seems to allow the commit to work, but we do not actually get the changes pushed to the remote in any useful sense. + +It seems to me, that we need to configure the environment in such a way that it will be able to push the changes to remote. + + +## Close as Completed + +The markdown editing service is deployed and configured correctly. diff --git a/issues/genenetwork/python-requests-error-in-container.gmi b/issues/genenetwork/python-requests-error-in-container.gmi new file mode 100644 index 0000000..0289762 --- /dev/null +++ b/issues/genenetwork/python-requests-error-in-container.gmi @@ -0,0 +1,174 @@ +# Python Requests Error in Container + +## Tags + +* type: bug +* assigned: fredm +* priority: critical +* status: closed, completed, fixed +* interested: alexk, aruni, bonfacem, pjotrp, zsloan +* keywords: production, container, tux04, python, requests + +## Description + +Building the container with the +=> https://git.genenetwork.org/guix-bioinformatics/commit/?id=eb7beb340a9731775e8ad177e47b70dba2f2a84f upgraded guix definition +leads to python's requests library failing. + +``` +2024-10-30 16:04:13 OSError: Could not find a suitable TLS CA certificate bundle, invalid path: /etc/ssl/certs/ca-certificates.crt +``` + +If you login to the container itself, however, you find that the file `/etc/ssl/certs/ca-certificates.crt` actually exists and has content. + +Possible fixes suggested are to set up correct envvars for the requests library, such as `REQUESTS_CA_BUNDLE` + +See +=> https://requests.readthedocs.io/en/latest/user/advanced/#ssl-cert-verification + +### Troubleshooting Logs + +Try reproducing the issue locally: + +``` +$ guix --version +hint: Consider installing the `glibc-locales' package and defining `GUIX_LOCPATH', along these lines: + + guix install glibc-locales + export GUIX_LOCPATH="$HOME/.guix-profile/lib/locale" + +See the "Application Setup" section in the manual, for more info. + +guix (GNU Guix) 2394a7f5fbf60dd6adc0a870366adb57166b6d8b +Copyright (C) 2024 the Guix authors +License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> +This is free software: you are free to change and redistribute it. +There is NO WARRANTY, to the extent permitted by law. +$ +$ guix shell --container --network python python-requests coreutils +[env]$ ls "${GUIX_ENVIRONMENT}/etc" +ld.so.cache profile +``` + +We see from the above that there are no certificates in the environment with just python and python-requests. + +Okay. Now let's write a simple python script to test things out with: + +``` +import requests + +resp = requests.get("https://github.com") +print(resp) +``` + +and run it! + +``` +$ guix shell --container --network python python-requests coreutils -- python3 test.py +Traceback (most recent call last): + File "/tmp/test.py", line 1, in <module> + import requests + File "/gnu/store/b6ny4p29f32rrnnvgx7zz1nhsms2zmqk-profile/lib/python3.10/site-packages/requests/__init__.py", line 164, in <module> + from .api import delete, get, head, options, patch, post, put, request + File "/gnu/store/b6ny4p29f32rrnnvgx7zz1nhsms2zmqk-profile/lib/python3.10/site-packages/requests/api.py", line 11, in <module> + from . import sessions + File "/gnu/store/b6ny4p29f32rrnnvgx7zz1nhsms2zmqk-profile/lib/python3.10/site-packages/requests/sessions.py", line 15, in <module> + from .adapters import HTTPAdapter + File "/gnu/store/b6ny4p29f32rrnnvgx7zz1nhsms2zmqk-profile/lib/python3.10/site-packages/requests/adapters.py", line 81, in <module> + _preloaded_ssl_context.load_verify_locations( +FileNotFoundError: [Errno 2] No such file or directory +``` + +Uhmm, what is this new error? + +Add `nss-certs` and try again. + +``` +$ guix shell --container --network python python-requests nss-certs coreutils +[env]$ ls ${GUIX_ENVIRONMENT}/etc/ssl/ +certs +[env]$ python3 test.py +Traceback (most recent call last): + File "/tmp/test.py", line 1, in <module> + import requests + File "/gnu/store/17dw8qczqqz9fmj2kxzsbfqn730frqd7-profile/lib/python3.10/site-packages/requests/__init__.py", line 164, in <module> + from .api import delete, get, head, options, patch, post, put, request + File "/gnu/store/17dw8qczqqz9fmj2kxzsbfqn730frqd7-profile/lib/python3.10/site-packages/requests/api.py", line 11, in <module> + from . import sessions + File "/gnu/store/17dw8qczqqz9fmj2kxzsbfqn730frqd7-profile/lib/python3.10/site-packages/requests/sessions.py", line 15, in <module> + from .adapters import HTTPAdapter + File "/gnu/store/17dw8qczqqz9fmj2kxzsbfqn730frqd7-profile/lib/python3.10/site-packages/requests/adapters.py", line 81, in <module> + _preloaded_ssl_context.load_verify_locations( +FileNotFoundError: [Errno 2] No such file or directory +[env]$ +[env]$ export REQUESTS_CA_BUNDLE="${GUIX_ENVIRONMENT}/etc/ssl/certs/ca-certificates.crt" +[env]$ $ python3 test.py +Traceback (most recent call last): + File "/tmp/test.py", line 1, in <module> + import requests + File "/gnu/store/17dw8qczqqz9fmj2kxzsbfqn730frqd7-profile/lib/python3.10/site-packages/requests/__init__.py", line 164, in <module> + from .api import delete, get, head, options, patch, post, put, request + File "/gnu/store/17dw8qczqqz9fmj2kxzsbfqn730frqd7-profile/lib/python3.10/site-packages/requests/api.py", line 11, in <module> + from . import sessions + File "/gnu/store/17dw8qczqqz9fmj2kxzsbfqn730frqd7-profile/lib/python3.10/site-packages/requests/sessions.py", line 15, in <module> + from .adapters import HTTPAdapter + File "/gnu/store/17dw8qczqqz9fmj2kxzsbfqn730frqd7-profile/lib/python3.10/site-packages/requests/adapters.py", line 81, in <module> + _preloaded_ssl_context.load_verify_locations( +FileNotFoundError: [Errno 2] No such file or directory +``` + +Welp! Looks like this error is a whole different thing. + +Let us try with the genenetwork2 package. + +``` +$ guix shell --container --network genenetwork2 coreutils +[env]$ ls "${GUIX_ENVIRONMENT}/etc" +bash_completion.d jupyter ld.so.cache profile +``` + +This does not seem to have the certificates in place either, so let's add nss-certs + +``` +$ guix shell --container --network genenetwork2 coreutils nss-certs +[env]$ ls "${GUIX_ENVIRONMENT}/etc" +bash_completion.d jupyter ld.so.cache profile ssl +[env]$ python3 test.py +Traceback (most recent call last): + File "/tmp/test.py", line 3, in <module> + resp = requests.get("https://github.com") + File "/gnu/store/qigjz4i0dckbsjbd2has0md2dxwsa7ry-profile/lib/python3.10/site-packages/requests/api.py", line 73, in get + return request("get", url, params=params, **kwargs) + File "/gnu/store/qigjz4i0dckbsjbd2has0md2dxwsa7ry-profile/lib/python3.10/site-packages/requests/api.py", line 59, in request + return session.request(method=method, url=url, **kwargs) + File "/gnu/store/qigjz4i0dckbsjbd2has0md2dxwsa7ry-profile/lib/python3.10/site-packages/requests/sessions.py", line 587, in request + resp = self.send(prep, **send_kwargs) + File "/gnu/store/qigjz4i0dckbsjbd2has0md2dxwsa7ry-profile/lib/python3.10/site-packages/requests/sessions.py", line 701, in send + r = adapter.send(request, **kwargs) + File "/gnu/store/qigjz4i0dckbsjbd2has0md2dxwsa7ry-profile/lib/python3.10/site-packages/requests/adapters.py", line 460, in send + self.cert_verify(conn, request.url, verify, cert) + File "/gnu/store/qigjz4i0dckbsjbd2has0md2dxwsa7ry-profile/lib/python3.10/site-packages/requests/adapters.py", line 263, in cert_verify + raise OSError( +OSError: Could not find a suitable TLS CA certificate bundle, invalid path: /etc/ssl/certs/ca-certificates.crt +``` + +We get the expected certificates error! This is good. Now define the envvar and try again. + +``` +[env]$ export REQUESTS_CA_BUNDLE="${GUIX_ENVIRONMENT}/etc/ssl/certs/ca-certificates.crt" +[env]$ python3 test.py +<Response [200]> +``` + +Success!!! + +Adding nss-certs and setting the `REQUESTS_CA_BUNDLE` fixes things. We'll need to do the same for the container, for both the genenetwork2 and genenetwork3 packages (and any other packages that use requests library). + +### Fixes + +=> https://git.genenetwork.org/guix-bioinformatics/commit/?id=fec68c4ca87eeca4eb9e69e71fc27e0eae4dd728 +=> https://git.genenetwork.org/guix-bioinformatics/commit/?id=c3bb784c8c70857904ef97ecd7d36ec98772413d +The two commits above add nss-certs package to all the flask apps, which make use of the python-requests library, which requires a valid CA certificates bundle in each application's environment. + +=> https://git.genenetwork.org/gn-machines/commit/?h=production-container&id=04506c4496e5ca8b3bc38e28ed70945a145fb036 +The commit above defines the "REQUESTS_CA_BUNDLE" environment variable for all the flask applications that make use of python's requests library. diff --git a/issues/genenetwork/setup-mailing-on-tux04.gmi b/issues/genenetwork/setup-mailing-on-tux04.gmi new file mode 100644 index 0000000..45605d9 --- /dev/null +++ b/issues/genenetwork/setup-mailing-on-tux04.gmi @@ -0,0 +1,16 @@ +# Setup Mailing on Tux04 + +## Tags + +* type: bug +* status: closed +* assigned: fredm +* priority: critical +* interested: pjotrp, zsloan +* keywords: production, container, tux04 + +## Description + +We use emails to verify user accounts and allow changing of user passwords. We therefore need to setup a way to send emails from the system. + +I updated the configurations to use UTHSC's mail server diff --git a/issues/genenetwork/umhet3-samples-timing-slow.gmi b/issues/genenetwork/umhet3-samples-timing-slow.gmi new file mode 100644 index 0000000..a3a33a7 --- /dev/null +++ b/issues/genenetwork/umhet3-samples-timing-slow.gmi @@ -0,0 +1,72 @@ +# UM-HET3 Timing: Slow + +## Tags + +* type: bug +* status: open +* assigned: fredm +* priority: critical +* interested: fredm, pjotrp, zsloan +* keywords: production, container, tux04, UM-HET3 + +## Description + +In email from @robw: + +``` +> > Not sure why. Am I testing the wrong way? +> > Are we using memory and RAM in the same way on the two machines? +> > Here are data on the loading time improvement for Tux2: +> > I tested this using a "worst case" trait that we know when—the 25,000 +> > UM-HET3 samples: +> > [1]https://genenetwork.org/show_trait?trait_id=10004&dataset=HET3-ITPPu +> > blish +> > Tux02: 15.6, 15.6, 15.3 sec +> > Fallback: 37.8, 38.7, 38.5 sec +> > Here are data on Gemma speed/latency performance: +> > Also tested "worst case" performance using three large BXD data sets +> > tested in this order: +> > [2]https://genenetwork.org/show_trait?trait_id=10004&dataset=BXD-Longev +> > ityPublish +> > [3]https://genenetwork.org/show_trait?trait_id=10003&dataset=BXD-Longev +> > ityPublish +> > [4]https://genenetwork.org/show_trait?trait_id=10002&dataset=BXD-Longev +> > ityPublish +> > Tux02: 107.2, 329.9 (ouch), 360.0 sec (double ouch) for 1004, 1003, and +> > 1002 respectively. On recompute (from cache) 19.9, 19.9 and 20.0—still +> > too slow. +> > Fallback: 154.1, 115.9 for the first two traits (trait 10002 already in +> > the cache) +> > On recompute (from cache) 59.6, 59.0 and 59.7. Too slow from cache. +> > PROBLEM 2: Tux02 is unable to map UM-HET3. I still get an nginx 413 +> > error: Entity Too Large. +> +> Yeah, Fred should fix that one. It is an nginx setting - we run 2x +> nginx. It was reported earlier. +> +> > I need this to work asap. Now mapping our amazing UM-HET3 data. I can +> > use Fallback, but it is painfully slow and takes about 214 sec. I hope +> > Tux02 gets that down to a still intolerable slow 86 sec. +> > Can we please fix and confirm by testing. The Trait is above for your +> > testing pleasure. +> > Even 86 secs is really too slow and should motivate us (or users like +> > me) to think about how we are using all of those 24 ultra-fast cores on +> > the AMD 9274F. Why not put them all to use for us and users. It is not +> > good enough just to have "it work". It has to work in about 5–10 +> > seconds. +> > Here are my questions for you guys: Are we able to use all 24 cores +> > for any one user? How does each user interact with the CPU? Can we +> > handle a class of 24 students with 24 cores, or is it "complicated"? +> > PROBLEM 3: Zach, Fred. Are we computing render time or transport +> > latency correctly? Ideally the printout at the bottom of mapping pages +> > would be true latency as experienced by the user. As far as I can tell +> > with a stop watch our estimates of time are incorrect by as much as 3 +> > secs. And note that the link +> > to [5]http://joss.theoj.org/papers/10.21105/joss.00025 is not working +> > correctly in the footer (see image below). Oddly enough it works fine +> > on Tux02 +> +> Fred, take a note. +``` + +Figure out what this is about and fix it. diff --git a/issues/genenetwork/virtuoso-shutdown-clears-data.gmi b/issues/genenetwork/virtuoso-shutdown-clears-data.gmi new file mode 100644 index 0000000..2e01238 --- /dev/null +++ b/issues/genenetwork/virtuoso-shutdown-clears-data.gmi @@ -0,0 +1,98 @@ +# Virtuoso: Shutdown Clears Data + +## Tags + +* type: bug +* assigned: fredm +* priority: critical +* status: closed, completed +* interested: bonfacem, pjotrp, zsloan +* keywords: production, container, tux04, virtuoso + +## Description + +It seems that virtuoso has the bad habit of clearing data whenever it is stopped/restarted. + +This issue will track the work necessary to get the service behaving correctly. + +According to the documentation on +=> https://vos.openlinksw.com/owiki/wiki/VOS/VirtBulkRDFLoader the bulk loading process + +``` +The bulk loader also disables checkpointing and the scheduler, which also need to be re-enabled post bulk load +``` + +That needs to be handled. + +### Notes + +After having a look at +=> https://docs.openlinksw.com/virtuoso/ch-server/#databaseadmsrv the configuration documentation +it occurs to me that the reason virtuoso supposedly clears the data is that the `DatabaseFile` value is not set, so it defaults to a new database file every time the server is restarted (See also the `Striping` setting). + +### Troubleshooting + +Reproduce locally: + +We begin by getting a look at the settings for the remote virtuoso +``` +$ ssh tux04 +fredm@tux04:~$ cat /gnu/store/bg6i4x96nm32gjp4qhphqmxqc5vggk3h-virtuoso.ini +[Parameters] +ServerPort = localhost:8981 +DirsAllowed = /var/lib/data +NumberOfBuffers = 4000000 +MaxDirtyBuffers = 3000000 +[HTTPServer] +ServerPort = localhost:8982 +``` + +Copy these into a file locally, and adjust the `NumberOfBuffers` and `MaxDirtyBuffers` for smaller local dev environment. Also update `DirsAllowed`. + +We end up with our local configuration in `~/tmp/virtuoso/etc/virtuoso.ini` with the content: + +``` +[Parameters] +ServerPort = localhost:8981 +DirsAllowed = /var/lib/data +NumberOfBuffers = 10000 +MaxDirtyBuffers = 6000 +[HTTPServer] +ServerPort = localhost:8982 +``` + +Run virtuoso! +``` +$ cd ~/tmp/virtuoso/var/lib/virtuoso/ +$ ls +$ ~/opt/virtuoso/bin/virtuoso-t +foreground +configfile ~/tmp/virtuoso/etc/virtuoso.ini +``` + +Here we start by changing into the `~/tmp/virtuoso/var/lib/virtuoso/` directory which will be where virtuoso will put its state. Now in a different terminal list the files created int the state directory: + +``` +$ ls ~/tmp/virtuoso/var/lib/virtuoso +virtuoso.db virtuoso.lck virtuoso.log virtuoso.pxa virtuoso.tdb virtuoso.trx +``` + +That creates the database file (and other files) with the documented default values, i.e. `virtuoso.*`. + +We cannot quite reproduce the issue locally, since every reboot will have exactly the same value for the files locally. + +Checking the state directory for virtuoso on tux04, however: + +``` +fredm@tux04:~$ sudo ls -al /export2/guix-containers/genenetwork/var/lib/virtuoso/ | grep '\.db$' +-rw-r--r-- 1 986 980 3787456512 Oct 28 14:16 js1b7qjpimdhfj870kg5b2dml640hryx-virtuoso.db +-rw-r--r-- 1 986 980 4152360960 Oct 28 17:11 rf8v0c6m6kn5yhf00zlrklhp5lmgpr4x-virtuoso.db +``` + +We see that there are multiple db files, each created when virtuoso was restarted. There is an extra (possibly) random string prepended to the `virtuoso.db` part. This happens for our service if we do not actually provide the `DatabaseFile` configuration. + + +## Fixes + +=> https://github.com/genenetwork/gn-gemtext-threads/commit/8211c1e49498ba2f3b578ed5b11b15c52299aa08 Document how to restart checkpointing and the scheduler after bulk loading +=> https://git.genenetwork.org/guix-bioinformatics/commit/?id=2dc335ca84ea7f26c6977e6b432f3420b113f0aa Add configs for scheduler and checkpointing +=> https://git.genenetwork.org/guix-bioinformatics/commit/?id=7d793603189f9d41c8ee87f8bb4c876440a1fce2 Set up virtuoso database configurations +=> https://git.genenetwork.org/gn-machines/commit/?id=46a1c4c8d01198799e6ac3b99998dca40d2c7094 Explicitly name virtuoso database files. diff --git a/issues/genenetwork2-account-registration-error.gmi b/issues/genenetwork2-account-registration-error.gmi index d617f93..14b6322 100644 --- a/issues/genenetwork2-account-registration-error.gmi +++ b/issues/genenetwork2-account-registration-error.gmi @@ -5,7 +5,7 @@ * type: bug * priority: critical * assigned: zachs, zsloan, fredm -* status: open +* status: closed, completed * keywords: genenetwork2, account management, user, registration ## Description diff --git a/issues/genenetwork2-cd-sometimes-fails-to-restart.gmi b/issues/genenetwork2-cd-sometimes-fails-to-restart.gmi index d2d2013..603de59 100644 --- a/issues/genenetwork2-cd-sometimes-fails-to-restart.gmi +++ b/issues/genenetwork2-cd-sometimes-fails-to-restart.gmi @@ -10,4 +10,7 @@ A reminder that CD logs are publicly accessible on tux02. => /topics/cd-logs ## Resolution + This issue has been re-opened. Originally, we believed that the restart failures were due to occasional breakage in GN code, and were not a problem with the CI/CD system itself. This will need further investigation to figure out what the root cause is. + +* closed diff --git a/issues/genenetwork2/broken-collections-features.gmi b/issues/genenetwork2/broken-collections-features.gmi new file mode 100644 index 0000000..4239929 --- /dev/null +++ b/issues/genenetwork2/broken-collections-features.gmi @@ -0,0 +1,44 @@ +# Broken Collections Features + +## Tags + +* type: bug +* status: open +* priority: high +* assigned: zachs, fredm +* keywords: gn2, genenetwork2, genenetwork 2, collections + +## Descriptions + +There are some features in the search results page, and/or the collections page that are broken — these are: + +* "CTL" feature +* "MultiMap" feature +* "Partial Correlations" feature +* "Generate Heatmap" feature + +### Reproduce Issue + +* Go to https://genenetwork.org +* Select "Mouse (Mus musculus, mm10) for "Species" +* Select "BXD Family" for "Group" +* Select "Traits and Cofactors" for "Type" +* Select "BXD Published Phenotypes" for "Dataset" +* Type "locomotion" in the "Get Any" field (without the quotes) +* Click "Search" +* In the results page, select the traits with the following "Record" values: "BXD_10050", "BXD_10051", "BXD_10088", "BXD_10091", "BXD_10092", "BXD_10455", "BXD_10569", "BXD_10570", "BXD_11316", "BXD_11317" +* Click the "Add" button and add them to a new collection +* In the resulting collections page, click the button for any of the listed failing features above + +### Failure modes + +* The "CTL" and "WCGNA" features have a failure mode that might have been caused by recent changes making use of AJAX calls, rather than submitting the form manually. +* The "MultiMap" and "Generate Heatmap" features raise exceptions that need to be investigated and resolved +* The "Partial Correlations" feature seems to run forever + +## Break-out Issues + +We break-out the issues above into separate pages to track the progress of the fixes for each feature separately. + +=> /issues/genenetwork3/ctl-maps-error +=> /issues/genenetwork3/generate-heatmaps-failing diff --git a/issues/genenetwork2/fix-display-for-time-consumed-for-correlations.gmi b/issues/genenetwork2/fix-display-for-time-consumed-for-correlations.gmi new file mode 100644 index 0000000..0c8e9c8 --- /dev/null +++ b/issues/genenetwork2/fix-display-for-time-consumed-for-correlations.gmi @@ -0,0 +1,15 @@ +# Fix Display for the Time Consumed for Correlations + +## Tags + +* type: bug +* status: closed, completed +* priority: low +* assigned: @alexm, @bonz +* keywords: gn2, genenetwork2, genenetwork 2, gn3, genenetwork3 genenetwork 3, correlations, time display + +## Description + +The breakdown of the time consumed for the correlations computations, displayed at the bottom of the page, is not representative of reality. The time that GeneNetwork3 (or background process) takes for the computations is not actually represented in the breakdown, leading to wildly inaccurate displays of total time. + +This will need to be fixed. diff --git a/issues/genenetwork/genenetwork2_configurations.gmi b/issues/genenetwork2/genenetwork2_configurations.gmi index 7d08db0..4ba0a89 100644 --- a/issues/genenetwork/genenetwork2_configurations.gmi +++ b/issues/genenetwork2/genenetwork2_configurations.gmi @@ -4,7 +4,7 @@ * assigned: fredm * priority: normal -* status: open +* status: closed, obsoleted * keywords: configuration, config, gn2, genenetwork, genenetwork2 * type: bug @@ -72,3 +72,10 @@ For `wqflask/run_gunicorn.py`, the route can remain as is, since this is an entr ### Non-Executable Configuration Files Eschew executable formats (*.py) for configuration files and prefer non-executable formats e.g. *.cfg, *.json, *.conf etc + + +## Closed as obsoleted + +I am closing this issue as obsoleted, since a lot of things have changed since this issue was set up. The `bin/genenetwork2` script no longer exists and most of the paths mentioned have changed. + +The configuration issue(s) mentioned above still abound, but the changes will have to be incremental to avoid breaking the system. diff --git a/issues/genenetwork2/haley-knott-regression-mapping-error.gmi b/issues/genenetwork2/haley-knott-regression-mapping-error.gmi new file mode 100644 index 0000000..25bb221 --- /dev/null +++ b/issues/genenetwork2/haley-knott-regression-mapping-error.gmi @@ -0,0 +1,80 @@ +# Haley-Knott Regression Mapping Error + +## Tags + +* type: bug +* status: closed, completed +* priority: high +* assigned: fredm +* keywords: gn2, genenetwork2, genenetwork 2, mapping, haley-knott + +## Description + +To run the mapping: + +* Do a search +* Click on any trait in the results +* On the trait page, expand the "Mapping Tools" section +* Select the "Haley-Knott Regression" option under "Mapping Tools" +* Click "Compute" + +On running the mapping as above, we got the following error: + +``` + GeneNetwork 2.11-rc2 https://gn2-fred.genenetwork.org/run_mapping ( 6:14AM UTC Sep 11, 2024) +Traceback (most recent call last): + File "/gnu/store/hgcvlkn4bjl0f9wqiakpk5w66brbfxk6-profile/lib/python3.10/site-packages/flask/app.py", line 1523, in full_dispatch_request + rv = self.dispatch_request() + File "/gnu/store/hgcvlkn4bjl0f9wqiakpk5w66brbfxk6-profile/lib/python3.10/site-packages/flask/app.py", line 1509, in dispatch_request + return self.ensure_sync(self.view_functions[rule.endpoint])(**req.view_args) + File "/gnu/store/hgcvlkn4bjl0f9wqiakpk5w66brbfxk6-profile/lib/python3.10/site-packages/gn2/wqflask/views.py", line 1004, in mapping_results_page + gn1_template_vars = display_mapping_results.DisplayMappingResults( + File "/gnu/store/hgcvlkn4bjl0f9wqiakpk5w66brbfxk6-profile/lib/python3.10/site-packages/gn2/wqflask/marker_regression/display_mapping_results.py", line 651, in __init__ + self.perm_filename = self.drawPermutationHistogram() + File "/gnu/store/hgcvlkn4bjl0f9wqiakpk5w66brbfxk6-profile/lib/python3.10/site-packages/gn2/wqflask/marker_regression/display_mapping_results.py", line 3056, in drawPermutationHistogram + Plot.plotBar(myCanvas, perm_output, XLabel=self.LRS_LOD, + File "/gnu/store/hgcvlkn4bjl0f9wqiakpk5w66brbfxk6-profile/lib/python3.10/site-packages/gn2/utility/Plot.py", line 184, in plotBar + scaleFont = ImageFont.truetype(font=COUR_FILE, size=11) + File "/gnu/store/hgcvlkn4bjl0f9wqiakpk5w66brbfxk6-profile/lib/python3.10/site-packages/PIL/ImageFont.py", line 959, in truetype + return freetype(font) + File "/gnu/store/hgcvlkn4bjl0f9wqiakpk5w66brbfxk6-profile/lib/python3.10/site-packages/PIL/ImageFont.py", line 956, in freetype + return FreeTypeFont(font, size, index, encoding, layout_engine) + File "/gnu/store/hgcvlkn4bjl0f9wqiakpk5w66brbfxk6-profile/lib/python3.10/site-packages/PIL/ImageFont.py", line 247, in __init__ + self.font = core.getfont( +OSError: cannot open resource +``` + +### Hypothesis + +My hypothesis is that the use of relative paths[fn:1] is the cause of the failure. + +When running the application with the working directory being the root of the GeneNetwork2 repository, use of the relative paths works well. Unfortunately, that assumption breaks quickly if the application is ever run outside of the root of the GN2 repo. + +Verification: + +*Question*: Does the application run on root of GN2 repository/package? + +* Log out the path of the font file and use the results to answer the question +* https://github.com/genenetwork/genenetwork2/commit/ca8018a61f2e014b4aee4da2cbd00d7b591b2f6a +* https://github.com/genenetwork/genenetwork2/commit/01d56903ba01a91841d199fe393f9b307a7596a2 + +*Answer*: No! The application does not run with the working directory on the root of the GN2 repository/package, as evidenced by this snippet from the logs: + +``` +2024-09-11 07:41:13 [2024-09-11 07:41:13 +0000] [494] [DEBUG] POST /run_mapping +2024-09-11 07:41:18 [2024-09-11 07:41:18 +0000] [494] [DEBUG] Font file path: /gn2/wqflask/static/fonts/courbd.ttf +2024-09-11 07:41:18 DEBUG:gn2.wqflask:Font file path: /gn2/wqflask/static/fonts/courbd.ttf +2024-09-11 07:41:18 [2024-09-11 07:41:18 +0000] [494] [ERROR] https://gn2-fred.genenetwork.org/run_mapping ( 7:41AM UTC Sep 11, 2024) +2024-09-11 07:41:18 Traceback (most recent call last): +``` + +We see from this that the application seems to be running with the working directory being "/" rather than the root for the application's package files. + +### Fixes + +* https://github.com/genenetwork/genenetwork2/commit/d001c1e7cae8f69435545b8715038b1d0fc1ee62 +* https://git.genenetwork.org/guix-bioinformatics/commit/?id=7a1bf5bc1c3de67f01eabd23e1ddc0150f81b22b + +# Footnotes + +[fn:1] https://github.com/genenetwork/genenetwork2/blob/50fc0b4bc4106164745afc7e1099bb150f6e635f/gn2/utility/Plot.py#L44-L46 diff --git a/issues/genenetwork2/handle-oauth-errors-better.gmi b/issues/genenetwork2/handle-oauth-errors-better.gmi new file mode 100644 index 0000000..77ad7ad --- /dev/null +++ b/issues/genenetwork2/handle-oauth-errors-better.gmi @@ -0,0 +1,21 @@ +# Handle OAuth Errors Better + +## Tags + +* type: bug +* status: closed, completed +* priority: high +* assigned: fredm +* interested: zachs, robw +* keywords: gn2, genenetwork2, ui, user interface, oauth, oauth errors + +## Description + +When a session expires, for whatever reason, a notification is displayed to the user as shown in the image below: +=> ./session_expiry_oauth_error.png + +The message is a little jarring to the end user. Make it gentler, and probably more informative, so the user is not as surprised. + +## Close as complete + +This should be fixed at this point. Closing this as complete. diff --git a/issues/genenetwork2/mapping-error.gmi b/issues/genenetwork2/mapping-error.gmi new file mode 100644 index 0000000..7e7d0a7 --- /dev/null +++ b/issues/genenetwork2/mapping-error.gmi @@ -0,0 +1,66 @@ +# Mapping Error + +## Tags + +* type: bug +* status: closed +* priority: medium +* assigned: zachs, fredm, flisso +* keywords: gn2, genenetwork2, genenetwork 2, mapping + +## Reproduction + +* Go to https://staging.genenetwork.org/ +* For 'Species' select "Arabidopsis (Arabidopsis thaliana, araTha1)" +* For 'Group' select "BayXSha(RIL by sib-mating)" +* For 'Type' select "arabidopsis seeds" +* For 'Dataset' select "Arabidopsis BayXShaXRIL_expr_reg _ATH1" +* Leave 'Get Any' blank +* Enter "*" for "Combined" +* Click "Search" +* On the search results page, click on "AT1G01010" +* Expand the "Mapping Tools" section +* For 'Chromosome' select "All" +* For 'Minor Allele ≥' enter "0.05" +* For 'Use LOCO' select "Yes" +* Ignore covariates +* Click "Compute" + +### Expected + +The system would compute the maps and display the mapping diagram(s) and data. + +### Actual + +The computation fails with: + +``` + GeneNetwork 2.11-rc2 https://staging.genenetwork.org/loading ( 6:50PM UTC Jul 03, 2024) +Traceback (most recent call last): + File "/gnu/store/jsvqai0gz6fn40k7kx3r12yq4hzfini6-profile/lib/python3.10/site-packages/flask/app.py", line 1523, in full_dispatch_request + rv = self.dispatch_request() + File "/gnu/store/jsvqai0gz6fn40k7kx3r12yq4hzfini6-profile/lib/python3.10/site-packages/flask/app.py", line 1509, in dispatch_request + return self.ensure_sync(self.view_functions[rule.endpoint])(**req.view_args) + File "/gnu/store/jsvqai0gz6fn40k7kx3r12yq4hzfini6-profile/lib/python3.10/site-packages/gn2/wqflask/views.py", line 812, in loading_page + for sample in samples: +TypeError: 'NoneType' object is not iterable +``` + +### Updates + +This is likely just because the genotype file doesn't exist in the necessary format (BIMBAM). We probably need to convert the R/qtl2 genotypes to BIMBAM. + +## Stalled + +This is currently stalled, until we can upload genotypes via the uploader. + + +## Notes + +### 2025-12-31 + +I am closing this issue as WONTFIX because of the following reasons: + +- Better fix is to prevent mapping in the first place, if no genotypes exist for the given trait(s) +- Issue relies on non-implemented feature (Genotypes upload) to fix it +- Issue does not exist on production diff --git a/issues/genenetwork2/mechanical-rob-add-partial-correlations-tests.gmi b/issues/genenetwork2/mechanical-rob-add-partial-correlations-tests.gmi new file mode 100644 index 0000000..e38f653 --- /dev/null +++ b/issues/genenetwork2/mechanical-rob-add-partial-correlations-tests.gmi @@ -0,0 +1,22 @@ +# mechanical-rob: Add Partial Correlations Tests + +## Tags + +* assigned: fredm +* priority: medium +* status: open +* keywords: genenetwork2, gn2, mechanical-rob, partial correlations, tests, regression +* type: enhancement + +## Description + +Add regression tests to verify that the partial correlations feature still works +as expected. + +### TODOS + +- [-] Tests for "entry-point" page +- [x] Tests for partial correlation using Pearson's R against select traits +- [ ] Tests for partial correlation using Spearman's Rho against select traits +- [ ] Tests for partial correlation using Pearson's R against an entire dataset +- [ ] Tests for partial correlation using Spearman's Rho against an entire dataset diff --git a/issues/genenetwork2/refresh-token-failure.gmi b/issues/genenetwork2/refresh-token-failure.gmi new file mode 100644 index 0000000..c488820 --- /dev/null +++ b/issues/genenetwork2/refresh-token-failure.gmi @@ -0,0 +1,111 @@ +# Refresh Token Failure + +## Tags + +* status: closed, obsoleted +* priority: high +* type: bug +* assigned: fredm, zsloan, zachs +* keywords: gn2, genenetwork2 + +## Description + +* Go to https://genenetwork.org +* Click "Sign in" and sign in to the application +* Wait 15 minutes +* Close the entire browser +* Open the browser and go to https://genenetwork.org +* Observe the "ERROR" message at the "Collections" link's badge + +The expectation is that the Collections badge would list the number of collection the user has, rather than the error message. + +The logs fail with an 'invalid_client' error: + +``` +2025-01-08 20:48:56 raise self.oauth_error_class( +2025-01-08 20:48:56 authlib.integrations.base_client.errors.OAuthError: invalid_client: +2025-01-08 20:48:56 ERROR:gn2.wqflask:Error loading number of collections +2025-01-08 20:48:56 Traceback (most recent call last): +2025-01-08 20:48:56 File "/gnu/store/3n1cl5cxal3qk7p9q363qgm2ag45a177-profile/lib/python3.10/site-packages/gn2/wqflask/__init__.py", +line 55, in numcoll +2025-01-08 20:48:56 return num_collections() +2025-01-08 20:48:56 File "/gnu/store/3n1cl5cxal3qk7p9q363qgm2ag45a177-profile/lib/python3.10/site-packages/gn2/wqflask/oauth2/collect +ions.py", line 13, in num_collections +2025-01-08 20:48:56 all_collections = all_collections + oauth2_get( +2025-01-08 20:48:56 File "/gnu/store/3n1cl5cxal3qk7p9q363qgm2ag45a177-profile/lib/python3.10/site-packages/gn2/wqflask/oauth2/client. +py", line 168, in oauth2_get +2025-01-08 20:48:56 resp = oauth2_client().get( +2025-01-08 20:48:56 File "/gnu/store/3n1cl5cxal3qk7p9q363qgm2ag45a177-profile/lib/python3.10/site-packages/requests/sessions.py", lin +e 600, in get +2025-01-08 20:48:56 return self.request("GET", url, **kwargs) +2025-01-08 20:48:56 File "/gnu/store/3n1cl5cxal3qk7p9q363qgm2ag45a177-profile/lib/python3.10/site-packages/authlib/integrations/reque +sts_client/oauth2_session.py", line 109, in request +2025-01-08 20:48:56 return super(OAuth2Session, self).request( +2025-01-08 20:48:56 File "/gnu/store/3n1cl5cxal3qk7p9q363qgm2ag45a177-profile/lib/python3.10/site-packages/requests/sessions.py", lin +e 573, in request +2025-01-08 20:48:56 prep = self.prepare_request(req) +2025-01-08 20:48:56 File "/gnu/store/3n1cl5cxal3qk7p9q363qgm2ag45a177-profile/lib/python3.10/site-packages/requests/sessions.py", lin +e 484, in prepare_request +2025-01-08 20:48:56 p.prepare( +2025-01-08 20:48:56 File "/gnu/store/3n1cl5cxal3qk7p9q363qgm2ag45a177-profile/lib/python3.10/site-packages/requests/models.py", line +372, in prepare +2025-01-08 20:48:56 self.prepare_auth(auth, url) +2025-01-08 20:48:56 File "/gnu/store/3n1cl5cxal3qk7p9q363qgm2ag45a177-profile/lib/python3.10/site-packages/requests/models.py", line +603, in prepare_auth +2025-01-08 20:48:56 r = auth(self) +2025-01-08 20:48:56 File "/gnu/store/3n1cl5cxal3qk7p9q363qgm2ag45a177-profile/lib/python3.10/site-packages/authlib/integrations/reque +sts_client/oauth2_session.py", line 24, in __call__ +2025-01-08 20:48:56 self.ensure_active_token() +2025-01-08 20:48:56 File "/gnu/store/3n1cl5cxal3qk7p9q363qgm2ag45a177-profile/lib/python3.10/site-packages/authlib/integrations/reque +sts_client/oauth2_session.py", line 20, in ensure_active_token +2025-01-08 20:48:56 if self.client and not self.client.ensure_active_token(self.token): +2025-01-08 20:48:56 File "/gnu/store/3n1cl5cxal3qk7p9q363qgm2ag45a177-profile/lib/python3.10/site-packages/authlib/oauth2/client.py", + line 262, in ensure_active_token +2025-01-08 20:48:56 self.refresh_token(url, refresh_token=refresh_token) +2025-01-08 20:48:56 File "/gnu/store/3n1cl5cxal3qk7p9q363qgm2ag45a177-profile/lib/python3.10/site-packages/authlib/oauth2/client.py", + line 252, in refresh_token +2025-01-08 20:48:56 return self._refresh_token( +2025-01-08 20:48:56 File "/gnu/store/3n1cl5cxal3qk7p9q363qgm2ag45a177-profile/lib/python3.10/site-packages/authlib/oauth2/client.py", + line 373, in _refresh_token +2025-01-08 20:48:56 token = self.parse_response_token(resp) +2025-01-08 20:48:56 File "/gnu/store/3n1cl5cxal3qk7p9q363qgm2ag45a177-profile/lib/python3.10/site-packages/authlib/oauth2/client.py", + line 340, in parse_response_token +2025-01-08 20:48:56 raise self.oauth_error_class( +2025-01-08 20:48:56 authlib.integrations.base_client.errors.OAuthError: invalid_client: +``` + + +### Troubleshooting + +The following commits were done as part of the troubleshooting: + +=> https://github.com/genenetwork/genenetwork2/commit/55da5809d851a3c8bfa13637947b019a2c02cc93 +=> https://git.genenetwork.org/guix-bioinformatics/commit/?id=d1cada0f0933732eb68b7786fb04ea541d8c51c9 +=> https://github.com/genenetwork/genenetwork2/commit/93dd7f7583af4e0bdd3c7b9c88d375fdc4b40039 +=> https://git.genenetwork.org/guix-bioinformatics/commit/?id=5fe04ca1545f740cbb91474576891c7fd1dff13a +=> https://github.com/genenetwork/genenetwork2/commit/2031da216f3b62c23dca64eb6d1c533c07dc81f1 +=> https://github.com/genenetwork/genenetwork2/commit/125c436f5310b194c10385ce9d81135518ac0adf +=> https://git.genenetwork.org/guix-bioinformatics/commit/?id=758e6f0fbf6af4af5b94b9aa5a9264c31f050153 +=> https://github.com/genenetwork/genenetwork2/commit/8bf483a3ab23ebf25d73380e78271c368ff06b2d +=> https://git.genenetwork.org/guix-bioinformatics/commit/?id=f1ee97a17e670b12112d48bea8969e2ee162f808 +=> https://github.com/genenetwork/genenetwork2/commit/de01f83090184fc56dce2f9887d2dc910edc60fe +=> https://github.com/genenetwork/genenetwork2/commit/91017b97ee346e73bed9b77e3f3f72daa4acbacd +=> https://github.com/genenetwork/genenetwork2/commit/7e6bfe48167c70d26e27b043eb567608bc1fda84 +=> https://git.genenetwork.org/guix-bioinformatics/commit/?id=1f71a1e78af87266e7a4170ace8860111a1569d6 +=> https://github.com/genenetwork/genenetwork2/commit/9bdc8ca0b17739c1df9dc504f8cd978296b987dd +=> https://git.genenetwork.org/guix-bioinformatics/commit/?id=02a9a99e7e3c308157f7d740a244876ab4196337 +=> https://github.com/genenetwork/genenetwork2/commit/236a48835dc6557ba0ece6aef6014f496ddb163e +=> https://git.genenetwork.org/guix-bioinformatics/commit/?id=f928be361d2e331d72448416300c331e47341807 +=> https://github.com/genenetwork/genenetwork2/commit/5fb56c51ad4eaff13a7e24b6022dffb7d82aa41d +=> https://github.com/genenetwork/genenetwork2/commit/c6c9ef71718d650f9c19ae459d6d4e25e72de00a +=> https://github.com/genenetwork/genenetwork2/commit/dc606f39fb4aad74004959a6a15e481fa74d52ff +=> https://git.genenetwork.org/guix-bioinformatics/commit/?id=4ab597b734968916af5bae6332756af8168783b3 +=> https://github.com/genenetwork/genenetwork2/commit/854639bd46293b6791c629591fd934d1f34038ac +=> https://git.genenetwork.org/guix-bioinformatics/commit/?id=7e0083555150d151e566cebed4bd82d69e347eb6 +=> https://github.com/genenetwork/genenetwork2/commit/c4508901027a2d3ea98e1e9b3f8767a455cad02f +=> https://git.genenetwork.org/guix-bioinformatics/commit/?id=955e4ce9370be9811262d7c73fa5398385cc04d8 + + +# Closed as Obsolete + +We no longer rely on refresh tokens. This issue is no longer present. diff --git a/issues/genenetwork2/remove-bin-genenetwork2-script.gmi b/issues/genenetwork2/remove-bin-genenetwork2-script.gmi new file mode 100644 index 0000000..da11be7 --- /dev/null +++ b/issues/genenetwork2/remove-bin-genenetwork2-script.gmi @@ -0,0 +1,114 @@ +# Remove `bin/genenetwork2` Script + +## Tags + +* type: improvement +* status: closed, completed +* priority: medium +* assigned: fredm, bonfacem, alexm, zachs +* interested: pjotrp, aruni +* keywords: gn2, bin/genenetwork2, startup script + +## Description + +The `bin/genenetwork2` script was used for a really long time to launch Genenetwork2, and has served that purpose with honour and dedication. We applaud that. + +It is, however, time to retire the script, since at this point in time, it serves more to obfuscate the startup that as a helpful tool. + +On production, we have all but abandoned the use of the script, and we need to do the same for CI/CD, and eventually, development. + +This issue tracks the process, and problems that come up during the move to retire the script. + +### Process + +* [x] Identify how to run unit tests without the script +* [x] Document how to run unit tests without the script +* [x] Identify how to run mechanical-rob tests without the script +* [x] Document how to run mechanical-rob tests without the script +* [x] Update CI/CD definitions to get rid of the references to the script +* [x] Delete the script from the repository + +### Setup + +First, we need to setup the following mandatory environment variables: + +* GN2_PROFILE +* GN2_SETTINGS +* JS_GUIX_PATH +* GEMMA_COMMAND +* PLINK_COMMAND +* GEMMA_WRAPPER_COMMAND +* REQUESTS_CA_BUNDLE + +Within a guix shell, you could do that with something like: + +``` +export GN2_PROFILE="${GUIX_ENVIRONMENT}" +export GN2_SETTINGS="/home/frederick/genenetwork/gn2_settings.conf" +export JS_GUIX_PATH="${GN2_PROFILE}/share/genenetwork2/javascript" +export GEMMA_COMMAND="${GN2_PROFILE}/bin/gemma" +export PLINK_COMMAND="${GN2_PROFILE}/bin/plink2" +export GEMMA_WRAPPER_COMMAND="${GN2_PROFILE}/bin/gemma-wrapper" +export REQUESTS_CA_BUNDLE="${GUIX_ENVIRONMENT}/etc/ssl/certs/ca-certificates.crt" +``` + +Note that, you can define all the variables derived from "GN2_PROFILE" in your settings file, if such a settings file is computed. + +### Running Unit Tests + +To run unit tests, run pytest at the root of the repository. + +``` +$ cd /path/to/genenetwork2 +$ pytest +``` + +### Running "mechanical-rob" Tests + +At the root of the repository, run something like: + +``` +python test/requests/test-website.py --all http://localhost:5033 +``` + +Change the port, as appropriate. + + +### Launching Application + +In addition to the minimum set of envvars defined in the "Setup" section above, we need the following variables defined to get the application to launch: + +* FLASK_APP + +In a guix shell, you could do: + +``` +export FLASK_APP="gn2.wsgi" +``` + +Now you can launch the application with flask with something like: + +``` +flask run --port=5033 --with-threads +``` + +or with green unicorn with something like: + +``` +gunicorn --reload \ + --workers 3 \ + --timeout 1200 \ + --log-level="debug" \ + --keep-alive 6000 \ + --max-requests 10 \ + --bind="127.0.0.1:5033" \ + --max-requests-jitter 5 \ + gn2.wsgi:application +``` + +You can change the gunicorn setting to fit your scenario. + + +## Close as completed + +The script has been deleted. diff --git a/issues/genenetwork2/session_expiry_oauth_error.png b/issues/genenetwork2/session_expiry_oauth_error.png new file mode 100644 index 0000000..34e2dda --- /dev/null +++ b/issues/genenetwork2/session_expiry_oauth_error.png Binary files differdiff --git a/issues/genenetwork3/01828928-26e6-4cad-bbc8-59fd7a7977de.json.zip b/issues/genenetwork3/01828928-26e6-4cad-bbc8-59fd7a7977de.json.zip new file mode 100644 index 0000000..7681b88 --- /dev/null +++ b/issues/genenetwork3/01828928-26e6-4cad-bbc8-59fd7a7977de.json.zip Binary files differdiff --git a/issues/genenetwork3/broken-aliases.gmi b/issues/genenetwork3/broken-aliases.gmi new file mode 100644 index 0000000..2bfbdae --- /dev/null +++ b/issues/genenetwork3/broken-aliases.gmi @@ -0,0 +1,188 @@ +# Broken Aliases + +## Tags + +* type: bug +* status: open +* priority: high +* assigned: pjotrp +* interested: pjotrp +* keywords: aliases, aliases server + +## Tasks + +* [X] Rewrite server in gn-guile +* [X] Fix menu search +* [X] Fix global search aliases +* [ ] Deploy and test aliases in GN2 + +## Repository + +=> https://github.com/genenetwork/gn3 + +moved to + +gn-guile repo. + +## Bug Report + +### Actual + +* Go to https://genenetwork.org/gn3/gene/aliases2/Shh,Brca2 +* Note that an exception is raised, with a "404 Not Found" message + +### Expected + +* We expected a list of aliases to be returned for the given symbols as is done in https://fallback.genenetwork.org/gn3/gene/aliases2/Shh,Brca2 + +## Resolution + +Actually the server is up, but it is not part of the main deployment because it is written in Racket - and we don't have much support in Guix. I wrote the code the days after my bike accident: + +=> https://github.com/genenetwork/gn3/blob/master/gn3/web/wikidata.rkt + +and it is probably easiest to move it to gn-guile. Guile is another Scheme after all ;). Only fitting I spent days in hospital only recently (for a different reason). gn-guile already has its own web server and provides a REST API for our markdown editor, for example. On tux04 it responds with + +``` +curl http://127.0.0.1:8091/version +"4.0.0" +``` + +What we want is to add the aliases server that should respond to + +``` +curl http://localhost:8000/gene/aliases/Shh # direct on tux01 +["9530036O11Rik","Dsh","Hhg1","Hx","Hxl3","M100081","ShhNC","ShhNC"] +curl https://genenetwork.org/gn3/gene/aliases2/Shh,Brca2 +[["Shh",["9530036O11Rik","Dsh","Hhg1","Hx","Hxl3","M100081","ShhNC","ShhNC"]],["Brca2",["Fancd1","RAB163"]]] +``` + +Note this is used by search functionality in GN, as well as the gene aliases list on the mapping page. In principle we cache it for the duration of the running server so as not to overload wikidata. No one uses aliases2, that I can tell, so we only implement the first 'aliases'. + +Note the wikidata interface has been stable all this time. That is good. + +Turns out we already use wikidata in the gn-guile implementation for fetching the wikidata id for a species (as part of metadata retrieval). I wrote that about two years ago as part of the REST API expansion. + +Unfortunately + +``` +(sparql-scm (wd-sparql-endpoint-url) (wikidata-gene-alias "Q24420953")) +``` + +throws a 403 forbidden error. + +This however works: + +``` +scheme@(gn db sparql) [15]> (sparql-wd-species-info "Q83310") +;;; ("https://query.wikidata.org/sparql?query=%0ASELECT%20DISTINCT%20%3Ftaxon%20%3Fncbi%20%3Fdescr%20where%20%7B%0A%20%20%20%20wd%3AQ83310%20wdt%3AP225%20%3Ftaxon%20%3B%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20wdt%3AP685%20%3Fncbi%20%3B%0A%20%20%20%20%20%20schema%3Adescription%20%3Fdescr%20.%0A%20%20%20%20%3Fspecies%20wdt%3AP685%20%3Fncbi%20.%0A%20%20%20%20FILTER%20%28lang%28%3Fdescr%29%3D%27en%27%29%0A%7D%20limit%205%0A%0A") +$11 = "?taxon\t?ncbi\t?descr\n\"Mus musculus\"\t\"10090\"\t\"species of mammal\"@en\n" +``` + +(if you can see the mouse ;). + +Ah, this works + +``` +scheme@(gn db sparql) [17]> (sparql-tsv (wd-sparql-endpoint-url) (wikidata-query-geneids "Shh" )) +;;; ("https://query.wikidata.org/sparql?query=SELECT%20DISTINCT%20%3Fwikidata_id%0A%20%20%20%20%20%20%20%20%20%20%20%20WHERE%20%7B%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%3Fwikidata_id%20wdt%3AP31%20wd%3AQ7187%3B%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20wdt%3AP703%20%3Fspecies%20.%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20VALUES%20%28%3Fspecies%29%20%7B%20%28wd%3AQ15978631%20%29%20%28%20wd%3AQ83310%20%29%20%28%20wd%3AQ184224%20%29%20%7D%20.%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%3Fwikidata_id%20rdfs%3Alabel%20%22Shh%22%40en%20.%0A%20%20%20%20%20%20%20%20%7D%0A") +$12 = "?wikidata_id\n<http://www.wikidata.org/entity/Q14860079>\n<http://www.wikidata.org/entity/Q24420953>\n" +``` + +But this does not + +``` +scheme@(gn db sparql) [17]> (sparql-scm (wd-sparql-endpoint-url) (wikidata-query-geneids "Shh" )) +ice-9/boot-9.scm:1685:16: In procedure raise-exception: +In procedure utf8->string: Wrong type argument in position 1 (expecting bytevector): "<html>\r\n<head><title>403 Forbidden</title></head>\r\n<body>\r\n<center><h1>403 Forbidden</h1></center>\r\n<hr><center>nginx/1.18.0</center>\r\n</body>\r\n</html>\r\n" +``` + +Going via tsv does work + +``` +scheme@(gn db sparql) [18]> (tsv->scm (sparql-tsv (wd-sparql-endpoint-url) (wikidata-query-geneids "Shh" ))) + +;;; ("https://query.wikidata.org/sparql?query=SELECT%20DISTINCT%20%3Fwikidata_id%0A%20%20%20%20%20%20%20%20%20%20%20%20WHERE%20%7B%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%3Fwikidata_id%20wdt%3AP31%20wd%3AQ7187%3B%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20wdt%3AP703%20%3Fspecies%20.%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20VALUES%20%28%3Fspecies%29%20%7B%20%28wd%3AQ15978631%20%29%20%28%20wd%3AQ83310%20%29%20%28%20wd%3AQ184224%20%29%20%7D%20.%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%3Fwikidata_id%20rdfs%3Alabel%20%22Shh%22%40en%20.%0A%20%20%20%20%20%20%20%20%7D%0A") +$13 = ("?wikidata_id") +$14 = (("<http://www.wikidata.org/entity/Q14860079>") ("<http://www.wikidata.org/entity/Q24420953>")) +``` + +that is nice enough. + +We now got a working alias server that is part of gn-guile. E.g. + +``` +curl http://127.0.0.1:8091/gene/aliases/Brca2 +["breast cancer 2","breast cancer 2, early onset","Fancd1","RAB163","BRCA2, DNA repair associated"] +``` + +it is part of gn-guile. gn-guile also has the 'commit/' handler by Alex, documented as +'curl -X POST http://127.0.0.1:8091/commit' in git-markdown-editor.md. Let's see how that is wired up. The web interface is at, for example, +https://genenetwork.org/editor/edit?file-path=general/help/facilities.md. Part of gn2's + +``` +gn2/wqflask/views.py +398:@app.route("/editor/edit", methods=["GET"]) +408:@app.route("/editor/settings", methods=["GET"]) +414:@app.route("/editor/commit", methods=["GET", "POST"]) +``` + +which has the code + +``` +@app.route("/editor/edit", methods=["GET"]) +@require_oauth2 +def edit_gn_doc_file(): + file_path = urllib.parse.urlencode( + {"file_path": request.args.get("file-path", "")}) + response = requests.get(f"http://localhost:8091/edit?{file_path}") + response.raise_for_status() + return render_template("gn_editor.html", **response.json()) +``` + +Running over localhost. This is unfortunately hard coded, and we should change that! In guix system +configuration it is already a variable as 'genenetwork-configuration-gn-guile-port 8091'. gn-guile should also be visible from outside, so that is a separate configuration. + +Also I note that the mapping page does three requests to wikidata (for mouse, rat and human). That could really be one. + +# Search + +Aliases are also used in search. You can tell when GN search renders too few results that aliases are not used. When aliases work we expect to list '2310010I16Rik' with + +=> https://genenetwork.org/search?species=mouse&group=BXD&type=Hippocampus+mRNA&dataset=HC_M2_0606_P&search_terms_or=sh*&search_terms_and=&FormID=searchResult + +Sheepdog tests for that and it has been failing for a while. + +Global search finds way more results, but also lacks that alias! Meanwhile GN1 does find that alias for record 1431728_at. GN2 finds it with hippocampus mRNA + +=> https://genenetwork.org/search?species=mouse&group=BXD&type=Hippocampus+mRNA&dataset=HC_M2_0606_P&search_terms_or=1431728_at%0D%0A&search_terms_and=&accession_id=None&FormID=searchResult + +in standard search. +But neither 1431728_at or '2310010I16Rik' has a hit in *global* search and the result for Ssh should include the record in both search systems. + +# Deploy + +We introduced a new environment variable that does not show up on CD, part of the mapping page: + +=> + +In the logs on /export2: + +``` +root@tux02:/export2/guix-containers/genenetwork-development/var/log/cd# tail -100 genenetwork2.log +2025-07-20 04:19:43 File "/genenetwork2/gn2/base/trait.py", line 157, in wikidata_alias_fmt +2025-07-20 04:19:43 GN_GUILE_SERVER_URL + "gene/aliases/" + self.symbol.upper()) +2025-07-20 04:19:43 NameError: name 'GN_GUILE_SERVER_URL' is not defined +``` + +One thing I ran into is http://genenetwork.org/gn3-proxy/ - what is that for? + +## Deploy Updates: 2025-08-15 +=> https://git.genenetwork.org/guix-bioinformatics/commit/?id=269f99f1e1f0c253ecdd99f04bc7c6697012b0aa Update commit of gn-guile used on production + +This does not fix the issue on https://gn2-fred.genenetwork.org/show_trait?trait_id=1427571_at&dataset=HC_M2_0606_P, instead we get + +``` +fredm@tux04:~$ curl http://localhost:8091/gene/aliases/Brca2 +Resource not found: /gene/aliases/Brca2 +``` diff --git a/issues/genenetwork3/check-for-mandatory-settings.gmi b/issues/genenetwork3/check-for-mandatory-settings.gmi new file mode 100644 index 0000000..16a2f8a --- /dev/null +++ b/issues/genenetwork3/check-for-mandatory-settings.gmi @@ -0,0 +1,40 @@ +# Check for Mandatory Settings + +## Tags + +* status: open +* priority: high +* type: bug, improvement +* interested: fredm, bonz +* assigned: jnduli, rookie101 +* keywords: GN3, gn3, genenetwork3, settings, config, configs, configurations + +## Explanation + +Giving defaults to some important settings leads to situations where the correct configuration is not set up correctly leading at best to failure, and at worst, to subtle failures that can be difficult to debug: e.g. When a default URI to a server points to an active domain, just not the correct one. + +We want to make such (arguably, sensitive) configurations explicit, and avoid giving them defaults. We want to check that they are set up before allowing the application to run, and fail loudly and obnoxiously if they are not provided. + +Examples of configuration variables that should be checked for: + +* All external URIs (external to app/repo under consideration) +* All secrets (secret keys, salts, tokens, etc) + +We should also eliminate from the defaults: + +* Computed values +* Calls to get values from ENVVARs (`os.environ.get(…)` calls) + +### Note on ENVVARs + +The environment variables should be used for overriding values under specific conditions, therefore, it should both be explicit and the last thing loaded to ensure they actually override settings. + +=> https://git.genenetwork.org/gn-auth/tree/gn_auth/__init__.py?id=3a276642bea934f0a7ef8f581d8639e617357a2a#n70 See this example for a possible way of allowing ENVVARs to override settings. + +The example above could be improved by maybe checking for environment variables starting with a specific value, e.g. the envvar `GNAUTH_SECRET_KEY` would override the `SECRET_KEY` configuration. This allows us to override settings without having to change the code. + +## Tasks + +* [ ] Explicitly check configs for ALL external URIs +* [ ] Explicitly check configs for ALL secrets +* [ ] Explicitly load ENVVARs last to override settings diff --git a/issues/genenetwork3/ctl-maps-error.gmi b/issues/genenetwork3/ctl-maps-error.gmi new file mode 100644 index 0000000..6726357 --- /dev/null +++ b/issues/genenetwork3/ctl-maps-error.gmi @@ -0,0 +1,46 @@ +# CTL Maps Error + +## Tags + +* type: bug +* status: open +* priority: high +* assigned: alexm, zachs, fredm +* keywords: CTL, CTL Maps, gn3, genetwork3, genenetwork 3 + +## Description + +Trying to run the CTL Maps feature in the collections page as described in +=> /issues/genenetwork2/broken-collections-feature + +We get an error in the results page of the form: + +``` +{'error': '{\'code\': 1, \'output\': \'Loading required package: MASS\\nLoading required package: parallel\\nLoading required package: qtl\\nThere were 13 warnings (use warnings() to see them)\\nError in xspline(x, y, shape = 0, lwd = lwd, border = col, lty = lty, : \\n invalid value specified for graphical parameter "lwd"\\nCalls: ctl.lineplot -> draw.spline -> xspline\\nExecution halted\\n\'}'} +``` + +on the CLI the same error is rendered: +``` +Loading required package: MASS +Loading required package: parallel +Loading required package: qtl +There were 13 warnings (use warnings() to see them) +Error in xspline(x, y, shape = 0, lwd = lwd, border = col, lty = lty, : + invalid value specified for graphical parameter "lwd" +Calls: ctl.lineplot -> draw.spline -> xspline +Execution halted +``` + +On my local development machine, the command run was +``` +Rscript /home/frederick/genenetwork/genenetwork3/scripts/ctl_analysis.R /tmp/01828928-26e6-4cad-bbc8-59fd7a7977de.json +``` + +Here is a zipped version of the json file (follow the link and click download): +=> https://github.com/genenetwork/gn-gemtext-threads/blob/main/issues/genenetwork3/01828928-26e6-4cad-bbc8-59fd7a7977de.json.zip + +Troubleshooting a while, I suspect +=> https://github.com/genenetwork/genenetwork3/blob/27d9c9d6ef7f37066fc63af3d6585bf18aeec925/scripts/ctl_analysis.R#L79-L80 this is the offending code. + +=> https://cran.r-project.org/web/packages/ctl/ctl.pdf The manual for the ctl library +indicates that our call above might be okay, which might mean something changed in the dependencies that the ctl library used. diff --git a/issues/genenetwork/genenetwork3_configuration.gmi b/issues/genenetwork3/genenetwork3_configuration.gmi index fcab572..cdd7c15 100644 --- a/issues/genenetwork/genenetwork3_configuration.gmi +++ b/issues/genenetwork3/genenetwork3_configuration.gmi @@ -1,10 +1,10 @@ -# Genenetwork2 Configurations +# Genenetwork3 Configurations ## Tags * assigned: fredm * priority: normal -* status: open +* status: closed, completed * keywords: configuration, config, gn2, genenetwork, genenetwork2 * type: bug @@ -13,3 +13,7 @@ The configuration file should only ever contain settings, and no code. Remove all code from the default settings file. Eschew executable formats (*.py) for configuration files and prefer non-executable formats e.g. *.cfg, *.json, *.conf etc + +## Closed as Completed + +See commit https://github.com/genenetwork/genenetwork3/commit/977efbb54da284fb3e8476f200206d00cb8e64cd diff --git a/issues/genenetwork3/generate-heatmaps-failing.gmi b/issues/genenetwork3/generate-heatmaps-failing.gmi new file mode 100644 index 0000000..522dc27 --- /dev/null +++ b/issues/genenetwork3/generate-heatmaps-failing.gmi @@ -0,0 +1,64 @@ +# Generate Heatmaps Failing + +## Tags + +* type: bug +* status: open +* priority: medium +* assigned: fredm, zachs, zsloan +* keywords: genenetwork3, gn3, GN3, heatmaps + +## Reproduce + +* Go to https://genenetwork.org/ +* Under "Select and Search" menu, enter "synap*" for the "Get Any" field +* Click "Search" +* In search results page, select first 10 traits +* Click "Add" +* Under "Create a new collection" enter the name "newcoll" and click "Create collection" +* In the collections page that shows up, click "Select All" once +* Ensure all the traits are selected +* Click "Generate Heatmap" and wait +* Note how system fails silently with no heatmap presented + +### Notes + +On https://gn2-fred.genenetwork.org the heatmaps fails with a note ("ERROR: undefined"). In the logs, I see "Module 'scipy' has no attribute 'array'" which seems to be due to a change in numpy. +=> https://github.com/MaartenGr/BERTopic/issues/1791 +=> https://github.com/scipy/scipy/issues/19972 + +This issue should not be present with python-plotly@5.20.0 but since guix-bioinformatics pins the guix version to `b0b988c41c9e0e591274495a1b2d6f27fcdae15a`, we are not able to pull in newer versions of packages from guix. + + +### Update 2025-04-08T10:59CDT + +Got the following error when I ran the background command manually: + +``` +$ export RUST_BACKTRACE=full +$ /gnu/store/dp4zq4xiap6rp7h6vslwl1n52bd8gnwm-profile/bin/qtlreaper --geno /home/frederick/genotype_files/genotype/genotype/BXD.geno --n_permutations 1000 --traits /tmp/traits_test_file_n2E7V06Cx7.txt --main_output /tmp/qtlreaper/main_output_NGVW4sfYha.txt --permu_output /tmp/qtlreaper/permu_output_MJnzLbrsrC.txt +thread 'main' panicked at src/regression.rs:216:25: +index out of bounds: the len is 20 but the index is 20 +stack backtrace: + 0: 0x61399d77d46d - <unknown> + 1: 0x61399d7b5e13 - <unknown> + 2: 0x61399d78b649 - <unknown> + 3: 0x61399d78f26f - <unknown> + 4: 0x61399d78ee98 - <unknown> + 5: 0x61399d78f815 - <unknown> + 6: 0x61399d77d859 - <unknown> + 7: 0x61399d77d679 - <unknown> + 8: 0x61399d78f3f4 - <unknown> + 9: 0x61399d6f4063 - <unknown> + 10: 0x61399d6f41f7 - <unknown> + 11: 0x61399d708f18 - <unknown> + 12: 0x61399d6f6e4e - <unknown> + 13: 0x61399d6f9e93 - <unknown> + 14: 0x61399d6f9e89 - <unknown> + 15: 0x61399d78e505 - <unknown> + 16: 0x61399d6f8d55 - <unknown> + 17: 0x75ee2b945bf7 - __libc_start_call_main + 18: 0x75ee2b945cac - __libc_start_main@GLIBC_2.2.5 + 19: 0x61399d6f4861 - <unknown> + 20: 0x0 - <unknown> +``` diff --git a/issues/genenetwork3/rqtl2-mapping-error.gmi b/issues/genenetwork3/rqtl2-mapping-error.gmi new file mode 100644 index 0000000..b43d66f --- /dev/null +++ b/issues/genenetwork3/rqtl2-mapping-error.gmi @@ -0,0 +1,46 @@ +# R/qtl2 Maps Error + +## Tags + +* type: bug +* status: closed, completed +* priority: high +* assigned: alexm, zachs, fredm +* keywords: R/qtl2, R/qtl2 Maps, gn3, genetwork3, genenetwork 3 + +## Reproduce + +* Go to https://genenetwork.org/ +* In the "Get Any" field, enter "synap*" and press the "Enter" key +* In the search results, click on the "1435464_at" trait +* Expand the "Mapping Tools" accordion section +* Select the "R/qtl2" option +* Click "Compute" +* In the "Computing the Maps" page that results, click on "Display System Log" + +### Observed + +A traceback is observed, with an error of the following form: + +``` +⋮ +FileNotFoundError: [Errno 2] No such file or directory: '/opt/gn/tmp/gn3-tmpdir/JL9PvKm3OyKk.txt' +``` + +### Expected + +The mapping runs successfully and the results are presented in the form of a mapping chart/graph and a table of values. + +### Debug Notes + +The directory "/opt/gn/tmp/gn3-tmpdir/" exists, and is actually used by other mappings (i.e. The "R/qtl" and "Pair Scan" mappings) successfully. + +This might imply a code issue: Perhaps +* a path is hardcoded, or +* the wrong path value is passed + +The same error occurs on https://cd.genenetwork.org but does not seem to prevent CD from running the mapping to completion. Maybe something is missing on production — what, though? + +## Closed as Completed + +This seems fixed now. diff --git a/issues/genetics/speeding-up-gemma.gmi b/issues/genetics/speeding-up-gemma.gmi new file mode 100644 index 0000000..91bab17 --- /dev/null +++ b/issues/genetics/speeding-up-gemma.gmi @@ -0,0 +1,492 @@ +# Speeding up GEMMA + +GEMMA is slow, but usually fast enough. Earlier I wrote gemma-wrapper to speed things up. In genenetwork.org, by using gemma-wrapper with LOCO, most traits are mapped in a few seconds on a a large server (30 individuals x 200K markers). By expanding makers to over 1 million, however, runtimes degrade to 6 minutes. Increasing the number of individuals to 1000 may slow mapping down to hour(s). As we are running 'precompute' on 13K traits - and soon maybe millions - it would be beneficial to reduce runtimes again. + +One thing to look at is Sen's bulklmm. It can do phenotypes in parallel, provided there is no missing data. This is perfect for permutations which we'll also do. For multiple phenotypes it is a bit tricky however, because you'll have to mix and match experiments to show the same individuals (read samples). + +So the approach is to first analyze steps in GEMMA and see where it is particularly inefficient. Maybe we can do something about that. I note I started the pangemma effort (and mgamma effort before). The idea is to use a propagator network for incremental improvements and also to introduce a new build system and testing framework. In parallel we'll try to scale out on HPC using Arun's ravanan software. + +There is no such thing as a free lunch. So, let's dive in. + +# Description + +# Tags + +* assigned: pjotrp +* type: feature +* priority: high + +# Tasks + +* [X] Try gzipped version +* [X] Run without debug +* [ ] Use lmdb for genotypes +* - [X] convert genotypes to lmdb +* - [X] replace GEMMA ReadGenotypes +* - [X] replace reading genotypes in AnalyzeBimbam +* - [+] Apply similar SNP filtering as the original +* - [X] Add SNP info tho Geno file +* - [X] Try different geno encodings +* - [+] Fix support for NAs - also in compute +* [X] Use lmdb for SNPs (probably part of Geno file) +* [X] Match output +* [ ] Write lmdb for output with filter +* [X] Optimize openblas for target architecture +* [ ] Use profiler +* [ ] Hash genotypes? Try buf.hash or xxhash +* [ ] Skip highly correlated markers with backtracking +* [ ] Perhaps try a faster malloc library for GEMMA +* [ ] Fix sqrt(NaN) when running big file example with -debug +* [ ] Fix/check assumption that geno is between 0 and 2 +* [ ] Try 64-bit integer index for lmdb +* [ ] Other improvements... + +# Summary + +Convert a geno file to mdb with + +``` +./bin/anno2mdb.rb mouse_hs1940.anno.txt +./bin/geno2mdb.rb mouse_hs1940.geno.txt --anno mouse_hs1940.anno.txt.mdb --eval Gf # convert to floating point +real 0m14.042s +user 0m12.639s +sys 0m0.402s +``` + +``` +../bin/anno2mdb.rb snps-matched.txt +../bin/geno2mdb.rb pangenome-13M-genotypes.txt --geno-json bxd_inds.list.json --anno snps-matched.txt.mdb --eval Gf +../bin/geno2mdb.rb pangenome-13M-genotypes.txt --geno-json bxd_inds.list.json --anno snps-matched.txt.mdb --eval Gb +``` + +even with floats a 30G pangenome genotype file got reduced to 12G. A quick full run of the mdb version takes 6 minutes. That is a massive 3x speedup. It also used less RAM (because it is one process instead of 20) and had a 40x core usage, much of it in the Linux kernel: + +``` +/bin/time -v ./build/bin/Release/gemma -k tmp/93f6b39ec06c09fb9ba9ca628b5fb990921b6c60.11.cXX.txt.cXX.txt -p tmp/pheno.json.txt -g pangenome-13M-genotypes.txt.mdb -lmm 9 -maf 0.1 -n 2 -debug +LD_LIBRARY_PATH=$GUIX_ENVIRONMENT/lib /bin/time -v ./build/bin/Release/gemma -k tmp/93f6b39ec06c09fb9ba9ca628b5fb990921b6c60.3.cXX.txt.cXX.txt -p tmp/pheno.json.txt -g tmp/pangenome-13M-genotypes.txt.mdb -lmm 9 -maf 0.1 -n 2 -no-check +real 5m47.587s +user 39m33.796s +sys 211m1.143s + +Command being timed: "./build/bin/Release/gemma -k tmp/93f6b39ec06c09fb9ba9ca628b5fb990921b6c60.3.cXX.txt.cXX.txt -p tmp/pheno.json.txt -g tmp/pangenome-13M-genotypes.txt.mdb -lmm 9 -maf 0.1 -n 2 -no-check" + User time (seconds): 2169.77 + System time (seconds): 11919.04 + Percent of CPU this job got: 3919% + Elapsed (wall clock) time (h:mm:ss or m:ss): 5:59.48 + Maximum resident set size (kbytes): 13377040 +``` + +as we only read the genotype file once it shows how much is IO bound! Moving to lmdb was the right choice to speed up pangemma. + +Old gemma does: + +``` + Command being timed: "/bin/gemma -k 93f6b39ec06c09fb9ba9ca628b5fb990921b6c60.11.cXX.txt.cXX.txt -p pheno.json.txt -g pangenome-13M-genotypes.txt.gz -a snps-matched.txt -lmm 9 -maf 0.1 -n 2 -no-check" + User time (seconds): 2017.25 + System time (seconds): 62.21 + Percent of CPU this job got: 240% + Elapsed (wall clock) time (h:mm:ss or m:ss): 14:24.17 + Maximum resident set size (kbytes): 9736884 +``` + +So we are at 3x speed. + +With Gb byte encoding the file got further reduced from 13Gb to 4Gb. + +What is more exciting is that LOCO now runs in 30s - compared to gemma's earlier 6 minutes, so that is at 10x speed, using about 1/3 of RAM. Note the CPU usage: + +``` + Command being timed: "./build/bin/Release/gemma -k tmp/93f6b39ec06c09fb9ba9ca628b5fb990921b6c60.3.cXX.txt.cXX.txt -p tmp/pheno.json.txt -g tmp/pangenome-13M-genotypes.txt-Gb.mdb -loco 2 -lmm 9 -maf 0.1 -n 2 -no-check" User time (seconds): 177.81 + System time (seconds): 934.92 + Percent of CPU this job got: 3391% + Elapsed (wall clock) time (h:mm:ss or m:ss): 0:32.80 + Maximum resident set size (kbytes): 4326308 +``` + +it looks like disk IO is no longer the bottleneck. The Gb version is much smaller than Gf, but runtime is only slightly better. So it is time for the profiler to find how we can make use of the other cores! But, for now, I am going to focus on getting the pipeline set up with ravanan. + +# Analysis + +As a test case we'll take on of the runs: + +``` +time -v /bin/gemma -loco 11 -k /export2/data/wrk/services/gemma-wrapper/tmp/tmp/panlmm/93f6b39ec06c09fb9ba9ca628b5fb990921b6c60.11.cXX.txt.cXX.txt -o 680029457111fdd460990f95853131c87ea20c57.11.assoc.txt -p pheno.json.txt -g pangenome-13M-genotypes.txt -a snps-matched.txt -lmm 9 -maf 0.1 -n 2 -outdir /export2/data/wrk/services/gemma-wrapper/tmp/tmp/panlmm/d20251111-588798-f81icw +``` + +which I simplify to + +``` +/bin/time -v /bin/gemma -loco 11 -k 93f6b39ec06c09fb9ba9ca628b5fb990921b6c60.11.cXX.txt.cXX.txt -p pheno.json.txt -g pangenome-13M-genotypes.txt -a snps-matched.txt -lmm 9 -maf 0.1 -n 2 -debug +Reading Files ... +number of total individuals = 143 +number of analyzed individuals = 20 +number of total SNPs/var = 13209385 +number of SNPS for K = 12376792 +number of SNPS for GWAS = 832593 +number of analyzed SNPs = 13111938 +``` + +The timer says: + +``` +User time (seconds): 365.33 +System time (seconds): 16.59 +Percent of CPU this job got: 128% +Elapsed (wall clock) time (h:mm:ss or m:ss): 4:57.01 +Average shared text size (kbytes): 0 +Average unshared data size (kbytes): 0 +Average stack size (kbytes): 0 +Average total size (kbytes): 0 +Maximum resident set size (kbytes): 11073412 +Average resident set size (kbytes): 0 +Major (requiring I/O) page faults: 0 +Minor (reclaiming a frame) page faults: 5756557 +Voluntary context switches: 1365 +nInvoluntary context switches: 478 +Swaps: 0 +File system inputs: 0 +File system outputs: 143704 +Socket messages sent: 0 +Socket messages received: 0 +Signals delivered: 0 +Page size (bytes): 4096 +Exit status: 0 +``` + +The genotype file is unzipped at 30G. Let's try running the gzipped version (which will be beneficial on a compute cluster anyhow) which comes in at 9.2G. We know that Gemma is not the most efficient when it comes to IO. So testing is crucial. +Critically the run gets slower: + +``` +Percent of CPU this job got: 118% +Elapsed (wall clock) time (h:mm:ss or m:ss): 7:43.56 +``` + +The problem is that unzip runs on a single thread in GEMMA, so it is actually slower that the gigantic raw text file. + +## Running without debug + +Without the debug swith gemma runs at the same speed with 128% CPU. That won't help much. + +## Optimizing GEMMA+OpenBLAS+GSL + +Compiling with optimization can be low hanging fruit - despite the fact that we seem to be IO bound at 128% CPU. Still, aggressive compiler optimizations may make a difference. The current build reads: + +``` +GEMMA Version = 0.98.6 (2022-08-05) +Build profile = /gnu/store/8rvid272yb53bgascf5c468z0jhsyflj-profile +GCC version = 14.3.0 +GSL Version = 2.8 +OpenBlas = OpenBLAS 0.3.30 - OpenBLAS 0.3.30 DYNAMIC_ARCH NO_AFFINITY Cooperlake MAX_THREADS=128 +arch = Cooperlake +threads = 96 +parallel type = threaded +``` + +this uses the gemma-gn2 package in + +=> https://git.genenetwork.org/guix-bioinformatics/tree/gn/packages/gemma.scm#n27 + +which is currently not built with arch optimizations (even though Cooperlake suggests differently). Another potential optimization is to use a fast malloc library. We do, however, already compile with a recent gcc, thanks to Guix. No need to improve on that. + +## Introduce lmdb for genotypes + +Rather than focussing on gzip, another potential improvement is to use lmdb with mmap. We am not going to upgrade the original gemma code (which is in maintenance mode). We are going to upgrade the new pangemma project instead: + +=> https://git.genenetwork.org/pangemma/ + +Reason being that this is our experimental project. + +So I just managed to build pangemma/gemma in Guix. Next step is to introduce lmdb genotypes. Genotypes come essentially as a matrix of markers x individuals. In the case of GN geno files and BIMBAM files they are simply stored as tab delimited values and/or probabilities. This happens in + +``` +src/param.cpp +1261:void PARAM::ReadGenotypes(gsl_matrix *UtX, gsl_matrix *K, const bool calc_K) { +1280:void PARAM::ReadGenotypes(vector<vector<unsigned char>> &Xt, gsl_matrix *K, +``` + +calling into + +``` +gemma_io.cpp +644:bool ReadFile_geno(const string &file_geno, const set<string> &setSnps, +1752:bool ReadFile_geno(const string file_geno, vector<int> &indicator_idv, +1857:bool ReadFile_geno(const string &file_geno, vector<int> &indicator_idv, +``` + +which are called from gemma.cpp. Also lmm.cpp reads the geno file in the AnalyzeBimbam function (see file_geno): + +``` +src/lmm.cpp +61: file_geno = cPar.file_geno; +1664: debug_msg(file_geno); +1665: auto infilen = file_geno.c_str(); +2291: cout << "error reading genotype file:" << file_geno << endl; +``` + +Note that also SNPs are read from a file (see file_snps). We already have an lmdb version for that! + +So, reading genotypes happens in multiple places. In fact, it is read 1x for computing K and 2x for GWA. And it is worth than this because LOCO runs GWA 20x rereading the same files. Reading it once using lmdb should speed things up. + +We'll start with the 30G 143samples.percentile.bimbam.bimbam-reduced2 file. To convert this file into lmdb we only do this once. We want to track both column and row names in the same lmdb and we will use a meta JSON record for that. On the command line we'll state wether the genotypes are stored as char or int. Floats will be packed into either of those. We'll expirement a bit to see what the default should be. A genotype is usually a number/character or a probability. In the latter case we don't have to have high precison and can choose to store an index into a range of values. We can also opt for Float16 or something more ad hoc because we don't have to store the exponent. + +But let's start with a standard float here, to keep things simple. To write the first version of code I'll use a byte conversion: + +``` +./bin/geno2mdb.rb BXD.geno.bimbam --eval '{"0"=>0,"1"=>1,"2"=>2,"NA"=>-1}' --pack 'C*' --geno-json BXD.geno.json +``` + +The lmdb file contains a metadata record that looks like: + +``` +{ + "type": "gemma-geno", + "version": 1, + "eval": "G0-2", + "key-format": "string", + "rec-format": "C*", + "geno": { + "type": "gn-geno-to-gemma", + "genofile": "BXD.geno", + "samples": [ + "BXD1", + "BXD2", + "BXD5", +etc. +``` + +i.e. it is a self-contained, efficient, genotype format. There is also another trick, we can use Plink-style compression with + +``` +./bin/geno2mdb.rb BXD.geno.bimbam --eval '{"0"=>0,"1"=>1,"2"=>2,"NA"=>4}' --geno-json BXD.geno.json --gpack 'l.each_slice(4).map { |slice| slice.map.with_index.sum {|val,i| val << (i*2) } }.pack("C*")' +``` + +reducing the original uncompressed BIMBAM from 9.9Mb to 2.7Mb. This is still a lot larger than the gzip compressed BIMBAM, but as I pointed out earlier the uncompressed version is faster by a wide margin. Compressing the lmdb file gets it in range of the compressed BIMBAM btw. So that is always an option. + +Next we create a floating point version. That reduces the file to 30% with + +``` +geno2mdb.rb fp.bimbam --geval 'g.to_f' --pack 'F*' --geno-json bxd_inds.list.json +``` + +and if we compress the probabilities into a byte reduces the file to 10%: + +``` +geno2mdb.rb fp.bimbam --geval '(g.to_f*255.0).to_i' --pack 'C*' --geno-json bxd_inds.list.json +``` + +And now the compressed version is also 4x smaller. We'll have to run gemma at scale to see what the impact is, but an uncompressed 10x reduction schould have an impact on the IO bottle neck. Note how easy it is to try these things with my little Ruby script. + +=> https://github.com/genetics-statistics/gemma-wrapper/blob/master/bin/geno2mdb.rb + +## Use lmdb genotypes from pangemma + +Rather than writing new code in C++ I proceeded embedding guile in pangemma. If it turns out to be a performance problem we can always fall back to C. Here we show a simple test witten in guile that gets called from main.cpp: + +=> https://git.genenetwork.org/pangemma/commit/?id=5b6b5e2ad97b4733125c0845cfae007e8094a687 + +## Some analysis of GEMMA + +GEMMA::BatchRun reads files and executes (b gemma.cpp:1657) +cPar.ReadFiles() + ReadFile_anno + ReadFile_pheno + ReadFile_geno (gemma_io.cpp:652) - first read to fetch SNPs info, num (ns_tset) and total SNPs (ns_total). + - it also does some checks + Note: These can all be handled by the lmdb files. So it saves one run. + +Summary of Mutated Outputs: +* indicator_snp: Binary indicators for which SNPs passed filtering +* snpInfo: Complete metadata for all SNPs in the file +* ns_test: Count of SNPs passing filters +checkpoint("read-geno-file",file_geno); + +Next start LMM9 gemma.cpp:2571 + ReadFile_kin + EigenDecomp_Zeroed + 2713 CalcUtX(U, W, UtW); + 2714 CalcUtX(U, Y, UtY); + CalcLambda + CalcLmmVgVeBeta + CalcPve + cPar.PrintSummary() + debug_msg("fit LMM (one phenotype)"); + cLmm.AnalyzeBimbam lmm.cpp:1665 and + LMM::Analyze lmm.cpp:1704 + + +Based on LLM code analysis, here's what gets mutated in the 'LMM' and Param class: + +### By 'ReadFile_geno': +This is a **standalone function** (not a member of LMM), but it mutates LMM members when passed as parameters: + +1. **'indicator_snp'** - cleared and populated with 0/1 filter flags +2. **'snpInfo'** - cleared and populated with SNP metadata +3. **'ns_test'** - set to count of SNPs that passed all filters + +### By 'LMM::AnalyzeBimbam': +(which calls 'LMM::Analyze') + +**Directly mutated in 'LMM::Analyze':** + +1. **'sumStat'** - PRIMARY OUTPUT + - Cleared at start (implied) + - Populated with one SUMSTAT entry per analyzed SNP + - Contains: beta, se, lambda_remle, lambda_mle, p_wald, p_lrt, p_score, logl_H1 + +2. **'time_UtX'** - timing accumulator + - '+= time_spent_on_matrix_multiplication' + +3. **'time_opt'** - timing accumulator + - '+= time_spent_on_optimization' + +**Read but NOT mutated:** +- 'indicator_snp' - read to determine which SNPs to process +- 'indicator_idv' - read to determine which individuals to include +- 'ni_total', 'ni_test' - used for loop bounds and assertions +- 'n_cvt' - number of covariates, used in calculations +- 'l_mle_null', 'l_min', 'l_max', 'n_region', 'logl_mle_H0' - analysis parameters +- 'a_mode' - determines which statistical tests to run +- 'd_pace' - controls progress bar display + +### Summary Table: + +| Member Variable | Mutated By | Purpose | +|----------------|------------|---------| +| 'indicator_snp' | 'ReadFile_geno' | Which SNPs passed filters | +| 'snpInfo' | 'ReadFile_geno' | SNP metadata (chr, pos, alleles, etc.) | +| 'ns_test' | 'ReadFile_geno' | Count of SNPs to analyze | +| 'sumStat' | 'Analyze' | **Main output**: Statistical results per SNP | +| 'time_UtX' | 'Analyze' | Performance profiling | +| 'time_opt' | 'Analyze' | Performance profiling | + +The key output is **'sumStat'** which contains all the association test results. + +PARAM variables directly mutated by these functions: + + indicator_snp (by ReadFile_geno) + snpInfo (by ReadFile_geno) + ns_test (by ReadFile_geno) + +LMM variables mutated: + + indicator_snp (by ReadFile_geno if passed LMM's copy) + snpInfo (by ReadFile_geno if passed LMM's copy) + ns_test (by ReadFile_geno if passed LMM's copy) + sumStat (by Analyze - this is LMM-only, not in PARAM) + time_UtX, time_opt (by Analyze) + +The actual analysis results (sumStat) exist only in LMM, not in PARAM. + +## Coding for lmdb support + +From above it should be clear that, if we have the genotypes and snp annotations in lmdb, we can skip reading the genotype file the first time. We can also rewrite the 'analyze' functions to fetch this information on the fly. + +Note that OpenBLAS will have to run single threaded when introducing SNP-based threads. + +## Fine grained multithreading + +From above it can be concluded that we can batch process SNPs in parallel. The only output is sumStat and that is written at once at the end. So, if we can collect the sumStat data without collision it should just work. + +Interestingly both Guile and C++ have recently introduced fibers. Boost.Fiber looks pretty clean: + +``` +#include <boost/fiber/all.hpp> +#include <vector> +#include <iostream> + +namespace fibers = boost::fibers; + +// Worker fiber +void compute_worker(int start, int end, + fibers::buffered_channel<int>& channel) { + for (int i = start; i < end; ++i) { + channel.push(i * i); + } +} + +void parallel_compute_fibers() { + fibers::buffered_channel<int> channel(100); + + // Spawn fibers + fibers::fiber f1([&]() { + compute_worker(0, 100, channel); + channel.close(); // Signal completion + }); + + fibers::fiber f2([&]() { + compute_worker(100, 200, channel); + }); + + // Collect results + std::vector<int> results; + int value; + while (fibers::channel_op_status::success == channel.pop(value)) { + results.push_back(value); + } + + f1.join(); + f2.join(); + + std::cout << "Total results: " << results.size() << std::endl; +} +``` + +Compare that with guile: + +``` +(use-modules (fibers) + (fibers channels)) + +;; Worker that streams individual results +(define (compute-worker-streaming start end result-channel) + (let loop ((i start)) + (when (< i end) + (put-message result-channel (* i i)) + (loop (+ i 1)))) + ;; Send completion signal + (put-message result-channel 'done)) + +;; Collector fiber +(define (result-collector result-channel num-workers) + (let loop ((results '()) + (done-count 0)) + (if (= done-count num-workers) + (reverse results) + (let ((msg (get-message result-channel))) + (if (eq? msg 'done) + (loop results (+ done-count 1)) + (loop (cons msg results) done-count)))))) + +(define (parallel-compute-streaming) + (run-fibers + (lambda () + (let ((result-channel (make-channel))) + + ;; Spawn workers + (spawn-fiber + (lambda () (compute-worker-streaming 0 100 result-channel))) + (spawn-fiber + (lambda () (compute-worker-streaming 100 200 result-channel))) + + ;; Collect results + (result-collector result-channel 2))))) +``` + +The Boost fiber is a relatively mature library now, with about 8+ years of development and real-world usage. +Interestingly Boost.fibers has work stealing built in. We can look at that later: + +=> https://www.boost.org/doc/libs/1_66_0/libs/fiber/doc/html/fiber/worker.html + +What about LOCO? Actually we can use the same fiber strategy for each chromosome as a per CHR process. We can set the number of threads differently based on chromosome SNP num, so all chromosomes take (about) the same time. Later, we can bring LOCO into one process with the advantage that the genotype data is only read once. In both cases the kinship matrices are in RAM anyway. + +# Reducing the size of the genotype file + +The first version of lmdb genotypes used simple floats. That reduced the pangenome text version from 30Gb to 12Gb with about a 3x speedup of gemma. Next I tried byte representation of the genotypes. + +# Optimizing SNP handling + +GEMMA originally used a separate SNP annotation file which proves inefficient. Now we transform the geno information to lmdb, we might as well include chr+pos. We'll make the key out of that and add a table with marker annotation. + +# Optimizing the index + +I opted for using a CHR+POS index (byte+long value). There are a few things to consider. There may be duplicates and there may be missing values. Also LMDB likes and integer index. The built-in dubsort does not work, so we need to create a unique pos for every variant. I'll do that by adding the line number. diff --git a/issues/genotype_search_bug.gmi b/issues/genotype_search_bug.gmi new file mode 100644 index 0000000..0f05f4e --- /dev/null +++ b/issues/genotype_search_bug.gmi @@ -0,0 +1,13 @@ +# The * Search for Genotypes Not Working + +## Tags + +* type: bug +* priority: medium +* status: closed +* assigned: zsloan +* keywords: bug, search + +## Description + +Currently * searches for genotypes return no results, even when data exists. diff --git a/issues/global-search-results.gmi b/issues/global-search-results.gmi deleted file mode 100644 index 9cd773a..0000000 --- a/issues/global-search-results.gmi +++ /dev/null @@ -1,32 +0,0 @@ -# Global search does not return results - -## Tags - -* priority: critical -* type: bug -* assigned: zsloan, pjotrp -* status: unclear -* keywords: global search, from github - -## Description - -=> https://github.com/genenetwork/genenetwork2/issues/629 From GitHub - -> Try a search for Brca2 -> -> I am trying to add an example to this storyboard: -> -> => https://github.com/genenetwork/gn-docs/blob/master/story-boards/starting-from-known-gene/starting-from-known-gene.md#use-the-search-page -> -> -> Interestingly luna does no better: -> -> => http://luna.genenetwork.org/gsearch?type=gene&terms=brca2 - -@pjotr @zsloan, it seems to me this might be fixed, but please have a look and fix it in case it is not - -## Resolution - -With the new xapian search, this issue is no more. - -* closed diff --git a/issues/global-search-unhandled-error.gmi b/issues/global-search-unhandled-error.gmi index b2f6ba8..7626280 100644 --- a/issues/global-search-unhandled-error.gmi +++ b/issues/global-search-unhandled-error.gmi @@ -5,7 +5,7 @@ * assigned: aruni, fredm * priority: high * type: bug -* status: open +* status: closed * keywords: global search, gn2, genenetwork2 ## Description @@ -15,3 +15,7 @@ assume the request will always be successful. This is not always the case, as ca => https://test3.genenetwork.org/gsearch?type=gene&terms=Priscilla here (as of 2024-03-04T11:25+03:00UTC). Possible errors should be checked for and handled before attempting to read and/or process expected data. + +## Closing Comments + +This issue is closed as obsoleted. The issue is really old (>=7 months). Closing it for now. To be reopened if the issue happens again. diff --git a/issues/gn-auth/email_verification.gmi b/issues/gn-auth/email_verification.gmi index 8147bb5..07e2b04 100644 --- a/issues/gn-auth/email_verification.gmi +++ b/issues/gn-auth/email_verification.gmi @@ -2,7 +2,7 @@ ## Tags -* status: open +* status: closed, completed * priority: medium * type: enhancement * assigned: fredm, zsloan @@ -12,8 +12,10 @@ When setting up e-mail verification, the following configurations should be set for gn-auth: -SMTP_HOST = "smtp.uthsc.edu" +SMTP_HOST = "smtp.uthsc" SMTP_PORT = 25 (not 587, which is what we first tried) SMTP_TIMEOUT = 200 # seconds Not sure about username/password yet. We tried UNKNOWN/UNKNOWN and my own (Zach's) username/password + +Note that this host is only visible on the internal network of UTHSC. It won't work for tux02. diff --git a/issues/gn-auth/example-privileges-script.gmi b/issues/gn-auth/example-privileges-script.gmi new file mode 100644 index 0000000..afda1a1 --- /dev/null +++ b/issues/gn-auth/example-privileges-script.gmi @@ -0,0 +1,36 @@ +# Example Python script for setting privileges for user/group + +## Description + +This is just an example of a python script for setting user/group privileges, for potential future reference + +Before running this script, stop the crontab job that automatically sets unlinked resource privileges + +```python +import uuid +import sqlite3 + +group_id = '0510dc91-0eb6-4d9d-97e5-405acc84ba2b' +resource_id = 'e5cc773d-ca28-44e2-b2a7-1c2901794238' + +publishxrefs = ('10955','10957','10960','10961','10964','10966','10969','10970','10973','10975','10978','10979','10982','10984','10987','10988','12486','12487','12489','12490','12491','12492','12493','12494','12495','12496','12497','12498','12499','12500','12501','12502','12503','12504','12505','12506','12507','12508','12509','12510','12511','12512','12513','12514','12515','12516','12517','12518','12519','12520','12521','12522','12523','12524','12525','12526','12527','12528','12529','12530','12531','12532','12533','12534','12535','12536','12537','12538','12539','12540','12541','12542','12543','12544','12545','12546','12547','12548','12549','12550','12551','12566','12567','12568','12569','12574','12575','12576','12577','12578','12579','12580','12621','12735','12737','12741','12742','12743','12744','12745','12780','12781','12782','12783','12784','12785','12786','12787','12788','12789','12790','12791','12792','12793','12794','12795','12796','12797','12798','12799','12800','12801','12803','12804','12805','12806','12807','12808','12809','12810','12812','12813','12816','12817','12961','12962','12963','12964','12965','12966','12967','12970','13029','14803','14804','14805','14806','15572','15573','16197','16375','17329','17330','17331','17332','17333','17334','17335','17336','17337','17338','17339','17340','17341','17342') + +# I generated these separatedly with uuid.uuid4(); I probably could have just done this in the script itself, but wanted to make sure they stayed the same +data_link_ids = ('3041366d-1ffd-45fb-9617-043772b285c8', 'da41fc30-3cd6-4b41-83b5-8fedc4ccd65f', '364a4010-e3fe-470f-a8c9-2a9fd359a4e3', '4e878c0a-cc92-4b21-8152-310266291967', 'ab50a999-e9bb-4bb6-91c0-9828b804156e', 'd50d30e9-15f9-4578-8b48-2bcb0d7a8afb', 'd42d2ef5-278f-4b5e-ae57-10f49f48c2e9', '78c022d7-390b-4688-96c6-c1afadd45877', '17fca9ae-8e71-4c55-b035-15d04f96d936', '4f9893de-fccf-4d6a-845d-df2f83e4d06c', '8a660b03-786a-4143-9fb3-9d00e888f3a2', '3965417a-e47a-47c8-81f6-991eef8c4152', 'e27707f7-5832-4e3f-9391-849e964bbaf6', 'bf9f6ff0-a131-46ef-8a2e-c37d8b66f992', '1ee744c4-95e1-4a66-958c-e785dc937563', '0fa79294-bbdc-4701-861d-9bb91ea72588', '38665214-7cdd-4b01-81dc-d1b78e63a0b0', '82a237df-96ce-404e-b052-8dbe45e793ee', 'ec4c1848-d326-462b-9c0d-f5e5c76e92f6', '46bee64b-8ce7-4910-80ec-211063725b1a', '7f489875-38b6-4cff-a05e-f11a7957b9b8', 'f39744a1-d673-406f-a2f1-c45082bb1975', '5f53a9e9-e40c-4a01-bf9d-430d7c2fd5ef', '1f0a4f2d-cd1c-41e5-a185-2ea2b2b05cd3', 'e282651c-7dc3-40e9-bb52-14e73c3a4ef7', '3c492e6d-e807-427b-acca-44afa4862894', '38e0df6c-3f44-4acb-9965-f0d3f0278150', '35e5ae63-3a32-49ac-93ed-b39d02ab5f5c', '0e6bfa4a-4fee-4b54-80c6-209f9b0ecd00', 'eb85e71a-8b4b-4f3f-9168-59b4ebc090a1', '3eb0325c-4dce-481e-bce7-46c37031da76', '7bc5ce49-4150-4d87-bfbf-d3a1cd20ad67', '03c0cba7-8712-4a27-9b79-e38818805b1f', '07d787ec-e0f9-4b7c-b368-d1f56ce030dc', '51d9e601-31c7-4643-b896-79d90bdc4105', '3cee3754-2822-4f0a-87ad-96bdfe2f0232', 'a7e9eb54-63bd-4ca9-a1f8-1aeac02a76db', '3ff132e5-7fb6-4763-943e-1efbe5f8000e', 'c685f0c9-084d-44d2-882e-ce66cdccef6d', 'ea062e07-1f59-4312-bfd9-6560e652c878', '75d33621-b5a4-447d-a094-7480d1d57a47', 'bb3dbd16-0c73-47d8-8e21-f095d3398b61', '0211177b-a92c-4215-a622-0cba5e8e2866', 'e2139b64-e74a-4263-9785-314e73b102df', '0426f12b-c223-487b-8ab7-baea5995c480', '4a467a72-174c-4ec7-9557-859656ad2c71', '38ab978e-e78f-4c0a-8af3-449b636fe5e6', 'a45c8d42-14d3-464d-8395-8a574148da78', 'e4171cc1-4a03-4311-a287-cee1b8084227', '75d70308-6f1a-49e4-9199-97ec8f60778e', 'efb5c834-b88a-4ee9-b09d-91913fddb546', '23866a00-a729-4ba9-af22-ee83ec164d34', '3feb1154-0613-464b-b758-aad308550a74', '7019d0f1-a590-46ce-a30e-4c21541b6ea8', '6e803182-71d2-4427-a5df-ad84651e5d11', 'fe1bf3f6-818b-4fae-9880-8ae2c1bdcff6', '66d480f7-da41-49ed-a222-8724b493313a', 'c908d2a3-8378-4574-83be-3bf8bdeff5fb', '96b36360-7258-43ab-bdda-23e93f15b0ac', 'daf90aca-6ee6-4c3c-9a60-1e7ae2e29cd2', '43800347-1fe1-40f7-9013-408f0b0740e9', 'e9350a78-a62f-4a08-8881-e6e51450d120', 'bda9a217-d605-4a18-9c3f-5139679ae413', 'cbd8f79a-4992-43c9-8391-994e221b73e1', 'c6b64d90-63ff-482d-b205-f58f3cf656df', '3ecbf267-3655-42a6-a8f9-2751439efb27', '808ae753-a255-43a6-96d4-0ed02b14aefe', '1a5424df-49b3-4274-8281-a1eed838ffda', '89e6d278-e643-43a2-8a61-746cbf446109', 'b4940ece-80a0-4382-ba57-eaad1d35e83e', 'f46cd643-fccb-4037-b642-9a4a329e84e2', '497a235c-4253-4e94-a69c-4b2f200976dd', '02aa8e3a-f9ac-459b-8e35-7081f2849f48', 'da5018e2-38af-415a-ad43-8caf8d82290d', '574ee482-f534-475e-9e7a-0a14e05f4495', 'b90b3a02-fa8d-4393-9dbb-087224a80b40', 'd68370ec-f569-42f3-9c07-a3118aa73ad5', '4b6b099b-3a7c-46c2-a2fc-92c01463b698', 'c9f5608f-3301-4835-b6dc-b1891fe81c36', 'eead972c-0fc4-4c5e-b1ad-63db4d1e9409', 'd8b295eb-6d07-4abe-8b8a-8cfef066a32e', 'a89f3944-be64-42d0-aa66-d2501021760d', '02f42124-bc38-4a14-9400-bbc8e8bf41b7', 'abbcb901-da42-4ef1-bc2c-55b95d584461', 'e28b0cef-eddb-41f2-9479-722365c0b2e0', '9135c304-1dd3-4eb5-82d4-91a86e39068a', '0bbd5f1d-eef3-4c35-84ab-484165a4240d', '08ad9a25-b20d-4ad8-a5e0-a886edc4a7aa', '7e05bdf8-51f5-49dc-9ff6-fbbc6aa20c9f', 'c82d4943-dc6f-4ec8-b76f-1309290183fe', '6a8d76bc-156b-4925-823c-b4585a847efc', '2604e9a8-a4ee-49be-a754-126b1705516e', '8c32b69b-e796-418d-b254-104a179a84ba', '532dca31-c38e-4b77-a84c-563407e9ae00', '954cacda-179e-42a9-8c1f-987e6fae1079', 'bcfced8a-bd50-48e6-9edb-4776a1e95bf5', '66308324-1747-46df-8ddf-41e5bff1cd1a', 'f797e23c-7cb6-4869-97f5-3a79b685c6a3', '0869bb57-0133-4e57-9655-2b6eb1906f5e', 'fc0dddfa-e683-4a8d-9f57-82fb368f8a84', '35b7ffc1-6782-4c85-9bf8-d51629cab2d0', '232850b6-5a53-45e0-8668-7773b9cb39c2', 'af20291c-2be6-40e1-9576-b78df5d56774', 'f52f5c1a-1f8a-4b8a-8e00-fc2bdc6edc5b', '90819230-f372-4e48-96fc-6fb97199fa07', 'b31aefbf-fb67-49dc-b357-f8f0cd76cea9', '5d695f24-674a-4dc5-9e02-7817b77ab06b', '064d5972-f636-4771-95fe-3f6260fd550f', 'c2254f71-98dc-4303-bc26-9b9640582be1', '6eac9495-a366-4e65-90d2-d63472937925', '119398e3-b8cc-4ae5-addb-ec13db9834fa', '6cce7b35-fe2a-4348-9e42-5179ea9f42f1', '65940929-c9fc-47e9-b1cf-c9c9688f7871', '73ffdb1a-f70d-4e8e-88b7-0e22cfd1916e', 'c1b25581-7d28-4535-bcdc-44dc3bc7e438', '6e03a5f7-f200-439a-a465-97056d3c9f71', '4d270b71-2e06-4cfb-a60d-258ccbc7860a', '8b82e29f-a901-454f-a9ad-2f96be9d6c44', '7d699b76-f554-44db-9c68-6ff985cd6388', '3417b2dc-a88a-4cb6-a446-9e90063731f9', '18760f59-4b50-48d5-9814-8117490ab972', '4aaebf37-9529-4365-bdb8-dd53b0ac2499', '95ecdf43-12a5-4b3c-993a-ff03b58cee93', '2b5dd4e6-2310-417e-82bb-b16e96c7346b', '92ee883a-646d-44dd-b2c6-1bffb7b0d2cb', '979038e4-9392-4836-ad04-f125cf19eafa', '1220629d-000f-4508-8a41-3706eebeb812', '42abca44-8eb3-4aa7-adae-16afc211dff4', '82fe9559-718e-4424-9465-033204e1ec03', '8353fe08-e6c8-4f87-b0d8-412ab4a41d19', '1c6bebcf-c125-42a3-9d5b-4fae3113b62b', 'ba54b2ba-fee3-4f1d-a903-18edc7c694bd', '0ea0d40d-3204-4b9b-bae2-54355dce2b5c', '5ee4857c-00b4-46d6-880c-44dbae021b45', '2caa4c03-78ce-456d-8e20-edb531bdd45a', 'e2536a5e-357d-4f6d-a764-ac85a40a2f3f', 'e6341996-80bb-42f9-8842-92062680e957', '3612e03e-430d-4da3-ac87-93a310a3d780', '88c600d2-cefd-4a99-a904-bf2260554ac6', 'f1a6af16-2525-4650-b729-cbec60ad276c', '4b854252-9e87-4d7c-99d9-84ae9297d26e', 'be580989-3ccd-48bd-8c85-a750a800afbd', '5fd675fe-e765-4bf0-8e0f-8f81107a0bb8', 'cf852032-6399-4bf8-a8e7-474c84030430', 'eef27f8a-32d2-4add-a018-ff2d34208a11', '3aca3b1d-4589-4b4c-90de-588fd43fe835', 'd6187213-5a39-4089-ac50-eb144be2a3a5', '5bf60cda-b6b9-4992-91ac-c022e523202a', '4c4395ca-2f2e-4a85-93df-37d2c7f3d1d6', 'b8f9d837-2bd6-447c-9ad8-f581f84f36c1', '029a88bb-3850-4e85-87ab-8ecb3ad59538', '39ead890-0e1a-43df-9bbc-459a3ea0a016', '4b559ad2-c4d8-4763-bc08-90cb63fc79d0', '8361884a-248b-4dac-a9f9-d56f31ab477e', 'd79e2e00-9ea6-4d43-addc-3b1955bc7e5f', '4c0a35ac-c549-4c1a-9fc8-a2e93ba1c632', '50f558d0-c7b1-4204-8ebb-5855e7588998', 'be061746-1b34-4c04-a752-ab5c8d78fdef', 'f8edfb50-c572-4025-87c6-b34e88d8fb90', '0a799ff1-df2c-4c85-9b7e-4fe4885ab5cd', 'db373aa1-8ab9-4257-8d48-11dc92448344', '1e2b9de8-74a4-446a-970e-b47c662760b2', 'ac09ffdf-9cb5-49be-8f52-b681598453f6', 'ae4a55af-a1bb-4698-b2e7-ffbed8760635', '7989ff1f-a9da-439a-bb8b-14482b15dd2e') + +# delete_query deletes from the AutoAdminGroup +delete_query = 'delete from linked_phenotype_data where group_id="5ea09f67-5426-4b66-9ea2-12bdd78350e8" and SpeciesId="1" and InbredSetId="1" and PublishFreezeId="1" and PublishXRefId=?' +resource_query = "insert into phenotype_resources values ('e5cc773d-ca28-44e2-b2a7-1c2901794238', ?)" +link_query = 'insert into linked_phenotype_data (data_link_id, group_id, SpeciesId, InbredSetId, PublishFreezeId, dataset_name, dataset_fullname, dataset_shortname, PublishXRefId) values (?,?,?,?,?,?,?,?,?)' + +db_path = '/home/gn2/auth.db' +conn = sqlite3.connect(db_path) +cursor = conn.cursor() + +the_data = tuple((dlid, group_id, 1, 1, 1, 'BXDPublish', 'BXD Phenotypes', 'BXD Publish', pxrid) for (dlid, pxrid) in zip(data_link_ids, publishxrefs)) + +cursor.executemany(delete_query, tuple((item,) for item in publishxrefs)) +cursor.executemany(link_query, the_data) +cursor.executemany(resource_query, tuple((item,) for item in data_link_ids)) +conn.commit() +``` diff --git a/issues/gn-auth/feature-request-create-test-accounts.gmi b/issues/gn-auth/feature-request-create-test-accounts.gmi new file mode 100644 index 0000000..9e8aa45 --- /dev/null +++ b/issues/gn-auth/feature-request-create-test-accounts.gmi @@ -0,0 +1,51 @@ +# Feature Request: Create Test Accounts + +## Tags + +* assigned: fredm, alex +* status: open +* type: feature request, feature-request +* priority: medium +* keywords: gn-auth, auth, test accounts + +## Description + +From the requests on Matrix: + +@alexm +``` +fredmanglis +: Can we create a generic, verified email for CD to make it easier for people to test our services that requires login? +``` + +and from @pjotrp + +``` +yes, please. Let it expire after a few weeks, or something, if possible. So we can hand out test accounts. +``` + +We, thus, want to have a feature that allows the system administrator, or some other user with the appropriate privileges, to create a bunch of test accounts that have the following properties: + +* The accounts are pre-verified +* The accounts are temporary and are deleted after a set amount of time + +This feature will need a corresponding UI, say on GN2 to enable the users with the appropriate privileges create the accounts easily. + +### Implementation Considerations + +Only system-admin level users will be able to create the test accounts + +We'll probably need to track the plain-text passwords for these accounts, probably. + +Information to collect might include: +* Start of test period (automatic on test account creation: mandatory) +* End of test period (Entered at creation time: mandatory) +* A pattern of sorts to follow when creating the accounts — this brings up the question, is there a specific domain (e.g. …@uthsc.edu, …@genenetwork.org etc.) that these test accounts should use? +* Extra details on event/conference necessitating creation of the test account(s) (optional) + + +Interaction with the rest of the system that we need to consider and handle are: +* Assign public-read for all public data: mostly easy. +* Forgot Password: If such users request a password change, what happens? Password changes requires emails to be sent out with a time-sensitive token. The emails in the test accounts are not meant to be actual existing emails and thus cannot reliably receive such emails. This needs to be considered. Probably just prevent users from changing their passwords. +* What group to assign to these test accounts? I'm thinking probably a new group that is also temporary - deleted when users are deleted. +* What happens to any data uploaded by these accounts? They should probably not upload data meant to be permanent. All their data might need to be deleted along with the temporary accounts. diff --git a/issues/gn-auth/fix-refresh-token.gmi b/issues/gn-auth/fix-refresh-token.gmi new file mode 100644 index 0000000..222b731 --- /dev/null +++ b/issues/gn-auth/fix-refresh-token.gmi @@ -0,0 +1,58 @@ +# Fix Refresh Token + +## Tags + +* status: closed, obsolete +* priority: high +* assigned: fredm +* type: feature-request, bug +* keywords: gn-auth, token, refresh token, jwt + +## Description + +The way we currently provide the refresh token is wrong, and complicated, and +leads to subtle bugs in the clients. + +The refresh tokens should be sent back together with the access token in the +same response with the following important considerations: + +* The access token is sent back as the body of the response +* The refresh token is sent back as a httpOnly cookie +* The refresh token should be opaque to the client — if it is a JWT, encrypt it + +### Server-Side Changes + +The following changes will be necessary at the generation of the access token: + +* Generate the refresh token (possibly in the `create_token_response()` function in `gn_auth.auth.authentication.oauth2.grants.JWTBearerGrant`). Put the user ID, and expiration in the refresh token. Expiration can be provided as part of initial request. +* Encrypt the refresh token (maybe use the auth-server's public key for this) +* Save refresh token to DB with link to access token ID perhaps? +* Attach the token to the response as a httpOnly cookie + +at the refreshing of the access token, we'll need to: + +* Fetch the refresh token from the cookies +* Decrypt it +* Compare the user ID in the refresh token with that in the access token provided +* Verify refresh token has not expired +* Check that the refresh token is not revoked (revocation will happen when user logs out, on manual sys-admin revocation) +* Generate new access token +* Do we attach the same refresh token or generate a new one? + +#### Gotchas + +Since there are multiple workers, you could get a flurry of refresh requests using the same refresh token. We might need to handle that — maybe save the refresh request to DB with the ID of the access token used and the new access token, and simply return the same new access token generated by the first successful refresh worker. + +This actually kills 2 birds with the one stone: +* The refresh completes successfully if the refresh token is not expired and the access token is valid +* In case the access token and refresh token are somehow compromised, the system returns the same, possibly expired access token, rendering the compromise moot. + +### Client-Side Changes + +* Get the refresh token from the cookies rather than from the body +* Maybe: make refreshing the access token unaware of threads/workers + + +## Close as Obsolete + +We no longer do refresh tokens at all, they were a pain to look into, so I simply removed them from the system. diff --git a/issues/gn-auth/implement-redirect-on-login.gmi b/issues/gn-auth/implement-redirect-on-login.gmi new file mode 100644 index 0000000..342b2e6 --- /dev/null +++ b/issues/gn-auth/implement-redirect-on-login.gmi @@ -0,0 +1,22 @@ +# Redirect Users to the Correct URL on Login for GN2 + +## Tags + +* assigned: alexm +* priority: medium +* status: in progress +* keywords: gn-auth, auth, redirect, login, completed, closed, done +* type: feature-request + +## Description + +The goal is to redirect users to the login page for services that require authentication, and then return them to the page they were trying to access before logging in, rather than sending them to the homepage. Additionally, display the message "You are required to log in" on the current page instead of on the homepage. + +## Tasks + +* [x] Redirect users to the login page if they are not logged in. +* [x] Implement a redirect to the correct resource after users log in. + +## Notes +See this PR for commits that fixes this: +=> https://github.com/genenetwork/genenetwork2/pull/875 diff --git a/issues/gn-auth/implement-refresh-token.gmi b/issues/gn-auth/implement-refresh-token.gmi index 6b697eb..0dc63f3 100644 --- a/issues/gn-auth/implement-refresh-token.gmi +++ b/issues/gn-auth/implement-refresh-token.gmi @@ -2,7 +2,7 @@ ## Tags -* status: open +* status: closed, completed, fixed * priority: high * assigned: fredm, bonfacem * type: feature-request, bug diff --git a/issues/gn-auth/new-privilegs-samples-ordering.gmi b/issues/gn-auth/new-privilegs-samples-ordering.gmi new file mode 100644 index 0000000..be9cfe9 --- /dev/null +++ b/issues/gn-auth/new-privilegs-samples-ordering.gmi @@ -0,0 +1,32 @@ +# New Privileges: Samples Ordering + +## Tags + +* status: open +* assigned: fredm +* interested: @zachs, @jnduli, @flisso +* priority: medium +* type: feature-request, feature request +* keywords: gn-auth, auth, privileges, samples, ordering + +## Description + +From the email thread: + +``` +Regarding the order of samples, it can basically be whatever we decide it is. It just needs to stay consistent (like if there are multiple genotype files). It only really affects how it's displayed, and any other genotype files we use for mapping needs to share the same order. +``` + +Since this has nothing to do with the data analysis, this could be considered a system-level privilege. I propose + +``` +system:species:samples:ordering +``` + +or something similar. + +This can be added into some sort of generic GN2 curator role (as opposed to a data curator role). + +This allows us to have users that are "data curators" that we can offload some of the data curation work to (e.g. @flisso, @suheeta etc.). + +We would then, restrict the UI and display "curation" to users like @acenteno, @robw and @zachs. This second set of users would thus have both the "data curation" roles, and still have the "UI curation" roles. diff --git a/issues/gn-auth/pass-on-unknown-get-parameters.gmi b/issues/gn-auth/pass-on-unknown-get-parameters.gmi new file mode 100644 index 0000000..a349800 --- /dev/null +++ b/issues/gn-auth/pass-on-unknown-get-parameters.gmi @@ -0,0 +1,17 @@ +# Pass on Unknown GET Parameters + +## Tags + +* status: open +* priority: medium +* type: feature-request, enhancement +* assigned: fredm, zsloan +* keywords: gn-auth, authorisation + +## Description + +A developer or user could be needing to access some feature hidden behind some flag (so called, "feature flags"). Some of these flags are set using known (to the application and developer/user) GET parameters. + +If the user provides these get parameters before login, then go through the login process, the unknown GET parameters are dropped silently, and the user has to them manually set them up again. This, while not a big deal, is annoying and wastes a few seconds each time. + +This feature request proposes to pass any unknown GET parameters untouched through the authentication/authorisation server and back to the authenticating client during the login process, to mitigate this small annoyance. diff --git a/issues/gn-auth/problems-with-roles.gmi b/issues/gn-auth/problems-with-roles.gmi index 46f3c52..2778b61 100644 --- a/issues/gn-auth/problems-with-roles.gmi +++ b/issues/gn-auth/problems-with-roles.gmi @@ -3,9 +3,9 @@ ## Tags * type: bug -* status: open * priority: critical * assigned: fredm, zachs +* status: closed, completed, fixed * keywords: gn-auth, authorisation, authorization, roles, privileges ## Description @@ -29,8 +29,8 @@ The implementation should instead, tie the roles to the specific resource, rathe * [x] migration: Add `resource:role:[create|delete|edit]-role` privileges to `resource-owner` role * [x] migration: Create new `resource_roles` db table linking each resource to roles that can act on it, and the user that created the role * [x] migration: Drop table `group_roles` deleting all data in the table: data here could already have privilege escalation in place -* [ ] Create a new "Roles" section on the "Resource-View" page, or a separate "Resource-Roles" page to handle the management of that resource's roles -* [ ] Ensure user can only assign roles they have created - maybe? +* [x] Create a new "Roles" section on the "Resource-View" page, or a separate "Resource-Roles" page to handle the management of that resource's roles +* [x] Ensure user can only assign roles they have created - maybe? ### Fixes @@ -39,3 +39,4 @@ The implementation should instead, tie the roles to the specific resource, rathe => https://git.genenetwork.org/gn-auth/commit/?h=handle-role-privilege-escalation&id=5d34332f356164ce539044f538ed74b983fcc706 => https://git.genenetwork.org/gn-auth/commit/?h=handle-role-privilege-escalation&id=f691603a8e7a1700783b2be6f855f30d30f645f1 => https://git.genenetwork.org/gn-auth/commit/?h=handle-role-privilege-escalation&id=2363842cc81132a2592d5cda98e6ebf1305e8482 +=> https://github.com/genenetwork/genenetwork2/commit/a7a8754a57594e5705fea8e5bbea391a09e8f64c diff --git a/issues/gn-auth/registration.gmi b/issues/gn-auth/registration.gmi index 6558a6d..61ea94a 100644 --- a/issues/gn-auth/registration.gmi +++ b/issues/gn-auth/registration.gmi @@ -2,8 +2,11 @@ # Tags +* type: bug * assigned: fredm * priority: critical +* status: closed, completed, fixed +* keywords: gn-auth, auth, authorisation, authentication, registration # Issues diff --git a/issues/gn-auth/resources-duplicates-in-resources-list.gmi b/issues/gn-auth/resources-duplicates-in-resources-list.gmi new file mode 100644 index 0000000..379c1eb --- /dev/null +++ b/issues/gn-auth/resources-duplicates-in-resources-list.gmi @@ -0,0 +1,29 @@ +# Resources: Duplicates in Resources List + +## Tags + +* type: bug +* status: closed +* priority: medium +* assigned: fredm, zachs, zsloan +* keywords: gn-auth, auth, authorisation, resources + +## Reproduce + +* Go to https://genenetwork.org/ +* Sign in to the system +* Click on "Profile" at the top to go to your profile page +* Click on "Resources" on your profile page to see the resources you have access to + +## Expected + +Each resource appears on the list only one time + +## Actual + +Some resources appear more than once on the list + + +## Fix + +=> https://git.genenetwork.org/gn-auth/commit/?id=00f863b3dcb76f5fdca8e139e903e2f7edb861fc diff --git a/issues/gn-auth/rework-view-resource-page.gmi b/issues/gn-auth/rework-view-resource-page.gmi new file mode 100644 index 0000000..2d6e145 --- /dev/null +++ b/issues/gn-auth/rework-view-resource-page.gmi @@ -0,0 +1,22 @@ +# Rework "View-Resource" Page + +## Tags + +* status: closed, completed +* priority: medium +* type: enhancement +* assigned: fredm, zsloan +* keywords: gn-auth, resource, resources, view resource + +## Description + +The view resource page ('/oauth2/resource/<uuid>/view') was built with only Genotype, Phenotype, and mRNA resources in mind. + +We have since moved on, and added more types of resources (group, system, inbredset-group, etc). This leads to the page breaking for these other types of resources. + +We need to update the UI and route to ensure the page renders correctly for each type, or at the very least, redirects to the correct page (e.g. in the case of groups, which have a separate "view group" page). + + +## Close as complete + +This is fixed now. diff --git a/issues/send-out-confirmation-emails-on-registration.gmi b/issues/gn-auth/send-out-confirmation-emails-on-registration.gmi index c85e26b..e32c7c0 100644 --- a/issues/send-out-confirmation-emails-on-registration.gmi +++ b/issues/gn-auth/send-out-confirmation-emails-on-registration.gmi @@ -2,11 +2,11 @@ ## Tags -* status: open +* status: closed, completed * assigned: fredm * priority: medium -* keywords: email, user registration * type: feature request, feature-request +* keywords: gn-auth, email, user registration, email confirmation ## Description diff --git a/issues/gn-auth/test1-deployment-cant-find-templates.gmi b/issues/gn-auth/test1-deployment-cant-find-templates.gmi index bd2f57e..ca3bfad 100644 --- a/issues/gn-auth/test1-deployment-cant-find-templates.gmi +++ b/issues/gn-auth/test1-deployment-cant-find-templates.gmi @@ -4,7 +4,7 @@ * assigned: fredm, aruni * priority: critical -* status: open +* status: closed, completed, fixed * type: bug * keywords: gn-auth, deployment, test1 diff --git a/issues/gn-guile/Configurations.gmi b/issues/gn-guile/Configurations.gmi new file mode 100644 index 0000000..f1ae06e --- /dev/null +++ b/issues/gn-guile/Configurations.gmi @@ -0,0 +1,60 @@ +# gn-guile Configurations + +## Tags + +* type: bug +* assigned: +* priority: high +* status: open +* keywords: gn-guile, markdown editing +* interested: alexk, bonfacem, fredm, pjotrp + +## Description + +=> https://git.genenetwork.org/gn-guile/ The gn-guile service +is used to enable markdown editing in GeneNetwork. + +There are configuration that are needed to get the system to work as expected: + +* CURRENT_REPO_PATH: The local path to the cloned repository +* CGIT_REPO_PATH: path to the bare repo (according to docs [gn-guile-docs]) + +With these settings, we should be able to make changes to make edits. These edits, however, do not get pushed upstream. + +Looking at the code +=> https://git.genenetwork.org/gn-guile/tree/web/webserver.scm?id=4623225b0adb0846a4c2e879a33b31884d2e5f05#n212 +we see both the settings above being used, and we can further have a look at +=> https://git.genenetwork.org/gn-guile/tree/web/view/markdown.scm?id=4623225b0adb0846a4c2e879a33b31884d2e5f05#n78 the definition of git-invoke. + +With the above, we could, hypothetically, do a command like: + +``` +git -C ${CURRENT_REPO_PATH} push ${REMOTE_REPO_URI} master +``` + +where REMOTE_REPO_URI can be something like "appuser@git.genenetwork.org:/home/git/public/gn-guile" + +That means we change the (git-invoke …) call seen previously to something like: + +``` +(git-invoke +current-repo-path+ "push" +remote-repo-url+ "master") +``` + +and make sure that the "+remote-repo-url+" value is something along the URI above. + +### Gotchas + +We need to fetch and rebase with every push, to avoid conflicts. That means we'll need a sequence such as the following: + +``` +(git-invoke +current-repo-path+ "fetch" +remote-repo-url+ "master") +(git-invoke +current-repo-path+ "rebase" "origin/master") +(git-invoke +current-repo-path+ "push" +remote-repo-url+ "master") +``` + +The tests above work with a normal user. We'll be running this code within a container, so we do need to expose a specific private ssh key for the user to use to push to remote. This also means that the corresponding public key should be registered with the repository server. + +## References + +* [gn-guile-docs] https://git.genenetwork.org/gn-guile/tree/doc/git-markdown-editor.md?id=4623225b0adb0846a4c2e879a33b31884d2e5f05 + diff --git a/issues/gn-guile/activations-on-production-not-running-as-expected.gmi b/issues/gn-guile/activations-on-production-not-running-as-expected.gmi new file mode 100644 index 0000000..be9cc00 --- /dev/null +++ b/issues/gn-guile/activations-on-production-not-running-as-expected.gmi @@ -0,0 +1,57 @@ +# gn-guile: Activations on Production not Running as Expected + +## Tags + +* status: closed, completed, fixed +* priority: high +* type: bug +* assigned: bonfacem, fredm, aruni +* keywords: gn-guile, deployment, activation-service-type + +## Description + +With the recent changes to guix's `least-authority-wrapper` we can no longer write to the root filesystem ("/"). That is not much of a problem. + +So I tried adding `#:directory (dirname gn-doc-git-checkout)` to the `make-forkexec-constructor` for the `gn-guile-shepherd-service` and that actually changes the working directory of the process, as I would expect. + +In `genenetwork-activation` I add: + +``` + ;; setup correct ownership for gn-docs + (for-each (lambda (file) + (chown file + (passwd:uid (getpw "genenetwork")) + (passwd:gid (getpw "genenetwork")))) + (find-files #$(dirname gn-doc-git-checkout) + #:directories? #t)) +``` + +which, ideally, should change ownership of the parent directory of the bare git checkout for "gn-docs" when we build/start the container. This does not happen — the directory is still owned by root. + +My thinking goes, the "genenetwork" user[1] is not yet created at the point when the activation[2] is run, leading to the service failing to start. + +The reason I think this, is because, when I do: + +``` +fredm@tux04:/...$ sudo guix container exec <container-pid> /run/current-system/profile/bin/bash --login +root@genenetwork-gn2-fred /# chown -R genenetwork:genenetwork /var/lib/genenetwork/ +root@genenetwork-gn2-fred /# chown -R genenetwork:genenetwork /var/lib/genenetwork/ +``` + +The bound directory's permissions change, and we can now enable and start the service: + +``` +root@genenetwork-gn2-fred /# herd enable gn-guile +root@genenetwork-gn2-fred /# herd start gn-guile +``` + +which starts the service as expected. We can also simply restart the entire container at this point, and it works too. + +## Footnotes + +=> https://git.genenetwork.org/gn-machines/tree/genenetwork/services/genenetwork.scm?id=e425671e69a321a032134fafee974442e8c1ce6f#n167 [1] "genenetwork" user declaration +=> https://git.genenetwork.org/gn-machines/tree/genenetwork/services/genenetwork.scm?id=e425671e69a321a032134fafee974442e8c1ce6f#n680 [2] Activation of services (see also the account-service-type being extended with the "genenetwork" user). + +## Close as Fixed + +This issue is fixed, with newer Guix and changes that @bonz did to the gn-machines repo. diff --git a/issues/gn-guile/rendering-images-within-markdown-documents.gmi b/issues/gn-guile/rendering-images-within-markdown-documents.gmi new file mode 100644 index 0000000..fe3ed39 --- /dev/null +++ b/issues/gn-guile/rendering-images-within-markdown-documents.gmi @@ -0,0 +1,22 @@ +# Rendering Images Linked in Markdown Documents + +## Tags + +* status: open +* priority: high +* type: bug +* assigned: alexm, bonfacem, fredm +* keywords: gn-guile, images, markdown + +## Description + +Rendering images linked within markdown documents does not work as expected — we cannot render images if they have a relative path. +As an example see the commit below: +=> https://github.com/genenetwork/gn-docs/commit/783e7d20368e370fb497974f843f985b51606d00 + +In that commit, we are forced to use the full github uri to get the images to load correctly when rendered via gn-guile. This, has two unfortunate consequences: + +* It makes editing more difficult, since the user has to remember to find and use the full github URL for their images. +* It ties the data and code to github + +This needs to be fixed, such that any and all paths relative to the markdown file are resolved at render time automatically. diff --git a/issues/gn-guile/rework-hard-dependence-on-github.gmi b/issues/gn-guile/rework-hard-dependence-on-github.gmi new file mode 100644 index 0000000..751e9fe --- /dev/null +++ b/issues/gn-guile/rework-hard-dependence-on-github.gmi @@ -0,0 +1,21 @@ +# Rework Hard Dependence on Github + +## Tags + +* status: open +* priority: medium +* type: bug +* assigned: alexm +* assigned: bonfacem +* assigned: fredm +* keywords: gn-guile, github + +## Description + +Currently, we have a hard-dependence on Github for our source repository — you can see this in lines 31, 41, 55 and 59 of the code linked below: + +=> https://git.genenetwork.org/gn-guile/tree/web/view/markdown.scm?id=0ebf6926db0c69e4c444a6f95907e0971ae9bf40 + +The most likely reason is that the "edit online" functionality might not exist in a lot of other popular source forges. + +This is rendered moot, however, since we do provide a means to edit the data on Genenetwork itself. We might as well get rid of this option, and only allow the "edit online" feature on Genenetwork and stop relying on its presence in the forges we use. diff --git a/issues/gn-libs/jobs-allow-job-cascades.gmi b/issues/gn-libs/jobs-allow-job-cascades.gmi new file mode 100644 index 0000000..f659f32 --- /dev/null +++ b/issues/gn-libs/jobs-allow-job-cascades.gmi @@ -0,0 +1,26 @@ +# Jobs: Allow Job Cascades + +## Tags + +* status: open +* priority: medium +* type: enhancement +* assigned: fredm, zsloan +* keywords: gn-libs, genenetwork, async jobs, asynchronous jobs, background jobs + +## Description + +Some jobs could require more than a single command/script to be run to complete. + +Rather than refactoring/rewriting the entire "async jobs" feature, I propose adding a way to note who started a job, i.e. +* the user, OR +* another job + +This could be tracked in an extra field in the database, say "started_by" which can have values of the form +* "user:<user-id>" +* "job:<job-id>" +where the parts in the angle bracket (i.e. "<user-id>" and "<job-id>") are replaced by actual ids. + +## Related Issues + +=> /issues/gn-libs/jobs-track-who-jobs-belong-to diff --git a/issues/gn-libs/jobs-track-who-jobs-belong-to.gmi b/issues/gn-libs/jobs-track-who-jobs-belong-to.gmi new file mode 100644 index 0000000..00eaf21 --- /dev/null +++ b/issues/gn-libs/jobs-track-who-jobs-belong-to.gmi @@ -0,0 +1,23 @@ +# Jobs: Track Who Jobs Belong To + +## Tags + +* status: open +* priority: medium +* type: enhancement +* assigned: fredm, zsloan +* keywords: gn-libs, genenetwork, async jobs, asynchronous jobs, background jobs + +## Description + +Some features in Genenetwork require long-running processes to be triggered and run in the background. We have a way to trigger such background processes, but there is no way of tracking who started what job, and therefore, no real way for a user to list only their jobs. + +This issue will track the introduction of such tracking. This will enable the building new job-related functionality such as a user being able to: +* list their past, unexpired jobs +* delete past jobs +* possibly rerun jobs that failed but are recoverable +* see currently running jobs, and their status + +## Related Issues + +=> /issues/gn-libs/jobs-allow-job-cascades diff --git a/issues/gn-uploader/AuthorisationError-gn-uploader.gmi b/issues/gn-uploader/AuthorisationError-gn-uploader.gmi new file mode 100644 index 0000000..262ad19 --- /dev/null +++ b/issues/gn-uploader/AuthorisationError-gn-uploader.gmi @@ -0,0 +1,70 @@ +# AuthorisationError in gn uploader + +## Tags +* assigned: fredm +* status: closed, obsoleted +* priority: critical +* type: error +* key words: authorisation, permission + +## Description + +Trying to create population for Kilifish dataset in the gn-uploader webpage, +then encountered the following error: +```sh +Traceback (most recent call last): + File "/gnu/store/wxb6rqf7125sb6xqd4kng44zf9yzsm5p-profile/lib/python3.10/site-packages/flask/app.py", line 917, in full_dispatch_request + rv = self.dispatch_request() + File "/gnu/store/wxb6rqf7125sb6xqd4kng44zf9yzsm5p-profile/lib/python3.10/site-packages/flask/app.py", line 902, in dispatch_request + return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args) # type: ignore[no-any-return] + File "/gnu/store/wxb6rqf7125sb6xqd4kng44zf9yzsm5p-profile/lib/python3.10/site-packages/uploader/authorisation.py", line 23, in __is_session_valid__ + return session.user_token().either( + File "/gnu/store/wxb6rqf7125sb6xqd4kng44zf9yzsm5p-profile/lib/python3.10/site-packages/pymonad/either.py", line 89, in either + return right_function(self.value) + File "/gnu/store/wxb6rqf7125sb6xqd4kng44zf9yzsm5p-profile/lib/python3.10/site-packages/uploader/authorisation.py", line 25, in <lambda> + lambda token: function(*args, **kwargs)) + File "/gnu/store/wxb6rqf7125sb6xqd4kng44zf9yzsm5p-profile/lib/python3.10/site-packages/uploader/population/views.py", line 185, in create_population + ).either( + File "/gnu/store/wxb6rqf7125sb6xqd4kng44zf9yzsm5p-profile/lib/python3.10/site-packages/pymonad/either.py", line 91, in either + return left_function(self.monoid[0]) + File "/gnu/store/wxb6rqf7125sb6xqd4kng44zf9yzsm5p-profile/lib/python3.10/site-packages/uploader/monadic_requests.py", line 99, in __fail__ + raise Exception(_data) +Exception: {'error': 'AuthorisationError', 'error-trace': 'Traceback (most recent call last): + File "/gnu/store/38iayxz7dgm86f2x76kfaa6gwicnnjg4-profile/lib/python3.10/site-packages/flask/app.py", line 917, in full_dispatch_request + rv = self.dispatch_request() + File "/gnu/store/38iayxz7dgm86f2x76kfaa6gwicnnjg4-profile/lib/python3.10/site-packages/flask/app.py", line 902, in dispatch_request + return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args) # type: ignore[no-any-return] + File "/gnu/store/38iayxz7dgm86f2x76kfaa6gwicnnjg4-profile/lib/python3.10/site-packages/authlib/integrations/flask_oauth2/resource_protector.py", line 110, in decorated + return f(*args, **kwargs) + File "/gnu/store/38iayxz7dgm86f2x76kfaa6gwicnnjg4-profile/lib/python3.10/site-packages/gn_auth/auth/authorisation/resources/inbredset/views.py", line 95, in create_population_resource + ).then( + File "/gnu/store/38iayxz7dgm86f2x76kfaa6gwicnnjg4-profile/lib/python3.10/site-packages/pymonad/monad.py", line 152, in then + result = self.map(function) + File "/gnu/store/38iayxz7dgm86f2x76kfaa6gwicnnjg4-profile/lib/python3.10/site-packages/pymonad/either.py", line 106, in map + return self.__class__(function(self.value), (None, True)) + File "/gnu/store/38iayxz7dgm86f2x76kfaa6gwicnnjg4-profile/lib/python3.10/site-packages/gn_auth/auth/authorisation/resources/inbredset/views.py", line 98, in <lambda> + "resource": create_resource( + File "/gnu/store/38iayxz7dgm86f2x76kfaa6gwicnnjg4-profile/lib/python3.10/site-packages/gn_auth/auth/authorisation/resources/inbredset/models.py", line 25, in create_resource + return _create_resource(cursor, + File "/gnu/store/38iayxz7dgm86f2x76kfaa6gwicnnjg4-profile/lib/python3.10/site-packages/gn_auth/auth/authorisation/checks.py", line 56, in __authoriser__ + raise AuthorisationError(error_description) +gn_auth.auth.errors.AuthorisationError: Insufficient privileges to create a resource +', 'error_description': 'Insufficient privileges to create a resource'} + +``` +The error above resulted from the attempt to upload the following information on the gn-uploader-`create population section` +Input details are as follows: +Full Name: Kilifish F2 Intercross Lines +Name: KF2_Lines +Population code: KF2 +Description: Kilifish second generation population +Family: Crosses, AIL, HS +Mapping Methods: GEMMA, QTLReaper, R/qtl +Genetic type: intercross + +And when pressed the `Create Population` icon, it led to the error above. + +## Closed as Obsolete + +* The service this was happening on (https://staging-uploader.genenenetwork.org) is no longer running +* Most of the authorisation issues are resolved in newer code diff --git a/issues/gn-uploader/check-genotypes-in-database-too.gmi b/issues/gn-uploader/check-genotypes-in-database-too.gmi new file mode 100644 index 0000000..4e034b7 --- /dev/null +++ b/issues/gn-uploader/check-genotypes-in-database-too.gmi @@ -0,0 +1,22 @@ +# Check Genotypes in the Database for R/qtl2 Uploads + +## Tags + +* type: bug +* assigned: fredm +* priority: high +* status: closed, completed, fixed +* keywords: gn-uploader, uploader, upload, genotypes, geno + +## Description + +Currently, the uploader expects that a R/qtl2 bundle be self-contained, i.e. it contains all the genotypes and other data that fully describe the data in that bundle. + +This is unnecessary, in a lot of situations, seeing as Genenetwork might already have the appropriate genotypes already in its database. + +This issue tracks the implementation for the check of the genotypes against both the genotypes provided in the bundle, and those already in the database. + +### Updates + +Fixed in +=> https://git.genenetwork.org/gn-uploader/commit/?id=0e74a1589db9f367cdbc3dce232b1b6168e3aca1 this commit diff --git a/issues/export-uploaded-data-to-RDF-store.gmi b/issues/gn-uploader/export-uploaded-data-to-RDF-store.gmi index c39edec..3ef05cd 100644 --- a/issues/export-uploaded-data-to-RDF-store.gmi +++ b/issues/gn-uploader/export-uploaded-data-to-RDF-store.gmi @@ -6,7 +6,7 @@ * priority: medium * type: feature-request * status: open -* keywords: API, data upload +* keywords: API, data upload, gn-uploader ## Description @@ -73,10 +73,16 @@ The metadata is useful for searching for the data. The "metadata->rdf" project[4 * [ ] How do we handle this? +## Related Issues and Topics + +=> https://issues.genenetwork.org/topics/next-gen-databases/design-doc +=> https://issues.genenetwork.org/topics/lmms/rqtl2/using-rqtl2-lmdb-adapter +=> https://issues.genenetwork.org/issues/dump-sample-data-to-lmdb +=> https://issues.genenetwork.org/topics/database/genotype-database ## Footnotes -=> https://gitlab.com/fredmanglis/gnqc_py 1: QC/Data upload project repository +=> https://git.genenetwork.org/gn-uploader/ 1: QC/Data upload project (gn-uploader) repository => https://github.com/genenetwork/genenetwork3/pull/130 2: Munyoki's Pull request => https://github.com/BonfaceKilz/gn-dataset-dump 3: Dataset -> LMDB export repository -=> https://github.com/genenetwork/dump-genenetwork-database 4: Metadata -> RDF export repository +=> https://git.genenetwork.org/gn-transform-databases/ 4: Metadata -> RDF export repository diff --git a/issues/gn-uploader/gn-uploader-container-running-wrong-gn2.gmi b/issues/gn-uploader/gn-uploader-container-running-wrong-gn2.gmi index d2c33e8..5a5cdfa 100644 --- a/issues/gn-uploader/gn-uploader-container-running-wrong-gn2.gmi +++ b/issues/gn-uploader/gn-uploader-container-running-wrong-gn2.gmi @@ -3,7 +3,7 @@ ## Tags * assigned: fredm, aruni -* status: open +* status: closed, completed * priority: high * type: bug * keywords: guix, gn-uploader diff --git a/issues/gn-uploader/guix-build-gn-uploader-error.gmi b/issues/gn-uploader/guix-build-gn-uploader-error.gmi index 44a5c4b..aeb6308 100644 --- a/issues/gn-uploader/guix-build-gn-uploader-error.gmi +++ b/issues/gn-uploader/guix-build-gn-uploader-error.gmi @@ -86,7 +86,7 @@ Filesystem Size Used Avail Use% Mounted on so we know that's not a problem. -A similar thing had shown up on space.uthsc.edu. +A similar thing had shown up on our space server. ### More Troubleshooting Efforts diff --git a/issues/gn-uploader/handling-tissues-in-uploader.gmi b/issues/gn-uploader/handling-tissues-in-uploader.gmi index 826af15..0c43040 100644 --- a/issues/gn-uploader/handling-tissues-in-uploader.gmi +++ b/issues/gn-uploader/handling-tissues-in-uploader.gmi @@ -2,11 +2,11 @@ ## Tags -* status: open +* status: closed, wontfix * priority: high * assigned: fredm * type: feature-request -* keywords: gn-uploader, tissues +* keywords: gn-uploader, tissues, archived ## Description @@ -112,3 +112,9 @@ ALTER TABLE Tissue MODIFY Id INT(5) UNIQUE NOT NULL; * [1] https://gn1.genenetwork.org/webqtl/main.py?FormID=schemaShowPage#ProbeFreeze * [2] https://gn1.genenetwork.org/webqtl/main.py?FormID=schemaShowPage#Tissue + +## Closed as WONTFIX + +I am closing this issue because it was created (2024-03-28) while I had a fundamental misunderstanding of the way data is laid out in the database. + +The information on the schema/layout of the tables is still useful, but chances are, we'll look at the tables themselves anyway should we need to figure out the schema. diff --git a/issues/gn-uploader/link-authentication-authorisation.gmi b/issues/gn-uploader/link-authentication-authorisation.gmi new file mode 100644 index 0000000..b64f887 --- /dev/null +++ b/issues/gn-uploader/link-authentication-authorisation.gmi @@ -0,0 +1,21 @@ +# Link Authentication/Authorisation + +## Tags + +* status: closed, completed +* assigned: fredm +* priority: critical +* type: feature request, feature-request +* keywords: gn-uploader, gn-auth, authorisation, authentication, uploader, upload + +## Description + +The last chain in the link to the uploads is the authentication/authorisation. Once the user uploads their data, they need access to it. The auth system, by default, will deny anyone/everyone access to any data that is not linked to a resource and which no user has any roles allowing them access to the data. + +We, currently, assign such data to the user manually, but that is not a sustainable way of working, especially as the uploader is exposed to more and more users. + +### Close as Completed + +The current iteration of the uploader does actually take into account the user that is uploading the data, granting them ownership of the uploaded data. By default, the data is not public, and is only accessible to the user who uploaded it. + +The user who uploads the data (and therefore own it) can later grant access to other users of the system. diff --git a/issues/quality-control/move-uploader-to-tux02.gmi b/issues/gn-uploader/move-uploader-to-tux02.gmi index 4459433..20c5b24 100644 --- a/issues/quality-control/move-uploader-to-tux02.gmi +++ b/issues/gn-uploader/move-uploader-to-tux02.gmi @@ -5,7 +5,7 @@ * type: migration * assigned: fredm * priority: high -* status: open +* status: closed, completed, fixed * keywords: gn-uploader, guix, container, deploy ## Databases @@ -17,13 +17,13 @@ This implies separate configurations, and separate startup. Some of the things to do to enable this, then, are: -- [x] Provide separate configs and run db server on separate port +* [x] Provide separate configs and run db server on separate port - Configs put in /etc/mysql3307 - Selected port 3307 - datadir in /var/lib/mysql3307 -> /export5 -- [x] Provide separate data directory for the content +* [x] Provide separate data directory for the content - extract backup -- [x] Maybe suffix the files with the port number, e.g. +* [x] Maybe suffix the files with the port number, e.g. ``` datadir = /var/lib/mysql3307 socket = /var/run/mysqld/mysqld3307.sock diff --git a/issues/gn-uploader/probeset-not-applicable-to-all-data.gmi b/issues/gn-uploader/probeset-not-applicable-to-all-data.gmi index 1841d36..af3b274 100644 --- a/issues/gn-uploader/probeset-not-applicable-to-all-data.gmi +++ b/issues/gn-uploader/probeset-not-applicable-to-all-data.gmi @@ -4,7 +4,7 @@ * type: bug * assigned: fredm -* status: open +* status: closed * priority: high * keywords: gn-uploader, uploader, ProbeSet @@ -20,3 +20,10 @@ applicable to our data, I don't think. ``` It seems like some of the data does not require a ProbeSet, and in that case, it should be possible to add it without one. + + +## Notes + +This "bug" is obsoleted by the fact that the implementation leading to it was entirely wrong. + +The feature that was leading to this bug no longer exists, and will have to be re-implemented from scratch with the involvement of @acenteno. diff --git a/issues/gn-uploader/provide-page-for-uploaded-data.gmi b/issues/gn-uploader/provide-page-for-uploaded-data.gmi new file mode 100644 index 0000000..5ab7f80 --- /dev/null +++ b/issues/gn-uploader/provide-page-for-uploaded-data.gmi @@ -0,0 +1,27 @@ +# Provide Page/Link for/to Uploaded Data + +## Tags + +* status: closed, completed +* assigned: fredm +* priority: medium +* type: feature, feature request, feature-request +* keywords: gn-uploader, uploader, data dashboard + +## Description + +Once a user has uploaded their data, provide them with a landing page/dashboard for the data they have uploaded, with details on what that data is. + +* Should we provide a means to edit the data here (mostly to add metadata and the like)? +* Maybe the page should actually be shown on GN2? + +## Blockers + +Depends on + +=> /issues/gn-uploader/link-authentication-authorisation + + +## Close as complete + +Current uploader directs user to a view of the data they uploader on GN2. This is complete. diff --git a/issues/gn-uploader/replace-redis-with-sqlite3.gmi b/issues/gn-uploader/replace-redis-with-sqlite3.gmi new file mode 100644 index 0000000..d3f94f0 --- /dev/null +++ b/issues/gn-uploader/replace-redis-with-sqlite3.gmi @@ -0,0 +1,29 @@ +# Replace Redis with SQL + +## Tags + +* status: open +* priority: low +* assigned: fredm +* type: feature, feature-request, feature request +* keywords: gn-uploader, uploader, redis, sqlite, sqlite3 + +## Description + +We currently (as of 2024-06-27) use Redis for tracking any asynchronous jobs (e.g. QC on uploaded files). + +A lot of what we use redis for, we can do in one of the many SQL databases (we'll probably use SQLite3 anyway), which are more standardised, and easier to migrate data from and to. It has the added advantage that we can open multiple connections to the database, enabling the different processes to update the status and metadata of the same job consistently. + +Changes done here can then be migrated to the other systems, i.e. GN2, GN3, and gn-auth, as necessary. + +### 2025-12-31: Progress Update + +Initial basic implementation can be found in: + +=> https://git.genenetwork.org/gn-libs/tree/gn_libs/jobs +=> https://git.genenetwork.org/gn-uploader/commit/?id=774a0af9db439f50421a47249c57e5a0a6932301 +=> https://git.genenetwork.org/gn-uploader/commit/?id=589ab74731aed62b1e1b3901d25a95fc73614f57 + +and others. + +More work needs to be done to clean-up some minor annoyances. diff --git a/issues/gn-uploader/resume-upload.gmi b/issues/gn-uploader/resume-upload.gmi new file mode 100644 index 0000000..0f9ba30 --- /dev/null +++ b/issues/gn-uploader/resume-upload.gmi @@ -0,0 +1,41 @@ +# gn-uploader: Resume Upload + +## Tags + +* status: closed, completed, fixed +* priority: medium +* assigned: fredm, flisso +* type: feature request, feature-request +* keywords: gn-uploader, uploader, upload, resume upload + +## Description + +If a user is uploading a particularly large file, we might need to provide a way for the user to resume their upload of the file. + +Maybe this can wait until we have +=> /issues/gn-uploader/link-authentication-authorisation linked authentication/authorisation to gn-uploader. +In this way, each upload can be linked to a specific user. + +### TODOs + +* [x] Build UI to allow uploads +* [x] Build back-end to handle uploads +* [x] Handle upload failures/errors +* [x] Deploy to staging + +### Updates + +=> https://git.genenetwork.org/gn-uploader/commit/?id=9a8dddab072748a70d43416ac8e6db69ad6fb0cb +=> https://git.genenetwork.org/gn-uploader/commit/?id=df9da3d5b5e4382976ede1b54eb1aeb04c4c45e5 +=> https://git.genenetwork.org/gn-uploader/commit/?id=47c2ea64682064d7cb609e5459d7bd2e49efa17e +=> https://git.genenetwork.org/gn-uploader/commit/?id=a68fe177ae41f2e58a64b3f8dcf3f825d004eeca + +### Possible Resources + +=> https://javascript.info/resume-upload +=> https://github.com/23/resumable.js/ +=> https://www.dropzone.dev/ +=> https://stackoverflow.com/questions/69339582/what-hash-python-3-hashlib-yields-a-portable-hash-of-file-contents + + +This is mostly fixed. Any arising bugs can be tracked is separate issues. diff --git a/issues/gn-uploader/speed-up-rqtl2-qc.gmi b/issues/gn-uploader/speed-up-rqtl2-qc.gmi new file mode 100644 index 0000000..43e6d49 --- /dev/null +++ b/issues/gn-uploader/speed-up-rqtl2-qc.gmi @@ -0,0 +1,30 @@ +# Speed Up QC on R/qtl2 Bundles + +## Tags + +## Description + +The default format for the CSV files in a R/qtl2 bundle is: + +``` +matrix of individuals × (markers/phenotypes/covariates/phenotype covariates/etc.) +``` + +(A) (f/F)ile(s) in the R/qtl2 bundle could however +=> https://kbroman.org/qtl2/assets/vignettes/input_files.html#csv-files be transposed, +which means the system needs to "un-transpose" the file(s) before processing. + +Currently, the system does this by reading all the files of a particular type, and then "un-transposing" the entire thing. This leads to a very slow system. + +This issue proposes to do the quality control/assurance processing on each file in isolation, where possible - this will allow parallelisation/multiprocessing of the QC checks. + +The main considerations that need to be handled are as follows: + +* Do QC on (founder) genotype files (when present) before any of the other files +* Genetic and physical maps (if present) can have QC run on them after the genotype files +* Do QC on phenotype files (when present) after genotype files but before any other files +* Covariate and phenotype covariate files come after the phenotype files +* Cross information files … ? +* Sex information files … ? + +We should probably detail the type of QC checks done for each type of file diff --git a/issues/gn-uploader/uploading-samples.gmi b/issues/gn-uploader/uploading-samples.gmi new file mode 100644 index 0000000..11842b9 --- /dev/null +++ b/issues/gn-uploader/uploading-samples.gmi @@ -0,0 +1,51 @@ +# Uploading Samples + +## Tags + +* status: open +* assigned: fredm +* interested: acenteno, zachs, flisso +* priority: high +* type: feature-request +* keywords: gn-uploader, uploader, samples, strains + +## Description + +This will track the various notes regarding the upload of samples onto GeneNetwork. + +### Sample Lists + +From the email thread(s) with @zachs, @flisso and @acenteno + +``` +When there's a new set of individuals, it generally needs to be added as a new group. In the absence of genotype data, a "dummy" .geno file currently needs to be generated* in order to define the sample list (if you look at the list of .geno files in genotype_files/genotype you'll find some really small files that just have either a single marker or a bunch of fake markers calls "Marker1, Marker2, etc" - these are solely just used to get the samplelist from the columns). So in theory such a file could be generated as a part of the upload process in the absence of genotypes +``` + +We note, however, that the as @zachs mentions + +``` +This is really goofy and should probably change. I've brought up the idea of just replacing these with JSON files containing group metadata (including samplelist), but we've never actually gone through with making any change to this. I already did something sorta similar to this with the existing JSON files (in genotype_files/genotype), but those are currently only used in situations where there are either multiple genotype files, or a genotype file only contains a subset of samples/strains from a group (so the JSON file tells mapping to only use those samples/strains). +``` + +We need to explore whether such a change might need updates to the GN2/GN3 code to ensure code that depends on these dummy files can also use the new format JSON files too. + +Regarding the order of the samples, from the email thread: + +``` +Regarding the order of samples, it can basically be whatever we decide it is. It just needs to stay consistent (like if there are multiple genotype files). It only really affects how it's displayed, and any other genotype files we use for mapping needs to share the same order. +``` + +The ordering of the samples has no bearing on the analysis of the data, i.e. it does not affect the results of computations. + + +### Curation + +``` +But any time new samples are involved, there probably needs to be some explicit confirmation by a curator like Rob (since we want to avoid a situation where a sample/strain just has a typo or somethin and we treat it like a new sample/strain). +``` + +also + +``` +When there's a mix of existing individuals, I think it's usually the case that it's the same group (that is being expanded with new individuals), but anything that involves adding new samples should probably involve some sort of direct/explicit confirmation from a curator like Rob or something. +``` diff --git a/issues/gn-volt-genofiles-parsing-integration.gmi b/issues/gn-volt-genofiles-parsing-integration.gmi index 8d3d149..e1b0162 100644 --- a/issues/gn-volt-genofiles-parsing-integration.gmi +++ b/issues/gn-volt-genofiles-parsing-integration.gmi @@ -5,7 +5,7 @@ * assigned: alexm, * type: improvement * priority: high -* status: in progress +* status: stalled, closed. ## Notes diff --git a/issues/gnqa/GNQA-for-evaluation.gmi b/issues/gnqa/GNQA-for-evaluation.gmi index 9f4a861..0b2e352 100644 --- a/issues/gnqa/GNQA-for-evaluation.gmi +++ b/issues/gnqa/GNQA-for-evaluation.gmi @@ -5,7 +5,7 @@ * Assigned: alexm, shelbys * Keywords: UI, GNQA, evaluation * Type: immediate -* Status: In Progress +* Status: completed ## Description @@ -13,5 +13,5 @@ We need to publish a paper on GeneNetwork Question & Answering system. To that e ## Tasks -* [ ] Add a thumbs up and down for rating the answer to a question -* [ ] Ensure to log the questions, respones, and ratings of each questions +* [X] Add a thumbs up and down for rating the answer to a question +* [X] Ensure to log the questions, respones, and ratings of each questions diff --git a/issues/gnqna/rating-system-has-no-indication-for-login-requirement.gmi b/issues/gnqa/Login_no-indicator-for-req.gmi index 7ed713a..7ed713a 100644 --- a/issues/gnqna/rating-system-has-no-indication-for-login-requirement.gmi +++ b/issues/gnqa/Login_no-indicator-for-req.gmi diff --git a/issues/fetch-pubmed-references-to-gnqa.gmi b/issues/gnqa/fetch-pubmed-references-to-gnqa.gmi index 63351d1..43c45cf 100644 --- a/issues/fetch-pubmed-references-to-gnqa.gmi +++ b/issues/gnqa/fetch-pubmed-references-to-gnqa.gmi @@ -5,7 +5,7 @@ * assigned: alexm * keywords: llm, pubmed, api, references * type: enhancements -* status: in progress +* status: completed, closed ## Description @@ -18,13 +18,13 @@ The task is to integrate PubMed references into the GNQA system by querying the * [x] Query the API with the publication titles. -* [] Display the PubMed information as reference information on the GN2 user interface. +* [x] Display the PubMed information as reference information on the GN2 user interface. -* [] dump the results to a DB e.g sqlite,lmdb +* [x] dump the results to a DB e.g sqlite,lmdb * [x] If references are not found, perform a lossy search or list the closest three papers. -* [] reimplement the reference ui to render the references as modal objects +* [x] reimplement the reference ui to render the references as modal objects For lossy search, see: diff --git a/issues/gn_llm_db_cache_integration.gmi b/issues/gnqa/gn_llm_db_cache_integration.gmi index 86f7c80..86f7c80 100644 --- a/issues/gn_llm_db_cache_integration.gmi +++ b/issues/gnqa/gn_llm_db_cache_integration.gmi diff --git a/issues/gnqa/gn_llm_integration_using_cached_searches.gmi b/issues/gnqa/gn_llm_integration_using_cached_searches.gmi new file mode 100644 index 0000000..e20b5a3 --- /dev/null +++ b/issues/gnqa/gn_llm_integration_using_cached_searches.gmi @@ -0,0 +1,43 @@ +# GN2 Integration with LLM search using cached results + +## Tags + +* assigned: jnduli, alexm, bmunyoki +* keywords: llm, genenetwork2 +* type: enhancement +* status: open + +## Description + +We'd like to include LLM searches integrated into our GN searches, when someone attempts a Xapian search e.g. when I search for `wiki:rif group:mouse nicotine`, we'd do a corresponding search for `rif mouse nicotine` on LLMs, and show the results on the main page. + +Another example: + +xapian search: rif:glioma species:human group:gtex_v8 +llm search: glioma human gtex_v8 + + +This can be phased out into + +* [ ] 1. UI integration, where we modify the search page to include a dummy content box +* [ ] 2. LLM search integration, where we perform a search and modify UI to show the results. This can either be async (i.e. the search results page waits for the LLM search results) or sync (i.e. we load the search results page after we've got the LLM results) +* [x] 2.1 create a copy branch for the gnqa-api branch +* [x] 2.2 create a PR containing all the branches +* [ ] 2.3 how much would it take to get the qa_*** branch merged into main?? +* [ ] 3. Cache design and integration: we already have some + +cache using redis (gn search history), so we may use this for the moment. + + +Let's use flag: `LLM_SEARCH_ENABLED` to enable/disable this feature during development to make sure we don't release this before it's ready. + + +## Notes + +The branch for merging to gn2: + +https://github.com/genenetwork/genenetwork2/pull/863 + +The branch for merging to gn3: + +https://github.com/genenetwork/genenetwork3/pull/188 \ No newline at end of file diff --git a/issues/gnqa/gnqa_integration_to_global_search_Design.gmi b/issues/gnqa/gnqa_integration_to_global_search_Design.gmi new file mode 100644 index 0000000..0d5afd0 --- /dev/null +++ b/issues/gnqa/gnqa_integration_to_global_search_Design.gmi @@ -0,0 +1,74 @@ +# GNQA Integration to Global Search Design Proposal + +## Tags +* assigned: jnduli, alexm +* keywords: llm, genenetwork2 +* type: feature +* status: complete, closed, done + +## Description +This document outlines the design proposal for integrating GNQA into the Global Search feature. + +## High-Level Design + +### UI Design +When the GN2 Global Search page loads: +1. A request is initiated via HTMX to the GNQA search page with the search query. +2. Based on the results, a page or subsection is rendered, displaying the query and the answer, and providing links to references. + +For more details on the UI design, refer to the pull request: +=> https://github.com/genenetwork/genenetwork2/pull/862 + +### Backend Design +The API handles requests to the Fahamu API and manages result caching. Once a request to the Fahamu API is successful, the results are cached using SQLite for future queries. Additionally, a separate API is provided to query cached results. + +## Deep Dive + +### Caching Implementation +For caching, we will use SQLite3 since it is already implemented for search history. Based on our study, this approach will require minimal space: + +*Statistical Estimation:* +We calculated that this caching solution would require approximately 79MB annually for an estimated 20 users, each querying the system 5 times a day. + +Why average request size per user and how we determined this? +The average request size was an upper bound calculation for documents returned from the Fahamu API. + +why we're assuming 20 users making 5 requests per day? + +We’re assuming 20 users making 5 requests per day to estimate typical usage of GN2 services +### Error Handling +* Handle cases where users are not logged in, as GNQA requires authentication. +* Handle scenarios where there is no response from Fahamu. +* Handle general errors. + +### Passing Questions to Fahamu +We can choose to either pass the entire query from the user to Fahamu or parse the query to search for keywords. + +### Generating Possible Questions +It is possible to generate potential questions based on the user's search and render those to Fahamu. Fahamu would then return possible related queries. + +## Related Issues +=> https://issues.genenetwork.org/issues/gn_llm_integration_using_cached_searches + +## Tasks + +* [x] Initiate a background task from HTMX to Fahamu once the search page loads. +* [x] Query Fahamu for data. +* [x] Cache results from Fahamu. +* [x] Render the UI page with the query and answer. +* [x] For "See more," render the entire GNQA page with the query, answer, references, and PubMed data. +* [x] Implement parsing for Xapian queries to normal queries. +* [x] Implement error handling. +* [x] reimplement how gnqa uses GN-AUTH in gn3. +* [x] Query Fahamu to generate possible questions based on certain keywords. + + +## Notes +From the latest Fahamu API docs, they have implemented a way to include subquestions by setting `amplify=True` for the POST request. We also have our own implementation for parsing text to extract questions. + +## PRs Merged Related to This + +=> https://github.com/genenetwork/genenetwork2/pull/868 +=> https://github.com/genenetwork/genenetwork2/pull/862 +=> https://github.com/genenetwork/genenetwork2/pull/867 +=> https://github.com/genenetwork/genenetwork3/pull/191 \ No newline at end of file diff --git a/issues/implement-auth-to-gn-llm.gmi b/issues/gnqa/implement-auth-to-gn-llm.gmi index 496a7cb..2a5456b 100644 --- a/issues/implement-auth-to-gn-llm.gmi +++ b/issues/gnqa/implement-auth-to-gn-llm.gmi @@ -6,7 +6,7 @@ * keywords: llm, auth * type: feature * priority: high -* status: done, completed +* status: done, completed, closed ## Description diff --git a/issues/gnqa/implement-no-login-requirement-for-gnqa.gmi b/issues/gnqa/implement-no-login-requirement-for-gnqa.gmi new file mode 100644 index 0000000..5b0a1ff --- /dev/null +++ b/issues/gnqa/implement-no-login-requirement-for-gnqa.gmi @@ -0,0 +1,20 @@ +# Implement No-Login Requirement for GNQA + +## Tags + +* type: feature +* status: completed, closed +* priority: medium +* assigned: alexm, +* keywords: gnqa, user experience, authentication, login, llm + +## Description +This feature will allow usage of LLM/GNQA features without requiring user authentication, while implementing measures to filter out bots + + +## Tasks + +* [x] If logged in: perform AI search with zero penalty +* [x] Add caching lifetime to save on token usage +* [x] Routes: check for referrer headers — if the previous search was not from the homepage, perform AI search +* [x] If global search returns more than *n* results (*n = number*), perform an AI search diff --git a/issues/implement-reference-rating-gn-llm.gmi b/issues/gnqa/implement-reference-rating-gn-llm.gmi index f646a6f..f646a6f 100644 --- a/issues/implement-reference-rating-gn-llm.gmi +++ b/issues/gnqa/implement-reference-rating-gn-llm.gmi diff --git a/issues/integrate_gn_llm_search.gmi b/issues/gnqa/integrate_gn_llm_search.gmi index 5dfd9da..5dfd9da 100644 --- a/issues/integrate_gn_llm_search.gmi +++ b/issues/gnqa/integrate_gn_llm_search.gmi diff --git a/issues/merge-gnqa-to-production.gmi b/issues/gnqa/merge-gnqa-to-production.gmi index 3d34bb1..6e5f119 100644 --- a/issues/merge-gnqa-to-production.gmi +++ b/issues/gnqa/merge-gnqa-to-production.gmi @@ -4,6 +4,7 @@ * assigned: alexm, * keywords: production, GNQA, integration +* status: closed, completed ## Description @@ -12,5 +13,5 @@ be pushed to production. We need to allow only logged-in users to access the ser ## Tasks -* [] Integrate GN-auth for the service -* [] Push production to the current commit \ No newline at end of file +* [x] Integrate GN-auth for the service +* [x] Push production to the current commit \ No newline at end of file diff --git a/issues/refactor-gn-llm-code.gmi b/issues/gnqa/refactor-gn-llm-code.gmi index 6e33737..64c43c4 100644 --- a/issues/refactor-gn-llm-code.gmi +++ b/issues/gnqa/refactor-gn-llm-code.gmi @@ -5,7 +5,7 @@ * assigned:alexm,shelby * keywords:refactoring,llm,tests * type: enchancements -* status: in progress +* status: completed, closed ## Description diff --git a/issues/gnqna/query-bug-DatabaseError.gmi b/issues/gnqna/query-bug-DatabaseError.gmi new file mode 100644 index 0000000..b8c1cfc --- /dev/null +++ b/issues/gnqna/query-bug-DatabaseError.gmi @@ -0,0 +1,37 @@ +# Query Bug: DatabaseError + +## Tags + +* assigned: fredm, bonfacem +* priority: high +* status: open +* type: bug +* keywords: gnqna + +## Descriptions + +* Go to https://genenetwork.org/gnqna +* Type in a query +* Press "Enter" +* Observe the error "DatabaseError" with a status code of 500. + +Expected: Query returns a result. + + +## Troubleshooting: 2025-10-27 + +* GNQNA's deployment is not part of the gn-machine's definitions! + +## Troubleshooting: 2025-12-31 + +If a user **IS NOT** logged in, the system responds with: + +``` +Search_Query: +Status_Code: 500 +Error/Reason: Login/Verification required to make this request +``` + +On the other hand, if a user is logged in, a query returns a result. + +We, therefore, probably need to notify the user that they need to be logged in to use this service. diff --git a/issues/guix-bioinformatics/guix-updates.gmi b/issues/guix-bioinformatics/guix-updates.gmi new file mode 100644 index 0000000..9c65fb9 --- /dev/null +++ b/issues/guix-bioinformatics/guix-updates.gmi @@ -0,0 +1,18 @@ +# Planned Guix Updates + +## Tags + +* status: open +* priority: medium +* type: enhancement +* assigned: fredm, bonfacem +* keywords: guix-bioinformatics, guix +* interested: pjotrp, aruni + +## Description + +The following outlines issues around the next upgrade: + +* Update pinned guix commit to the latest and see whether inferior profiles for the laminar user are properly created. +* Rust packages (new package build system) we need to think about. + diff --git a/issues/guix-bioinformatics/pin-channels-commits.gmi b/issues/guix-bioinformatics/pin-channels-commits.gmi new file mode 100644 index 0000000..216dd24 --- /dev/null +++ b/issues/guix-bioinformatics/pin-channels-commits.gmi @@ -0,0 +1,39 @@ +# Pin Channel Commits; Decouple from Guix + +## Tags + +* status: closed +* priority: medium +* type: enhancement +* assigned: fredm, bonfacem, aruni +* keywords: guix-bioinformatics, guix +* interested: pjotrp, aruni + +## Description + +Changes in upstream Guix often lead to deployment issues, due to breakages caused by changes in how GNU Guix does things. This interrupts our day-to-day operations, leading us to scramble to fix the breakages and make the builds sane again. + +In order to avoid these breakages in the future, we'll need to actually pin the commit(s) for all the channels we depend on, to avoid surprises down the line. + +### Channel Dependencies + +We depend on the following channels in guix-bioinformatics: + +* guix: Mainline Guix channel +* guix-past: Channel for old packages, no longer maintained on guix mainline +* guix-rust-past-crates: Channel for rust packages using the old packaging form +* guix-forge: Manages building containers and whatnot. The dependence is implicit here, but it is one of the main causes of breakages + +### Tasks + +* [x] Pin guix channel +* [x] Pin guix-past +* [x] Pin guix-rust-past-crates channel +* [x] Pin guix-forge channel +* [ ] Move packages from (gn packages bioinformatics) to upstream (gnu packages bioinformatics) + +### Solution + +To allow guix-bioinformatics to continue improving, while preventing random breakages, we stopped depending on guix-bioinformatics directly, rather, we changed our main channel to gn-machines, and there, we pinned the version of guix-bioinformatics we depend on. + +This allows us to continue updating our packages while keeping the channel dependencies relatively stable. diff --git a/issues/guix-ci-tests.gmi b/issues/guix-ci-tests.gmi new file mode 100644 index 0000000..ce56705 --- /dev/null +++ b/issues/guix-ci-tests.gmi @@ -0,0 +1,47 @@ +# Guix CI failure: guix-past build breaks due to missing (libchop) + +# Tags + +* assigned: bonfacem +* type: bug, infrastructure +* priority: high + +# Notes + +After fixing a permissions issue in the Laminar CI environment (/var/guix/profiles/per-user/laminar): + +``` +[laminar] Executing cfg/jobs/gn-libs.run Backtrace: 9 (primitive-load "/var/lib/laminar/cfg/jobs/gn-libs.run") In ice-9/boot-9.scm: 152:2 8 (with-fluid* _ _ _) In ice-9/eval.scm: 202:51 7 (_ #(#(#<directory (guile-user) 7fce0bc71c80> #<pro?> ?))) 293:34 6 (_ #(#(#<directory (guile-user) 7fce0bc71c80> #<pro?> ?))) In guix/inferior.scm: 1006:4 5 (inferior-for-channels _ #:cache-directory _ #:ttl _) In ice-9/boot-9.scm: 1752:10 4 (with-exception-handler _ _ #:unwind? _ # _) In guix/store.scm: 690:37 3 (thunk) 1331:8 2 (call-with-build-handler #<procedure 7fce00e9f0c0 at g?> ?) In guix/inferior.scm: 951:2 1 (cached-channel-instance #<store-connection 256.100 7f?> ?) In ice-9/boot-9.scm: 1685:16 0 (raise-exception _ #:continuable? _) ice-9/boot-9.scm:1685:16: In procedure raise-exception: In procedure mkdir: Permission denied: "/var/guix/profiles/per-user/laminar" +``` + +... by (inside the container) running: + +``` +mkdir -p /var/guix/profiles/per-user/laminar +chown -R laminar:laminar /var/guix/profiles/per-user/laminar +``` + +... the CI progressed further but now fails when attempting to build guix-past. The failure is caused by an unbound variable error for the module (libchop), indicating a mismatch or missing dependency in the pinned Guix channels. + +Error Log: + +``` +(exception unbound-variable (value #f) + (value "Unbound variable: ~S") + (value (libchop)) (value #f)) + +builder for /gnu/store/gx57wj08yv0x0g1r8rbnwcp2fc58lqvx-guix-past.drv +failed to produce output path +/gnu/store/n3q0sgqwm9mwvna5215npwmdfigfyr9f-guix-past + +cannot build derivation +/gnu/store/3fwagz1p9vv3h020lwb2ab52f6wj6z1g-profile.drv: +1 dependencies couldn't be built +``` + +# Resolution + +* Inside genenetwork-development.scm, manually create `/var/guix/profiles/per-user/laminar` if it doesn't exist. +* Update the relevant .guix-channel file to match channels in guix-bioinformatics. + +* closed diff --git a/issues/implement-gn-markdown-editor.gmi b/issues/implement-gn-markdown-editor.gmi index 7d7d08f..a0d386b 100644 --- a/issues/implement-gn-markdown-editor.gmi +++ b/issues/implement-gn-markdown-editor.gmi @@ -13,7 +13,7 @@ Example of similar implementation * assigned: alexm * type: enhancement -* status: IN PROGRESS +* status: done, completed. * keywords: markdown,editor @@ -23,7 +23,7 @@ Example of similar implementation * [x] add live preview for page markdown on edit -* [] authentication(WIP) +* [x] authentication * [x] commit changes to github repo diff --git a/issues/implement_xapian_to_text_transformer.gmi b/issues/implement_xapian_to_text_transformer.gmi new file mode 100644 index 0000000..192491a --- /dev/null +++ b/issues/implement_xapian_to_text_transformer.gmi @@ -0,0 +1,15 @@ +# Xapian to Text Transformer + +## Tags +* assigned: alexm, jnduli +* keywords: llm, genenetwork2, xapian, transform +* type: feature +* status: closed, completed + +## Description: + +Given a Xapian search query, e.g., "CYTOCHROME AND P450" or "CYTOCHROME NEAR P450," we need to convert the text to a format with no Xapian keywords. In this case, the transformed text would be "CYTOCHROME P450." + + +This issue is a part of the main issue below. +=> https://issues.genenetwork.org/issues/gn_llm_integration_using_cached_searches diff --git a/issues/inspect-discrepancies-between-xapian-and-sql-search.gmi b/issues/inspect-discrepancies-between-xapian-and-sql-search.gmi new file mode 100644 index 0000000..98b46b6 --- /dev/null +++ b/issues/inspect-discrepancies-between-xapian-and-sql-search.gmi @@ -0,0 +1,135 @@ +# Inspect Discrepancies Between Xapian and SQL Search. + +* assigned: bonfacem, rookie101 + +## Description + +When doing a Xapian search, we miss some data that is available from the SQL Search. The searches we tested: + +=> https://cd.genenetwork.org/search?species=mouse&group=BXD&type=Hippocampus+mRNA&dataset=HC_M2_0606_P&search_terms_or=WIKI%3Dglioma&search_terms_and=&accession_id=None&FormID=searchResulto SQL search for dataset=HC_M2_0606_P species=mouse group=BXD WIKI=glioma (31 results) + +=> https://cd.genenetwork.org/gsearch?type=gene&terms=species%3Amouse+group%3Abxd+dataset%3Ahc_m2_0606_p+wiki%3Aglioma species:mouse group:bxd dataset:hc_m2_0606_p wiki:glioma (26 results) + +We miss the following entries from the Xapian search: + +``` +15 1423803_s_at Gltscr2 glioma tumor suppressor candidate region gene 2 +16 1451121_a_at Gltscr2 glioma tumor suppressor candidate region 2; exons 8 and 9 +17 1452409_at Gltscr2 glioma tumor suppressor candidate region gene 2 +25 1416556_at Sas sarcoma amplified sequence +26 1430029_a_at Sas sarcoma amplified sequence +``` + +We want to figure out why there is a discrepancy between the 2 searches above. + +## Resolution + +Use "quest" to search for one of the symbols that don't appear in the Xapian search to get the exact document id: + +``` +quest --msize=2 -s en --boolean-prefix="iden:Qgene:" "iden:"1423803_s_at:hc_m2_0606_p"" \ +--db=/export/data/genenetwork-xapian/ + +Parsed Query: Query(0 * Qgene:1423803_s_at:hc_m2_0606_p) +Exactly 1 matches +MSet: +9665867: [0] +{ + "name": "1423803_s_at", + "symbol": "Gltscr2", + "description": "glioma tumor suppressor candidate region gene 2", + "chr": "1", + "mb": 4.687986, + "dataset": "HC_M2_0606_P", + "dataset_fullname": "Hippocampus Consortium M430v2 (Jun06) PDNN", + "species": "mouse", + "group": "BXD", + "tissue": "Hippocampus mRNA", + "mean": 11.749030303030299, + "lrs": 11.3847971289981, + "additive": -0.0650828877005346, + "geno_chr": "5", + "geno_mb": 137.010795 +} +``` + +From the retrieved document-id, use "xapian-delve" to inspect the terms inside the index: + +``` +xapian-delve -r 9665867 -d /export/data/genenetwork-xapian/ + +Data for record #9665867: +{ + "name": "1423803_s_at", + "symbol": "Gltscr2", + "description": "glioma tumor suppressor candidate region gene 2", + "chr": "1", + "mb": 4.687986, + "dataset": "HC_M2_0606_P", + "dataset_fullname": "Hippocampus Consortium M430v2 (Jun06) PDNN", + "species": "mouse", + "group": "BXD", + "tissue": "Hippocampus mRNA", + "mean": 11.749030303030299, + "lrs": 11.3847971289981, + "additive": -0.0650828877005346, + "geno_chr": "5", + "geno_mb": 137.010795 +} +Term List for record #9665867: 1423803_s_at 2 5330430h08rik +9430097c02rik Qgene:1423803_s_at:hc_m2_0606_p +XC1 XDShc_m2_0606_p XGbxd XIhippocampus XImrna XPC5 +XSmouse XTgene XYgltscr2 ZXDShc_m2_0606_p ZXGbxd +ZXIhippocampus ZXImrna ZXSmous ZXYgltscr2 Zbc017637 +Zbxd Zcandid Zgene Zglioma Zgltscr2 Zhc_m2_0606_p +Zhippocampus Zmous Zmrna Zregion Zsuppressor Ztumor +bc017637 bxd candidate gene glioma gltscr2 +hc_m2_0606_p hippocampus mouse mrna +region suppressor tumor +``` + +We have no wiki (XWK) entries from the above. When transforming to TTL files from SQL, we have symbols that exist in the GeneRIF table that do not exist in the GeneRIF_BASIC table: + +``` +SELECT COUNT(symbol) FROM GeneRIF WHERE +symbol NOT IN (SELECT symbol FROM GeneRIF_BASIC) +GROUP BY BINARY symbol; +``` + +Consequently, this means that after transforming to TTL files, we have some missing RDF entries that map a symbol (subject) to it's real name (object). When building the RDF cache, we thereby have some missing RIF/WIKI entries, and some entries are not indexed. This patch fixes the aforementioned error with missing symbols: + +=> https://git.genenetwork.org/gn-transform-databases/commit/?id=d95501bd2bd41ef8cf3584118382e83cbbbe0c87 [gn-transform-databases] Add missing RIF symbols. + +Now these 2 queries return the same exact results: + +=> https://cd.genenetwork.org/search?species=mouse&group=BXD&type=Hippocampus+mRNA&dataset=HC_M2_0606_P&search_terms_or=WIKI%3Dglioma&search_terms_and=&accession_id=None&FormID=searchResulto SQL search for dataset=HC_M2_0606_P species=mouse group=BXD WIKI=glioma (31 results) + +=> https://cd.genenetwork.org/gsearch?type=gene&terms=species%3Amouse+group%3Abxd+dataset%3Ahc_m2_0606_p+wiki%3Aglioma species:mouse group:bxd dataset:hc_m2_0606_p wiki:glioma (31 results) + +However, Xapian search is case insensitive while the SQL search is case sensitive: + +=> https://cd.genenetwork.org/gsearch?type=gene&terms=species%3Amouse+group%3Abxd+dataset%3Ahc_m2_0606_p+wiki%3Acancer species:mouse group:bxd dataset:hc_m2_0606_p wiki:cancer (72 results) + +=> https://cd.genenetwork.org/search?species=mouse&group=BXD&type=Hippocampus+mRNA&dataset=HC_M2_0606_P&search_terms_or=WIKI%3Dcancer&search_terms_and=&accession_id=None&FormID=searchResulto SQL search for dataset=HC_M2_0606_P species=mouse group=BXD WIKI=cancer (70 results) + +=> https://cd.genenetwork.org/search?species=mouse&group=BXD&type=Hippocampus+mRNA&dataset=HC_M2_0606_P&search_terms_or=WIKI%3DCancer&search_terms_and=&accession_id=None&FormID=searchResulto SQL search for dataset=HC_M2_0606_P species=mouse group=BXD WIKI=Cancer (Note the change in the case "Cancer": 13 results) + +Another reason for discrepancies between search results, E.g. + +=> https://cd.genenetwork.org/gsearch?type=gene&terms=species%3Amouse+group%3Abxd+dataset%3Ahc_m2_0606_p+wiki%3Adiabetes species:mouse group:bxd dataset:hc_m2_0606_p wiki:diabetes (59 results) + +=> https://cd.genenetwork.org/search?species=mouse&group=BXD&type=Hippocampus+mRNA&dataset=HC_M2_0606_P&search_terms_or=WIKI%3Ddiabetes&search_terms_and=&accession_id=None&FormID=searchResulto SQL search for dataset=HC_M2_0606_P species=mouse group=BXD WIKI=diabetes (52 results) + +is that Xapian performs stemming on the search terms. For example, in the above wiki search for "diabetes", Xapian will stem "diabetes" to "diabet" thereby matching "diabetic", "diabetes", or any other word variation of "diabetes." + +## Ordering of Results + +The ordering in the Xapian search and SQL search is different. By default, SQL orders by Symbol where we have: + +``` +[...] ORDER BY ProbeSet.symbol ASC +``` + +However, Xapian orders search results by decreasing relevance score. This is configurable. + +* closed diff --git a/issues/inspect-discrepancies-between-xapian-and-sql-search2.gmi b/issues/inspect-discrepancies-between-xapian-and-sql-search2.gmi new file mode 100644 index 0000000..451d5c3 --- /dev/null +++ b/issues/inspect-discrepancies-between-xapian-and-sql-search2.gmi @@ -0,0 +1,11 @@ +# Inspect Discrepancies Between Xapian and SQL Search. + +* assigned: bonfacem, rookie101 + +## Description + +When we type BXD_21526 in xapian search we should find + +=> https://genenetwork.org/search?species=mouse&group=BXD&type=Phenotypes&dataset=BXDPublish&search_terms_or=BXD_21526&search_terms_and=&accession_id=None&FormID=searchResult + +This is not the case right now. diff --git a/issues/integrate-markdown-editor-to-gn2.gmi b/issues/integrate-markdown-editor-to-gn2.gmi index 98c170b..5904eac 100644 --- a/issues/integrate-markdown-editor-to-gn2.gmi +++ b/issues/integrate-markdown-editor-to-gn2.gmi @@ -1,3 +1,4 @@ + # GN Markdown Editor Integration ## Tags @@ -5,26 +6,168 @@ * assigned: alexm * status: in progress * priority: high +* tags: markdown, integration, guile ## Notes -This is a to-do list to integrate the GN Markdown editor into GN2. + +This is a to-do list to integrate the GN Markdown editor into GN2. To see the implementation, see: -=> https://github.com/Alexanderlacuna/geditor +=> https://git.genenetwork.org/gn-guile/ ## Tasks -* [ ] Implement APIs to fetch file for edit -* [ ] Add verification for the repository -* [ ] Implement API to edit and commit changes -* [ ] Replace JS with HTMX -* [ ] Support external links and image rendering -* [ ] Package dependencies -* [ ] Handle errors +* [x] Implement APIs to fetch files for editing +* [x] Add verification for the repository +* [x] Implement API to edit and commit changes +* [x] Replace JS with HTMX +* [x] Support external links and image rendering +* [x] Package dependencies +* [x] show diff for files +* [x] Handle errors * [ ] Review by users -* [ ] Integrate auth to the system. +* [x] Integrate authentication into the system + + +## API Documentation + +This APi endpoints are implemented in guile See repo: + +=> https://git.genenetwork.org/gn-guile/ + +The main endpoints are: `/edit` and `/commit` + +### Edit (GET) + +This is a `GET` request to retrieve file content. Make sure you pass a valid `file_path` as `search_query` (the path should be relative to the repository). + +**Edit Request Example:** + +```bash + +curl -G -d "file_path=test.md" localhost:8091/edit +``` + +In case of a successful response, the expected result is: + + +```json +{ +"path": "<file_path>", +"content": "Test for new user\n test 2 for line\n test 3 for new line\n ## real markdown two test\n", +"hash": "<commit_sha>" +} +``` + +In case of an error, the expected response is: + +```json +{ +"error": "<error_type>", +"msg": "<error_reason>" +} +``` + +### Commit (POST) + +**Endpoint:** + +``` +localhost:8091/commit +``` + + +```bash + +curl -X POST http://127.0.0.1:8091/commit \ +-H 'Content-Type: application/json' \ +-d '{ +"content": "make test commit", +"filename": "test.md", +"email": "test@gmail.com", +"username": "test", +"commit_message": "init commit", +"prev_commit": "7cbfc40d98b49a64e98e7cd562f373053d0325bd" +}' + +``` -Related issues: +It expects the following data in JSON format: + +* `content` (the data you want to commit to the file, *valid markdown*) +* `prev_commit` (required for integrity) +* `filename` (file path to the file you are modifying) +* `username` (identifier for the user, in our case from auth) +* `email` (identifier email from the user, in our case from auth) +* `commit_message` + +If the request succeeds, the response should be: + +```json +{ +"status": "201", +"message": "Committed file successfully", +"content": "Test for new user\n test 2 for line\n test 3 for new line\n ## real markdown two test\n", +"commit_sha": "47df3b7f13a935d50cc8b40e98ca9e513cba104c", +"commit_message": "commit by genetics" +} +``` + +If there are no changes to the file: + +```json +{ +"status": "200", +"message": "Nothing to commit, working tree clean", +"commit_sha": "ecd96f27c45301279150fbda411544687db1aa45" +} +``` + +If the request fails, the expected results are: + +```json +{ +"error": "<error_type>", +"msg": "Commits do not match. Please pull in the latest changes for the current commit *ecd96f27c45301279150fbda411544687db1aa45* and previous commits." +} +``` + +## Related Issues => https://issues.genenetwork.org/issues/implement-gn-markdown-editor-in-guile -=> https://issues.genenetwork.org/issues/implement-gn-markdown-editor \ No newline at end of file +=> https://issues.genenetwork.org/issues/implement-gn-markdown-editor + +## Notes on Gn-Editor UI + +Here is the link to the PR for integrating the GN-Editor, including screenshots: + +=> https://github.com/genenetwork/genenetwork2/pull/854 + +Genenetwork2 consumes the endpoint for the GN-Editor. Authentication is required to prevent access by malicious users and bots. + +The main endpoint to fetch and edit a file is: + +``` +genenetwork.org/editor/edit?file-path=<relative file path> +``` + +This loads the editor with the content for editing. + +### Modifying Editor Settings + +You can modify editor settings, such as font size and keyboard bindings. To do this, navigate to: + +``` +genenetwork.org/editor/settings +``` + +Be sure to save your changes for them to take effect. + +### Showing Diff for Editor + +The editor also provides a diff functionality to show you the changes made to the file. Use the "Diff" button in the navigation to view these changes. + +### Committing Changes + +To commit your changes, use the "Commit" button. A commit message is required in the text area for the commit to be processed. + diff --git a/issues/mgamma/mgamma-design.gmi b/issues/mgamma/mgamma-design.gmi index 23e02d5..ed4c061 100644 --- a/issues/mgamma/mgamma-design.gmi +++ b/issues/mgamma/mgamma-design.gmi @@ -7,3 +7,31 @@ We have a lot of experience running and hacking the GEMMA tool in GeneNetwork.or GEMMA proves to give great GWA results and has a decent speed for a single threaded implementation - even though the matrix calls to openblas use multiple threads. The source code base of GEMMA, however, proves hard to build on. This is why we are creating a next generation tool that has a focus on *performance and hackability*. After several attempts using R, D, Julia, python, Ruby we have in 2023 settled on Guile+C+Zig. Guile provides a REPL and great hackabability. C+Zig we'll use for performance. The other languages are all great, but we think we can work faster in this setup. + +Well, it is the end of 2024 and we have ditched that effort. Who said life was easy! The guile interface proved problematic - and Zig went out of favour because of its bootstrap story which prevents it becoming part of Guix, Debian etc. Also I discovered new tensor MPUs support f64 - so we may want to support vector and matrix computations on these cores. + +To write a gemma replacement I am now favouring to chunk up existing gemma and make sure its components can talk with alternative implementations. We may use a propagated network approach. Critical is to keep the data in RAM, so it may need some message passing interface with memory that can be shared. The chunking into CELLs (read propagator network PN) is a requirement because we kept tripping over state in GEMMA. So a PN should make sure we can run two implementations of the same CELL and compare outcomes for testing. Also it will allow us to test AVX, tensor and (say) MKL or CUDA implementations down the line. Also it should allow us to start using new functionality on GN faster. It would also be fun to have an implementation run on the RISC-V manycore. + +So, what do we want out of our languages: + +* Nice matrix interface (Julia) +* Support for AVX (Julia) +* Possibility to drop to low level C programming (Julia+prescheme+C?) +* High level -- PN -- glue (Julia+Guile?) + +Julia looks like a great candidate, even though it has notable downsides including the big 'server' blob deployment and the garbage collector (the latter also being a strength, mind). Alternatives could be Rust and Prescheme which have no such concerns, but lack the nice matrix notation. + +The approach will be to start with Julia and reimplementing GEMMA functions so they can be called from Julia and/or guile. + +Oh, I just found out that Julia, like zig, is no longer up-to-date on Debian. And the Guix version is 2 years old. That is really bad. If these languages don't get supported on major distros it is a dead end! + +=> https://mastodon.social/@pjotrprins/113379842047170785 + +What to now? + +* Nice matrix interface (?) +* Support for AVX (?) +* Possibility to drop to low level C programming (?+prescheme+C?) +* High level -- PN -- glue (?+Guile?) + +Current candidates for ? are Nim and Rust. Neither has a really nice matrix interface - though Nim's is probably what I prefer and it is close to python. Chicken may work too when I get fed with mentioned two languages. diff --git a/issues/mgamma/mgamma-lmm.gmi b/issues/mgamma/mgamma-lmm.gmi new file mode 100644 index 0000000..61481c2 --- /dev/null +++ b/issues/mgamma/mgamma-lmm.gmi @@ -0,0 +1,17 @@ +# MGAMMA LMM + +MGamma does GWAS, which means it has to do Linear Mixed Models—both univariate and multivariate. + +# Tags + +* assigned: pjotrp, artyom +* type: feature +* priority: high + +# Tasks + +* [X] Kinship matrix computation. +* [X] Univariate LMM. +* [ ] Multivariate LMM. +* [X] Export data from GEMMA. +* [ ] Compare and ensure data match between MGamma and GEMMA. \ No newline at end of file diff --git a/issues/move-racket-gn-rest-api-to-guile.gmi b/issues/move-racket-gn-rest-api-to-guile.gmi index 185e7de..659c586 100644 --- a/issues/move-racket-gn-rest-api-to-guile.gmi +++ b/issues/move-racket-gn-rest-api-to-guile.gmi @@ -6,7 +6,7 @@ * priority: medium * type: API, metadata * keywords: API -* status: open +* status: stalled ## Description diff --git a/issues/move-search-to-xapian.gmi b/issues/move-search-to-xapian.gmi index 57612e7..d98be9b 100644 --- a/issues/move-search-to-xapian.gmi +++ b/issues/move-search-to-xapian.gmi @@ -18,3 +18,5 @@ As a work around---to make search work with Python3.10, an inefficient hack was => https://github.com/genenetwork/genenetwork2/pull/805/commits/9a6ddf9f1560b3bc1611f50bf2b94f0dc44652a2 Replace escape with conn.escape_string To get rid of this inheritance, I propose rewriting the search functionality in a more straightforward and functional manner. In doing so, we can also transition to Xapian search, a faster and more efficient search system. + +* closed diff --git a/issues/old_session_bug.gmi b/issues/old_session_bug.gmi index 649ea46..925b9f6 100644 --- a/issues/old_session_bug.gmi +++ b/issues/old_session_bug.gmi @@ -2,7 +2,7 @@ ## Tags -* status: open +* status: closed * priority: medium * type: bug * assigned: zsloan, fredm diff --git a/issues/prevent-weak-passwords.gmi b/issues/prevent-weak-passwords.gmi index 8e8ca2f..957a170 100644 --- a/issues/prevent-weak-passwords.gmi +++ b/issues/prevent-weak-passwords.gmi @@ -19,3 +19,11 @@ There was a request made to prevent weak passwords. Use existing libraries to check and prevent weak passwords. + +## Notes + +### 2025-12-31: Look Into Libraries + +=> https://pypi.org/project/password-strength/ password-strength + +The library above seems promising. Unfortunately, we'd have to write a guix definition for it. diff --git a/issues/production-container-mechanical-rob-failure.gmi b/issues/production-container-mechanical-rob-failure.gmi new file mode 100644 index 0000000..ae6bae8 --- /dev/null +++ b/issues/production-container-mechanical-rob-failure.gmi @@ -0,0 +1,224 @@ +# Production Container: `mechanical-rob` Failure + +## Tags + +* status: closed, completed, fixed +* priority: high +* type: bug +* assigned: fredm +* keywords: genenetwork, production, mechanical-rob + +## Description + +After deploying the latest commits to https://gn2-fred.genenetwork.org on 2025-02-19UTC-0600, with the following commits: + +* genenetwork2: 2a3df8cfba6b29dddbe40910c69283a1afbc8e51 +* genenetwork3: 99fd5070a84f37f91993f329f9cc8dd82a4b9339 +* gn-auth: 073395ff331042a5c686a46fa124f9cc6e10dd2f +* gn-libs: 72a95f8ffa5401649f70978e863dd3f21900a611 + +I had the (not so) bright idea to run the `mechanical-rob` tests against it before pushing it to production, proper. Here's where I ran into problems: some of the `mechanical-rob` tests failed, specifically, the correlation tests. + +Meanwhile, a run of the same tests against https://cd.genenetwork.org with the same commits was successful: + +=> https://ci.genenetwork.org/jobs/genenetwork2-mechanical-rob/1531 See this. + +This points to a possible problem with the setup of the production container, that leads to failures where none should be. This needs investigation and fixing. + +### Update 2025-02-20 + +The MariaDB server is crashing. To reproduce: + +* Go to https://gn2-fred.genenetwork.org/show_trait?trait_id=1435464_at&dataset=HC_M2_0606_P +* Click on "Calculate Correlations" to expand +* Click "Compute" + +Observe that after a little while, the system fails with the following errors: + +* `MySQLdb.OperationalError: (2013, 'Lost connection to MySQL server during query')` +* `MySQLdb.OperationalError: (2006, 'MySQL server has gone away')` + +I attempted updating the configuration for MariaDB, setting the `max_allowed_packet` to 16M and then 64M, but that did not resolve the problem. + +The log files indicate the following: + +``` +2025-02-20 7:46:07 0 [Note] Recovering after a crash using /var/lib/mysql/gn0-binary-log +2025-02-20 7:46:07 0 [Note] Starting crash recovery... +2025-02-20 7:46:07 0 [Note] Crash recovery finished. +2025-02-20 7:46:07 0 [Note] Server socket created on IP: '0.0.0.0'. +2025-02-20 7:46:07 0 [Warning] 'user' entry 'webqtlout@tux01' ignored in --skip-name-resolve mode. +2025-02-20 7:46:07 0 [Warning] 'db' entry 'db_webqtl webqtlout@tux01' ignored in --skip-name-resolve mode. +2025-02-20 7:46:07 0 [Note] Reading of all Master_info entries succeeded +2025-02-20 7:46:07 0 [Note] Added new Master_info '' to hash table +2025-02-20 7:46:07 0 [Note] /usr/sbin/mariadbd: ready for connections. +Version: '10.5.23-MariaDB-0+deb11u1-log' socket: '/run/mysqld/mysqld.sock' port: 3306 Debian 11 +2025-02-20 7:46:07 4 [Warning] Access denied for user 'root'@'localhost' (using password: NO) +2025-02-20 7:46:07 5 [Warning] Access denied for user 'root'@'localhost' (using password: NO) +2025-02-20 7:46:07 0 [Note] InnoDB: Buffer pool(s) load completed at 250220 7:46:07 +250220 7:50:12 [ERROR] mysqld got signal 11 ; +Sorry, we probably made a mistake, and this is a bug. + +Your assistance in bug reporting will enable us to fix this for the next release. +To report this bug, see https://mariadb.com/kb/en/reporting-bugs + +We will try our best to scrape up some info that will hopefully help +diagnose the problem, but since we have already crashed, +something is definitely wrong and this may fail. + +Server version: 10.5.23-MariaDB-0+deb11u1-log source revision: 6cfd2ba397b0ca689d8ff1bdb9fc4a4dc516a5eb +key_buffer_size=10485760 +read_buffer_size=131072 +max_used_connections=1 +max_threads=2050 +thread_count=1 +It is possible that mysqld could use up to +key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 4523497 K bytes of memory +Hope that's ok; if not, decrease some variables in the equation. + +Thread pointer: 0x7f599c000c58 +Attempting backtrace. You can use the following information to find out +where mysqld died. If you see no messages after this, something went +terribly wrong... +stack_bottom = 0x7f6150282d78 thread_stack 0x49000 +/usr/sbin/mariadbd(my_print_stacktrace+0x2e)[0x55f43330c14e] +/usr/sbin/mariadbd(handle_fatal_signal+0x475)[0x55f432e013b5] +sigaction.c:0(__restore_rt)[0x7f615a1cb140] +/usr/sbin/mariadbd(+0xcbffbe)[0x55f43314efbe] +/usr/sbin/mariadbd(+0xd730ec)[0x55f4332020ec] +/usr/sbin/mariadbd(+0xd1b36b)[0x55f4331aa36b] +/usr/sbin/mariadbd(+0xd1cd8e)[0x55f4331abd8e] +/usr/sbin/mariadbd(+0xc596f3)[0x55f4330e86f3] +/usr/sbin/mariadbd(_ZN7handler18ha_index_next_sameEPhPKhj+0x2a5)[0x55f432e092b5] +/usr/sbin/mariadbd(+0x7b54d1)[0x55f432c444d1] +/usr/sbin/mariadbd(_Z10sub_selectP4JOINP13st_join_tableb+0x1f8)[0x55f432c37da8] +/usr/sbin/mariadbd(_ZN10JOIN_CACHE24generate_full_extensionsEPh+0x134)[0x55f432d24224] +/usr/sbin/mariadbd(_ZN10JOIN_CACHE21join_matching_recordsEb+0x206)[0x55f432d245d6] +/usr/sbin/mariadbd(_ZN10JOIN_CACHE12join_recordsEb+0x1cf)[0x55f432d23eff] +/usr/sbin/mariadbd(_Z16sub_select_cacheP4JOINP13st_join_tableb+0x8a)[0x55f432c382fa] +/usr/sbin/mariadbd(_ZN4JOIN10exec_innerEv+0xd16)[0x55f432c63826] +/usr/sbin/mariadbd(_ZN4JOIN4execEv+0x35)[0x55f432c63cc5] +/usr/sbin/mariadbd(_Z12mysql_selectP3THDP10TABLE_LISTR4ListI4ItemEPS4_jP8st_orderS9_S7_S9_yP13select_resultP18st_select_lex_unitP13st_select_lex+0x106)[0x55f432c61c26] +/usr/sbin/mariadbd(_Z13handle_selectP3THDP3LEXP13select_resultm+0x138)[0x55f432c62698] +/usr/sbin/mariadbd(+0x762121)[0x55f432bf1121] +/usr/sbin/mariadbd(_Z21mysql_execute_commandP3THD+0x3d6c)[0x55f432bfdd1c] +/usr/sbin/mariadbd(_Z11mysql_parseP3THDPcjP12Parser_statebb+0x20b)[0x55f432bff17b] +/usr/sbin/mariadbd(_Z16dispatch_command19enum_server_commandP3THDPcjbb+0xdb5)[0x55f432c00f55] +/usr/sbin/mariadbd(_Z10do_commandP3THD+0x120)[0x55f432c02da0] +/usr/sbin/mariadbd(_Z24do_handle_one_connectionP7CONNECTb+0x2f2)[0x55f432cf8b32] +/usr/sbin/mariadbd(handle_one_connection+0x5d)[0x55f432cf8dad] +/usr/sbin/mariadbd(+0xbb4ceb)[0x55f433043ceb] +nptl/pthread_create.c:478(start_thread)[0x7f615a1bfea7] +x86_64/clone.S:97(__GI___clone)[0x7f6159dc6acf] + +Trying to get some variables. +Some pointers may be invalid and cause the dump to abort. +Query (0x7f599c012c50): SELECT ProbeSet.Name,ProbeSet.Chr,ProbeSet.Mb, + ProbeSet.Symbol,ProbeSetXRef.mean, + CONCAT_WS('; ', ProbeSet.description, ProbeSet.Probe_Target_Description) AS description, + ProbeSetXRef.additive,ProbeSetXRef.LRS,Geno.Chr, Geno.Mb + FROM ProbeSet INNER JOIN ProbeSetXRef + ON ProbeSet.Id=ProbeSetXRef.ProbeSetId + INNER JOIN Geno + ON ProbeSetXRef.Locus = Geno.Name + INNER JOIN Species + ON Geno.SpeciesId = Species.Id + WHERE ProbeSet.Name in ('1447591_x_at', '1422809_at', '1428917_at', '1438096_a_at', '1416474_at', '1453271_at', '1441725_at', '1452952_at', '1456774_at', '1438413_at', '1431110_at', '1453723_x_at', '1424124_at', '1448706_at', '1448762_at', '1428332_at', '1438389_x_at', '1455508_at', '1455805_x_at', '1433276_at', '1454989_at', '1427467_a_at', '1447448_s_at', '1438695_at', '1456795_at', '1454874_at', '1455189_at', '1448631_a_at', '1422697_s_at', '1423717_at', '1439484_at', '1419123_a_at', '1435286_at', '1439886_at', '1436348_at', '1437475_at', '1447667_x_at', '1421046_a_at', '1448296_x_at', '1460577_at', 'AFFX-GapdhMur/M32599_M_at', '1424393_s_at', '1426190_at', '1434749_at', '1455706_at', '1448584_at', '1434093_at', '1434461_at', '1419401_at', '1433957_at', '1419453_at', '1416500_at', '1439436_x_at', '1451413_at', '1455696_a_at', '1457190_at', '1455521_at', '1434842_s_at', '1442525_at', '1452331_s_at', '1428862_at', '1436463_at', '1438535_at', 'AFFX-GapdhMur/M32599_3_at', '1424012_at', '1440027_at', '1435846_x_at', '1443282_at', '1435567_at', '1450112_a_at', '1428251_at', '1429063_s_at', '1433781_a_at', '1436698_x_at', '1436175_at', '1435668_at', '1424683_at', '1442743_at', '1416944_a_at', '1437511_x_at', '1451254_at', '1423083_at', '1440158_x_at', '1424324_at', '1426382_at', '1420142_s_at', '1434553_at', '1428772_at', '1424094_at', '1435900_at', '1455322_at', '1453283_at', '1428551_at', '1453078_at', '1444602_at', '1443836_x_at', '1435590_at', '1434283_at', '1435240_at', '1434659_at', '1427032_at', '1455278_at', '1448104_at', '1421247_at', 'AFFX-MURINE_b1_at', '1460216_at', '1433969_at', '1419171_at', '1456699_s_at', '1456901_at', '1442139_at', '1421849_at', '1419824_a_at', '1460588_at', '1420131_s_at', '1446138_at', '1435829_at', '1434462_at', '1435059_at', '1415949_at', '1460624_at', '1426707_at', '1417250_at', '1434956_at', '1438018_at', '1454846_at', '1435298_at', '1442077_at', '1424074_at', '1428883_at', '1454149_a_at', '1423925_at', '1457060_at', '1433821_at', '1447923_at', '1460670_at', '1434468_at', '1454980_at', '1426913_at', '1456741_s_at', '1449278_at', '1443534_at', '1417941_at', '1433167_at', '1434401_at', '1456516_x_at', '1451360_at', 'AFFX-GapdhMur/M32599_5_at', '1417827_at', '1434161_at', '1448979_at', '1435797_at', '1419807_at', '1418330_at', '1426304_x_at', '1425492_at', '1437873_at', '1435734_x_at', '1420622_a_at', '1456019_at', '1449200_at', '1455314_at', '1428419_at', '1426349_s_at', '1426743_at', '1436073_at', '1452306_at', '1436735_at', '1439529_at', '1459347_at', '1429642_at', '1438930_s_at', '1437380_x_at', '1459861_s_at', '1424243_at', '1430503_at', '1434474_at', '1417962_s_at', '1440187_at', '1446809_at', '1436234_at', '1415906_at', 'AFFX-MURINE_B2_at', '1434836_at', '1426002_a_at', '1448111_at', '1452882_at', '1436597_at', '1455915_at', '1421846_at', '1428693_at', '1422624_at', '1423755_at', '1460367_at', '1433746_at', '1454872_at', '1429194_at', '1424652_at', '1440795_x_at', '1458690_at', '1434355_at', '1456324_at', '1457867_at', '1429698_at', '1423104_at', '1437585_x_at', '1437739_a_at', '1445605_s_at', '1436313_at', '1449738_s_at', '1437525_a_at', '1454937_at', '1429043_at', '1440091_at', '1422820_at', '1437456_x_at', '1427322_at', '1446649_at', '1433568_at', '1441114_at', '1456541_x_at', '1426985_s_at', '1454764_s_at', '1424071_s_at', '1429251_at', '1429155_at', '1433946_at', '1448771_a_at', '1458664_at', '1438320_s_at', '1449616_s_at', '1435445_at', '1433872_at', '1429273_at', '1420880_a_at', '1448645_at', '1449646_s_at', '1428341_at', '1431299_a_at', '1433427_at', '1418530_at', '1436247_at', '1454350_at', '1455860_at', '1417145_at', '1454952_s_at', '1435977_at', '1434807_s_at', '1428715_at', '1418117_at', '1447947_at', '1431781_at', '1428915_at', '1427197_at', '1427208_at', '1455460_at', '1423899_at', '1441944_s_at', '1455429_at', '1452266_at', '1454409_at', '1426384_a_at', '1428725_at', '1419181_at', '1454862_at', '1452907_at', '1433794_at', '1435492_at', '1424839_a_at', '1416214_at', '1449312_at', '1436678_at', '1426253_at', '1438859_x_at', '1448189_a_at', '1442557_at', '1446174_at', '1459718_x_at', '1437613_s_at', '1456509_at', '1455267_at', '1440480_at', '1417296_at', '1460050_x_at', '1433585_at', '1436771_x_at', '1424294_at', '1448648_at', '1417753_at', '1436139_at', '1425642_at', '1418553_at', '1415747_s_at', '1445984_at', '1440024_at', '1448720_at', '1429459_at', '1451459_at', '1428853_at', '1433856_at', '1426248_at', '1417765_a_at', '1439459_x_at', '1447023_at', '1426088_at', '1440825_s_at', '1417390_at', '1444744_at', '1435618_at', '1424635_at', '1443727_x_at', '1421096_at', '1427410_at', '1416860_s_at', '1442773_at', '1442030_at', '1452281_at', '1434774_at', '1416891_at', '1447915_x_at', '1429129_at', '1418850_at', '1416308_at', '1422858_at', '1447679_s_at', '1440903_at', '1417321_at', '1452342_at', '1453510_s_at', '1454923_at', '1454611_a_at', '1457532_at', '1438440_at', '1434232_a_at', '1455878_at', '1455571_x_at', '1436401_at', '1453289_at', '1457365_at', '1436708_x_at', '1434494_at', '1419588_at', '1433679_at', '1455159_at', '1428982_at', '1446510_at', '1434131_at', '1418066_at', '1435346_at', '1449415_at', '1455384_x_at', '1418817_at', '1442073_at', '1457265_at', '1447361_at', '1418039_at', '1428467_at', '1452224_at', '1417538_at', '1434529_x_at', '1442149_at', '1437379_x_at', '1416473_a_at', '1432750_at', '1428389_s_at', '1433823_at', '1451889_at', '1438178_x_at', '1441807_s_at', '1416799_at', '1420623_x_at', '1453245_at', '1434037_s_at', '1443012_at', '1443172_at', '1455321_at', '1438396_at', '1440823_x_at', '1436278_at', '1457543_at', '1452908_at', '1417483_at', '1418397_at', '1446589_at', '1450966_at', '1447877_x_at', '1446524_at', '1438592_at', '1455589_at', '1428629_at', '1429585_s_at', '1440020_at', '1417365_a_at', '1426442_at', '1427151_at', '1437377_a_at', '1433995_s_at', '1435464_at', '1417007_a_at', '1429690_at', '1427999_at', '1426819_at', '1454905_at', '1439516_at', '1434509_at', '1428707_at', '1416793_at', '1440822_x_at', '1437327_x_at', '1428682_at', '1435004_at', '1434238_at', '1417581_at', '1434699_at', '1455597_at', '1458613_at', '1456485_at', '1435122_x_at', '1452864_at', '1453122_at', '1435254_at', '1451221_at', '1460168_at', '1455336_at', '1427965_at', '1432576_at', '1455425_at', '1428762_at', '1455459_at', '1419317_x_at', '1434691_at', '1437950_at', '1426401_at', '1457261_at', '1433824_x_at', '1435235_at', '1437343_x_at', '1439964_at', '1444280_at', '1455434_a_at', '1424431_at', '1421519_a_at', '1428412_at', '1434010_at', '1419976_s_at', '1418887_a_at', '1428498_at', '1446883_at', '1435675_at', '1422599_s_at', '1457410_at', '1444437_at', '1421050_at', '1437885_at', '1459754_x_at', '1423807_a_at', '1435490_at', '1426760_at', '1449459_s_at', '1432098_a_at', '1437067_at', '1435574_at', '1433999_at', '1431289_at', '1428919_at', '1425678_a_at', '1434924_at', '1421640_a_at', '1440191_s_at', '1460082_at', '1449913_at', '1439830_at', '1425020_at', '1443790_x_at', '1436931_at', '1454214_a_at', '1455854_a_at', '1437061_at', '1436125_at', '1426385_x_at', '1431893_a_at', '1417140_a_at', '1435333_at', '1427907_at', '1434446_at', '1417594_at', '1426518_at', '1437345_a_at', '1420091_s_at', '1450058_at', '1435161_at', '1430348_at', '1455778_at', '1422653_at', '1447942_x_at', '1434843_at', '1454956_at', '1454998_at', '1427384_at', '1439828_at') AND + Species.Name = 'mouse' AND + ProbeSetXRef.ProbeSetFreezeId IN ( + SELECT ProbeSetFreeze.Id + FROM ProbeSetFreeze WHERE ProbeSetFreeze.Name = 'HC_M2_0606_P') + +Connection ID (thread ID): 41 +Status: NOT_KILLED + +Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=off,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=on,table_elimination=on,extended_keys=on,exists_to_in=on,orderby_uses_equalities=on,condition_pushdown_for_derived=on,split_materialized=on,condition_pushdown_for_subquery=on,rowid_filter=on,condition_pushdown_from_having=on,not_null_range_scan=off + +The manual page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mariadbd/ contains +information that should help you find out what is causing the crash. +Writing a core file... +Working directory at /export/mysql/var/lib/mysql +Resource Limits: +Limit Soft Limit Hard Limit Units +Max cpu time unlimited unlimited seconds +Max file size unlimited unlimited bytes +Max data size unlimited unlimited bytes +Max stack size 8388608 unlimited bytes +Max core file size 0 unlimited bytes +Max resident set unlimited unlimited bytes +Max processes 3094157 3094157 processes +Max open files 64000 64000 files +Max locked memory 65536 65536 bytes +Max address space unlimited unlimited bytes +Max file locks unlimited unlimited locks +Max pending signals 3094157 3094157 signals +Max msgqueue size 819200 819200 bytes +Max nice priority 0 0 +Max realtime priority 0 0 +Max realtime timeout unlimited unlimited us +Core pattern: core + +Kernel version: Linux version 5.10.0-22-amd64 (debian-kernel@lists.debian.org) (gcc-10 (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 SMP Debian 5.10.178-3 (2023-04-22) + +2025-02-20 7:50:17 0 [Note] Starting MariaDB 10.5.23-MariaDB-0+deb11u1-log source revision 6cfd2ba397b0ca689d8ff1bdb9fc4a4dc516a5eb as process 3086167 +2025-02-20 7:50:17 0 [Note] InnoDB: !!! innodb_force_recovery is set to 1 !!! +2025-02-20 7:50:17 0 [Note] InnoDB: Uses event mutexes +2025-02-20 7:50:17 0 [Note] InnoDB: Compressed tables use zlib 1.2.11 +2025-02-20 7:50:17 0 [Note] InnoDB: Number of pools: 1 +2025-02-20 7:50:17 0 [Note] InnoDB: Using crc32 + pclmulqdq instructions +2025-02-20 7:50:17 0 [Note] InnoDB: Using Linux native AIO +2025-02-20 7:50:17 0 [Note] InnoDB: Initializing buffer pool, total size = 17179869184, chunk size = 134217728 +2025-02-20 7:50:17 0 [Note] InnoDB: Completed initialization of buffer pool +2025-02-20 7:50:17 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=1537379110991,1537379110991 +2025-02-20 7:50:17 0 [Note] InnoDB: Last binlog file '/var/lib/mysql/gn0-binary-log.000134', position 82843148 +2025-02-20 7:50:17 0 [Note] InnoDB: 128 rollback segments are active. +2025-02-20 7:50:17 0 [Note] InnoDB: Removed temporary tablespace data file: "ibtmp1" +2025-02-20 7:50:17 0 [Note] InnoDB: Creating shared tablespace for temporary tables +2025-02-20 7:50:17 0 [Note] InnoDB: Setting file './ibtmp1' size to 12 MB. Physically writing the file full; Please wait ... +2025-02-20 7:50:17 0 [Note] InnoDB: File './ibtmp1' size is now 12 MB. +2025-02-20 7:50:17 0 [Note] InnoDB: 10.5.23 started; log sequence number 1537379111003; transaction id 3459549902 +2025-02-20 7:50:17 0 [Note] Plugin 'FEEDBACK' is disabled. +2025-02-20 7:50:17 0 [Note] InnoDB: Loading buffer pool(s) from /export/mysql/var/lib/mysql/ib_buffer_pool +2025-02-20 7:50:17 0 [Note] Loaded 'locales.so' with offset 0x7f9551bc0000 +2025-02-20 7:50:17 0 [Note] Recovering after a crash using /var/lib/mysql/gn0-binary-log +2025-02-20 7:50:17 0 [Note] Starting crash recovery... +2025-02-20 7:50:17 0 [Note] Crash recovery finished. +2025-02-20 7:50:17 0 [Note] Server socket created on IP: '0.0.0.0'. +2025-02-20 7:50:17 0 [Warning] 'user' entry 'webqtlout@tux01' ignored in --skip-name-resolve mode. +2025-02-20 7:50:17 0 [Warning] 'db' entry 'db_webqtl webqtlout@tux01' ignored in --skip-name-resolve mode. +2025-02-20 7:50:17 0 [Note] Reading of all Master_info entries succeeded +2025-02-20 7:50:17 0 [Note] Added new Master_info '' to hash table +2025-02-20 7:50:17 0 [Note] /usr/sbin/mariadbd: ready for connections. +Version: '10.5.23-MariaDB-0+deb11u1-log' socket: '/run/mysqld/mysqld.sock' port: 3306 Debian 11 +2025-02-20 7:50:17 4 [Warning] Access denied for user 'root'@'localhost' (using password: NO) +2025-02-20 7:50:17 5 [Warning] Access denied for user 'root'@'localhost' (using password: NO) +2025-02-20 7:50:17 0 [Note] InnoDB: Buffer pool(s) load completed at 250220 7:50:17 +``` + +A possible issue is the use of the environment variable SQL_URI at this point: + +=> https://github.com/genenetwork/genenetwork2/blob/testing/gn2/wqflask/correlation/rust_correlation.py#L34 + +which is requested + +=> https://github.com/genenetwork/genenetwork2/blob/testing/gn2/wqflask/correlation/rust_correlation.py#L7 from here. + +I tried setting an environment variable "SQL_URI" with the same value as the config and rebuilt the container. That did not fix the problem. + +Running the query directly in the default mysql client also fails with: + +``` +ERROR 2013 (HY000): Lost connection to MySQL server during query +``` + +Huh, so this was not a code problem. + +Configured database to allow upgrade of tables if necessary and restarted mariadbd. + +The problem still persists. + +Note Pjotr: this is likely a mariadb bug with 10.5.23, the most recent mariadbd we use (both tux01 and tux02 are older). The dump shows it balks on creating a new thread: pthread_create.c:478. Looks similar to https://jira.mariadb.org/browse/MDEV-32262 + +10.5, 10.6, 10.11 are affected. so running correlations on production crashes mysqld? I am not trying for obvious reasons ;) the threading issues of mariadb look scary - I wonder how deep it goes. + +We'll test for a different version of mariadb combining a Debian update because Debian on tux04 is broken. diff --git a/issues/provide-link-to-register-user-in-sign-in-page.gmi b/issues/provide-link-to-register-user-in-sign-in-page.gmi index 24d7c21..b9e6a4d 100644 --- a/issues/provide-link-to-register-user-in-sign-in-page.gmi +++ b/issues/provide-link-to-register-user-in-sign-in-page.gmi @@ -3,7 +3,7 @@ ## Tags * type: bug -* status: open +* status: closed * assigned: fredm * priority: medium * keywords: register user, gn-auth, genenetwork @@ -16,3 +16,8 @@ Provide a link allowing a user to register with the system on the sign-in page. We are now using OAuth2 to enable sign-in, which means that the user is redirected from the service they were in to the authorisation service to sign-in. The service should retain a note of the service which the user came from, and redirect back to it on successful registration. + + +### Close as Completed + +@zachs seems to have fixed this. diff --git a/issues/quality-control/fix-flash-messages.gmi b/issues/quality-control/fix-flash-messages.gmi index da54c52..e65c0f6 100644 --- a/issues/quality-control/fix-flash-messages.gmi +++ b/issues/quality-control/fix-flash-messages.gmi @@ -5,7 +5,7 @@ * assigned: fredm * priority: low * type: bug -* status: open +* status: closed, completed, fixed * keywords: flask, flash ## Description diff --git a/issues/quality-control/qc-r-qtl2-bundles.gmi b/issues/quality-control/qc-r-qtl2-bundles.gmi index 9cc1452..6560594 100644 --- a/issues/quality-control/qc-r-qtl2-bundles.gmi +++ b/issues/quality-control/qc-r-qtl2-bundles.gmi @@ -3,7 +3,7 @@ ## Tags * assigned: fredm, acenteno -* status: open +* status: closed, completed * type: feature request * priority: medium * keywords: quality control, QC, R/qtl2 bundle diff --git a/issues/quality-control/r-qtl2-features.gmi b/issues/quality-control/r-qtl2-features.gmi index eac53c4..bcc5d71 100644 --- a/issues/quality-control/r-qtl2-features.gmi +++ b/issues/quality-control/r-qtl2-features.gmi @@ -3,7 +3,7 @@ ## Tags * type: listing -* status: open +* status: closed, completed * assigned: fredm * priority: high * keywords: listing, bug, feature @@ -12,5 +12,9 @@ This is a listing of non-critical features and bugs that do not currently have a dedicated issue, and need to be handled some time in the future. -* [feature] "Undo Transpose": Files marked as '*_transposed: true' will have the transposition undone to ease processing down the line. +* Closed, completed: [feature] "Undo Transpose": Files marked as '*_transposed: true' will have the transposition undone to ease processing down the line. * … + +### Close as completed + +Actually open dedicated issues for bugs and features rather than collecting them here. diff --git a/issues/rdf/automate-rdf-generation-and-ingress.gmi b/issues/rdf/automate-rdf-generation-and-ingress.gmi new file mode 100644 index 0000000..ef4ba9f --- /dev/null +++ b/issues/rdf/automate-rdf-generation-and-ingress.gmi @@ -0,0 +1,37 @@ +# Update RDF Generation and Ingress to Virtuoso + +## Tags + +* assigned: bonfacem +* priority: high +* tags: in-progress +* deadline: 2024-10-23 Wed + +We need to update Virtuoso in production. At the moment this is done manually. For the current set-up, we need to update the recent modified RIF+WIKI models: + + +``` +# Generate the RDF triples +time guix shell guile-dbi guile-hashing -m manifest.scm -- ./pre-inst-env ./examples/generif.scm --settings conf.scm --output /home/bonfacem/ttl-files/generif-metadata-new.ttl --documentation ./docs/generif-metadata.md + +# Make sure they are valid +guix shell -m manifest.scm -- rapper --input turtle --count /home/bonfacem/ttl-files/generif-metadata-new.ttl + +# Copy the files over to the exposed virtuoso path +cp /home/bonfacem/ttl-files/generif-metadata-new.ttl </some/dir/> + +# Get into Virtuoso (with a password) +guix shell virtuoso-ose -- isql <port-number> + +# Load the files to be loaded +# Assuming that '/var/lib/data' is where the files are +ld_dir('/var/lib/data', 'generif-metadata-new.ttl', 'http://genenetwork.org'); + +# Load the files +rdf_loader_run(); +CHECKPOINT; +``` + +Above steps should be automated and tested in CD before roll-out in production. Key considerations: + +- Pick latest important changes from git, so that we can pick what files to run instead of generating all the ttl files all the time. diff --git a/issues/rdf/hash-rdf-graph.gmi b/issues/rdf/hash-rdf-graph.gmi index c896218..2863108 100644 --- a/issues/rdf/hash-rdf-graph.gmi +++ b/issues/rdf/hash-rdf-graph.gmi @@ -5,3 +5,12 @@ ## Description Building the index is an expesive operation. Hash the graph and store the metadata in xapian, and similarly in the RDF store. The mcron-job should check whether this has changed, and if there's any difference, go ahead and re-build the index. + +Resolution: + +=> https://github.com/genenetwork/genenetwork3/pull/171 Improve Sharing Memory Across Processes. +=> https://github.com/genenetwork/genenetwork3/pull/172 Check whether table names were stored in xapian. +=> https://github.com/genenetwork/genenetwork3/pull/174 Wikidata index. +=> https://github.com/genenetwork/genenetwork3/pull/175 Refactor how the generif md5 sum is calculated and stored in XAPIAN. + +* closed diff --git a/issues/redesign-global-search-design.gmi b/issues/redesign-global-search-design.gmi new file mode 100644 index 0000000..df63791 --- /dev/null +++ b/issues/redesign-global-search-design.gmi @@ -0,0 +1,23 @@ +# Redesign Global Search Design + +## Tags +* assigned: alexm, zac +* keywords: global search, design, HTML +* type: enhancement +* status: closed, completed, done + +## Description +Rob suggested we model the global search on the NCBI PubMed interface. We should remove the `?` button, which seems to be confusing for users, and have a better user guide. + +## Tasks + +* [x] Redesign the global search to fit the NCBI PubMed model. +* [x] Replace the "?" button that acts as a user guide + +## Related issues: + +=> https://issues.genenetwork.org/issues/cleanup-base-file-gn2 + +## Notes +PR that seeks to address this issue: +=> https://github.com/genenetwork/genenetwork2/pull/880 \ No newline at end of file diff --git a/issues/remove-custom-bootstrap-css.gmi b/issues/remove-custom-bootstrap-css.gmi index 7fa6f24..14c1c35 100644 --- a/issues/remove-custom-bootstrap-css.gmi +++ b/issues/remove-custom-bootstrap-css.gmi @@ -1,7 +1,7 @@ # Remove overrides to bootstrap classes in bootstrap-custom.css * assigned: zachs, bonfacem, alexm - +* status: stalled We have a "bootstrap-custom.css" in GeneNetwork. Consider this snippet: diff --git a/issues/remove-references-to-old-gn-auth-code.gmi b/issues/remove-references-to-old-gn-auth-code.gmi index 1a03c25..8c110aa 100644 --- a/issues/remove-references-to-old-gn-auth-code.gmi +++ b/issues/remove-references-to-old-gn-auth-code.gmi @@ -4,7 +4,7 @@ * assigned: bonfacem * keywords: auth -* status: open +* status: stalled ## Description diff --git a/issues/replace-neo4j-with-virtuoso.gmi b/issues/replace-neo4j-with-virtuoso.gmi new file mode 100644 index 0000000..450fb70 --- /dev/null +++ b/issues/replace-neo4j-with-virtuoso.gmi @@ -0,0 +1,8 @@ +# Replace Neo4J with Virtuoso + +## Tags + +* assigned: bonfacem, soloshelby +* deadline: 2024-10-25 Fri + +Currently, the RAG ingests TTL files into Neo4J. Replace this with Virtuoso. diff --git a/issues/reset-password-on-container-rebuild.gmi b/issues/reset-password-on-container-rebuild.gmi index b0e4dbb..6c0ad1e 100644 --- a/issues/reset-password-on-container-rebuild.gmi +++ b/issues/reset-password-on-container-rebuild.gmi @@ -2,5 +2,6 @@ ## Tags * assigned: bonfacem +* status: stalled Whenever the virtuoso container is rebuilt, we manually have to reset the password. We should fix this by modifying the virtuoso service so that things are set automatically. diff --git a/issues/search-for-brca.gmi b/issues/search-for-brca.gmi index c42c745..05c6fd0 100644 --- a/issues/search-for-brca.gmi +++ b/issues/search-for-brca.gmi @@ -1,10 +1,31 @@ -# Search for brca +# Search Improvements: capital insensitive search for RIF+WIKI; Examples -* assigned: arun +## Tags -Search for brca does not return results for brca1 and brca2. It should. -=> https://cd.genenetwork.org/gsearch?type=gene&terms=brca +* assigned: bonfacem, rookie101 +* priority: high +* type: ops +* keywords: virtuoso -The xapian stemmer does not stem brca1 to brca. That's why when one searches for brca, results for brca1 are not returned. +## Description + +RIF search is finally working on production: + +> rif:Brca2 and group:BXD + +and capital insentive search too for the BXD. See: + +=> https://github.com/genenetwork/genenetwork3/commit/4b2e9f3fb3383421d7a55df5399aab71e0cc3b4f Stem group field regardless of case. +=> https://github.com/genenetwork/genenetwork3/commit/a37622b466f9f045db06a6f07e88fcf81b176f91 Stem all the time. + +## Questions: + +* How do we search genewiki data? + +* rif:Brca2 should also be RIF:Brca2 (prefer the latter if we have to +choose as that is what people will try) + +* Can we continue giving examples at + +=> https://genenetwork.org/search-syntax search syntax -Perhaps we should write a custom stemmer that stems brca1 to brca. But, at the same time, we should be wary of stemming terms like p450 to p. Pjotr suggests the heuristic that we look for at least 2 or 3 alphabetic characters at the beginning. Another approach is to hard-code a list of candidates to look for. diff --git a/issues/set-up-gn-guile-in-tux02.gmi b/issues/set-up-gn-guile-in-tux02.gmi new file mode 100644 index 0000000..29eca68 --- /dev/null +++ b/issues/set-up-gn-guile-in-tux02.gmi @@ -0,0 +1,15 @@ +# Set Up gn-guile in tux02 + +## Tags + +* assigned: bonfacem +* priority: high +* status: in-progress +* deadline: 2024-10-23 Wed + +## Tasks + +* [-] Create gn-guile container. +* [X] Merge gn2 UI PR. +=> https://github.com/genenetwork/genenetwork2/pull/854 Feature/gn editor UI +* [-] Test out auth editing in CD. diff --git a/issues/set-up-virtuoso-on-production.gmi b/issues/set-up-virtuoso-on-production.gmi index 88c04f7..614565a 100644 --- a/issues/set-up-virtuoso-on-production.gmi +++ b/issues/set-up-virtuoso-on-production.gmi @@ -1,8 +1,8 @@ -# Set-up Virtuoso on Production +# Set-up Virtuoso+Xapian on Production ## Tags -* assigned: bonfacem +* assigned: bonfacem, zachs, fredm * priority: high * type: ops * keywords: virtuoso @@ -11,5 +11,121 @@ We already have virtuoso set-up in tux02. Right now, to be able to interact with RDF, we need to have virtuoso set-up. This issue will unblock: +* Global Search in Production + => https://github.com/genenetwork/genenetwork3/pull/137 Update RDF endpoints + => https://github.com/genenetwork/genenetwork2/pull/808 UI/RDF frontend + + +## HOWTO: Updating Virtuoso in Production (Tux01) + + +Note where the virtuoso data directory is mapped from the "production.sh" script as you will use this in the consequent steps: + +> --share=/export2/guix-containers/genenetwork/var/lib/virtuoso=/var/lib/virtuoso + +### Generating the TTL Files + +=> https://git.genenetwork.org/gn-transform-databases/tree/generate-ttl-files.scm Run "generate-ttl-files" to generate the TTL files: + +``` +time guix shell guile-dbi -m manifest.scm -- \ +./generate-ttl-files.scm --settings conn-dev.scm --output \ +/export2/guix-containers/genenetwork-development/var/lib/virtuoso \ +--documentation /tmp/doc-directory +``` + +* [Recommended] Alternatively, copy over the TTL files (in Tux01) to the correct shared directory in the container: + +``` +cp /home/bonfacem/ttl-files/*ttl /export2/guix-containers/genenetwork/var/lib/virtuoso/ +``` + +### Loading the TTL Files + +* Make sure that the virtuoso service type has the "dirs-allowed" variable set correctly: + +``` +(service virtuoso-service-type + (virtuoso-configuration + (server-port 7892) + (http-server-port 7893) + (dirs-allowed "/var/lib/virtuoso"))) +``` + +* Get into isql: + +``` +guix shell virtuoso-ose -- isql 7892 +``` +* Make sure that no pre-existing TTL files exist in "DB.DBA.LOAD_LIST": + +``` +SQL> select * from DB.DBA.LOAD_LIST; +SQL> delete from DB.DBA.load_list; +``` +* Delete the genenetwork graph: + +``` +SQL> DELETE FROM rdf_quad WHERE g = iri_to_id('http://genenetwork.org'); +``` + +* Load all the TTL files (This takes some time): + +``` +SQL> ld_dir('/var/lib/virtuoso', '*.ttl', 'http://genenetwork.org'); +SQL> rdf_loader_run(); +SQL> CHECKPOINT; +SQL> checkpoint_interval(60); +SQL> scheduler_interval(10); +``` +* Verify you have some RDF data by running: + +``` +SQL> SPARQL +PREFIX gn: <http://genenetwork.org/id/> +PREFIX gnc: <http://genenetwork.org/category/> +PREFIX owl: <http://www.w3.org/2002/07/owl#> +PREFIX gnt: <http://genenetwork.org/term/> +PREFIX skos: <http://www.w3.org/2004/02/skos/core#> +PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> +PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> +PREFIX taxon: <http://purl.uniprot.org/taxonomy/> + +SELECT * WHERE { + ?s skos:member gn:Mus_musculus . + ?s ?p ?o . +}; +``` + +* Update GN3 Configurations to point to the correct Virtuoso instance: + +> SPARQL_ENDPOINT="http://localhost:7893/sparql" + +## HOWTO: Generating the Xapian Index + +* Make sure you are using the correct guix profile or that you have the "PYTHONPATH" pointing to the GN3 repository. + +* Generate the Xapian Index using "genenetwork3/scripts/create-xapian-index" against the correct output directory (The build takes around 71 minutes on an SSD Drive): + +``` +time python index-genenetwork create-xapian-index \ +/export/data/genenetwork-xapian/ \ +mysql://<user>:<password>@localhost/db_webqtl \ +http://localhost:7893/sparql +``` +* After the build, you can verify that the index works by: + +``` +guix shell xapian -- xapian-delve /export/data/genenetwork-xapian/ +``` +* Update GN3 configuration files to point to the right Xapian path: + +> XAPIAN_DB_PATH="/export/data/genenetwork-xapian/" + +## Resolution + +@fredm updated virtuoso; and @zachs updated the xapian index in production. + +* closed diff --git a/issues/systems/apps.gmi b/issues/systems/apps.gmi new file mode 100644 index 0000000..e374250 --- /dev/null +++ b/issues/systems/apps.gmi @@ -0,0 +1,225 @@ +# Apps + +GeneNetwork.org retains a number of apps. Currently they are managed by shepherd as `guix shell` services, but we should really move them to system containers. + +# Tags + +* assigned: pjotrp +* type: enhancement +* status: in progress +* priority: medium +* keywords: system, sheepdog, shepherd + +# Tasks + +* [ ] Get services running +* [ ] Move guix shell into containers +* [ ] Make sure the container starts up on reboot and/or migrate to a new host + +# List of apps + +Current apps managed by shepherd/systemd on tux02/balg01 are + +=> https://genecup.org/ +* [+] genecup [shell] (hao) +* [X] - fire up service +* [X] - add sheepdog monitor +* [ ] - add link in GN2 +* [X] - add banner for GeneNetwork +* [ ] - create system container +* [X] - create guix root +* [ ] - make sure it works on reboot (systemd) +=> https://bnw.genenetwork.org/ +* [+] bnw [container] (yan cui and rob) +* [X] - fire up service +* [X] - add sheepdog monitor +* [X] - add link in GN2 +* [ ] - add banner for GeneNetwork +* [ ] - update system container +* [X] - create guix root +* [ ] - make sure it works on reboot (systemd) +=> http://hrdp.genenetwork.org +* [+] hrdp-project (hao?) +* [X] - fire up service +* [X] - add sheepdog monitor +* [ ] - https +* [ ] - add link in GN2 +* [ ] - add banner for GeneNetwork +* [ ] - create system container +* [ ] - create guix root +* [ ] - make sure it works on reboot (systemd) +=> https://pluto.genenetwork.org/ +* [+] pluto (saunak) +* [X] - fire up service +* [X] - add sheepdog monitor +* [ ] - add link in GN2 +* [ ] - add banner for GeneNetwork +* [ ] - create system container +* [ ] - create guix root +* [ ] - make sure it works on reboot (systemd) +=> https://power.genenetwork.org/ +* [+] power app (dave) +* [X] - fire up service +* [X] - add sheepdog monitor +* [ ] - add link in GN2 +* [ ] - add banner for GeneNetwork +* [ ] - create system container +* [X] - create guix root +* [ ] - make sure it works on reboot (systemd) +* [ ] root? +=> http://longevity-explorer.genenetwork.org/ +* [+] Longevity explorer [container balg01] (dave) +* [X] - fire up service +* [X] - add sheepdog monitor +* [ ] - https +* [ ] - add link in GN2 +* [ ] - add banner for GeneNetwork +* [ ] - create system container +* [ ] - create guix root +* [ ] - make sure it works on reboot (systemd) +=> http://jumpshiny.genenetwork.org/ +* [+] jumpshiny app (xusheng) +* [+] - fire up service (still some dependencies) +* [X] - add sheepdog monitor +* [ ] - https +* [ ] - add link in GN2 +* [ ] - add banner for GeneNetwork +* [ ] - create system container +* [ ] - create guix root +* [ ] - make sure it works on reboot (systemd) +=> https://hegp.genenetwork.org/ +* [+] hegp (pjotr) +* [X] - fire up service +* [X] - add sheepdog monitor +* [ ] - add link in GN2 +* [ ] - add banner for GeneNetwork +* [ ] - create system container +* [ ] - create guix root +* [X] - make sure it works on reboot (systemd) + +* [-] singlecell (siamak) +* [-] rn6app (hao - remove) +* [-] genome-browser (defunct) + +To fix them we need to validate the sheepdog monitor and make sure they are working in either shepherd (+), or as a system container (X). + +Sheepdog monitor is at + +=> http://sheepdog.genenetwork.org/sheepdog/status.html + +# Info + +## BNW + +The app is already a Guix system container! To make it part of the startup I had to move it away from shepherd (which runs in userland) and: + +``` +/home/shepherd/guix-profiles/bnw/bin/guix system container /home/shepherd/guix-bioinformatics/gn/services/bnw-container.scm --share=/home/shepherd/logs/bnw-server=/var/log --network +ln -s /gnu/store/0hnfb9ynnxsig3yyprwxmg5h6c9g8mry-run-container /usr/local/bin/bnw-app-container +``` + +systemd service: + +``` +root@tux02:/etc/systemd/system# cat bnw-app-container.service +[Unit] +Description = Run genenetwork BNW app container +[Service] +ExecStart = /usr/local/bin/bnw-app-container +[Install] +WantedBy = multi-user.target +``` + +We need to make sure the garbace collector does not destroy the container, add the --root switch + +``` +/home/shepherd/guix-profiles/bnw/bin/guix system container /home/shepherd/guix-bioinformatics/gn/services/bnw-container.scm --share=/home/shepherd/logs/bnw-server=/var/log --network --root=/usr/local/bin/bnw-app-container +``` + +Check with + +``` +root@tux02:/home/shepherd# /home/shepherd/guix-profiles/bnw/bin/guix gc --list-roots |grep bnw + /usr/local/bin/bnw-app-container +``` + +## R/shiny apps + +The R/shiny apps were showing a tarball mismatch: + +``` +building /gnu/store/rjnw7k56z955v4bl07flm9pjwxx5vs0r-r-minimal-4.0.2.drv... +downloading from http://cran.r-project.org/src/contrib/Archive/KernSmooth/KernSmooth_2.23-17.tar.gz ... +- 'configure' phasesha256 hash mismatch for /gnu/store/n05zjfhxl0iqx1jbw8i6vv1174zkj7ja-KernSmooth_2.23-17.tar.gz: + expected hash: 11g6b0q67vasxag6v9m4px33qqxpmnx47c73yv1dninv2pz76g9b + actual hash: 1ciaycyp79l5aj78gpmwsyx164zi5jc60mh84vxxzq4j7vlcdb5p + hash mismatch for store item '/gnu/store/n05zjfhxl0iqx1jbw8i6vv1174zkj7ja-KernSmooth_2.23-17.tar.gz' +``` + +Guix checks and it is not great CRAN allows for changing tarballs with the same version number!! Luckily building with a more recent version of Guix just worked (TM). Now we create a root too: + +``` +/home/wrk/opt/guix-pull/bin/guix pull -p ~/guix-profiles/guix-for-r-shiny +``` + +Note I did not have to pull in guix-bioinformatics channel + +## Singlecell + +Singlecell is an R/shiny app. It starts with an error after above upgrade: + +``` +no slot of name "counts" for this object of class +``` + +and the code needs to be updated: + +=> https://github.com/satijalab/seurat/issues/8804 + +The 4 year old code lives at + +=> https://github.com/genenetwork/singleCellRshiny + +and it looks like lines like these need to be updated: + +=> https://github.com/genenetwork/singleCellRshiny/blob/6b2a344dd0d02f65228ad8c350bac0ced5850d05/app.R#L167 + +Let me ask the author Siamak Yousefi. I think we'll drop it. + +## longevity + +Package definition is at + +=> https://git.genenetwork.org/guix-bioinformatics/tree/gn/packages/mouse-longevity.scm + +Container is at + +=> https://git.genenetwork.org/gn-machines/tree/gn/services/mouse-longevity.scm + +gaeta:~/iwrk/deploy/gn-machines$ guix system container -L . -L ~/guix-bioinformatics --verbosity=3 test-r-container.scm -L ~/iwrk/deploy/guix-forge/guix +forge/nginx.scm:145:40: error: acme-service-type: unbound variable +hint: Did you forget `(use-modules (forge acme))'? + + +## jumpshiny + +Jumpshiny is hosted on balg01. Scripts are in tux02 git. + +=> git.genenetwork.org:/home/git/shared/source/jumpshiny + +``` +root@balg01:/home/j*/gn-machines# . /usr/local/guix-profiles/guix-pull/etc/profile +guix system container --network -L . -L ../guix-forge/guix/ -L ../guix-bioinformatics/ -L ../guix-past/modules/ --substitute-urls='https://ci.guix.gnu.org https://bordeaux.guix.gnu.org https://cuirass.genenetwork.org' test-r-container.scm -L ../guix-forge/guix/gnu/store/xyks73sf6pk78rvrwf45ik181v0zw8rx-run-container +/gnu/store/6y65x5jk3lxy4yckssnl32yayjx9nwl5-run-container +``` + +Currently: + +Jumpshiny: as aijun, cd services/jumpshiny and ./.guix-run + + +## JUMPsem_web + +Another shiny app to run on balg01. + +Jumpshiny: as aijun, cd services/jumpsem and ./.guix-run diff --git a/issues/systems/fallbacks-and-backups.gmi b/issues/systems/fallbacks-and-backups.gmi index 9b890c7..53bd8fa 100644 --- a/issues/systems/fallbacks-and-backups.gmi +++ b/issues/systems/fallbacks-and-backups.gmi @@ -1,6 +1,12 @@ # Fallbacks and backups -As a hurricane is barreling towards our machine room in Memphis we are checking our fallbacks and backups for GeneNetwork. For years we have been making backups on Amazon - both S3 and a running virtual machine. The latter was expensive, so I replaced it with a bare metal server which earns itself (if it hadn't been down for months, but that is a different story). +A revisit to previous work on backups etc. The sheepdog hosts are no longer responding and we should really run sheepdog on a machine that is not physically with the other machines. In time sheepdog should also move away from redis and run in a system container, but that is for later. I did most of the work late 2021 when I wrote: + +> As a hurricane is barreling towards our machine room in Memphis we are checking our fallbacks and backups for GeneNetwork. For years we have been making backups on Amazon - both S3 and a running virtual machine. The latter was expensive, so I replaced it with a bare metal server which earns itself (if it hadn't been down for months, but that is a different story). + +As we are introducing an external sheepdog server we may give it a DNS entry as sheepdog.genenetwork.org. + +=> http://sheepdog.genenetwork.org/sheepdog/index.html See also @@ -16,13 +22,15 @@ See also ## Tasks -* [.] backup ratspub, r/shiny, bnw, covid19, hegp, pluto services -* [X] /etc /home/shepherd backups for Octopus -* [X] /etc /home/shepherd backups for P2 -* [X] Get backups running again on fallback -* [ ] fix redis queue for P2 - needs to be on rabbit +* [X] fix redis queue and sheepdog server +* [X] check backups on tux01 +* [ ] drop tux02 backups off-site +* [ ] backup ratspub, r/shiny, bnw, covid19, hegp, pluto services +* [ ] /etc /home/shepherd backups for Octopus +* [ ] /etc /home/shepherd /home/git CI-CD GN-QA backups on Tux02 +* [ ] Get backups running again on fallback * [ ] fix bacchus large backups -* [ ] backup octopus01:/lizardfs/backup-pangenome on bacchus +* [ ] mount bacchus on HPC ## Backup and restore @@ -52,22 +60,21 @@ Recently epysode was reinstated after hardware failure. I took the opportunity t As epysode was one of the main sheepdog messaging servers I need to reinstate: * [X] scripts for sheepdog -* [X] enable trim -* [X] reinstate monitoring web services -* [X] reinstate daily backup from penguin2 -* [X] CRON -* [X] make sure messaging works through redis -* [X] fix and propagate GN1 backup -* [X] fix and propagate IPFS and gitea backups -* [X] add GN1 backup -* [X] add IPFS backup -* [X] other backups +* [ ] Check tunnel on tux01 is reinstated +* [ ] enable trim +* [ ] reinstate monitoring web services +* [ ] reinstate daily backups +* [ ] CRON +* [ ] make sure messaging works through redis +* [ ] fix and propagate GN1 backup +* [ ] fix and propagate fileserver and git backups +* [ ] add GN1 backup +* [ ] other backups * [ ] email on fail Tux01 is backed up now. Need to make sure it propagates to -* [X] P2 -* [X] epysode -* [X] rabbit -* [X] Tux02 +* [ ] rabbit +* [ ] Tux02 +* [ ] balg01 * [ ] bacchus diff --git a/issues/systems/machine-room.gmi b/issues/systems/machine-room.gmi deleted file mode 100644 index 28d9921..0000000 --- a/issues/systems/machine-room.gmi +++ /dev/null @@ -1,19 +0,0 @@ -# Machine room - -## Tags - -* assign: pjotrp, dana -* type: system administration -* priority: high -* keywords: systems -* status: unclear - -## Tasks - -* [X] Make tux02e visible from outside -* [ ] Network switch 10Gbs - add hosts -* [ ] Add disks to tux01 and tux02 - need to reboot -* [ ] Set up E-mail relay for tux01 and tux02 smtp.uthsc.edu, port 25 - -=> tux02-production.gmi setup new production machine -=> decommission-machines.gmi Decommission machines diff --git a/issues/systems/octopus.gmi b/issues/systems/octopus.gmi index c510fd9..3a6d317 100644 --- a/issues/systems/octopus.gmi +++ b/issues/systems/octopus.gmi @@ -1,6 +1,9 @@ # Octopus sysmaintenance -Reopened tasks because of new sheepdog layout and add new machines to Octopus and get fiber optic network going with @andreag. See also +Reopened tasks because of new sheepdog layout and add new machines to Octopus and get fiber optic network going with @andreag. +IT recently upgraded the network switch, so we should have great interconnect between all nodes. We also need to work on user management and network storage. + +See also => ../../topics/systemtopics/systems/hpcs/hpc/octopus-maintenance @@ -14,7 +17,7 @@ Reopened tasks because of new sheepdog layout and add new machines to Octopus an # Tasks -* [ ] add lizardfs to nodes +* [X] add lizardfs to nodes * [ ] add PBS to nodes * [ ] use fiber optic network * [ ] install sheepdog @@ -36,6 +39,17 @@ default via 172.23.16.1 dev ens1f0np0 # Current topology +vim /etc/ssh/sshd_config +systemctl reload ssh + +The routing should be as on octopus01 + +``` +default via 172.23.16.1 dev eno1 +172.23.16.0/21 dev ens1f0np0 proto kernel scope link src 172.23.18.221 +172.23.16.0/21 dev eno1 proto kernel scope link src 172.23.18.188 +``` + ``` ip a ip route @@ -44,3 +58,9 @@ ip route - Octopus01 uses eno1 172.23.18.188/21 gateway 172.23.16.1 (eno1: Link is up at 1000 Mbps) - Octopus02 uses eno1 172.23.17.63/21 gateway 172.23.16.1 (eno1: Link is up at 1000 Mbps) 172.23.x.x + +# Work + +* After the switch upgrade penguin2 NFS is not visible for octopus01. I disabled the mount in fstab +* On octopus01 disabled unattended upgrade script - we don't want kernel updates on this machine(!) +* Updated IP addresses in sshd_config diff --git a/issues/systems/octoraid-storage.gmi b/issues/systems/octoraid-storage.gmi new file mode 100644 index 0000000..97e0e55 --- /dev/null +++ b/issues/systems/octoraid-storage.gmi @@ -0,0 +1,18 @@ +# OctoRAID + +We are building machines that can handle cheap drives. + +# octoraid01 + +This is a jetson with 4 22TB seagate-ironwolf-pro-st22000nt001-22tb-enterprise-nas-hard-drives-7200-rpm. + +Unfortunately the stock kernel has no RAID support, so we simple mount the 4 drives (hosted on a USB-SATA bridge). + +Stress testing: + +``` +cd /export/nfs/lair01 +stress -v -d 1 +``` + +Running on multiple disks the jetson is holding up well! diff --git a/issues/systems/penguin2-raid5.gmi b/issues/systems/penguin2-raid5.gmi new file mode 100644 index 0000000..f03075d --- /dev/null +++ b/issues/systems/penguin2-raid5.gmi @@ -0,0 +1,61 @@ +# Penguin2 RAID 5 + +# Tags + +* assigned: @fredm, @pjotrp +* status: in progress + +# Description + +The current RAID contains 3 disks: + +``` +root@penguin2:~# cat /proc/mdstat +md0 : active raid5 sdb1[1] sda1[0] sdg1[4] +/dev/md0 33T 27T 4.2T 87% /export +``` + +using /dev/sda,sdb,sdg + +The current root and swap is on + +``` +# root +/dev/sdd1 393G 121G 252G 33% / +# swap +/dev/sdd5 partition 976M 76.5M -2 +``` + +We can therefore add four new disks in slots /dev/sdc,sde,sdf,sdh + +penguin2 has no out-of-band and no serial connector right now. That means any work needs to be done on the terminal. + +Boot loader menu: + +``` +menuentry 'Debian GNU/Linux' --class debian --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-simple-7ff268df-cb90-4cbc-9d76-7fd6677b4964' { + load_video + insmod gzio + if [ x$grub_platform = xxen ]; then insmod xzio; insmod lzopio; fi + insmod part_msdos + insmod ext2 + set root='hd2,msdos1' + if [ x$feature_platform_search_hint = xy ]; then + search --no-floppy --fs-uuid --set=root --hint-bios=hd2,msdos1 --hint-efi=hd2,msdos1 --hint-baremetal=ahci2,msdos1 7ff268df-cb90-4cbc-9d76-7fd6677b4964 + else + search --no-floppy --fs-uuid --set=root 7ff268df-cb90-4cbc-9d76-7fd6677b4964 + fi + echo 'Loading Linux 5.10.0-18-amd64 ...' + linux /boot/vmlinuz-5.10.0-18-amd64 root=UUID=7ff268df-cb90-4cbc-9d76-7fd6677b4964 ro quiet + echo 'Loading initial ramdisk ...' + initrd /boot/initrd.img-5.10.0-18-amd64 +} +``` + +Added to sdd MBR + +``` +root@penguin2:~# grub-install /dev/sdd +Installing for i386-pc platform. +Installation finished. No error reported. +``` diff --git a/issues/systems/t02-crash.gmi b/issues/systems/t02-crash.gmi new file mode 100644 index 0000000..bf0c5d5 --- /dev/null +++ b/issues/systems/t02-crash.gmi @@ -0,0 +1,47 @@ +## Postmortem tux02 crash + +I'll take a look at tux02 - it rebooted last night and I need to start some services. It rebooted at CDT Aug 07 19:29:14 tux02 kernel: Linux version ... We have two out of memory messages before that: + +``` +Aug 7 18:45:27 tux02 kernel: [13521994.665636] Out of memory: Kill process 30165 (guix) score 759 or sacrifice child +Aug 7 18:45:27 tux02 kernel: [13521994.758974] Killed process 30165 (guix) total-vm:498873224kB, anon-rss:223599272kB, file-rss:4kB, shmem-rss:0kB +``` + +My mosh clapped out before that + +``` +wrk pts/96 mosh [128868] Thu Aug 7 18:53 - down (00:00) +``` + +Someone killed the development container before that + +``` +Aug 7 18:06:32 tux02 systemd[1]: genenetwork-development-container.service: Killing process 86832 (20qjyhd7n9n62fa) with signal SIGKILL. +``` + +and + +``` +Aug 7 13:28:26 tux02 kernel: [13502972.611421] oom_reaper: reaped process 25224 (guix), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB +Aug 7 18:16:00 tux02 kernel: [13520227.160945] oom_reaper: reaped process 128091 (guix), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB +``` + +Guix builds running out of RAM... My conclusion is that someone has been doing some heavy lifting. Probably Fred. I'll ask him to use a different machine that is not shared by many people. First I need to bring up some processes. The shepherd had not started, so: + +``` +systemctl status user-shepherd.service +``` + +most services started now. I need to check in half an hour. + +BNW is the one that does not start up automatically. + +``` +su shepherd +herd status +herd stop bnw +herd status bnw +tail -f /home/shepherd/logs/bnw.log +``` + +Shows a process is blocking the port. Kill as root, after making sure herd status shows it as stopped. diff --git a/issues/systems/tux02-production.gmi b/issues/systems/tux02-production.gmi index 7de911f..d811c5e 100644 --- a/issues/systems/tux02-production.gmi +++ b/issues/systems/tux02-production.gmi @@ -14,9 +14,9 @@ We are going to move production to tux02 - tux01 will be the staging machine. Th * [X] update guix guix-1.3.0-9.f743f20 * [X] set up nginx (Debian) -* [X] test ipmi console (172.23.30.40) +* [X] test ipmi console * [X] test ports (nginx) -* [?] set up network for external tux02e.uthsc.edu (128.169.4.52) +* [?] set up network for external tux02 * [X] set up deployment evironment * [X] sheepdog copy database backup from tux01 on a daily basis using ibackup user * [X] same for GN2 production environment diff --git a/issues/systems/tux04-disk-issues.gmi b/issues/systems/tux04-disk-issues.gmi index cea5a59..3df0a03 100644 --- a/issues/systems/tux04-disk-issues.gmi +++ b/issues/systems/tux04-disk-issues.gmi @@ -1,4 +1,4 @@ -# Tux04 disk issues +# Tux04/Tux05 disk issues We are facing some disk issues with Tux04: @@ -6,6 +6,10 @@ We are facing some disk issues with Tux04: May 02 20:57:42 tux04 kernel: Buffer I/O error on device sdf1, logical block 859240457 ``` +and the same happened to tux05 (same batch). Basically the controllers report no issues. Just to be sure we added +a copy of the boot partition. + +=> topics/system/linux/add-boot-partition # Tags @@ -52,6 +56,8 @@ Download megacli from => https://hwraid.le-vert.net/wiki/DebianPackages ``` +apt-get update +apt-get install megacli megacli -LDInfo -L5 -a0 ``` @@ -95,3 +101,323 @@ and nothing ;). Megacli is actually the tool to use ``` megacli -AdpAllInfo -aAll ``` + +# Database + +During a backup the DB shows this error: + +``` +2025-03-02 06:28:33 Database page corruption detected at page 1079428, retrying...\n[01] 2025-03-02 06:29:33 Database page corruption detected at page 1103108, retrying... +``` + + +Interestingly the DB recovered on a second backup. + +The database is hosted on a solid /dev/sde Dell Ent NVMe FI. The log says + +``` +kernel: I/O error, dev sde, sector 2136655448 op 0x0:(READ) flags 0x80700 phys_seg 40 prio class 2 +``` + +Suggests: + +=> https://stackoverflow.com/questions/50312219/blk-update-request-i-o-error-dev-sda-sector-xxxxxxxxxxx + +> The errors that you see are interface errors, they are not coming from the disk itself but rather from the connection to it. It can be the cable or any of the ports in the connection. +> Since the CRC errors on the drive do not increase I can only assume that the problem is on the receive side of the machine you use. You should check the cable and try a different SATA port on the server. + +and someone wrote + +> analyzed that most of the reasons are caused by intensive reading and writing. This is a CDN cache node. Type reading NVME temperature is relatively high, if it continues, it will start to throttle and then slowly collapse. + +and temperature on that drive has been 70 C. + +Mariabd log is showing errors: + +``` +2025-03-02 6:54:47 0 [ERROR] InnoDB: Failed to read page 449925 from file './db_webqtl/SnpAll.ibd': Page read from tablespace is corrupted. +2025-03-02 7:01:43 489015 [ERROR] Got error 180 when reading table './db_webqtl/ProbeSetXRef' +2025-03-02 8:10:32 489143 [ERROR] Got error 180 when reading table './db_webqtl/ProbeSetXRef' +``` + +Let's try and dump those tables when the backup is done. + +``` +mariadb-dump -uwebqtlout db_webqtl SnpAll +mariadb-dump: Error 1030: Got error 1877 "Unknown error 1877" from storage engine InnoDB when dumping table `SnpAll` at row: 0 +mariadb-dump -uwebqtlout db_webqtl ProbeSetXRef > ProbeSetXRef.sql +``` + +Eeep: + +``` +tux04:/etc$ mariadb-check -uwebqtlout -c db_webqtl ProbeSetXRef +db_webqtl.ProbeSetXRef +Warning : InnoDB: Index ProbeSetFreezeId is marked as corrupted +Warning : InnoDB: Index ProbeSetId is marked as corrupted +error : Corrupt +tux04:/etc$ mariadb-check -uwebqtlout -c db_webqtl SnpAll +db_webqtl.SnpAll +Warning : InnoDB: Index PRIMARY is marked as corrupted +Warning : InnoDB: Index SnpName is marked as corrupted +Warning : InnoDB: Index Rs is marked as corrupted +Warning : InnoDB: Index Position is marked as corrupted +Warning : InnoDB: Index Source is marked as corrupted +error : Corrupt +``` + +On tux01 we have a working database, we can test with + +``` +mysqldump --no-data --all-databases > table_schema.sql +mysqldump -uwebqtlout db_webqtl SnpAll > SnpAll.sql +``` + +Running the backup with rate limiting from: + +``` +Mar 02 17:09:59 tux04 sudo[548058]: pam_unix(sudo:session): session opened for user root(uid=0) by wrk(uid=1000) +Mar 02 17:09:59 tux04 sudo[548058]: wrk : TTY=pts/3 ; PWD=/export3/local/home/wrk/iwrk/deploy/gn-deploy-servers/scripts/tux04 ; USER=roo> +Mar 02 17:09:55 tux04 sudo[548058]: pam_unix(sudo:auth): authentication failure; logname=wrk uid=1000 euid=0 tty=/dev/pts/3 ruser=wrk rhost= > +Mar 02 17:04:26 tux04 su[548006]: pam_unix(su:session): session opened for user ibackup(uid=1003) by wrk(uid=0) +``` + +Oh oh + +Tux04 is showing errors on all disks. We have to bail out. I am copying the potentially corrupted files to tux01 right now. We have backups, so nothing serious I hope. I am only worried about the myisam files we have because they have no strong internal validation: + +``` +2025-03-04 8:32:45 502 [ERROR] db_webqtl.ProbeSetData: Record-count is not ok; is 5264578601 Should be: 5264580806 +2025-03-04 8:32:45 502 [Warning] db_webqtl.ProbeSetData: Found 28665 deleted space. Should be 0 +2025-03-04 8:32:45 502 [Warning] db_webqtl.ProbeSetData: Found 2205 deleted blocks Should be: 0 +2025-03-04 8:32:45 502 [ERROR] Got an error from thread_id=502, ./storage/myisam/ha_myisam.cc:1120 +2025-03-04 8:32:45 502 [ERROR] MariaDB thread id 502, OS thread handle 139625162532544, query id 837999 localhost webqtlout Checking table +CHECK TABLE ProbeSetData +2025-03-04 8:34:02 79695 [ERROR] mariadbd: Table './db_webqtl/ProbeSetData' is marked as crashed and should be repaired +``` + +See also + +=> https://dev.mysql.com/doc/refman/8.4/en/myisam-check.html + +Tux04 will require open heart 'disk controller' surgery and some severe testing before we move back. We'll also look at tux05-8 to see if they have similar problems. + +## Recovery + +According to the logs tux04 started showing serious errors on March 2nd - when I introduced sanitizing the mariadb backup: + +``` +Mar 02 05:00:42 tux04 kernel: I/O error, dev sde, sector 2071078320 op 0x0:(READ) flags 0x80700 phys_seg 16 prio class 2 +Mar 02 05:00:58 tux04 kernel: I/O error, dev sde, sector 2083650928 op 0x0:(READ) flags 0x80700 phys_seg 59 prio class 2 +... +``` + +The log started on Feb 23 when we had our last reboot. It probably is a good idea to turn on persistent logging! Anyway, it is likely files were fine until March 2nd. Similarly the mariadb logs also show + +``` +2025-03-02 6:53:52 489007 [ERROR] mariadbd: Index for table './db_webqtl/ProbeSetData.MYI' is corrupt; try to repair it +2025-03-02 6:53:52 489007 [ERROR] db_webqtl.ProbeSetData: Can't read key from filepos: 2269659136 +``` + +So, if we can restore a backup from March 1st we should be reasonably confident it is sane. + +First is to backup the existing database(!) Next restore the new DB by changing the DB location (symlink in /var/lib/mysql as well as check /etc/mysql/mariadb.cnf). + +When upgrading it is an idea to switch on these in mariadb.cnf + +``` +# forcing recovery with these two lines: +innodb_force_recovery=3 +innodb_purge_threads=0 +``` + +Make sure to disable (and restart) once it is up and running! + +So the steps are: + +* [X] install updated guix version of mariadb in /usr/local/guix-profiles (don't use Debian!!) +* [X] repair borg backup +* [X] Stop old mariadb (on new host tux02) +* [X] backup old mariadb database +* [X] restore 'sane' version of DB from borg March 1st +* [X] point to new DB in /var/lib/mysql and cnf file +* [X] update systemd settings +* [X] start mariadb new version with recovery setting in cnf +* [X] check logs +* [X] once running revert on recovery setting in cnf and restart + +OK, looks like we are in business again. In the next phase we need to validate files. Normal files can be checked with + +``` +find -type f \( -not -name "md5sum.txt" \) -exec md5sum '{}' \; > md5sum.txt +``` + +and compared with another set on a different server with + +``` +md5sum -c md5sum.txt +``` + +* [X] check genotype file directory - some MAGIC files missing on tux01 + +gn-docs is a git repo, so that is easily checked + +* [X] check gn-docs and sync with master repo + + +## Other servers + +``` +journalctl -r|grep -i "I/O error"|less +# tux05 +Nov 18 02:19:55 tux05 kernel: XFS (sdc2): metadata I/O error in "xfs_da_read_buf+0xd9/0x130 [xfs]" at daddr 0x78 len 8 error 74 +Nov 05 14:36:32 tux05 kernel: blk_update_request: I/O error, dev sdb, sector 1993616 op 0x1:(WRITE) flags +0x0 phys_seg 35 prio class 0 +Jul 27 11:56:22 tux05 kernel: blk_update_request: I/O error, dev sdc, sector 55676616 op 0x0:(READ) flags +0x80700 phys_seg 26 prio class 0 +Jul 27 11:56:22 tux05 kernel: blk_update_request: I/O error, dev sdc, sector 55676616 op 0x0:(READ) flags +0x80700 phys_seg 26 prio class 0 +# tux06 +Apr 15 08:10:57 tux06 kernel: I/O error, dev sda, sector 21740352 op 0x1:(WRITE) flags 0x1000 phys_seg 4 prio class 2 +Dec 13 12:56:14 tux06 kernel: I/O error, dev sdb, sector 3910157327 op 0x9:(WRITE_ZEROES) flags 0x8000000 phys_seg 0 prio class 2 +# tux07 +Mar 27 08:00:11 tux07 mfschunkserver[1927469]: replication error: failed to create chunk (No space left) +# tux08 +Mar 27 08:12:11 tux08 mfschunkserver[464794]: replication error: failed to create chunk (No space left) +``` + +Tux04, 05 and 06 show disk errors. Tux07 and Tux08 are overloaded with a full disk, but no other errors. We need to babysit Lizard more! + +``` +stress -v -d 1 +``` + +Write test: + +``` +dd if=/dev/zero of=./test bs=512k count=2048 oflag=direct +``` + +Read test: + +``` +/sbin/sysctl -w vm.drop_caches=3 +dd if=./test of=/dev/zero bs=512k count=2048 +``` + + +smartctl -a /dev/sdd -d megaraid,0 + +RAID Controller in SL 3: Dell PERC H755N Front + +# The story continues + +I don't know what happened but the server gave a hard +error in the logs: + +``` +racadm getsel # get system log +Record: 340 +Date/Time: 05/31/2025 09:25:17 +Source: system +Severity: Critical +Description: A high-severity issue has occurred at the Power-On +Self-Test (POST) phase which has resulted in the system BIOS to +abruptly stop functioning. +``` + +Woops! I fixed it by resetting idrac and rebooting remotely. Nasty. + +Looking around I found this link + +=> +https://tomaskalabis.com/wordpress/a-high-severity-issue-has-occurred-at-the-power-on-self-te +st-post-phase-which-has-resulted-in-the-system-bios-to-abruptly-stop-functioning/ + +suggesting we should upgrade idrac firmware. I am not going to do that +without backups and a fully up-to-date fallback online. It may fix the +other hardware issues we have been seeing (who knows?). + +Fred, the boot sequence is not perfect yet. Turned out the network +interfaces do not come up in the right order and nginx failed because +of a missing /var/run/nginx. The container would not restart because - +missing above - it could not check the certificates. + +## A week later + +``` +[SMM] APIC 0x00 S00:C00:T00 > ASSERT [AmdPlatformRasRsSmm] u:\EDK2\MdePkg\Library\BasePciSegmentLibPci\PciSegmentLib.c(766): ((Address) & (0xfffffffff0000000ULL | (3))) == 0 !!!! X64 Exception Type - 03(#BP - Breakpoint) CPU Apic ID - 00000000 !!!! +RIP - 0000000076DA4343, CS - 0000000000000038, RFLAGS - 0000000000000002 +RAX - 0000000000000010, RCX - 00000000770D5B58, RDX - 00000000000002F8 +RBX - 0000000000000000, RSP - 0000000077773278, RBP - 0000000000000000 +RSI - 0000000000000087, RDI - 00000000777733E0 R8 - 00000000777731F8, R9 - 0000000000000000, R10 - 0000000000000000 +R11 - 00000000000000A0, R12 - 0000000000000000, R13 - 0000000000000000 +R14 - FFFFFFFFA0C1A118, R15 - 000000000005B000 +DS - 0000000000000020, ES - 0000000000000020, FS - 0000000000000020 +GS - 0000000000000020, SS - 0000000000000020 +CR0 - 0000000080010033, CR2 - 0000000015502000, CR3 - 0000000077749000 +CR4 - 0000000000001668, CR8 - 0000000000000001 +DR0 - 0000000000000000, DR1 - 0000000000000000, DR2 - 0000000000000000 DR3 - 0000000000000000, DR6 - 00000000FFFF0FF0, DR7 - 0000000000000400 +GDTR - 000000007773C000 000000000000004F, LDTR - 0000000000000000 IDTR - 0000000077761000 00000000000001FF, TR - 0000000000000040 +FXSAVE_STATE - 0000000077772ED0 +!!!! Find image based on IP(0x76DA4343) u:\Build_Genoa\DellBrazosPkg\DEBUG_MYTOOLS\X64\DellPkgs\DellChipsetPkgs\AmdGenoaModulePkg\Override\AmdCpmPkg\Features\PlatformRas\Rs\Smm\AmdPlatformRasRsSmm\DEBUG\AmdPlatformRasRsSmm.pdb (ImageBase=0000000076D3E000, EntryPoint=0000000076D3E6C0) !!!! +``` + +New error in system log: + +``` +Record: 341 Date/Time: 06/04/2025 19:47:08 +Source: system +Severity: Critical Description: A high-severity issue has occurred at the Power-On Self-Test (POST) phase which has resulted in the system BIOS to abruptly stop functioning. +``` + +The error appears to relate to AMD Brazos which is probably part of the on board APU/GPU. + +The code where it segfaulted is online at: + +=> https://github.com/tianocore/edk2/blame/master/MdePkg/Library/BasePciSegmentLibPci/PciSegmentLib.c + +and has to do with PCI registers and that can actually be caused by the new PCIe card we hosted. + +# Sept 2025 + +We moved production away from tux04, so now we should be able to work on this machine. + + +## System crash on tux04 + +And tux04 is down *again*. Wow, glad we moved off! I want to fix that machine and we had to move production off! I left the terminal open and the last message is: + +``` +tux04:~$ [SMM] APIC 0x00 S00:C00:T00 > ASSERT [AmdPlatformRasRsSmm] u:\EDK2\MdePkg\Library\BasePciSegmentLibPci\PciSegmentLib.c(766): ((Address) & (0xfffffffff0000000ULL | (3))) == 0 +!!!! X64 Exception Type - 03(#BP - Breakpoint) CPU Apic ID - 00000000 !!!! +RIP - 0000000076DA4343, CS - 0000000000000038, RFLAGS - 0000000000000002 +RAX - 0000000000000010, RCX - 00000000770D5B58, RDX - 00000000000002F8 +RBX - 0000000000000000, RSP - 0000000077773278, RBP - 0000000000000000 +RSI - 0000000000000000, RDI - 00000000777733E0 +R8 - 00000000777731F8, R9 - 0000000000000000, R10 - 0000000000000000 +R11 - 00000000000000A0, R12 - 0000000000000000, R13 - 0000000000000000 +R14 - FFFFFFFFAC41A118, R15 - 000000000005B000 +DS - 0000000000000020, ES - 0000000000000020, FS - 0000000000000020 +GS - 0000000000000020, SS - 0000000000000020 +CR0 - 0000000080010033, CR2 - 00007F67F5268030, CR3 - 0000000077749000 +CR4 - 0000000000001668, CR8 - 0000000000000001 +DR0 - 0000000000000000, DR1 - 0000000000000000, DR2 - 0000000000000000 +DR3 - 0000000000000000, DR6 - 00000000FFFF0FF0, DR7 - 0000000000000400 +GDTR - 000000007773C000 000000000000004F, LDTR - 0000000000000000 +IDTR - 0000000077761000 00000000000001FF, TR - 0000000000000040 +FXSAVE_STATE - 0000000077772ED0 +!!!! Find image based on IP(0x76DA4343) u:\Build_Genoa\DellBrazosPkg\DEBUG_MYTOOLS\X64\DellPkgs\DellChipsetPkgs\AmdGenoaModulePkg\Override\AmdCpmPkg\Features\PlatformRas\Rs\Smm\AmdPlatformRasRsSmm\DEBUG\AmdPlatformRasRsSmm.pdb (ImageBase=0000000076D3E000, EntryPoint=0000000076D3E6C0) !!!! +``` + +and the racadm system log says + +``` +Record: 362 +Date/Time: 09/11/2025 21:47:02 +Source: system +Severity: Critical +Description: A high-severity issue has occurred at the Power-On Self-Test (POST) phase which has resulted in the system BIOS to abruptly stop functioning. +``` + +I have seen that before and it is definitely a hardware/driver issue on the Dell itself. I'll work on tha later. Luckily it always reboots. diff --git a/issues/systems/tux04-production.gmi b/issues/systems/tux04-production.gmi new file mode 100644 index 0000000..58ff8c1 --- /dev/null +++ b/issues/systems/tux04-production.gmi @@ -0,0 +1,279 @@ +# Production on tux04 + +Lately we have been running production on tux04. Unfortunately Debian got broken and I don't see a way to fix it (something with python versions that break apt!). Also mariadb is giving problems: + +=> issues/production-container-mechanical-rob-failure.gmi + +and that is alarming. We might as well try an upgrade. I created a new partition on /dev/sda4 using debootstrap. + +The hardware RAID has proven unreliable on this machine (and perhaps others). + +We added a drive on a PCIe raiser outside the RAID. Use this for bulk data copying. We still bootstrap from the RAID. + +Luckily not too much is running on this machine and if we mount things again, most should work. + +# Tasks + +* [X] cleanly shut down mariadb +* [X] reboot into new partition /dev/sda4 +* [X] git in /etc +* [X] make sure serial boot works (/etc/default/grub) +* [X] fix groups and users +* [X] get guix going +* [X] get mariadb going +* [X] fire up GN2 service +* [X] fire up SPARQL service +* [X] sheepdog +* [ ] fix CRON jobs and backups +* [ ] test full reboots + + +# Boot in new partition + +``` +blkid /dev/sda4 +/dev/sda4: UUID="4aca24fe-3ece-485c-b04b-e2451e226bf7" BLOCK_SIZE="4096" TYPE="ext4" PARTUUID="2e3d569f-6024-46ea-8ef6-15b26725f811" +``` + +After debootstrap there are two things to take care of: the /dev directory and grub. For good measure +I also capture some state + +``` +cd ~ +ps xau > cron.log +systemctl > systemctl.txt +cp /etc/network/interfaces . +cp /boot/grub/grub.cfg . +``` + +we should still have access to the old root partition, so I don't need to capture everything. + +## /dev + +I ran MAKEDEV and that may not be needed with udev. + +## grub + +We need to tell grub to boot into the new partition. The old root is on +UUID=8e874576-a167-4fa1-948f-2031e8c3809f /dev/sda2. + +Next I ran + +``` +tux04:~$ update-grub2 /dev/sda +Generating grub configuration file ... +Found linux image: /boot/vmlinuz-5.10.0-32-amd64 +Found initrd image: /boot/initrd.img-5.10.0-32-amd64 +Found linux image: /boot/vmlinuz-5.10.0-22-amd64 +Found initrd image: /boot/initrd.img-5.10.0-22-amd64 +Warning: os-prober will be executed to detect other bootable partitions. +Its output will be used to detect bootable binaries on them and create new boot entries. +Found Debian GNU/Linux 12 (bookworm) on /dev/sda4 +Found Windows Boot Manager on /dev/sdd1@/efi/Microsoft/Boot/bootmgfw.efi +Found Debian GNU/Linux 11 (bullseye) on /dev/sdf2 +``` + +Very good. Do a diff on grub.cfg and you see it even picked up the serial configuration. It only shows it added menu entries for the new boot. Very nice. + +At this point I feel safe to boot as we should be able to get back into the old partition. + +# /etc/fstab + +The old fstab looked like + +``` +UUID=8e874576-a167-4fa1-948f-2031e8c3809f / ext4 errors=remount-ro 0 1 +# /boot/efi was on /dev/sdc1 during installation +UUID=998E-68AF /boot/efi vfat umask=0077 0 1 +# swap was on /dev/sdc3 during installation +UUID=cbfcd84e-73f8-4cec-98ee-40cad404735f none swap sw 0 0 +UUID="783e3bd6-5610-47be-be82-ac92fdd8c8b8" /export2 ext4 auto 0 2 +UUID="9e6a9d88-66e7-4a2e-a12c-f80705c16f4f" /export ext4 auto 0 2 +UUID="f006dd4a-2365-454d-a3a2-9a42518d6286" /export3 auto auto 0 2 +/export2/gnu /gnu none defaults,bind 0 0 +# /dev/sdd1: PARTLABEL="bulk" PARTUUID="b1a820fe-cb1f-425e-b984-914ee648097e" +# /dev/sdb4 /export ext4 auto 0 2 +# /dev/sdd1 /export2 ext4 auto 0 2 +``` + +# reboot + +Next we are going to reboot, and we need a serial connector to the Dell out-of-band using racadm: + +``` +ssh IP +console com2 +racadm getsel +racadm serveraction powercycle +racadm serveraction powerstatus + +``` + +Main trick it so hit ESC, wait 2 sec and 2 when you want the bios boot menu. Ctrl-\ to escape console. Otherwise ESC (wait) ! to get to the boot menu. + +# First boot + +It still boots by default into the old root. That gave an error: + +[FAILED] Failed to start File Syste…a-2365-454d-a3a2-9a42518d6286 + +This is /export3. We can fix that later. + +When I booted into the proper partition the console clapped out. Also the racadm password did not work on tmux -- I had to switch to a standard console to log in again. Not sure why that is, but next I got: + +``` +Give root password for maintenance +(or press Control-D to continue): +``` + +and giving the root password I was in maintenance mode on the correct partition! + +To rerun grup I had to add `GRUB_DISABLE_OS_PROBER=false`. + +Once booting up it is a matter of mounting partitions and tick the check boxes above. + +The following contained errors: + +``` +/dev/sdd1 3.6T 1.8T 1.7T 52% /export2 +``` + +# Guix + +Getting guix going is a bit tricky because we want to keep the store! + +``` +cp -vau /mnt/old-root/var/guix/ /var/ +cp -vau /mnt/old-root/usr/local/guix-profiles /usr/local/ +cp -vau /mnt/old-root/usr/local/bin/* /usr/local/bin/ +cp -vau /mnt/old-root/etc/systemd/system/guix-daemon.service* /etc/systemd/system/ +cp -vau /mnt/old-root/etc/systemd/system/gnu-store.mount* /etc/systemd/system/ +``` + +Also had to add guixbuild users and group by hand. + +# nginx + +We use the streaming facility. Check that + +``` +nginx -V +``` + +lists --with-stream=static, see + +=> https://serverfault.com/questions/858067/unknown-directive-stream-in-etc-nginx-nginx-conf86/858074#858074 + +and load at the start of nginx.conf: + +``` +load_module /usr/lib/nginx/modules/ngx_stream_module.so; +``` + +and + +``` +nginx -t +``` + +passes + +Now the container responds to the browser with `Internal Server Error`. + +# container web server + +Visit the container with something like + +``` +nsenter -at 2838 /run/current-system/profile/bin/bash --login +``` + +The nginx log in the container has many + +``` +2025/02/22 17:23:48 [error] 136#0: *166916 connect() failed (111: Connection refused) while connecting to upstream, client: 127.0.0.1, server: genenetwork.org, request: "GET /gn3/gene/aliases/st%2029:1;o;s HTTP/1.1", upstream: "http://127.0.0.1:9800/gene/aliases/st%2029:1;o;s", host: "genenetwork.org" +``` + +that is interesting. Acme/https is working because GN2 is working: + +``` +curl https://genenetwork.org/api3/version +"1.0" +``` + +Looking at the logs it appears it is a redis problem first for GN2. + +Fred builds the container with `/home/fredm/opt/guix-production/bin/guix`. Machines are defined in + +``` +fredm@tux04:/export3/local/home/fredm/gn-machines +``` + +The shared dir for redis is at + +--share=/export2/guix-containers/genenetwork/var/lib/redis=/var/lib/redis + +with + +``` +root@genenetwork-production /var# ls lib/redis/ -l +-rw-r--r-- 1 redis redis 629328484 Feb 22 17:25 dump.rdb +``` + +In production.scm it is defined as + +``` +(service redis-service-type + (redis-configuration + (bind "127.0.0.1") + (port 6379) + (working-directory "/var/lib/redis"))) +``` + +The defaults are the same as the definition of redis-service-type (in guix). Not sure why we are duplicating. + +After starting redis by hand I get another error `500 DatabaseError: The following exception was raised while attempting to access http://auth.genenetwork.org/auth/data/authorisation: database disk image is malformed`. The problem is it created +a DB in the wrong place. Alright, the logs in the container say: + +``` +Feb 23 14:04:31 genenetwork-production shepherd[1]: [redis-server] 3977:C 23 Feb 2025 14:04:31.040 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo +Feb 23 14:04:31 genenetwork-production shepherd[1]: [redis-server] 3977:C 23 Feb 2025 14:04:31.040 # Redis version=7.0.12, bits=64, commit=00000000, modified=0, pid=3977, just started +Feb 23 14:04:31 genenetwork-production shepherd[1]: [redis-server] 3977:C 23 Feb 2025 14:04:31.040 # Configuration loaded +Feb 23 14:04:31 genenetwork-production shepherd[1]: [redis-server] 3977:M 23 Feb 2025 14:04:31.041 * Increased maximum number of open files to 10032 (it was originally set to 1024). +Feb 23 14:04:31 genenetwork-production shepherd[1]: [redis-server] 3977:M 23 Feb 2025 14:04:31.041 * monotonic clock: POSIX clock_gettime +Feb 23 14:04:31 genenetwork-production shepherd[1]: [redis-server] 3977:M 23 Feb 2025 14:04:31.041 * Running mode=standalone, port=6379. +Feb 23 14:04:31 genenetwork-production shepherd[1]: [redis-server] 3977:M 23 Feb 2025 14:04:31.042 # Server initialized +Feb 23 14:04:31 genenetwork-production shepherd[1]: [redis-server] 3977:M 23 Feb 2025 14:04:31.042 # WARNING Memory overcommit must be enabled! Without it, a background save or replication may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect. +Feb 23 14:04:31 genenetwork-production shepherd[1]: [redis-server] 3977:M 23 Feb 2025 14:04:31.042 # Wrong signature trying to load DB from file +Feb 23 14:04:31 genenetwork-production shepherd[1]: [redis-server] 3977:M 23 Feb 2025 14:04:31.042 # Fatal error loading the DB: Invalid argument. Exiting. +Feb 23 14:04:31 genenetwork-production shepherd[1]: Service redis (PID 3977) exited with 1. +``` + +This is caused by a newer version of redis. This is odd because we are using the same version from the container?! + +Actually it turned out the redis DB was corrupted on the SSD! Same for some other databases (ugh). + +Fred copied all data to an enterprise level storage, and we rolled back to some older DBs, so hopefully we'll be OK for now. + +# Reinstating backups + +In the next step we need to restore backups as described in + +=> /topics/systems/backups-with-borg + +I already created an ibackup user. Next we test the backup script for mariadb. + +One important step is to check the database: + +``` +/usr/bin/mariadb-check -c -u user -p* db_webqtl +``` + +A successful mariadb backup consists of multiple steps + +``` +2025-02-27 11:48:28 +0000 (ibackup@tux04) SUCCESS 0 <32m43s> mariabackup-dump +2025-02-27 11:48:29 +0000 (ibackup@tux04) SUCCESS 0 <00m00s> mariabackup-make-consistent +2025-02-27 12:16:37 +0000 (ibackup@tux04) SUCCESS 0 <28m08s> borg-tux04-sql-backup +2025-02-27 12:16:46 +0000 (ibackup@tux04) SUCCESS 0 <00m07s> drop-rsync-balg01 +``` diff --git a/issues/xapian_bug.gmi b/issues/xapian_bug.gmi index f11b604..068d8eb 100644 --- a/issues/xapian_bug.gmi +++ b/issues/xapian_bug.gmi @@ -5,6 +5,7 @@ * assigned: zsloan * priority: high * type: search +* status: closed * keywords: xapian, gn2, gn3 ## Description diff --git a/miscellaneous/broken-guix-on-tux04-20241108.org b/miscellaneous/broken-guix-on-tux04-20241108.org new file mode 100644 index 0000000..a3eb550 --- /dev/null +++ b/miscellaneous/broken-guix-on-tux04-20241108.org @@ -0,0 +1,162 @@ +* Broken Guix on Tux04 Troubleshooting Log + +Noticed at arount 15:30 (UTC-0500) that guix broke for no apparent reason and I +could no longer do ~guix pull~ without a "Segmentation fault" error. + +I could not even do a ~guix search~ without the error. + +Began today (2024-11-08UTC-0500) by attempting a pull, and still got the Segmentation fault. + +Try checking out an older generation + +#+begin_src sh + $ /home/fredm/opt/guix/bin/guix -p ~/opt/guix --list-generations + ⋮ + $ /home/fredm/opt/guix/bin/guix -p ~/opt/guix --delete-generations=46 + ⋮ + Segmentation fault + $ /home/fredm/opt/guix/bin/guix -p ~/opt/guix --delete-generations=40..47 + ⋮ + Segmentation fault +#+end_src + +Try pulling with someone else's guix: + +#+begin_src bash + /var/guix/profiles/per-user/aruni/current-guix/bin/guix --profile=/export3/local/home/fredm/opt/guix --channels=channels.scm +#+end_src + +#+begin_example +(repl-version 0 1 1) +Backtrace: + 1 (primitive-load "/gnu/store/2lpf5rl0dhq1v8wjfnfmglvr0p2?") +In ice-9/boot-9.scm: + 2007:7 0 (error _ . _) + +ice-9/boot-9.scm:2007:7: In procedure error: +inferior failed "/gnu/store/5rac7n9ya6qrpbz42gpn96sa47dkz7rj-guix-2394a7f5f" +#+end_example + +Fine, maybe the issue is my profile: delete all links and do a ~guix gc~ to clear any issues. + +Upgrade guix daemon + +#+begin_src bash + $ sudo -i /var/guix/profiles/per-user/aruni/current-guix/bin/guix pull + ⋮ + $ sudo systemctl stop guix-daemon.service + $ sudo systemctl start guix-daemon.service + $ sudo systemctl status guix-daemon.service +#+end_src + +Try again: No success. + +Try: +#+begin_src shell + $ /var/guix/profiles/per-user/aruni/current-guix/bin/guix describe --format=channels > channels-aruni.scm + $ /var/guix/profiles/per-user/aruni/current-guix/bin/guix pull -C channels-aruni.scm -p /home/fredm/opt/guix + $ /home/fredm/opt/guix/bin/guix search python-requests +#+end_src + +Okay, now that gives me an output! + +Try pulling latest channels: + +#+begin_src shell + $ /home/fredm/opt/guix/bin/guix pull -C channels.scm -p ~/opt/guix + ⋮ + Building from these channels: + guix-bioinformaticshttps://git.genenetwork.org/guix-bioinformatics 3659ecd + guix-forgehttps://git.systemreboot.net/guix-forge/ 9864821 + guix-past https://gitlab.inria.fr/guix-hpc/guix-past 5fb77cc + guix https://git.savannah.gnu.org/git/guix.git 2394a7f + ⋮ + building /gnu/store/gzm7jb7hj2bmfhdjzsprar8lacz34ncr-guix-past.drv... + -builder for `/gnu/store/gzm7jb7hj2bmfhdjzsprar8lacz34ncr-guix-past.drv' failed with exit code 1 + build of /gnu/store/gzm7jb7hj2bmfhdjzsprar8lacz34ncr-guix-past.drv failed + View build log at '/var/log/guix/drvs/gz/m7jb7hj2bmfhdjzsprar8lacz34ncr-guix-past.drv.gz'. + cannot build derivation `/gnu/store/rwarhjfc9kymw5k5wvl0pz658bmmmqvh-profile.drv': 1 dependencies couldn't be built + guix pull: error: build of `/gnu/store/rwarhjfc9kymw5k5wvl0pz658bmmmqvh-profile.drv' failed +#+end_src + +Okay, view the logs: +#+begin_src sh + $ gunzip -kc /var/log/guix/drvs/gz/m7jb7hj2bmfhdjzsprar8lacz34ncr-guix-past.drv.gz + (repl-version 0 1 1) + Backtrace: + 1 (primitive-load "/gnu/store/2lpf5rl0dhq1v8wjfnfmglvr0p2?") + In ice-9/boot-9.scm: + 2007:7 0 (error _ . _) + + ice-9/boot-9.scm:2007:7: In procedure error: + inferior failed "/gnu/store/5rac7n9ya6qrpbz42gpn96sa47dkz7rj-guix-2394a7f5f" +#+end_src + +Oh boy! + +How about we start with @aruni's channels and pin everything except +=guix-bioinformatics= in the code below: +#+begin_src scheme + (list (channel + (name 'guix-bioinformatics) + (url "https://git.genenetwork.org/guix-bioinformatics") + (branch "master")) + (channel + (name 'guix-forge) + (url "https://git.systemreboot.net/guix-forge/") + (branch "main") + (commit + "9864821f976b3de355138c12f287a19fdced67df") + (introduction + (make-channel-introduction + "0432e37b20dd678a02efee21adf0b9525a670310" + (openpgp-fingerprint + "7F73 0343 F2F0 9F3C 77BF 79D3 2E25 EE8B 6180 2BB3")))) + (channel + (name 'guix-past) + (url "https://gitlab.inria.fr/guix-hpc/guix-past") + (branch "master") + (commit + "5fb77cce01f21a03b8f5a9c873067691cf09d057") + (introduction + (make-channel-introduction + "0c119db2ea86a389769f4d2b9c6f5c41c027e336" + (openpgp-fingerprint + "3CE4 6455 8A84 FDC6 9DB4 0CFB 090B 1199 3D9A EBB5")))) + (channel + (name 'guix) + (url "https://git.savannah.gnu.org/git/guix.git") + (branch "master") + (commit + "b0b988c41c9e0e591274495a1b2d6f27fcdae15a") + (introduction + (make-channel-introduction + "9edb3f66fd807b096b48283debdcddccfea34bad" + (openpgp-fingerprint + "BBB0 2DDF 2CEA F6A8 0D1D E643 A2A0 6DF2 A33A 54FA"))))) +#+end_src +Put the code above in =channels.scm= and retry: + +#+begin_src sh + $ /home/fredm/opt/guix/bin/guix pull -C channels.scm -p ~/opt/guix + ⋮ + Building from these channels: + guix-bioinformaticshttps://git.genenetwork.org/guix-bioinformatics 3659ecd + guix-forgehttps://git.systemreboot.net/guix-forge/ 9864821 + guix-past https://gitlab.inria.fr/guix-hpc/guix-past 5fb77cc + guix https://git.savannah.gnu.org/git/guix.git b0b988c + ⋮ +#+end_src + +Success! + +Okay, now, try rebuilding the container: +#+begin_src sh + $ env PATH="/home/fredm/opt/guix/bin:${PATH}" ./production-deploy.sh +#+end_src + +Success! + +Check that https://gn2-fred.genenetwork.org/ is up and running: Yes! + +Okay. We are back! diff --git a/tasks/alexm.gmi b/tasks/alexm.gmi index 88d3927..e410046 100644 --- a/tasks/alexm.gmi +++ b/tasks/alexm.gmi @@ -1,4 +1,4 @@ -# Tasks for Fred +# Tasks for Alex ## Description @@ -16,11 +16,172 @@ You can refine the search by constraining the checks some more, e.g. to get high # Tasks -* [ ] Make GNQA reliable (with @fahamu) -* [ ] Improve UX for GNQA (with @shelbys) -* [ ] GNQA add abstracts pubmed (with @shelbys) +## This week + +* [ ] Start application - Pwani +* - [X] Got all transcripts +* [+] Correlations - Fred is having issues - Rust updated on Guix +* - also take a look at long running SQL statement and large LIMIT value (check prod!) +* [ ] Friend of UTHSC - Pjotr needs to send forms +* [+] Disable spinner on production (check prod!) +* [+] Rqtl2 - BXD output work on CD +* - [ ] should go to production w. fredm + Disable for Production +* - [X] DO mice family file - children are heterozygous - family file contains parents->child +* - [X] DO GN2 compatible by generating .geno files + Test on CD +* [+ ] Minor refactorings - Rqtl2 is hacky +* [ ] Work in development system container and document +=> https://git.genenetwork.org/gn-machines/commit/?h=gn-local-development-container&id=589dcf32be90f5ec827cb6976d3cb5838d500ac0 +* [+] Create terminal output for external processes on *PRODUCTION* (Rqtl1, Rqtl2, GEMMA, pair-scan are done --- WGCNA as a pilot, with @bonfacem and @pjotrp) + + +## (14/4/25) + +* [x] Debug DO results for for genenetwork2 + * [x] inspect results from gn3 and display mapping results + * [x] Debug db tunneling connection + * [x] Debug rendering huge datatables + +## (21/4/25) +* [x] QTL computation for the DO dataset + * [x] Debug rendering large datasets using datatables + * [x] fix issue with qtl2 plot for DO dataset + * [x] Caching for qtl2 computations + +* [] Pwani Campus Application + +## 28/4/25 + +* [x] Push changes to CD/Production +* [x] Enable RQTL2 only for DO/bxd dataset +* [] look at integrating QTL for HS dataset +* [x] setup local container with bons + +## 5/05/25 + +* [] Integrate hsrat dataset for rqtl2 mapping. +* [] Pwani campus application. +* [] Look at caching for genotype probabilities (rqtl2). +* [] Add full logs on the mapping results page. +* [x] Add test feature flag for rqtl2. + +## 2/06/2025 + +* work onsubset for hs dataset;; define founder genotype files?? +* script to dump genotypes to db with bons +* experiment with caching for Genotypic probabilities rds objects +* work on genenetwork llms how to make search without login + +* masters ; submit documents + +## 9/06/2025 + +* [x] no login for gnqna +* [-] hsrat rqtl2 integration: follow up on the dumping genotypes files to lmdb + + +## 16/06/2025 + +* [x] implementation for no login for gnqa users. + see issue here: https://issues.genenetwork.org/issues/gnqa/implement-no-login-requirement-for-gnqa +* [-] hsrat rqtl2 integration: follow up on the dumping genotypes files to lmdb with bons +* [] create rqtl2 adapter for reading cross from lmdb + + +## 23/06/2025 + +* [-] focus on documentation/source code for read cross ;;add option for reading lmdb +* [-] looking at thor an lmdb interface +* [x] implementation of no login for gnqna users ;; TODO push this code to cd. +* [x] for nologin llm provide the correct feedbacks to user if ai search does not meet criteria;; currently only static response `Please login to view AI generated summary` + + +## 30/06/2025 + +* [] rqtl2 lmdb interfac +e + +## 7/7/2025 + +* [x] guix packaging for thor package +* [x] reading metadata from lmdb file using thor +* [-] working on reconstructing the genotype files to geno, geno_map and pheno_map file +* [x] rate limiting for gnqna users. +* [x] check on tokens for no logged in users +* [x] Look at isse about signing tokens for non logged in users + + +## 21/7/2025 +* [x] working on reconstructing/reading the genotype files based on bons dumping script +* [x] generate and validate cross objects + +## 29/7/2025 +* [x] adding founder_geno and pheno covariates pheno +* [-] minor fixes for lmdb matrix script (missing metadata) +* [-] check phenotype work for lmdb + +## 4/8/2025 +* [-] refactoring lmdb matrix script +* [-] integrating fetching rqtl2 from lmdb using bxd as a test pilot + + +## 11/8/2025 + +* [x] integrating lmdb genotypes for rqtl2 computation for BXD + + +## 18/8/2025 + +* [x] integrating lmdb genotypes for rqtl2 computation for BXD + +## 25/8/2025 +* [x] script to dump phenotypes and cross metadata to lmdb + +## 31/8/2025 + +* [x] generic script that can parse json cross files and dump the metadata to lmdb +* [x] follow up on dumped phenotypes in LMDB for GeneNetwork cc @bons +* [x] integrating lmdb rqtl2 adapter to CD (test this on cd for bxd.) + +## 23/9/2025 +* [x] dumping phenotypes to lmdb (BXDPublish) +* [x] dumping cross metadata to lmdb + +## 30/9/2025 +* [x] profiling and benchmarking read_lmdb_cross against read_cross +* [-] Run this on tux02 and integrate to CD + + +## 7/9/2025 + +* [-] integrate rqtl2-lmdb adapter to CD cc @bons with directory setup. +* [x] improvement on rqtl2-lmdb adapter script; add covariates supports. + + +## 15/9/2025 + +* [-] clone qtl2 repo and build rqtl2-lmdb package locally. +* [] package rqtl2-lmdb package to guix-bioinformatics. + +## Next week(s) + +* [ ] Accelerate Xapian functionality - needs Aider key from Pjotr +* Check and fix CTL? +* [+] Create terminal output for external processes (Rqtl1, Rqtl2, pair-scan are done --- WGCNA as a pilot, with @bonfacem and @pjotrp) +* [X] GNQA says there are no results, but has them +* [X] Correlations are slow + +## Done + +* [X] Rqtl1 - ITP output - 3K individuals - family file +* [X] When bonz is ready wire up GNQA +* + balg-qa.genenetwork.org +* [X] Don't support new PIL - stick to the old one in guix-bioninformatics +* [X] Make GNQA reliable (with @fahamu) +* [X] Improve UX for GNQA (with @shelbys) -- Adrian wants to use our AI UX for their setup +* [X] GNQA add abstracts pubmed (with @shelbys) => ../issues/fetch-pubmed-references-to-gnqa +* [X] Edit markdown/gemtext pages through web UI (with @bonfacem) + -* [ ] Edit markdown/gemtext pages through web UI (with @bonfacem) -* [ ] GNQA add GN metadata with @bonfacem -* [ ] Create terminal output for external processes (WGCNA as a pilot, with @bonfacem and @pjotrp) diff --git a/tasks/bonfacem.gmi b/tasks/bonfacem.gmi index 52f4027..2d56e72 100644 --- a/tasks/bonfacem.gmi +++ b/tasks/bonfacem.gmi @@ -4,13 +4,71 @@ * kanban: bonfacem * assigned: bonfacem -* status: in progress ## Tasks -* [X] Indexing generif data / Improve Local Search -* [ ] Add hashes to RDF metadata -* [-] Brain Data (To be spec'ed further) +### Note +* Don't lose metadata. Have an array of disabled snips. +* Store by snip (rows). Storage by marker. 2 different files. +* gn-auth: + have wrappers around gn-auth (draw-back: folk may forget). + use Nginx as a MTM (re-routing calls). How to add handler in Nginx and to work with tokens. +* GN-auth dashboard fixes. Follow up with Fred. +* Case-attributes used in co-variates. + +### PhD Work + +* Concept note/ideas: Add all metadata in GN to an LLM that enriches GnQA. +* Use mapping output as full vectors for gpt/transformers. Integrate this work into GN. +* Share PhD concept note with PJ for polishing + +### This week +* [] Look at deep-seek/anthropic (also really doc deployment in balg01). Run in debian machine. +* [~] Adapter to LMDB into a cross object. +* - Try computations with R/qtl2. +* - Look at R LMDB libraries. +* - Look at functions that read the files. +* - PJ: LMDB adapter in R and cross-type files. +* [~] gn-guile webhook. +* [~] Dump all genotypes from tux02 to LMDB. +* - PJ sync tux01 genotypes with tux02/04. +* - Yet to set-up 2FA on new device + +### Later +* [ ] Generate RDF docs using AI. +* [ ] Editing genotype metadata +* [ ] Look at XAPIAN search for gene alias. +* [ ] Add GeneNetwork abstracts to XAPIAN search. +* [+] Correlations hash. +* - Add dataset count to RDF. +* [ ] Spam + LLMs +* - RateLimiting for Rif Editing. +* - Honepot approach. +* [ ] Dockerise GN container. For Harm. +* [ ] Send emails when job fail. +* [ ] Look at updating gn-auth/gn-libs to PYTHONPATH for gn2/3. +* [ ] Sample/individual/strain/genometype counts for PublishData only - ProbeSetData? https://github.com/genenetwork/genenetwork2/blob/testing/scripts/sample_count.py - mirror in RDF and use global search +* - search for all traits that have more than X samples +* [ ] Add case attributes to RDF and share with Felix (depends on @felixl) +* [ ] xapian search, add dataset size keys, as well as GN accession id, trait id, and date/year +* - Improve xapian markdown docs to show all used fields/keys with examples +* - genewiki search (link in table? check with Rob) +* - base line with GN1 search - add tests +* - Fix missing search term for sh* - both menu search and global search +* - Use GN1 as a benchmark for search results (mechanical Rob?) +* - Xapian ranges for markers + +### Even later + +* [ ] Rest API for precompute output (mapping with GEMMA) +* [ ] GNQA add GN metadata (to RAG) +* - Focus on RIF +* - triple -> plain text +* - bob :fatherof nancy -> Bob is the father of Nancy. + +## Later + +* [ ] AI improvements ### On going tasks @@ -34,3 +92,76 @@ Should something in one of these closed issues be amiss, we can always and shoul Currently closed issues are: => https://issues.genenetwork.org/search?type=closed-issue&query=assigned%3ABonfaceKilz%20AND%20type%3Aissue%20AND%20is%3Aclosed Closed Issues + +* [X] Indexing generif data / Improve Local Search +* [X] lmdb publishdata output and share with Pjotr and Johannes + +## Done +* [X] Updated Penguin2 with cuda drivers. Cuda no longer supports K80s +* [X] Provided Johannes anthropic tokens +* [X] Add lmdb output hashes with index and export LMDB_DATA_DIRECTORY +* [X] Share small database with @pjotrp and @felixl +* [X] With Alex get rqtl2 demo going in CD (for BXD) +* [X] Set up meeting with ILRI +* - Zasper https://news.ycombinator.com/item?id=42572057 - Alan +* [X] Migrate fahamuai RAG to VPS and switch tokens to GGI OpenAI account +* 1. Running AI server using (our) VPS and our tokens +* + Pjotr gives API key - OpenAI - model? +* 2. Read the code base - Elixir is plumbing incl. authentication, Python processing text etc. +* 3. Try ingestion and prompt (REST API) - check out postgres tables +* 4. Backup state from production Elixir +* 5. Assess porting it to Guix (don't do any work) - minimum version Elixir +* 6. Get docs from Shelby/Brian +* [X] Set-up grobit on balg01 +* - guix docker/native +* - recent breaking changes +* [X] GeneRIF +* - Merge recent changes first. Ping Rob. +* - Brainstorm ideas around log-in. +* - Unlimited tokens that don't expire. +* - Sync prod with CD -- sqlite. +* - Add deletion +* [X] Describe Generif/wikidata access for Rob in an email with test account on CD +* 1. Send email to Rob +* 2. Work on production w. Fred +* [X] Distinguish CD from production -- banners/buttons/colors. +* [X] Use aider - give a presentation in the coming weeks +* [X] gn-auth fixes +* [X] Assess Brian's repo for deployment. +* [X] Finish container work +* - View diffs in BXD: Edit case attributes throws an error. +* [X] Check small db from: https://files.genenetwork.org/database/ +* [X] Changes to Production + (Alex) +* [X] File issue with syslog +* [X] LMDB database. +* - Simplify (focus on small files). Don't over-rely on Numpy. +* [X] Assess adding GeneRIF to LLM. +* [X] Referrer headers -- a way of preventing bots beyond rate-limiting. +* [X] Python Fahamu. +* [X] Memvid - brief look. +* [X] Encourage FahamuAI to be open. +* Another paper with his group should be out this month +* [X] Help Alex with SSL certification container error. +* - Fix SSL issues in local container. +* [X] Send Arun an e-mail on how to go about upgrading shepherd. +* [X] Case Attributes. +* - Git blame. Add tests. Fred. +* - NOTE: Fixed the diffs. But there's an edge-case with BXD longevity (I haven't checked. Shared scripts) +* - NOTE: Elpy broke. Eglot/lspemacs doesn't work. +* - NOTE: Moved away from storing diffs in files to LMDB. +* - Error when checking the history. Fixed by fixing the diffs. +* - Reach out to Zach. NOTE: Timing differences. +* - Disable diff in the UI - unnecessary. +* [X] Added LMDB_PATH to dev container. Updated old commits. +* [X] Merged no-login AI work that Alex did. +* [X] Talk to Fred and hand over case-attributes. +* [X] Distinct admin and dev user. [w/ Fred] +* - Extra fluff to grant dev user access to everything. +* [X] Merged rate-limiter. +* [X] Look at slow running CD (look at issue tracker and be systematic). +* [X] Fix CD. Build guix against a recent pinned profile. +* [X] Fix CD tests. +* [X] Look at different provider(s) for LLMs. +* [X] Install OpenCL in Penguin2 and try LLM script. Check differences between OpenCL and CUDA. PJ installed CUDA in balg01. +* [X] Look at container work. Look at permissions issue. +* [X] Set-up wolfshead: Resolving dependency conflicts in Python. Using DSPY diff --git a/tasks/felixl.gmi b/tasks/felixl.gmi index 209e8c9..7a472a1 100644 --- a/tasks/felixl.gmi +++ b/tasks/felixl.gmi @@ -1,4 +1,4 @@ -# Tasks for Munyoki +# Tasks for Felix ## Tags @@ -6,12 +6,231 @@ * assigned: felixl * status: in progress -## October +## Tasks + +### Goals + +1. Write papers for PhD +2. Load data into GN - serve the communities +3. Get comfortable with programming + +#### Previous week(s) + +* [x] Restless Legs Syndrome (RLS) - 'Traditional Phewas' - AI aspect - Johannes +* [+] Finalize the slide deck - so it can be read on its own +* [.] Review paper: one-liners for @pjotrp - why is this important for GN and/or thesis +* - [ ] list of relevant papers with one-liners - the WHY +=> https://pmc.ncbi.nlm.nih.gov/articles/PMC3294237/ +* [+] Analyse and discuss BXD case attributes with Rob --- both group level and dataset level +* [ ] Sane representation of case attributes in RDF with @bonfacem +* [X] Present C.elegans protocol and example mappings with GEMMA/Rqtl +* [ ] Uploader - setting up code with @fredm +* - [ ] Concrete improvement to work on +* - [X] run small database mysql locally +* - [X] aider with Sonnet + code fixes +* - [ ] document - add to code base - merge with Fred's tree - share changes with Pjotr & team +* [ ] Sort @alexm application with Pwani = this week + +### This week (07-04-2025 onwards) + +* GN2 tasks + * [ X ] Progress on Kilifish + - meet with Dennis (send him an email with all the queries needed) + - progress to format and upload data to gn2 (to be ready by latest Friday!) + * [ X ] Make a milestone with genotype smoothing + +* PhD tasks + * [ X ] Complete and share concept note and timeline to supervisors, have a meeting for progress + * [+] Make a milestone on chapter one manuscript (deep dive into the selected papers){THE BIG PICTURE; a complete draft by early May} + +* Programming + * [ ] Make a milestone with the uploader (really push and learn!) + - documentation (use ai); add to the code base of the uploader + - utilise the hurdles to learn programming priniciples in action + +### This week (14-04-2025 onwards) + +* gn-uploader programming + * [X] - Resolve the config file issue with your local uploader + * [ ] - Run the uploader locally, then break the system, see how components connect to each other + * [ ] - document your findings + +* genotype smoothing + * [ ] - resolve errors with plotting, document your findings + +### This week (21-04-Onwards) + +* genotype smoothing + * [ ] - haplotyping tools for smoothing (plink,., etc) + - see what it can offer with smoothing. See what others say about this. + +* gn-uploader programming + * [ ] - Run the uploader locally, then break the system, see how components connect to each other (ask help from Bonz) + * [ ] - document your findings + +### This week (28-04-Onwards) +* gn-uploader programming + * [X] - Run the uploader locally, then break the system, see how components connect to each other (ask help from Bonz) + * [X] - document your findings + {Get help from your teammates/AI to jump start this!, swallow your pride! :(} + +* genotype smoothing + * [X] Keep refining the following: + * [X] filtering power adapted from plink + * [X] the xsomes mix up in the plot (probably the phenotype data?) + * [X] Update findings and push to github + +### This week (05-05-Onwards) +* programming (gn-uploader) + * [ ] - pick one file each day, review it, understand it + * [ ] - pair programming with Alex on test runs + +* HS rats scripts + * [ ] - prepare/refine scripts to quickly process HS rats file + * [ ] - assist alex with hs rats cross info + +* AOBs + * [ X ] Weekly meetings + * [ X ] follow up with Paul on his progress + * [ X ] follow up on the MSc bioinformatics project + * [ X ] follow up on Alex's application with Pwani + +### (12-05-onwards) + * [X] - HS genotypes scripting + +### (19-05-onwards) + * [X] - HS genotypes debugging (memory issue) + * [X] - pair programming with Bonz to improve the script + +### this week (26-05-onwards) + * [X] - process the genotype file for hs rats + * [X] - approach by tissues categories + * [X] - adipose and liver + - test by Xsomes for memory capture + - run the working commands + * [X] - the rest 10 other tissues (in progress) + * [X] - *.bed file vs the updated vcf files from the website? + +### this week (02-06-onwards) +* [X] - process the genotypes for the rest of the 10 tissues for HS rats +* [X] - document the new findings about smoothing using bcftools and plink +* ## this week (09-06-onwards) +* [+] - identify start and end points for haplotypes in hs genotype files +* [+] - upload the final updates to gn2, test and see the results +* [-] - gn-uploader/uploader folder, explore + +* ## this week (16-06-onwards) +* [X] - hs rats proximal and distal haplotype edges +* [+] - uploading kilifish using the backend route + +* ## this week (23-06-onwards) +* [X] - hs rats recombination counts +* [+] - kilifish to gn2 via backend + +* ## this week (30-06-onwards) +* [ ] - mapping offsprings to founders (hs rats) +* [ ] - upload kilifish to genenetwork +* [ ] - revise celegans smoothing (genotypes) + +* ## this week (07-07-onwards) +* [X] - generate haplotypes for offsprings and founders combined; intepretation next.., +* [+] - keep improving the uploader via data uploading and error solving +* [-] - close smoothing revision for celegans, as left before +* [X] - why should people read my paper on improving genotyping methods? +* - on smoothing (low density genotypes for mapping, high density genotypes for fine mapping.,) +* - liftovers due to reference versions (currently, a challenge to be looked upon) +* - founders and their offsprings in genotyping +* - pangenomics and machine learning for improved genotyping + +** keys (+; in progress, X; done, -; not yet) +* ## this week (14-07-onwards) +* [+] - map founders to offspring, work with only pure recombiantions + [+] - tools available? (plink, rqtl2, beagle, etc) + [+] - custom pipeline, to reflect gaps in the existing tools? (dealing with multiparent species) + [+] - documentation for the paper write up + +* ## this week (21-07-onwards) +* [ ] - HS rats smoothing continues +* [ ] - documenting the milestones +* [ ] - see the possibility to write a tool from it +* [ ] - Pushing kilifish to genenetwork2/learn the source code build up +* [ ] - resmoothen celegans genotypes with the new knowledge + +* ## this week (28-07-onwards) +* [-] - predict genotype probabilities with rqlt2 functions + - problems with control setup to load in the needed files for the functions +* [+] - comparison models for @individual rat vs 8 founders (similarities and percentage composition) + [+] - ongoing discussion with alex, there's progress + +* ## this week (04-08-onwards) +* [+] - Testing the logic to infer Hs outbred genotypes with the founders + - Managed to identify parents of origin for each snp on each rat per position, corresponding to the 8 founders + - Still, need to filter in the disntictive snps, then generate haplo blocks., + +* ## this week (11 - 08 - onwards) +* [X] - generate final haplo file and document +* [+] - testing on local gemma and in gn2 + +* ## this week (18-08-onwards) +* [+] - push for the file to be in gn2, and feedback from the team +* [X] - complete the local gemma run, interpret the results +* [+] - process the rest of the Xsomes for a ready file to go to gn2 +* - issues: over filtering snps, neglecting the one parent of origin, takes long to run. +* [+] - prepare an abstract for CTC conference in Barcelona + +* ## this week (01-09-onWards) +* [ ] - finetune abstract +* - include more of what i achieved: main focus; genotype smoothing on models with complex traits +* - thought map: generate plots, compare before and after smoothing, check for overlaps, and whether or not the peaks in traits are same before and after smoothing +* [ ] - troubleshoot inferring scripts for all Xsomes +* - request bonz/alex's help on this (to save time) + + +* ## this week (30-06-onwards) +* [X] - mapping offsprings to founders (hs rats) +* [+] - upload kilifish to genenetwork +* [-] - revise celegans smoothing (genotypes) + +* ## this week (07-07-onwards) +* [X] - generate haplotypes for offsprings and founders combined; intepretation next.., +* [+] - keep improving the uploader via data uploading and error solving +* [-] - close smoothing revision for celegans, as left before +* [X] - why should people read my paper on improving genotyping methods? +* - on smoothing (low density genotypes for mapping, high density genotypes for fine mapping.,) +* - liftovers due to reference versions (currently, a challenge to be looked upon) +* - founders and their offsprings in genotyping +* - pangenomics and machine learning for improved genotyping + +** keys (+; in progress, X; done, -; not yet) +* ## this week (14-07-onwards) +* [+] - map founders to offspring, work with only pure recombiantions + [+] - tools available? (plink, rqtl2, beagle, etc) + [+] - custom pipeline, to reflect gaps in the existing tools? (dealing with multiparent species) + [+] - documentation for the paper write up + +* ## this week (21-07-onwards) +* [ ] - HS rats smoothing continues +* [ ] - documenting the milestones +* [ ] - see the possibility to write a tool from it +* [ ] - Pushing kilifish to genenetwork2/learn the source code build up +* [ ] - resmoothen celegans genotypes with the new knowledge + + +### Later weeks (non-programming tasks) + +* [ ] Kilifish into GN +* [ ] Review paper on genotyping +* [ ] HS Rat +* [ ] Prepare others for C.elegans * [ ] Upload Arabidopsis dataset * [ ] Upload Medaka dataset +* [ ] Work on improved DO and Ce genotyping + +### Done + + -## Tasks ### On going tasks => https://issues.genenetwork.org/search?query=assigned%3Afelixl+AND+is%3Aopen&type=open-issue All in-progress tasks diff --git a/tasks/fredm.gmi b/tasks/fredm.gmi index 743007a..1cd3125 100644 --- a/tasks/fredm.gmi +++ b/tasks/fredm.gmi @@ -1,11 +1,27 @@ # Tasks for Fred +# Tags + +* kanban: fredm +* assigned: @fredm +* status: in progress + +# Tasks + +* [ ] Add drives to Penguin2, see issues/systems/penguin2-raid5 +* [X] Move production files from sdc to sde +* [ ] Fix password weakness +* [ ] Fix gn-docs and editing, e.g. facilities page by gn-guile in container +* [ ] Unifiy container dirs +* [ ] Fix wikidata gene aliases (see mapping page) with @pjotrp +* [ ] Public SPARQL container? + ## Description These are the tasks and issues to be handled by Fred. Here are all open issues -=> https://issues.genenetwork.org/search?query=is%3Aopen+AND+assigned%3AFrederick&type=all assigned to Frederick +=> https://issues.genenetwork.org/search?query=is%3Aopen+AND+assigned%3Afrederick&type=all assigned to Frederick The search used for the list above is > is:open AND assigned:Frederick @@ -17,8 +33,21 @@ You can refine the search by constraining the checks some more, e.g. to get high ## Priority -… +### gn-uploader -## Extra +=> https://issues.genenetwork.org/search?query=is%3Aopen+AND+assigned%3Afrederick+AND+keywords%3Agn-uploader&type=all + +### gn-auth + +=> https://issues.genenetwork.org/search?query=is%3Aopen+AND+assigned%3Afrederick+AND+keywords%3Agn-auth&type=all -* Work on analysis of aging mice +### GeneNetwork2 + +=> https://issues.genenetwork.org/search?query=is%3Aopen+AND+assigned%3Afrederick+AND+keywords%3Agn2&type=all + +### GeneNetwork2 + +=> https://issues.genenetwork.org/search?query=is%3Aopen+AND+assigned%3Afrederick+AND+keywords%3Agn3&type=all + + +## Extra diff --git a/tasks/hasitha.gmi b/tasks/hasitha.gmi new file mode 100644 index 0000000..fcef29b --- /dev/null +++ b/tasks/hasitha.gmi @@ -0,0 +1,37 @@ +# Tasks for Hasitha + +## Tags + +* kanban: hasitha +* assigned: hasitha +* status: in progress + +## Tasks + +### Notes +* + +### This week +* [ ] Implementing CRAM encoding methods in GBAM +* - [ ] ReadName tokenization pipeline - could save ~6% space using this technique. Need to work on decoding. +* [ ] Starting off with Cigar compression sub project with Andrea +* [ ] Discuss AGC and population compression with Andrea + +### Later +* [ ] + +### Even later + +* [ ] + +### On Hold +* [ ] GBAM reader using noodles + +## Done + +* [X] Moving GBAM python and rust stuff to C +* [X] Fixing memory issues in C +* [X] SAM input to GBAM in Rust +* [X] agc-rs setup testing on M1 mac + + diff --git a/tasks/johannesm.gmi b/tasks/johannesm.gmi new file mode 100644 index 0000000..840bd3e --- /dev/null +++ b/tasks/johannesm.gmi @@ -0,0 +1,90 @@ +# Tasks for Johannes + +## Tags + +* kanban: johannesm +* assigned: johannesm +* status: in progress + +## Tasks + +### Ongoing + +* [] Get system into use in GN + +* [] Draft outline paper + +* [] Talk with Hao about publication on agent system and gain over GNQA + +* [] Read literature for paper + + + +### Later + +* [~] Catch up on SPARQL + +* [~] Catch up on LMDB + + +### Past + +* [X] Make RAG script available for reuse and clean it with Bonface + +* [X] Discuss with Bonface on how to get metadata for RAG + +* [X] Pickle RAG -> not successful :) + +* [X] Try out SPARQLWrapper + +* [X] Optimize RAG + +* [X] Integrate with RAG + +* [X] Get actual metadata with SPARQL endpoint + +* [X] Make RAG agentic -> AI system + +* [X] Visit precompute issue + +* [X] Find more affordable options for GNQA + +* [X] Replace GNQA backend with Bonz + +* [X] Test GPU + +* [X] Test new AI system + +* [X] Use GPU to make naturalization faster + +* [X] Fix bugs and optimize AI system + +* [X] Test and validate AI system working + +* [X] Read up on performance evaluation for AI systems + +* [X] Refetch all data from SPARQL + +* [X] Preproces RDF for improved naturalization + +* [X] Make asynchronous requests to server for naturalization + +* [X] Document work on issue tracker + +* [X] Package code AI system + +* [X] Draft API endpoint + +* [X] Test package and share with Bonz + +* [X] Test AI system with descriptions, qtl and real biology questions + +* [X] Compare performance of Claude and open model on GN data before finetuning + +* [X] Look into system finetuning + +* [X] Test performance gain + +* [X] Think about how to show responses of agent system and GNQA in one UI + +* [X] Get API working diff --git a/tasks/machine-room.gmi b/tasks/machine-room.gmi index f6c7737..d656f2f 100644 --- a/tasks/machine-room.gmi +++ b/tasks/machine-room.gmi @@ -1,36 +1,45 @@ # Machine room tasks -## Tags +# Tags * assigned: pjotrp * priority: medium * type: system administration * keywords: system administration, octopus, gateway, tux02, tux01, tux03 -## Tasks - -### UTHSC - -* [ ] describe machines with Rick Stripes -* [ ] get bacchus back on line -* [ ] fix www.genenetwork.org and gn2.genenetwork.org https +# Tasks + +## GN + +* [ ] penguin2 has 90TB of space we can use on NFS/backups +* [ ] Script to replace reaper with GEMMA +* [ ] Transfer nervenet.org to dnsimple +* [+] Trait vectors for Johannes +* [X] grub on tux04 +* [ ] nft on tux04 +* [ ] !!Organize pluto, update Julia and add apps to GN menu Jupyter notebooks +* [+] !!Xusheng jumpshiny services +* [ ] Fix apps and create system containers for herd services - see issues/systems/apps +* [ ] Slurm+ravanan on production for GEMMA speedup +* [ ] Embed R/qtl2 (Alex) +* [ ] Hoot in GN2 (Andrew) * [ ] tux02 certbot failing (manual now) -* [ ] get data from summer211.uthsc.edu (access machine room) -* [ ] VPN access and FoUT -* [ ] penguin2 has 32TB of space we can use on NFS/backups - -Network: -* [ ] Octopus: wire up machines so they talk with each other over fiber +## Octopus: -Lambda: - -* [ ] remote access? (with Erik) - * [X] get BMC password +* [X] Fix Tux05 badblocks on /dev/sdb2 1050624 47925247 46874624 22.4G Linux filesystem + - see add-boot-partition +* [+] Copy linux partition on tux04, tux05, tux02 and test reboot +* [ ] !!Ceph on Tuxes +* [ ] Centralized user management system +* [ ] Monitor nodes +* [ ] Check machines so they talk with each other over fiber -Backups & storage: +## Backups & storage: -* [_] data warehousing +* [ ] Create and check backups of tux04 etc etc. +* [ ] set up zero to backup tux02 and report to redis +* [ ] reintroduce borg-borg on zero * [+] run sheepdog as root: redis password error; introduce SHEEPDOG_CONF * [ ] tux01 has unused 4TB spinning disk * [ ] tux02 has unused 2x4TB spinning disks and 2TB nvme /dev/nvme0n1 on adapter @@ -39,22 +48,23 @@ Backups & storage: fwupdmgr get-devices fwupdmgr update The previously problematic Samsung 980 Pro was basically using the 3B2QGXA7, and now Samsung has introduced a new 5B2QGXA7 firmware to fix the problem. The problem mainly affects the 2TB version of the 980 Pro -* [ ] Check backups of etc etc. Security: * [ ] Limit idrac access -* [X] space server out-of-band access -### Spice +## Spice -* [ ] Run GN off balg01 +* [ ] Add 2nd boot partition on balg01 * [ ] Add firewall test to sheepdog -* [ ] Convert balg02 to Guix server -* [ ] VM for student team -### Done +## Done +* [X] describe machines with Rick Stripes +* [X] get bacchus back on line +* [X] fix www.genenetwork.org and gn2.genenetwork.org https +* [-] get data from summer211 (access machine room) +* [X] VPN access and FoUT * [X] lambda: get fiber working * [X] lambda: add to Octopus HPC * [X] lambda: racked up and runs @@ -82,3 +92,10 @@ Security: * [X] tux07 has no fiber * [X] tux08 has no fiber * [X] tux09 has no fiber +### Lambda +* [X] remote access? (with Erik) + * [X] get BMC password +* [X] space server out-of-band access +### Spice +* [X] Run GN off balg01 +* [X] Convert balg02 to Guix server diff --git a/tasks/octopus.gmi b/tasks/octopus.gmi index 27232ec..61955ec 100644 --- a/tasks/octopus.gmi +++ b/tasks/octopus.gmi @@ -2,6 +2,9 @@ In this file we track tasks that need to be done. +Tuxes still have some 30x 2.5" slots. +Lambda has 18x 2.5" slots. + # Tasks * [X] get lizardfs and NFS going on tuxes tux06-09 diff --git a/tasks/pjotrp.gmi b/tasks/pjotrp.gmi index 278323b..57620aa 100644 --- a/tasks/pjotrp.gmi +++ b/tasks/pjotrp.gmi @@ -6,28 +6,69 @@ * assigned: pjotrp * status: in progress -# Notes - -The tasks here should probably be broken out into appropriately tagged issues, where they have not - they can be found and filtered out with tissue (formerly gnbug). - -=> https://issues.genenetwork.org - # Current -## Mempang24 - -* [X] Invites -* [+] Payments -* [+] Rooms -* [+] Catering +## 1U01HG013760 + +* Prefix-Free Parsing Compressed Suffix Tree (PFP) for tokenization +* Mempang + +* [+] create backup server with @fredm +* [+] RAG with Shelby and Bonz +* [+] Moni builds 1U01HG013760 +* [+] test framework wfmash - vertebrate tree and HPC compute? +* - wfmash - wgatools -> PAF + FASTA to VCF +* - wfmash arch=native build +* [ ] gbam - data compression with Nick and Hasithak +* [X] accelerate wfmash with @santiago and team +* [+] package wfmash and Rust wfa2-lib +* [ ] add Ceph for distributed network storage 1U01HG013760 +* [ ] Work on pangenome genotyping 1U01HG013760 +* [ ] update freebayes into Debian (version #) +* - [ ] static build and prepare for conda +* [ ] update vcflib into Debian (version #) +* - [ ] static build and prepare for conda +* [ ] pangenome as a 1st class input for GEMMA +* kilifish pangenome with Paul and Dario + +## Systems + +* [+] jumpshiny +* [ ] pluto +* [ ] Backup production databases on Tux04 +* - [+] Dump containers w. databases +* - [X] Dump mariadb +* - [ ] backup remote +* - [ ] borg-borg +* - [ ] fix root scripts +* [ ] make sure production is up to scratch (see stable below) +* [ ] synchronize git repos for public, CD, fallback and production using sheepdog and document +* [ ] drop tux02 backups on balg01 +* [X] Small database public ## Ongoing tasks (current/urgent) -* [+] Set up stable GeneNetwork server instance with new hardware (see below) -=> /topics/systems/fire-up-genenetwork-system-container.gmi -* [+] Julia as part of GN3 deployment +* [ ] ~Felix, Alex, Rahul as friends of UTHSC +* [ ] Precompute with GEMMA + + [ ] Store N + + [ ] Store significance levels + + [ ] Check genotype input data + + [ ] Imputation + + [ ] Do same with bulkLMM + + [ ] Generate lmdb output + + [ ] Hook into Xapian + + [ ] Hook into correlations + +* [ ] Check email setup tux04 +* [ ] jbrowse plugin code - https://genenetwork.trop.in/mm10 +* [+] bulklmm Julia as part of GN3 deployment + - precompute & Julia +=> https://github.com/GregFa/TestSysimage + Here the repo with BulkLMMSysimage: +=> https://github.com/GregFa/BulkLMMSysimage => /topics/deploy/julia.gmi -* [+] Shelby's paper+application +* [X] Set up stable GeneNetwork server instance with new hardware (see below) +=> /topics/systems/fire-up-genenetwork-system-container.gmi # Tasks @@ -43,18 +84,16 @@ Now (X=done +=WIP _=kickoff ?=?) * [+] Build leadership team * [+] gBAM -* [ ] Audit DB security * [ ] p-value global search -* [ ] Xapian search add tags, notmuch style (with @zachs) +* [+] Xapian search add tags, notmuch style (with @bonfacem and @zachs) => ../issues/systems/octopus -* [ ] Add R/qtl2 and multi-parent support with Karl (DO and Magic populations) -* [ ] Fix slow search on Mariadb? +* [+] Add R/qtl2 and multi-parent support with Karl (DO and Magic populations) +* [+] Fix slow search on Mariadb? Moving to xapian * [.] GeneNetwork paper * + [ ] add FAIR statement * [ ] RISC-V trials -* [ ] Guix project server Longer term @@ -64,22 +103,24 @@ Longer term Later -* [ ] Remove ttf files GN2 +* [ ] Mempang25 1U01HG013760 + + [X] Invites + + [ ] Payments + + [ ] Rooms + + [ ] Catering + +* [ ] Remove ttf files GN2 git repo * [+] DOI support GN -* [+] Tree of life w. Andrea * [ ] Hao's idea for counting cis-qtl * [ ] Improve search for significant and suggestive hits * [ ] Include dominance matrix results in gemma (gene expression and transcript isophorms) -* [ ] UCSC browser and big wig files -* [ ] Jbrowse2 browser +* [-] UCSC browser and big wig files (no longer applicable) +* [+] Jbrowse2 browser ### Set up stable server instance with new hardware -* [ ] R/qtl2 with Karl -* [ ] Set up opensmtpd as a service - + [ ] Add package dependency - + [ ] Test on open port 25 - + [ ] Add public-inbox (Arun) +* [ ] ssh-shell access for git markdown +* [+] R/qtl2 with Karl and Alex, see [alex.gmi] => ./machine-room.gmi machine room @@ -92,6 +133,9 @@ Later ### Done +* [X] add drives to Octopus 1U01HG013760 +* [X] mount NFS bacchus on octopus 1U01HG013760 +* [X] Guix project server * [X] Build junior team * [X] Complete Octopus setup with Erik, Andrea and Efraim - adding all machines * [X] Aging grant application w. Rob (fall) @@ -104,3 +148,12 @@ Later * [X] Fix mariadb index search - need to upgrade mariadb to convert final utf8mb4, see => ../issues/slow-sql-query-for-xapian-indexing.gmi * [X] Debian/free software issues incl. vcflib work in Zig and release +* [X] Set up opensmtpd as a service + +# Notes + +The tasks here should probably be broken out into appropriately tagged issues, where they have not - they can be found and filtered out with tissue (formerly gnbug). + +=> https://issues.genenetwork.org + +Generally work applies to NIH/R073237482 and other grants. diff --git a/tasks/programmer-team/meetings.gmi b/tasks/programmer-team/meetings.gmi new file mode 100644 index 0000000..d972b3b --- /dev/null +++ b/tasks/programmer-team/meetings.gmi @@ -0,0 +1,82 @@ +# Weekly meetings + +In this document we will track tasks based of our weekly meetings. This list sets the agenda +on progress for the next week's meeting. + +## 02-10-2024 +## @felixm +* [ ] Use Aider to contribute and cover to Fred's coding. Share useful prompts. +* [ ] Feed relevant papers to GPT and find similar summary for other datasets. Start with C-Elegans. + + +## @bonfacem +* [ ] Share values with PJ. +* [ ] Assume LMDB files are transient. When hash doesn't exist, generate the hash for that dataset. Use LMDB to store key value pairs of hashes. +* [ ] Add dump script to gn-guile. +* [ ] Add Case Attributes in Virtuoso. + +## @alex +* [ ] Push R/QTL2 to production +* [ ] Have R/QTL2 work for ITP + +Nice to have: +* Think about editing publish data and consequent updates to LMDB. + +## @pjotr +* Kickstart UTHSC VPN access for Felix and Alex. + +## 01-20-2024 +### @bonfacem + +* [ ] Report: OpenAI on Aider - use AI for programming - discuss with @alexm + +=> https://issues.genenetwork.org/topics/ai/aider + +* [-] Metadata: Provide list of case attributes for BXD to @flisso +* [-] Code UI: GeneRIF and GenWiki should work from the mapping page - encourage people to use + - anyone logged in can edit + - If RIF does not exist point to GeneWiki + - If GeneWiki does not exist provide edit page +* [ ] Code export: Exporting traits to lmdb PublishData - @alexm helps with SQL + - missing data should not be an X + - run lmdb design (first code) by @pjotrp + - start exporting traits for Johannes (he will need to write a python reader) +* Later: Improve the work/dev container for @alexm + +### @flisso + +* [ ] Write: Uploader protocol. NOTES: Finished with C-elegans. Yet to test with other datasets. +* [ ] Script: Run Reaper +* [ ] Data: Case attributes - with @bonfacem +* [ ] Write: Create protocol to upload case attributes + +### @alexm + +* [ ] Code: Rqtl2 match Rqtl1: match scan changes. Notes: PR out and added tests. +* [ ] Bug: Fix pair scan. NOTES: Fixed it. But can't test it now since CD is down. +* Later: AI changes + +### @Pjotr + +* [ ] Code: Work on precompute with GEMMA (w. Jameson) +* [ ] Code: Take Bonface's trait files when they become available + + +## 01-27-2024 + +Last week's error with CD and production downtime: +* [ level 1] Container: Error messages when data not loaded in Virtuoso, Indexing. +* [ level 2] Sheepdog: Check services --- sheepdog. Health checkpoints. +* [ level 3] User feedback. Escalate errors correctly to the users, so they can report to coders + +### @bonfacem +* [ ] Troubleshoot CD. +* [ ] Export files in lmdb. Yohannes read file in Python example +* [ ] Metadata: Provide list of case attributes for BXD to @flisso +* [ ] Aider: See if it can generate some guile and python. Give an example. + +### @alexm +* [ ] UI for R/Qtl2. + +### @flisso +* [ ] Look at Fred Python code for the uploader and report on this. diff --git a/tasks/roadmap.gmi b/tasks/roadmap.gmi new file mode 100644 index 0000000..9bed63d --- /dev/null +++ b/tasks/roadmap.gmi @@ -0,0 +1,65 @@ +# GN Road map + +GN is a web service for complex traits. The main version is currently deployed in Memphis TN, mostly targetting mouse and rat. +Here we define a road map to bring GN to more communities by providing federated services. +The aim is to have plant.genenetwork.org, nematode.genenetwork.org, big.genenetwork.org running in the coming years. + +# Getting an instance up (step 1) + +## Deploy a new instance + +To test things we can use an existing database or a new one. We can deploy that as a (new) Guix service container. + +We'll need to run a few services including: + +* GN3 +* GN2 +* Auth (if required) +* Uploader (if required) + +## Get database ready + +In the first step we have to upload data for the target community. This can be done by updating the databases with some example datasets. Care has to be taken that search etc. works and that we can do the mapping. + +* Add traits +* Add genotype files +* Add metadata + +# Branding and hosting (Step 2) + +Once we have a working database with a number of example use cases we can start rebranding the service and, ideally, host it on location. + +# Synchronization (Step 3) + +## Move traits into lmdb + +This is WIP. We need to adapt the GN3 code to work with lmdb when available. + +## Move genotypes into lmdb + +This is WIP. We need to adapt the GN3 code to work with lmdb when available. + +# Federated metadata (Step 4) + +## Move all metadata into RDF + +This is WIP and happening. We will need to document. + +# LLM Integration (Step 5) + +Provide an LLM that integrates well with the gn eco-system. Goals for the LLM: + +* Flexible data ingestion +* Plug and play LLMS (local, OpenAI, Claude etc.) + +This is still a WIP. + +# Community (Step 6) + +## Uploading data examples + +## GN3 examples + +## UI examples + +## Provide programming examples diff --git a/tasks/zachs.gmi b/tasks/zachs.gmi new file mode 100644 index 0000000..6ae3df1 --- /dev/null +++ b/tasks/zachs.gmi @@ -0,0 +1,7 @@ +# Tasks for Zach + +# Tasks + +* [ ] Move non-ephemeral data out of redis into sqlite DB - see JSON dump +* - [ ] Collections +* - [ ] permanent URIs(?) diff --git a/tissue.scm b/tissue.scm index a0d6345..b3fd1fd 100644 --- a/tissue.scm +++ b/tissue.scm @@ -28,6 +28,7 @@ (tissue-configuration #:aliases '(("Alexander Kabui" "Alexander" "alex" "alexk") ("Arun Isaac" "arun" "aruni") + ("Hasitha") ("BonfaceKilz" "Bonface Kilz" "bonfacem") ("Felix Lisso" "fetche" "flisso") ("Efraim Flashner" "efraimf") diff --git a/topics/ADR/gn-guile/000-markdown-editor-push-to-bare-repo.gmi b/topics/ADR/gn-guile/000-markdown-editor-push-to-bare-repo.gmi new file mode 100644 index 0000000..05b2b6a --- /dev/null +++ b/topics/ADR/gn-guile/000-markdown-editor-push-to-bare-repo.gmi @@ -0,0 +1,18 @@ +# [gn-guile/ADR-000] Extend Markdown Editor to push to Git Bare Repo + +* author: bonfacem +* status: accepted +* reviewed-by: alexm, jnduli + +## Context + +The gn-guile markdown editor currently reads from normal git repositories. However, for GN's self-hosted git repository, we use bare repositories. Bare repositories only store the git objects, therefore we can't edit files directly. + +## Decision + +gn-guile and the cgit instance run in the same server. We will have one normal repository, and the bare repository, configurable by: "CURRENT_REPO_PATH", which has the normal raw files; and "CGIT_REPO_PATH" which is the bare repository. We will make edits to the normal repository, and once that is done, push locally to the cgit instance. + +## Consequences + +* When creating the gn-guile container, this introduces extra complexity in that will have to make sure that the container has the correct write access to the bare repository in the container. +* With this, we are coupled to our GN git set-up. diff --git a/topics/ADR/gn-transform-databases/000-remodel-rif-transform-with-predicateobject-lists.gmi b/topics/ADR/gn-transform-databases/000-remodel-rif-transform-with-predicateobject-lists.gmi new file mode 100644 index 0000000..1e3ee6a --- /dev/null +++ b/topics/ADR/gn-transform-databases/000-remodel-rif-transform-with-predicateobject-lists.gmi @@ -0,0 +1,74 @@ +# [gn-transform-databases/ADR-000] Remodel GeneRIF Metadata Using predicateObject Lists + +* author: bonfacem +* status: rejected +* reviewed-by: pjotr, jnduli + +## Context + +In RDF 1.1 Turtle, you have to use a Qname as the subject. As such, you cannot have a string literal forming the string. In simpler terms, this is not possible: + +``` +"Unique expression signature of a system that includes the subiculum, layer 6 in cortex ventral and lateral to dorsal striatum, and the endopiriform nucleus. Expression in cerebellum is apparently limited to Bergemann glia ABA" dct:created "2007-08-31T13:00:47"^^xsd:datetime . +``` + +As of commit "397745b554e0", a work-around was to manually create a unique identifier for each comment for the GeneRIF table. This identifier was created by combining GeneRIF.Id with GeneRIF.VersionId. One challenge with this is that we create some coupling with MySQL's unique generation of the GeneRIF.Id column. Here's an example of snipped turtle entries: + +``` +gn:wiki-352-0 rdfs:comment "Ubiquitously expressed. Hypomorphic vibrator allele shows degeneration of interneurons and tremor and juvenile lethality; modified by CAST alleles of Nxf1. Knockout has hepatic steatosis and hypoglycemia." . +gn:wiki-352-0 rdf:type gnc:GNWikiEntry . +gn:wiki-352-0 gnt:symbol gn:symbolPitpna . +gn:wiki-352-0 dct:created "2006-03-10T15:39:29"^^xsd:datetime . +gn:wiki-352-0 gnt:belongsToSpecies gn:Mus_musculus . +gn:wiki-352-0 dct:hasVersion "0"^^xsd:int . +gn:wiki-352-0 dct:identifier "352"^^xsd:int . +gn:wiki-352-0 gnt:initial "BAH" . +gn:wiki-352-0 foaf:mbox "XXX@XXX.XXX" . +gn:wiki-352-0 dct:references ( pubmed:9182797 pubmed:12788952 pubmed:14517553 ) . +gn:wiki-352-0 gnt:belongsToCategory ( "Cellular distribution" "Development and aging" "Expression patterns: mature cells, tissues" "Genetic variation and alleles" "Health and disease associations" "Interactions: mRNA, proteins, other molecules" ) . +``` + +## Decision + +We want to avoid manually generating a unique identifier for each WIKI comment. We should instead have that UID be a blank node reference that we don't care about and use predicateObjectLists as an idiom for representing string literals that can't be subjects. + +=> https://www.w3.org/TR/turtle/#grammar-production-predicateObjectList Predicate Object Lists + +The above transform (gn:wiki-352-0) would now be represented as: + +``` +[ rdfs:comment '''Ubiquitously expressed. Hypomorphic vibrator allele shows degeneration of interneurons and tremor and juvenile lethality; modified by CAST alleles of Nxf1. Knockout has hepatic steatosis and hypoglycemia.'''@en] rdf:type gnc:GNWikiEntry ; + gnt:belongsToSpecies gn:Mus_musculus ; + dct:created "2006-03-10 12:39:29"^^xsd:datetime ; + dct:references ( pubmed:9182797 pubmed:12788952 pubmed:14517553 ) ; + foaf:mbox <XXX@XXX.XXX> ; + dct:identifier "352"^^xsd:integer ; + dct:hasVersion "0"^^xsd:integer ; + gnt:initial "BAH" ; + gnt:belongsToCategory ( "Cellular distribution" "Development and aging" "Expression patterns: mature cells, tissues" "Genetic variation and alleles" "Health and disease associations" "Interactions: mRNA, proteins, other molecules" ) ; + gnt:symbol gn:symbolPitpna . +``` + +The above can be loosely translated as: + +``` +_:comment rdfs:comment '''Ubiquitously expressed. Hypomorphic vibrator allele shows degeneration of interneurons and tremor and juvenile lethality; modified by CAST alleles of Nxf1. Knockout has hepatic steatosis and hypoglycemia.'''@en] . +_:comment rdf:type gnc:GNWikiEntry . +_:comment dct:created "2006-03-10 12:39:29"^^xsd:datetime . +_:comment dct:references ( pubmed:9182797 pubmed:12788952 pubmed:14517553 ) . +_:comment foaf:mbox <bah@ucsd.edu> . +_:comment dct:identifier "352"^^xsd:integer . +_:comment dct:hasVersion "0"^^xsd:integer . +_:comment gnt:initial "BAH" . +_:comment gnt:belongsToCategory ( "Cellular distribution" "Development and aging" "Expression patterns: mature cells, tissues" "Genetic variation and alleles" "Health and disease associations" "Interactions: mRNA, proteins, other molecules" ) . +_:comment gnt:symbol gn:symbolPitpna . +``` + +## Consequences + +* Update SPARQL in tux02, tux01 in lockstep with updating GN3/GN2 and the XAPIAN index. +* Reduction in size of the final output, and faster transform time because using PredicateObjectLists output more terse RDF. + +## Rejection Rationale + +This proposal was rejected because relying on blank-nodes as an identifier is opaque and not human-readable. We want to use human readable identifiers where possible. diff --git a/topics/ADR/gn-transform-databases/001-remodel-ncbi-transform-with-predicateobject-lists.gmi b/topics/ADR/gn-transform-databases/001-remodel-ncbi-transform-with-predicateobject-lists.gmi new file mode 100644 index 0000000..073525a --- /dev/null +++ b/topics/ADR/gn-transform-databases/001-remodel-ncbi-transform-with-predicateobject-lists.gmi @@ -0,0 +1,102 @@ +# [gn-transform-databases/ADR-001] Remodel GeneRIF_BASIC (NCBI RIFs) Metadata Using predicateObject Lists + +* author: bonfacem +* status: rejected +* reviewed-by: pjotr, jnduli + +## Context + +We can model RIF comments using pridacetobject lists as described in: + +=> https://issues.genenetwork.org/topics/ADR/gn-transform-databases/000-remodel-rif-transform-with-predicateobject-lists [ADR/gn-transform-databases] Remodel GeneRIF Metadata Using predicateObject Lists + +However, currently for NCBI RIFs we represent comments as blank nodes: + +``` +gn:symbolsspA rdfs:comment [ + rdf:type gnc:NCBIWikiEntry ; + rdfs:comment "N-terminus verified by Edman degradation on mature peptide"^^xsd:string ; + gnt:belongsToSpecies gn:Mus_musculus ; + skos:notation taxon:511145 ; + gnt:hasGeneId generif:944744 ; + dct:hasVersion '1'^^xsd:int ; + dct:references pubmed:97295 ; + ... + dct:references pubmed:15361618 ; + dct:created "2007-11-06T00:38:00"^^xsd:datetime ; +] . +gn:symbolaraC rdfs:comment [ + rdf:type gnc:NCBIWikiEntry ; + rdfs:comment "N-terminus verified by Edman degradation on mature peptide"^^xsd:string ; + gnt:belongsToSpecies gn:Mus_musculus ; + skos:notation taxon:511145 ; + gnt:hasGeneId generif:944780 ; + dct:hasVersion '1'^^xsd:int ; + dct:references pubmed:320034 ; + ... + dct:references pubmed:16369539 ; + dct:created "2007-11-06T00:39:00"^^xsd:datetime ; +] . + +``` + +Here we see alot of duplicated entries for the same symbols. For the above 2 entries, everything is exactly the same except for the "gnt:hasGeneId" and "dct:references" predicates. + +## Decision + +We use predicateObjectLists with blankNodePropertyLists as an idiom to represent the generif comments. + +=> https://www.w3.org/TR/turtle/#grammar-production-predicateObjectList predicateObjectList +=> https://www.w3.org/TR/turtle/#grammar-production-blankNodePropertyList blankNodePropertyList + +In so doing, we can de-duplicate the entries demonstrated above. A representation of the above RDF Turtle triples would be: + +``` +[ rdfs:comment "N-terminus verified by Edman degradation on mature peptide"^^xsd:string ] +rdf:type gnc:NCBIWikiEntry ; +dct:created "2007-11-06T00:39:00"^^xsd:datetime ; +gnt:belongsToSpecies gn:Mus_musculus ; +skos:notation taxon:511145 ; +dct:hasVersion '1'^^xsd:int ; +rdfs:seeAlso [ + gnt:hasGeneId generif:944744 ; + gnt:symbol gn:symbolsspA ; + dct:references ( pubmed:97295 ... pubmed:15361618 ) ; +] ; +rdfs:seeAlso [ + gnt:hasGeneId generif:944780 ; + gn:symbolaraC ; + dct:references ( pubmed:320034 ... pubmed:16369539 ) ; +] . +``` + +The above would translate to the following triples: + +``` +_:comment rdfs:comment "N-terminus verified by Edman degradation on mature peptide"^^xsd:string . +_:comment rdfs:type gn:NCBIWikiEntry . +_:comment dct:created "2007-11-06T00:39:00"^^xsd:datetime . +_:comment gnt:belongsToSpecies gn:Mus_musculus . +_:comment skos:notation taxon:511145 . +_:comment dct:hasVersion '1'^^xsd:int . +_:comment rdfs:seeAlso _:metadata1 +_:comment rdfs:seeAlso _:metadata2 . +_:metadata1 gnt:hasGeneId generif:944744 . +_:metadata1 gnt:symbol gn:symbolaraC . +_:metadata1 dct:references ( pubmed:97295 ... pubmed:15361618 ) +_:metadata2 gnt:hasGeneId generif:944780 . +_:metadata2 gnt:symbol gn:symbolsspA . +_:metadata2 dct:references ( pubmed:320034 ... pubmed:16369539 ) . +``` + +Beyond that, we intentionally use a sequence to store a list of pubmed references. + +## Consequences + +* De-duplication of comments during the transform while retaining the integrity of the RIF metadata. +* Because of the terseness, less work during the I/O heavy operation. +* Update SPARQL in tux02, tux01 in lockstep with updating GN3/GN2 and the XAPIAN index. + +## Rejection Rationale + +This proposal was rejected because relying on blank-nodes as an identifier is opaque and not human-readable. We want to use human readable identifiers where possible. diff --git a/topics/ADR/gn-transform-databases/002-remodel-ncbi-transform-to-be-more-compact.gmi b/topics/ADR/gn-transform-databases/002-remodel-ncbi-transform-to-be-more-compact.gmi new file mode 100644 index 0000000..ac06fc1 --- /dev/null +++ b/topics/ADR/gn-transform-databases/002-remodel-ncbi-transform-to-be-more-compact.gmi @@ -0,0 +1,127 @@ +# [gn-transform-databases/ADR-002] Remodel GeneRIF_BASIC (NCBI RIFs) Metadata To Be More Compact + +* author: bonfacem +* status: proposal +* reviewed-by: pjotr, jnduli + +## Context + +Currently, we represent NCBI RIFs as blank nodes that form the object of a given symbol: + +``` +gn:symbolsspA rdfs:comment [ + rdf:type gnc:NCBIWikiEntry ; + rdfs:comment "N-terminus verified by Edman degradation on mature peptide"^^xsd:string ; + gnt:belongsToSpecies gn:Mus_musculus ; + skos:notation taxon:511145 ; + gnt:hasGeneId generif:944744 ; + dct:hasVersion '1'^^xsd:int ; + dct:references pubmed:97295 ; + ... + dct:references pubmed:15361618 ; + dct:created "2007-11-06T00:38:00"^^xsd:datetime ; +] . +gn:symbolaraC rdfs:comment [ + rdf:type gnc:NCBIWikiEntry ; + rdfs:comment "N-terminus verified by Edman degradation on mature peptide"^^xsd:string ; + gnt:belongsToSpecies gn:Mus_musculus ; + skos:notation taxon:511145 ; + gnt:hasGeneId generif:944780 ; + dct:hasVersion '1'^^xsd:int ; + dct:references pubmed:320034 ; + ... + dct:references pubmed:16369539 ; + dct:created "2007-11-06T00:39:00"^^xsd:datetime ; +] . +``` + +Moreover, we also store all the different versions of a comment: + +``` +mysql> SELECT * FROM GeneRIF_BASIC WHERE SpeciesId=1 AND TaxID=7955 AND GeneId=323473 AND PubMed_ID = 15680355\G +*************************** 1. row *************************** + SpeciesId: 1 + TaxID: 7955 + GeneId: 323473 + symbol: prdm1a + PubMed_ID: 15680355 +createtime: 2010-01-21 00:00:00 + comment: One of two mutations in which defects are observed in both cell populations: it leads to a complete absence of RB neurons and a reduction in neural crest cells + VersionId: 1 +*************************** 2. row *************************** + SpeciesId: 1 + TaxID: 7955 + GeneId: 323473 + symbol: prdm1a + PubMed_ID: 15680355 +createtime: 2010-01-21 00:00:00 + comment: prdm1 functions to promote the cell fate specification of both neural crest cells and sensory neurons + VersionId: 2 +``` + +## Decision + +First, we should only store the latest version of a given RIF entry and ignore all other versions. RIF entries in the GeneRIF_BASIC table are uniquely identified by the columns: SpeciesId, GeneId, PubMed_ID, createtime, and VersionId. Since we are storing the latest version of a given RIF entry, we drop the version identifier during the RDF transform. + +We use a unique identifier for a given comment, and use that as a triple's QName: + +> gn:rif-<speciesId>-<GeneId> + +Finally instead of: + +``` +<symbol> predicate <comment metadata> +``` + +We use: + +``` +<comment-uid> predicate object ; + ... (more metadata) . +``` + +An example triple would take the form: + +``` +gn:rif-1-511145 rdf:label '''N-terminus verified by Edman degradation on mature peptide'''@en . +gn:rif-1-511145 rdf:type gnc:NCBIWikiEntry . +gn:rif-1-511145 gnt:belongsToSpecies gn:Mus_musculus . +gn:rif-1-511145 skos:notation taxon:511145 . +gn:rif-1-511145 rdfs:seeAlso [ + gnt:hasGeneId generif:944744 ; + gnt:symbol "spA" ; + dct:references ( pubmed:97295 ... pubmed:15361618 ) . +] . +gn:rif-1-511145 rdfs:seeAlso [ + gnt:hasGeneId generif:944780 ; + gnt:symbol "araC" ; + dct:references ( pubmed:320034 ... pubmed:16369539 ) . +] +``` + +To efficiently store GeneIds, symbols and references, we use blank nodes. This reduces redundancy and simplifies the triples compared to including these details within the subject: + +``` +gn:rif-1-511145-944744 rdf:label '''N-terminus verified by Edman degradation on mature peptide'''@en . +gn:rif-1-511145-944744 rdf:type gnc:NCBIWikiEntry . +gn:rif-1-511145-944744 gnt:belongsToSpecies gn:Mus_musculus . +gn:rif-1-511145-944744 skos:notation taxon:511145 . +gn:rif-1-511145-944744 gnt:hasGeneId generif:944744 . +gn:rif-1-511145-944744 gnt:symbol "spA" . +gn:rif-1-511145-944744 dct:references ( pubmed:97295 ... pubmed:15361618 ) . + +gn:rif-1-511145-944780 rdf:label '''N-terminus verified by Edman degradation on mature peptide'''@en . +gn:rif-1-511145-944780 rdf:type gnc:NCBIWikiEntry . +gn:rif-1-511145-944780 gnt:belongsToSpecies gn:Mus_musculus . +gn:rif-1-511145-944780 skos:notation taxon:511145 . +gn:rif-1-511145-944780 gnt:hasGeneId generif:944744 . +gn:rif-1-511145-944780 gnt:symbol "spA" . +gn:rif-1-511145-944780 dct:references ( pubmed:97295 ... pubmed:15361618 ) . +``` + +## Consequences + +* More complex SQL query required for the transform. +* De-duplication of RIF entries during the transform. +* Because of the terseness, less work during the I/O heavy operation. +* Update SPARQL in tux02, tux01 in lockstep with updating GN3/GN2 and the XAPIAN index. diff --git a/topics/ADR/gn3/000-add-test-cases-for-rdf.gmi b/topics/ADR/gn3/000-add-test-cases-for-rdf.gmi new file mode 100644 index 0000000..43ac2ba --- /dev/null +++ b/topics/ADR/gn3/000-add-test-cases-for-rdf.gmi @@ -0,0 +1,21 @@ +# [gn3/ADR-000] Add RDF Test Cases + +* author: bonfacem +* status: proposed +* reviewed-by: jnduli + +## Context + +We have no way of ensuring the integrity of our SPARQL queries in GN3. As such, GN3 is fragile to breaking changes when the TTL files are updated. + +## Decision + +In Virtuoso, we load all our data to a default named graph: <http://genenetwork.org>. For SPARQL/RDF tests, we should upload test ttl files to a test named graph: <http://cd-test.genenetwork.org>, and run our RDF unit tests against that named graph. + +## Consequences + +* Extra bootstrapping to load ttl files when running the test. +* Extra documentation to GN developers on how to run virtuoso locally to get the tests running. +* Testing against gn-machines to make sure that all things run accordingly. +* Extra maintenance costs to keep the TTL files in lockstep with the latest RDF changes during re-modeling. +* Improvement in GN3 reliability. diff --git a/topics/ADR/gn3/001-remove-stace-traces-in-gn3-error-response.gmi b/topics/ADR/gn3/001-remove-stace-traces-in-gn3-error-response.gmi new file mode 100644 index 0000000..0910415 --- /dev/null +++ b/topics/ADR/gn3/001-remove-stace-traces-in-gn3-error-response.gmi @@ -0,0 +1,49 @@ +# [gn3/ADR-001] Remove Stack Traces in GN3 + +* author: bonfacem +* status: rejected +* reviewed-by: jnduli, zach, pjotr, fredm + +## Context + +Currently, GN3 error responses include stack traces: + +``` +def add_trace(exc: Exception, jsonmsg: dict) -> dict: + """Add the traceback to the error handling object.""" + return { + **jsonmsg, + "error-trace": "".join(traceback.format_exception(exc)) + } + + +def page_not_found(pnf): + """Generic 404 handler.""" + current_app.logger.error("Handling 404 errors", exc_info=True) + return jsonify(add_trace(pnf, { + "error": pnf.name, + "error_description": pnf.description + })), 404 + + +def internal_server_error(pnf): + """Generic 404 handler.""" + current_app.logger.error("Handling internal server errors", exc_info=True) + return jsonify(add_trace(pnf, { + "error": pnf.name, + "error_description": pnf.description + })), 500 +``` + + +## Decision + +Stack traces have the potential to allow malicious actors compromise our system by providing more context. As such, we should send a useful description of what went wrong; and log our stack traces in our logs, and send an appropriate error status code. We can use the logs to troubleshoot our system. + +## Consequences + +* Lockstep update in GN2 UI on how we handle GN3 errors. + +## Rejection Rationale + +The proposal to remove stack traces from error responses was rejected because they are essential for troubleshooting, especially when issues are difficult to reproduce or production logs are inaccessible. Stack traces provide immediate error context, and removing them would complicate debugging by requiring additional effort to link logs with specific requests; a trade-off we are not willing to make at the moment. diff --git a/topics/ADR/gn3/002-run-rdf-tests-in-build-container.gmi b/topics/ADR/gn3/002-run-rdf-tests-in-build-container.gmi new file mode 100644 index 0000000..a8026ce --- /dev/null +++ b/topics/ADR/gn3/002-run-rdf-tests-in-build-container.gmi @@ -0,0 +1,32 @@ +# [gn3/ADR-002] Move RDF Test Cases to Build Container + +* author: bonfacem +* status: accepted +* reviewed-by: jnduli + +## Context + +GN3 RDF tests are run against the CD's virtuoso instance. As such, we need to set special parameters when running tests: + +``` +SPARQL_USER = "dba" +SPARQL_PASSWORD = "dba" +SPARQL_AUTH_URI="http://localhost:8890/sparql-auth/" +SPARQL_CRUD_AUTH_URI="http://localhost:8890/sparql-graph-crud-auth" +FAHAMU_AUTH_TOKEN="XXXXXX" +``` + +This extra bootstrapping when running tests needs care, and locks tests to CD or special configuration when running locally. This leads to fragile tests that cause CD to break. Moreover, to add tests to CD, we would have to add extra g-exp to gn-machines. + +This ADR is related to: + +=> /topics/ADR/gn3/000-add-test-cases-for-rdf.gmi gn3/ADR-000. + +## Decision + +Move tests to the test build phase of building the genenetwork3 package. These tests are added in the ".guix/genenetwork3-all-tests.scm" file instead of the main "genenetwork3" package definition in guix-bioinformatics. This way, we have all our "light" tests I.e. unit tests running in guix-bioinformatics, while having all our heavier tests, in this case, RDF tests, running in CD. + +## Consequences + +* Extra bootstrapping to gn3's .guix/genenetwork3-package.scm to get tests working. +* GN3 RDF tests refactoring to use a virtuoso instance running in the background while tests are running. diff --git a/topics/ai/aider.gmi b/topics/ai/aider.gmi new file mode 100644 index 0000000..00845c8 --- /dev/null +++ b/topics/ai/aider.gmi @@ -0,0 +1,19 @@ +# Aider + +=> https://aider.chat/ + +``` +apt-get install python3-venv # or use guix +python3 -m venv ~/opt/python-aider +~/opt/python-aider/bin/python3 -m pip install aider-install +export PATH="/home/wrk/.local/bin:$PATH" +~/opt/python-aider/bin/aider-install +``` + +Installed 1 executable: aider +Executable directory /home/wrk/.local/bin is already in PATH + +``` +aider --model sonnet --api-key anthropic=sk-ant... +aider --model gpt-4o --openai-api-key aa... +``` diff --git a/topics/ai/gn_agent.gmi b/topics/ai/gn_agent.gmi new file mode 100644 index 0000000..2b789c9 --- /dev/null +++ b/topics/ai/gn_agent.gmi @@ -0,0 +1,790 @@ +# Build an AI system for GN + +## Tags +* type: feature +* assigned: johannesm +* priority: medium +* status: in progress +* keywords: llm, rag, ai, agent + +## Description + +The aim is to build an AI system/agent/RAG able to digest mapping results and metadata in GN for analysis scaling. This is not quite possible at the moment, given that one stills need to dig and compare manually that type of information. And the data in GN is somehow big for such approach :) + +I have made an attempt to using Deep-Learning for my Masters project. It could work but required further processing of results for interpretation. Not quite handy! Instead, we want a system which takes care of all the work (at least most of it) and that we can understand. This is how transformers and LLMs came into the picture. + +This work is an extension of the GNQA system initiated by Shelby and Pjotr. + +## Tasks +* [X] Look for transformer model ready for use and try +* [X] Build a RAG system and test with small corpus of mapping results +* [X] Experiment with actual mapping results and metadata +* [X] Move from RAG to agent +* [X] Optimize AI system +* [] Scale analysis to more data +* [] Compare performance of open LLMs with Claude in the system + + + + +### Look for transformer model ready for use and try + +Given the success of transformers, I was first incited by Pjotr to look for a model that can support different types of data i.e numerical (mapping results) vs textual (metadata). + +I found TAPAS which: +* takes data of different types in tabular format +* takes a query or question in form of text +* performs operations on rows of the data table +* retrieves relevant information +* returns an answer to the original query + +Experimentations were ongoing when Rob found with the help of Claude that this architecture would not go far. I know we used an AI to assist our work on AI (at least we did not ask an AI to do the job from the go :)) +But it was a good point. TAPAS is relatively old and a lot of progress have been made with LLM and agent since! + +To take advantage of all the progress made with LLM, need to find a way to have only text data. LLMs are trained to undertand and work with text. Metadata, being RDF, is already in text format. I only needed to convert the mapping results to text. It is a detour worth the shot if this can give more flexibility and save development time! + +### Build a RAG system and test with a small corpus of mapping results + +I have read a number of books and found that RAG system are pretty easy to design with LangChain. The RAG is made of 2 components: +* search and retrieval -> need a keyword search algorithm or embedding model +* response generation -> need an LLM + +The system takes a query and return a response based on the information retrieved. Very efficient in increasing factuality of LLM responses! + +I wrote a first draft with: + +1. BMD25 keyword search algorithm and Qwen3-Embedding-0.6B model +This is an hybrid search subsystem. It searches both by keyword and meaning. It should be more scalable. The subsystem: +* takes the query +* processes it +* extracts the most similar documents or data chunks +* passes the results to the LLM for further processing + +2. Qwen2.5-72B-Instruct LLM +This is the response subsystem. It: +* takes the original query +* takes retrieved information +* reason using tree-of-thought mechanism for validation +* generate an answer to the query using retrieved information + +Small bonus +I suspected that more than one query might be required to address some questions. I needed to find a way to make the system conversational so that one could ask many questions without the information to be lost in between. This came down to adding memory to the system. So, I added a small model for memory propagation. The model was: +3. Phi-3-mini-4k-instruct +It uses the model and the ConversationSummary class from LangChain to summarizekey information in interaction for accessibility between sessions. + +Time to test! + +* Query 1 + +``` +Question: What is the lod for trait leptin receptor EPFLMouseLiverCDEx0413 at position 100? + +Response: +**Expert 1:** +The question asks for the LOD score at position 100. From the relevant information, I see that there is a LOD score of 1 at position 100. + +**Expert 2:** +I agree with Expert 1. The relevant information clearly states that the LOD score at position 100 is 1. + +**Expert 3:** +I concur with both Experts 1 and 2. The LOD score for the trait leptin receptor EPFLMouseLiverCDEx0413 at position 100 is indeed 1. + +**Conclusion:** +All experts agree that the LOD score for the trait leptin receptor EPFLMouseLiverCDEx0413 at position 100 is 1. <|end|> +``` + +Since the corpus used for testing is small, I could inspect to confirm the accuracy of the response generated by the RAG. Lookup revealed that it was correct, yah! + +* Query 2 + +``` +Question: Which position of leptin receptor EPFLMouseLiverCDEx0413 was examined in the previous prompt? + +Response: +**Expert 1:** +The previous prompt asked for the LOD score at position 100. Therefore, the position examined in the previous prompt is 100. + +**Expert 2:** +I agree with Expert 1. The previous prompt specifically asked about the LOD score at position 100. Thus, the position examined is 100. + +**Expert 3:** +I concur with both Experts 1 and 2. The previous prompt clearly specified that it was asking for the LOD score at position 100. Therefore, the position examined in the previous prompt is 100. + +**Conclusion:** +All experts agree that the position of leptin receptor EPFLMouseLiverCDEx0413 examined in the previous prompt is 100. <|end|> +``` + +This was also correct. I was very excited, Rob too! + +I received a feedback that made me clarify the meaning of position. I rebuilt the small corpus using a BXD dataset of traits I run GEMMA on for my Masters project, making sure to use actual marker positions this time and including also marker names. + +Let's experiment again! + +I got results similar to the ones above to the exception that marker positions are true and marker names are supported. + +I faced a challenge though :( + +For queries that require combinations of different data chunks or documents (non atomic queries), the system does not perform well. For example, to the query + +* How many traits hepatic nuclear factor 4 are in the datasets? +The system was confused. Even after prompt engineering, the answer generated was not accurate + +* Identify 2 traits that have similar lod values on chromosome 1 position 3010274 +The system sometimes missed or caught only 1 trait having a lod value at the position. + +This is probably because the system cannot execute more than one retrieval run. To get there, I need to make the RAG more autonomous: this is how the concept of agent came up. + + +### Experiment with actual mapping results and metadata + +Getting an agent asked for more readings. In the meantime, I decided to get actual mapping results and metadata for experimentation. Would be sad to proceed if the system is actually not compatible with data to use in production :) + +I waited for Pjotr to precompute GEMMA association results and export them with metadata to an endpoint. The RDF schema was very interesting to learn and Bonz did some work about that in the past :) + +You can check out recent developments of Pjotr's work here: +=> https://issues.genenetwork.org/topics/systems/mariadb/precompute-publishdata + +For Bonz work, see: +=> https://github.com/genenetwork/gn-docs/blob/master/rdf-documentation/phenotype-metadata.md + +Anyway, it took some time but I finally got a glance of the data. + +This started with the metadata from an old endpoint created by Bonz. I had also to learn SPARQL - I was quite new to it! + +We thought LLMs can make directly sense of RDF data (still in text format) but it turns out it is not. They can recognize that it is RDF but in between all the URIs, they start making mistakes quite quickly. Instead of using RDF natively, we decided to use LLMs to first convert RDF data - could be both metadata or mapping results - to natural text before using it with the RAG system. The system should do best and we confirmed that! + +Pjotr made available the first version of the global endpoint. Nothing should stop me now :) I wrote a script to fetch metadata from the endpoint. I am not sharing my code so far. Let me fix that right now. You can follow this link for the script I was referring to above: +=> https://github.com/johanmed/gn-rag/blob/main/fetch_metadata.py + +Pjotr also made available the ttl files in my home directory on balg01 - full flexibility! + +I naturalized some RDF triples. The corpus now looked like below + +``` +The phenotype identified as BXD_19926, or abbreviated as TAG(48:3)_HFD, is part of the EPFL LISP3 Study. It's a measurement of Liver Lipidomics: TAG48:3_HFD, which has side chain FA composition [TG12:0_18:1_18:2,TG14:0_16:0_18:3,TG14:0_16:1_18:2,TG14:0_16:2_18:1,TG14:1_16:0_18:2,TG14:1_16:1_18:1,TG16:0_16:1_16:2,TG16:1_16:1_16:1] belonging to the lipid class triacylglycerol TAG. This phenotype was measured by MS [R.U.AUC] in overnight fasted male mice at 29 weeks of age, who were on a high fat diet (60% kCal/fat HarlanTD.06414). +The phenotype was submitted by Jha P and contributed to by Auwerx J. It belongs to the group setBxd and has a sequence number of 1. The mean value for this phenotype is 0.00410489, with an additive effect of 0.00066854. The LOD score for this phenotype is 2.91779. +The locus associated with this phenotype is Rsm10000001653. +``` + +It was time to proceed to testing. The results I show below are not exactly for the RAG system I explained above. I was improving the RAG in parallel. So watch out. I will explain everything in the next task :) + +For now, you can have a look at the first implementation at: +=> https://github.com/johanmed/gn-rag/commit/2cf0b74442e8f7e3a67d563b882f3ab25a4ceb6d + +The goal was to try a complex query the previous system failed on. You can see that the question was indeed not atomic. + +``` +Question: Identify traits having a lod score > 4.0 + +Response: +In addressing the query to identify traits with a LOD score exceeding 4.0, it's pertinent to highlight three mapped loci that meet these criteria: Rsm10000011038_BXDPublish_170 +86_gemma_GWA_19ca1aeb (LOD: 4.5), Rsm10000000185_BXDPublish_12641_gemma_GWA_8c8d0548 (LOD: 4.14) and Rsm10000000559 (LOD: 4.61). These findings underscore the significance of these loci in genetic studies, given their high LOD scores indicative of strong linkage evidence. +``` + +It is already better. At least, the system combined information from different data or document chunks. Retrieved chunks were: + +``` +[[Document(metadata={}, page_content='\nThe mapped locus associated with Rsm10000001536 has a LOD score of 2.07, and the mapped SNP is identified as GEMMAMapped_LOCO_BXDPublish_24451_gemma_GWA_aecf628e. The effect size for this locus is 0.01, with an allele frequency (AF) of 0.333.\n '), Document(metadata={}, page_content='\nThe mapped + locus Rsm10000011536 is associated with a LOD score of 5.69, an effect size of 0.385 and an allele frequency of 0.526. This locus has been mapped to the SNP GEMMAMapped_LOCO_BXDPublish_2032 +0_gemma_GWA_6832c0e4.\n '), Document(metadata={}, page_content='\nThe mapped locus, Rsm10000000185_BXDPublish_12641_gemma_GWA_8c8d0548, has an effect size of -3.137 and a LOD +score of 4.14. This locus is associated with the mapped SNP GEMMAMapped_LOCO_BXDPublish_12641_gemma_GWA_8c8d0548, and it has an allele frequency of 0.556.\n '), Document(metad +ata={}, page_content='\nIn plain English, this data refers to a mapped locus associated with the Rsm10000011038_BXDPublish_17086_gemma_GWA_19ca1aeb identifier. This locus is linked to the Rsm10000011038 identifier, has an effect size of -0.048, a LOD score of 4.5, and an allele frequency (AF) of 0.167. The mapped SNP associated with this data can be found under the GEMMAMapped_LOCO_BXDPublish_17086_gemma_GWA_19ca1aeb identifier.\n '), Document(metadata={}, page_content='\n In plain English, the data describes a genetic locus identified as Rsm10000000559. This locus was mapped through an effect size of -34.191, with an allele frequency of 0.438. The mapping achieved a LOD score of 4.61, indicating the statistical significance of this genetic association. The mapped locus is associated with a specific SNP (Single Nucleotide Polymorphism) identified as GEMMAMapped_LOCO_BXDPublish_12016_gemma_GWA_bc6adcae.\n ')]] +``` + +### Move from RAG to agent + +This is where I made the system more autonomous i.e agentic. I am now going to explain how I did it. I read a couple of sources and found that RAG system built with LangChain could be made agentic by using LangGraph. This creates a graph structure which splits the task among different nodes or agents. Each agent achieves a specific subtasks and a final node manages the integration. + +Checkout this commit to see the results: +=> https://github.com/johanmed/gn-rag/commit/ecde30a31588605358007cc39df25976b9c2e295 + +You can clearly see differences between *rag_langchain.py* and *rag_langgraph.py* + +Basically, + +``` +def ask_question(self, question: str): + start=time.time() + memory_var=self.memory.load_memory_variables({}) + chat_history=memory_var.get('chat_history', '') + result=self.retrieval_chain.invoke( + {'question': question, + 'input': question, + 'chat_history': chat_history}) + answer=result.get("answer") + citations=result.get("context") + self.memory.save_context( + {'input': question}, + {'answer': answer}) + # Close LLMs + GENERATIVE_MODEL.client.close() + SUMMARY_MODEL.client.close() + end=time.time() + print(f'ask_question: {end-start}') + return { + "question": question, + "answer": answer, + "citations": citations, + } +``` + +became: + +``` +def retrieve(self, state: State) -> dict: + # Define graph node for retrieval + prompt = f""" + You are powerful data retriever and you strictly return + what is asked for. + Retrieve relevant documents for the query below, + excluding these documents: {state.get('seen_documents', [])} + Query: {state['input']}""" + retrieved_docs = self.ensemble_retriever.invoke(prompt) + return {"input": state["input"], + "context": retrieved_docs, + "digested_context": state.get("digested_context", []), + "result_count": state.get("result_count", 0), + "target": state.get("target", 3), + "max_iterations": state.get("max_iterations", 5), + "should_continue": "naturalize", + "iterations": state.get("iterations", 0) + 1, # Add one per run + "chat_history": state.get("chat_history", []), + "answer": state.get("answer", ""), + "seen_documents": state.get("seen_documents", [])} + + def manage(self, state:State) -> dict: + # Define graph node for task orchestration + context = state.get("context", []) + digested_context = state.get("digested_context", []) + answer = state.get("answer", "") + iterations = state.get("iterations", 0) + chat_history = state.get("chat_history", []) + result_count = state.get("result_count", 0) + target = state.get("target", 3) + max_iterations = state.get("max_iterations", 5) + should_continue = state.get("should_continue", "retrieve") + # Orchestration logic + if iterations >= max_iterations or result_count >= target: + should_continue = "summarize" + elif should_continue == "retrieve": + # Reset fields + context = [] + digested_context = [] + answer = "" + elif should_continue == "naturalize" and not context: + should_continue = "retrieve" # Can't naturalize without context + context = [] + digested_context = [] + answer = "" + elif should_continue == "analyze" and \ + (not context or not digested_context): + should_continue = "retrieve" # Can't analyze without context + context = [] + digested_context = [] + answer = "" + elif should_continue == "check_relevance" and not answer: + should_continue = "analyze" # Can't check relevance without answer + elif should_continue not in ["retrieve", \ + "naturalize", "check_relevance", "analyze", "summarize"]: + should_continue = "summarize" # Fallback + return {"input": state["input"], + "should_continue": should_continue, + "result_count": result_count, + "target": target, + "iterations": iterations, + "max_iterations": max_iterations, + "context": context, + "digested_context": digested_context, + "chat_history": chat_history, + "answer": answer, + "seen_documents": state.get("seen_documents", [])} + + def analyze(self, state:State) -> dict: + # Define graph node for analysis and text generation + context = "\n".join(state.get("digested_context", [])) + existing_history="\n".join(state.get("chat_history", [])) \ + if state.get("chat_history") else "" + iterations = state.get("iterations", 0) + max_iterations = state.get("max_iterations", 5) + result_count = state.get("result_count", 0) + target = state.get("target", 3) + if not context: # Cannot proceed without context + should_continue = "summarize" if iterations >= max_iterations \ + or result_count >= target else "retrieve" + response = "" + else: + prompt = f""" + <|im_start|>system + You are an experienced analyst that can use available information + to provide accurate and concise feedback. + <|im_end|> + <|im_start|>user + Answer the question below using following information. + Context: {context} + History: {existing_history} + Question: {state["input"]} + Answer: + <|im_end|> + <|im_start|>assistant""" + response = GENERATIVE_MODEL.invoke(prompt) + if not response or not isinstance(response, str) or \ + response.strip() == "": # Need valid generation + should_continue = "summarize" if iterations >= max_iterations \ + or result_count >= target else "retrieve" + response = "" # Ensure a clean state + else: + should_continue = "check_relevance" + return {"input": state["input"], + "answer": response, + "should_continue": should_continue, + "context": state.get("context", []), + "digested_context": state.get("digested_context", []), + "iterations": iterations, + "max_iterations": max_iterations, + "result_count": result_count, + "target": target, + "chat_history": state.get("chat_history", []), + "seen_documents": state.get("seen_documents", [])} + + + def summarize(self, state:State) -> dict: + # Define node for summarization + existing_history = state.get("chat_history", []) + current_interaction=f""" + User: {state["input"]}\nAssistant: {state["answer"]}""" + full_context = "\n".join(existing_history) + "\n" + \ + current_interaction if existing_history else current_interaction + result_count = state.get("result_count", 0) + target = state.get("target", 3) + iterations = state.get("iterations", 0) + max_iterations = state.get("max_iterations", 5) + prompt = f""" + <|system|> + You are an excellent and concise summary maker. + <|end|> + <|user|> + Summarize in bullet points the conversation below. + Follow this format: input - answer + Conversation: {full_context} + <|end|> + <|assistant|>""" + summary = GENERATIVE_MODEL.invoke(prompt).strip() # central task + if not summary or not isinstance(summary, str) or summary.strip() == "": + summary = f"- {state['input']} - No valid answer generated" + should_continue="end" if result_count >= target or \ + iterations >= max_iterations else "retrieve" + updated_history = existing_history + [summary] # update chat_history + print(f"\nChat history in summarize: {updated_history}") + return {"input": state["input"], + "answer": summary, + "should_continue": should_continue, + "context": state.get("context", []), + "digested_context": state.get("digested_context", []), + "iterations": iterations, + "max_iterations": max_iterations, + "result_count": result_count, + "target": target, + "chat_history": updated_history, + "seen_documents": state.get("seen_documents", [])} + + def check_relevance(self, state:State) -> dict: + # Define node to check relevance of retrieved data + context = "\n".join(state.get("digested_context", [])) + result_count = state.get("result_count", 0) + target = state.get("target", 3) + iterations = state.get("iterations", 0) + max_iterations = state.get("max_iterations", 5) + seen_documents = state.get("seen_documents", []) + prompt = f""" + <|system|> + You are an expert in evaluating data relevance. You do it seriously. + <|end|> + <|user|> + Assess if the provided answer is relevant to the query. + Return only yes or no. Nothing else. + Answer: {state["answer"]} + Query: {state["input"]} + Context: {context} + <|end|> + <|assistant|>""" + assessment = GENERATIVE_MODEL.invoke(prompt).strip() + if assessment=="yes": + result_count = result_count + 1 + should_continue = "summarize" + elif result_count >= target or iterations >= max_iterations: + should_continue = "summarize" + else: + should_continue = "retrieve" + seen_documents.extend([doc.page_content for doc in \ + state.get("context", [])]) + return {"input": state["input"], + "context": state.get("context", []), + "digested_context": state.get("digested_context", []), + "iterations": iterations, + "max_iterations": max_iterations, + "answer": state["answer"], + "result_count": result_count, + "target": target, + "seen_documents": seen_documents, + "chat_history": state.get("chat_history", []), + "should_continue": should_continue} + + def route_manage(self, state: State) -> str: + should_continue = state.get("should_continue", "retrieve") + iterations = state.get("iterations", 0) + max_iterations = state.get("max_iterations", 5) + result_count = state.get("result_count", 0) + target = state.get("target", 3) + context = state.get("context", []) + digested_context = state.get("digested_context", []) + answer = state.get("answer", "") + # Validate state and enforce termination + if iterations >= max_iterations or result_count >= target: + return "summarize" + if should_continue not in ["retrieve", "naturalize", \ + "check_relevance", "analyze", "summarize"]: + return "summarize" # Fallback to summarize + return should_continue + + def initialize_langgraph_chain(self) -> Any: + graph_builder = StateGraph(State) + graph_builder.add_node("manage", self.manage) + graph_builder.add_node("retrieve", self.retrieve) + graph_builder.add_node("naturalize", self.naturalize) + graph_builder.add_node("check_relevance", self.check_relevance) + graph_builder.add_node("analyze", self.analyze) + graph_builder.add_node("summarize", self.summarize) + graph_builder.add_edge(START, "manage") + graph_builder.add_edge("retrieve", "naturalize") + graph_builder.add_edge("naturalize", "analyze") + graph_builder.add_edge("analyze", "check_relevance") + graph_builder.add_edge("check_relevance", "manage") + graph_builder.add_edge("summarize", END) + graph_builder.add_conditional_edges( + "manage", + self.route_manage, + {"retrieve": "retrieve", + "naturalize": "naturalize", + "check_relevance": "check_relevance", + "analyze": "analyze", + "summarize": "summarize"}) + graph=graph_builder.compile() + return graph + + async def invoke_langgraph(self, question: str) -> Any: + graph = self.initialize_langgraph_chain() + initial_state = { + "input": question, + "chat_history": [], + "context": [], + "digested_context": [], + "seen_documents": [], + "answer": "", + "iterations": 0, + "result_count": 0, + "should_continue": "retrieve", + "target": 3, # Explain magic number 3 + "max_iterations": 5 # Explain magic number 5 + } + result = await graph.ainvoke(initial_state) # Run graph asynchronously + return result + + + def answer_question(self, question: str) -> Any: + start = time.time() + result = asyncio.run(self.invoke_langgraph(question)) + end = time.time() + print(f'answer_question: {end-start}') + return {"result": result["chat_history"], + "state": result} +``` + +As mentioned above, we quickly spotted the need for the naturalization of RDF triples. This explains the addition of a naturalization node to the graph: + +``` +def naturalize(self, state: State) -> dict: + # Define graph node for RDF naturalization + prompt = f""" + <|im_start|>system + You are extremely good at naturalizing RDF and inferring meaning. + <|im_end|> + <|im_start|>user + Take element in the list of RDF triples one by one and + make it sounds like Plain English. Repeat for each the subject + which is at the start. You should return a list. Nothing else. + List: ["Entity http://genenetwork.org/id/traitBxd_20537 \ + \nhas http://purl.org/dc/terms/isReferencedBy of \ + http://genenetwork.org/id/unpublished22893", "has \ + http://genenetwork.org/term/locus of \ + http://genenetwork.org/id/Rsm10000002554"] + <|im_end|> + <|im_start|>assistant + New list: ["traitBxd_20537 isReferencedBy unpublished22893", \ + "traitBxd_20537 has a locus Rsm10000002554"] + <|im_end|> + <|im_start|>user + Take element in the list of RDF triples one by one and + make it sounds like Plain English. Repeat for each the subject + which is at the start. You should return a list. Nothing else. + List: {state.get("context", [])} + <|im_start|>end + <|im_start|>assistant""" + response = GENERATIVE_MODEL.invoke(prompt) + print(f"Response in naturalize: {response}") + if isinstance(response, str): + start=response.find("[") + end=response.rfind("]") + 1 # offset by 1 to make slicing + response=json.loads(response[start:end]) + else: + response=[] + return {"input": state["input"], + "context": state.get("context", []), + "digested_context": response, + "result_count": state.get("result_count", 0), + "target": state.get("target", 3), + "max_iterations": state.get("max_iterations", 5), + "should_continue": "analyze", + "iterations": state.get("iterations", 0), + "chat_history": state.get("chat_history", []), + "answer": state.get("answer", ""), + "seen_documents": state.get("seen_documents", [])} +``` + +Next step is to compare the RAG system (rag_langchain.py) to the agent system (rag_langgraph.py) on some queries. + +* Query 1 + +``` +Question: What is the lod score of BXDPublish_10187_gemma_GWA at D12mit280? + +RAG response: 3.01245 + +Agent response: 5.21 +``` + +Hmm, the result of the RAG system is dubious. I did not expect long floating number for the LOD scores. Fortunately, Pjotr's endpoint is useful to make quick query agains the RDF data. By the way, one could see how complementary SPARQL queries and LLM calls are. It was very thoughtful to pass through RDF :) + +Using the endpoint, I have fetched all lodscores for BXDPublish_10187_gemma_GWA at D12mit280 and only the result of the agent system appeared to be true. + +* Query 2 + +``` +Question: I want you look for D12mit280. You are allowed to initiate many rounds of retrieval until you get 20 relevant results. Next, extract the lod score and trait for each result. List for me traits that have a lod score > 4.0. Join to the list the corresponding lod scores so I can confirm. Show results using the following format: trait - lod score + +RAG response: It seems there's a misunderstanding. The provided data doesn't include any results related to "D12mit280". I can only provide information or perform tasks based on the given context and data. If you have another query or need clarification on something else, feel free to ask. + +Agent response: It appears there are no results for D12mit280 with a lod score > 4.0 in the provided entities. If you have more data or different criteria, please let me know so I can assist further. + +``` + +This was surprising. My hypotehsis was that the node in charge of retrieval was not fetching relevant documents for the LLM to elaborate a sound feedback. There is only one way to confirm: check the documents retrieved. + +Printing the documents after retrieval showed that I was actually right. I also noticed a couple of other problems in the execution of the program. Nodes were mostly not executed in the order I expected. Consequently, I decided to get into a long quest of incremental improvements :) + +### Optimize AI system + +A couple of things needed to be optimized. This included retrieval, node orchestration and GPU acceleration. + +* Retrieval + +Let's start with retrieval. I played with different parameters of the retriever. It was an EnsembleRetriever using both keyword and semantic search as illustrated below: +``` +ensemble_retriever = EnsembleRetriever( + retrievers=[ + self.chroma_db.as_retriever(search_kwargs={"k": 10}), + bm25_retriever, + ], + weights=[0.4, 0.6], + ) +``` +I opted for trying different combinations of weights to get to this selection. But more rigorous work needs to be done to systematically identify the best hyparameters for retrieval. + +* Node orchestration + +Moving to node orchestration. It took me some time and reflection to realize that the nodes, I had at the moment, make only sense to be executed sequentially. Analysis (analyze node) should always be followed with relevance checking (check_relevance node), finding summarization (summarize node), and in that order. Any other sequence of execution is not useful. I had to modify the code to comply to this and prevent getting into some unnecessary loops :) + +But this also highlighted other limitations of the system: lack of flexibility and lack of autonomy + +To address the lack of flexibility, I introduced a new node to split a query into multiple queries that can be solved independently and asynchronously. The node split_query works as follows: +``` +def split_query(self, query: str) -> list[str]: + + prompt = f""" + <|im_start|>system + You are a very powerful task generator. + + Split the query into task and context based on tags. + Based on the context, ask relevant questions that help achieve the task. Make sure the subquestions are atomic and do not rely on each other. + Return only the subquestions. + Return strictly a JSON list of strings, nothing else. + <|im_end|> + <|im_start|>user + Query: + Task: Identify traits with a lod score > 3.0 for the marker Rsm10000011643. Tell me what marker Rsm10000011643 is involved in biology. + Context: A trait has a long name and contain generally strings like GWA or GEMMA. The goal is to know the biological processes which might be related to the marker previously mentioned. + + Result: + <|im_end|> + <|im_start|>assistant + ["What traits (containing GWA or GEMMA) have a lod score > 3.0 at Rsm10000011643?", "Which biological processes are related to Rsm10000011643?"] + <|im_end|> + <|im_start|>user + Query: + {query} + Result: + <|im_end|> + <|im_start|>assistant""" + + with self.generative_lock: + response = GENERATIVE_MODEL.invoke(prompt) + print(f"Subqueries in split_query: {response}") + + if isinstance(response, str): + start = response.find("[") + end = response.rfind("]") + 1 + subqueries = json.loads(response[start:end]) + else: + subqueries = [query] + + return subqueries + +``` + +There is need for another node to reconcile answers generated for each subquery. This motivated the addition of the node finalize: +``` +def finalize(self, query: str, subqueries: list[str], answers: list[str]) -> dict: + + prompt = f""" + <|im_start|>system + You are an experienced biology scientist. Given the subqueries and corresponding answers, generate a comprehensive explanation to address the query using all information provided. + Ensure the response is insightful, concise, and draws logical inferences where possible. + Do not modify entities names such as trait and marker. + Make sure to link based on what is common in the answers. + Provide only the story, nothing else. + Do not repeat answers. Use only 200 words max. + <|im_end|> + <|im_start|>user + Query: + Identify two traits related to diabetes. + Compare their lod scores at Rsm149505. + Subqueries: + ["Identify two traits related to diabetes", + "Compare lod scores of same traits at Rsm149505"] + Answers: + ["Traits A and B are related to diabetes", \ + "The lod score at Rsm149505 is 2.3 and 3.4 for trait A and B"] + Conclusion: + <|im_end|> + <|im_start|>assistant + Traits A and B are related to diabetes and have a lod score of\ + 2.3 and 3.4 at Rsm149505. The two traits could interact via a\ + gene close to the marker Rsm149505. + <|im_end|> + <|im_start|>user + Query: + {query} + Subqueries: + {subqueries} + Answers: + {answers} + Conclusion: + <|im_end|> + <|im_start|>assistant""" + with self.generative_lock: + response = GENERATIVE_MODEL.invoke(prompt) + print(f"Response in finalize: {response}") + + final_answer = ( + response + if response + else "Sorry, we are unable to \ + provide an overall feedback due to lack of relevant data." + ) + + return final_answer +``` + +The system could now take a multi-faceted query, split it into multiple subqueries, address each one of them asynchronously using sequentially retriever, analysis, check_relevance and summarize. Results are combined in the end before giving a feedback to the user. + +Time to make the system really agentic - so far it is not trulty because of the lack of autonomy! An agentic system requires access to many tools and a core LLM that can reason on its own about sequence of tools to call in order to solve a problem. This sounds scary but not quite if well designed :) I was also planning to add some safeguards to prevent infinite looping that could consume a lot of tokens very quickly. + +What I did was to register the graph I have so far as a subgraph a bigger graph (real AI system). This arm of the AI system is called researcher and has the following definition: +``` +def researcher(self, state: AgentState) -> Any: + if len(state.messages) < 3: + input = state.messages[0] + else: + input = state.messages[-1] + input = input.content + logging.info(f"Input in researcher: {input}") + result = self.manage_subtasks(input) + end = time.time() + logging.info(f"Result in researcher: {result}") + + return { + "messages": [result], + } +``` + +I also designed a planner, reflector and supervisor that the system can use. As the name indicates, the planner helps with planning steps to take to solve the problem. The reflector provides feedback and helps improve the output of the researcher. The supervisor is the core handler. It manages interations between planner, researcher and reflector. + +You can inspect design code for planner, reflector and supervisor below: +``` +def planner(self, state: AgentState) -> Any: + input = [self.plan_system_prompt] + state.messages + result = plan(background=input) + answer = result.get("answer") + return { + "messages": [answer], + } + +def reflector(self, state: AgentState) -> Any: + trans_map = {AIMessage: HumanMessage, HumanMessage: AIMessage} + translated_messages = [self.refl_system_prompt, state.messages[0]] + [ + trans_map[msg.__class__](content=msg.content) for msg in state.messages[1:] + ] + result = tune(background=translated_messages) + answer = result.get("answer") + answer = ( + "Progress has been made. Use now all the resources to addess this new suggestion: " + + answer + ) + return { + "messages": [HumanMessage(answer)], + } + +def supervisor(self, state: AgentState) -> Any: + messages = [ + ("system", self.sup_system_prompt1), + *state.messages, + ("system", self.sup_system_prompt2), + ] + + if len(messages) > self.max_global_visits: + return {"next": "end"} + + result = supervise(background=messages) + next = result.get("next") + + return { + "next": next, + } +``` + +* GPU acceleration + +The last point is GPU acceleration. Pjotr installed a GPU on balg01 to allow for acceleration. You can check out the details here: +=> https://issues.genenetwork.org/topics/systems/linux/GPU-on-balg01 + +The GPU is automatically used for LLM related work. I just started using it at first. Later, I learnt about SGLang which allows for deployment of LLM server and even faster inference. Code for deployment of the server is here: +=> https://github.com/johanmed/gn-rag/blob/543a7835f5620a541cdb679b852c91e62bca2698/src/agent_system/config.sh + +With DSPy, I could literally switch between any model, closed or open. Consequently, I added support for DSPy. For details, check out the following commit: +=> https://github.com/johanmed/gn-rag/commit/ec0d8ffc174cca0ccf32cb98d82ebdc7106b4ac2 + +Small gotcha, for locally served model using SGLang, not all open models could be run given VRAM (GPU's RAM) constraint. Took me some experiments to find workable models that are finetuned for instruction following and have decent performance. At the time of writing, I am working with Qwen/Qwen2.5-7B-Instruct accessed via HuggingFace. This is the LLM. There is also an embedding model but I have not added GPU acceleration support for it to improve memory management. We have limited resources for now :) + +I also performed a series of refactoring and formatting to improve readability of the source code. Find it here: +=> https://github.com/johanmed/gn-rag/tree/main/src + +### Scale analysis to more data diff --git a/topics/ai/ontogpt.gmi b/topics/ai/ontogpt.gmi new file mode 100644 index 0000000..94bd165 --- /dev/null +++ b/topics/ai/ontogpt.gmi @@ -0,0 +1,7 @@ +# OntoGPT + +python3 -m venv ~/opt/ontogpt +~/opt/ontogpt/bin/python3 -m pip install ontogpt + + +runoak set-apikey -e openai diff --git a/topics/authentication/architecture.gmi b/topics/authentication/architecture.gmi index 931f9cb..2200745 100644 --- a/topics/authentication/architecture.gmi +++ b/topics/authentication/architecture.gmi @@ -54,13 +54,14 @@ Users are granted privileges (see "Privileges" section) to act upon resources, t Examples of "types" of resources on the system: -- system: The system itself -- group: Collection of users considered a group -- genotype: A resource representing a genotype trait -- phenotype: A resource representing a phenotype trait -- mrna: A resource representing a collection of mRNA Assay traits -- inbredset-group: A resource representing an InbredSet group - +* system: The system itself +* group: Collection of users considered a group +* genotype: A resource representing a genotype trait +* phenotype: A resource representing a phenotype trait +* mrna: A resource representing a collection of mRNA Assay traits +* inbredset-group: A resource representing an InbredSet group + +---- * TODO: Figure out a better name/description for "InbredSet group" -- so far, I have "a classification/grouping of traits/datasets within a species". Another is to use the term "population". ## Users diff --git a/topics/authentication/development-guide.gmi b/topics/authentication/development-guide.gmi new file mode 100644 index 0000000..840c26b --- /dev/null +++ b/topics/authentication/development-guide.gmi @@ -0,0 +1,60 @@ +# GN-AUTH FAQ + +## Tags + +* type: docs, documentation +* status: ongoing, open +* keywords: authentication, authorisation, docs, documentation +* author: @jnduli + +## Quick configuration for local development + +Save a `local_settings.conf` file that has the contents: + +``` +SQL_URI = "mysql://user:password@localhost/db_name" # mysql uri +AUTH_DB = "/absolute/path/to/auth.db/" # path to sqlite db file +GN_AUTH_SECRETS = "/absolute/path/to/secrets/secrets.conf" +``` + +The `GN_AUTH_SECRETS` path has two functions: + +* It contains the `SECRET_KEY` we use in our application +* The folder containing this file is used to store our jwks. + +An example is: + +``` +SECRET_KEY = "qQIrgiK29kXZU6v8D09y4uw_sk8I4cqgNZniYUrRoUk" +``` + +## Quick set up cli commands + +``` +export FLASK_DEBUG=1 AUTHLIB_INSECURE_TRANSPORT=1 OAUTHLIB_INSECURE_TRANSPORT=1 FLASK_APP=gn_auth/wsgi +export GN_AUTH_CONF=/absolute/path/to/local_settings.conf +flask init-dev-clients --client-uri "http://localhost:port" +flask init-dev-users +flask assign-system-admin 0ad1917c-57da-46dc-b79e-c81c91e5b928 +``` + +## Handling verification for users in local development + +* Run flask init_dev_users, which will create a verified local user. +* Run `UPDATE users set verified=1` on the sqlite3 auth database. + +## Errors related to unsupported clients/redirect URIs for client + +Rerun + +``` +FLASK_DEBUG=1 AUTHLIB_INSECURE_TRANSPORT=1 OAUTHLIB_INSECURE_TRANSPORT=1 \ + GN_AUTH_CONF=/absolute/path/to/local_settings.conf FLASK_APP=gn_auth/wsgi \ + flask init-dev-clients --client-uri "http://localhost:port_you_use_for_gn2" +``` + +This will update your clients list to have all the related urls we want. + +## 500 Server Error: INTERNAL SERVER ERROR + +When you see the error: `500 Server Error: INTERNAL SERVER ERROR for url: http://localhost:8081/auth/token`, restart the gn2 server. diff --git a/topics/authentication/permission_hooks.gmi b/topics/authentication/permission_hooks.gmi new file mode 100644 index 0000000..dd475b6 --- /dev/null +++ b/topics/authentication/permission_hooks.gmi @@ -0,0 +1,62 @@ +# Permission Hooks System Design +## Status: Draft + +## Objective + +We want to achieve: + +- Default permissions for users that come from `.edu` domains. +- Support for visitors to the website. + +This should be dynamic and easily maintenable. + +## Design + +### Events + +* Use middleware to plug into the various aspects of a requests life cycle. We'll plug into `after_request` for providing default permissions. +* Create a hook which contains: the event to handle, what part of the life cycle the hook plugs into and the actual functions to call, +* Events can be identified using their `request.base_url` parameter. +* Each hook registers itself to the global set of hooks (TODO: Figure out how to automatically handle the registration). + + +``` +@app.after_request +def handle_hooks(): + for hook in hooks: + if hook.lifecycle == "after_request" and hook.can_handle(): + hook.run() + + +Hooks = [RegistrationHook, ...] + + +class RegistrationHook: + + def can_handle(self): + request.base_url == "register" + + def lifecyle: + return "after_request" + + def run(self): + ... +``` + +### Privilege Hooks + +* After login/registration, use the email to get extra privileges assigned to a user. We use `login` too to ensure that all users have the most up-to-date roles and privileges. +* This means that any user gets assigned these privileges and normal workflows can happen. + +### Storage + +* Create a new role that contains the default privileges we want to assign to users depending on their domain. +* This role will link up with the privileges to be assigned to said user. +* Example privileges we may want to add to users in the `.edu` domain: + * group:resource:edit-resource + * system:inbreadset:apply-case-attribute-edit + * system:inbreadset:edit-case-attribute + * system:inbreadset:view-case-attribute +* Create an extra table that provides a link between some `email identifier` and the role we'd like to pre-assign. We can use python regex for the email identifier e.g. `*.edu$` or `*.utsch.edu`. +* This will be the table used by the Registration Hook. +* This also allows us to edit roles/privileges without code releases. diff --git a/topics/biohackathon/biohackrxiv2024.gmi b/topics/biohackathon/biohackrxiv2024.gmi new file mode 100644 index 0000000..a159ec4 --- /dev/null +++ b/topics/biohackathon/biohackrxiv2024.gmi @@ -0,0 +1,7 @@ +# BioHackRxiv + +We have a hacking week in Barcelona to work on BioHackRXiv. + +# Tasks + +* [ ] ORCIDs for authors in PDF diff --git a/topics/R-qtl2-format-notes.gmi b/topics/data/R-qtl2-format-notes.gmi index e0109b1..3397b5e 100644 --- a/topics/R-qtl2-format-notes.gmi +++ b/topics/data/R-qtl2-format-notes.gmi @@ -1,4 +1,4 @@ -# R/qtl2 Format Notes +# R/qtl2 and GEMMA Format Notes This document is mostly to help other non-biologists figure out their way around the format(s) of the R/qtl2 files. It mostly deals with the meaning/significance of the various fields. @@ -12,6 +12,39 @@ and We are going to consider the "non-transposed" form here, for ease of documentation: simply flip the meanings as appropriate for the transposed files. +To convert between formats we should probably use python as that is what can use as 'esperanto'. + +## Control files + +Both GN and R/qtl2 have control files. For GN it basically describes the individuals (genometypes) and looks like: + +```js +{ + "mat": "C57BL/6J", + "pat": "DBA/2J", + "f1s": ["B6D2F1", "D2B6F1"], + "genofile" : [{ + "title" : "WGS-based (Mar2022)", + "location" : "BXD.8.geno", + "sample_list" : ["BXD1", "BXD2", "BXD5", "BXD6", "BXD8", "BXD9", "BXD11", "BXD12", "BXD13", "BXD14", "BXD15", "BXD16", "BXD18", "BXD19", "BXD20", "BXD21", "BXD22", "BXD23", "BXD24", "BXD24a", "BXD25", "BXD27", "BXD28", "BXD29", "BXD30", "BXD31", "BXD32", "BXD33", "BXD34", "BXD35", "BXD36", "BXD37", "BXD38", "BXD39", "BXD40", "BXD41", "BXD42", "BXD43", "BXD44", + ...]}]} +``` + +In gn-guile this gets parsed in gn/data/genotype.scm to fetch the individuals that match the genotype and phenotype layouts. + +## pheno files and phenotypes + +The standard GEMMA input files are not very good for trouble shooting. R/qtl2 has at least the individual or genometype ID for every line: + +``` +id,bolting_days,seed_weight,seed_area,ttl_seedspfruit,branches,height,pc_seeds_aborted,fruit_length +MAGIC.1,15.33,17.15,0.64,45.11,10.5,NA,0,14.95 +MAGIC.2,22,22.71,0.75,49.11,4.33,42.33,1.09,13.27 +MAGIC.3,23,21.03,0.68,57,4.67,50,0,13.9 +``` + +This is a good standard and can match with the control files. + ## geno files > The genotype data file is a matrix of individuals × markers. The first column is the individual IDs; the first row is the marker names. @@ -22,10 +55,6 @@ For GeneNetwork, this means that the first column contains the Sample names (pre The first column of the gmap/pmap file contains genetic marker values. There are no Individuals/samples (or strains) here. -## pheno files - -The first column is the list of individuals (samples/strains) whereas the first column is the list of phenotypes. - ## phenocovar files These seem to contain extra metadata for the phenotypes. diff --git a/topics/data/epochs.gmi b/topics/data/epochs.gmi new file mode 100644 index 0000000..3e8b676 --- /dev/null +++ b/topics/data/epochs.gmi @@ -0,0 +1,153 @@ +# Epochs + +In the 2019 BXD paper epochs are brought up. Basically, even though the BXD are 'immortal' with identical children, mutations do creep in. An epoch is a period of mice and we track the years a mouse was used. So a BXD1 breeding started at 1971 and production in 2001. In GN we don't make a distinction (per se), but obviously these are (slightly) different mice today. Ashbrook et al. find some interesting results that differ in epochs. + +In GN epochs are currently handled as a trait. This can help with covariate mapping. For a different epoch, however, the genotypes should also be adapted. The effect on the kinship matrix will be minor, but genotypes can be used for fine mapping. With pangenome derived genotypes it should get even more interesting. + +# Fetching data + +Tracking the epochs is happening in spreadsheet. According to track changes only one item was changed in two years - BXD10 was marked as extinct. + +In the GN SQL database Epoch with its RRID is stored as a CaseAttribute: + +``` +MariaDB [db_webqtl]> select * from CaseAttribute LIMIT 3; ++-------------+-----------------+--------+-------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| InbredSetId | CaseAttributeId | Name | Description + | ++-------------+-----------------+--------+-------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| 1 | 1 | Status | Live= Available at JAX, Cryo=Cryopreserved only, Extinct + | +| 1 | 36 | RRID | Research resource identifier given by SciCrunch.org + | +| 1 | 37 | Epoch | BXD family subgroups. Each number with common parents. Epoch1(BXD1-32), Epoch2-6 (BXD33-220). See Ashbrook et al. https://pubmed.ncbi.nlm.nih.gov/33472028/ | ++-------------+-----------------+--------+-------------------------------------------------------------------------------------------------------------------------------------------------------------+ +``` + +And + +``` +MariaDB [db_webqtl]> select * from CaseAttributeXRefNew LIMIT 40; ++-------------+----------+-----------------+------------+ +| InbredSetId | StrainId | CaseAttributeId | Value | ++-------------+----------+-----------------+------------+ +| 1 | 1 | 1 | Live | +| 1 | 1 | 36 | JAX:100006 | +| 1 | 1 | 37 | 0 | +| 1 | 1 | 40 | | +| 1 | 2 | 1 | Live | +| 1 | 2 | 36 | JAX:000664 | +| 1 | 2 | 37 | 0 | +| 1 | 2 | 40 | 69 | +| 1 | 3 | 1 | Live | +| 1 | 3 | 36 | JAX:000671 | +| 1 | 3 | 37 | 0 | +| 1 | 3 | 40 | 108 | +| 1 | 4 | 1 | Live +``` + +I am not going to comment on this table architecture, other than that RDF is a much better fit. + +For extracting this data, the SQL table is probably the best source of 'truth' as it is seen by users on a regular basis. But, at this point, we'll just use the spreadsheet. Generating something like: + +``` +gn:Bxd14 + dct:description "BXD014/TyJ" ; + gnt:epoch 1 ; + gnt:availability "Cryorecovery" ; + gnt:method "B6 female to D2 male F2 intercross" ; + gnt:M_origin "B6" ; + gnt:Y_origin "D2" ; + gnt:JAX "000329" ; + gnt:start_year 1971 ; + gnt:age_seq_ind 271 ; + gnt:birth_seq_ind "2/18/2016" ; + gnt:availability_2023 "Cryorecovery" ; + gnt:has_genotypes true ; + rdfs:label "BXD14" . +gn:Bxd65 + dct:description "BXD065/RwwJ" ; + gnt:epoch 3 ; + gnt:availability "Available" ; + gnt:method "Advanced intercross progeny of B6 female to D2 male" ; + gnt:M_origin "B6" ; + gnt:Y_origin "D2" ; + gnt:JAX "007110" ; + gnt:start_year 1999 ; + gnt:age_seq_ind 46 ; + gnt:birth_seq_ind "9/18/2016" ; + gnt:availability_2023 "Available" ; + gnt:has_genotypes true ; + rdfs:label "BXD65" . +etc. +``` + +# Approach + +## Fetching data + +To get at the epochs we'll need to fetch the sample/ind names (such as BXD73b) from GN. + +For every dataset we can fetch samples+values with + +``` +curl http://127.0.0.1:8092/dataset/bxd-publish/values/$id.json > pheno.json +{"BXD40":-1.631969,"BXD68":-2.721761,"BXD43":-2.290135,"BXD44":-2.512057,"BXD48":-3.128819 ... +``` + +These are also stored in the pangemma output lmdb files. We don't want to store all values in RDF as these are only used for compute and can be easily fetched on demand from GN. We do want to access the sample names, but that is a list that is not necessarily unique to a single trait. In fact a trait should be referencing an experiment/dataset that has the samples/inds. Usually they will use the same animals. To not complicate things we'll just point to the samples with something like + +``` +traitid gn:sample gn:BXD40 . +``` + +Currently RDF contains + +``` +gn:Bxd12 rdfs:label "BXD12" . +gn:Bxd12 rdf:type gnc:strain . +gn:Bxd12 gnt:belongsToSpecies gn:Mus_musculus . +``` + +and traits have + +``` +gn:traitBxd_10002 rdf:type gnc:Phenotype . +gn:traitBxd_10002 gnt:belongsToGroup gn:setBxd . +gn:traitBxd_10002 gnt:traitId "10002" . +gn:traitBxd_10002 skos:altLabel "BXD_10002" . +gn:traitBxd_10002 dct:description "Central nervous system, morphology: Cerebellum weight after adjustment for covariance with brain size [mg]" . +gn:traitBxd_10002 gnt:abbreviation "ADJCBLWT" . +gn:traitBxd_10002 gnt:submitter "robwilliams" . +gn:traitBxd_10002 gnt:mean "52.22058767430923"^^xsd:double . +gn:traitBxd_10002 gnt:locus gn:Rsm10000005699 . +gn:traitBxd_10002 gnt:lodScore "4.779380894726979"^^xsd:double . +gn:traitBxd_10002 gnt:additive "2.0817857571428617"^^xsd:double . +gn:traitBxd_10002 gnt:sequence "1"^^xsd:integer . +gn:traitBxd_10002 dct:isReferencedBy pubmed:11438585 . +``` + +ignore the capitalization and some naming - gnc:strain should be gnc:sample - we'll fix that. But for now we can find some trait info and we can link the individuals up with a trait. + +The query we want to write is something like + +``` +SELECT * WHERE { + ?traitid a gnc:Phenotype; + gnt:traitId "10002" ; + gnt:belongsToGroup gn:setBxd ; + gnt:traitId ?trait ; + dct:isReferencedBy ?pubmed . + OPTIONAL { + ?traitid dct:description ?descr ; + gnt:sample_id ?sampleid . + ?sampleid rdfs:label ?sample . + } +} LIMIT 10 +``` + +So, for every trait/sample combination we need to add + +``` +gn:traitBxd_10002 gnt:sample_id gn:Bxd12 . +``` diff --git a/topics/data/precompute/steps.gmi b/topics/data/precompute/steps.gmi index 75e3bfd..ac03d1a 100644 --- a/topics/data/precompute/steps.gmi +++ b/topics/data/precompute/steps.gmi @@ -2,7 +2,8 @@ At this stage precompute fetches a trait from the DB and runs GEMMA. Next it tar balls up the vector for later use. It also updates the database with the latest info. -To actually kick off compute on machines that do not access the DB I realize now we need a step-wise approach. Basically you want to shift files around without connecting to a DB. And then update the DB whenever it is convenient. So we are going to make it a multi-step procedure. I don't have to write all code because we have a working runner. I just need to chunk the work. +To actually kick off compute on machines that do not access the DB I realize now we need a step-wise approach. Basically you want to shift files around without connecting to a DB. And then update the DB whenever it is convenient. So we are going to make it a multi-step procedure. +We need to chunk the work. We will track precompute steps here. We will have: @@ -13,8 +14,18 @@ We will track precompute steps here. We will have: Trait archives will have steps for * [X] step p1: list-traits-to-compute -* [+] step p2: gemma-lmm9-loco-output: Compute standard GEMMA lmm9 LOCO vector with gemma-wrapper -* [ ] step p3: gemma-to-lmdb: create a clean vector +* [X] step p2: gemma-lmm9-loco-output: Compute standard GEMMA lmm9 LOCO vector with gemma-wrapper +* [X] step p3: gemma-to-lmdb: create a clean vector + +Start precompute + +* [ ] Fetch traits on tux04 +* [ ] Set up runner on tux04 and others +* [ ] Run on Octopus + +Work on published data + +* [ ] Fetch traits The DB itself can be updated from these @@ -22,8 +33,11 @@ The DB itself can be updated from these Later +* [ ] Rqtl2: Compute Rqtl2 vector * [ ] bulklmm: Compute bulklmm vector +Interestingly this work coincides with Arun's work on CWL. Rather than trying to write a workflow in bash, we'll use ccwl and accompanying tools to scale up the effort. + # Tags * assigned: pjotrp @@ -36,10 +50,10 @@ Later * [ ] Check Artyoms LMDB version for kinship and maybe add LOCO * [+] Create JSON metadata controller for every compute incl. type of content -* [+] Create genotype archive -* [+] Create kinship archive +* [X] Create genotype archive +* [X] Create kinship archive * [+] Create trait archives -* [+] Kick off lmm9 step +* [X] Kick off lmm9 step * [ ] Update DB step v1 # Step p1: list traits to compute @@ -62,7 +76,7 @@ At this point we can write {"2":9.40338,"3":10.196,"4":10.1093,"5":9.42362,"6":9.8285,"7":10.0808,"8":9.17844,"9":10.1527,"10":10.1167,"11":9.88551,"13":9.58127,"15":9.82312,"17":9.88005,"19":10.0761,"20":10.2739,"21":9.54171,"22":10.1056,"23":10.5702,"25":10.1433,"26":9.68685,"28":9.98464,"29":10.132,"30":9.96049,"31":10.2055,"35":10.1406,"36":9.94794,"37":9.96864,"39":9.31048} ``` -Note that it (potentially) includes the parents. Also the strain-id is a string and we may want to plug in the strain name. To allow for easy comparison downstream. Finally we may want to store a checksum of sorts. In Guile this can be achieved with: +Note that it (potentially) includes the parents and that is corrected when generating the phenotype file for GEMMA. Also the strain-id is a string and we may want to plug in the strain name. To allow for easy comparison downstream. Finally we may want to store a checksum of sorts. In Guile this can be achieved with: ```scheme (use-modules (rnrs bytevectors) diff --git a/topics/database/mariadb-database-architecture.gmi b/topics/database/mariadb-database-architecture.gmi new file mode 100644 index 0000000..0454d71 --- /dev/null +++ b/topics/database/mariadb-database-architecture.gmi @@ -0,0 +1,830 @@ +# MariaDB Database Architecture + +The GeneNetwork database is running on MariaDB and the layout is almost carved in stone because so much code depends on it. +We are increasingly moving material out into lmdb (genotypes and phenotypes) and virtuoso (all types of metadata), but this proves a lengthy and rather tedious process. We also run redis for cachine, sqlite for authentication, and xapian for search! + +In this document we'll discuss where things are, where they ought to go, and how the nomenclature should change. + +An SVG of the SQL layout can be found here + +=> https://raw.githubusercontent.com/genenetwork/gn-gemtext-threads/main/topics/database/sql.svg + +# Nomenclature + +These are the terms we use + +* Genotypes +* Case or genometype: individual, strain, sample +* ProbeData: Now almost obsolete. [Comment by RWW perhaps for a footnote: In GeneNetwork 1 we had built and maintained a table for individual "Probe level" data simply because the Affymetrix data sets were so large. For example, the BXD Family: "UMUTAffy Hippocampus Exon 9Feb09)RMA" array data consists of 1.236 million "probesets" each of which is a summary of many individual probe assays (ProbeData)—a total of 4.5 million probes (see https://www.thermofisher.com/order/catalog/product/900817). In GN1 we built a special interface to interrogate these 4.5 million indivdual probes--extremely useful to studing the fine-structure of mRNA expresswion. We thought it best to split these very large "pro-level" data sets from the much smaller and more widely use "ProbeSetData". The term "Probe" in this particular context (Affymetrix Exon arrays) refers to short nucleotide probes used by Affymetrix and other microarray vendors. Affymetrix "Exon"-type arrays consist of 25 nt hybridization probes that target relatively specific parts of RNAs--mainly exons but also many intronic sequences.] +* ProbeSetData: trait/sample values almost exclusively used for molecular data types (mRNA, protein, methylation assays, metabolomics, etc). [Comment by RWW perhaps for a footnote: The term "ProbeSetData" should ideally be changed to "High_Content_Data_Assays. In 2003 the only high content data assays we had were Affymetrix microarrays that measured mRNA level, and the vendor called their assays "ProbeSets". We used this now obsolete term. Most ProbeSetData in GN1 and GN2 as of 2024 are measurments of molecular traits that can be tagged to a single genome location—-the location of the gene from which the mRNA and its derivative protein are transcribed and translated, or in the case of epigenomic studies—the site at which the genome is methylated. When these three types of molecular traits are mapped, we typically add a mark all graphic output maps that highlight the location of the "parent" gene. For example, the sonic hedgehog gene in mice is located on chromosome 5 at about 28.457 Mb on the mm10 assembly (aka GRCm38). When we measure the expression of Shh mRNA, we place a purple triangle at the coordinate of the Shh gene. Two notes: 1. There are at least three ProbeSetData types do NOT have parent genes--metabolomic data, and metagenomic data, and new high-content brain connectome data. When we do NOT know the location of a parent gene, we should NOT place any mark along the X-axis. 2. Ideally GN databases would define the TYPE of high-content data, so that the code could fork to the correct GUI for that particular data type. Connectome data for the brain is an example of a data type that is very large (40,000 measurements per brain), that is truly high-content data, but that is NOT molecular. Time series data may also fall into this category.] +* ProbeSetFreeze: points to datasets + +## More on naming + +Naming convention-wise there is a confusing use of id and data-id in particular. We should stick to the table-id naming. + +# The small test database (2GB) + +The default install comes with a smaller database which includes a +number of the BXDs and the Human liver dataset (GSE9588). + +It can be downloaded from: + +=> https://files.genenetwork.org/database/ + +Try the latest one first. + +# GeneNetwork database + +Estimated table sizes with metadata comment for the important tables + +select table_name,round(((data_length + index_length) / 1024 / 1024), 2) `Size in MB` from information_schema.TABLES where table_schema = "db_webqtl" order by data_length; + +``` ++-------------------------+------------+ +| table_name | Size in MB | Should be named: ++-------------------------+------------+ +| PublishData | 22.54 | ClassicTraitValues <- data-id, strain-id, value (3M traits) +| PublishSE | 4.71 | ClassicTraitValueError (300K traits) <- data-id, strain-id, value +| PublishXRef | 2.18 | List of publications <- id, data-id, inbred-id, pheno-id, pub-id +| ProbeSetData | 59358.80 | BulkTraitValues <- id, strain, value +| ProbeSetSE | 14551.02 | BulkTraitValueError <- SE values aligns with ProbeSetData +| ProbeSetXRef | 4532.89 | PrecomputedLRS <- precomputed LRS values, pointing to dataset+trait +| ProbeSet | 2880.21 | ProbeSetInfo <- over utilized mRNA probeset description, e.g. 100001_at comes with sequence info +| ProbeSetFreeze | 0.22 | DatasetInfo <- dataset description, e.g. "Hippocampus_BXD_Jun06" - probesetfreezeid points to dataset, shortname, public? +| Probe | 2150.30 | ProbeInfo <- Probe trait info incl sequence, id, probeset-id +| ProbeFreeze | 0.06 | Dataset names <- Similar to ProbesetFreeze, id, chip-id, inbredset-id, tissue-id +| Phenotype | 6.50 | PhenotypeMeta <- "Hippocampus weight", id, prepublish short-name, postpublish short-name +| ProbeXRef | 743.38 | ProbeFreezeDataIDs <- link ProbeFreeze-Id,Probe-Id with Data-Id +| Datasets | 2.31 | DatasetMeta <- "Data generated by...", investigator-id, publication title +| NStrain | 4.80 | StrainCountDataId <- Strains used in dataset, count, strain-id, data-id +| Strain | 1.07 | StrainNames <- with species ID and alias, id, species-id, name +| TissueProbeSetData | 74.42 | <- link Id,TissueID with value +| TissueProbeSetXRef | 14.73 | TissueGeneTable? <- data-id, gene-id, mean, symbol, TissueProbeSetFreezeId | ProbesetId | DataId +| TissueProbeSetFreeze | 0.01 | tissueprobefreeze-id +| InbredSet | 0.01 | InbredSetMeta -> Id,SpeciesId,FullName +| ProbeData | 22405.44 | (OLD?) mRNAStrainValues used for partial correlations <- id, strain, value = individual probe data (mRNA) [GN1,GN3] +| ProbeSE | 6263.83 | (OLD?) Trait Error <- trait SE aligns with ProbeData? [GN3] ++-------------------------+------------+ +``` +Less commonly used tables: + +``` ++-------------------------+------------+ +| table_name | Size in MB | ++-------------------------+------------+ +| LCorrRamin3 | 18506.53 | +| SnpAll | 15484.67 | +| SnpPattern | 9177.05 | +| QuickSearch | 5972.86 | +| GenoData | 3291.91 | Strain by genotype - only used in GN1 +| CeleraINFO_mm6 | 989.80 | +| pubmedsearch | 1032.50 | +| GeneRIF_BASIC | 448.54 | +| BXDSnpPosition | 224.44 | +| EnsemblProbe | 133.66 | +| EnsemblProbeLocation | 105.49 | +| Genbank | 37.71 | +| AccessLog | 42.38 | +| GeneList | 34.11 | +| Geno | 33.90 | Marker probe info (incl. sequence) +| MachineAccessLog | 28.34 | +| IndelAll | 22.42 | +| ProbeH2 | 13.26 | +| GenoXRef | 22.83 | +| TempData | 8.35 | +| GeneList_rn3 | 5.54 | +| GORef | 4.97 | +| temporary | 3.59 | +| InfoFiles | 3.32 | +| Publication | 3.42 | +| Homologene | 5.69 | +| GeneList_rn33 | 2.61 | +| GeneRIF | 2.18 | +| Vlookup | 1.87 | +| H2 | 2.18 | +| IndelXRef | 2.91 | +| GeneMap_cuiyan | 0.51 | +| user_collection | 0.30 | +| CaseAttributeXRef | 0.44 | +| StrainXRef | 0.56 | +| GeneIDXRef | 0.77 | +| Docs | 0.17 | +| News | 0.17 | +| GeneRIFXRef | 0.24 | +| Sample | 0.06 | +| login | 0.06 | +| user | 0.04 | +| TableFieldAnnotation | 0.05 | +| DatasetMapInvestigator | 0.05 | +| User | 0.04 | +| TableComments | 0.02 | +| Investigators | 0.02 | +| DBList | 0.03 | +| Tissue | 0.02 | +| GeneChip | 0.01 | +| GeneCategory | 0.01 | +| SampleXRef | 0.01 | +| SnpAllele_to_be_deleted | 0.00 | +| Organizations | 0.01 | +| PublishFreeze | 0.00 | +| GenoFreeze | 0.00 | Used for public/private +| Chr_Length | 0.01 | +| SnpSource | 0.00 | +| AvgMethod | 0.00 | +| Species | 0.00 | +| Dataset_mbat | 0.00 | +| TissueProbeFreeze | 0.00 | +| EnsemblChip | 0.00 | +| UserPrivilege | 0.00 | +| CaseAttribute | 0.00 | +| MappingMethod | 0.00 | +| DBType | 0.00 | +| InfoFilesUser_md5 | 0.00 | +| GenoCode | 0.00 | +| DatasetStatus | 0.00 | +| GeneChipEnsemblXRef | 0.00 | +| GenoSE | 0.00 | +| user_openids | 0.00 | +| roles_users | 0.00 | +| role | 0.00 | +| Temp | NULL | ++-------------------------+------------+ +97 rows in set, 1 warning (0.01 sec) +``` + +All *Data tables are large + +## Tables containing trait values + +A trait on GN is defined by a trait-id with a dataset-id. + +=> https://genenetwork.org/show_trait?trait_id=10031&dataset=BXDPublish + +The trait-id can also be a probe name + +=> https://genenetwork.org/show_trait?trait_id=1441566_at&dataset=HC_M2_0606_P + +One of the more problematic aspects of GN is that there are two tables containing trait values (actually there are three!). ProbeSetData mostly contains expression data. PublishData contains 'classical' phenotypes. ProbeData is considered defunct. + +So, a set of trait values gets described by the dataset+probe (trait_id) OR by BXDPublish --- which is its own table --- and an identifier, here 10031. + +OK, let's look at the ProbeSetData (expression) traits: + +``` +MariaDB [db_webqtl]> select * from ProbeSetData limit 5; ++----+----------+-------+ +| Id | StrainId | value | ++----+----------+-------+ +| 1 | 1 | 5.742 | +| 1 | 2 | 5.006 | +| 1 | 3 | 6.079 | +| 1 | 4 | 6.414 | +| 1 | 5 | 4.885 | ++----+----------+-------+ +5 rows in set (0.193 sec) +MariaDB [db_webqtl]> select * from ProbeData limit 5; ++--------+----------+---------+ +| Id | StrainId | value | ++--------+----------+---------+ +| 503636 | 42 | 11.6906 | +| 503636 | 43 | 11.4205 | +| 503636 | 44 | 11.2491 | +| 503636 | 45 | 11.2373 | +| 503636 | 46 | 12.0471 | ++--------+----------+---------+ +5 rows in set (0.183 sec) +``` + +ProbeSet describes ProbeSetData. I.e., every probe ID comes with a sequence (microarray) etc. + +As for duplicated data: duplicated or "detached"* data happens sometimes, though that's not related to the PublishData/ProbeSetData distinction (unless this is done deliberately for some reason). I believe that whether data is entered as one or the other primarily comes down to the desire/need to divide it into datasets (or by tissue) within a group (with mRNA expression data just being the most common reason for this). I've encountered a situation before with Arthur where there was data in ProbeSetData that wasn't also in ProbeSetXRef + +an you give an example of exactly what you mean? PublishData would be stuff like sex, weight, etc (is this what you mean?) while ProbeSetData is used for mRNA expression data (except for a few situations where it isn't lol). + +That being said, *functionally*, I think the only real distinction (aside from what metadata is displayed) is that "ProbeSet" data has extra levels of "granularity" where it's also organized by tissue type and can be split into "datasets" (while "PublishData" traits are only associated with a Group (InbredSet in DB). That's why some non-mRNA expression data is still classified as "ProbeSet" - I think it's basically just a way to separate it into datasets within a group, often for specific tissues. + +So the organization is something like this: + +``` +Group -> PublishData +Group -> Tissue -> Dataset -> ProbeSetData +``` + +## ProbeData + +[OBSOLETE] ProbeData meanwhile is a table with fine-grained probe level Affymetrix data only. Contains 1 billion rows March 2016. This table may be *deleted* later since it is only used by the Probe Table display in GN1. Not used in GN2 +"ProbeData" should probably be "AssayData" or something more neutral. + +In comparison the "ProbeSetData" table contains more molecular assay data, including probe set data, RNA-seq data, proteomic data, and metabolomic data. 2.5 billion rows March 2016. +ProbeData contains data only for Affymetrix probe level data (e.g. Exon array probes and M430 probes). + +"StrainId" should be "CaseId" or "SampleId" or "GenometypeId", see nomenclature above. + +``` +select * from ProbeData limit 2; ++--------+----------+---------+ +| Id | StrainId | value | ++--------+----------+---------+ +| 503636 | 42 | 11.6906 | +| 503636 | 43 | 11.4205 | ++--------+----------+---------+ +2 rows in set (0.00 sec) + +select count(*) from ProbeData limit 2; ++-----------+ +| count(*) | ++-----------+ +| 976753435 | ++-----------+ +1 row in set (0.00 sec) +``` + +## PublishData + +These are the classic phenotypes under BXDPublish. + +``` +MariaDB [db_webqtl]> select * from PublishData where StrainId=5 limit 5; ++---------+----------+------------+ +| Id | StrainId | value | ++---------+----------+------------+ +| 8967043 | 5 | 49.000000 | +| 8967044 | 5 | 50.099998 | +| 8967045 | 5 | 403.000000 | +| 8967046 | 5 | 45.500000 | +| 8967047 | 5 | 44.900002 | ++---------+----------+------------+ +5 rows in set (0.265 sec) +MariaDB [db_webqtl]> select * from PublishSE where StrainId=5 limit 5; ++---------+----------+-------+ +| DataId | StrainId | error | ++---------+----------+-------+ +| 8967043 | 5 | 1.25 | +| 8967044 | 5 | 0.71 | +| 8967045 | 5 | 8.6 | +| 8967046 | 5 | 1.23 | +| 8967047 | 5 | 1.42 | ++---------+----------+-------+ +5 rows in set (0.203 sec) +MariaDB [db_webqtl]> select * from PublishXRef limit 2; ++-------+-------------+-------------+---------------+---------+-------------------+----------------+------------------+------------------+----------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| Id | InbredSetId | PhenotypeId | PublicationId | DataId | mean | Locus | LRS | additive | Sequence | comments | ++-------+-------------+-------------+---------------+---------+-------------------+----------------+------------------+------------------+----------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| 10001 | 1 | 4 | 116 | 8967043 | 52.13529418496525 | rs48756159 | 13.4974911471087 | 2.39444435069444 | 1 | robwilliams modified post_publication_description at Mon Jul 30 14:58:10 2012 +robwilliams modified post_publication_description at Sat Jan 30 13:48:49 2016 + | +| 10002 | 1 | 10 | 116 | 8967044 | 52.22058767430923 | rsm10000005699 | 22.004269639323 | 2.08178575714286 | 1 | robwilliams modified phenotype at Thu Oct 28 21:43:28 2010 + | ++-------+-------------+-------------+---------------+---------+-------------------+----------------+------------------+------------------+----------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+ +2 rows in set (0.328 sec) +``` + +## ProbeSet + + +Comment: PLEASE CHANGE TABLE NAME and rework fields carefully. This is a terrible table but it works well (RWW March 2016). It is used in combination with the crucial TRAIT DATA and ANALYSIS pages in GN1 and GN2. It is also used by annotators using the UPDATE INFO AND DATA web form to correct and update annotation. It is used by Arthur to enter new annotation files and metadata for arrays, genes, proteins, metabolites. The main problem with this table is that it is doing too much work. And it is not doing enough because it is huge, but does not track changes. The plan is to migrate to lmdb for that. + +Initially (2003) this table contained only Affymetrix ProbeSet data for mouse (U74aV2 initially). Many other array platforms for different species were added. At least four other major categories of molecular assays have been added since about 2010: + +1. RNA-seq annotation and sequence data for transcripts using ENSEMBL identifiers or NCBI NM_XXXXX and NR_XXXXX type identifiers + +2. Protein and peptide annotation and sequence data (see BXD Liver Proteome data, SRM and SWATH type data) with identifiers such as "abcb10_q9ji39_t311" for SRM data and "LLGNMIVIVLGHHLGKDFTPAAQAA" for SWATH data where the latter is just the peptide fragment that has been quantified. Data first entered in 2015 for work by Rudi Aebersold and colleagues. + +3. Metabolite annotation and metadata (see BXD Liver Metabolome data) with identifiers that are usually Mass charge ratios such as "149.0970810_MZ" + +4. Epigenomic and methylome data (e.g. Human CANDLE Methylation data with identifiers such as "cg24523000") + +It would make good sense to break this table into four or more types of molecular assay metadata or annotation tables) (AssayRNA_Anno, AssayProtein_Anno, AssayMetabolite_Anno, AssayEpigenome_Anno, AssayMetagenome_Anno), since these assays will have many differences in annotation content compared to RNAs (RWW). + +Some complex logic is used to update contents of this table when annotators modify and correct the information (for example, updating gene symbols). These features requested by Rob so that annotating one gene symbol in one species would annotate all gene symbols in the same species based on common NCBI GeneID number. For example, changing the gene alias for one ProbeSet.Id will changing the list of aliases in all instances with the same gene symbol. + +If the ProbeSet.BlatSeq (or is this ProbSetTargetSeq) is identical between different ProbeSet.Ids then annotation is forced to be the same even if the symbol or geneID is different. This "feature" was implemented when we found many probe sets with identical sequence but different annotations and identifiers. + + +``` +select count(*) from ProbeSet limit 5; ++----------+ +| count(*) | ++----------+ +| 4351030 | ++----------+ +| Id | ChipId | Name | TargetId | Symbol | description | Chr | Mb | alias | GeneId | GenbankId | SNP | BlatSeq |TargetSeq | UniGeneId | Strand_Probe | Strand_Gene | OMIM | comments | Probe_set_target_region | Probe_set_specificity | Probe_set_BLAT_score | Probe_set_Blat_Mb_start | Probe_set_Blat_Mb_end | Probe_set_strand | Probe_set_Note_by_RW | flag | Symbol_H | description_H | chromosome_H | MB_H | alias_H | GeneId_H | chr_num | name_num | Probe_Target_Description | RefSeq_TranscriptId | Chr_mm8 | Mb_mm8 | Probe_set_Blat_Mb_start_mm8 | Probe_set_Blat_Mb_end_mm8 | HomoloGeneID | Biotype_ENS | ProteinID | ProteinName | Flybase_Id | HMDB_ID | Confidence | ChEBI_ID | ChEMBL_ID | CAS_number | PubChem_ID | ChemSpider_ID | UNII_ID | EC_number | KEGG_ID | Molecular_Weight | Nugowiki_ID | Type | Tissue | PrimaryName | SecondaryNames | PeptideSequence | ++------+--------+----------+----------+--------+----------------------------------------------+------+-----------+----------+--------+-----------+------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------+--------------+-------------+--------+----------+-------------------------+-----------------------+----------------------+-------------------------+-----------------------+------------------+----------------------+------+----------+---------------+--------------+------+---------+----------+---------+----------+--------------------------+---------------------+---------+-----------+-----------------------------+---------------------------+--------------+-------------+-----------+-------------+------------+---------+------------+----------+-----------+------------+------------+---------------+---------+-----------+---------+------------------+-------------+------+--------+-------------+----------------+-----------------+ +| 7282 | 1 | 93288_at | NULL | Arpc2 | actin related protein 2/3 complex, subunit 2 | 1 | 74.310961 | AK008777 | 76709 | AI835883 | 0 | CCGACTTCCTTAAGGTGCTCAACCGGACTGCTTGCTACTGGATAATCGTGAGGGATTCTCCATTTGGGTTCCATTTTGTACGAGTTTGGCAAATAACCTGCAGAAACGAGCTGTGCTTGCAAGGACTTGATAGTTCCTAATCCTTTTCCAAGCTGTTTGCTTTGCAATATGT | ccgacttccttaaggtgctcaaccgtnnnnnnccnannnnccnagaaaaaagaaatgaaaannnnnnnnnnnnnnnnnnnttcatcccgctaactcttgggaactgaggaggaagcgctgtcgaccgaagnntggactgcttgctactggataatcgtnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnntgagggattctccatttgggttccattttgtacgagtttggcaaataacctgcagaaacgagctgtgcttgcaaggacttgatagttcctaagaattanaanaaaaaaaanaanttccacttgatcaanttaattcccttttatttttcctccctcantccccttccttttccaagctgtttgctttgcaatatgt | Mm.337038 | + | | 604224 | | NULL | 8.45 | 169 | 74.310961 | 74.31466 | NULL | NULL | 3 | NULL | NULL | NULL | NULL | NULL | NULL | 1 | 93288 | NULL | XM_129773 | 1 | 74.197594 | 74.197594 | 74.201293 | 4187 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | ++------+--------+----------+----------+--------+----------------------------------------------+------+-----------+----------+--------+-----------+------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------+--------------+-------------+--------+----------+-------------------------+-----------------------+----------------------+-------------------------+-----------------------+------------------+----------------------+------+----------+---------------+--------------+------+---------+----------+---------+----------+--------------------------+---------------------+---------+-----------+-----------------------------+---------------------------+--------------+-------------+-----------+-------------+------------+---------+------------+----------+-----------+------------+------------+---------------+---------+-----------+---------+------------------+-------------+------+--------+-------------+----------------+-----------------+ +2 rows in set (0.00 sec) +``` + +** ProbeSetXRef (phenotypes/dataset_name.json) + +For every probe set (read dataset measuring point): + +``` +select * from ProbeSetXRef; +| ProbeSetFreezeId | ProbeSetId | DataId | Locus_old | LRS_old | pValue_old | mean | se | Locus | LRS | pValue | additive | h2 | +| 112 | 123528 | 23439389 | NULL | NULL | NULL | 6.7460707070707 | NULL | rs6239372 | 10.9675593568894 | 0.567 | 0.0448545966228878 | NULL | +| 112 | 123527 | 23439388 | NULL | NULL | NULL | 6.19416161616162 | NULL | rs13476936 | 10.9075670392762 | 0.567 | -0.0358456732993988 | NULL | +``` + +where ProbeSetFreezeId is the dataset (experiment). ProbesetId refers to the probe set information (measuring point). DataId points to the data point. The other values are used for search. It is used in search thus: + +``` +SELECT distinct ProbeSet.Name as TNAME, + ProbeSetXRef.Mean as TMEAN, ProbeSetXRef.LRS as TLRS, + ProbeSetXRef.PVALUE as TPVALUE, ProbeSet.Chr_num as TCHR_NUM, + ProbeSet.Mb as TMB, ProbeSet.Symbol as TSYMBOL, + ProbeSet.name_num as TNAME_NUM +FROM ProbeSetXRef, ProbeSet +WHERE ProbeSet.Id = ProbeSetXRef.ProbeSetId + and ProbeSetXRef.ProbeSetFreezeId = 112 + ORDER BY ProbeSet.symbol ASC limit 5; +| TNAME | TMEAN | TLRS | TPVALUE | TCHR_NUM | TMB | TSYMBOL | TNAME_NUM | +| 1445618_at | 7.05679797979798 | 13.5417452764616 | 0.17 | 8 | 75.077895 | NULL | 1445618 | +| 1452452_at | 7.232 | 30.4944361132252 | 0.0000609756097560421 | 12 | 12.6694 | NULL | 1452452 | +``` + +Probedata - main molecular data. Probesets, metabolome, + +Almost all important molecular assay data is in this table including probe set data, RNA-seq data, proteomic data, and metabolomic data. 2.5 billion rows March 2016. In comparison, ProbeData contains data only for Affymetrix probe level data (e.g. Exon array probes and M430 probes). + +# Strain + +``` +select * from Strain limit 5; ++----+----------+----------+-----------+--------+-------+ +| Id | Name | Name2 | SpeciesId | Symbol | Alias | ++----+----------+----------+-----------+--------+-------+ +| 1 | B6D2F1 | B6D2F1 | 1 | NULL | NULL | +| 2 | C57BL/6J | C57BL/6J | 1 | B6J | NULL | +| 3 | DBA/2J | DBA/2J | 1 | D2J | NULL | +| 4 | BXD1 | BXD1 | 1 | NULL | NULL | +| 5 | BXD2 | BXD2 | 1 | NULL | NULL | ++----+----------+----------+-----------+--------+-------+ +``` + +``` +show indexes from Strain; ++--------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+ +| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | ++--------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+ +| Strain | 0 | PRIMARY | 1 | Id | A | 14368 | NULL | NULL | | BTREE | | | +| Strain | 0 | Name | 1 | Name | A | 14368 | NULL | NULL | YES | BTREE | | | +| Strain | 0 | Name | 2 | SpeciesId | A | 14368 | NULL | NULL | | BTREE | | | +| Strain | 1 | Symbol | 1 | Symbol | A | 14368 | NULL | NULL | YES | BTREE | | | ++--------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+ + +A typical query may look like + +SELECT Strain.Name, ProbeSetData.value, ProbeSetSE.error, ProbeSetData.Id + FROM (ProbeSetData, ProbeSetFreeze, Strain, ProbeSet, ProbeSetXRef) + left join ProbeSetSE on + (ProbeSetSE.DataId = ProbeSetData.Id AND ProbeSetSE.StrainId = ProbeSetData.StrainId) + WHERE + ProbeSetFreeze.name = 'B139_K_1206_M' AND + ProbeSetXRef.ProbeSetId = ProbeSet.Id AND + ProbeSetXRef.ProbeSetFreezeId = ProbeSetFreeze.Id AND + ProbeSetXRef.DataId = ProbeSetData.Id AND + ProbeSetData.StrainId = Strain.Id + Order BY Strain.Name + ++-------+-------+-------+----------+ +| Name | value | error | Id | ++-------+-------+-------+----------+ +| SM001 | 38.3 | NULL | 25309550 | +| SM001 | 2.7 | NULL | 25309520 | +| SM001 | 20.3 | NULL | 25309507 | +| SM001 | 125.8 | NULL | 25309511 | +| SM001 | 8.2 | NULL | 25309534 | ++-------+-------+-------+----------+ +5 rows in set (22.28 sec) +``` + +# ProbeSetFreeze + +``` +select * from ProbeSetFreeze limit 5; ++----+---------------+-------+-------------+---------------------------------+---------------------------------------------+-------------------------+------------+-----------+--------+-----------------+-----------------+-----------+ +| Id | ProbeFreezeId | AvgID | Name | Name2 | FullName | ShortName | CreateTime | OrderList | public | confidentiality | AuthorisedUsers | DataScale | ++----+---------------+-------+-------------+---------------------------------+---------------------------------------------+-------------------------+------------+-----------+--------+-----------------+-----------------+-----------+ +| 1 | 3 | 1 | Br_U_0803_M | BXDMicroArray_ProbeSet_August03 | UTHSC Brain mRNA U74Av2 (Aug03) MAS5 | Brain U74Av2 08/03 MAS5 | 2003-08-01 | NULL | 0 | 0 | NULL | log2 | +| 2 | 10 | 1 | Br_U_0603_M | BXDMicroArray_ProbeSet_June03 | UTHSC Brain mRNA U74Av2 (Jun03) MAS5 | Brain U74Av2 06/03 MAS5 | 2003-06-01 | NULL | 0 | 0 | NULL | log2 | +| 3 | 8 | 1 | Br_U_0303_M | BXDMicroArray_ProbeSet_March03 | UTHSC Brain mRNA U74Av2 (Mar03) MAS5 | Brain U74Av2 03/03 MAS5 | 2003-03-01 | NULL | 0 | 0 | NULL | log2 | +| 4 | 5 | 1 | Br_U_0503_M | BXDMicroArray_ProbeSet_May03 | UTHSC Brain mRNA U74Av2 (May03) MAS5 | Brain U74Av2 05/03 MAS5 | 2003-05-01 | NULL | 0 | 0 | NULL | log2 | +| 5 | 4 | 1 | HC_U_0303_M | GNFMicroArray_ProbeSet_March03 | GNF Hematopoietic Cells U74Av2 (Mar03) MAS5 | GNF U74Av2 03/03 MAS5 | 2003-03-01 | NULL | 0 | 0 | NULL | log2 | ++----+---------------+-------+-------------+---------------------------------+---------------------------------------------+-------------------------+------------+-----------+--------+-----------------+-----------------+-----------+ +``` + +# ProbeSetXRef + +``` +select * from ProbeSetXRef limit 5; ++------------------+------------+--------+------------+--------------------+------------+-------------------+---------------------+-----------------+--------------------+--------+----------------------+------+ +| ProbeSetFreezeId | ProbeSetId | DataId | Locus_old | LRS_old | pValue_old | mean | se | Locus | LRS | pValue | additive | h2 | ++------------------+------------+--------+------------+--------------------+------------+-------------------+---------------------+-----------------+--------------------+--------+----------------------+------+ +| 1 | 1 | 1 | 10.095.400 | 13.3971627898894 | 0.163 | 5.48794285714286 | 0.08525787814808819 | rs13480619 | 12.590069931048001 | 0.269 | -0.28515625 | NULL | +| 1 | 2 | 2 | D15Mit189 | 10.042057464356201 | 0.431 | 9.90165714285714 | 0.0374686634976217 | CEL-17_50896182 | 10.5970737900941 | 0.304 | -0.11678333333333299 | NULL | +| 1 | 3 | 3 | D5Mit139 | 5.43678531742749 | 0.993 | 7.83948571428571 | 0.0457583416912569 | rs13478499 | 6.0970532702754 | 0.988 | 0.112957489878542 | NULL | +| 1 | 4 | 4 | D1Mit511 | 9.87815279480766 | 0.483 | 8.315628571428569 | 0.0470396593931327 | rs6154379 | 11.774867551173099 | 0.286 | -0.157113725490196 | NULL | +| 1 | 5 | 5 | D16H21S16 | 10.191723834264499 | 0.528 | 9.19345714285714 | 0.0354801718293322 | rs4199265 | 10.923263374016202 | 0.468 | 0.11476470588235299 | NULL | ++------------------+------------+--------+------------+--------------------+------------+-------------------+---------------------+-----------------+--------------------+--------+----------------------+------+ +``` + + +Note that the following unlimited search is very slow: + +select max(value) from ProbeSetData; + +``` ++------------+ +| max(value) | ++------------+ +| 26436006 | ++------------+ +1 row in set (2 min 16.31 sec) +``` + +which is in some form is used in the search page, see [[https://github.com/genenetwork/genenetwork2_diet/blob/master/wqflask/wqflask/do_search.py#L811][the search code]]. + + +*** Comments + +I think the ProbeSetData table should be generalized to a 'phenotypes' table with an 'sample_id' column and a 'value' column. + +A new table 'samples' will link each sample against an 'experiment', an 'individual' and which in turn can link to a 'strain'. + +Experiment is here in a wide sense, GTex can be one - I don't want to use dataset ;) + +This means a (slight) reordering: + +``` +phenotypes: (id), sample_id, value +samples: experiment_id, individual_id +experiments: name, version +individual: strain_id +strains: species_id +species: ... +``` + +ProbeData is also interesting, because it has the same structure as ProbeSetData, but only contains microarrays. This tables should be one (when we clear up the cross-referencing) as they both contain phenotype values. Both are large tables. + +PublishData is another phenotype table with values only which can be merged into that same table. This data does not require the annotations of probesets(!) + +=> https://genenetwork.org/show_trait?trait_id=10031&dataset=BXDPublish + +So we have phenotype data in 3 tables with exactly the same +layout. There is also TissueProbeSet*, but we'll ignore those for +now. I think we should merge these into one and have the sample ref +refer to the type of data (probeset, probe, metabolomics, +whatever). These are all phenotype values and by having them split +into different tables they won't play well when looking for +correlations. + +ProbeSet contains the metadata on the probes and should (eventually) +move into NoSQL. There is plenty redundancy in that table now. + +I know it is going to be a pain to reorganize the database, but if we +want to use it in the long run we are going to have to simplify it. + +# ProbeSetFreeze and ProbeFreeze (/dataset/name.json) + +GN_SERVER: /dataset/HC_M2_0606_P.json + +ProbesetFreeze contains DataSet information, such as name, fullname of +datasets, as well as whether they are public and how the data is +scaled: + +``` +select * from ProbeSetFreeze; +| Id | ProbeFreezeId | AvgID | Name | Name2 | FullName | ShortName | CreateTime | OrderList | public | confidentiality | AuthorisedUsers | DataScale | +| 112 | 30 | 2 | HC_M2_0606_P | Hippocampus_M430_V2_BXD_PDNN_Jun06 | Hippocampus Consortium M430v2 (Jun06) PDNN | Hippocampus M430v2 BXD 06/06 PDNN | 2006-06-23 | NULL | 2 | 0 | NULL | log2 | +``` + +Another table contains a tissue reference and a back reference to the cross +type: + +``` +select * from ProbeFreeze; +| Id | ProbeFreezeId | ChipId | TissueId | Name | FullName | ShortName | CreateTime | InbredSetId | +| 30 | 30 | 4 | 9 | Hippocampus Consortium M430v2 Probe (Jun06) | | | 2006-07-07 | 1 | +``` + +NOTE: these tables can probably be merged into one. + +``` +show indexes from ProbeSetFreeze; ++----------------+------------+-----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+ +| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | ++----------------+------------+-----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+ +| ProbeSetFreeze | 0 | PRIMARY | 1 | Id | A | 2 | NULL | NULL | | BTREE | | | +| ProbeSetFreeze | 0 | FullName | 1 | FullName | A | 2 | NULL | NULL | | BTREE | | | +| ProbeSetFreeze | 0 | Name | 1 | Name | A | 2 | NULL | NULL | YES | BTREE | | | +| ProbeSetFreeze | 1 | NameIndex | 1 | Name2 | A | 2 | NULL | NULL | | BTREE | | | ++----------------+------------+-----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+ +``` + +# ProbeSetSE + +``` +select * from ProbeSetSE limit 5; ++--------+----------+----------+ +| DataId | StrainId | error | ++--------+----------+----------+ +| 1 | 1 | 0.681091 | +| 1 | 2 | 0.361151 | +| 1 | 3 | 0.364342 | +| 1 | 4 | 0.827588 | +| 1 | 5 | 0.303492 | ++--------+----------+----------+ +``` + +# More information + +For the other tables, you may check the GN2/doc/database.org document (the starting point for this document). + +# Contributions regarding data upload to the GeneNetwork webserver +* Ideas shared by the GeneNetwork team to facilitate the process of uploading data to production + +## Quality check and integrity of the data to be uploaded to gn2 + +* A note to add (from Arthur): Some datasets have the following identifiers: ProbeSet IDs {chr_3020701, chr_3020851, etc}. This is not an acceptable way to name the probeset IDs. So, the data provider needs to understand what format is needed for gn2 to accept the ProbeSet IDs in their dataset +* Also, for the annotation file, among other important columns, it is crucial that there are descriptions, aliases, and location columns. And the formatting should be exactly as found in the public repositories such as NCBI, Ensembl, etc. For instance, for description: `X-linked Kx blood group related 4`, and Aliases: ` XRG4; Gm210; mKIAA1889` as in +=> https://www.ncbi.nlm.nih.gov/gene/497097 + +## Valid ProbeSetIDs + +* The official ProbeSetIDs would be the one from the vendor. This would also constitute the platform used to generate data {Novogene-specific platform}, for instance; `NovaSeqPE150` for the MBD UTHSC mice seq dataset +* NB; in this case, if the vendor does not provide the official names as expected, we can use the platform + the numbering order of the file to generate probeset IDs. For instance; `NseqPE150_000001 to NseqPE150_432694` for samples 1 to 432694 +* Avoid IDs with meaning, e.g. =chr1_3020701= → Chromosome 1 at 3020701 base pairs. Prefer IDs with no meaning + +## The importance of having unique identifiers within a platform + +* Unique identifiers solve the hurdles that come with having duplicate genes. So, the QA tools in place should ensure the uploaded dataset adheres to the requirements mentioned +* However, newer RNA-seq data sets generated by sequencing do not usually have an official vendor identifier. The identifier is usually based on the NCBI mRNA model (NM_XXXXXX) that was used to evaluate an expression and on the sequence that is involved, usually the start and stop nucleotide positions based on a specific genome assembly or just a suffix to make sure it is unique. In this case, you are looking at mRNA assays for a single transcript, but different parts of the transcript that have different genome coordinates. We now typically use ENSEMBL identifiers. +* The mouse version of the sonic hedgehog gene as an example: `ENSMUST00000002708` or `ENSMUSG00000002633` sources should be fine. The important thing is to know the provenance of the ID—who is in charge of that ID type? +* When a mRNA assay is super precise (one exon only or a part of the 5' UTR), then we should use exon identifiers from ENSEMBL probably. +* Ideally, we should enter the sequence's first and last 100 nt in GeneNetwork for verification and alignment. We did this religiously for arrays, but have started to get lazy now. The sequence is the ultimate identifier +* For methylation arrays and CpG assays, we can use this format `cg14050475` as seen in MBD UTHSC Ben's data +* For metabolites like isoleucine—the ID we have been using is the mass-to-charge (MZ) ratio such as `130.0874220_MZ` +* For protein and peptide identifiers we have used the official Protein ID followed by an underscore character and then some or all of the sequence. This is then followed by another underscore and a number. Evan to confirm, but the suffix number is the charge state if I remember correctly +``` +Q9JHJ3_LLHTADVCQLEVALVGASPR_3 +A2A8E1_TIVEFECR_2 +A2A8E1_ATLENVTNLRPVGEDFR_3 +A2A8E1_ENSIDILSSTIK_2 +``` +* But in older protein expression databases Evan and the team used a different method +``` +abcb10_q9ji39_t311 +abcb10_q9ji39_t312 +``` +* The above is just the gene symbol then the protein ID and not so sure what t311 and t312 mean +* Ideally these IDs are explained to some extent when they embed some information + + + +## BXD individuals + +* Basically groups (represented by the InbredSet tables) are primarily defined by their list of samples/strains (represented by the Strain tables). When we create a new group, it's because we have data with a distinct set of samples/strains from any existing groups. +* So when we receive data for BXD individuals, as far as the database is concerned they are a completely separate group (since the list of samples is new/distinct from any other existing groups). We can choose to also enter it as part of the "generic" BXD group (by converting it to strain means/SEs using the strain of each individual, assuming it's provided like in the files Arthur was showing us). +* This same logic could apply to other groups as well - we could choose to make one group the "strain mean" group for another set of groups that contain sample data for individuals. But the database doesn't reflect the relationship between these groups* +* As far as the database is concerned, there is no distinction between strain means and individual sample data - they're all rows in the ProbeSetData/PublishData tables. The only difference is that strain mean data will probably also have an SE value in the ProbeSetSE/PublishSE tables and/or an N (number of individuals per strain) value in the NStrain table +* As for what this means for the uploader - I think it depends on whether Rob/Arthur/etc wants to give users the ability to simultaneously upload both strain mean and individual data. For example, if someone uploads some BXD individuals' data, do we want the uploader to both create a new group for this (or add to an existing BXD individuals group) and calculate the strain means/SE and enter it into the "main" BXD group? My personal feeling is that it's probably best to postpone that for later and only upload the data with the specific set of samples indicated in the file since it would insert some extra complexity to the uploading process that could always be added later (since the user would need to select "the group the strains are from" as a separate option) +* The relationship is sorta captured in the CaseAttribute and CaseAttributeXRefNew tables (which contain sample metadata), but only in the form of the metadata that is sometimes displayed as extra columns in the trait page table - this data isn't used in any queries/analyses currently (outside of some JS filters run on the table itself) and isn't that important as part of the uploading process (or at least can be postponed) + +## Individual Datasets and Derivatives datasets in gn2 +* Individual dataset reflects the actual data provided or submitted by the investigator (user). Derivative datasets include the processed information from the individual dataset, as in the case of the average datasets. +* An example of an individual dataset would look something like; (MBD dataset) +``` +#+begin_example +sample, strain, Sex, Age,… +FEB0001,BXD48a,M,63,… +FEB0002,BXD48a,M,15,… +FEB0003,BXD48a,F,22,… +FEB0004,BXD16,M,39,… +FEB0005,BXD16,F,14,… +⋮ +#+end_example +``` +* The strain column above has repetitive values. Each value has a one-to-many relationship with values on sample column. From this dataset, there can be several derivatives. For example; +- Sex-based categories +- Average data (3 sample values averaged to one strain value) +- Standard error table computed for the averages + +## Saving data to database +* Strain table schema +``` +#+begin_src sql + MariaDB [db_webqtl]> DESC Strain; + +-----------+----------------------+------+-----+---------+----------------+ + | Field | Type | Null | Key | Default | Extra | + +-----------+----------------------+------+-----+---------+----------------+ + | Id | int(20) | NO | PRI | NULL | auto_increment | + | Name | varchar(100) | YES | MUL | NULL | | + | Name2 | varchar(100) | YES | | NULL | | + | SpeciesId | smallint(5) unsigned | NO | | 0 | | + | Symbol | varchar(20) | YES | MUL | NULL | | + | Alias | varchar(255) | YES | | NULL | | + +-----------+----------------------+------+-----+---------+----------------+ + 6 rows in set (0.00 sec) +#+end_src +``` +* For the *individual data*, the =sample= field would be saved as =Name= and the =strain= would be saved as =Name2=. These records would then all be linked to an inbredset group (population?) in the =InbredSet= table via the =StrainXRef= table, whose schema is as follows: +``` +#+begin_src sql + MariaDB [db_webqtl]> DESC StrainXRef; + +------------------+----------------------+------+-----+---------+-------+ + | Field | Type | Null | Key | Default | Extra | + +------------------+----------------------+------+-----+---------+-------+ + | InbredSetId | smallint(5) unsigned | NO | PRI | 0 | | + | StrainId | int(20) | NO | PRI | NULL | | + | OrderId | int(20) | YES | | NULL | | + | Used_for_mapping | char(1) | YES | | N | | + | PedigreeStatus | varchar(255) | YES | | NULL | | + +------------------+----------------------+------+-----+---------+-------+ + 5 rows in set (0.00 sec) +#+end_src +``` +* Where the =InbredSetId= comes from the =InbredSet= table and the =StrainId= comes from the =Strain= table. The *individual data* would be linked to an inbredset group that is for individuals +* For the *average data*, the only value to save would be the =strain= field, which would be saved as =Name= in the =Strain= table and linked to an InbredSet group that is for averages +*Question 01*: How do we distinguish the inbredset groups? +*Answer*: The =Family= field is useful for this. + +*Question 02*: If you have more derived "datasets", e.g. males-only, females-only, under-10-years, 10-to-25-years, etc. How would the =Strains= table handle all those differences? + +## Metadata +* The data we looked at had =gene id= and =gene symbol= fields. These fields were used to fetch the *Ensembl ID* and *descriptions* from [[https://www.ncbi.nlm.nih.gov/][NCBI]] and the [[https://useast.ensembl.org/][Ensembl Genome Browser]] + +## Files for mapping +* Files used for mapping need to be in =bimbam= or =.geno= formats. We would need to do conversions to at least one of these formats where necessary + +## Annotation files +* Consider the following schema of DB tables +#+begin_src sql + MariaDB [db_webqtl]> DESC InbredSet; + +-----------------+----------------------+------+-----+---------+----------------+ + | Field | Type | Null | Key | Default | Extra | + +-----------------+----------------------+------+-----+---------+----------------+ + | Id | smallint(5) unsigned | NO | PRI | NULL | auto_increment | + | InbredSetId | int(5) unsigned | NO | | NULL | | + | InbredSetName | varchar(100) | YES | | NULL | | + | Name | char(30) | NO | | | | + | SpeciesId | smallint(5) unsigned | YES | | 1 | | + | FullName | varchar(100) | YES | | NULL | | + | public | tinyint(3) unsigned | YES | | 2 | | + | MappingMethodId | char(50) | YES | | 1 | | + | GeneticType | varchar(255) | YES | | NULL | | + | Family | varchar(100) | YES | | NULL | | + | FamilyOrder | int(5) | YES | | NULL | | + | MenuOrderId | double | NO | | NULL | | + | InbredSetCode | varchar(5) | YES | | NULL | | + | Description | longtext | YES | | NULL | | + +-----------------+----------------------+------+-----+---------+----------------+ + ⋮ + MariaDB [db_webqtl]> DESC Strain; + +-----------+----------------------+------+-----+---------+----------------+ + | Field | Type | Null | Key | Default | Extra | + +-----------+----------------------+------+-----+---------+----------------+ + | Id | int(20) | NO | PRI | NULL | auto_increment | + | Name | varchar(100) | YES | MUL | NULL | | + | Name2 | varchar(100) | YES | | NULL | | + | SpeciesId | smallint(5) unsigned | NO | | 0 | | + | Symbol | varchar(20) | YES | MUL | NULL | | + | Alias | varchar(255) | YES | | NULL | | + +-----------+----------------------+------+-----+---------+----------------+ + ⋮ + MariaDB [db_webqtl]> DESC StrainXRef; + +------------------+----------------------+------+-----+---------+-------+ + | Field | Type | Null | Key | Default | Extra | + +------------------+----------------------+------+-----+---------+-------+ + | InbredSetId | smallint(5) unsigned | NO | PRI | 0 | | + | StrainId | int(20) | NO | PRI | NULL | | + | OrderId | int(20) | YES | | NULL | | + | Used_for_mapping | char(1) | YES | | N | | + | PedigreeStatus | varchar(255) | YES | | NULL | | + +------------------+----------------------+------+-----+---------+-------+ +#+end_src + +* The =StrainXRef= table creates a link between the Samples/cases/individuals (stored in the =Strain= table) to the group (population?) they belong to in the =InbredSet= table +* Steps to prepare the TSV file for entering samples/cases into the database are: +- Clean up =Name= of the samples/cases/individuals in the file: + - Names should have no spaces + - Names should be the same length of characters: pad those that are shorter e.g. *SampleName12* → *SampleName012* to fit in with other names if, say, the samples range from 1 to 999. In a similar vein, you'd rename *SampleName1* to *SampleName001* +- Order samples by the names +- Create a new column, say, =orderId= in the TSV, and assign the order *1, 2, 3, …, n* for the rows, from the first to the "n^{th}" row. The order of the strains is very important and must be maintained +- retrieve the largest current =Id= value in the =Strain= table +- Increment by one (1) and assign that to the first row of your ordered data + - Assign subsequent rows, the subsequent values for the ID e.g. Assuming the largest =Id= value in the =Strain= table was *23*, the first row of the new data would have the id *24*. The second row would have *25*, the third, *26* and so on +- Get the =InbredSetId= for your samples' data. Add a new column in the data and copy this value for all rows +- Enter data into the =Strain= table +- Using the previously computed strain ID values, and the =InbredSetId= previously copied, enter data into the =StrainXRef= table + +* Some notes on the data: +- The =Symbol= field in the =Strain= table corresponds to the =Strain= field in the annotation file +- The =used_for_mapping= field should be set to ~Y~ unless otherwise informed +- The =PedigreeStatus= field is unknown to us for now: set to ~NULL~ + +* Annotation file format +The important fields are: +- =ChipId=: The platform that the data was collected from/with +Consider the following table; +#+begin_src sql + MariaDB [db_webqtl]> DESC GeneChip; + +---------------+----------------------+------+-----+---------+----------------+ + | Field | Type | Null | Key | Default | Extra | + +---------------+----------------------+------+-----+---------+----------------+ + | Id | smallint(5) unsigned | NO | PRI | NULL | auto_increment | + | GeneChipId | int(5) | YES | | NULL | | + | GeneChipName | varchar(200) | YES | | NULL | | + | Name | char(30) | NO | | | | + | GeoPlatform | char(15) | YES | | NULL | | + | Title | varchar(100) | YES | | NULL | | + | SpeciesId | int(5) | YES | | 1 | | + | GO_tree_value | varchar(50) | YES | | NULL | | + +---------------+----------------------+------+-----+---------+----------------+ + #+end_src + Some of the important fields that were highlighted were: + - =GeoPlatform=: Links the details of the platform in our database with NCBI's [[https://www.ncbi.nlm.nih.gov/geo/][Gene Ontology Omnibus (GEO)]] system. This is not always possible, but where we can, it would be nice to have this field populated + - =GO_tree_value=: This is supposed to link the detail we have with some external system "GO". I have not figured this one out on my own and will need to follow up on it. + - =Name=: The name corresponds to the =ProbeSetId=, and we want this to be the same value as the identifier on the [[https://www.ensembl.org][Ensembl genome browser]], e.g. For a gene, say =Shh=, for *mouse*, we want the =Name= value to be a variation on [[https://useast.ensembl.org/Mus_musculus/Gene/Summary?db=core;g=ENSMUSG00000002633;r=5:28661813-28672254;t=ENSMUST00000002708][*ENSMUSG00000002633*]] + - =Probe_set_Blat_Mb_start=/=Probe_set_Blat_Mb_end=: In Byron's and Beni's data, these correspond to the =geneStart= and =geneEnd= fields respectively. These are the positions, in megabasepairs, that the gene begins and ends at, respectively. + - =Mb=: This is the =geneStart=/=Probe_set_Blat_Mb_start= value divided by *1000000*. (*Note to self*: Maybe the Probe_set_Blat_Mb_* fields above might not be in megabase pairs — please confirm) + - =Strand_Probe= and =Strand_Gene=: These fields' values are simply ~+~ or ~-~. If these values are missing, you can [[https://ftp.ncbi.nih.gov/gene/README][retrieve them from NCBI]], specifically from the =orientation= field of seemingly any text file with the field + - =Chr=: This is the chromosome on which the gene is found + +* The final annotation file will have (at minimum) the following fields (or their +analogs): +- =StrainName= +- =OrderId= +- =StrainId=: from the database +- =InbredSetId=: from the database +- =Symbol=: This could be named =Strain= +- =GeneChipId=: from the database +- =EnsemblId=: from the Ensembl genome browser +- =Probe_set_Blat_Mb_start=: possible analogue is =geneStart= +- =Probe_set_Blat_Mb_end=: possible analogue is =geneEnd= +- =Mb= +- =Strand_Probe= +- =Strand_Gene= +- =Chr= + +* =.geno= Files +- The =.geno= files have sample names, not the strain/symbol. The =Locus= field in the =.geno= file corresponds to the **marker**. =.geno= files are used with =QTLReaper= +- The sample names in the ~.geno~ files *MUST* be in the same order as the +strains/symbols for that species. For example; +Data format is as follows; +``` +#+begin_example +SampleName,Strain,… +⋮ +BJCWI0001,BXD40,… +BJCWI0002,BXD40,… +BJCWI0003,BXD33,… +BJCWI0004,BXD50,… +BJCWI0005,BXD50,… +⋮ +#+end_example +``` +and the order of strains is as follows; +``` +#+begin_example +…,BXD33,…,BXD40,…,BXD50,… +#+end_example +``` +then, the ~.geno~ file generated by this data should have a form such as shown +below; +``` +#+begin_example +…,BJCWI0003,…,BJCWI0001,BJCWI0002,…,BJCWI0004,BJCWI0005,… +#+end_example +``` +The order of samples that belong to the same strain is irrelevant - they share the same data, i.e. the order below is also valid; +``` +#+begin_example +…,BJCWI0003,…,BJCWI0002,BJCWI0001,…,BJCWI0004,BJCWI0005,… +#+end_example +``` +* =BimBam= Files +- Used with =GEMMA= +* Case Attributes +- These are metadata about every case/sample/individual in an InbredSet group. The metadata is any data that has nothing to do with phenotypes (e.g. height, weight, etc) that is useful for researchers to have in order to make sense of the data. +- Examples of case attributes: + - Treatment + - Sex (Really? Isn't sex an expression of genes?) + - batch + - Case ID, etc + +* Summary steps to load data to the database +- [x] Create *InbredSet* group (think population) +- [x] Load the strains/samples data +- [x] Load the sample cross-reference data to link the samples to their + InbredSet group +- Load the case-attributes data +- [x] Load the annotation (data into ProbeSet table) +- [x] Create the study for the data (At around this point, the InbredSet group + will show up in the UI). +- [x] Create the Dataset for the data +- [x] Load the *Log2* data (ProbeSetData and ProbeSetXRef tables) +- [x] Compute means (an SQL query was used — this could be pre-computed in code + and entered along with the data) +- [x] Run QTLReaper diff --git a/topics/database/setting-up-local-development-database.gmi b/topics/database/setting-up-local-development-database.gmi index 3b743b9..9ebb48b 100644 --- a/topics/database/setting-up-local-development-database.gmi +++ b/topics/database/setting-up-local-development-database.gmi @@ -41,7 +41,12 @@ Setting up mariadb in a Guix container is the preferred and easier method. But, ``` $ sudo $(./containers/db-container.sh) ``` -You should now be able to connect to the database using +By default, mariadb allows passwordless login for root only on the local machine. So, enter the container using guix container exec and set the root password to a blank. +``` +$ mysql -u root +MariaDB [(none)]> SET PASSWORD = PASSWORD(""); +``` +You should now be able to connect to the database from outside the container using ``` $ mysql --protocol tcp -u root ``` diff --git a/topics/database/sql.svg b/topics/database/sql.svg new file mode 100644 index 0000000..b7ab96e --- /dev/null +++ b/topics/database/sql.svg @@ -0,0 +1,2558 @@ +<?xml version="1.0" encoding="UTF-8" standalone="no"?> +<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" + "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd"> +<!-- Generated by graphviz version 2.49.0 (20210828.1703) + --> +<!-- Title: schema Pages: 1 --> +<svg width="13704pt" height="5921pt" + viewBox="0.00 0.00 13703.50 5921.00" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"> +<g id="graph0" class="graph" transform="scale(1 1) rotate(0) translate(4 5917)"> +<title>schema</title> +<polygon fill="white" stroke="transparent" points="-4,4 -4,-5917 13699.5,-5917 13699.5,4 -4,4"/> +<!-- NStrain --> +<g id="node1" class="node"> +<title>NStrain</title> +<polygon fill="white" stroke="transparent" points="6648.5,-1918 6648.5,-2008 6775.5,-2008 6775.5,-1918 6648.5,-1918"/> +<polygon fill="#df65b0" stroke="transparent" points="6652,-1984 6652,-2005 6773,-2005 6773,-1984 6652,-1984"/> +<polygon fill="none" stroke="black" points="6652,-1984 6652,-2005 6773,-2005 6773,-1984 6652,-1984"/> +<text text-anchor="start" x="6655" y="-1990.8" font-family="Times,serif" font-size="14.00">NStrain (9 MiB)</text> +<text text-anchor="start" x="6692.5" y="-1968.8" font-family="Times,serif" font-size="14.00">count</text> +<text text-anchor="start" x="6688" y="-1947.8" font-family="Times,serif" font-size="14.00">DataId</text> +<text text-anchor="start" x="6683" y="-1926.8" font-family="Times,serif" font-size="14.00">StrainId</text> +<polygon fill="none" stroke="black" points="6648.5,-1918 6648.5,-2008 6775.5,-2008 6775.5,-1918 6648.5,-1918"/> +</g> +<!-- Strain --> +<g id="node40" class="node"> +<title>Strain</title> +<polygon fill="lightgrey" stroke="transparent" points="5728.5,-765.5 5728.5,-918.5 5843.5,-918.5 5843.5,-765.5 5728.5,-765.5"/> +<polygon fill="#df65b0" stroke="transparent" points="5732,-894 5732,-915 5841,-915 5841,-894 5732,-894"/> +<polygon fill="none" stroke="black" points="5732,-894 5732,-915 5841,-915 5841,-894 5732,-894"/> +<text text-anchor="start" x="5735" y="-900.8" font-family="Times,serif" font-size="14.00">Strain (2 MiB)</text> +<polygon fill="green" stroke="transparent" points="5732,-873 5732,-892 5841,-892 5841,-873 5732,-873"/> +<text text-anchor="start" x="5769" y="-878.8" font-family="Times,serif" font-size="14.00">Alias</text> +<polygon fill="green" stroke="transparent" points="5732,-852 5732,-871 5841,-871 5841,-852 5732,-852"/> +<text text-anchor="start" x="5765" y="-857.8" font-family="Times,serif" font-size="14.00">Name</text> +<polygon fill="green" stroke="transparent" points="5732,-831 5732,-850 5841,-850 5841,-831 5732,-831"/> +<text text-anchor="start" x="5760.5" y="-836.8" font-family="Times,serif" font-size="14.00">Name2</text> +<polygon fill="green" stroke="transparent" points="5732,-810 5732,-829 5841,-829 5841,-810 5732,-810"/> +<text text-anchor="start" x="5759.5" y="-815.8" font-family="Times,serif" font-size="14.00">Symbol</text> +<text text-anchor="start" x="5779" y="-794.8" font-family="Times,serif" font-size="14.00">Id</text> +<text text-anchor="start" x="5751.5" y="-773.8" font-family="Times,serif" font-size="14.00">SpeciesId</text> +<polygon fill="none" stroke="black" points="5728.5,-765.5 5728.5,-918.5 5843.5,-918.5 5843.5,-765.5 5728.5,-765.5"/> +</g> +<!-- NStrain->Strain --> +<g id="edge1" class="edge"> +<title>NStrain:StrainId->Strain</title> +<path fill="none" stroke="black" d="M6651,-1930C6610.43,-1930 6653.88,-1233.5 6631,-1200 6450.66,-935.96 6033.45,-866.5 5861.83,-848.81"/> +<polygon fill="black" stroke="black" points="5861.92,-845.3 5851.62,-847.79 5861.23,-852.27 5861.92,-845.3"/> +</g> +<!-- roles_users --> +<g id="node2" class="node"> +<title>roles_users</title> +<polygon fill="white" stroke="transparent" points="7071.5,-4853 7071.5,-4922 7204.5,-4922 7204.5,-4853 7071.5,-4853"/> +<polygon fill="#f1eef6" stroke="transparent" points="7075,-4897.5 7075,-4918.5 7202,-4918.5 7202,-4897.5 7075,-4897.5"/> +<polygon fill="none" stroke="black" points="7075,-4897.5 7075,-4918.5 7202,-4918.5 7202,-4897.5 7075,-4897.5"/> +<text text-anchor="start" x="7078" y="-4904.3" font-family="Times,serif" font-size="14.00">roles_users (0 B)</text> +<text text-anchor="start" x="7114" y="-4882.3" font-family="Times,serif" font-size="14.00">role_id</text> +<text text-anchor="start" x="7112.5" y="-4861.3" font-family="Times,serif" font-size="14.00">user_id</text> +<polygon fill="none" stroke="black" points="7071.5,-4853 7071.5,-4922 7204.5,-4922 7204.5,-4853 7071.5,-4853"/> +</g> +<!-- role --> +<g id="node58" class="node"> +<title>role</title> +<polygon fill="white" stroke="transparent" points="7093.5,-3249 7093.5,-3339 7184.5,-3339 7184.5,-3249 7093.5,-3249"/> +<polygon fill="#f1eef6" stroke="transparent" points="7097,-3315 7097,-3336 7182,-3336 7182,-3315 7097,-3315"/> +<polygon fill="none" stroke="black" points="7097,-3315 7097,-3336 7182,-3336 7182,-3315 7097,-3315"/> +<text text-anchor="start" x="7106" y="-3321.8" font-family="Times,serif" font-size="14.00">role (0 B)</text> +<text text-anchor="start" x="7099" y="-3299.8" font-family="Times,serif" font-size="14.00">description</text> +<text text-anchor="start" x="7119.5" y="-3278.8" font-family="Times,serif" font-size="14.00">name</text> +<text text-anchor="start" x="7117.5" y="-3257.8" font-family="Times,serif" font-size="14.00">the_id</text> +<polygon fill="none" stroke="black" points="7093.5,-3249 7093.5,-3339 7184.5,-3339 7184.5,-3249 7093.5,-3249"/> +</g> +<!-- roles_users->role --> +<g id="edge2" class="edge"> +<title>roles_users:role_id->role</title> +<path fill="none" stroke="black" d="M7203,-4885.5C7242.13,-4885.5 7161.86,-3639.62 7142.89,-3353.21"/> +<polygon fill="black" stroke="black" points="7146.37,-3352.78 7142.22,-3343.03 7139.39,-3353.24 7146.37,-3352.78"/> +</g> +<!-- User --> +<g id="node60" class="node"> +<title>User</title> +<polygon fill="white" stroke="transparent" points="7244,-3175.5 7244,-3412.5 7354,-3412.5 7354,-3175.5 7244,-3175.5"/> +<polygon fill="#d7b5d8" stroke="transparent" points="7247,-3388 7247,-3409 7351,-3409 7351,-3388 7247,-3388"/> +<polygon fill="none" stroke="black" points="7247,-3388 7247,-3409 7351,-3409 7351,-3388 7247,-3388"/> +<text text-anchor="start" x="7250" y="-3394.8" font-family="Times,serif" font-size="14.00">User (28 KiB)</text> +<text text-anchor="start" x="7260" y="-3372.8" font-family="Times,serif" font-size="14.00">createtime</text> +<text text-anchor="start" x="7273" y="-3351.8" font-family="Times,serif" font-size="14.00">disable</text> +<text text-anchor="start" x="7279" y="-3330.8" font-family="Times,serif" font-size="14.00">email</text> +<text text-anchor="start" x="7265.5" y="-3309.8" font-family="Times,serif" font-size="14.00">grpName</text> +<text text-anchor="start" x="7292" y="-3288.8" font-family="Times,serif" font-size="14.00">id</text> +<text text-anchor="start" x="7268" y="-3267.8" font-family="Times,serif" font-size="14.00">lastlogin</text> +<text text-anchor="start" x="7279" y="-3246.8" font-family="Times,serif" font-size="14.00">name</text> +<text text-anchor="start" x="7264.5" y="-3225.8" font-family="Times,serif" font-size="14.00">password</text> +<text text-anchor="start" x="7267" y="-3204.8" font-family="Times,serif" font-size="14.00">privilege</text> +<text text-anchor="start" x="7273" y="-3183.8" font-family="Times,serif" font-size="14.00">user_ip</text> +<polygon fill="none" stroke="black" points="7244,-3175.5 7244,-3412.5 7354,-3412.5 7354,-3175.5 7244,-3175.5"/> +</g> +<!-- roles_users->User --> +<g id="edge3" class="edge"> +<title>roles_users:user_id->User</title> +<path fill="none" stroke="black" d="M7139,-4854.5C7139,-4323.12 7232.06,-3695.19 7276.24,-3427.05"/> +<polygon fill="black" stroke="black" points="7279.74,-3427.32 7277.92,-3416.88 7272.83,-3426.18 7279.74,-3427.32"/> +</g> +<!-- SnpAllRat --> +<g id="node3" class="node"> +<title>SnpAllRat</title> +<polygon fill="white" stroke="transparent" points="2716,-702.5 2716,-981.5 2876,-981.5 2876,-702.5 2716,-702.5"/> +<polygon fill="#df65b0" stroke="transparent" points="2719,-957 2719,-978 2873,-978 2873,-957 2719,-957"/> +<polygon fill="none" stroke="black" points="2719,-957 2719,-978 2873,-978 2873,-957 2719,-957"/> +<text text-anchor="start" x="2722" y="-963.8" font-family="Times,serif" font-size="14.00">SnpAllRat (908 MiB)</text> +<text text-anchor="start" x="2772" y="-941.8" font-family="Times,serif" font-size="14.00">Alleles</text> +<text text-anchor="start" x="2749" y="-920.8" font-family="Times,serif" font-size="14.00">Chromosome</text> +<text text-anchor="start" x="2728" y="-899.8" font-family="Times,serif" font-size="14.00">ConservationScore</text> +<text text-anchor="start" x="2768.5" y="-878.8" font-family="Times,serif" font-size="14.00">Domain</text> +<text text-anchor="start" x="2764" y="-857.8" font-family="Times,serif" font-size="14.00">Function</text> +<text text-anchor="start" x="2777.5" y="-836.8" font-family="Times,serif" font-size="14.00">Gene</text> +<text text-anchor="start" x="2788.5" y="-815.8" font-family="Times,serif" font-size="14.00">Id</text> +<text text-anchor="start" x="2767" y="-794.8" font-family="Times,serif" font-size="14.00">Position</text> +<text text-anchor="start" x="2761" y="-773.8" font-family="Times,serif" font-size="14.00">SnpName</text> +<text text-anchor="start" x="2771" y="-752.8" font-family="Times,serif" font-size="14.00">Source</text> +<text text-anchor="start" x="2761" y="-731.8" font-family="Times,serif" font-size="14.00">SpeciesId</text> +<text text-anchor="start" x="2758.5" y="-710.8" font-family="Times,serif" font-size="14.00">Transcript</text> +<polygon fill="none" stroke="black" points="2716,-702.5 2716,-981.5 2876,-981.5 2876,-702.5 2716,-702.5"/> +</g> +<!-- Species --> +<g id="node33" class="node"> +<title>Species</title> +<polygon fill="lightgrey" stroke="transparent" points="2734,-201 2734,-396 2858,-396 2858,-201 2734,-201"/> +<polygon fill="#f1eef6" stroke="transparent" points="2737,-371.5 2737,-392.5 2855,-392.5 2855,-371.5 2737,-371.5"/> +<polygon fill="none" stroke="black" points="2737,-371.5 2737,-392.5 2855,-392.5 2855,-371.5 2737,-371.5"/> +<text text-anchor="start" x="2740" y="-378.3" font-family="Times,serif" font-size="14.00">Species (796 B)</text> +<polygon fill="green" stroke="transparent" points="2737,-350.5 2737,-369.5 2855,-369.5 2855,-350.5 2737,-350.5"/> +<text text-anchor="start" x="2761" y="-356.3" font-family="Times,serif" font-size="14.00">FullName</text> +<polygon fill="green" stroke="transparent" points="2737,-329.5 2737,-348.5 2855,-348.5 2855,-329.5 2737,-329.5"/> +<text text-anchor="start" x="2754.5" y="-335.3" font-family="Times,serif" font-size="14.00">MenuName</text> +<polygon fill="green" stroke="transparent" points="2737,-308.5 2737,-327.5 2855,-327.5 2855,-308.5 2737,-308.5"/> +<text text-anchor="start" x="2747.5" y="-314.3" font-family="Times,serif" font-size="14.00">SpeciesName</text> +<text text-anchor="start" x="2788.5" y="-293.3" font-family="Times,serif" font-size="14.00">Id</text> +<polygon fill="green" stroke="transparent" points="2737,-266.5 2737,-285.5 2855,-285.5 2855,-266.5 2737,-266.5"/> +<text text-anchor="start" x="2774.5" y="-272.3" font-family="Times,serif" font-size="14.00">Name</text> +<text text-anchor="start" x="2767.5" y="-251.3" font-family="Times,serif" font-size="14.00">OrderId</text> +<text text-anchor="start" x="2761" y="-230.3" font-family="Times,serif" font-size="14.00">SpeciesId</text> +<text text-anchor="start" x="2752.5" y="-209.3" font-family="Times,serif" font-size="14.00">TaxonomyId</text> +<polygon fill="none" stroke="black" points="2734,-201 2734,-396 2858,-396 2858,-201 2734,-201"/> +</g> +<!-- SnpAllRat->Species --> +<g id="edge4" class="edge"> +<title>SnpAllRat:SpeciesId->Species</title> +<path fill="none" stroke="black" d="M2874,-735C2906.96,-735 2860.65,-539.2 2826.56,-410.18"/> +<polygon fill="black" stroke="black" points="2829.87,-409 2823.92,-400.23 2823.1,-410.8 2829.87,-409"/> +</g> +<!-- SampleXRef --> +<g id="node4" class="node"> +<title>SampleXRef</title> +<polygon fill="white" stroke="transparent" points="3272,-3259.5 3272,-3328.5 3426,-3328.5 3426,-3259.5 3272,-3259.5"/> +<polygon fill="#d7b5d8" stroke="transparent" points="3275,-3304 3275,-3325 3423,-3325 3423,-3304 3275,-3304"/> +<polygon fill="none" stroke="black" points="3275,-3304 3275,-3325 3423,-3325 3423,-3304 3275,-3304"/> +<text text-anchor="start" x="3278" y="-3310.8" font-family="Times,serif" font-size="14.00">SampleXRef (4 KiB)</text> +<text text-anchor="start" x="3296" y="-3288.8" font-family="Times,serif" font-size="14.00">ProbeFreezeId</text> +<text text-anchor="start" x="3315" y="-3267.8" font-family="Times,serif" font-size="14.00">SampleId</text> +<polygon fill="none" stroke="black" points="3272,-3259.5 3272,-3328.5 3426,-3328.5 3426,-3259.5 3272,-3259.5"/> +</g> +<!-- ProbeFreeze --> +<g id="node42" class="node"> +<title>ProbeFreeze</title> +<polygon fill="white" stroke="transparent" points="2611,-1855 2611,-2071 2777,-2071 2777,-1855 2611,-1855"/> +<polygon fill="#d7b5d8" stroke="transparent" points="2614,-2047 2614,-2068 2774,-2068 2774,-2047 2614,-2047"/> +<polygon fill="none" stroke="black" points="2614,-2047 2614,-2068 2774,-2068 2774,-2047 2614,-2047"/> +<text text-anchor="start" x="2617" y="-2053.8" font-family="Times,serif" font-size="14.00">ProbeFreeze (30 KiB)</text> +<text text-anchor="start" x="2670" y="-2031.8" font-family="Times,serif" font-size="14.00">ChipId</text> +<text text-anchor="start" x="2652" y="-2010.8" font-family="Times,serif" font-size="14.00">CreateTime</text> +<text text-anchor="start" x="2659" y="-1989.8" font-family="Times,serif" font-size="14.00">FullName</text> +<text text-anchor="start" x="2686.5" y="-1968.8" font-family="Times,serif" font-size="14.00">Id</text> +<text text-anchor="start" x="2651" y="-1947.8" font-family="Times,serif" font-size="14.00">InbredSetId</text> +<text text-anchor="start" x="2672.5" y="-1926.8" font-family="Times,serif" font-size="14.00">Name</text> +<text text-anchor="start" x="2641" y="-1905.8" font-family="Times,serif" font-size="14.00">ProbeFreezeId</text> +<text text-anchor="start" x="2653" y="-1884.8" font-family="Times,serif" font-size="14.00">ShortName</text> +<text text-anchor="start" x="2663.5" y="-1863.8" font-family="Times,serif" font-size="14.00">TissueId</text> +<polygon fill="none" stroke="black" points="2611,-1855 2611,-2071 2777,-2071 2777,-1855 2611,-1855"/> +</g> +<!-- SampleXRef->ProbeFreeze --> +<g id="edge5" class="edge"> +<title>SampleXRef:ProbeFreezeId->ProbeFreeze</title> +<path fill="none" stroke="black" d="M3274,-3292C3032.87,-3292 3338.17,-2922.26 3158,-2762 3097.26,-2707.98 2852.39,-2782.55 2794,-2726 2622.74,-2560.12 2641.84,-2254.55 2669,-2085.12"/> +<polygon fill="black" stroke="black" points="2672.47,-2085.6 2670.63,-2075.16 2665.56,-2084.47 2672.47,-2085.6"/> +</g> +<!-- Sample --> +<g id="node95" class="node"> +<title>Sample</title> +<polygon fill="white" stroke="transparent" points="3653.5,-1792 3653.5,-2134 3782.5,-2134 3782.5,-1792 3653.5,-1792"/> +<polygon fill="#d7b5d8" stroke="transparent" points="3657,-2110 3657,-2131 3780,-2131 3780,-2110 3657,-2110"/> +<polygon fill="none" stroke="black" points="3657,-2110 3657,-2131 3780,-2131 3780,-2110 3657,-2110"/> +<text text-anchor="start" x="3660" y="-2116.8" font-family="Times,serif" font-size="14.00">Sample (53 KiB)</text> +<text text-anchor="start" x="3704.5" y="-2094.8" font-family="Times,serif" font-size="14.00">Age</text> +<text text-anchor="start" x="3688" y="-2073.8" font-family="Times,serif" font-size="14.00">CELURL</text> +<text text-anchor="start" x="3686.5" y="-2052.8" font-family="Times,serif" font-size="14.00">CHPURL</text> +<text text-anchor="start" x="3676.5" y="-2031.8" font-family="Times,serif" font-size="14.00">CreateTime</text> +<text text-anchor="start" x="3688" y="-2010.8" font-family="Times,serif" font-size="14.00">DATURL</text> +<text text-anchor="start" x="3688" y="-1989.8" font-family="Times,serif" font-size="14.00">EXPURL</text> +<text text-anchor="start" x="3687" y="-1968.8" font-family="Times,serif" font-size="14.00">FromSrc</text> +<text text-anchor="start" x="3711" y="-1947.8" font-family="Times,serif" font-size="14.00">Id</text> +<text text-anchor="start" x="3680.5" y="-1926.8" font-family="Times,serif" font-size="14.00">ImageURL</text> +<text text-anchor="start" x="3697" y="-1905.8" font-family="Times,serif" font-size="14.00">Name</text> +<text text-anchor="start" x="3688" y="-1884.8" font-family="Times,serif" font-size="14.00">RPTURL</text> +<text text-anchor="start" x="3705" y="-1863.8" font-family="Times,serif" font-size="14.00">Sex</text> +<text text-anchor="start" x="3689" y="-1842.8" font-family="Times,serif" font-size="14.00">StrainId</text> +<text text-anchor="start" x="3678" y="-1821.8" font-family="Times,serif" font-size="14.00">TissueType</text> +<text text-anchor="start" x="3688.5" y="-1800.8" font-family="Times,serif" font-size="14.00">TXTURL</text> +<polygon fill="none" stroke="black" points="3653.5,-1792 3653.5,-2134 3782.5,-2134 3782.5,-1792 3653.5,-1792"/> +</g> +<!-- SampleXRef->Sample --> +<g id="edge6" class="edge"> +<title>SampleXRef:SampleId->Sample</title> +<path fill="none" stroke="black" d="M3424,-3271C3878.8,-3271 3810.34,-2508.42 3752.65,-2148.25"/> +<polygon fill="black" stroke="black" points="3756.08,-2147.55 3751.03,-2138.24 3749.17,-2148.67 3756.08,-2147.55"/> +</g> +<!-- GeneIDXRef --> +<g id="node5" class="node"> +<title>GeneIDXRef</title> +<polygon fill="white" stroke="transparent" points="7441,-4842.5 7441,-4932.5 7613,-4932.5 7613,-4842.5 7441,-4842.5"/> +<polygon fill="#d7b5d8" stroke="transparent" points="7444,-4908.5 7444,-4929.5 7610,-4929.5 7610,-4908.5 7444,-4908.5"/> +<polygon fill="none" stroke="black" points="7444,-4908.5 7444,-4929.5 7610,-4929.5 7610,-4908.5 7444,-4908.5"/> +<text text-anchor="start" x="7447" y="-4915.3" font-family="Times,serif" font-size="14.00">GeneIDXRef (220 KiB)</text> +<text text-anchor="start" x="7502.5" y="-4893.3" font-family="Times,serif" font-size="14.00">human</text> +<text text-anchor="start" x="7503.5" y="-4872.3" font-family="Times,serif" font-size="14.00">mouse</text> +<text text-anchor="start" x="7516" y="-4851.3" font-family="Times,serif" font-size="14.00">rat</text> +<polygon fill="none" stroke="black" points="7441,-4842.5 7441,-4932.5 7613,-4932.5 7613,-4842.5 7441,-4842.5"/> +</g> +<!-- MachineAccessLog --> +<g id="node6" class="node"> +<title>MachineAccessLog</title> +<polygon fill="white" stroke="transparent" points="7647,-4811 7647,-4964 7861,-4964 7861,-4811 7647,-4811"/> +<polygon fill="#df65b0" stroke="transparent" points="7650,-4939.5 7650,-4960.5 7858,-4960.5 7858,-4939.5 7650,-4939.5"/> +<polygon fill="none" stroke="black" points="7650,-4939.5 7650,-4960.5 7858,-4960.5 7858,-4939.5 7650,-4939.5"/> +<text text-anchor="start" x="7653" y="-4946.3" font-family="Times,serif" font-size="14.00">MachineAccessLog (23 MiB)</text> +<text text-anchor="start" x="7714.5" y="-4924.3" font-family="Times,serif" font-size="14.00">accesstime</text> +<text text-anchor="start" x="7732" y="-4903.3" font-family="Times,serif" font-size="14.00">action</text> +<text text-anchor="start" x="7728" y="-4882.3" font-family="Times,serif" font-size="14.00">data_id</text> +<text text-anchor="start" x="7734.5" y="-4861.3" font-family="Times,serif" font-size="14.00">db_id</text> +<text text-anchor="start" x="7747" y="-4840.3" font-family="Times,serif" font-size="14.00">id</text> +<text text-anchor="start" x="7715.5" y="-4819.3" font-family="Times,serif" font-size="14.00">ip_address</text> +<polygon fill="none" stroke="black" points="7647,-4811 7647,-4964 7861,-4964 7861,-4811 7647,-4811"/> +</g> +<!-- metadata_audit --> +<g id="node7" class="node"> +<title>metadata_audit</title> +<polygon fill="white" stroke="transparent" points="292.5,-1897 292.5,-2029 479.5,-2029 479.5,-1897 292.5,-1897"/> +<polygon fill="#d7b5d8" stroke="transparent" points="296,-2005 296,-2026 477,-2026 477,-2005 296,-2005"/> +<polygon fill="none" stroke="black" points="296,-2005 296,-2026 477,-2026 477,-2005 296,-2005"/> +<text text-anchor="start" x="299" y="-2011.8" font-family="Times,serif" font-size="14.00">metadata_audit (16 KiB)</text> +<text text-anchor="start" x="349.5" y="-1989.8" font-family="Times,serif" font-size="14.00">dataset_id</text> +<text text-anchor="start" x="365" y="-1968.8" font-family="Times,serif" font-size="14.00">editor</text> +<text text-anchor="start" x="379.5" y="-1947.8" font-family="Times,serif" font-size="14.00">id</text> +<text text-anchor="start" x="337.5" y="-1926.8" font-family="Times,serif" font-size="14.00">json_diff_data</text> +<text text-anchor="start" x="344.5" y="-1905.8" font-family="Times,serif" font-size="14.00">time_stamp</text> +<polygon fill="none" stroke="black" points="292.5,-1897 292.5,-2029 479.5,-2029 479.5,-1897 292.5,-1897"/> +</g> +<!-- Datasets --> +<g id="node16" class="node"> +<title>Datasets</title> +<polygon fill="lightgrey" stroke="transparent" points="305,-660.5 305,-1023.5 469,-1023.5 469,-660.5 305,-660.5"/> +<polygon fill="#df65b0" stroke="transparent" points="308,-999 308,-1020 466,-1020 466,-999 308,-999"/> +<polygon fill="none" stroke="black" points="308,-999 308,-1020 466,-1020 466,-999 308,-999"/> +<text text-anchor="start" x="326.5" y="-1005.8" font-family="Times,serif" font-size="14.00">Datasets (4 MiB)</text> +<polygon fill="green" stroke="transparent" points="308,-978 308,-997 466,-997 466,-978 308,-978"/> +<text text-anchor="start" x="344.5" y="-983.8" font-family="Times,serif" font-size="14.00">AboutCases</text> +<polygon fill="green" stroke="transparent" points="308,-957 308,-976 466,-976 466,-957 308,-957"/> +<text text-anchor="start" x="310" y="-962.8" font-family="Times,serif" font-size="14.00">AboutDataProcessing</text> +<polygon fill="green" stroke="transparent" points="308,-936 308,-955 466,-955 466,-936 308,-936"/> +<text text-anchor="start" x="334.5" y="-941.8" font-family="Times,serif" font-size="14.00">AboutPlatform</text> +<polygon fill="green" stroke="transparent" points="308,-915 308,-934 466,-934 466,-915 308,-915"/> +<text text-anchor="start" x="343" y="-920.8" font-family="Times,serif" font-size="14.00">AboutTissue</text> +<polygon fill="green" stroke="transparent" points="308,-894 308,-913 466,-913 466,-894 308,-894"/> +<text text-anchor="start" x="325.5" y="-899.8" font-family="Times,serif" font-size="14.00">Acknowledgment</text> +<polygon fill="green" stroke="transparent" points="308,-873 308,-892 466,-892 466,-873 308,-873"/> +<text text-anchor="start" x="358" y="-878.8" font-family="Times,serif" font-size="14.00">Citation</text> +<polygon fill="green" stroke="transparent" points="308,-852 308,-871 466,-871 466,-852 308,-852"/> +<text text-anchor="start" x="341" y="-857.8" font-family="Times,serif" font-size="14.00">Contributors</text> +<text text-anchor="start" x="352" y="-836.8" font-family="Times,serif" font-size="14.00">DatasetId</text> +<polygon fill="green" stroke="transparent" points="308,-810 308,-829 466,-829 466,-810 308,-810"/> +<text text-anchor="start" x="338" y="-815.8" font-family="Times,serif" font-size="14.00">DatasetName</text> +<text text-anchor="start" x="328.5" y="-794.8" font-family="Times,serif" font-size="14.00">DatasetStatusId</text> +<polygon fill="green" stroke="transparent" points="308,-768 308,-787 466,-787 466,-768 308,-768"/> +<text text-anchor="start" x="320" y="-773.8" font-family="Times,serif" font-size="14.00">ExperimentDesign</text> +<polygon fill="green" stroke="transparent" points="308,-747 308,-766 466,-766 466,-747 308,-747"/> +<text text-anchor="start" x="350.5" y="-752.8" font-family="Times,serif" font-size="14.00">GeoSeries</text> +<text text-anchor="start" x="336" y="-731.8" font-family="Times,serif" font-size="14.00">InvestigatorId</text> +<polygon fill="green" stroke="transparent" points="308,-705 308,-724 466,-724 466,-705 308,-705"/> +<text text-anchor="start" x="365.5" y="-710.8" font-family="Times,serif" font-size="14.00">Notes</text> +<text text-anchor="start" x="330.5" y="-689.8" font-family="Times,serif" font-size="14.00">PublicationTitle</text> +<polygon fill="green" stroke="transparent" points="308,-663 308,-682 466,-682 466,-663 308,-663"/> +<text text-anchor="start" x="352" y="-668.8" font-family="Times,serif" font-size="14.00">Summary</text> +<polygon fill="none" stroke="black" points="305,-660.5 305,-1023.5 469,-1023.5 469,-660.5 305,-660.5"/> +</g> +<!-- metadata_audit->Datasets --> +<g id="edge7" class="edge"> +<title>metadata_audit:dataset_id->Datasets</title> +<path fill="none" stroke="black" d="M478,-1994C525.38,-1994 453.11,-1365.95 412.1,-1037.71"/> +<polygon fill="black" stroke="black" points="415.55,-1037.1 410.84,-1027.61 408.61,-1037.97 415.55,-1037.1"/> +</g> +<!-- GenoXRef --> +<g id="node8" class="node"> +<title>GenoXRef</title> +<polygon fill="white" stroke="transparent" points="4464,-3228 4464,-3360 4614,-3360 4614,-3228 4464,-3228"/> +<polygon fill="#df65b0" stroke="transparent" points="4467,-3336 4467,-3357 4611,-3357 4611,-3336 4467,-3336"/> +<polygon fill="none" stroke="black" points="4467,-3336 4467,-3357 4611,-3357 4611,-3336 4467,-3336"/> +<text text-anchor="start" x="4470" y="-3342.8" font-family="Times,serif" font-size="14.00">GenoXRef (14 MiB)</text> +<text text-anchor="start" x="4528" y="-3320.8" font-family="Times,serif" font-size="14.00">cM</text> +<text text-anchor="start" x="4514.5" y="-3299.8" font-family="Times,serif" font-size="14.00">DataId</text> +<text text-anchor="start" x="4489" y="-3278.8" font-family="Times,serif" font-size="14.00">GenoFreezeId</text> +<text text-anchor="start" x="4513" y="-3257.8" font-family="Times,serif" font-size="14.00">GenoId</text> +<text text-anchor="start" x="4472.5" y="-3236.8" font-family="Times,serif" font-size="14.00">Used_for_mapping</text> +<polygon fill="none" stroke="black" points="4464,-3228 4464,-3360 4614,-3360 4614,-3228 4464,-3228"/> +</g> +<!-- Geno --> +<g id="node46" class="node"> +<title>Geno</title> +<polygon fill="white" stroke="transparent" points="4245,-671 4245,-1013 4383,-1013 4383,-671 4245,-671"/> +<polygon fill="#df65b0" stroke="transparent" points="4248,-989 4248,-1010 4380,-1010 4380,-989 4248,-989"/> +<polygon fill="none" stroke="black" points="4248,-989 4248,-1010 4380,-1010 4380,-989 4248,-989"/> +<text text-anchor="start" x="4262" y="-995.8" font-family="Times,serif" font-size="14.00">Geno (39 MiB)</text> +<text text-anchor="start" x="4300.5" y="-973.8" font-family="Times,serif" font-size="14.00">Chr</text> +<text text-anchor="start" x="4279" y="-952.8" font-family="Times,serif" font-size="14.00">Chr_mm8</text> +<text text-anchor="start" x="4283" y="-931.8" font-family="Times,serif" font-size="14.00">chr_num</text> +<text text-anchor="start" x="4275.5" y="-910.8" font-family="Times,serif" font-size="14.00">Comments</text> +<text text-anchor="start" x="4306.5" y="-889.8" font-family="Times,serif" font-size="14.00">Id</text> +<text text-anchor="start" x="4263" y="-868.8" font-family="Times,serif" font-size="14.00">Marker_Name</text> +<text text-anchor="start" x="4302" y="-847.8" font-family="Times,serif" font-size="14.00">Mb</text> +<text text-anchor="start" x="4280.5" y="-826.8" font-family="Times,serif" font-size="14.00">Mb_2016</text> +<text text-anchor="start" x="4280.5" y="-805.8" font-family="Times,serif" font-size="14.00">Mb_mm8</text> +<text text-anchor="start" x="4292.5" y="-784.8" font-family="Times,serif" font-size="14.00">Name</text> +<text text-anchor="start" x="4279" y="-763.8" font-family="Times,serif" font-size="14.00">Sequence</text> +<text text-anchor="start" x="4289" y="-742.8" font-family="Times,serif" font-size="14.00">Source</text> +<text text-anchor="start" x="4284.5" y="-721.8" font-family="Times,serif" font-size="14.00">Source2</text> +<text text-anchor="start" x="4279" y="-700.8" font-family="Times,serif" font-size="14.00">SpeciesId</text> +<text text-anchor="start" x="4250" y="-679.8" font-family="Times,serif" font-size="14.00">used_by_geno_file</text> +<polygon fill="none" stroke="black" points="4245,-671 4245,-1013 4383,-1013 4383,-671 4245,-671"/> +</g> +<!-- GenoXRef->Geno --> +<g id="edge9" class="edge"> +<title>GenoXRef:GenoId->Geno</title> +<path fill="none" stroke="black" d="M4612,-3261C4626.31,-3261 4580.57,-1213.56 4576,-1200 4540.22,-1093.91 4460.35,-992.99 4398.15,-925.69"/> +<polygon fill="black" stroke="black" points="4400.41,-922.99 4391.03,-918.06 4395.29,-927.76 4400.41,-922.99"/> +</g> +<!-- GenoFreeze --> +<g id="node82" class="node"> +<title>GenoFreeze</title> +<polygon fill="white" stroke="transparent" points="4407,-1855 4407,-2071 4559,-2071 4559,-1855 4407,-1855"/> +<polygon fill="#d7b5d8" stroke="transparent" points="4410,-2047 4410,-2068 4556,-2068 4556,-2047 4410,-2047"/> +<polygon fill="none" stroke="black" points="4410,-2047 4410,-2068 4556,-2068 4556,-2047 4410,-2047"/> +<text text-anchor="start" x="4413" y="-2053.8" font-family="Times,serif" font-size="14.00">GenoFreeze (2 KiB)</text> +<text text-anchor="start" x="4422.5" y="-2031.8" font-family="Times,serif" font-size="14.00">AuthorisedUsers</text> +<text text-anchor="start" x="4431.5" y="-2010.8" font-family="Times,serif" font-size="14.00">confidentiality</text> +<text text-anchor="start" x="4441" y="-1989.8" font-family="Times,serif" font-size="14.00">CreateTime</text> +<text text-anchor="start" x="4448" y="-1968.8" font-family="Times,serif" font-size="14.00">FullName</text> +<text text-anchor="start" x="4475.5" y="-1947.8" font-family="Times,serif" font-size="14.00">Id</text> +<text text-anchor="start" x="4440" y="-1926.8" font-family="Times,serif" font-size="14.00">InbredSetId</text> +<text text-anchor="start" x="4461.5" y="-1905.8" font-family="Times,serif" font-size="14.00">Name</text> +<text text-anchor="start" x="4461" y="-1884.8" font-family="Times,serif" font-size="14.00">public</text> +<text text-anchor="start" x="4442" y="-1863.8" font-family="Times,serif" font-size="14.00">ShortName</text> +<polygon fill="none" stroke="black" points="4407,-1855 4407,-2071 4559,-2071 4559,-1855 4407,-1855"/> +</g> +<!-- GenoXRef->GenoFreeze --> +<g id="edge8" class="edge"> +<title>GenoXRef:GenoFreezeId->GenoFreeze</title> +<path fill="none" stroke="black" d="M4466,-3282C4346.95,-3282 4432.68,-2411.13 4468.93,-2085.19"/> +<polygon fill="black" stroke="black" points="4472.41,-2085.56 4470.04,-2075.24 4465.45,-2084.79 4472.41,-2085.56"/> +</g> +<!-- TissueProbeSetXRef --> +<g id="node9" class="node"> +<title>TissueProbeSetXRef</title> +<polygon fill="white" stroke="transparent" points="6347,-4748 6347,-5027 6563,-5027 6563,-4748 6347,-4748"/> +<polygon fill="#df65b0" stroke="transparent" points="6350,-5002.5 6350,-5023.5 6560,-5023.5 6560,-5002.5 6350,-5002.5"/> +<polygon fill="none" stroke="black" points="6350,-5002.5 6350,-5023.5 6560,-5023.5 6560,-5002.5 6350,-5002.5"/> +<text text-anchor="start" x="6353" y="-5009.3" font-family="Times,serif" font-size="14.00">TissueProbeSetXRef (9 MiB)</text> +<text text-anchor="start" x="6441.5" y="-4987.3" font-family="Times,serif" font-size="14.00">Chr</text> +<text text-anchor="start" x="6430.5" y="-4966.3" font-family="Times,serif" font-size="14.00">DataId</text> +<text text-anchor="start" x="6414.5" y="-4945.3" font-family="Times,serif" font-size="14.00">description</text> +<text text-anchor="start" x="6429" y="-4924.3" font-family="Times,serif" font-size="14.00">GeneId</text> +<text text-anchor="start" x="6443" y="-4903.3" font-family="Times,serif" font-size="14.00">Mb</text> +<text text-anchor="start" x="6421.5" y="-4882.3" font-family="Times,serif" font-size="14.00">Mb_2016</text> +<text text-anchor="start" x="6435" y="-4861.3" font-family="Times,serif" font-size="14.00">Mean</text> +<text text-anchor="start" x="6362.5" y="-4840.3" font-family="Times,serif" font-size="14.00">Probe_Target_Description</text> +<text text-anchor="start" x="6415.5" y="-4819.3" font-family="Times,serif" font-size="14.00">ProbesetId</text> +<text text-anchor="start" x="6428" y="-4798.3" font-family="Times,serif" font-size="14.00">Symbol</text> +<text text-anchor="start" x="6367.5" y="-4777.3" font-family="Times,serif" font-size="14.00">TissueProbeSetFreezeId</text> +<text text-anchor="start" x="6419" y="-4756.3" font-family="Times,serif" font-size="14.00">useStatus</text> +<polygon fill="none" stroke="black" points="6347,-4748 6347,-5027 6563,-5027 6563,-4748 6347,-4748"/> +</g> +<!-- TissueProbeSetFreeze --> +<g id="node23" class="node"> +<title>TissueProbeSetFreeze</title> +<polygon fill="white" stroke="transparent" points="4747,-3165 4747,-3423 4977,-3423 4977,-3165 4747,-3165"/> +<polygon fill="#f1eef6" stroke="transparent" points="4750,-3399 4750,-3420 4974,-3420 4974,-3399 4750,-3399"/> +<polygon fill="none" stroke="black" points="4750,-3399 4750,-3420 4974,-3420 4974,-3399 4750,-3399"/> +<text text-anchor="start" x="4753" y="-3405.8" font-family="Times,serif" font-size="14.00">TissueProbeSetFreeze (228 B)</text> +<text text-anchor="start" x="4801.5" y="-3383.8" font-family="Times,serif" font-size="14.00">AuthorisedUsers</text> +<text text-anchor="start" x="4840" y="-3362.8" font-family="Times,serif" font-size="14.00">AvgID</text> +<text text-anchor="start" x="4810.5" y="-3341.8" font-family="Times,serif" font-size="14.00">confidentiality</text> +<text text-anchor="start" x="4820" y="-3320.8" font-family="Times,serif" font-size="14.00">CreateTime</text> +<text text-anchor="start" x="4827" y="-3299.8" font-family="Times,serif" font-size="14.00">FullName</text> +<text text-anchor="start" x="4854.5" y="-3278.8" font-family="Times,serif" font-size="14.00">Id</text> +<text text-anchor="start" x="4840.5" y="-3257.8" font-family="Times,serif" font-size="14.00">Name</text> +<text text-anchor="start" x="4836" y="-3236.8" font-family="Times,serif" font-size="14.00">Name2</text> +<text text-anchor="start" x="4840" y="-3215.8" font-family="Times,serif" font-size="14.00">public</text> +<text text-anchor="start" x="4821" y="-3194.8" font-family="Times,serif" font-size="14.00">ShortName</text> +<text text-anchor="start" x="4786.5" y="-3173.8" font-family="Times,serif" font-size="14.00">TissueProbeFreezeId</text> +<polygon fill="none" stroke="black" points="4747,-3165 4747,-3423 4977,-3423 4977,-3165 4747,-3165"/> +</g> +<!-- TissueProbeSetXRef->TissueProbeSetFreeze --> +<g id="edge11" class="edge"> +<title>TissueProbeSetXRef:TissueProbeSetFreezeId->TissueProbeSetFreeze</title> +<path fill="none" stroke="black" d="M6349,-4780.5C5901.77,-4780.5 6243.92,-4188.23 5938,-3862 5667.77,-3573.83 5217.81,-3404.02 4995.17,-3333.49"/> +<polygon fill="black" stroke="black" points="4995.98,-3330.08 4985.39,-3330.41 4993.88,-3336.75 4995.98,-3330.08"/> +</g> +<!-- ProbeSE --> +<g id="node78" class="node"> +<title>ProbeSE</title> +<polygon fill="white" stroke="transparent" points="6992,-1918 6992,-2008 7122,-2008 7122,-1918 6992,-1918"/> +<polygon fill="#ce1256" stroke="transparent" points="6995,-1984 6995,-2005 7119,-2005 7119,-1984 6995,-1984"/> +<polygon fill="none" stroke="black" points="6995,-1984 6995,-2005 7119,-2005 7119,-1984 6995,-1984"/> +<text text-anchor="start" x="6998" y="-1990.8" font-family="Times,serif" font-size="14.00">ProbeSE (3 GiB)</text> +<text text-anchor="start" x="7032.5" y="-1968.8" font-family="Times,serif" font-size="14.00">DataId</text> +<text text-anchor="start" x="7038.5" y="-1947.8" font-family="Times,serif" font-size="14.00">error</text> +<text text-anchor="start" x="7027.5" y="-1926.8" font-family="Times,serif" font-size="14.00">StrainId</text> +<polygon fill="none" stroke="black" points="6992,-1918 6992,-2008 7122,-2008 7122,-1918 6992,-1918"/> +</g> +<!-- TissueProbeSetXRef->ProbeSE --> +<g id="edge10" class="edge"> +<title>TissueProbeSetXRef:ProbesetId->ProbeSE</title> +<path fill="none" stroke="black" d="M6561,-4822.5C6998.45,-4822.5 6458.97,-4163.43 6776,-3862 6844.63,-3796.75 6923.59,-3897.22 6986,-3826 7107.35,-3687.52 7069.01,-2322.6 7059.04,-2022.25"/> +<polygon fill="black" stroke="black" points="7062.53,-2021.9 7058.7,-2012.02 7055.54,-2022.13 7062.53,-2021.9"/> +</g> +<!-- Homologene --> +<g id="node10" class="node"> +<title>Homologene</title> +<polygon fill="white" stroke="transparent" points="7895,-4842.5 7895,-4932.5 8055,-4932.5 8055,-4842.5 7895,-4842.5"/> +<polygon fill="#df65b0" stroke="transparent" points="7898,-4908.5 7898,-4929.5 8052,-4929.5 8052,-4908.5 7898,-4908.5"/> +<polygon fill="none" stroke="black" points="7898,-4908.5 7898,-4929.5 8052,-4929.5 8052,-4908.5 7898,-4908.5"/> +<text text-anchor="start" x="7901" y="-4915.3" font-family="Times,serif" font-size="14.00">Homologene (3 MiB)</text> +<text text-anchor="start" x="7949" y="-4893.3" font-family="Times,serif" font-size="14.00">GeneId</text> +<text text-anchor="start" x="7923" y="-4872.3" font-family="Times,serif" font-size="14.00">HomologeneId</text> +<text text-anchor="start" x="7931.5" y="-4851.3" font-family="Times,serif" font-size="14.00">TaxonomyId</text> +<polygon fill="none" stroke="black" points="7895,-4842.5 7895,-4932.5 8055,-4932.5 8055,-4842.5 7895,-4842.5"/> +</g> +<!-- PublishData --> +<g id="node11" class="node"> +<title>PublishData</title> +<polygon fill="white" stroke="transparent" points="5091,-1918 5091,-2008 5257,-2008 5257,-1918 5091,-1918"/> +<polygon fill="#df65b0" stroke="transparent" points="5094,-1984 5094,-2005 5254,-2005 5254,-1984 5094,-1984"/> +<polygon fill="none" stroke="black" points="5094,-1984 5094,-2005 5254,-2005 5254,-1984 5094,-1984"/> +<text text-anchor="start" x="5097" y="-1990.8" font-family="Times,serif" font-size="14.00">PublishData (34 MiB)</text> +<text text-anchor="start" x="5166.5" y="-1968.8" font-family="Times,serif" font-size="14.00">Id</text> +<text text-anchor="start" x="5144.5" y="-1947.8" font-family="Times,serif" font-size="14.00">StrainId</text> +<text text-anchor="start" x="5154.5" y="-1926.8" font-family="Times,serif" font-size="14.00">value</text> +<polygon fill="none" stroke="black" points="5091,-1918 5091,-2008 5257,-2008 5257,-1918 5091,-1918"/> +</g> +<!-- PublishData->Strain --> +<g id="edge12" class="edge"> +<title>PublishData:StrainId->Strain</title> +<path fill="none" stroke="black" d="M5255,-1951C5275.87,-1951 5264.11,-1218.38 5274,-1200 5368.85,-1023.7 5593.45,-915.93 5711.13,-869.6"/> +<polygon fill="black" stroke="black" points="5712.4,-872.86 5720.45,-865.97 5709.86,-866.34 5712.4,-872.86"/> +</g> +<!-- ProbeSetXRef --> +<g id="node12" class="node"> +<title>ProbeSetXRef</title> +<polygon fill="white" stroke="transparent" points="3033.5,-4737.5 3033.5,-5037.5 3200.5,-5037.5 3200.5,-4737.5 3033.5,-4737.5"/> +<polygon fill="#ce1256" stroke="transparent" points="3037,-5013.5 3037,-5034.5 3198,-5034.5 3198,-5013.5 3037,-5013.5"/> +<polygon fill="none" stroke="black" points="3037,-5013.5 3037,-5034.5 3198,-5034.5 3198,-5013.5 3037,-5013.5"/> +<text text-anchor="start" x="3040" y="-5020.3" font-family="Times,serif" font-size="14.00">ProbeSetXRef (2 GiB)</text> +<text text-anchor="start" x="3088.5" y="-4998.3" font-family="Times,serif" font-size="14.00">additive</text> +<text text-anchor="start" x="3093" y="-4977.3" font-family="Times,serif" font-size="14.00">DataId</text> +<text text-anchor="start" x="3108" y="-4956.3" font-family="Times,serif" font-size="14.00">h2</text> +<text text-anchor="start" x="3096.5" y="-4935.3" font-family="Times,serif" font-size="14.00">Locus</text> +<text text-anchor="start" x="3082.5" y="-4914.3" font-family="Times,serif" font-size="14.00">Locus_old</text> +<text text-anchor="start" x="3102.5" y="-4893.3" font-family="Times,serif" font-size="14.00">LRS</text> +<text text-anchor="start" x="3088.5" y="-4872.3" font-family="Times,serif" font-size="14.00">LRS_old</text> +<text text-anchor="start" x="3097.5" y="-4851.3" font-family="Times,serif" font-size="14.00">mean</text> +<text text-anchor="start" x="3052.5" y="-4830.3" font-family="Times,serif" font-size="14.00">ProbeSetFreezeId</text> +<text text-anchor="start" x="3077" y="-4809.3" font-family="Times,serif" font-size="14.00">ProbeSetId</text> +<text text-anchor="start" x="3093" y="-4788.3" font-family="Times,serif" font-size="14.00">pValue</text> +<text text-anchor="start" x="3079" y="-4767.3" font-family="Times,serif" font-size="14.00">pValue_old</text> +<text text-anchor="start" x="3109.5" y="-4746.3" font-family="Times,serif" font-size="14.00">se</text> +<polygon fill="none" stroke="black" points="3033.5,-4737.5 3033.5,-5037.5 3200.5,-5037.5 3200.5,-4737.5 3033.5,-4737.5"/> +</g> +<!-- ProbeSetXRef->ProbeSE --> +<g id="edge14" class="edge"> +<title>ProbeSetXRef:ProbeSetId->ProbeSE</title> +<path fill="none" stroke="black" d="M3199,-4812.5C4021.93,-4812.5 3996.77,-4088.2 4788,-3862 4841.88,-3846.6 6765.02,-3865.27 6805,-3826 6889.39,-3743.1 6769.62,-2854.79 6843,-2762 6880.46,-2714.64 6934.85,-2771.97 6974,-2726 7149.11,-2520.43 7098.76,-2161.98 7070.36,-2022.18"/> +<polygon fill="black" stroke="black" points="7073.73,-2021.18 7068.27,-2012.1 7066.87,-2022.6 7073.73,-2021.18"/> +</g> +<!-- ProbeSetFreeze --> +<g id="node90" class="node"> +<title>ProbeSetFreeze</title> +<polygon fill="white" stroke="transparent" points="2639.5,-3144 2639.5,-3444 2838.5,-3444 2838.5,-3144 2639.5,-3144"/> +<polygon fill="#d7b5d8" stroke="transparent" points="2643,-3420 2643,-3441 2836,-3441 2836,-3420 2643,-3420"/> +<polygon fill="none" stroke="black" points="2643,-3420 2643,-3441 2836,-3441 2836,-3420 2643,-3420"/> +<text text-anchor="start" x="2646" y="-3426.8" font-family="Times,serif" font-size="14.00">ProbeSetFreeze (171 KiB)</text> +<text text-anchor="start" x="2679" y="-3404.8" font-family="Times,serif" font-size="14.00">AuthorisedUsers</text> +<text text-anchor="start" x="2717.5" y="-3383.8" font-family="Times,serif" font-size="14.00">AvgID</text> +<text text-anchor="start" x="2688" y="-3362.8" font-family="Times,serif" font-size="14.00">confidentiality</text> +<text text-anchor="start" x="2697.5" y="-3341.8" font-family="Times,serif" font-size="14.00">CreateTime</text> +<text text-anchor="start" x="2703" y="-3320.8" font-family="Times,serif" font-size="14.00">DataScale</text> +<text text-anchor="start" x="2704.5" y="-3299.8" font-family="Times,serif" font-size="14.00">FullName</text> +<text text-anchor="start" x="2732" y="-3278.8" font-family="Times,serif" font-size="14.00">Id</text> +<text text-anchor="start" x="2718" y="-3257.8" font-family="Times,serif" font-size="14.00">Name</text> +<text text-anchor="start" x="2713.5" y="-3236.8" font-family="Times,serif" font-size="14.00">Name2</text> +<text text-anchor="start" x="2704.5" y="-3215.8" font-family="Times,serif" font-size="14.00">OrderList</text> +<text text-anchor="start" x="2686.5" y="-3194.8" font-family="Times,serif" font-size="14.00">ProbeFreezeId</text> +<text text-anchor="start" x="2717.5" y="-3173.8" font-family="Times,serif" font-size="14.00">public</text> +<text text-anchor="start" x="2698.5" y="-3152.8" font-family="Times,serif" font-size="14.00">ShortName</text> +<polygon fill="none" stroke="black" points="2639.5,-3144 2639.5,-3444 2838.5,-3444 2838.5,-3144 2639.5,-3144"/> +</g> +<!-- ProbeSetXRef->ProbeSetFreeze --> +<g id="edge13" class="edge"> +<title>ProbeSetXRef:ProbeSetFreezeId->ProbeSetFreeze</title> +<path fill="none" stroke="black" d="M3036,-4833.5C2816.79,-4833.5 2907.79,-4076.99 2865,-3862 2837.79,-3725.3 2803.24,-3570.92 2777.19,-3457.81"/> +<polygon fill="black" stroke="black" points="2780.6,-3456.98 2774.94,-3448.03 2773.77,-3458.56 2780.6,-3456.98"/> +</g> +<!-- TraitMetadata --> +<g id="node13" class="node"> +<title>TraitMetadata</title> +<polygon fill="white" stroke="transparent" points="8089,-4853 8089,-4922 8267,-4922 8267,-4853 8089,-4853"/> +<polygon fill="#d7b5d8" stroke="transparent" points="8092,-4897.5 8092,-4918.5 8264,-4918.5 8264,-4897.5 8092,-4897.5"/> +<polygon fill="none" stroke="black" points="8092,-4897.5 8092,-4918.5 8264,-4918.5 8264,-4897.5 8092,-4897.5"/> +<text text-anchor="start" x="8095" y="-4904.3" font-family="Times,serif" font-size="14.00">TraitMetadata (16 KiB)</text> +<text text-anchor="start" x="8162" y="-4882.3" font-family="Times,serif" font-size="14.00">type</text> +<text text-anchor="start" x="8158.5" y="-4861.3" font-family="Times,serif" font-size="14.00">value</text> +<polygon fill="none" stroke="black" points="8089,-4853 8089,-4922 8267,-4922 8267,-4853 8089,-4853"/> +</g> +<!-- TissueProbeSetData --> +<g id="node14" class="node"> +<title>TissueProbeSetData</title> +<polygon fill="white" stroke="transparent" points="2313.5,-1918 2313.5,-2008 2538.5,-2008 2538.5,-1918 2313.5,-1918"/> +<polygon fill="#df65b0" stroke="transparent" points="2317,-1984 2317,-2005 2536,-2005 2536,-1984 2317,-1984"/> +<polygon fill="none" stroke="black" points="2317,-1984 2317,-2005 2536,-2005 2536,-1984 2317,-1984"/> +<text text-anchor="start" x="2320" y="-1990.8" font-family="Times,serif" font-size="14.00">TissueProbeSetData (33 MiB)</text> +<text text-anchor="start" x="2419" y="-1968.8" font-family="Times,serif" font-size="14.00">Id</text> +<text text-anchor="start" x="2395" y="-1947.8" font-family="Times,serif" font-size="14.00">TissueID</text> +<text text-anchor="start" x="2407" y="-1926.8" font-family="Times,serif" font-size="14.00">value</text> +<polygon fill="none" stroke="black" points="2313.5,-1918 2313.5,-2008 2538.5,-2008 2538.5,-1918 2313.5,-1918"/> +</g> +<!-- Tissue --> +<g id="node79" class="node"> +<title>Tissue</title> +<polygon fill="lightgrey" stroke="transparent" points="2372.5,-755 2372.5,-929 2497.5,-929 2497.5,-755 2372.5,-755"/> +<polygon fill="#d7b5d8" stroke="transparent" points="2376,-905 2376,-926 2495,-926 2495,-905 2376,-905"/> +<polygon fill="none" stroke="black" points="2376,-905 2376,-926 2495,-926 2495,-905 2376,-905"/> +<text text-anchor="start" x="2381" y="-911.8" font-family="Times,serif" font-size="14.00">Tissue (11 KiB)</text> +<text text-anchor="start" x="2390.5" y="-889.8" font-family="Times,serif" font-size="14.00">BIRN_lex_ID</text> +<text text-anchor="start" x="2378" y="-868.8" font-family="Times,serif" font-size="14.00">BIRN_lex_Name</text> +<text text-anchor="start" x="2428" y="-847.8" font-family="Times,serif" font-size="14.00">Id</text> +<polygon fill="green" stroke="transparent" points="2376,-821 2376,-840 2495,-840 2495,-821 2376,-821"/> +<text text-anchor="start" x="2414" y="-826.8" font-family="Times,serif" font-size="14.00">Name</text> +<polygon fill="green" stroke="transparent" points="2376,-800 2376,-819 2495,-819 2495,-800 2376,-800"/> +<text text-anchor="start" x="2391" y="-805.8" font-family="Times,serif" font-size="14.00">Short_Name</text> +<text text-anchor="start" x="2405" y="-784.8" font-family="Times,serif" font-size="14.00">TissueId</text> +<text text-anchor="start" x="2391.5" y="-763.8" font-family="Times,serif" font-size="14.00">TissueName</text> +<polygon fill="none" stroke="black" points="2372.5,-755 2372.5,-929 2497.5,-929 2497.5,-755 2372.5,-755"/> +</g> +<!-- TissueProbeSetData->Tissue --> +<g id="edge15" class="edge"> +<title>TissueProbeSetData:TissueID->Tissue</title> +<path fill="none" stroke="black" d="M2537,-1951C2587.33,-1951 2488.08,-1216.42 2449.46,-943.5"/> +<polygon fill="black" stroke="black" points="2452.87,-942.61 2448,-933.2 2445.94,-943.59 2452.87,-942.61"/> +</g> +<!-- DBType --> +<g id="node15" class="node"> +<title>DBType</title> +<polygon fill="white" stroke="transparent" points="8304.5,-3259.5 8304.5,-3328.5 8421.5,-3328.5 8421.5,-3259.5 8304.5,-3259.5"/> +<polygon fill="#f1eef6" stroke="transparent" points="8308,-3304 8308,-3325 8419,-3325 8419,-3304 8308,-3304"/> +<polygon fill="none" stroke="black" points="8308,-3304 8308,-3325 8419,-3325 8419,-3304 8308,-3304"/> +<text text-anchor="start" x="8311" y="-3310.8" font-family="Times,serif" font-size="14.00">DBType (99 B)</text> +<text text-anchor="start" x="8356" y="-3288.8" font-family="Times,serif" font-size="14.00">Id</text> +<text text-anchor="start" x="8342" y="-3267.8" font-family="Times,serif" font-size="14.00">Name</text> +<polygon fill="none" stroke="black" points="8304.5,-3259.5 8304.5,-3328.5 8421.5,-3328.5 8421.5,-3259.5 8304.5,-3259.5"/> +</g> +<!-- DatasetStatus --> +<g id="node20" class="node"> +<title>DatasetStatus</title> +<polygon fill="lightgrey" stroke="transparent" points="305.5,-264 305.5,-333 468.5,-333 468.5,-264 305.5,-264"/> +<polygon fill="#f1eef6" stroke="transparent" points="309,-308.5 309,-329.5 466,-329.5 466,-308.5 309,-308.5"/> +<polygon fill="none" stroke="black" points="309,-308.5 309,-329.5 466,-329.5 466,-308.5 309,-308.5"/> +<text text-anchor="start" x="312" y="-315.3" font-family="Times,serif" font-size="14.00">DatasetStatus (40 B)</text> +<text text-anchor="start" x="329" y="-293.3" font-family="Times,serif" font-size="14.00">DatasetStatusId</text> +<polygon fill="green" stroke="transparent" points="309,-266.5 309,-285.5 466,-285.5 466,-266.5 309,-266.5"/> +<text text-anchor="start" x="315" y="-272.3" font-family="Times,serif" font-size="14.00">DatasetStatusName</text> +<polygon fill="none" stroke="black" points="305.5,-264 305.5,-333 468.5,-333 468.5,-264 305.5,-264"/> +</g> +<!-- Datasets->DatasetStatus --> +<g id="edge16" class="edge"> +<title>Datasets:DatasetStatusId->DatasetStatus</title> +<path fill="none" stroke="black" d="M467,-798C557.78,-798 449.28,-471.63 404.55,-347.04"/> +<polygon fill="black" stroke="black" points="407.75,-345.6 401.06,-337.38 401.16,-347.97 407.75,-345.6"/> +</g> +<!-- Investigators --> +<g id="node71" class="node"> +<title>Investigators</title> +<polygon fill="lightgrey" stroke="transparent" points="88,-117 88,-480 258,-480 258,-117 88,-117"/> +<polygon fill="#d7b5d8" stroke="transparent" points="91,-455.5 91,-476.5 255,-476.5 255,-455.5 91,-455.5"/> +<polygon fill="none" stroke="black" points="91,-455.5 91,-476.5 255,-476.5 255,-455.5 91,-455.5"/> +<text text-anchor="start" x="94" y="-462.3" font-family="Times,serif" font-size="14.00">Investigators (22 KiB)</text> +<polygon fill="green" stroke="transparent" points="91,-434.5 91,-453.5 255,-453.5 255,-434.5 91,-434.5"/> +<text text-anchor="start" x="144" y="-440.3" font-family="Times,serif" font-size="14.00">Address</text> +<polygon fill="green" stroke="transparent" points="91,-413.5 91,-432.5 255,-432.5 255,-413.5 91,-413.5"/> +<text text-anchor="start" x="158" y="-419.3" font-family="Times,serif" font-size="14.00">City</text> +<polygon fill="green" stroke="transparent" points="91,-392.5 91,-411.5 255,-411.5 255,-392.5 91,-392.5"/> +<text text-anchor="start" x="144" y="-398.3" font-family="Times,serif" font-size="14.00">Country</text> +<polygon fill="green" stroke="transparent" points="91,-371.5 91,-390.5 255,-390.5 255,-371.5 91,-371.5"/> +<text text-anchor="start" x="152" y="-377.3" font-family="Times,serif" font-size="14.00">Email</text> +<polygon fill="green" stroke="transparent" points="91,-350.5 91,-369.5 255,-369.5 255,-350.5 91,-350.5"/> +<text text-anchor="start" x="134.5" y="-356.3" font-family="Times,serif" font-size="14.00">FirstName</text> +<text text-anchor="start" x="122" y="-335.3" font-family="Times,serif" font-size="14.00">InvestigatorId</text> +<polygon fill="green" stroke="transparent" points="91,-308.5 91,-327.5 255,-327.5 255,-308.5 91,-308.5"/> +<text text-anchor="start" x="136.5" y="-314.3" font-family="Times,serif" font-size="14.00">LastName</text> +<text text-anchor="start" x="119.5" y="-293.3" font-family="Times,serif" font-size="14.00">OrganizationId</text> +<polygon fill="green" stroke="transparent" points="91,-266.5 91,-285.5 255,-285.5 255,-266.5 91,-266.5"/> +<text text-anchor="start" x="150.5" y="-272.3" font-family="Times,serif" font-size="14.00">Phone</text> +<polygon fill="green" stroke="transparent" points="91,-245.5 91,-264.5 255,-264.5 255,-245.5 91,-245.5"/> +<text text-anchor="start" x="153.5" y="-251.3" font-family="Times,serif" font-size="14.00">State</text> +<polygon fill="green" stroke="transparent" points="91,-224.5 91,-243.5 255,-243.5 255,-224.5 91,-224.5"/> +<text text-anchor="start" x="161" y="-230.3" font-family="Times,serif" font-size="14.00">Url</text> +<text text-anchor="start" x="138.5" y="-209.3" font-family="Times,serif" font-size="14.00">UserDate</text> +<text text-anchor="start" x="136.5" y="-188.3" font-family="Times,serif" font-size="14.00">UserLevel</text> +<text text-anchor="start" x="134.5" y="-167.3" font-family="Times,serif" font-size="14.00">UserName</text> +<text text-anchor="start" x="139.5" y="-146.3" font-family="Times,serif" font-size="14.00">UserPass</text> +<polygon fill="green" stroke="transparent" points="91,-119.5 91,-138.5 255,-138.5 255,-119.5 91,-119.5"/> +<text text-anchor="start" x="143" y="-125.3" font-family="Times,serif" font-size="14.00">ZipCode</text> +<polygon fill="none" stroke="black" points="88,-117 88,-480 258,-480 258,-117 88,-117"/> +</g> +<!-- Datasets->Investigators --> +<g id="edge17" class="edge"> +<title>Datasets:InvestigatorId->Investigators</title> +<path fill="none" stroke="black" d="M307,-735C252.81,-735 218.24,-610.26 197.82,-494.3"/> +<polygon fill="black" stroke="black" points="201.22,-493.45 196.07,-484.19 194.32,-494.64 201.22,-493.45"/> +</g> +<!-- IndelAll --> +<g id="node17" class="node"> +<title>IndelAll</title> +<polygon fill="white" stroke="transparent" points="3168,-692 3168,-992 3302,-992 3302,-692 3168,-692"/> +<polygon fill="#df65b0" stroke="transparent" points="3171,-968 3171,-989 3299,-989 3299,-968 3171,-968"/> +<polygon fill="none" stroke="black" points="3171,-968 3171,-989 3299,-989 3299,-968 3171,-968"/> +<text text-anchor="start" x="3174" y="-974.8" font-family="Times,serif" font-size="14.00">IndelAll (17 MiB)</text> +<text text-anchor="start" x="3188" y="-952.8" font-family="Times,serif" font-size="14.00">Chromosome</text> +<text text-anchor="start" x="3227.5" y="-931.8" font-family="Times,serif" font-size="14.00">Id</text> +<text text-anchor="start" x="3181" y="-910.8" font-family="Times,serif" font-size="14.00">InDelSequence</text> +<text text-anchor="start" x="3206.5" y="-889.8" font-family="Times,serif" font-size="14.00">Mb_end</text> +<text text-anchor="start" x="3185" y="-868.8" font-family="Times,serif" font-size="14.00">Mb_end_2016</text> +<text text-anchor="start" x="3202.5" y="-847.8" font-family="Times,serif" font-size="14.00">Mb_start</text> +<text text-anchor="start" x="3181" y="-826.8" font-family="Times,serif" font-size="14.00">Mb_start_2016</text> +<text text-anchor="start" x="3213.5" y="-805.8" font-family="Times,serif" font-size="14.00">Name</text> +<text text-anchor="start" x="3219.5" y="-784.8" font-family="Times,serif" font-size="14.00">Size</text> +<text text-anchor="start" x="3203" y="-763.8" font-family="Times,serif" font-size="14.00">SourceId</text> +<text text-anchor="start" x="3200" y="-742.8" font-family="Times,serif" font-size="14.00">SpeciesId</text> +<text text-anchor="start" x="3210.5" y="-721.8" font-family="Times,serif" font-size="14.00">Strand</text> +<text text-anchor="start" x="3217.5" y="-700.8" font-family="Times,serif" font-size="14.00">Type</text> +<polygon fill="none" stroke="black" points="3168,-692 3168,-992 3302,-992 3302,-692 3168,-692"/> +</g> +<!-- IndelAll->Species --> +<g id="edge18" class="edge"> +<title>IndelAll:SpeciesId->Species</title> +<path fill="none" stroke="black" d="M3170,-746C3144.8,-746 3164.16,-541.49 3151,-520 3088.71,-418.27 2960,-356.26 2875.88,-324.91"/> +<polygon fill="black" stroke="black" points="2876.95,-321.58 2866.36,-321.42 2874.55,-328.15 2876.95,-321.58"/> +</g> +<!-- GORef --> +<g id="node18" class="node"> +<title>GORef</title> +<polygon fill="white" stroke="transparent" points="8459.5,-4842.5 8459.5,-4932.5 8576.5,-4932.5 8576.5,-4842.5 8459.5,-4842.5"/> +<polygon fill="#df65b0" stroke="transparent" points="8463,-4908.5 8463,-4929.5 8574,-4929.5 8574,-4908.5 8463,-4908.5"/> +<polygon fill="none" stroke="black" points="8463,-4908.5 8463,-4929.5 8574,-4929.5 8574,-4908.5 8463,-4908.5"/> +<text text-anchor="start" x="8466" y="-4915.3" font-family="Times,serif" font-size="14.00">GORef (2 MiB)</text> +<text text-anchor="start" x="8497" y="-4893.3" font-family="Times,serif" font-size="14.00">genes</text> +<text text-anchor="start" x="8492.5" y="-4872.3" font-family="Times,serif" font-size="14.00">goterm</text> +<text text-anchor="start" x="8511.5" y="-4851.3" font-family="Times,serif" font-size="14.00">id</text> +<polygon fill="none" stroke="black" points="8459.5,-4842.5 8459.5,-4932.5 8576.5,-4932.5 8576.5,-4842.5 8459.5,-4842.5"/> +</g> +<!-- Publication --> +<g id="node19" class="node"> +<title>Publication</title> +<polygon fill="lightgrey" stroke="transparent" points="2531.5,-723.5 2531.5,-960.5 2682.5,-960.5 2682.5,-723.5 2531.5,-723.5"/> +<polygon fill="#df65b0" stroke="transparent" points="2535,-936 2535,-957 2680,-957 2680,-936 2535,-936"/> +<polygon fill="none" stroke="black" points="2535,-936 2535,-957 2680,-957 2680,-936 2535,-936"/> +<text text-anchor="start" x="2538" y="-942.8" font-family="Times,serif" font-size="14.00">Publication (7 MiB)</text> +<polygon fill="green" stroke="transparent" points="2535,-915 2535,-934 2680,-934 2680,-915 2535,-915"/> +<text text-anchor="start" x="2577" y="-920.8" font-family="Times,serif" font-size="14.00">Abstract</text> +<polygon fill="green" stroke="transparent" points="2535,-894 2535,-913 2680,-913 2680,-894 2535,-894"/> +<text text-anchor="start" x="2579" y="-899.8" font-family="Times,serif" font-size="14.00">Authors</text> +<polygon fill="green" stroke="transparent" points="2535,-873 2535,-892 2680,-892 2680,-873 2535,-873"/> +<text text-anchor="start" x="2581.5" y="-878.8" font-family="Times,serif" font-size="14.00">Journal</text> +<polygon fill="green" stroke="transparent" points="2535,-852 2535,-871 2680,-871 2680,-852 2535,-852"/> +<text text-anchor="start" x="2584" y="-857.8" font-family="Times,serif" font-size="14.00">Month</text> +<polygon fill="green" stroke="transparent" points="2535,-831 2535,-850 2680,-850 2680,-831 2535,-831"/> +<text text-anchor="start" x="2586" y="-836.8" font-family="Times,serif" font-size="14.00">Pages</text> +<polygon fill="green" stroke="transparent" points="2535,-810 2535,-829 2680,-829 2680,-810 2535,-810"/> +<text text-anchor="start" x="2566" y="-815.8" font-family="Times,serif" font-size="14.00">PubMed_ID</text> +<polygon fill="green" stroke="transparent" points="2535,-789 2535,-808 2680,-808 2680,-789 2535,-789"/> +<text text-anchor="start" x="2591" y="-794.8" font-family="Times,serif" font-size="14.00">Title</text> +<polygon fill="green" stroke="transparent" points="2535,-768 2535,-787 2680,-787 2680,-768 2535,-768"/> +<text text-anchor="start" x="2581" y="-773.8" font-family="Times,serif" font-size="14.00">Volume</text> +<polygon fill="green" stroke="transparent" points="2535,-747 2535,-766 2680,-766 2680,-747 2535,-747"/> +<text text-anchor="start" x="2591.5" y="-752.8" font-family="Times,serif" font-size="14.00">Year</text> +<text text-anchor="start" x="2600" y="-731.8" font-family="Times,serif" font-size="14.00">Id</text> +<polygon fill="none" stroke="black" points="2531.5,-723.5 2531.5,-960.5 2682.5,-960.5 2682.5,-723.5 2531.5,-723.5"/> +</g> +<!-- PublishFreeze --> +<g id="node21" class="node"> +<title>PublishFreeze</title> +<polygon fill="white" stroke="transparent" points="3246.5,-1855 3246.5,-2071 3415.5,-2071 3415.5,-1855 3246.5,-1855"/> +<polygon fill="#d7b5d8" stroke="transparent" points="3250,-2047 3250,-2068 3413,-2068 3413,-2047 3250,-2047"/> +<polygon fill="none" stroke="black" points="3250,-2047 3250,-2068 3413,-2068 3413,-2047 3250,-2047"/> +<text text-anchor="start" x="3253" y="-2053.8" font-family="Times,serif" font-size="14.00">PublishFreeze (6 KiB)</text> +<text text-anchor="start" x="3271" y="-2031.8" font-family="Times,serif" font-size="14.00">AuthorisedUsers</text> +<text text-anchor="start" x="3280" y="-2010.8" font-family="Times,serif" font-size="14.00">confidentiality</text> +<text text-anchor="start" x="3289.5" y="-1989.8" font-family="Times,serif" font-size="14.00">CreateTime</text> +<text text-anchor="start" x="3296.5" y="-1968.8" font-family="Times,serif" font-size="14.00">FullName</text> +<text text-anchor="start" x="3324" y="-1947.8" font-family="Times,serif" font-size="14.00">Id</text> +<text text-anchor="start" x="3288.5" y="-1926.8" font-family="Times,serif" font-size="14.00">InbredSetId</text> +<text text-anchor="start" x="3310" y="-1905.8" font-family="Times,serif" font-size="14.00">Name</text> +<text text-anchor="start" x="3309.5" y="-1884.8" font-family="Times,serif" font-size="14.00">public</text> +<text text-anchor="start" x="3290.5" y="-1863.8" font-family="Times,serif" font-size="14.00">ShortName</text> +<polygon fill="none" stroke="black" points="3246.5,-1855 3246.5,-2071 3415.5,-2071 3415.5,-1855 3246.5,-1855"/> +</g> +<!-- InbredSet --> +<g id="node28" class="node"> +<title>InbredSet</title> +<polygon fill="lightgrey" stroke="transparent" points="3781.5,-692 3781.5,-992 3928.5,-992 3928.5,-692 3781.5,-692"/> +<polygon fill="#d7b5d8" stroke="transparent" points="3785,-968 3785,-989 3926,-989 3926,-968 3785,-968"/> +<polygon fill="none" stroke="black" points="3785,-968 3785,-989 3926,-989 3926,-968 3785,-968"/> +<text text-anchor="start" x="3788" y="-974.8" font-family="Times,serif" font-size="14.00">InbredSet (10 KiB)</text> +<text text-anchor="start" x="3810" y="-952.8" font-family="Times,serif" font-size="14.00">FamilyOrder</text> +<text text-anchor="start" x="3848" y="-931.8" font-family="Times,serif" font-size="14.00">Id</text> +<text text-anchor="start" x="3801.5" y="-910.8" font-family="Times,serif" font-size="14.00">InbredSetCode</text> +<text text-anchor="start" x="3812.5" y="-889.8" font-family="Times,serif" font-size="14.00">InbredSetId</text> +<text text-anchor="start" x="3798.5" y="-868.8" font-family="Times,serif" font-size="14.00">InbredSetName</text> +<text text-anchor="start" x="3789" y="-847.8" font-family="Times,serif" font-size="14.00">MappingMethodId</text> +<text text-anchor="start" x="3807" y="-826.8" font-family="Times,serif" font-size="14.00">MenuOrderId</text> +<polygon fill="green" stroke="transparent" points="3785,-800 3785,-819 3926,-819 3926,-800 3785,-800"/> +<text text-anchor="start" x="3834" y="-805.8" font-family="Times,serif" font-size="14.00">Name</text> +<text text-anchor="start" x="3833.5" y="-784.8" font-family="Times,serif" font-size="14.00">public</text> +<text text-anchor="start" x="3820.5" y="-763.8" font-family="Times,serif" font-size="14.00">SpeciesId</text> +<polygon fill="green" stroke="transparent" points="3785,-737 3785,-756 3926,-756 3926,-737 3785,-737"/> +<text text-anchor="start" x="3831" y="-742.8" font-family="Times,serif" font-size="14.00">Family</text> +<polygon fill="green" stroke="transparent" points="3785,-716 3785,-735 3926,-735 3926,-716 3785,-716"/> +<text text-anchor="start" x="3820.5" y="-721.8" font-family="Times,serif" font-size="14.00">FullName</text> +<polygon fill="green" stroke="transparent" points="3785,-695 3785,-714 3926,-714 3926,-695 3785,-695"/> +<text text-anchor="start" x="3810.5" y="-700.8" font-family="Times,serif" font-size="14.00">GeneticType</text> +<polygon fill="none" stroke="black" points="3781.5,-692 3781.5,-992 3928.5,-992 3928.5,-692 3781.5,-692"/> +</g> +<!-- PublishFreeze->InbredSet --> +<g id="edge19" class="edge"> +<title>PublishFreeze:InbredSetId->InbredSet</title> +<path fill="none" stroke="black" d="M3414,-1930C3454.58,-1930 3409.48,-1229.81 3437,-1200 3485.84,-1147.1 3703.73,-1210.15 3759,-1164 3805.64,-1125.05 3830.2,-1064.45 3842.93,-1006.34"/> +<polygon fill="black" stroke="black" points="3846.42,-1006.79 3845.03,-996.28 3839.56,-1005.36 3846.42,-1006.79"/> +</g> +<!-- TissueProbeFreeze --> +<g id="node22" class="node"> +<title>TissueProbeFreeze</title> +<polygon fill="white" stroke="transparent" points="4631,-1865.5 4631,-2060.5 4837,-2060.5 4837,-1865.5 4631,-1865.5"/> +<polygon fill="#f1eef6" stroke="transparent" points="4634,-2036 4634,-2057 4834,-2057 4834,-2036 4634,-2036"/> +<polygon fill="none" stroke="black" points="4634,-2036 4634,-2057 4834,-2057 4834,-2036 4634,-2036"/> +<text text-anchor="start" x="4637" y="-2042.8" font-family="Times,serif" font-size="14.00">TissueProbeFreeze (116 B)</text> +<text text-anchor="start" x="4710" y="-2020.8" font-family="Times,serif" font-size="14.00">ChipId</text> +<text text-anchor="start" x="4692" y="-1999.8" font-family="Times,serif" font-size="14.00">CreateTime</text> +<text text-anchor="start" x="4699" y="-1978.8" font-family="Times,serif" font-size="14.00">FullName</text> +<text text-anchor="start" x="4726.5" y="-1957.8" font-family="Times,serif" font-size="14.00">Id</text> +<text text-anchor="start" x="4691" y="-1936.8" font-family="Times,serif" font-size="14.00">InbredSetId</text> +<text text-anchor="start" x="4712.5" y="-1915.8" font-family="Times,serif" font-size="14.00">Name</text> +<text text-anchor="start" x="4693" y="-1894.8" font-family="Times,serif" font-size="14.00">ShortName</text> +<text text-anchor="start" x="4704.5" y="-1873.8" font-family="Times,serif" font-size="14.00">StrainId</text> +<polygon fill="none" stroke="black" points="4631,-1865.5 4631,-2060.5 4837,-2060.5 4837,-1865.5 4631,-1865.5"/> +</g> +<!-- TissueProbeFreeze->InbredSet --> +<g id="edge20" class="edge"> +<title>TissueProbeFreeze:InbredSetId->InbredSet</title> +<path fill="none" stroke="black" d="M4633,-1940C4550.53,-1940 4633.54,-1259.07 4576,-1200 4521.75,-1144.31 4299.4,-1194.77 4228,-1164 4116.11,-1115.79 4013.14,-1021.68 3943.86,-947.77"/> +<polygon fill="black" stroke="black" points="3946.22,-945.17 3936.85,-940.23 3941.1,-949.94 3946.22,-945.17"/> +</g> +<!-- TissueProbeSetFreeze->TissueProbeFreeze --> +<g id="edge21" class="edge"> +<title>TissueProbeSetFreeze:TissueProbeFreezeId->TissueProbeFreeze</title> +<path fill="none" stroke="black" d="M4862,-3167C4862,-2762.54 4789.57,-2285.87 4753.68,-2074.48"/> +<polygon fill="black" stroke="black" points="4757.13,-2073.88 4752,-2064.61 4750.23,-2075.06 4757.13,-2073.88"/> +</g> +<!-- ProbeXRef --> +<g id="node24" class="node"> +<title>ProbeXRef</title> +<polygon fill="white" stroke="transparent" points="4805,-4842.5 4805,-4932.5 4969,-4932.5 4969,-4842.5 4805,-4842.5"/> +<polygon fill="#df65b0" stroke="transparent" points="4808,-4908.5 4808,-4929.5 4966,-4929.5 4966,-4908.5 4808,-4908.5"/> +<polygon fill="none" stroke="black" points="4808,-4908.5 4808,-4929.5 4966,-4929.5 4966,-4908.5 4808,-4908.5"/> +<text text-anchor="start" x="4811" y="-4915.3" font-family="Times,serif" font-size="14.00">ProbeXRef (229 MiB)</text> +<text text-anchor="start" x="4862.5" y="-4893.3" font-family="Times,serif" font-size="14.00">DataId</text> +<text text-anchor="start" x="4834" y="-4872.3" font-family="Times,serif" font-size="14.00">ProbeFreezeId</text> +<text text-anchor="start" x="4858.5" y="-4851.3" font-family="Times,serif" font-size="14.00">ProbeId</text> +<polygon fill="none" stroke="black" points="4805,-4842.5 4805,-4932.5 4969,-4932.5 4969,-4842.5 4805,-4842.5"/> +</g> +<!-- Probe --> +<g id="node41" class="node"> +<title>Probe</title> +<polygon fill="white" stroke="transparent" points="6860.5,-3186 6860.5,-3402 6969.5,-3402 6969.5,-3186 6860.5,-3186"/> +<polygon fill="#ce1256" stroke="transparent" points="6864,-3378 6864,-3399 6967,-3399 6967,-3378 6864,-3378"/> +<polygon fill="none" stroke="black" points="6864,-3378 6864,-3399 6967,-3399 6967,-3378 6864,-3378"/> +<text text-anchor="start" x="6867" y="-3384.8" font-family="Times,serif" font-size="14.00">Probe (2 GiB)</text> +<text text-anchor="start" x="6891" y="-3362.8" font-family="Times,serif" font-size="14.00">E_GSB</text> +<text text-anchor="start" x="6890.5" y="-3341.8" font-family="Times,serif" font-size="14.00">E_NSB</text> +<text text-anchor="start" x="6887" y="-3320.8" font-family="Times,serif" font-size="14.00">ExonNo</text> +<text text-anchor="start" x="6908" y="-3299.8" font-family="Times,serif" font-size="14.00">Id</text> +<text text-anchor="start" x="6894" y="-3278.8" font-family="Times,serif" font-size="14.00">Name</text> +<text text-anchor="start" x="6875" y="-3257.8" font-family="Times,serif" font-size="14.00">ProbeSetId</text> +<text text-anchor="start" x="6880.5" y="-3236.8" font-family="Times,serif" font-size="14.00">Sequence</text> +<text text-anchor="start" x="6873" y="-3215.8" font-family="Times,serif" font-size="14.00">SerialOrder</text> +<text text-anchor="start" x="6904" y="-3194.8" font-family="Times,serif" font-size="14.00">Tm</text> +<polygon fill="none" stroke="black" points="6860.5,-3186 6860.5,-3402 6969.5,-3402 6969.5,-3186 6860.5,-3186"/> +</g> +<!-- ProbeXRef->Probe --> +<g id="edge23" class="edge"> +<title>ProbeXRef:ProbeId->Probe</title> +<path fill="none" stroke="black" d="M4967,-4854.5C5534.68,-4854.5 5262.79,-4114.96 5771,-3862 5877.2,-3809.14 6749.63,-3905.13 6838,-3826 6950.47,-3725.29 6951.4,-3539.28 6936.93,-3416.33"/> +<polygon fill="black" stroke="black" points="6940.37,-3415.61 6935.68,-3406.11 6933.42,-3416.47 6940.37,-3415.61"/> +</g> +<!-- ProbeXRef->ProbeFreeze --> +<g id="edge22" class="edge"> +<title>ProbeXRef:ProbeFreezeId->ProbeFreeze</title> +<path fill="none" stroke="black" d="M4807,-4875.5C3968.98,-4875.5 3960.35,-4248.91 3217,-3862 3179.88,-3842.68 3157.46,-3857.58 3130,-3826 2809.52,-3457.41 3148.75,-3152.22 2855,-2762 2836.07,-2736.85 2811.36,-2752.26 2794,-2726 2665.13,-2531.04 2665.79,-2246.15 2679.06,-2085.66"/> +<polygon fill="black" stroke="black" points="2682.59,-2085.53 2679.95,-2075.27 2675.61,-2084.93 2682.59,-2085.53"/> +</g> +<!-- Publication_Test --> +<g id="node25" class="node"> +<title>Publication_Test</title> +<polygon fill="white" stroke="transparent" points="8610.5,-4769 8610.5,-5006 8797.5,-5006 8797.5,-4769 8610.5,-4769"/> +<polygon fill="#df65b0" stroke="transparent" points="8614,-4981.5 8614,-5002.5 8795,-5002.5 8795,-4981.5 8614,-4981.5"/> +<polygon fill="none" stroke="black" points="8614,-4981.5 8614,-5002.5 8795,-5002.5 8795,-4981.5 8614,-4981.5"/> +<text text-anchor="start" x="8617" y="-4988.3" font-family="Times,serif" font-size="14.00">Publication_Test (7 MiB)</text> +<text text-anchor="start" x="8674" y="-4966.3" font-family="Times,serif" font-size="14.00">Abstract</text> +<text text-anchor="start" x="8676" y="-4945.3" font-family="Times,serif" font-size="14.00">Authors</text> +<text text-anchor="start" x="8697" y="-4924.3" font-family="Times,serif" font-size="14.00">Id</text> +<text text-anchor="start" x="8678.5" y="-4903.3" font-family="Times,serif" font-size="14.00">Journal</text> +<text text-anchor="start" x="8681" y="-4882.3" font-family="Times,serif" font-size="14.00">Month</text> +<text text-anchor="start" x="8683" y="-4861.3" font-family="Times,serif" font-size="14.00">Pages</text> +<text text-anchor="start" x="8663" y="-4840.3" font-family="Times,serif" font-size="14.00">PubMed_ID</text> +<text text-anchor="start" x="8688" y="-4819.3" font-family="Times,serif" font-size="14.00">Title</text> +<text text-anchor="start" x="8678" y="-4798.3" font-family="Times,serif" font-size="14.00">Volume</text> +<text text-anchor="start" x="8688.5" y="-4777.3" font-family="Times,serif" font-size="14.00">Year</text> +<polygon fill="none" stroke="black" points="8610.5,-4769 8610.5,-5006 8797.5,-5006 8797.5,-4769 8610.5,-4769"/> +</g> +<!-- DBList --> +<g id="node26" class="node"> +<title>DBList</title> +<polygon fill="white" stroke="transparent" points="8301,-4821.5 8301,-4953.5 8425,-4953.5 8425,-4821.5 8301,-4821.5"/> +<polygon fill="#d7b5d8" stroke="transparent" points="8304,-4929.5 8304,-4950.5 8422,-4950.5 8422,-4929.5 8304,-4929.5"/> +<polygon fill="none" stroke="black" points="8304,-4929.5 8304,-4950.5 8422,-4950.5 8422,-4929.5 8304,-4929.5"/> +<text text-anchor="start" x="8307" y="-4936.3" font-family="Times,serif" font-size="14.00">DBList (99 KiB)</text> +<text text-anchor="start" x="8344.5" y="-4914.3" font-family="Times,serif" font-size="14.00">Code</text> +<text text-anchor="start" x="8327.5" y="-4893.3" font-family="Times,serif" font-size="14.00">DBTypeId</text> +<text text-anchor="start" x="8331" y="-4872.3" font-family="Times,serif" font-size="14.00">FreezeId</text> +<text text-anchor="start" x="8355.5" y="-4851.3" font-family="Times,serif" font-size="14.00">Id</text> +<text text-anchor="start" x="8341.5" y="-4830.3" font-family="Times,serif" font-size="14.00">Name</text> +<polygon fill="none" stroke="black" points="8301,-4821.5 8301,-4953.5 8425,-4953.5 8425,-4821.5 8301,-4821.5"/> +</g> +<!-- DBList->DBType --> +<g id="edge24" class="edge"> +<title>DBList:DBTypeId->DBType</title> +<path fill="none" stroke="black" d="M8423,-4897.5C8462.94,-4897.5 8383.01,-3608.94 8366.07,-3342.76"/> +<polygon fill="black" stroke="black" points="8369.55,-3342.4 8365.42,-3332.64 8362.57,-3342.84 8369.55,-3342.4"/> +</g> +<!-- H2 --> +<g id="node27" class="node"> +<title>H2</title> +<polygon fill="white" stroke="transparent" points="8831.5,-4832 8831.5,-4943 8922.5,-4943 8922.5,-4832 8831.5,-4832"/> +<polygon fill="#df65b0" stroke="transparent" points="8835,-4918.5 8835,-4939.5 8920,-4939.5 8920,-4918.5 8835,-4918.5"/> +<polygon fill="none" stroke="black" points="8835,-4918.5 8835,-4939.5 8920,-4939.5 8920,-4918.5 8835,-4918.5"/> +<text text-anchor="start" x="8838" y="-4925.3" font-family="Times,serif" font-size="14.00">H2 (2 MiB)</text> +<text text-anchor="start" x="8853" y="-4903.3" font-family="Times,serif" font-size="14.00">DataId</text> +<text text-anchor="start" x="8856.5" y="-4882.3" font-family="Times,serif" font-size="14.00">H2SE</text> +<text text-anchor="start" x="8856" y="-4861.3" font-family="Times,serif" font-size="14.00">HPH2</text> +<text text-anchor="start" x="8859" y="-4840.3" font-family="Times,serif" font-size="14.00">ICH2</text> +<polygon fill="none" stroke="black" points="8831.5,-4832 8831.5,-4943 8922.5,-4943 8922.5,-4832 8831.5,-4832"/> +</g> +<!-- InbredSet->Species --> +<g id="edge25" class="edge"> +<title>InbredSet:SpeciesId->Species</title> +<path fill="none" stroke="black" d="M3784,-767C3728.83,-767 3795.51,-561.36 3759,-520 3641.66,-387.09 3085.79,-325.05 2876.21,-306.09"/> +<polygon fill="black" stroke="black" points="2876.47,-302.6 2866.2,-305.19 2875.85,-309.57 2876.47,-302.6"/> +</g> +<!-- DatasetMapInvestigator --> +<g id="node29" class="node"> +<title>DatasetMapInvestigator</title> +<polygon fill="white" stroke="transparent" points="8,-1918 8,-2008 258,-2008 258,-1918 8,-1918"/> +<polygon fill="#d7b5d8" stroke="transparent" points="11,-1984 11,-2005 255,-2005 255,-1984 11,-1984"/> +<polygon fill="none" stroke="black" points="11,-1984 11,-2005 255,-2005 255,-1984 11,-1984"/> +<text text-anchor="start" x="14" y="-1990.8" font-family="Times,serif" font-size="14.00">DatasetMapInvestigator (28 KiB)</text> +<text text-anchor="start" x="98" y="-1968.8" font-family="Times,serif" font-size="14.00">DatasetId</text> +<text text-anchor="start" x="125.5" y="-1947.8" font-family="Times,serif" font-size="14.00">Id</text> +<text text-anchor="start" x="82" y="-1926.8" font-family="Times,serif" font-size="14.00">InvestigatorId</text> +<polygon fill="none" stroke="black" points="8,-1918 8,-2008 258,-2008 258,-1918 8,-1918"/> +</g> +<!-- DatasetMapInvestigator->Datasets --> +<g id="edge26" class="edge"> +<title>DatasetMapInvestigator:DatasetId->Datasets</title> +<path fill="none" stroke="black" d="M256,-1973C277.48,-1973 271.49,-1221.19 275,-1200 283.9,-1146.31 298.97,-1089.52 315.22,-1037.42"/> +<polygon fill="black" stroke="black" points="318.6,-1038.33 318.27,-1027.74 311.93,-1036.23 318.6,-1038.33"/> +</g> +<!-- DatasetMapInvestigator->Investigators --> +<g id="edge27" class="edge"> +<title>DatasetMapInvestigator:InvestigatorId->Investigators</title> +<path fill="none" stroke="black" d="M133,-1920C133,-1405.22 153.42,-798.72 165.08,-494.41"/> +<polygon fill="black" stroke="black" points="168.59,-494.29 165.48,-484.16 161.59,-494.02 168.59,-494.29"/> +</g> +<!-- Docs --> +<g id="node30" class="node"> +<title>Docs</title> +<polygon fill="white" stroke="transparent" points="8956.5,-4832 8956.5,-4943 9075.5,-4943 9075.5,-4832 8956.5,-4832"/> +<polygon fill="#d7b5d8" stroke="transparent" points="8960,-4918.5 8960,-4939.5 9073,-4939.5 9073,-4918.5 8960,-4918.5"/> +<polygon fill="none" stroke="black" points="8960,-4918.5 8960,-4939.5 9073,-4939.5 9073,-4918.5 8960,-4918.5"/> +<text text-anchor="start" x="8963" y="-4925.3" font-family="Times,serif" font-size="14.00">Docs (148 KiB)</text> +<text text-anchor="start" x="8989" y="-4903.3" font-family="Times,serif" font-size="14.00">content</text> +<text text-anchor="start" x="8997" y="-4882.3" font-family="Times,serif" font-size="14.00">entry</text> +<text text-anchor="start" x="9009.5" y="-4861.3" font-family="Times,serif" font-size="14.00">id</text> +<text text-anchor="start" x="9001.5" y="-4840.3" font-family="Times,serif" font-size="14.00">title</text> +<polygon fill="none" stroke="black" points="8956.5,-4832 8956.5,-4943 9075.5,-4943 9075.5,-4832 8956.5,-4832"/> +</g> +<!-- Phenotype --> +<g id="node31" class="node"> +<title>Phenotype</title> +<polygon fill="lightgrey" stroke="transparent" points="2910,-713 2910,-971 3134,-971 3134,-713 2910,-713"/> +<polygon fill="#df65b0" stroke="transparent" points="2913,-947 2913,-968 3131,-968 3131,-947 2913,-947"/> +<polygon fill="none" stroke="black" points="2913,-947 2913,-968 3131,-968 3131,-947 2913,-947"/> +<text text-anchor="start" x="2955" y="-953.8" font-family="Times,serif" font-size="14.00">Phenotype (9 MiB)</text> +<text text-anchor="start" x="3014.5" y="-931.8" font-family="Times,serif" font-size="14.00">Id</text> +<text text-anchor="start" x="2915" y="-910.8" font-family="Times,serif" font-size="14.00">Post_publication_abbreviation</text> +<text text-anchor="start" x="2918" y="-889.8" font-family="Times,serif" font-size="14.00">Pre_publication_abbreviation</text> +<polygon fill="green" stroke="transparent" points="2913,-863 2913,-882 3131,-882 3131,-863 2913,-863"/> +<text text-anchor="start" x="2958.5" y="-868.8" font-family="Times,serif" font-size="14.00">Authorized_Users</text> +<polygon fill="green" stroke="transparent" points="2913,-842 2913,-861 3131,-861 3131,-842 2913,-842"/> +<text text-anchor="start" x="2988.5" y="-847.8" font-family="Times,serif" font-size="14.00">Lab_code</text> +<polygon fill="green" stroke="transparent" points="2913,-821 2913,-840 3131,-840 3131,-821 2913,-821"/> +<text text-anchor="start" x="2949.5" y="-826.8" font-family="Times,serif" font-size="14.00">Original_description</text> +<polygon fill="green" stroke="transparent" points="2913,-800 2913,-819 3131,-819 3131,-800 2913,-800"/> +<text text-anchor="start" x="2998" y="-805.8" font-family="Times,serif" font-size="14.00">Owner</text> +<polygon fill="green" stroke="transparent" points="2913,-779 2913,-798 3131,-798 3131,-779 2913,-779"/> +<text text-anchor="start" x="2919.5" y="-784.8" font-family="Times,serif" font-size="14.00">Post_publication_description</text> +<polygon fill="green" stroke="transparent" points="2913,-758 2913,-777 3131,-777 3131,-758 2913,-758"/> +<text text-anchor="start" x="2922.5" y="-763.8" font-family="Times,serif" font-size="14.00">Pre_publication_description</text> +<polygon fill="green" stroke="transparent" points="2913,-737 2913,-756 3131,-756 3131,-737 2913,-737"/> +<text text-anchor="start" x="2985.5" y="-742.8" font-family="Times,serif" font-size="14.00">Submitter</text> +<polygon fill="green" stroke="transparent" points="2913,-716 2913,-735 3131,-735 3131,-716 2913,-716"/> +<text text-anchor="start" x="3002" y="-721.8" font-family="Times,serif" font-size="14.00">Units</text> +<polygon fill="none" stroke="black" points="2910,-713 2910,-971 3134,-971 3134,-713 2910,-713"/> +</g> +<!-- SnpPattern --> +<g id="node32" class="node"> +<title>SnpPattern</title> +<polygon fill="white" stroke="transparent" points="9110,-3866 9110,-5909 9294,-5909 9294,-3866 9110,-3866"/> +<polygon fill="#ce1256" stroke="transparent" points="9113,-5884.5 9113,-5905.5 9291,-5905.5 9291,-5884.5 9113,-5884.5"/> +<polygon fill="none" stroke="black" points="9113,-5884.5 9113,-5905.5 9291,-5905.5 9291,-5884.5 9113,-5884.5"/> +<text text-anchor="start" x="9134" y="-5891.3" font-family="Times,serif" font-size="14.00">SnpPattern (8 GiB)</text> +<text text-anchor="start" x="9150.5" y="-5869.3" font-family="Times,serif" font-size="14.00">129P2/OlaHsd</text> +<text text-anchor="start" x="9155.5" y="-5848.3" font-family="Times,serif" font-size="14.00">129S1/SvImJ</text> +<text text-anchor="start" x="9153.5" y="-5827.3" font-family="Times,serif" font-size="14.00">129S2/SvHsd</text> +<text text-anchor="start" x="9156.5" y="-5806.3" font-family="Times,serif" font-size="14.00">129S4/SvJae</text> +<text text-anchor="start" x="9145" y="-5785.3" font-family="Times,serif" font-size="14.00">129S5/SvEvBrd</text> +<text text-anchor="start" x="9158" y="-5764.3" font-family="Times,serif" font-size="14.00">129S6/SvEv</text> +<text text-anchor="start" x="9149.5" y="-5743.3" font-family="Times,serif" font-size="14.00">129T2/SvEmsJ</text> +<text text-anchor="start" x="9165" y="-5722.3" font-family="Times,serif" font-size="14.00">129X1/SvJ</text> +<text text-anchor="start" x="9192" y="-5701.3" font-family="Times,serif" font-size="14.00">A/J</text> +<text text-anchor="start" x="9181.5" y="-5680.3" font-family="Times,serif" font-size="14.00">AKR/J</text> +<text text-anchor="start" x="9115" y="-5659.3" font-family="Times,serif" font-size="14.00">B6A6_Esline_Regeneron</text> +<text text-anchor="start" x="9164" y="-5638.3" font-family="Times,serif" font-size="14.00">BALB/cByJ</text> +<text text-anchor="start" x="9173" y="-5617.3" font-family="Times,serif" font-size="14.00">BALB/cJ</text> +<text text-anchor="start" x="9176" y="-5596.3" font-family="Times,serif" font-size="14.00">BPH/2J</text> +<text text-anchor="start" x="9177.5" y="-5575.3" font-family="Times,serif" font-size="14.00">BPL/1J</text> +<text text-anchor="start" x="9176" y="-5554.3" font-family="Times,serif" font-size="14.00">BPN/3J</text> +<text text-anchor="start" x="9148.5" y="-5533.3" font-family="Times,serif" font-size="14.00">BTBRT<+>tf/J</text> +<text text-anchor="start" x="9170.5" y="-5512.3" font-family="Times,serif" font-size="14.00">BUB/BnJ</text> +<text text-anchor="start" x="9135.5" y="-5491.3" font-family="Times,serif" font-size="14.00">C2T1_Esline_Nagy</text> +<text text-anchor="start" x="9171" y="-5470.3" font-family="Times,serif" font-size="14.00">C3H/HeJ</text> +<text text-anchor="start" x="9163" y="-5449.3" font-family="Times,serif" font-size="14.00">C3HeB/FeJ</text> +<text text-anchor="start" x="9164" y="-5428.3" font-family="Times,serif" font-size="14.00">C57BL/10J</text> +<text text-anchor="start" x="9159" y="-5407.3" font-family="Times,serif" font-size="14.00">C57BL/6ByJ</text> +<text text-anchor="start" x="9168.5" y="-5386.3" font-family="Times,serif" font-size="14.00">C57BL/6J</text> +<text text-anchor="start" x="9140" y="-5365.3" font-family="Times,serif" font-size="14.00">C57BL/6JBomTac</text> +<text text-anchor="start" x="9157.5" y="-5344.3" font-family="Times,serif" font-size="14.00">C57BL/6JCrl</text> +<text text-anchor="start" x="9142" y="-5323.3" font-family="Times,serif" font-size="14.00">C57BL/6JOlaHsd</text> +<text text-anchor="start" x="9154" y="-5302.3" font-family="Times,serif" font-size="14.00">C57BL/6NCrl</text> +<text text-anchor="start" x="9150.5" y="-5281.3" font-family="Times,serif" font-size="14.00">C57BL/6NHsd</text> +<text text-anchor="start" x="9162.5" y="-5260.3" font-family="Times,serif" font-size="14.00">C57BL/6NJ</text> +<text text-anchor="start" x="9150.5" y="-5239.3" font-family="Times,serif" font-size="14.00">C57BL/6NNIH</text> +<text text-anchor="start" x="9153" y="-5218.3" font-family="Times,serif" font-size="14.00">C57BL/6NTac</text> +<text text-anchor="start" x="9162.5" y="-5197.3" font-family="Times,serif" font-size="14.00">C57BLKS/J</text> +<text text-anchor="start" x="9164" y="-5176.3" font-family="Times,serif" font-size="14.00">C57BR/cdJ</text> +<text text-anchor="start" x="9178" y="-5155.3" font-family="Times,serif" font-size="14.00">C57L/J</text> +<text text-anchor="start" x="9182.5" y="-5134.3" font-family="Times,serif" font-size="14.00">C58/J</text> +<text text-anchor="start" x="9167.5" y="-5113.3" font-family="Times,serif" font-size="14.00">CALB/RkJ</text> +<text text-anchor="start" x="9170" y="-5092.3" font-family="Times,serif" font-size="14.00">CAST/EiJ</text> +<text text-anchor="start" x="9181.5" y="-5071.3" font-family="Times,serif" font-size="14.00">CBA/J</text> +<text text-anchor="start" x="9186.5" y="-5050.3" font-family="Times,serif" font-size="14.00">CE/J</text> +<text text-anchor="start" x="9157.5" y="-5029.3" font-family="Times,serif" font-size="14.00">CZECHII/EiJ</text> +<text text-anchor="start" x="9176.5" y="-5008.3" font-family="Times,serif" font-size="14.00">DBA/1J</text> +<text text-anchor="start" x="9176.5" y="-4987.3" font-family="Times,serif" font-size="14.00">DBA/2J</text> +<text text-anchor="start" x="9170.5" y="-4966.3" font-family="Times,serif" font-size="14.00">DDK/Pas</text> +<text text-anchor="start" x="9135.5" y="-4945.3" font-family="Times,serif" font-size="14.00">DDY/JclSidSeyFrkJ</text> +<text text-anchor="start" x="9148.5" y="-4924.3" font-family="Times,serif" font-size="14.00">EL/SuzSeyFrkJ</text> +<text text-anchor="start" x="9183.5" y="-4903.3" font-family="Times,serif" font-size="14.00">Fline</text> +<text text-anchor="start" x="9176" y="-4882.3" font-family="Times,serif" font-size="14.00">FVB/NJ</text> +<text text-anchor="start" x="9154" y="-4861.3" font-family="Times,serif" font-size="14.00">HTG/GoSfSnJ</text> +<text text-anchor="start" x="9185" y="-4840.3" font-family="Times,serif" font-size="14.00">I/LnJ</text> +<text text-anchor="start" x="9162.5" y="-4819.3" font-family="Times,serif" font-size="14.00">ILS/IbgTejJ</text> +<text text-anchor="start" x="9164" y="-4798.3" font-family="Times,serif" font-size="14.00">IS/CamRkJ</text> +<text text-anchor="start" x="9162.5" y="-4777.3" font-family="Times,serif" font-size="14.00">ISS/IbgTejJ</text> +<text text-anchor="start" x="9176.5" y="-4756.3" font-family="Times,serif" font-size="14.00">JF1/Ms</text> +<text text-anchor="start" x="9178" y="-4735.3" font-family="Times,serif" font-size="14.00">KK/HlJ</text> +<text text-anchor="start" x="9162.5" y="-4714.3" font-family="Times,serif" font-size="14.00">LEWES/EiJ</text> +<text text-anchor="start" x="9186.5" y="-4693.3" font-family="Times,serif" font-size="14.00">LG/J</text> +<text text-anchor="start" x="9184" y="-4672.3" font-family="Times,serif" font-size="14.00">Lline</text> +<text text-anchor="start" x="9187.5" y="-4651.3" font-family="Times,serif" font-size="14.00">LP/J</text> +<text text-anchor="start" x="9173.5" y="-4630.3" font-family="Times,serif" font-size="14.00">MA/MyJ</text> +<text text-anchor="start" x="9172.5" y="-4609.3" font-family="Times,serif" font-size="14.00">MAI/Pas</text> +<text text-anchor="start" x="9167" y="-4588.3" font-family="Times,serif" font-size="14.00">MOLF/EiJ</text> +<text text-anchor="start" x="9164" y="-4567.3" font-family="Times,serif" font-size="14.00">MOLG/DnJ</text> +<text text-anchor="start" x="9168.5" y="-4546.3" font-family="Times,serif" font-size="14.00">MRL/MpJ</text> +<text text-anchor="start" x="9169.5" y="-4525.3" font-family="Times,serif" font-size="14.00">MSM/Ms</text> +<text text-anchor="start" x="9160.5" y="-4504.3" font-family="Times,serif" font-size="14.00">NOD/ShiLtJ</text> +<text text-anchor="start" x="9171.5" y="-4483.3" font-family="Times,serif" font-size="14.00">NON/LtJ</text> +<text text-anchor="start" x="9172.5" y="-4462.3" font-family="Times,serif" font-size="14.00">NOR/LtJ</text> +<text text-anchor="start" x="9167" y="-4441.3" font-family="Times,serif" font-size="14.00">NZB/BlNJ</text> +<text text-anchor="start" x="9174" y="-4420.3" font-family="Times,serif" font-size="14.00">NZL/LtJ</text> +<text text-anchor="start" x="9164.5" y="-4399.3" font-family="Times,serif" font-size="14.00">NZO/HlLtJ</text> +<text text-anchor="start" x="9166.5" y="-4378.3" font-family="Times,serif" font-size="14.00">NZW/LacJ</text> +<text text-anchor="start" x="9187" y="-4357.3" font-family="Times,serif" font-size="14.00">O20</text> +<text text-anchor="start" x="9192" y="-4336.3" font-family="Times,serif" font-size="14.00">P/J</text> +<text text-anchor="start" x="9169" y="-4315.3" font-family="Times,serif" font-size="14.00">PERA/EiJ</text> +<text text-anchor="start" x="9168.5" y="-4294.3" font-family="Times,serif" font-size="14.00">PERC/EiJ</text> +<text text-anchor="start" x="9187.5" y="-4273.3" font-family="Times,serif" font-size="14.00">PL/J</text> +<text text-anchor="start" x="9170" y="-4252.3" font-family="Times,serif" font-size="14.00">PWD/PhJ</text> +<text text-anchor="start" x="9170" y="-4231.3" font-family="Times,serif" font-size="14.00">PWK/PhJ</text> +<text text-anchor="start" x="9185.5" y="-4210.3" font-family="Times,serif" font-size="14.00">Qsi5</text> +<text text-anchor="start" x="9171.5" y="-4189.3" font-family="Times,serif" font-size="14.00">RBA/DnJ</text> +<text text-anchor="start" x="9186.5" y="-4168.3" font-family="Times,serif" font-size="14.00">RF/J</text> +<text text-anchor="start" x="9179" y="-4147.3" font-family="Times,serif" font-size="14.00">RIIIS/J</text> +<text text-anchor="start" x="9171.5" y="-4126.3" font-family="Times,serif" font-size="14.00">SEA/GnJ</text> +<text text-anchor="start" x="9171.5" y="-4105.3" font-family="Times,serif" font-size="14.00">SEG/Pas</text> +<text text-anchor="start" x="9185" y="-4084.3" font-family="Times,serif" font-size="14.00">SJL/J</text> +<text text-anchor="start" x="9166.5" y="-4063.3" font-family="Times,serif" font-size="14.00">SKIVE/EiJ</text> +<text text-anchor="start" x="9185" y="-4042.3" font-family="Times,serif" font-size="14.00">SM/J</text> +<text text-anchor="start" x="9180.5" y="-4021.3" font-family="Times,serif" font-size="14.00">SnpId</text> +<text text-anchor="start" x="9168.5" y="-4000.3" font-family="Times,serif" font-size="14.00">SOD1/EiJ</text> +<text text-anchor="start" x="9164.5" y="-3979.3" font-family="Times,serif" font-size="14.00">SPRET/EiJ</text> +<text text-anchor="start" x="9183" y="-3958.3" font-family="Times,serif" font-size="14.00">ST/bJ</text> +<text text-anchor="start" x="9179.5" y="-3937.3" font-family="Times,serif" font-size="14.00">SWR/J</text> +<text text-anchor="start" x="9151.5" y="-3916.3" font-family="Times,serif" font-size="14.00">TALLYHO/JngJ</text> +<text text-anchor="start" x="9172" y="-3895.3" font-family="Times,serif" font-size="14.00">WSB/EiJ</text> +<text text-anchor="start" x="9153" y="-3874.3" font-family="Times,serif" font-size="14.00">ZALENDE/EiJ</text> +<polygon fill="none" stroke="black" points="9110,-3866 9110,-5909 9294,-5909 9294,-3866 9110,-3866"/> +</g> +<!-- AccessLog --> +<g id="node34" class="node"> +<title>AccessLog</title> +<polygon fill="white" stroke="transparent" points="9328,-4842.5 9328,-4932.5 9482,-4932.5 9482,-4842.5 9328,-4842.5"/> +<polygon fill="#df65b0" stroke="transparent" points="9331,-4908.5 9331,-4929.5 9479,-4929.5 9479,-4908.5 9331,-4908.5"/> +<polygon fill="none" stroke="black" points="9331,-4908.5 9331,-4929.5 9479,-4929.5 9479,-4908.5 9331,-4908.5"/> +<text text-anchor="start" x="9334" y="-4915.3" font-family="Times,serif" font-size="14.00">AccessLog (46 MiB)</text> +<text text-anchor="start" x="9365.5" y="-4893.3" font-family="Times,serif" font-size="14.00">accesstime</text> +<text text-anchor="start" x="9398" y="-4872.3" font-family="Times,serif" font-size="14.00">id</text> +<text text-anchor="start" x="9366.5" y="-4851.3" font-family="Times,serif" font-size="14.00">ip_address</text> +<polygon fill="none" stroke="black" points="9328,-4842.5 9328,-4932.5 9482,-4932.5 9482,-4842.5 9328,-4842.5"/> +</g> +<!-- GeneRIF --> +<g id="node35" class="node"> +<title>GeneRIF</title> +<polygon fill="white" stroke="transparent" points="3576.5,-692 3576.5,-992 3709.5,-992 3709.5,-692 3576.5,-692"/> +<polygon fill="#df65b0" stroke="transparent" points="3580,-968 3580,-989 3707,-989 3707,-968 3580,-968"/> +<polygon fill="none" stroke="black" points="3580,-968 3580,-989 3707,-989 3707,-968 3580,-968"/> +<text text-anchor="start" x="3583" y="-974.8" font-family="Times,serif" font-size="14.00">GeneRIF (2 MiB)</text> +<text text-anchor="start" x="3610" y="-952.8" font-family="Times,serif" font-size="14.00">comment</text> +<text text-anchor="start" x="3604.5" y="-931.8" font-family="Times,serif" font-size="14.00">createtime</text> +<text text-anchor="start" x="3617.5" y="-910.8" font-family="Times,serif" font-size="14.00">display</text> +<text text-anchor="start" x="3623.5" y="-889.8" font-family="Times,serif" font-size="14.00">email</text> +<text text-anchor="start" x="3636" y="-868.8" font-family="Times,serif" font-size="14.00">Id</text> +<text text-anchor="start" x="3622.5" y="-847.8" font-family="Times,serif" font-size="14.00">initial</text> +<text text-anchor="start" x="3602" y="-826.8" font-family="Times,serif" font-size="14.00">PubMed_ID</text> +<text text-anchor="start" x="3619" y="-805.8" font-family="Times,serif" font-size="14.00">reason</text> +<text text-anchor="start" x="3608.5" y="-784.8" font-family="Times,serif" font-size="14.00">SpeciesId</text> +<text text-anchor="start" x="3617.5" y="-763.8" font-family="Times,serif" font-size="14.00">symbol</text> +<text text-anchor="start" x="3617.5" y="-742.8" font-family="Times,serif" font-size="14.00">user_ip</text> +<text text-anchor="start" x="3610" y="-721.8" font-family="Times,serif" font-size="14.00">versionId</text> +<text text-anchor="start" x="3618.5" y="-700.8" font-family="Times,serif" font-size="14.00">weburl</text> +<polygon fill="none" stroke="black" points="3576.5,-692 3576.5,-992 3709.5,-992 3709.5,-692 3576.5,-692"/> +</g> +<!-- GeneRIF->Species --> +<g id="edge28" class="edge"> +<title>GeneRIF:SpeciesId->Species</title> +<path fill="none" stroke="black" d="M3579,-788C3549.14,-788 3577.82,-543.18 3559,-520 3471.93,-412.76 3053.77,-338.32 2876.12,-311.02"/> +<polygon fill="black" stroke="black" points="2876.46,-307.54 2866.05,-309.49 2875.41,-314.46 2876.46,-307.54"/> +</g> +<!-- ProbeData --> +<g id="node36" class="node"> +<title>ProbeData</title> +<polygon fill="white" stroke="transparent" points="5291,-1918 5291,-2008 5443,-2008 5443,-1918 5291,-1918"/> +<polygon fill="#ce1256" stroke="transparent" points="5294,-1984 5294,-2005 5440,-2005 5440,-1984 5294,-1984"/> +<polygon fill="none" stroke="black" points="5294,-1984 5294,-2005 5440,-2005 5440,-1984 5294,-1984"/> +<text text-anchor="start" x="5297" y="-1990.8" font-family="Times,serif" font-size="14.00">ProbeData (10 GiB)</text> +<text text-anchor="start" x="5359.5" y="-1968.8" font-family="Times,serif" font-size="14.00">Id</text> +<text text-anchor="start" x="5337.5" y="-1947.8" font-family="Times,serif" font-size="14.00">StrainId</text> +<text text-anchor="start" x="5347.5" y="-1926.8" font-family="Times,serif" font-size="14.00">value</text> +<polygon fill="none" stroke="black" points="5291,-1918 5291,-2008 5443,-2008 5443,-1918 5291,-1918"/> +</g> +<!-- ProbeData->Strain --> +<g id="edge29" class="edge"> +<title>ProbeData:StrainId->Strain</title> +<path fill="none" stroke="black" d="M5441,-1951C5461.87,-1951 5451.21,-1219.36 5459,-1200 5511.05,-1070.73 5632.85,-959.15 5712.21,-896.58"/> +<polygon fill="black" stroke="black" points="5714.51,-899.22 5720.23,-890.3 5710.2,-893.71 5714.51,-899.22"/> +</g> +<!-- AvgMethod --> +<g id="node37" class="node"> +<title>AvgMethod</title> +<polygon fill="lightgrey" stroke="transparent" points="982.5,-786.5 982.5,-897.5 1133.5,-897.5 1133.5,-786.5 982.5,-786.5"/> +<polygon fill="#f1eef6" stroke="transparent" points="986,-873 986,-894 1131,-894 1131,-873 986,-873"/> +<polygon fill="none" stroke="black" points="986,-873 986,-894 1131,-894 1131,-873 986,-873"/> +<text text-anchor="start" x="989" y="-879.8" font-family="Times,serif" font-size="14.00">AvgMethod (792 B)</text> +<text text-anchor="start" x="1010" y="-857.8" font-family="Times,serif" font-size="14.00">AvgMethodId</text> +<text text-anchor="start" x="1051" y="-836.8" font-family="Times,serif" font-size="14.00">Id</text> +<polygon fill="green" stroke="transparent" points="986,-810 986,-829 1131,-829 1131,-810 986,-810"/> +<text text-anchor="start" x="1037" y="-815.8" font-family="Times,serif" font-size="14.00">Name</text> +<text text-anchor="start" x="1007.5" y="-794.8" font-family="Times,serif" font-size="14.00">Normalization</text> +<polygon fill="none" stroke="black" points="982.5,-786.5 982.5,-897.5 1133.5,-897.5 1133.5,-786.5 982.5,-786.5"/> +</g> +<!-- GeneRIFXRef --> +<g id="node38" class="node"> +<title>GeneRIFXRef</title> +<polygon fill="white" stroke="transparent" points="3003,-1918 3003,-2008 3175,-2008 3175,-1918 3003,-1918"/> +<polygon fill="#d7b5d8" stroke="transparent" points="3006,-1984 3006,-2005 3172,-2005 3172,-1984 3006,-1984"/> +<polygon fill="none" stroke="black" points="3006,-1984 3006,-2005 3172,-2005 3172,-1984 3006,-1984"/> +<text text-anchor="start" x="3009" y="-1990.8" font-family="Times,serif" font-size="14.00">GeneRIFXRef (82 KiB)</text> +<text text-anchor="start" x="3030.5" y="-1968.8" font-family="Times,serif" font-size="14.00">GeneCategoryId</text> +<text text-anchor="start" x="3050.5" y="-1947.8" font-family="Times,serif" font-size="14.00">GeneRIFId</text> +<text text-anchor="start" x="3055.5" y="-1926.8" font-family="Times,serif" font-size="14.00">versionId</text> +<polygon fill="none" stroke="black" points="3003,-1918 3003,-2008 3175,-2008 3175,-1918 3003,-1918"/> +</g> +<!-- GeneRIFXRef->GeneRIF --> +<g id="edge31" class="edge"> +<title>GeneRIFXRef:GeneRIFId->GeneRIF</title> +<path fill="none" stroke="black" d="M3173,-1951C3214.74,-1951 3168.49,-1230.49 3197,-1200 3252.21,-1140.95 3497.53,-1216.51 3559,-1164 3604.75,-1124.91 3627.15,-1064.28 3637.64,-1006.19"/> +<polygon fill="black" stroke="black" points="3641.12,-1006.59 3639.34,-996.14 3634.22,-1005.42 3641.12,-1006.59"/> +</g> +<!-- GeneCategory --> +<g id="node73" class="node"> +<title>GeneCategory</title> +<polygon fill="white" stroke="transparent" points="3373.5,-807.5 3373.5,-876.5 3542.5,-876.5 3542.5,-807.5 3373.5,-807.5"/> +<polygon fill="#d7b5d8" stroke="transparent" points="3377,-852 3377,-873 3540,-873 3540,-852 3377,-852"/> +<polygon fill="none" stroke="black" points="3377,-852 3377,-873 3540,-873 3540,-852 3377,-852"/> +<text text-anchor="start" x="3380" y="-858.8" font-family="Times,serif" font-size="14.00">GeneCategory (5 KiB)</text> +<text text-anchor="start" x="3451" y="-836.8" font-family="Times,serif" font-size="14.00">Id</text> +<text text-anchor="start" x="3437" y="-815.8" font-family="Times,serif" font-size="14.00">Name</text> +<polygon fill="none" stroke="black" points="3373.5,-807.5 3373.5,-876.5 3542.5,-876.5 3542.5,-807.5 3373.5,-807.5"/> +</g> +<!-- GeneRIFXRef->GeneCategory --> +<g id="edge30" class="edge"> +<title>GeneRIFXRef:GeneCategoryId->GeneCategory</title> +<path fill="none" stroke="black" d="M3173,-1973C3215.97,-1973 3169.76,-1233.22 3197,-1200 3241.84,-1145.31 3299.78,-1211.69 3352,-1164 3430.43,-1092.39 3450.94,-961.62 3456.23,-891.11"/> +<polygon fill="black" stroke="black" points="3459.75,-890.96 3456.93,-880.75 3452.77,-890.49 3459.75,-890.96"/> +</g> +<!-- CaseAttribute --> +<g id="node39" class="node"> +<title>CaseAttribute</title> +<polygon fill="lightgrey" stroke="transparent" points="1168,-797 1168,-887 1334,-887 1334,-797 1168,-797"/> +<polygon fill="#d7b5d8" stroke="transparent" points="1171,-863 1171,-884 1331,-884 1331,-863 1171,-863"/> +<polygon fill="none" stroke="black" points="1171,-863 1171,-884 1331,-884 1331,-863 1171,-863"/> +<text text-anchor="start" x="1174" y="-869.8" font-family="Times,serif" font-size="14.00">CaseAttribute (2 KiB)</text> +<polygon fill="green" stroke="transparent" points="1171,-842 1171,-861 1331,-861 1331,-842 1171,-842"/> +<text text-anchor="start" x="1209.5" y="-847.8" font-family="Times,serif" font-size="14.00">Description</text> +<polygon fill="green" stroke="transparent" points="1171,-821 1171,-840 1331,-840 1331,-821 1171,-821"/> +<text text-anchor="start" x="1243.5" y="-826.8" font-family="Times,serif" font-size="14.00">Id</text> +<polygon fill="green" stroke="transparent" points="1171,-800 1171,-819 1331,-819 1331,-800 1171,-800"/> +<text text-anchor="start" x="1229.5" y="-805.8" font-family="Times,serif" font-size="14.00">Name</text> +<polygon fill="none" stroke="black" points="1168,-797 1168,-887 1334,-887 1334,-797 1168,-797"/> +</g> +<!-- Strain->Species --> +<g id="edge32" class="edge"> +<title>Strain:SpeciesId->Species</title> +<path fill="none" stroke="black" d="M5731,-777C5128.52,-777 4994.43,-618.17 4400,-520 3817.59,-423.81 3111.33,-337.05 2876.33,-308.98"/> +<polygon fill="black" stroke="black" points="2876.51,-305.48 2866.17,-307.77 2875.68,-312.43 2876.51,-305.48"/> +</g> +<!-- Probe->ProbeSE --> +<g id="edge33" class="edge"> +<title>Probe:ProbeSetId->ProbeSE</title> +<path fill="none" stroke="black" d="M6968,-3261C6999.5,-3261 7043.75,-2274.36 7054.55,-2022.15"/> +<polygon fill="black" stroke="black" points="7058.05,-2022.23 7054.98,-2012.09 7051.06,-2021.93 7058.05,-2022.23"/> +</g> +<!-- ProbeFreeze->InbredSet --> +<g id="edge34" class="edge"> +<title>ProbeFreeze:InbredSetId->InbredSet</title> +<path fill="none" stroke="black" d="M2775,-1951C2816.74,-1951 2764.69,-1229.71 2794,-1200 2866.79,-1126.23 3641.27,-1223.68 3726,-1164 3778.21,-1127.22 3809.31,-1065.62 3827.82,-1006.16"/> +<polygon fill="black" stroke="black" points="3831.27,-1006.83 3830.79,-996.25 3824.56,-1004.82 3831.27,-1006.83"/> +</g> +<!-- ProbeFreeze->Tissue --> +<g id="edge35" class="edge"> +<title>ProbeFreeze:TissueId->Tissue</title> +<path fill="none" stroke="black" d="M2613,-1867C2575.92,-1867 2609.31,-1231.02 2589,-1200 2568.75,-1169.06 2537.32,-1192.7 2514,-1164 2463.47,-1101.8 2444.56,-1011.96 2437.81,-943.13"/> +<polygon fill="black" stroke="black" points="2441.29,-942.77 2436.9,-933.13 2434.32,-943.41 2441.29,-942.77"/> +</g> +<!-- BXDSnpPosition --> +<g id="node43" class="node"> +<title>BXDSnpPosition</title> +<polygon fill="white" stroke="transparent" points="5476.5,-1886.5 5476.5,-2039.5 5681.5,-2039.5 5681.5,-1886.5 5476.5,-1886.5"/> +<polygon fill="#df65b0" stroke="transparent" points="5480,-2015 5480,-2036 5679,-2036 5679,-2015 5480,-2015"/> +<polygon fill="none" stroke="black" points="5480,-2015 5480,-2036 5679,-2036 5679,-2015 5480,-2015"/> +<text text-anchor="start" x="5483" y="-2021.8" font-family="Times,serif" font-size="14.00">BXDSnpPosition (230 MiB)</text> +<text text-anchor="start" x="5566" y="-1999.8" font-family="Times,serif" font-size="14.00">Chr</text> +<text text-anchor="start" x="5572.5" y="-1978.8" font-family="Times,serif" font-size="14.00">id</text> +<text text-anchor="start" x="5567.5" y="-1957.8" font-family="Times,serif" font-size="14.00">Mb</text> +<text text-anchor="start" x="5546" y="-1936.8" font-family="Times,serif" font-size="14.00">Mb_2016</text> +<text text-anchor="start" x="5545.5" y="-1915.8" font-family="Times,serif" font-size="14.00">StrainId1</text> +<text text-anchor="start" x="5545.5" y="-1894.8" font-family="Times,serif" font-size="14.00">StrainId2</text> +<polygon fill="none" stroke="black" points="5476.5,-1886.5 5476.5,-2039.5 5681.5,-2039.5 5681.5,-1886.5 5476.5,-1886.5"/> +</g> +<!-- BXDSnpPosition->Strain --> +<g id="edge36" class="edge"> +<title>BXDSnpPosition:StrainId1->Strain</title> +<path fill="none" stroke="black" d="M5680,-1919C5699.98,-1919 5696.36,-1219.8 5699,-1200 5711.36,-1107.45 5738.02,-1004.03 5758.6,-932.42"/> +<polygon fill="black" stroke="black" points="5762.04,-933.11 5761.46,-922.54 5755.32,-931.17 5762.04,-933.11"/> +</g> +<!-- BXDSnpPosition->Strain --> +<g id="edge37" class="edge"> +<title>BXDSnpPosition:StrainId2->Strain</title> +<path fill="none" stroke="black" d="M5680,-1898C5699.4,-1898 5696.43,-1219.22 5699,-1200 5711.39,-1107.46 5738.05,-1004.03 5758.62,-932.43"/> +<polygon fill="black" stroke="black" points="5762.06,-933.12 5761.48,-922.54 5755.34,-931.17 5762.06,-933.12"/> +</g> +<!-- GeneRIF_BASIC --> +<g id="node44" class="node"> +<title>GeneRIF_BASIC</title> +<polygon fill="white" stroke="transparent" points="531.5,-744.5 531.5,-939.5 734.5,-939.5 734.5,-744.5 531.5,-744.5"/> +<polygon fill="#df65b0" stroke="transparent" points="535,-915 535,-936 732,-936 732,-915 535,-915"/> +<polygon fill="none" stroke="black" points="535,-915 535,-936 732,-936 732,-915 535,-915"/> +<text text-anchor="start" x="538" y="-921.8" font-family="Times,serif" font-size="14.00">GeneRIF_BASIC (275 MiB)</text> +<text text-anchor="start" x="600" y="-899.8" font-family="Times,serif" font-size="14.00">comment</text> +<text text-anchor="start" x="594.5" y="-878.8" font-family="Times,serif" font-size="14.00">createtime</text> +<text text-anchor="start" x="607.5" y="-857.8" font-family="Times,serif" font-size="14.00">GeneId</text> +<text text-anchor="start" x="592" y="-836.8" font-family="Times,serif" font-size="14.00">PubMed_ID</text> +<text text-anchor="start" x="598.5" y="-815.8" font-family="Times,serif" font-size="14.00">SpeciesId</text> +<text text-anchor="start" x="607.5" y="-794.8" font-family="Times,serif" font-size="14.00">symbol</text> +<text text-anchor="start" x="612.5" y="-773.8" font-family="Times,serif" font-size="14.00">TaxID</text> +<text text-anchor="start" x="599.5" y="-752.8" font-family="Times,serif" font-size="14.00">VersionId</text> +<polygon fill="none" stroke="black" points="531.5,-744.5 531.5,-939.5 734.5,-939.5 734.5,-744.5 531.5,-744.5"/> +</g> +<!-- GeneRIF_BASIC->Species --> +<g id="edge38" class="edge"> +<title>GeneRIF_BASIC:SpeciesId->Species</title> +<path fill="none" stroke="black" d="M733,-819C766.29,-819 728.98,-544.05 752,-520 890.33,-375.45 2354.35,-314.96 2715.71,-302.17"/> +<polygon fill="black" stroke="black" points="2715.96,-305.66 2725.83,-301.81 2715.71,-298.67 2715.96,-305.66"/> +</g> +<!-- GeneList_rn33 --> +<g id="node45" class="node"> +<title>GeneList_rn33</title> +<polygon fill="white" stroke="transparent" points="9516.5,-4737.5 9516.5,-5037.5 9691.5,-5037.5 9691.5,-4737.5 9516.5,-4737.5"/> +<polygon fill="#df65b0" stroke="transparent" points="9520,-5013.5 9520,-5034.5 9689,-5034.5 9689,-5013.5 9520,-5013.5"/> +<polygon fill="none" stroke="black" points="9520,-5013.5 9520,-5034.5 9689,-5034.5 9689,-5013.5 9520,-5013.5"/> +<text text-anchor="start" x="9523" y="-5020.3" font-family="Times,serif" font-size="14.00">GeneList_rn33 (2 MiB)</text> +<text text-anchor="start" x="9578" y="-4998.3" font-family="Times,serif" font-size="14.00">cdsEnd</text> +<text text-anchor="start" x="9574" y="-4977.3" font-family="Times,serif" font-size="14.00">cdsStart</text> +<text text-anchor="start" x="9559" y="-4956.3" font-family="Times,serif" font-size="14.00">chromosome</text> +<text text-anchor="start" x="9566" y="-4935.3" font-family="Times,serif" font-size="14.00">exonCount</text> +<text text-anchor="start" x="9569.5" y="-4914.3" font-family="Times,serif" font-size="14.00">exonEnds</text> +<text text-anchor="start" x="9565" y="-4893.3" font-family="Times,serif" font-size="14.00">exonStarts</text> +<text text-anchor="start" x="9560.5" y="-4872.3" font-family="Times,serif" font-size="14.00">geneSymbol</text> +<text text-anchor="start" x="9597.5" y="-4851.3" font-family="Times,serif" font-size="14.00">id</text> +<text text-anchor="start" x="9587.5" y="-4830.3" font-family="Times,serif" font-size="14.00">kgID</text> +<text text-anchor="start" x="9579.5" y="-4809.3" font-family="Times,serif" font-size="14.00">NM_ID</text> +<text text-anchor="start" x="9581" y="-4788.3" font-family="Times,serif" font-size="14.00">strand</text> +<text text-anchor="start" x="9583" y="-4767.3" font-family="Times,serif" font-size="14.00">txEnd</text> +<text text-anchor="start" x="9578.5" y="-4746.3" font-family="Times,serif" font-size="14.00">txStart</text> +<polygon fill="none" stroke="black" points="9516.5,-4737.5 9516.5,-5037.5 9691.5,-5037.5 9691.5,-4737.5 9516.5,-4737.5"/> +</g> +<!-- Geno->Species --> +<g id="edge39" class="edge"> +<title>Geno:SpeciesId->Species</title> +<path fill="none" stroke="black" d="M4247,-704C4089.83,-704 4091.63,-576.6 3945,-520 3561.93,-372.13 3067.37,-320.3 2876.27,-305.04"/> +<polygon fill="black" stroke="black" points="2876.28,-301.52 2866.03,-304.23 2875.73,-308.5 2876.28,-301.52"/> +</g> +<!-- Organizations --> +<g id="node47" class="node"> +<title>Organizations</title> +<polygon fill="white" stroke="transparent" points="90,-4 90,-73 256,-73 256,-4 90,-4"/> +<polygon fill="#d7b5d8" stroke="transparent" points="93,-48.5 93,-69.5 253,-69.5 253,-48.5 93,-48.5"/> +<polygon fill="none" stroke="black" points="93,-48.5 93,-69.5 253,-69.5 253,-48.5 93,-48.5"/> +<text text-anchor="start" x="96" y="-55.3" font-family="Times,serif" font-size="14.00">Organizations (3 KiB)</text> +<text text-anchor="start" x="119.5" y="-33.3" font-family="Times,serif" font-size="14.00">OrganizationId</text> +<text text-anchor="start" x="105.5" y="-12.3" font-family="Times,serif" font-size="14.00">OrganizationName</text> +<polygon fill="none" stroke="black" points="90,-4 90,-73 256,-73 256,-4 90,-4"/> +</g> +<!-- StrainXRef --> +<g id="node48" class="node"> +<title>StrainXRef</title> +<polygon fill="white" stroke="transparent" points="4871,-1897 4871,-2029 5019,-2029 5019,-1897 4871,-1897"/> +<polygon fill="#df65b0" stroke="transparent" points="4874,-2005 4874,-2026 5016,-2026 5016,-2005 4874,-2005"/> +<polygon fill="none" stroke="black" points="4874,-2005 4874,-2026 5016,-2026 5016,-2005 4874,-2005"/> +<text text-anchor="start" x="4877" y="-2011.8" font-family="Times,serif" font-size="14.00">StrainXRef (1 MiB)</text> +<text text-anchor="start" x="4902" y="-1989.8" font-family="Times,serif" font-size="14.00">InbredSetId</text> +<text text-anchor="start" x="4916.5" y="-1968.8" font-family="Times,serif" font-size="14.00">OrderId</text> +<text text-anchor="start" x="4890" y="-1947.8" font-family="Times,serif" font-size="14.00">PedigreeStatus</text> +<text text-anchor="start" x="4915.5" y="-1926.8" font-family="Times,serif" font-size="14.00">StrainId</text> +<text text-anchor="start" x="4878.5" y="-1905.8" font-family="Times,serif" font-size="14.00">Used_for_mapping</text> +<polygon fill="none" stroke="black" points="4871,-1897 4871,-2029 5019,-2029 5019,-1897 4871,-1897"/> +</g> +<!-- StrainXRef->InbredSet --> +<g id="edge40" class="edge"> +<title>StrainXRef:InbredSetId->InbredSet</title> +<path fill="none" stroke="black" d="M4873,-1994C4828.88,-1994 4884.67,-1231.72 4854,-1200 4805.57,-1149.92 4292.6,-1190.1 4228,-1164 4115.23,-1118.43 4012.54,-1024.28 3943.58,-949.66"/> +<polygon fill="black" stroke="black" points="3945.94,-947.05 3936.6,-942.05 3940.78,-951.79 3945.94,-947.05"/> +</g> +<!-- StrainXRef->Strain --> +<g id="edge41" class="edge"> +<title>StrainXRef:StrainId->Strain</title> +<path fill="none" stroke="black" d="M5017,-1930C5057.58,-1930 5018.82,-1233.98 5041,-1200 5195.5,-963.36 5553.55,-879.5 5710.26,-853.43"/> +<polygon fill="black" stroke="black" points="5710.98,-856.86 5720.28,-851.79 5709.85,-849.95 5710.98,-856.86"/> +</g> +<!-- SnpSource --> +<g id="node49" class="node"> +<title>SnpSource</title> +<polygon fill="white" stroke="transparent" points="9726,-4832 9726,-4943 9870,-4943 9870,-4832 9726,-4832"/> +<polygon fill="#d7b5d8" stroke="transparent" points="9729,-4918.5 9729,-4939.5 9867,-4939.5 9867,-4918.5 9729,-4918.5"/> +<polygon fill="none" stroke="black" points="9729,-4918.5 9729,-4939.5 9867,-4939.5 9867,-4918.5 9729,-4918.5"/> +<text text-anchor="start" x="9732" y="-4925.3" font-family="Times,serif" font-size="14.00">SnpSource (1 KiB)</text> +<text text-anchor="start" x="9758.5" y="-4903.3" font-family="Times,serif" font-size="14.00">DateAdded</text> +<text text-anchor="start" x="9752.5" y="-4882.3" font-family="Times,serif" font-size="14.00">DateCreated</text> +<text text-anchor="start" x="9790.5" y="-4861.3" font-family="Times,serif" font-size="14.00">Id</text> +<text text-anchor="start" x="9776.5" y="-4840.3" font-family="Times,serif" font-size="14.00">Name</text> +<polygon fill="none" stroke="black" points="9726,-4832 9726,-4943 9870,-4943 9870,-4832 9726,-4832"/> +</g> +<!-- user_openids --> +<g id="node50" class="node"> +<title>user_openids</title> +<polygon fill="white" stroke="transparent" points="9904.5,-4853 9904.5,-4922 10049.5,-4922 10049.5,-4853 9904.5,-4853"/> +<polygon fill="#f1eef6" stroke="transparent" points="9908,-4897.5 9908,-4918.5 10047,-4918.5 10047,-4897.5 9908,-4897.5"/> +<polygon fill="none" stroke="black" points="9908,-4897.5 9908,-4918.5 10047,-4918.5 10047,-4897.5 9908,-4897.5"/> +<text text-anchor="start" x="9911" y="-4904.3" font-family="Times,serif" font-size="14.00">user_openids (0 B)</text> +<text text-anchor="start" x="9939.5" y="-4882.3" font-family="Times,serif" font-size="14.00">openid_url</text> +<text text-anchor="start" x="9951.5" y="-4861.3" font-family="Times,serif" font-size="14.00">user_id</text> +<polygon fill="none" stroke="black" points="9904.5,-4853 9904.5,-4922 10049.5,-4922 10049.5,-4853 9904.5,-4853"/> +</g> +<!-- GeneMap_cuiyan --> +<g id="node51" class="node"> +<title>GeneMap_cuiyan</title> +<polygon fill="white" stroke="transparent" points="10084,-4832 10084,-4943 10290,-4943 10290,-4832 10084,-4832"/> +<polygon fill="#d7b5d8" stroke="transparent" points="10087,-4918.5 10087,-4939.5 10287,-4939.5 10287,-4918.5 10087,-4918.5"/> +<polygon fill="none" stroke="black" points="10087,-4918.5 10087,-4939.5 10287,-4939.5 10287,-4918.5 10087,-4918.5"/> +<text text-anchor="start" x="10090" y="-4925.3" font-family="Times,serif" font-size="14.00">GeneMap_cuiyan (376 KiB)</text> +<text text-anchor="start" x="10160" y="-4903.3" font-family="Times,serif" font-size="14.00">GeneID</text> +<text text-anchor="start" x="10180" y="-4882.3" font-family="Times,serif" font-size="14.00">id</text> +<text text-anchor="start" x="10160" y="-4861.3" font-family="Times,serif" font-size="14.00">Symbol</text> +<text text-anchor="start" x="10141.5" y="-4840.3" font-family="Times,serif" font-size="14.00">TranscriptID</text> +<polygon fill="none" stroke="black" points="10084,-4832 10084,-4943 10290,-4943 10290,-4832 10084,-4832"/> +</g> +<!-- InfoFilesUser_md5 --> +<g id="node52" class="node"> +<title>InfoFilesUser_md5</title> +<polygon fill="white" stroke="transparent" points="10324,-4853 10324,-4922 10520,-4922 10520,-4853 10324,-4853"/> +<polygon fill="#f1eef6" stroke="transparent" points="10327,-4897.5 10327,-4918.5 10517,-4918.5 10517,-4897.5 10327,-4897.5"/> +<polygon fill="none" stroke="black" points="10327,-4897.5 10327,-4918.5 10517,-4918.5 10517,-4897.5 10327,-4897.5"/> +<text text-anchor="start" x="10330" y="-4904.3" font-family="Times,serif" font-size="14.00">InfoFilesUser_md5 (96 B)</text> +<text text-anchor="start" x="10387.5" y="-4882.3" font-family="Times,serif" font-size="14.00">Password</text> +<text text-anchor="start" x="10385" y="-4861.3" font-family="Times,serif" font-size="14.00">Username</text> +<polygon fill="none" stroke="black" points="10324,-4853 10324,-4922 10520,-4922 10520,-4853 10324,-4853"/> +</g> +<!-- PublishXRef --> +<g id="node53" class="node"> +<title>PublishXRef</title> +<polygon fill="lightgrey" stroke="transparent" points="2811.5,-1834 2811.5,-2092 2968.5,-2092 2968.5,-1834 2811.5,-1834"/> +<polygon fill="#df65b0" stroke="transparent" points="2815,-2068 2815,-2089 2966,-2089 2966,-2068 2815,-2068"/> +<polygon fill="none" stroke="black" points="2815,-2068 2815,-2089 2966,-2089 2966,-2068 2815,-2068"/> +<text text-anchor="start" x="2818" y="-2074.8" font-family="Times,serif" font-size="14.00">PublishXRef (2 MiB)</text> +<text text-anchor="start" x="2861.5" y="-2052.8" font-family="Times,serif" font-size="14.00">additive</text> +<text text-anchor="start" x="2853.5" y="-2031.8" font-family="Times,serif" font-size="14.00">comments</text> +<text text-anchor="start" x="2866" y="-2010.8" font-family="Times,serif" font-size="14.00">DataId</text> +<polygon fill="green" stroke="transparent" points="2815,-1984 2815,-2003 2966,-2003 2966,-1984 2815,-1984"/> +<text text-anchor="start" x="2883" y="-1989.8" font-family="Times,serif" font-size="14.00">Id</text> +<text text-anchor="start" x="2847.5" y="-1968.8" font-family="Times,serif" font-size="14.00">InbredSetId</text> +<text text-anchor="start" x="2869.5" y="-1947.8" font-family="Times,serif" font-size="14.00">Locus</text> +<text text-anchor="start" x="2875.5" y="-1926.8" font-family="Times,serif" font-size="14.00">LRS</text> +<text text-anchor="start" x="2870.5" y="-1905.8" font-family="Times,serif" font-size="14.00">mean</text> +<text text-anchor="start" x="2845" y="-1884.8" font-family="Times,serif" font-size="14.00">PhenotypeId</text> +<polygon fill="green" stroke="transparent" points="2815,-1858 2815,-1877 2966,-1877 2966,-1858 2815,-1858"/> +<text text-anchor="start" x="2843" y="-1863.8" font-family="Times,serif" font-size="14.00">PublicationId</text> +<text text-anchor="start" x="2855.5" y="-1842.8" font-family="Times,serif" font-size="14.00">Sequence</text> +<polygon fill="none" stroke="black" points="2811.5,-1834 2811.5,-2092 2968.5,-2092 2968.5,-1834 2811.5,-1834"/> +</g> +<!-- PublishXRef->Publication --> +<g id="edge44" class="edge"> +<title>PublishXRef:PublicationId->Publication</title> +<path fill="none" stroke="black" d="M2814,-1867C2776.93,-1867 2815.52,-1230.19 2794,-1200 2767.79,-1163.23 2729.57,-1197.23 2699,-1164 2651.77,-1112.67 2628.61,-1038.69 2617.34,-974.68"/> +<polygon fill="black" stroke="black" points="2620.73,-973.78 2615.62,-964.5 2613.83,-974.94 2620.73,-973.78"/> +</g> +<!-- PublishXRef->InbredSet --> +<g id="edge42" class="edge"> +<title>PublishXRef:InbredSetId->InbredSet</title> +<path fill="none" stroke="black" d="M2967,-1973C3009.96,-1973 2955.99,-1230.74 2986,-1200 3043.5,-1141.1 3658.94,-1211.74 3726,-1164 3777.91,-1127.05 3808.95,-1065.59 3827.5,-1006.29"/> +<polygon fill="black" stroke="black" points="3830.95,-1006.99 3830.49,-996.41 3824.25,-1004.97 3830.95,-1006.99"/> +</g> +<!-- PublishXRef->Phenotype --> +<g id="edge43" class="edge"> +<title>PublishXRef:PhenotypeId->Phenotype</title> +<path fill="none" stroke="black" d="M2967,-1888C2986.12,-1888 2984.78,-1219.08 2986,-1200 2990.55,-1129.04 2998.2,-1050.39 3005.28,-985.01"/> +<polygon fill="black" stroke="black" points="3008.76,-985.37 3006.37,-975.05 3001.8,-984.61 3008.76,-985.37"/> +</g> +<!-- RatSnpPattern --> +<g id="node54" class="node"> +<title>RatSnpPattern</title> +<polygon fill="white" stroke="transparent" points="10554,-4517 10554,-5258 10748,-5258 10748,-4517 10554,-4517"/> +<polygon fill="#df65b0" stroke="transparent" points="10557,-5233.5 10557,-5254.5 10745,-5254.5 10745,-5233.5 10557,-5233.5"/> +<polygon fill="none" stroke="black" points="10557,-5233.5 10557,-5254.5 10745,-5254.5 10745,-5233.5 10557,-5233.5"/> +<text text-anchor="start" x="10560" y="-5240.3" font-family="Times,serif" font-size="14.00">RatSnpPattern (202 MiB)</text> +<text text-anchor="start" x="10638" y="-5218.3" font-family="Times,serif" font-size="14.00">ACI</text> +<text text-anchor="start" x="10628.5" y="-5197.3" font-family="Times,serif" font-size="14.00">ACI_N</text> +<text text-anchor="start" x="10629.5" y="-5176.3" font-family="Times,serif" font-size="14.00">BBDP</text> +<text text-anchor="start" x="10639.5" y="-5155.3" font-family="Times,serif" font-size="14.00">BN</text> +<text text-anchor="start" x="10630" y="-5134.3" font-family="Times,serif" font-size="14.00">BN_N</text> +<text text-anchor="start" x="10625" y="-5113.3" font-family="Times,serif" font-size="14.00">BUF_N</text> +<text text-anchor="start" x="10632.5" y="-5092.3" font-family="Times,serif" font-size="14.00">F344</text> +<text text-anchor="start" x="10623" y="-5071.3" font-family="Times,serif" font-size="14.00">F344_N</text> +<text text-anchor="start" x="10634" y="-5050.3" font-family="Times,serif" font-size="14.00">FHH</text> +<text text-anchor="start" x="10635.5" y="-5029.3" font-family="Times,serif" font-size="14.00">FHL</text> +<text text-anchor="start" x="10640" y="-5008.3" font-family="Times,serif" font-size="14.00">GK</text> +<text text-anchor="start" x="10643.5" y="-4987.3" font-family="Times,serif" font-size="14.00">Id</text> +<text text-anchor="start" x="10641" y="-4966.3" font-family="Times,serif" font-size="14.00">LE</text> +<text text-anchor="start" x="10634" y="-4945.3" font-family="Times,serif" font-size="14.00">LEW</text> +<text text-anchor="start" x="10640" y="-4924.3" font-family="Times,serif" font-size="14.00">LH</text> +<text text-anchor="start" x="10641.5" y="-4903.3" font-family="Times,serif" font-size="14.00">LL</text> +<text text-anchor="start" x="10640" y="-4882.3" font-family="Times,serif" font-size="14.00">LN</text> +<text text-anchor="start" x="10620.5" y="-4861.3" font-family="Times,serif" font-size="14.00">M520_N</text> +<text text-anchor="start" x="10632.5" y="-4840.3" font-family="Times,serif" font-size="14.00">MHS</text> +<text text-anchor="start" x="10632.5" y="-4819.3" font-family="Times,serif" font-size="14.00">MNS</text> +<text text-anchor="start" x="10629" y="-4798.3" font-family="Times,serif" font-size="14.00">MR_N</text> +<text text-anchor="start" x="10634.5" y="-4777.3" font-family="Times,serif" font-size="14.00">SBH</text> +<text text-anchor="start" x="10634.5" y="-4756.3" font-family="Times,serif" font-size="14.00">SBN</text> +<text text-anchor="start" x="10634.5" y="-4735.3" font-family="Times,serif" font-size="14.00">SHR</text> +<text text-anchor="start" x="10625" y="-4714.3" font-family="Times,serif" font-size="14.00">SHRSP</text> +<text text-anchor="start" x="10629.5" y="-4693.3" font-family="Times,serif" font-size="14.00">SnpId</text> +<text text-anchor="start" x="10640.5" y="-4672.3" font-family="Times,serif" font-size="14.00">SR</text> +<text text-anchor="start" x="10641.5" y="-4651.3" font-family="Times,serif" font-size="14.00">SS</text> +<text text-anchor="start" x="10633.5" y="-4630.3" font-family="Times,serif" font-size="14.00">WAG</text> +<text text-anchor="start" x="10634" y="-4609.3" font-family="Times,serif" font-size="14.00">WKY</text> +<text text-anchor="start" x="10625" y="-4588.3" font-family="Times,serif" font-size="14.00">WKY_N</text> +<text text-anchor="start" x="10636.5" y="-4567.3" font-family="Times,serif" font-size="14.00">WLI</text> +<text text-anchor="start" x="10634" y="-4546.3" font-family="Times,serif" font-size="14.00">WMI</text> +<text text-anchor="start" x="10628" y="-4525.3" font-family="Times,serif" font-size="14.00">WN_N</text> +<polygon fill="none" stroke="black" points="10554,-4517 10554,-5258 10748,-5258 10748,-4517 10554,-4517"/> +</g> +<!-- Genbank --> +<g id="node55" class="node"> +<title>Genbank</title> +<polygon fill="white" stroke="transparent" points="769,-797 769,-887 911,-887 911,-797 769,-797"/> +<polygon fill="#df65b0" stroke="transparent" points="772,-863 772,-884 908,-884 908,-863 772,-863"/> +<polygon fill="none" stroke="black" points="772,-863 772,-884 908,-884 908,-863 772,-863"/> +<text text-anchor="start" x="775" y="-869.8" font-family="Times,serif" font-size="14.00">Genbank (37 MiB)</text> +<text text-anchor="start" x="832.5" y="-847.8" font-family="Times,serif" font-size="14.00">Id</text> +<text text-anchor="start" x="805" y="-826.8" font-family="Times,serif" font-size="14.00">Sequence</text> +<text text-anchor="start" x="805" y="-805.8" font-family="Times,serif" font-size="14.00">SpeciesId</text> +<polygon fill="none" stroke="black" points="769,-797 769,-887 911,-887 911,-797 769,-797"/> +</g> +<!-- Genbank->Species --> +<g id="edge45" class="edge"> +<title>Genbank:SpeciesId->Species</title> +<path fill="none" stroke="black" d="M909,-809C941.22,-809 910.62,-543.18 933,-520 1058.95,-389.57 2375.45,-319.21 2715.96,-303.1"/> +<polygon fill="black" stroke="black" points="2716.17,-306.6 2725.99,-302.63 2715.84,-299.61 2716.17,-306.6"/> +</g> +<!-- EnsemblChip --> +<g id="node56" class="node"> +<title>EnsemblChip</title> +<polygon fill="white" stroke="transparent" points="1780.5,-786.5 1780.5,-897.5 1945.5,-897.5 1945.5,-786.5 1780.5,-786.5"/> +<polygon fill="#f1eef6" stroke="transparent" points="1784,-873 1784,-894 1943,-894 1943,-873 1784,-873"/> +<polygon fill="none" stroke="black" points="1784,-873 1784,-894 1943,-894 1943,-873 1784,-873"/> +<text text-anchor="start" x="1787" y="-879.8" font-family="Times,serif" font-size="14.00">EnsemblChip (296 B)</text> +<text text-anchor="start" x="1856" y="-857.8" font-family="Times,serif" font-size="14.00">Id</text> +<text text-anchor="start" x="1842" y="-836.8" font-family="Times,serif" font-size="14.00">Name</text> +<text text-anchor="start" x="1815" y="-815.8" font-family="Times,serif" font-size="14.00">ProbeSetSize</text> +<text text-anchor="start" x="1846" y="-794.8" font-family="Times,serif" font-size="14.00">Type</text> +<polygon fill="none" stroke="black" points="1780.5,-786.5 1780.5,-897.5 1945.5,-897.5 1945.5,-786.5 1780.5,-786.5"/> +</g> +<!-- LCorrRamin3 --> +<g id="node57" class="node"> +<title>LCorrRamin3</title> +<polygon fill="white" stroke="transparent" points="10782.5,-4842.5 10782.5,-4932.5 10945.5,-4932.5 10945.5,-4842.5 10782.5,-4842.5"/> +<polygon fill="#ce1256" stroke="transparent" points="10786,-4908.5 10786,-4929.5 10943,-4929.5 10943,-4908.5 10786,-4908.5"/> +<polygon fill="none" stroke="black" points="10786,-4908.5 10786,-4929.5 10943,-4929.5 10943,-4908.5 10786,-4908.5"/> +<text text-anchor="start" x="10789" y="-4915.3" font-family="Times,serif" font-size="14.00">LCorrRamin3 (2 GiB)</text> +<text text-anchor="start" x="10834" y="-4893.3" font-family="Times,serif" font-size="14.00">GeneId1</text> +<text text-anchor="start" x="10834" y="-4872.3" font-family="Times,serif" font-size="14.00">GeneId2</text> +<text text-anchor="start" x="10845" y="-4851.3" font-family="Times,serif" font-size="14.00">value</text> +<polygon fill="none" stroke="black" points="10782.5,-4842.5 10782.5,-4932.5 10945.5,-4932.5 10945.5,-4842.5 10782.5,-4842.5"/> +</g> +<!-- UserPrivilege --> +<g id="node59" class="node"> +<title>UserPrivilege</title> +<polygon fill="white" stroke="transparent" points="7239,-4842.5 7239,-4932.5 7407,-4932.5 7407,-4842.5 7239,-4842.5"/> +<polygon fill="#f1eef6" stroke="transparent" points="7242,-4908.5 7242,-4929.5 7404,-4929.5 7404,-4908.5 7242,-4908.5"/> +<polygon fill="none" stroke="black" points="7242,-4908.5 7242,-4929.5 7404,-4929.5 7404,-4908.5 7242,-4908.5"/> +<text text-anchor="start" x="7245" y="-4915.3" font-family="Times,serif" font-size="14.00">UserPrivilege (224 B)</text> +<text text-anchor="start" x="7246.5" y="-4893.3" font-family="Times,serif" font-size="14.00">download_result_priv</text> +<text text-anchor="start" x="7258" y="-4872.3" font-family="Times,serif" font-size="14.00">ProbeSetFreezeId</text> +<text text-anchor="start" x="7298.5" y="-4851.3" font-family="Times,serif" font-size="14.00">UserId</text> +<polygon fill="none" stroke="black" points="7239,-4842.5 7239,-4932.5 7407,-4932.5 7407,-4842.5 7239,-4842.5"/> +</g> +<!-- UserPrivilege->User --> +<g id="edge46" class="edge"> +<title>UserPrivilege:UserId->User</title> +<path fill="none" stroke="black" d="M7323,-4844.5C7323,-4319.22 7309.04,-3693.9 7302.41,-3426.66"/> +<polygon fill="black" stroke="black" points="7305.91,-3426.44 7302.16,-3416.53 7298.91,-3426.61 7305.91,-3426.44"/> +</g> +<!-- GeneChip --> +<g id="node61" class="node"> +<title>GeneChip</title> +<polygon fill="lightgrey" stroke="transparent" points="1980,-744.5 1980,-939.5 2116,-939.5 2116,-744.5 1980,-744.5"/> +<polygon fill="#d7b5d8" stroke="transparent" points="1983,-915 1983,-936 2113,-936 2113,-915 1983,-915"/> +<polygon fill="none" stroke="black" points="1983,-915 1983,-936 2113,-936 2113,-915 1983,-915"/> +<text text-anchor="start" x="1986" y="-921.8" font-family="Times,serif" font-size="14.00">GeneChip (9 KiB)</text> +<text text-anchor="start" x="2005.5" y="-899.8" font-family="Times,serif" font-size="14.00">GeneChipId</text> +<polygon fill="green" stroke="transparent" points="1983,-873 1983,-892 2113,-892 2113,-873 1983,-873"/> +<text text-anchor="start" x="1992" y="-878.8" font-family="Times,serif" font-size="14.00">GeneChipName</text> +<text text-anchor="start" x="2002.5" y="-857.8" font-family="Times,serif" font-size="14.00">GeoPlatform</text> +<text text-anchor="start" x="1996" y="-836.8" font-family="Times,serif" font-size="14.00">GO_tree_value</text> +<text text-anchor="start" x="2040.5" y="-815.8" font-family="Times,serif" font-size="14.00">Id</text> +<polygon fill="green" stroke="transparent" points="1983,-789 1983,-808 2113,-808 2113,-789 1983,-789"/> +<text text-anchor="start" x="2026.5" y="-794.8" font-family="Times,serif" font-size="14.00">Name</text> +<text text-anchor="start" x="2013" y="-773.8" font-family="Times,serif" font-size="14.00">SpeciesId</text> +<text text-anchor="start" x="2031.5" y="-752.8" font-family="Times,serif" font-size="14.00">Title</text> +<polygon fill="none" stroke="black" points="1980,-744.5 1980,-939.5 2116,-939.5 2116,-744.5 1980,-744.5"/> +</g> +<!-- GeneChip->Species --> +<g id="edge47" class="edge"> +<title>GeneChip:SpeciesId->Species</title> +<path fill="none" stroke="black" d="M2114,-777C2142.63,-777 2115.4,-542.59 2133,-520 2274.95,-337.76 2572.58,-304.64 2715.73,-299.62"/> +<polygon fill="black" stroke="black" points="2715.88,-303.12 2725.77,-299.31 2715.66,-296.12 2715.88,-303.12"/> +</g> +<!-- IndelXRef --> +<g id="node62" class="node"> +<title>IndelXRef</title> +<polygon fill="white" stroke="transparent" points="5716,-1918 5716,-2008 5856,-2008 5856,-1918 5716,-1918"/> +<polygon fill="#df65b0" stroke="transparent" points="5719,-1984 5719,-2005 5853,-2005 5853,-1984 5719,-1984"/> +<polygon fill="none" stroke="black" points="5719,-1984 5719,-2005 5853,-2005 5853,-1984 5719,-1984"/> +<text text-anchor="start" x="5722" y="-1990.8" font-family="Times,serif" font-size="14.00">IndelXRef (1 MiB)</text> +<text text-anchor="start" x="5760.5" y="-1968.8" font-family="Times,serif" font-size="14.00">IndelId</text> +<text text-anchor="start" x="5752" y="-1947.8" font-family="Times,serif" font-size="14.00">StrainId1</text> +<text text-anchor="start" x="5752" y="-1926.8" font-family="Times,serif" font-size="14.00">StrainId2</text> +<polygon fill="none" stroke="black" points="5716,-1918 5716,-2008 5856,-2008 5856,-1918 5716,-1918"/> +</g> +<!-- IndelXRef->Strain --> +<g id="edge48" class="edge"> +<title>IndelXRef:StrainId1->Strain</title> +<path fill="none" stroke="black" d="M5854,-1951C5904.87,-1951 5825.54,-1197.02 5796.2,-933"/> +<polygon fill="black" stroke="black" points="5799.64,-932.24 5795.05,-922.68 5792.68,-933.01 5799.64,-932.24"/> +</g> +<!-- IndelXRef->Strain --> +<g id="edge49" class="edge"> +<title>IndelXRef:StrainId2->Strain</title> +<path fill="none" stroke="black" d="M5786,-1920C5786,-1553.9 5786,-1117.79 5786,-932.93"/> +<polygon fill="black" stroke="black" points="5789.5,-932.72 5786,-922.72 5782.5,-932.72 5789.5,-932.72"/> +</g> +<!-- user --> +<g id="node63" class="node"> +<title>user</title> +<polygon fill="white" stroke="transparent" points="10979.5,-4779.5 10979.5,-4995.5 11108.5,-4995.5 11108.5,-4779.5 10979.5,-4779.5"/> +<polygon fill="#d7b5d8" stroke="transparent" points="10983,-4971.5 10983,-4992.5 11106,-4992.5 11106,-4971.5 10983,-4971.5"/> +<polygon fill="none" stroke="black" points="10983,-4971.5 10983,-4992.5 11106,-4992.5 11106,-4971.5 10983,-4971.5"/> +<text text-anchor="start" x="10997" y="-4978.3" font-family="Times,serif" font-size="14.00">user (64 KiB)</text> +<text text-anchor="start" x="11023" y="-4956.3" font-family="Times,serif" font-size="14.00">active</text> +<text text-anchor="start" x="11008.5" y="-4935.3" font-family="Times,serif" font-size="14.00">confirmed</text> +<text text-anchor="start" x="10993" y="-4914.3" font-family="Times,serif" font-size="14.00">email_address</text> +<text text-anchor="start" x="11009.5" y="-4893.3" font-family="Times,serif" font-size="14.00">full_name</text> +<text text-anchor="start" x="11037.5" y="-4872.3" font-family="Times,serif" font-size="14.00">id</text> +<text text-anchor="start" x="10999.5" y="-4851.3" font-family="Times,serif" font-size="14.00">organization</text> +<text text-anchor="start" x="11010" y="-4830.3" font-family="Times,serif" font-size="14.00">password</text> +<text text-anchor="start" x="10985" y="-4809.3" font-family="Times,serif" font-size="14.00">registration_info</text> +<text text-anchor="start" x="11008" y="-4788.3" font-family="Times,serif" font-size="14.00">superuser</text> +<polygon fill="none" stroke="black" points="10979.5,-4779.5 10979.5,-4995.5 11108.5,-4995.5 11108.5,-4779.5 10979.5,-4779.5"/> +</g> +<!-- PublishSE --> +<g id="node64" class="node"> +<title>PublishSE</title> +<polygon fill="white" stroke="transparent" points="5890,-1918 5890,-2008 6034,-2008 6034,-1918 5890,-1918"/> +<polygon fill="#df65b0" stroke="transparent" points="5893,-1984 5893,-2005 6031,-2005 6031,-1984 5893,-1984"/> +<polygon fill="none" stroke="black" points="5893,-1984 5893,-2005 6031,-2005 6031,-1984 5893,-1984"/> +<text text-anchor="start" x="5896" y="-1990.8" font-family="Times,serif" font-size="14.00">PublishSE (3 MiB)</text> +<text text-anchor="start" x="5937.5" y="-1968.8" font-family="Times,serif" font-size="14.00">DataId</text> +<text text-anchor="start" x="5943.5" y="-1947.8" font-family="Times,serif" font-size="14.00">error</text> +<text text-anchor="start" x="5932.5" y="-1926.8" font-family="Times,serif" font-size="14.00">StrainId</text> +<polygon fill="none" stroke="black" points="5890,-1918 5890,-2008 6034,-2008 6034,-1918 5890,-1918"/> +</g> +<!-- PublishSE->Strain --> +<g id="edge50" class="edge"> +<title>PublishSE:StrainId->Strain</title> +<path fill="none" stroke="black" d="M5962,-1920C5962,-1549.32 5859.2,-1116.17 5810.73,-932.54"/> +<polygon fill="black" stroke="black" points="5814.06,-931.43 5808.11,-922.66 5807.29,-933.22 5814.06,-931.43"/> +</g> +<!-- EnsemblProbe --> +<g id="node65" class="node"> +<title>EnsemblProbe</title> +<polygon fill="white" stroke="transparent" points="11143,-4821.5 11143,-4953.5 11327,-4953.5 11327,-4821.5 11143,-4821.5"/> +<polygon fill="#df65b0" stroke="transparent" points="11146,-4929.5 11146,-4950.5 11324,-4950.5 11324,-4929.5 11146,-4929.5"/> +<polygon fill="none" stroke="black" points="11146,-4929.5 11146,-4950.5 11324,-4950.5 11324,-4929.5 11146,-4929.5"/> +<text text-anchor="start" x="11149" y="-4936.3" font-family="Times,serif" font-size="14.00">EnsemblProbe (94 MiB)</text> +<text text-anchor="start" x="11211" y="-4914.3" font-family="Times,serif" font-size="14.00">ChipId</text> +<text text-anchor="start" x="11227.5" y="-4893.3" font-family="Times,serif" font-size="14.00">Id</text> +<text text-anchor="start" x="11212" y="-4872.3" font-family="Times,serif" font-size="14.00">length</text> +<text text-anchor="start" x="11213.5" y="-4851.3" font-family="Times,serif" font-size="14.00">Name</text> +<text text-anchor="start" x="11201.5" y="-4830.3" font-family="Times,serif" font-size="14.00">ProbeSet</text> +<polygon fill="none" stroke="black" points="11143,-4821.5 11143,-4953.5 11327,-4953.5 11327,-4821.5 11143,-4821.5"/> +</g> +<!-- InfoFiles --> +<g id="node66" class="node"> +<title>InfoFiles</title> +<polygon fill="lightgrey" stroke="transparent" points="2048.5,-1424.5 2048.5,-2501.5 2279.5,-2501.5 2279.5,-1424.5 2048.5,-1424.5"/> +<polygon fill="#df65b0" stroke="transparent" points="2052,-2477 2052,-2498 2277,-2498 2277,-2477 2052,-2477"/> +<polygon fill="none" stroke="black" points="2052,-2477 2052,-2498 2277,-2498 2277,-2477 2052,-2477"/> +<text text-anchor="start" x="2104" y="-2483.8" font-family="Times,serif" font-size="14.00">InfoFiles (4 MiB)</text> +<text text-anchor="start" x="2085.5" y="-2461.8" font-family="Times,serif" font-size="14.00">About_Array_Platform</text> +<text text-anchor="start" x="2119" y="-2440.8" font-family="Times,serif" font-size="14.00">About_Cases</text> +<text text-anchor="start" x="2054" y="-2419.8" font-family="Times,serif" font-size="14.00">About_Data_Values_Processing</text> +<text text-anchor="start" x="2104.5" y="-2398.8" font-family="Times,serif" font-size="14.00">About_Download</text> +<text text-anchor="start" x="2117" y="-2377.8" font-family="Times,serif" font-size="14.00">About_Tissue</text> +<text text-anchor="start" x="2104" y="-2356.8" font-family="Times,serif" font-size="14.00">AuthorizedUsers</text> +<text text-anchor="start" x="2116" y="-2335.8" font-family="Times,serif" font-size="14.00">AvgMethodId</text> +<text text-anchor="start" x="2135.5" y="-2314.8" font-family="Times,serif" font-size="14.00">Citation</text> +<text text-anchor="start" x="2149.5" y="-2293.8" font-family="Times,serif" font-size="14.00">City</text> +<text text-anchor="start" x="2112" y="-2272.8" font-family="Times,serif" font-size="14.00">Contact_Name</text> +<text text-anchor="start" x="2122" y="-2251.8" font-family="Times,serif" font-size="14.00">Contributor</text> +<text text-anchor="start" x="2135.5" y="-2230.8" font-family="Times,serif" font-size="14.00">Country</text> +<text text-anchor="start" x="2069" y="-2209.8" font-family="Times,serif" font-size="14.00">Data_Source_Acknowledge</text> +<text text-anchor="start" x="2129.5" y="-2188.8" font-family="Times,serif" font-size="14.00">DatasetId</text> +<text text-anchor="start" x="2129" y="-2167.8" font-family="Times,serif" font-size="14.00">DB_Name</text> +<text text-anchor="start" x="2121" y="-2146.8" font-family="Times,serif" font-size="14.00">Department</text> +<text text-anchor="start" x="2140" y="-2125.8" font-family="Times,serif" font-size="14.00">Emails</text> +<text text-anchor="start" x="2101.5" y="-2104.8" font-family="Times,serif" font-size="14.00">Experiment_Type</text> +<text text-anchor="start" x="2122" y="-2083.8" font-family="Times,serif" font-size="14.00">GeneChipId</text> +<polygon fill="green" stroke="transparent" points="2052,-2057 2052,-2076 2277,-2076 2277,-2057 2052,-2057"/> +<text text-anchor="start" x="2111" y="-2062.8" font-family="Times,serif" font-size="14.00">GN_AccesionId</text> +<text text-anchor="start" x="2128.5" y="-2041.8" font-family="Times,serif" font-size="14.00">InbredSet</text> +<text text-anchor="start" x="2121.5" y="-2020.8" font-family="Times,serif" font-size="14.00">InbredSetId</text> +<text text-anchor="start" x="2129.5" y="-1999.8" font-family="Times,serif" font-size="14.00">InfoFileId</text> +<polygon fill="green" stroke="transparent" points="2052,-1973 2052,-1992 2277,-1992 2277,-1973 2052,-1973"/> +<text text-anchor="start" x="2120.5" y="-1978.8" font-family="Times,serif" font-size="14.00">InfoFileTitle</text> +<text text-anchor="start" x="2112" y="-1957.8" font-family="Times,serif" font-size="14.00">InfoPageName</text> +<text text-anchor="start" x="2117" y="-1936.8" font-family="Times,serif" font-size="14.00">InfoPageTitle</text> +<text text-anchor="start" x="2125" y="-1915.8" font-family="Times,serif" font-size="14.00">Laboratory</text> +<text text-anchor="start" x="2113.5" y="-1894.8" font-family="Times,serif" font-size="14.00">Normalization</text> +<text text-anchor="start" x="2129.5" y="-1873.8" font-family="Times,serif" font-size="14.00">Organism</text> +<text text-anchor="start" x="2119" y="-1852.8" font-family="Times,serif" font-size="14.00">Organism_Id</text> +<text text-anchor="start" x="2093.5" y="-1831.8" font-family="Times,serif" font-size="14.00">Organization_Name</text> +<text text-anchor="start" x="2110" y="-1810.8" font-family="Times,serif" font-size="14.00">Overall_Design</text> +<text text-anchor="start" x="2142" y="-1789.8" font-family="Times,serif" font-size="14.00">Phone</text> +<text text-anchor="start" x="2129.5" y="-1768.8" font-family="Times,serif" font-size="14.00">Platforms</text> +<text text-anchor="start" x="2132" y="-1747.8" font-family="Times,serif" font-size="14.00">Progreso</text> +<text text-anchor="start" x="2088.5" y="-1726.8" font-family="Times,serif" font-size="14.00">QualityControlStatus</text> +<text text-anchor="start" x="2134" y="-1705.8" font-family="Times,serif" font-size="14.00">Samples</text> +<text text-anchor="start" x="2137" y="-1684.8" font-family="Times,serif" font-size="14.00">Species</text> +<text text-anchor="start" x="2129.5" y="-1663.8" font-family="Times,serif" font-size="14.00">SpeciesId</text> +<polygon fill="green" stroke="transparent" points="2052,-1637 2052,-1656 2277,-1656 2277,-1637 2052,-1637"/> +<text text-anchor="start" x="2132.5" y="-1642.8" font-family="Times,serif" font-size="14.00">Specifics</text> +<text text-anchor="start" x="2145" y="-1621.8" font-family="Times,serif" font-size="14.00">State</text> +<text text-anchor="start" x="2141" y="-1600.8" font-family="Times,serif" font-size="14.00">Status</text> +<text text-anchor="start" x="2141.5" y="-1579.8" font-family="Times,serif" font-size="14.00">Street</text> +<text text-anchor="start" x="2102.5" y="-1558.8" font-family="Times,serif" font-size="14.00">Submission_Date</text> +<text text-anchor="start" x="2129.5" y="-1537.8" font-family="Times,serif" font-size="14.00">Summary</text> +<text text-anchor="start" x="2141.5" y="-1516.8" font-family="Times,serif" font-size="14.00">Tissue</text> +<text text-anchor="start" x="2134" y="-1495.8" font-family="Times,serif" font-size="14.00">TissueId</text> +<polygon fill="green" stroke="transparent" points="2052,-1469 2052,-1488 2277,-1488 2277,-1469 2052,-1469"/> +<text text-anchor="start" x="2148" y="-1474.8" font-family="Times,serif" font-size="14.00">Title</text> +<text text-anchor="start" x="2148.5" y="-1453.8" font-family="Times,serif" font-size="14.00">URL</text> +<text text-anchor="start" x="2152" y="-1432.8" font-family="Times,serif" font-size="14.00">ZIP</text> +<polygon fill="none" stroke="black" points="2048.5,-1424.5 2048.5,-2501.5 2279.5,-2501.5 2279.5,-1424.5 2048.5,-1424.5"/> +</g> +<!-- InfoFiles->Datasets --> +<g id="edge52" class="edge"> +<title>InfoFiles:DatasetId->Datasets</title> +<path fill="none" stroke="black" d="M2051,-2193C1940.48,-2193 2072.47,-1276.81 1993,-1200 1933.9,-1142.88 581.41,-1211.03 514,-1164 470.71,-1133.8 442.18,-1086.38 423.37,-1037.17"/> +<polygon fill="black" stroke="black" points="426.6,-1035.81 419.85,-1027.64 420.03,-1038.23 426.6,-1035.81"/> +</g> +<!-- InfoFiles->InbredSet --> +<g id="edge54" class="edge"> +<title>InfoFiles:InbredSetId->InbredSet</title> +<path fill="none" stroke="black" d="M2278,-2025C2323.84,-2025 2263.64,-1232.47 2296,-1200 2352.1,-1143.71 3660.72,-1209.33 3726,-1164 3778.57,-1127.49 3809.73,-1065.76 3828.19,-1006.12"/> +<polygon fill="black" stroke="black" points="3831.65,-1006.77 3831.16,-996.18 3824.94,-1004.76 3831.65,-1006.77"/> +</g> +<!-- InfoFiles->Species --> +<g id="edge55" class="edge"> +<title>InfoFiles:SpeciesId->Species</title> +<path fill="none" stroke="black" d="M2278,-1667C2303.96,-1667 2277.61,-1218.33 2296,-1200 2376.56,-1119.71 3240,-1245.83 3319,-1164 3368.7,-1112.52 3358.57,-579.62 3319,-520 3219.73,-370.42 2996.86,-322.06 2876.6,-306.62"/> +<polygon fill="black" stroke="black" points="2876.71,-303.1 2866.35,-305.35 2875.85,-310.05 2876.71,-303.1"/> +</g> +<!-- InfoFiles->AvgMethod --> +<g id="edge51" class="edge"> +<title>InfoFiles:AvgMethodId->AvgMethod</title> +<path fill="none" stroke="black" d="M2051,-2340C1924.17,-2340 2083.05,-1289.32 1993,-1200 1926.52,-1134.05 1224.64,-1221.84 1151,-1164 1075.17,-1104.44 1058.6,-986.94 1056.31,-911.82"/> +<polygon fill="black" stroke="black" points="1059.8,-911.43 1056.07,-901.51 1052.8,-911.59 1059.8,-911.43"/> +</g> +<!-- InfoFiles->GeneChip --> +<g id="edge53" class="edge"> +<title>InfoFiles:GeneChipId->GeneChip</title> +<path fill="none" stroke="black" d="M2051,-2088C2022.77,-2088 2038.62,-1258.67 2045.41,-953.75"/> +<polygon fill="black" stroke="black" points="2048.91,-953.64 2045.63,-943.57 2041.91,-953.49 2048.91,-953.64"/> +</g> +<!-- InfoFiles->Tissue --> +<g id="edge56" class="edge"> +<title>InfoFiles:TissueId->Tissue</title> +<path fill="none" stroke="black" d="M2278,-1499C2311.28,-1499 2278.84,-1228.52 2296,-1200 2311.83,-1173.68 2336.81,-1188.76 2355,-1164 2402.06,-1099.94 2421.62,-1011.33 2429.66,-943.43"/> +<polygon fill="black" stroke="black" points="2433.17,-943.47 2430.81,-933.15 2426.22,-942.7 2433.17,-943.47"/> +</g> +<!-- Vlookup --> +<g id="node67" class="node"> +<title>Vlookup</title> +<polygon fill="white" stroke="transparent" points="2070,-2766 2070,-3822 2258,-3822 2258,-2766 2070,-2766"/> +<polygon fill="#d7b5d8" stroke="transparent" points="2073,-3798 2073,-3819 2255,-3819 2255,-3798 2073,-3798"/> +<polygon fill="none" stroke="black" points="2073,-3798 2073,-3819 2255,-3819 2255,-3798 2073,-3798"/> +<text text-anchor="start" x="2099" y="-3804.8" font-family="Times,serif" font-size="14.00">Vlookup (120 KiB)</text> +<text text-anchor="start" x="2147" y="-3782.8" font-family="Times,serif" font-size="14.00">alias</text> +<text text-anchor="start" x="2137" y="-3761.8" font-family="Times,serif" font-size="14.00">AlignID</text> +<text text-anchor="start" x="2130.5" y="-3740.8" font-family="Times,serif" font-size="14.00">assembly</text> +<text text-anchor="start" x="2115.5" y="-3719.8" font-family="Times,serif" font-size="14.00">AvgMethodId</text> +<text text-anchor="start" x="2135.5" y="-3698.8" font-family="Times,serif" font-size="14.00">BlatSeq</text> +<text text-anchor="start" x="2117.5" y="-3677.8" font-family="Times,serif" font-size="14.00">CAS_number</text> +<text text-anchor="start" x="2137.5" y="-3656.8" font-family="Times,serif" font-size="14.00">cdsEnd</text> +<text text-anchor="start" x="2133.5" y="-3635.8" font-family="Times,serif" font-size="14.00">cdsStart</text> +<text text-anchor="start" x="2129" y="-3614.8" font-family="Times,serif" font-size="14.00">ChEBI_ID</text> +<text text-anchor="start" x="2120" y="-3593.8" font-family="Times,serif" font-size="14.00">ChEMBL_ID</text> +<text text-anchor="start" x="2108" y="-3572.8" font-family="Times,serif" font-size="14.00">ChemSpider_ID</text> +<text text-anchor="start" x="2150.5" y="-3551.8" font-family="Times,serif" font-size="14.00">Chr</text> +<text text-anchor="start" x="2129" y="-3530.8" font-family="Times,serif" font-size="14.00">DatasetId</text> +<text text-anchor="start" x="2123.5" y="-3509.8" font-family="Times,serif" font-size="14.00">description</text> +<text text-anchor="start" x="2122" y="-3488.8" font-family="Times,serif" font-size="14.00">EC_number</text> +<text text-anchor="start" x="2125.5" y="-3467.8" font-family="Times,serif" font-size="14.00">exonCount</text> +<text text-anchor="start" x="2129" y="-3446.8" font-family="Times,serif" font-size="14.00">exonEnds</text> +<text text-anchor="start" x="2124.5" y="-3425.8" font-family="Times,serif" font-size="14.00">exonStarts</text> +<text text-anchor="start" x="2105" y="-3404.8" font-family="Times,serif" font-size="14.00">Full_Description</text> +<text text-anchor="start" x="2121.5" y="-3383.8" font-family="Times,serif" font-size="14.00">GeneChipId</text> +<text text-anchor="start" x="2138" y="-3362.8" font-family="Times,serif" font-size="14.00">GeneId</text> +<text text-anchor="start" x="2110.5" y="-3341.8" font-family="Times,serif" font-size="14.00">GN_AccesionId</text> +<text text-anchor="start" x="2128" y="-3320.8" font-family="Times,serif" font-size="14.00">HMDB_ID</text> +<text text-anchor="start" x="2156.5" y="-3299.8" font-family="Times,serif" font-size="14.00">Id</text> +<text text-anchor="start" x="2121" y="-3278.8" font-family="Times,serif" font-size="14.00">InbredSetId</text> +<text text-anchor="start" x="2129" y="-3257.8" font-family="Times,serif" font-size="14.00">InfoFileId</text> +<text text-anchor="start" x="2111.5" y="-3236.8" font-family="Times,serif" font-size="14.00">InfoPageName</text> +<text text-anchor="start" x="2130.5" y="-3215.8" font-family="Times,serif" font-size="14.00">KEGG_ID</text> +<text text-anchor="start" x="2147" y="-3194.8" font-family="Times,serif" font-size="14.00">kgID</text> +<text text-anchor="start" x="2152" y="-3173.8" font-family="Times,serif" font-size="14.00">Mb</text> +<text text-anchor="start" x="2099.5" y="-3152.8" font-family="Times,serif" font-size="14.00">Molecular_Weight</text> +<text text-anchor="start" x="2142.5" y="-3131.8" font-family="Times,serif" font-size="14.00">Name</text> +<text text-anchor="start" x="2139" y="-3110.8" font-family="Times,serif" font-size="14.00">NM_ID</text> +<text text-anchor="start" x="2118.5" y="-3089.8" font-family="Times,serif" font-size="14.00">Nugowiki_ID</text> +<text text-anchor="start" x="2135" y="-3068.8" font-family="Times,serif" font-size="14.00">Position</text> +<text text-anchor="start" x="2079" y="-3047.8" font-family="Times,serif" font-size="14.00">Probe_set_Blat_Mb_end</text> +<text text-anchor="start" x="2075" y="-3026.8" font-family="Times,serif" font-size="14.00">Probe_set_Blat_Mb_start</text> +<text text-anchor="start" x="2129" y="-3005.8" font-family="Times,serif" font-size="14.00">ProteinID</text> +<text text-anchor="start" x="2117.5" y="-2984.8" font-family="Times,serif" font-size="14.00">PubChem_ID</text> +<text text-anchor="start" x="2129" y="-2963.8" font-family="Times,serif" font-size="14.00">SnpName</text> +<text text-anchor="start" x="2129" y="-2942.8" font-family="Times,serif" font-size="14.00">SpeciesId</text> +<text text-anchor="start" x="2139.5" y="-2921.8" font-family="Times,serif" font-size="14.00">Strand</text> +<text text-anchor="start" x="2137" y="-2900.8" font-family="Times,serif" font-size="14.00">Symbol</text> +<text text-anchor="start" x="2133.5" y="-2879.8" font-family="Times,serif" font-size="14.00">TissueId</text> +<text text-anchor="start" x="2141" y="-2858.8" font-family="Times,serif" font-size="14.00">TxEnd</text> +<text text-anchor="start" x="2136.5" y="-2837.8" font-family="Times,serif" font-size="14.00">TxStart</text> +<text text-anchor="start" x="2135" y="-2816.8" font-family="Times,serif" font-size="14.00">UNII_ID</text> +<text text-anchor="start" x="2126" y="-2795.8" font-family="Times,serif" font-size="14.00">VLBlatSeq</text> +<text text-anchor="start" x="2114" y="-2774.8" font-family="Times,serif" font-size="14.00">VLProbeSetId</text> +<polygon fill="none" stroke="black" points="2070,-2766 2070,-3822 2258,-3822 2258,-2766 2070,-2766"/> +</g> +<!-- Vlookup->Datasets --> +<g id="edge58" class="edge"> +<title>Vlookup:DatasetId->Datasets</title> +<path fill="none" stroke="black" d="M2072,-3535C1300.04,-3535 942.38,-3381.71 535,-2726 490.25,-2653.97 509.59,-1283.71 496,-1200 487.3,-1146.41 472.62,-1089.65 456.8,-1037.55"/> +<polygon fill="black" stroke="black" points="460.12,-1036.41 453.84,-1027.88 453.42,-1038.46 460.12,-1036.41"/> +</g> +<!-- Vlookup->InbredSet --> +<g id="edge60" class="edge"> +<title>Vlookup:InbredSetId->InbredSet</title> +<path fill="none" stroke="black" d="M2256,-3282C2538.62,-3282 2374.11,-2897.73 2622,-2762 2701.02,-2718.73 3368.94,-2790.34 3432,-2726 3491.36,-2665.43 3412.08,-1262.87 3469,-1200 3556.17,-1103.72 3659.82,-1247.85 3759,-1164 3805.29,-1124.86 3829.81,-1064.39 3842.6,-1006.44"/> +<polygon fill="black" stroke="black" points="3846.07,-1006.91 3844.7,-996.41 3839.22,-1005.48 3846.07,-1006.91"/> +</g> +<!-- Vlookup->Species --> +<g id="edge62" class="edge"> +<title>Vlookup:SpeciesId->Species</title> +<path fill="none" stroke="black" d="M2256,-2946C2438.07,-2946 2446.26,-2809.59 2622,-2762 2683.25,-2745.41 3148.01,-2771.74 3192,-2726 3250.79,-2664.88 3170.49,-1261.39 3229,-1200 3305.4,-1119.84 3650.58,-1245.08 3726,-1164 3774.74,-1111.61 3770.38,-576.13 3726,-520 3619.99,-385.91 3082.28,-324.84 2876.38,-306.1"/> +<polygon fill="black" stroke="black" points="2876.51,-302.6 2866.23,-305.19 2875.88,-309.57 2876.51,-302.6"/> +</g> +<!-- Vlookup->AvgMethod --> +<g id="edge57" class="edge"> +<title>Vlookup:AvgMethodId->AvgMethod</title> +<path fill="none" stroke="black" d="M2072,-3724C882.38,-3724 1769.05,-2234.12 1181,-1200 1170.7,-1181.9 1160.77,-1182.39 1151,-1164 1107.82,-1082.73 1082.45,-978.95 1069.39,-911.74"/> +<polygon fill="black" stroke="black" points="1072.79,-910.86 1067.48,-901.69 1065.91,-912.17 1072.79,-910.86"/> +</g> +<!-- Vlookup->GeneChip --> +<g id="edge59" class="edge"> +<title>Vlookup:GeneChipId->GeneChip</title> +<path fill="none" stroke="black" d="M2072,-3388C1777.21,-3388 2040.12,-3020.64 2031,-2726 2010.03,-2048.1 2014.87,-1878.03 2031,-1200 2032.96,-1117.52 2037.47,-1024.42 2041.35,-953.97"/> +<polygon fill="black" stroke="black" points="2044.86,-953.92 2041.92,-943.75 2037.87,-953.54 2044.86,-953.92"/> +</g> +<!-- Vlookup->InfoFiles --> +<g id="edge61" class="edge"> +<title>Vlookup:InfoFileId->InfoFiles</title> +<path fill="none" stroke="black" d="M2256,-3261C2335.46,-3261 2299.68,-2868.62 2251.39,-2515.5"/> +<polygon fill="black" stroke="black" points="2254.86,-2515.02 2250.03,-2505.59 2247.92,-2515.97 2254.86,-2515.02"/> +</g> +<!-- Vlookup->Tissue --> +<g id="edge63" class="edge"> +<title>Vlookup:TissueId->Tissue</title> +<path fill="none" stroke="black" d="M2256,-2883C2406.09,-2883 2477.36,-2854.46 2555,-2726 2598.85,-2653.44 2589.96,-1277.23 2555,-1200 2545,-1177.91 2526.59,-1184.73 2514,-1164 2473.2,-1096.81 2453.44,-1009.63 2443.89,-943.08"/> +<polygon fill="black" stroke="black" points="2447.33,-942.43 2442.5,-933 2440.4,-943.39 2447.33,-942.43"/> +</g> +<!-- user_collection --> +<g id="node68" class="node"> +<title>user_collection</title> +<polygon fill="white" stroke="transparent" points="11361,-4811 11361,-4964 11543,-4964 11543,-4811 11361,-4811"/> +<polygon fill="#d7b5d8" stroke="transparent" points="11364,-4939.5 11364,-4960.5 11540,-4960.5 11540,-4939.5 11364,-4939.5"/> +<polygon fill="none" stroke="black" points="11364,-4939.5 11364,-4960.5 11540,-4960.5 11540,-4939.5 11364,-4939.5"/> +<text text-anchor="start" x="11367" y="-4946.3" font-family="Times,serif" font-size="14.00">user_collection (60 KiB)</text> +<text text-anchor="start" x="11380" y="-4924.3" font-family="Times,serif" font-size="14.00">changed_timestamp</text> +<text text-anchor="start" x="11383" y="-4903.3" font-family="Times,serif" font-size="14.00">created_timestamp</text> +<text text-anchor="start" x="11445" y="-4882.3" font-family="Times,serif" font-size="14.00">id</text> +<text text-anchor="start" x="11418.5" y="-4861.3" font-family="Times,serif" font-size="14.00">members</text> +<text text-anchor="start" x="11432" y="-4840.3" font-family="Times,serif" font-size="14.00">name</text> +<text text-anchor="start" x="11436" y="-4819.3" font-family="Times,serif" font-size="14.00">user</text> +<polygon fill="none" stroke="black" points="11361,-4811 11361,-4964 11543,-4964 11543,-4811 11361,-4811"/> +</g> +<!-- pubmedsearch --> +<g id="node69" class="node"> +<title>pubmedsearch</title> +<polygon fill="white" stroke="transparent" points="11577.5,-4800.5 11577.5,-4974.5 11770.5,-4974.5 11770.5,-4800.5 11577.5,-4800.5"/> +<polygon fill="#df65b0" stroke="transparent" points="11581,-4950.5 11581,-4971.5 11768,-4971.5 11768,-4950.5 11581,-4950.5"/> +<polygon fill="none" stroke="black" points="11581,-4950.5 11581,-4971.5 11768,-4971.5 11768,-4950.5 11581,-4950.5"/> +<text text-anchor="start" x="11584" y="-4957.3" font-family="Times,serif" font-size="14.00">pubmedsearch (586 MiB)</text> +<text text-anchor="start" x="11619.5" y="-4935.3" font-family="Times,serif" font-size="14.00">authorfullname</text> +<text text-anchor="start" x="11612.5" y="-4914.3" font-family="Times,serif" font-size="14.00">authorshortname</text> +<text text-anchor="start" x="11650" y="-4893.3" font-family="Times,serif" font-size="14.00">geneid</text> +<text text-anchor="start" x="11667.5" y="-4872.3" font-family="Times,serif" font-size="14.00">id</text> +<text text-anchor="start" x="11644" y="-4851.3" font-family="Times,serif" font-size="14.00">institute</text> +<text text-anchor="start" x="11638.5" y="-4830.3" font-family="Times,serif" font-size="14.00">pubmedid</text> +<text text-anchor="start" x="11659.5" y="-4809.3" font-family="Times,serif" font-size="14.00">title</text> +<polygon fill="none" stroke="black" points="11577.5,-4800.5 11577.5,-4974.5 11770.5,-4974.5 11770.5,-4800.5 11577.5,-4800.5"/> +</g> +<!-- EnsemblProbeLocation --> +<g id="node70" class="node"> +<title>EnsemblProbeLocation</title> +<polygon fill="white" stroke="transparent" points="6793,-4790 6793,-4985 7037,-4985 7037,-4790 6793,-4790"/> +<polygon fill="#df65b0" stroke="transparent" points="6796,-4960.5 6796,-4981.5 7034,-4981.5 7034,-4960.5 6796,-4960.5"/> +<polygon fill="none" stroke="black" points="6796,-4960.5 6796,-4981.5 7034,-4981.5 7034,-4960.5 6796,-4960.5"/> +<text text-anchor="start" x="6799" y="-4967.3" font-family="Times,serif" font-size="14.00">EnsemblProbeLocation (99 MiB)</text> +<text text-anchor="start" x="6901.5" y="-4945.3" font-family="Times,serif" font-size="14.00">Chr</text> +<text text-anchor="start" x="6900.5" y="-4924.3" font-family="Times,serif" font-size="14.00">End</text> +<text text-anchor="start" x="6879" y="-4903.3" font-family="Times,serif" font-size="14.00">End_2016</text> +<text text-anchor="start" x="6867" y="-4882.3" font-family="Times,serif" font-size="14.00">MisMataches</text> +<text text-anchor="start" x="6886.5" y="-4861.3" font-family="Times,serif" font-size="14.00">ProbeId</text> +<text text-anchor="start" x="6896.5" y="-4840.3" font-family="Times,serif" font-size="14.00">Start</text> +<text text-anchor="start" x="6875" y="-4819.3" font-family="Times,serif" font-size="14.00">Start_2016</text> +<text text-anchor="start" x="6890.5" y="-4798.3" font-family="Times,serif" font-size="14.00">Strand</text> +<polygon fill="none" stroke="black" points="6793,-4790 6793,-4985 7037,-4985 7037,-4790 6793,-4790"/> +</g> +<!-- EnsemblProbeLocation->Probe --> +<g id="edge64" class="edge"> +<title>EnsemblProbeLocation:ProbeId->Probe</title> +<path fill="none" stroke="black" d="M7035,-4864.5C7071.26,-4864.5 6964.83,-3784.86 6927.45,-3416.46"/> +<polygon fill="black" stroke="black" points="6930.91,-3415.9 6926.42,-3406.3 6923.95,-3416.61 6930.91,-3415.9"/> +</g> +<!-- Investigators->Organizations --> +<g id="edge65" class="edge"> +<title>Investigators:OrganizationId->Organizations</title> +<path fill="none" stroke="black" d="M256,-296.5C296.78,-296.5 271.73,-150.19 255,-113 250.33,-102.62 243.39,-93.09 235.5,-84.57"/> +<polygon fill="black" stroke="black" points="237.88,-82 228.35,-77.36 232.9,-86.93 237.88,-82"/> +</g> +<!-- ProbeSetSE --> +<g id="node72" class="node"> +<title>ProbeSetSE</title> +<polygon fill="white" stroke="transparent" points="6068,-1918 6068,-2008 6222,-2008 6222,-1918 6068,-1918"/> +<polygon fill="#ce1256" stroke="transparent" points="6071,-1984 6071,-2005 6219,-2005 6219,-1984 6071,-1984"/> +<polygon fill="none" stroke="black" points="6071,-1984 6071,-2005 6219,-2005 6219,-1984 6071,-1984"/> +<text text-anchor="start" x="6074" y="-1990.8" font-family="Times,serif" font-size="14.00">ProbeSetSE (7 GiB)</text> +<text text-anchor="start" x="6120.5" y="-1968.8" font-family="Times,serif" font-size="14.00">DataId</text> +<text text-anchor="start" x="6126.5" y="-1947.8" font-family="Times,serif" font-size="14.00">error</text> +<text text-anchor="start" x="6115.5" y="-1926.8" font-family="Times,serif" font-size="14.00">StrainId</text> +<polygon fill="none" stroke="black" points="6068,-1918 6068,-2008 6222,-2008 6222,-1918 6068,-1918"/> +</g> +<!-- ProbeSetSE->Strain --> +<g id="edge66" class="edge"> +<title>ProbeSetSE:StrainId->Strain</title> +<path fill="none" stroke="black" d="M6070,-1930C6049.72,-1930 6057.62,-1219.18 6051,-1200 6011.97,-1086.88 5923.03,-979.85 5858.94,-913.01"/> +<polygon fill="black" stroke="black" points="5861.11,-910.22 5851.65,-905.47 5856.08,-915.09 5861.11,-910.22"/> +</g> +<!-- TableComments --> +<g id="node74" class="node"> +<title>TableComments</title> +<polygon fill="white" stroke="transparent" points="11805,-4853 11805,-4922 11995,-4922 11995,-4853 11805,-4853"/> +<polygon fill="#d7b5d8" stroke="transparent" points="11808,-4897.5 11808,-4918.5 11992,-4918.5 11992,-4897.5 11808,-4897.5"/> +<polygon fill="none" stroke="black" points="11808,-4897.5 11808,-4918.5 11992,-4918.5 11992,-4897.5 11808,-4897.5"/> +<text text-anchor="start" x="11811" y="-4904.3" font-family="Times,serif" font-size="14.00">TableComments (34 KiB)</text> +<text text-anchor="start" x="11865" y="-4882.3" font-family="Times,serif" font-size="14.00">Comment</text> +<text text-anchor="start" x="11859.5" y="-4861.3" font-family="Times,serif" font-size="14.00">TableName</text> +<polygon fill="none" stroke="black" points="11805,-4853 11805,-4922 11995,-4922 11995,-4853 11805,-4853"/> +</g> +<!-- Dataset_mbat --> +<g id="node75" class="node"> +<title>Dataset_mbat</title> +<polygon fill="white" stroke="transparent" points="12029.5,-4800.5 12029.5,-4974.5 12198.5,-4974.5 12198.5,-4800.5 12029.5,-4800.5"/> +<polygon fill="#f1eef6" stroke="transparent" points="12033,-4950.5 12033,-4971.5 12196,-4971.5 12196,-4950.5 12033,-4950.5"/> +<polygon fill="none" stroke="black" points="12033,-4950.5 12033,-4971.5 12196,-4971.5 12196,-4950.5 12033,-4950.5"/> +<text text-anchor="start" x="12036" y="-4957.3" font-family="Times,serif" font-size="14.00">Dataset_mbat (764 B)</text> +<text text-anchor="start" x="12095.5" y="-4935.3" font-family="Times,serif" font-size="14.00">cross</text> +<text text-anchor="start" x="12082" y="-4914.3" font-family="Times,serif" font-size="14.00">database</text> +<text text-anchor="start" x="12040" y="-4893.3" font-family="Times,serif" font-size="14.00">database_LongName</text> +<text text-anchor="start" x="12107.5" y="-4872.3" font-family="Times,serif" font-size="14.00">id</text> +<text text-anchor="start" x="12088" y="-4851.3" font-family="Times,serif" font-size="14.00">species</text> +<text text-anchor="start" x="12091" y="-4830.3" font-family="Times,serif" font-size="14.00">switch</text> +<text text-anchor="start" x="12093" y="-4809.3" font-family="Times,serif" font-size="14.00">tissue</text> +<polygon fill="none" stroke="black" points="12029.5,-4800.5 12029.5,-4974.5 12198.5,-4974.5 12198.5,-4800.5 12029.5,-4800.5"/> +</g> +<!-- CaseAttributeXRefNew --> +<g id="node76" class="node"> +<title>CaseAttributeXRefNew</title> +<polygon fill="white" stroke="transparent" points="3817,-1907.5 3817,-2018.5 4053,-2018.5 4053,-1907.5 3817,-1907.5"/> +<polygon fill="#df65b0" stroke="transparent" points="3820,-1994 3820,-2015 4050,-2015 4050,-1994 3820,-1994"/> +<polygon fill="none" stroke="black" points="3820,-1994 3820,-2015 4050,-2015 4050,-1994 3820,-1994"/> +<text text-anchor="start" x="3823" y="-2000.8" font-family="Times,serif" font-size="14.00">CaseAttributeXRefNew (5 MiB)</text> +<text text-anchor="start" x="3877.5" y="-1978.8" font-family="Times,serif" font-size="14.00">CaseAttributeId</text> +<text text-anchor="start" x="3892" y="-1957.8" font-family="Times,serif" font-size="14.00">InbredSetId</text> +<text text-anchor="start" x="3905.5" y="-1936.8" font-family="Times,serif" font-size="14.00">StrainId</text> +<text text-anchor="start" x="3915" y="-1915.8" font-family="Times,serif" font-size="14.00">Value</text> +<polygon fill="none" stroke="black" points="3817,-1907.5 3817,-2018.5 4053,-2018.5 4053,-1907.5 3817,-1907.5"/> +</g> +<!-- CaseAttributeXRefNew->InbredSet --> +<g id="edge68" class="edge"> +<title>CaseAttributeXRefNew:InbredSetId->InbredSet</title> +<path fill="none" stroke="black" d="M3819,-1961C3795.41,-1961 3828.4,-1316.38 3845.65,-1006.1"/> +<polygon fill="black" stroke="black" points="3849.14,-1006.29 3846.2,-996.11 3842.15,-1005.9 3849.14,-1006.29"/> +</g> +<!-- CaseAttributeXRefNew->CaseAttribute --> +<g id="edge67" class="edge"> +<title>CaseAttributeXRefNew:CaseAttributeId->CaseAttribute</title> +<path fill="none" stroke="black" d="M3819,-1983C3775.49,-1983 3829.94,-1230.6 3799,-1200 3702.3,-1104.35 1459.95,-1245.42 1351,-1164 1269.39,-1103.01 1252.58,-975.97 1250.14,-901.3"/> +<polygon fill="black" stroke="black" points="1253.64,-901.03 1249.89,-891.12 1246.64,-901.2 1253.64,-901.03"/> +</g> +<!-- CaseAttributeXRefNew->Strain --> +<g id="edge69" class="edge"> +<title>CaseAttributeXRefNew:StrainId->Strain</title> +<path fill="none" stroke="black" d="M4051,-1940C4092.12,-1940 4042.15,-1230.26 4070,-1200 4119.95,-1145.72 4327.27,-1176.34 4400,-1164 4905.53,-1078.2 5502.61,-920.46 5710.32,-863.88"/> +<polygon fill="black" stroke="black" points="5711.48,-867.19 5720.21,-861.18 5709.64,-860.44 5711.48,-867.19"/> +</g> +<!-- GenoCode --> +<g id="node77" class="node"> +<title>GenoCode</title> +<polygon fill="white" stroke="transparent" points="3486.5,-1907.5 3486.5,-2018.5 3619.5,-2018.5 3619.5,-1907.5 3486.5,-1907.5"/> +<polygon fill="#f1eef6" stroke="transparent" points="3490,-1994 3490,-2015 3617,-2015 3617,-1994 3490,-1994"/> +<polygon fill="none" stroke="black" points="3490,-1994 3490,-2015 3617,-2015 3617,-1994 3490,-1994"/> +<text text-anchor="start" x="3493" y="-2000.8" font-family="Times,serif" font-size="14.00">GenoCode (40 B)</text> +<text text-anchor="start" x="3506.5" y="-1978.8" font-family="Times,serif" font-size="14.00">AlleleSymbol</text> +<text text-anchor="start" x="3516" y="-1957.8" font-family="Times,serif" font-size="14.00">AlleleType</text> +<text text-anchor="start" x="3500.5" y="-1936.8" font-family="Times,serif" font-size="14.00">DatabaseValue</text> +<text text-anchor="start" x="3510.5" y="-1915.8" font-family="Times,serif" font-size="14.00">InbredSetId</text> +<polygon fill="none" stroke="black" points="3486.5,-1907.5 3486.5,-2018.5 3619.5,-2018.5 3619.5,-1907.5 3486.5,-1907.5"/> +</g> +<!-- GenoCode->InbredSet --> +<g id="edge70" class="edge"> +<title>GenoCode:InbredSetId->InbredSet</title> +<path fill="none" stroke="black" d="M3618,-1919C3657.96,-1919 3611.64,-1231.67 3636,-1200 3670.72,-1154.85 3718.61,-1204.16 3759,-1164 3801.12,-1122.13 3824.91,-1062.6 3838.29,-1006.16"/> +<polygon fill="black" stroke="black" points="3841.71,-1006.93 3840.51,-996.4 3834.88,-1005.38 3841.71,-1006.93"/> +</g> +<!-- ProbeSE->Strain --> +<g id="edge71" class="edge"> +<title>ProbeSE:StrainId->Strain</title> +<path fill="none" stroke="black" d="M6994,-1930C6953.43,-1930 6998.65,-1232.22 6974,-1200 6834.26,-1017.37 6100.93,-891 5861.61,-854.12"/> +<polygon fill="black" stroke="black" points="5862.02,-850.65 5851.61,-852.59 5860.96,-857.57 5862.02,-850.65"/> +</g> +<!-- Temp --> +<g id="node80" class="node"> +<title>Temp</title> +<polygon fill="white" stroke="transparent" points="4087.5,-1865.5 4087.5,-2060.5 4206.5,-2060.5 4206.5,-1865.5 4087.5,-1865.5"/> +<polygon fill="#df65b0" stroke="transparent" points="4091,-2036 4091,-2057 4204,-2057 4204,-2036 4091,-2036"/> +<polygon fill="none" stroke="black" points="4091,-2036 4091,-2057 4204,-2057 4204,-2036 4091,-2036"/> +<text text-anchor="start" x="4099" y="-2042.8" font-family="Times,serif" font-size="14.00">Temp (1 MiB)</text> +<text text-anchor="start" x="4108.5" y="-2020.8" font-family="Times,serif" font-size="14.00">createtime</text> +<text text-anchor="start" x="4123" y="-1999.8" font-family="Times,serif" font-size="14.00">DataId</text> +<text text-anchor="start" x="4093" y="-1978.8" font-family="Times,serif" font-size="14.00">dbdisplayname</text> +<text text-anchor="start" x="4107" y="-1957.8" font-family="Times,serif" font-size="14.00">description</text> +<text text-anchor="start" x="4140" y="-1936.8" font-family="Times,serif" font-size="14.00">Id</text> +<text text-anchor="start" x="4104.5" y="-1915.8" font-family="Times,serif" font-size="14.00">InbredSetId</text> +<text text-anchor="start" x="4139.5" y="-1894.8" font-family="Times,serif" font-size="14.00">IP</text> +<text text-anchor="start" x="4126" y="-1873.8" font-family="Times,serif" font-size="14.00">Name</text> +<polygon fill="none" stroke="black" points="4087.5,-1865.5 4087.5,-2060.5 4206.5,-2060.5 4206.5,-1865.5 4087.5,-1865.5"/> +</g> +<!-- Temp->InbredSet --> +<g id="edge72" class="edge"> +<title>Temp:InbredSetId->InbredSet</title> +<path fill="none" stroke="black" d="M4090,-1919C4070.02,-1919 4075.62,-1219.17 4070,-1200 4043.91,-1110.94 3990,-1021.51 3942.68,-954.43"/> +<polygon fill="black" stroke="black" points="3945.3,-952.07 3936.65,-945.95 3939.59,-956.12 3945.3,-952.07"/> +</g> +<!-- GenoData --> +<g id="node81" class="node"> +<title>GenoData</title> +<polygon fill="white" stroke="transparent" points="6256.5,-1918 6256.5,-2008 6403.5,-2008 6403.5,-1918 6256.5,-1918"/> +<polygon fill="#ce1256" stroke="transparent" points="6260,-1984 6260,-2005 6401,-2005 6401,-1984 6260,-1984"/> +<polygon fill="none" stroke="black" points="6260,-1984 6260,-2005 6401,-2005 6401,-1984 6260,-1984"/> +<text text-anchor="start" x="6263" y="-1990.8" font-family="Times,serif" font-size="14.00">GenoData (10 GiB)</text> +<text text-anchor="start" x="6323" y="-1968.8" font-family="Times,serif" font-size="14.00">Id</text> +<text text-anchor="start" x="6301" y="-1947.8" font-family="Times,serif" font-size="14.00">StrainId</text> +<text text-anchor="start" x="6311" y="-1926.8" font-family="Times,serif" font-size="14.00">value</text> +<polygon fill="none" stroke="black" points="6256.5,-1918 6256.5,-2008 6403.5,-2008 6403.5,-1918 6256.5,-1918"/> +</g> +<!-- GenoData->Strain --> +<g id="edge73" class="edge"> +<title>GenoData:StrainId->Strain</title> +<path fill="none" stroke="black" d="M6259,-1951C6217.26,-1951 6257.72,-1237.31 6239,-1200 6158.18,-1038.89 5967.05,-927.85 5860.69,-876.11"/> +<polygon fill="black" stroke="black" points="5862.14,-872.92 5851.61,-871.74 5859.11,-879.23 5862.14,-872.92"/> +</g> +<!-- GenoFreeze->InbredSet --> +<g id="edge74" class="edge"> +<title>GenoFreeze:InbredSetId->InbredSet</title> +<path fill="none" stroke="black" d="M4409,-1930C4368.43,-1930 4415.79,-1231.31 4390,-1200 4343.1,-1143.07 4293.94,-1197.05 4228,-1164 4118.16,-1108.94 4014.02,-1014.44 3943.83,-942.19"/> +<polygon fill="black" stroke="black" points="3946.19,-939.59 3936.73,-934.83 3941.15,-944.45 3946.19,-939.59"/> +</g> +<!-- ProbeSetData --> +<g id="node83" class="node"> +<title>ProbeSetData</title> +<polygon fill="white" stroke="transparent" points="6438,-1918 6438,-2008 6614,-2008 6614,-1918 6438,-1918"/> +<polygon fill="#ce1256" stroke="transparent" points="6441,-1984 6441,-2005 6611,-2005 6611,-1984 6441,-1984"/> +<polygon fill="none" stroke="black" points="6441,-1984 6441,-2005 6611,-2005 6611,-1984 6441,-1984"/> +<text text-anchor="start" x="6444" y="-1990.8" font-family="Times,serif" font-size="14.00">ProbeSetData (62 GiB)</text> +<text text-anchor="start" x="6518.5" y="-1968.8" font-family="Times,serif" font-size="14.00">Id</text> +<text text-anchor="start" x="6496.5" y="-1947.8" font-family="Times,serif" font-size="14.00">StrainId</text> +<text text-anchor="start" x="6506.5" y="-1926.8" font-family="Times,serif" font-size="14.00">value</text> +<polygon fill="none" stroke="black" points="6438,-1918 6438,-2008 6614,-2008 6614,-1918 6438,-1918"/> +</g> +<!-- ProbeSetData->Strain --> +<g id="edge75" class="edge"> +<title>ProbeSetData:StrainId->Strain</title> +<path fill="none" stroke="black" d="M6440,-1951C6398.26,-1951 6441.54,-1235.75 6420,-1200 6294.74,-992.11 6000.36,-895.18 5861.29,-859.75"/> +<polygon fill="black" stroke="black" points="5862.1,-856.35 5851.55,-857.31 5860.4,-863.14 5862.1,-856.35"/> +</g> +<!-- CeleraINFO_mm6 --> +<g id="node84" class="node"> +<title>CeleraINFO_mm6</title> +<polygon fill="white" stroke="transparent" points="12232,-4706 12232,-5069 12448,-5069 12448,-4706 12232,-4706"/> +<polygon fill="#df65b0" stroke="transparent" points="12235,-5044.5 12235,-5065.5 12445,-5065.5 12445,-5044.5 12235,-5044.5"/> +<polygon fill="none" stroke="black" points="12235,-5044.5 12235,-5065.5 12445,-5065.5 12445,-5044.5 12235,-5044.5"/> +<text text-anchor="start" x="12238" y="-5051.3" font-family="Times,serif" font-size="14.00">CeleraINFO_mm6 (780 MiB)</text> +<text text-anchor="start" x="12309.5" y="-5029.3" font-family="Times,serif" font-size="14.00">allele_AJ</text> +<text text-anchor="start" x="12307.5" y="-5008.3" font-family="Times,serif" font-size="14.00">allele_B6</text> +<text text-anchor="start" x="12307" y="-4987.3" font-family="Times,serif" font-size="14.00">allele_D2</text> +<text text-anchor="start" x="12308" y="-4966.3" font-family="Times,serif" font-size="14.00">allele_S1</text> +<text text-anchor="start" x="12308" y="-4945.3" font-family="Times,serif" font-size="14.00">allele_X1</text> +<text text-anchor="start" x="12319" y="-4924.3" font-family="Times,serif" font-size="14.00">B6_AJ</text> +<text text-anchor="start" x="12316.5" y="-4903.3" font-family="Times,serif" font-size="14.00">B6_D2</text> +<text text-anchor="start" x="12294.5" y="-4882.3" font-family="Times,serif" font-size="14.00">chromosome</text> +<text text-anchor="start" x="12318.5" y="-4861.3" font-family="Times,serif" font-size="14.00">D2_AJ</text> +<text text-anchor="start" x="12306.5" y="-4840.3" font-family="Times,serif" font-size="14.00">flanking3</text> +<text text-anchor="start" x="12306.5" y="-4819.3" font-family="Times,serif" font-size="14.00">flanking5</text> +<text text-anchor="start" x="12332.5" y="-4798.3" font-family="Times,serif" font-size="14.00">Id</text> +<text text-anchor="start" x="12302" y="-4777.3" font-family="Times,serif" font-size="14.00">MB_celera</text> +<text text-anchor="start" x="12302.5" y="-4756.3" font-family="Times,serif" font-size="14.00">MB_UCSC</text> +<text text-anchor="start" x="12283.5" y="-4735.3" font-family="Times,serif" font-size="14.00">MB_UCSC_OLD</text> +<text text-anchor="start" x="12315.5" y="-4714.3" font-family="Times,serif" font-size="14.00">SNPID</text> +<polygon fill="none" stroke="black" points="12232,-4706 12232,-5069 12448,-5069 12448,-4706 12232,-4706"/> +</g> +<!-- TableFieldAnnotation --> +<g id="node85" class="node"> +<title>TableFieldAnnotation</title> +<polygon fill="white" stroke="transparent" points="12482,-4842.5 12482,-4932.5 12710,-4932.5 12710,-4842.5 12482,-4842.5"/> +<polygon fill="#d7b5d8" stroke="transparent" points="12485,-4908.5 12485,-4929.5 12707,-4929.5 12707,-4908.5 12485,-4908.5"/> +<polygon fill="none" stroke="black" points="12485,-4908.5 12485,-4929.5 12707,-4929.5 12707,-4908.5 12485,-4908.5"/> +<text text-anchor="start" x="12488" y="-4915.3" font-family="Times,serif" font-size="14.00">TableFieldAnnotation (43 KiB)</text> +<text text-anchor="start" x="12556.5" y="-4893.3" font-family="Times,serif" font-size="14.00">Annotation</text> +<text text-anchor="start" x="12552" y="-4872.3" font-family="Times,serif" font-size="14.00">Foreign_Key</text> +<text text-anchor="start" x="12558.5" y="-4851.3" font-family="Times,serif" font-size="14.00">TableField</text> +<polygon fill="none" stroke="black" points="12482,-4842.5 12482,-4932.5 12710,-4932.5 12710,-4842.5 12482,-4842.5"/> +</g> +<!-- ProbeSet --> +<g id="node86" class="node"> +<title>ProbeSet</title> +<polygon fill="white" stroke="transparent" points="752.5,-1204 752.5,-2722 983.5,-2722 983.5,-1204 752.5,-1204"/> +<polygon fill="#ce1256" stroke="transparent" points="756,-2698 756,-2719 981,-2719 981,-2698 756,-2698"/> +<polygon fill="none" stroke="black" points="756,-2698 756,-2719 981,-2719 981,-2698 756,-2698"/> +<text text-anchor="start" x="808" y="-2704.8" font-family="Times,serif" font-size="14.00">ProbeSet (2 GiB)</text> +<text text-anchor="start" x="851.5" y="-2682.8" font-family="Times,serif" font-size="14.00">alias</text> +<text text-anchor="start" x="842.5" y="-2661.8" font-family="Times,serif" font-size="14.00">alias_H</text> +<text text-anchor="start" x="821.5" y="-2640.8" font-family="Times,serif" font-size="14.00">Biotype_ENS</text> +<text text-anchor="start" x="840" y="-2619.8" font-family="Times,serif" font-size="14.00">BlatSeq</text> +<text text-anchor="start" x="822" y="-2598.8" font-family="Times,serif" font-size="14.00">CAS_number</text> +<text text-anchor="start" x="833.5" y="-2577.8" font-family="Times,serif" font-size="14.00">ChEBI_ID</text> +<text text-anchor="start" x="824.5" y="-2556.8" font-family="Times,serif" font-size="14.00">ChEMBL_ID</text> +<text text-anchor="start" x="812.5" y="-2535.8" font-family="Times,serif" font-size="14.00">ChemSpider_ID</text> +<text text-anchor="start" x="844.5" y="-2514.8" font-family="Times,serif" font-size="14.00">ChipId</text> +<text text-anchor="start" x="855" y="-2493.8" font-family="Times,serif" font-size="14.00">Chr</text> +<text text-anchor="start" x="833.5" y="-2472.8" font-family="Times,serif" font-size="14.00">Chr_2016</text> +<text text-anchor="start" x="833.5" y="-2451.8" font-family="Times,serif" font-size="14.00">Chr_mm8</text> +<text text-anchor="start" x="837.5" y="-2430.8" font-family="Times,serif" font-size="14.00">chr_num</text> +<text text-anchor="start" x="813.5" y="-2409.8" font-family="Times,serif" font-size="14.00">chromosome_H</text> +<text text-anchor="start" x="831.5" y="-2388.8" font-family="Times,serif" font-size="14.00">comments</text> +<text text-anchor="start" x="829" y="-2367.8" font-family="Times,serif" font-size="14.00">Confidence</text> +<text text-anchor="start" x="828" y="-2346.8" font-family="Times,serif" font-size="14.00">description</text> +<text text-anchor="start" x="818.5" y="-2325.8" font-family="Times,serif" font-size="14.00">description_H</text> +<text text-anchor="start" x="826.5" y="-2304.8" font-family="Times,serif" font-size="14.00">EC_number</text> +<text text-anchor="start" x="804.5" y="-2283.8" font-family="Times,serif" font-size="14.00">ENSEMBLGeneId</text> +<text text-anchor="start" x="855" y="-2262.8" font-family="Times,serif" font-size="14.00">flag</text> +<text text-anchor="start" x="830" y="-2241.8" font-family="Times,serif" font-size="14.00">Flybase_Id</text> +<text text-anchor="start" x="829.5" y="-2220.8" font-family="Times,serif" font-size="14.00">GenbankId</text> +<text text-anchor="start" x="842.5" y="-2199.8" font-family="Times,serif" font-size="14.00">GeneId</text> +<text text-anchor="start" x="833.5" y="-2178.8" font-family="Times,serif" font-size="14.00">GeneId_H</text> +<text text-anchor="start" x="833.5" y="-2157.8" font-family="Times,serif" font-size="14.00">HGNC_ID</text> +<text text-anchor="start" x="832.5" y="-2136.8" font-family="Times,serif" font-size="14.00">HMDB_ID</text> +<text text-anchor="start" x="814" y="-2115.8" font-family="Times,serif" font-size="14.00">HomoloGeneID</text> +<text text-anchor="start" x="861" y="-2094.8" font-family="Times,serif" font-size="14.00">Id</text> +<text text-anchor="start" x="835" y="-2073.8" font-family="Times,serif" font-size="14.00">KEGG_ID</text> +<text text-anchor="start" x="856.5" y="-2052.8" font-family="Times,serif" font-size="14.00">Mb</text> +<text text-anchor="start" x="835" y="-2031.8" font-family="Times,serif" font-size="14.00">Mb_2016</text> +<text text-anchor="start" x="846.5" y="-2010.8" font-family="Times,serif" font-size="14.00">MB_H</text> +<text text-anchor="start" x="835" y="-1989.8" font-family="Times,serif" font-size="14.00">Mb_mm8</text> +<text text-anchor="start" x="804" y="-1968.8" font-family="Times,serif" font-size="14.00">Molecular_Weight</text> +<text text-anchor="start" x="847" y="-1947.8" font-family="Times,serif" font-size="14.00">Name</text> +<text text-anchor="start" x="829.5" y="-1926.8" font-family="Times,serif" font-size="14.00">name_num</text> +<text text-anchor="start" x="823" y="-1905.8" font-family="Times,serif" font-size="14.00">Nugowiki_ID</text> +<text text-anchor="start" x="845.5" y="-1884.8" font-family="Times,serif" font-size="14.00">OMIM</text> +<text text-anchor="start" x="806.5" y="-1863.8" font-family="Times,serif" font-size="14.00">PeptideSequence</text> +<text text-anchor="start" x="818.5" y="-1842.8" font-family="Times,serif" font-size="14.00">PrimaryName</text> +<text text-anchor="start" x="783.5" y="-1821.8" font-family="Times,serif" font-size="14.00">Probe_set_Blat_Mb_end</text> +<text text-anchor="start" x="762" y="-1800.8" font-family="Times,serif" font-size="14.00">Probe_set_Blat_Mb_end_2016</text> +<text text-anchor="start" x="762" y="-1779.8" font-family="Times,serif" font-size="14.00">Probe_set_Blat_Mb_end_mm8</text> +<text text-anchor="start" x="779.5" y="-1758.8" font-family="Times,serif" font-size="14.00">Probe_set_Blat_Mb_start</text> +<text text-anchor="start" x="758" y="-1737.8" font-family="Times,serif" font-size="14.00">Probe_set_Blat_Mb_start_2016</text> +<text text-anchor="start" x="758" y="-1716.8" font-family="Times,serif" font-size="14.00">Probe_set_Blat_Mb_start_mm8</text> +<text text-anchor="start" x="788.5" y="-1695.8" font-family="Times,serif" font-size="14.00">Probe_set_BLAT_score</text> +<text text-anchor="start" x="784.5" y="-1674.8" font-family="Times,serif" font-size="14.00">Probe_set_Note_by_RW</text> +<text text-anchor="start" x="793.5" y="-1653.8" font-family="Times,serif" font-size="14.00">Probe_set_specificity</text> +<text text-anchor="start" x="806.5" y="-1632.8" font-family="Times,serif" font-size="14.00">Probe_set_strand</text> +<text text-anchor="start" x="781" y="-1611.8" font-family="Times,serif" font-size="14.00">Probe_set_target_region</text> +<text text-anchor="start" x="776" y="-1590.8" font-family="Times,serif" font-size="14.00">Probe_Target_Description</text> +<text text-anchor="start" x="833.5" y="-1569.8" font-family="Times,serif" font-size="14.00">ProteinID</text> +<text text-anchor="start" x="821" y="-1548.8" font-family="Times,serif" font-size="14.00">ProteinName</text> +<text text-anchor="start" x="822" y="-1527.8" font-family="Times,serif" font-size="14.00">PubChem_ID</text> +<text text-anchor="start" x="795" y="-1506.8" font-family="Times,serif" font-size="14.00">RefSeq_TranscriptId</text> +<text text-anchor="start" x="840" y="-1485.8" font-family="Times,serif" font-size="14.00">RGD_ID</text> +<text text-anchor="start" x="806" y="-1464.8" font-family="Times,serif" font-size="14.00">SecondaryNames</text> +<text text-anchor="start" x="852.5" y="-1443.8" font-family="Times,serif" font-size="14.00">SNP</text> +<text text-anchor="start" x="822" y="-1422.8" font-family="Times,serif" font-size="14.00">Strand_Gene</text> +<text text-anchor="start" x="819.5" y="-1401.8" font-family="Times,serif" font-size="14.00">Strand_Probe</text> +<text text-anchor="start" x="841.5" y="-1380.8" font-family="Times,serif" font-size="14.00">Symbol</text> +<text text-anchor="start" x="832" y="-1359.8" font-family="Times,serif" font-size="14.00">Symbol_H</text> +<text text-anchor="start" x="838" y="-1338.8" font-family="Times,serif" font-size="14.00">TargetId</text> +<text text-anchor="start" x="831.5" y="-1317.8" font-family="Times,serif" font-size="14.00">TargetSeq</text> +<text text-anchor="start" x="845.5" y="-1296.8" font-family="Times,serif" font-size="14.00">Tissue</text> +<text text-anchor="start" x="851" y="-1275.8" font-family="Times,serif" font-size="14.00">Type</text> +<text text-anchor="start" x="830" y="-1254.8" font-family="Times,serif" font-size="14.00">UniGeneId</text> +<text text-anchor="start" x="839.5" y="-1233.8" font-family="Times,serif" font-size="14.00">UNII_ID</text> +<text text-anchor="start" x="832" y="-1212.8" font-family="Times,serif" font-size="14.00">UniProtID</text> +<polygon fill="none" stroke="black" points="752.5,-1204 752.5,-2722 983.5,-2722 983.5,-1204 752.5,-1204"/> +</g> +<!-- ProbeSet->Genbank --> +<g id="edge76" class="edge"> +<title>ProbeSet:GenbankId->Genbank</title> +<path fill="none" stroke="black" d="M755,-2225C726.53,-2225 752.7,-1228.28 756,-1200 768.49,-1092.85 801.24,-971.17 821.96,-901.12"/> +<polygon fill="black" stroke="black" points="825.42,-901.75 824.93,-891.16 818.72,-899.75 825.42,-901.75"/> +</g> +<!-- GenoFile --> +<g id="node87" class="node"> +<title>GenoFile</title> +<polygon fill="white" stroke="transparent" points="4240.5,-1886.5 4240.5,-2039.5 4373.5,-2039.5 4373.5,-1886.5 4240.5,-1886.5"/> +<polygon fill="#f1eef6" stroke="transparent" points="4244,-2015 4244,-2036 4371,-2036 4371,-2015 4244,-2015"/> +<polygon fill="none" stroke="black" points="4244,-2015 4244,-2036 4371,-2036 4371,-2015 4244,-2015"/> +<text text-anchor="start" x="4247" y="-2021.8" font-family="Times,serif" font-size="14.00">GenoFile (332 B)</text> +<text text-anchor="start" x="4300.5" y="-1999.8" font-family="Times,serif" font-size="14.00">id</text> +<text text-anchor="start" x="4263.5" y="-1978.8" font-family="Times,serif" font-size="14.00">InbredSetID</text> +<text text-anchor="start" x="4279" y="-1957.8" font-family="Times,serif" font-size="14.00">location</text> +<text text-anchor="start" x="4284.5" y="-1936.8" font-family="Times,serif" font-size="14.00">server</text> +<text text-anchor="start" x="4293" y="-1915.8" font-family="Times,serif" font-size="14.00">sort</text> +<text text-anchor="start" x="4292.5" y="-1894.8" font-family="Times,serif" font-size="14.00">title</text> +<polygon fill="none" stroke="black" points="4240.5,-1886.5 4240.5,-2039.5 4373.5,-2039.5 4373.5,-1886.5 4240.5,-1886.5"/> +</g> +<!-- GenoFile->InbredSet --> +<g id="edge77" class="edge"> +<title>GenoFile:InbredSetID->InbredSet</title> +<path fill="none" stroke="black" d="M4243,-1983C4221.24,-1983 4231.73,-1219.93 4223,-1200 4165.37,-1068.5 4034.27,-960.98 3945.16,-899.43"/> +<polygon fill="black" stroke="black" points="3946.9,-896.38 3936.67,-893.62 3942.95,-902.16 3946.9,-896.38"/> +</g> +<!-- TempData --> +<g id="node88" class="node"> +<title>TempData</title> +<polygon fill="white" stroke="transparent" points="6636,-3228 6636,-3360 6788,-3360 6788,-3228 6636,-3228"/> +<polygon fill="#df65b0" stroke="transparent" points="6639,-3336 6639,-3357 6785,-3357 6785,-3336 6639,-3336"/> +<polygon fill="none" stroke="black" points="6639,-3336 6639,-3357 6785,-3357 6785,-3336 6639,-3336"/> +<text text-anchor="start" x="6642" y="-3342.8" font-family="Times,serif" font-size="14.00">TempData (11 MiB)</text> +<text text-anchor="start" x="6704.5" y="-3320.8" font-family="Times,serif" font-size="14.00">Id</text> +<text text-anchor="start" x="6683.5" y="-3299.8" font-family="Times,serif" font-size="14.00">NStrain</text> +<text text-anchor="start" x="6701.5" y="-3278.8" font-family="Times,serif" font-size="14.00">SE</text> +<text text-anchor="start" x="6682.5" y="-3257.8" font-family="Times,serif" font-size="14.00">StrainId</text> +<text text-anchor="start" x="6692.5" y="-3236.8" font-family="Times,serif" font-size="14.00">value</text> +<polygon fill="none" stroke="black" points="6636,-3228 6636,-3360 6788,-3360 6788,-3228 6636,-3228"/> +</g> +<!-- TempData->NStrain --> +<g id="edge78" class="edge"> +<title>TempData:NStrain->NStrain</title> +<path fill="none" stroke="black" d="M6786,-3304C6851.17,-3304 6745.87,-2280.14 6718.32,-2022.36"/> +<polygon fill="black" stroke="black" points="6721.77,-2021.66 6717.22,-2012.09 6714.81,-2022.4 6721.77,-2021.66"/> +</g> +<!-- TempData->Strain --> +<g id="edge79" class="edge"> +<title>TempData:StrainId->Strain</title> +<path fill="none" stroke="black" d="M6786,-3261C6799.61,-3261 6829.44,-1253.01 6792,-1200 6572.1,-888.62 6056.1,-847.14 5861.8,-842.87"/> +<polygon fill="black" stroke="black" points="5861.75,-839.37 5851.68,-842.67 5861.61,-846.36 5861.75,-839.37"/> +</g> +<!-- CaseAttributeXRef --> +<g id="node89" class="node"> +<title>CaseAttributeXRef</title> +<polygon fill="white" stroke="transparent" points="2630,-4832 2630,-4943 2848,-4943 2848,-4832 2630,-4832"/> +<polygon fill="#d7b5d8" stroke="transparent" points="2633,-4918.5 2633,-4939.5 2845,-4939.5 2845,-4918.5 2633,-4918.5"/> +<polygon fill="none" stroke="black" points="2633,-4918.5 2633,-4939.5 2845,-4939.5 2845,-4918.5 2633,-4918.5"/> +<text text-anchor="start" x="2636" y="-4925.3" font-family="Times,serif" font-size="14.00">CaseAttributeXRef (753 KiB)</text> +<text text-anchor="start" x="2681.5" y="-4903.3" font-family="Times,serif" font-size="14.00">CaseAttributeId</text> +<text text-anchor="start" x="2674" y="-4882.3" font-family="Times,serif" font-size="14.00">ProbeSetFreezeId</text> +<text text-anchor="start" x="2709.5" y="-4861.3" font-family="Times,serif" font-size="14.00">StrainId</text> +<text text-anchor="start" x="2719" y="-4840.3" font-family="Times,serif" font-size="14.00">Value</text> +<polygon fill="none" stroke="black" points="2630,-4832 2630,-4943 2848,-4943 2848,-4832 2630,-4832"/> +</g> +<!-- CaseAttributeXRef->CaseAttribute --> +<g id="edge80" class="edge"> +<title>CaseAttributeXRef:CaseAttributeId->CaseAttribute</title> +<path fill="none" stroke="black" d="M2632,-4907.5C859.27,-4907.5 1188.58,-1398.42 1244.12,-901.29"/> +<polygon fill="black" stroke="black" points="1247.63,-901.45 1245.27,-891.12 1240.67,-900.66 1247.63,-901.45"/> +</g> +<!-- CaseAttributeXRef->Strain --> +<g id="edge82" class="edge"> +<title>CaseAttributeXRef:StrainId->Strain</title> +<path fill="none" stroke="black" d="M2846,-4864.5C3071.96,-4864.5 2844.72,-4009.37 3016,-3862 3099.31,-3790.32 4915.51,-3902.94 4994,-3826 5098.23,-3723.83 4995.8,-1323.24 5074,-1200 5218.94,-971.59 5558.15,-883.8 5710.07,-855.09"/> +<polygon fill="black" stroke="black" points="5711.05,-858.47 5720.24,-853.2 5709.77,-851.59 5711.05,-858.47"/> +</g> +<!-- CaseAttributeXRef->ProbeSetFreeze --> +<g id="edge81" class="edge"> +<title>CaseAttributeXRef:ProbeSetFreezeId->ProbeSetFreeze</title> +<path fill="none" stroke="black" d="M2846,-4885.5C3129.96,-4885.5 2889.92,-3863.52 2783.5,-3457.98"/> +<polygon fill="black" stroke="black" points="2786.86,-3457.01 2780.93,-3448.23 2780.09,-3458.79 2786.86,-3457.01"/> +</g> +<!-- ProbeSetFreeze->ProbeFreeze --> +<g id="edge83" class="edge"> +<title>ProbeSetFreeze:ProbeFreezeId->ProbeFreeze</title> +<path fill="none" stroke="black" d="M2642,-3198C2531.36,-3198 2632.91,-2395.98 2676.43,-2085.09"/> +<polygon fill="black" stroke="black" points="2679.9,-2085.53 2677.83,-2075.14 2672.97,-2084.56 2679.9,-2085.53"/> +</g> +<!-- temporary --> +<g id="node91" class="node"> +<title>temporary</title> +<polygon fill="white" stroke="transparent" points="12744.5,-4811 12744.5,-4964 12889.5,-4964 12889.5,-4811 12744.5,-4811"/> +<polygon fill="#df65b0" stroke="transparent" points="12748,-4939.5 12748,-4960.5 12887,-4960.5 12887,-4939.5 12748,-4939.5"/> +<polygon fill="none" stroke="black" points="12748,-4939.5 12748,-4960.5 12887,-4960.5 12887,-4939.5 12748,-4939.5"/> +<text text-anchor="start" x="12751" y="-4946.3" font-family="Times,serif" font-size="14.00">temporary (4 MiB)</text> +<text text-anchor="start" x="12790.5" y="-4924.3" font-family="Times,serif" font-size="14.00">GeneID</text> +<text text-anchor="start" x="12771.5" y="-4903.3" font-family="Times,serif" font-size="14.00">HomoloGene</text> +<text text-anchor="start" x="12794.5" y="-4882.3" font-family="Times,serif" font-size="14.00">OMIM</text> +<text text-anchor="start" x="12766.5" y="-4861.3" font-family="Times,serif" font-size="14.00">Other_GeneID</text> +<text text-anchor="start" x="12790.5" y="-4840.3" font-family="Times,serif" font-size="14.00">Symbol</text> +<text text-anchor="start" x="12796" y="-4819.3" font-family="Times,serif" font-size="14.00">tax_id</text> +<polygon fill="none" stroke="black" points="12744.5,-4811 12744.5,-4964 12889.5,-4964 12889.5,-4811 12744.5,-4811"/> +</g> +<!-- Chr_Length --> +<g id="node92" class="node"> +<title>Chr_Length</title> +<polygon fill="white" stroke="transparent" points="1368,-765.5 1368,-918.5 1518,-918.5 1518,-765.5 1368,-765.5"/> +<polygon fill="#d7b5d8" stroke="transparent" points="1371,-894 1371,-915 1515,-915 1515,-894 1371,-894"/> +<polygon fill="none" stroke="black" points="1371,-894 1371,-915 1515,-915 1515,-894 1371,-894"/> +<text text-anchor="start" x="1374" y="-900.8" font-family="Times,serif" font-size="14.00">Chr_Length (2 KiB)</text> +<text text-anchor="start" x="1417.5" y="-878.8" font-family="Times,serif" font-size="14.00">Length</text> +<text text-anchor="start" x="1396" y="-857.8" font-family="Times,serif" font-size="14.00">Length_2016</text> +<text text-anchor="start" x="1396" y="-836.8" font-family="Times,serif" font-size="14.00">Length_mm8</text> +<text text-anchor="start" x="1421.5" y="-815.8" font-family="Times,serif" font-size="14.00">Name</text> +<text text-anchor="start" x="1414.5" y="-794.8" font-family="Times,serif" font-size="14.00">OrderId</text> +<text text-anchor="start" x="1408" y="-773.8" font-family="Times,serif" font-size="14.00">SpeciesId</text> +<polygon fill="none" stroke="black" points="1368,-765.5 1368,-918.5 1518,-918.5 1518,-765.5 1368,-765.5"/> +</g> +<!-- Chr_Length->Species --> +<g id="edge84" class="edge"> +<title>Chr_Length:SpeciesId->Species</title> +<path fill="none" stroke="black" d="M1516,-777C1544.63,-777 1515.78,-541.23 1535,-520 1694.07,-344.29 2463.44,-308.31 2715.71,-301.19"/> +<polygon fill="black" stroke="black" points="2716,-304.69 2725.9,-300.91 2715.81,-297.69 2716,-304.69"/> +</g> +<!-- GenoSE --> +<g id="node93" class="node"> +<title>GenoSE</title> +<polygon fill="white" stroke="transparent" points="6848.5,-1918 6848.5,-2008 6957.5,-2008 6957.5,-1918 6848.5,-1918"/> +<polygon fill="#f1eef6" stroke="transparent" points="6852,-1984 6852,-2005 6955,-2005 6955,-1984 6852,-1984"/> +<polygon fill="none" stroke="black" points="6852,-1984 6852,-2005 6955,-2005 6955,-1984 6852,-1984"/> +<text text-anchor="start" x="6855" y="-1990.8" font-family="Times,serif" font-size="14.00">GenoSE (0 B)</text> +<text text-anchor="start" x="6879" y="-1968.8" font-family="Times,serif" font-size="14.00">DataId</text> +<text text-anchor="start" x="6885" y="-1947.8" font-family="Times,serif" font-size="14.00">error</text> +<text text-anchor="start" x="6874" y="-1926.8" font-family="Times,serif" font-size="14.00">StrainId</text> +<polygon fill="none" stroke="black" points="6848.5,-1918 6848.5,-2008 6957.5,-2008 6957.5,-1918 6848.5,-1918"/> +</g> +<!-- GenoSE->Strain --> +<g id="edge85" class="edge"> +<title>GenoSE:StrainId->Strain</title> +<path fill="none" stroke="black" d="M6851,-1930C6810.42,-1930 6850.14,-1232.62 6826,-1200 6591.69,-883.44 6059.6,-845.25 5861.86,-842.35"/> +<polygon fill="black" stroke="black" points="5861.61,-838.85 5851.57,-842.23 5861.53,-845.85 5861.61,-838.85"/> +</g> +<!-- ProbeH2 --> +<g id="node94" class="node"> +<title>ProbeH2</title> +<polygon fill="white" stroke="transparent" points="5788.5,-4832 5788.5,-4943 5921.5,-4943 5921.5,-4832 5788.5,-4832"/> +<polygon fill="#df65b0" stroke="transparent" points="5792,-4918.5 5792,-4939.5 5919,-4939.5 5919,-4918.5 5792,-4918.5"/> +<polygon fill="none" stroke="black" points="5792,-4918.5 5792,-4939.5 5919,-4939.5 5919,-4918.5 5792,-4918.5"/> +<text text-anchor="start" x="5795" y="-4925.3" font-family="Times,serif" font-size="14.00">ProbeH2 (9 MiB)</text> +<text text-anchor="start" x="5846" y="-4903.3" font-family="Times,serif" font-size="14.00">h2</text> +<text text-anchor="start" x="5802.5" y="-4882.3" font-family="Times,serif" font-size="14.00">ProbeFreezeId</text> +<text text-anchor="start" x="5827" y="-4861.3" font-family="Times,serif" font-size="14.00">ProbeId</text> +<text text-anchor="start" x="5831" y="-4840.3" font-family="Times,serif" font-size="14.00">weight</text> +<polygon fill="none" stroke="black" points="5788.5,-4832 5788.5,-4943 5921.5,-4943 5921.5,-4832 5788.5,-4832"/> +</g> +<!-- ProbeH2->Probe --> +<g id="edge87" class="edge"> +<title>ProbeH2:ProbeId->Probe</title> +<path fill="none" stroke="black" d="M5920,-4864.5C6401.38,-4864.5 5940.09,-4144.3 6330,-3862 6421.67,-3795.63 6755.1,-3903.04 6838,-3826 6948.34,-3723.46 6950.01,-3538.6 6936.27,-3416.32"/> +<polygon fill="black" stroke="black" points="6939.72,-3415.69 6935.07,-3406.16 6932.76,-3416.5 6939.72,-3415.69"/> +</g> +<!-- ProbeH2->ProbeFreeze --> +<g id="edge86" class="edge"> +<title>ProbeH2:ProbeFreezeId->ProbeFreeze</title> +<path fill="none" stroke="black" d="M5791,-4885.5C5212.27,-4885.5 5503.91,-4120.25 4986,-3862 4899.92,-3819.08 3329.71,-3886.58 3255,-3826 2877.83,-3520.19 3360.75,-3094.62 3007,-2762 2937.05,-2696.23 2860.62,-2795.13 2794,-2726 2629.09,-2554.88 2645.25,-2253.02 2670.34,-2085.17"/> +<polygon fill="black" stroke="black" points="2673.84,-2085.47 2671.89,-2075.05 2666.92,-2084.41 2673.84,-2085.47"/> +</g> +<!-- MappingMethod --> +<g id="node96" class="node"> +<title>MappingMethod</title> +<polygon fill="white" stroke="transparent" points="12923.5,-4853 12923.5,-4922 13110.5,-4922 13110.5,-4853 12923.5,-4853"/> +<polygon fill="#f1eef6" stroke="transparent" points="12927,-4897.5 12927,-4918.5 13108,-4918.5 13108,-4897.5 12927,-4897.5"/> +<polygon fill="none" stroke="black" points="12927,-4897.5 12927,-4918.5 13108,-4918.5 13108,-4897.5 12927,-4897.5"/> +<text text-anchor="start" x="12930" y="-4904.3" font-family="Times,serif" font-size="14.00">MappingMethod (100 B)</text> +<text text-anchor="start" x="13010" y="-4882.3" font-family="Times,serif" font-size="14.00">Id</text> +<text text-anchor="start" x="12996" y="-4861.3" font-family="Times,serif" font-size="14.00">Name</text> +<polygon fill="none" stroke="black" points="12923.5,-4853 12923.5,-4922 13110.5,-4922 13110.5,-4853 12923.5,-4853"/> +</g> +<!-- SnpAll --> +<g id="node97" class="node"> +<title>SnpAll</title> +<polygon fill="white" stroke="transparent" points="1552,-524 1552,-1160 1746,-1160 1746,-524 1552,-524"/> +<polygon fill="#ce1256" stroke="transparent" points="1555,-1136 1555,-1157 1743,-1157 1743,-1136 1555,-1136"/> +<polygon fill="none" stroke="black" points="1555,-1136 1555,-1157 1743,-1157 1743,-1136 1555,-1136"/> +<text text-anchor="start" x="1593.5" y="-1142.8" font-family="Times,serif" font-size="14.00">SnpAll (11 GiB)</text> +<text text-anchor="start" x="1603.5" y="-1120.8" font-family="Times,serif" font-size="14.00">3Prime_UTR</text> +<text text-anchor="start" x="1603.5" y="-1099.8" font-family="Times,serif" font-size="14.00">5Prime_UTR</text> +<text text-anchor="start" x="1625" y="-1078.8" font-family="Times,serif" font-size="14.00">Alleles</text> +<text text-anchor="start" x="1602" y="-1057.8" font-family="Times,serif" font-size="14.00">Chromosome</text> +<text text-anchor="start" x="1581" y="-1036.8" font-family="Times,serif" font-size="14.00">ConservationScore</text> +<text text-anchor="start" x="1621.5" y="-1015.8" font-family="Times,serif" font-size="14.00">Domain</text> +<text text-anchor="start" x="1603.5" y="-994.8" font-family="Times,serif" font-size="14.00">Downstream</text> +<text text-anchor="start" x="1630.5" y="-973.8" font-family="Times,serif" font-size="14.00">Exon</text> +<text text-anchor="start" x="1630.5" y="-952.8" font-family="Times,serif" font-size="14.00">Gene</text> +<text text-anchor="start" x="1641.5" y="-931.8" font-family="Times,serif" font-size="14.00">Id</text> +<text text-anchor="start" x="1612" y="-910.8" font-family="Times,serif" font-size="14.00">Intergenic</text> +<text text-anchor="start" x="1626.5" y="-889.8" font-family="Times,serif" font-size="14.00">Intron</text> +<text text-anchor="start" x="1591.5" y="-868.8" font-family="Times,serif" font-size="14.00">Non_Splice_Site</text> +<text text-anchor="start" x="1557" y="-847.8" font-family="Times,serif" font-size="14.00">Non_Synonymous_Coding</text> +<text text-anchor="start" x="1620" y="-826.8" font-family="Times,serif" font-size="14.00">Position</text> +<text text-anchor="start" x="1599" y="-805.8" font-family="Times,serif" font-size="14.00">Position_2016</text> +<text text-anchor="start" x="1639.5" y="-784.8" font-family="Times,serif" font-size="14.00">Rs</text> +<text text-anchor="start" x="1614" y="-763.8" font-family="Times,serif" font-size="14.00">SnpName</text> +<text text-anchor="start" x="1624" y="-742.8" font-family="Times,serif" font-size="14.00">Source</text> +<text text-anchor="start" x="1614" y="-721.8" font-family="Times,serif" font-size="14.00">SpeciesId</text> +<text text-anchor="start" x="1609.5" y="-700.8" font-family="Times,serif" font-size="14.00">Splice_Site</text> +<text text-anchor="start" x="1602" y="-679.8" font-family="Times,serif" font-size="14.00">Start_Gained</text> +<text text-anchor="start" x="1611.5" y="-658.8" font-family="Times,serif" font-size="14.00">Start_Lost</text> +<text text-anchor="start" x="1603.5" y="-637.8" font-family="Times,serif" font-size="14.00">Stop_Gained</text> +<text text-anchor="start" x="1613.5" y="-616.8" font-family="Times,serif" font-size="14.00">Stop_Lost</text> +<text text-anchor="start" x="1575" y="-595.8" font-family="Times,serif" font-size="14.00">Synonymous_Coding</text> +<text text-anchor="start" x="1611.5" y="-574.8" font-family="Times,serif" font-size="14.00">Transcript</text> +<text text-anchor="start" x="1558.5" y="-553.8" font-family="Times,serif" font-size="14.00">Unknown_Effect_In_Exon</text> +<text text-anchor="start" x="1613" y="-532.8" font-family="Times,serif" font-size="14.00">Upstream</text> +<polygon fill="none" stroke="black" points="1552,-524 1552,-1160 1746,-1160 1746,-524 1552,-524"/> +</g> +<!-- SnpAll->Species --> +<g id="edge88" class="edge"> +<title>SnpAll:SpeciesId->Species</title> +<path fill="none" stroke="black" d="M1744,-725C1789.75,-725 1732.61,-554.2 1763,-520 1889.95,-377.13 2495.01,-320.73 2715.44,-304.71"/> +<polygon fill="black" stroke="black" points="2715.91,-308.18 2725.64,-303.98 2715.41,-301.2 2715.91,-308.18"/> +</g> +<!-- GeneInfo --> +<g id="node98" class="node"> +<title>GeneInfo</title> +<polygon fill="white" stroke="transparent" points="2150,-671 2150,-1013 2338,-1013 2338,-671 2150,-671"/> +<polygon fill="#df65b0" stroke="transparent" points="2153,-989 2153,-1010 2335,-1010 2335,-989 2153,-989"/> +<polygon fill="none" stroke="black" points="2153,-989 2153,-1010 2335,-1010 2335,-989 2153,-989"/> +<text text-anchor="start" x="2178" y="-995.8" font-family="Times,serif" font-size="14.00">GeneInfo (23 MiB)</text> +<text text-anchor="start" x="2226.5" y="-973.8" font-family="Times,serif" font-size="14.00">Alias</text> +<text text-anchor="start" x="2215.5" y="-952.8" font-family="Times,serif" font-size="14.00">BlatSeq</text> +<text text-anchor="start" x="2230.5" y="-931.8" font-family="Times,serif" font-size="14.00">Chr</text> +<text text-anchor="start" x="2218" y="-910.8" font-family="Times,serif" font-size="14.00">GeneId</text> +<text text-anchor="start" x="2189.5" y="-889.8" font-family="Times,serif" font-size="14.00">HomoloGeneID</text> +<text text-anchor="start" x="2236.5" y="-868.8" font-family="Times,serif" font-size="14.00">Id</text> +<text text-anchor="start" x="2232" y="-847.8" font-family="Times,serif" font-size="14.00">Mb</text> +<text text-anchor="start" x="2221" y="-826.8" font-family="Times,serif" font-size="14.00">OMIM</text> +<text text-anchor="start" x="2159" y="-805.8" font-family="Times,serif" font-size="14.00">Probe_set_Blat_Mb_end</text> +<text text-anchor="start" x="2155" y="-784.8" font-family="Times,serif" font-size="14.00">Probe_set_Blat_Mb_start</text> +<text text-anchor="start" x="2209" y="-763.8" font-family="Times,serif" font-size="14.00">SpeciesId</text> +<text text-anchor="start" x="2197.5" y="-742.8" font-family="Times,serif" font-size="14.00">Strand_Gene</text> +<text text-anchor="start" x="2195" y="-721.8" font-family="Times,serif" font-size="14.00">Strand_Probe</text> +<text text-anchor="start" x="2217" y="-700.8" font-family="Times,serif" font-size="14.00">Symbol</text> +<text text-anchor="start" x="2224" y="-679.8" font-family="Times,serif" font-size="14.00">TaxId</text> +<polygon fill="none" stroke="black" points="2150,-671 2150,-1013 2338,-1013 2338,-671 2150,-671"/> +</g> +<!-- GeneInfo->Species --> +<g id="edge89" class="edge"> +<title>GeneInfo:SpeciesId->Species</title> +<path fill="none" stroke="black" d="M2336,-767C2363.53,-767 2339.64,-542.84 2355,-520 2438.32,-396.09 2612.85,-338.66 2715.61,-314.65"/> +<polygon fill="black" stroke="black" points="2716.66,-318 2725.63,-312.36 2715.1,-311.18 2716.66,-318"/> +</g> +<!-- GeneList_rn3 --> +<g id="node99" class="node"> +<title>GeneList_rn3</title> +<polygon fill="white" stroke="transparent" points="552,-1718.5 552,-2207.5 718,-2207.5 718,-1718.5 552,-1718.5"/> +<polygon fill="#df65b0" stroke="transparent" points="555,-2183 555,-2204 715,-2204 715,-2183 555,-2183"/> +<polygon fill="none" stroke="black" points="555,-2183 555,-2204 715,-2204 715,-2183 555,-2183"/> +<text text-anchor="start" x="558" y="-2189.8" font-family="Times,serif" font-size="14.00">GeneList_rn3 (5 MiB)</text> +<text text-anchor="start" x="589.5" y="-2167.8" font-family="Times,serif" font-size="14.00">chromosome</text> +<text text-anchor="start" x="621.5" y="-2146.8" font-family="Times,serif" font-size="14.00">flag</text> +<text text-anchor="start" x="595.5" y="-2125.8" font-family="Times,serif" font-size="14.00">genBankID</text> +<text text-anchor="start" x="576" y="-2104.8" font-family="Times,serif" font-size="14.00">geneDescription</text> +<text text-anchor="start" x="609" y="-2083.8" font-family="Times,serif" font-size="14.00">geneID</text> +<text text-anchor="start" x="591" y="-2062.8" font-family="Times,serif" font-size="14.00">geneSymbol</text> +<text text-anchor="start" x="628" y="-2041.8" font-family="Times,serif" font-size="14.00">id</text> +<text text-anchor="start" x="607" y="-2020.8" font-family="Times,serif" font-size="14.00">identity</text> +<text text-anchor="start" x="618" y="-1999.8" font-family="Times,serif" font-size="14.00">kgID</text> +<text text-anchor="start" x="601.5" y="-1978.8" font-family="Times,serif" font-size="14.00">ProbeSet</text> +<text text-anchor="start" x="616" y="-1957.8" font-family="Times,serif" font-size="14.00">qEnd</text> +<text text-anchor="start" x="615" y="-1936.8" font-family="Times,serif" font-size="14.00">qSize</text> +<text text-anchor="start" x="612" y="-1915.8" font-family="Times,serif" font-size="14.00">qStart</text> +<text text-anchor="start" x="615.5" y="-1894.8" font-family="Times,serif" font-size="14.00">score</text> +<text text-anchor="start" x="601.5" y="-1873.8" font-family="Times,serif" font-size="14.00">sequence</text> +<text text-anchor="start" x="618" y="-1852.8" font-family="Times,serif" font-size="14.00">span</text> +<text text-anchor="start" x="598.5" y="-1831.8" font-family="Times,serif" font-size="14.00">specificity</text> +<text text-anchor="start" x="611.5" y="-1810.8" font-family="Times,serif" font-size="14.00">strand</text> +<text text-anchor="start" x="613.5" y="-1789.8" font-family="Times,serif" font-size="14.00">txEnd</text> +<text text-anchor="start" x="612.5" y="-1768.8" font-family="Times,serif" font-size="14.00">txSize</text> +<text text-anchor="start" x="609" y="-1747.8" font-family="Times,serif" font-size="14.00">txStart</text> +<text text-anchor="start" x="602" y="-1726.8" font-family="Times,serif" font-size="14.00">unigenID</text> +<polygon fill="none" stroke="black" points="552,-1718.5 552,-2207.5 718,-2207.5 718,-1718.5 552,-1718.5"/> +</g> +<!-- GeneList_rn3->Genbank --> +<g id="edge90" class="edge"> +<title>GeneList_rn3:genBankID->Genbank</title> +<path fill="none" stroke="black" d="M716,-2130C741.84,-2130 729.38,-1225.22 735,-1200 738.81,-1182.91 745.09,-1180.48 751,-1164 783.34,-1073.83 811.09,-965.96 826.65,-901.05"/> +<polygon fill="black" stroke="black" points="830.13,-901.54 829.04,-891 823.32,-899.92 830.13,-901.54"/> +</g> +<!-- News --> +<g id="node100" class="node"> +<title>News</title> +<polygon fill="white" stroke="transparent" points="13145,-4842.5 13145,-4932.5 13269,-4932.5 13269,-4842.5 13145,-4842.5"/> +<polygon fill="#d7b5d8" stroke="transparent" points="13148,-4908.5 13148,-4929.5 13266,-4929.5 13266,-4908.5 13148,-4908.5"/> +<polygon fill="none" stroke="black" points="13148,-4908.5 13148,-4929.5 13266,-4929.5 13266,-4908.5 13148,-4908.5"/> +<text text-anchor="start" x="13151" y="-4915.3" font-family="Times,serif" font-size="14.00">News (167 KiB)</text> +<text text-anchor="start" x="13191" y="-4893.3" font-family="Times,serif" font-size="14.00">date</text> +<text text-anchor="start" x="13182.5" y="-4872.3" font-family="Times,serif" font-size="14.00">details</text> +<text text-anchor="start" x="13200" y="-4851.3" font-family="Times,serif" font-size="14.00">id</text> +<polygon fill="none" stroke="black" points="13145,-4842.5 13145,-4932.5 13269,-4932.5 13269,-4842.5 13145,-4842.5"/> +</g> +<!-- login --> +<g id="node101" class="node"> +<title>login</title> +<polygon fill="white" stroke="transparent" points="13303.5,-4800.5 13303.5,-4974.5 13414.5,-4974.5 13414.5,-4800.5 13303.5,-4800.5"/> +<polygon fill="#d7b5d8" stroke="transparent" points="13307,-4950.5 13307,-4971.5 13412,-4971.5 13412,-4950.5 13307,-4950.5"/> +<polygon fill="none" stroke="black" points="13307,-4950.5 13307,-4971.5 13412,-4971.5 13412,-4950.5 13307,-4950.5"/> +<text text-anchor="start" x="13310" y="-4957.3" font-family="Times,serif" font-size="14.00">login (52 KiB)</text> +<text text-anchor="start" x="13315.5" y="-4935.3" font-family="Times,serif" font-size="14.00">assumed_by</text> +<text text-anchor="start" x="13352.5" y="-4914.3" font-family="Times,serif" font-size="14.00">id</text> +<text text-anchor="start" x="13321" y="-4893.3" font-family="Times,serif" font-size="14.00">ip_address</text> +<text text-anchor="start" x="13323" y="-4872.3" font-family="Times,serif" font-size="14.00">session_id</text> +<text text-anchor="start" x="13322.5" y="-4851.3" font-family="Times,serif" font-size="14.00">successful</text> +<text text-anchor="start" x="13321" y="-4830.3" font-family="Times,serif" font-size="14.00">timestamp</text> +<text text-anchor="start" x="13343.5" y="-4809.3" font-family="Times,serif" font-size="14.00">user</text> +<polygon fill="none" stroke="black" points="13303.5,-4800.5 13303.5,-4974.5 13414.5,-4974.5 13414.5,-4800.5 13303.5,-4800.5"/> +</g> +<!-- GeneList --> +<g id="node102" class="node"> +<title>GeneList</title> +<polygon fill="white" stroke="transparent" points="1017.5,-1582 1017.5,-2344 1164.5,-2344 1164.5,-1582 1017.5,-1582"/> +<polygon fill="#df65b0" stroke="transparent" points="1021,-2320 1021,-2341 1162,-2341 1162,-2320 1021,-2320"/> +<polygon fill="none" stroke="black" points="1021,-2320 1021,-2341 1162,-2341 1162,-2320 1021,-2320"/> +<text text-anchor="start" x="1026" y="-2326.8" font-family="Times,serif" font-size="14.00">GeneList (37 MiB)</text> +<text text-anchor="start" x="1064.5" y="-2304.8" font-family="Times,serif" font-size="14.00">AlignID</text> +<text text-anchor="start" x="1065" y="-2283.8" font-family="Times,serif" font-size="14.00">cdsEnd</text> +<text text-anchor="start" x="1043.5" y="-2262.8" font-family="Times,serif" font-size="14.00">cdsEnd_2016</text> +<text text-anchor="start" x="1043.5" y="-2241.8" font-family="Times,serif" font-size="14.00">cdsEnd_mm8</text> +<text text-anchor="start" x="1061" y="-2220.8" font-family="Times,serif" font-size="14.00">cdsStart</text> +<text text-anchor="start" x="1039.5" y="-2199.8" font-family="Times,serif" font-size="14.00">cdsStart_2016</text> +<text text-anchor="start" x="1039.5" y="-2178.8" font-family="Times,serif" font-size="14.00">cdsStart_mm8</text> +<text text-anchor="start" x="1044.5" y="-2157.8" font-family="Times,serif" font-size="14.00">Chromosome</text> +<text text-anchor="start" x="1023" y="-2136.8" font-family="Times,serif" font-size="14.00">Chromosome_mm8</text> +<text text-anchor="start" x="1053" y="-2115.8" font-family="Times,serif" font-size="14.00">exonCount</text> +<text text-anchor="start" x="1031.5" y="-2094.8" font-family="Times,serif" font-size="14.00">exonCount_mm8</text> +<text text-anchor="start" x="1056.5" y="-2073.8" font-family="Times,serif" font-size="14.00">exonEnds</text> +<text text-anchor="start" x="1035" y="-2052.8" font-family="Times,serif" font-size="14.00">exonEnds_mm8</text> +<text text-anchor="start" x="1052" y="-2031.8" font-family="Times,serif" font-size="14.00">exonStarts</text> +<text text-anchor="start" x="1031" y="-2010.8" font-family="Times,serif" font-size="14.00">exonStarts_mm8</text> +<text text-anchor="start" x="1050.5" y="-1989.8" font-family="Times,serif" font-size="14.00">GenBankID</text> +<text text-anchor="start" x="1031.5" y="-1968.8" font-family="Times,serif" font-size="14.00">GeneDescription</text> +<text text-anchor="start" x="1064.5" y="-1947.8" font-family="Times,serif" font-size="14.00">GeneID</text> +<text text-anchor="start" x="1046" y="-1926.8" font-family="Times,serif" font-size="14.00">GeneSymbol</text> +<text text-anchor="start" x="1084" y="-1905.8" font-family="Times,serif" font-size="14.00">Id</text> +<text text-anchor="start" x="1056" y="-1884.8" font-family="Times,serif" font-size="14.00">Info_mm9</text> +<text text-anchor="start" x="1074.5" y="-1863.8" font-family="Times,serif" font-size="14.00">kgID</text> +<text text-anchor="start" x="1066.5" y="-1842.8" font-family="Times,serif" font-size="14.00">NM_ID</text> +<text text-anchor="start" x="1056.5" y="-1821.8" font-family="Times,serif" font-size="14.00">ProteinID</text> +<text text-anchor="start" x="1063" y="-1800.8" font-family="Times,serif" font-size="14.00">RGD_ID</text> +<text text-anchor="start" x="1056.5" y="-1779.8" font-family="Times,serif" font-size="14.00">SpeciesId</text> +<text text-anchor="start" x="1067" y="-1758.8" font-family="Times,serif" font-size="14.00">Strand</text> +<text text-anchor="start" x="1045.5" y="-1737.8" font-family="Times,serif" font-size="14.00">Strand_mm8</text> +<text text-anchor="start" x="1068.5" y="-1716.8" font-family="Times,serif" font-size="14.00">TxEnd</text> +<text text-anchor="start" x="1047" y="-1695.8" font-family="Times,serif" font-size="14.00">TxEnd_2016</text> +<text text-anchor="start" x="1047" y="-1674.8" font-family="Times,serif" font-size="14.00">TxEnd_mm8</text> +<text text-anchor="start" x="1064" y="-1653.8" font-family="Times,serif" font-size="14.00">TxStart</text> +<text text-anchor="start" x="1043" y="-1632.8" font-family="Times,serif" font-size="14.00">TxStart_2016</text> +<text text-anchor="start" x="1043" y="-1611.8" font-family="Times,serif" font-size="14.00">TxStart_mm8</text> +<text text-anchor="start" x="1057" y="-1590.8" font-family="Times,serif" font-size="14.00">UnigenID</text> +<polygon fill="none" stroke="black" points="1017.5,-1582 1017.5,-2344 1164.5,-2344 1164.5,-1582 1017.5,-1582"/> +</g> +<!-- GeneList->Species --> +<g id="edge92" class="edge"> +<title>GeneList:SpeciesId->Species</title> +<path fill="none" stroke="black" d="M1020,-1783C987.59,-1783 1012.7,-1229.81 1000,-1200 991.25,-1179.47 973.39,-1184.68 965,-1164 938.08,-1097.7 917.52,-573.54 965,-520 1083.11,-386.82 2377.72,-318.63 2715.68,-303.02"/> +<polygon fill="black" stroke="black" points="2716.05,-306.51 2725.88,-302.55 2715.73,-299.51 2716.05,-306.51"/> +</g> +<!-- GeneList->Genbank --> +<g id="edge91" class="edge"> +<title>GeneList:GenBankID->Genbank</title> +<path fill="none" stroke="black" d="M1020,-1994C975.87,-1994 1023.12,-1237.58 1000,-1200 982.29,-1171.21 954.25,-1190.29 933,-1164 870.98,-1087.29 850.32,-970.88 843.44,-901.34"/> +<polygon fill="black" stroke="black" points="846.89,-900.65 842.48,-891.02 839.92,-901.3 846.89,-900.65"/> +</g> +<!-- GeneChipEnsemblXRef --> +<g id="node103" class="node"> +<title>GeneChipEnsemblXRef</title> +<polygon fill="white" stroke="transparent" points="1750,-1928.5 1750,-1997.5 1976,-1997.5 1976,-1928.5 1750,-1928.5"/> +<polygon fill="#f1eef6" stroke="transparent" points="1753,-1973 1753,-1994 1973,-1994 1973,-1973 1753,-1973"/> +<polygon fill="none" stroke="black" points="1753,-1973 1753,-1994 1973,-1994 1973,-1973 1753,-1973"/> +<text text-anchor="start" x="1756" y="-1979.8" font-family="Times,serif" font-size="14.00">GeneChipEnsemblXRef (36 B)</text> +<text text-anchor="start" x="1808" y="-1957.8" font-family="Times,serif" font-size="14.00">EnsemblChipId</text> +<text text-anchor="start" x="1820.5" y="-1936.8" font-family="Times,serif" font-size="14.00">GeneChipId</text> +<polygon fill="none" stroke="black" points="1750,-1928.5 1750,-1997.5 1976,-1997.5 1976,-1928.5 1750,-1928.5"/> +</g> +<!-- GeneChipEnsemblXRef->EnsemblChip --> +<g id="edge93" class="edge"> +<title>GeneChipEnsemblXRef:EnsemblChipId->EnsemblChip</title> +<path fill="none" stroke="black" d="M1974,-1961C2027,-1961 1909.96,-1154.89 1873.44,-911.66"/> +<polygon fill="black" stroke="black" points="1876.86,-910.9 1871.91,-901.53 1869.94,-911.94 1876.86,-910.9"/> +</g> +<!-- GeneChipEnsemblXRef->GeneChip --> +<g id="edge94" class="edge"> +<title>GeneChipEnsemblXRef:GeneChipId->GeneChip</title> +<path fill="none" stroke="black" d="M1974,-1940C1994.57,-1940 1996.24,-1220.49 1998,-1200 2005.12,-1117.24 2018.29,-1024.34 2029.33,-954.05"/> +<polygon fill="black" stroke="black" points="2032.84,-954.27 2030.95,-943.85 2025.93,-953.18 2032.84,-954.27"/> +</g> +<!-- SnpAllele_to_be_deleted --> +<g id="node104" class="node"> +<title>SnpAllele_to_be_deleted</title> +<polygon fill="white" stroke="transparent" points="13448.5,-4842.5 13448.5,-4932.5 13687.5,-4932.5 13687.5,-4842.5 13448.5,-4842.5"/> +<polygon fill="#d7b5d8" stroke="transparent" points="13452,-4908.5 13452,-4929.5 13685,-4929.5 13685,-4908.5 13452,-4908.5"/> +<polygon fill="none" stroke="black" points="13452,-4908.5 13452,-4929.5 13685,-4929.5 13685,-4908.5 13452,-4908.5"/> +<text text-anchor="start" x="13455" y="-4915.3" font-family="Times,serif" font-size="14.00">SnpAllele_to_be_deleted (3 KiB)</text> +<text text-anchor="start" x="13551" y="-4893.3" font-family="Times,serif" font-size="14.00">Base</text> +<text text-anchor="start" x="13561" y="-4872.3" font-family="Times,serif" font-size="14.00">Id</text> +<text text-anchor="start" x="13554.5" y="-4851.3" font-family="Times,serif" font-size="14.00">Info</text> +<polygon fill="none" stroke="black" points="13448.5,-4842.5 13448.5,-4932.5 13687.5,-4932.5 13687.5,-4842.5 13448.5,-4842.5"/> +</g> +</g> +</svg> diff --git a/topics/deploy/configuring-nginx-on-host.gmi b/topics/deploy/configuring-nginx-on-host.gmi new file mode 100644 index 0000000..cb1c497 --- /dev/null +++ b/topics/deploy/configuring-nginx-on-host.gmi @@ -0,0 +1,220 @@ +# Configuring Nginx on the Host System + +## Tags + +* type: doc, docs, documentation +* keywords: deploy, deployment, deploying, nginx, guix, guix container, guix system container +* status: in progress + +## Introduction + +We deploy the GeneNetwork system within GNU Guix system containers. All the configurations and HTTPS certificates are handled from within the container, thus all the host has to do is to pass the traffic on to the system container. + +This document shows you how to set up the host container to forward all the necessary traffic so that you do not run into all the problems that we did when figuring this stuff out :-). + +## Ports and Domains + +In your system container, there are certain ports that are defined for various traffic. The most important ones, and the ones we will deal with, are for HTTP and HTTPS. The ideas should translate for most other ports. + +For the examples is this document, we will assume the following ports are defined in the Guix system container: +* HTTP on port 9080 +* HTTPS on port 9081 + +## HTTPS Traffic + +### Nginx --with-stream_ssl_preread_module + +We handle all the necessary traffic details (e.g. SSL/TLS termination, etc.) within the container, and only need the host to forward the traffic. + +In order to achieve this, your Nginx will need to be compiled with the +=> https://nginx.org/en/docs/stream/ngx_stream_ssl_preread_module.html Nginx Stream SSL Preread Module. + +Now, because we are awesome, we include +=> https://git.genenetwork.org/gn-machines/tree/nginx-preread.scm a definition for nginx compiled with the module. +Simply install it on your host by doing something like: + +``` +$ git clone https://git.genenetwork.org/gn-machines +$ cd gn-machines +$ ./nginx-preread-deploy.sh +``` + +That will install the nginx under "/usr/local/sbin/nginx". + +Now, we comment out, or delete any/all lines loading any nginx modules for any previously existing nginx. Comment out/delete the following line in your "/etc/nginx/nginx.conf" file if it exists: + +``` +include /etc/nginx/modules-enabled/*.conf; +``` + +This is necessary since the nginx we installed from guix comes with all the modules we need, and even if not, it would not successfully use the hosts modules anyhow. You'd need to modify the nginx config for yourself to add any missing modules for the nginx from guix — how to do that is outside the scope of this document, but should not be particularly difficult. + +Set up your init system to use the nginx from guix. Assuming systemd, you need to have something like the following in your "/etc/systemd/system/nginx.service" unit file: + +``` +[Unit] +Description=nginx web server (from Guix, not the host) +After=network.target + +[Service] +Type=forking +PIDFile=/run/nginx.pid +ExecStartPre=/usr/local/sbin/nginx -q -t -c /etc/nginx/nginx.conf -e /var/log/nginx/error.log +ExecStart=/usr/local/sbin/nginx -c /etc/nginx/nginx.conf -p /var/run/nginx -e /var/log/nginx/error.log +ExecReload=/usr/local/sbin/nginx -c /etc/nginx/nginx.conf -s reload -e /var/log/nginx/error.log +ExecStop=-/sbin/start-stop-daemon --quiet --stop --retry QUIT/5 --pidfile /run/nginx.pid +TimeoutStopSec=5 +KillMode=mixed + +[Install] +WantedBy=multi-user.target +``` + +Awesome. Now enable the unit file: + +``` +$ sudo systemctl enable nginx.service +``` + +### Forwarding the HTTPS Traffic + +Now that we have nginx in place, we can forward HTTPS traffic for all the domains we want. In "/etc/nginx/nginx.conf" we add: + +``` +# Forward some HTTPS connections into existing guix containers +stream { + upstream my-container { + # This is our Guix system container + server 127.0.0.1:9081; + } + + upstream host-https { + # Forward any https traffic for any previously existing domains on the + # host itself. + server 127.0.0.1:6443; + } + + map $ssl_preread_server_name $upstream { + yourdomain1.genenetwork.org my-container; + yourdomain2.genenetwork.org my-container; + default host-https; + } + + server { + listen 443; + proxy_pass $upstream; + ssl_preread on; + } +} +``` + +## HTTP Traffic + +You will need to pass the HTTP traffic on to the container in order to enable HTTP-dependent traffic (e.g. setting up the SSL certificates using the ACME protocol) is successfully handled. + +You have 2 options to do this: +* Add a separate server block in `/etc/nginx/site-available/` (or other configured directory) +* Add the server block directly in `/etc/nginx/nginx.conf` (or your main nginx config file, if it's not the standard one mentioned here). + +The configuration to add is as follows: + +``` +server { + ## Forward HTTP traffic to container + ## Without this, the HTTP calls will fall through to the defaults in + ## /etc/nginx/sites-enabled/ leading to http-dependent traffic, like + ## that of the ACME client, failing. + server_name yourdomain1.genenetwork.org yourdomain2.genenetwork.org …; + listen 80; + location / { + proxy_pass http://127.0.0.1:9080; + proxy_set_header Host $host; + } +} +``` + +** Do please replace the "yourdomain*" parts in the example above as appropriate for your scenario. The ellipsis (…) indicate optional extra domains you might need to configure. + +Without this, the `Run ACME Client` below will fail + +## Run ACME Client + +Now that all traffic is set up, and you can reach your sites using both HTTP and HTTPS (you have tested your sites, right? right?) we can now request the SSL certificates from Let's Encrypt so that we no longer see the "Self-signed Certificate" warning. + +You need to get into your system container to do this. The steps are a follows: + +=> https://git.genenetwork.org/gn-machines/tree/README.org#n61 Figure out which process is your container +=> https://git.genenetwork.org/gn-machines/tree/README.org#n55 Get a shell into the container +=> https://guix-forge.systemreboot.net/manual/dev/en/#section-acme-service Run "/usr/bin/acme renew" to get your initial SSL certificates from Let's Encrypt. + +At this point, the traffic portion of the configuration is done. + +## Sample "/etc/nginx/nginx.conf" + +``` +user www-data; +worker_processes auto; +pid /run/nginx.pid; +# include /etc/nginx/modules-enabled/*.conf; + +access_log /var/log/nginx/access.log; +error_log /var/log/nginx/error.log error; + +events { + worker_connections 768; + # multi_accept on; +} + +stream { + upstream my-container { + # This is our Guix system container + server 127.0.0.1:9081; + } + + upstream host-https { + # Forward any https traffic for any previously existing domains on the + # host itself. + server 127.0.0.1:6443; + } + + map $ssl_preread_server_name $upstream { + yourdomain1.genenetwork.org my-container; + yourdomain2.genenetwork.org my-container; + default host-https; + } + + server { + listen 443; + proxy_pass $upstream; + ssl_preread on; + } +} + +http { + ## + # Basic Settings + ## + + ⋮ + + include /etc/nginx/conf.d/*.conf; + server { + ## Forward HTTP traffic to container + ## Without this, the HTTP calls will fall through to the defaults in + ## /etc/nginx/sites-enabled/ leading to http-dependent traffic, like + ## that of the ACME client, failing. + server_name yourdomain1.genenetwork.org yourdomain2.genenetwork.org …; + listen 80; + location / { + proxy_pass http://127.0.0.1:9080; + proxy_set_header Host $host; + } + } + include /etc/nginx/sites-enabled/*; + + ⋮ +} + +⋮ + +``` diff --git a/topics/deploy/deployment.gmi b/topics/deploy/deployment.gmi index b844821..74fd6f0 100644 --- a/topics/deploy/deployment.gmi +++ b/topics/deploy/deployment.gmi @@ -1,14 +1,21 @@ # Deploy GeneNetwork +## Tags + +* type: doc, docs, documentation +* keywords: deploy, deployment, deploying, guix, guix container, guix system container +* status: in progress + # Description This page attempts to document the deployment process we have for GeneNetwork. We use Guix system containers for deployment of CI/CD and the Guix configuration for the CI/CD container should be considered the authoritative reference. -=> https://github.com/genenetwork/genenetwork-machines/blob/main/genenetwork-development.scm +=> https://git.genenetwork.org/gn-machines/tree/genenetwork-development.scm See also => ./guix-system-containers-and-how-we-use-them +=> ./configuring-nginx-on-host ## genenetwork2 diff --git a/topics/deploy/genecup.gmi b/topics/deploy/genecup.gmi index c5aec17..fc93d07 100644 --- a/topics/deploy/genecup.gmi +++ b/topics/deploy/genecup.gmi @@ -53,3 +53,72 @@ and port forward: ssh -L 4200:127.0.0.1:4200 -f -N server curl localhost:4200 ``` + +# Troubleshooting + +## Moving the PubMed dir + +After moving the PubMed dir GeneCup stopped displaying part of the connections. This can be reproduced by running the standard example on the home page - the result should look like the image on the right of the home page. + +After fixing the paths and restarting the service there still was no result. + +Genecup is currently managed by the shepherd as user shepherd. Stop the service as that user: + +``` +shepherd@tux02:~$ herd stop genecup +guile: warning: failed to install locale +Service genecup has been stopped. +``` + +Now the servic looks stopped, but it is still running and you need to kill by hand: + +``` +shepherd@tux02:~$ ps xau|grep genecup +shepherd 89524 0.0 0.0 12780 944 pts/42 S+ 00:32 0:00 grep genecup +shepherd 129334 0.0 0.7 42620944 2089640 ? Sl Mar05 66:30 /gnu/store/1w5v338qk5m8khcazwclprs3znqp6f7f-python-3.10.7/bin/python3 /gnu/store/a6z0mmj6iq6grwynfvkzd0xbbr4zdm0l-genecup-latest-with-tensorflow-native-HEAD-of-master-branch/.server.py-real +shepherd@tux02:~$ kill -9 129334 +shepherd@tux02:~$ ps xau|grep genecup +shepherd 89747 0.0 0.0 12780 944 pts/42 S+ 00:32 0:00 grep genecup +shepherd@tux02:~$ +``` + +The log file lives in + +``` +shepherd@tux02:~/logs$ tail -f genecup.log +``` + +and we were getting errors on a reload and I had to fix + +``` +shepherd@tux02:~/shepherd-services$ grep export run_genecup.sh +export EDIRECT_PUBMED_MASTER=/export3/PubMed +export TMPDIR=/export/ratspub/tmp +export NLTK_DATA=/export3/PubMed/nltk_data +``` + +See + +=> https://git.genenetwork.org/gn-shepherd-services/commit/?id=cd4512634ce1407b14b0842b0ef6a9cd35e6d46c + +The symlink from /export2 is not honoured by the guix container. Now the service works. + +Note we have deprecation warnings that need to be addressed in the future: + +``` +2025-04-22 00:40:07 /home/shepherd/services/genecup/guix-past/modules/past/packages/python.scm:740:19: warning: 'texlive-union' is deprecated, + use 'texlive-updmap.cfg' instead +2025-04-22 00:40:07 guix build: warning: 'texlive-latex-base' is deprecated, use 'texlive-latex-bin' instead +2025-04-22 00:40:15 updating checkout of 'https://git.genenetwork.org/genecup'... +/gnu/store/9lbn1l04y0xciasv6zzigqrrk1bzz543-tensorflow-native-1.9.0/lib/python3.10/site-packages/tensorflow/python/framewo +rk/dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. +2025-04-22 00:40:38 _np_quint16 = np.dtype([("quint16", np.uint16, 1)]) +2025-04-22 00:40:38 /gnu/store/9lbn1l04y0xciasv6zzigqrrk1bzz543-tensorflow-native-1.9.0/lib/python3.10/site-packages/tensorflow/python/framewo +rk/dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. +2025-04-22 00:40:38 _np_qint32 = np.dtype([("qint32", np.int32, 1)]) +2025-04-22 00:40:38 /gnu/store/9lbn1l04y0xciasv6zzigqrrk1bzz543-tensorflow-native-1.9.0/lib/python3.10/site-packages/tensorflow/python/framewo +rk/dtypes.py:532: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. +2025-04-22 00:40:38 np_resource = np.dtype([("resource", np.ubyte, 1)]) +2025-04-22 00:40:39 /gnu/store/7sam0mr9kxrd4p7g1hlz9wrwag67a6x6-python-flask-sqlalchemy-2.5.1/lib/python3.10/site-packages/flask_sqlalchemy/__ +init__.py:872: FSADeprecationWarning: SQLALCHEMY_TRACK_MODIFICATIONS adds significant overhead and will be disabled by default in the future. Set it to True or False to suppress this warning. +``` diff --git a/topics/deploy/installation.gmi b/topics/deploy/installation.gmi index 757d848..d6baa79 100644 --- a/topics/deploy/installation.gmi +++ b/topics/deploy/installation.gmi @@ -319,7 +319,7 @@ Currently we have two databases for deployment, from BXD mice and 'db_webqtl_plant' which contains all plant related material. -Download one database from +Download a recent database from => https://files.genenetwork.org/database/ diff --git a/topics/deploy/machines.gmi b/topics/deploy/machines.gmi index d610c9f..a7c197c 100644 --- a/topics/deploy/machines.gmi +++ b/topics/deploy/machines.gmi @@ -2,17 +2,19 @@ ``` - [ ] bacchus 172.23.17.156 (00:11:32:ba:7f:17) - 1 Gbs -- [X] lambda01 172.23.18.212 (7c:c2:55:11:9c:ac) +- [ ] penguin2 +- [X] lambda01 172.23.18.212 (7c:c2:55:11:9c:ac) - currently 172.23.17.41 - [X] tux03i 172.23.17.181 (00:0a:f7:c1:00:8d) - 10 Gbs [X] tux03 128.169.5.101 (00:0a:f7:c1:00:8b) - 1 Gbs -- [ ] tux04i 172.23.17.170 (14:23:f2:4f:e6:10) -- [ ] tux04 128.169.5.119 (14:23:f2:4f:e6:11) +- [X] tux04i 172.23.17.170 (14:23:f2:4f:e6:10) +- [X] tux04 128.169.5.119 (14:23:f2:4f:e6:11) - [X] tux05 172.23.18.129 (14:23:f2:4f:35:00) - [X] tux06 172.23.17.188 (14:23:f2:4e:29:10) - [X] tux07 172.23.17.191 (14:23:f2:4e:7d:60) - [X] tux08 172.23.17.186 (14:23:f2:4f:4e:b0) - [X] tux09 172.23.17.182 (14:23:f2:4e:49:10) - [X] space 128.169.5.175 (e4:3d:1a:80:6c:40) +- [ ] space-i 172.23.18.153 (cc:48:3a:13:db:4c) - [ ] octopus01f 172.23.18.221 (2c:ea:7f:60:bf:61) - [ ] octopus02f 172.23.22.159 (2c:ea:7f:60:bd:61) - [ ] octopus03f 172.23.19.187 (2c:ea:7f:60:ac:2b) @@ -25,6 +27,8 @@ c for console or control ``` - [ ] DNS entries no longer visible +- [X] penguin2-c 172.23.31.83 +- [ ] octolair01 172.23.16.228 - [X] lambda01-c 172.23.17.173 (3c:ec:ef:aa:e5:50) - [X] tux01-c 172.23.31.85 (58:8A:5A:F9:3A:22) - [X] tux02-c 172.23.30.40 (58:8A:5A:F0:E6:E4) diff --git a/topics/deploy/our-virtuoso-instances.gmi b/topics/deploy/our-virtuoso-instances.gmi index 0336018..3ac56ae 100644 --- a/topics/deploy/our-virtuoso-instances.gmi +++ b/topics/deploy/our-virtuoso-instances.gmi @@ -9,6 +9,8 @@ We run three instances of virtuoso. The public SPARQL endpoint is accessible at => https://sparql.genenetwork.org/sparql +These are now generally run as part of genenetwork2 containers(!) + ## Configuration All our virtuoso instances are deployed in Guix system containers. The configuration for these containers is at diff --git a/topics/deploy/paths-in-flask-applications.gmi b/topics/deploy/paths-in-flask-applications.gmi new file mode 100644 index 0000000..77bc201 --- /dev/null +++ b/topics/deploy/paths-in-flask-applications.gmi @@ -0,0 +1,22 @@ +# Paths in Flask Application + +## Tags + +* type: doc, docs, documentation +* assigned: fredm +* keywords: application paths, flask, absolute path, relative path + +## Content + +Always build and use absolute paths for the resources you use in your application. Assuming that the application will always be run with the root of the application's repository/package as the working directory is a recipe for failure. + +To demonstrate, see the following issue: +=> /issues/genenetwork2/haley-knott-regression-mapping-error + +In this case, the path issue was not caught in the CI/CD environment since it runs the application with the repository root as its working directory. This issue will also not show up in most development environments since it is easier to run the application from the root of the repository than have to set up the PYTHONPATH variables. + +In the new containers making use of the "(genenetwork services genenetwork)" module in gn-machines[fn:1], the working directory where the application is invoked has no relation with the application's package — in fact, the working directory is actually the root of the containers file system ("/"). + +# Footnotes + +[fn:1] https://git.genenetwork.org/gn-machines/ diff --git a/topics/deploy/setting-up-or-migrating-production-across-machines.gmi b/topics/deploy/setting-up-or-migrating-production-across-machines.gmi new file mode 100644 index 0000000..631a000 --- /dev/null +++ b/topics/deploy/setting-up-or-migrating-production-across-machines.gmi @@ -0,0 +1,202 @@ +# Setting Up or Migrating Production Across Machines + +## Tags + +* type: documentation, docs, doc +* status: in-progress +* assigned: fredm +* priority: undefined +* keywords: migration, production, genenetwork +* interested-parties: pjotrp, zachs + +## Introduction + +Recent events (Late 2024 and early 2025) have led to us needing to move the production system from one machine to the other several time, due to machine failures, disk space, security concerns, and the like. + +In this respect, a number of tasks rise to the front as necessary to accomplish for a successful migration. Each of the following sections will detail a task that's necessary for a successful migration. + +## Copy Over Auth Database + +We need to synchronise the authorisation database. We can copy this over from the production system, or the backups + +* TODO: Indicate where the backups for the auth database are here! + +Steps (flesh out better): + +* Extract backup (or copy from existing production system) +* Stop the (new) container (if it's running) +* Backup the (new) container's auth-db file ( +* Place the auth db file in the correct place in the container's filesystem: +* Backup existing secrets +* Login to the `/auth/admin/dashboard` of the auth server (e.g. https://cd.genenetwork.org/auth/admin/dashboard) +* If client with the CLIENT_ID in the secrets exists +* 1. update the uris for that client, if it doesn't exist, create an entirely new client and replace both the CLIENT_ID and CLIENT_SECRET in the secrets file. +* 2. Click on the "Change Secret" button and generate a new secret. Replace the secret in the secrets file with the newly generated secret +* If client with the CLIENT_ID in the secrets DOES NOT exist, register a new client, setting up the appropriate URIs and endpoints, and then add/replace both the CLIENT_ID and CLIENT_SECRET in the secrets file. +* Restart (new) container + +## Set Up the Database + +=> /topics/systems/restore-backups Extract the latest database from the backups. +=> /topics/deploy/installation Configure MariaDB according to this document. + +## Set Up the File System + +* TODO: List the necessary directories and describe what purpose each serves. This will be from the perspective of the container — actual paths on the host system are left to the builders choice, and can vary wildly. +* TODO: Prefer explicit binding rather than implicit — makes the shell scripts longer, but no assumptions have to be made, everything is explicitly spelled out. + +The container(s) need access to various files and directories from the host system in order to work correctly. + +Filesystem bindings could be linked to wildly different paths on different physical host machines, therefore, we shall examine the bindings from the point of view of the paths within the container, rather than forcing a particular file system layout on the host systems themselves. + +Each of the sections below details a specific binding: + +### /var/genenetwork + +This binding must be READWRITE within the container. + +The purpose is to hold varying files that are specific to the genenetwork system(s). Examples of the files are: + +* "gn-meta" and "synteny" files for GN3 +* genotype files +* session files for various systems (GN2, gn-uploader, etc.) + +### /var/lib/acme + +This binding must be READWRITE within the container. + +This is used to store TLS certificates for the various services within the container by the ACME (Automatic Certificate Management Environment) script. + +### /var/lib/redis + +This binding must be READWRITE within the container. + +This is used by the redis daemon to persist its state(s). + +### /var/lib/virtuoso + +This binding must be READWRITE within the container. + +Used by the virtuoso daemon to save its state, and maybe some log files. + +### /export/data/virtuoso/ + +This binding must be READONLY within the container. (Really?) + +This is used for importing data into virtuoso, say by sharing Turtle (TTL) files within the binding. + +--- +At this point the binding is READONLY because any TTL files to load are imported from outside the container. If the transformation of data from MariaDB to TTL form is built into the production containers down the line, then this might change to READWRITE to allow the tranformation tool to write to it. + +### /var/log + +This binding must be READWRITE within the container. + +Allows logs from various services running in the container be accessible in the host system. This is useful for debugging issues with the running systems. + +### /etc/genenetwork + +This binding must be READWRITE within the container. + +Useful for storing various configuration files/data for the service(s) running inside the running container. + +### /var/lib/xapian + +This binding must be READWRITE within the container. + +Stores the processed search indexes for the xapian search system. + +### /var/lib/genenetwork/sqlite/gn-auth + +This binding must be READWRITE within the container. + +The authorisation database is stored here. The directory needs to be writable to avoid permissions issues within the container when attempting to write data into the database. + +### /var/lib/genenetwork/sqlite/genenetwork3 + +This binding must be READWRITE within the container. + +This stores various SQLite databases in use with GN3. These are: + +* Database for the GNQA system +* ... + +### /run/mysqld + +This binding must be READWRITE within the container. + +This binding is the link to the host directory that holds the socket file for the running MariaDB instance. + +### /opt/gn/tmp + +This binding must be READWRITE within the container. + +Holds temporary files for the various services that run within the container. Some of the generated files from various services are also stored here. + +**PROPOSAL**: Move all generated files here, or have a dedicated directory for holding generated files? + + +### /var/genenetwork/sessions + +This binding must be READWRITE within the container. + +Holds session files for various services within the container. See also the /var/genenetwork binding. + +### /var/lib/genenetwork/uploader + +This binding must be READWRITE within the container. + +**gn-uploader** specific data files. Types of data files that could go here are: + +* File uploads +* (Reusable) Cache files and generated files +* ... others? + +### /var/lib/genenetwork/sqlite/gn-uploader + +This binding must be READWRITE within the container. + +Holds various SQLite databases used with the **gn-uploader** service, e.g.: + +* Background jobs database +* ... + +### /var/lib/genenetwork/gn-guile + +This binding must be READWRITE within the container. + +Various data files for the **gn-guile** service, such as: + +* The bare **gn-docs** repository (Previously bound at `/export/data/gn-docs`: now deprecated). + +## Redis + +We currently (2025-06-11) use Redis for: + +* Tracking user collection (this will be moved to SQLite database) +* Tracking background jobs (this is being moved out to SQLite databases) +* Tracking running-time (not sure what this is about) +* Others? + +We do need to copy over the redis save file whenever we do a migration, at least until the user collections and background jobs features have been moved completely out of Redis. + +## Container Configurations: Secrets + +* TODO: Detail how to extract/restore the existing secrets configurations in the new machine + +## Build Production Container + +* TODO: Add notes on building +* TODO: Add notes on setting up systemd + +## NGINX + +* TODO: Add notes on streaming and configuration of it thereof + +## SSL Certificates + +* TODO: Add notes on acquisition and setup of SSL certificates + +## DNS + +* TODO: Migrate DNS settings diff --git a/topics/deploy/uthsc-email.gmi b/topics/deploy/uthsc-email.gmi new file mode 100644 index 0000000..05f2ba5 --- /dev/null +++ b/topics/deploy/uthsc-email.gmi @@ -0,0 +1,64 @@ +# UTHSC E-mail + +Similar to many organizations UT uses outlook and exchange for their E-mail. Thanks to mobile support it is possible to work with email using other tools outside the UT network. + +## Prospect E-mail client + +People have success using Prospect as an E-mail client. You can follow the instructions on the UT website that are similar to Android support. + +## Davmail IMAP bridge + +An interesting solution is to create an IMAP bridge. It is a little slower, but can be set up! That way you can use your favorite E-mail tool (and filters!). + +I have had success setting up davmail with the following settings and testing with thunderbird first: + +``` +apt install davmail openjfx thunderbird +``` + +Start davmail with + +``` +davmail -d +``` + +Stop davmail and edit the ~/.davmail.properties file with the following. + +``` +davmail.mode=O365Interactive +davmail.url=https://outlook.office365.com/EWS/Exchange.asmx +davmail.oauth.clientId=d3590ed6-52b3-4102-aeff-aad2292ab01c +davmail.enableOauth2=true +davmail.oauth.deviceCode=true +davmail.oauth.enableOauth2=true +davmail.oauth.redirectUri=urn:ietf:wg:oauth:2.0:oob +davmail.oauth.tenantId=common +davmail.imapPort=1143 +davmail.smtpPort=1025 +davmail.logFilePath=/home/yours/.davmail/davmail.log +log4j.logger.httpclient.wire=DEBUG +log4j.rootLogger=DEBUG +log4j.logger.org.apache.http.wire=DEBUG +``` + +Restart davmail and point thunderbird to + +``` +IMAP Server: localhost:1143 +SMTP Server: localhost:1025 +Username: your-email@uthsc.edu +``` + +Note that you should set the UT password in the 2FA browser when it pops up. Do *not* set it in Thunderbird, also when it asks for it to send out SMTP. + +When something fails make sure to track the log in ~/.davmail/davmail.log + +# Using Mutt + +Some useful links: + +=> https://jonathanh.co.uk/blog/exchange-mutt/ +=> https://movementarian.org/blog/posts/mutt-and-office365/ +=> https://www.vanormondt.net/~peter/blog/2021-03-16-mutt-office365-mfa.html + +If someone can get the last one to work we won't even need davmail any more! diff --git a/topics/deploy/uthsc-vpn-with-free-software.gmi b/topics/deploy/uthsc-vpn-with-free-software.gmi index 344772c..aeba322 100644 --- a/topics/deploy/uthsc-vpn-with-free-software.gmi +++ b/topics/deploy/uthsc-vpn-with-free-software.gmi @@ -6,10 +6,24 @@ It is possible to connect to the UTHSC VPN using only free software. For this, y To connect, run openconnect-sso as follows. A browser window will pop up for you to complete the Duo authentication. Once done, you will be connected to the VPN. ``` -$ openconnect-sso --server uthscvpn1.uthsc.edu --authgroup UTHSC +$ openconnect-sso --server vpn-server --authgroup UTHSC ``` Note that openconnect-sso should be run as a regular user, not as root. After passing Duo authentication, openconnect-sso will try to gain root priviliges to set up the network routes. At that point, it will prompt you for your password using sudo. +## Recommended way + +The recommended way is to use Arun's g-expression setup using guix. See below. It should just work, provided you have the chained certificate that you can get from the browser or one of us and point to the right server. Simply + +``` +$(guix build -f uthsc-vpn.scm) +``` + +See + +=> ./uthsc-vpn.scm + +Get the final details from us. UT does not like it when we put it online even though there is no real risk. + ## Avoid tunneling all your network traffic through the VPN (aka Split Tunneling) openconnect, by default, tunnels all your traffic through the VPN. This is not good for your privacy. It is better to tunnel only the traffic destined to the specific hosts that you want to access. This can be done using the vpn-slice script. @@ -17,7 +31,7 @@ openconnect, by default, tunnels all your traffic through the VPN. This is not g For example, to connect to the UTHSC VPN but only access the hosts tux01 and tux02e through the VPN, run the following command. ``` -$ openconnect-sso --server uthscvpn1.uthsc.edu --authgroup UTHSC -- --script 'vpn-slice tux01 tux02e' +$ openconnect-sso --server vpn-server --authgroup UTHSC -- --script 'vpn-slice tux01 tux02e' ``` The vpn-slice script looks up the hostnames tux01 and tux02e on the VPN DNS and adds /etc/hosts entries and routes to your system. vpn-slice can also set up more complicated routes. To learn more, read the vpn-slice documentation. @@ -44,50 +58,50 @@ export OPENSSL_CONF=/tmp/openssl.cnf ``` Then, run the openconnect-sso client as usual. -## Putting it all together using Guix G-expressions +## Misconfigured UTHSC TLS certificate -Remembering to do all these steps is a hassle. Writing a shell script to automate this is a good idea, but why write shell scripts when we have G-expressions! Here's a G-expression script that I prepared earlier. -=> uthsc-vpn.scm -Download it, tweak the %hosts variable to specify the hosts you are interested in, and run it like so: +The UTHSC TLS certificate does not validate on some systems. You can work around this by downloading the certificate chain and adding it to your system: +* Navigate with browser to https://vpn-server/. Inspect the certificate in the browser (lock icon next to search bar) and export .pem file +* Move it to /usr/local/share/ca-certificates (with .crt extension) or equivalent +* On Debian/Ubuntu update the certificate store with update-ca-certificates +You should see ``` -$(guix build -f uthsc-vpn.scm) +Updating certificates in /etc/ssl/certs... +1 added, 0 removed; done. ``` +Thanks Niklas. See also +=> https://superuser.com/a/719047/914881 -# Troubleshooting - -Older versions would not show a proper dialog for sign-in. Try - +However, adding certificates to your system manually is not good security practice. It is better to limit the added certificate to the openconnect process. You can do this using the REQUESTS_CA_BUNDLE environment variable like so: ``` -export QTWEBENGINE_CHROMIUM_FLAGS=--disable-seccomp-filter-sandbox +REQUESTS_CA_BUNDLE=/path/to/uthsc/certificate.pem openconnect-sso --server vpn-server --authgroup UTHSC ``` -## Update certificate - -When the certificate expires you can download the new one with: +## Putting it all together using Guix G-expressions -* Navigate with browser to https://uthscvpn1.uthsc.edu/. Inspect the certificate in the browser (lock icon next to search bar) and export .pem file -* Move it to /usr/local/share/ca-certificates (with .crt extension) or equivalent -* On Debian/Ubuntu update the certificate store with update-ca-certificates +Remembering to do all these steps is a hassle. Writing a shell script to automate this is a good idea, but why write shell scripts when we have G-expressions! Here's a G-expression script that I prepared earlier. +=> uthsc-vpn.scm +Download it, download the UTHSC TLS certificate chain to uthsc-certificate.pem, tweak the %hosts variable to specify the hosts you are interested in, and run it like so: +``` +$(guix build -f uthsc-vpn.scm) +``` -You should see +to add a route by hand after you can do ``` -Updating certificates in /etc/ssl/certs... -1 added, 0 removed; done. +ip route add 172.23.17.156 dev tun0 ``` -Thanks Niklas. See also - -=> https://superuser.com/a/719047/914881 +# Troubleshooting -On GUIX you may need to point to the updated certificates file with: +Older versions would not show a proper dialog for sign-in. Try ``` -env REQUESTS_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt openconnect-sso --server uthscvpn1.uthsc.edu --authgroup UTHSC +export QTWEBENGINE_CHROMIUM_FLAGS=--disable-seccomp-filter-sandbox ``` ## Acknowledgement -Many thanks to Pjotr Prins and Erik Garrison without whose earlier work this guide would not be possible. +Many thanks to Arun, Pjotr and Erik without whose earlier work this guide would not be possible. => https://github.com/pjotrp/linux-at-university-of-tennessee => https://github.com/ekg/openconnect-sso-docker diff --git a/topics/deploy/uthsc-vpn.scm b/topics/deploy/uthsc-vpn.scm index c714731..f204cdf 100644 --- a/topics/deploy/uthsc-vpn.scm +++ b/topics/deploy/uthsc-vpn.scm @@ -1,11 +1,30 @@ -(use-modules ((gnu packages guile-xyz) #:select (guile-ini guile-lib guile-smc)) - ((gnu packages vpn) #:select (openconnect-sso vpn-slice)) - (guix gexp)) +(use-modules ((gnu packages check) + #:select (python-pytest python-pytest-asyncio python-pytest-httpserver)) + ((gnu packages freedesktop) #:select (python-pyxdg)) + ((gnu packages python-build) #:select (python-poetry-core python-toml)) + ((gnu packages python-crypto) + #:select (python-keyring python-pyotp)) + ((gnu packages python-web) #:select (python-requests python-urllib3)) + ((gnu packages python-xyz) + #:select (python-attrs python-charset-normalizer + python-colorama python-prompt-toolkit python-pysocks + python-structlog)) + ((gnu packages guile-xyz) #:select (guile-ini guile-lib guile-smc)) + ((gnu packages qt) #:select (python-pyqt-6 python-pyqtwebengine-6)) + ((gnu packages vpn) #:select (openconnect vpn-slice)) + ((gnu packages xml) #:select (python-lxml-4.9)) + (guix build-system pyproject) + (guix build-system python) + (guix download) + (guix gexp) + (guix git-download) + ((guix licenses) #:prefix license:) + (guix packages)) ;; Put in the hosts you are interested in here. (define %hosts (list "octopus01" - "tux01.genenetwork.org")) + "spacex")) (define (ini-file name scm) "Return a file-like object representing INI file with @var{name} and @@ -19,6 +38,127 @@ (call-with-output-file #$output (cut scm->ini #$scm #:port <>)))))) +(define python-urllib3-1.26 + (package + (inherit python-urllib3) + (version "1.26.15") + (source + (origin + (method url-fetch) + (uri (pypi-uri "urllib3" version)) + (sha256 + (base32 + "01dkqv0rsjqyw4wrp6yj8h3bcnl7c678qkj845596vs7p4bqff4a")))) + (build-system python-build-system))) + +(define python-charset-normalizer-2.10 + (package + (inherit python-charset-normalizer) + (version "2.1.0") + (source + (origin + (method url-fetch) + (uri (pypi-uri "charset-normalizer" version)) + (sha256 + (base32 "04zlajr77f6c7ai59l46as1idi0jjgbvj72lh4v5wfpz2s070pjp")))) + (build-system python-build-system) + (arguments (list)) + (native-inputs + (modify-inputs (package-native-inputs python-charset-normalizer) + (delete "python-setuptools"))))) + +(define python-requests-2.28 + (package + (inherit python-requests) + (name "python-requests") + (version "2.28.1") + (source (origin + (method url-fetch) + (uri (pypi-uri "requests" version)) + (sha256 + (base32 + "10vrr7bijzrypvms3g2sgz8vya7f9ymmcv423ikampgy0aqrjmbw")))) + (build-system python-build-system) + (arguments (list #:tests? #f)) + (native-inputs (list)) + (propagated-inputs + (modify-inputs (package-propagated-inputs python-requests) + (replace "python-charset-normalizer" python-charset-normalizer-2.10) + (replace "python-urllib3" python-urllib3-1.26))))) + +(define-public openconnect-sso + (package + (name "openconnect-sso") + ;; 0.8.0 was released in 2021, the latest update on master HEAD is from + ;; 2023. + (properties '((commit . "94128073ef49acb3bad84a2ae19fdef926ab7bdf") + (revision . "0"))) + (version (git-version "0.8.0" + (assoc-ref properties 'revision) + (assoc-ref properties 'commit))) + (source + (origin + (method git-fetch) + (uri (git-reference + (url "https://github.com/vlaci/openconnect-sso") + (commit (assoc-ref properties 'commit)))) + (file-name (git-file-name name version)) + (sha256 + (base32 "08cqd40p9vld1liyl6qrsdrilzc709scyfghfzmmja3m1m7nym94")))) + (build-system pyproject-build-system) + (arguments + `(#:phases + (modify-phases %standard-phases + (add-after 'unpack 'use-poetry-core + (lambda _ + ;; Patch to use the core poetry API. + (substitute* "pyproject.toml" + (("poetry.masonry.api") + "poetry.core.masonry.api")))) + (add-after 'unpack 'patch-openconnect + (lambda* (#:key inputs #:allow-other-keys) + (substitute* "openconnect_sso/app.py" + (("\"openconnect\"") + (string-append "\"" + (search-input-file inputs "/sbin/openconnect") + "\"")))))))) + (inputs + (list openconnect + python-attrs + python-colorama + python-keyring + python-lxml-4.9 + python-prompt-toolkit + python-pyotp + python-pyqt-6 + python-pyqtwebengine-6 + python-pysocks + python-pyxdg + python-requests + python-structlog + python-toml)) + (native-inputs + (list python-poetry-core + python-pytest + python-pytest-asyncio + python-pytest-httpserver)) + (home-page "https://github.com/vlaci/openconnect-sso") + (synopsis "OpenConnect wrapper script supporting Azure AD (SAMLv2)") + (description + "This package provides a wrapper script for OpenConnect supporting Azure AD +(SAMLv2) authentication to Cisco SSL-VPNs.") + (license license:gpl3))) + +;; Login to the UTHSC VPN fails with an SSLV3_ALERT_HANDSHAKE_FAILURE +;; on newer python-requests. +(define openconnect-sso-uthsc + (package + (inherit openconnect-sso) + (name "openconnect-sso-uthsc") + (inputs + (modify-inputs (package-inputs openconnect-sso) + (replace "python-requests" python-requests-2.28))))) + (define uthsc-vpn (with-imported-modules '((guix build utils)) #~(begin @@ -34,8 +174,10 @@ ("system_default" . "system_default_sect")) ("system_default_sect" ("Options" . "UnsafeLegacyRenegotiation"))))) - (invoke #$(file-append openconnect-sso "/bin/openconnect-sso") - "--server" "uthscvpn1.uthsc.edu" + (setenv "REQUESTS_CA_BUNDLE" + #$(local-file "uthsc-certificate.pem")) + (invoke #$(file-append openconnect-sso-uthsc "/bin/openconnect-sso") + "--server" "$vpn-server" ; ask us for end-point or see UT docs "--authgroup" "UTHSC" "--" "--script" (string-join (cons #$(file-append vpn-slice "/bin/vpn-slice") diff --git a/topics/documentation/guides_vs_references.gmi b/topics/documentation/guides_vs_references.gmi new file mode 100644 index 0000000..7df0be2 --- /dev/null +++ b/topics/documentation/guides_vs_references.gmi @@ -0,0 +1,24 @@ +# Guides Vs References + +Before coming up with docs, figure out their use. It can either be as a guide (provides solutions to problems encountered) or a reference (similar to man pages, where we provide detailed explanations). + +## For guides: + +* Be as brief as possible, providing reference links for users that want to explore i.e. don't aim from completeness, but rather practicality. +* Prefer providing code or command snippets where possible. +* Preferable have another team member review the docs. This helps eliminate blindspots due to our current knowledge. +* Organize the document in such a way that it starts with the most actionable steps. +* Avoid stream of consciousness writing. + +### Example + +Wrong: + +When setting up guix OS, I couldn't get `tmux` to start, getting `tmux: invalid LC_ALL, LC_CTYPE or LANG`. Running `locale -a` failed too. It took me a while to figure out the solution for this problem, and I attempted to reinstall `glibc-locales` which didn't help. After a lot of research, I found that the root cause was that my applications were built on a different version of `glibc`. I ran `guix update` and the problem disappeared. + +Correct: + +`tmux` failing with `tmux: invalid LC_ALL, LC_CTYPE or LANG` could be caused by having packages build on a different version of `glibc`. Attempt: + +> locale -a # should also fail +> guix update # rebuilds your packages with your current glibc diff --git a/topics/editing/case-attributes.gmi b/topics/editing/case-attributes.gmi new file mode 100644 index 0000000..1a86131 --- /dev/null +++ b/topics/editing/case-attributes.gmi @@ -0,0 +1,110 @@ +# Editing Case-Attributes + +## Tags + +* type: document +* keywords: case-attribute, editing +* assigned: fredm, zachs, acenteno, bonfacem +* status: requirements gathering + +## Introduction + +Case-attributes metadata for samples. They are include: sex, age, etc of the various individuals and exist separately from "normal" traits mainly because they're non-numeric. From the GN2 traits page, they are shown as extra columns under the "Reviews and Edit Data" section. + +Case-attributes are determined at the group-level. E.g. for BXD, case attributes would apply at the level of each sample, across all BXD data. Every strain has a unique attribute and it's fixed, not variable. + +We need to differentiate these two things: + +* Case-Attribute labels/names/categories (e.g. Sex, Height, Cage-handler, etc) +* Case-Attribute values (e.g. Male/Female, 20cm, Frederick, etc.) + +Currently, both labels and values are set at the group level: + +=> https://github.com/genenetwork/genenetwork1/blob/0f170f0b748a4e10eaf8538f6bcbf88b573ce8e7/web/webqtl/showTrait/DataEditingPage.py Case-Attributes on GeneNetwork1 +is a good starting point to help with understanding how case-attributes were implemented and how they worked. + +Critical bug existed where editing one case-attribute affected all case-attributes defined for a group. + +Case attributes can have the following data-types: + +* Free-form text (no constraints) - see the `Status` column +* Enumerations - textual data, but where the user can only pick from specific values +* Links - The value displayed also acts as a link - e.g. the 'JAX:*' values in the `RRID` column + +## HOWTO + +Example SQL query to fetch case-attribute data: + +``` +SELECT + caxrn.*, ca.Name AS CaseAttributeName, + ca.Description AS CaseAttributeDescription, + iset.InbredSetId AS OrigInbredSetId +FROM + CaseAttribute AS ca INNER JOIN CaseAttributeXRefNew AS caxrn + ON ca.Id=caxrn.CaseAttributeId +INNER JOIN + StrainXRef AS sxr + ON caxrn.StrainId=sxr.StrainId +INNER JOIN + InbredSet AS iset + ON sxr.InbredSetId=iset.InbredSetId +WHERE + caxrn.value != 'x' + AND caxrn.value IS NOT NULL; +``` + +CaseAttributeXRefNew differs from CaseAttributeXRef: + +``` +mysql> describe CaseAttributeXRef; ++------------------+----------------------+------+-----+---------+-------+ +| Field | Type | Null | Key | Default | Extra | ++------------------+----------------------+------+-----+---------+-------+ +| ProbeSetFreezeId | smallint(5) unsigned | NO | PRI | 0 | | +| StrainId | smallint(5) unsigned | NO | PRI | 0 | | +| CaseAttributeId | smallint(5) | NO | PRI | 0 | | +| Value | varchar(100) | NO | | | | ++------------------+----------------------+------+-----+---------+-------+ +4 rows in set (0.01 sec) + +mysql> describe CaseAttributeXRefNew; ++-----------------+------------------+------+-----+---------+-------+ +| Field | Type | Null | Key | Default | Extra | ++-----------------+------------------+------+-----+---------+-------+ +| InbredSetId | int(5) unsigned | NO | PRI | NULL | | +| StrainId | int(20) unsigned | NO | PRI | NULL | | +| CaseAttributeId | int(5) unsigned | NO | PRI | NULL | | +| Value | varchar(100) | NO | | NULL | | ++-----------------+------------------+------+-----+---------+-------+ +4 rows in set (0.01 sec) +``` + +=> https://github.com/genenetwork/genenetwork3/blob/dd0b29c07017ec398c447ca683dd4b4be18d73b7/scripts/update-case-attribute-tables-20230818 Script to update CaseAttribute and CaseAttributeXRefNew table + +## Tasks + +* @bmunyoki: Model case-attributes correctly in RDF. +* @bmunyoki, @zachs: Implement case-attributes editing in GN3 that correctly models case-attributes at the group-level. CRUD operations with the correct authorization. People who can edit sample data should not be able to edit case-attributes because case-attributes are defined at the group level; and editing case-attributes at the group-level will affect other samples. +* @rob: Confirm to team whether "N" and "SE" are case-attributes. @bmunyoki AFAICT, no. + + +Possible set of privileges subject to discussion: + +* group:resource:add-case-attributes - Allows user to add a completely new case attribute +* group:resource:edit-case-attributes - Allows user to edit an existing case attribute +* group:resource:delete-case-attributes - Allows user to delete an existing case attribute +* group:resource:view-case-attributes - Allows user to view case attributes and their value + +Given groups are not directly linked to any auth resource, we may introduce some level of indirection. Addy a new resource type that handles groups may solve this. + +## See Also + +=> https://matrix.to/#/!EhYjlMbZHbGPjmAFbh:matrix.org/$myIoafLp_dIONnyNvEI0k2xf3Y8-LyiI_mkP2vBN08o?via=matrix.org Discussion on Case-Attributes Editing in Matrix +=> https://matrix.to/#/!EhYjlMbZHbGPjmAFbh:matrix.org/$P6SNnpY-nAZsDr3VZlRi05m6MT32lXBsCl-BYLh-YLM?via=matrix.org More Discussion on Matrix +=> /issues/case-attr-edit-error Case Attribute Edting Problems +=> /issues/fix-case-attribute-work Fix Case Attribute Work (Same Columns) +=> /issues/fix-case-attribute-editing Editing Case Attribute +=> /issues/consecutive-crud-applications-when-uploading-data Fix Case Attribute Work (Consecutive CRUD applications) +=> /issues/edit-metadata-bugs Cannot Edit Metadata of BXD Traits Effectively +=> /topics/data-uploads/datasets Some Historical Context diff --git a/topics/editing/case_attributes.gmi b/topics/editing/case_attributes.gmi deleted file mode 100644 index 5a11026..0000000 --- a/topics/editing/case_attributes.gmi +++ /dev/null @@ -1,180 +0,0 @@ -# Editing Case-Attributes - -## Tags - -* type: document -* keywords: case-attribute, editing -* assigned: fredm, zachs, acenteno -* status: requirements gathering - -## Introduction - -Case-Attributes are essentially the metadata for the samples. In the GN2 system, they are the extra columns in the table in the "Reviews and Edit Data" accordion tab besides the value and its error margin. - -To quote @zachs - -> "Case Attributes" are basically just sample metadata. So stuff like the sex, age, etc of the various individuals (and exist separately from "normal" traits mainly because they're non-numeric) - -They are the metadata for the various sample in a trait. The case attributes are determined at the group-level: - -> Since they're metadata (or "attributes" in this case) for samples, they're group-level so for BXD, case attributes would apply at the level of each sample, across all BXD data - -Also From email: -> Every strain has a unique attribute and it's fixed, not variable. - -## Direction - -We need to differentiate two things: -* Case-Attribute labels/names/categories (e.g. Sex, Height, Cage-handler, etc) -* Case-Attribute values (e.g. Male/Female, 20cm, Frederick, etc.) - -As is currently implemented (as of before 2023-08-31), both the labels and values are set at group level. - -A look at -=> https://github.com/genenetwork/genenetwork1/blob/0f170f0b748a4e10eaf8538f6bcbf88b573ce8e7/web/webqtl/showTrait/DataEditingPage.py Case-Attributes on GeneNetwork1 -is a good starting point to help with understanding how case-attributes were implemented and how they worked. - -## Status - -There is code that existed for the case-attributes editing, but it had a critical bug where the data for existing attributes would be deleted/replaced randomly when one made a change. This lead to a pause in this effort. - -The chosen course of action will, however, not make use of this existing code. Instead, we will reimplement the feature with code in GN3, exposing the data and its editing via API endpoints. - -## Database - -The existing database tables of concern to us are: - -* InbredSet -* CaseAttribute -* StrainXRef -* Strain -* CaseAttributeXRefNew - -We can fetch case-attribute data from the database with: - -``` -SELECT - caxrn.*, ca.Name AS CaseAttributeName, - ca.Description AS CaseAttributeDescription, - iset.InbredSetId AS OrigInbredSetId -FROM - CaseAttribute AS ca INNER JOIN CaseAttributeXRefNew AS caxrn - ON ca.Id=caxrn.CaseAttributeId -INNER JOIN - StrainXRef AS sxr - ON caxrn.StrainId=sxr.StrainId -INNER JOIN - InbredSet AS iset - ON sxr.InbredSetId=iset.InbredSetId -WHERE - caxrn.value != 'x' - AND caxrn.value IS NOT NULL; -``` - -which gives us all the information we need to rework the database schema. - -Since the Case-Attributes are group-level, we need to move the `InbredSetId` to the `CaseAttribute` table from the `CaseAttributeXRefNew` table. - -For more concrete relationship declaration, we can have the `CaseAttributeXRefNew` table have it primary key be composed of the `InbredSetId`, `StrainId` and `CaseAttributeId`. That has the added advantage that we can index the table on `InbredSetId` and `StrainId`. - -That leaves the `CaseAttribute` table with the following columns: - -* InbredSetId: Foreign Key from `InbredSet` table -* Id: The CaseAttribute identifier -* Name: Textual name for the Case-Attribute -* Description: Textual description fro the case-attribute - -while the `CaseAttributeXRefNew` table ends up with the following columns: - -* InbredSetId: Foreign Key from `InbredSet` table -* StrainId: The strain -* CaseAttributeId: The case-attribute identifier -* Value: The value for the case-attribute for this specific strain - -There will not be any `NULL` values allowed for any of the columns in both tables. If a strain has no value, we simply delete the corresponding record from the `CaseAttributeXRefNew` table. - -To that end, the following script has been added to ease the migration of the table schemas: -=> https://github.com/genenetwork/genenetwork3/blob/dd0b29c07017ec398c447ca683dd4b4be18d73b7/scripts/update-case-attribute-tables-20230818 -The script is meant to be run only once, and makes the changes mentioned above for both tables. - -## Data Types - -> ... (and exist separately from "normal" traits mainly because they're non-numeric) - -The values for Case-Attributes are non-numeric data. This will probably be mostly textual data. - -As an example: -=> https://genenetwork.org/show_trait?trait_id=10010&dataset=BXDPublish Trait Data and Analysis for BXD_10010 -we see Case-Attributes as: - -* Free-form text (no constraints) - see the `Status` column -* Enumerations - textual data, but where the user can only pick from specific values -* Links - The value displayed also acts as a link - e.g. the 'JAX:*' values in the `RRID` column - - -=> https://genenetwork.org/show_trait?trait_id=10002&dataset=CCPublish For this trait - -We see: -* Numeric data - see the `N` and `SE` columns -though that might be a misunderstanding of the quote - -> In the following link for example, every column after Value is a case attribute - https://genenetwork.org/show_trait?trait_id=10010&dataset=BXDPublish - -**TODO**: Verify whether `N` and `SE` are Case-Attributes - -## Authorisation - -From email: -> it's probably not okay to let anyone who can edit sample data for a trait also edit case attributes, since they're group level - -and from matrix: -> The weird bug aside, Bonface had (mostly) successfully implemented editing these through the CSV files in the same way as any other sample data, but for authorization reasons this probably doesn't make sense (since a user having access to editing sample data for specific traits doesn't imply that they'd have access for editing case attributes across the entire group) - -From this, it implies we might need a new set of privileges for dealing with case-attributes, e.g. -* group:resource:add-case-attributes - Allows user to add a completely new case attribute -* group:resource:edit-case-attributes - Allows user to edit an existing case attribute -* group:resource:delete-case-attributes - Allows user to delete an existing case attribute -* group:resource:view-case-attributes - Allows user to view case attributes and their value - -Considering, however, that groups (InbredSets) are not directly linked to any auth resource, this might mean some indirection of sorts, or maybe add a new resource type that handles groups. - -## Features - -* Editing existing case-attributes: YES -* Adding new case attributes: ??? -* Deleting existing case attributes: ??? - -Strains/samples are shared across traits. The values for the case attributes are the same for a particular strain/sample for all traits within a particular InbredSet (group). - -## Related and Unsynthesised Chats - -=> https://matrix.to/#/!EhYjlMbZHbGPjmAFbh:matrix.org/$myIoafLp_dIONnyNvEI0k2xf3Y8-LyiI_mkP2vBN08o?via=matrix.org -``` -Zachary SloanZ -I'm pretty sure multiple phenotypes and mRNA datasets can belong to the same experiment (and definitely for the purposes of case attributes -since the mRNA datasets are split by tissue -genotype traits should all be considered part of the same "experiment" (at least as long as we're still only databasing a single genotype file for each group) - -pjotrp -: Case attribute editing will still need to be group level, at least until the whole feature is completely changed. Since they're basically just phenotypes we choose to show in the trait page table, and phenotypes are at the group level -``` - -=> https://matrix.to/#/!EhYjlMbZHbGPjmAFbh:matrix.org/$P6SNnpY-nAZsDr3VZlRi05m6MT32lXBsCl-BYLh-YLM?via=matrix.org -``` -Zachary SloanZ -21:14 -Groups are defined by their list of samples/strains, and the "case attributes" are just "the characteristics of those samples/strains we choose to show on the trait page" (if we move away from the "group" concept entirely that could change, but if we did that we probably would also replace "case attributes" with something else because the way that's implemented is kind of weird to begin with) -ZB -``` - -## Related issues - -=> /issues/case-attr-edit-error -=> /issues/fix-case-attribute-work -=> /issues/fix-case-attribute-editing -=> /issues/consecutive-crud-applications-when-uploading-data -=> /issues/edit-metadata-bugs - -## References - -=> /topics/data-uploads/datasets diff --git a/topics/engineering/improving-wiki-rif-search-in-genenetwork.gmi b/topics/engineering/improving-wiki-rif-search-in-genenetwork.gmi new file mode 100644 index 0000000..74e7178 --- /dev/null +++ b/topics/engineering/improving-wiki-rif-search-in-genenetwork.gmi @@ -0,0 +1,119 @@ +# Improving RIF+WIKI Search + +* author: bonfacem +* reviewed-by: jnduli + +At the time of this writing, WIKI and/or RIF Search is extremely slow for MySQL .e.g. searching: "WIKI=nicotine MEAN=(12.103 12.105)" causes an Nginx time-out in Genenetwork2. This blog discusses how we improved the WIKI+RIF search using XAPIAN and some of our key learnings. + +### TLDR; Key Learnings from Adding RIF+WIKI to the Index + +* xapian-compacting is IO bound. +* Instrument your indexing script and appropriately choose an appropriate parallel process_count that fits your needs. +* Do NOT store positional data unless you need it. +* Consider stemming your data and removing stop-words from your data ahead of indexing. + +### Slow MySQL Performance + +When indexing genes, we have a complex query [0] which returns 48,308,714 rows + +running an "EXPLAIN" on [0] yields: + +``` +1 +------+-------------+----------------+--------+-----------------------------+------------------+---------+------------------------------------------------------------+-------+-------------+ +2 | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +3 +------+-------------+----------------+--------+-----------------------------+------------------+---------+------------------------------------------------------------+-------+-------------+ +4 | 1 | SIMPLE | ProbeSetFreeze | ALL | PRIMARY | NULL | NULL | NULL | 931 | | +5 | 1 | SIMPLE | ProbeFreeze | eq_ref | PRIMARY | PRIMARY | 2 | db_webqtl.ProbeSetFreeze.ProbeFreezeId | 1 | Using where | +6 | 1 | SIMPLE | Tissue | eq_ref | PRIMARY | PRIMARY | 2 | db_webqtl.ProbeFreeze.TissueId | 1 | | +7 | 1 | SIMPLE | InbredSet | eq_ref | PRIMARY | PRIMARY | 2 | db_webqtl.ProbeFreeze.InbredSetId | 1 | Using where | +8 | 1 | SIMPLE | Species | eq_ref | PRIMARY | PRIMARY | 2 | db_webqtl.InbredSet.SpeciesId | 1 | | +9 | 1 | SIMPLE | ProbeSetXRef | ref | ProbeSetFreezeId,ProbeSetId | ProbeSetFreezeId | 2 | db_webqtl.ProbeSetFreeze.Id | 27287 | | +10 | 1 | SIMPLE | ProbeSet | eq_ref | PRIMARY | PRIMARY | 4 | db_webqtl.ProbeSetXRef.ProbeSetId | 1 | | +11 | 1 | SIMPLE | Geno | eq_ref | species_name | species_name | 164 | db_webqtl.InbredSet.SpeciesId,db_webqtl.ProbeSetXRef.Locus | 1 | Using where | ++------+-------------+----------------+--------+-----------------------------+------------------+---------+------------------------------------------------------------+-------+-------------+ +``` + +From the above table, we note that we have "ref" under the "type" column in line 9. The "type" column describes how the rows are found from the table (I.e the join type) [2]. In this case, "ref" means a non-unique index or prefix is used to find all the rows which we can see by running "SHOW INDEXES FROM ProbeSetXRef" (note the Non-unique value of 1 for ProbeSetFreezeId): + +``` ++--------------+------------+------------------+--------------+------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+ +| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | ++--------------+------------+------------------+--------------+------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+ +| ProbeSetXRef | 0 | PRIMARY | 1 | DataId | A | 46061750 | NULL | NULL | | BTREE | | | +| ProbeSetXRef | 1 | ProbeSetFreezeId | 1 | ProbeSetFreezeId | A | 1688 | NULL | NULL | | BTREE | | | +| ProbeSetXRef | 1 | ProbeSetId | 1 | ProbeSetId | A | 11515437 | NULL | NULL | | BTREE | | | +| ProbeSetXRef | 1 | Locus_2 | 1 | Locus | A | 1806 | 5 | NULL | YES | BTREE | | | ++--------------+------------+------------------+--------------+------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+ +``` + +We get a performance hit on the join: "INNER JOIN ProbeSetXRef ON ProbeSetXRef.ProbeSetFreezeId = ProbeSetFreeze.Id" since ProbeSetXRef.ProbeSetFreezeId is a non-unique index. What this means to our query is that for rows scanned in the ProbeSetFreeze table, there are several rows under the ProbeSetXRef table tha will satisfy the JOIN condition. This is analogous to nested loops in programming. + +In the RIF Search, we append "INNER JOIN GeneRIF_BASIC ON GeneRIF_BASIC.symbol = ProbeSet.Symbol" to [0]. Running an EXPLAIN on this new query yields: + +``` +1 +------+-------------+----------------+--------+---------------------------------------+--------------+---------+------------------------------------------------------------+---------+-----------------------+ +2 | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +3 +------+-------------+----------------+--------+---------------------------------------+--------------+---------+------------------------------------------------------------+---------+-----------------------+ +4 | 1 | SIMPLE | GeneRIF_BASIC | index | NULL | symbol | 777 | NULL | 1366287 | Using index | +5 | 1 | SIMPLE | ProbeSet | ref | PRIMARY,symbol_IDX,ft_ProbeSet_Symbol | symbol_IDX | 403 | func | 1 | Using index condition | +6 | 1 | SIMPLE | ProbeSetXRef | ref | ProbeSetFreezeId,ProbeSetId | ProbeSetId | 4 | db_webqtl.ProbeSet.Id | 4 | | +7 | 1 | SIMPLE | ProbeSetFreeze | eq_ref | PRIMARY | PRIMARY | 2 | db_webqtl.ProbeSetXRef.ProbeSetFreezeId | 1 | | +8 | 1 | SIMPLE | ProbeFreeze | eq_ref | PRIMARY | PRIMARY | 2 | db_webqtl.ProbeSetFreeze.ProbeFreezeId | 1 | Using where | +9 | 1 | SIMPLE | InbredSet | eq_ref | PRIMARY | PRIMARY | 2 | db_webqtl.ProbeFreeze.InbredSetId | 1 | Using where | +10 | 1 | SIMPLE | Tissue | eq_ref | PRIMARY | PRIMARY | 2 | db_webqtl.ProbeFreeze.TissueId | 1 | | +11 | 1 | SIMPLE | Species | eq_ref | PRIMARY | PRIMARY | 2 | db_webqtl.InbredSet.SpeciesId | 1 | | +12 | 1 | SIMPLE | Geno | eq_ref | species_name | species_name | 164 | db_webqtl.InbredSet.SpeciesId,db_webqtl.ProbeSetXRef.Locus | 1 | Using where | +13 +------+-------------+----------------+--------+---------------------------------------+--------------+---------+------------------------------------------------------------+---------+-----------------------+ +``` + +From the above we see that we have an extra "ref" on line 5 which adds extra overhead. Additionally, now under the "ref" column we see "func" with a "Using index condition" under the "Extra" column. This means that we are using some function during this join [3]. Specifically, this is because the "symbol" column in the GeneRIF_BASIC table is indexed, but the "Symbol" column in the ProbeSet table is not indexed. Regardless, this increases the performance of the query by some orders of magnitude. + +### Adding RIF+WIKI Search to the Existing Gene Index + +Our current indexer[4] works by indexing the results from [0] in chunks of 100,000 into separate xapian databases stored in different directories. This happens by spawning different child processes from the main indexer script. The final step in this process is to compact all the different databases into one database. + +To add RIF+WIKI indices to the existing gene index, we built a global cache. In each child process, we fetch the relevant RIF+WIKI entry from this cache and index. This increased our indexing time and space consumption. At one point we ran out of our RAM causing an intermittent outage on 2024-06-21 (search for "Outage for 2024-06-20 in the following link"): + +=> https://issues.genenetwork.org/topics/meetings/jnduli_bmunyoki Meeting notes + +When troubleshooting our outage, we realized the indexing script consumed all the RAM. This was because the child processes spawned by the index script each consumed around 3GB of RAM; with the total number of child processes and their RAM usage exceeding the system RAM. To remedy this, we settled on a total_child_process count of 67, limiting the number of spawned children and putting a cap on the total number of RAM the indexing script could consume. You can see the fix in this commit: + +=> https://github.com/genenetwork/genenetwork3/commit/99d0d1200d7dcd81e27ce65ab84bab145d9ae543 feat: set 67 parallel processes to run in prod + +To try to speed our indexing speed, we attempted to parallelize our compacting. Parallelising had some improvements in reducing our compacting time, but nothing significant. On a SATA drive, compacting 3 different databases which had been compacted from 50 different databases was significantly faster than compacting one database at once from 150 different databases. The conclusion we could draw from this was that the compacting process is IO bound. This is useful data because it informs the type of drive you would want to run our indexing script in, and in our case, an NVMe drive is an ideal candidate because of the fast IO speeds it has. + +To attempt to reduce the index script's space consumption and improve the script's performance, we first removed stop-words and most common words from the global cache, and stemmed words from other documents. This reduced the space footprint to 152 Gb. This was still unacceptable per our needs. Further research with how xapian indexing works pointed us to positional data in the XAPIAN index. In XAPIAN, positional data allows someone to be able to perform phrase searches such as: "nicotine NEAR mouse" which loosely translates to "search for the term nicotine which occurs near the term mouse." One thing we noticed in the RIF+WIKI search is that we don't need this type of search, a trade-off we were willing to make to make search faster and our XAPIAN database smaller. Instrumenting the impact of dropping positional data from RIF+WIKI data was immediate. Our indexing times, on the NVMe drive dropped to a record high of 1 hour 9 minutes with a size of 73 Gb! The table below summarizes our findings: + + +``` +| | Indexing Time (min) | Space (Gb) | % Inc Size (from G+P) | % Inc Time | +|------------------------------------------------------------------------------------------------------------------ -----| +|G+P (no stop-words, no-stemming, pos. data) | 75 | 60 | 0 | 0 | +|G+P+W+R (no stop-words, no stemming, pos. data)| 429 | 152 | 153.3 | 472 | +|G+P+W+R (stop-words, stemming, no pos. data) | 69 | 73 | 21.6 | -8 | + +Key: +---- +G: Genes +P: Phenotypes +W: Wiki +R: RIF +``` + +### Some Example Searches + +With RIF+WIKI search added, here are some searches you can try out in CD genenetwork instance: + +* wiki:nicotine AND mean:12.103..12.105 +* rif:isoneuronal AND mean:12.103..12.105 +* species:mouse wiki:addiction rif:heteroneuronal mean:12.103..12.105 +* symbol:shh rif:glioma wiki:nicotine + +### References + +=> https://github.com/genenetwork/genenetwork3/blob/52cd294c2f1d06dddbd6ff613b11f8bc43066038/scripts/index-genenetwork#L54-L89 [0] Gene Indexing SQL Query +=> https://mariadb.com/kb/en/explain/ [1] MariaDB EXPLAIN +=> https://stackoverflow.com/a/4528433 [2] What does eq_ref and ref types mean in MySQL explain? +=> https://planet.mysql.com/entry/?id=29724 [3] The meaning of ref=func in MySQL EXPLAIN +=> https://issues.genenetwork.org/topics/engineering/instrumenting-ram-usage [3] Instrument RAM Usage +=> https://github.com/genenetwork/genenetwork3/blob/main/scripts/index-genenetwork#L54 index-genenetwork diff --git a/topics/engineering/instrumenting-ram-usage.gmi b/topics/engineering/instrumenting-ram-usage.gmi new file mode 100644 index 0000000..4f7ab96 --- /dev/null +++ b/topics/engineering/instrumenting-ram-usage.gmi @@ -0,0 +1,32 @@ +# Instrumenting RAM usage + +* author: bonfacem +* reviewed-by: jnduli + +On 2024-06-21, TUX02 experienced an outage because we ran out of RAM on the server. Here we outline how to instrument processes that consume RAM, in particular, what to watch out for. + +=> https://issues.genenetwork.org/topics/meetings/jnduli_bmunyoki Meeting Notes + +The output of "free -m -h" looks like: + +``` + total used free shared buff/cache available +Mem: 251G 88G 57G 6.2G 105G 155G +Swap: 29G 20G 9.8G +``` + +When running "free", you can refresh the output regularly. As an example, to get human readable output every 2 seconds: + +> free -m -h -s 2 + +It's tempting to check the "free" column to see how much RAM is being used. However, this column also includes disk caching. Disk caching doesn't prevent applications from getting the memory they want[1]. What we need to be aware of instead are: + +* available: Make sure this is within acceptable thresholds. +* swap used: Make sure this does not change significantly. + +Also, use htop/top and filter out the process (and preferably order by RAM usage) you are monitoring to see how much RAM a process and it's children (if any) consume. + +## References + +=> https://www.linuxatemyram.com/index.html [0] Linux ate my ram! +=> https://www.linuxatemyram.com/play.html [1] Experiments and fun with Linux disk cache diff --git a/topics/engineering/setting-up-a-basic-pre-commit-hook-for-linting-scheme-files.gmi b/topics/engineering/setting-up-a-basic-pre-commit-hook-for-linting-scheme-files.gmi new file mode 100644 index 0000000..5324de8 --- /dev/null +++ b/topics/engineering/setting-up-a-basic-pre-commit-hook-for-linting-scheme-files.gmi @@ -0,0 +1,31 @@ +# Setting Up a Basic Pre-Commit Hook for Linting Scheme Files + +* author: bonfacem +* reviewed-by: jnduli + +Git executes hooks before/after events such as: commit, push and receive. A pre-commit hook runs before a commit is finalized [0]. This post shows how to create a pre-commit hook for linting scheme files using `guix style`. + +``` +# Step 1: Create the hook +touch .git/hooks/pre-commit + +# Step 2: Make the hook executable +chmod +x .git/hooks/pre-commit + +# Step 3: Copy the following to .git/hooks/pre-commit + +#!/bin/sh + +# Run guix style on staged .scm files +for file in $(git diff --cached --name-only --diff-filter=ACM | grep ".scm$"); do + if ! guix style --whole-file "$file"; then + echo "Linting failed for $file. Please fix the errors and try again." + exit 1 + fi + git add $file +done +``` + +## References: + +=> https://www.slingacademy.com/article/git-pre-commit-hook-a-practical-guide-with-examples/ [0] Git Pre-Commit Hook: A Practical Guide (with Examples) diff --git a/topics/engineering/using-architecture-decision-records-in-genenetwork.gmi b/topics/engineering/using-architecture-decision-records-in-genenetwork.gmi new file mode 100644 index 0000000..43d344c --- /dev/null +++ b/topics/engineering/using-architecture-decision-records-in-genenetwork.gmi @@ -0,0 +1,56 @@ +# Using Architecture Decision Records at GeneNetwork + +* author: bonfacem +* reviewed-by: fredm, jnduli + +> One of the hardest things to track during the life of a project is the motivation behind certain decisions. A new person coming on to a project may be perplexed, baffled, delighted, or infuriated by some past decision. +> -- Michael Nygard + +When building or maintaining software, there's often moments when we ask, "What were they thinking?" This happens when we are trying to figure out why something was done a certain way, leading to speculation, humor, or criticism[0]. Given the constraints we face when writing code, it's important to make sure that important decisions are well-documented and transparent. Architecture Decision Records (ADRs) are one such tool. They provide a structured way to capture the reasoning behind key decisions. + +ADRs consist 4 key sections [0]: + +* Status: An ADR begins with a proposed status. After discussions, it will be accepted or rejected. It is also possible for a decision to be superseded by a newer ADR later on. +* Context: The context section outlines the situation or problem, providing the background and constraints relevant to the decision. This section is meant to frame the issue concisely, not as a lengthy blog post or detailed explanation. +* Decision: This section clearly defines the chosen approach and the specific actions that will be taken to address the issue. +* Consequences: This part lays out the impact or outcomes of the decision, detailing the expected results and potential trade-offs. + +Optionally, when an ADR is rejected, you can add a section: + +* Rejection Rationale: Briefly provides some context for why the ADR was rejected. + +At GeneNetwork, we manage ADRs within our issue tracker, organizing them under the path "/topics/ADR/<project-name>/XXX-name.gmi". The "XXX" represents a three-digit number, allowing for an easy, chronological order of the proposals as they are created. + +Here is a template for a typical ADR in Genenetwork: + +``` +# [<project>/ADR-<XXX>] Title + +* author: author-name +* status: proposed +* reviewed-by: A, B, C + +## Context + +Some context. + +## Decision + +Decisions. + +## Consequences + +Consequences. +``` + +Here are some examples of Genenetwork specific ADRs: + +=> https://issues.genenetwork.org/topics/ADR/gn3/000-add-test-cases-for-rdf [gn3/ADR-000] Add RDF Test Case +=> https://issues.genenetwork.org/topics/ADR/gn3/000-remove-stace-traces-in-gn3-error-response [gn3/ADR-001] Remove Stack Traces in GN3 + +### References + +=> https://www.oreilly.com/library/view/mastering-api-architecture/9781492090625/ [0] Gough, J., Bryant, D., & Auburn, M. (2022). Mastering API Architecture: Design, Operate, and Evolve API-based Systems. O'Reilly Media, Incorporated. +=> https://adr.github.io/ [1] Architectural Decision Records. Homepage of the ADR GitHub organization +=> https://docs.aws.amazon.com/prescriptive-guidance/latest/architectural-decision-records/adr-process.html [2] Amazon's ADR process +=> https://cloud.google.com/architecture/architecture-decision-records [3] Google Cloud Center Architecture Decision Records Overview diff --git a/topics/engineering/working-with-virtuoso-locally.gmi b/topics/engineering/working-with-virtuoso-locally.gmi new file mode 100644 index 0000000..af249a5 --- /dev/null +++ b/topics/engineering/working-with-virtuoso-locally.gmi @@ -0,0 +1,70 @@ +# Working with Virtuoso for Local Development + +* author: bonfacem +* reviewed-by: jnduli + +Using guix, install the Virtuoso server: + +``` +guix install virtuoso-ose # or any other means to install virtuoso +cd /path/to/virtuoso/database/folder +cp $HOME/.guix-profile/var/lib/virtuoso/db/virtuoso.ini ./virtuoso.ini +# modify the virtuoso.ini file to save files to the folder you'd prefer +virtuoso-t +foreground +wait +debug +``` + +## Common Virtuoso Operations + +Use isql to load up data: + +``` +isql +# subsquent commands run in isql prompt +# this folder is relative to the folder virtuoso was started from +ld_dir ('path/to/folder/with/ttls', '*.ttl', 'http://genenetwork.org'); +rdf_loader_run(); +checkpoint; +``` + +Add data using HTTP: + +``` +# Replace dba:dba with <user>:<password> +curl --digest --user 'dba:dba' --verbose --url\ +"http://localhost:8890/sparql-graph-crud-auth?graph=http://genenetwork.org"\ +-T test-data.ttl +``` + +Delete data using HTTP: + +``` +# Replace dba:dba with <user>:<password> +curl --digest --user 'dba:dba' --verbose --url\ +"http://localhost:8890/sparql-graph-crud-auth?graph=http://genenetwork.org" -X DELETE +``` + +Query the graph data: + +``` +curl --verbose --url\ +"http://localhost:8890/sparql-graph-crud?graph=http://genenetwork.org" +``` + +Check out more cURL examples here: + +=> https://vos.openlinksw.com/owiki/wiki/VOS/VirtGraphProtocolCURLExamples SPARQL 1.1 Graph Store HTTP Protocol cURL Exampple Collection + +## Setting Passwords + +Virtuoso's default user is "dba" and its default password is "dba". To change a password, use isql to run: + +``` +set password "dba" "dba"; +CHECKPOINT; +``` + +## More + +Read a fuller more complete tutorial on Virtuoso here: + +=> https://issues.genenetwork.org/topics/systems/virtuoso Virtuoso diff --git a/topics/genenetwork-releases.gmi b/topics/genenetwork-releases.gmi new file mode 100644 index 0000000..e179629 --- /dev/null +++ b/topics/genenetwork-releases.gmi @@ -0,0 +1,77 @@ +# GeneNetwork Releases + +## Tags + +* status: open +* priority: +* assigned: +* type: documentation +* keywords: documentation, docs, release, releases, genenetwork + +## Introduction + +The sections that follow will be note down the commits used for various stable (and stable-ish) releases of genenetwork. + +The tagging of the commits will need to distinguish repository-specific tags from overall system tags. + +In this document, we only concern ourselves with the overall system tags, that shall have the template: + +``` +genenetwork-system-v<major>.<minor>.<patch>[-<commit>] +``` + +the portions in angle brackets will be replaced with the actual version numbers. + +## genenetwork-system-v1.0.0 + +This is the first, guix-system-container-based, stable release of the entire genenetwork system. +The commits involved are: + +=> https://github.com/genenetwork/genenetwork2/commit/314c6d597a96ac903071fcb6e50df3d9e88935e9 GN2: 314c6d5 +=> https://github.com/genenetwork/genenetwork3/commit/0d902ec267d96b87648669a7a43b699c8a22a3de GN3: 0d902ec +=> https://git.genenetwork.org/gn-auth/commit/?id=8e64f7f8a392b8743a4f36c497cd2ec339fcfebc: gn-auth: 8e64f7f +=> https://git.genenetwork.org/gn-libs/commit/?id=72a95f8ffa5401649f70978e863dd3f21900a611: gn-libs: 72a95f8 + +The guix channels used for deployment of the system above are as follows: + +``` +(list (channel + (name 'guix-bioinformatics) + (url "https://git.genenetwork.org/guix-bioinformatics/") + (branch "master") + (commit + "039a3dd72c32d26b9c5d2cc99986fd7c968a90a5")) + (channel + (name 'guix-forge) + (url "https://git.systemreboot.net/guix-forge/") + (branch "main") + (commit + "bcb3e2353b9f6b5ac7bc89d639e630c12049fc42") + (introduction + (make-channel-introduction + "0432e37b20dd678a02efee21adf0b9525a670310" + (openpgp-fingerprint + "7F73 0343 F2F0 9F3C 77BF 79D3 2E25 EE8B 6180 2BB3")))) + (channel + (name 'guix-past) + (url "https://gitlab.inria.fr/guix-hpc/guix-past") + (branch "master") + (commit + "5fb77cce01f21a03b8f5a9c873067691cf09d057") + (introduction + (make-channel-introduction + "0c119db2ea86a389769f4d2b9c6f5c41c027e336" + (openpgp-fingerprint + "3CE4 6455 8A84 FDC6 9DB4 0CFB 090B 1199 3D9A EBB5")))) + (channel + (name 'guix) + (url "https://git.savannah.gnu.org/git/guix.git") + (branch "master") + (commit + "2394a7f5fbf60dd6adc0a870366adb57166b6d8b") + (introduction + (make-channel-introduction + "9edb3f66fd807b096b48283debdcddccfea34bad" + (openpgp-fingerprint + "BBB0 2DDF 2CEA F6A8 0D1D E643 A2A0 6DF2 A33A 54FA"))))) +``` diff --git a/topics/genenetwork/Case_Attributes_GN2 b/topics/genenetwork/Case_Attributes_GN2 new file mode 100644 index 0000000..52a956f --- /dev/null +++ b/topics/genenetwork/Case_Attributes_GN2 @@ -0,0 +1,2 @@ +# Update Case Attributes to capture hierarchy info +## The following provides guidelines and insight regarding case attributes as used in GeneNetwork Webservice searches diff --git a/topics/genenetwork/genenetwork-services.gmi b/topics/genenetwork/genenetwork-services.gmi new file mode 100644 index 0000000..717fdd8 --- /dev/null +++ b/topics/genenetwork/genenetwork-services.gmi @@ -0,0 +1,122 @@ +# GeneNetwork Services + +## Tags + +* type: documentation +* keywords: documentation, docs, doc, services, genenetwork services + +## GeneNetwork Core Services + +GeneNetwork is composed of a number of different services. This document attempts to document all the services that make up GeneNetwork and document what links give access to the services. + +### GeneNetwork2 + +This is the main user-interface to the entire GeneNetwork system. + +#### Links + +=> https://github.com/genenetwork/genenetwork2 Repository +=> https://genenetwork.org/ GN2 on production +=> https://fallback.genenetwork.org/ GN2 on old production +=> https://cd.genenetwork.org/ GN2 on CI/CD +=> https://staging.genenetwork.org/ GN2 on staging + +### GeneNetwork3 + +This is the main API server for GeneNetwork. + +#### Links + +=> https://github.com/genenetwork/genenetwork3 Repository +=> https://genenetwork.org/api3/ GN3 on production +=> https://fallback.genenetwork.org/api3/ GN3 on old production +=> https://cd.genenetwork.org/api3/ GN3 on CI/CD +=> https://staging.genenetwork.org/api3/ GN3 on staging + +### Sparql Service + +The SparQL service is served from a Virtuoso-OSE service. + +=> https://issues.genenetwork.org/topics/deploy/our-virtuoso-instances We have notes on our virtuoso instances here. + + +#### Links + +=> https://github.com/genenetwork/genenetwork3 Repository +=> https://sparql.genenetwork.org/sparql/ sparql-service on production +* ??? sparql-service on old production +* ??? sparql-service on CI/CD +* ??? sparql-service on staging + +### GN-Auth + +This is the authorisation server for the GeneNetwork system. + +#### Links + +=> https://git.genenetwork.org/gn-auth/ Repository +=> https://auth.genenetwork.org/ gn-auth on production +=> https://fallback.genenetwork.org/gn-auth/ gn-auth on old production +* ??? gn-auth on CI/CD +=> https://staging-auth.genenetwork.org/ gn-auth on staging + +### GN-Uploader + +This service is to be used for uploading data to GeneNetwork. It is currently in development (best case, alpha). + +#### Links + +=> https://git.genenetwork.org/gn-uploader/ Repository +* ??? gn-uploader on production +* ??? gn-uploader on old production +* ??? gn-uploader on CI/CD +=> https://staging-uploader.genenetwork.org/ gn-uploader on staging + +### Aliases Server + +An extra server to respond with aliases for genetic (etc.) symbols. + +This is currently a project in racket, but we should probably pull in the features in this repository into one of the others (probably GeneNetwork3) and trash this repository. + +#### Links + +=> https://github.com/genenetwork/gn3 Repository +=> https://genenetwork.org/gn3/ aliases-server on production +=> https://fallback.genenetwork.org/gn3/ aliases-server on old production +=> https://cd.genenetwork.org/gn3/ aliases-server on CI/CD +=> https://staging.genenetwork.org/gn3/ aliases-server on staging + +### Markdown Editing Server + +#### Links + +=> https://git.genenetwork.org/gn-guile/ Repository +=> https://genenetwork.org/facilities/ markdown-editing-server on production +=> https://fallback.genenetwork.org/facilities/ markdown-editing-server on old production +=> https://cd.genenetwork.org/facilities/ markdown-editing-server on CI/CD +=> https://staging.genenetwork.org/facilities/ markdown-editing-server on staging + +## Support Services + +These are other services that support the development and maintenance of the core services. + +### Issue Tracker + +We use a text-based issue tracker that is accessible via +=> https://issues.genenetwork.org/ + +The repository for this service is at +=> https://github.com/genenetwork/gn-gemtext-threads/ + +### Repositories Server + +This is where a lot of the genenetwork repositories live. You can access it at +=> https://git.genenetwork.org/ + +### Continuous Integration Service + +… + +=> https://ci.genenetwork.org/ + +### … diff --git a/topics/genenetwork/genenetwork-streaming-functionality.gmi b/topics/genenetwork/genenetwork-streaming-functionality.gmi new file mode 100644 index 0000000..4f81eea --- /dev/null +++ b/topics/genenetwork/genenetwork-streaming-functionality.gmi @@ -0,0 +1,43 @@ +# Genenetwork Streaming Functionality + +## Tags +* type: documentation +* Keywords: documentation, docs, genenetwork, streaming + +### Introduction +Genenetwork implements streaming functionality that logs results from a running external process to a terminal emulator. + +The streaming functionality can be divided into several sections. + +### Streaming UI +The terminal emulator is implemented using the `xterm.js` library and +logs results from the GN3 API. + +See: +=> https://github.com/xtermjs/xterm.js + +### Streaming API +This is the main endpoint for streaming: + +See reference: +=> https://github.com/genenetwork/genenetwork3/gn3/api/streaming.py + +### How to Integrate + +#### Import the `enable_streaming` Decorator + +``` +from gn3.computations.streaming import enable_streaming +``` + +#### Apply the Decorator to Your Endpoint that Runs an External Process + +Note: To run the external process, use the `run_process` function, +which captures the `stdout` in a file identified by the `run_id`. + +``` +@app.route('/your-endpoint') +@enable_streaming +def your_endpoint(streaming_output_file): + run_process(command, streaming_output_file, run_id) +``` diff --git a/topics/genenetwork/publications-on-genenetwork.gmi b/topics/genenetwork/publications-on-genenetwork.gmi new file mode 100644 index 0000000..aea1f63 --- /dev/null +++ b/topics/genenetwork/publications-on-genenetwork.gmi @@ -0,0 +1,14 @@ +# Publications on Genenetwork + +## Tags + +* type: documentation +* keywords: documentation, docs, doc, publications + +## Important points + +A publication can relate to more than one Dataset (or family), i.e. you can have a publication with phenotypes from the BXD and CXB populations: from @robw: + +``` +Yes. A single publication can make use of several different families of strains. Our. hippocampus paper with Rupert included both BXD, CXB, and Mouse Diversity panel. We (awkwardly) put them all into BXDs with "Other" and also in CXB and also in Mouse Diversity. Definitely not optimal, but the code would have been way more work that just entering in three ways. +``` diff --git a/topics/genenetwork/starting_gn1.gmi b/topics/genenetwork/starting_gn1.gmi index efbfd0f..e31061f 100644 --- a/topics/genenetwork/starting_gn1.gmi +++ b/topics/genenetwork/starting_gn1.gmi @@ -51,9 +51,7 @@ On an update of guix the build may fail. Try #######################################' # Environment Variables - private ######################################### - # sql_host = '[1]tux02.uthsc.edu' - # sql_host = '128.169.4.67' - sql_host = '172.23.18.213' + sql_host = '170.23.18.213' SERVERNAME = sql_host MYSQL_SERVER = sql_host DB_NAME = 'db_webqtl' diff --git a/topics/genetics/pangenotypes.gmi b/topics/genetics/pangenotypes.gmi new file mode 100644 index 0000000..9b3d534 --- /dev/null +++ b/topics/genetics/pangenotypes.gmi @@ -0,0 +1,52 @@ +# Pangenotypes + +Here we discuss different storage solutions for pangenotypes. + +## GRG format + + +Looking for graph genotyping I ran into Genotype Representation Graphs GRG + +=> https://pmc.ncbi.nlm.nih.gov/articles/PMC11071416/ + +It has a binary storage format that represents something like: + +``` +# GRG file example: genotype graph +# Nodes section: NODE <id> <label> allele=<genotype> +NODE 1 GeneA allele=AA +NODE 2 GeneB allele=AG +NODE 3 GeneC allele=GG +NODE 4 GeneD allele=AA +NODE 5 GeneE allele=AG + +# Edges section: EDGE <from_id> <to_id> +EDGE 1 2 +EDGE 1 3 +EDGE 2 4 +EDGE 3 4 +EDGE 4 5 +EDGE 5 1 +``` + +the tooling + +=> https://github.com/aprilweilab/grgl.git + +builds with + +``` +guix shell -C -N coreutils gcc-toolchain make cmake openssl nss-certs git pkg-config zlib +``` + +I did some tests and read the source code. The nice thing is that they have very similar ideas. Unfortunately the implementation is not what we want. I wonder why people alway reinvent data structures :/. To get an idea: + +=> https://github.com/aprilweilab/grgl/blob/main/src/serialize.cpp + +I would like to take similar ideas and take it to an efficient in-memory graph structure that is easily extensible. RDF is key for extensions (and queries). A fast RDF implementation we are going to try is + +=> https://pyoxigraph.readthedocs.io/en/stable/index.html + +Toshiaki pointed out we should look at qlever instead: + +=> https://github.com/ad-freiburg/qlever diff --git a/topics/genetics/standards/gemma-genotype-format.gmi b/topics/genetics/standards/gemma-genotype-format.gmi new file mode 100644 index 0000000..6ca5998 --- /dev/null +++ b/topics/genetics/standards/gemma-genotype-format.gmi @@ -0,0 +1,99 @@ +# PanGEMMA Genotype Format + +Here we describe the genotype DB format that is used by GN and pangemma. Essentially it contains the genotypes as markers x samples (rows x cols). Unlike some earlier formats it also carries metadata and allows for track changes to the genotypes. + +The current reference implementation for creating the file lives at + +=> https://github.com/genetics-statistics/gemma-wrapper/blob/master/bin/geno2mdb.rb + +Note that we'll likely create new versions in python, guile and/or rust. + +# Storage + +We use the LMDB b-tree format to store and retrieve records based on an index. LMDB is very fast as it uses the memory map facilities of the underlying operating system. + +=> https://en.wikipedia.org/wiki/Lightning_Memory-Mapped_Database + +LMDB supports multiple 'tables' in one file. We also use a metadata table named3 'info'. Another table named 'track-changes' keep track of modifications to the genotypes. This allows the genotypes to change over time - still giving people access to the original information if they need it. + +# Genotypes in the 'geno' table + +Genotypes are stored as fixed size rows of genotypes. Genotypes can be represented as 4-byte floats 'f*' or a list of bytes 'C*' (note these format specifiers come from ruby pack - python has similar but slightly different specifiers). The idea being that storing floats gives enough precision for probabilities and single bytes can represent all other cases. In the future we may add 2-byte integers, but that is probably not necessary. + +For the float version we use NaN to disignate a missing value (NA). + +For the byte version we use the value 255 or 0xFF to designate a missing value (NA). The other 255 values (including 0) are used either as an index - so A,B,H could be 0,1,2 - or we use it to project a range of values. In many cases 255 values is enough to present genotype variation in a population. Otherwise opt for the float option. + +The index to the rows is currently built out of keys. These keys hold the chromosome number as a single byte 'C', the position as a 4-byte long integer 'L>' and the row number in the original file as a 4-byte long 'L>'. These numbers are stored native-endian so the index is always correctly sorted(!). + +# Metadata in the 'info' table + +The default metadata is stored in the info table as + +``` +meta = { + "type" => "gemma-geno", + "format" => options[:format], + "version" => 1.0, + "eval" => EVAL.to_s, + "key-format" => CHRPOS_PACK, + "rec-format" => PACK, + "geno" => json +} +``` + +where CHRPOS_PACK gives the key layout 'CL>L>' and PACK the genotype list, e.g. 'f*'. The format line gives the 'standard' storage type, e.g. 'Gf' for the floats and eval is the command used to transform values. The only field we really have to use for unpacking the data is format or rec-format because key-format does not change. The info table has some extra records that may be used: + +``` + info['numsamples'] = [numsamples].pack("Q") # uint64 + info['nummarkers'] = [geno.size].pack("Q") + info['meta'] = meta.to_json.to_s + info['format'] = options[:format].to_s + info['options'] = options.to_s +``` + +where 'numsamples' and 'nummarkers' are counts. 'meta' reflects above json record. 'format' mirrors format in the meta record and 'options' shows the options as they where fed to the program that generated the file. + +# Tracking changes + +Note: this is a proposal and has not yet implemented. But the idea is to store records by time stamp. Each record will describe the change so the last genotypes can be rolled forward at the user's wish. In case of a replacement it could be: + +``` +timestamp => +{ + "marker" => name, + "chr" => chr, + "pos" => pos, + "line" => line, + "action" => "update", + "author" => author, + "genotypes" => list +``` + +Where list contains the *updated* genotypes. +Likewise for a marker insertion or deletion. + +The track changes can also specify that a change only applies to a trait, a list of traits, a specific set of samples, or a group. E.g. + +``` +timestamp => +{ + "marker" => name, + "chr" => chr, + "pos" => pos, + "line" => line, + "action" => "update", + "author" => author, + "genotypes" => list, + "for-traits" => list, + "for-samples" => list, + "for-group" => name +} +``` + +The 'geno' database will therefore always the *first* version. These records make it possible to roll forward on changes and present an updated genotype matrix. Used genotypes are retained. This, naturally, can be handled in a cache. So any rewritten genotype files will be available in cache for a period of time. +In the future a tool, such as GEMMA, could support dynamic application of these edits. That way we only have to cache the latest version. + +This way users may be able to select changes (i.e. pick and choose), use all (latest) or use original (init). + +For the editing we should provide an API. diff --git a/topics/genetics/test-pangenome-derived-genotypes.gmi b/topics/genetics/test-pangenome-derived-genotypes.gmi new file mode 100644 index 0000000..3a8473a --- /dev/null +++ b/topics/genetics/test-pangenome-derived-genotypes.gmi @@ -0,0 +1,1005 @@ +# Test pangenome derived genotypes + +Here we follow up on the work we did on precompute PublishData: + +=> ../systems/mariadb/precompute-publishdata + +But now run against pangenome derived genotypes. +For the BXD we have 23M markers(!) whereof 8M *not* on the reference genome. + +# Tasks + +* [ ] Document lmdb geno and marker information +* [ ] Extract epoch information +* [ ] Add BED file and link SNPS +* [ ] Check MAF filter - it may be too stringent +* [ ] Use ravanan/CWL to push to Octopus +* [ ] Reintroduce nodes that were not annotated for position (Flavia) +* [ ] GWA plotter +* [ ] Speed up IO for GEMMA by using lmdb for genotypes and marker file +* [ ] Use 1.5LOD score to compute QTLs instead of using 50M distance +* [ ] Reduce GEMMA GRM RAM requirements (not urgent) +* [ ] Fix -lmm 4 ERROR: Enforce failed for Trying to take the sqrt of nan in src/mathfunc.cpp at line 127 in safe_sqrt + +# Summary + +To get the mapping and generate the assoc output in mdb format we run a variant of gemma-wrapper. + +The workflow essentially is: + +* capture the significant markers from GEMMA's mdb output (as created by gemma-wrapper) +* These are transformed into RDF using the 'gemma-mdb-to-rdf.rb' script +* Next we upload that RDF into virtuoso +* from there download a table of start-stop data using SPARQL +* We compute QTL locations using 'sparql-qtl-detect.rb' +* Upload that RDF also into virtuoso + +For mapping virtuoso contains four important ttl files: + +* marker positions in pangenome-marker graph +* mapped markers in pangenome-mapped graph +* computed QTL positions in pangenome-qtl graph +* trait values in traits graph (nyi) + + +``` +gemma-batch-run.sh +``` + +Next we convert that output to RDF with + +``` +../bin/gemma-mdb-to-rdf.rb --header > output.ttl +time ../bin/gemma-mdb-to-rdf.rb --anno snps-matched.txt.mdb tmp/panlmm/*-gemma-GWA.tar.xz >> output.ttl # two hours for 7000 traits +time serdi -i turtle -o ntriples output.ttl > output.n3 +``` + +(note that n3 files are less error prone and serdi does better than rapper with huge files) and copy the file to the virtuoso instance and load it with isql (note it may be worth search-replacing the gnt:run tag to something descriptive). + +``` +cd /export/guix-containers/virtuoso/data/virtuoso/ttl/ +guix shell -C -N --expose=/export/guix-containers/virtuoso/data/virtuoso/ttl/=/export/data/virtuoso/ttl virtuoso-ose -- isql -S 8891 +SQL> ld_dir('/export/data/virtuoso/ttl','test-run-3000.n3','http://pan-test.genenetwork.org'); +Done. -- 3 msec. +# for testing the validity and optional delete problematic ones: +SQL> SELECT * FROM DB.DBA.load_list; +SQL> DELETE from DB.DBA.LOAD_LIST where ll_error IS NOT NULL ; +SQL> DELETE from DB.DBA.LOAD_LIST where LL_STATE = 1; +# commit changes +SQL> rdf_loader_run (); // about 1 min per GB n3 +SQL> checkpoint; +Done. -- 16 msec. +SQL> SPARQL SELECT count(*) FROM <http://pan-test.genenetwork.org> WHERE { ?s ?p ?o } LIMIT 10; +34200686 +``` + +Note it may be a good idea to drop graphs first. That is why we have separate subgraph spaces for every large TTL file: + +``` +log_enable(3,1); +SQL> SPARQL CLEAR GRAPH <http://pan-test.genenetwork.org>; +SQL> SPARQL CLEAR GRAPH <http://pan-mapped.genenetwork.org>; // 10 min +SQL> SPARQL CLEAR GRAPH <http://pangenome-marker.genenetwork.org>; +SQL> ld_dir('/export/data/virtuoso/ttl','pangenome-markers.n3','http://pangenome-marker.genenetwork.org'); +SQL> SPARQL SELECT count(*) FROM <http://pan-test.genenetwork.org> WHERE { ?s ?p ?o } LIMIT 10; +``` + +For pangenomes we have a marker file, a QTL file + +As a test, fetch a table of the traits with their SNPs + +``` +PREFIX dct: <http://purl.org/dc/terms/> +PREFIX gn: <http://genenetwork.org/id/> +PREFIX owl: <http://www.w3.org/2002/07/owl#> +PREFIX gnc: <http://genenetwork.org/category/> +PREFIX gnt: <http://genenetwork.org/term/> +PREFIX sdmx-measure: <http://purl.org/linked-data/sdmx/2009/measure#> +PREFIX skos: <http://www.w3.org/2004/02/skos/core#> +PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> +PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> +PREFIX xsd: <http://www.w3.org/2001/XMLSchema#> +PREFIX qb: <http://purl.org/linked-data/cube#> +PREFIX xkos: <http://rdf-vocabulary.ddialliance.org/xkos#> +PREFIX pubmed: <http://rdf.ncbi.nlm.nih.gov/pubmed/> + +SELECT * FROM <http://pangenome-mapped.genenetwork.org> WHERE { +?traitid a gnt:mappedTrait; + gnt:run gn:test . +?snp gnt:mappedSnp ?traitid ; + gnt:locus ?locus ; + gnt:lodScore ?lod ; + gnt:af ?af . +?locus rdfs:label ?nodeid ; + gnt:chr ?chr ; + gnt:pos ?pos . +FILTER (contains(?nodeid,"Marker") && ?pos < 1000) +} LIMIT 100 +``` + +OK, we are ready to run a little workflow. First create a sorted list of IDs. + +``` +PREFIX dct: <http://purl.org/dc/terms/> +PREFIX gn: <http://genenetwork.org/id/> +PREFIX owl: <http://www.w3.org/2002/07/owl#> +PREFIX gnc: <http://genenetwork.org/category/> +PREFIX gnt: <http://genenetwork.org/term/> +PREFIX sdmx-measure: <http://purl.org/linked-data/sdmx/2009/measure#> +PREFIX skos: <http://www.w3.org/2004/02/skos/core#> +PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> +PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> +PREFIX xsd: <http://www.w3.org/2001/XMLSchema#> +PREFIX qb: <http://purl.org/linked-data/cube#> +PREFIX xkos: <http://rdf-vocabulary.ddialliance.org/xkos#> +PREFIX pubmed: <http://rdf.ncbi.nlm.nih.gov/pubmed/> + +SELECT DISTINCT ?trait FROM <http://pangenome-mapped.genenetwork.org> WHERE { +?traitid a gnt:mappedTrait; + gnt:run gn:test ; + gnt:traitId ?trait. +} +``` + +See also + +=> https://github.com/genetics-statistics/gemma-wrapper/blob/master/doc/examples/list-traits.sparql + +Sort that list and save as 'pan-ids-sorted.txt'. Next run + +``` +../../bin/workflow/qtl-detect-batch-run.sh +``` + +and load those in virtuoso. List new QTL + +``` +SELECT DISTINCT ?t ?lod (count(?snp) as ?snps) ?chr ?s ?e WHERE { + ?traitid a gnt:mappedTrait ; + gnt:traitId ?t . + MINUS { ?traitid gnt:run gn:test } # use if you want the original GEMMA QTL + # ?traitid gnt:run gn:test . # use if you want the new QTL + ?qtl gnt:mappedQTL ?traitid ; + gnt:qtlChr ?chr ; + gnt:qtlLOD ?lod ; + gnt:qtlStart ?s ; + gnt:qtlStop ?e . + ?qtl gnt:mappedSnp ?snp . + FILTER (?t = "10002" && ?lod >= 5.0 ) . +} LIMIT 100 +``` + +# Prior work + +For the first traits (presented at CTC'25) gemma was run as + +``` +echo "[$(date)] Starting kinship matrix calculation for PERCENTILE..." +gemma -g ${BIMBAM_DIR}/143samples.percentile.bimbam.bimbam.gz \ + -p ${PHENO_FILE} \ + -gk \ + -o percentile_result > percentile.kinship.143.txt + +echo "[$(date)] Kinship matrix calculation completed for PERCENTILE." +echo "[$(date)] Starting association analysis for PERCENTILE..." +gemma -g ${BIMBAM_DIR}/143samples.percentile.bimbam.bimbam.gz \ + -p ${PHENO_FILE} \ + -k ./output/percentile_result.cXX.txt \ + -lmm 4 \ + -maf 0.05 \ + -o percentile_association > percentile.assoc.143.txt +``` + +Note no LOCO. + +The genotype BIMBAM file is 45G uncompressed. Even though GEMMA does not load everything in RAM, it is a bit large for my workstation. I opted to use tux04 since no one is using it. Had to reboot the machine because it is unreliable and had crashed. + +There I rebuilt gemma and set up a first run: + +``` +tux04:/export/data/wrk/iwrk/opensource/code/genetics/gemma/tmp$ +/bin/time -v ../bin/gemma -g 143samples.percentile.bimbam.bimbam.gz -p 143samples.percentile.bimbam.pheno.gz -gk +``` + +Without LOCO this took about 18 minutes (186% CPU), 110Gb of RAM. We ought to work on this ;) Next + +``` +/bin/time -v ../bin/gemma -g 143samples.percentile.bimbam.bimbam.gz -p 143samples.percentile.bimbam.pheno.gz -k output/result.cXX.txt -lmm 9 -maf 0.05 +``` + +To run gemma on the current 23M BXD pangenome derived genotypes takes 2.5 hours (@ 200% CPU). That is a bit long :). 13K traits would be 43 months on a single machine. We'll need something better. As Rob writes: + +> The huge majority of variants will have r2 of 1 with hundreds ir thousands of neighbors. This is just a monster distraction. We just want proximal and distal haplotype boundaries for each BXD. Then we want to layer on the weird non-SNP variants and inversions. + +A few days later I had to rerun gemma because the output was wrong (I should have checked!). It shows: + +``` +chr rs ps n_miss allele1 allele0 af beta se logl_H1 l_remle l_mle p_wald p_lrt p_score +-9 A1-0 -9 0 A T 0.171 -nan -nan -nan 1.000000e+05 1.000000e+05 -nan -nan -nan +-9 A2-0 -9 0 A T 0.170 -nan -nan -nan 1.000000e+05 1.000000e+05 -nan -nan -nan +``` + +Turns out I was using the wrong pheno file. Let's try again. + +``` +/bin/time -v ../bin/gemma -g 143samples.percentile.bimbam.bimbam.gz -p 10354082_143.list.pang.txt -k output/result.cXX.txt -lmm 9 -maf 0.05 +``` + +As a check I can diff against the original output. So, I replicated the original run! It also ran faster at 400% CPU in 35 minutes. + +(btw tux04 crashed, so I upgraded the BIOS and iDRAC remotely, let's see if this improves things). + +## Moving to gemma-wrapper + +gemma-wrapper has extra facilities, such as LOCO and caching and lmdb output. Last time we used it in + +=> ../genetics/systems/mariadb/precompute-publishdata + +in a guix container it looked like + +``` +#! /bin/env sh + +export TMPDIR=./tmp +curl http://127.0.0.1:8092/dataset/bxd-publish/list > bxd-publish.json +jq ".[] | .Id" < bxd-publish.json > ids.txt +./bin/gemma-wrapper --force --json --loco -- -g BXD.geno.txt -p BXD_pheno.txt -a BXD.8_snps.txt -n 2 -gk > K.json + +for id in 'cat ids.txt' ; do + echo Precomputing $id + if [ ! -e tmp/*-BXDPublish-$id-gemma-GWA.tar.xz ] ; then + curl http://127.0.0.1:8092/dataset/bxd-publish/values/$id.json > pheno.json + ./bin/gn-pheno-to-gemma.rb --phenotypes pheno.json --geno-json BXD.geno.json > BXD_pheno.txt + ./bin/gemma-wrapper --json --lmdb --population BXD --name BXDPublish --trait $id --loco --input K.json -- -g BXD.geno.txt -p BXD_pheno.txt -a BXD.8_snps.txt -lmm 9 -maf 0.1 -n 2 -debug > GWA.json + fi +done +``` + +Let's try running the big stuff instead: + +``` +./bin/gemma-wrapper --force --json --loco -- -g tmp/143samples.percentile.bimbam.bimbam.gz -p tmp/143samples.percentile.bimbam.pheno.gz -gk +``` + +## Individuals + +gemma does not really track individuals. The order of genotype columns should just be the same as in the pheno file. +In this case a sample list is provided and we'll generate a geno-json version that we can give to gemma-wrapper. Basically such a file lists the following + +``` +{ + "type": "gn-geno-to-gemma", + "genofile": "BXD.geno", + "samples": [ + "BXD1", + "BXD2", + "BXD5", +... + ], + "numsamples": 237, + "header": [ + "# File name: BXD_experimental_DGA_7_Dec_2021", +... +``` + +To get this + +``` +cut -f 1 143samples.pc-list.tsv|sed -e s,_.\*,,|sed -e s,^,\",|sed -e s,$,\"\,,| cut -f 1 143samples.pc-list.tsv|sed -e s,_.\*,,|sed -e s,^,\",|sed -e "s,$,\"\\,," > bxd_inds.list.txt +"BXD100", +"BXD101", +"BXD102", +``` + +Next I turned it into a JSON file by hand as 'bxd_inds.list.json'. + +## Markers + +With GEMMA marker names are listed in the geno file. GEMMA also can use a SNP file that gives the chromosome and location. +Without the SNP filegemma-wrapper complains it needs the SNP/marker annotation file. This is logical because for LOCO it needs to know what chromosome a marker is on. + +The next step is to take the nodes file that and extract all rows from the genotype file that match nodes with chromosomes defined. Andrea is going to deliver all positions for all nodes, but for now we can use what we have. Currently we have nodes annotated in mm10+C57BL_6+DBA_2J.p98.s10k.matrix-pos.txt: + +``` +mm10#1#chr3 23209565 93886997 +mm10#1#chr3 23209564 93886999 +mm10#1#chr3 23209563 93887016 +... +``` + +In the genotype file we find, for example + +``` +A23209564-0, A, T, 1.919141867395325, 0.9306930597711228, 1.8201319833577734, 0.7607260422339468, 1.427392726736106, 1.2310230984252724, 1.6633662444541875, 0.6105610229068721, ... +``` + +bit funny, but you get the idea. So we can take the mm10 file and write out the genotype file again for all matching nodes with a matching SNP file that should contain for this node: + +``` +A23209564-0 93886999 3 +``` + +To rewrite above mm10+C57BL_6+DBA_2J.p98.s10k.matrix-pos.txt file we can do something like + +``` +#! ruby + +ARGF.each_line do |line| + tag,name,pos = line.strip.split(/\t/) + tag =~ /chr(.*)$/ + chrom = $1 + print "A#{name}-0\t#{pos}\t#{chrom}\n" +end +``` + +Now, another problem is that not all SNPs have a position in the genotype file (yet). As we can't display them I can drop them at this stage. So we take the SNP file and rewrite the BIMBAM file using that information. That throwaway script looks like + +``` +bimbamfn = ARGV.shift +snpfn = ARGV.shift +snps = {} +open(snpfn).each_line do |snpl| + name = snpl.split(/\t/)[0] + snps [name] = 1 +end +open(bimbamfn).each_line do |line| + marker = line.split(/[,\s]/)[0] + if snps[marker] + print line + end +end +``` + +takes a while to run, but as this is a one-off that does not matter. Reducing the file leads to 13667900 markers with genotypes. The original SNP file has 14927024 lines. Hmmm. The overlap is therefor not perfect (we have more annotations than genotypes now). To check this I'll run a diff. + +``` +cut -f 1 -d "," 143samples.percentile.bimbam.bimbam-reduced > 143samples.percentile.bimbam.bimbam-reduced-markers +sort 143samples.percentile.bimbam.bimbam-reduced-markers > markers-sorted.txt +diff --speed-large-files 143samples.percentile.bimbam.bimbam-reduced-markers markers-sorted.txt +< A80951-0 +< A80952-0 +< A80953-0 +... +cut -f 1 snps.txt |sort > snps-col1-sorted.txt +diff --speed-large-files snps-col1-sorted.txt markers-sorted.txt +241773d228996 +< A10314686-0 +241777d228999 +< A10314689-0 +241781d229002 +< A10314692-0 +grep A10314686 snps-col1-sorted.txt markers-sorted.txt +snps-col1-sorted.txt:A10314686-0 +snps-col1-sorted.txt:A10314686-0 +markers-sorted.txt:A10314686-0 +``` + +Ah, we have duplicate annotation lines in the SNP file. + +``` +grep A10314686-0 snps.txt +A10314686-0 20257882 8 +A10314686-0 20384895 8 +grep A10314692-0 snps.txt +A10314692-0 20257575 8 +A10314692-0 20384588 8 +``` + +so, the same node is considered two snps. This is due to the node covering multiple inds (paths). Turns out a chunk of them map on different chromosomes too. I think we ought to drop them until we have a better understanding of what they represent (they may be mismapping artifacts). + +I updated the script. Now I see it skips A280000 because there is no marker annotation for that node. Good. Also the number of genotype markers got further reduced to 13209385. +I checked the gemma code and the SNP annotation file should match the genotype file line for line. Usurprising, perhaps, but now I need to rewrite both. After adapting the script we now have to files with the same number of lines. + +Rerunning with the new files: + +``` +gemma -g new-genotypes.txt -p pheno_filtered_143.txt -gk +gemma -g new-genotypes.txt -p pheno_filtered_143.txt -k output/result.cXX.txt -maf 0.05 -lmm 4 -a snps-matched.txt +``` + +And, even though the results differ somewhat in size -- due to the different number of markers -- the results look very similar to what was produced before. Good! + +Now we have confirmation and all the pieces we can run the same set with gemma-wrapper and LOCO. + +## gemma-wrapper + +The first 'challenge' is that gemma-wrapper computes hash values using a Ruby lib which is rather slow. This is also something we encounter in guix. I replaced that by using our pfff hashing for larger files. + +``` +/bin/time -v ../bin/gemma-wrapper --json --loco --jobs 8 -v -- -g new-genotypes.txt -p pheno_filtered_143.txt -gk -a snps-matched.txt > K.json +``` + +For this computation each gemma maxed out at 80Gb RAM (total 640Gb). We are really hitting limits here. In the near future we need to check why so much data is retained. As we only have 150 individuals it is a marker thing. + +``` +/bin/time -v ../bin/gemma-wrapper -v --json --lmdb --loco --input K.json -- -g new-genotypes.txt -p pheno_filtered_143.txt -a snps-matched.txt -debug -maf 0.05 -lmm 9 > GWA.json +``` + +This time gemma requires only 25Gb per chromosome, so we can run it in one go in RAM on this large server. Much of the time is spent in IO, so I think that when we start using mmap (lmdb) we can speed it up significantly. +gemma-wrapper has a wall clock time of 10 minutes utilizing 17 cores. + +Some chromosomes failed with 'ERROR: Enforce failed for Trying to take the sqrt of nan in src/mathfunc.cpp at line 127 in safe_sqrt2'. Running the same with -lmm 9 passed. I'll need to keep an eye on that one. + +After some fixes we now have loco in an lmdb output. The mdb file comes in at 693Mb. That will make 9TB for 13K traits. Storing the full vector is probably not wise here (and arguably we won't ever use it at this size - we should use the smoothed haplotypes). Only storing the significant values (4.0) made the size 17Mb. That makes it 215Gb total. Which is manageable. I made it even smaller by removing the (superfluous) hits from the metadata. Now down to 7Mb and 3.2Mb compressed. That'll total less than 100Gb for 13K traits. Good. + +## Final hookup + +Now gemma-wrapper works (and test results are confirmed) we have to wire it up to fetch traits from the DB. We also have to make sure the trait values align with the individuals in the genotype file. Earlier I was running the script gemma-batch-run.sh: + +=> https://github.com/genetics-statistics/gemma-wrapper/blob/master/bin/gemma-batch-run.sh + +which looks like: + +``` +export TMPDIR=./tmp +curl http://127.0.0.1:8092/dataset/bxd-publish/list > bxd-publish.json +jq ".[] | .Id" < bxd-publish.json > ids.txt +# ---- Compute GRM +./bin/gemma-wrapper --json --loco -- -g BXD.geno.txt -p BXD_pheno.txt -a BXD.8_snps.txt -n 2 -gk > K.json + +# ---- For all entries run LMM +for id in 'cat ids.txt' ; do + echo Precomputing $id + if [ ! -e tmp/*-BXDPublish-$id-gemma-GWA.tar.xz ] ; then + curl http://127.0.0.1:8092/dataset/bxd-publish/values/$id.json > pheno.json + ./bin/gemma-wrapper --json --lmdb --geno-json BXD.geno.json --phenotypes pheno.json --population BXD --name BXDPublish --trait $id --loco --input K.json -- -g BXD.geno.txt -a BXD.8_snps.txt -lmm 9 -maf 0.1 -n 2 -debug > GWA.json + fi +done +``` + +We already have ids.txt and the GRM. What is required is the trait values from the DB. What we need to do is run gn-guile somewhere with access to the DB. Also I need to make sure the current gemma-wrapper tar-balls up the result. + +OK, we are running. Looks like the smaller datasets only use 11Gb RES RAM per chromosome. Which means we can run two computes in parallel on this machine. + +The first run came through! I forgot the --reduce flag, so it came as 190Mb. I'll fix that. 34 individuals ran in 7 minutes. +We are currently runnings at a trait in 6 min. We can double that on this machine. + +The following puzzles me a bit + +``` +## number of analyzed individuals = 31 +## number of covariates = 1 +## number of phenotypes = 1 +## leave one chromosome out (LOCO) = 14 +## number of total SNPs/var = 13209385 +## number of SNPS for K = 12322657 +## number of SNPS for GWAS = 886728 +## number of analyzed SNPs = 13122153 +``` + +why is the number of SNPs for GWAS low? Perhaps a threshold of 10% for maf is a bit stringent. See below. + +Anyway, we are running traits and the first 500 we'll use for analysis. + +Meanwhile I'll look at deploying on octopus and maybe speeding up GEMMA. See + +=> issues/genetics/speeding-up-gemma + +# MAF + +GEMMA has a MAF filter. For every SNP a maf is computed by adding the geno value: + +``` +maf += geno +``` + +when all genotype values are added up MAF is divided by 2x the number of individuals (minus missing). + +``` +maf /= 2.0 * (double)(ni_test - n_miss); +``` + +and this is held against the maf passed on the command line. The 2.0 therefore assumes all values are between 0 and 2. + +Actually I now realise we are using LOCO. So the number of SNPs are the ones on one chromosome. That makes sense! +Still we have to be careful about the MAF range. In our genotype file the values are between 0 and 2. So that is fine in itself. + +# RDF + +Next step is to generate RDF. The SNP annotation was slow, so I moved that to lmdb. Parsing 400 traits now takes 3 minutes. The RDF file is under 1Gb and the SNP annotation RDF is 330Mb. Not too bad! + +``` +guix shell -C -N --expose=/export/guix-containers/virtuoso/data/virtuoso/ttl/=/export/data/virtuoso/ttl virtuoso-ose -- isql -S 8891 +SQL> ld_dir('/export/data/virtuoso/ttl','pan-test-snps-400.n3','http://pan-test.genenetwork.org'); +Done. -- 3 msec. +# for testing the validity and optional delete problematic ones: +SQL> SELECT * FROM DB.DBA.load_list; +SQL> DELETE from DB.DBA.LOAD_LIST where ll_error IS NOT NULL ; +# commit changes +SQL> rdf_loader_run (); +SQL> ld_dir('/export/data/virtuoso/ttl','pan-test-400.n3','http://pan-test.genenetwork.org'); +SQL> rdf_loader_run (); +SQL> checkpoint; +Done. -- 16 msec. +SQL> SPARQL SELECT count(*) FROM <http://pan-test.genenetwork.org> WHERE { ?s ?p ?o } LIMIT 10; +34200686 +``` + +Or in the web interface: + +``` +SELECT count(*) FROM <http://pan-test.genenetwork.org> WHERE { ?s ?p ?o } +``` + +## Query + +The RDF is formed as: + +``` +gn:GEMMAMapped_test_LOCO_BXDPublish_10383_gemma_GWA_e6478639 a gnt:mappedTrait; + rdfs:label "GEMMA BXDPublish trait 10383 mapped with LOCO (defaults)"; + gnt:trait gn:publishXRef_10383; + gnt:loco true; + gnt:run gn:test; + gnt:time "2025/11/10 08:12"; + gnt:belongsToGroup gn:setBxd; + gnt:name "BXDPublish"; + gnt:traitId "10383"; + gnt:nind 14; + gnt:mean 18.0; + gnt:std 10.9479; + gnt:skew 0.3926; + gnt:kurtosis -1.1801; + skos:altLabel "BXD_10383"; + gnt:filename "0233fa0cf277ee7d749de08b32f97c8be6478639-BXDPublish-10383-gemma-GWA.tar.xz"; + gnt:hostname "napoli"; + gnt:user "wrk". +gn:A8828461_0_BXDPublish_10383_gemma_GWA_e6478639 a gnt:mappedLocus; + gnt:mappedSnp gn:GEMMAMapped_test_LOCO_BXDPublish_10383_gemma_GWA_e6478639; + gnt:locus gn:A8828461_0; + gnt:lodScore 4.8; + gnt:af 0.536; + gnt:effect -32.859. +``` + +and SNPs are annotated as + +``` +gn:A8828461_0 a gnt:marker; + rdfs:label "A8828461-0"; + gnt:chr "1"; + gnt:pos 3304440. +gn:A8828464_0 a gnt:marker; + rdfs:label "A8828464-0"; + gnt:chr "1"; + gnt:pos 3304500. +``` + +To get all tested traits you can list: + +``` +PREFIX dct: <http://purl.org/dc/terms/> +PREFIX gn: <http://genenetwork.org/id/> +PREFIX owl: <http://www.w3.org/2002/07/owl#> +PREFIX gnc: <http://genenetwork.org/category/> +PREFIX gnt: <http://genenetwork.org/term/> +PREFIX sdmx-measure: <http://purl.org/linked-data/sdmx/2009/measure#> +PREFIX skos: <http://www.w3.org/2004/02/skos/core#> +PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> +PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> +PREFIX xsd: <http://www.w3.org/2001/XMLSchema#> +PREFIX qb: <http://purl.org/linked-data/cube#> +PREFIX xkos: <http://rdf-vocabulary.ddialliance.org/xkos#> +PREFIX pubmed: <http://rdf.ncbi.nlm.nih.gov/pubmed/> + +SELECT * FROM <http://pan-test.genenetwork.org> WHERE { +?trait a gnt:mappedTrait; + gnt:run gn:test ; + gnt:traitId ?traitid ; + gnt:kurtosis ?kurtosis . +} limit 100 +``` + +To get all SNPs for trait "10001" + +``` +SELECT * FROM <http://pan-test.genenetwork.org> WHERE { +?traitid a gnt:mappedTrait; + gnt:run gn:test ; + gnt:traitId "10381" . +?snp gnt:mappedSnp ?traitid ; + gnt:locus ?locus . +?locus rdfs:label ?nodeid ; + gnt:chr ?chr ; + gnt:pos ?pos . +} +``` + +Lists: + +``` +| http://genenetwork.org/id/A8828461_0_BXDPublish_10383_gemma_GWA_e6478639 | "A8828461-0" | "1" | 3304440 | +``` + +## Scoring/annotating QTL + +Next step is annotating the QTL in RDF. Earlier I wrote a script rdf-analyse-gemma-hits. It uses rapper to read two RDF files (two runs) and annotates the QTL and differences between the files. The code is not pretty: + +=> https://github.com/genetics-statistics/gemma-wrapper/blob/6d667ac97284013867b6cac451ec7e7a22ffbf4b/bin/rdf-analyse-gemma-hits.rb#L1 + +The supporting library is a bit better: + +=> https://github.com/genetics-statistics/gemma-wrapper/blob/6d667ac97284013867b6cac451ec7e7a22ffbf4b/lib/qtlrange.rb#L1 + +Basically we have a QTL locus (QLocus) that tracks chr,pos,af and lod for each hit. +QRange is a set of QLocus which also tracks some stats chr,min,max,snps,max_af,lod. +It can compute whether two QTL (QRange) overlap. +Next we have a container that tracks the QTL (QRanges) on a chromosome. + +Finally there is a diff function that can show the differences on a chromosome (QRanges) for two mapped traits. + +Maybe the naming could be a bit better, but the code is clear as it stands. On thing to note is that we use a fixed distance MAX_SNP_DISTANCE_BPS of 50M that decides whether a SNP falls in the same QTL. It would be worth trying to base it on dropping LOD scores (1.5 from the top). Rob and Flavia pointed out. + +So, the library is fine, but the calling program is not great. The reason is that I parse RDF directly, teasing apart the logic we do in above SPARQL. I track state in dictionaries (hashes of hashes) and the result ends up convoluted. Also a lot of state in RAM. I chose RDF direct parsing because it makes for easier development. The downside is that I need to parse the whole file to make sure I have everything related to a trait. To fetch SNP results from SPARQL directly is slow too. I am in a bind. + +Using curl: + +``` +time curl -G http://sparql -H "Accept: application/json; charset=utf-8" --data-urlencode query=" +PREFIX dct: <http://purl.org/dc/terms/> +PREFIX gn: <http://genenetwork.org/id/> +PREFIX owl: <http://www.w3.org/2002/07/owl#> +PREFIX gnc: <http://genenetwork.org/category/> +PREFIX gnt: <http://genenetwork.org/term/> +PREFIX sdmx-measure: <http://purl.org/linked-data/sdmx/2009/measure#> +PREFIX skos: <http://www.w3.org/2004/02/skos/core#> +PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> +PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> +PREFIX xsd: <http://www.w3.org/2001/XMLSchema#> +PREFIX qb: <http://purl.org/linked-data/cube#> +PREFIX xkos: <http://rdf-vocabulary.ddialliance.org/xkos#> +PREFIX pubmed: <http://rdf.ncbi.nlm.nih.gov/pubmed/> +SELECT * FROM <http://pan-test.genenetwork.org> WHERE { ?traitid a gnt:mappedTrait ; gnt:traitId ?trait ; gnt:kurtosis ?k . } +``` + + +``` +time curl -G http:///sparql -H "Accept: application/json; charset=utf-8" --data-urlencode query=" +PREFIX dct: <http://purl.org/dc/terms/> +PREFIX gn: <http://genenetwork.org/id/> +PREFIX owl: <http://www.w3.org/2002/07/owl#> +PREFIX gnc: <http://genenetwork.org/category/> +PREFIX gnt: <http://genenetwork.org/term/> +PREFIX sdmx-measure: <http://purl.org/linked-data/sdmx/2009/measure#> +PREFIX skos: <http://www.w3.org/2004/02/skos/core#> +PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> +PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> +PREFIX xsd: <http://www.w3.org/2001/XMLSchema#> +PREFIX qb: <http://purl.org/linked-data/cube#> +PREFIX xkos: <http://rdf-vocabulary.ddialliance.org/xkos#> +PREFIX pubmed: <http://rdf.ncbi.nlm.nih.gov/pubmed/> +SELECT * FROM <http://pan-test.genenetwork.org> WHERE { ?traitid a gnt:mappedTrait ; gnt:traitId \"10001\" ; gnt:kurtosis ?k . ?snp gnt:mappedSnp ?traitid ; gnt:locus ?locus . } +" > test.out +real 0m1.612s +user 0m0.020s +sys 0m0.000s +``` + +To get the trait info for 400 traits takes a second. So, that is no big deal. To get the 6K SNPs for one trait also takes a second. Hmmm. That takes hours, compared to the minutes for direct RDF parsing. Before lmdb comes to the rescue we should try running in on the virtuoso server itself. For curl we get 0.5s. Which makes it two hours for 13K traits. But when we run the query using isql it runs in 70ms which totals 15 minutes. That is perfectly fine for running the whole set! + +One way is to simply script isql from the command line. Meanwhile, it also turns out the ODBC interface can be used from python or ruby. Here an example in R: + +=> https://cran.stat.auckland.ac.nz/web/packages/virtuoso/index.html + +Not sure if that is fast enough, but perhaps worth trying. + +So, now we have a way to query the data around a trait in seconds. This means I can rewrite the QTL generator to go by trait. This also allows for a quick turnaround during development (good!). Also I want two scripts: one for computing the QTL and one for annotating the differences. + +Alright. The first script should simply to fetch a trait with its markers from SPARQL and score the QTL (as RDF output). The new script is at + +=> https://github.com/genetics-statistics/gemma-wrapper/blob/master/bin/sparql-qtl-detect.rb + +First, the query for one trait looks like: + +``` +PREFIX dct: <http://purl.org/dc/terms/> +PREFIX gn: <http://genenetwork.org/id/> +PREFIX owl: <http://www.w3.org/2002/07/owl#> +PREFIX gnc: <http://genenetwork.org/category/> +PREFIX gnt: <http://genenetwork.org/term/> +PREFIX sdmx-measure: <http://purl.org/linked-data/sdmx/2009/measure#> +PREFIX skos: <http://www.w3.org/2004/02/skos/core#> +PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> +PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> +PREFIX xsd: <http://www.w3.org/2001/XMLSchema#> +PREFIX qb: <http://purl.org/linked-data/cube#> +PREFIX xkos: <http://rdf-vocabulary.ddialliance.org/xkos#> +PREFIX pubmed: <http://rdf.ncbi.nlm.nih.gov/pubmed/> + +SELECT ?lod ?af ?nodeid ?chr ?pos FROM <http://pan-test.genenetwork.org> WHERE { +?traitid a gnt:mappedTrait; + gnt:run gn:test ; + gnt:traitId "10002" . +?snp gnt:mappedSnp ?traitid ; + gnt:locus ?locus ; + gnt:lodScore ?lod ; + gnt:af ?af . +?locus rdfs:label ?nodeid ; + gnt:chr ?chr ; + gnt:pos ?pos . +} ORDER BY DESC(?lod) +``` + +rendering some 22K markers for trait 10002 as a TSV: + +``` +"lod" "af" "nodeid" "chr" "pos" +7.5 0.547 "A13459298-0" "8" 98658490 +7.1 0.154 "A13402313-0" "8" 96798487 +7 0.432 "A13446492-0" "8" 97355019 +7 0.263 "A13387873-0" "8" 94934820 +7 0.585 "A4794343-0" "1" 172265488 +... +``` + +Earlier with precompute for trait 10002 we got: + +``` +[10002,HK] =>{"1"=>[#<QRange Chr1 𝚺14 179.862..181.546 LOD=3.07..3.07>], "8"=>[#<QRange Chr8 𝚺102 94.3743..112.929 LOD=3.1..5.57>]} +[10002,LOCO] =>{"1"=>[#<QRange Chr1 𝚺15 72.2551..73.3771 AF=0.574 LOD=4.0..5.1>, #<QRange Chr1 𝚺91 171.172..183.154 AF=0.588 LOD=4.5..5.3>], "8"=>[#<QRange Chr8 𝚺32 94.4792..97.3382 AF=0.441 LOD=4.5..4.8>]} +``` + +so the hits are in range, but the LOD may be inflated because of the number of markers. Anyway, this point we are merely concerned with scoring QTL. The first script is simply: + +``` +qtls = QTL::QRanges.new("10002","test") +CSV.foreach(fn,headers: true, col_sep: "\t") do |hit| + qlocus = QTL::QLocus.new(hit["nodeid"],hit["chr"],hit["pos"].to_i,hit["af"].to_f,hit["lod"].to_f) + qtls.add_locus(qlocus) +end +print qtls +``` + +and prints a long list of QTL containing a single hit. + +``` +[10002,test] =>{"1"=>[#<QRange Chr1 𝚺1 3099543..3099543 AF=0.583 LOD=5.8..5.8>, #<QRange Chr1 𝚺1 65908328..65908328 AF=0.627 LOD=5.7..5.7>, #<QRange Chr1 𝚺1 81604902..81604902 AF=0.451 LOD=5.5..5.5>, #<QRange Chr1 𝚺2 85087169..85087177 AF=0.781 LOD=5.5..5.6>, #<QRange Chr1 𝚺1 93740525..93740525 AF=0.762 LOD=6.5..6.5>, #<QRange Chr1 𝚺1 114086053..114086053 AF=0.568 LOD=5.7..5.7>,... +``` + +For trait 10002 tweaking thresholds and rebinning we get + +``` +#<QRange Chr8 𝚺2 34.303454..35.675301 AF=0.571 LOD=5.7..5.8> +#<QRange Chr8 𝚺621 91.752748..102.722635 AF=0.663 LOD=5.6..7.5> +#<QRange Chr1 𝚺16 65.908328..175.232335 AF=0.781 LOD=5.6..7.0> +#<QRange Chr4 𝚺5 56.498971..126.135422 AF=0.657 LOD=5.6..6.4> +#<QRange Chr12 𝚺3 23.037869..58.306731 AF=0.643 LOD=5.8..6.2> +#<QRange Chr10 𝚺2 13.442071..13.442088 AF=0.641 LOD=5.8..6.0> +#<QRange Chr10 𝚺3 94.246536..103.438796 AF=0.608 LOD=5.9..6.2> +#<QRange Chr3 𝚺2 47.644513..82.451061 AF=0.548 LOD=5.7..6.2> +#<QRange Chr9 𝚺2 97.445077..120.263403 AF=0.717 LOD=5.8..5.8> +#<QRange Chr11 𝚺2 27.4058..56.30011 AF=0.559 LOD=5.7..5.7> +``` + +with a LOD>5.5 cut-off. That seems justified because LOD scores are inflated. Compare this with the earlier mapping using 'traditional' genotypes: + +``` +[10002,LOCO] =>{ +"1"=>[#<QRange Chr1 𝚺15 72.2551..73.3771 AF=0.574 LOD=4.0..5.1>, + #<QRange Chr1 𝚺91 171.172..183.154 AF=0.588 LOD=4.5..5.3>], +"8"=>[#<QRange Chr8 𝚺32 94.4792..97.3382 AF=0.441 LOD=4.5..4.8>]} +``` + +we can see the significance of chr8 has gone up with pangenome mapping (relative to chr1) and we find 2 QTL now on chr8, a new one to the left. Chr1 looks similar. We have some other candidates that may or may not be relevant (all narrow!). + +Note this *is* a random trait(!) and suggests the landscape of QTLs will change pretty dramatically. Note also that Andrea will give new genotypes and smoothing to follow. But it is encouraging. + +I played a bit with the QTL output, and for now settled on tracking nodes that have a LOD>5.0. We drop QTL based on the following: + +``` +qtl.lod.max < 6.0 or (qtl.lod.max < 7.5 - qtl.snps.size/2) +``` + +I.e. a single SNP QTL has to have a LOD of 7.0. A 2-SNP QTL has to have a LOD of 6.5. This begets + +``` +[10002,test] =>{ +"1"=>[#<QRange Chr1 𝚺69 3.099543..192.718161 AF=0.781 LOD=5.1..7.0>], +"4"=>[#<QRange Chr4 𝚺12 56.498971..147.86044 AF=0.676 LOD=5.1..6.4>], +"8"=>[#<QRange Chr8 𝚺2774 34.303454..116.023702 AF=0.899 LOD=5.1..7.5>], +"10"=>[#<QRange Chr10 𝚺7 82.334108..105.062097 AF=0.623 LOD=5.1..6.2>], +"12"=>[#<QRange Chr12 𝚺9 21.707644..72.57041 AF=0.77 LOD=5.1..6.2>]} +``` + +which are all worth considering (I think). Obviously we could annotate all QTL in RDF triples and filter on that using SPARQL. But this makes processing a bit faster without having to deal with too much noise. We can fine tune later. + +Now two more steps to go: + +* [X] Fetch all mapped traits using SPARQL and write RDF +* [X] Compare QTL between datasets and annotate new hits + +## Fetch all mapped traits + +``` +SELECT * FROM <http://pan-test.genenetwork.org> WHERE { +?traitid a gnt:mappedTrait; + gnt:run gn:test ; + gnt:traitId "10002" . +?snp gnt:mappedSnp ?traitid ; + gnt:locus ?locus ; + gnt:lodScore ?lod ; + gnt:af ?af . +?locus rdfs:label ?nodeid ; + gnt:chr ?chr ; + gnt:pos ?pos . +} ORDER BY DESC(?lod) +``` + +The first step is to fetch this data. Let's try SPARQL over the web first. + +## Compare QTL sets + +The previous code I wrote to compare QTLs essentially walks the QTLs and annotates a new QTL if there is no overlap between the two sets. Again, this code is too convoluted: + +=> https://github.com/genetics-statistics/gemma-wrapper/blob/18e7a3ac8a11becba84325499116621ad095f28e/lib/qtlrange.rb#L190 + +The principle is straightforward, however. The code for reading the SPARQL output for a trait is + +``` + CSV.foreach(fn,headers: true, col_sep: "\t") do |hit| + trait_id = hit["traitid"] if not trait_id + lod = hit["lod"].to_f + if lod > 5.0 # set for pangenome input + qlocus = QTL::QLocus.new(hit["snp"],hit["chr"],hit["pos"].to_f/10**6,hit["af"].to_f,lod) + qtls.add_locus(qlocus) + end + end +``` + +So we can use SPARQL to build two sets on the fly and then run the diff. + +Actually, when thinking about this I realised it should not be too hard to do in SPARQL to find the 'new' QTL. + +``` + +SELECT * WHERE { +?traitid a gnt:mappedTrait ; + gnt:traitId "10002" . +} +http://genenetwork.org/id/GEMMAMapped_LOCO_BXDPublish_10002_gemma_GWA_7c00f36d +http://genenetwork.org/id/HK_trait_BXDPublish_10002_gemma_GWA_hk_assoc_txt +http://genenetwork.org/id/GEMMAMapped_test_LOCO_BXDPublish_10002_gemma_GWA_82087f23 +``` + +lists the three versions of compute for traits. To fetch all QTL for first mapping: + +``` +SELECT ?qtl ?lod ?chr ?start ?stop (count(?snp) as ?snps) WHERE { +?traitid a gnt:mappedTrait ; + gnt:traitId "10002" . +?qtl gnt:mappedQTL ?traitid ; + gnt:qtlChr ?chr ; + gnt:qtlStart ?start ; + gnt:qtlStop ?stop ; + gnt:qtlLOD ?lod . +?qtl gnt:mappedSnp ?snp . +} +``` + +gets 3 QTL. Now I did not store HK in RDF, but to show the filtering principle we can fetch two traits and compare QTL. +The following gets two QTL from trait "10002" on CHR1 and holds that against that of trait "10079": + +``` +SELECT ?t ?s1 ?e1 ?t2 ?s2 ?e2 WHERE { + ?traitid a gnt:mappedTrait ; + gnt:traitId ?t . + ?qtl gnt:mappedQTL ?traitid ; + gnt:qtlChr ?chr ; + gnt:qtlStart ?s1 ; + gnt:qtlStop ?e1 . + { + SELECT * WHERE { + ?tid a gnt:mappedTrait ; + gnt:traitId "10079" ; + gnt:traitId ?t2 . + ?qtl2 gnt:mappedQTL ?tid ; + gnt:qtlChr ?chr ; + gnt:qtlStart ?s2 ; + gnt:qtlStop ?e2 . + } + } + FILTER (?t = "10002") . +} LIMIT 10 + +"10002",171.172,183.154,"10079",172.235,172.235 +"10002",72.2551,73.3771,"10079",172.235,172.235 +``` + +Note we pivot on two traits and one chromosome, so we find all pairs. +To say if a QTL is *new* or different we can add another FILTER + +``` +FILTER ((?s2 > ?s1 && ?e2 > ?e1) || (?s2 < ?s1 && ?e2 < ?e1)) . +"t","s1","e1","t2","s2","e2" +"10002",72.2551,73.3771,"10079",172.235,172.235 +``` + +that says that this ?qtl2 does not overlap with ?qtl. I.e. here it is a new QTL! + +This new insight means we should should store *all* QTL in RDF, including the single SNP ones, because it is easy to filter on them. Note that there may be a more elegant way to query traits pairwise. This is just the first thing that worked. It may need more tuning if there are more than two QTL on a chromosome. E.g. the comparison between 10002 and 10413 finds: + +``` +"t","s1","e1","t2","s2","e2" +"10002",72.2551,73.3771,"10413",32.3113,42.4624 +"10002",171.172,183.154,"10413",171.04,171.041 +"10002",171.172,183.154,"10413",32.3113,42.4624 +"10002",72.2551,73.3771,"10413",171.04,171.041 +``` + +I.e. it does find new QTL here and you still need to do a little set analysis. In words you should be able to "remove all overlapping QTL from a chromosome". Maybe we can filter the other way - select overlapping QTL and remove those from the result set. + +``` +BIND ((?s2 >= ?s1 && ?e2 <= ?e1) || (?s1 >= ?s2 && ?e1 <= ?e2) as ?overlap) . +"10002",171.172,183.154,"10079",172.235,172.235,1 +"10002",72.2551,73.3771,"10079",172.235,172.235,0 +``` + +now drop all ?t's that are overlapping. It appears to work with: + +=> https://github.com/genetics-statistics/gemma-wrapper/blob/master/doc/examples/show-qtls-two-traits.sparql + +I'll need to test it on the pangenome set. + +# Listing QTL + +To get all QTL from a run you can use something like + +``` +SELECT DISTINCT ?t ?lod (count(?snp) as ?snps) ?chr ?s ?e WHERE { + ?traitid a gnt:mappedTrait ; + gnt:traitId ?t . + MINUS { ?traitid gnt:run gn:test } # use if you want the original GEMMA QTL + # ?traitid gnt:run gn:test . # use if you want the new QTL + ?qtl gnt:mappedQTL ?traitid ; + gnt:qtlChr ?chr ; + gnt:qtlLOD ?lod ; + gnt:qtlStart ?s ; + gnt:qtlStop ?e . + ?qtl gnt:mappedSnp ?snp . + FILTER (?t = "10002" && ?lod >= 5.0 ) . +} LIMIT 100 +``` + +Note we filter on a trait name and LOD score. + +For panQTL (gnt:run == gn:test) this results in + +``` +"t" "lod" "snps" "chr" "start" "end" +"10002" 6.4 3 "15" 87.671663 98.028911 +"10002" 6.4 12 "4" 56.498971 147.86044 +"10002" 7 69 "1" 3.099543 192.718161 +"10002" 7.5 2774 "8" 34.303454 116.023702 +"10002" 6.2 7 "10" 82.334108 105.062097 +"10002" 6.2 2 "3" 47.644513 82.451061 +"10002" 6.2 1 "3" 130.145235 130.145235 +"10002" 6 2 "10" 13.442071 13.442088 +"10002" 6.2 9 "12" 21.707644 72.57041 +``` + +For the traditional genotypes (gnt:run != gn:test) + +``` +"t" "lod" "snps" "chr" "start" "end" +"10002" 5.3 91 "1" 171.172 183.154 +"10002" 5.1 15 "1" 72.2551 73.3771 +``` + + +# Listing SNPs + +Now we have all QTLs in the DB, as well as underlying SNPs, one interesting question to ask is what SNPs are repeated across our traits. This, if you remember, is the key idea of reversed genetics. +Of course, with our pangenome-derived genotypes, we now have thousands of SNPs per trait. Let's see if we can rank them by number of traits. + +For our 1000 traits we map about 7.7M snps with a LOD>5 + + +# Using sparql from emacs + +Note: if you are doing SPARQL quite a bit, I recommend using sparql-mode in emacs! It is easy, faster and you can use git :) + +=> https://github.com/ljos/sparql-mode + +``` +M-x sparql-query-region [ENTER] http://sparql-test.genenetwork.org/sparql/ [ENTER] +``` diff --git a/topics/genome-browser/hoot-genome-browser.gmi b/topics/genome-browser/hoot-genome-browser.gmi new file mode 100644 index 0000000..219fda5 --- /dev/null +++ b/topics/genome-browser/hoot-genome-browser.gmi @@ -0,0 +1,21 @@ +# Hoot Genome Browser + +Together with Andrew we have created a genome browser that runs in WASM. Safari recently (202509) added critical hoot support, so we should have it in all important browsers now! + +With this task tracker we want to embed the existing browser in GN and add tracks for mapped QTL. + +# Tags + +* assigned: andrewt, pjotrp +* priority: high +* status: open, in progress +* keywords: mapping + +# Tasks + +* [ ] Embed hoot browser in GN2 as a pilot +* + [ ] Guix package for JS and minimal JBrowse2? +* + [ ] Embedding code in GN2 +* [ ] Create two tracks for QTL comparisons - vector data available +* [ ] Create BED file for matched QTL - use SPARQL live? +* [ ] Annotated SNPs diff --git a/topics/gn-learning-team/next-steps.gmi b/topics/gn-learning-team/next-steps.gmi new file mode 100644 index 0000000..b427923 --- /dev/null +++ b/topics/gn-learning-team/next-steps.gmi @@ -0,0 +1,48 @@ +# Next steps + +Wednesday we had a wrap-up meeting of the gn-learning efforts. + +## Data uploading + +The goal of these meetings was to learn how to upload data into GN. In the process Felix has become the de facto uploader, next to Arthur. A C. elegans dataset was uploaded and Felix is preparing + +* More C. elegans +* HSRat +* Kilifish +* Medaka + +Updates are here: + +=> https://issues.genenetwork.org/tasks/felixl + +We'll keep focussing on that work and hopefully we'll get more parties interested in doing some actual work down the line. + +## Hosting GN in Wageningen + +Harm commented that he thought these meetings were valuable, particularly we learnt a lot about GN ins and outs. Harm suggests we focus on hosting GN in Wageningen for C. elegans and Arabidopsis. +Pjotr says that is a priority this year, even if we start on a privately hosted machine in NL. Wageningen requires Docker images and Bonface says that is possible - with some work. So: + +* Host GN in NL +* Make GN specific for C.elegans and Arabidopsis - both trim and add datasets +* Create Docker container +* Host Docker container in Wageningen +* Present to other parties in Wageningen + +Having above datasets will help this effort succeed. + +## AI + +Harm is also very interested in the AI efforts and wants to pursue that in the context of above server - i.e., functionality arrives when it lands in GN. + +## Wormbase + +Jameson suggest we can work with Wormbase and the Caender folks once we have a running system. Interactive data analysis is very powerful and could run in conjunction with those sites. + +=> https://caendr.org/ +=> https://wormbase.org/ + +Other efforts are Flybase and Arabidopsis Magic which we can host, in principle. + +## Mapping methods + +Jameson will continue with his work on risiduals. diff --git a/topics/gn-uploader/genome-details.gmi b/topics/gn-uploader/genome-details.gmi new file mode 100644 index 0000000..f8a12f6 --- /dev/null +++ b/topics/gn-uploader/genome-details.gmi @@ -0,0 +1,42 @@ +# Genome Details + +This file is probably misnamed. + +*TODO*: Update name once we know where this fits + +## Tags + +* type: documentation, doc, docs +* assigned: fredm +* priority: docs +* status: open +* keywords: gn-uploader, uploader, genome + +## Location + +### centiMorgan (cM) + +We no longer use centiMorgan in GeneNetwork + +From the email threads: + +``` +> … +> Sorry, we now generally do not use centimorgans. Chr 19 is 57 cM +> using markers that exclude telomeres in most crosses. +> … +``` + +and + +``` +> … +> I know that cM is a bit more variable because it's not a direct measurement, … +> … +``` + +### Megabasepairs (Mbp) + +The uploader will store any provided physical location values (in megabasepairs) in the +=> https://gn1.genenetwork.org/webqtl/main.py?FormID=schemaShowPage#Geno Geno table +specifically in the `Mb` field of that table. diff --git a/topics/gn-uploader/genotypes-assemblies-markers-and-genenetwork.gmi b/topics/gn-uploader/genotypes-assemblies-markers-and-genenetwork.gmi new file mode 100644 index 0000000..db0ddf3 --- /dev/null +++ b/topics/gn-uploader/genotypes-assemblies-markers-and-genenetwork.gmi @@ -0,0 +1,40 @@ +# Genotypes, Assemblies, Markers and GeneNetwork + +## Tags + +* type: documentation, docs, doc +* keywords: genotype, assembly, markers, data, database, genenetwork, uploader + +## Markers + +``` +The marker is the SNP… + +— Rob (Paraphrased) +``` + +SNPs (Single Nucleotide Polymorphisms) are specific locations of interest within the genome, where the pair of nucleotides can take different forms. + +A SNP and its immediate neighbourhood (a number of megabase pairs before and after the SNP) form a sequence that is effectively the marker, e.g. for mouse (Mus musculus) you could have the following sequence from the GRCm38 genome assembly (mm10): + +``` +GAGATAAAGATGGGTCCCTTGGCACAGGACTGGCCCACATTTCCaatataaattacaacaattttttttaaatttttaaaCAAAACAAGCATCTCACACAC/TTGAAAAAGAAGATGCATTCAAAGAAAATAGATGTTTCAATGTATTTAAGATAATCAAGAGATAACCATGACCATATCATGAGGAAACTTAAGAATTGGCA +``` + +where the position with `C/T` represents the SNP of interest and thus the marker. + +You can search this on the UCSC Genome Browser, specifically the +=> https://genome.ucsc.edu/cgi-bin/hgBlat BLAT search +to get the name of the marker, and some extra details regarding it. + +## Genome Assemblies + +The genome assembly used will "determine" the position of the marker on the genome — newer assemblies will (generally) give a better position accounting for more of the issues discovered in older assemblies. + +With most of the newer assemblies, the positions do not shift very drastically. + +## GeneNetwork + +Currently (September 2024), GeneNetwork uses the GRCm38 (mm10) assembly for mice. + +Unfortunately, since the system was built for mice, the tables (e.g. Geno table) do not account for the fact that you could have markers (and other data) from species other than Mus musculus. You thus have the Geno table with fields like `Mb_mm8`, `Chr_mm8` which are very mouse-specific. diff --git a/topics/gn-uploader/samplelist-details.gmi b/topics/gn-uploader/samplelist-details.gmi new file mode 100644 index 0000000..2e64d8a --- /dev/null +++ b/topics/gn-uploader/samplelist-details.gmi @@ -0,0 +1,17 @@ +# Explanation of how Sample Lists are handled in GN2 (and may be handled moving forward) + +## Tags + +* status: open +* assigned: fredm, zsloan +* priority: medium +* type: documentation +* keywords: strains, gn-uploader + +## Description + +Regarding the order of samples/strains, it can basically be whatever we decide it is. It just needs to stay consistent (like if there are multiple genotype files). It only really affects how the strains are displayed, and any other genotype files we use for mapping needs to share the same order. + +I think this is the case regardless of whether it's strains or individuals (and both the code and files make no distinction). Sometimes it just logically makes sense to sort them in a particular way for display purposes (like BXD1, BXD2, etc), but technically everything would still work the same if you swapped those columns across all genotype files. Users would be confused about why BXD2 is before BXD1, but everything would still work and all calculations would give the same results. + +zsloan's proposal for handling sample lists in the future is to just store them in a JSON file in the genotype_files/genotype directory. diff --git a/topics/gn-uploader/types-of-data.gmi b/topics/gn-uploader/types-of-data.gmi new file mode 100644 index 0000000..1f53dec --- /dev/null +++ b/topics/gn-uploader/types-of-data.gmi @@ -0,0 +1,63 @@ +# Types of Data in GeneNetwork + +## Tags + +* assigned: +* priority: +* status: open +* type: documentation +* keywords: gn-uploader, uploader, genenetwork, documentation, doc, docs, data, data type, types of data + +## Description + +There are five (5) main types of data in GeneNetwork + +* Classical Phenotypes (PublishData) +* High Content Data +* Genotype Data +* Cofactors and Attributes +* Metadata + +### Classical Phenotypes + +This is usually low-content data e.g. body weight, tail length, etc. + +This is currently saved in the `Publish*` tables in the database. + +This data is saved as is i.e. not log-transformed + +### High Content Data + +This includes mainly molecular data such as +* mRNA assay data +* genetic expression data +* probes +* tissue type and data + +These data are saved in the `ProbeSet*` database tables (and other closely related tables like the `Tissue*` tables - fred added this: verify). + +These could be saved in the database in a log-tranformed form - verify. + +How do you check for log-transformation in the data? + +### Genotype Data + +This is core data, and all other data seem to rely on its existence. + +Useful for: +* correlations, cofactor and PheWAS computations. +* mapping purposes +* search and display +* editing and curation + +### Cofactors and Attributes + +This data can be alphanumeric (mix of numerical and non-numerical) data. + +It is not intended for mapping. + +### Metadata + +This data should (ideally) always accompany any and all of the data types above. It provides contextual information regarding the data it accompanies, and is useful for search, and other contextualising operations. + +It is alphanumeric data, and mostly cannot be used for numeric computations. diff --git a/topics/guix/genenetwork-fixating-guix.gmi b/topics/guix/genenetwork-fixating-guix.gmi new file mode 100644 index 0000000..844b0fd --- /dev/null +++ b/topics/guix/genenetwork-fixating-guix.gmi @@ -0,0 +1,34 @@ +# Fixating Guix for GN + +The GeneNetwork services depend on a rather complicated Guix deployment. The problem is not guix, but GN itself :) +But we were getting bitten by updates on upstream, as well as updates on our different targets/services. + +# Using channels that affect GN production + +To avoid duplication of work and unknown rabbit holes we decided to fixate guix trunk and other dependencies by using Guix channels. This means all GN development happens on a single version of Guix! That version is defined here: + +=> https://git.genenetwork.org/gn-machines/tree/.guix-channel + +Note that guix-forge and guix-bioinformatics are *also* fixated. The idea is that we only upgrade GN packages in gn-machines itself by inheriting definitions. E.g. + +=> https://git.genenetwork.org/gn-machines/tree/guix/gn-machines/genenetwork.scm + +We will probably get rid of the guix-past and guix-rust-past-crates sub-channels soon by removing those packages that depend on those (genenetwork1 will get its own tree, and @alexm will upgrade the rust packages). + +If someone wants to update guix channel or guix-bioinformatics channel they should not update this file. The one in charge is @fredm. Fred has to be in control because we don't want to break production. It is forbidden to touch this channel file. + +People can patch the packages and gn-machines, but if it involves CI/CD and/or production in any way, Fred will have to know about it. + +# Service level channels + +For individual services, such as genenetwork2, genenetwork3, gn-auth, etc., we have local channel files. These should mirror above gn-machines channel file to make sure we can migrate your code easily. E.g. + +=> https://github.com/genenetwork/genenetwork3/blob/main/.guix-channel + +Should match + +=> https://git.genenetwork.org/gn-machines/tree/.guix-channel + +If that is not the case we have a major problem! So before sending patches to Fred make sure the channels match. + +To be honest, I think we should fetch these channels automagically from gn-machines as a first step. diff --git a/topics/guix/guix-profiles.gmi b/topics/guix/guix-profiles.gmi index 578bb82..15f7683 100644 --- a/topics/guix/guix-profiles.gmi +++ b/topics/guix/guix-profiles.gmi @@ -15,9 +15,9 @@ Note that a recently tested channel can be fetched from cd.genenetwork.org. That Alternatively put the following into a channels.scm file. ``` (list (channel - (name 'gn-bioinformatics) - (url "https://gitlab.com/genenetwork/guix-bioinformatics") - (branch "master"))) + (name 'gn-machines) + (url "https://git.genenetwork.org/gn-machines") + (branch "main"))) ``` Build a profile using ``` @@ -55,23 +55,6 @@ And everything should be in the PATH, PYTHONPATH etc. From time to time, updates to upstream Guix break our guix-bioinformatics channel. As a result, `guix pull` breaks and our work comes to a standstill for a few days until this can be fixed. While it is important to gradually move towards the latest and greatest, we would like to avoid being at the mercy of externalities and would prefer to update in a more controlled way without interrupting everyone's work. -To this end, we hard-code the guix-bioinformatics channel to depend on a *specific* commit of upstream Guix that is tied to guix-bioinformatics, for example: +To this end, we hard-code the guix-bioinformatics channel to depend on a specific commit of upstream Guix that is tied to guix-bioinformatics. This is why the recommended channels.scm file above does not include a %default-guix-channel. However, this comes with the drawback that your entire system will be stuck at that specific commit of upstream Guix (but not if you use another profile as described above). We highly recommend using a separate `guix pull` profile specifically for GeneNetwork work, as described above. -``` -(list (channel - (name 'gn-bioinformatics) - (url "https://gitlab.com/genenetwork/guix-bioinformatics") - (branch "master") - (commit - "feff05b47c305d8c944499fbc00fd2126f2b881d"))) -``` - -This is why the recommended channels.scm file above does not include a %default-guix-channel. However, this comes with the drawback that your entire system will be stuck at that specific commit of upstream Guix (but not if you use another profile as described above). We highly recommend using a separate `guix pull` profile specifically for GeneNetwork work, as described above. - -This scheme also comes with the added bonus that all members on the team and the continuous integration system will be using exactly the same Guix. Above channels.scm file is only exposed on a *succesful* build. - -## Notes - -We recently had to switch to gitlab because our git server went down on Penguin2. We may move to a cgit solution soon, see - -=> ../issues/cant-use-guix-bioinformatics-with-guix-pull.gmi +This scheme also comes with the added bonus that all members on the team and the continuous integration system will be using exactly the same Guix. \ No newline at end of file diff --git a/topics/guix/packages.gmi b/topics/guix/packages.gmi index a52f49b..b4a393c 100644 --- a/topics/guix/packages.gmi +++ b/topics/guix/packages.gmi @@ -2,15 +2,27 @@ To deploy GN we have packages in Guix itself (that comes with a distribution), in guix-bioinformatics and in guix-past (for older packages). +When you develop a new package it is best to run against a recent version of guix. Note that with GeneNetwork this is different as we 'fixate' guix at an older version. See + +=> genenetwork-fixating-guix + Typically run a guix pull to get the latest guix: ``` mkdir -p ~/opt -guix package -i guix -p ~/opt/guix +guix pull -p ~/opt/guix-pull +unset GUIX_PROFILE # for Debian +source ~/opt/guix-pull ``` and checkout guix-past and guix-bioinformatics using git. +Note that a codeberg pull may be faster (Guix recently moved main development to Codeberg): + +``` +guix pull -p ~/opt/guix-pull --url=https://codeberg.org/guix/guix +``` + Now Guix should be happy with ``` @@ -20,7 +32,7 @@ genenetwork2 3.11-2.1328932 out /home/wrk/guix-bioinformatics/gn genenetwork3 0.1.0-2.e781996 out /home/wrk/guix-bioinformatics/gn/packages/genenetwork.scm:107:4 ``` -and we can try building +note that using the -L switch is a bit of a hack. Normally we use channels. We can try building ``` ~/opt/guix-pull/bin/guix build -L ~/guix-bioinformatics/ -L ~/guix-past/modules/ genenetwork2 diff --git a/topics/gunicorn/deploying-app-under-url-prefix.gmi b/topics/gunicorn/deploying-app-under-url-prefix.gmi new file mode 100644 index 0000000..b2e382f --- /dev/null +++ b/topics/gunicorn/deploying-app-under-url-prefix.gmi @@ -0,0 +1,121 @@ +# Deploying Your Flask Application Under a URL Prefix With GUnicorn + +## TAGS + +* type: doc, documentation, docs +* author: fredm, zachs +* keywords: flask, gunicorn, SCRIPT_NAME, URL prefix + +## Introduction + +You have your application and are ready to deploy it, however, for some reason, you want to deploy it under a URL prefix, rather than at a top-level-domain. + +This short article details the things you need to set up. + +## Set up Your WebServer (Nginx) + +You need to tell your webserver to serve the application under a particular url prefix. You do this using that particular webserver's reverse-proxying configurations: For this article, we will use Nginx as the server. + +Normally, you'd simply do something like: + +``` +server { + server_name your.server.domain + + ⋮ + + location /the-prefix/ { + proxy_pass http://127.0.0.1:8080/; + proxy_set_header Host $host; + ⋮ + } + + ⋮ +} +``` + +Here, your top-level domain will be https://your.server.domain and you therefore want to access your shiny new application at https://your.server.domain/the-prefix/ + +For a simple application, with no sessions or anything, this should work, somewhat, though you might run into trouble with things like static files (e.g. css, js, etc) if the application does not use the same ones as that one on the TLD. + +If you are using sessions, you might also run into an issue where there is an interaction in the session management of both applications, especially if the application on the TLD makes use of services from the application at the url prefix. This is mostly due to redirects from the url-prefix app getting lost and hitting the TLD app. + +To fix this, we change the configuration above to: + +``` +server { + server_name your.server.domain + + ⋮ + + location /the-prefix/ { + proxy_pass http://127.0.0.1:8080/the-prefix/; + proxy_set_header Host $host; + ⋮ + } + + ⋮ +} +``` + +but now, you get errors, since there is no endpoint in your shiny new app that in at the route /the-prefix/***. + +Enter Gunicorn! + + +## Setting up SCRIPT_NAME for GUnicorn + +### The "Hacky" Way + +At the point of invocation of GUnicorn, we set the SCRIPT_NAME environment variable to the value "/the-prefix" — note that there is no trailing slash; this is very important. You should now have something like: + +``` +$ export SCRIPT_NAME="/the-prefix" +$ gunicorn --bind 0.0.0.0:8082 --workers … +``` + +The first line tells GUnicorn what the URL prefix is. It will use this to compute what URL to pass to the flask application. + +Example, say you try accessing the endpoint + +``` +https://your.server.domain/the-prefix/auth/authorise?response_type=code&client_id=some-id&redirect_uri=some-uri +``` + +Gunicorn will split that URL into 2 parts using the value of the SCRIPT_NAME environment variable, giving you: + +* https://your.server.domain +* /auth/authorise?response_type=code&client_id=some-id&redirect_uri=some-uri + +It will then pass on the second part to flask. This is why the value of SCRIPT_NAME should not have a trailing slash. + +Note that using the SCRIPT_NAME environment variable is a convenience feature provided by GUnicorn, not a WSGI feature. If you ever change your WSGI server, there is no guarantee this fix will work. + +### Using WSGI Routing MiddleWare + +A better way is to make use of a WSGI routing middleware. You could do this by defining a separate WSGI entry point in your application's repository. + +``` +# wsgi_url_prefix.py +from werkzeug.wrappers import Response +from werkzeug.middleware.dispatcher import DispatcherMiddleware + +from app import create_app + +def init_prefixed_app(theapp): + theapp.wsgi_app = DispatcherMiddleware( + Response("Not Found", 404), + { + "/the-prefix": the_app.wsgi_app + }) + return theapp + + +app = init_prefixed_app(create_app()) +``` + +## References + +=> https://docs.gunicorn.org/en/latest/faq.html#how-do-i-set-script-name +=> https://dlukes.github.io/flask-wsgi-url-prefix.html +=> https://www.reddit.com/r/Python/comments/juwj3x/comment/gchdsld/ diff --git a/topics/hpc/octopus/slurm-user-guide.gmi b/topics/hpc/octopus/slurm-user-guide.gmi index f7ea6d4..d0a3cc4 100644 --- a/topics/hpc/octopus/slurm-user-guide.gmi +++ b/topics/hpc/octopus/slurm-user-guide.gmi @@ -37,7 +37,6 @@ To get a shell prompt on one of the nodes (useful for testing your environment) srun -N 1 --mem=32G --pty /bin/bash ``` - # Differences ## Guix (look ma, no modules) diff --git a/topics/lmms/bulklmm/readme.gmi b/topics/lmms/bulklmm/readme.gmi new file mode 100644 index 0000000..8bd96a8 --- /dev/null +++ b/topics/lmms/bulklmm/readme.gmi @@ -0,0 +1 @@ +This is a stub diff --git a/topics/lmms/gemma/permutations.gmi b/topics/lmms/gemma/permutations.gmi new file mode 100644 index 0000000..4c8932a --- /dev/null +++ b/topics/lmms/gemma/permutations.gmi @@ -0,0 +1,1014 @@ +# Permutations + +Currently we use gemma-wrapper to compute the significance level - by shuffling the phenotype vector 1000x. +As this is a lengthy procedure we have not incorporated it into the GN web service. The new bulklmm may work +in certain cases (genotypes have to be complete, for one). + +Because of many changes gemma-wrapper is not working for permutations. I have a few steps to take care of: + +* [X] read R/qtl2 format for phenotype + +# R/qtl2 and GEMMA formats + +See + +=> data/R-qtl2-format-notes + +# One-offs + +## Phenotypes + +For a study Dave handed me phenotype and covariate files for the BXD. Phenotypes look like: + +``` + +Record ID,21526,21527,21528,21529,21530,21531,21532,21537,24398,24401,24402,24403,24404,24405,24406,24407,24408,24412,27513,27514,27515,27516, +27517 +BXD1,18.5,161.5,6.5,1919.450806,3307.318848,0.8655,1.752,23.07,0.5,161.5,18.5,6.5,1919.450806,3307.318848,0.8655,1.752,0.5,32,1.5,1.75,2.25,1. +25,50 +BXD100,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x +BXD101,20.6,176.199997,4.4,2546.293945,4574.802734,1.729,3.245,25.172001,0.6,176.199997,20.6,4.4,2546.294189,4574.802734,1.7286,3.2446,0.6,32, +1.875,2.375,2.75,1.75,38 +BXD102,18.785,159.582993,6.167,1745.671997,4241.505859,0.771,2.216,22.796667,0.25,159.583328,18.785,6.166667,1745.672485,4241.506348,0.770667, +2.216242,0.25,28.08333,1.5,2,2.875,1.5,28.5 +... +``` + +which is close to the R/qtl2 format. GEMMA meanwile expects a tab delimited file where x=NA. You can pass in the column number with the -n switch. One thing GEMMA lacks it the first ID which has to align with the genotype file. The BIMBAM geno format, again, does not contain the IDs. See + +=> http://www.xzlab.org/software/GEMMAmanual.pdf + +What we need to do is create and use R/qtl2 format files because they can be error checked on IDs and convert those, again, to BIMBAM for use by GEMMA. In the past I wrote Python converters for gemma2lib: + +=> https://github.com/genetics-statistics/gemma2lib + +I kinda abandoned the project, but you can see a lot of functionality, e.g. + +=> https://github.com/genetics-statistics/gemma2lib/blob/master/gemma2/format/bimbam.py + +We also have bioruby-table as a generic command line tool + +=> https://github.com/pjotrp/bioruby-table + +which is an amazingly flexible tool and can probably do the same. I kinda abandoned that project too. You know, bioinformatics is a graveyard of projects :/ + +OK, let's try. The first step is to convert the phenotype file to something GEMMA can use. We have to make sure that the individuals align with the genotype file(!). So, because we work with GN's GEMMA files, the steps are: + +* [X] Read the JSON layout file - 'sample_list' is essentially the header of the BIMBAM geno file +* [X] Use the R/qtl2-style phenotype file to write a correct GEMMA pheno file (multi column) +* [X] Compare results with GN pheno output + +Running GEMMA by hand it complained + +``` +## number of total individuals = 235 +## number of analyzed individuals = 26 +## number of covariates = 1 +## number of phenotypes = 1 +## number of total SNPs/var = 21056 +## number of analyzed SNPs = 21056 +Calculating Relatedness Matrix ... +rsm10000000001, X, Y, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0.5, 0, 1, 0, 1, 0.5, 0, 1, 0, 0, 0, 1, 1, 0, 0.5, 1, 1, 0.5, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0.5, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0.5, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0.5, 0, 0, 0.5, 0, 1, 0, 1, 0, 0, 1, 0.5, 0, 1, 0, 0.5, 1, 1, 1, 1, 0.5, 0, 0, 0.5, 1, 0.5, 0.5, 0.5, 1, 0.5, 1, 0.5, 0.5, 0, 0, 0, 0.5, 1, 0.5, 0, 0, 0.5, 0, 0, 1, 0, 0.5, 1, 0.5, 0.5, 0.5, 1, 0.5, 0.5, 0.5 +237 != 235 +WARNING: Columns in geno file do not match # individuals in phenotypes +ERROR: Enforce failed for not enough genotype fields for marker in src/gemma_io.cpp at line 1470 in BimbamKin +``` + +GEMMA on production is fine. So, I counted BXDs. For comparison, GN's pheno outputs 241 BXDs. Daves pheno file has 241 BXDs (good). But when using my script we get 235 BXDs. Ah, apparently they are different from what we use on GN because GN does not use the parents and the F1s for GEMMA. So, my script should complain when a match is not made. Turns out the JSON file only contains 235 'mappable' BXDs and refers to BXD.8 which is from Apr 26, 2023. The header says `BXD_experimental_DGA_7_Dec_2021` and GN says WGS March 2022. So which one is it? I'll just go with latest, but genotype naming is problematic and the headers are not updated. + +> MOTTO: Always complain when there are problems! + +Luckily GEMMA complained, but the script should have also complained. The JSON file with 235 genometypes is not representing the actual 237 genometypes. We'll work on that in the next section. + +Meanwhile let's add this code to gemma-wrapper. The code can be found here: + +=> https://github.com/genetics-statistics/gemma-wrapper/blob/master/bin/rqtl2-pheno-to-gemma.py + +## Genotypes + +The pheno script now errors with + +``` +ERROR: sets differ {'BXD065xBXD102F1', 'C57BL/6J', 'DBA/2J', 'BXD077xBXD065F1', 'D2B6F1', 'B6D2F1'} +``` + +Since these are parents and F1s, and are all NAs in Dave's phenotypes, they are easy to remove. So, now we have 235 samples in the phenotype file and 237 genometypes in the genotype file (according to GEMMA). A quick check shows that BXD.geno has 236 genometypes. Same for the bimbam on production. We now have 3 values: 235, 236 and 237. Question is why these do not overlap. + +### Genotype probabilities for GEMMA + +Another problem on production is that we are not using the standard GEMMA values. So GEMMA complains with + +``` +WARNING: The maximum genotype value is not 2.0 - this is not the BIMBAM standard and will skew l_lme and effect sizes +``` + +This explains why we divide the effect size by 2 in the GN production code. Maybe it is a better idea to fix then geno files! + +* [X] Generate BIMBAM file from GENO .geno files (via R/qtl2) +* [X] Check bimbam files on production + +So we need to convert .geno files as they are the current source of genotypes in GN and contain the sample names that we need to align with pheno files. For this we'll output two files - one JSON file with metadata and sample names and the actual BIMBAM file GEMMA requires. I notice that I actually never had the need to parse a geno file! Zach wrote a tool `gn2/maintenance/convert_geno_to_bimbam.py` that also writes the GN JSON file and I'll take some ideas from that. We'll also need to convert to R/qtl2 as that is what Dave can use and then on to BIMBAM. So, let's add that code to gemma-wrapper again. + +This is another tool at + +=> https://github.com/genetics-statistics/gemma-wrapper/blob/master/bin/gn-geno-to-gemma.py + +where the generated JSON file helps create the pheno file. We ended up with 237 genometypes/samples to match the genotype file and all of Dave's samples matched. Also, now I was able to run GEMMA successfully and passed in the pheno column number with + +``` +gemma -gk -g BXD-test.txt -p BXD_pheno_Dave-GEMMA.txt -n 5 +gemma -lmm 9 -g BXD-test.txt -p BXD_pheno_Dave-GEMMA.txt -k output/result.cXX.txt -n 5 +``` + +the pheno file can include the sample names as long as there are no spaces in them. For marker rs3718618 we get values -9 0 X Y 0.317 7.930689e+02 1.779940e+02 1.000000e+05 7.532662e-05. The last value translates to + +``` +-Math.log10(7.532662e-05) => 4.123051519468808 +``` + +and that matches GN's run of GEMMA w.o. LOCO. + +The next step is to make the -n switch run with LOCO on gemma-wrapper. + +``` +./bin/gemma-wrapper --loco --json -- -gk -g BXD-test.txt -p BXD_pheno_Dave-GEMMA.txt -n 5 -a BXD.8_snps.txt > K.json +./bin/gemma-wrapper --keep --force --json --loco --input K.json -- -lmm 9 -g BXD-test.txt -p BXD_pheno_Dave-GEMMA.txt -n 5 -a BXD.8_snps.txt > GWA.json +``` + +Checking the output we get + +``` +-Math.log10(3.191755e-05) => 4.495970452606926 +``` + +and that matches Dave's output for LOCO and marker rs3718618. All good, so far. Next step permute. + +## Permute + +Now we have gemma-wrapper working we need to fix it to work with the latest type of files. + +* [X] randomize phenotypes using -n switch +* [X] Permute gemma and collect results +* [X] Unseed randomizer or make it an option +* [X] Fix tmpdir +* [X] Show final score +* [X] Compare small and large BXD set + +For the first one, the --permutate-phenotype switch takes the input pheno file. Because we pick a column with gemma we can randomize all input lines together. So, in the above example, we shuffle BXD_pheno_Dave-GEMMA.txt. Interestingly it looks like we are already shuffling by line in gemma-wrapper. + +The good news is that it runs, but the outcome is wrong: + +``` +["95 percentile (significant) ", 1000.0, -3.0] +["67 percentile (suggestive) ", 1000.0, -3.0] +``` + +Inspecting the phenotype files they are shuffled, e.g. + +``` +BXD073xBXD065F1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA +BXD49 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA +BXD86 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA +BXD161 15.623 142.908997 4.0 2350.637939 3294.824951 1.452 2.08 20.416365 0.363636 142.909088 15.622727 4.0 2350.638672 3294.825928 1.45 +1636 2.079909 0.363636 33.545448 2.125 2.0 2.375 1.25 44.5 +BXD154 20.143 195.5 4.75 1533.689941 4568.76416 0.727 2.213748 27.9275 0.75 195.5 20.142857 4.75 1533.690796 4568.76416 0.72675 2.2137 +48 0.75 54.5 0.75 1.75 3.0 1.5 33.0 +``` + +which brings out an interesting point. Most BXDs in the genotype file are missing from this experiment. We are computing LOD scores as if we have a full BXD population. So, what we are saying here is that if we have all BXD genotypes and we randomly assign phenotypes against a subset, what is the chance we get a hit at random. I don't think this is a bad assumption, but it not exactly what Gary Churchill had in mind in his 1994 paper: + +=> https://pubmed.ncbi.nlm.nih.gov/7851788/ Empirical threshold values for quantitative trait mapping + +The idea is to shuffle genotypes against phenotypes. If there is a high correlation we get a result. The idea is to break the correlation and that should work for both the large and the small BXD set. Scoring the best 'random' result out of 1000 permutations at, say 95% highest, sets the significance level. +With our new precompute we should be able to show the difference. Anyway, that is one problem, the other is that the stats somehow do not add up to the final result. Score min is set at + +=> https://github.com/genetics-statistics/gemma-wrapper/blob/7769f209bcaff2472ba185234fad47985e59e7a3/bin/gemma-wrapper#L667 + +The next line says 'if false'. Alright, that explains part of it at least as the next block was disabled for slurm and is never run. I should rip the slurm stuff out, actually, as Arun has come up with a much better solution. But that is for later. + +Disabling that permutation stopped with + +``` +Add parallel job: time -v /bin/gemma -loco X -k 02fe8482913a998e6e9559ff5e3f1b89e904d59d.X.cXX.txt.cXX.txt -o 55b49eb774f638d16fd267313d8b4d1d6d2a0a25.X.assoc.txt -p phenotypes-1 -lmm 9 -g BXD-test.txt -n 5 -a BXD.8_snps.txt -outdir /tmp/d20240823-4481-xfrnp6 +DEBUG: Reading 55b49eb774f638d16fd267313d8b4d1d6d2a0a25.X.assoc.txt.1.assoc.txt +./bin/gemma-wrapper:672:in `foreach': No such file or directory @ rb_sysopen - 55b49eb774f638d16fd267313d8b4d1d6d2a0a25.X.assoc.txt.1.assoc.txt (Errno::ENOENT) +``` + +so it created a file, but can't find it because outdir is not shared. Now tmpdir is in the outer block so the file should still exist. For troubleshooting the first step is to seed the randomizer (seed) so we get the same run every time. +It turns out there are a number of problems. First of all the permutation output was numbered and the result was not found. Fixing that gave a first result without the -parallel switch: + +``` +[0.0008489742, 0.03214928, 0.03426648, 0.0351207, 0.0405179, 0.04688354, 0.0692488, 0.1217158, 0.1270747, 0.1880325] +["95 percentile (significant) ", 0.0008489742, 3.1] +["67 percentile (suggestive) ", 0.0351207, 1.5] +``` + +That is pleasing and it suggests that we have a significant result for the trait of interest: `volume of the first tumor that developed`. Running LOCO withouth parallel is slow (how did we survive in the past!). + +The 100 run shows + +``` +[0.0001626146, 0.0001993085, 0.000652191, 0.0007356249, 0.0008489742, 0.0009828207, 0.00102203, 0.001091924, 0.00117823, 0.001282312, 0.001471041, 0.001663572, 0.001898194, 0.003467039, 0.004655921, 0.005284387, 0.005628393, 0.006319995, 0.006767502, 0.007752473, 0.008757406, 0.008826192, 0.009018125, 0.009735282, 0.01034488, 0.01039465, 0.0122644, 0.01231366, 0.01265093, 0.01317425, 0.01348443, 0.013548, 0.01399461, 0.01442383, 0.01534904, 0.01579931, 0.01668551, 0.01696015, 0.01770371, 0.01838937, 0.01883068, 0.02011034, 0.02234977, 0.02362105, 0.0242342, 0.02520063, 0.02536663, 0.0266905, 0.02932001, 0.03116032, 0.03139836, 0.03176087, 0.03214928, 0.03348359, 0.03426648, 0.0351207, 0.03538503, 0.0354338, 0.03609931, 0.0371134, 0.03739827, 0.03787489, 0.04022586, 0.0405179, 0.04056273, 0.04076034, 0.04545012, 0.04588635, 0.04688354, 0.04790254, 0.05871501, 0.05903692, 0.05904868, 0.05978341, 0.06103624, 0.06396175, 0.06628317, 0.06640048, 0.06676557, 0.06848021, 0.0692488, 0.07122914, 0.07166011, 0.0749728, 0.08174019, 0.08188341, 0.08647539, 0.0955264, 0.1019648, 0.1032776, 0.1169525, 0.1182405, 0.1217158, 0.1270747, 0.1316735, 0.1316905, 0.1392859, 0.1576149, 0.1685975, 0.1880325] +["95 percentile (significant) ", 0.0009828207, 3.0] +["67 percentile (suggestive) ", 0.01442383, 1.8] +``` + +Not too far off! + +The command was + +``` +./bin/gemma-wrapper --debug --no-parallel --keep --force --json --loco --input K.json --permutate 100 --permute-phenotype BXD_pheno_Dave-GEMMA.txt -- -lmm 9 -g BXD-test.txt -n 5 -a BXD.8_snps.txt +``` + +It is fun to see that when I did a second run the + +``` +[100, ["95 percentile (significant) ", 0.0002998286, 3.5], ["67 percentile (suggestive) ", 0.01167864, 1.9]] +``` + +significance value was 3.5. Still, our hit is whopper - based on this. + +## Run permutations in parallel + +Next I introduced and fixed parallel support for permutations, now we can run gemma LOCO with decent speed - about 1 permutation per 3s! That is one trait in an hour on my machine. + +=> https://github.com/genetics-statistics/gemma-wrapper/commit/a8d3922a21c7807a9f20cf9ffb62d8b16f18c591 + +Now we can run 1000 permutations in an hour, rerunning above we get + +``` +["95 percentile (significant) ", 0.0006983356, 3.2] +["67 percentile (suggestive) ", 0.01200505, 1.9] +``` + +which proves that 100 permutations is not enough. It is a bit crazy to think that 5% of randomized phenotypes will get a LOD score of 3.2 or higher! + +Down the line I can use Arun's CWL implementation to fire this on a cluster. Coming... + +## Reduce genotypes for permutations + +In the next phase we need to check if shuffling the full set of BXDs makes sense for computing permutations. Since I wrote a script for this exercise to transform BIMBAM genotypes I can reuse that: + +=> https://github.com/genetics-statistics/gemma-wrapper/blob/a8d3922a21c7807a9f20cf9ffb62d8b16f18c591/bin/gn-geno-to-gemma.py#L31 + +If we check the sample names we can write a reduced genotype matrix. Use that to compute the GRM. Next permute with the smaller BXD sample set and genotypes. + +Instead of modifying above script I decided to add another one + +``` +bimbam-filter.py --json BXD.geno.json --sample-file BXD_pheno_Dave-GEMMA-samples.txt BXD_geno.txt > BXD_geno-samples.txt +``` + +which takes as inputs the json file from gn-geno-to-gemma and the GEMMA input file. This is not to mix targets and keeping the code simple. Now create the GRM with + +``` +./bin/gemma-wrapper --loco --json -- -gk -g BXD_geno-samples.txt -p BXD_pheno_Dave-GEMMA-samples.txt -n 5 -a BXD.8_snps.txt > K-samples.json +./bin/gemma-wrapper --keep --force --json --loco --input K-samples.json -- -lmm 9 -g BXD_geno-samples.txt -p BXD_pheno_Dave-GEMMA-samples.txt -n 5 -a BXD.8_snps.txt > GWA-samples.json +``` + +Now the hit got reduced: + +``` +-Math.log10(1.111411e-04) +=> 3.9541253091741235 +``` + +and with 1000 permutations + +``` +./bin/gemma-wrapper --debug --parallel --keep --force --json --loco --input K-samples.json --permutate 1000 --permute-phenotype BXD_pheno_Dave-GEMMA-samples.txt -- -lmm 9 -g BXD_geno-samples.txt -n 5 -a BXD.8_snps.txt +["95 percentile (significant) ", 0.0004184217, 3.4] +["67 percentile (suggestive) ", 0.006213012, 2.2] +``` + +we are still significant. Though the question is now why results differ so much, compared to using the full BXD genotypes. + +## Why do we have a difference with the full BXD genotypes? + +GEMMA strips out the missing phenotypes in a list. Only the actual phenotypes are used. We need to check how the GRM is used and what genotypes are used by GEMMA. For the GRM the small genotype file compares vs the large: + +``` +Samples small large +BXD1 <-> BXD1 0.248 0.253 +BXD24 <-> BXD24 0.255 0.248 +BXD1 <-> BXD24 -0.040 -0.045 +BXD1 <-> BXD29 0.010 0.009 +``` + +You can see there is a small difference in the computation of K even though it looks pretty close. This is logical because with the full BXD set all genotypes are used. With a smaller BXD set only those genotypes are used. We expect a difference in values, but not much of a difference in magnitude (shift). The only way to prove that K impacts the outcome is to take the larger matrix and reduce it to the smaller one using those values. I feel another script coming ;) + +Above numbers are without LOCO. With LOCO on CHR18 + +``` +Samples small large +BXD1 <-> BXD1 0.254 0.248 +BXD1 <-> BXD24 -0.037 -0.042 +``` + +again a small shift. OK, let's try computing with a reduced matrix and compare results for rs3718618. Example: + +``` +gemma -gk -g BXD-test.txt -p BXD_pheno_Dave-GEMMA.txt -n 5 -a BXD.8_snps.txt -o full-bxd +gemma -lmm 9 -k output/full-bxd.cXX.txt -g BXD-test.txt -p BXD_pheno_Dave-GEMMA.txt -n 5 -a BXD.8_snps.txt -o full-bxd +``` + +we get three outcomes where full-bxd is the full set, +``` +output/full-bxd.assoc.txt:18 rs3718618 7.532662e-05 +output/full-reduced-bxd.assoc.txt:18 rs3718618 2.336439e-04 +output/small-bxd.assoc.txt:18 rs3718618 2.338226e-04 +``` + +even without LOCO you can see a huge jump for the full BXD kinship matrix, just looking at our hit rs3718618: + +``` +-Math.log10(7.532662e-05) +=> 4.123051519468808 +-Math.log10(2.338226e-04) +=> 3.631113514641496 +``` + +With LOCO the difference may be even greater. + +So, which one to use? Truth is that the GRM is a blunt instrument. Essentially every combination of two samples/strains/genometypes gets compressed into a single number that gives a distance between the genomes. This number represents a hierarchy of relationships computed in differences in DNA (haplotypes) between those individuals. The more DNA variation is represented in the calculation, the more 'fine tuned' this GRM matrix becomes. Instinctively the larger matrix, or full BXD population, is a better estimate of distance between the individuals than just using a subset of DNA. + +So, I still underwrite using the full BXD for computing the GRM. To run GEMMA, I have just proven we can use the reduced GRM which will be quite a bit faster too, as the results are the same. For permutations we *should* use the reduced form of the full BXD GRM as it does not make sense to shuffle phenotypes against BXDs we don't use. So I need to recompute that. + +## Recomputing significance with the reduced GRM matrix + +* [ ] Recomute significance with reduced GRM + +I can reuse the script I wrote for the previous section. + +=> https://github.com/genetics-statistics/gemma-wrapper/blob/master/bin/grm-filter.py + +So, the idea is to rerun permutations with the small set, but with the reduced GRM from the full BXD population. That ought to be straightforward by using the new matrix as an input for GWA. Only problem is that LOCO generates a GRM for every chromosome, so we need to make gemma-wrapper aware about the matrix reduction. As the reduction is fast we can do it for every run of gemma-wrapper and destroy it automatically with tmpdir. So: + +* [X] Compute the full GRM for every LOCO (if not cached) - already part of gemma-wrapper +* [X] Run through GRMs and reduce them in tmpdir +* [X] Plug new GRM name into computations - which really updates the JSON file that is input for GWA + +The interesting bit is that GEMMA requires input of phenotypes, but does not use them to compute the GRM. + +After giving it some thought we want GRM reduction to work in production GN because of the speed benefit. That means modifying gemma-wrapper to take a list of samples/genometypes as input - and we'll output that with GN. It is a good idea anyhow because it can give us some improved error feedback down the line. + +We'll use the --input switch to gemma-wrapper by providing the full list of genometypes that are used to compute the GRM and the 'reduced' list of genometypes that are used to reduce the GRM and compute GWA after. +So the first step is to create this JSON input file. We already created the "gn-geno-to-gemma" output that has a full list of samples as parsed from the GN .geno file. Now we need a script to generate the reduced samples JSON and merge that to "gn-geno-to-gemma-reduced" by addind a "samples-reduced" vector. + +The rqtl2-pheno-to-gemma.py script I wrote above already takes the "gn-geno-to-gemma" JSON. It now adds to the JSON: + +``` + "samples-column": 2, + "samples-reduced": { + "BXD1": 18.5, + "BXD24": 27.510204, + "BXD29": 17.204, + "BXD43": 21.825397, + "BXD44": 23.454, + "BXD60": 22.604, + "BXD63": 19.171, + "BXD65": 21.607, + "BXD66": 17.056999, + "BXD70": 17.962999, + "BXD73b": 20.231001, + "BXD75": 19.952999, + "BXD78": 19.514, + "BXD83": 18.031, + "BXD87": 18.258715, + "BXD89": 18.365, + "BXD90": 20.489796, + "BXD101": 20.6, + "BXD102": 18.785, + "BXD113": 24.52, + "BXD124": 21.762142, + "BXD128a": 18.952, + "BXD154": 20.143, + "BXD161": 15.623, + "BXD210": 23.771999, + "BXD214": 19.533117 + }, + "numsamples-reduced": 26 +``` + +which is kinda cool because now I can reduce and write the pheno file in one go. Implementation: + +=> https://github.com/genetics-statistics/gemma-wrapper/blob/master/bin/rqtl2-pheno-to-gemma.py + +OK, we are going to input the resulting JSON file into gemma-wrapper. At the GRM stage we ignore the reduction but we need to add these details to the outgoing JSON. So the following commands can run: + +``` +./bin/gemma-wrapper --loco --json --input BXD_pheno_Dave-GEMMA.txt.json -- -gk -g BXD-test.txt -p BXD_pheno_Dave-GEMMA.txt -n 5 -a BXD.8_snps.txt > K.json +``` + +where K.json has a json["input"] which essentially is above structure. + +``` +./bin/gemma-wrapper --keep --force --json --loco --input K.json -- -lmm 9 -g BXD-test.txt -p BXD_pheno_Dave-GEMMA.txt -n 5 -a BXD.8_snps.txt > GWA.json +``` + +Now I have to deal with phenotype files as they are rewritten. We should still cater for `-p` for GEMMA. We already have `--permute-phenotypes filen` for gemma-wrapper. Now we are adding `--phenotypes` to gemma-wrapper which replaces both! +Note that we can use -p if --phenotypes is NOT defined. Problem is we have a few paths now: + +* [X] Check phenotypes are directly passed into GEMMA with -p switch +* [X] Check phenotypes are passed in as a file with --phenotypes switch +* [X] Check phenotypes are coming in using the JSON file + +Fixed the first one with + +=> https://github.com/genetics-statistics/gemma-wrapper/commit/2b7570a7f0ba0d1080c730b208823c0622dd8f2c + +though that does not do caching (yet). Next stop doing LOCO I notice xz is phenomenally slow. Turns out it was not xz, but when using `tar -C` we switch into the path and somehow xz kept growing its output. + +At this point David told me that we don't have to do epoch or covariates. So it is just the traits. After getting side-tracked by a slow running python program for haplotype assessment we start up again. + +So, now we can pass in a trait using JSON. This is probably not a great idea when you have a million values, but for our purposes it will do. K.json contains the reduced samples. Next GWA is run on that. I had to fix minor niggles and get `parallel' to give more useful debug info. + +Next write the pheno file and pass it in! + +``` +./bin/gemma-wrapper --debug --verbose --force --loco --json --lmdb --input K.json -- -g test/data/input/BXD_geno.txt.gz -a test/data/input/BXD_snps.txt -lmm 9 -maf 0.05 -n 2 -debug +``` + +note the '-n 2' switch to get the second generated column in the phenotype file. We had our first successful run! To run permutations I get: + +``` +./bin/gemma-wrapper:722:in `<main>': You should supply --permute-phenotypes with gemma-wrapper --permutate (RuntimeError) +``` + +and, of course, as this reduced file is generated it not available yet. That was an easy fix/hack. Next I got + +``` +./bin/gemma-wrapper:230:in `block in <main>': Do not use the GEMMA -p switch with gemma-wrapper if you are using JSON phenotypes! +``` + +Hmm. This is a bit harder. The call to GWAS takes a kinship matrix and it gets reduced with every permutation. That is probably OK because it runs quickly, but I'll need to remove the -p switch... OK. Done that and permutations are running in a second for 28 BXD! That implies computing significance in the web service comes into view - especially if we use a cluster on the backend. + +It is interesting to see that 60% of time is spent in the kernel - which means still heavy IO on GEMMA's end - even with the reduced data: + +``` +%Cpu0 : 39.1 us, 51.0 sy +%Cpu1 : 34.0 us, 54.8 sy +%Cpu2 : 35.8 us, 54.5 sy +%Cpu3 : 37.5 us, 49.8 sy +%Cpu4 : 36.0 us, 53.3 sy +%Cpu5 : 29.5 us, 57.9 sy +%Cpu6 : 42.7 us, 44.7 sy +%Cpu7 : 35.9 us, 52.2 sy +%Cpu8 : 27.0 us, 60.7 sy +%Cpu9 : 24.5 us, 63.2 sy +%Cpu10 : 29.8 us, 58.9 sy +%Cpu11 : 25.3 us, 62.7 sy +%Cpu12 : 28.1 us, 58.9 sy +%Cpu13 : 34.2 us, 52.8 sy +%Cpu14 : 34.6 us, 52.2 sy +%Cpu15 : 37.5 us, 51.8 sy +``` + +There is room for more optimization. + +The good news is for a peak we have we find that it is statistically significant: + +``` +["95 percentile (significant) ", 0.0004945423, 3.3] +["67 percentile (suggestive) ", 0.009975183, 2.0] +``` + +Even though it was low permutations there was actually a real bug. It turns out I only picked the values from the X chromosome (ugh!). It looks different now. + +For the peaks of + +=> https://genenetwork.org/show_trait?trait_id=21526&dataset=BXDPublish + +after 1000 permutations (I tried a few times) the significance threshold with MAF 0.05 ends up at approx. + +["95 percentile (significant) ", 1.434302e-05, 4.8] +["67 percentile (suggestive) ", 0.0001620244, 3.8] + +If it is it means that for this trait BXD_21526 the peaks on chr 14 at LOD 3.5 are not significant, but close to suggestive (aligning with Dave's findings and comments). It is interesting to see the numbers quickly stabilize by 100 permutations (see attached). Now, this is before correcting for epoch effects and other covariates. And I took the data from Dave as is (the distribution looks fairly normal). Also there is a problem with MAF I have to look into: + +GEMMA in GN2 shows the same result when setting MAF to 0.05 or 0.1 (you can try that). The GN2 GEMMA code for LOCO does pass in -maf (though I see that non-LOCO does not - ugh again). I need to run GEMMA to see if the output should differ and I'll need to see the GN2 logs to understand what is happening. Maybe it just says that the hits are haplotype driven - and that kinda makes sense because there is a range of them. + +That leads me to think that we only need to check for epoch when we have a single *low* MAF hit, say 0.01 for 28 mice. As we actively filter on MAF right now we won't likely see an epoch hit. + + +## Protocol for permutations + +First we run GEMMA just without LOCO using default settings that GN uses + +``` +# Convert the GN geno file to BIMBAM geno file +./bin/gn-geno-to-gemma.py BXD.geno > BXD.geno.txt +# Match pheno file +./bin/rqtl2-pheno-to-gemma.py BXD_pheno_Dave.csv --json BXD.geno.json > BXD_pheno_matched.txt + Wrote GEMMA pheno 237 from 237 with genometypes (rows) and 24 collections (cols)! +gemma -gk -g BXD.geno.txt -p BXD_pheno_matched.txt -n 5 +gemma -lmm 9 -g BXD.geno.txt -p BXD_pheno_matched.txt -k output/result.cXX.txt -n 5 +``` + +So far the output is correct. + +``` +-Math.log10(7.532460e-05) +=> 4.123063165904243 +``` + +Try with gemma-wrapper + +``` +./bin/gemma-wrapper --json -- -gk -g BXD.geno.txt -p BXD_pheno_matched.txt -n 5 -a BXD.8_snps.txt > K.json +cp output/bab43175329bd14d485e582b7ad890cf0ec28915.cXX.txt /tmp +``` + +Works, but the following failed without the -n switch: + +``` +./bin/gemma-wrapper --debug --verbose --force --json --lmdb --input K.json -- -g BXD.geno.txt -a BXD.8_snps.txt -lmm 9 -p BXD_pheno_matched.txt -n 5 +``` + +and worked with. That is logical, if you see output like + +``` +19 rs30886715 46903165 0 X Y 0.536 0.000000e+00 0.000000e+00 1.000000e-05 1.000000e+00 +19 rs6376540 46905638 0 X Y 0.536 0.000000e+00 0.000000e+00 1.000000e-05 1.000000e+00 +19 rs50610897 47412184 0 X Y 0.538 0.000000e+00 0.000000e+00 1.000000e-05 1.000000e+00 +``` + +It means the phenotype column that was parsed has empty values. In this case the BXD strain names. GEMMA should show a meaningful error. + +Now that works we can move to a full LOCO + + +``` +./bin/gemma-wrapper --loco --json -- -gk -g BXD.geno.txt -p BXD_pheno_matched.txt -n 5 -a BXD.8_snps.txt > K.json +./bin/gemma-wrapper --debug --verbose --force --loco --json --lmdb --input K.json -- -g BXD.geno.txt -a test/data/input/BXD_snps.txt -lmm 9 -maf 0.05 -p BXD_pheno_matched.txt -n 5 +./bin/./bin/view-gemma-mdb --sort /tmp/test/ca55b05e8b48fb139179fe09c35cff0340fe13bc.mdb +``` + +and we get + +``` +18,69216071,rs3718618,0.635,-195.5784,82.1243,100000.0,0.0,4.5 +18,69825784,rs50446650,0.635,-195.5784,82.1243,100000.0,0.0,4.5 +18,68189477,rs29539715,0.596,-189.7332,79.7479,100000.0,0.0,4.49 +``` + +When we converted BXD.geno to its BIMBAM BXD.geno.txt we also got a BXD.geno.json file which contains a list of the individuals/genometypes that were used in the genotype file. + +Now we reduce the traits file to something GEMMA can use for permutations - adding the trait number and output BXD_pheno_Dave.csv.json + +```sh +./bin/rqtl2-pheno-to-gemma.py BXD_pheno_Dave.csv --json BXD.geno.json -n 5 > BXD_pheno_matched-5.txt +``` + +The matched file should be identical to the earlier BXD_pheno_matched.txt file. Meanwhile, if you inspect the JSON file you should see + +``` +jq < BXD_pheno_Dave.csv.json + "samples-column": 5, + "trait": "21529", + "samples-reduced": { + "BXD1": 1919.450806, + "BXD101": 2546.293945, + "BXD102": 1745.671997, +``` + +So far we are OK! + +At this point we have a reduced sample set, a BIMBAM file and a phenotype file GEMMA can use! + +``` +./bin/gemma-wrapper --loco --json --input BXD_pheno_Dave.csv.json -- -gk -g BXD.geno.txt -p BXD_pheno_matched.txt -a BXD.8_snps.txt -n 5 > K.json +``` + +Note that at this step we actually create a full GRM. Reducing happens in the next mapping stage. + +``` +./bin/gemma-wrapper --debug --verbose --force --loco --json --lmdb --input K.json -- -g BXD.geno.txt -a test/data/input/BXD_snps.txt -lmm 9 -maf 0.05 +``` + +Note the use of '-n' switch. We should change that. + +``` +./bin/./bin/view-gemma-mdb /tmp/test/8599834ee474b9da9ff39cc4954d662518a6b5c8.mdb --sort +``` + +Look for rs3718618 at 69216071 and I am currently getting the wrong result for trait 21529 and it is not clear why that is: + +``` +chr,pos,marker,af,beta,se,l_mle,l_lrt,-logP +16,88032783,?,0.538,-134.1339,75.7837,0.0,0.0009,3.02 +16,88038734,?,0.538,-134.1339,75.7837,0.0,0.0009,3.02 +(...) +18,69216071,?,0.462,10.8099,93.3936,0.0,0.8097,0.09 +``` + +The failing command is: + +``` +/bin/gemma -loco 18 -k /tmp/test/reduced-GRM-18.txt.tmp -o 69170e8a2d2f08905daa14461eca1d82a676b4c4.18.assoc.txt -p /tmp/test/reduced-pheno.txt.tmp -n 2 -g BXD.geno.txt -a test/data/input/BXD_snps.txt -lmm 9 -maf 0.05 -outdir /tmp/test +``` + +produces + +``` +18 rs3718618 69216071 0 X Y 0.462 -2.161984e+01 9.339365e+01 1.000000e-05 8.097026e-01 +``` + +The pheno file looks correct, so it has to be the reduced GRM. And this does not look good either: + +``` +number of SNPS for K = 7070 +number of SNPS for GWAS = 250 +``` + +When running GEMMA on genenetwork.org we get a peak for LOCO at that position for rs3718618. I note that the non-LOCO version at 4.1 vs 4.5 for LOCO has a higher peak. We should compute the significance for both! + +Now, when I run the non-LOCO version by hand I get + +``` +-Math.log10(7.532460e-05) +=> 4.123063165904243 +``` + +## Finally + +So, we rolled back to not using reduced phenotypes for now. + +For trait 21529 after 1000 permutations we get for LOCO: + +``` +["95 percentile (significant) ", 1.051208e-05, 5.0] +["67 percentile (suggestive) ", 0.0001483188, 3.8] +``` + +which means our GWA hit is at 4.5 is not so close to being significant. + +Next I made sure the phenotypes got shuffled against the BXD used - which is arguably the right thing to do. +It should not have a huge impact because the BXDs share haplotypes - so randomized association should end up in the same ball park. The new result after 1000 permutations is: + +``` +["95 percentile (significant) ", 8.799303e-06, 5.1] +["67 percentile (suggestive) ", 0.0001048443, 4.0] +``` + +## More for Dave + + +Run and permute: + +``` +./bin/gemma-wrapper --lmdb --debug --phenotypes BXD_pheno_matched.txt --verbose --force --loco --json --input K.json -- -g BXD.geno.txt -a BXD.8. -lmm 9 -maf 0.05 -n 2 -p BXD_pheno_matched.txt +./bin/gemma-wrapper --debug --phenotypes BXD_pheno_matched.txt --permutate 1000 --phenotype-column 2 --verbose --force --loco --json --input K.json -- -g BXD.geno.txt -a test/data/input/BXD_snps.txt -lmm 9 -maf 0.05 +``` + +``` +21526 How old was the mouse when a tumor was first detected? +chr,pos,marker,af,beta,se,l_mle,l_lrt,-logP +14,99632276,?,0.462,-0.6627,0.3322,100000.0,0.0003,3.56 +14,99694520,?,0.462,-0.6627,0.3322,100000.0,0.0003,3.56 +17,80952261,?,0.538,0.6528,0.3451,100000.0,0.0005,3.31 +["95 percentile (significant) ", 6.352578e-06, 5.2] +["67 percentile (suggestive) ", 0.0001007502, 4.0] +``` + +``` +24406 What was the weight of the first tumor that developed, at death? +chr,pos,marker,af,beta,se,l_mle,l_lrt,-logP +11,9032629,?,0.536,0.1293,0.0562,100000.0,0.0,4.36 +11,9165457,?,0.536,0.1293,0.0562,100000.0,0.0,4.36 +11,11152439,?,0.5,0.126,0.0562,100000.0,0.0001,4.21 +11,11171143,?,0.5,0.126,0.0562,100000.0,0.0001,4.21 +11,11525458,?,0.5,0.126,0.0562,100000.0,0.0001,4.21 +11,8786241,?,0.571,0.1203,0.0581,100000.0,0.0002,3.78 +11,8836726,?,0.571,0.1203,0.0581,100000.0,0.0002,3.78 +11,19745817,?,0.536,0.1183,0.061,100000.0,0.0003,3.46 +11,19833554,?,0.536,0.1183,0.061,100000.0,0.0003,3.46 +["95 percentile (significant) ", 1.172001e-05, 4.9] +["67 percentile (suggestive) ", 0.0001175644, 3.9] +``` + +``` +27515 No description +chr,pos,marker,af,beta,se,l_mle,l_lrt,-logP +4,103682035,?,0.481,-0.1653,0.0585,100000.0,0.0,5.57 +4,103875085,?,0.481,-0.1653,0.0585,100000.0,0.0,5.57 +4,104004372,?,0.481,-0.1653,0.0585,100000.0,0.0,5.57 +4,104156915,?,0.481,-0.1653,0.0585,100000.0,0.0,5.57 +4,104166428,?,0.481,-0.1653,0.0585,100000.0,0.0,5.57 +4,104584276,?,0.481,-0.1653,0.0585,100000.0,0.0,5.57 +4,103634906,?,0.519,-0.1497,0.0733,100000.0,0.0002,3.67 +4,103640707,?,0.519,-0.1497,0.0733,100000.0,0.0002,3.67 +["95 percentile (significant) ", 7.501004e-06, 5.1] +["67 percentile (suggestive) ", 7.804668e-05, 4.1] +``` + +## Dealing with significance + +Now the significance thresholds appear to be a bit higher than we expect. So, let's see what is going on. First I check the randomization of the phenotypes. That looks great. There are 1000 different phenotype files and they randomized only the BXD we used. Let's zoom in on our most interesting 27515. When running in GN2 I get more hits - they are at the same level, but somehow SNPs have dropped off. In those runs our SNP of interest shows only a few higher values: + +``` +./6abd89211d93b0d03dc4281ac3a0abe7fc10da46.4.assoc.txt.assoc.txt:4 rs28166983 103682035 0 X Y 0.481 -2.932957e-01 7.337327e-02 1.000000e+05 2.700506e-04 +./b6e58d6092987d0c23ae1735d11d4a293782c511.4.assoc.txt.assoc.txt:4 rs28166983 103682035 0 X Y 0.481 -2.413067e-01 6.416133e-02 1.000000e+05 5.188637e-04 +./4266656951ab0c5f3097ddb4bf917448d7542dd5.4.assoc.txt.assoc.txt:4 rs28166983 103682035 0 X Y 0.481 2.757074e-01 6.815899e-02 1.000000e+05 2.365318e-04 +./265e44a4c078d2a608b7117bbdcb9be36f56c7de.4.assoc.txt.assoc.txt:4 rs28166983 103682035 0 X Y 0.481 2.358494e-01 5.743872e-02 1.000000e+05 1.996261e-04 +napoli:/export/local/home/wrk/iwrk/opensource/code/genetics/gemma-wrapper/tmp/test$ rg 103682035 .|grep 5$ +./b29f08a4b1061301d52f939087f1a4c1376256f0.4.assoc.txt.assoc.txt:4 rs28166983 103682035 0 X Y 0.481 -2.841255e-01 6.194426e-02 1.000000e+05 5.220922e-05 +./3e5b12e9b7478b127b47c23ccdfba2127cf7e2b2.4.assoc.txt.assoc.txt:4 rs28166983 103682035 0 X Y 0.481 -2.813968e-01 6.379554e-02 1.000000e+05 8.533857e-05 +``` + +but none as high as the original hit of 5.57 + +``` +irb(main):001:0> -Math.log10(2.700506e-04) +=> 3.5685548534637 +irb(main):002:0> -Math.log10(5.220922e-05) +=> 4.282252795052573 +irb(main):003:0> -Math.log10(8.533857e-05) +=> 4.06885463879464 +``` + +All good. This leaves two things to look into. First, I see less hits than with GN2(!). Second, qnorm gives a higher peak in GN2. + +* [X] Check for number of SNPs + +The number of SNPs is not enough: + +``` +GEMMA 0.98.6 (2022-08-05) by Xiang Zhou, Pjotr Prins and team (C) 2012-2022 +Reading Files ... +## number of total individuals = 237 +## number of analyzed individuals = 26 +## number of covariates = 1 +## number of phenotypes = 1 +## leave one chromosome out (LOCO) = 1 +## number of total SNPs/var = 21056 +## number of SNPS for K = 6684 +## number of SNPS for GWAS = 636 +## number of analyzed SNPs = 21056 +``` + +Even when disabling MAF filtering we still see a subset of SNPs. I am wondering what GN2 does here. + +## Missing SNPs + +In our results we miss SNPs that are listed on GN2, but do appear in our genotypes, e.g. + +``` +BXD.8_snps.txt +19463:rsm10000013598, 69448067, 18 +``` + +First of all we find we used a total of 6360 SNPs out of the original 21056. For this SNP the genotype files show: + +``` +BXD_geno.txt +19463:rsm10000013598, X, Y, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 1, 1, 0.5, 1, 1, 1, 1, 0, 1, 0, 1, 0.5, 0, 0, 0, 1, 0.5, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0.5, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0.5, 1, 0, 0, 0, 1, 1, 1, 0.5, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0.5, 0.5, 0, 0.5, 0.5, 0.5, 0, 0.5, 0.5, 0.5, 0, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 1, 0.5, 1, 0.5, 0.5, 1, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0, 0.5, 1, 0.5, 0, 0.5 +``` + +and in our updated + +``` +BXD.geno.txt +rsm10000013598,X,Y,2,0,2,0,2,0,2,0,0,0,0,0,0,2,2,2,0,0,0,0,2,2,2,2,2,0,0,0,0,2,0,2,2,2,0,0,0,0,2,0,2,2,2,2,0,0,2,0,0,0,2,2,0,2,0,0,2,2,2,0,0,2,2,2,2,2,2,2,2,2,2,0,0,2,2,0,2,2,2,2,0,2,2,2,2,2,2,2,0,0,2,2,0,2,0,0,2,2,2,0,2,2,2,0,1,1,1,1,1,1,2,2,1,2,2,2,2,0,2,0,2,1,0,0,0,2,1,0,2,2,2,2,2,0,0,2,2,0,2,2,0,2,2,2,2,2,2,2,2,0,2,2,2,2,2,0,0,0,0,0,2,0,0,2,0,2,1,0,2,0,0,0,0,0,0,0,0,1,2,0,0,0,2,2,2,1,0,2,2,2,2,0,2,0,0,0,2,2,2,2,1,1,0,1,1,1,0,1,1,1,0,1,1,1,1,1,1,1,1,2,1,2,1,1,2,1,1,1,1,1,1,0,1,2,1,0,1 +``` + +That looks good. Turns out we need the annotation file(?!) + +I figured out where the missing SNPs went. Turns out that, if you pass in an annotation file, and if it is not complete, GEMMA drops the non-annotated SNPs unceremoniously. Getting the right annotation file fixed it. GEMMA should obviously not behave like that ;). Anyway, I am in sync with GN2 now. Unfortunately, with permutations, the significance threshold did not change much (which kinda makes sense). + +I want to see why gemma is giving this number. If I can't find it fast I'll try to run bulklmm or R/qtl2 lmm instead and see if they disagree with gemma and if we can get close to what Rob expects. + + +``` +gemma -gk -g BXD.geno.txt -p BXD_pheno_matched.txt -n 22 +gemma -lmm 9 -g BXD.geno.txt -p BXD_pheno_matched.txt -k output/result.cXX.txt -n 22 +``` + +Now that works we can move to a full LOCO + +``` +./bin/gemma-wrapper --loco --json -- -gk -g BXD.geno.txt -p BXD_pheno_matched.txt -n 5 -a BXD.8_snps.txt > K.json +./bin/gemma-wrapper --debug --verbose --force --loco --json --lmdb --input K.json -- -g BXD.geno.txt -a BXD.8_snps.txt -lmm 9 -maf 0.05 -p BXD_pheno_matched.txt -n 5 +./bin/./bin/view-gemma-mdb --sort /tmp/test/ca55b05e8b48fb139179fe09c35cff0340fe13bc.mdb +``` + +and we get + +``` +chr,pos,marker,af,beta,se,l_mle,l_lrt,-logP +18,69216071,rs3718618,0.635,-195.5784,82.1243,100000.0,0.0,4.5 +18,69448067,rsm10000013598,0.635,-195.5784,82.1243,100000.0,0.0,4.5 +18,69463065,rsm10000013599,0.635,-195.5784,82.1243,100000.0,0.0,4.5 +18,69803489,rsm10000013600,0.635,-195.5784,82.1243,100000.0,0.0,4.5 +18,69825784,rs50446650,0.635,-195.5784,82.1243,100000.0,0.0,4.5 +18,69836387,rsm10000013601,0.635,-195.5784,82.1243,100000.0,0.0,4.5 +18,68188822,rsm10000013579,0.596,-189.7332,79.7479,100000.0,0.0,4.49 +18,68189477,rs29539715,0.596,-189.7332,79.7479,100000.0,0.0,4.49 +18,68195226,rsm10000013580,0.596,-189.7332,79.7479,100000.0,0.0,4.49 +18,68195289,rsm10000013581,0.596,-189.7332,79.7479,100000.0,0.0,4.49 +18,68195758,rsm10000013582,0.596,-189.7332,79.7479,100000.0,0.0,4.49 +18,68454446,rs30216358,0.596,-189.7332,79.7479,100000.0,0.0,4.49 +18,68514475,rs6346101,0.596,-189.7332,79.7479,100000.0,0.0,4.49 +18,68521138,rsm10000013583,0.596,-189.7332,79.7479,100000.0,0.0,4.49 +18,68526029,rs29984158,0.596,-189.7332,79.7479,100000.0,0.0,4.49 +18,68542739,rsm10000013584,0.596,-189.7332,79.7479,100000.0,0.0,4.49 +18,68543456,rsm10000013585,0.596,-189.7332,79.7479,100000.0,0.0,4.49 +18,68564736,rsm10000013586,0.596,-189.7332,79.7479,100000.0,0.0,4.49 +18,68565230,rsm10000013587,0.596,-189.7332,79.7479,100000.0,0.0,4.49 +``` + +which is in line with GN2. + +Run and permute: + +``` +./bin/gemma-wrapper --debug --phenotypes BXD_pheno_matched.txt --permutate 1000 --phenotype-column 2 --verbose --force --loco --json --input K.json -- -g BXD.geno.txt -a BXD.8_snps.txt -lmm 9 -maf 0.05 +``` + +* [X] Test significance effect for higher and lower MAF than 0.05 + +Lower MAF increases significance thresholds? + +``` +0.05? +["95 percentile (significant) ", 6.268117e-06, 5.2] +["67 percentile (suggestive) ", 7.457537e-05, 4.1] + +0.01 +["95 percentile (significant) ", 5.871237e-06, 5.2] +["67 percentile (suggestive) ", 7.046853e-05, 4.2] +``` + +* [ ] Check distribution of hits with permutations + +## What about significance + +What we are trying to do here is to decide on a significance level that says that the chance of a hit caused by a random event is less that 1 in a thousand. We are currently finding levels of 5.0 and from earlier work it should be less than 4.0. We are essentially following Gary Churchill's '94 paper: ``Empirical threshold values for quantitative trait mapping''. The significance level depends on the shape of the data - i.e., the shape of both genotypes and the trait under study. If the significance level is 5.0 it means that we can expect alpha=0.05 or 5% of random trait vectors can be expected to show a LOD score of 5 or higher. + +What GEMMA does is look for a correlation between a marker, e.g. + +``` +BXD.geno.txt +rsm10000013598,X,Y,2,0,2,0,2,0,2,0,0,0,0,0,0,2,2,2,0,0,0,0,2,2,2,2,2,0,0,0,0,2,0,2,2,2,0,0,0,0,2,0,2,2,2,2,0,0,2,0,0,0,2,2,0,2,0,0,2,2,2,0,0,2,2,2,2,2,2,2,2,2,2,0,0,2,2,0,2,2,2,2,0,2,2,2,2,2,2,2,0,0,2,2,0,2,0,0,2,2,2,0,2,2,2,0,1,1,1,1,1,1,2,2,1,2,2,2,2,0,2,0,2,1,0,0,0,2,1,0,2,2,2,2,2,0,0,2,2,0,2,2,0,2,2,2,2,2,2,2,2,0,2,2,2,2,2,0,0,0,0,0,2,0,0,2,0,2,1,0,2,0,0,0,0,0,0,0,0,1,2,0,0,0,2,2,2,1,0,2,2,2,2,0,2,0,0,0,2,2,2,2,1,1,0,1,1,1,0,1,1,1,0,1,1,1,1,1,1,1,1,2,1,2,1,1,2,1,1,1,1,1,1,0,1,2,1,0,1 +``` + +and a trait that is measured for a limited number against these individuals/strains/genometypes. We also correct for kinship between the individuals, but that is tied to the individuals, so we can ignore that for now. So you get a vector of: + +``` +marker rsm10000013598 +ind trait +0 8.1 +0 7.9 +2 12.3 +2 13.4 +``` + +We permute the data after breaking the correlation between left and right columns. When running 1000 permutations for this particular hit we find that the shuffled never gets a higher value then for our main run. That is comforting because random permutations are always less correlated (for this marker). + +If we do this genome-wide we also see a randomly positioned highest hit across all chromosomes after shuffling the trait vector and our hit never appears the highest. E.g. + +``` +[10, ["2", "rs13476914", "170826974"], ["95 percentile (significant) ", 1.870138e-05, 4.7], ["67 percentile (suggestive) ", 6.3797e-05, 4.2]] +[11, ["6", "rsm10000004149", "25227945"], ["95 percentile (significant) ", 1.870138e-05, 4.7], ["67 percentile (suggestive) ", 6.3797e-05, 4. 2]] +[12, ["9", "rsm10000006852", "81294046"], ["95 percentile (significant) ", 1.555683e-05, 4.8], ["67 percentile (suggestive) ", 4.216931e-05, 4.4]] +[13, ["2", "rsm10000001382", "57898368"], ["95 percentile (significant) ", 1.555683e-05, 4.8], ["67 percentile (suggestive) ", 6.3797e-05, 4. 2]] +[14, ["1", "rsm10000000166", "94030054"], ["95 percentile (significant) ", 1.555683e-05, 4.8], ["67 percentile (suggestive) ", 6.3797e-05, 4. 2]] +[15, ["X", "rsm10000014672", "163387262"], ["95 percentile (significant) ", 1.555683e-05, 4.8], ["67 percentile (suggestive) ", 6.3797e-05, 4 .2]] +``` + +### Shuffling a normally distributed trait + + +So the randomization works well. Still, or 95% is close to 5.0 and that is by chance. What happens when we change the shape of the data? Let's create a new trait, so the distribution is random and normal: + +``` +> rnorm(25, mean = 10, sd = 2) + [1] 10.347116 9.475156 11.747876 10.969742 11.374611 12.283834 11.499779 + [8] 11.123520 10.830300 11.640049 10.392085 11.586836 11.540470 10.700869 +[15] 8.802858 10.238498 11.099536 8.832104 6.463636 10.347956 11.222558 +[22] 8.658024 7.796304 10.684967 9.540483 +``` + +These random trait values renders a hit of -Math.log10(8.325683e-04) = 3.0! Now we permute and we get: + +["95 percentile (significant) ", 5.22093e-06, 5.3] +["67 percentile (suggestive) ", 7.303966e-05, 4.1] + +So the shape of a normally distribute trait gives a higher threshold - it is easier to get a hit by chance. + +### Genotypes + +So 95% of random shuffled trait runs still gives us 5.x. So this has to be a property of the genotypes in conjunction with the method GEMMA applies. With regard to genotypes, the BXD are not exactly random because they share markers from two parents which run along haplotypes. I.e. we are dealing with a patchwork of similar genotypes. You may expect that would suppress the chance of finding random hits. Let's try to prove that by creating fully random genotypes and an extreme haplotype set. And, for good measure something in between. + +* [X] Fully random genotypes + +In the next phase we are going to play a bit with the haplotypes. First we fully randomize the genotype matrix. This way we break all haplotypes. As BIMBAM is a simple format we'll just modify an existing BIMBAM file. It looks like + +``` +rs3677817,X,Y,1.77,0.42,0.18,0.42,1.42,0.34,0.69,1.57,0.52,0.1,0.37,1.27,0.62,1.87,1.71,1.65,1.83,0.04,1.05,0.52,1.92,0.57,0.61,0.11,1.49,1.07,1.48,1.7,0.5,1.75,1.74,0.29,0.37,1.78,1.91,1.37,1.64,0.32,0.09,1.21,1.58,0.4,1.0,0.62,1.1,0.7,0.35,0.86,0.7,0.46,1.14,0.04,1.87,1.96,0.61,1.34,0.63,1.04,1.95,0.22,0.54,0.31,0.14,0.95,1.45,0.93,0.37,0.79,1.37,0.87,1.79,0.41,1.73,1.25,1.49,1.57,0.39,1.61,0.37,1.85,1.83,1.71,1.5,1.78,1.34,1.29,1.41,1.54,1.05,0.3,0.87,1.85,0.5,0.19,1.54,0.53,0.26,1.47,0.67,0.84,0.18,0.79,0.68,1.48,0.4,1.83,1.76,1.09,0.2,1.48,0.24,0.53,0.41,1.24,1.38,1.31,1.73,0.52,1.86,1.21,0.58,1.68,0.79,0.4,1.41,0.07,0.57,0.42,0.47,0.49,0.05,0.77,1.33,0.15,1.41,0.03,0.24,1.66,1.39,2.0,0.23,1.4,1.05,0.79,0.51,0.66,1.24,0.29,1.12,0.46,0.92,1.12,1.53,1.78,1.22,1.35,0.1,0.43,0.41,1.89,0.09,0.13,1.04,0.24,1.4,1.25,0.24,0.26,0.31,0.36,0.31,1.34,1.23,1.91,0.7,0.08,1.43,0.17,1.9,0.06,1.42,1.94,0.43,0.54,1.96,1.29,0.64,0.82,1.85,1.63,0.23,1.79,0.52,1.65,1.43,0.95,1.13,0.59,0.07,0.66,1.79,0.92,1.89,1.2,0.51,0.18,0.96,0.44,0.46,0.88,0.39,0.89,1.68,0.07,1.46,1.61,1.73,0.56,1.33,1.67,0.16,1.78,0.61,1.55,0.88,0.15,1.98,1.96,0.61,0.04,0.12,1.4,1.65,0.71,1.3,1.85,0.49 +``` + +We'll stick in the old hit for good measure and run our genotypes: + +``` +./bin/gemma-wrapper --loco --json -- -gk -g BXD.geno.rand.txt -p BXD_pheno_matched.txt -n 5 -a BXD.8_snps.txt > K.json +./bin/gemma-wrapper --debug --verbose --force --loco --json --lmdb --input K.json -- -g BXD.geno.rand.txt -a BXD.8_snps.txt -lmm 9 -maf 0.05 -p BXD_pheno_matched.txt -n 22 +./bin/./bin/view-gemma-mdb --sort /tmp/test/ca55b05e8b48fb139179fe09c35cff0340fe13bc.mdb +./bin/view-gemma-mdb /tmp/e279abbebee8e41d7eb9dae...-gemma-GWA.tar.xz --anno BXD.8_snps.txt|head -20 +chr,pos,marker,af,beta,se,l_mle,l_lrt,-logP +X,139258413,rsm10000014629,0.496,0.2248,0.093,100000.0,0.0,4.58 +6,132586518,rsm10000003691,0.517,0.2399,0.1068,100000.0,0.0001,4.17 +2,161895805,rs27350606,0.585,-0.2303,0.1059,100000.0,0.0001,4.0 +X,47002415,rsm10000014323,0.562,-0.1904,0.0877,100000.0,0.0001,3.99 +3,32576363,rsm10000001568,0.468,-0.2251,0.104,100000.0,0.0001,3.97 +14,19281191,rs52350512,0.5,-0.2454,0.1154,100000.0,0.0001,3.88 +7,111680092,rs32385258,0.536,0.2022,0.0968,100000.0,0.0002,3.79 +4,151267320,rsm10000002095,0.604,-0.2257,0.1102,100000.0,0.0002,3.69 +2,157353289,rs27323024,0.455,0.2188,0.1072,100000.0,0.0002,3.67 +19,56503719,rsm10000013894,0.617,0.2606,0.1302,100000.0,0.0003,3.58 +``` + +Interestingly our trait did not do that well: + +``` +18,69448067,rsm10000013598,0.635,0.0941,0.0774,100000.0,0.0167,1.78 +``` + +It shows how large the impact of the GRM is. We can run our permutations. + +``` +./bin/gemma-wrapper --debug --phenotypes BXD_pheno_matched.txt --permutate 1000 --phenotype-column 22 --verbose --force --loco --json --input K.json -- -g BXD.geno.rand.txt -a BXD.8_snps.txt -lmm 9 -maf 0.05 +["95 percentile (significant) ", 1.478479e-07, 6.8] +["67 percentile (suggestive) ", 1.892087e-06, 5.7] +``` + +Well that went through the roof :). It makes sense when you think about it. Randomizing genotypes of 21K SNPs gives you a high chance of finding SNPs that correlate with the trait. Let's go the other way and give 20% of indidivuals the exact same haplotypes, basically copying + +``` +rsm10000013598,X,Y,2,0,2,0,2,0,2,0,0,0,0,0,0,2,2,2,0,0,0,0,2,2,2,2,2,0,0,0,0,2,0,2,2,2,0,0,0,0,2,0,2,2,2,2,0,0,2,0,0,0,2,2,0,2,0,0,2,2,2,0,0,2,2,2,2,2,2,2,2,2,2,0,0,2,2,0,2,2,2,2,0,2,2,2,2,2,2,2,0,0,2,2,0,2,0,0,2,2,2,0,2,2,2,0,1,1,1,1,1,1,2,2,1,2,2,2,2,0,2,0,2,1,0,0,0,2,1,0,2,2,2,2,2,0,0,2,2,0,2,2,0,2,2,2,2,2,2,2,2,0,2,2,2,2,2,0,0,0,0,0,2,0,0,2,0,2,1,0,2,0,0,0,0,0,0,0,0,1,2,0,0,0,2,2,2,1,0,2,2,2,2,0,2,0,0,0,2,2,2,2,1,1,0,1,1,1,0,1,1,1,0,1,1,1,1,1,1,1,1,2,1,2,1,1,2,1,1,1,1,1,1,0,1,2,1,0,1 +``` + +``` +./bin/bimbam-rewrite.py --inject inject.geno.txt BXD.geno.txt --perc=20 > BXD.geno.20.txt +rg -c "2,0,2,0,2,0,2,0,0,0,0,0,0,2,2,2,0,0,0,0,2,2,2,2,2,0,0,0,0,2,0,2,2,2,0,0,0,0,2,0,2,2,2,2,0,0,2,0,0,0,2,2,0,2,0,0,2,2,2,0,0,2,2,2,2,2,2,2,2,2,2,0,0,2,2,0,2,2,2,2,0,2,2,2,2,2,2,2,0,0,2,2,0,2,0,0,2,2,2,0,2,2,2,0,1,1,1,1,1,1,2,2,1,2,2,2,2,0,2,0,2,1,0,0,0,2,1,0,2,2,2,2,2,0,0,2,2,0,2,2,0,2,2,2,2,2,2,2,2,0,2,2,2,2,2,0,0,0,0,0,2,0,0,2,0,2,1,0,2,0,0,0,0,0,0,0,0,1,2,0,0,0,2,2,2,1,0,2,2,2,2,0,2,0,0,0,2,2,2,2,1,1,0,1,1,1,0,1,1,1,0,1,1,1,1,1,1,1,1,2,1,2,1,1,2,1,1,1,1,1,1,0,1,2,1,0,1" BXD.geno.20.txt +4276 +``` + +so 4K out of 20K SNPs has identical haplotypes which correlate with our trait of interest: + +``` +["95 percentile (significant) ", 5.16167e-06, 5.3] +["67 percentile (suggestive) ", 6.163728e-05, 4.2] +``` + +and at 40% haplotype injection we get + +``` +["95 percentile (significant) ", 3.104788e-06, 5.5] +["67 percentile (suggestive) ", 7.032406e-05, 4.2] +``` + +* [X] Haplotype equal genotypes 20% and 40% + +All looks interesting, but does not help. + +Also when we halve the number of SNPs the results are similar too. + +``` +["95 percentile (significant) ", 6.026549e-06, 5.2] +["67 percentile (suggestive) ", 8.571557e-05, 4.1] +``` + +Even though the threshold is high, it is kind of interesting to see that no matter what you do you end up similar levels. After a meeting with Rob and Saunak the latter pointed out that these numbers are not completely surprising. For LMMs we need to use an adaptation - i.e. shuffle the trait values after rotation and transformation and then reverse that procedure. There is only the assumption of normality that Churchill does not require. The good news is that BulkLMM contains that method and thresholds will be lower. The bad news is that I'll have to adapt it because it does not handle missing data. + +Oh yes, rereading the Churchill paper from 1994 I now realise he also suggests an at marker significance method that will end lower - we saw that already in an earlier comparison. Saunak, however, says that we *should* do experiment-wide. + +## BulkLMM + +* [ ] Run bulklmm + + +## Dealing with epoch + +Rob pointed out that the GRM does not necessarily represent epoch and that may influence the significance level. I.e. we should check for that. I agree that the GRM distances are not precise enough (blunt instrument) to capture a few variants that appeared in a new epoch of mice. I.e., the mice from the 90s may be different from the mice today in a few DNA variants that won't be reflected in the GRM. + +* [ ] Deal with epoch + +We have two or more possible solutions to deal with hierarchy in the population. + +## Covariates + +* [ ] Try covariates Dave + +## Later + +* [ ] Check running or trait without LOCO with both standard and random GRMs +* [ ] Test non-loco effect for rsm10000013598 - looks too low and does not agree with GN2 +* [X] Try qnorm run +* [ ] Fix non-use of MAF in GN for non-LOCO +* [ ] Fix running of -p switch when assoc cache exists (bug) + +Quantile-Based Permutation Thresholds for Quantitative Trait Loci Hotspots +https://academic.oup.com/genetics/article/191/4/1355/5935078 +by Karl, Ritsert et al. 2012 diff --git a/topics/lmms/rqtl2/genenetwork-rqtl2-implementation.gmi b/topics/lmms/rqtl2/genenetwork-rqtl2-implementation.gmi new file mode 100644 index 0000000..452930f --- /dev/null +++ b/topics/lmms/rqtl2/genenetwork-rqtl2-implementation.gmi @@ -0,0 +1,71 @@ +# Implementation of QTL Analysis Using r-qtl2 in GeneNetwork +## Tags + +* Assigned: alexm +* Keywords: RQTL, GeneNetwork2, implementation +* Type: Feature +* Status: In Progress + +## Description + +This document outlines the implementation of a QTL analysis tool in GeneNetwork using r-qtl2 (see docs: https://kbroman.org/qtl2/) and explains what the script does. +This PR contains the implementation of the r-qtl2 script for genenetwork: +=> https://github.com/genenetwork/genenetwork3/pull/201 + +## Tasks + +The script currently aims to achieve the following: + +* [x] Parsing arguments required for the script +* [x] Data validation for the script +* [x] Generating the cross file +* [x] Reading the cross file +* [x] Calculating genotype probabilities +* [x] Performing Geno Scan (scan1) using HK, LOCO, etc. +* [x] Finding LOD peaks +* [x] Performing permutation tests +* [x] Conducting QTL analysis for multiparent populations +* [ ] Generating required plots + +## How to Run the Script + +The script requires an input file containing all the necessary data to generate the control file. Example: + +```json +{ + "crosstype": "riself", + "geno_file": "grav2_geno.csv", + "geno_map_file": "grav2_gmap.csv", + "pheno_file": "grav2_pheno.csv", + "phenocovar_file": "grav2_phenocovar.csv" +} + +``` +In addition other parameters required are + +* output file (A file path of where the output for the script will be generated) +* --directory ( A workspace of where to generate the control file) + +Optional parameters include +* --output_file: The file path where the output for the script will be generated. +* --directory: The workspace directory where the control file will be generated. + +Optional parameters: + +* --cores: The number of cores to use (set to 0 for using all cores). +* --method: The scanning method to use (e.g., Haley-Knott, Linear Mixed Model, or LMM with Leave-One-Chromosome-Out). +* --pstrata: Use permutation strata. +* --threshold: Minimum LOD score for a peak. + + +An example of how to run the script: + +```sh + +Rscript rqtl2_wrapper.R --input_file [file_path] --directory [workspace_dir] --output_file [file_path] --nperm 100 --cores 3 + +``` +## Related issues: +https://issues.genenetwork.org/topics/lmms/rqtl2/using-rqtl2 +=> ./using-rqtl2 +=> ./gn-rqtl-design-implementation diff --git a/topics/lmms/rqtl2/gn-rqtl-design-implementation.gmi b/topics/lmms/rqtl2/gn-rqtl-design-implementation.gmi new file mode 100644 index 0000000..f37da42 --- /dev/null +++ b/topics/lmms/rqtl2/gn-rqtl-design-implementation.gmi @@ -0,0 +1,203 @@ +# RQTL Implementation for GeneNetwork Design Proposal + +## Tags + +* Assigned: alexm, +* Keywords: RQTL, GeneNetwork2, Design +* Type: Enhancements, +* Status: In Progress + + + +## Description + +This document outlines the design proposal for the re-implementation of the RQTL feature in GeneNetwork providing also a console view to track the stdout from the external process. + +### Problem Definition + +The current RQTL implementation faces the following challenges: + +- Lack of adequate error handling for the API and scripts. + +- Insufficient separation of concerns between GN2 and GN3. + +- lack way for user to track the progress of the r-qtl script being executed + +- There is lack of a clear way in which the r-qtl script is executed + +We will address these challenges and add enhancements by: + +- Rewriting the R script using r-qtl2 instead of r-qtl. + +- Establishing clear separation of concerns between GN2 and GN3, eliminating file path transfers between the two. + +- Implementing better error handling for both the API and the RQTL script. + +- run the script as a job in a task queue + +- Piping stdout from the script to the browser through a console for real-time monitoring. + +- Improving the overall design and architecture of the system. + + + +## High-Level Design +This is divided into three major components: + +* GN3 RQTL-2 Script implementation +* RQTL Api +* Monitoring system for the rqtl script + + +### GN3 RQTL-2 Script implementation +We currently have an rqtl script written in rqtl https://github.com/genenetwork/genenetwork3/blob/main/scripts/rqtl_wrapper.R +There is a newer rqtl implementation (rqtl-2) which is +a reimplementation of the QTL analysis software R/qtl, to better handle high-dimensional data and complex cross designs. +To see the difference between the two see documentation: +=> https://kbroman.org/qtl2/assets/vignettes/rqtl_diff.html +We aim to implement a seperate script using this while maintaining the one +implemented using rqtl1 (rqtl) . +(TODO) This probably needs to be split to a new issue(with enough knowledge) , to capture +each computation step in the r script. + +### RQTL Api + + +This component will serve as the entry point for running RQTL in GN3. At this stage, we need to improve the overall architecture and error handling. This process will be divided into the following steps: + +- Data Validation +In this step, we must validate that all required data to run RQTL is provided in the JSON format. This includes the mapping method, genotype file, phenotype file, etc. Please refer to the r-qtl2 documentation for an overview on the requirements : +=> https://rqtl.org/ + +- Data Preprocessing +During this stage, we will transform the data into a format that R can understand. This includes converting boolean values to the appropriate representations, preparing the RQTL command with all required values, and adding defaults where necessary. + +- Data Computation +In this stage, we will pass the RQTL script command to the task queue to run as a job. + +- Output Data Processing +In this step, we need to retrieve the results outputted from the script in a specified format, such as JSON or CSV and process the data. This may include outputs like RQTL pair scans and generated diagrams. Please refer to the documentation for an overview: +=> https://rqtl.org/ + + + +**Subtasks:** + +- [ ] add the rqtl api endpoint (10%) +- [ ] Input Data validation (15%) +- [ ] Input data processing (20%) +- [ ] Passing data to r-script for the computation (40%) +- [ ] output data processing (80%) + -[ ] add unittests for this module (100%) + + +### Monitoring system for the rqtl script + +This component involves creating a monitoring system to track the state of the external process and output relevant information to the user. +We need a way to determine the status for the current job for example +QUEUED, STARTED, INPROGRESS, COMPLETED (see deep dive for more on this) + + +## Deep Dive + + +### Running the External Script +The RQTL implementation is in R, and we need a strategy for executing this script as an external process. This can be subdivided into several key steps: + +- **Task Queue Integration**: + + - We will utilize a task queue system , + We already have an implementation in gn3 + to manage script execution + +- https://github.com/genenetwork/genenetwork3/blob/0820295202c2fe747c05b93ce0f1c5a604442f69/gn3/commands.py#L101 + +- **Job Submission**: + - Each API call will create a new job in the task queue, which will handle the execution of the R script. + +- **Script Execution**: + - This stage involves executing the R script in a controlled environment, ensuring all necessary dependencies are loaded. + +- **Monitoring and Logging**: + +- The system will include monitoring tools to track the status of each job. Users will receive real-time updates on job progress and logs for the current task. + +In this stage, we can have different states for the current job, such as QUEUED, IN PROGRESS, and COMPLETED. + +We need to output to the user which stage of computation we are currently on during the script +execution. + +- During the QUEUED state, the standard output (stdout) should display the command to be executed along with all its arguments. + +- During the STARTED stage, the stdout should notify the user that execution has begun. + +- In the IN PROGRESS stage, we need to fetch logs from the script being executed at each computation step. Please refer to this documentation for an overview of the different computations we +shall have : +=> https://rqtl.org/ + +- During the DONE step, the system should output the results from the R/qtl script to the user. + + +- **Result Retrieval**: + - Once the R script completes (either successfully or with an error), results will be returned to the API call. + +- **Error Handling**: + - Better error handling will be implemented to manage potential issues during script execution. This includes capturing errors from the R script and providing meaningful feedback to users through the application. + +### Additional Error Handling Considerations +This will involve: +* API error handling +* Error handling within the R script + +## Additional UI Considerations +We need to rethink where to output the external process stdout in the UI. Currently, we can add flags to the URL to enable this functionality, e.g., `URL/page&flags&console=1`. +Also the design suggestion is to output the results in a terminal emulator for +example xterm ,See more: https://xtermjs.org/, A current implementation already exists +for gn3 see +=> https://github.com/genenetwork/genenetwork2/blob/abe324888fc3942d4b3469ec8d1ce2c7dcbd8a93/gn2/wqflask/templates/wgcna_setup.html#L89 + +### Design Suggestions: +#### With HTMX, offer a split screen +This will include an output page and a monitoring system page. + +#### Popup button for preview +A button that allows users to preview and hide the console output. + + + + + +## Long-Term Goals +We aim to run computations on clusters rather than locally. This project will serve as a pioneer for that approach. + +## Related Issues +=> https://issues.genenetwork.org/topics/lmms/rqtl2/using-rqtl2 + +### Tasks + +* stage 1 (20%) * + + - [x] implement the rqtl script using rqtl2 + +* stage 2 (40%) * + +- [ ] Implement the RQTL API endpoints +- [ ] validation and preprocessing for data from the client +- [ ] Implement state-of-the-art error handling +- [ ] Add unit tests for the rqtl api module +- [ ] Make improvements to the current R script if possible + +* stage 3 (60%)* + +- [ ] Task queue integration (refer to the Deep Dive section) +- [ ] Implement a monitoring and logging system for job execution (refer to the deep dive section +- [ ] Fetch results from running jobs +- [ ] Processing output from the external script + +* stage 4 (80%) * +- [ ] Implement a console preview UI for user feedback +- [ ] Refactor the GN2 UI + +* stage 5 (100%) * + +- [ ] Run this computation on clusters \ No newline at end of file diff --git a/topics/lmms/rqtl2/using-rqtl2-lmdb-adapter.gmi b/topics/lmms/rqtl2/using-rqtl2-lmdb-adapter.gmi new file mode 100644 index 0000000..8e5332a --- /dev/null +++ b/topics/lmms/rqtl2/using-rqtl2-lmdb-adapter.gmi @@ -0,0 +1,84 @@ +# R/qtl2 LMDB Adapter +## Tags + +* assigned: alexm +* priority: medium +* type: feature, documentation +* status: WIP +* keywords: rqtl2, lmdb, adapter, cross + +## Description +We want to add support for reading crosses from LMDB. +Currently, R/qtl2 (https://kbroman.org/qtl2/) only supports reading from CSV files. + +## Tasks + +* [x] Dump genotypes to LMDB +* [x] Dump cross metadata to LMDB +* [-] Create a `read_lmdb_cross` adapter +* [] Dump phenotypes to LMDB + +## Using the Adapter + +### Dumping the Genotypes +You can find the `lmdb_matrix.py` script here: + +=> https://github.com/genenetwork/genenetwork3/blob/main/scripts/lmdb_matrix.py + +```sh +guix shell python-click python-lmdb python-wrapper python-numpy -- \ + python lmdb_matrix.py import-genotype \ + <path-to-genotype-file> <path-to-lmdb-store> +``` + +## Dumping the Cross Metadata + +The script can be found here: +=> https://github.com/genenetwork/genenetwork3/pull/235/files # lmdb_cross_metadata.py + +You need to provide a cross file path. The currently supported formats are JSON and YAML. + +Example: + +```sh +guix shell python-click python-lmdb python-wrapper python-pyyaml -- \ + python dump_metadata.py dump-cross [LMDB_PATH] [CROSS_FILE_PATH] --file-format yaml/json + +# Example + +python dump_metadata.py dump-cross "./test_lmdb_data" "./cross_file.json" +``` + +### Running the R/qtl2 LMDB Adapter Script + +The script `rqtl_lmdb_adapter.r` can be found here: +=> https://github.com/genenetwork/genenetwork3/pull/235/files # rqtl_lmdb_adapter.r + +```sh +guix shell r r-thor r-rjson r-qtl2 -- \ + Rscript [PATH_TO_ADAPTER_SCRIPT] [LMDB_PATH] + +# Example +Rscript https://github.com/genenetwork/genenetwork3/pull/235/files ./lmdb_path + + +``` +### using this with rqtl2 Example + +```r +cross <- read_lmdb_cross(LMDB_DB_PATH) +summary(cross) +cat("Is this cross okay", check_cross2(cross), "\n") +warnings() # enable warnings for the debug purposes only! +pr <- calc_genoprob(cross) +out <- scan1(pr, cross$pheno, cores=4) +par(mar=c(5.1, 4.1, 1.1, 1.1)) +ymx <- maxlod(out) +plot(out, cross$gmap, lodcolumn=1, col="slateblue") # test generating of qtl plots + +``` + + +### References +=> https://kbroman.org/qtl2/assets/vignettes/developer_guide.html + diff --git a/topics/lmms/rqtl2/using-rqtl2.gmi b/topics/lmms/rqtl2/using-rqtl2.gmi new file mode 100644 index 0000000..7f671ba --- /dev/null +++ b/topics/lmms/rqtl2/using-rqtl2.gmi @@ -0,0 +1,44 @@ +# R/qtl2 + +# Tags + +* assigned: pjotrp, alexm +* priority: high +* type: enhancement +* status: open +* keywords: database, gemma, reaper, rqtl2 + +# Description + +R/qtl2 handles multi-parent populations, such as DO, HS rat and the collaborative cross (CC). It also comes with an LMM implementation. Here we describe using and embedding R/qtl2 in GN2. + +# Tasks + + +## R/qtl2 + +R/qtl2 is packaged in guix and can be run in a shell with + + +``` +guix shell -C r r-qtl2 +R +library(qtl2) +``` + +R/qtl2 also comes with many tests. When starting up with development tools in the R/qtl2 checked out git repo + +```sh +cd qtl2 +guix shell -C -D r r-qtl2 r-devtools make coreutils gcc-toolchain +make test +Warning: Your system is mis-configured: '/var/db/timezone/localtime' is not a symlink +i Testing qtl2 +Error in dyn.load(dll_copy_file) : +unable to load shared object '/tmp/RtmpWaf4td/pkgload31850824d/qtl2.so': /gnu/store/hs6jjk97kzafl3qn4wkdc8l73bfqqmqh-gfortran-11.4.0-lib/lib/libstdc++.so.6: version `GLIBCXX_3.4.32' not found (required by /tmp/RtmpWaf4td/pkgload31850824d/qtl2.so) +Calls: <Anonymous> ... <Anonymous> -> load_dll -> library.dynam2 -> dyn.load +Execution halted +make: *** [Makefile:9: test] Error 1 +``` + +not sure what the problem is yet. diff --git a/topics/meetings/gn-kilifi-2025-standup.gmi b/topics/meetings/gn-kilifi-2025-standup.gmi new file mode 100644 index 0000000..c20e847 --- /dev/null +++ b/topics/meetings/gn-kilifi-2025-standup.gmi @@ -0,0 +1,177 @@ +# Stand-up Notes + +## 25-8-2025 +### Felix +* Completed first draft of the abstract +* HS genotyping; polishing founders vcf file: +* - sorting xsome order +* - remove duplicate markers + +### Johannnes +* Extending the rag system, make it more of an agent: +* - fixing bugs +* - reduce runtime + +### Bonface +* CD went down! +* revert it to last commit +* set it up locally (not very easy) +* coordinate with Fred and Aarun for help + +## 27-8-2025 +### Felix +* updating the abstract: meeting with Kauthar, for more tips and insight +* HS rats genotyping +* still no significant output + +### Bonface +* fixed GN3, GN guile, and GN auth +* reproduce GN2 error, not yet fixed though + +### Johannes +* LLM transformer is taking too long to run +* change input documentations to observe results + +## 29-8-2025 +### Felix +* Still on the abstract +* Errors with hs genotyping script + +### Johannes +* Still on the rag system: experimenting with documents as input + +### Bonface +* CD down again, tux02 is the calprit +* Running gn2 outside container in meantime + +## 1-9-2025 +### Felix +* improving the smoothing scripts for hs: building a snakemake pipeline; redrafting the abstract introductory part + +### Bonface +* looked on deployment to fix..., +* install gpu drivers for penguine 2 with alex +* it has gpu drivers tesla k8, but hardware installation is important for the meantime + +### Alex +* rqlt2/lmdb runs and rebooted +* dumping bxd phenotype dataset for testing + +### Johannes +* no big updates +* literature review on rag systems and how they work + +## 3-9-2025 +### Felix +* Improved the inferring script +* Abstract on hold first + +### Johannes +* fix documents using agents: the issue is parallelization +* reviewing fahamu AI (the python part) + +### Bonface +* Gn2 finally runs +* Documenting the progress and steps +* Gn-auth is broken; Fred to solve it +* Updated tesla k8 gpu drivers + + +### Alex +* Rqtl-wrapper for BXD is done +* focusing on cross-information;lmdb + +## 5-9-2025 +### Felix +* managed to generate haplotype blocks for hs rats + +### Johannes +* RAG experimentation script: balg01, running slow; prepare for msc defense + +### Alex +* dumping phenotypes to lmdb + +### Bonface +* install drivers on tesla k8 +* review Johannes scripts +* fixing gn2 +* review alex's patches, {pull requests via emails} + +* AOB; +* API tokens for RAG and coding + +## 10-9-2025 +### Felix +* Managed to run gemma and generate plots for the original hs genotype data and hs smoothed genotype data +* Still working on fine-tuning the statistical metrices to suite the plots + +### Bonface +* All the GN machines are up and running, expect GN Auth +* Sent documentations to Pjotr on the progress with troubleshooting GN machines + +### Alex +* writing queries to move phenotypes matrices from gn2 db to lmdb + +### Johannes +* defense + +## 17-9-2025 +### Felix +* Completed abstract writing +* Working on generating genotype files for hs in gn2 format + +### Bonface +* Gn machines are up and running +* Still some issues with manually restarting the container +* Worked with Alex to set up an external bootable ssd for Johannes + +### Alex +* Succeeded dumping phenotypes to lmdb +* Been reading literature around phenotypes in gn2 +* Assisted Bonz in setting up external bootable ssd for Johannes + +### Johannes +* Managed to get access to bux01 server for the llms work + +## 19-09-2025 +### Felix +* Finalized HS genotypes/haplotypes + +### Bonface +* GN machines are running on CD +* troubleshooting assisted by Aaron + +### Johannes +* Experimenting with RAG systems + +### Alex +* Writing LMDB scripts to dump phenotypes to LMDB from the common gn DB + +## [Review] MONTH OCTOBER, 2025 +*** Metric wise, plans and goals to achieve + +### Bonface +Thoughts: Good start after relocation. + +* CD/CI: Make sure that tests/infra is super stable +* Look at suggested forge/guix-bioinformatics upgrades that affect prod +* Support work for GNQA: sane gn-qna(fahamu)/AI deploys; Review Johannes' work; Help with tuning models; play catch up (cover up knowledge gap); figure out how to compare the different models. +* Follow up on my phd + +### Alex +Thoughts: very productive + +* Rqtl2 lmdb adapter running on production; depends on traction with Rqtl2 upstream by karl Browman to gn-bioinformatics +* Collaborate with Felix on his scripts + +### Johannes + +Thoughts: Not very productive, barely, looking forward for more improvement + +* Experimenting on the qtl data with the AI system being currently built; also in need of optimization +* Comparing LLMs/GNQA agents; getting assistance from Shelby's work + +### Felix +* HS genotypes to be fully supported in GN2 +* Poster presentation +* PhD: ML objective (ML fundamentals); Complete 1st manuscript draft; Process my student pass diff --git a/topics/meetings/gn-nairobi-2025.gmi b/topics/meetings/gn-nairobi-2025.gmi new file mode 100644 index 0000000..fb357a5 --- /dev/null +++ b/topics/meetings/gn-nairobi-2025.gmi @@ -0,0 +1,17 @@ +# Meeting Notes + +## 2024-01-10 +* @flisso: Prepare gn-uploader presentation for KEMRI. +* @flisso: Put c-elegans dataset to staging. +* @flisso: PHEWAS --- extract phenotypes from genenetwork and analyse them using PHEWAS. +* @alexm: Clean up R/Qtl 1. +* @alexm: Add R/Qtl 2 in gn. +* @alexm: Fix UI issues around GN AI. +* @bonfacem: Fix UI for group pages. +* @bonfacem: Add git hooks to cd container for self-hosted repositories. +* @bonfacem: Share developer work container and have Alex test it out. +* @bonfacem: Prepare RDF presentation for KEMRI. + +Nice to have: +* @bonfacem: Start dataset metadata editing work. +* @flisso: Write PhD concept note. diff --git a/topics/meetings/jnduli_bmunyoki.gmi b/topics/meetings/jnduli_bmunyoki.gmi index 5af7221..26621d1 100644 --- a/topics/meetings/jnduli_bmunyoki.gmi +++ b/topics/meetings/jnduli_bmunyoki.gmi @@ -1,5 +1,462 @@ # Meeting Notes +## 2024-10-15 +* DONE: @flisso: Follow up with the Medaka team on verification of genotype sample names +* DONE: @flisso: Understand uploader scripts and help improve then. +* CANCELLED: @flisso: Set up virtuoso. @bonfacem shall share notes on this. +* NOT DONE: @flisso: Write PhD concept note. +* DONE: @alexm @jnduli: R/Qtl script. +* DONE: @bonfacem: Test the production container locally and provide @fredm some feedback. +* DONE: @bonfacem: Wrap-up re-writing gn-guile to be part of genenetwork-webservices. +* NOT DONE: @bonfacem: Start dataset metadata editing work. + +## 2024-10-08 +* NOT DONE: @bonfacem RIF Indexing for RIF page in Xapian. +* IN PROGRESS: @bonfacem: Test the production container locally and provide @fredm some feedback. +* IN PROGRESS: @bonfacem: Re-writing gn-guile to be part of genenetwork-webservices. +* NOT DONE: @shelbys @bonfacem: Getting RDF into R2R. +* NOT DONE: @flisso: Follow up with the Medaka team on verification of genotype sample names. NOTE: Medaka team are yet to respond. +* IN PROGRESS: @flisso: Figure out how to add C Elegans data in staging. NOTE: Got access to staging server. Ran example tests. Still working on some errors. +* NOT DONE: @flisso: Set up virtuoso. @bonfacem shall share notes on this. +* NOT DONE: @flisso: Write PhD concept note. NOTE: Doing some lit review. +* @shelbys: Be able to test things on lambda01 for LLM tests. +* @alexm @jnduli: R/Qtl script. + +## 2024-10-18 +* IN-PROGRESS: @priscilla @flisso: Set up mariadb and virtuoso to test out some GN3 endpoints. NOTE: Mariadb set-up +* NOT DONE: @priscilla @flisso @bmunyoki: Improve docs while hacking on the above. +* DONE: @jnduli Remove gn-auth code from GN3. +* DONE: @jnduli Resolve current issue with broken auth in gn-qa. +* DONE: @jnduli @alexm Work on the R/Qtl design doc. +* IN-PROGRESS: @alexm: R/Qtl script. NOTE: Reviewed by @jnduli. +* DONE: @flisso MIKK genotyping. NOTE: Verification pending from Medaka team. +* DONE: @flisso Make sure we have C Elegans and HS Rats dataset to testing, and have the genotyping pipeline working. NOTE: Issues with tux02 staging server. +* DONE: @shelbys: Modify existing Grant write-up for pangenomes. NOTES: Some more edits to be done. +* NOT DONE: @shelbys @bonfacem: Getting RDF into R2R. +* NOT DONE: @bonfacem RIF Indexing for RIF page in Xapian. +* DONE: @bonfacem Work on properly containerizing gn-guile. NOTE: Send in patches to @alexm, @aruni, and @fredm to review later today. +* DONE: @bonfacem: Fix the virtuoso CI job in CD: NOTE: I'm awaiting feedback from @arun/@fredm. + +## 2024-10-11 +* WIP @priscilla @flisso: Try out API endpoints that don't require auth. NOTE: Priscilla got to set-up guix channels for gn3. Felix ran into problems. Priscilla set up the MySQL in her Ubuntu system. +* NOT DONE: @jnduli Harden hook system for gn-auth. +* WIP: @jnduli Remove gn-auth code from GN3. NOTE: Sent latest patches to Fred. Running issue, some patches may have caused gn-qa to fail. +* DONE: @jnduli @bonfacem Finish up RIF Editing project. +* NOT DONE: @jnduli @alexm Create issue on describing the monitoring system. +* NOT DONE: @jnduli @alexm Create issue on prompt engineering in GN to improve what we already have. +* WIP: @alex Work on R/Qtl. NOTE: @jnduli/@bonfacem help out with this. NOTE: Finished writing the design doc for gn-qa. +* DONE: Looked at documentation for R/Qtl. +* NOT DONE: @alex: Review @bmunyoki's work on RIF/Indexing. +* WIP: @flisso: Make sure we have C Elegans dataset and MIKK genotypes to production. NOTE: Issues with data entry scripts. Fred/Zach working to set up test environment. +* WIP: @flisso: MIKK genotyping. NOTE: Still testing the pipeline. Halfway there. +* NOT DONE: @flisso: Make sure we have HS Rats in testing stage. +* WIP: @flisso: Make progress in learning back-end coding WRT GN. NOTE: Issue setting up GN3. +* WIP: @shelbys: Modify existing Grant write-up for pangenomes. NOTE: Reviewed by Pj and Eric. More mods based of feedback. Paper got accepted by BioArxiv. Added some docs to R2R evaluation code. +* DONE: @shelbys: Finish getting all the R2R scores from the first study. NOTE: Got scores for all the scores from first papers using R2R instead of Fahamu. +* NOT DONE: @bonfacem RIF Indexing for RIF page in Xapian. +* WIP: @bonfacem Work on properly containerizing gn-guile. +* DONE: @bonfacem Fix the gn-transform-database in CI. Sent patches to Arun for review. +* DONE: @bonfacem Fixed broken utf-8 characters in gn-gemtext. + +## 2024-10-04 +* IN PROGRESS: @priscilla @bonfacem Setting up GN3. @priscilla try out API endpoints that don't require auth. NOTE: @priscilla Able to set up guix as a package manager. Trouble with Guix set-up with GN3. @bonfacem good opportunity to improve docs in GN3. +* IN PROGRESS: @jnduli Harden hook system for gn-auth. +* IN PROGRESS: @jnduli Remove gn-auth code from GN3. +* DONE: @jnduli Finish UI changes for RIF editing. NOTE: Demo done in GN Learning team. +* IN PROGRESS: @alex Work on R/Qtl. NOTE: Met with Karl Brohman/PJ. Been reading the docs. Will track this issue in GN. +* NOT DONE: @alex @bonfacem Work on properly containerizing gn-guile. +* DONE: @bonfacem API/Display of NCBI Rif metadata. +* IN PROGRESS: @bonfacem @alex RIF Indexing for RIF page in Xapian. +* IN PROGRESS: @flisso Push data to production. Commence work on Arabidopsis data and HS Rats data. NOTE: C-Elegans pushed in process of being pushed to testing server, then later production. WIP with HS Rats data in collab with Palmer. +* DONE: @flisso: Learning how to use SQL WRT C Elegans data. +* IN PROGRESS: @shelbys Re-formatting grant to use pangenomes. Waiting for Garisson for feedback. +* DONE: @shelbys Got the R2R for the human generated questions. TODO: Run this for GPT 4.0 model. + +## 2024-09-27 + +* DONE: @jnduli @bonfacem @alex Look at base files refactor and merge work. +* DONE: @priscilla continue to upload more papers. NOTE: Uploaded an extra 200 papers. +* NOT DONE: @priscilla @flisso Set up GN3. Goal is to be able to query some APIs in cURL. +* IN PROGRESS: @jnduli Improve hook systems for gn-auth. NOTE: Still figuring out a cleaner implementation for some things. +* IN PROGRESS: @jnduli Trying to remove auth code GN3 code. NOTE: Idea, though unsure about safety. @fred to review work and make sure things are safe. +* DONE: @jnduli @bonfacem @alex Push most recent changes to production. Figure out what needs doing. NOTE: @Zach is in charge of deployment. @fredm is working on the production container. +* DONE: @alex Close down remaining issues on issue tracker. NOTE: Merged work on cleaning up base files. Few more minor modifications to the UI. +* NOT DONE: @alex investigate the dumped static files for encoding issues. +* IN PROGRESS: @bonfacem NCBI Metadata - Modelling and Display. NOTE: Done with the modelling. Almost done with API/UI work. +* DONE: @bonfacem Fix broken CD tests. NOTE: We have tests running inside the guix build phase. +* IN-PROGRESS: @flisso Continue work on uploading datasets: C Elegans and MIKK. NOTE: Managed to create data files that need to be upleaded to the testing gn2 stage server. +* NOT DONE: @flisso @jnduli help @flisso with SQL. + +## 2024-09-20 +* NOT DONE: @priscilla @flisso @bmunyoki @jnduli set up GN ecosystem and review UI PRs +* DONE: @priscilla continue to upload more papers. NOTE: Shared access to drive to @bmunyoki. We are at 800 papers. +* DONE: @bmunyoki update tux02/01 with recent RIF modifications +* DONE: @jnduli Finish up experiments on hook system. NOTE: Patches got merged. Needs to make some things more concrete. +* NOT DONE: @alex @bonfacem investigate the dumped static files for encoding issues. +* DONE: Refactoring base files for GN2. +* IN PROGRESS: @flisso: Continue work on uploading datasets: C Elegans and MIKK. Note: Waiting for the original MIKK genotype file from the Medaka team. C Elegans yet to process the annotation file---some info is missing. +* NOT DONE: @flisso: Do code reviews on Sarthak's script. +* NOT DONE: @bmunyoki NCBI Metadata - Modelling and Display. +* DONE: @bmunyoki update tux02/01 with recent RIF modifications. NOTE: CD tests are broken and need to be fixed. + +## 2024-09-13 +* NOT DONE: @jnduli @bmunyoki fetch ncbi metadata and display them in GN2 +* DONE: @jnduli @bmunyoki add auth layer to edit rifs functionality +* DONE: @jnduli complete design doc for hooks system for gn-auth. NOTE: More experimentation with this. +* DONE: @jnduli @alex bug fixes for LLM integration. +* DONE: @priscilla added more papers to the LLM ~ 250 papers. +* NOT DONE: @priscilla @flisso @bmunyoki @jnduli set up GN ecosystem and review UI PRs +* DONE: @bmunyoki modify edit api to also write to RIF +* NOT DONE: @bmunyoki update tux02/01 with recent RIF modifications +* DONE: @bmunyoki Add test cases for RDF +* DONE: @alex Bug fix for session expiry. +* DONE: @alex Update links for static content to use self-hosted git repo. +* IN PROGRESS: @flisso Upload C Elegans Dataset. Nb: MIKK one has some issues, so work is paused for now. NOTE: Waiting for annotation and phenotype file for the C Elegans Dataset. +* DONE @flisso: Reviewed gemma wrapper scripts. + + +Nice to have: +* @bmunyoki build system container for gn-guile and write documentation for creating containers + +## 2024-09-06 + +* DONE: @bmunyoki Replicate GN1 WIKI+RIF in GN2. +* DONE: @bmunyoki update server to include latest code changes +* IN PROGRESS: @bmunyoki modify edit api to also write to RIF +* NOT DONE: @bmunyoki build system container for gn-guile and write documentation for creating containers +* DONE: @bmunyoki @flisso update case attributes to capture hierarchy info +* DONE: @bmunyoki prepare presentation for RIF work to GN learning team (goal is to present on Wednesday next week) +* NOT DONE: @bmunyoki update tux02/01 with recent RIF modifications +* NOT DONE: @jnduli @bmunyoki fetch ncbi metadata and display them in GN2 +* NOT DONE: @jnduli complete design doc for hooks system for gn-auth; Focus for next week. +* DONE: @alexm @jnduli integrate LLM in GN2 and GN3: On the look-out for bug-fixes. +* IN PROGRESS: @jnduli add auth layer to edit rifs functionality +* DONE: @flisso generate genotype file on Medaka fish dataset: @arthur to have a look at this. +* IN PROGRESS: @flisso code reviews for gemma-wrapper with @pjotr +* DONE: @flisso update gemtext documentation +* DONE: @flisso help Masters students with their proposal defences +* @priscilla add more papers to LLM +* NOT DONE: @priscilla @flisso @bmunyoki @jnduli set up GN ecosystem and review UI PRs + + +## 2024-09-02 (Sync with @flisso+@bonfacem) + +### Case-Attributes + +* @bmunyoki understood case attributes by reverse-engineering the relevant tables from GeneNetwork's database. + +* One source of confusion for @bmunyoki is that we have the same "CaseAttribute.Name" that applies to different strains. Example Query: + +``` +SELECT * FROM CaseAttribute JOIN CaseAttributeXRef ON CaseAttribute.CaseAttributeId = CaseAttributeXRef.CaseAttributeId WHERE CaseAttribute.Name = "Sex"\G +``` + +* @rob wants fine-grained access control with case attributes. + +* @flisso, case-attributes are GN invention. Case Attributes are extra metadata about a given dataset beyond the phenotype measurements. E.g. We can have the phenotype: "Central nervous system"; whereby we collect the values, and SE. However, we can also collect extra metadata like "Body Weight", "Sex", "Status", etc, and in GN terminology, that extra metadata is called Case Attributes. + +* @bmunyoki. Most of the confusion around case-attributes is because of how we store case-attributes. We don't have unique identifiers for case-attributes. + +## 2024-08-30 + +* IN PROGRESS: @bmunyoki Replicate GN1 WIKI+RIF in GN2. +* DONE: @bmunyoki and @alex help Alex deploy gn-guile code on tux02, run this in a tmux session. +* DONE: @bmunyoki api for history for all tasks +* DONE: @bmunyoki UI layer for RDF history +* @bmunyoki modify edit api to also write to RIF +* @bmunyoki build system container for gn-guile and write documentation for creating containers +* NOT DONE: @jnduli complete design doc for hooks system for gn-auth +* DONE: @alexm @jnduli create branches to testing for LLM in GN2 and GN3 +* IN PROGRESS: @alexm @jnduli integrate LLM in GN2 and GN3 +* IN PROGRESS: @jnduli add auth layer to edit rifs functionality +* DONE: @bmunyoki @felix sync on case attributes and document +* DONE: @flisso managed to upload <TODO> dataset to production + + +### nice to haves + +* nice_to_have: @bmunyoki experiment and document updating gn-bioinformatics set up packages (to support rshiny) + +## 2024-08-23 +* @shelby re-ingest data and run RAGAs against the queries already in the system to perform comparison with new papers. +* @shelby figure out Claude Sonnet stuff. +* IN PROGRESS: @felix @fred push RQTL bundles to uploader, also includes metadata. +* IN PROGESS: @felix look for means to fix metadata challenge ie. trouble associating data we upload and metadata that provides descriptions. +* DONE: @bmunyoki API: Get all RIF metadata by symbols from rdf. +* NOT DONE: @bmunyoki UI: Modify traits page to have "GN2 (GeneWiki)", to be picked after RDF is updated in tux02 +* DONE: @bmunyoki UI: Integrate with API +* NOT DONE: @bmunyoki Replicate GN1 WIKI+RIF in GN2. +* IN PROGRESS: @bmunyoki and @alex help Alex deploy gn-guile code on tux02. +* DONE: @bmunyoki @jnduli review gn2 UI change for markdown editor +* NOT DONE: @bmunyoki create template for bio paper +* DONE: @alex sync with Boni to set up gn-guile +* DONE: @alex @bmunyoki @jnduli sync to plan out work for llm integration +* DONE: @jnduli edit WIKI+RIF +* NOT DONE: @jnduli set up gn-uploader locally and improve docs +* NOT DONE: @jnduli complete design doc for hooks system for gn-auth +* DONE: @felix to document email threads on gemtext + +## 2024-08-22 + +=> https://issues.genenetwork.org/issues/edit-rif-metadata APIs for wiki editting and broke down wiki-editting task to sub-projects. + +## 2024-08-20 + +Integrating GNQA to GN2 website and how it will work? + +1. Have the context information displayed to the right of the GN2 xapian search page +2. When someone clicks the context info page, it opens the search from GNQA which has all the references. +3. Cache queries since many searches are the same. + +Problems: + +1. search has xapian specific terminology. How do we handle this? Remove xapian prefixes and provide the key words to search. +2. how do we handle cache expiry? + - no expiry for now. + - store them in a database table. + - every quarter year, the search can be updated. + - group:bxd, species: mouse -> bxd mouse + mouse bxd: -> when caching the ordering for the seach terms shouldn't matter much. + +Game Plan: + +1. Production the code relating to LLM search. Get the code for LLMs merged into main branch. +2. UI changes to show the search context from LLM. +3. Figuring out caching: + - database table structure + - cache expiry (use 1 month for now) + - modify LLM search to pick from cache if it exists. +4. Have another qa branch that fixes all errors since we had the freeze. +5. Only logged in users will have access to this functionality. + +## 2024-08-16 +* @jnduli Fix failing unit tests on GN-Auth. +* @jnduli Exploring Mechanical Rob for Integration Tests. GN-Auth should be as stable as possible. +* @jnduli Research e-mail patch workflow and propose a sane workflow for GN through an engineering blog post. +* @jnduli Help @alexm with auth work. +* @felix @fred push RQTL bundles to uploader. +* @felix look for means to fix metadata challenge ie. trouble associating data we upload and metadata that provides descriptions. +* @felix @jnduli programming learning: started building a web server to learn backend using Flask. +* @felix @jnduli Read Shelby's paper and provide feedback by the end of Saturday. + +## 2024-08-16 +* DONE: @jnduli Fix failing unit tests on GN-Auth. +* NOT DONE: @jnduli Exploring Mechanical Rob for Integration Tests. GN-Auth should be as stable as possible. +* NOT DONE: @jnduli Research e-mail patch workflow and propose a sane workflow for GN through an engineering blog post. +* DONE: @jnduli Help @alexm with auth work. +* IN PROGRESS: @felix @fred push RQTL bundles to uploader, also includes metadata. +* IN PROGRESS: @felix look for means to fix metadata challenge ie. trouble associating data we upload and metadata that provides descriptions. +* DONE: @felix @jnduli programming learning: started building a web server to learn backend using Flask. Learning html and css and will share the progress with this. +* DONE: @felix ~@jnduli~ Read Shelby's paper and provide feedback by the end of Saturday. +* DONE: @felix tested the time tracker script. +* IN PROGRESS: @bmunyoki implementation code work to edit Rif + WIki SQL n RDF data. We'll break this down. +* @bmunyoki and @alex help Alex deploy gn-guile code on tux02. +* NOT DONE: @bmunyoki Replicate GN1 WIKI+RIF in GN2. +* @shelby @bonfacem @alex Integrate QNQA Search to global search. +* @shelby handling edits with the current open paper + +Nice To Have: +* DEPRIORITIZED: @felix figure out how to fix large data uploads ie. most data sets are large e.g. 45GB. Uploader cannot handle these large files. +* DONE: @felix Try out John's time tracking tool and provide feedback. +* @shelby run RAGAs against the queries already in the system to perform comparison with new papers: re-ingesting, now at 1500 papers. +* @bmunyoki Send out emails to the culprit on failing tests in CI/CD. + +## 2024-08-15 +### RTF Editing (bmunyoki+alexm) + +In our static content, we don't really store RTF; instead we store, HTML. As an example compare these 2 documements and note their difference: + +=> https://github.com/bitfocus/rtf2text/blob/master/sample.rtf => [Proper RTF] sample.rtf +=> https://github.com/genenetwork/gn-docs/blob/master/general/datasets/Br_u_1203_rr/acknowledgment.rtf => [GN] acknowledgement.rtf + +* TODO @alexm Rename all the *rtf to *html during transform to make things clearer. Send @bonfacem PR. + +## 2024-08-13 +### Markdown Editor (bmunyoki+alexm) + +* @alexm @bonfacem Tested the Markdown Editor locally and it works fine. Only issue is that someone can make edits without logging in. +* API end-points to be only exposed locally. +* @alexm: Fix minor bug for when showing the diff. Have a back arrow. +* @bonfacem, @alexm: Deploy gn-guile; make sure it's only exposed locally. +* [blocking] @alexm having issues setting up gn-auth. @jnduli to help out to set up gn-auth and work out any quirks. @alexm to make sure you can't make edits without being logged in. +* @bmunyoki to set merge ge-editor UI work once basic auth is figured out. +* [nice-to-have] @alexm work on packaging: "diff2html-ui.min.js", "diff.min.js", "marked.min.js", "index.umd.js", "diff2html.min.js". +* [nice-to-have] @alexm to check-out djlint for linting jinja templates. +* @bonfacem share pre-commit hooks for setting up djlint and auto-pep8. +* [nice-to-have] @alexm to checkout: + +> djlint gn2/wqflask/templates/gn_editor.htmll --profile=jinja --reformat --format-css --format-js +=> https://www.djlint.com/ dj Lint; Lint & Format HTML Templates + +## 2024-08-09 + +* @shelby figure out Claude Sonnet stuff: NOT DONE, main focus was on the paper +* @shelby planning session for next work and tasks for Priscilla. DONE: Priscilla was given some work. Loop in Priscilla for our meetings. +* @shelby format output for ingested paper so that we can test the RAG engine. IN PROGRESS. Most focus has been on editing paper and some funding pursuit. +* @shelby run RAGAs against the queries already in the system to perform comparison with new papers. NOT DONE. +* @bmunyoki implementation code work to edit Rif + WIki SQL n RDF data. IN PROGRESS. Updated the RDF transform for geneWIKI; Now we can do a single GET for a single comment in RDF. +* @bmunyoki @shelby group paper on dissertation to target Arxiv. NOT DONE. +* @bmunyoki and @alex help Alex deploy gn-guile code on tux02. NOT DONE. Currently auth is a blocker. +* @bmunyoki review UI code editor work. DONE. +* @alex address comments in UI work. DONE. +* @felix @fred push RQTL bundles to uploader. In Progress: OOM Killer killing upload process. +* @felix look for means to fix metadata challenge ie. trouble associating data we upload and metadata that provides descriptions. The metadata doesn't meeting requirements. In Progress: Some things to be confirmed with Rob/PJ on coming up with a good format for adding metadata. NOT DONE. +* @felix figure out how to fix large data uploads ie. most data sets are large e.g. 45GB. Uploader cannot handle these large files. +* @felix @jnduli programming learning: started building a web server to learn backend using Flask. NOT DONE. +* @felix (@bmunyoki / @alex) learning emacs so that he figures out how to track times. @jnduli shared his time-tracking tool with @felix. DONE. +* @jnduli fix group creation bug in gn-auth. DONE: Group creation wasn't exactly a bug; updated docs, and fixed the masquerade API. +* @jnduli edit rif metadata using gn3. NOT DONE +* @jnduli update documentation for gn-auth setup. DONE +* @jnduli investigate more bugs related to gn-auth. DONE + +Note: When setting up sync between @jnduli and @felix, add @bmunyoki too. + + +## 2024-08-02 + +* DONE: @bmunyoki virtuoso and xapian updated in prod +* @bmunyoki code work to edit Rif + WIki SQL n RDF data: WIP, we have desired API, but we need to implement code. +* NOT DONE: @bmunyoki group paper on dissertation to target Arxiv +* DONE: @bmunyoki fix case insensitivity in Xapian search +* DONE: @jnduli review Alex patches +* DONE: @bmunyoki: updated gn2 and gn3 on git.genetwork server. Shared QA code with @shelby on a special branch. +* @bmunyoki @jnduli: fixed minor bug on xapian reflected with stemming. +* @shelby figure out Claude Sonnet stuff: NOT DONE, main focus was on the paper +* IN PROGRESS: @shelby edit paper with @pjtor +* @shelby planning session for next work and tasks for Priscilla. +* @shelby use RAGAS to test R2R with the new papers (follow up on the ingestion of papers tasks) +* @shelby and @boni to discuss R2R and interfacing with Virtuoso: deprioritized, we'll figure out interfacing with R2R. Implementation to happen later. +* DONE: @jnduli get up to speed on gn-auth +* @alex have an instance of gn-guile running on production: Code in prod, but needs to liase with Boni to get this working. +* @jgart getting genecup and rshiny containers to run as normal users instead of root users. May use libvirts APIs; or podman/docker as normal user; or rewriting the services as guix home services: system container doesn't have work around this, there's no work around. Because guix by default needs root to run as a system container. We also need sudo since at root level we define our system containers in a systemd that needs to be run as root. Why systemd? Systemd no one needs to run this. + +### Meeting with Sevila on Masters Papers + +- mainly stylistic changes provided. +- provide an email explaining how long ethical review took, so that he follows up on unexpected delays. +- met up with Dr Betsy, once done with defences in October (hopefully), and Boni may get his degree before graduation next year, to facilitate Boni applying for PhD. + +### Guix Root Container + +- With docker, to prevent the need for sudo, we usually create a docker group, and add users that need to run this to this group. Can this ahppen in guix? +- Guix has a guix group. Why haven't we done this??? @jgart and @boni + +## 2024-07-26 +Plan for this week: + +* NOT DONE, needs a meeting: @bmunyoki virtuoso and xapian are up-to-date in prod. Boni doesn't have root access in production, so coordination with Fred and Zach is causing delays. +* Apis design DONE, actual CODE incomplete: @bmunyoki update RIF+WIKI on SQL and RDF from gn2 website +* DONE: @bmunyoki and @shelby review dissertation for Masters +* DONE, needs to review new changes: @bmunyoki and @jgart to review patches for `genecup` and `rshiny`. +* @bmunyoki and @jnduli to review patches for markdown parser +* DONE, patches sent. @alexm add validation and document to markdown parser. +* DONE: @shelby ingest ageing data to RAG, 10% left to complete. +* DONE: @shelby do another round for editting on the AI paper +* IN PROGRESS: @shelby RAG engine only works with OpenAI, figure out Claude Sonnet integration +* IN PROGRESS: @jnduli get up to speed on gn-auth +* @jgart enabling acme service in genecup and rshiny containers. +* @jnduli and @bmunyoki to attempt to get familiar with R2R + +Nice to have: +* @bmunyoki fix CI job for GN transformer database i.e. instead of checksums just run full job once per month: scheme script created that dumps the example files, next step is to create Gexp that runs this script. Bandwidth constraints. + +## 2024-07-23 +### LLM Meeting (@shelby+@bmunyoki) +* There's no clear way of ingesting human-readable data with context into the RAG Graph from RDF. +* What specific graph should we ingest into the RAG Graph from RDF? @bmunyoki suggested RIF, PubMed Metadata. We'll figure this out. +* @bonfacem recommended: Much better to work with SPARQL than directly with TTL files. +* We've uploaded rdf triples, yet they loose their strength as the RAG system is not undergirded with a knowledge graph. @bonfacem should read the following for more context and should reach out to @shelby on how to move forward with SPARQL more concretely: + +=> https://r2r-docs.sciphi.ai/cookbooks/knowledge-graph#r2r-knowledge-graph-configuration + +* We need to test the knowledge graph backend of R2R to see how feasible it is to use with the existing data (RDF). +* Fahamu just stored the object and lost the subject+predicate +* Loop in Alex. + + +## 2024-07-19 +Plan for this week: + +* DONE: @jgart getting `genecup` app to run in a guix container i.e. `gunicorn service` should then run `genecup`, similar to how gn2 and gn-uploader work. Patches sent to Boni, include `genecup` and `rshiny` and the container patches are tested. +* @jgart enable acme certificates for `genecup` container: Should just enable a single form, let's use arun's email since its what we use for all our services. Reverse proxy happens inside the container. Add a comment explaining that this shouldn't be standard python set up. +* INPROGRESS: @bmunyoki virtuoso and xapian are up-to-date in prod: +* NOT DONE: @bmunyoki update RIF+WIKI on SQL and RDF from gn2 website +* INPROGRESS: @bmunyoki fix CI job for GN transformer database i.e. instead of checksums just run full job once per month: scheme script created that dumps the example files, next step is to create Gexp that runs this script. Bandwidth constraints. +* @bmunyoki and @shelby review dissertation for Masters: @bonz needs to send updated version. Also reviewed another masters by Johannes. +* ON HOLD: @alexm rewrite UI code using htmx +* INPROGRESS: @alexm address review comments in markdown parser. Api endpoints are getting reimplemented. Needs to add validation and documentation and send v2 patches for review. +* DONE: @shelby compile ingesting 500 more papers into RAG engine +* @shelby ingesting ageing research into the RAG engine: diabetes reseach is ingested, ageing will be done later. +* NOT DONE: @shelby RAG engine only works with OpenAI, figure out Claude Sonnet integration +* DONE: @shelby @bmunyoki @alexm to define the problem with RDF triple stores +* DONE: @jnduli finish up on RIF update +* IN PROGRESS: @jnduli get up to speed on gn-auth + +AOB + +* RAG engine uses R2R for the integration. It would be great if we could integrate this into guix. @shelby will send @jgart the paper on how we use the RAG. + + +## 2024-07-12 + +Plan for this week: + +* @shelby use Claude Sonnet with R2R RAG engine with 1000 papers and fix bugs: 500 papers ingested into R2R, remaining with 500. +* @shelby final run through for paper 1 before Pjotr's review. DONE, configurations fixed. New repo gnai that contains the results and will contain R2R stuff. +* NOT DONE: @shelby and @bmunyoki review dissertation paper for Masters +* @shelby @bmunyoki @alexm to define the problem with RDF triple stores +* @alexm integrate the markdown parser: DONE, patches sent to Boni +* @alexm rewrite UI code using htmx: NOT DONE +* @bmunyoki investigate why xapian index isn't getting rebuilt: DONE +* @bmunyoki investigate discrepancies between wiki and rif search: DONE, get this to prod to be tested +* @jnduli update the generif_basic table from NCBI: IN PROGRESS. +* @jnduli blog post of preference for documentation: DONE. + +We have qa.genenetwork.com. We need to have this set up to `qa.genenetwork.com/paper1` so that we always have the system that was used for this. How? + +Nice to Haves + +* @bmunyoki Nice to have tag for paper1: Fix this with Boni and get done later on/iron them out then. +* @bmunyoki fix CI job that transforms gn tables to TTL: Move this to running a cron job once per month instead of + + +## 2024-06-24 + +Plan for this Week: + +* CANCELED: @bmunyoki Remove boolean prefixes from search where it makes sense. +* DONE: @bmunyoki GeneWiki + GeneRIF search in production. Mostly needs to be run in prod to see impact. +* DONE: @jnduli Children process termination when we kill the main index-genenetwork script +* CANCELED: @bmunyoki Follow up on getting virtuoso child pages in production +* IN PROGRESS @alexm push endpoints for editting and making commits for markdown files +* DONE: @all Reply to survey from Shelby +* DONE: @jnduli Fix JS import orders (without messing up the rest of Genenetwork) +* DONE: @jnduli fix search results when nothing is found +* CANCELED: @jnduli test out running guix cron jobs locally +* NOT DONE: @Jnduli mention our indexing documentation in gn2 README + +Note: For qa.genenetwork.com, we chose to pause work on this until papers are done. + +Review for last week + +* DONE: @bmunyoki rebuild guix container with new mcron changes +* WIP: @jnduli attempts to make UI change that shows all supported keys in the search: Blocked because our JS imports aren't ordered correctly and using `boolean_prefixes` means our searches don't work as we'd expect. +* WIP: @bmunyoki create an issue with all the problems experienced with search and potential solutions. Make sure it has replication steps, and plans for solutions. Issue was created but we need to get a better understanding for how cis and trans searches work. +* TODO: @bmunyoki and @jnduli genewiki indexing: PR for WIKI indexing is completed, but we didn't test it out due to the outage caused by RAM and our script. We don't have a way to easily instrument how much RAM our process uses and how to kill the process. +* DONE: @bmunyoki demoes and documents how to run and test guix cron job for indexing +* DONE: @bmunyoki trains @jnduli on how to review patchsets from emails +* DONE: @jnduli Follow up notes on setting up local index-genenetwork search +* DONE: @alexm handling with graduation, AFK +* TODO: @bmunyoki follow up with Rob to makes sure he tests search after everything is complete: He got some feedback and Rob is out of Town but wants RIF and Wiki search by July 2nd. + +Nice to haves: + +* TODO: minor: bonfacem makes sure that mypy/pylint in CI runs against the index-genenetwork script. +* TODO: @bmunyoki follow up how do we make sure that xapian prefix changes in code retrigger xapian indexing? + - howto: xapian prefix changes, let's maintain a hash for the file and store it in xapian + - howto: for RDF changes, since we have ttl files, if this ever changes we trigger the script. It's also nice to be able to automatically also load up data to virtuoso if this file changes. + + ## 2024-06-21 ### Outage for 2024-06-20 diff --git a/topics/octopus/lizardfs/README.gmi b/topics/octopus/lizardfs/lizard-maintenance.gmi index 78316ef..a34ef3e 100644 --- a/topics/octopus/lizardfs/README.gmi +++ b/topics/octopus/lizardfs/lizard-maintenance.gmi @@ -1,4 +1,4 @@ -# Information about lizardfs, and some usage suggestions +# Lizard maintenance On the octopus cluster the lizardfs head node is on octopus01, with disks being added mainly from the other nodes. SSDs are added to the lizardfs-chunkserver.service systemd service and SDDs added to the lizardfs-chunkserver-hdd.service. The storage pool is available on all nodes at /lizardfs, with the default storage option of "slow", which corresponds to two copies of the data, both on SDDs. @@ -73,6 +73,17 @@ Chunks deletion state: 2ssd 7984 - - - - - - - - - - ``` +<<<<<<< HEAD +This table essentially says that slow and fast are replicating data (if they are in column 0 it is OK!). This looks good for fast: + +``` +Chunks replication state: + Goal 0 1 2 3 4 5 6 7 8 9 10+ + slow - 137461 448977 - - - - - - - - + fast 6133152 - 5 - - - - - - - - +``` +This table essentially says that slow and fast are replicating data (if they are in column 0 it is OK!). + To query how the individual disks are filling up and if there are any errors: List all disks @@ -83,17 +94,62 @@ lizardfs-admin list-disks octopus01 9421 | less Other commands can be found with `man lizardfs-admin`. +## Info + +``` +lizardfs-admin info octopus01 9421 +LizardFS v3.12.0 +Memory usage: 2.5GiB23 + +Total space: 250TiB Available space: 10TiB +Trash space: 510GiB +Trash files: 188 +Reserved space: 21GiB Reserved files: 18 +FS objects: 7369883 +Directories: 378782 +Files: 6858803 +Chunks: 9100088 +Chunk copies: 20017964 +Regular copies (deprecated): 20017964 +``` + +``` +lizardfs-admin chunks-health octopus01 9421 +Chunks availability state: + Goal Safe Unsafe Lost + slow 1323220 1 - + fast 6398524 - 5 + +Chunks replication state: + Goal 0 1 2 3 4 5 6 7 8 9 10+ + slow - 218663 1104558 - - - - - - - - + fast 6398524 - 5 - - - - - - - - + +Chunks deletion state: + Goal 0 1 2 3 4 5 6 7 8 9 10+ + slow - 104855 554911 203583 76228 39425 19348 8659 3276 20077 292859 + fast 6380439 18060 30 - - - - - - - - +``` ## Deleted files -Lizardfs also keeps deleted files, by default for 30 days. If you need to recover deleted files (or delete them permanently) then the metadata directory can be mounted with: +Lizardfs also keeps deleted files, by default for 30 days in `/mnt/lizardfs-meta/trash`. If you need to recover deleted files (or delete them permanently) then the metadata directory can be mounted with: ``` $ mfsmount /path/to/unused/mount -o mfsmeta ``` For more information see the lizardfs documentation online -=> https://dev.lizardfs.com/docs/adminguide/advanced_configuration.html#trash-directory lizardfs documentation for the trash directory +=> https://lizardfs-docs.readthedocs.io/en/latest/adminguide/advanced_configuration.html#trash-directory lizardfs documentation for the trash directory + +## Start lizardfs-mount (lizardfs reader daemon) after a system reboot + +``` +sudo bash +systemctl daemon-reload +systemctl restart lizardfs-mount +systemctl status lizardfs-mount +``` ## Gotchas @@ -179,3 +235,54 @@ KeyringMode=inherit [Install] WantedBy=multi-user.target ``` + +# To deplete and remove a drive in LizardFS + +**1. Mark the chunkserver (or specific disk) for removal** + +Edit the chunkserver's disk configuration file (typically `/etc/lizardfs/mfshdd.cfg`) and prefix the drive path with an asterisk: + +``` +*/mnt/disk_to_remove +``` + +Restart the chunkserver process on the node + +```bash +systemctl stop lizardfs-chunkserver +systemctl start lizardfs-chunkserver +``` + +**3. Monitor the evacuation progress** + +The master will begin migrating chunks off the marked drive. You can monitor progress with: + +```bash +lizardfs-admin list-disks octopus01 9421 +lizardfs-admin list-disks octopus01 9421|grep 172.23.19.59 -A 7 +172.23.19.59:9422:/mnt/sdc/lizardfs_vol/ + to delete: yes + damaged: no + scanning: no + last error: no errors + total space: 3.6TiB + used space: 3.4TiB + chunks: 277k +``` + +Look for the disk showing evacuation status. The "to delete" chunks count should decrease over time as data is replicated elsewhere. + +You can also check the CGI web interface if you have it running—it shows disk status and chunk counts. + +**4. Remove the drive once empty** + +Once all chunks have been evacuated (the disk shows 0 chunks or is marked as empty), you can safely: + +1. Remove the line from `mfshdd.cfg` entirely +2. Reload the configuration again +3. Physically remove or repurpose the drive + +**Important notes:** +- Ensure you have enough free space on other disks to absorb the migrating chunks +- The evacuation time depends on the amount of data and network/disk speed +- Don't forcibly remove a drive before evacuation completes, or you risk data loss if replication goals aren't met diff --git a/topics/octopus/maintenance.gmi b/topics/octopus/maintenance.gmi new file mode 100644 index 0000000..00cc575 --- /dev/null +++ b/topics/octopus/maintenance.gmi @@ -0,0 +1,98 @@ +# Octopus/Tux maintenance + +## To remember + +`fdisk -l` to see disk models +`lsblk -nd` to see mounted disks + +## Status + +octopus02 +- Devices: 2 3.7T SSDs + 2 894.3G SSDs + 2 4.6T HDDs +- **Status: Slurm not OK, LizardFS not OK** +- Notes: + - `octopus02 mfsmount[31909]: can't resolve master hostname and/or portname (octopus01:9421)`, + - **I don't see 2 drives that are physically mounted** + +octopus03 +- Devices: 4 3.7T SSDs + 2 894.3G SSDs +- Status: Slurm OK, LizardFS OK +- Notes: **I don't see 2 drives that are physically mounted** + +octopus04 +- Devices: 4 7.3 T SSDs (Neil) + 1 4.6T HDD + 1 3.7T SSD + 2 894.3G SSDs +- Status: Slurm NO, LizardFS OK (we don't share the HDD) +- Notes: no + +octopus05 +- Devices: 1 7.3 T SSDs (Neil) + 5 3.7T SSDs + 2 894.3G SSDs +- Status: Slurm OK, LizardFS OK +- Notes: no + +octopus06 +- Devices: 1 7.3 T SSDs (Neil) + 1 4.6T HDD + 4 3.7T SSDs + 2 894.3G SSDs +- Status: Slurm OK, LizardFS OK (we don't share the HDD) +- Notes: no + +octopus07 +- Devices: 1 7.3 T SSDs (Neil) + 4 3.7T SSDs + 2 894.3G SSDs +- Status: Slurm OK, LizardFS OK +- Notes: **I don't see 1 device that is physically mounted** + +octopus08 +- Devices: 1 7.3 T SSDs (Neil) + 1 4.6T HDD + 4 3.7T SSDs + 2 894.3G SSDs +- Status: Slurm OK, LizardFS OK (we don't share the HDD) +- Notes: no + +octopus09 +- Devices: 1 7.3 T SSDs (Neil) + 1 4.6T HDD + 4 3.7T SSDs + 2 894.3G SSDs +- Status: Slurm OK, LizardFS OK (we don't share the HDD) +- Notes: no + +octopus10 +- Devices: 1 7.3 T SSDs (Neil) + 4 3.7T SSDs + 2 894.3G SSDs +- Status: Slurm OK, LizardFS OK (we don't share the HDD) +- Notes: **I don't see 1 device that is physically mounted** + +octopus11 +- Devices: 1 7.3 T SSDs (Neil) + 5 3.7T SSDs + 2 894.3G SSDs +- Status: Slurm OK, LizardFS OK +- Notes: on + +tux05 +- Devices: 1 3.6 NVMe + 1 1.5T NVMe + 1 894.3G NVMe +- Status: Slurm OK, LizardFS OK (we don't share anything) +- Notes: **I don't have a picture to confirm physically mounted devices** + +tux06 +- Devices: 2 3.6 T SSDs (1 from Neil) + 1 1.5T NVMe + 1 894.3G NVMe +- Status: Slurm OK, LizardFS (we don't share anything) +- Notes: + - **Last picture reports 1 7.3 T SSD (Neil) that is missing** + - **Disk /dev/sdc: 3.64 TiB (Samsung SSD 990: free and usable for lizardfs** + - **Disk /dev/sdd: 3.64 TiB (Samsung SSD 990): free and usable for lizardfs** + +tux07 +- Devices: 3 3.6 T SSDs + 1 1.5T NVMe (Neil) + 1 894.3G NVMe +- Status: Slurm OK, LizardFS +- Notes: + - **Disk /dev/sdb: 3.64 TiB (Samsung SSD 990): free and usable for lizardfs** + - **Disk /dev/sdd: 3.64 TiB (Samsung SSD 990): mounted at /mnt/sdb and shared on LIZARDFS: TO CHECK BECAUSE IT HAS NO PARTITIONS** + +tux08 +- Devices: 3 3.6 T SSDs + 1 1.5T NVMe (Neil) + 1 894.3G NVMe +- Status: Slurm OK, LizardFS +- Notes: no + +tux09 +- Devices: 1 3.6 T SSDs + 1 1.5T NVMe + 1 894.3G NVMe +- Status: Slurm OK, LizardFS +- Notes: **I don't see 1 device that is physically mounted** + +## Neil disks +- four 8TB SSDs on the right of octopus04 +- one 8TB SSD in the left slot of octopus05 +- six 8TB SSDs bottom-right slot of octopus06,07,08,09,10,11 +- one 4TB NVMe and one 8TB SSDs on tux06, NVME in the bottom-right of the group of 4 on the left, SSD on the bottom-left of the group of 4 on the right +- one 4TB NVMe on tux07, on the top-left of the group of 4 on the right +- one 4TB NVMe on tux08, on the top-left of the group of 4 on the right diff --git a/topics/octopus/moosefs/moosefs-maintenance.gmi b/topics/octopus/moosefs/moosefs-maintenance.gmi new file mode 100644 index 0000000..1032cde --- /dev/null +++ b/topics/octopus/moosefs/moosefs-maintenance.gmi @@ -0,0 +1,252 @@ +# Moosefs + +We use moosefs as a network distributed storage system with redundancy. The setup is to use SSDs for fast access and spinning storage for redundancy/backups (in turn these are in RAID5 configuration). In addition we'll experiment with a non-redundant fast storage access using the fastest drives and network connections. + +# Configuration + +## Ports + +We should use different ports than lizard. Lizard uses 9419-24 by default. So let's use +9519- ports. + +* 9519 for moose meta logger +* 9520 for chunk server connections +* 9521 for mount connections +* 9522 for slow HDD chunks (HDD) +* 9523 for replicating SSD chunks (SSD) +* 9524 for fast non-redundant SSD chunks (FAST) + +## Topology + +Moosefs uses topology to decide where to fetch data. We can host the slow spinning HDD drives in a 'distant' location, so that data is fetched last. + +## Disks + +Some disks are slower than others. To test we can do: + +``` +root@octopus03:/export# dd if=/dev/zero of=test1.img bs=1G count=1 +1+0 records in +1+0 records out +1073741824 bytes (1.1 GB, 1.0 GiB) copied, 2.20529 s, 487 MB/s +/sbin/sysctl -w vm.drop_caches=3 +root@octopus03:/export# dd if=test1.img of=/dev/null bs=1G count=1 +1+0 records in +1+0 records out +1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.649035 s, 1.7 GB/s +rm test1.img +``` + +Above is on a RAID5 setup. Typical values are: + +``` + Write Read +Octopus Dell NVME 1.2 GB/s 2.0 GB/s +Octopus03 RAID5 487 MB/s 1.7 GB/s +Octopus01 RAID5 127 MB/s 163 MB/s +Samsung SSD 870 408 MB/s 565 MB/s +``` + +``` +mfs#octopus03:9521 3.7T 4.0G 3.7T 1% /moosefs-fast +``` + +## Command line + +``` +. /usr/local/guix-profiles/moosefs/etc/profile +mfscli -H octopus03 -P 9521 -SCS +``` + +## Config + +``` +root@octopus03:/etc/mfs# diff example/mfsexports.cfg.sample mfsexports.cfg +2c2,4 +< * / rw,alldirs,admin,maproot=0:0 +--- +> 172.23.21.0/24 / rw,alldirs,maproot=0,ignoregid +> 172.23.22.0/24 / rw,alldirs,maproot=0,ignoregid +> 172.23.17.0/24 / rw,alldirs,maproot=0,ignoregid +``` + +``` +root@octopus03:/etc/mfs# diff example/mfsmaster.cfg.sample mfsmaster.cfg +4a5,10 +> ## Only one metadata server in LizardFS shall have 'master' personality. +> PERSONALITY = master +> +> ## Password for administrative connections and commands. +> ADMIN_PASSWORD = nolizard +> +6c12 +< # WORKING_USER = nobody +--- +> WORKING_USER = mfs +9c15 +< # WORKING_GROUP = +--- +> WORKING_GROUP = mfs +27c33 +< # DATA_PATH = /gnu/store/yg0xb1g9mls04h4085kmfbbg8z36a7c2-moosefs-4.58.3/var/mfs +--- +> DATA_PATH = /export/var/lib/mfs +34c40 +< # EXPORTS_FILENAME = /gnu/store/yg0xb1g9mls04h4085kmfbbg8z36a7c2-moosefs-4.58.3/etc/mfs/mfsexports.cfg +--- +> EXPORTS_FILENAME = /etc/mfs/mfsexports.cfg +87c93 +< # MATOML_LISTEN_PORT = 9419 +--- +> MATOML_LISTEN_PORT = 9519 +103c109 +< # MATOCS_LISTEN_PORT = 9420 +--- +> MATOCS_LISTEN_PORT = 9520 +219c225 +< # MATOCL_LISTEN_PORT = 9421 +--- +> MATOCL_LISTEN_PORT = 9521 +``` + +``` +root@octopus03:/etc/mfs# cat mfsgoals.cfg +# safe - 2 copies, 1 on slow disk, 1 on fast disk +11 slow: HDD SSD + +# Fast storage - 1 copy on fast disks, no redundancy +12 fast: FAST +``` + +``` ++++ b/mfs/mfschunkserver-fast.cfg + # user to run daemon as (default is nobody) +-# WORKING_USER = nobody ++WORKING_USER = mfs + + # group to run daemon as (optional - if empty then default user group will be used) +-# WORKING_GROUP = ++WORKING_GROUP = mfs + + # name of process to place in syslog messages (default is mfschunkserver) + # SYSLOG_IDENT = mfschunkserver +@@ -28,6 +28,7 @@ + + # where to store daemon lock file (default is /gnu/store/yg0xb1g9mls04h4085kmfbbg8z36a7c2-moosefs-4.58.3/var/mfs) + # DATA_PATH = /gnu/store/yg0xb1g9mls04h4085kmfbbg8z36a7c2-moosefs-4.58.3/var/mfs ++DATA_PATH=/var/lib/mfs + + # when set to one chunkserver will not abort start even when incorrect entries are found in 'mfshdd.cfg' file + # ALLOW_STARTING_WITH_INVALID_DISKS = 0 +@@ -41,6 +42,7 @@ + + # alternate location/name of mfshdd.cfg file (default is /gnu/store/yg0xb1g9mls04h4085kmfbbg8z36a7c2-moosefs-4.58.3/etc/mfs/mfshdd.cfg); this +file will be re-read on each process reload, regardless if the path was changed + # HDD_CONF_FILENAME = /gnu/store/yg0xb1g9mls04h4085kmfbbg8z36a7c2-moosefs-4.58.3/etc/mfs/mfshdd.cfg ++HDD_CONF_FILENAME = /etc/mfs/mfsdisk-fast.cfg + + # speed of background chunk tests in MB/s per disk (formally entry defined in mfshdd.cfg). Value can be given as a decimal number (default is +1.0) + # deprecates: HDD_TEST_FREQ (if HDD_TEST_SPEED is not defined, but there is redefined HDD_TEST_FREQ, then HDD_TEST_SPEED = 10 / HDD_TEST_FREQ) +@@ -109,10 +111,10 @@ + # BIND_HOST = * + + # MooseFS master host, IP is allowed only in single-master installations (default is mfsmaster) +-# MASTER_HOST = mfsmaster ++MASTER_HOST = octopus03 + + # MooseFS master command port (default is 9420) +-# MASTER_PORT = 9420 ++MASTER_PORT = 9520 + + # timeout in seconds for master connections. Value >0 forces given timeout, but when value is 0 then CS asks master for timeout (default is 0 +- ask master) + # MASTER_TIMEOUT = 0 +@@ -134,5 +136,5 @@ + # CSSERV_LISTEN_HOST = * + + # port to listen for client (mount) connections (default is 9422) +-# CSSERV_LISTEN_PORT = 9422 ++CSSERV_LISTEN_PORT = 9524 +``` + +``` ++++ b/mfs/mfsmount.cfg +mfsmaster=octopus03,nosuid,nodev,noatime,nosuid,mfscachemode=AUTO,mfstimeout=30,mfswritecachesize=2048,mfsreadaheadsize=2048,mfsport=9521 +/moosefs-fast +``` + +## systemd + + +``` +root@octopus03:/etc# cat systemd/system/moosefs-master.service +Description=MooseFS master server daemon +Documentation=man:mfsmaster +After=network.target +Wants=network-online.target + +[Service] +Type=forking +TimeoutSec=0 +ExecStart=/usr/local/guix-profiles/moosefs/sbin/mfsmaster -d start -c /etc/mfs/mfsmaster.cfg -x +ExecStop=/usr/local/guix-profiles/moosefs/sbin/mfsmaster -c /etc/mfs/mfsmaster.cfg stop +ExecStop=/usr/local/guix-profiles/moosefs/sbin/mfsmaster -c /etc/mfs/mfsmaster.cfg reload +ExecReload=/bin/kill -HUP $MAINPID +User=mfs +Group=mfs +Restart=on-failure +RestartSec=60 +OOMScoreAdjust=-999 + +[Install] +WantedBy=multi-user.target +``` + +``` + cat systemd/system/moosefs-mount.service +[Unit] +Description=Moosefs mounts +After=syslog.target network.target + +[Service] +Type=forking +TimeoutSec=600 +ExecStart=/usr/local/guix-profiles/moosefs/bin/mfsmount -c /etc/mfs/mfsmount.cfg +ExecStop=/usr/bin/umount /moosefs-fast + +[Install] +WantedBy=multi-user.target +root@octopus04:/etc# cat systemd/system/moosefs-chunkserver-fast.service +[Unit] +Description=MooseFS Chunkserver (Fast) +After=network.target + +[Service] +Type=simple +ExecStart=/usr/local/guix-profiles/moosefs/sbin/mfschunkserver -f -c /etc/mfs/mfschunkserver-fast.cfg +User=mfs +Group=mfs +Restart=on-failure +RestartSec=5 +LimitNOFILE=65535 + +[Install] +WantedBy=multi-user.target +``` + +``` +cat systemd/system/moosefs-mount.service +[Unit] +Description=Moosefs mounts +After=syslog.target network.target + +[Service] +Type=forking +TimeoutSec=600 +ExecStart=/usr/local/guix-profiles/moosefs/bin/mfsmount -c /etc/mfs/mfsmount.cfg +ExecStop=/usr/bin/umount /moosefs-fast + +[Install] +WantedBy=multi-user.target +``` diff --git a/topics/octopus/octopussy-needs-love.gmi b/topics/octopus/octopussy-needs-love.gmi new file mode 100644 index 0000000..8c6315d --- /dev/null +++ b/topics/octopus/octopussy-needs-love.gmi @@ -0,0 +1,266 @@ +# Octopussy needs love + +At UTHSC, Memphis, TN, around October 2020 Efraim and I installed Octopus on Debian+Guix with lizard as a distributed network storage system and slurm for job control. Around October 2023 we added 5 genoa tux05-09 machines, doubling the cluster in size. See + +=> https://genenetwork.org/gn-docs/facilities + +Octopus made a lot of work possible we can't really do on larger HPCs and led to a bunch of high impact studies and publications, particularly on pangenomics. + +In the coming period we want te replace lizard with moosefs. Lizard is no longer maintained and as it was a fork of Moose, it is only logical to go forward on that one. We also looked at Ceph, but apparently Ceph is not great for systems that carry no redundancy. So far, lizard has been using redundancy, but we figure we can do without if the occassional (cheap) SSD goes bad. + +We also need to look at upgrading some of the Dell BIOS - particularly tux05-09 - as they can be occassionally problematic with non-OEM SSDs. + +On the worker nodes it may be wise to upgrade Debian. Followed by an upgrade to the head nodes and other supporting machines. Even though we rely on Guix for latest and greatest, there may be good upgrades in the underlying Linux kernel and drivers. + +Our Slurm PBS we are up-to-date because we run that completely on Guix and Arun supports the latest and greatest. + +Another thing we ought to fix is introduce centralized user management. So far we have had few users and just got by. But sometimes it bites us that users have different UIDs on the nodes. + +## Architecture overview + +* O1 is the old head node hosting lizardfs - will move to a compute +* O2 is the old backup hosting the lizardfs shadow - will move to compute +* O3 is the new head node hosting moosefs +* O4 is the backup head node hosting moosefs shadow - will act as a compute node too + +All the other nodes are for compute. O1 and O4 will be the last nodes to remain on older Debian. They will handle the last bits of lizard. + +# Tasks + +* [X] Create moosefs package +* [X] Install moosefs +* [X] Upgrade bios (all tuxes) +* [ ] Migrate lizardfs nodes to moosefs (one at a time) +* [ ] Add server monitoring with sheepdog +* [ ] Upgrade Debian +* - [ ] Maybe, just maybe, boot the nodes from a central server +* [ ] Introduce centralized user management + +# Progress + +## Lizardfs and Moosefs + +Our Lizard documention lives at + +=> lizardfs/README + +Efraim wrote a lizardfs for Guix at the time in guix-bioinformatics, but we ended up deploying with Debian. Going back now, the package does not look too taxing (I think we dropped it because the Guix system configuration did not play well). + +=> https://git.genenetwork.org/guix-bioinformatics/tree/gn/packages/file-systems.scm + +Looking at the Debian package + +=> https://salsa.debian.org/debian/moosefs + +It carries no special patches, but a few nice hints in *.README.debian. I think it is worth trying to write a Guix package so we can easily upgrade (even on an aging Debian). Future proofing is key. + +The following built moosefs in a guix shell: + +``` +guix shell -C -D -F coreutils make autoconf automake fuse libpcap zlib pkg-config python libtool gcc-toolchain +autoreconf -f -i +make +``` + +Next I created a guix package that installs with: + +``` +guix build -L ~/guix-bioinformatics -L ~/guix-past/modules moosefs +``` + +See + +=> https://git.genenetwork.org/guix-bioinformatics/commit/?id=236903baaab0f84f012a55700c1917265a2b701c + +Next stop testing and deploying! + +## Choosing a head node + +Currently octopus01 is the head node. It probably is a good idea to change that, so we can safely upgrade the new server. The first choice would be octopus02 (o2). We can mirror the moose daemons on octopus01 (o1) later. Let's see what that looks like. + +A quick assessment of o1 shows that we have 14T storage on o1 that takes care of /home and /gnu. But only 1.2T is used. + +o2 has also quite a few disks (up 1417 days!), but a bunch of SSDs appears to error out. E.g. + +``` +Sep 04 07:44:56 octopus02 mfschunkserver[22766]: can't create lock file /mnt/sdd1/lizardfs_vol/.lock, marking hdd as damaged: Input/output error +UUID=277c05de-64f5-48a8-8614-8027a53be212 /mnt/sdd1 xfs rw,exec,nodev,noatime,nodiratime,largeio,inode64 0 1 +``` + +Lizard also complains 4 SSDs have been wiped out. +We'll need to reboot the server to see what storage still may work. The slurm connection appears to be misconfigured: + +``` +[2025-12-20T09:36:27.846] error: service_connection: slurm_receive_msg: Insane message length +[2025-12-20T09:36:28.415] error: unpackstr_xmalloc: Buffer to be unpacked is too large (1700881509 > 1073741824) [2025-12-20T09:36:28.415] error: unpacking header [2025-12-20T09:36:28.415] error: destroy_forward: no init [2025-12-20T09:36:28.415] error: slurm_receive_msg_and_forward: [[nessus6.uthsc.edu]:35553] failed: Message receive failure +``` + +looks like Andrea is the only one using the machine right now though some others logged in. Before rebooting I'll block users, ask Andrea to move off, and deplete slurm and lizard. But o2 is a large RAM machine, so we should not use that as a head node. + +Let's take a look at o3. This one has less RAM. Flavia is running some tools, but I don't think the machine is really used right now. Slurm is running, but shows similar configuration issues as o2. Let's take a look at slurm + +=> ../systems/hpc/octopus-maintenance +=> ../hpc/octopus/slurm-user-guide + +Alright, I depleted and removed slurm from o3. I think it would be wise to also deplete the lizard drives on that machine. + +The big users on lizard are: + +``` +1.6T dashbrook +1.8T pangenomes +2.1T erikg +3.4T aruni +3.4T junh +8.4T hchen +9.2T salehi +13T guarracino +16T flaviav +``` + +it seems we can clean some of that up! We have some backup storage that we can use. Alternatively move to ISAAC. + +We'll slowly start depleting the lizard. See also + +=> lizardfs/README + +O3 has 4 lizard drives. We'll start by depleting one. + + +# O2 + +``` +172.23.22.159:9422:/mnt/sde1/lizardfs_vol/ + to delete: no + damaged: yes + scanning: no + last error: no errors + total space: 0B + used space: 0B + chunks: 0 +172.23.22.159:9422:/mnt/sdd1/lizardfs_vol/ + to delete: no + damaged: yes + scanning: no + last error: no errors + total space: 0B + used space: 0B + chunks: 0 +172.23.22.159:9422:/mnt/sdc1/lizardfs_vol/ + to delete: no + damaged: yes + scanning: no + last error: no errors + total space: 0B + used space: 0B + chunks: 0 +``` + +Stopped the chunk server. +sde remounted after xfs_repair. The others were not visible, so rebooted. The folloing storage should add to the total again: + +``` +/dev/sdc1 4.6T 3.9T 725G 85% /mnt/sdc1 +/dev/sdd1 4.6T 4.2T 428G 91% /mnt/sdd1 +/dev/sdf1 4.6T 4.2T 358G 93% /mnt/sdf1 +/dev/sde 3.7T 3.7T 4.0G 100% /mnt/sde +/dev/sdg1 3.7T 3.7T 3.9G 100% /mnt/sdg1 +``` + +After adding this storage and people removing material it starts to look better: + +``` +mfs#octopus01:9421 171T 83T 89T 49% /lizardfs +``` + +# O3 + +I have marked the disks (4x4T) on o3 for deletion - that will subtract 7T. This in preparation for upgrading Linux and migrating those disks to moosefs. Continue below. + +# T5 + +T5 requires a new bios - it has the same one as the unreliable T4. I also need to see if there are any disks in the bios we don't see right now. T5 has two small fast SSDs and one larger one (3.5T). + +I managed to install the new bios, but I had trouble getting into linux because of some network/driver issues. ipmi was suspect. Finally managed rescue mode by adding 'systemd.unit=emergency.target' in the grub line. 'single' is no longer enough (grrr). One to keep in mind. + +Had to disable ipmi modules. See my idrac.org. + +# T6 + +Tux06 (T6) contains two unused drives that appear to have contained XFS. xfs_repair did not really help... +The BIOS on T6 is newer than on T4+T5. That probably explains why the higher T numbers have no disk issues, while T4+T5 had problems with non-OEM! Anyway, as I was at it, I updated the BIOS for all. + +T6 has 4 SSDs, 2x 3.5T. Both unused. The lizard chunk server is failing, so might as well disable it. + +I am using T6 to test network boots because it is not serving lizard. + +# T7 + +On T7 root was full(!?). Culprit was Andrea with /tmp/sweepga_genomes_111850/. +T7 has 3x3.5T with one unused. + +# T8 + +T8 has 3x3.5T, all used. After the BIOS upgrade the efi partition did not boot. After a few reboots it did get into grub and I made a copy of the efi partition on sdd (just in case). + +# T9 + +T9 has 1x3.5T. Used. I had to reduce HDD_LEAVE_SPACE_DEFAULT to give the chunkserver some air. + +# O3 + O4 + +Back to O3, our future head node. lizard has mostly been depleted. Though every drive has a few chunks left. I just pulled down the chunkserver and lizard appears to be fine (no errors). Good! + +Next install Linux. I have two routes, one is using debootstrap, the other is via PXE. I want to try the latter. + +So far, I managed to boot into ipxe on Octopus. +The linux kernel loads over http, but it does not show output. Likely I need to: + +* [X] Build ipxe with serial support +* [X] Test the installer with serial support +* [X] Add NFS support +* [X] debootstrap install of new Debian on /export/nfs/nodes/debian14 +* [X] Make available through NFS and boot through IPXE + +I managed to boot T6 over the network. +Essentially we have a running Debian last stable on T6 that is completely run over NFS! +In the next steps I need to figure out: + +* [X] Mount NFS with root access +* [ ] Every PXE node needs its own hard disk configuration +* [ ] Mount NFS from octopus01 +* [ ] Start slurm + +We can have this as a test node pretty soon. +But first we have to start moosefs and migrate data. + +I am doing some small tests and will put (old) T6 back on slurm again. + +To get every node booted with its own version of fstab and state logging on a local disk we need to pull some trick with initrd. + +Basically NFS boot initrd needs to contain a script that invokes changes for every node. The node hostname and primary partition can be passed on from ipxe using the kernel myhost=client01 localdisk=/dev/sda1. So that is the differentiator. The script in /etc/nodes/initramfs-tools/update-node-etc will remount /tmp and /var onto $localdisk and copy /etc there too. Next it will symnlink a few files, such as /etc/hostname and /etc/fstab to adjust for local settings. + +This way we will deploy all nodes centrally. One aspect is that we don't need dynamic user management as it is centrally orchestrated! The user files can be copied from the head node when they change. + +O4 is going to be the backup head node. It will act as a compute node too, until we need it as the head node. O4 is currently not on the slurm queue. + +* [X] Update guix on O1 +* [X] Install guix moosefs +* [X] Start moosefs master on O3 +* [X] Start moosefs metalogger on O4 +* [ ] Check moosefs logging facilities +* [ ] See if we can mark drives so it is easier to track them +* [ ] Test broken (?) /dev/sdf on octopus03 + +We can start moose master on O3. We should use different ports than lizard. Lizard uses 9419-24 by default. So let's use +9519- ports. See + +=> moosefs/moosefs-maintenance.gmi + +# P2 + +Penguin2 has 80T of spinning disk storage. We are going to use that for redundancy. Basically these disks get a moosefs goal of HDD 'slow' and we'll configure them on a remote rack - so chunks get fetched from local chunk servers (first). This will gain us 40T of immediate storage. Adding more spinning disks will free up SSDs further. + +* [X] P2 Update Guix +* [X] Install moosefs +* [ ] Create HDD chunk server diff --git a/topics/octopus/recent-rust.gmi b/topics/octopus/recent-rust.gmi new file mode 100644 index 0000000..7ce8968 --- /dev/null +++ b/topics/octopus/recent-rust.gmi @@ -0,0 +1,76 @@ +# Use a recent Rust on Octopus + + +For impg we currently need a rust that is more recent than what we have in Debian +or Guix. No panic, because Rust has few requirements. + +Install latest rust using the script + +``` +curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh +``` + +Set path + +``` +. ~/.cargo/env +``` + +Update rust + +``` +rustup default stable +``` + +Next update Rust + +``` +octopus01:~/tmp/impg$ . ~/.cargo/env +octopus01:~/tmp/impg$ rustup default stable +info: syncing channel updates for 'stable-x86_64-unknown-linux-gnu' +info: latest update on 2025-05-15, rust version 1.87.0 (17067e9ac 2025-05-09) +info: downloading component 'cargo' +info: downloading component 'clippy' +info: downloading component 'rust-docs' +info: downloading component 'rust-std' +info: downloading component 'rustc' +(...) +``` + +and build the package + +``` +octopus01:~/tmp/impg$ cargo build +``` + +Since we are not in guix we get the local dependencies: + +``` +octopus01:~/tmp/impg$ ldd target/debug/impg + linux-vdso.so.1 (0x00007ffdb266a000) + libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fe404001000) + librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fe403ff7000) + libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fe403fd6000) + libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fe403fd1000) + libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fe403e11000) + /lib64/ld-linux-x86-64.so.2 (0x00007fe404682000) +``` + +Login on another octopus - say 02 you can run impg from this directory: + +``` +octopus02:~$ ~/tmp/impg/target/debug/impg +Command-line tool for querying overlaps in PAF files + +Usage: impg <COMMAND> + +Commands: + index Create an IMPG index + partition Partition the alignment + query Query overlaps in the alignment + stats Print alignment statistics + +Options: + -h, --help Print help + -V, --version Print version +``` diff --git a/topics/octopus/set-up-guix-for-new-users.gmi b/topics/octopus/set-up-guix-for-new-users.gmi new file mode 100644 index 0000000..f459559 --- /dev/null +++ b/topics/octopus/set-up-guix-for-new-users.gmi @@ -0,0 +1,38 @@ +# Set up Guix for new users + +This document describes how to set up Guix for new users on a machine in which Guix is already installed (such as octopus01). + +## Create a per-user profile for yourself by running your first guix pull + +"Borrow" some other user's guix to run guix pull. In the example below, we use root's guix, but it might as well be any guix. +``` +$ /var/guix/profiles/per-user/root/current-guix/bin/guix pull +``` +This should create your very own Guix profile at ~/.config/guix/current. You may invoke guix from this profile as +``` +$ ~/.config/guix/current/bin/guix ... +``` +But, you'd normally want to make this more convenient. So, add ~/.config/guix/current/bin to your PATH. To do this, add the following to your ~/.profile +``` +GUIX_PROFILE=~/.config/guix/current +. $GUIX_PROFILE/etc/profile +``` +Thereafter, you may run any guix command simply as +``` +$ guix ... +``` + +## Pulling from a different channels.scm + +By default, guix pull pulls the latest commit of the main upstream Guix channel. You may want to pull from additional channels as well. Put the channels you want into ~/.config/guix/channels.scm, and then run guix pull. For example, here's a channels.scm if you want to use the guix-bioinformatics channel. +``` +$ cat ~/.config/guix/channels.scm +(list (channel + (name 'gn-bioinformatics) + (url "https://git.genenetwork.org/guix-bioinformatics") + (branch "master"))) +``` +And, +``` +$ guix pull +``` diff --git a/topics/octopus/slurm-upgrade.gmi b/topics/octopus/slurm-upgrade.gmi new file mode 100644 index 0000000..822f68e --- /dev/null +++ b/topics/octopus/slurm-upgrade.gmi @@ -0,0 +1,89 @@ +# How to upgrade slurm on octopus + +This document closely mirrors the official upgrade guide. The official upgrade guide is very thorough. Please refer to it and update this document if something is not clear. +=> https://slurm.schedmd.com/upgrades.html Official slurm upgrade guide + +## Preparation + +It is possible to upgrade slurm in-place without upsetting running jobs. But, for our small cluster, we don't mind a little downtime. So, it is simpler if we schedule some downtime with other users and make sure there are no running jobs. + +slurm can only be upgraded safely in small version increments. For example, it is safe to upgrade version 18.08 to 19.05 or 20.02, but not to 20.11 or later. This compatibility information is in the RELEASE_NOTES file of the slurm git repo with the git tag corresponding to the version checked out. Any configuration file changes are also outlined in this file. +=> https://github.com/SchedMD/slurm/ slurm git repository + +## Backup + +Stop the slurmdbd, slurmctld and slurmd services. +``` +# systemctl stop slurmdbd slurmctld slurmd slurmrestd +``` +Backup the slurm StateSaveLocation (/var/spool/slurmd/ctld in our case) and the slurm configuration directory. +``` +# cp -av /var/spool/slurmd/ctld /somewhere/safe/ +# cp -av /etc/slurm /somewhere/safe/ +``` +Backup the slurmdbd MySQL database. Enter the password when prompted. The password is specified in StoragePass of /etc/slurm/slurmdbd.conf. +``` +$ mysqldump -u slurm -p --databases slurm_acct_db > /somewhere/safe/slurm_acct_db.sql +``` + +## Upgrade slurm on octopus01 (the head node) + +Clone the gn-machines git repo. +``` +$ git clone https://git.genenetwork.org/gn-machines +``` +Edit slurm.scm to build the version of slurm you are upgrading to. Ensure it builds successfully using +``` +$ guix build -f slurm.scm +``` +Upgrade slurm. +``` +# ./slurm-head-deploy.sh +``` +Make any configuration file changes outlined in RELEASE_NOTES. Next, run the slurmdbd daemon, wait for it to start up successfully and then exit with Ctrl+C. During upgrades, slurmdbd may take extra time to update the database. This may cause systemd to timeout and kill slurmdbd. So, we do it this way, instead of simply starting the slurmdbd systemd service. +``` +# sudo -u slurm slurmdbd -D +``` +Reload the new systemd configuration files. Then, start the slurmdbd, slurmctld and slurmd services one at a time ensuring that each starts up correctly before proceeding on to the next. +``` +# systemctl daemon-reload +# systemctl start slurmdbd +# systemctl start slurmctld +# systemctl start slurmd +# systemctl start slurmrestd +``` + +## Upgrade slurm on the worker nodes + +Repeat the steps below on every worker node. + +Stop the slurmd service. +``` +# systemctl stop slurmd +``` +Upgrade slurm, passing slurm-worker-deploy.sh the slurm store path obtained from building slurm using guix build on octopus01. Recall that you cannot invoke guix build on the worker nodes. +``` +# ./slurm-worker-deploy.sh /gnu/store/...-slurm +``` +Copy over any configuration file changes from octopus01. Then, reload the new systemd configuration files and start slurmd. +``` +# systemctl daemon-reload +# systemctl start slurmd +``` + +## Tip: Running the same command on all worker nodes + +It is a lot of typing to run the same command on all worker nodes. You could make this a little less cumbersome with the following bash for loop. +``` +for node in octopus02 octopus03 octopus05 octopus06 octopus07 octopus08 octopus09 octopus10 octopus11 tux05 tux06 tux07 tux08 tux09; +do + ssh $node your command +done +``` +You can even do this for sudo commands using the -S flag of sudo that makes it read the password from stdin. Assuming your password is in the pass password manager, the bash for loop would then look like: +``` +for node in octopus02 octopus03 octopus05 octopus06 octopus07 octopus08 octopus09 octopus10 octopus11 tux05 tux06 tux07 tux08 tux09; +do + pass octopus | ssh $node sudo -S your command +done +``` \ No newline at end of file diff --git a/topics/pangenome/impg/impg-agc-bindings.gmi b/topics/pangenome/impg/impg-agc-bindings.gmi new file mode 100644 index 0000000..2451c0a --- /dev/null +++ b/topics/pangenome/impg/impg-agc-bindings.gmi @@ -0,0 +1,246 @@ +# IMPG AGC bindings + +In this document we will create a build setup that allows us to use AGC (a C++ library) from a recent Rust compiler. The original binding proves tricky. So we break it down into parts. Also we try out the new Rust cargo support in Guix. + +Fortunately the AGC include file contains a limited list of functions that have C ABI bindings: + +```c +EXTERNC agc_t* agc_open(char* fn, int prefetching); +EXTERNC int agc_close(agc_t* agc); +EXTERNC int agc_get_ctg_len(const agc_t *agc, const char *sample, const char *name); +EXTERNC int agc_get_ctg_seq(const agc_t *agc, const char *sample, const char *name, int start, int end, char *buf); +EXTERNC int agc_n_sample(const agc_t* agc); +EXTERNC int agc_n_ctg(const agc_t *agc, const char *sample); +EXTERNC char* agc_reference_sample(const agc_t* agc); +EXTERNC char **agc_list_sample(const agc_t *agc, int *n_sample); +EXTERNC char **agc_list_ctg(const agc_t *agc, const char *sample, int *n_ctg); +EXTERNC int agc_list_destroy(char **list); +EXTERNC int agc_string_destroy(char *sample); +``` + +Even for a C++ library it is very thoughtful to provide a C ABI! Both the current Rust binding and the Python example in AGC actually use the C++ class - which means they need to build against a matching C++ source tree. +It should be straightforward to create a Rust module that calles into the shared library directly using the C ABI instead of importing and building all the source code. + +One early choice is a separation of concerns. We will try to build the library independently of the Rust package. This follows a standard model. For example cargo should not build zlib - it is provided by the environment. The bindings, meanwhile, are defined and built in cargo. + +# Tasks + +* [X] Fix AGC passing exceptions through C ABI +* [X] Get guix to compile impg (here testlibagc) with AGC +* [ ] Add optimization +* [ ] Make sure spoa build in spoa-rs is optimized +* [ ] Create static binary for distribution +* [ ] Create singularity example + +# Steps + +## Setting up Guix with rust + +Guix provides a reproducible build environment. If you get over the fact that it is Lisp, it proves a remarkably nice way to handle dependencies. The first step is to set up guix so you get a recent set of dependencies. For this run guix pull and set it up in a profile + +```sh +guix pull -p ~/opt/guix-pull --url=https://codeberg.org/guix/guix +``` + +it takes a few minutes. Next set the environment + +```sh +unset GUIX_PROFILE +. ~/opt/guix-pull/etc/profile +``` + +and list the packages + +```sh +guix package -A rust +rust 1.85.1 rust-src,tools,out,cargo gnu/packages/rust.scm:1454:4 +``` + +should show a recent edition of rust (typically about half a year old, the rust-team in guix is now working on 1.89). Note you can also pull an older version of guix (and rust) by passing in the git hash value of the codeberg repo. This allows you to go back to the dependency tree of, say, three months ago. It allows for a level of sanity not seen in other software deployment systems. + +Note that we tend not to be too recent with packages as Guix is used to deploy *stable* systems. If you want a more recent version of rust you can write your own guix package - it is not that hard. We may attempt it later for this exercise. + +Note also that newbies run guix-pull too often. I typically do it every three months, or so. So the slowness of guix-pull should not really count. + +One thing that is a bit funny now is that we currently can't list most cargo packages in guix because they the crates are now 'local' to a package. We have to check the source tree: + +=> https://codeberg.org/guix/guix/src/branch/master/gnu/packages/rust-crates.scm + +## Building AGC in guix + +AGC is a C++ program with a C ABI. The README suggests there are no dependencies, but that is misleading. It sources other dependencies and builds them (bit like git submodules). I managed to build AGC using a guix shell with: + +```sh +guix shell -C guix gcc-toolchain make libdeflate pkg-config xz mimalloc coreutils sed minizip-ng lzlib zlib:static zstd:static zstd:lib zstd zlib +make PLATFORM=avx2 libagc +``` + +Note it pulls in too much. To make it compile the patch I applied is + +```diff +--- a/agc/makefile ++++ b/agc/makefile +@@ -14,14 +14,14 @@ $(call SET_SRC_OBJ_BIN,src,obj,bin) + + # *** Project configuration + $(call CHECK_NASM) +-$(call ADD_MIMALLOC, $(3RD_PARTY_DIR)/mimalloc) ++# $(call ADD_MIMALLOC, $(3RD_PARTY_DIR)/mimalloc) + $(call PROPOSE_ISAL, $(3RD_PARTY_DIR)/isa-l) +-$(call PROPOSE_ZLIB_NG, $(3RD_PARTY_DIR)/zlib-ng) +-$(call CHOOSE_GZIP_DECOMPRESSION) +-$(call ADD_LIBDEFLATE, $(3RD_PARTY_DIR)/libdeflate) +-$(call ADD_LIBZSTD, $(3RD_PARTY_DIR)/zstd) ++# $(call PROPOSE_ZLIB_NG, $(3RD_PARTY_DIR)/zlib-ng) ++# $(call CHOOSE_GZIP_DECOMPRESSION) ++# $(call ADD_LIBDEFLATE, $(3RD_PARTY_DIR)/libdeflate) ++# $(call ADD_LIBZSTD, $(3RD_PARTY_DIR)/zstd) + $(call ADD_RADULS_INPLACE,$(3RD_PARTY_DIR)/raduls-inplace) +-$(call ADD_PYBIND11,$(3RD_PARTY_DIR)/pybind11/include) ++# $(call ADD_PYBIND11,$(3RD_PARTY_DIR)/pybind11/include) + $(call SET_STATIC, $(STATIC_LINK)) + + $(call SET_C_CPP_STANDARDS, c11, c++20) +@@ -57,7 +57,7 @@ $(OUT_BIN_DIR)/agc: \ + $(CXX) -o $@ \ + $(MIMALLOC_OBJ) \ + $(OBJ_APP) $(OBJ_CORE) $(OBJ_COMMON) \ +- $(LIBRARY_FILES) $(LINKER_FLAGS) $(LINKER_DIRS) ++ $(LIBRARY_FILES) -lzstd -lz -ldeflate $(LINKER_FLAGS) $(LINKER_DIRS)^M + + libagc: $(OUT_BIN_DIR)/libagc + $(OUT_BIN_DIR)/libagc: +``` + +Essentially disables 3rd-party dependency builds, in favour of using the Guix ones. + +Note that Bioconda installes AGC as a binary: + +=> https://github.com/bioconda/bioconda-recipes/blob/master/recipes/agc/meta.yaml + +So it circumvents building AGC by downloading the provided static binaries. In only downloads the binary, not the library. + +## The current cargo package + +The current cargo bindings package named agc-rs vendors in (in its turn) the AGC github repository. Similarly to git modules. It is kinda ironic that we left git submodules for something that is not better (maybe even worse because it does not do the hash values, but a versioned branch/tag -- who is to say what happened upstream). + +## Changes + +So we propose to take a different approach when it comes to distributing software. First premise is that we will prepare pre-built *binaries* for external use that can be handled by conda and singularity. Both these deployers can handle external dependencies, so we can just use a standard AGC build/distribution. That is key to keeping sane - so not have cargo build AGC itself as it is just a library with a decent C ABI. + +To make it work with Rust we can create a cargo module that binds to the C ABI using FFI (and not care where the AGC library comes from). One great feature is we can use the C ABI without having to generate bindings using clang and all that. A C ABI can be written and maintained by hand in Rust. + +For C++ only libraries, the narrative gets a bit harder. If the C++ interface is rich it may be best to use a bindings generator. In general it should be possible to provide a C ABI that calls into C++, however, in C. This means we can take the same deployment approach (in general) for pure C++ libraries, provided we can write a short C ABI. I have done this for vcflib, for example, to write the Zig version of vcflib: + +=> https://github.com/vcflib/vcflib/blob/master/src/vcf-c-api.cpp + +To support AGC in Rust we need to: + +* [X] Create a Rust binding that uses the AGC C ABI instead of the C++ one, so we can use a statically built AGC lib and don't need the source tree for cargo + +We will also write a + +* [ ] Guix build to create the optimized AGC static lib +* [ ] Guix build that creates an optimized impg + +And that last one allows us to distribute prebuilt binaries in CONDA and apptainer/singularity/docker. + +Note that this is the same approach as taken by + +=> https://github.com/rust-lang/libz-sys/blob/main/build.rs + +which binds against libz. It *optionally* builds the source tree of zlib which is included as a submodule + +=> https://github.com/rust-lang/libz-sys/tree/main/src + +In our case, a rebuild can be useful when AGC lib can not be found. Note that the cargo edition of libz-sys does not invoke make or cmake. It builds it by 'hand'! + +There is also libz-rs, but that is a somewhat typical Rust rewrite of libz: + +=> https://github.com/trifectatechfoundation/zlib-rs + +I also took a quick look at the rust spoa crate. Here a build is always forced, but I don't think it actually optimizes the build. Add a note to my tasks. + +## First guix package by Fred + +Fred drafted a first guix package which can build impg with + +``` +guix build -L .guix/modules -f guix.scm + +/gnu/store/cdjiq6aalpc849hl8irmbn8xax9mq2b6-impg-0.3.1/bin/impg +Command-line tool for querying overlaps in PAF files + +Usage: impg <COMMAND> + +Commands: + index Create an IMPG index + lace Lace files together (graphs or VCFs) + partition Partition the alignment + query Query overlaps in the alignment + similarity Compute pairwise similarity between sequences in a region + stats Print alignment statistics + +Options: + -h, --help Print help + -V, --version Print version +``` + +It builds against rust 1.85 and uses the new cargo support in Guix. It does not have to rebuild the cargo packages already in guix. Nice and a good start! + +=> https://github.com/pangenome/impg/blob/f5ebaf8b511ee06bdeb193ef509836c26cd4793a/.guix/modules/impg/impg.scm#L4 + +we'll still need to add AGC, static output and optimizations. + +## Adding a guix package for AGC + +As a first step we build a package for AGC that compiles libagc.a using AVX2: + +=> https://github.com/pjotrp/impg/commit/ed16948cc4145ff933a19ba54c3bc1fe4cec709f + +we used the vendored in source for raduls-inplace and isa-l. Not sure they are really required, but I think it is harmless here. + +## Make sure libagc.a is linked to impg + +To create a rust package for binding libagc it is worth reading: + +=> https://doc.rust-lang.org/cargo/reference/build-scripts.html#a-sys-packages + +* The library crate should link to the native library libfoo. This will often probe the current system for libfoo before resorting to building from source. +* The library crate should provide declarations for types and functions in libfoo, but not higher-level abstractions. + +So we should create an agc-rs crate that provides a high-level interface to the upcoming libagc-sys crate. No wonder these crates proliferate. + +# Using a linked libagc.so + +I managed to create a crate that binds libagc.so against Rust: + +=> https://github.com/pjotrp/libagc-sys + +See also the included test in lib.rs. It binds against the updated agc: + +=> https://github.com/refresh-bio/agc/compare/main...pjotrp:agc:main + +which contains the fixes that don't allow C++ exceptions to pass through the C ABI. +Also I fixed one function and added a shared lib as output. + +Finally, rather than messing with the impg code tree (which keeps changing), I created a test crate that mirrors impg: + +=> https://github.com/pjotrp/testlibagc + +which can be build and run with + +``` +cargo build --release +target/release/testagc-sys +Number of samples: 4 +``` + +At least we have a reference implementation for binding successfully against a shared C library with a very *light* and standardised interface. It obviously also works in Guix. We can use it to benchmark against the new (impressive) Rust implementation by Erik. It also acts as a template for future bindings. + +Note that we should discourage C++ bindings. Mostly because there is no standard C++ ABI (in contrast to the C one), so avoid the use of the cxx crates - unless you really know what you are doing. + +Potential future work is: + +- [ ] Optimized runtime +- [ ] Static binary for distribution diff --git a/topics/programming/autossh-for-keeping-ssh-tunnels.gmi b/topics/programming/autossh-for-keeping-ssh-tunnels.gmi new file mode 100644 index 0000000..a977232 --- /dev/null +++ b/topics/programming/autossh-for-keeping-ssh-tunnels.gmi @@ -0,0 +1,65 @@ +# Using autossh to Keep SSH Tunnels Alive + +## Tags +* keywords: ssh, autossh, tunnel, alive + + +## TL;DR + +``` +guix package -i autossh # Install autossh with Guix +autossh -M 0 -o "ServerAliveInterval 60" -o "ServerAliveCountMax 5" -L 4000:127.0.0.1:3306 alexander@remoteserver.org +``` + +## Introduction + +Autossh is a utility for automatically restarting SSH sessions and tunnels if they drop or become inactive. It's particularly useful for long-lived tunnels in unstable network environments. + +See official docs: + +=> https://www.harding.motd.ca/autossh/ + +## Installing autossh + +Install autossh using Guix: + +``` +guix package -i autossh +``` + +Basic usage: + +``` +autossh [-V] [-M monitor_port[:echo_port]] [-f] [SSH_OPTIONS] +``` + +## Examples + +### Keep a database tunnel alive with autossh + +Forward a remote MySQL port to your local machine: + +**Using plain SSH:** + +``` +ssh -L 5000:localhost:3306 alexander@remoteserver.org +``` + +**Using autossh:** + +``` +autossh -L 5000:localhost:3306 alexander@remoteserver.org +``` + +### Better option + +``` +autossh -M 0 -o "ServerAliveInterval 30" -o "ServerAliveCountMax 3" -L 5000:localhost:3306 alexander@remoteserver.org +``` + +#### Option explanations: + +- `ServerAliveInterval`: Seconds between sending keepalive packets to the server (default: 0). +- `ServerAliveCountMax`: Number of unanswered keepalive packets before SSH disconnects (default: 3). + +You can also configure these options in your `~/.ssh/config` file to simplify command-line usage. diff --git a/topics/programming/better-logging.gmi b/topics/programming/better-logging.gmi index dca8c0d..d80bb0d 100644 --- a/topics/programming/better-logging.gmi +++ b/topics/programming/better-logging.gmi @@ -1,14 +1,17 @@ -# Improving Logging in GN2 +# Improving Alerting/Logging in GN2 -## What Are We Trying To Solve? +## Problem Statement -We prioritise maintaining user functionality over speed in GN [with time this speed will be improved]. As such we should be pay more attention at not breaking any currently working GN2 functionality. And when/if we do, trouble-shooting should be easy. On this front, one way is to stream-line logging in both GN2/GN3 and make it more script friendly - only report when something fails, not to instrument variables - and in so doing make the process of monitoring easier. +Current logging in the genenetwork ecosystem is noisy and difficult to parse programatically which makes it hard to: + +* Integrate logs into some observability pipeline (E.g. sheepdog). +* Troubleshoot issues as they occur. We always learn of bugs from users. ## Goals -* Have script-friendly error/info logs. -* Remove noise from GN2. -* Separate logging into different files: error logs, info logs. Add this somewhere with Flask itself instead of re-directing STDOUT to a file. +* Standardize logging format and config across GN2 flask apps and gn-guile. +* Adopt structured logging. +* Extend sheep-dog to be able to parse gn logs and send alerts on e-mail or matrix. ### Non-goals @@ -27,3 +30,5 @@ We prioritise maintaining user functionality over speed in GN [with time this sp ## Resources => https://realpython.com/python-logging/ Logging in Python +=> https://signoz.io/guides/python-logging-best-practices/ Python Logging Best Practices - Obvious and Not-So-Obvious +=> https://signoz.io/blog/what-is-opentelemetry/ What is OpenTelemetry diff --git a/topics/rust/guix-rust-bootstrap.gmi b/topics/rust/guix-rust-bootstrap.gmi new file mode 100644 index 0000000..cd3c322 --- /dev/null +++ b/topics/rust/guix-rust-bootstrap.gmi @@ -0,0 +1,173 @@ +# Guix Rust Bootstrap + +To develop code against rust you often need a recent edition of rust. With Guix this is possible because you don't depend on the underlying linux distribution to provide recent versions of glibc and other libraries. Here we have a recipe that should work anywhere on Linux. + +I succeeded in running the latest Rust on Octopus and building packages with guix. + +To make it work the following steps are required: + +* Update guix with guix-pull if your guix is older than 3 months +* Unset GUIX_PROFILE on some systems +* Set your updated guix profile vars +* Create a container that has all dependencies for rust itself +* Run rustup +* Run cargo with LD_LIBRARY_PATH set to $GUIX_ENVIRONMENT/lib + +# Get Guix updated + +Important is to have a recent version of Guix. This is achieved with 'guix pull' and making sure it works. + + +```sh +mkdir -p ~/opt +guix pull -p ~/opt/guix-pull --url=https://codeberg.org/guix/guix +``` + +it takes a few minutes. Next set the environment + +```sh +unset GUIX_PROFILE +. ~/opt/guix-pull/etc/profile +``` + +This will point the path to a recent guix. You can make sure with + +``` +guix describe + guix 772c456 + repository URL: https://codeberg.org/guix/guix + branch: master + commit: 772c456717e755829397a6ff6dba4c1e135426d8 +``` + +which can be validated against the Guix tree. Running + + +```sh +guix package -A rust +rust 1.85.1 rust-src,tools,out,cargo gnu/packages/rust.scm:1454:4 +``` + +shows the current *stable* version in Guix. Now, of course, we want something more to get rust latest. + +# Update Rust and Cargo to latest (stable) + +The trick is to set up a container with Rust in your git working directory: + +``` +mkdir -p ~/.cargo ~/.rustup # to prevent rebuilds +guix shell --share=$HOME/.cargo --share=$HOME/.rustup -C -N -D -F -v 3 guix gcc-toolchain make libdeflate pkg-config xz coreutils sed zstd zlib nss-certs openssl curl +curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh +. ~/.cargo/env +rustup default stable +``` + +Now rustc shows it is recent: + +``` +rustc --version +rustc 1.90.0 (1159e78c4 2025-09-14) +``` + +Next run 'cargo build' with: + +``` +env LD_LIBRARY_PATH=$GUIX_ENVIRONMENT/lib cargo build + Compiling libagc-sys v0.1.0 (/home/wrk/iwrk/opensource/code/pangenome/libagc-sys) + Finished 'dev' profile [unoptimized + debuginfo] target(s) in 0.06s +$ ./target/debug/libagc-sys +./target/debug/libagc-sys: error while loading shared libraries: libgcc_s.so.1: cannot open shared object file: No such file or directory +$ env LD_LIBRARY_PATH=$GUIX_ENVIRONMENT/lib ./target/debug/libagc-sys +Hello, world! +``` + +and your source should build and run. Note the libgcc_s.so.1 error. + +## What if you get a libgcc or librt error? + +The problem is that cargo picks up the wrong libgcc: + +``` +$ ls /gnu/store/*/lib/libgcc_s.so.1 +/gnu/store/m2vhzr0dy352cn59sgcklcaykprrr4j6-gcc-14.3.0-lib/lib/libgcc_s.so.1 +/gnu/store/rbs3nrx9z6sfawn3fa8r8z1kffdbnk8q-gcc-toolchain-15.2.0/lib/libgcc_s.so.1 +/gnu/store/v3bq3shn333kh7m6gj3r58l0v7mkn4in-profile/lib/libgcc_s.so.1 +/gnu/store/xm7i1gvi0i9pyndlkv627r08rsw1ny96-gcc-15.2.0-lib/lib/libgcc_s.so.1 +``` + +This is because Guix itself builds on an older libgcc and librt. You need to tell it explicitly what library to load that built your cargo: + +``` +ldd ~/.cargo/bin/cargo + linux-vdso.so.1 (0x00007ffd409b2000) + libgcc_s.so.1 => /lib/libgcc_s.so.1 (0x00007fd2cf433000) + librt.so.1 => /lib/librt.so.1 (0x00007fd2cf42e000) +``` + +in the container: + +``` +ls -l /lib/libgcc_s.so.1 +lrwxrwxrwx 1 65534 overflow 82 Jan 1 1970 /lib/libgcc_s.so.1 -> /gnu/store/rbs3nrx9z6sfawn3fa8r8z1kffdbnk8q-gcc-toolchain-15.2.0/lib/libgcc_s.so.1 +``` + +which happens to be the one in $GUIX_ENVIRONMENT/lib! So setting the library path solves it. + +The reason that we don't get the automatically resolving libraries that you normally have in guix is that we have updated rust by *hand* using rustup. Guix has no control over this process. + +# spoa-rs on octopus01 + +I just did above to build spoa-rs. Only had to add cmake to the shell packages. + +# sweepga on octopus01 + +I just built sweepga. Only had the add clang to the shell: + +```sh +guix shell --share=$HOME/.cargo --share=$HOME/.rustup -C -N -D -F -v 3 guix gcc-toolchain make libdeflate pkg-config xz coreutils sed zstd zlib nss-certs openssl curl zlib cmake clang +. ~/.cargo/env +env LD_LIBRARY_PATH=$GUIX_ENVIRONMENT/lib cargo build +env LD_LIBRARY_PATH=$GUIX_ENVIRONMENT/lib ./target/debug/sweepga +[sweepga::start::0.000*1.00] 2025-10-11 15:27:28 | ./target/debug/sweepga +[sweepga::detect::0.000*1.00] Using .1aln workflow (FastGA native format) +[sweepga] ERROR: No valid input provided +``` + +To run on the cluster you likely don't want to use the container. Make a note of GUIX_ENVIRONMENT: + +``` +echo $GUIX_ENVIRONMENT/ +/gnu/store/6khi7iv7l75595hwlfc1nwmdcv72m24s-profile/ +``` + +It has your libs! So, outsite the container you can run + +``` +export GUIX_ENVIRONMENT=/gnu/store/6khi7iv7l75595hwlfc1nwmdcv72m24s-profile +env LD_LIBRARY_PATH=$GUIX_ENVIRONMENT/lib /home/wrk/tmp/sweepga/target/debug/sweepga +``` + +# Updating the container + +Now your build may fail because you miss a crucial library or tool. This is a feature of guix containers as it makes dependencies explicit. + +Just add them to the guix shell command. Let's say we add zlib + +``` +guix shell --share=$HOME/.cargo --share=$HOME/.rustup -C -N -D -F -v 3 guix gcc-toolchain make libdeflate pkg-config xz coreutils sed zstd zlib nss-certs openssl curl zlib +``` + +# Troubleshooting + +## Collisions + +Guix may complain about collisions. These are mostly naming issues: + +``` +warning: collision encountered: + /gnu/store/nym6kiinrg2mb8z4lwnvfx5my8df9vrs-glibc-for-fhs-2.41/bin/ldd + /gnu/store/rbs3nrx9z6sfawn3fa8r8z1kffdbnk8q-gcc-toolchain-15.2.0/bin/ldd +warning: choosing /gnu/store/nym6kiinrg2mb8z4lwnvfx5my8df9vrs-glibc-for-fhs-2.41/bin/ldd +``` + +it will like one into your environment. You can still use both tools by using the full path and normally ignore the warning. diff --git a/topics/systems/backup-drops.gmi b/topics/systems/backup-drops.gmi index 191b185..a29e605 100644 --- a/topics/systems/backup-drops.gmi +++ b/topics/systems/backup-drops.gmi @@ -4,6 +4,10 @@ To make backups we use a combination of sheepdog, borg, sshfs, rsync. sheepdog i This system proves pretty resilient over time. Only on the synology server I can't get it to work because of some CRON permission issue. +For doing the actual backups see + +=> ./backups-with-borg.gmi + # Tags * assigned: pjotrp @@ -13,7 +17,7 @@ This system proves pretty resilient over time. Only on the synology server I can ## Borg backups -It is advised to use a backup password and not store that on the remote. +Despite our precautions it is advised to use a backup password and *not* store that on the remote. ## Running sheepdog on rabbit @@ -59,14 +63,14 @@ where remote can be an IP address. Warning: if you introduce this `AllowUsers` command all users should be listed or people may get locked out of the machine. -Next create a special key on the backup machine's ibackup user (just hit enter): +Next create a special password-less key on the backup machine's ibackup user (just hit enter): ``` su ibackup ssh-keygen -t ecdsa -f $HOME/.ssh/id_ecdsa_backup ``` -and copy the public key into the remote /home/bacchus/.ssh/authorized_keys +and copy the public key into the remote /home/bacchus/.ssh/authorized_keys. Now test it from the backup server with @@ -82,13 +86,20 @@ On the drop server you can track messages by tail -40 /var/log/auth.log ``` +or on recent linux with systemd + +``` +journalctl -r +``` + Next ``` ssh -v -i ~/.ssh/id_ecdsa_backup bacchus@dropserver ``` -should give a Broken pipe(!). In auth.log you may see something like +should give a Broken pipe(!) or -- more recently -- it says `This service allows sftp connections only`. +When running sshd with a verbose switch you may see something like fatal: bad ownership or modes for chroot directory component "/export/backup/" @@ -106,10 +117,23 @@ So, as root ``` cd /export mkdir -p backup/bacchus/drop -chown bacchus.bacchus backup/bacchus/drop/ +chown bacchus:bacchus backup/bacchus/drop/ chmod 0700 backup/bacchus/drop/ ``` +Another error may be: + +``` +fusermount3: mount failed: Operation not permitted +``` + +This means you need to set the suid on the fusermount3 command. Bit nasty in Guix. + +``` +apt-get install fuse(3) sshfs +chmod 4755 /usr/bin/fusermount +``` + If auth.log says error: /dev/pts/11: No such file or directory on ssh, or received disconnect (...) disconnected by user we are good to go! Note: at this stage it may pay to track the system log with @@ -171,3 +195,56 @@ sshfs -o reconnect,ServerAliveInterval=15,ServerAliveCountMax=3,IdentityFile=~/. The recent scripts can be found at => https://github.com/genenetwork/gn-deploy-servers/blob/master/scripts/tux01/backup_drop.sh + +# borg-borg + + +Backups work for production according to sheepdog. They run at 5am CST. Which (I guess) is OK. On the remote server we are going to forward the backup to a server on a different continent at 4pm GMT. I have been running that by hand lately, so time to sheepdog it! + +The manual command is + +``` +rsync -e "ssh -i ~/.ssh/id_ecdsa_borgborg" -vaP tux03 $HOST:/export/backup/bacchus/drop/ +``` + +With sheepdog we can make it: + +``` +sheepdog_run.rb -v --tag "drop-mount-$name" -c "sshfs -o $SFTP_SETTING,IdentityFile=~/.ssh/id_ecdsa_backup bacchus@$host:/ ~/mnt/$name" +sheepdog_run.rb --always -v --tag "drop-rsync-$name" -c "rsync -vrltDP borg/* ~/mnt/$name/drop/$HOST/ --delete" +sheepdog_run.rb -v --tag "drop-unmount-$name" -c "fusermount -u ~/mnt/$name" +``` + +For some reason this took a while to figure out. Part of it is that the machine on the other end has a rather slow CPU! An +Intel(R) Celeron(R) CPU J1900 @ 1.99GHz launched over 10 years ago. We still use it because of its low energy consumption. Once it starts pumping a file it is up to speed + +``` +tux03/tux03-containers/data/0/239 + 154,501,120 29% 11.20MB/s 0:00:32 +``` + +So one backup of a backup has started running and I made it a CRON job. Next stop is borgborg on the receiving HOST. The CRON job looks like + +``` +0 3 * * * env BORG_PASSPHRASE=none /home/wrk/iwrk/deploy/deploy/bin/sheepdog_borg.rb -t borgborg --always -v -b /export/backup/bacchus/borgborg/drop /export/backup/bacchus/drop --args '--stats' >> ~/cron.log 2>&1 +``` + +note the backups are already password protected. No need to do that again. Now this backup is going to go onto optical media twice a year with the password printed on the backup. That should keep it for 100 years. + +You can track this backup progress daily on the sheepdog status + +=> http://sheepdog.genenetwork.org/sheepdog/status.html + +i.e. in reverse order the flow is: + +``` +2025-09-18 08:35:00 +0200 FAIL host borgborg-backup +2025-09-18 16:19:45 -0500 SUCCESS balg01 drop-rsync-zero +2025-09-18 05:59:46 +0000 SUCCESS tux03 mariadb-check +2025-09-18 05:26:01 +0000 SUCCESS tux03 drop-rsync-balg01 +2025-09-18 05:25:48 +0000 SUCCESS tux03 borg-tux03-sql-backup +2025-09-18 04:44:38 +0000 SUCCESS tux03 mariabackup-make-consistent +2025-09-18 04:44:25 +0000 SUCCESS tux03 mariabackup-dump +``` + +The borgborg should be fixed now. I am missing the container backups. What is going on there? These were last backed up on 'Sun, 2025-09-14 00:00:52'. Ah, I set the CRON job to runs once a week. That should be fixed now and it should show up. diff --git a/topics/systems/backups-with-borg.gmi b/topics/systems/backups-with-borg.gmi new file mode 100644 index 0000000..dbd9192 --- /dev/null +++ b/topics/systems/backups-with-borg.gmi @@ -0,0 +1,449 @@ +# Borg backups + +We use borg for backups. Borg is an amazing tool and after 25+ years of making backups it just feels right. +With the new tux04 production install we need to organize backups off-site. The first step is to create a +borg runner using sheepdog -- sheepdog we use for monitoring success/failure. +Sheepdog essentially wraps a Unix command and sends a report to a local or remote redis instance. +Sheepdog also includes a web server for output: + +=> http://sheepdog.genenetwork.org/sheepdog/status.html + +which I run on one of my machines. + +# Tags + +* assigned: pjotrp +* keywords: systems, backup, sheepdog, database + +# Install borg + +Usually I use a version of borg from guix. This should really be done as the borg user (ibackup). + +``` +ibackup@tux03:~$ mkdir ~/opt +ibackup@tux03:~$ guix package -i borg -p ~/opt/borg +~/opt/borg/bin/borg --version + 1.2.2 +``` + +# Create a new backup dir and user + +The backup should live on a *different* disk from the things we backup, so when that disk fails we have another. In fact in 2025 we had a corruption of the backups(!) We could recover from the original data + older backups. Not great. But if it had been the same disk it would have been worse. + +The SQL database lives on /export and the containers live on /export2. /export3 is a largish slow drive, so perfect. + +By convention I point /export/backup to the real backup dir on /export3/backup/borg/ Another convention is that we use an ibackup user which has the backup passphrase in ~/.borg-pass. As root: + +``` +mkdir /export/backup/borg +chown ibackup:ibackup /export/backup/borg +chown ibackup:ibackup /home/ibackup/.borg-pass +su ibackup +``` + +Now you should be able to load the passphrase and create the backup dir + +``` +id + uid=1003(ibackup) +. ~/.borg-pass +cd /export/backup/borg +~/opt/borg/bin/borg init --encryption=repokey-blake2 genenetwork +``` + +Note that we typically start from an existing backup. These go back a long time. + +Now we can run our first backup. Note that ibackup should be a member of the mysql and gn groups + +``` +mysql:x:116:ibackup +``` + +# First backup + +Run the backup the first time: + +``` +id + uid=1003(ibackup) groups=1003(ibackup),116(mysql) +~/opt/borg/bin/borg create --progress --stats genenetwork::first-backup /export/mysql/database/* +``` + +You may first need to update permissions to give group access + +``` +chmod g+rx -R /var/lib/mysql/* +``` + +When that works borg reports: + +``` +Archive name: first-backup +Archive fingerprint: 376d32fda9738daa97078fe4ca6d084c3fa9be8013dc4d359f951f594f24184d +Time (start): Sat, 2025-02-08 04:46:48 +Time (end): Sat, 2025-02-08 05:30:01 +Duration: 43 minutes 12.87 seconds +Number of files: 799 +Utilization of max. archive size: 0% +------------------------------------------------------------------------------ + Original size Compressed size Deduplicated size +This archive: 534.24 GB 238.43 GB 237.85 GB +All archives: 534.24 GB 238.43 GB 238.38 GB + Unique chunks Total chunks +Chunk index: 200049 227228 +------------------------------------------------------------------------------ +``` + +50% compression is not bad. borg is incremental so it will only backup differences next round. + +Once borg works we could run a CRON job. But we should use the sheepdog monitor to make sure backups keep going without failure going unnoticed. + +# Using the sheepdog + +=> https://github.com/pjotrp/deploy sheepdog code + +## Clone sheepdog + +=> https://github.com/pjotrp/deploy#install sheepdog install + +Essentially clone the repo so it shows up in ~/deploy + +``` +cd /home/ibackup +git clone https://github.com/pjotrp/deploy.git +/export/backup/scripts/tux04/backup-tux04.sh +``` + +## Setup redis + +All sheepdog messages get pushed to redis. You can run it locally or remotely. + +By default we use redis, but syslog and others may also be used. The advantage of redis is that it is not bound to the same host, can cross firewalls using an ssh reverse tunnel, and is easy to query. + +=> https://github.com/pjotrp/deploy#install sheepdog install + +In our case we use redis on a remote host and the results get displayed by a webserver. Also some people get E-mail updates on failure. The configuration is in + +``` +/home/ibackup# cat .config/sheepdog/sheepdog.conf . +{ + "redis": { + "host" : "remote-host", + "password": "something" + } +} +``` + +If you see localhost with port 6377 it is probably a reverse tunnel setup: + +=> https://github.com/pjotrp/deploy#redis-reverse-tunnel + +Update the fields according to what we use. Main thing is that is the definition of the sheepdog->redis connector. If you also use sheepdog as another user you'll need to add a config. + +Sheepdog should show a warning when you configure redis and it is not connecting. + +## Scripts + +Typically I run the cron job from root CRON so people can find it. Still it is probably a better idea to use an ibackup CRON. In my version a script is run that also captures output: + +```cron root +0 6 * * * /bin/su ibackup -c /export/backup/scripts/tux04/backup-tux04.sh >> ~/cron.log 2>&1 +``` + +The script contains something like + +```bash +#! /bin/bash +if [ "$EUID" -eq 0 ] + then echo "Please do not run as root. Run as: su ibackup -c $0" + exit +fi +rundir=$(dirname "$0") +# ---- for sheepdog +source $rundir/sheepdog_env.sh +cd $rundir +sheepdog_borg.rb -t borg-tux04-sql --group ibackup -v -b /export/backup/borg/genenetwork /export/mysql/database/* +``` + +and the accompanying sheepdov_env.sh + +``` +export GEM_PATH=/home/ibackup/opt/deploy/lib/ruby/vendor_ruby +export PATH=/home/ibackup/opt/deploy/deploy/bin:/home/wrk/opt/deploy/bin:$PATH +``` + +If it reports + +``` +/export/backup/scripts/tux04/backup-tux04.sh: line 11: /export/backup/scripts/tux04/sheepdog_env.sh: No such file or directory +``` + +you need to install sheepdog first. + +If all shows green (and takes some time) we made a backup. Check the backup with + +``` +ibackup@tux04:/export/backup/borg$ borg list genenetwork/ +first-backup Sat, 2025-02-08 04:39:50 [58715b883c080996ab86630b3ae3db9bedb65e6dd2e83977b72c8a9eaa257cdf] +borg-tux04-sql-20250209-01:43-Sun Sun, 2025-02-09 01:43:23 [5e9698a032143bd6c625cdfa12ec4462f67218aa3cedc4233c176e8ffb92e16a] +``` +and you should see the latest. The contents with all files should be visible with + +``` +borg list genenetwork::borg-tux04-sql-20250209-01:43-Sun +``` + +Make sure you not only see just a symlink. + +# More backups + +Our production server runs databases and file stores that need to be backed up too. + +# Drop backups + +Once backups work it is useful to copy them to a remote server, so when the machine stops functioning we have another chance at recovery. See + +=> ./backup-drops.gmi + +# Recovery + +With tux04 we ran into a problem where all disks were getting corrupted(!) Probably due to the RAID controller, but we still need to figure that one out. + +Anyway, we have to assume the DB is corrupt. Files are corrupt AND the backups are corrupt. Borg backup has checksums which you can + +``` +borg check repo +``` + +it has a --repair switch which we needed to remove some faults in the backup itself: + +``` +borg check --repair repo +``` + +# Production backups + +Now backups were supposed to run, but they don't show up yet. Ah, it is not yet 3am CST. Meanwhile we drop the backups on another server. Just in case we lose *both* drives on the production server and/or the server itself. To achieve this we have set up a user 'bacchus' with limited permissions on the remote. All bacchus can do is copy the files across. So, we add an ssh key and invoke the commands: + +``` +sheepdog_run.rb -v --tag "drop-mount-$name" -c "sshfs -o $SFTP_SETTING,IdentityFile=~/.ssh/id_ecdsa_backup bacchus@$host:/ ~/mnt/$name" +sheepdog_run.rb --always -v --tag "drop-rsync-$name" -c "rsync -vrltDP borg/* ~/mnt/$name/drop/$HOST/ --delete" +sheepdog_run.rb -v --tag "drop-unmount-$name" -c "fusermount -u ~/mnt/$name" +``` + +essentially mounting the remote dir, rsync files across, and unmount. All monitored by sheepdog. Copying files over sshfs is not the fastest route, but it is very secure because of the limited permissions. On the remote we have space and for now we'll use the old backups as a starting point. When it works I'll disable and remove the old tux04 backups. Actually I'll disable the cron job now and make sure mariadb did not start (so no one can use that by mistake). All checked! + +Meanwhile the system log at point of failure shows no information. This means it is a hard crash the Linux kernel is not even aware of and it points out it is not a kernel/driver/software issue on our end. It really sucks. We'll work on it: + +=> tux04-disk-issues + +OK, so I prepared the old production backups on the remote and we run an update by hand. And after some fiddling with permissions it worked: + +``` +ibackup@tux03:/export/backup/scripts/tux03$ ./backup_drop_balg01.sh +fusermount: entry for /home/ibackup/mnt/balg01 not found in /etc/mtab +{:cmd=>"sshfs -o reconnect,ServerAliveInterval=15,ServerAliveCountMax=3,IdentityFile=~/.ssh/id_ecdsa_backup bacchus@balg01.genenetwork.org:/ ~/mnt/balg01", :channel=>"run", :host=>"localhost", :port=>6377, :password=>"*", :verbose=>true, :tag=>"drop-mount-balg01", :config=>"/home/ibackup/.config/sheepdog/sheepdog.conf"} +sshfs -o reconnect,ServerAliveInterval=15,ServerAliveCountMax=3,IdentityFile=~/.ssh/id_ecdsa_backup bacchus@balg01.genenetwork.org:/ ~/mnt/balg01 +No event to report <sheepdog_run> {:cmd=>"rsync -vrltDP borg/* ~/mnt/balg01/drop/tux03/ --delete", :channel=>"run", :host=>"localhost", :port=>6377, :password=>"*", :always=>true, :verbose=>true, :tag=>"drop-rsync-balg01", :config=>"/home/ibackup/.config/sheepdog/sheepdog.conf"} +rsync -vrltDP borg/* ~/mnt/balg01/drop/tux03/ --delete sending incremental file list +deleting genenetwork/integrity.1148 +(...) +sent 22,153,007 bytes received 352 bytes 3,408,209.08 bytes/sec +total size is 413,991,028,933 speedup is 18,687.51 +{:time=>"2025-09-12 07:51:52 +0000", :elapsed=>5, :user=>"ibackup", :host=>"tux03", :command=>"rsync -vrltDP borg/* ~/mnt/balg01/drop/tux03/ --delete", :tag=>"drop-rsync-balg01", :stdout=>nil, :stderr=>nil, :status=>0, :err=>"SUCCESS"} +Pushing out event <sheepdog_run> to <localhost:6377> +{:cmd=>"fusermount -u ~/mnt/balg01", :channel=>"run", :host=>"localhost", :port=>6377, :password=>"*", :verbose=>true, :tag=>"drop-unmount-balg01", :config=>"/home/ibackup/.config/sheepdog/sheepdog.conf"} +fusermount -u ~/mnt/balg01 +No event to report <sheepdog_run> +``` + +And on the remote I can see the added backup: + +``` +tux03-new Wed, 2025-09-10 04:33:21 [dd4bbdc30898327b62d8ccdc63c5285f916d5643bffe942b73561fe297540eae] +``` + +All good. Now we add this to CRON and track sheepdog to see if there are problems popping up. It now confirms: 'SUCCESS tux03 drop-rsync-balg01'. + +The backup drop setup is documented here: + +=> https://issues.genenetwork.org/topics/systems/backup-drops + +I am looking into setting up the backups again. Tux04 crashed a few days ago, yet again, so we were saved from that debacle! I rebooted to get at the old backups (they are elsewhere, but that is the latest). Setting up backups is slightly laborious, described here: + +=> https://issues.genenetwork.org/topics/systems/backups-with-borg + +we use sheepdog for monitoring + +=> http://sheepdog.genenetwork.org/sheepdog/status.html + +Code: + +=> https://github.com/pjotrp/deploy + +a tool that does a lot of checks in the background every day! Compressed backup sizes: + +``` +283G genenetwork +103G tux04-containers +``` + +the local network speed between tux04 and tux03 is 100 Mbs. Not bad, but it takes more an hour to move across. + +First manual backup worked: + +``` +ibackup@tux03:/export/backup/borg$ borg create genenetwork::tux03-new /export/mariadb/export/backup/mariadb/latest --stats --progress +Archive name: tux03-new +Archive fingerprint: dd4bbdc30898327b62d8ccdc63c5285f916d5643bffe942b73561fe297540eae +Time (start): Wed, 2025-09-10 09:33:21 +Time (end): Wed, 2025-09-10 10:02:52 +Duration: 29 minutes 31.00 seconds +Number of files: 907 +Utilization of max. archive size: 0% +------------------------------------------------------------------------------ + Original size Compressed size Deduplicated size +This archive: 536.84 GB 238.56 GB 3.68 MB +All archives: 65.60 TB 29.15 TB 303.71 GB + + Unique chunks Total chunks +Chunk index: 253613 24717056 +------------------------------------------------------------------------------ +``` + +Next we set up sheepdog for monitoring automated backups. Next to the +code repos we have a script repo at +'tux02.genenetwork.org:/home/git/pjotrp/gn-deploy-servers' which +currently handles monitoring for our servers, including: bacchus epysode +octopus01 penguin2 rabbit shared thebird tux01 tux02 tux04. Now tux03. The main backup script looks like + +``` +rm -rf $backupdir/latest +tag="mariabackup-dump" +sheepdog_run.rb --always -v --tag $tag -c "mariabackup --backup --innodb-io-capacity=200 --kill-long-query-type=SELEC +--kill-long-queries-timeout=120 --target-dir=$backupdir/latest/ --user=webqtlout --password=webqtlout" +tag="mariabackup-make-consistent" +sheepdog_run.rb --always -v --tag $tag -c "mariabackup --prepare --target-dir=$backupdir/latest/" +sheepdog_borg.rb -t borg-tux04-sql --always --group ibackup -v -b /export/backup/borg/genenetwork $backupdir --args ' +--stats' +``` + +What it does is make a full copy of mariadb databases and make sure it is consistent. Next we use borg to make a backup. The reason a DB have a consistent copy is that the running DB may change during the backup. And that is no good! We use sheepdog to monitor these command - i.e. on failure we get notified. First we run it by hand to make sure it works. First errors, for example + +``` +ibackup@tux03:/export/backup/scripts/tux03$ ./backup.sh +{:cmd=>"mariabackup --backup --innodb-io-capacity=200 --kill-long-query-type=SELECT --kill-long-queries-timeout=120 --target-dir=/export/backup/mariadb/latest/ --user=webqtlout --password=webqtlout", :channel=>"run", :host=>"localhost", : port=>6379, :always=>true, :verbose=>true, :tag=>"mariabackup-dump", :config=>"/home/ibackup/.redis.conf"} mariabackup --backup --innodb-io-capacity=200 --kill-long-query-type=SELECT --kill-long-queries-timeout=120 --target-di r=/export/backup/mariadb/latest/ --user=webqtlout --password=webqtlout +[00] 2025-09-10 10:31:19 Connecting to MariaDB server host: localhost, user: webqtlout, password: set, port: not set, s +ocket: not set +[00] 2025-09-10 10:31:19 Using server version 10.11.11-MariaDB-0+deb12u1-log +(...) +[00] 2025-09-10 10:31:19 InnoDB: Using liburing +[00] 2025-09-10 10:31:19 mariabackup: The option "innodb_force_recovery" should only be used with "--prepare". +[00] 2025-09-10 10:31:19 mariabackup: innodb_init_param(): Error occurred. +``` + +The good thing is that the actual command is listed, so we can fix things a step at a time. + +``` +mariabackup --backup --innodb-io-capacity=200 --kill-long-query-type=SELECT --kill-long-queries-timeout=120 --target-dir=/export/backup/mariadb/latest/ --user=webqtlout --password=* +``` + +I had to disable 'innodb_force_recovery=1' to make it work. Also permissions have to allow the backup user with 'chmod u+rX -R /var/lib/mysql/*'. + +Now that works I need to make sure sheepdog can send its updates to the remote machine (in NL). It is a bit complicated because we set up an ssh tunnel that can only run redis commands. It looks like + +``` +3 * * * * /usr/bin/ssh -i ~/.ssh/id_ecdsa_sheepdog -f -NT -o ServerAliveInterval=60 -L 6377:127.0.0.1:6379 redis-tun@sheepdog.genenetwork.org >> tunnel.log &2>1 +``` + +Now when I run sheepdog_status it reports + +``` +2025-09-10 06:01:02 -0500 (@tux04) FAIL 1 <00m00s> mariadb-test02 +2025-09-10 06:01:02 -0500 (@tux04) FAIL 1 <00m00s> mariadb-test01 +``` + +which is correct because I switched mariadb off on tux04! + +Now Mariadb on tux03 is showing errors. The problem is that it actually is in an inconsitent state (sigh). Basically I am getting endless errors like: + +``` +Retrying read of log at LSN=1537842295040 +Retrying read of log at LSN=1537842295040 +Retrying read of log at LSN=1537842295040 +``` + +There is a way to fix the replay log - probably harmless in our case. + +But what we *should* do is move this database out of the way - I may need it for Arthur - and do a proper backup recovery. I bumped off an E-mail to Arthur and started recovery. That takes also an hour to extract a borg backup of this size. I keep GN running in parallel (meanwhile) using the old DB. Bit of extra work, but less work than trying to recover from a broken DB. The good thing is we get to test backups. Btw this is exactly why it is *not* easy to migrate/update/copy/sync databases by 'just copying files'. They are too easily in an inconsistent state. There was some E-mail thread about that this year. Maybe it is a flaw of mysql/mariabd because the replay log is inconsistent when it is left open. + +``` +ibackup@tux03:/export/mariadb/restore$ borg extract /export/backup/borg/genenetwork::borg-tux04-sql-20250906-04:16-Sat --progress + 71.1% Extracting: export/backup/mariadb/latest/db_webqtl/ProbeSetData.MYI +``` + +So we rolled back the DB until further complaints. And made a new backup... This is how we keep ourselves busy. + +Turns out the new backup is problematic too! It completes, but still has redo isssues. It ends with: + +``` +Redo log (from LSN 1537842295024 to 1537842295040) was copied. +``` + +The error was + +``` +Retrying read of log at LSN=1537842295040 +``` + +so it is the last record (or all of them!). Kranky. I used + +``` +RESET MASTER +``` + +to clear out the redo log. It says 'Log flushed up to 1537842295040'. Good. Try another backup. Still not working. The mysql log says '[Warning] Could not read packet: fd: 24 state: 1 read_length: 4 errno: 11 vio_errno: 1158 length: 0'. But this does not appear to be related. + +``` +perror 11 +OS error code 11: Resource temporarily unavailable +``` + +hmmm. Still not related. The error relates to the file: + +``` +ls -l /proc/574984/fd|grep '24 ' +lrwx------ 1 mysql mysql 64 Sep 11 07:46 124 -> /export/mariadb/export/backup/mariadb/latest/db_webqtl/IndelAll.ibd +``` + +Probably a good idea to check all tables! OK, let's test this table first. + +``` +mysqlcheck -c db_webqtl -u webqtlout -pwebqtlout IndelAll +db_webqtl.IndelAll OK +``` + +looks OK. Try all + +``` +time mysqlcheck -c -u webqtlout -pwebqtlout db_webqtl +real 33m39.642s +``` + +all tables are good. Alright, I think we can make backups and the warning may go away with a future mariadb version. My assessment is that this Warning is harmless. Let's move forward by setting up sheepdog and borg backup. First backup run should show up soon as 'SUCCESS tux03 borg-tux03-sql-backup' in + +=> http://sheepdog.genenetwork.org/sheepdog/status.html + +Now it works I add it as a CRON job to run daily. Sheepdog will tell me whether we are healthy or not. + + + +## Backups (part 3) + +As an aside. Last night, according to sheepdog, tux03 made a perfect backup run and dropped the data on a server in a different location. + +=> http://sheepdog.genenetwork.org/sheepdog/status.html + +There is more to do, however. First of all we don't backup everything. We should also backup the containers and the state of the machine. Finally we need to make sure the backups are backed up(!) The reason is that if a backup is corrupted it will just propagate - it has happened to us. A backup of a backup will have sane versions from before the corruption. These days, you also have to anticipate bad actors injecting stuff. That you won't find if they penetrated the backup system. We are quite keen on having offline backups for that reason alone. + +For backup of the containers we need to run as root (unfortunately). I see now we did not have a proper backup on tux04. The last one was from 2025-03-04. Now we generate these containers, but still a bad idea not to backup the small databases. Anyway, first add the containers to the backup and more state. I set it up and added the CRON job. See if it pops up on sheepdog. diff --git a/topics/systems/ci-cd.gmi b/topics/systems/ci-cd.gmi index 6aa17f2..e20a37a 100644 --- a/topics/systems/ci-cd.gmi +++ b/topics/systems/ci-cd.gmi @@ -1,4 +1,6 @@ -# CI/ CD for genetwork projects +# CI/CD for genetwork projects + +Continuous intergration (CI) and continuous deployment (CD) are critical parts of making sure software development does not mess up our deployment(s). We need various levels of tests to be run, from unit tests to the more complicated ones like integration, performance, regression, etc tests, and of course, they cannot all be run for each and every commit, and will thus need to be staggered across the entire deployment cycle to help with quick iteration of the code. @@ -22,16 +24,86 @@ As part of the CI/CD effort, it is necessary that there is GeneNetwork is interested in doing two things on every commit (or periodically, say, once an hour/day): -* CI: run unit tests +* CI: run unit tests on git push * CD: rebuild and redeploy a container running GN3 -Arun has figured out the CI part. It runs a suitably configured laminar CI service in a Guix container created with `guix system container'. A cron job periodically triggers the laminar CI job (note: this is no longer true). +Arun has figured out the CI part. It runs a suitably configured laminar CI service in a Guix container created with `guix system container'. => https://git.systemreboot.net/guix-forge/about/ -CD hasn't been figured out. Normally, Guix VMs and containers created by `guix system` can only access the store read-only. Since containers don't have write access to the store, you cannot `guix build' from within a container or deploy new containers from within a container. This is a problem for CD. How do you make Guix containers have write access to the store? +We have the quick running tests, e.g unit tests, run on each commit to branch "main". Once those are successful, the CI/CD system we choose should automatically pick the latest commit that passed the quick running tests for for further testing and deployment. +Once the next battery of tests is passed, the CI/CD system will create a build/artifact to be deployed to staging and have the next battery of tests runs against it. If that passes, then that artifact could be deployed to production with details of the commit and deployment dependencies. + +## Adding a web-hook + +### Github hooks + +IIRC actions run artifacts inside github's infrastracture. We use webhooks: e.g. + +Update the hook at + +=> https://github.com/genenetwork/genenetwork3/settings/hooks + +A web hook basically calls an endpoint on a git push event. The webhook for genenetwork3 has recently not been called (ever? it says: This hook has never been triggered. ). The webhook for genenetwork2, however, has been called. + +=> ./screenshot-github-webhook.png + +To trigger CI manually, run this with the project name: + +``` +curl https://ci.genenetwork.org/hooks/example-gn3 +``` + +I just tested and it appeared this triggered a redeploy of gn2: + +``` +curl -XGET "https://ci.genenetwork.org/hooks/genenetwork2 +``` + +For gemtext we have a github hook that adds a forge-project and looks like + +```lisp +(define gn-gemtext-threads-project + (forge-project + (name "gn-gemtext-threads") + (repository "https://github.com/genenetwork/gn-gemtext-threads/") + (ci-jobs (list (forge-laminar-job + (name "gn-gemtext-threads") + (run (with-packages (list nss-certs openssl) + (with-imported-modules '((guix build utils)) + #~(begin + (use-modules (guix build utils)) + + (setenv "LC_ALL" "en_US.UTF-8") + (invoke #$(file-append tissue "/bin/tissue") + "pull" "issues.genenetwork.org")))))))) + (ci-jobs-trigger 'webhook))) +``` + +The normal trigger is automatic, you push code in any of the two repos (three? I'll verify), GN2 and GN3 and the laminar runs the jobs and updates the code in the container restarts services, as appropriate. + +If you want to trigger the CI manually, there are webhooks available for that that can be triggered manually with something like: + +``` +curl -XGET "https://ci.genenetwork.org/hooks/genenetwork2" +``` + +for GN2. Change the part after /hooks/ for each of the different repos as follows: + +``` +GN2: /genenetwork2 +GN3: /genenetwork3 +gn-auth: /gn-auth (I need to verify this) +gn-uploader: Does not exist right now +``` + +Guix forge can be found at + +=> https://git.systemreboot.net/guix-forge/ + +### git.genenetwork.org hooks -Another alternative for CI/ CID were to have the quick running tests, e.g unit tests, run on each commit to branch "main". Once those are successful, the CI/CD system we choose should automatically pick the latest commit that passed the quick running tests for for further testing and deployment, maybe once an hour or so. Once the next battery of tests is passed, the CI/CD system will create a build/artifact to be deployed to staging and have the next battery of tests runs against it. If that passes, then that artifact could be deployed to production, and details on the commit and +TBD #### Possible Steps @@ -81,7 +153,7 @@ Below are some possible steps (and tasks) to undertake for automated deployment * Generate guix declaration for re-generating the release * Archive container image, documentation and guix declaration for possible rollback -#### Some Work Done +#### See also => /topics/systems/gn-services GN Services diff --git a/topics/systems/debug-and-developing-code-with-genenetwork-system-container.gmi b/topics/systems/debug-and-developing-code-with-genenetwork-system-container.gmi index 131474c..f3cbbd6 100644 --- a/topics/systems/debug-and-developing-code-with-genenetwork-system-container.gmi +++ b/topics/systems/debug-and-developing-code-with-genenetwork-system-container.gmi @@ -1,12 +1,59 @@ # Debugging and developing code -Once we get to the stage of having a working system container it would be nice to develop code against it. The idea is to take an existing running system container and start modifying code *inside* the container by brining in an external path. +Once we get to the stage of having a working system container it would be nice to develop code against it. The idea is to take an existing running system container and start modifying code *inside* the container by bringing in an external path. -First build and start a guix system container as described in +In principle we'll build guix system containers as described in -=> /topics/guix/guix-system-containers-and-how-we-use-them.gmi +=> /topics/guix/guix-system-containers-and-how-we-use-them -The idea is to do less `guix pull' and system container builds, so as to speed up development. The advantage of using an existing system container is that the full deployment is the same on our other running systems! No more path hacks, in other words. +The idea is to minimise `guix pull' and system container builds, so as to speed up development. The advantage of using an existing system container is that the full deployment is the same on our other running systems! No more path hacks, in other words. + +## Philosophy + +For development containers we will: + +* Use sane default values - for URLs, paths etc. +* Add services incrementally (i.e., not one big blob) +* Run tests inside the container (not during build time) +* Build indexes etc. outside the container - or make it optional + +Also: + +* We should be able to run gn3 and gn-guile (aka gn4) as a guix shell without anything else +* We should be able to run gn2 with only gn3 and/or gn-guile as a guix shell with external DBs. +* We should be albe to run gn2+gn3+gn-guile as a system container with external DBs. +* We should be able to run gn-auth with gn2 as a system container +* We should be able to run the uploader as a system container + +I.e. no https, not authentication by default (as long as we run on localhost). The localhost URLs and file paths can be defaults because there will only be one development container running on a single machine. + +System containers are a bit overkill for development. Still, in some cases we'll need a system container. For example when testing integration of gn-auth, uploader, gn2 etc. We have the CD deployment that gets updated when git repos change. We also have a development container written by @bonz that needs to be improved and documented. + +=> https://git.genenetwork.org/gn-machines/tree/genenetwork-local-container.scm?h=gn-local-development-container + +Note it is on a special branch for now. + +Databases, and files will simply be shared on default paths - /export/guix-containers/gndev/... +And if you need different combinations it should be relatively easy to compose a new shell or system container. + +# Tags + +* type: bug +* status: open +* priority: high +* assigned: pjotrp +* interested: pjotrp,bonfacem,fredm +* keywords: development, deployment, server + +# Tasks + +Create a dev environment for: + +* [ ] GN3 +* [ ] gn-guile +* [ ] GN2 +* [ ] gn-auth +* [ ] gn-uploader # GN3 in system container @@ -258,6 +305,12 @@ guix-vm-run: $cmd ``` +## Virtuoso in a system container + +See + +=> ./virtuoso + # Troubleshooting ## Updating the VM does not show latest fixes diff --git a/topics/systems/dns-changes.gmi b/topics/systems/dns-changes.gmi index 7f1d8f1..30aae58 100644 --- a/topics/systems/dns-changes.gmi +++ b/topics/systems/dns-changes.gmi @@ -27,6 +27,7 @@ We are moving thing to a new DNS hosting service. We have accounts on both. To m * Import DNS settings on DNSimple (cut-N-paste) + Edit delegation - make sure the delegation box is set => https://support.dnsimple.com/articles/delegating-dnsimple-registered + + Registration menu item comes up after transfer... * Approve transfer on GoDaddy a few minutes later (!!), see + https://dcc.godaddy.com/control/transfers * Add DNSSec diff --git a/topics/systems/hpc/octopus-maintenance.gmi b/topics/systems/hpc/octopus-maintenance.gmi index a0a2f16..d034575 100644 --- a/topics/systems/hpc/octopus-maintenance.gmi +++ b/topics/systems/hpc/octopus-maintenance.gmi @@ -2,10 +2,23 @@ ## Slurm -Status of slurm +Status of slurm (as of 202512) ``` sinfo +workers* up infinite 8 idle octopus[03,05-11] +allnodes up infinite 3 alloc tux[06,08-09] +allnodes up infinite 11 idle octopus[02-03,05-11],tux[05,07] +tux up infinite 3 alloc tux[06,08-09] +tux up infinite 2 idle tux[05,07] +1tbmem up infinite 1 idle octopus02 +headnode up infinite 1 idle octopus01 +highmem up infinite 2 idle octopus[02,11] +386mem up infinite 6 idle octopus[03,06-10] +lowmem up infinite 7 idle octopus[03,05-10] +``` + +``` sinfo -R squeue ``` @@ -29,7 +42,7 @@ UnkillableStepProgram = (null) UnkillableStepTimeout = 60 sec ``` -check valid configuration with `slurmd -C` and update nodes with +check valid configuration with 'slurmd -C' and update nodes with ``` scontrol reconfigure @@ -45,13 +58,13 @@ Basically the root user can copy across. ## Execute binaries on mounted devices -To avoid `./scratch/script.sh: Permission denied` on `device_file`: +To avoid './scratch/script.sh: Permission denied' on 'device_file': -- `sudo bash` -- `ls /scratch -l` to check where `/scratch` is -- `vim /etc/fstab` -- replace `noexec` with `exec` for `device_file` -- `mount -o remount [device_file]` to remount the partition with its new configuration. +- 'sudo bash' +- 'ls /scratch -l' to check where '/scratch' is +- 'vim /etc/fstab' +- replace 'noexec' with 'exec' for 'device_file' +- 'mount -o remount [device_file]' to remount the partition with its new configuration. Some notes: @@ -67,7 +80,7 @@ x-systemd.device-timeout= 10.0.0.110:/export/3T /mnt/3T nfs nofail,x-systemd.automount,x-systemd.requires=network-online.target,x-systemd.device-timeout=10 0 0 -## Installation of `munge` and `slurm` on a new node +## Installation of 'munge' and 'slurm' on a new node Current nodes in the pool have: @@ -78,7 +91,7 @@ sbatch --version slurm-wlm 18.08.5-2 ``` -To install `munge`, go to `octopus01` and run: +To install 'munge', go to 'octopus01' and run: ```shell guix package -i munge@0.5.14 -p /export/octopus01/guix-profiles/slurm @@ -86,7 +99,7 @@ guix package -i munge@0.5.14 -p /export/octopus01/guix-profiles/slurm systemctl status munge # to check if the service is running and where its service file is ``` -We need to setup the rights for `munge`: +We need to setup the rights for 'munge': ```shell sudo bash @@ -100,7 +113,7 @@ mkdir -p /var/lib/munge chown munge:munge /var/lib/munge/ mkdir -p /etc/munge -# copy `munge.key` (from a working node) to `/etc/munge/munge.key` +# copy 'munge.key' (from a working node) to '/etc/munge/munge.key' chown -R munge:munge /etc/munge mkdir -p /run/munge @@ -112,7 +125,7 @@ chown munge:munge /var/log/munge mkdir -p /var/run/munge # todo: not sure why it needs such a folder chown munge:munge /var/run/munge -# copy `munge.service` (from a working node) to `/etc/systemd/system/munge.service` +# copy 'munge.service' (from a working node) to '/etc/systemd/system/munge.service' systemctl daemon-reload systemctl enable munge @@ -120,25 +133,25 @@ systemctl start munge systemctl status munge ``` -To test the new installation, go to `octopus01` and then: +To test the new installation, go to 'octopus01' and then: ```shell munge -n | ssh tux08 /export/octopus01/guix-profiles/slurm-2-link/bin/unmunge ``` -If you get `STATUS: Rewound credential (16)`, it means that there is a difference between the encoding and decoding times. To fix it, go into the new machine and fix the time with +If you get 'STATUS: Rewound credential (16)', it means that there is a difference between the encoding and decoding times. To fix it, go into the new machine and fix the time with ```shell sudo date MMDDhhmmYYYY.ss ``` -To install `slurm`, go to `octopus01` and run: +To install 'slurm', go to 'octopus01' and run: ```shell guix package -i slurm@18.08.9 -p /export/octopus01/guix-profiles/slurm ``` -We need to setup the rights for `slurm`: +We need to setup the rights for 'slurm': ```shell sudo bash @@ -152,8 +165,8 @@ mkdir -p /var/lib/slurm chown munge:munge /var/lib/slurm/ mkdir -p /etc/slurm -# copy `slurm.conf` to `/etc/slurm/slurm.conf` -# copy `cgroup.conf` to `/etc/slurm/cgroup.conf` +# copy 'slurm.conf' to '/etc/slurm/slurm.conf' +# copy 'cgroup.conf' to '/etc/slurm/cgroup.conf' chown -R slurm:slurm /etc/slurm @@ -163,7 +176,7 @@ chown slurm:slurm /run/slurm mkdir -p /var/log/slurm chown slurm:slurm /var/log/slurm -# copy `slurm.service` to `/etc/systemd/system/slurm.service` +# copy 'slurm.service' to '/etc/systemd/system/slurm.service' /export/octopus01/guix-profiles/slurm-2-link/sbin/slurmd -f /etc/slurm/slurm.conf -C | head -n 1 >> /etc/slurm/slurm.conf # add node configuration information @@ -173,12 +186,24 @@ systemctl start slurm systemctl status slurm ``` -On `octopus01` (the master): +On 'octopus01' (the master): ```shell sudo bash -# add the new node to `/etc/slurm/slurm.conf` +# add the new node to '/etc/slurm/slurm.conf' systemctl restart slurmctld # after editing /etc/slurm/slurm.conf on the master ``` + + +# Removing a node + +We are removing o3 so it can become the new head node: + +``` +scontrol update nodename=octopus03 state=drain reason="removing" +scontrol show node octopus03 | grep State +scontrol update nodename=octopus03 state=down reason="removed" + State=DOWN+DRAIN ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A +``` diff --git a/topics/systems/hpc/performance.gmi b/topics/systems/hpc/performance.gmi index ce6a111..ee604b5 100644 --- a/topics/systems/hpc/performance.gmi +++ b/topics/systems/hpc/performance.gmi @@ -12,6 +12,23 @@ For disk speeds make sure there is no load and run hdparm -Ttv /dev/sdc1 ``` +Cheap and cheerful: + +Write test: + +``` +dd if=/dev/zero of=./test bs=512k count=2048 oflag=direct +``` + +Read test: + +``` +/sbin/sysctl -w vm.drop_caches=3 +dd if=./test of=/dev/zero bs=512k count=2048 +``` + + + ## Networking To check the network devices installed use diff --git a/topics/systems/linux/GPU-on-balg01.gmi b/topics/systems/linux/GPU-on-balg01.gmi new file mode 100644 index 0000000..d0cb3fc --- /dev/null +++ b/topics/systems/linux/GPU-on-balg01.gmi @@ -0,0 +1,201 @@ +# Installing GPU on Balg01 server + +lspci shows the card, an L4 + +=> https://www.techpowerup.com/gpu-specs/l4.c4091 + +``` +lspci|grep NVIDIA +NVIDIA Corporation AD104GL +``` + +The machine had raspi and Tesla support installed (?!), so I removed that: + +``` +apt-get remove firmware-nvidia-tesla-gsp +``` + +Disabled nouveau drivers + +```/etc/modprobe.d/blacklist-nouveau.conf +blacklist nouveau +options nouveau modeset=0 +``` + +``` +dpkg --purge raspi-firmware +update-initramfs -u +reboot (can skip for a bit) +``` + +## Create fallback boot partition + +Well, before rebooting I should have created another fallback boot partitition with a more recent debian. +Unfortunately I had not prepared space on one of the disks (something I normally do). Turned out /dev/sdc on /export3 was not really used lately, so I could move that data and reuse that partition. + +``` +/dev/sdc1 1.8T 552G 1.2T 33% /export3 +``` + +it is a very slow drive (btw), not sure why. I ran badblocks but it does not make a difference. The logs show: + +``` +Oct 04 09:34:37 balg01 kernel: I/O error, dev sdc, sector 23392285 op 0x9:(WRITE_ZEROES) flags 0x8000000 > +O +``` + +but it looks more like a driver problem than an actual disk error. Well, maybe on the new debian install it will be fine. +At this point it is just to install a fallback boot partition, so no real worries. + +On using debootstrap, grub etc. the old partition came back fine and I tested I can also boot into the new Debian install. Especially with remote servers this is a great comfort. + +## CUDA continued + +Now we have a fallback boot partition it is a bit easier to mess with CUDA drivers. + +To install the CUDA drivers you may need to disable 'secure boot' in the bios. + +``` +apt install build-essential gcc make cmake dkms +apt install linux-headers-$(uname -r) +``` + +The debian selector, choose data center and L series: Driver Version:580.95.05 CUDA Toolkit:13.0 Release Date:Wed Oct 01, 2025 File Size:844.44 MB + +Note I installed the nvidia-open drivers. If things are not working we should look at the proprietary stuff. I used the 'local repository installation' instructions of + +=> https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/index.html#debian-installation + + +``` +apt-get install nvidia-libopencl1 nvidia-open nvidia-driver-cuda +``` + +The first one is to prevent + +``` +libnppc11 : Conflicts: nvidia-libopencl1 +``` + +now this should run + +``` +balg01:~# nvidia-smi +Sat Oct 4 11:56:19 2025 ++-----------------------------------------------------------------------------------------+ +| NVIDIA-SMI 580.95.05 Driver Version: 580.95.05 CUDA Version: 13.0 | ++-----------------------------------------+------------------------+----------------------+ +| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | +| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | +| | | MIG M. | +|=========================================+========================+======================| +| 0 NVIDIA L4 Off | 00000000:81:00.0 Off | 0 | +| N/A 57C P0 29W / 72W | 0MiB / 23034MiB | 2% Default | +| | | N/A | ++-----------------------------------------+------------------------+----------------------+ + ++-----------------------------------------------------------------------------------------+ +| Processes: | +| GPU GI CI PID Type Process name GPU Memory | +| ID ID Usage | +|=========================================================================================| +| No running processes found | ++-----------------------------------------------------------------------------------------+ +``` + +## Testing GPU + + +Using Guix python I ran: + +``` +pip install "gpu-benchmark-tool[nvidia]" +``` + +of course it downloads a ridiculous amount of binaries... But then we can run + +``` +export PATH=/home/wrk/.local/bin:$PATH +gpu-benchmark benchmark --duration=30 +``` + +that did not work. CUDA samples are packaged in Debian and requires building the scripts: + +``` +apt-get install nvidia-cuda-samples nvidia-cuda-toolkit-gcc +cd /usr/share/doc/nvidia-cuda-toolkit/examples/Samples/6_Performance/transpose +export CUDA_PATH=/usr +make +./transpose +> [NVIDIA L4] has 58 MP(s) x 128 (Cores/MP) = 7424 (Cores) +> Compute performance scaling factor = 1.00 +... +Test passed +``` + +Note that this removed nvidia-smi. Let's look at versions: + +``` +pool/non-free/n/nvidia-graphics-drivers/nvidia-libopencl1_535.247.01-1~deb12u1_amd64.deb +pool/contrib/n/nvidia-cuda-samples/nvidia-cuda-samples_11.8~dfsg-2_all.deb +pool/non-free/n/nvidia-cuda-toolkit/nvidia-cuda-toolkit-gcc_11.8.0-5~deb12u1_amd64.deb +pool/non-free/n/nvidia-graphics-drivers/nvidia-libopencl1_535.247.01-1~deb12u1_amd64.deb +``` + +while + +``` +Filename: ./nvidia-open_580.95.05-1_amd64.deb +Package: nvidia-driver-cuda +Version: 580.95.05-1 +Section: NVIDIA +Source: nvidia-graphics-drivers +Provides: nvidia-cuda-mps, nvidia-smi +``` + +and it turns out to be a mixture. I have to take real care not to mix in Debian packages! For example this package is a Debian original: + +``` +ii nvidia-cuda-gdb 11.8.86~11.8.0-5~deb12u1 amd64 NVIDIA CUDA Debugger (GDB) +``` + +``` +apt remove --purge nvidia-* cuda-* libnvidia-* +``` + +says + +``` +Note, selecting 'libnvidia-gpucomp' instead of 'libnvidia-gpucomp-580.95.05' +``` + +To view installed packages belonging to Debian itself: + +``` +dpkg -l|grep nvid|grep deb12 +dpkg -l|grep cuda|grep deb12 +``` + +Let's reinstall and make sure only NVIDIA packages are used: + +``` +wget https://developer.download.nvidia.com/compute/cuda/repos/debian12/x86_64/cuda-keyring_1.1-1_all.deb +dpkg -i cuda-keyring_1.1-1_all.deb +apt-get update +apt-get install cuda-toolkit cuda-compiler-12-2 +``` + +Now we have: + +``` +/usr/local/cuda-12.3/bin/nvcc --version +nvcc: NVIDIA (R) Cuda compiler driver +Copyright (c) 2005-2023 NVIDIA Corporation +Built on Wed_Nov_22_10:17:15_PST_2023 +``` + +# Pytorch + +CUDA environment variable for pytorch is probably useful: + +=> https://docs.pytorch.org/docs/stable/cuda_environment_variables.html diff --git a/topics/systems/linux/add-boot-partition.gmi b/topics/systems/linux/add-boot-partition.gmi new file mode 100644 index 0000000..564e044 --- /dev/null +++ b/topics/systems/linux/add-boot-partition.gmi @@ -0,0 +1,52 @@ +# Add (2nd) boot and other partitions + +As we handle machines remotely it is often useful to have a secondary boot partition that can be used from grub. + +Basically, create a similar sized boot partition on a different disk and copy the running one over with: + +``` +parted -a optimal /dev/sdb +(parted) p +Model: NVMe CT4000P3SSD8 (scsi) +Disk /dev/sdb: 4001GB +Sector size (logical/physical): 512B/512B +Partition Table: gpt +Disk Flags: + +Number Start End Size File system Name Flags + 1 32.0GB 4001GB 3969GB ext4 bulk + +(parted) rm 1 +mklabel gpt +mkpart fat23 1 1GB +set 1 esp on +align-check optimal 1 +mkpart ext4 1GB 32GB +mkpart swap 32GB 48GB +set 2 boot on # other flags are raid, swap, lvm +set 3 swap on +mkpart scratch 48GB 512GB +mkpart ceph 512GB -1 +``` + +We also took the opportunity to create a new scratch partition (for moving things around) and a ceph partition (for testing). +Resulting in + +``` +Number Start End Size File system Name Flags + 1 1049kB 1000MB 999MB fat23 boot, esp + 2 1000MB 24.0GB 23.0GB ext4 boot, esp + 3 24.0GB 32.0GB 8001MB swap swap + 4 32.0GB 512GB 480GB ext4 scratch + 5 512GB 4001GB 3489GB ceph +``` + +Now we have the drive ready we can copy the existing boot partitions and make sure you don't get it wrong and the target partitiong is larger. +Here the original boot disk is /dev/sda (894Gb). We copy that to the new disk /dev/sdb (3.64Tb) + +``` +root@tux05:/home/wrk# dd if=/dev/sda1 of=/dev/sdb1 +root@tux05:/home/wrk# dd if=/dev/sda2 of=/dev/sdb2 +``` + +Next, test mount the dirs and reboot. You make want to run e2fsck and resize2fs on the new partitions (or their equivalent if you use xfs or something). diff --git a/topics/systems/linux/adding-nvidia-drivers-penguin2.gmi b/topics/systems/linux/adding-nvidia-drivers-penguin2.gmi new file mode 100644 index 0000000..81e721f --- /dev/null +++ b/topics/systems/linux/adding-nvidia-drivers-penguin2.gmi @@ -0,0 +1,74 @@ +# GPU Graphics Driver Set-Up + +Tux02 has the Tesla K80 (GK210GL) GPU. For machine learning, we want the official proprietary NVIDIA drivers. + +## Installation + +* Debian 12 moved NVIDIA driver into the non-free-firmware repo. Add the following to "/etc/apt/sources.list" and run "sudo apt update": + +``` +deb http://deb.debian.org/debian/ bookworm main contrib non-free non-free-firmware +``` + +* Make sure the correct kernel headers are installed: + +``` +sudo apt install linux-headers-$(uname -r) +``` + +* Install "nvidia-tesla-470-driver"⁰ (The NVIDIA line-up of programmable "Tesla" devices, used primarily for simulations and large-scale calculations, also require separate driver packages to function correctly compared to the consumer-grade GeForce GPUs that are instead targeted for desktop and gaming usage)¹: + +``` +sudo apt purge 'nvidia-*' +sudo apt install nvidia-tesla-470-driver +``` + +* Black list nouveau since it conflicts with NVIDIA's driver, and regenerate the initramfs "sudo update-initramfs -u": + +``` +echo "blacklist nouveau" | sudo tee /etc/modprobe.d/blacklist-nouveau.conf +echo "options nouveau modeset=0" | sudo tee -a /etc/modprobe.d/blacklist-nouveau.conf +``` + +* Reboot and test the nvidia drivers: + +``` +sudo reboot +nvidia-smi + +# optional if you want to use nvidia-cuda-toolkit +sudo apt install nvidia-cuda-dev nvidia-cuda-toolkit +``` + +## Issues + +Holding on reboot until I check in with the rest of team regarding some initd raspi hook: + +``` +update-initramfs: Generating /boot/initrd.img-6.1.0-9-amd64 +raspi-firmware: missing /boot/firmware, did you forget to mount it? +run-parts: /etc/initramfs/post-update.d//z50-raspi-firmware exited with return code 1 +dpkg: error processing package initramfs-tools (--configure): + installed initramfs-tools package post-installation script subprocess returned error exit status 1 +Processing triggers for libgdk-pixbuf-2.0-0:amd64 (2.42.10+dfsg-1+deb12u1) ... +Errors were encountered while processing: + initramfs-tools +``` + +Removed the firmware by running: + +``` +sudo apt purge raspi-firmware + +# Configure all packages that are installed but not yet fully configured +sudo dpkg --configure -a + +# Update initramfs since we updated our drivers +sudo update-initramfs -u +``` + +## References + +=> https://us.download.nvidia.com/XFree86/Linux-x86_64/470.129.06/README/supportedchips.html ⁰ Nvidia 470.129.06 Supported Chipsets. +=> https://wiki.debian.org/NvidiaGraphicsDrivers#Tesla_Drivers ¹ Debian Tesla Drivers. +=> https://wiki.debian.org/NvidiaGraphicsDrivers/Configuration ² NVIDIA Proprietary Driver: Configuration. diff --git a/topics/systems/mariadb/mariadb.gmi b/topics/systems/mariadb/mariadb.gmi index ae0ab19..ec8b739 100644 --- a/topics/systems/mariadb/mariadb.gmi +++ b/topics/systems/mariadb/mariadb.gmi @@ -16,6 +16,8 @@ To install Mariadb (as a container) see below and Start the client and: ``` +mysql +show databases MariaDB [db_webqtl]> show binary logs; +-----------------------+-----------+ | Log_name | File_size | @@ -60,4 +62,11 @@ Stop the running mariadb-guix.service. Restore the latest backup archive and ove => https://www.borgbackup.org/ Borg => https://borgbackup.readthedocs.io/en/stable/ Borg documentation -# +# Upgrade mariadb + +It is wise to upgrade mariadb once in a while. In a disaster recovery it is better to move forward in versions too. +Before upgrading make sure there is a decent backup of the current setup. + +See also + +=> issues/systems/tux04-disk-issues.gmi diff --git a/topics/systems/mariadb/precompute-mapping-input-data.gmi b/topics/systems/mariadb/precompute-mapping-input-data.gmi index 0c89fe5..3442d4e 100644 --- a/topics/systems/mariadb/precompute-mapping-input-data.gmi +++ b/topics/systems/mariadb/precompute-mapping-input-data.gmi @@ -2,7 +2,7 @@ GN relies on precomputed mapping scores for search and other functionality. Here we prepare for a new generation of functionality that introduces LMMs for compute and multiple significant scores for queries. -At this stage we precompute GEMMA and tarball or lmdb it. As a project is never complete we need to add a metadata record in each tarball that track the status of the 'package'. Also, to offload compute to machines without DB access we need to prepare a first step that contains genotypes and phenotypes for compute. The genotypes have to be shared, as well as the computed kinship with and without LOCO. See +At this stage we precompute GEMMA and tarball or lmdb it. As a project is never complete we need to add a metadata record in each tarball that tracks the status of the 'package'. Also, to offload compute to machines without DB access we need to prepare a first step that contains genotypes and phenotypes for compute. The genotypes have to be shared, as well as the computed kinship with and without LOCO. See => /topics/data/precompute/steps @@ -43,16 +43,41 @@ And after: # Info -## Original qtlreaper version +## Original qtlreaper version for PublishData + +See the writeup at + +=> ./precompute-publishdata + +## Original qtlreaper version for ProbeSetData The original reaper precompute lives in => https://github.com/genenetwork/genenetwork2/blob/testing/scripts/maintenance/QTL_Reaper_v6.py -This script first fetches inbredsets +More recent incarnations are at v8, including a PublishData version that can be found in + +=> https://github.com/genenetwork/genenetwork2/tree/testing/scripts/maintenance + +Note that the locations are on space: + +``` +cd /mount/space2/lily-clone/acenteno/GN-Data +ls -l +python QTL_Reaper_v8_space_good.py 116 +-- +python UPDATE_Mean_MySQL_tab.py +cd /mount/space2/lily-clone/gnshare/gn/web/webqtl/maintainance +ls -l +python QTL_Reaper_cal_lrs.py 7 +``` + +The first task is to prepare an update script that can run a set at a time and compute GEMMA output (instead of reaper). + +The script first fetches inbredsets ``` - select Id,InbredSetId,InbredSetName,Name,SpeciesId,FullName,public,MappingMethodId,GeneticType,Family,FamilyOrder,MenuOrderId,InbredSetCode from InbredSet LIMIT 5; +select Id,InbredSetId,InbredSetName,Name,SpeciesId,FullName,public,MappingMethodId,GeneticType,Family,FamilyOrder,MenuOrderId,InbredSetCode from InbredSet LIMIT 5; +----+-------------+-------------------+----------+-----------+-------------------+--------+-----------------+-------------+--------------------------------------------------+-------------+-------------+---------------+ | Id | InbredSetId | InbredSetName | Name | SpeciesId | FullName | public | MappingMethodId | GeneticType | Family | FamilyOrder | MenuOrderId | InbredSetCode | +----+-------------+-------------------+----------+-----------+-------------------+--------+-----------------+-------------+--------------------------------------------------+-------------+-------------+---------------+ diff --git a/topics/systems/mariadb/precompute-publishdata.gmi b/topics/systems/mariadb/precompute-publishdata.gmi new file mode 100644 index 0000000..74c278f --- /dev/null +++ b/topics/systems/mariadb/precompute-publishdata.gmi @@ -0,0 +1,3370 @@ +# Precompute PublishData + +Based on the QTL_Reaper_cal_lrs.py aka QTL_Reaper_v8_PublishXRef.py. This script simply updates PublishXRef table with a highest hit as computed by qtlreaper. + +In a first attempt to update the database we are going to do just that using GEMMA. + +For the new script we will pass in the genotype file as well as the phenotype file, so gemma-wrapper can process it. I wrote quite a few scripts already + +=> https://github.com/genetics-statistics/gemma-wrapper/tree/master/bin + +So we can convert a .geno file to BIMBAM. I need to extract GN traits to a R/qtl2 or lmdb trait format file and use that as input. + +* [X] Visit use of PublishXRef +* [X] geno -> BIMBAM (BXD first) +* [X] Get PublishData trait(s) and convert to gemma, R/qtl2 or lmdb +* - [X] see scripts/lmdb-publishdata-export.scm +* - [X] see scripts for ProbeSetData +* - [X] Make sure the BXDs are mappable +* [X] Run gemma-wrapper +* [X] We should map by trait-id, data id is not intuitive: curl http://127.0.0.1:8091/dataset/bxd-publish/values/8967044.json > 10002-pheno.json +* [X] Check why Zach/GN JSON file lists different mappable BXDs +* [X] Update DB on run-server +* [X] Add batch run and some metadata so we can link back from results +* [X] Create a DB/table containing hits and old reaper values +* [X] Convert this info to RDF +* [X] Run virtuoso server +* [X] When loading traits compute mean, se, skew, kurtosis and store them as metadata in lmdb +* [ ] Why is X not showing in LMM precompute for trait 51064 +* [X] Correctly handle Infinite LOD +* [X] Ask interesting questions about the overlap between reaper and gemma +* [ ] Update PublishXRef and store old reaper value(?) +* [ ] Correctly Handle gn-guile escalating errors +* [X] RDF point back to original data file +* [ ] Fix Infinity also in LMM run (156 SNPs only) +* [ ] Make time stamp, host, user a compute 'origin' block in RDF +* [X] RDF mark QTL +* [ ] Make sure the trait fetcher handles authorization or runs localhost only +* [ ] gemma-wrapper --force does not work for GRM and re-check GRM does not change on phenotype +* [ ] Use SNP URIs when possible (instead of inventing our own) - and BED information so we can locate them +* [ ] Check lmdb duplicate key warning +* [ ] run gemma with pangenome-derived genotypes +* [ ] run gemma with qnorm +* [ ] run gemma with sex covariate +* [ ] run gemma again with the hit as a covariate +* [ ] Check invalid data sets/traits and feed them to Rob/Arthur +* [ ] Add metadata for bimodality indicator in addition to kurtosis (see below) +* [ ] Provide SPARQL to find QTL and return metadata about traits +* [ ] Provide PheWAS examples +* [ ] Add BED information on Genes +* [ ] Update Xapian search - also to handle gene aliases +* [ ] Create GN UI with Zach + +For the last we should probably add a few columns. Initially we'll only store the maximum hit. + +After + +* [ ] provide distributed storage of files using https + +# Visit use of PublishXRef + +In GN2 this table is used in search, auth, and router. For search it is to look for trait hits (logically). For the router it is to fetch train info as well as dataset info. + +In GN3 this table is used for partial correlations. Also to fetch API trait info and to build the search index. + +In GN1 usage is similar. + +# geno -> BIMBAM + +We can use the script in gemma-wrapper + +=> https://github.com/genetics-statistics/gemma-wrapper/blob/master/bin/gn-geno-to-gemma.py + +there is probably something similar in GN2. And I have another version somewhere. + +To identify the geno file the reaper script uses + +```python +cursor.execute('select Id, Name from InbredSet') +results = cursor.fetchall() +InbredSets = {} +for item in results: + InbredSets[item[0]] = genotypeDir+str(item[1])+'.geno' +``` + +which assumes one single geno file for the BXD that is indexed by the InbredSetID (a number). Note it ignores the many genotype files we have per inbredset (today). Also there is a funny hardcoded + +```python + if InbredSetId==3: + InbredSetId=1 +``` + +(no comment). + +Later we'll output to lmdb when GEMMA supports it. + +There are about 100 InbredSets. Genotype files can be found on production in +/export/guix-containers/genenetwork/var/genenetwork/genotype-files/genotype. For the BXD alone there are + +``` +BXD.2.geno BXD-Heart-Metals_old.geno BXD-Micturition.6.geno +BXD.4.geno BXD-JAX-AD.4.geno BXD-Micturition.8.geno +BXD.5.geno BXD-JAX-AD.8.geno BXD-Micturition.geno +BXD.6.geno BXD-JAX-AD.geno BXD-Micturition_old.4.geno +BXD.7.geno BXD-JAX-AD_old.geno BXD-Micturition_old.6.geno +BXD.8.geno BXD-JAX-OFS.geno BXD-Micturition_old.geno +BXD-AE.4.geno BXD-Longevity.4.geno BXD_mm8.geno +BXD-AE.8.geno BXD-Longevity.8.geno BXD-NIA-AD.4.geno +BXD-AE.geno BXD-Longevity.9.geno BXD-NIA-AD.8.geno +BXD-AE_old.geno BXD-Longevity.array.geno BXD-NIA-AD.geno +BXD-Bone.geno BXD-Longevity.classic.geno BXD-NIA-AD_old2.geno +BXD-Bone_orig.geno BXD-Longevity.geno BXD-NIA-AD_old.geno +BXD.geno BXD-Longevity_old.4.geno BXD_Nov_23_2010_before_polish_101_102_103.geno +BXD-Harvested.geno BXD-Longevity_old.8.geno BXD_Nov_24_2010_before_polish_55_81.geno +BXD-Heart-Metals.4.geno BXD-Longevity_old.geno BXD_old.geno +BXD-Heart-Metals.8.geno BXD-MBD-UTHSC.geno BXD_unsure.geno +BXD-Heart-Metals.geno BXD-Micturition.4.geno BXD_UT-SJ.geno +``` + +Not really reflected in the DB: + +``` +MariaDB [db_webqtl]> select Id, Name from InbredSet where name like '%BXD%'; ++----+------------------+ +| Id | Name | ++----+------------------+ +| 1 | BXD | +| 58 | BXD-Bone | +| 64 | BXD-Longevity | +| 68 | BXD_Dev | +| 76 | DOD-BXD-GWI | +| 84 | BXD-Heart-Metals | +| 86 | BXD-AE | +| 91 | BXD-Micturition | +| 92 | BXD-JAX-AD | +| 93 | BXD-NIA-AD | +| 94 | CCBXD-TM | +| 96 | BXD-JAX-OFS | +| 97 | BXD-MBD-UTHSC | ++----+------------------+ +``` + +Bit of a mess. Looks like some files are discarded. Let's see what the reaper script does. + +We should also look into distributed storage. One option is webdav. + +# Get PublishData trait(s) and convert to R/qtl2 or lmdb + +Let's see how the scripts do it. Note that we already did that for the probeset script in + +=> precompute-mapping-input-data + +The code is reflected in + +=> https://git.genenetwork.org/gn-guile/tree/scripts/precompute/list-traits-to-compute.scm + +Now I need to do the exact same thing, but for PublishData. + +Let's connect to a remote GN DB: + +``` +ssh -L 3306:127.0.0.1:3306 -f -N tux02.genenetwork.org +``` + +and follow + +=> https://github.com/genenetwork/genenetwork2/blob/testing/scripts/maintenance/QTL_Reaper_v8_PublishXRef.py + +the script takes a number of values 'PublishFreezeIds'. Alternatively it picks it up by SpeciesId (hard effing coded, of course). + +=> https://github.com/genenetwork/genenetwork2/blob/fcde38b0f37f12508a01b16b7820029aa951bded/scripts/maintenance/QTL_Reaper_v8_PublishXRef.py#L62 + +Next it picks the geno file from the InbredSetID with + +``` +select InbredSetId from PublishFreeze where PublishFreeze.Id = 1; +``` + +Here we are initially going to focus on BXD=1 datasets only. + +``` +MariaDB [db_webqtl]> select Id,InbredSetId from PublishFreeze where InbredSetId = 1; ++----+-------------+ +| Id | InbredSetId | ++----+-------------+ +| 1 | 1 | ++----+-------------+ +``` + +(we are half way the script now). Next we capture some metadata + +``` +MariaDB [db_webqtl]> select PhenotypeId, Locus, DataId, Phenotype.Post_publication_description from PublishXRef, Phenotype where PublishXRef.PhenotypeId = Phenotype.Id and InbredSetId=1 limit 5; ++-------------+----------------+---------+----------------------------------------------------------------------------------------------------------------------------+ +| PhenotypeId | Locus | DataId | Post_publication_description | ++-------------+----------------+---------+----------------------------------------------------------------------------------------------------------------------------+ +| 4 | rs48756159 | 8967043 | Central nervous system, morphology: Cerebellum weight, whole, bilateral in adults of both sexes [mg] | +| 10 | rsm10000005699 | 8967044 | Central nervous system, morphology: Cerebellum weight after adjustment for covariance with brain size [mg] | +| 15 | rsm10000013713 | 8967045 | Central nervous system, morphology: Brain weight, male and female adult average, unadjusted for body weight, age, sex [mg] | +| 20 | rs48756159 | 8967046 | Central nervous system, morphology: Cerebellum volume [mm3] | +| 25 | rsm10000005699 | 8967047 | Central nervous system, morphology: Cerebellum volume, adjusted for covariance with brain size [mm3] | ++-------------+----------------+---------+----------------------------------------------------------------------------------------------------------------------------+ +``` + +it captures LRS + +``` +MariaDB [db_webqtl]> select LRS from PublishXRef where PhenotypeId=4 and InbredSetId=1; ++--------------------+ +| LRS | ++--------------------+ +| 13.497491147108706 | ++--------------------+ +``` + +and finally the trait values that are used for mapping + +``` +select Strain.Name, PublishData.value from Strain, PublishData where Strain.Id = PublishData.StrainId and PublishData.Id = 8967043; ++-------+-----------+ +| Name | value | ++-------+-----------+ +| BXD1 | 61.400002 | +| BXD2 | 49.000000 | +| BXD5 | 62.500000 | +| BXD6 | 53.099998 | +| BXD8 | 59.099998 | +| BXD9 | 53.900002 | +| BXD11 | 53.099998 | +| BXD12 | 45.900002 | +| BXD13 | 48.400002 | +| BXD14 | 49.400002 | +| BXD15 | 47.400002 | +| BXD16 | 56.299999 | +| BXD18 | 53.599998 | +| BXD19 | 50.099998 | +| BXD20 | 48.200001 | +| BXD21 | 50.599998 | +| BXD22 | 53.799999 | +| BXD23 | 48.599998 | +| BXD24 | 54.900002 | +| BXD25 | 49.599998 | +| BXD27 | 47.400002 | +| BXD28 | 51.500000 | +| BXD29 | 50.200001 | +| BXD30 | 53.599998 | +| BXD31 | 49.700001 | +| BXD32 | 56.000000 | +| BXD33 | 52.099998 | +| BXD34 | 53.700001 | +| BXD35 | 49.700001 | +| BXD36 | 44.500000 | +| BXD38 | 51.099998 | +| BXD39 | 54.900002 | +| BXD40 | 49.900002 | +| BXD42 | 59.400002 | ++-------+-----------+ +``` + +Note that we need to filter out the parents - the original reaper script does not do that! My gn-guile code does handle that: + +``` +SELECT StrainId,Strain.Name FROM Strain, StrainXRef WHERE StrainXRef.StrainId = Strain.Id AND StrainXRef.InbredSetId =1 AND Used_for_mapping<>'Y' limit 5; ++----------+----------+ +| StrainId | Name | ++----------+----------+ +| 1 | B6D2F1 | +| 2 | C57BL/6J | +| 3 | DBA/2J | +| 150 | A/J | +| 151 | AXB1 | ++----------+----------+ +etc. +``` + +Also Bonz' script + +=> https://git.genenetwork.org/gn-guile/tree/scripts/lmdb-publishdata-export.scm + +has an interesting query: + +``` +MariaDB [db_webqtl]> +SELECT DISTINCT PublishFreeze.Name, PublishXRef.Id FROM PublishData + INNER JOIN Strain ON PublishData.StrainId = Strain.Id + INNER JOIN PublishXRef ON PublishData.Id = PublishXRef.DataId + INNER JOIN PublishFreeze ON PublishXRef.InbredSetId = PublishFreeze.InbredSetId + LEFT JOIN PublishSE ON PublishSE.DataId = PublishData.Id AND PublishSE.StrainId = PublishData.StrainId + LEFT JOIN NStrain ON NStrain.DataId = PublishData.Id AND NStrain.StrainId = PublishData.StrainId + WHERE PublishFreeze.public > 0 AND PublishFreeze.confidentiality < 1 + ORDER BY PublishFreeze.Id, PublishXRef.Id limit 5; ++------------+-------+ +| Name | Id | ++------------+-------+ +| BXDPublish | 10001 | +| BXDPublish | 10002 | +| BXDPublish | 10003 | +| BXDPublish | 10004 | +| BXDPublish | 10005 | ++------------+-------+ +5 rows in set (0.239 sec) +``` + +that shows we have 13689 BXDPublish datasets. It also has + +``` +SELECT +JSON_ARRAYAGG(JSON_ARRAY(Strain.Name, PublishData.Value)) AS data, + MD5(JSON_ARRAY(Strain.Name, PublishData.Value)) as md5hash +FROM + PublishData + INNER JOIN Strain ON PublishData.StrainId = Strain.Id + INNER JOIN PublishXRef ON PublishData.Id = PublishXRef.DataId + INNER JOIN PublishFreeze ON PublishXRef.InbredSetId = PublishFreeze.InbredSetId +LEFT JOIN PublishSE ON + PublishSE.DataId = PublishData.Id AND + PublishSE.StrainId = PublishData.StrainId +LEFT JOIN NStrain ON + NStrain.DataId = PublishData.Id AND + NStrain.StrainId = PublishData.StrainId +WHERE + PublishFreeze.Name = "BXDPublish" AND + PublishFreeze.public > 0 AND + PublishData.value IS NOT NULL AND + PublishFreeze.confidentiality < 1 +ORDER BY + LENGTH(Strain.Name), Strain.Name LIMIT 5; +``` + +best to pipe that to a file. It outputs JSON and an MD5SUM straight from mariadb. Interesting. + +Finally, let's have a look at the existing GN API + +``` +SELECT + Strain.Name, Strain.Name2, PublishData.value, PublishData.Id, PublishSE.error, NStrain.count + FROM + (PublishData, Strain, PublishXRef, PublishFreeze) + LEFT JOIN PublishSE ON + (PublishSE.DataId = PublishData.Id AND PublishSE.StrainId = PublishData.StrainId) + LEFT JOIN NStrain ON + (NStrain.DataId = PublishData.Id AND + NStrain.StrainId = PublishData.StrainId) + WHERE + PublishXRef.InbredSetId = 1 AND + PublishXRef.PhenotypeId = 4 AND + PublishData.Id = PublishXRef.DataId AND + PublishData.StrainId = Strain.Id AND + PublishXRef.InbredSetId = PublishFreeze.InbredSetId AND + PublishFreeze.public > 0 AND + PublishFreeze.confidentiality < 1 + ORDER BY + Strain.Name; + +-------+-------+-----------+---------+-------+-------+ +| Name | Name2 | value | Id | error | count | ++-------+-------+-----------+---------+-------+-------+ +| BXD1 | BXD1 | 61.400002 | 8967043 | 2.38 | NULL | +| BXD11 | BXD11 | 53.099998 | 8967043 | 1.1 | NULL | +| BXD12 | BXD12 | 45.900002 | 8967043 | 1.09 | NULL | +| BXD13 | BXD13 | 48.400002 | 8967043 | 1.63 | NULL | +... +``` + +which actually blocks non-public sets and shows std err, as well as counts when available(?) It does not exclude the parents for mapping (btw). That probably happens on the mapping page itself. + +Probably the most elegant query is in GN3 API: + +``` +SELECT st.Name, ifnull(pd.value, 'x'), ifnull(ps.error, 'x'), ifnull(ns.count, 'x') + FROM PublishFreeze pf JOIN PublishXRef px ON px.InbredSetId = pf.InbredSetId + JOIN PublishData pd ON pd.Id = px.DataId JOIN Strain st ON pd.StrainId = st.Id + LEFT JOIN PublishSE ps ON ps.DataId = pd.Id AND ps.StrainId = pd.StrainId + LEFT JOIN NStrain ns ON ns.DataId = pd.Id AND ns.StrainId = pd.StrainId + WHERE px.PhenotypeId = 4 limit 5; ++------+-----------------------+-----------------------+-----------------------+ +| Name | ifnull(pd.value, 'x') | ifnull(ps.error, 'x') | ifnull(ns.count, 'x') | ++------+-----------------------+-----------------------+-----------------------+ +| BXD1 | 61.400002 | 2.38 | x | +| BXD2 | 49.000000 | 1.25 | x | +| BXD5 | 62.500000 | 2.32 | x | +| BXD6 | 53.099998 | 1.22 | x | +| BXD8 | 59.099998 | 2.07 | x | ++------+-----------------------+-----------------------+-----------------------+ +``` + +written by Zach and Bonface. See + +=> https://github.com/genenetwork/genenetwork3/blame/main/gn3/db/sample_data.py + + + +We can get a list of the 13689 BXD datasets we can use. Note that we start with public data because we'll feed it to AI and all privacy will be gone after. We'll design an second API that makes use of Fred's authentication/authorization later. +Let's start with the SQL statement listed on: + + +We can run mysql through an ssh tunnel with + +``` +ssh -L 3306:127.0.0.1:3306 -f -N tux02.genenetwork.org +mysql -A -h 127.0.0.1 -uwebqtlout -pwebqtlout db_webqtl +``` + +and test the query, i.e. + +``` +MariaDB [db_webqtl]> SELECT DISTINCT PublishFreeze.Name, PublishXRef.Id FROM PublishData + -> INNER JOIN Strain ON PublishData.StrainId = Strain.Id + -> INNER JOIN PublishXRef ON PublishData.Id = PublishXRef.DataId + -> INNER JOIN PublishFreeze ON PublishXRef.InbredSetId = PublishFreeze.InbredSetId + -> LEFT JOIN PublishSE ON PublishSE.DataId = PublishData.Id AND PublishSE.StrainId = PublishData.StrainId + -> LEFT JOIN NStrain ON NStrain.DataId = PublishData.Id AND NStrain.StrainId = PublishData.StrainId + -> WHERE PublishFreeze.public > 0 AND PublishFreeze.confidentiality < 1 + -> ORDER BY PublishFreeze.Id, PublishXRef.Id limit 5; ++------------+-------+ +| Name | Id | ++------------+-------+ +| BXDPublish | 10001 | +| BXDPublish | 10002 | +| BXDPublish | 10003 | +| BXDPublish | 10004 | +| BXDPublish | 10005 | +``` + +Let's take this apart a little. First of all PublishFreeze has only one record for BXDPublish where ID=1. PublishData may be used to check valid fields, but the real information is in PublishXRef. A simple + +``` + select count(*) from PublishXRef WHERE InbredSetId=1; ++----------+ +| count(*) | ++----------+ +| 13711 | ++----------+ +``` + +counts a few extra datasets (it was 13689). It may mean that PublishXRef contains some records that are still not public? Anyway, +let's go for the full dataset for precompute right now. We'll add an API endpoint to gn-guile so it can be used later. + +Note GN2 on the menu search + +=> https://genenetwork.org/search?species=mouse&group=BXD&type=Phenotypes&dataset=BXDPublish&search_terms_or=*&search_terms_and=&accession_id=None&FormID=searchResult + +gives 13,729 entries, including recent BXD_51094. That is because that production database is newer. If we look at our highest records: + +``` +select * from PublishXRef WHERE InbredSetId=1 ORDER BY ID DESC limit 3; ++-------+-------------+-------------+---------------+----------+-------------------+----------------+--------------------+--------------------+----------+----------+ +| Id | InbredSetId | PhenotypeId | PublicationId | DataId | mean | Locus | LRS | additive | Sequence | comments | ++-------+-------------+-------------+---------------+----------+-------------------+----------------+--------------------+--------------------+----------+----------+ +| 51060 | 1 | 45821 | 39794 | 41022015 | NULL | rsm10000000968 | 13.263934206457122 | 2.1741201177177185 | 1 | | +| 51049 | 1 | 45810 | 39783 | 41022004 | 8.092333210508029 | rsm10000014174 | 16.8291804498215 | 18.143229769230775 | 1 | | +| 51048 | 1 | 45809 | 39782 | 41022003 | 6.082199917286634 | rsm10000009222 | 14.462661474938166 | 4.582111488461538 | 1 | | ++-------+-------------+-------------+---------------+----------+-------------------+----------------+--------------------+--------------------+----------+----------+ +``` + +You can see they match that list (51060 got updated on production). The ID matches record BXD_51060 on the production search table. +We can look at the DataId with + +``` +select Id,PhenotypeId,DataId from PublishXRef WHERE InbredSetId=1 ORDER BY ID DESC limit 3; ++-------+-------------+----------+ +| Id | PhenotypeId | DataId | ++-------+-------------+----------+ +| 51060 | 45821 | 41022015 | +| 51049 | 45810 | 41022004 | +| 51048 | 45809 | 41022003 | ++-------+-------------+----------+ +``` + +And get the actual values with + +``` +select * from PublishData WHERE Id=41022003; ++----------+----------+-----------+ +| Id | StrainId | value | ++----------+----------+-----------+ +| 41022003 | 2 | 9.136000 | +| 41022003 | 3 | 4.401000 | +| 41022003 | 9 | 4.360000 | +| 41022003 | 29 | 15.745000 | +| 41022003 | 98 | 4.073000 | +| 41022003 | 99 | -0.580000 | +``` + +which match the values on + +=> https://genenetwork.org/show_trait?trait_id=51048&dataset=BXDPublish + +The phenotypeid is useful for some metadata: + + +``` +select * from Phenotype WHERE ID=45809; +| 45809 | Central nervous system, metabolism, nutrition, toxicology: Difference score for Iron (Fe) concentration in cortex (CTX) between 20 to 120-day-old and 300 to 918-day-old males mice fed Envigo diet 7912 containing 240, 93, and 63 ppm Fe, Cu and Zn, respectively [µg/g wet weight] | Central nervous system, metabolism, nutrition, toxicology: Difference score for Iron (Fe) concentration in cortex (CTX) between 20 to 120-day-old and 300 to 918-day-old males mice fed Envigo diet 7912 containing 240, 93, and 63 ppm Fe, Cu and Zn, respectively [µg/g wet weight] | Central nervous system, metabolism, nutrition, toxicology: Difference score for Iron (Fe) concentration in cortex (CTX) between 20 to 120-day-old and 300 to 918-day-old males mice fed Envigo diet 7912 containing 240, 93, and 63 ppm Fe, Cu and Zn, respectively [µg/g wet weight] | [ug/mg wet weight] | Fe300-120CTXMale | Fe300-120CTXMale | NULL | acenteno | Jones B | joneslab | +``` + +Since I am going for the simpler query I'll add an API endpoint named +datasets/bxd-publish/list (so others can use that too). We'll return +tuples for each entry so we can extend it later. First we need the +DataID so we can point into PublishData. We expect the endpoint to +return something like + +``` ++-------+-------------+----------+ +| Id | PhenotypeId | DataId | ++-------+-------------+----------+ +| 51060 | 45821 | 41022015 | +| 51049 | 45810 | 41022004 | +| 51048 | 45809 | 41022003 | +... +``` + +Alright, let's write some code. The following patch returns on the endpoint: + +``` +[ + { + "Id": 10001, + "PhenotypeId": 4, + "DataId": 8967043 + }, + { + "Id": 10002, + "PhenotypeId": 10, + "DataId": 8967044 + }, + { + "Id": 10003, + "PhenotypeId": 15, + "DataId": 8967045 + }, +... +``` + +in about 3 seconds. It will run a lot faster on a local network. But for our purpose it is fine. The code I wrote is here: + +=> https://git.genenetwork.org/gn-guile/commit/?id=1590be15f85e30d7db879c19d2d3b4bed201556a + +Note the simple SQL query (compared to the first one). +Next step is to fetch the trait values we can feed to GEMMA. The full query using the PhenotypeId and DataId in GN is: + +``` +SELECT Strain.Name, Strain.Name2, PublishData.value, PublishData.Id, PublishSE.error, NStrain.count + FROM + (PublishData, Strain, PublishXRef, PublishFreeze) + LEFT JOIN PublishSE ON + (PublishSE.DataId = PublishData.Id AND PublishSE.StrainId = PublishData.StrainId) + LEFT JOIN NStrain ON + (NStrain.DataId = PublishData.Id AND + NStrain.StrainId = PublishData.StrainId) + WHERE + PublishXRef.InbredSetId = 1 AND + PublishXRef.PhenotypeId = 4 AND + PublishData.Id = PublishXRef.DataId AND + PublishData.StrainId = Strain.Id AND + PublishXRef.InbredSetId = PublishFreeze.InbredSetId AND + PublishFreeze.public > 0 AND + PublishFreeze.confidentiality < 1; ++-------+-------+-----------+---------+-------+-------+ +| Name | Name2 | value | Id | error | count | ++-------+-------+-----------+---------+-------+-------+ +| BXD1 | BXD1 | 61.400002 | 8967043 | 2.38 | NULL | +| BXD2 | BXD2 | 49.000000 | 8967043 | 1.25 | NULL | +| BXD5 | BXD5 | 62.500000 | 8967043 | 2.32 | NULL | +| BXD6 | BXD6 | 53.099998 | 8967043 | 1.22 | NULL | +... +``` + +(result includes parents). We can simplify this for GEMMA because it only wants the name and (mean) value. + +The short version when you have the data ID is: + +``` +SELECT Strain.Name, PublishData.value FROM Strain, PublishData WHERE PublishData.Id=41022003 and Strain.Id=StrainID; ++----------+-----------+ +| Name | value | ++----------+-----------+ +| C57BL/6J | 9.136000 | +| DBA/2J | 4.401000 | +| BXD9 | 4.360000 | +| BXD32 | 15.745000 | +| BXD43 | 4.073000 | +| BXD44 | -0.580000 | +| BXD48 | -1.810000 | +| BXD51 | 4.294000 | +| BXD60 | -0.208000 | +| BXD62 | -0.013000 | +| BXD63 | 3.221000 | +| BXD66 | 2.472000 | +| BXD69 | 12.886000 | +| BXD70 | -1.973000 | +| BXD78 | 19.511999 | +| BXD79 | 7.845000 | +| BXD73a | 3.201000 | +| BXD87 | -3.054000 | +| BXD48a | 11.585000 | +| BXD100 | 7.088000 | +| BXD102 | 8.485000 | +| BXD124 | 13.442000 | +| BXD170 | -1.274000 | +| BXD172 | 18.587000 | +| BXD186 | 10.634000 | ++----------+-----------+ +``` + +which matches GN perfectly (some individuals where added). Alright, let's add an endpoint for this named +'dataset/bxd-publish/values/dataid/41022003'. Note we only deal with public data (so far). Later we may come up with more generic +end points and authorization. At this point the API is either on the local network (this one is) or public. + +The first version returns this data from the endpoint: + +``` +time curl http://127.0.0.1:8091/dataset/bxd-publish/values/41022003 +[{"Name":"C57BL/6J","value":9.136},{"Name":"DBA/2J","value":4.401},{"Name":"BXD9","value":4.36},{"Name":"BXD32","value":15.745},{"Name":"BXD43","value":4.073},{"Name":"BXD44","value":-0.58},{"Name":"BXD48","value":-1.81},{"Name":"BXD51","value":4.294},{"Name":"BXD60","value":-0.208},{"Name":"BXD62","value":-0.013},{"Name":"BXD63","value":3.221},{"Name":"BXD66","value":2.472},{"Name":"BXD69","value":12.886},{"Name":"BXD70","value":-1.973},{"Name":"BXD78","value":19.511999},{"Name":"BXD79","value":7.845},{"Name":"BXD73a","value":3.201},{"Name":"BXD87","value":-3.054},{"Name":"BXD48a","value":11.585},{"Name":"BXD100","value":7.088},{"Name":"BXD102","value":8.485},{"Name":"BXD124","value":13.442},{"Name":"BXD170","value":-1.274},{"Name":"BXD172","value":18.587},{"Name":"BXD186","value":10.634}] +real 0m0.537s +user 0m0.002s +sys 0m0.005s +``` + +Note it includes the parents. We should drop them. In this case we can simple check for (string-contains name "BXD"). The database records allow for a filter, so we get + +``` +curl http://127.0.0.1:8091/dataset/bxd-publish/mapping/values/41022003 +[{"Name":"BXD9","value":4.36},{"Name":"BXD32","value":15.745},{"Name":"BXD43","value":4.073},{"Name":"BXD44","value":-0.58},{"Name":"BXD48","value":-1.81},{"Name":"BXD51","value":4.294},{"Name":"BXD60","value":-0.208},{"Name":"BXD62","value":-0.013},{"Name":"BXD63","value":3.221},{"Name":"BXD66","value":2.472},{"Name":"BXD69","value":12.886},{"Name":"BXD70","value":-1.973},{"Name":"BXD78","value":19.511999},{"Name":"BXD79","value":7.845},{"Name":"BXD73a","value":3.201},{"Name":"BXD87","value":-3.054},{"Name":"BXD48a","value":11.585},{"Name":"BXD100","value":7.088},{"Name":"BXD102","value":8.485},{"Name":"BXD124","value":13.442},{"Name":"BXD170","value":-1.274},{"Name":"BXD172","value":18.587},{"Name":"BXD186","value":10.634}] +``` + +That code went in as + +=> https://git.genenetwork.org/gn-guile/commit/?id=9ad0793eb477611c700f4a5b02f60ac793bfae96 + +It took a bit longer than I wanted because I made a mistake converting the results to a hash table. It broke the JSON conversion and the error was not so helpful. + +To write a CSV it turns out I have written + +=> https://git.genenetwork.org/gn-guile/tree/gn/runner/gemma.scm?id=9ad0793eb477611c700f4a5b02f60ac793bfae96#n18 + +which takes the GN BXD.json file and our trait file. BXD.json captures the genotype information GN has: + +``` +{ + "mat": "C57BL/6J", + "pat": "DBA/2J", + "f1s": ["B6D2F1", "D2B6F1"], + "genofile" : [{ + "title" : "WGS-based (Mar2022)", + "location" : "BXD.8.geno", + "sample_list" : ["BXD1", "BXD2", "BXD5", "BXD6", "BXD8", "BXD9", "BXD11", "BXD12", "BXD13", "BXD14", "BXD15", "BXD16", "BXD18", "BXD19", "BXD20", "BXD21", "BXD22", "BXD23", "BXD24", "BXD24a", "BXD25", "BXD27", "BXD28", "BXD29", "BXD30", "BXD31", "BXD32", "BXD33", "BXD34", "BXD35", "BXD36", "BXD37", "BXD38", "BXD39", "BXD40", "BXD41", "BXD42", "BXD43", "BXD44", "BXD45", "BXD48", "BXD48a", "BXD49", "BXD50", "BXD51", "BXD52", "BXD53", "BXD54", "BXD55", "BXD56", "BXD59", "BXD60", "BXD61", +(...) +"BXD065xBXD077F1", "BXD069xBXD090F1", "BXD071xBXD061F1", "BXD073bxBXD065F1", "BXD073bxBXD077F1", "BXD073xBXD034F1", "BXD073xBXD065F1", "BXD073xBXD077F1", "BXD074xBXD055F1", "BXD077xBXD062F1", "BXD083xBXD045F1", "BXD087xBXD100F1", "BXD065bxBXD055F1", "BXD102xBXD077F1", "BXD102xBXD73bF1", "BXD170xBXD172F1", "BXD172xBXD197F1", "BXD197xBXD009F1", "BXD197xBXD170F1"] +``` + +The code maps the traits values I generated against these columns to see what inviduals overlap which corrects for unmappable individuals (anyway). + +The function 'write-pheno-file', listed above, does not work however because of the format of the endpoint. Remember it generates + +``` +[{"Name":"BXD9","value":4.36},{"Name":"BXD32","value":15.745}...] +``` + +While this function expects the shorter + +``` +{"BXD9":4.36,"BXD23":15.745...} +``` + +Now, for endpoints there is no real standard. We have written ideas up here: + +=> https://git.genenetwork.org/gn-docs/tree/api + +and, most recently + +=> https://git.genenetwork.org/gn-docs/tree/api/GN-REST-API-v2.md + +Where I make a case for having the metadata as a separate endpoint that can be reasoned on by people and machines (and AI). +That means I should default to the short version of the data and describe that layout using metadata. This we can do later. + +I modified the endpoint to return the shorter hash: + +``` +time curl http://127.0.0.1:8091/dataset/bxd-publish/values/41022003 +{"BXD9":4.36,"BXD23":15.745...} +``` + +Next, to align with + +=> https://github.com/genenetwork/gn-docs/blob/master/api/GN-REST-API-v2.md + +I gave the API the json extension, so we have http://127.0.0.1:8091/dataset/bxd-publish/values/41022003.json + +This allows writing a special handler for GEMMA output (.gemma extension) downloading the pheno file with + +``` +curl http://127.0.0.1:8091/dataset/bxd-publish/values/41022003.gemma +NA +NA +NA +NA +NA +4.36NA +NA +NA +NA +(...) +``` + +that GEMMA can use directly and matches the order of the individuals in the BXD.8.geno file and the founders/parents are not included. Note that all of this now only works for the BXD (on PublishData) and I am using BXD.json as described in + +=> https://issues.genenetwork.org/topics/systems/mariadb/precompute-mapping-input-data + +I.e., it is Zach's listed stopgap solution. Code is here: + +=> https://git.genenetwork.org/gn-guile/log/ + +Next step run gemma as we are on par with my earlier work on ProbeSetData. I wrote a gemma runner for that too at + +=> https://git.genenetwork.org/gn-guile/tree/gn/runner/gemma.scm#n79 + +Now here I use guile to essentially script running GEMMA. There is no real advantage for that, so I will simply tell gemma-wrapper to use the output of above .gemma endpoint to fetch the trait values. Basically gemma-wrapper can specify the standard gemma -p switch, or pass in --phenotypes, that are used for permutations. + +Now the new method we want to introduce is that the trait values are read from a REST API, instead of a file. The dirty way is to provide that functionality directly to gemma-wrapper, but we plan to get rid of that code (useful as it is -- it duplicates what Arun's ravanan does and ravanan has the advantage that it can be run on a cluster). + +So we simply download the data and write it to a file with a small script. To run: + +``` +curl http://127.0.0.1:8091/dataset/bxd-publish/values/41022003.gemma > 41022003-pheno.txt +``` + +Next we create a container for gemma-wrapper (and includes the gemma that GN uses): + +``` +. .guix-deploy +env TMPDIR=tmp ruby ./bin/gemma-wrapper --force --json \ + --loco -- \ + -g BXD.8_geno.txt.gz \ + -p 41022003-pheno.txt \ + -a BXD.8_snps.txt \ + -gk > K.json +``` + +this bailed out with + +Executing: parallel --results /tmp/test --joblog /tmp/test/5f3849a9e61b70e3d562b20c5eade5a699923c68-parallel.log < /tmp/test/parallel-commands.txt +Command exited with non-zero status 20 + +When running an individual chromosome (from the parallel log) we get two warnings and an error: + +``` +**** WARNING: The maximum genotype value is not 2.0 - this is not the BIMBAM standard and will skew l_lme and effect sizes +**** WARNING: Columns in geno file do not match # individuals in phenotypes +ERROR: Enforce failed for not enough genotype fields for marker in src/gemma_io.cpp at line 1470 in BimbamKin +``` + +Looks familiar! +The first warning we'll ignore for now, as we just want the hits initially. The second warning relates to the error that there is a mismatch in number of inds. + +This topic I have covered in the past, particularly trying to debug Dave's conflicting results: + +=> https://issues.genenetwork.org/topics/lmms/gemma/permutations + +It makes somewhat depressive reading, though we have a solution. + +Note the correct conversion we only have to do once (basically the code I wrote earlier +to fetch BXD traits needs to work with the latest BXD genotypes). +The real problem is that gemma itself does not compare individual names (at all), so any corrections need to be done beforehand. In this case our pheno file contains 212 inds from the earlier BXD.json file. + +``` +wc -l 41022003-pheno.txt +212 41022003-pheno.txt +``` + +And that is off. Let's try the tool I wrote during that exercise. It can create a different json file after parsing BXD.geno +that has in the header: + +> # Date Modified: April 23, 2024 by Arthur Centeno, Suheeta Roy. March 22, 2022 by Rob Williams, David Ashbrook, and Danny Arends to remove excessive cross-over events in strains BXD42 (Chr9), BXD81 (Chrs1, 5, 10), BXD99 (Chr1), and BXD100 (Chrs2 and 6); and to add Taar1 maker on Chr 10 for T. Phillips-Richards. Jan 19, 2017: Danny Arends computed BXD cM values and recombinations between markers. Rob W. Williams fixed errors on most chromosomes and added Affy eQTL markers. BXD223 now has been added based on David Ashbrook's spreadsheet genotype information. + +``` +md5sum BXD.geno: + a78aa312b51ac15dd8ece911409c5b98 BXD.geno +gemma-wrapper$ ./bin/gn-geno-to-gemma.py BXD.geno > BXD.geno.txt +``` + +creates a .json file (that is different from Zach/GN's) and a bimbam file GEMMA can use. Now in the next step I need to adapt above code to use this format. What I *should* have done, instead of writing gemma phenotypes directly, is write the R/qtl2 format that includes the ind names (so we can compare and validate those) and *then* parse that data against our new JSON file created by gn-geno-to-gemma.py using the rqtl2-pheno-to-gemma.py script. Both Python scripts are already part of gemma-wrapper: + +=> https://github.com/genetics-statistics/gemma-wrapper/blob/master/bin/gn-geno-to-gemma.py +=> https://github.com/genetics-statistics/gemma-wrapper/blob/master/bin/rqtl2-pheno-to-gemma.py + +The idea was to create the rqtl2 API endpoint, or I'll adapt the 2nd script to take the endpoint as input and then correct for GEMMA's requirements. + +OK, updated the endpoints and the code for rqtl2-pheno-to-gemma.py so it accepts a URL instead of a file. So the idea is +to run + +``` +./bin/rqtl2-pheno-to-gemma.py BXD_pheno_Dave.csv --json BXD.geno.json > BXD_pheno_matched.txt +``` + +A line in BXD_pheno_Dave.csv is: + +``` +BXD113,24.52,205.429001,3.643,2203.312012,3685.907959,1.199,2.019,29.347143,0.642857,205.428574,24.520409,3.642857,2203 +.312012,3685.908203,1.198643,2.018643,0.642857,33.785709,1.625,2,1.625,1,22.75 +``` + +Now if I read the Rqtl2 docs it says: + +> We split the numeric phenotypes from the mixed-mode covariates, as two separate CSV files. Each file forms a matrix of individuals × phenotypes (or covariates), with the first column being individual IDs and the first row being phenotype or covariate names. Sex and line IDs (if needed) can be columns in the covariate data. + +This differs from the BXD Dave layout (it is transposed). Karl added in the docs: + +> All of these CSV files may be transposed relative to the form described below. You just need to include, in the control file, a line like: "geno_transposed: true". So, OK, we can use the transposed form. First we make it possible to parse json: + +``` +curl http://127.0.0.1:8091/dataset/bxd-publish/values/41022003.json > 41022003-pheno.json +jq < 41022003-pheno.json +{ + "C57BL/6J": 9.136, + "DBA/2J": 4.401, + "BXD9": 4.36, + "BXD32": 15.745, +(...) +``` + +note it includes the parents. Feed it to + +``` +./bin/rqtl2-pheno-to-gemma.py 41022003-pheno.json --json BXD.geno.json +``` + +where BXD.geno.json is not the Zach/GN json file, but the actual BXDs in GEMMA's bimbam file. + +One question is why Zach's JSON file gives a different number of mappable BXDs. I made of note of that to check. + +I wrote a new script and we had our first GEMMA run with lmdb output: + +``` +wrk@napoli /export/local/home/wrk/iwrk/opensource/code/genetics/gemma-wrapper [env]$ tar tvf /tmp/3fddda2374509c7b346> +-rw-r--r-- wrk/users 294912 2025-08-06 05:49 3fddda2374509c7b346b7819ae358ed23be9cb46-gemma-GWA.mdb +``` + +The script is just 10 lines of code (after the command line handler) + +=> https://github.com/genetics-statistics/gemma-wrapper/blob/master/bin/gn-pheno-to-gemma.rb + +Excellent, now we can run gemma and the next step is to look at the largest hit. + +So the trait we try to run is 41022003 = https://genenetwork.org/show_trait?trait_id=51048&dataset=BXDPublish. The inputs match up. When we run GEMMA in GN it has a 4.0 score on chr 12 and 3.9 on chr 19. + +Running gemma-wrapper we get + +``` +LOCO K computation with caching and JSON output + +gemma-wrapper --json --force --loco -- -g BXD.geno.txt -p BXD_pheno.txt -a BXD.8_snps.txt -n 2 -gk -debug > K.json + +LMM's using the K's captured in K.json using the --input switch + +gemma-wrapper --json --force --lmdb --loco --input K.json -- -g BXD.geno.txt -p BXD_pheno.txt -a BXD.8_snps.txt -lmm 9 -maf 0.1 -n 2 -debug > GWA.json +``` + +We can view the lmdb file with something like: + +``` +./bin/view-gemma-mdb --sort /tmp/66b8c19be87e9566358ce904682a56250eb05748-gemma-GWA.tar.xz --anno BXD.8_snps.txt > test.out +/tmp/3fddda2374509c7b346b7819ae358ed23be9cb46-gemma-GWA.tar.xz +chr,pos,marker,af,beta,se,l_mle,l_lrt,-logP +7,67950073,rsm10000004928,0.543,1.5226,1.3331,100000.0,0.0002,3.79 +7,68061665,rs32453663,0.543,1.5226,1.3331,100000.0,0.0002,3.79 +7,68111284,rs32227186,0.543,1.5226,1.3331,100000.0,0.0002,3.79 +19,30665443,rsm10000014129,0.522,2.2128,1.0486,100000.0,0.0002,3.77 +19,30671753,rs31207057,0.522,2.2128,1.0486,100000.0,0.0002,3.77 +12,40785621,rsm10000009222,0.565,2.8541,1.3576,100000.0,0.0002,3.75 +12,40786657,rs29124638,0.565,2.8541,1.3576,100000.0,0.0002,3.75 +12,40842857,rs13481410,0.565,2.8541,1.3576,100000.0,0.0002,3.75 +12,40887762,rsm10000009223,0.565,2.8541,1.3576,100000.0,0.0002,3.75 +12,40887894,rsm10000009224,0.565,2.8541,1.3576,100000.0,0.0002,3.75 +12,40900825,rs50979658,0.565,2.8541,1.3576,100000.0,0.0002,3.75 +12,41054766,rs46705481,0.565,2.8541,1.3576,100000.0,0.0002,3.75 +``` + +Interestingly the hits are very similar to what is on production now, though not the same! That points out that I am not using the production database on this recent dataset. Let's try an older one. BXD_10002 has data id 8967044 + +``` +curl http://127.0.0.1:8091/dataset/bxd-publish/values/8967044.json > 10002-pheno.json +./bin/gn-pheno-to-gemma.rb -p 10002-pheno.json --geno-json BXD.geno.json > 10002-pheno.txt +gemma-wrapper --json --force --loco -- -g BXD.geno.txt -p 10002-pheno.txt -a BXD.8_snps.txt -n 2 -gk -debug > K.json +gemma-wrapper --json --force --lmdb --loco --input K.json -- -g BXD.geno.txt -p 10002-pheno.txt -a BXD.8_snps.txt -lmm 9 -maf 0.1 -n 2 -debug > GWA.json +./bin/view-gemma-mdb --sort /tmp/c4ffedf358698814c6e29a54a2a51cb6c66328d0-gemma-GWA.tar.xz --anno BXD.8_snps.txt > test.out +``` + +Luckily this is a perfect match: + +``` +1,179861787,rsm10000000444,0.559,0.8837,0.3555,100000.0,0.0,4.99 +1,179862838,rs30712622,0.559,0.8837,0.3555,100000.0,0.0,4.99 +1,179915631,rsm10000000787,0.559,0.8837,0.3555,100000.0,0.0,4.99 +1,179919811,rsm10000000788,0.559,0.8837,0.3555,100000.0,0.0,4.99 +(...) +8,94479237,rs32095272,0.441,1.0456,0.4362,100000.0,0.0,4.75 +8,94765445,rsm10000005684,0.441,1.0456,0.4362,100000.0,0.0,4.75 +8,94785223,rsm10000005685,0.441,1.0456,0.4362,100000.0,0.0,4.75 +8,94840921,rsm10000005686,0.441,1.0456,0.4362,100000.0,0.0,4.75 +``` + +The lmdb file contains the full vector and compresses to 100K. For 13K traits that equals about 1Gb. + +First I wanted to check how Zach's list of mappable inds compares to mine. A simple REPL exercise shows: + +``` +zach = JSON.parse(File.read('BXD.json')) +pj = JSON.parse(File.read('BXD.geno.json')) +s1 = zach["genofile"][0]["sample_list"] +=> ["BXD1", "BXD2", "BXD5", "BXD6", "BXD8", "BXD9", "BXD11", "BXD12", "BXD13", "BXD14", "BXD15", "BXD16", "BXD18",... +s2 = pj["samples"] +=> ["BXD1", "BXD2", "BXD5", "BXD6", "BXD8", "BXD9", "BXD11", "BXD12", "BXD13", "BXD14", "BXD15", "BXD16", "BXD18",... +s1.size() +=> 235 +s2.size() +=> 237 + s2-s1 +=> ["BXD077xBXD065F1", "BXD065xBXD102F1"] +``` + +So it turns out the newer geno file contains these two new inds that are *also* in the .geno file and confuses the hell out of my scripts ;). The GN2 webserver probably uses the header of the geno file to fetch the correct number. The trait page also lists these inds, so (I guess) the BXD.json file ought to be updated. + +Now that is explained and we are good. + +## Running at scale + +In the next step we need to batch run GEMMA. Initially we'll run on one server. gemma-wrapper takes care of running only once, so we can restart the pipeline at any point (we'll move to ravanan after to run on the cluster). At this point the API uses the dataid to return the trait values. I think that is not so intuitive, so I modified the endpoint to give the same results for: + +``` +curl http://127.0.0.1:8091/dataset/bxd-publish/values/10002.json > 10002-pheno.json +curl http://127.0.0.1:8091/dataset/bxd-publish/dataid/values/8967044.json > 10002-pheno.json +``` + +Now that works we can get a list of all BXDPublish datasets that I wrote earlier: + +``` +curl http://127.0.0.1:8091/dataset/bxd-publish/list > bxd-publish.json +[ + { + "Id": 10001, + "PhenotypeId": 4, + "DataId": 8967043 + }, + { + "Id": 10002, + "PhenotypeId": 10, + "DataId": 8967044 + }, + { + "Id": 10003, + "PhenotypeId": 15, + "DataId": 8967045 + }, +``` + +so we can use this to create our batch list. There are 13711 datasets listed on this DB. We can use jq to extract all Ids + +``` +jq ".[] | .Id" < bxd-publish.json > ids.txt +``` + +All set to run our first batch! Now we replicate our guix-wrapper environment, start the gn-guile server and fire up a batch script that pulls the data from the database and runs gemma for every step. + + +To get precompute going we need a server set up with a recent database. I don't want to use the production server. The fastest other server we have is balg01, and it is not busy right now, so let's use that. First we recover a DB from our backup, as described in + +=> topics/systems/mariadb/precompute-mapping-input-data + +(btw that examples show we started on precompute since November 2023 - 1.5 years ago). On that server mariadb is running as +/usr/local/guix-profiles/gn-latest/bin/mariadbd --datadir=/export/mariadb/tux01. We can simply overwrite that database as it +is an installation of Feb 18 2024. We extract: + +``` +borg extract --progress /export/backup/bacchus/drop/tux04/genenetwork::borg-tux04-sql-20250807-04:16-Thu +``` + +After extracting the backup we need to update permissions and point mariadb to the new dir: balg01:/export/mariadb/tux04/latest/. +Restarting the DB and it all appears to work. + +Before I move the code across we need to make sure metadata on the traits get added to the lmdb mapping data. I actually wrote the code for that here. This adds the metadata to lmdb: + +=> https://github.com/genetics-statistics/gemma-wrapper/blob/a0eb8ed829072cb539b32affe135a7930989ca30/bin/gemma2lmdb.py#L99 + +gemma-wrapper writes data like this: + +``` + "meta": { + "type": "gemma-wrapper", + "version": "0.99.7-pre1", + "population": "BXD", + "name": "HC_U_0304_R", + "trait": "101500_at", + "url": "https://genenetwork.org/show_trait?trait_id=101500_at&dataset=HC_U_0304_R", + "archive_GRM": "46bfba373fe8c19e68be6156cad3750120280e2e-gemma-cXX.tar.xz", + "archive_GWA": "779a54a59e4cd03608178db4068791db4ca44ab3-gemma-GWA.tar.xz", + "dataid": 75629, + "probesetid": 1097, + "probesetfreezeid": 7 + } +``` + +This was done for probesetdata and needs to be adapted for our BXD PublishData exercise. Also I want the archive_GWA file name to include the trait name/ID so we can find it quickly on the storage (without having to parse/query all lmdb files). + +From the gemma-wrapper invocation you can see I added a few switches to pass in this information: + +=> https://git.genenetwork.org/gn-guile/tree/gn/runner/gemma.scm#n97 + +``` + --meta NAME Pass in metadata as JSON file + --population NAME Add population identifier to metadata + --name NAME Add dataset identifier to metadata + --id ID Add identifier to metadata + --trait TRAIT Add trait identifier to metadata +``` + +We can add BXD as population and BXDPublish as a dataset identifier. Set id with dataid, and trait id with PublishXRefID and point it back to GN, so we can click + +=> https://genenetwork.org/show_trait?trait_id=51048&dataset=BXDPublish + +Another thing I want to add are the existing qtlreaper hit values. That way we can assess where the biggest impact was of using gemma over qtlreaper. To achieve this we will create a new API endpoint that can serve that data. Remember we get the trait values with: + +=> http://127.0.0.1:8091/dataset/bxd-publish/values/10002.json + +so we can add an endpoint that lists the mapping results + +=> http://127.0.0.1:8091/dataset/bxd-publish/trait-hits/10002.json + +we also will have + +=> http://127.0.0.1:8091/dataset/bxd-publish/trait-info/10002.json + +That will return more metadata and point into our RDF store. Note that this is now all very specific to bxd-publish. Later we'll have to think how to generalise these endpoints. We are just moving forward to do the BXD precompute run. + +Interestingly GN2 shows this information (well, only the highest hit) on the search page, but not on the trait page. As we can get hits from multiple sources we should (eventually) account for that with something like: + +``` +=> http://127.0.0.1:8091/dataset/bxd-publish/trait-hits/10002.json +{ "qtlreaper-hk": + { + [ + { "name":..., "chr": ..., "pos":..., "LRS":..., "additive":..., } + ] + } + "gemma-loco": + { + [ + { "name":..., "chr": ..., "pos":..., "LRS":..., "additive":..., } + { "name":..., "chr": ..., "pos":..., "LRS":..., "additive":..., } + { "name":..., "chr": ..., "pos":..., "LRS":..., "additive":..., } + ] + } +} +``` + +Eventually we may list gemma, Rqtl2 hits with and without LOCO and with and without covariates. Once we build this support we can adapt our search tools. + +Obviously this won't fit the current PublishXRef format, so -- for now -- we will just mirror its contents: + +``` +{ "qtlreaper-hk": + { + [ + { "name":..., "chr": ..., "pos":..., "LRS":..., "additive":..., } + ] + } +} +``` + +To get compute going I am going to skip above because we can update the lmdb files later. +The first fix is to add the trait name to the file names and the following record to lmdb: + + "meta": { + "type": "gemma-wrapper", + "version": "0.99.7-pre1", + "population": "BXD", + "name": "BXDPublish", + "table": "PublishData", + "traitid": 10002, // aka PublishXrefId + "url": "https://genenetwork.org/show_trait?trait_id=51048&dataset=BXDPublish, + "archive_GRM": "46bfba373fe8c19e68be6156cad3750120280e2e-gemma-cXX.tar.xz", + "archive_GWA": "779a54a59e4cd03608178db4068791db4ca44ab3-BXDPublish-10002-gemma-GWA.tar.xz", + "dataid": 8967044, + } + +This required modifications to gemma-wrapper. + +Running: + +``` +gemma-wrapper --json --force --loco -- -g BXD.geno.txt -p BXD_pheno.txt -a BXD.8_snps.txt -n 2 -gk -debug > K.json +gemma-wrapper --json --force --lmdb --population BXD --name BXDPublish --trait 10002 --loco --input K.json -- -g BXD.geno.txt -p BXD_pheno.txt -a BXD.8_snps.txt -lmm 9 -maf 0.1 -n 2 -debug > GWA.json +``` + +begets '66b8c19be87e9566358ce904682a56250eb05748-BXDPublish-10002-gemma-GWA.tar.xz'. When I check the meta data in the lmdb file it is set to + +``` +"meta": {"type": "gemma-wrapper", "version": "1.00-pre1", "population": "BXD", "name": "BXDPublish", "trait": "10002", "geno_filename": "BXD.geno.txt", "geno_hash": "3b65ed252fa47270a3ea867409b0bdc5700ad6f6", "loco": true, "url": "https://genenetwork.org/show_trait?trait_id=10002&dataset=BXDPublish", "archive_GRM": "185eb08dc3897c7db5d7ea987170898035768f93-gemma-cXX.tar.xz", "archive_GWA":"66b8c19be87e9566358ce904682a56250eb05748-BXDPublish-10002-gemma-GWA.tar.xz", "table": "PublishData", "traitid": 10002, "dataid": 0} +``` + +which is good enough (for now). I may still add the dataid, but it requires a SQL call. Code is here: + +=> https://github.com/genetics-statistics/gemma-wrapper/commit/49587523fc93bdcf0265da9da97f8d6d2a9e1008 + +I should note that up to this point I would have had no advantage from AI programming. I know there are topics that I'll work on where I may benefit, but this type of architecturing, with very little code writing, does not really help. I certainly have the intention of using AI! Next steps, unfortunately, there is still little to be gained. Where we'll probably gain is: + +- Using the RDF data store and documenting the endpoint(s) +- Refactoring some of GN2's code to introduce lmdb\ +- Deduplicating GN2/GN3 SQL code +- Improving the REST API and writing documentation and tests +- Analysing existing code bases, such as GEMMA itself + +Next step is getting the data churn going! After that we'll list all the hits which requires processing the lmdb output. + +Precompute of 13K traits has its first test run on balg01. + +It is going at 30 gemma runs per minute, so perhaps 8 hours for the full run if it keeps going. But I am hitting errors. + +Afther that will be to digest hits from the precomputed vectors in lmdb. + +## Yesterday's tux02 crash + +All servers work on tux02 except for BNW. + +I tried to restart BNW, but it is giving an error, including the mystifying shepherd error (that I have as a sticker on my laptop): + +> 2025-08-11 01:13:41 error in finalization thread: Success + +It is on our end, so no need to ping Yan. I'll fix it when I have time (I did below). + +## Precompute + +To get precompute up and running I need to create the environment on balg01. The DB I updated a few days ago, so that should be fine. + +First we check out the guile webserver: + +``` +git clone tux02.genenetwork.org:/home/git/public/gn-guile gn-guile-8092 +``` + +Now gn-guile is already running serving aliases, so we want to run this as an internal endpoint right now with something like + +``` +unset GUIX_PROFILE +. /usr/local/guix-profiles/guix-pull/etc/profile +guix shell -L ~/guix-bioinformatics --container --network --file=guix.scm -- guile -L . --fresh-auto-compile -e main web/webserver.scm 8092 +``` + +so, this renders + +``` +curl http://127.0.0.1:8092/dataset/bxd-publish/values/10002.json +{"BXD1":54.099998,"BXD2":50.099998,"BXD5":53.299999,"BXD6":55.099998 +``` + +Next step is to set up gemma-wrapper. Now this failed because guix was not happy. We have been updating things these last weeks. Rather than trying to align with recent changes I could have rolled back to the version I am using on my desktop. But I decided not to let those bits rot and updated guix from + +guix describe Thu Mar 14 21:33:55 2024 + +to + +guix describe Sun Aug 10 18:18:20 2025 + +Should use a newer version first! Let's try + +``` +guix pull --url=https://codeberg.org/guix/guix -p ~/opt/guix-pull +``` + +(that took a while, so I took the opportunity to fix BNW -- turns out someone disabled BNW in shepherd by creating a systemd version that did not start properly). + +After the pull there were quite a few problems with gemma dependencies that needed fixing. First problem + +``` +guix package: warning: failed to load '(gn packages gemma)': +In procedure abi-check: #<record-type <git-reference>>: record ABI mismatch; recompilation needed +``` + +required + +``` +find ~/.cache/guile -name "*.go" -delete +``` + +I also had to point guix-past to the new codeberg record! And now, magically, things started working. + +So, now I have an identical setup on my desktop and on the balg server. Next is to write a script that will batch run gemma-wrapper for every BXD PublishData ID. We created that list with jq earlier. + +``` +curl http://127.0.0.1:8092/dataset/bxd-publish/list > bxd-publish.json +jq ".[] | .Id" < bxd-publish.json > ids.txt +``` + +For every ID in that list we are going to fetch the trait values with + +``` +#! /bin/env sh +export TMPDIR=./tmp +curl http://127.0.0.1:8092/dataset/bxd-publish/list > bxd-publish.json +jq ".[] | .Id" < bxd-publish.json > ids.txt +./bin/gemma-wrapper --force --json --loco -- -g BXD.geno.txt -p BXD_pheno.txt -a BXD.8_snps.txt -n 2 -gk > K.json + +for id in 'cat ids.txt' ; do + echo Precomputing $id + curl http://127.0.0.1:8092/dataset/bxd-publish/values/$id.json > pheno.json + ./bin/gn-pheno-to-gemma.rb --phenotypes pheno.json --geno-json BXD.geno.json > BXD_pheno.txt + ./bin/gemma-wrapper --json --lmdb --population BXD --name BXDPublish --trait $id --loco --input K.json -- -g BXD.geno.txt -p BXD_pheno.txt -a BXD.8_snps.txt -lmm 9 -maf 0.1 -n 2 -debug > GWA.json +done +``` + +I hard copied the following files + +``` +BXD.geno.json +BXD.geno.txt +BXD.8_snps.txt +``` + +One thing I need to check is that the GRM is actually a constant. I forgot what GEMMA does. + +We hit an error + +``` +/gnu/store/vvl1g1l0j19w39kry2xcsawvlhbyb87j-ruby-3.4.4/lib/ruby/3.4.0/json/common.rb:221:in 'JSON::Ext::Parser.parse': +unexpected token at '' (JSON::ParserError) +FATAL ERROR: gemma-wrapper bailed out with pid 340588 exit 20 +./bin/gemma-wrapper:494:in 'block (2 levels) in <main>' +./bin/gemma-wrapper:479:in 'IO.open' +./bin/gemma-wrapper:479:in 'block in <main>' +./bin/gemma-wrapper:832:in '<main>'Precomputing 10137 +``` + +The JSON file is empty 10136. Hmmm. + +I also see + +``` +WARNING: failed to update lmdb record with key b'\r\x02n\x7f\x10' -- probably a duplicate 13:40795920 (b'\r':40795920) +``` + +For the first the webserver actually stopped on `In procedure accept: Too many open files`. The problem looks similar to + +=> https://issues.guix.gnu.org/60226 + +and Arun's patch + +=> https://cgit.git.savannah.gnu.org/cgit/guix/mumi.git/commit/?id=897967a84d3f51da2b1cc8c3ee942fd14f4c669b + +I raised ulimit, but may need to restart the webserver several time. We are computing though: + +``` +-rw-r--r-- 1 wrk wrk 82968 Aug 11 05:16 ab51d69f79601cfa7399feebca619ea1a71c1270-BXDPublish-10146-gemma-GWA.tar.xz +-rw-r--r-- 1 wrk wrk 82772 Aug 11 05:16 e6739ace8ca4931fc51baa1844b3b5ceac592104-BXDPublish-10147-gemma-GWA.tar.xz +-rw-r--r-- 1 wrk wrk 81848 Aug 11 05:16 60880fc7e8c86dffb17f28664e478204ea26f827-BXDPublish-10148-gemma-GWA.tar.xz +-rw-r--r-- 1 wrk wrk 79336 Aug 11 05:16 c914d6221b004dec98d60e08c0fdf8791c09cb41-BXDPublish-10149-gemma-GWA.tar.xz +-rw-r--r-- 1 wrk wrk 83536 Aug 11 05:16 3d72b19730edab29bdc593cb6a1a86dd789d351f-BXDPublish-10150-gemma-GWA.tar.xz +-rw-r--r-- 1 wrk wrk 69060 Aug 11 05:16 0e965f1778425071a5497d0fe69f2dc2e534ef60-BXDPublish-10151-gemma-GWA.tar.xz +-rw-r--r-- 1 wrk wrk 69072 Aug 11 05:16 4de26e62a75727bc7edd6b266dfcd7753d185f1a-BXDPublish-10152-gemma-GWA.tar.xz +(...) +``` + +There are some scarily small datasets: + +``` +GET /dataset/bxd-publish/values/10198.json +;;; ("8967240") + +;;; ((("C57BL/6J" . 1.62) ("BXD1" . 2.37) ("BXD5" . 2.73) ("BXD9" . 3.52) ("BXD11" . 0.18) ("BXD12" . 3.69) ("BXD16" . 0.29) ("BXD21" . 2.34) ("BXD27" . 3.38) ("BXD32" . 0.24))) +``` + +i.e. https://genenetwork.org/show_trait?trait_id=10198&dataset=BXDPublish + +Not sure we should be running GEMMA on those! + + +The computation initially stopped at 70% (we are now at 98%). + +To get from 70% I run the webserver without fibers as suggested by Arun's patch: + +=> https://cgit.git.savannah.gnu.org/cgit/guix/mumi.git/commit/?id=897967a84d3f51da2b1cc8c3ee942fd14f4c669b + +Because we were getting errors like: In procedure accept: Too many open files with GET /dataset/bxd-publish/values/23486.json + +Afther removing fibers precompute just continued where it left off. As it should. The fix is: + +=> https://git.genenetwork.org/gn-guile/commit/?id=289da2e13e07928cdb8a1d165483a3a3cd9ae1c6 + +Now that is running I want to make sure I can point back to metadata and perhaps fetch some information to enrich our lmdb files for further processing. Earlier we captured some metadata with + +Next we capture some metadata + +``` +MariaDB [db_webqtl]> select PhenotypeId, Locus, DataId, Phenotype.Post_publication_description from PublishXRef, Phenotype where PublishXRef.PhenotypeId = Phenotype.Id and InbredSetId=1 limit 5; ++-------------+----------------+---------+----------------------------------------------------------------------------------------------------------------------------+ +| PhenotypeId | Locus | DataId | Post_publication_description | ++-------------+----------------+---------+----------------------------------------------------------------------------------------------------------------------------+ +| 4 | rs48756159 | 8967043 | Central nervous system, morphology: Cerebellum weight, whole, bilateral in adults of both sexes [mg] | +| 10 | rsm10000005699 | 8967044 | Central nervous system, morphology: Cerebellum weight after adjustment for covariance with brain size [mg] | +| 15 | rsm10000013713 | 8967045 | Central nervous system, morphology: Brain weight, male and female adult average, unadjusted for body weight, age, sex [mg] | +| 20 | rs48756159 | 8967046 | Central nervous system, morphology: Cerebellum volume [mm3] | +| 25 | rsm10000005699 | 8967047 | Central nervous system, morphology: Cerebellum volume, adjusted for covariance with brain size [mm3] | ++-------------+----------------+---------+----------------------------------------------------------------------------------------------------------------------------+ +``` + +The qtlreaper hits are also of interest. Note Bonz has brilliantly captured this in RDF, see + +=> https://github.com/genenetwork/gn-docs/blob/master/rdf-documentation/phenotype-metadata.md + +which is parseable by machines(!). Let's try to use RDF first. The query: + +``` +SELECT * WHERE { + <http://genenetwork.org/id/traitBxd_10002> ?p ?o . +} +``` + +renders + +``` +"http://www.w3.org/1999/02/22-rdf-syntax-ns#type","http://genenetwork.org/category/Phenotype" +"http://genenetwork.org/term/belongsToGroup","http://genenetwork.org/id/setBxd" +"http://www.w3.org/2004/02/skos/core#altLabel","BXD_10002" +"http://purl.org/dc/terms/description","Central nervous system, morphology: Cerebellum weight after adjustment for covariance with brain size [mg]" +"http://genenetwork.org/term/abbreviation","ADJCBLWT" +"http://genenetwork.org/term/additive",2.08179 +"http://genenetwork.org/term/locus","http://genenetwork.org/id/Rsm10000005699" +"http://genenetwork.org/term/lodScore",4.77938 +"http://genenetwork.org/term/mean",52.2206 +"http://genenetwork.org/term/sequence",1 +"http://genenetwork.org/term/submitter","robwilliams" +"http://genenetwork.org/term/traitId","10002" +"http://purl.org/dc/terms/isReferencedBy","http://rdf.ncbi.nlm.nih.gov/pubmed/11438585" +``` + +which covers pretty much what we need. Note that this is coming from our public endpoint and can be used to instruct AI agents(!) + +Now we want to fetch these values for all these traitBxd (yes, we need to fix some naming) with a single query: + +``` +SELECT count(*) WHERE { + ?s gnt:belongsToGroup gn:setBxd. +} limit 5 +``` + +returns 14039 traits. Good! Let's get all properties + +``` + +SELECT * WHERE { + ?s gnt:belongsToGroup gn:setBxd; + gnt:traitId ?id; + gnt:locus ?locus; + gnt:lodScore ?lrs; + dct:description ?descr. +} limit 50 +``` + +[Try](https://sparql.genenetwork.org/sparql?default-graph-uri=&query=%0D%0APREFIX+dct%3A+%3Chttp%3A%2F%2Fpurl.org%2Fdc%2Fterms%2F%3E+%0D%0APREFIX+gn%3A+%3Chttp%3A%2F%2Fgenenetwork.org%2Fid%2F%3E+%0D%0APREFIX+owl%3A+%3Chttp%3A%2F%2Fwww.w3.org%2F2002%2F07%2Fowl%23%3E+%0D%0APREFIX+gnc%3A+%3Chttp%3A%2F%2Fgenenetwork.org%2Fcategory%2F%3E+%0D%0APREFIX+gnt%3A+%3Chttp%3A%2F%2Fgenenetwork.org%2Fterm%2F%3E+%0D%0APREFIX+sdmx-measure%3A+%3Chttp%3A%2F%2Fpurl.org%2Flinked-data%2Fsdmx%2F2009%2Fmeasure%23%3E+%0D%0APREFIX+skos%3A+%3Chttp%3A%2F%2Fwww.w3.org%2F2004%2F02%2Fskos%2Fcore%23%3E+%0D%0APREFIX+rdf%3A+%3Chttp%3A%2F%2Fwww.w3.org%2F1999%2F02%2F22-rdf-syntax-ns%23%3E+%0D%0APREFIX+rdfs%3A+%3Chttp%3A%2F%2Fwww.w3.org%2F2000%2F01%2Frdf-schema%23%3E+%0D%0APREFIX+xsd%3A+%3Chttp%3A%2F%2Fwww.w3.org%2F2001%2FXMLSchema%23%3E+%0D%0APREFIX+qb%3A+%3Chttp%3A%2F%2Fpurl.org%2Flinked-data%2Fcube%23%3E+%0D%0APREFIX+xkos%3A+%3Chttp%3A%2F%2Frdf-vocabulary.ddialliance.org%2Fxkos%23%3E+%0D%0APREFIX+pubmed%3A+%3Chttp%3A%2F%2Frdf.ncbi.nlm.nih.gov%2Fpubmed%2F%3E+%0D%0A%0D%0A%0D%0A%0D%0ASELECT+*+WHERE+%7B%0D%0A++++%3Fs+gnt%3AbelongsToGroup+gn%3AsetBxd%3B%0D%0A+++++++++gnt%3AtraitId+%3Fid%3B%0D%0A+++++++++gnt%3Alocus+%3Flocus%3B%0D%0A+++++++++%23+gnt%3Achr+%3Fchr%3B%0D%0A+++++++++%23+gnt%3Apos+%3Fpos%3B%0D%0A+++++++++gnt%3AlodScore+%3Flrs%3B%0D%0A+++++++++dct%3Adescription+%3Fdescr.%0D%0A%7D+limit+50&format=text%2Fhtml&timeout=0&signal_void=on) + +If we want to get the chr+location we can query one: + +``` +SELECT * WHERE { +gn:Rs47436964 ?p ?o. +} +``` + +renders + +``` +http://www.w3.org/2000/01/rdf-schema#label "rs47436964" +chr "12" +mb 65.0498 +``` + +Now the label is not so interesting, so in one query we can do: + +``` +SELECT ?id ?lod ?chr ?mb ?descr WHERE { + ?s gnt:belongsToGroup gn:setBxd; + gnt:traitId ?id; + gnt:locus ?locus; + gnt:lodScore ?lod; + dct:description ?descr. + ?locus gnt:chr ?chr; + gnt:mb ?mb. +} order by desc(?lod) limit 50 +``` + +which gets, for example a massive reaper HK QTL at + +``` +"21588" 34.558 "12" 116.67 "Cofactor, genetics, genomics: Structural variants SVs on chromosome 12, raw uncorrected sum of calls using LongRanger on linked-read sequencing data [n]" +``` + +The description of the phenotype is unfortunate. I think it is a synthetic QTL. The title is "SVs_Chr12". Luckily most traits give more an idea of what it is about. + +[SPARQL](https://sparql.genenetwork.org/sparql?default-graph-uri=&query=%0D%0APREFIX+dct%3A+%3Chttp%3A%2F%2Fpurl.org%2Fdc%2Fterms%2F%3E+%0D%0APREFIX+gn%3A+%3Chttp%3A%2F%2Fgenenetwork.org%2Fid%2F%3E+%0D%0APREFIX+owl%3A+%3Chttp%3A%2F%2Fwww.w3.org%2F2002%2F07%2Fowl%23%3E+%0D%0APREFIX+gnc%3A+%3Chttp%3A%2F%2Fgenenetwork.org%2Fcategory%2F%3E+%0D%0APREFIX+gnt%3A+%3Chttp%3A%2F%2Fgenenetwork.org%2Fterm%2F%3E+%0D%0APREFIX+sdmx-measure%3A+%3Chttp%3A%2F%2Fpurl.org%2Flinked-data%2Fsdmx%2F2009%2Fmeasure%23%3E+%0D%0APREFIX+skos%3A+%3Chttp%3A%2F%2Fwww.w3.org%2F2004%2F02%2Fskos%2Fcore%23%3E+%0D%0APREFIX+rdf%3A+%3Chttp%3A%2F%2Fwww.w3.org%2F1999%2F02%2F22-rdf-syntax-ns%23%3E+%0D%0APREFIX+rdfs%3A+%3Chttp%3A%2F%2Fwww.w3.org%2F2000%2F01%2Frdf-schema%23%3E+%0D%0APREFIX+xsd%3A+%3Chttp%3A%2F%2Fwww.w3.org%2F2001%2FXMLSchema%23%3E+%0D%0APREFIX+qb%3A+%3Chttp%3A%2F%2Fpurl.org%2Flinked-data%2Fcube%23%3E+%0D%0APREFIX+xkos%3A+%3Chttp%3A%2F%2Frdf-vocabulary.ddialliance.org%2Fxkos%23%3E+%0D%0APREFIX+pubmed%3A+%3Chttp%3A%2F%2Frdf.ncbi.nlm.nih.gov%2Fpubmed%2F%3E+%0D%0A%0D%0A%0D%0A%0D%0ASELECT+%3Fid+%3Flrs+%3Fchr+%3Fmb+%3Fdescr+WHERE+%7B%0D%0A++++%3Fs+gnt%3AbelongsToGroup+gn%3AsetBxd%3B%0D%0A+++++++++gnt%3AtraitId+%3Fid%3B%0D%0A+++++++++gnt%3Alocus+%3Flocus%3B%0D%0A+++++++++gnt%3AlodScore+%3Flrs%3B%0D%0A+++++++++dct%3Adescription+%3Fdescr.%0D%0A++++%3Flocus+gnt%3Achr+%3Fchr%3B%0D%0A+++++++++++++++gnt%3Amb+%3Fmb.%0D%0A%7D+order+by+desc%28%3Flrs%29+limit+50&format=text%2Fhtml&timeout=0&signal_void=on) + +To run this query on all 13K traits takes just a second! The resulting 3Mb TSV I'll share. Note that there is no code necessary to get to this point! Just SPARQL queries on a public endpoint. + +Now, what we want to do is take these results and combine them with the full vector data stored in lmdb. +The first thing we can do is list the top hit from every trait and combine that with above data. That way we can quickly asses what trait hits will change using GEMMA instead of HK reaper. One thing to note is the formula LRS/4.6=LOD. The GN2 interface shows LRS. + +Meanwhile I am waiting for precompute. Most of it is done, but some interesting errors: + +``` +Precomputing 20484 +;;; ("41012208") +SQL Connection ERROR! file not found +``` + +especially since it appears this is a cache hit. OK, I'll check tomorrow. For now we have 12837 completed vectors! +After some reruns we have 13491 vectors, i.e. 98% of BXD PublishData. + + +After some reruns we have 13491 vectors, i.e. 98% of BXD PublishData. + +Some remaining problems: + +``` +Executing: parallel --results /tmp/test --joblog /tmp/test/79d6dbd2fbd55b159c35d903ba10d9cab14f7816-parallel.log < /tmp +/test/parallel-commands.txt +Command exited with non-zero status 20 +``` + +the trait values are all 1.0. + +``` +BXD1 1.0 +BXD2 1.0 +BXD5 1.0 +BXD6 1.0 +BXD8 1.0 +BXD9 1.0 +BXD11 1.0 +BXD12 1.0 +BXD13 1.0 +BXD14 1.0 +BXD15 1.0 +BXD16 1.0 +BXD18 1.0 +BXD19 1.0 +``` + +We'll look into those later. + +Next step is to collect all the highest hits and we can do that with + +``` +./bin/view-gemma-mdb --sort tmp/tmp/9179b...923f181-gemma-GWA.mdb --anno BXD.8_snps.txt |head -2 +Reading tmp/tmp/9179b192fc1c19142d97607b64c04bf5a923f181-gemma-GWA.mdb... +chr,pos,marker,af,beta,se,l_mle,l_lrt,-logP +10,125580028,rsm10000007478,0.655,0.014,0.0134,100000.0,0.0005,3.34 +``` + +That is great, but now we need to put the data in a place that we can analyse it - and the difference with qtlreaper. We can do a one-off using some tabular format. But that would mean we would have to redo things later to get it in SQL and/or present it some other way. So, basically, we need a flexible storage format that allows us to query things -- without predicting how people want to use that data and -- importantly -- have machines do it. Here comes RDF as the solution. As Mark Wilkinson has it: in my lab we only do RDF. No hacks (please). + +So, let's adapt the output of view-gemma-mdb and convert that to RDF. Bonz has done many such exercises in + +=> https://git.genenetwork.org/gn-transform-databases/tree/ + +e.g. for the earlier phenotypes RDF+SPARQL we used to get the reaper values + +=> https://git.genenetwork.org/gn-transform-databases/tree/examples/phenotype.scm + +In this code SQL queries are embedded. I would argue these need to be replaced with REST API calls. But hey. + +First step is to include the ID with ./bin/view-gemma-mdb and some other metadata as fields, that we so thoughtfully included in the mdb metadata. This results in: + +``` +Reading /tmp/tmphvi6grqm/2b8e7c7cfe98f7e44bb2f07f057cc1adedf29c38-gemma-GWA.mdb... +name,id,chr,pos,marker,af,beta,se,l_mle,l_lrt,-logP +BXDPublish,22200,1,4858261,rsm10000000111,0.5,0.0246,0.0537,100000.0,0.0192,1.72 +BXDPublish,22200,1,182581091,rsm10000000451,0.548,-0.009,0.0537,100000.0,0.139,0.86 +BXDPublish,22200,1,182635325,rsm10000000452,0.548,-0.009,0.0537,100000.0,0.139,0.86 +``` + +Now remember the HK reaper data is already in RDF. If we push this data in we should be able to query the combined datasets. Let's convert this to RDF that looks like: + +``` +gn:GEMMAMappedLOCO_22200 a gnt:mappedTrait; + label "GEMMA trait 22200 mapped with LOCO (defaults)"; + gnt:LOCO true; + gnt:belongsToGroup gn:setBxd; + gnt:traitId "22200"; + skos:altLabel "BXD_22200"; + gnt:locus gn:rsm10000000111; + gnt:lodScore 1.72; + gnt:af 0.5; + gnt:effect 0.0246; +``` + +If the marker is not yet defined we can add: + +``` +gn:rsm10000000111 a gnt:marker; + label "rsm10000000111I"; + gnt:chr "1"; + gnt:mb 4.858261; + gnt:pos 4858261. +``` + +This means we can pivot on the trait id between reaper and gemma results. It will also be easy to store multiple +GEMMA hits. +I note that GEMMA does not store the mean +value. We can fetch that from trait values. + +Rob wrote: + +> We will want to harvest the sample size for each trait. That will be a critical parameter for filtering. Knowing the skew and kurtosis also highly valuable in filtering and diagnostics. Many users forget to log their data and this introduces serious problems since you have a tail of outliers. Obviously a dumb mistake to have traits with all values of 1. Perhaps you can assign the task of fixing/deleting that traits to Arthur and me. Just send a list. + +I'll make a list to send to Arthur and you - it is on my tasks. With regard to trait info we should compute that as metadata when doing the precompute (as we have the trait values at that point!). I have added that to the task list. + +=> https://issues.genenetwork.org/topics/systems/mariadb/precompute-publishdata + +We'll do a rerun with this data soon, as it only took a day. + +Alright, I am keen to move forward on our precompute, because this is the fun phase. Getting the metadata in place should be easy, now we are on RDF. First we are going to simply mirror PublishXRef information for HK reaper and GEMMA runs. Reaper is already in RDF (mostly), so let's add some functionality to gemma-wrapper. + +The viewer for 1e59d19a679359516ecd97cf20375c80e987ee3e-BXDPublish-22282-gemma-GWA.tar.xz gives + +``` +name,id,chr,pos,marker,af,beta,se,l_mle,l_lrt,-logP +BXDPublish,22282,5,110385941,rs29780222,0.484,-0.0802,0.0356,2.0341,0.0,4.51 +BXDPublish,22282,5,110421808,rsm10000002804,0.484,-0.0802,0.0356,2.0341,0.0,4.51 +BXDPublish,22282,5,110479038,rsm10000002805,0.484,-0.0802,0.0356,2.0341,0.0,4.51 +BXDPublish,22282,5,110515858,rs33083878,0.484,-0.0802,0.0356,2.0341,0.0,4.51 +``` + +Note that the sorting is arbitrary because -logP is identical! My take is that we should include all hits (read SNP names) for comparison with HK reaper. We will be able to parse range locations - so we can check 50K base pairs up and downstream too. + +Looking at SNPs we should look at using existing URIs instead of inventing new ones. I'll make a note of that too (to move forward). Looking at the first hit rs29780222 some googling finds https://www.informatics.jax.org/marker/MGI:1925270. I need to check with the GN database what is known there. Adding a BED file to RDF makes sense. Yet another task to add. + +OK, back to focussing on generating RDF with what we have now. A first attempt is + +``` +gn:GEMMAMapped_LOCO_e987ee3e_BXDPublish_22282_gemma_GWA a gnt:mappedTrait; + rdfs:label "GEMMA BXDPublish trait 22282 mapped with LOCO (defaults)"; + gnt:trait gn:publishXRef_22282; + gnt:loco true; + gnt:time "2025/08/11 10:15"; + gnt:belongsToGroup gn:setBxd; + gnt:name "BXDPublish"; + gnt:traitId "22282"; + skos:altLabel "BXD_22282"; + gnt:locus gn:rs29780222; + gnt:lodScore 4.51; + gnt:af 0.484; + gnt:effect -0.08; +``` + +which looks nice already. We want to support more SNPs, however, so we split those up and now this dataset shows 84 snps at a cut off of logP of 4.0. We'll improve on that later (and will us precompute to estimate levels for the BXD). We always show the single highest score, no matter what. The cool thing is that we have *all* peaks now in RDF and we can query that: + +``` +gn:GEMMAMapped_LOCO_BXDPublish_22282_gemma_GWA_e987ee3e a gnt:mappedTrait; + rdfs:label "GEMMA BXDPublish trait 22282 mapped with LOCO (defaults)"; + gnt:trait gn:publishXRef_22282; + gnt:loco true; + gnt:time "2025/08/11 10:15"; + gnt:belongsToGroup gn:setBxd; + gnt:name "BXDPublish"; + gnt:traitId "22282"; + skos:altLabel "BXD_22282". +gn:rs29780222_BXDPublish_22282_gemma_GWA_e987ee3e a gnt:mappedLocus; + gnt:mappedSnp gn:GEMMAMapped_LOCO_BXDPublish_22282_gemma_GWA_e987ee3e; + gnt:locus gn:rs29780222; + gnt:lodScore 4.51; + gnt:af 0.484; + gnt:effect -0.08. +gn:rsm10000002804_BXDPublish_22282_gemma_GWA_e987ee3e a gnt:mappedLocus; + gnt:mappedSnp gn:GEMMAMapped_LOCO_BXDPublish_22282_gemma_GWA_e987ee3e; + gnt:locus gn:rsm10000002804; + gnt:lodScore 4.51; + gnt:af 0.484; + gnt:effect -0.08. +(...) +gn:rs33400361_BXDPublish_22282_gemma_GWA_e987ee3e a gnt:mappedLocus; + gnt:mappedSnp gn:GEMMAMapped_LOCO_BXDPublish_22282_gemma_GWA_e987ee3e; + gnt:locus gn:rs33400361; + gnt:lodScore 4.07; + gnt:af 0.452; + gnt:effect -0.078. +gn:rsm10000002851_BXDPublish_22282_gemma_GWA_e987ee3e a gnt:mappedLocus; + gnt:mappedSnp gn:GEMMAMapped_LOCO_BXDPublish_22282_gemma_GWA_e987ee3e; + gnt:locus gn:rsm10000002851; + gnt:lodScore 4.07; + gnt:af 0.452; + gnt:effect -0.078. +``` + +Next step is to use rapper to see if this is valid RDF. + +``` +rapper --input turtle test.ttl +``` + +For this one trait: rapper: Parsing returned 513 triples. It may look like a lot of data, but RDF stores are pretty good at creating small enough representations. All identifiers are stored once as a string and referenced by 64-bit pointers. + +For the locus I notice Bonz capitalized the SNP identifiers. We don't want that. But I'll stick it in for now. The code is here: + +=> https://github.com/genetics-statistics/gemma-wrapper/blob/master/bin/gemma-mdb-to-rdf.rb + +Basically we run + +``` +rm test.rdf +for x in tmp/*.xz ; do + env GEM_PATH=tmp/ruby GEM_HOME=tmp/ruby ./bin/gemma-mdb-to-rdf.rb $x --anno BXD.8_snps.txt --sort >> test.rdf +done +``` + +for the 98% BXD PublishData that rendered 1512885 triples. It needs some minor fixes, such as a Lod of infinite and the use of ? for an unknown locus. + +To load the file on production: + +``` +guix shell -C -N virtuoso-ose -- isql +# or +/gnu/store/9d81kdw2frn6b3fwqphsmkssc9zblir1-virtuoso-ose-7.2.11/bin/isql -u dba -P "*" -S 8981 +OpenLink Virtuoso Interactive SQL (Virtuoso) +Version 07.20.3238 as of Jan 1 1970 +Type HELP; for help and EXIT; to exit. +Connected to OpenLink Virtuoso +Driver: 07.20.3238 OpenLink Virtuoso ODBC Driver +ld_dir("/home/wrk/","test.ttl","http://pjotr.genenetwork.org") +SQL> rdf_loader_run(); +Done. -- 13 msec. +SQL> checkpoint; +Done. -- 243 msec. +SQL> +``` + +But it don't show. Same for: + +``` +root@tux04:/export/guix-containers/genenetwork/data/virtuoso/ttl# curl --digest -v --user 'dba:*' --url "http://localhost:8982/sparql-graph-crud-auth?graph=http://pjotr.genenetwork.org" -T test.ttl +``` + + +I tried to upload to production, but this crashed the virtuoso server :/. +So I built a new virtuoso instance using gn-machines: + +=> https://git.genenetwork.org/gn-machines/commit/?id=90fa4fdacffe26c57649cb0515d0679ca19c27cc + +Now we can run isql locally as + +``` +guix shell -C -N --expose=/export/guix-containers/virtuoso/data/virtuoso/ttl/=/export/data/virtuoso/ttl virtuoso-ose -- isql -S 8891 + +SQL> ld_dir('/export/data/virtuoso/ttl','test.n3','http://pjotr.genenetwork.org'); +Done. -- 3 msec. +# for testing the validity and optional delete problematic ones: +SQL> SELECT * FROM DB.DBA.load_list; +SQL> DELETE from DB.DBA.LOAD_LIST where ll_error IS NOT NULL ; +# commit changes +SQL> rdf_loader_run (); +SQL> checkpoint; +Done. -- 16 msec. +SQL> SPARQL SELECT count(*) FROM <http://pjotr.genenetwork.org> WHERE { ?s ?p ?o }; +15 +``` + +If an error exists all uploads will be blocked unless DB.DBA.LOAD_LIST is emptied (DELETE). +An error may look like: + +``` +ERROR : Character data are not allowed here by XML structure rules +at line 2 column 3 of source text +@prefix dct: <http://purl.org/dc/terms/> . +``` + +I don't know why, but only n3 triples appeared to work. The full manual is here: + +=> https://vos.openlinksw.com/owiki/wiki/VOS/VirtBulkRDFLoader Virtuoso bulk uploader + +## Fixing hanging virtuoso on production + +Going back to production I cleaned up the DB.DBA.LOAD_LIST as described above. Running isql can be done outside the container: + +``` +guix shell virtuoso-ose -- isql 8981 +SQL> DELETE from DB.DBA.LOAD_LIST; +SQL> checkpoint; +``` + +SPARQL queries inside isql are fast: + +``` +SQL> SPARQL SELECT count(*) FROM <http://pjotr.genenetwork.org> WHERE { ?s ?p ?o }; +1206882 +SQL> SPARQL SELECT count(*) FROM <http://genenetwork.org> WHERE { ?s ?p ?o }; +46982542 +``` + +The web socket is not connected. This does not respond: + +``` +curl http://localhost:8982/sparql/ +``` + +herd stop/start virtuoso made no difference. Nor did nginx or nscd. Hmm. Restarting the full container it starts up at + +``` +root@tux04:/export/guix-containers/genenetwork/var/log# tail virtuoso.log + 2025-08-17 07:47:07 07:47:07 HTTP server online at localhost:9893 + 2025-08-17 07:47:07 07:47:07 Server online at localhost:9892 (pid 43) +curl localhost:9893/sparql +``` + +Aha, the domain is pointing to the wrong virtuoso server... I modified nginx on tux04 and, at least, we have SPARQL running on http. For https nginx is pointing to https://127.0.0.1:8993. Hmmm. That is not the same as what the logs tell me. Looks like there is still some problem with the production container. Well, we can solve that later. + +I'll first run virtuoso on a server. Starting from a guix from half a year ago: + +``` +. /usr/local/guix-profiles/guix-pull-3-link/etc/profile +cd ~/gn-machines +./virtuoso-deploy.sh +curl localhost:8892/sparql/ +``` + +Configure nginx to listen + +``` +server { + server_name sparql-test.genenetwork.org; + listen 80; + access_log /var/log/nginx/sparql-test-access.log; + error_log /var/log/nginx/sparql-test-error.log; + location / { + proxy_pass http://localhost:8892; + proxy_set_header Host $host; + } +} +``` + +Added DNS-entry and we should be able to see + +=> http://sparql-test.genenetwork.org/sparql/ + +Now I need to load the important data into this SPARQL server. On tux02 I find a recent set: + +``` + 4096 Dec 5 2024 wip + 260886 Jul 21 19:57 schema.ttl +443454617 Jul 21 19:57 generif-old.ttl + 44902 Jul 21 19:57 classification.ttl +339900838 Jul 21 19:58 genelist.ttl + 42509383 Jul 21 19:58 genbank.ttl +152936953 Jul 21 19:58 genotype.ttl + 1460511 Jul 21 19:58 dataset-metadata.ttl +700627810 Jul 21 19:58 generif.ttl + 10491221 Jul 21 19:58 strains.ttl + 1388 Jul 21 19:58 species.ttl + 23495986 Jul 21 19:58 publication.ttl + 16879 Jul 21 19:58 tissue.ttl + 18537935 Jul 21 19:58 phenotype.ttl +root@tux02:/export/data/genenetwork-virtuoso# du -sh . +1.7G . +``` + +Which is about 2Gb uncompressed. Not bad. To load the ttl files I have to move them into +/export/guix-containers/virtuoso/data/virtuoso/ttl. + +``` +guix shell virtuoso-ose -- isql 8891 exec="ld_dir('/export/data/virtuoso/ttl','*.ttl','http://genenetwork.org');" +guix shell virtuoso-ose -- isql 8891 exec="rdf_loader_run();" +``` + +That takes a few minutes for 29746544 triples. Not bad at all! + +``` +guix shell virtuoso-ose -- isql 8891 exec="SELECT * FROM DB.DBA.load_list;" +guix shell virtuoso-ose -- isql 8891 exec="checkpoint;" +``` + +Let's list all the tissues we have with + +``` +SELECT * WHERE { + ?s rdf:type gnc:tissue . + ?s rdfs:label ?o . +} +"http://genenetwork.org/id/tissueA1c" "Primary Auditory (A1) Cortex mRNA" +"http://genenetwork.org/id/tissueAcc" "Anterior Cingulate Cortex mRNA" +"http://genenetwork.org/id/tissueAdr" "Adrenal Gland mRNA" +"http://genenetwork.org/id/tissueAmg" "Amygdala mRNA" +"http://genenetwork.org/id/tissueBebv" "Lymphoblast B-cell mRNA" +"http://genenetwork.org/id/tissueBla" "Bladder mRNA" +(...) +``` + +=> http://sparql-test.genenetwork.org/sparql/?default-graph-uri=&query=PREFIX+dct%3A+%3Chttp%3A%2F%2Fpurl.org%2Fdc%2Fterms%2F%3E%0D%0APREFIX+gn%3A+%3Chttp%3A%2F%2Fgenenetwork.org%2Fid%2F%3E%0D%0APREFIX+owl%3A+%3Chttp%3A%2F%2Fwww.w3.org%2F2002%2F07%2Fowl%23%3E%0D%0APREFIX+gnc%3A+%3Chttp%3A%2F%2Fgenenetwork.org%2Fcategory%2F%3E%0D%0APREFIX+gnt%3A+%3Chttp%3A%2F%2Fgenenetwork.org%2Fterm%2F%3E%0D%0APREFIX+sdmx-measure%3A+%3Chttp%3A%2F%2Fpurl.org%2Flinked-data%2Fsdmx%2F2009%2Fmeasure%23%3E%0D%0APREFIX+skos%3A+%3Chttp%3A%2F%2Fwww.w3.org%2F2004%2F02%2Fskos%2Fcore%23%3E%0D%0APREFIX+rdf%3A+%3Chttp%3A%2F%2Fwww.w3.org%2F1999%2F02%2F22-rdf-syntax-ns%23%3E%0D%0APREFIX+rdfs%3A+%3Chttp%3A%2F%2Fwww.w3.org%2F2000%2F01%2Frdf-schema%23%3E%0D%0APREFIX+xsd%3A+%3Chttp%3A%2F%2Fwww.w3.org%2F2001%2FXMLSchema%23%3E%0D%0APREFIX+qb%3A+%3Chttp%3A%2F%2Fpurl.org%2Flinked-data%2Fcube%23%3E%0D%0APREFIX+xkos%3A+%3Chttp%3A%2F%2Frdf-vocabulary.ddialliance.org%2Fxkos%23%3E%0D%0APREFIX+pubmed%3A+%3Chttp%3A%2F%2Frdf.ncbi.nlm.nih.gov%2Fpubmed%2F%3E%0D%0A%0D%0ASELECT+*+WHERE+%7B%0D%0A%3Fs+rdf%3Atype+gnc%3Atissue+.%0D%0A%3Fs+rdfs%3Alabel+%3Fo+.%0D%0A%7D%0D%0A&format=text%2Fhtml&timeout=0&signal_void=on Try it! + +## Getting to our first PublishData queries + +Next we need to upload our fresh PublishData RDF. We generated that with: + +``` +rm test.rdf ; for x in tmp/*.xz ; do ./bin/gemma-mdb-to-rdf.rb $x --anno BXD.8_snps.txt --sort >> test.ttl; done +``` + +Takes 10 minutes. rapper still returns an error for 'gnt:lodScore Infinity;'. I'll fix that down the line. + +Put test.ttl in /export/guix-containers/virtuoso/data/virtuoso/ttl and use the isql commands to update virtuoso. I use a separate graph named 'http://pjotr.genenetwork.org' so we can easily delete the triples. + +``` +guix shell virtuoso-ose -- isql 8891 exec="ld_dir('/export/data/virtuoso/ttl','test.ttl','http://pjotr.genenetwork.org'); rdf_loader_run();" +``` + +OK, we have the data together. Time for our first queries. Interesting questions are: + +* How many hits do we have for qtlreaper and how many for gemma in total +* How many hits do we have for qtlreaper and how many for gemma that have a hit of 4.0 or higher +* How many of these hits for qtlreaper differ from those of gemma +* What datasets have been mapped in qtlreaper, but not in gemma + +### How many hits do we have for qtlreaper and how many for gemma in total + +Remember we had this query for reaper: + +``` +SELECT * WHERE { + ?s gnt:belongsToGroup gn:setBxd; + gnt:traitId ?id; + gnt:locus ?locus; + gnt:lodScore ?lrs; + dct:description ?descr. +} limit 5 +"http://genenetwork.org/id/traitBxd_10001","10001","http://genenetwork.org/id/Rs48756159",2.93169,"Central nervous system, morphology: Cerebellum weight, whole, bilateral in adults of both sexes [mg]" +"http://genenetwork.org/id/traitBxd_10002","10002","http://genenetwork.org/id/Rsm10000005699",4.77938,"Central nervous system, morphology: Cerebellum weight after adjustment for covariance with brain size [mg]" +"http://genenetwork.org/id/traitBxd_10003","10003","http://genenetwork.org/id/Rsm10000013713",3.38682,"Central nervous system, morphology: Brain weight, male and female adult average, unadjusted for body weight, age, sex [mg]" +"http://genenetwork.org/id/traitBxd_10004","10004","http://genenetwork.org/id/Rs48756159",2.56076,"Central nervous system, morphology: Cerebellum volume [mm3]" +"http://genenetwork.org/id/traitBxd_10005","10005","http://genenetwork.org/id/Rsm10000005699",5.02907,"Central nervous system, morphology: Cerebellum volume, adjusted for covariance with brain size [mm3]" +``` + +we can run a similar query for GEMMA results with trait id "10001" and locus names. + +``` +SELECT * WHERE { + ?s gnt:mappedSnp ?id; + gnt:locus ?locus; + gnt:lodScore ?lrs. + filter(?lrs > 4.0). +} limit 5 +``` + +to find distinct datasets for GEMMA: + +``` +SELECT count(*) WHERE { + ?id gnt:name "BXDPublish" . +} limit 5 +``` + +To count the total number of hits we have 13576 reaper hits and 231911 GEMMA hits. For GEMMA we have 13491 uniquely mapped datasets. + +### Count hits that are significant + +For GEMMA 223232 hits are 4.0 or higher. For Reaper we count 1098. Almost all reaper values are between 2.0 and 4.0. When we count GEMMA below 4.0 we get 8679 datasets - and that makes sense because for gemmma we list all SNPs that are over 4.0 and only the datasets that are below we list the highest SNP. In both cases the majority of traits are below our threshold. + +### Start looking at the difference + +For every reaper SNP 'locus' we want to find that GEMMA sets that contain that particular SNP. In other words, those are the hits that GEMMA found that compare with qtlreaper. We pivot on SNP ?locus and ?traitid. + +``` +SELECT count(*) WHERE { + ?reaper gnt:belongsToGroup gn:setBxd; + gnt:traitId ?traitid; + gnt:locus ?locus; + gnt:lodScore ?lrs . + ?gemma gnt:mappedSnp ?id2; + gnt:locus ?locus; + gnt:lodScore ?lrs2. + ?id2 gnt:name "BXDPublish" ; + gnt:traitId ?traitid. + filter(?lrs2 >= 4.0). +} limit 5 +``` + +Now find 4222 overlapping traits! Whereof 2924 have a gemma lod score >= 4.0. And reaper 892 > 4.0 (out of 1098). That implies that some 200 significant scores find (completely) different SNPs for GEMMA. + +The next step is to list these differences. That is a reverse query. In plain English it should be something like: + +> List all sets where reaper has a SNP (r_snp) that does not appear in its GEMMA computation (g_snps). + +This is rather hard to do in SPARQL. We can make a list, however, of the overlapping traits with a lod score>4.0 with + +``` +PREFIX dct: <http://purl.org/dc/terms/> +PREFIX gn: <http://genenetwork.org/id/> +PREFIX owl: <http://www.w3.org/2002/07/owl#> +PREFIX gnc: <http://genenetwork.org/category/> +PREFIX gnt: <http://genenetwork.org/term/> +PREFIX sdmx-measure: <http://purl.org/linked-data/sdmx/2009/measure#> +PREFIX skos: <http://www.w3.org/2004/02/skos/core#> +PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> +PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> +PREFIX xsd: <http://www.w3.org/2001/XMLSchema#> +PREFIX qb: <http://purl.org/linked-data/cube#> +PREFIX xkos: <http://rdf-vocabulary.ddialliance.org/xkos#> +PREFIX pubmed: <http://rdf.ncbi.nlm.nih.gov/pubmed/> + +SELECT ?traitid WHERE { + # --- get the reaper SNPs + ?r_trait gnt:belongsToGroup gn:setBxd; + gnt:traitId ?traitid; + gnt:locus ?snp. + # --- get gemma trait that matches reaper traitid (pivot on traitid) + ?g_trait gnt:name "BXDPublish" ; + gnt:traitId ?traitid. + # --- g_snp is the SNP scored within a gemma trait run + ?g_snp gnt:mappedSnp ?g_trait; + gnt:locus ?snp; + gnt:lodScore ?g_lrs. + filter(?g_lrs >= 4.0). +} limit 5 +``` + +Resulting in 2925 overlapping results. For example, it lists trait + +=> https://genenetwork.org/show_trait?trait_id=12014&dataset=BXDPublish + +where both reaper and gemma show a top hit for rs13478947. + + +SELECT count(distinct ?traitid) WHERE { + # --- get the reaper SNPs + ?r_trait gnt:belongsToGroup gn:setBxd; + gnt:traitId ?traitid; + gnt:locus ?snp. + # --- get gemma trait that matches reaper traitid (pivot on traitid) + ?g_trait gnt:name "BXDPublish" ; + gnt:traitId ?traitid. + # --- g_snp is the SNP scored within a gemma trait run + ?g_snp gnt:mappedSnp ?g_trait; + gnt:lodScore ?g_lrs. + MINUS { ?g_snp gnt:locus ?snp . } + filter(?g_lrs >= 4.0). +} + + + +Now we can make a second list for all gemma results where g_lrs > 4.0. The difference is our set. + +``` +SELECT DISTINCT ?traitid WHERE { + # --- get gemma trait that matches reaper traitid (pivot on traitid) + ?g_trait gnt:name "BXDPublish" ; + gnt:traitId ?traitid. + # --- g_snp is the SNP scored within a gemma trait run + ?g_snp gnt:mappedSnp ?g_trait; + gnt:locus ?snp; + gnt:lodScore ?g_lrs. + filter(?g_lrs >= 4.0). +} +``` + +One example is trait 23777 where reaper has rsm10000008413 and gemma ranks SNPs, and rsm10000008413 with LRS 3.44 is below the threshold. That makes not such a strong case because both results are on Chr11 and not to far from each other (58 vs 73 Mb). Still, it may be a difference of interest. GEMMA's main hit rs13480386 is also ranked by reaper (in GN2). +I think we need to refine our method. Peaks on Chr9 and 15 are also of interest. + +See + +=> https://genenetwork.org/show_trait?trait_id=23777&dataset=BXDPublish + +Another trait 14905 shows a whopper on Chr4 with gemma and and one on Chr8 with reaper. +This is rather a good example. To improve the power of our search I think I should extend the GEMMA results with all hits above 3.0. That greatly increase the chance that a reaper marker is seen. To do an even better job we should run reaper precompute and also store the highest ranked markers (rather than one single hit). That way we get a true picture of the overlap and differences. While we are at it, we should store the trait values with the sample size etc. + +But first let's try finding those that differ on chromosome hits: + +Hmmm. the folloinwg not working quite right because it shows all the differences with 200K results. I tried + +``` +PREFIX dct: <http://purl.org/dc/terms/> +PREFIX gn: <http://genenetwork.org/id/> +PREFIX owl: <http://www.w3.org/2002/07/owl#> +PREFIX gnc: <http://genenetwork.org/category/> +PREFIX gnt: <http://genenetwork.org/term/> +PREFIX sdmx-measure: <http://purl.org/linked-data/sdmx/2009/measure#> +PREFIX skos: <http://www.w3.org/2004/02/skos/core#> +PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> +PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> +PREFIX xsd: <http://www.w3.org/2001/XMLSchema#> +PREFIX qb: <http://purl.org/linked-data/cube#> +PREFIX xkos: <http://rdf-vocabulary.ddialliance.org/xkos#> +PREFIX pubmed: <http://rdf.ncbi.nlm.nih.gov/pubmed/> + +SELECT DISTINCT ?traitid ?chr1 ?chr2 ?url ?descr WHERE { + # --- get the reaper SNPs + ?r_trait gnt:belongsToGroup gn:setBxd; + gnt:traitId ?traitid; + gnt:locus ?snp ; + dct:description ?descr. + # --- get gemma trait that matches reaper traitid (pivot on traitid) + ?g_trait gnt:name "BXDPublish" ; + gnt:traitId ?traitid. + # --- g_snp is the SNP scored within a gemma trait run + ?g_snp gnt:mappedSnp ?g_trait; + gnt:lodScore ?g_lrs ; + gnt:locus ?snp2 . + # --- get Chr positions of both snps + ?snp gnt:chr ?chr1 . + ?snp2 gnt:chr ?chr2 . + MINUS { ?g_snp gnt:locus ?snp . } + filter(?g_lrs >= 4.0). + filter(?chr2 != ?chr1) . + BIND(REPLACE(?traitid, "(\\d+)","https://genenetwork.org/show_trait?trait_id=$1&dataset=BXDPublish") AS ?url) +} LIMIT 15 +``` + +What I am trying is set analysis and SPARQL is so powerful that you actually try, but it is far simpler to do in any programming language. I tooted about this rediscovery: + +=> https://genomic.social/@pjotrprins@mastodon.social/115059451578588805 + +I created list for Rob using some simple shell commands, so he can see what the challenge is. I wrote + +> Attached a list of traits that show a reaper SNP that is not significant (LOD 4.0) for GEMMA and still show a significant hit for GEMMA. You can test run them on GN2 and see that the story is ambiguous. To do a proper job we should store more hits for GEMMA (say from LOD 3.0) and do a precompute exercise with reaper storing all top hits. That way we can probably do better and even get a list for Claude. + +One example is trait 23777 where reaper has rsm10000008413 and gemma ranks SNPs, and rsm10000008413 with LRS 3.44 is be low the threshold. That makes not such a strong case because both results are on Chr11 and not to far from each other (58 vs 73 Mb). Still, it may be a difference of interest. GEMMA's main hit rs13480386 is also ranked by reaper (in GN2). I think we need to refine our method. Peaks on Chr9 and 15 are also of interest. + +See + +=> https://genenetwork.org/show_trait?trait_id=23777&dataset=BXDPublish + +Another trait 14905 shows a whopper on Chr4 with gemma and and one on Chr8 with reaper. This is rather a good example. To improve the power of our search I think I should extend the GEMMA results with all hi ts above 3.0. That greatly increase the chance that a reaper marker is seen. To do an even better job we should run rea per precompute and also store the highest ranked markers (rather than one single hit). That way we get a true picture o f the overlap and differences. While we are at it, we should store the trait values with the sample size etc. + +So, rerunning GEMMA and reaper are on the books. While we are at it we can adapt reruns for + +* qnormalized data* +* auto winsorizing +* sex covariate +* run gemma without LOCO +* cis covariate, using the current hit and recompute with that as a covariate* +* epistatic covariates + +and that should all be reasonably easy for the 13K traits. + +## More metadata + +But first we set up a new run with more metadata. In the lmdb files we should add the trait values, the mean, SE, skew, kurtosis, any DOIs. + +gemma-wrapper can take trait values as produced by our gn-guile endpoint (in .json). First step is to add thes values to the meta data. The existing permutate switch takes a pheno file and outputs that during a run. We can use that to pass in the pheno file. + + +Now we should write out the gemma phenotypes to make sure they align. Now we essentially moved the functionality from gn-pheno-to-gemma.rb into gemma-wrapper, so we need to pass in the geno information too. + +The command becomes + +``` +./bin/gemma-wrapper --force --json --loco -- -g BXD.geno.txt -p BXD_pheno.txt -a BXD.8_snps.txt -n 2 -gk > K.json +./bin/gemma-wrapper --json --lmdb --geno-json BXD.geno.json --lmdb --phenotypes 10002-pheno.json --population BXD --name BXDPublish --trait $id --loco --input K.json -- -g BXD.geno.txt -a BXD.8_snps.txt -lmm 9 -maf 0.1 -n 2 -debug > GWA.json +``` + +We now store the trait values into the metadata and they go into lmdb! + +``` + "meta": { + "type": "gemma-wrapper", + "version": "1.00-pre1", + "population": "BXD", + "name": "BXDPublish", + "trait": "1", + "geno_filename": "BXD.geno.txt", + "geno_hash": "3b65ed252fa47270a3ea867409b0bdc5700ad6f6", + "loco": true, + "url": "https://genenetwork.org/show_trait?trait_id=1&dataset=BXDPublish", + "archive_GRM": "185eb08dc3897c7db5d7ea987170898035768f93-gemma-cXX.tar.xz", + "archive_GWA": "c143bc7928408fdc53affed0dacdd98d7c00f36d-BXDPublish-1-gemma-GWA.tar.xz", + "trait_values": { + "BXD1": 54.099998, + "BXD2": 50.099998, + "BXD5": 53.299999, +... +``` + +Commit is here: + +=> https://github.com/genetics-statistics/gemma-wrapper/commit/9ad5f762823031da08fc51c2a6adae983e6e8314 + +Now gemma2lmdb is actually written in python, so we can make use of scipy functions using the trait values. + +So, for example, we can compute: + +``` +mean= 52.22058749999999 std= 2.968538937833582 kurtosis= 0.03143766680654192 skew= -0.1315270039489698 +for +[54.099998, 50.099998, 53.299999, 55.099998, 57.299999, 51.200001, 53.599998, 46.799999, 50.599998, 49.299999, 45.700001, 52.5, 52.0, 51.099998, 52.400002, 49.0, 51.599998, 50.700001, 55.5, 52.599998, 53.099998, 53.5, 53.200001, 58.700001, 50.799999, 53.299999, 51.900002, 54.099998, 52.299999, 46.099998, 51.799999, 57.0, 48.599998, 56.599998] +``` + +Using + +=> https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.skew.html +=> https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kurtosis.html + +Code in gemma-wrapper repo. + +I'll set up a new run and export to RDF. Some additions first. + +Even though we store trait values, I should add the number of indiduals too. We store that as nind. + +Now we have these metrics, no metadata is complete without its publication. PublishXRef contains a PublicationID. It points into the Publication table that contains, for example: + +``` +| Id | PubMed_ID | Abstract | Authors | Title | Journal | Volume | Pages | Month | Year | +| 116 | 11438585 | To discover genes influencing cerebellum development, we conducted a complex trait analysis of variation in the size of the adult mouse cerebellum. We analyzed two sets of recombinant inbred BXD strains and an F2 intercross of the common inbred strains, C57BL/6J and DBA/2J. We measured cerebellar size as the weight or volume of fixed or histologically processed tissue. Among BXD recombinant inbred strains, the cerebellum averages 52 mg (12.4% of the brain) and ranges 18 mg in size. In F2 mice, the cerebellum averages 62 mg (12.9% of the brain) and ranges approximately 20 mg in size. Five quantitative trait loci (QTLs) that significantly control variation in cerebellar size were mapped to chromosomes 1 (Cbs1a), 8 (Cbs8a), 14 (Cbs14a), and 19 (Cbs19a, Cbs19b). In combination, these QTLs can shift cerebellar size to an appreciable 35% of the observed range. To assess regional genetic control of the cerebellum, we also measured the volume of the cell-rich, internal granule layer (IGL) in a set of BXD strains. The IGL ranges from 34 to 43% of total cerebellar volume. The QTL Cbs8a is significantly linked to variation in IGL volume and is suggestively linked to variation in the number of cerebellar folia. The QTLs we have discovered are among the first loci shown to modulate the size and architecture of the adult mouse cerebellum. | Airey DC, Lu L, Williams RW | Genetic control of the mouse cerebellum: identification of quantitative trait loci modulating size and architecture | J Neuroscience | 21 | 5099-5109 | NULL | 2001 | +``` + +That is a nice example. +But we also find many publications without abstracts, e.g. | 7276 | 15792 | NULL | Williams EG, Andreux P, Houtkooper R, Auwerx J | Recombinant Inbred BXD Mice as a Model for the Metabolic Syndrome. + +In fact, 22K entries out of 29K miss the abstract. Also I can't find this last paper by Evan Williams. The closest is "Systems Genetics of Metabolism: The Use of the BXD Murine Reference Panel for Multiscalar Integration of Traits" which is probably worth reading. + +=> https://www.cell.com/cell/pdfExtended/S0092-8674(12)01007-0?__cf_chl_tk=kYZ49R4P29zOzYPeuWdrXVJC61HyhpHwFtq8lS2_rlk-1756022056-1.0.1.1-uY.PpAbgi8FO54P4_wYp_f6Nm84CdfHNQEI1WOmngFE + +I have no idea where the number 15792 comes from. It is not a pubmed ID. Some quick checks: + +``` +MariaDB [db_webqtl]> select count(*) from Publication WHERE Pubmed_ID>0 limit 3; ++----------+ +| 427 | ++----------+ +MariaDB [db_webqtl]> select count(*) from Publication WHERE Pubmed_ID>0 and Pubmed_ID<99999 limit 3; ++----------+ +| 2 | ++----------+ +MariaDB [db_webqtl]> select count(*) from Publication WHERE Pubmed_ID>0 and Pubmed_ID<999999 limit 3; ++----------+ +| 10 | ++----------+ +select count(*) from Publication WHERE NOT Abstract is NULL limit 3; ++----------+ +| 6750 | ++----------+ +``` + +so, out of 29K entries, we have a very limited number of useful PMIDs, but we have some 6750 abstracts - mostly related to the BXD. Meanwhile some 16572 entries (about half) appear to have valid titles. Almost all records have authors, however. + +It really is a bit of a mess. What we need to do is harvest what we have and then collect pubmed ids for the missing BXD PublishData records and use that to fetch up-to-date abstracts and author lists. We can even adapt my Pubmed script that I use for bibtex. A search for just the combination of these authors + +``` +pubmed2bib.sh 'Williams EG, Andreux P, Houtkooper R, Auwerx J [au]' +``` + +renders + +``` +@article{Andreux:2012, + keywords = { }, + pmid = {22939713}, + pmcid = {3604687}, + note = {{PMC3604687}}, + IDS = {PMC3604687, PMID:22939713}, + author = {Andreux, P. A. and Williams, E. G. and Koutnikova, H. and Houtkooper, R. H. and Champy, M. F. and Henry, H. and Schoonjans, K. and Williams, R. W. and Auwerx, J.}, + title = {{Systems genetics of metabolism: the use of the BXD murine reference panel for multiscalar integration of traits}}, + journal = {Cell}, + year = {2012}, + volume = {150}, + number = {6}, + pages = {1287-1299}, + doi = {10.1016/j.cell.2012.08.012}, + url = {http://www.ncbi.nlm.nih.gov/pubmed/22939713}, + abstract = {Metabolic homeostasis is achieved by complex molecular and cellular networks that differ significantly among individuals and are difficult to model with genetically engineered lines of mice optimized to study single gene function. Here, we systematically acquired metabolic phenotypes by using the EUMODIC EMPReSS protocols across a large panel of isogenic but diverse strains of mice (BXD type) to study the genetic control of metabolism. We generated and analyzed 140 classical phenotypes and deposited these in an open-access web service for systems genetics (www.genenetwork.org). Heritability, influence of sex, and genetic modifiers of traits were examined singly and jointly by using quantitative-trait locus (QTL) and expression QTL-mapping methods. Traits and networks were linked to loci encompassing both known variants and novel candidate genes, including alkaline phosphatase (ALPL), here linked to hypophosphatasia. The assembled and curated phenotypes provide key resources and exemplars that can be used to dissect complex metabolic traits and disorders.}, +} +``` + +So, yes, it is the likely candidate. We can use this information to suggest updates. It just proves again how useful manual curation is. + +Note that this information is collected at the experimental level (rather than the trait level), so it really does not belong in the GEMMA lmdb data. Every trait has an entry in PublishXRef that points back to the Publication ID. So we can take it later (and fix it!). + +# Rerun GEMMA precompute + +Let's set up a full rerun for the 13K BXD PublishData entries with this new information. That should allow us to see how skew and kurtosis and experimental size affect the outcome. Remember we have the batch run script: + +``` +#! /bin/env sh + +export TMPDIR=./tmp +curl http://127.0.0.1:8092/dataset/bxd-publish/list > bxd-publish.json +jq ".[] | .Id" < bxd-publish.json > ids.txt +./bin/gemma-wrapper --force --json --loco -- -g BXD.geno.txt -p BXD_pheno.txt -a BXD.8_snps.txt -n 2 -gk > K.json + +for id in 'cat ids.txt' ; do + echo Precomputing $id + if [ ! -e tmp/*-BXDPublish-$id-gemma-GWA.tar.xz ] ; then + curl http://127.0.0.1:8092/dataset/bxd-publish/values/$id.json > pheno.json + ./bin/gn-pheno-to-gemma.rb --phenotypes pheno.json --geno-json BXD.geno.json > BXD_pheno.txt + ./bin/gemma-wrapper --json --lmdb --population BXD --name BXDPublish --trait $id --loco --input K.json -- -g BXD.geno.txt -p BXD_pheno.txt -a BXD.8_snps.txt -lmm 9 -maf 0.1 -n 2 -debug > GWA.json + fi +done +``` + +that can be simplified because gemma-wrapper now replaces gn-pheno-to-gemma.rb. First Guix had to install scipy which pulls in inkscape and Jupyter among other things. It is really too much! But at least Guix makes it easy to reproduce the environment I use on my desktop to the server. Now we get a beautiful record in every lmdb GEMMA run: + +``` +"archive_GWA": "c143bc7928408fdc53affed0dacdd98d7c00f36d-BXDPublish-10001-gemma-GWA.tar.xz", "trait_values": {"BXD +1": 61.400002, "BXD2": 49.0, "BXD5": 62.5, "BXD6": 53.099998, "BXD8": 59.099998, "BXD9": 53.900002, "BXD11": 53.099998, + "BXD12": 45.900002, "BXD13": 48.400002, "BXD14": 49.400002, "BXD15": 47.400002, "BXD16": 56.299999, "BXD18": 53.599998 +, "BXD19": 50.099998, "BXD20": 48.200001, "BXD21": 50.599998, "BXD22": 53.799999, "BXD23": 48.599998, "BXD24": 54.90000 +2, "BXD25": 49.599998, "BXD27": 47.400002, "BXD28": 51.5, "BXD29": 50.200001, "BXD30": 53.599998, "BXD31": 49.700001, " +BXD32": 56.0, "BXD33": 52.099998, "BXD34": 53.700001, "BXD35": 49.700001, "BXD36": 44.5, "BXD38": 51.099998, "BXD39": 5 +4.900002, "BXD40": 49.900002, "BXD42": 59.400002}, "table": "PublishData", "traitid": 10001, "dataid": 0}}, "nind": 34, + "mean": 52.1353, "std": 4.1758, "skew": 0.6619, "kurtosis": 0.0523, +``` + +and the job is running.... + +Next stop is to rerun reaper and variations on gemma. Last night it halted at 9K. The webserver gave an SQL error and just stopped/waited. As it is not using threads it will block. It says: SQL Connection ERROR! file not found + +# HK + +We want to rerun reaper to get more top ranked hits (and peaks). Now I also realize GEMMA can also do LR and it would be interesting to see how that differs from reaper. The '-lm' switch says: + +``` + -lm [num] specify analysis options (default 1). + options: 1: Wald test + 2: Likelihood ratio test + 3: Score test + 4: 1-3 +``` + +the documentation points out that we don't need a GRM. Exactly. Now we could try and embed this in gemma-wrapper, but that is overkill. Part of the complexity of gemma-wrapper is related to handling the GRM with LOCO. Here we have a simple command that needs to be iterated. We don't need to record trait values, kurtosis etc. because that is already part of the previous exercise (and is constant). So the main complications are to create the trait vector, run gemma, and write an lmdb file. For now this will be a one-off, so we are not going to bother with caching and all that. + +``` +gemma -g BXD.geno.txt -p BXD_pheno.txt -a BXD.8_snps.txt -lm 2 -o trait-BXDPublish-$id-gemma-GWA-hk +``` + +This produces a file + +``` +chr rs ps n_mis n_obs allele1 allele0 af p_lrt +1 rsm10000000001 3001490 0 237 X Y 0.527 -nan +1 rs31443144 3010274 0 237 X Y 0.525 -nan +1 rs6269442 3492195 0 237 X Y 0.525 -nan +1 rs32285189 3511204 0 237 X Y 0.525 -nan +``` + +Hmm. All p_lrt are NaN. Oh, I need to make sure the second column is used: + +``` +gemma -g BXD.geno.txt -p BXD_pheno.txt -a BXD.8_snps.txt -n 2 -lm 2 -o tmp/trait-BXDPublish-$id-gemma-GWA-hk +chr rs ps n_mis n_obs allele1 allele0 af p_lrt +1 rsm10000000001 3001490 0 23 X Y 0.739 8.331149e-01 +1 rs31443144 3010274 0 23 X Y 0.739 8.331149e-01 +1 rs6269442 3492195 0 23 X Y 0.739 8.331149e-01 +1 rs32285189 3511204 0 23 X Y 0.739 8.331149e-01 +1 rs258367496 3659804 0 23 X Y 0.739 8.331149e-01 +``` + +much better! Now we need to turn this into an lmdb file. We can adapt gemma2lmdb.py to do that. But I am not going to do that. The attraction of repurposing code is always there, but it will mean diluting the meaning of the code - basically ifthen blocks - and making the code less readable. This is one reason the Linux kernel does not share code between device drivers. Even for these simple tools I prefer to split out at the risk of not being DRY. I hope you can see what I mean with: + +=> https://github.com/genetics-statistics/gemma-wrapper/blob/master/bin/gemma2lmdb.py + +which is now pretty straightforward for parsing LMM output of GEMMA into lmdb. We are going to do the same thing for a simpler output. But when writing it suddenly struck me we don't need lmdb here in the first place! lmdb is for the full vector output and there is no reason to retain it. All we want is the top hits. Great, that simplifies matters even more. Which btw points out how baffling it is to me that people think they can replace programmers with AI. Well, maybe for the obvious code... You just see how much code will be garbage. + +Now we have the same idea in gemma-mdb-to-rdf.rb - and for the same reason as before I am not going to adapt that code. + +Fun fact, HK returns the same hits for GEMMA and reaper versions. Good. the log10 of the GEMMA's p_LRT returns a value of 2.720446e-06 where -log10/LOD is 5.56 and the multiplier with 4.61 renders 25 where GN2 shows an LRS of 22. Oh well, we are not too concerned, as long as the ranking is correct. + +So for GN trait + +=> https://genenetwork.org/show_trait?trait_id=10002&dataset=BXDPublish + +we now get for GEMMA HK: + +``` +gn:HK_output_trait_BXDPublish_1_gemma_GWA_hk_assoc_txt a gnt:mappedTrait; + rdfs:label "GEMMA_BXDPublish output/trait-BXDPublish-1-gemma-GWA-hk.assoc.txt trait HK mapped"; + gnt:GEMMA_HK true; + gnt:belongsToGroup gn:setBxd; + gnt:trait gn:publishXRef_1; + gnt:time "2025-08-25 10:14:23 +0000"; + gnt:belongsToGroup gn:setBxd; + gnt:name "BXDPublish"; + gnt:traitId "1"; + skos:altLabel "BXD_1". +gn:rsm10000005699_HK_output_trait_BXDPublish_1_gemma_GWA_hk_assoc_txt a gnt:mappedLocus; + gnt:mappedSnp gn:rsm10000005699_HK_output_trait_BXDPublish_1_gemma_GWA_hk_assoc_txt ; + gnt:locus gn:Rsm10000005699 ; + gnt:lodScore 5.6 . +gn:rs47899232_HK_output_trait_BXDPublish_1_gemma_GWA_hk_assoc_txt a gnt:mappedLocus; + gnt:mappedSnp gn:rs47899232_HK_output_trait_BXDPublish_1_gemma_GWA_hk_assoc_txt ; + gnt:locus gn:Rs47899232 ; + gnt:lodScore 5.6 . +gn:rs3661882_HK_output_trait_BXDPublish_1_gemma_GWA_hk_assoc_txt a gnt:mappedLocus; + gnt:mappedSnp gn:rs3661882_HK_output_trait_BXDPublish_1_gemma_GWA_hk_assoc_txt ; + gnt:locus gn:Rs3661882 ; + gnt:lodScore 5.3 . +gn:rs33490412_HK_output_trait_BXDPublish_1_gemma_GWA_hk_assoc_txt a gnt:mappedLocus; + gnt:mappedSnp gn:rs33490412_HK_output_trait_BXDPublish_1_gemma_GWA_hk_assoc_txt ; + gnt:locus gn:Rs33490412 ; + gnt:lodScore 5.3 . +gn:rsm10000005703_HK_output_trait_BXDPublish_1_gemma_GWA_hk_assoc_txt a gnt:mappedLocus; + gnt:mappedSnp gn:rsm10000005703_HK_output_trait_BXDPublish_1_gemma_GWA_hk_assoc_txt ; + gnt:locus gn:Rsm10000005703 ; + gnt:lodScore 5.3 . +(...) +``` + +Code is here: + +=> https://github.com/genetics-statistics/gemma-wrapper/commit/a17901d927d21a1686c0ac0d1552695f0096b84b + +Generate RDF incl. skew, kurtosis etc + +``` +./bin/gemma-mdb-to-rdf.rb --header > test.ttl +time for x in tmp/*.xz ; do + ./bin/gemma-mdb-to-rdf.rb $x --anno BXD.8_snps.txt --sort >> test.ttl +done +``` + +Renders + +``` +gn:GEMMAMapped_LOCO_BXDPublish_10001_gemma_GWA_7c00f36d a gnt:mappedTrait; + rdfs:label "GEMMA BXDPublish trait 10001 mapped with LOCO (defaults)"; + gnt:trait gn:publishXRef_10001; + gnt:loco true; + gnt:time "2025/08/24 08:22"; + gnt:belongsToGroup gn:setBxd; + gnt:name "BXDPublish"; + gnt:traitId "10001"; + gnt:nind 34; + gnt:mean 52.1353; + gnt:std 4.1758; + gnt:skew 0.6619; + gnt:kurtosis 0.0523; + skos:altLabel "BXD_10001". +gn:Rsm10000005700_BXDPublish_10001_gemma_GWA_7c00f36d a gnt:mappedLocus; + gnt:mappedSnp gn:GEMMAMapped_LOCO_BXDPublish_10001_gemma_GWA_7c00f36d; + gnt:locus gn:Rsm10000005700; + gnt:lodScore 6.2; + gnt:af 0.382; + gnt:effect 1.626. +n:Rs32133186_BXDPublish_10001_gemma_GWA_7c00f36d a gnt:mappedLocus; + gnt:mappedSnp gn:GEMMAMapped_LOCO_BXDPublish_10001_gemma_GWA_7c00f36d; + gnt:locus gn:Rs32133186; + gnt:lodScore 6.2; + gnt:af 0.382; + gnt:effect 1.626. +... +``` + +Funny thing is that the hash values are now all the same because gemma-wrapper no longer includes the trait values. That is a harmless bug that I'll fix for the next run. + +The GEMMA run ended up generating 1,576,110 triples. The gemma-mdb-to-rdf script took 42 minutes. + +After GEMMA LMM completed its run we set up the HK run which should reflect reaper. + +# On bimodality (of trait values) + +Kurtosis is not a great predictor of bimodality. + +=> https://aldenbradford.com/bimodality.html + +Rob says that for the BXD bimodality works best. Maybe annotate with + +=> https://skeptric.com/dip-statistic/ + +We'll skip it for now - I added a task above. + +# Combine results + +First we upload the data into virtuoso after dropping the old graph. We can do again, now introducing new sub graphs + +``` +rapper -i turtle test.ttl > test.n3 +guix shell -C -N --expose=/export/guix-containers/virtuoso/data/virtuoso/ttl/=/export/data/virtuoso/ttl virtuoso-ose -- isql -S 8891 +SQL> log_enable(3,1); +SQL> DELETE FROM rdf_quad WHERE g = iri_to_id ('http://pjotr.genenetwork.org'); +SQL> SPARQL SELECT count(*) FROM <http://pjotr.genenetwork.org> WHERE { ?s ?p ?o }; + 0 +SQL> ld_dir('/export/data/virtuoso/ttl','test.n3','http://lmm2.genenetwork.org'); + Done. -- 3 msec. +# for testing the validity and optional delete problematic ones: +SQL> SELECT * FROM DB.DBA.load_list; +SQL> DELETE from DB.DBA.LOAD_LIST where ll_error IS NOT NULL ; +# commit changes +SQL> rdf_loader_run (); +SQL> checkpoint; +Done. -- 16 msec. +SQL> SPARQL SELECT count(*) FROM <http://pjotr.genenetwork.org> WHERE { ?s ?p ?o }; + 1576102 +``` + +and after HK we are at 6838444 triples for this exercise. Note that you can clean up the load list with + +``` +DELETE from DB.DBA.LOAD_LIST; +``` + + +Let's list all the tissues we have with + +``` +PREFIX dct: <http://purl.org/dc/terms/> +PREFIX gn: <http://genenetwork.org/id/> +PREFIX owl: <http://www.w3.org/2002/07/owl#> +PREFIX gnc: <http://genenetwork.org/category/> +PREFIX gnt: <http://genenetwork.org/term/> +PREFIX sdmx-measure: <http://purl.org/linked-data/sdmx/2009/measure#> +PREFIX skos: <http://www.w3.org/2004/02/skos/core#> +PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> +PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> +PREFIX xsd: <http://www.w3.org/2001/XMLSchema#> +PREFIX qb: <http://purl.org/linked-data/cube#> +PREFIX xkos: <http://rdf-vocabulary.ddialliance.org/xkos#> +PREFIX pubmed: <http://rdf.ncbi.nlm.nih.gov/pubmed/> + +SELECT * WHERE { ?s rdf:type gnc:tissue . ?s rdfs:label ?o . } + +"http://genenetwork.org/id/tissueA1c" "Primary Auditory (A1) Cortex mRNA" +"http://genenetwork.org/id/tissueAcc" "Anterior Cingulate Cortex mRNA" +"http://genenetwork.org/id/tissueAdr" "Adrenal Gland mRNA" +"http://genenetwork.org/id/tissueAmg" "Amygdala mRNA" +"http://genenetwork.org/id/tissueBebv" "Lymphoblast B-cell mRNA" +"http://genenetwork.org/id/tissueBla" "Bladder mRNA" +(...) +``` + +To other quick queries confirm that our data is loaded correctly. One quick test we would want to do is to see if all reaper hits overlap with GEMMA_HK. That would be a comfort. + +The reaper hits are found with + +``` +SELECT * WHERE { + ?s gnt:belongsToGroup gn:setBxd; + gnt:traitId ?id; + gnt:locus ?locus; + gnt:lodScore ?lrs; + dct:description ?descr. +} limit 50 +``` + +The HK hits are defined as + +``` +gn:HK_output_trait_BXDPublish_1_gemma_GWA_hk_assoc_txt a gnt:mappedTrait; + rdfs:label "GEMMA_BXDPublish output/trait-BXDPublish-1-gemma-GWA-hk.assoc.txt trait HK mapped"; + gnt:GEMMA_HK true; + gnt:belongsToGroup gn:setBxd; + gnt:trait gn:publishXRef_1; + gnt:time "2025-08-25 10:14:23 +0000"; + gnt:belongsToGroup gn:setBxd; + gnt:name "BXDPublish"; + gnt:traitId "1"; + skos:altLabel "BXD_1". +gn:rsm10000005699_HK_output_trait_BXDPublish_1_gemma_GWA_hk_assoc_txt a gnt:mappedLocus; + gnt:mappedSnp gn:rsm10000005699_HK_output_trait_BXDPublish_1_gemma_GWA_hk_assoc_txt ; + gnt:locus gn:Rsm10000005699 ; + gnt:lodScore 5.6 . +gn:rs47899232_HK_output_trait_BXDPublish_1_gemma_GWA_hk_assoc_txt a gnt:mappedLocus; + gnt:mappedSnp gn:rs47899232_HK_output_trait_BXDPublish_1_gemma_GWA_hk_assoc_txt ; + gnt:locus gn:Rs47899232 ; + gnt:lodScore 5.6 . +``` + +So the hits can be listed as + +``` +SELECT count(*) WHERE { + ?reaper gnt:belongsToGroup gn:setBxd; + gnt:traitId ?traitid; + gnt:locus ?locus; + gnt:lodScore ?lrs . + ?gemma gnt:mappedSnp ?id2; + gnt:locus ?locus; + gnt:lodScore ?lrs2. + ?id2 gnt:name "BXDPublish" ; + gnt:GEMMA_HK true; + gnt:traitId ?traitid. +} limit 5 +``` + +Unfortunately I made a mistake mapping the SNPs. This should have linked back. So instead of: + +``` +gn:rsm10000005699_HK_output_trait_BXDPublish_1_gemma_GWA_hk_assoc_txt a gnt:mappedLocus; + gnt:mappedSnp gn:rsm10000005699_HK_output_trait_BXDPublish_1_gemma_GWA_hk_assoc_txt ; +``` + +I should have generated + +``` +gn:rsm10000005699_HK_output_trait_BXDPublish_1_gemma_GWA_hk_assoc_txt a gnt:mappedLocus; + gnt:mappedSnp gn:HK_output_trait_BXDPublish_1_gemma_GWA_hk_assoc_txt ; + +``` + +Doh! These SNPs are dangling now. Bit hard to see sometimes with these identifiers. OK, set up another rdf generation run. +Now I see it show an error for a few traits, e.g. + +``` +./bin/gemma2rdf.rb:74:in "initialize": No such file or directory @ rb_sysopen - ./tmp/trait-BXDPublish-18078-gemma-GWA-hk.assoc.txt (Errno::ENOENT) +``` + +For later (again) as the majority is coming through. + +``` +SQL> ld_dir('/export/data/virtuoso/ttl','gemma-GWA-hk.ttl','http://hk.genenetwork.org'); +SQL> rdf_loader_run (); +SQL> SPARQL SELECT count(*) FROM <http://hk.genenetwork.org> WHERE { ?s ?p ?o }; + 5262347 +``` + +Try again + +``` +SELECT count(*) WHERE { + ?reaper gnt:belongsToGroup gn:setBxd; + gnt:traitId ?traitid; + gnt:locus ?locus; + gnt:lodScore ?lrs . + ?trait gnt:GEMMA_HK true; + gnt:traitId ?traitid. + # filter(?lrs2 >= 4.0). + ?snp gnt:mappedSnp ?trait ; + gnt:locus ?locus ; + gnt:lodScore ?lrs2 . +} +"traitid","locus","lrs","lrs2" +"21188","http://genenetwork.org/id/Rs31400538",2.73982,3.42 +"21194","http://genenetwork.org/id/Rs29514307",3.94845,4.7 +"21199","http://genenetwork.org/id/Rs50530980",2.60066,3.27 +"21203","http://genenetwork.org/id/Rs13483656",2.57406,3.24 +"21205","http://genenetwork.org/id/Rsm10000000057",2.90985,3.6 +"21210","http://genenetwork.org/id/Rsm10000000182",2.67097,3.34 +"21217","http://genenetwork.org/id/Rs29525970",3.80402,4.54 +"21220","http://genenetwork.org/id/Rs46586055",2.50946,3.17 +"21221","http://genenetwork.org/id/Rs47967883",2.54473,3.21 +"21223","http://genenetwork.org/id/Rs29327089",3.94623,4.69 +"21230","http://genenetwork.org/id/Rs30026335",2.78151,3.46 +"21238","http://genenetwork.org/id/Rs32170136",2.83393,3.52 +"21267","http://genenetwork.org/id/Rsm10000000063",2.54818,3.21 +``` + +counts 9261 overlapping SNPs. So, about 4000 traits are not mapping exactly. Also interesting is that GEMMA HK LRS/LOD is consistently higher than reaper. + +For the non-overlapping traits we find, for example 10023, has no significant HK hit. For GEMMA_HK it is simply ignored and for reaper Bonz included the lodScore of 1.77. If we count the significant hits for reaper LOD>3.0 we find 4541 hits. Out of these 4506 hits overlap with GEMMA_HK. That is perfect! + +``` +SELECT ?traitid WHERE { + ?reaper gnt:belongsToGroup gn:setBxd; + gnt:traitId ?traitid; + gnt:locus ?locus; + gnt:lodScore ?lrs . + ?trait gnt:GEMMA_HK true; + gnt:traitId ?traitid. + filter(?lrs >= 3.0). + ?snp gnt:mappedSnp ?trait ; + gnt:locus ?locus ; + gnt:lodScore ?lrs2 . +} +``` + +Essentially every reaper result is replicated in GEMMA_HK and now we have all SNPs that can be compared against the LMM results. + +# On Normality + +But first we want to take a look normality for the datasets now we stored ninds, mean, std, skew and kurtosis. At this stage let's just count datasets. So, out of 13427 GEMMA LMM traits 12416 have more than 16 individuals. When looking at abs(skew)<0.8 we have 7691 fairly normal traits. Adding an abs(kurtosis)<1.0 we have 6289 traits. So about half of them are fairly normal. So if we quantile normalize these vectors it may have some impact. Let that be another task I add above (run gemma with qnorm). + +The query was + +``` +SELECT count(*) WHERE { + ?trait gnt:loco true; + gnt:traitId ?traitid; + gnt:nind ?nind; + gnt:skew ?skew; + gnt:kurtosis ?kurtosis. + filter(?nind > 16 and abs(?skew) < 0.8 and abs(?kurtosis) < 1.0). +} LIMIT 40 +``` + +# Pubmed + +As an aside, I did an interesting discovery. Some of the pubmed IDs that I thought were wrong may actually be OK. Maybe Bonz did some screening because his RDF differs from what is in MySQL. + +# Preparing for comparison + +OK, we are finally at the point where we can compare LMM results with HK (read reaper). This is a 'set analysis' because we want to see what SNPs differ between the two results for every trait and highlight those where peaks are different. We have captured in RDF all the SNPs that are considered (fairly) significant for both LMM and HK. + +The easiest way is to capture all SNPs and write the analysis in code. There may be a way to do this in SPARQL but it will take me more time and we'll end with less flexibility. Now there are two main ways to go about it. I can dump a table with all SNPs using SPARQL itself and process the tabular data (this, btw, may be a good input for AI). Another option is to use an RDF library and parse the RDF triples directly (without Virtuoso) in the middle. That should allow for quicker processing and also a shorter turnaround if I need to modify RDF (the process of updating, uploading, checking and writing SPARQL queries, is quite long). There is one thing in writing software that is very important: you want a quick turnaround, otherwise you are just staring at a prompt ;). So it pays to learn these short cuts. It also allows accessing lmdb files and even SQL if useful. +Note that we still can also use SPARQL *also* to output RDF triples. So if we want more powerful filtering and/or add metadata it will all work. + +## Reading RDF + +So, I wrote a first script to digest our RDF from GEMMA. The RDF library in Guix is a bit old, so we have to upgrade that in Guix. + +For testing I created a small TTL file and convert to N3 with wrapper. + +``` +rapper -i turtle test-2000.ttl > test-2000.n3 +``` + +What we want to do is walk the dataset and harvest SNPs that belong to a run. As a start. + +First I needed to add the relevant RDF packages to Guix. + +=> https://git.genenetwork.org/guix-bioinformatics/commit/?id=fcbe2919a1e4b168e8ec9ac995a6512360d56ac8 + +The following code fetches all traits with all SNPs: + +``` + graph = RDF::Graph.load(fn) + datasets = graph.query(RDF::Query.new { + pattern [:dataset, RDF.type, GNT.mappedTrait] + }) + datasets.each { |trait| + p "-------" + p trait.dataset + snps = graph.query(RDF::Query.new { + pattern [ :snp, GNT.mappedSnp, trait.dataset ] + }) + p snps + } +``` + +Resulting in + +``` +"-------" +#<RDF::URI:0x9ec0 URI:http://genenetwork.org/id/GEMMAMapped_LOCO_BXDPublish_10007_gemma_GWA_7c00f36d> +[#<RDF::Query::Solution:0x9ed4({:snp=>#<RDF::URI:0x9ee8 URI:http://genenetwork.org/id/Rsm10000005697_BXDPublish_10007_gemma_GWA_7c00f36d>})>] +``` + +At the next step we want to do a bit more sophisticated queries. This thing has SPARQL support with the graph in RAM, but I want to try the native interface first. + +The first hurdle was that loading RDF triples is extremely slow. So I wanted to try the RDF Raptor C extension, but that sent me down a temporary Guix rabbit hole because nss-certs moved. Also the raptor gem was ancient, and was showing errors, so I updated to the latest github code. + +Anyway guix-bioinformatics was updated to support that. Next I tried loading with raptor and that made the difference. At least the triples are read in minutes rather than hours, but the next step building the large graph takes a lot of time too. This sucks. + +Creating and inspecting each statement is fast enough that look like: + +``` +#<RDF::Statement:0x7a8(<http://genenetwork.org/id/HK_trait_BXDPublish_10001_gemma_GWA_hk_assoc_txt> <http://genenetwork.org/term/trait> <http://genenetwork.org/id/publishXRef_10001> .)> +``` + +So, rather than including all triples, we first filter out the ones we are not interested in and that speeds things up. That worked until I included all SNPs. Are we delivered here? These libraries may be too slow. Analysing 200K triples took forever. Constructing the graph through an enumerator is a really slow step. The graph query is also slow. But adding the raptor read triples to an array only took 7s. It makes pretty clear we should process the 'raw' data directly. + +The current script collects all SNPs by GEMMA trait: + +``` +time ./bin/rdf-analyse-gemma-hits.rb test.nt +Parsing test.nt... + +real 0m12.314s +user 0m12.117s +sys 0m0.196s +``` + +Next stop we make it a set and do the same for HK. And we can do set analysis. The first round is pretty impressive, it looks like trait 10001 has exactly the same SNPs for HK and GEMMA. That is a nice confirmation. Actually 10001 is an interesting test case because in GN you can see HK and GEMMA find different secondary peaks: + +=> https://genenetwork.org/show_trait?trait_id=10001&dataset=BXDPublish + +At the GEMMA threshold we set (LOD>4.0) all hits are on chr8 and they overlap with HK. Down the line we could look at lower values, but lets stick with this for now. + +For 10004 we find some different SNPs. The mapping looks similar in GN: + +=> https://genenetwork.org/show_trait?trait_id=10001&dataset=BXDPublish + +The difference is: + +``` +["10004", #<Set: {#<RDF::URI:0x1a18 URI:http://genenetwork.org/id/Rs47899232>, #<RDF::URI:0x1a54 URI:http://genenetwork.org/id/Rsm10000005699>, #<RDF::URI:0xf78 URI:http://genenetwork.org/id/Rsm10000005700>, #<RDF::URI:0xf3c URI:http://genenetwork.org/id/Rs32133186>, #<RDF::URI:0xf00 URI:http://genenetwork.org/id/Rs32818171>, #<RDF::URI:0xec4 URI:http://genenetwork.org/id/Rsm10000005701>, #<RDF::URI:0xe88 URI:http://genenetwork.org/id/Rsm10000005702>, #<RDF::URI:0xdd4 URI:http://genenetwork.org/id/Rsm10000005703>, #<RDF::URI:0xfb4 URI:http://genenetwork.org/id/Rs33490412>, #<RDF::URI:0xff0 URI:http://genenetwork.org/id/Rs3661882>, #<RDF::URI:0x102c URI:http://genenetwork.org/id/Rsm10000005704>, #<RDF::URI:0x1068 URI:http://genenetwork.org/id/Rs32579649>, #<RDF::URI:0x10a4 URI:http://genenetwork.org/id/Rsm10000005705>}>] +``` + +This locus Rs47899232 is not in my test set, so it looks like it is under the threshold. If you look at Chr8 you can see the GEMMA hit shifted somewhat to the right from HK Chr8: 68.799000 to LOCO Chr8: 95.704608. The LOCO hit is also visible in HK, but dropped below significance. + +So we can do this analysis now! But just looking at SNPs is going to be laborious. At this stage we are mostly interested in the highest peak and whether it changed. What we need to do is capture regions, i.e. the chromosome positions, and map out if they moved. + +In the next phase I am going to take all SNP positions and map their region (+- 10,000 bps). For every trait we'll have a list of *regions* linked to significant hits. If these regions differ then the peaks differ, and we can highlight them. + +# Getting SNPs and their positions + +To get SNPs and their positions a simple SPARQL query will do. Bonz has created a TTL, e.g. + +``` +gn:Rs47899232 rdf:type gnc:Genotype . +gn:Rs47899232 rdfs:label "rs47899232" . +gn:Rs47899232 gnt:chr "8" . +gn:Rs47899232 gnt:mb "95.704608"^^xsd:double . +gn:Rs47899232 gnt:belongsToSpecies gn:Mus_musculus . +gn:Rs47899232 gnt:chrNum "0"^^xsd:int . +gn:Rsm10000005700 rdf:type gnc:Genotype . +gn:Rsm10000005700 rdfs:label "rsm10000005700" . +gn:Rsm10000005700 gnt:chr "8" . +gn:Rsm10000005700 gnt:mb "95.712996"^^xsd:double . +gn:Rsm10000005700 gnt:belongsToSpecies gn:Mus_musculus . +gn:Rsm10000005700 gnt:chrNum "0"^^xsd:int . +``` + +A few things are a bit puzzling, but at this stage we mostly care for are the identifier, label, chr and mb. GN, for some reason tracks mb as a floating point. I don't like that, but it will work for tracking positions. To get a table we use the following query: + +``` +SELECT * WHERE { + ?snp a gnc:Genotype; + gnt:belongsToSpecies gn:Mus_musculus ; + rdfs:label ?name ; + gnt:chr ?chr ; + gnt:mb ?mb . + +} +``` + +we save that as a TSV and have 120K SNPs formatted like: + +``` +"http://genenetwork.org/id/Rs47899232" "rs47899232" "8" 95.7046 +``` + +# Ranges + +In the next step we want do define peak ranges. It would be nice to visualize them as a line, e.g. for HK and LOCO: + +``` +Chr 1 2 3 ... +HK ---X-------------------X----- +LOCO ---X----X--------------X----- +``` + +That way we can see that a peak appeared on Chr 1. Down the line we can use the same info to compare traits A and B: + +``` +Chr 1 2 3 ... +A ---X-------------------X----- +B ---X------------------------- +``` + +where we see some chromosome area is shared. Rob sent me this nice 2008 paper: + +=> https://pubmed.ncbi.nlm.nih.gov/19008955/ + +which states that a remarkably diverse set of traits maps to a region on mouse distal chromosome 1 (Chr 1) that corresponds to human Chr 1q21-q23. This region is highly enriched in quantitative trait loci (QTLs) that control neural and behavioral phenotypes, including motor behavior, escape latency, emotionality, seizure susceptibility (Szs1), and responses to ethanol, caffeine, pentobarbital, and haloperidol. + +And we are still doing this research today. + +Anyway, for our purposes, for each trait we have a range of SNPs. If they are close to each other they form a 'peak'. What I am going to do is combine the SNPs we are comparing into one set first. Use that to define the ranges (say within 10K BPs). Next we go back to the computed SNPs and figure out what fits a range. We will pick out those ranges that are unique to a trait. But first we'll just visualize. + +As this involves some logic we will have to do it in real code (again). First we show how many SNPs we have combined for HK+LOCO and how many differ, e.g. + +``` +["10001", 78, 0] +["10002", 208, 92] +["10003", 96, 0] +["10004", 35, 13] +["10005", 76, 0] +``` + +so, for 10001 we have 78 SNPs and the LOCO ones overlap with HK. We showed before that for every set we have the SNP ids. + +For the first time this exercise I have to write some real new code (before I was just tying together existing work and fixing bugs on the fly). The reason is that we have to track QTL peak ranges by inserting SNP positions. Not only that, we also need to make sure that these ranges do not overlap and build faithfully. For example, the order of adding SNPs matters - we grow a range by adding SNPs on the same chromosome. If a SNP falls out of range (e.g. 10K BPs away) we create a new range. But when a nother SNP falls in the middle we need to merge them into one range (or peak). This requires some logic and I am creating a new module for it. + +The current code creates the following peaks on chr1: + +``` +@chromosome={"1"=>[#<QRange 𝚺14 173.339..173.679>, #<QRange 𝚺9 175.615..176.205>, #<QRange 𝚺2 174.541..174.679>, #<QRange 𝚺7 175.437..176.032>, #<QRange 𝚺15 72.2551..73.3771>, #<QRange 𝚺10 179.862..180.284>, #<QRange 𝚺22 181.476..183.154>, #<QRange 𝚺9 179.916..180.412>, #<QRange 𝚺4 177.555..177.901>, #<QRange 𝚺29 171.749..173.532>, #<QRange 𝚺8 171.172..172.175>] +``` + +The sigma tells you how many SNPs are in there. There is some overlap, so I need to fix that. When I set the distance at 50,000 BPS we get too many peaks. We need some other heuristic to decide what is a peak and what not. Probably look at the direction the significance is going. I.e. when it drops and rises again we may have a local peak. Would be nice to track those as separate ranges. + +Rob suggested a bin size of 500,000 BPs for the BXD. Let's try that first. This results in an orderly combined LOCO+HK results for trait 10002: + +``` +#<QTL::QRanges:0x00007f99f277c840 @chromosome={"1"=>[#<QRange 𝚺15 72.2551..73.3771>, #<QRange 𝚺91 171.172..183.154>], "8"=>[#<QRange 𝚺102 94.3743..112.929>]}> +``` + +Next we do this for LOCO and HK separately: + +``` +[10002,combined] =>{"1"=>[#<QRange 𝚺15 72.2551..73.3771>, #<QRange 𝚺91 171.172..183.154>], "8"=>[#<QRange 𝚺102 94.3743..112.929>]} +[10002,HK] =>{"1"=>[#<QRange 𝚺14 179.862..181.546>], "8"=>[#<QRange 𝚺102 94.3743..112.929>]} +[10002,LOCO] =>{"1"=>[#<QRange 𝚺15 72.2551..73.3771>, #<QRange 𝚺91 171.172..183.154>], "8"=>[#<QRange 𝚺32 94.4792..97.3382>]} +["10003", 96, 0] +["10004", 35, 13] +[10004,combined] =>{"8"=>[#<QRange 𝚺35 68.7992..97.3516>]} +[10004,HK] =>{"8"=>[#<QRange 𝚺22 68.7992..74.9652>]} +[10004,LOCO] =>{"8"=>[#<QRange 𝚺13 95.6926..97.3516>]} +``` + +Resulting in a new QTL for 10002,LOCO. And with 10004 we see the QTL shift to the right. Nice! + +We'll want to track the LOD score too, so let's load that using the RDF file we parse anyway. + +``` +[10002,HK] =>{"1"=>[#<QRange 𝚺14 179.862..181.546 LOD=3.07..3.07>], "8"=>[#<QRange 𝚺102 94.3743..112.929 LOD=3.1..5.57>]} +[10002,LOCO] =>{"1"=>[#<QRange 𝚺15 72.2551..73.3771 LOD=4.0..5.1>, #<QRange 𝚺91 171.172..183.154 LOD=4.5..5.3>], "8"=>[#<QRange 𝚺32 94.4792..97.3382 LOD=4.5..4.8>]} +[10004,HK] =>{"8"=>[#<QRange 𝚺22 68.7992..74.9652 LOD=3.14..3.23>]} +[10004,LOCO] =>{"8"=>[#<QRange 𝚺13 95.6926..97.3516 LOD=4.1..4.6>]} +``` + +Speaks for itself. + +# Analyzing peaks + + + +Now we have the peaks for different runs (HK and LOCO). We would like to see how many of the traits are affected - gaining or losing or moving peaks. Also, before we introduce the GEMMA values to GN, we would like to assess how many of the peaks are really different. + +With above example we can see that 10002 gained a peak on chr1. With 10004 we see that the peak on chr8 shifted position. These are the things we want to capture. Also we want to bring back some metadata to show what the trait is about. Finally we want to point to the full vector lmdb file which I forgot to include in the original parsing though I did include the hash, e.g. + +``` +gn:GEMMAMapped_LOCO_BXDPublish_10001_gemma_GWA_7c00f36d a gnt:mappedTrait; + rdfs:label "GEMMA BXDPublish trait 10001 mapped with LOCO (defaults)"; + gnt:trait gn:publishXRef_10001; + gnt:loco true; + gnt:time "2025/08/24 08:22"; + gnt:belongsToGroup gn:setBxd; + gnt:name "BXDPublish"; + gnt:traitId "10001"; +``` + +I shoud add + +``` + gnt:filename "c143bc7928408fdc53affed0dacdd98d7c00f36d-BXDPublish-10001-gemma-GWA.tar.xz" + gnt:hostname "balg01" +``` + +so we can find it back easily. + +Next step is to say something about the peaks. Let's enrich our RDF store to show these results. Basically for 10002 we can add RDF statements for + +``` +[10002,HK] =>{"1"=>[#<QRange 𝚺14 179.862..181.546 LOD=3.07..3.07>], "8"=>[#<QRange 𝚺102 94.3743..112.929 LOD=3.1..5.57>]} +[10002,LOCO] =>{"1"=>[#<QRange 𝚺15 72.2551..73.3771 LOD=4.0..5.1>, #<QRange 𝚺91 171.172..183.154 LOD=4.5..5.3>], "8"=>[#<QRange 𝚺32 94.4792..97.3382 LOD=4.5..4.8>]} +``` + +e.g. + +``` +gn:qtl00001_LOCO + gnt:qtlChr "1"; + gnt:qtlStart 72.2551 ; + gnt:qtlStop 73.3771 ; + gnt:qtlLOD 5.1 ; + gnt:SNPs 15 ; +gn:qtl00002_LOCO + gnt:qtlChr "1"; + gnt:qtlStart 171.172 ; + gnt:qtlStop 183.154 ; + gnt:qtlLOD 5.3 ; + gnt:SNPs 91 ; + gnt:qtlOverlaps gn:qtl00001_HK. +``` + +This way, in SPARQL, we can query all QTL that are not in HK. For the QTL that are in HK we can also see if they shifted. Actually for SPARQL we don't really need the last statement - it is just a convenience. We will also add the actual SNP identifiers so the SNP counter is not really necessary either (let SPARQL count): + +``` +gn:QTL_CHR1_722551_GEMMAMapped_LOCO_BXDPublish_10002_gemma_GWA_7c00f36d + gnt:mappedQTL gn:GEMMAMapped_LOCO_BXDPublish_10002_gemma_GWA_7c00f36d + rdfs:label "GEMMA BXDPublish LOCO QTL on 1:722551 trait 10002"; + gnt:qtlChr "1"; + gnt:qtlStart 72.2551 ; + gnt:qtlStop 73.3771 ; + gnt:qtlLOD 5.1 ; + gnt:qtlSNP gn:Rs13475920_BXDPublish_10002_gemma_GWA_7c00f36d + gnt:qtlSNP gn:Rs31428112_BXDPublish_10002_gemma_GWA_7c00f36d + (...) +``` + +I have two things to solve now. First we need to check whether QTLs between the two runs overlap. And then there is a bug in the QTL computation from SNP positions. I am seeing some inconsistencies wrt binning. + +The problem I was referring to yesterday turns out to be alright. I thought that when I was using the combined SNPs from HK and LOCO that there was only one peak. But there are two: + +``` +[10002,combined] =>{"1"=>[#<QRange 𝚺15 72.2551..73.3771 LOD=..>, #<QRange 𝚺91 171.172..183.154 LOD=..>]}, +[10002,HK] =>{"1"=> #<QRange 𝚺14 179.862..181.546 LOD=3.07..3.07>], +[10002,LOCO] =>{"1"=>[#<QRange 𝚺15 72.2551..73.3771 LOD=4.0..5.1>, #<QRange 𝚺91 171.172..183.154 LOD=4.5..5.3>] +``` + +It is interesting to see that HK misses out on one peak completely and the second peak completely overlaps with LOCO (including all SNPs). All good, so far. OK. Let's add some logic to see what peaks match or don't match: + +``` +[10002,HK] =>{"1"=>[#<QRange Chr1 𝚺14 179.862..181.546 LOD=3.07..3.07>], "8"=>[#<QRange Chr8 𝚺102 94.3743..112.929 LOD=3.1..5.57>]} +[10002,LOCO] =>{"1"=>[#<QRange Chr1 𝚺15 72.2551..73.3771 LOD=4.0..5.1>, #<QRange Chr1 𝚺91 171.172..183.154 LOD=4.5..5.3>], "8"=>[#<QRange Chr8 𝚺32 94.4792..97.3382 LOD=4.5..4.8>]} +["10002: NO HK match for LOCO Chr 1 QTL!", #<QRange Chr1 𝚺15 72.2551..73.3771 LOD=4.0..5.1>] +[10004,HK] =>{"8"=>[#<QRange Chr8 𝚺22 68.7992..74.9652 LOD=3.14..3.23>]} +[10004,LOCO] =>{"8"=>[#<QRange Chr8 𝚺13 95.6926..97.3516 LOD=4.1..4.6>]} +["10004: NO HK match for LOCO Chr 8 QTL!", #<QRange Chr8 𝚺13 95.6926..97.3516 LOD=4.1..4.6>] +``` + +So 10002 correctly says there is a new QTL on chr1 and for 10004 a new QTL on chr8. Now, for 10004 it appears the HK version is in a different location, but I think it suffices to point out 'apparently' new QTL. + +Alright, so we can now annotate new/moved QTL! We are going to feed this back into virtuoso by writing RDF as I showed yesterday. + +Next step is to say something about the peaks. Let's enrich our RDF store to show these results. Basically for 10002 we add RDF statements for + +``` +[10002,HK] =>{"1"=>[#<QRange 𝚺14 179.862..181.546 LOD=3.07..3.07>], "8"=>[#<QRange 𝚺102 94.3743..112.929 LOD=3.1..5.57>]} +[10002,LOCO] =>{"1"=>[#<QRange 𝚺15 72.2551..73.3771 LOD=4.0..5.1>, #<QRange 𝚺91 171.172..183.154 LOD=4.5..5.3>], "8"=>[#<QRange 𝚺32 94.4792..97.3382 LOD=4.5..4.8>]} +``` + +E.g. + +``` +gn:GEMMAMapped_LOCO_BXDPublish_10002_gemma_GWA_7c00f36d_QTL_Chr8_94_97 + gnt:mappedQTL gn:GEMMAMapped_LOCO_BXDPublish_10002_gemma_GWA_7c00f36d; + rdfs:label "GEMMA BXDPublish QTL"; + gnt:qtlChr "8"; + gnt:qtlStart 94.4792 ; + gnt:qtlStop 97.3382 ; + gnt:qtlLOD 4.8 . +gn:GEMMAMapped_LOCO_BXDPublish_10002_gemma_GWA_7c00f36d_QTL_Chr8_94_97 gnt:mappedSnp gn:Rsm10000005689_BXDPublish_10002_gemma_GWA_7c00f36d . +gn:GEMMAMapped_LOCO_BXDPublish_10002_gemma_GWA_7c00f36d_QTL_Chr8_94_97 gnt:mappedSnp gn:Rs232396986_BXDPublish_10002_gemma_GWA_7c00f36d . +gn:GEMMAMapped_LOCO_BXDPublish_10002_gemma_GWA_7c00f36d_QTL_Chr8_94_97 gnt:mappedSnp gn:Rsm10000005690_BXDPublish_10002_gemma_GWA_7c00f36d . +(...) +``` + +and if it is a new QTL compared to HK we annotate a newly discovered QTL: + +``` +gn:GEMMAMapped_LOCO_BXDPublish_10002_gemma_GWA_7c00f36d_1_72_73 a gnt:newlyDiscoveredQTL . +gn:GEMMAMapped_LOCO_BXDPublish_10004_gemma_GWA_7c00f36d_8_96_97 a gnt:newlyDiscoveredQTL . +``` + +Note we skipped the results that show no SNP changes - I should add them later to give full QTL cover. + +Code is here: + +=> https://github.com/genetics-statistics/gemma-wrapper/blob/master/bin/rdf-analyse-gemma-hits.rb +=> https://github.com/genetics-statistics/gemma-wrapper/blob/master/lib/qtlrange.rb + +Now we have all the RDF to figure out what traits have new QTL compared to reaper! +I'll upload them in virtuoso for further analysis. + +I want to do a run that shows what traits have changed QTLs. +Basically the command is + +``` +./bin/rdf-analyse-gemma-hits.rb test-hk-2000.ttl test-2000.ttl -o RDF +``` + +let's try to run with the full ttl files. Actually I converted them to n3 because of some error: + +``` +rapper --input turtle gemma-GWA.ttl > gemma-GWA.n3 +rapper --input turtle gemma-GWA-hk.ttl > gemma-GWA-hk.n3 +time ./bin/rdf-analyse-gemma-hits.rb gemma-GWA-hk.n3 gemma-GWA.n3 > test.out +real 3m21.979s +user 3m21.076s +sys 0m0.716s +``` + +3.5 minutes is fine for testing stuff (if already a little tedious). The first run failed because I have renamed GEMMA_HK to GemmaHK. Another bug I hit was with: + +``` +[10009,HK] =>{"15"=>[#<QRange Chr15 𝚺30 25.6987..74.5398 LOD=3.01..3.27>]} +[10009,LOCO] =>{"10"=>[#<QRange Chr10 𝚺1 76.2484..76.2484 LOD=3.5..3.5>]} +/export/local/home/wrk/iwrk/opensource/code/genetics/gemma-wrapper/lib/qtlrange.rb:126:in `block (2 levels) in qtl_diff': undefined method `each' for nil (NoMethodError) +``` + +There are a few more bugs to fix - mostly around empty results, e.g. if a trait had no SNPs. Also HK would render a lodScore of infinite `gnt:lodScore Infinity` and that reduced the result set. I set a LOD of infinity to 99.0. So at least it'll stand out. Fixing it at 12 minutes made the run a lot slower than 3.5 minutes! Still OK, for now. + +The first run shows 7943 new QTL. Turns out that a bunch of them are non-significant, so need to filter those. Remember we kept the highest hit, even if significance was low. A quick filter shows that with LMM 2802 traits show new QTLs (out of 13K). Out of those 1984 traits did not compute a QTL at all with HK. That looks exciting, but we need to validate. Lets take a look at + +``` +[10727,HK] =>{} +[10727,LOCO] =>{"15"=>[#<QRange Chr15 𝚺9 62.3894..63.6584 LOD=4.4..4.4>]} +["10727: NO HK match for LOCO Chr 15 QTL!", [#<QRange Chr15 𝚺9 62.3894..63.6584 LOD=4.4..4.4>]] +``` + +=> https://genenetwork.org/show_trait?trait_id=10727&dataset=BXDPublish + +That looks correct to me. Rob you may want to check. And another: + +``` +[51064,HK] =>{"10"=>[#<QRange Chr10 𝚺12 92.3035..108.525 LOD=3.08..4.15>], "19"=>[#<QRange Chr19 𝚺34 8.93047..34.2017 LOD=3.06..3.41>], "3"=>[#<QRange Chr3 𝚺5 138.273..138.581 LOD=3.06..3.06>], "X"=>[#<QRange ChrX 𝚺5 160.766..163.016 LOD=3.48..3.48>]} +[51064,LOCO] =>{"19"=>[#<QRange Chr19 𝚺37 29.9654..34.2017 LOD=4.3..5.5>]} +``` + +=> https://genenetwork.org/show_trait?trait_id=51064&dataset=BXDPublish + +Looks correct. With HK we see QTL on Chr 3,10,19 and X. On GN LMM we see a whopper on chr 19, as well as X. I need to see why GEMMA is not finding that X in precompute! Made a note of that too. + +# Updating RDF + +Now we have QTL output we can upload that to RDF. + +Making the traits accessible we need to add some metadata on description of trait, publication and authors. All this information can also be used to build a UI. + +For this I am going to regenerate the RDF without running gemma again to sure it is complete and mark the new QTL. One change is that if a LOD is infinite we set it to 99.1. The number will stand out. The idea is that when a P-value ends up rounded to zero we can pick it up easily as a conversion. This turns out to be relevant for example: + +``` +gn:HK_trait_BXDPublish_13032_gemma_GWA_hk_assoc_txt a gnt:mappedTrait; + rdfs:label "GEMMA_BXDPublish ./tmp/trait-BXDPublish-13032-gemma-GWA-hk.assoc.txt trait HK mapped"; + gnt:GEMMA_HK true; + gnt:belongsToGroup gn:setBxd; + gnt:trait gn:publishXRef_13032; + gnt:time "2025-08-27 06:44:45 +0000"; + gnt:name "BXDPublish"; + gnt:traitId "13032"; + skos:altLabel "BXD_13032". + +gn:rsm10000005888_HK_trait_BXDPublish_13032_gemma_GWA_hk_assoc_txt a gnt:mappedLocus; + gnt:mappedSnp gn:HK_trait_BXDPublish_13032_gemma_GWA_hk_assoc_txt ; + gnt:locus gn:Rsm10000005888 ; + gnt:lodScore Infinity . + +gn:rsm10000005889_HK_trait_BXDPublish_13032_gemma_GWA_hk_assoc_txt a gnt:mappedLocus; + gnt:mappedSnp gn:HK_trait_BXDPublish_13032_gemma_GWA_hk_assoc_txt ; + gnt:locus gn:Rsm10000005889 ; + gnt:lodScore Infinity . +``` + +The trait has +1 and -1 values: + +=> https://genenetwork.org/show_trait?trait_id=13032&dataset=BXDPublish + +HK on GN show a map, but no result table. Hmmm. The SNPs listed here as Infinity don't really show in GN - and GEMMA finds no hits there. I think, on consideration, since we don't use HK other than for comparison I should just drop these results. It looks dodgy. Aha, in the GEMMA run these actually show up as not a number (NaN), so I should drop them! + +``` +chr rs ps n_mis n_obs allele1 allele0 af p_lrt +9 rsm10000005888 31848339 0 23 X Y 0.348 -nan +9 rsm10000005864 27578739 0 23 X Y 0.391 1.770379e-10 +``` + +Funny enough they are on the same chromosome as the highest ranking hits. + +Let's generate RDF and look at the differences: + +``` +export RDF=gemma-GWA-hk2.ttl +wrk@balg01 ~/services/gemma-wrapper [env]$ ./bin/gemma2rdf.rb --header > $RDF +wrk@balg01 ~/services/gemma-wrapper [env]$ for id in 'cat ids.txt' ; do traitfn=trait-BXDPublish-$id-gemma-GWA-hk ; ./bin/gemma2rdf.rb $TMPDIR/$traitfn.assoc.txt >> $RDF ; done +``` + +Took 43 min. The diff with the orignal looks good. Note I don't track origin files for this. Maybe I should, but I don't think we'll really use those. Next generate GEMMA LOCO RDF again + +``` +RDF=gemma-GWA.ttl +wrk@balg01 ~/services/gemma-wrapper [env]$ ./bin/gemma-mdb-to-rdf.rb --header > $RDF +time for x in tmp/*.xz ; do + ./bin/gemma-mdb-to-rdf.rb $x --anno BXD.8_snps.txt --sort >> $RDF +done +``` + +Runs in 50min for 13K traits. + +The output now points to the lmdb vector files: + +``` ++ gnt:filename "c143bc7928408fdc53affed0dacdd98d7c00f36d-BXDPublish-10080-gemma-GWA.tar.xz"; ++ gnt:hostname "balg01"; +``` + +## Digest QTL to RDF + +In the next step we want to show the QTL in RDF. First I created a small subset for testing that I can run with + +``` +time ./bin/rdf-analyse-gemma-hits.rb test-hk-2000.n3 test-2000.n3 +``` + +It shows, for example, + +``` +gn:GEMMAMapped_LOCO_BXDPublish_10012_gemma_GWA_7c00f36d_QTL_Chr4_25_25 + gnt:mappedQTL gn:GEMMAMapped_LOCO_BXDPublish_10012_gemma_GWA_7c00f36d; + rdfs:label "GEMMA BXDPublish QTL"; + gnt:qtlChr "4"; + gnt:qtlStart 24.7356 ; + gnt:qtlStop 24.7356 ; + gnt:qtlLOD 3.6 . +gn:GEMMAMapped_LOCO_BXDPublish_10012_gemma_GWA_7c00f36d_QTL_Chr4_25_25 gnt:mappedSnp gn:Rsm10000001919_BXDPublish_10012 +_gemma_GWA_7c00f36d . +gn:GEMMAMapped_LOCO_BXDPublish_10012_gemma_GWA_7c00f36d_QTL_Chr4_25_25 a gnt:newQTL . +``` + +in other words a QTL with LOD 3.6 and a single SNP that is new compared to the HK output. We want to annotate a bit more, because I want to show the maximum allele frequency contained by the SNPs. That is not too hard as it is contained in the mapped SNP info: + +``` +gn:Rsm10000005700_BXDPublish_10001_gemma_GWA_7c00f36d a gnt:mappedLocus; + gnt:mappedSnp gn:GEMMAMapped_LOCO_BXDPublish_10001_gemma_GWA_7c00f36d; + gnt:locus gn:Rsm10000005700; + gnt:lodScore 6.2; + gnt:af 0.382; + gnt:effect 1.626. +``` + + +With precompute I added allele frequencies to the QTL. So for trait 10002 we get: + +``` +[10002,HK] =>{"1"=>[#<QRange Chr1 𝚺14 179.862..181.546 LOD=3.07..3.07>], "8"=>[#<QRange Chr8 𝚺102 94.3743..112.929 LOD=3.1..5.57>]} +[10002,LOCO] =>{"1"=>[#<QRange Chr1 𝚺15 72.2551..73.3771 AF=0.574 LOD=4.0..5.1>, #<QRange Chr1 𝚺91 171.172..183.154 AF=0.588 LOD=4.5..5.3>], "8"=>[#<QRange Chr8 𝚺32 94.4792..97.3382 AF=0.441 LOD=4.5..4.8>]} +``` + +and with RDF: + +``` +gn:GEMMAMapped_LOCO_BXDPublish_10002_gemma_GWA_7c00f36d_QTL_Chr1_72_73 + gnt:mappedQTL gn:GEMMAMapped_LOCO_BXDPublish_10002_gemma_GWA_7c00f36d; + rdfs:label "GEMMA BXDPublish QTL"; + gnt:qtlChr "1"; + gnt:qtlStart 72.2551 ; + gnt:qtlStop 73.3771 ; + gnt:qtlAF 0.574 ; + gnt:qtlLOD 5.1 . +gn:GEMMAMapped_LOCO_BXDPublish_10002_gemma_GWA_7c00f36d_QTL_Chr1_72_73 gnt:mappedSnp gn:Rsm10000000582_BXDPublish_10002_gemma_GWA_7c00f36d . +gn:GEMMAMapped_LOCO_BXDPublish_10002_gemma_GWA_7c00f36d_QTL_Chr1_72_73 gnt:mappedSnp gn:Rsm10000000583_BXDPublish_10002_gemma_GWA_7c00f36d . +gn:GEMMAMapped_LOCO_BXDPublish_10002_gemma_GWA_7c00f36d_QTL_Chr1_72_73 gnt:mappedSnp gn:Rs37034472_BXDPublish_10002_gemma_GWA_7c00f36d . +...etc... +gn:GEMMAMapped_LOCO_BXDPublish_10002_gemma_GWA_7c00f36d_QTL_Chr1_72_73 a gnt:newQTL . +``` + +Important: we only store LOCO QTL (which we reckon are 'truth'), not the HK QTL. We also marked QTL that are *not* in HK with the gnt:newQTL annotation. + +For AF filtering we track this information on the trait: + +``` +gn:GEMMAMapped_LOCO_BXDPublish_10002_gemma_GWA_7c00f36d a gnt:mappedTrait; + rdfs:label "GEMMA BXDPublish trait 10002 mapped with LOCO (defaults)"; + gnt:trait gn:publishXRef_10002; + gnt:loco true; + gnt:time "2025/08/24 08:22"; + gnt:belongsToGroup gn:setBxd; + gnt:name "BXDPublish"; + gnt:traitId "10002"; + gnt:nind 34; + gnt:mean 52.2206; + gnt:std 2.9685; + gnt:skew -0.1315; + gnt:kurtosis 0.0314; + skos:altLabel "BXD_10002"; + gnt:filename "c143bc7928408fdc53affed0dacdd98d7c00f36d-BXDPublish-10002-gemma-GWA.tar.xz"; + gnt:hostname "balg01"; + gnt:user "wrk". +``` + +So, for the first QTL, an AF of 0.574 is based on (1-0.574)*34 = 14 out of 34 individuals is great. When we get to 1 or 2 individuals it may be kinda dodgy. For a dataset this size the AF threshold should be 0.06 (and 0.94). If we have 15 individuals we should be closer to 0.1 (0.9). Anyway, we can compute these on the fly in SPARQL. I rather show too many false positives. + +Also note that AF is not a problem with our BXD genotyping. Even so, we are going to use pangenome genotypes next and it will be important for that. + +Let's do a full QTL compute with + +``` +time ./bin/rdf-analyse-gemma-hits.rb gemma-GWA-hk2.n3 gemma-GWA.n3 -o RDF > QTL.rdf +``` + +And we should have the queriable mapped QTL we wished for! But some inspection shows: + +``` +[10015,HK] =>{"12"=>[#<QRange Chr12 𝚺2 3.2..9.74252 LOD=3.74..3.74>], "2"=>[#<QRange Chr2 𝚺259 4.03246..52.4268 LOD=3.11..16.01>]} +[10015,LOCO] =>{"2"=>[#<QRange Chr2 𝚺256 4.03246..57.8635 AF=0.542 LOD=4.0..15.2>]} +["10015: NO HK match, QTL LOCO Chr 2!", #<QRange Chr2 𝚺256 4.03246..57.8635 AF=0.542 LOD=4.0..15.2>] +``` + +which is strange because there is overlap on that particular QTL Chr2! They are obviously the same. As subtle bug. Instead of + +``` +- return true if qtl.min > @min and qtl.max < @max +- return true if qtl.min < @min and qtl.max > @min +- return true if qtl.min < @max and qtl.max > @max +``` + +I now have: + +``` ++ return true if qtl.min >= @min and qtl.max <= @max # qtl falls within boundaries ++ return true if qtl.min <= @min and qtl.max >= @min # qtl over left boundary ++ return true if qtl.min <= @max and qtl.max >= @max # qtl over right boundary + +``` + +I had to include the boundaries themselves. + +Now we also still log false positives with + +``` +[10009,HK] =>{"15"=>[#<QRange Chr15 𝚺30 25.6987..74.5398 LOD=3.01..3.27>]} +[10009,LOCO] =>{"10"=>[#<QRange Chr10 𝚺1 76.2484..76.2484 AF=0.5 LOD=3.5..3.5>]} +["10009: NO HK results, new QTL(s) LOCO Chr 10!", [#<QRange Chr10 𝚺1 76.2484..76.2484 AF=0.5 LOD=3.5..3.5>]] +``` + +note the LOD score. I should not mark new QTL that are below 4.0. Now we count 2351 new QTL and that is in line with my earlier quick counts. + +Note the current script eats RAM because it holds all LOD scorer and SNPs in memory. That is fine for our 13K classical traits but will probably not work for millions of traits. It runs in 8 minutes. That is cool too. + +# Updating RDF in virtuoso + +Similar to what we did before we are going to update Virtuoso on the sparql-test server using the CLI isql commands discussed above. + + +Similar to what we did before we are going to update Virtuoso on the sparql-test server using the CLI isql commands discussed above. + +In August I uploaded: + +``` +SELECT * FROM DB.DBA.load_list; +/export/data/virtuoso/ttl/gemma-GWA-hk.ttl http://hk.genenetwork.org 2 2025.8.27 8:31.57 122123000 2025.8.27 8:32.6 104530000 0 NULL NULL +/export/data/virtuoso/ttl/test.n3 http://lmm2.genenetwork.org 2 2025.8.27 6:47.44 947047000 2025.8.27 6:47.49 73865000 0 NULL NULL +``` + +Also, to list all available graphs you can do + +``` +SELECT DISTINCT ?g + WHERE { GRAPH ?g {?s ?p ?o} } +ORDER BY ?g +http://genenetwork.org +http://hk.genenetwork.org +http://lmm2.genenetwork.org +``` + +The first graph is for all Bonz' RDF. I can now safely delete the other two, to start with a fresh slate. +The graph has 36584993 triples. Deleting HK remains 31322646 and LMM2 remains 29746544 triples. + +``` +ld_dir('/export/data/virtuoso/ttl','QTL.rdf','http://qtl.genenetwork.org'); +``` + +Ouch, we got an error. With the proper prefix values and renaming the file to QTL.ttl it worked with 183562 new triples! +Next we loaded the updated TTL files. HK imported 3196834 triples. LMM imported 1616383 and we total 34743323 triples. Which is less than the previous set - because we cleaned out the SNPs that had a LOD of infinite. + +After a checkpoint, time to SPARQL! This query lists all new QTL with their traits: + +``` +PREFIX gn: <http://genenetwork.org/id/> +PREFIX gnt: <http://genenetwork.org/term/> +PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> +PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> +SELECT ?trait, ?chr, ?start, ?stop, ?lod WHERE { + ?qtl gnt:mappedQTL ?traitid ; + gnt:qtlChr ?chr ; + gnt:qtlStart ?start ; + gnt:qtlStop ?stop ; + a gnt:newQTL ; + gnt:qtlLOD ?lod . + ?traitid gnt:traitId ?trait . +} LIMIT 20 + +"trait" "chr" "start" "stop" "lod" +"26116" "7" 36.9408 36.9408 4 +"26118" "2" 3.19074 4.29272 4.3 +"26118" "9" 60.6863 64.4059 4.3 +"26126" "17" 71.754 72.1374 4.7 +"26135" "15" 93.3404 94.2523 5.5 +(...) +``` + +So we list all traits that have a *NEW* QTL using GEMMA compared to HK. We have a few thousand trait updates that have new QTL. Let's add the number of samples/genometypes, se we can ignore the smaller sets. Or better, count them first. We simplify the query first: + +``` +SELECT count(DISTINCT ?trait) WHERE { + ?qtl a gnt:newQTL ; + gnt:mappedQTL ?traitid . + ?traitid gnt:traitId ?trait ; + gnt:nind ?nind. +} LIMIT 20 +``` + +Counts 2040 traits with at least one new QTL. When we FILTER (?nind > 16) we get 2019 traits. That is a tiny minority with fewer individuals. So we can ignore filtering them. + +Of course we visited several traits before to see if the QTL were correct. I'll make a list for Rob to check, expanding the trait to a clickable URL: + +Let's look for the new QTL. + +``` +SELECT ?trait, ?chr, ?start, ?stop, ?lod WHERE { + ?qtl gnt:mappedQTL ?traitid ; + gnt:qtlChr ?chr ; + gnt:qtlStart ?start ; + gnt:qtlStop ?stop ; + a gnt:newQTL ; + gnt:qtlLOD ?lod . + ?traitid gnt:traitId ?trait . + BIND(REPLACE(?trait, "(\\d+)","https://genenetwork.org/show_trait?trait_id=$1&dataset=BXDPublish") AS ?url) +} LIMIT 20 + +"trait" "chr" "start" "stop" "lod" "url" +"26116" "7" 36.9408 36.9408 4 "https://genenetwork.org/show_trait?trait_id=26116&dataset=BXDPublish" +"26118" "2" 3.19074 4.29272 4.3 "https://genenetwork.org/show_trait?trait_id=26118&dataset=BXDPublish" +"26118" "9" 60.6863 64.4059 4.3 "https://genenetwork.org/show_trait?trait_id=26118&dataset=BXDPublish" +"26126" "17" 71.754 72.1374 4.7 "https://genenetwork.org/show_trait?trait_id=26126&dataset=BXDPublish" +"26135" "15" 93.3404 94.2523 5.5 "https://genenetwork.org/show_trait?trait_id=26135&dataset=BXDPublish" +``` + +Now when I click the link for 26118 I can run HK and GEMMA and I can confirm we have a new result on CHR2 and CHR9. +Very cool. Now we want to show the trait info and authors, so we can see who we want to approach with this new information. + +Now in the phenotype RDF we have + +``` +gn:traitBxd_10001 rdf:type gnc:Phenotype . +gn:traitBxd_10001 gnt:belongsToGroup gn:setBxd . +gn:traitBxd_10001 gnt:traitId "10001" . +gn:traitBxd_10001 dct:description "Central nervous system, morphology: Cerebellum weight, whole, bilateral in adults of + both sexes [mg]" . +gn:traitBxd_10001 gnt:submitter "robwilliams" . +gn:traitBxd_10001 dct:isReferencedBy pubmed:11438585 . +``` + +The submitter is mostly one of the GN team. The pubmed id may help find the authors. Bonz RDF'd it as + +``` +pubmed:11438585 rdf:type fabio:ResearchPaper . +pubmed:11438585 fabio:hasPubMedId pubmed:11438585 . +pubmed:11438585 dct:title "Genetic control of the mouse cerebellum: identification of quantitative trait loci modulatin +g size and architecture" . +pubmed:11438585 fabio:Journal "J Neuroscience" . +pubmed:11438585 prism:volume "21" . +pubmed:11438585 fabio:page "5099-5109" . +pubmed:11438585 fabio:hasPublicationYear "2001"^^xsd:gYear . +pubmed:11438585 dct:creator "Airey DC" . +pubmed:11438585 dct:creator "Lu L" . +pubmed:11438585 dct:creator "Williams RW" . +``` + +So we can fetch that when it is available. You can run the query here: + +=> http://sparql-test.genenetwork.org/sparql/ + +Just copy paste: + +``` +PREFIX dct: <http://purl.org/dc/terms/> +PREFIX pubmed: <http://rdf.ncbi.nlm.nih.gov/pubmed/> +PREFIX gn: <http://genenetwork.org/id/> +PREFIX gnt: <http://genenetwork.org/term/> +PREFIX gnc: <http://genenetwork.org/category/> +PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> +PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> +PREFIX fabio: <http://purl.org/spar/fabio/> + +SELECT ?trait, ?chr, ?start, ?stop, ?lod, ?year, ?submitter, SAMPLE(?author as ?one_author), ?url, ?descr WHERE { + ?qtl gnt:mappedQTL ?traitid ; + gnt:qtlChr ?chr ; + gnt:qtlStart ?start ; + gnt:qtlStop ?stop ; + a gnt:newQTL ; + gnt:qtlLOD ?lod . + ?traitid gnt:traitId ?trait . + OPTIONAL { ?phenoid gnt:traitId ?trait ; + a gnc:Phenotype ; + gnt:belongsToGroup gn:setBxd ; + gnt:submitter ?submitter ; + dct:description ?descr ; + dct:isReferencedBy ?pubid . } . + ?pubid dct:creator ?author ; + fabio:hasPublicationYear ?pubyear . + BIND(concat(str(?pubyear)) as ?year) + BIND(REPLACE(?trait, "(\\d+)","https://genenetwork.org/show_trait?trait_id=$1&dataset=BXDPublish") AS ?url) +} ORDER by ?trait +LIMIT 100 +"10002" "1" 72.2551 73.3771 5.1 "2001" "robwilliams" "Lu L" "https://genenetwork.org/show_trait?trait_id=10002&dataset=BXDPublish" "Central nervous system, morphology: Cerebellum weight after adjustment for covariance with brain size [mg]" +"10004" "8" 95.6926 97.3516 4.6 "2001" "robwilliams" "Lu L" "https://genenetwork.org/show_trait?trait_id=10004&dataset=BXDPublish" "Central nervous system, morphology: Cerebellum volume [mm3]" +"10013" "2" 160.117 160.304 4.8 "1996" "robwilliams" "Alexander RC" "https://genenetwork.org/show_trait?trait_id=10013&dataset=BXDPublish" "Central nervous system, behavior: Saline control response 0.9% ip, locomotor activity from 0-60 min after injection just prior to injection of 5 mg/kg amphetamine [cm]" +(...) +``` + + +Currently authors are not 'ranked' in RDF, so I pick a random one. I can add ranking later, so we get the first author. We also have the option to fetch all traits that, for example, involve Dave Ashbrook. + +We can also look for details like skewness by adding + +``` + ?traitid gnt:traitId ?trait ; + gnt:skew ?skew . +``` + +# Testing pangenome derived genotypes + +We continue testing new genotypes in this document: + +=> ../genetics/test-pangenome-derived-genotypes + +# Introducing epochs + +see + +=> topics/data/epochs diff --git a/topics/systems/migrate-p2.gmi b/topics/systems/migrate-p2.gmi deleted file mode 100644 index c7fcb90..0000000 --- a/topics/systems/migrate-p2.gmi +++ /dev/null @@ -1,12 +0,0 @@ -* Penguin2 crash - -This week the boot partition of P2 crashed. We have a few lessons here, not least having a fallback for all services ;) - -* Tasks - -- [ ] setup space.uthsc.edu for GN2 development -- [ ] update DNS to tux02 128.169.4.52 and space 128.169.5.175 -- [ ] move CI/CD to tux02 - - -* Notes diff --git a/topics/systems/restore-backups.gmi b/topics/systems/restore-backups.gmi index 518c56d..b97af2b 100644 --- a/topics/systems/restore-backups.gmi +++ b/topics/systems/restore-backups.gmi @@ -26,7 +26,7 @@ The last backup on 'tux02' is from October 2022 - after I did a reinstall. That According to sheepdog the drops are happening to 'space' and 'epysode', but 'tux02' is missing: -=> https://rabbit.genenetwork.org/sheepdog/index.html +=> http://sheepdog.genenetwork.org/sheepdog/status.html ## Mariadb diff --git a/topics/systems/screenshot-github-webhook.png b/topics/systems/screenshot-github-webhook.png new file mode 100644 index 0000000..08feed3 --- /dev/null +++ b/topics/systems/screenshot-github-webhook.png Binary files differdiff --git a/topics/systems/security.gmi b/topics/systems/security.gmi new file mode 100644 index 0000000..a7192d4 --- /dev/null +++ b/topics/systems/security.gmi @@ -0,0 +1,61 @@ +# Security + +We secure our system by running recent stable versions of Linux distributions. We also are minimalistic on what we install and run and web services typically run in guix system containers (a kind of light weight Docker setup). + +# ssh + +Secure shell is very important. First we disable password logins. We use keys only. We also limit AllowUsers because it is flexible to remove and add users. + +``` +--- a/ssh/sshd_config ++++ b/ssh/sshd_config +@@ -54,7 +54,7 @@ Include /etc/ssh/sshd_config.d/*.conf + #IgnoreRhosts yes + + # To disable tunneled clear text passwords, change to no here! +-#PasswordAuthentication yes ++PasswordAuthentication no + #PermitEmptyPasswords no + ++AllowUsers marco daniel ... +``` + +Note that keys should be password protected. + +# Firewalling + +We typically use the monitored CISCO firewalling UTHSC provides. In addition we use nftables, e.g. in /etc/nftables.conf + +``` +table inet filter { + set udp_accepted { + type inet_service + flags interval + elements = { 60000-61000 } # for mosh + } + chain input { + type filter hook input priority filter; policy drop; + ct state { established, related } accept + iifname "lo" accept + iifname "lo" ip saddr != 127.0.0.0/8 drop + tcp dport ssh limit rate 5/minute accept + tcp dport { http, https } accept + tcp dport mysql ip saddr { list of ips } accept + udp dport @udp_accepted accept + reject with icmp port-unreachable + } + chain forward { + type filter hook forward priority filter; policy accept; + } + chain output { + type filter hook output priority filter; policy accept; + } +} +``` + +Enable this with + +``` +systemctl enable nftables +nft list ruleset +``` diff --git a/topics/systems/synchronising-the-different-environments.gmi b/topics/systems/synchronising-the-different-environments.gmi new file mode 100644 index 0000000..207b234 --- /dev/null +++ b/topics/systems/synchronising-the-different-environments.gmi @@ -0,0 +1,68 @@ +# Synchronising the Different Environments + +## Tags + +* status: open +* priority: +* type: documentation +* assigned: fredm +* keywords: doc, docs, documentation + +## Introduction + +We have different environments we run for various reasons, e.g. + +* Production: This is the user-facing environment. This is what GeneNetwork is about. +* gn2-fred: production-adjacent. It is meant to test out changes before they get to production. It is **NOT** meant for users. +* CI/CD: Used for development. The latest commits get auto-deployed here. It's the first place (outside of developer machines) where errors and breakages are caught and/or revealed. This will break a lot. Do not expose to users! +* staging: Uploader environment. This is where Felix, Fred and Arthur flesh out the upload process, and tasks, and also test out the uploader. + +These different environments demand synchronisation, in order to have mostly similar results and failure modes. + +## Synchronisation of the Environments + +### Main Database: MariaDB + +* [ ] TODO: Describe process + +=> https://issues.genenetwork.org/topics/systems/restore-backups Extract borg archive +* Automate? Will probably need some checks for data sanity. + +### Authorisation Database + +* [ ] TODO: Describe process + +* Copy backup from production +* Update/replace GN2 client configs in database +* What other things? + +### Virtuoso/RDF + +* [ ] TODO: Describe process + +* Copy TTL (Turtle) files from (where?). Production might not always be latest source of TTL files. +=> https://issues.genenetwork.org/issues/set-up-virtuoso-on-production Run setup to "activate" database entries +* Can we automate this? What checks are necessary? + +## Genotype Files + +* [ ] TODO: Describe process + +* Copy from source-of-truth (currently Zach's tux01 and/or production). +* Rsync? + +### gn-docs + +* [ ] TODO: Describe process + +* Not sure changes from other environments should ever take + +### AI Summaries (aka. gnqna) + +* [ ] TODO: Describe process + +* Update configs (should be once, during container setup) + +### Others? + +* [ ] TODO: Describe process diff --git a/topics/systems/update-production-checklist.gmi b/topics/systems/update-production-checklist.gmi new file mode 100644 index 0000000..2cb0761 --- /dev/null +++ b/topics/systems/update-production-checklist.gmi @@ -0,0 +1,197 @@ +# Update production checklist + +Last migration round is the move to tux03 (Sept2025)! + +# Tasks + +* [X] Install underlying Debian +* [X] Get guix going +* [X] Check database settings +* [X] Check gemma working +* [X] Check global search +* [X] Check authentication +* [X] Check sending E-mails +* [X] Check SPARQL +* [X] Make sure info.genenetwork.org and 'space' can reach the DB +* [ ] Backups + +The following are at the system level + +* [X] Firewalling and other security measures (sshd) +* [X] Check tmpdirs (cleanup?) +* [X] Make sure journalctl persistent (check for reboots) +* [X] Update certificates in CRON (no longer if not part of Guix) +* [X] Run trim in CRON +* [ ] Monitors (sheepdog) + +# Install underlying Debian + +For our production systems we use Debian as a base install. Once installed: + +* [X] set up git in /etc and limit permissions to root user +* [X] add ttyS0 support for grub and kernel - so out-of-band works +* [X] start ssh server and configure not to use with passwords +* [X] start nginx and check external networking +* [X] mount old root +* [X] Clean up /etc/profile (remove global profile.d loading) +* [X] set up E-mail routing + +It may help to mount the old root if you have it. Now it is on + +``` +mount /dev/sdd2 /mnt/old-root/ +``` + +# Get Guix going + +* [X] Mount bind /gnu on a large partition +* [X] Move /gnu/store to larger partition +* [X] Install Guix daemon +* [X] Update Guix daemon and setup in systemd (if necessary) +* [X] Make available in /usr/local/guix-profiles + +Next move the /gnu store to a large partion and hard mount it in /etc/fstab with + +``` +/export2/gnu /gnu none defaults,bind 0 0 +``` + +We can bootstrap with the Debian guix package (though I prefer the guix-install.sh script these days, mostly because it is more modern). + +=> https://guix.gnu.org/manual/en/html_node/Binary-Installation.html + + +Run guix pull + +``` +guix pull --url=https://codeberg.org/guix/guix -p ~/opt/guix-pull +``` + +Use that also to install guix in /usr/local/guix-profiles + +``` +guix package -i guix -p /usr/local/guix-profiles/guix +``` + +and update the daemon in systemd accordingly. After that I tend to remove /usr/bin/guix + +The Debian installer configures guix. I tend to remove the profiles from /etc/profile so people have a minimal profile. + +# Check database + +* [X] Install mariadb +* [X] Recover database +* [X] Test permissions +* [X] Mariadb update my.cnf + +Basically recover the database from a backup is the best start and set permissions. We usually take the default mariadb unless production is already on a newer version - so we move to guix deployment. + +On tux02 mariadb-10.5.8 is running. On Debian it is now 10.11.11-0+deb12u1, so we should be good. On Guix is 10.10 at this point. + +``` +apt-get install mariadb-server +``` + +Next unpack the database files and set permissions to the mysql user. And (don't forget) update the /etc/mysql config files. + +Restart mysql until you see: + +``` +mysql -u webqtlout -p -e "show databases" ++---------------------------+ +| Database | ++---------------------------+ +| 20081110_uthsc_dbdownload | +| db_GeneOntology | +| db_webqtl | +| db_webqtl_s | +| go | +| information_schema | +| kegg | +| mysql | +| performance_schema | +| sys | ++---------------------------+ +``` + +=> topics/systems/mariadb/mariadb.gmi + +## Recover database + +We use borg for backups. First restore the backup on the PCIe. Also a test for overheating! + + +# Check sending E-mails + +The swaks package is quite useful to test for a valid receive host: + +``` +swaks --to testing-my-server@gmail.com --server smtp.network +=== Trying smtp.network:25... +=== Connected to smtp.network. +<- 220 mailrouter8.network ESMTP NO UCE + -> EHLO tux04.network +<- 250-mailrouter8.network +<- 250-PIPELINING +<- 250-SIZE 26214400 +<- 250-VRFY +<- 250-ETRN +<- 250-STARTTLS +<- 250-ENHANCEDSTATUSCODES +<- 250-8BITMIME +<- 250-DSN +<- 250 SMTPUTF8 + -> MAIL FROM:<root@tux04.network> +<- 250 2.1.0 Ok + -> RCPT TO:<pjotr2020@thebird.nl> +<- 250 2.1.5 Ok + -> DATA +<- 354 End data with <CR><LF>.<CR><LF> + -> Date: Thu, 06 Mar 2025 08:34:24 +0000 + -> To: pjotr2020@thebird.nl + -> From: root@tux04.network + -> Subject: test Thu, 06 Mar 2025 08:34:24 +0000 + -> Message-Id: <20250306083424.624509@tux04.network> + -> X-Mailer: swaks v20201014.0 jetmore.org/john/code/swaks/ + -> + -> This is a test mailing + -> + -> + -> . +<- 250 2.0.0 Ok: queued as 4157929DD + -> QUIT +<- 221 2.0.0 Bye === Connection closed with remote host +``` + +An exim configuration can be + +``` +dc_eximconfig_configtype='smarthost' +dc_other_hostnames='genenetwork.org' +dc_local_interfaces='127.0.0.1 ; ::1' +dc_readhost='' +dc_relay_domains='' +dc_minimaldns='false' +dc_relay_nets='' +dc_smarthost='smtp.network' +CFILEMODE='644' +dc_use_split_config='false' +dc_hide_mailname='false' +dc_mailname_in_oh='true' +dc_localdelivery='maildir_home' +``` + +And this should work: + +``` +swaks --to myemailaddress --from john@network --server localhost +``` + +# Backups + +* [ ] Create an ibackup user. +* [ ] Install borg (usually guix version) +* [ ] Create a borg passphrase + +=> topics/systems/backups-with-borg.gmi +=> topics/systems/backup-drops.gmi diff --git a/topics/systems/virtuoso.gmi b/topics/systems/virtuoso.gmi index e911a8b..bd7424a 100644 --- a/topics/systems/virtuoso.gmi +++ b/topics/systems/virtuoso.gmi @@ -8,6 +8,10 @@ We run instances of virtuoso for our graph databases. Virtuoso is remarkable sof ## Running virtuoso ### Running virtuoso in a guix system container +See also + +=> ../deploy/our-virtuoso-instances + We have a Guix virtuoso service in the guix-bioinformatics channel. The easiest way to run virtuoso is to use the virtuoso service to run it in a guix system container. The only downside of this method is that, since guix system containers require root privileges to start up, you will need root priviliges on the machine you are running this on. Here is a basic guix system configuration that runs virtuoso listening on port 8891, and with its HTTP server listening on port 8892. Among other things, the HTTP server provides a SPARQL endpoint to interact with. @@ -104,11 +108,16 @@ After running virtuoso, you will want to change the default password of the `dba In a typical production virtuoso installation, you will want to change the password of the dba user and disable the dav user. Here are the commands to do so. Pay attention to the single versus double quoting. ``` -SQL> set password "dba" "rFw,OntlJ@Sz"; +SQL> set password "dba" "dba"; SQL> UPDATE ws.ws.sys_dav_user SET u_account_disabled=1 WHERE u_name='dav'; SQL> CHECKPOINT; ``` +We now store the passwords in secrets: + +* CI/CD: /export2/guix-containers/genenetwork-development/etc/genenetwork/conf/gn3/secrets.py +* Production: /export/guix-containers/genenetwork/etc/genenetwork/genenetwork3/gn3-secrets.py + ## Loading data into virtuoso Virtuoso supports at least three different ways to load RDF. @@ -151,6 +160,19 @@ Start isql with something like guix shell --expose=verified-data=/var/lib/data virtuoso-ose -- isql -U dba -P password 8981 ``` +Password is in container secrets file. +Inside a container, you can do also do + +``` +root@tux04 ~# /gnu/store/9d81kdw2frn6b3fwqphsmkssc9zblir1-virtuoso-ose-7.2.11/bin/isql -u dba -P password -S 8981 +OpenLink Virtuoso Interactive SQL (Virtuoso) +Version 07.20.3238 as of Jan 1 1970 +Type HELP; for help and EXIT; to exit. + +*** Error 28000: [Virtuoso Driver]CL034: Bad login + +``` + To delete a graph: ``` @@ -166,6 +188,18 @@ rdf_loader_run(); checkpoint; ``` +You may not have permissions to dir. Check + +``` +select virtuoso_ini_path(); +``` + +the file should contain the relevant dir + +``` +DirsAllowed=/dir +``` + => http://vos.openlinksw.com/owiki/wiki/VOS/VirtTipsAndTricksGuideDeleteLargeGraphs How can I delete graphs containing large numbers of triples from the Virtuoso Quad Store? When virtuoso has just been started up with a clean state (that is, the virtuoso state directory was empty before virtuoso started), uploading large amounts of data using the SPARQL 1.1 Graph Store HTTP Protocol fails the first time. It succeeds only the second time. It is not clear why. I can only recommend retrying as in this commit: @@ -274,3 +308,7 @@ To dump data into a ttl file, first make sure that you are in the guix environme => https://github.com/genenetwork/dump-genenetwork-database/ Dump Genenetwork Database See the README for instructions. + +For the public GN endpoint visit + +=> https://sparql.genenetwork.org/sparql/ diff --git a/topics/testing/mechanical-rob.gmi b/topics/testing/mechanical-rob.gmi index 9413b47..baf111a 100644 --- a/topics/testing/mechanical-rob.gmi +++ b/topics/testing/mechanical-rob.gmi @@ -1,9 +1,74 @@ # Mechanical Rob -We need to run Mechanical Rob tests as part of our continuous integration tests. +## Tags -The Mechanical Rob CI tests are functioning again now. To see how to run Mechanical Rob, see the CI job definition in the genenetwork-machines repo. +* type: documentation, docs +* assigned: bonfacem, rookie101, fredm +* priority: medium +* status: open +* keywords: tests, testing, mechanical-rob -=> genenetwork-machines/src/branch/main/genenetwork-development.scm +## What is Mechanical Rob? -The invocation procedure is bound to change as the many environment variables in genenetwork2 are cleared up. +Mechanical Rob is our name for what could be considered our integration tests. + +The idea is that we observe how Prof. Robert Williams (Rob) (and other scientists) use(s) GeneNetwork and create a "mechanical" facsimile of that. The purpose is to ensure that the system works correctly with each and every commit in any of our various repositories. + +If any commit causes any part of the Mechanical Rob system to raise an error, then we know, immediately, that something is broken, and the culprit can get onto fixing that with haste. + +## Show Me Some Code!!! + +Nice! I like your enthusiasm. + +You can find the +=> https://github.com/genenetwork/genenetwork2/tree/testing/test/requests Mechanical Rob code here +within the genenetwork2 repository. + +You can also see how it is triggered in the gn-machines repository in +=> https://git.genenetwork.org/gn-machines/tree/genenetwork-development.scm this module. +Search for "genenetwork2-mechanical-rob" within that module and you should find how the system is triggered. + +## How About Running it Locally + +All the above is nice and all, but sometimes you just want to run the checks locally. + +In that case, you can run Mechanical Rob locally by following the steps below: +(note that these steps are mostly the same ones to run GN2 locally). + + +1. Get a guix shell for GN2 development: +``` +$ cd genenetwork2/ +$ guix shell --container --network \ + --expose=</path/to/directory/with/genotypes> \ + --expose=</path/to/local/genenetwork3> \ + --expose=</path/to/setting/file> \ + --expose=</path/to/secrets/file> \ + --file=guix.scm bash +``` +The last `bash` is to ensure we install the Bourne-Again Shell whic we use to launch the application. The `</path/to/local/genenetwork3>` can be omitted if you do not need the latest code in GN3 to be included in your running GN2. + +2. Set up the appropriate environment variables: +``` +[env]$ export HOME=</path/to/home/directory> +[env]$ export GN2_SETTINGS=</path/to/settings/file> +[env]$ export SERVER_PORT=5003 +[env]$ export GN2_PROFILE="${GUIX_ENVIRONMENT}" +[env]$ export GN3_PYTHONPATH=</path/to/local/genenetwor3> # Only needed if you need to test GN3 updates +``` + +3. Run the mechanical-rob tests +``` +[env]$ bash bin/genenetwork2 gn2/default_settings.py -c \ + test/requests/test-website.py \ + --all "http://localhost:${SERVER_PORT}" +``` +Of course, here we are assuming that `SERVER_PORT` has the value of the port on which GN2 is running. + + +## Possible Improvements + +Look into using geckodriver to help with the mechanical-rob tests. +`geckodriver` comes with the +=> https://icecatbrowser.org/index.html GNU IceCat browser +which is present as a package in GNU Guix. diff --git a/topics/xapian/xapian-indexing.gmi b/topics/xapian/xapian-indexing.gmi index 1c82018..68ab7a6 100644 --- a/topics/xapian/xapian-indexing.gmi +++ b/topics/xapian/xapian-indexing.gmi @@ -2,18 +2,48 @@ Due to the enormous size of the GeneNetwork database, indexing it in a reasonable amount of time is a tricky process that calls for careful identification and optimization of the performance bottlenecks. This document is a description of how we achieve it. -Indexing happens in the following three phases. +Indexing happens in these phases. * Phase 1: retrieve data from SQL -* Phase 2: index text -* Phase 3: write Xapian index to disk +* Phase 2: retrieve metadata from RDF +* Phase 3: index text +* Phase 4: write Xapian index to disk -Phases 1 and 3 (that is, the retrieval of data from SQL and writing of the Xapian index to disk) are I/O bound processes. Phase 2 (the actual indexing of text) is CPU bound. So, we parallelize phase 2 while keeping phases 1 and 3 sequential. +Phases 1, 2 and 4 are I/O bound processes. Phase 3 (the actual indexing of text) is CPU bound. So, we parallelize phase 2 while keeping phases 1, 2 and 3 sequential. -There is a long delay in retrieving data from SQL and loading it into memory. In this time, the CPU is waiting on I/O and idling away. In order to avoid this, we retrieve SQL data chunk by chunk and spawn off phase 2 worker processes. Thus, we interleave phase 1 and 2 so that they don't block each other. Despite this, on tux02, the indexing script is only able to keep around 10 of the 128 CPUs busy. As phase 1 is dishing out jobs to phase 2 worker processes, before it can finish dishing out jobs to all 128 CPUs, the earliest worker processes finish and exit. The only way to avoid this and improve CPU utilization would be to further optimize the I/O of phase 1. +There is a long delay in retrieving data from SQL and loading it into memory. In this time, the CPU is waiting on I/O and idling away. In order to avoid this, we retrieve SQL data chunk by chunk and spawn off phase 3 worker processes. We get RDF data in one large call before any processing is done. Thus, we interleave phase 1 and 3 so that they don't block each other. Despite this, on tux02, the indexing script is only able to keep around 10 of the 128 CPUs busy. As phase 1 is dishing out jobs to phase 2 worker processes, before it can finish dishing out jobs to all 128 CPUs, the earliest worker processes finish and exit. The only way to avoid this and improve CPU utilization would be to further optimize the I/O of phase 1. Building a single large Xapian index is not scalable. See detailed report on Xapian scalability. => xapian-scalability So, we let each process of phase 2 build its own separate Xapian index. Finally, we compact and combine them into one large index. When writing smaller indexes in parallel, we take care to lock access to the disk so that only one process is writing to the disk at any given time. If many processes try to simultaneously write to the disk, the write speed is slowed down, often considerably, due to I/O contention. -It is important to note that the performance bottlenecks identified in this document are machine-specific. For example, on my laptop with only 2 cores, CPU performance in phase 2 is the bottleneck. Phase 1 I/O waits on the CPU to finish instead of the other way around. +It is important to note that the performance bottlenecks identified in this document are machine-specific. For example, on my laptop with only 2 cores, CPU performance in phase 3 is the bottleneck. Phase 1 I/O waits on the CPU to finish instead of the other way around. + +## Local Development + +For local development, see: + +=> https://issues.genenetwork.org/topics/database/working-with-virtuoso-locally Working with Virtuoso for Local Development + +Ping @bmunyoki for the ttl folder backups. + +Set up mysql with instructions from + +=> https://issues.genenetwork.org/topics/database/setting-up-local-development-database + +and load up the backup file using: + +> mariadb gn2 < /path/to/backup/file.sql + +A backup file can be generated using: + +> mysqldump -u mysqluser -pmysqlpasswd --opt --where="1 limit 100000" db_webqtl > out.sql +> xz out.sql + +And run the index script using: + +> python3 scripts/index-genenetwork create-xapian-index /tmp/xapian "mysql://gn2:password@localhost/gn2" "http://localhost:8890/sparql" + +Verify the index with: + +> xapian-delve /tmp/xapian |
