diff options
Diffstat (limited to 'issues')
74 files changed, 3655 insertions, 57 deletions
diff --git a/issues/CI-CD/cd-is-slow.gmi b/issues/CI-CD/cd-is-slow.gmi new file mode 100644 index 0000000..9b0e1ee --- /dev/null +++ b/issues/CI-CD/cd-is-slow.gmi @@ -0,0 +1,276 @@ +# CD is slow + +The pages are slow and some are broken. + +We found out that there are quite a full network calls using DNS - and DNS was slow. The configured DNS server was not responding. Using Google's DNS made things go fast again. We will probably introduce dnsmasq in the container to make things even faster. + +# Tags + +* type: bug +* status: in progress +* priority: high +* assigned: pjotrp +* interested: pjotrp, bonfacem +* keywords: deployment, server + +# Tasks + +* [ ] Use dnsmasq caching - it is a guix system service +* [ ] Run less gunicorn processes on CD (2 should do) +* [ ] Increase debugging output for GN2 +* [ ] Fix GN3 hook for github (it is not working) +* [X] gn-guile lacks certificates it can use for sparql + +# Measuring + +bonfacekilz: +I'm currently instrumenting the requests. See what hogs up time. Loading the landing page takes up 32 seconds! + +Something's off. From outside the container: + +``` +123bonfacem@tux02 ~ $ guix shell python-wrapper python-requests -- python time.py +Status: 200 +Time taken: 32.989222288131714 seconds +``` + +From inside the container: + +``` +12025-07-18 14:46:36 INFO:gn2.wqflask:Landing page rendered in 8.12 seconds +``` + +And I see: + +## CD + +``` +> curl -w @- -o /dev/null -s https://cd.genenetwork.org <<EOF +\n +DNS lookup: %{time_namelookup}s\n +Connect time: %{time_connect}s\n +TLS handshake: %{time_appconnect}s\n +Pre-transfer: %{time_pretransfer}s\n +Start transfer: %{time_starttransfer}s\n +Total time: %{time_total}s\n +EOF + +DNS lookup: 8.117543s +Connect time: 8.117757s +TLS handshake: 8.197767s +Pre-transfer: 8.197861s +Start transfer: 33.096467s +Total time: 33.096601s +``` + +## Production +``` +> curl -w @- -o /dev/null -s https://genenetwork.org <<EOF +\n +DNS lookup: %{time_namelookup}s\n +Connect time: %{time_connect}s\n +TLS handshake: %{time_appconnect}s\n +Pre-transfer: %{time_pretransfer}s\n +Start transfer: %{time_starttransfer}s\n +Total time: %{time_total}s\n +EOF + +DNS lookup: 8.075794s +Connect time: 8.076402s +TLS handshake: 8.147322s +Pre-transfer: 8.147370s +Start transfer: 8.797107s +Total time: 8.797299s +``` + +## On tux02 (outside CD container) + +``` +> curl -w @- -o /dev/null -s http://localhost:9092 <<EOF +\n +DNS lookup: %{time_namelookup}s\n +Connect time: %{time_connect}s\n +TLS handshake: %{time_appconnect}s\n +Pre-transfer: %{time_pretransfer}s\n +Start transfer: %{time_starttransfer}s\n +Total time: %{time_total}s\n +EOF + +DNS lookup: 0.000068s +Connect time: 0.000543s +TLS handshake: 0.000000s +Pre-transfer: 0.000606s +Start transfer: 24.851069s +Total time: 24.851166s +``` + +This does not look like an nginx problem (at least on tux02 itself). Also the nginx configuration was not really changed. +The mysql configuration ditto. I can still test both, but it looks like the problem is inside the system container. + +The container logs are at + +``` +root@tux02:/export2/guix-containers/genenetwork-development/var/log/cd# tail -100 genenetwork2.log +``` + +Some interesting errors there that need resolving, such as + +## gn-guile error + +``` +tail gn-guile.log +2025-07-20 04:49:49 X.509 certificate of 'sparql.genenetwork.org' could not be verified: +2025-07-20 04:49:49 signer-not-found invalid +``` + +Guile is not finding the certificates for our virtuoso server. It does work with curl, try + +``` +curl -G https://query.wikidata.org/sparql -H "Accept: application/json; charset=utf-8" --data-urlencode query="SELECT DISTINCT * where { + wd:Q158695 wdt:P225 ?o . +} limit 5" +{ + "head" : { + "vars" : [ "o" ] }, "results" : { "bindings" : [ { "o" : { + "type" : "literal", + "value" : "Arabidopsis thaliana" + } + } ] + } +``` + +Also inside the container: + +``` +curl http://localhost:8091/gene/aliases/Shh +``` + +renders the same error! X.509 certificate of 'query.wikidata.org' could not be verified. so it is a gn-guile issue. + +## GN2 error reporting + +Also there are too many gunicorn processes - and strikingly - no debug output. Also I see a missing robots.txt file (even though LLMs hardly honour them). + +Let's try to get inside the container with nsenter: + +``` +ps xau|grep genenetwork-development-container +root 115940 0.0 0.0 163692 26296 ? Ssl Jul18 0:00 /gnu/store/ylwk2vn18dkzkj0nxq2h4vjzhz17bm7c-guile-3.0.9/bin/guile --no-auto-compile /usr/local/bin/genenetwork-development-container +pgrep -P 115940 +115961 +``` + +Use this child PID and a recent nsenter: + +``` +/gnu/store/w7a3frdmffpw3hvxpvvxwxgzfhyqdm6n-profile/bin/nsenter -m -p -t 115961 /run/current-system/profile/bin/bash -login +``` + +System tools are in '/run/current-system/profile/bin/' + +Make it a one-liner with + +``` +/gnu/store/w7a3frdmffpw3hvxpvvxwxgzfhyqdm6n-profile/bin/nsenter -m -p -t $(pgrep -P `ps xau|grep genenetwork-development-container|awk '{print $2}'|sort -r|head -1`) /run/current-system/profile/bin/bash -login +``` + +Once inside we can pick up curl (I note the system container has full access to the /gnu/store on the host: + +``` +root@tux02 /# /gnu/store/vdaspmq10c3zmqhp38lfqy812w6r4xg3-curl-8.6.0/bin/curl -w @- -o /dev/null -s http://localhost:9092 <<EOF +\n +DNS lookup: %{time_namelookup}s\n +Connect time: %{time_connect}s\n +TLS handshake: %{time_appconnect}s\n +Pre-transfer: %{time_pretransfer}s\n +Start transfer: %{time_starttransfer}s\n +Total time: %{time_total}s\n +EOF + +DNS lookup: 0.000064s +Connect time: 0.000478s +TLS handshake: 0.000000s +Pre-transfer: 0.000551s +Start transfer: 24.792926s +Total time: 24.793015s +``` + +That rules out container and nginx streaming issues. + +So the problem is with GN and its DBs. The gn-machines is used from /home/aruni and it checkout is March. Has CD been slow since then? I don't think so. Also the changes to the actual scripts are even older. Also the guix-bioinformatics repo shows no changes. Remaining culprits I suspect are: + +* [*] MySQL +* [ ] Interaction gn-auth with gn2 +* [ ] Interaction gnqa with gn2 + +Running a standard test on mysql shows it is fine: + +``` +time mysql -u webqtlout -pwebqtlout db_webqtl < $rundir/../shared/sql/test02.sql +Name FullName Name Symbol CAST(ProbeSet."description" AS BINARY) CAST(ProbeSet."Probe_Target_Description" AS BINARY) Chr Mb Mean LRS Locus pValue additive geno_chr geno_mb +HC_M2_0606_P Hippocampus Consortium M430v2 (Jun06) PDNN 1457545_at 9530036O11Rik long non-coding RNA, expressed sequence tag (EST) AK035474 with high bladder expression antisense EST 14 Kb upstream of Shh 5 28.480441 6.7419292929293 15.2845189682605 rsm10000001525 0.055 0.0434848484848485 3 9.671673 +HC_M2_0606_P Hippocampus Consortium M430v2 (Jun06) PDNN 1427571_at Shh sonic hedgehog (hedgehog) last exon 5 28.457886 6.50113131313131 9.58158655605723 rs8253327 0.697 0.0494097096188748 1 191.908118 +HC_M2_0606_P Hippocampus Consortium M430v2 (Jun06) PDNN 1436869_at Shh sonic hedgehog (hedgehog) mid distal 3' UTR 5 28.457155 9.279090909090911 12.7711275309832 rs8253327 0.306 -0.214087568058076 1 191.908118 + +real 0m0.010s +user 0m0.004s +sys 0m0.000s +``` + +# Profiling CD + +Ran a profiler against a traits page. See the following: + +=> /issues/CI-CD/profiling-flask + +## Results/Interpretation + +* By fixing gn-guile and gene-alias resolution, times dropped by ~10s. However, the page takes 37.9s to run. + +* Resolving a DNS takes around 4.585s. We make 7 requests. Totalling to 32.09. Typically, a traits page should take 8.79s. The difference: (- 37.9 32.09) = 5.8s; which explains the slowness: + +``` + ncall tottime percall cumtime percall filename:lineno(function) +---------------------------------------------------------------------------- + 7 0.00002618 3.741e-05 32.09 4.585 socket.py:938(getaddrinfo) +``` + +* The above is consistent all the analysis I've done across all the profile dumps. + +* Testing my theory out: + +``` +@app.route("/test-network") +def test_network(): + start = time.time() + http_url = urljoin( + current_app.config["GN_SERVER_URL"], + "version" + ) + result = requests.get(http_url) + duration = time.time() - start + app.logger.error(f"{http_url}: {duration:.4f}s") + + start = time.time() + local_url = "http://localhost:9093/api/version" + result = requests.get(local_url) + duration = time.time() - start + app.logger.error(f"{local_url}: {duration:.4f}s") + return result.json() +``` + +* Results: + +``` +2025-07-24 10:20:43 [2025-07-24 10:20:43 +0000] [101] [ERROR] https://cd.genenetwork.org/api3/version: 8.1647s +2025-07-24 10:20:43 ERROR:gn2.wqflask:https://cd.genenetwork.org/api3/version: 8.1647s +2025-07-24 10:20:43 [2025-07-24 10:20:43 +0000] [101] [ERROR] result: 1.0 +2025-07-24 10:20:43 ERROR:gn2.wqflask:result: 1.0 +2025-07-24 10:20:43 [2025-07-24 10:20:43 +0000] [101] [ERROR] http://localhost:9093/api/version: 0.0088s +2025-07-24 10:20:43 ERROR:gn2.wqflask:http://localhost:9093/api/version: 0.0088s +2025-07-24 10:20:43 [2025-07-24 10:20:43 +0000] [101] [ERROR] result: 1.0 +``` + +## Possible Mitigations + +* Switch over gn-auth.genenetwork.org to localhost. diff --git a/issues/CI-CD/development-container-checklist.gmi b/issues/CI-CD/development-container-checklist.gmi new file mode 100644 index 0000000..7cf4687 --- /dev/null +++ b/issues/CI-CD/development-container-checklist.gmi @@ -0,0 +1,101 @@ +# Deploying GeneNetwork CD + +## Prerequisites + +Ensure you have `fzf' installed and Guix is set up with your preferred channel configuration. + + +## Step 1: Pull the Latest Profiles + +``` +guix pull -C channels.scm -p ~/.guix-extra-profiles/gn-machines --allow-downgrades +guix pull -C channels.scm -p ~/.guix-extra-profiles/gn-machines-shepherd-upgrade --allow-downgrades +``` + + +## Step 2: Source the Correct Profile + +``` +. ,choose-profile +``` + + +### Contents of `,choose-profile' + +This script lets you interactively select a profile using `fzf': + +``` +#!/bin/env sh + +export GUIX_PROFILE="$(guix package --list-profiles | fzf --multi)" +. "$GUIX_PROFILE/etc/profile" + +hash guix + +echo "Currently using: $GUIX_PROFILE" +``` + + +## Step 3: Verify the Profile + +``` +guix describe +``` + + +## Step 4: Pull the Latest Code + +``` +cd gn-machines +git pull +``` + + +## Step 5: Run the Deployment Script + +``` +./genenetwork-development-deploy.sh +``` + + +## Step 6: Restart the Development Container + +``` +sudo systemctl restart genenetwork-development-container +``` + + +## Step 7: Verify Changes + +Manually confirm that the intended changes were applied correctly. + + +# Accessing the Development Container on tux02 + +To enter the running container shell, ensure you're using the *parent* PID of the `shepherd' process. + + +## Step 1: Identify the Correct PID + +Use this command to locate the correct container parent process: + +``` +ps -u root -f --forest | grep -A4 '/usr/local/bin/genenetwork-development-container' | grep shepherd +``` + + +## Step 2: Enter the Container + +Replace `46804' with your actual parent PID: + +``` +sudo /home/bonfacem/.config/guix/current/bin/guix container exec 46804 \ + /gnu/store/m6c5hgqg569mbcjjbp8l8m7q82ascpdl-bash-5.1.16/bin/bash \ + --init-file /home/bonfacem/.guix-profile/etc/profile --login +``` + + +## Notes + +* Ensure the PID is the container’s *shepherd parent*, not a child process. +* Always double-check your environment and profiles before deploying. diff --git a/issues/CI-CD/failing-services-startup.gmi b/issues/CI-CD/failing-services-startup.gmi new file mode 100644 index 0000000..751e61c --- /dev/null +++ b/issues/CI-CD/failing-services-startup.gmi @@ -0,0 +1,236 @@ +# Failing Services' Startup + +## Tags + +* type: bug +* status: closed, completed +* priority: high +* assigned: fredm, bonfacem +* interested: pjotrp, bonfacem, aruni +* keywords: deployment, CI, CD + +## Description + +Upgrading guix to `34453b97005ff86355399df89c8827c57839d9c7` for CI/CD fails with: + +``` +2025-08-20 16:05:20 Backtrace: +2025-08-20 16:05:20 6 (primitive-load "/gnu/store/xbxd2zihw9dssrhips925gri0yn?") +2025-08-20 16:05:20 In ice-9/eval.scm: +2025-08-20 16:05:20 191:35 5 (_ _) +2025-08-20 16:05:20 In gnu/build/linux-container.scm: +2025-08-20 16:05:20 368:8 4 (call-with-temporary-directory #<procedure 7f014aa3a3f0?>) +2025-08-20 16:05:20 476:16 3 (_ "/tmp/guix-directory.VWRNbv") +2025-08-20 16:05:20 62:6 2 (call-with-clean-exit #<procedure 7f014aa1de80 at gnu/b?>) +2025-08-20 16:05:20 321:20 1 (_) +2025-08-20 16:05:20 In guix/build/syscalls.scm: +2025-08-20 16:05:20 1231:10 0 (_ 268566528) +2025-08-20 16:05:20 +2025-08-20 16:05:20 guix/build/syscalls.scm:1231:10: In procedure unshare: 268566528: Invalid argument +2025-08-20 16:05:20 Backtrace: +2025-08-20 16:05:20 4 (primitive-load "/gnu/store/xbxd2zihw9dssrhips925gri0yn?") +2025-08-20 16:05:20 In ice-9/eval.scm: +2025-08-20 16:05:20 191:35 3 (_ #f) +2025-08-20 16:05:20 In gnu/build/linux-container.scm: +2025-08-20 16:05:20 368:8 2 (call-with-temporary-directory #<procedure 7f014aa3a3f0?>) +2025-08-20 16:05:20 485:7 1 (_ "/tmp/guix-directory.VWRNbv") +2025-08-20 16:05:20 In unknown file: +2025-08-20 16:05:20 0 (waitpid #f #<undefined>) +2025-08-20 16:05:20 +2025-08-20 16:05:20 ERROR: In procedure waitpid: +2025-08-20 16:05:20 Wrong type (expecting exact integer): #f +``` + +Failing services: + +* genenetwork3: consistently +* genenetwork2: consistently +* gn-auth: intermittently + +## Troubleshooting Notes + +### Unable to run genenetwork2 in a shell container with the "-C" flag + +With the following channels: + +``` +$ guix describe +Generation 3 Aug 28 2025 03:56:44 (current) + gn-bioinformatics cffafde + repository URL: file:///home/bonfacem/guix-bioinformatics/ + branch: master + commit: cffafde125f3e711418d3ebb62eacd48a3efa8cf + guix-forge 3c8dc85 + repository URL: https://git.genenetwork.org/guix-forge/ + branch: main + commit: 3c8dc85a584c98bc90088ec1c85933d4d10e7383 + guix-past b14d7f9 + repository URL: https://codeberg.org/guix-science/guix-past + branch: master + commit: b14d7f997ae8eec788a7c16a7252460cba3aaef8 + guix 34453b9 + repository URL: https://codeberg.org/guix/guix + branch: master + commit: 34453b97005ff86355399df89c8827c57839d9c7 +``` + +Running: + +``` +$ guix shell -C genenetwork2 +``` + +Produces: + +``` +guix shell: error: unshare: 268566528: Invalid argument +Backtrace: + 16 (primitive-load "/export3/local/home/bonfacem/.guix-ext…") +In guix/ui.scm: + 2399:7 15 (run-guix . _) + 2362:10 14 (run-guix-command _ . _) +In ice-9/boot-9.scm: + 1752:10 13 (with-exception-handler _ _ #:unwind? _ # _) +In guix/status.scm: + 842:4 12 (call-with-status-report _ _) +In guix/store.scm: + 703:3 11 (_) +In ice-9/boot-9.scm: + 1752:10 10 (with-exception-handler _ _ #:unwind? _ # _) +In guix/store.scm: + 690:37 9 (thunk) + 1331:8 8 (call-with-build-handler _ _) + 1331:8 7 (call-with-build-handler #<procedure 7fc86bb50de0 at g…> …) +In guix/scripts/environment.scm: + 1205:11 6 (proc _) +In guix/store.scm: + 2212:25 5 (run-with-store #<store-connection 256.100 7fc87a46d820> …) +In guix/scripts/environment.scm: + 911:8 4 (_ _) +In gnu/build/linux-container.scm: + 485:7 3 (call-with-container _ _ #:namespaces _ #:host-uids _ # …) +In unknown file: + 2 (waitpid #f #<undefined>) +In ice-9/boot-9.scm: + 1685:16 1 (raise-exception _ #:continuable? _) + 1685:16 0 (raise-exception _ #:continuable? _) + +ice-9/boot-9.scm:1685:16: In procedure raise-exception: +Wrong type (expecting exact integer): #f +``` + +This is fixed by increasing the value of respawn-delay (default is 0.5s) to 5s. + + +### Unable to write to a temporary directory and issues with running git inside the g-exp + +Stack trace: +``` +2025-09-03 12:23:32 In ice-9/eval.scm: +2025-09-03 12:23:32 191:35 3 (_ #f) +2025-09-03 12:23:32 In gnu/build/linux-container.scm: +2025-09-03 12:23:32 368:8 2 (call-with-temporary-directory #<procedure 7f012241d3f0?>) +2025-09-03 12:23:32 485:7 1 (_ "/tmp/guix-directory.Bl6jtx") +2025-09-03 12:23:32 In unknown file: +2025-09-03 12:23:32 0 (waitpid #f #<undefined>) +2025-09-03 12:23:32 + +``` + +Cryptic message. Running the g-exps as a program shows: + +``` +Receiving objects: 100% (698/698), 16.18 MiB | 30.29 MiB/s, done. +Resolving deltas: 100% (49/49), done. +================================================== +error: cannot run less: No such file or directory +fatal: unable to execute pager 'less' +Backtrace: + 5 (primitive-load "/gnu/store/c9bvy90s5mglp6xdfkc1s4qkzj8?") +In ice-9/eval.scm: + 619:8 4 (_ #f) +In ice-9/boot-9.scm: + 142:2 3 (dynamic-wind #<procedure 7fa954b25880 at ice-9/eval.s?> ?) + 142:2 2 (dynamic-wind #<procedure 7fa94b7970c0 at ice-9/eval.s?> ?) +In ice-9/eval.scm: + 619:8 1 (_ #(#(#<directory (guile-user) 7fa954b03c80>))) +In guix/build/utils.scm: + 822:6 0 (invoke "git" "log" "--max-count" "1") + +guix/build/utils.scm:822:6: In procedure invoke: +ERROR: + 1. &invoke-error: + program: "git" + arguments: ("log" "--max-count" "1") + exit-status: 128 + term-signal: #f + stop-signal: #f +``` + +Fixed by adding "less" to the with-packages form and setting: + +``` +(setenv "TERM" "xterm-256color") + +``` + +### gn-auth: sqlite3.OperationalError: unable to open database file + +Despite having all file perms correctly set with 0644, we see: + +``` +Traceback (most recent call last): + File "/gnu/store/ag1m9bv22iwm3sq87xly35y138l6kzd7-profile/lib/python3.11/site-packages/flask/app.py", line 917, in full_dispatch_request + rv = self.dispatch_request() + ^^^^^^^^^^^^^^^^^^^^^^^ + File "/gnu/store/ag1m9bv22iwm3sq87xly35y138l6kzd7-profile/lib/python3.11/site-packages/flask/app.py", line 902, in dispatch_request + return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args) # type: ignore[no-any-return] + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/export/data/repositories/gn-auth/gn_auth/auth/authentication/oauth2/views.py", line 102, in authorise + return with_db_connection(__authorise__) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/export/data/repositories/gn-auth/gn_auth/auth/db/sqlite3.py", line 63, in with_db_connection + return func(conn) + ^^^^^^^^^^ + File "/export/data/repositories/gn-auth/gn_auth/auth/authentication/oauth2/views.py", line 90, in __authorise__ + return server.create_authorization_response(request=request, grant_user=user) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/gnu/store/ag1m9bv22iwm3sq87xly35y138l6kzd7-profile/lib/python3.11/site-packages/authlib/oauth2/rfc6749/authorization_server.py", line 297, in create_authorization_response + args = grant.create_authorization_response(redirect_uri, grant_user) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/export/data/repositories/gn-auth/gn_auth/auth/authentication/oauth2/grants/authorisation_code_grant.py", line 31, in create_authorization_response + response = super().create_authorization_response( + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/gnu/store/ag1m9bv22iwm3sq87xly35y138l6kzd7-profile/lib/python3.11/site-packages/authlib/oauth2/rfc6749/grants/authorization_code.py", line 158, in create_authorization_response + self.save_authorization_code(code, self.request) + File "/export/data/repositories/gn-auth/gn_auth/auth/authentication/oauth2/grants/authorisation_code_grant.py", line 45, in save_authorization_code + return __save_authorization_code__( + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/export/data/repositories/gn-auth/gn_auth/auth/authentication/oauth2/grants/authorisation_code_grant.py", line 106, in __save_authorization_code__ + return with_db_connection(lambda conn: save_authorisation_code(conn, code)) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/export/data/repositories/gn-auth/gn_auth/auth/db/sqlite3.py", line 63, in with_db_connection + return func(conn) + ^^^^^^^^^^ + File "/export/data/repositories/gn-auth/gn_auth/auth/authentication/oauth2/grants/authorisation_code_grant.py", line 106, in <lambda> + return with_db_connection(lambda conn: save_authorisation_code(conn, code)) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/export/data/repositories/gn-auth/gn_auth/auth/authentication/oauth2/models/authorization_code.py", line 92, in save_authorisation_code + cursor.execute( +sqlite3.OperationalError: unable to open database file +``` + +Fixed above by correctly mapping: + +``` +- (source auth-db-path) ++ (source (dirname auth-db-path)) +``` + +in the relevant g-exp, and making sure that the parent directory is set to #o775 (rwx for both user/group). + +## Also See + +=> https://issues.guix.gnu.org/78356 Broken system and home containers +=> https://codeberg.org/guix/guix/src/commit/34453b97005ff86355399df89c8827c57839d9c7/guix/build/syscalls.scm#L1218-L1233 How "unshare" is defined +=> https://codeberg.org/guix/guix/src/commit/34453b97005ff86355399df89c8827c57839d9c7/gnu/build/linux-container.scm#L321 Where `unshare` is called diff --git a/issues/CI-CD/profiling-flask.gmi b/issues/CI-CD/profiling-flask.gmi new file mode 100644 index 0000000..2d0c539 --- /dev/null +++ b/issues/CI-CD/profiling-flask.gmi @@ -0,0 +1,33 @@ +# Profiling GN + +Use this simple structure: + +``` +from werkzeug.middleware.profiler import ProfilerMiddleware + + +app = Flask(__name__) +app.config["PROFILE"] = True +app.wsgi_app = ProfilerMiddleware( + app.wsgi_app, + restrictions=[40, "main"], + profile_dir="profiler_dump", + filename_format="{time:.0f}-{method}-{path}-{elapsed:.0f}ms.prof", +) +``` + + +You can use gprof2dot to visualise the profile + +``` +guix shell gprof2dot -- gprof2dot -f pstats 1753202013-GET-show_trait-37931ms.prof > 1753202013-GET-show_trait-37931ms.prof.dot +guix shell xdot -- xdot 1753202013-GET-show_trait-37931ms.prof.dot +``` + +Or snakeviz to visualize it: + + +``` +scp genenetwork:/home/bonfacem/profiling/1753202013-GET-show_trait-37931ms.prof /tmp/test +snakeviz 1753202013-GET-show_trait-37931ms.prof +``` diff --git a/issues/CI-CD/troubleshooting-within-the-development-container.gmi b/issues/CI-CD/troubleshooting-within-the-development-container.gmi new file mode 100644 index 0000000..3aa8c3b --- /dev/null +++ b/issues/CI-CD/troubleshooting-within-the-development-container.gmi @@ -0,0 +1,46 @@ +# Troubleshooting inside the GN dev container +* type: systems, debugging, container +* keywords: container, troubleshooting, logs, webhooks + +You need to find the development container so that you can begin troubleshooting: + +``` +ps -u root -f --forest | grep -A4 '/usr/local/bin/genenetwork-development-container' | grep shepherd +``` + +Example output: + +``` +root 16182 16162 0 03:57 ? 00:00:04 \_ /gnu/store/n87px1cazqkav83npg80ccp1n777j08s-guile-3.0.9/bin/guile --no-auto-compile /gnu/store/b4n5ax7l1ccia7sr123fqcjqi4vy03pv-shepherd-1.0.2/bin/shepherd --config /gnu/store/5ahb3745wlpa5mjsbk8j6frn78khvzzw-shepherd.conf +``` + +Get into the container: + +``` +# Use the correct pid and guix/bash path. + +sudo /home/bonfacem/.config/guix/current/bin/guix container exec 16182 /gnu/store/m6c5hgqg569mbcjjbp8l8m7q82ascpdl-bash-5.1.16/bin/bash --init-file /home/bonfacem/.guix-profile/etc/profile --login +``` + +All the gn related logs can be found in "/var/log/cd": + +``` +genenetwork2.log +genenetwork3.log +gn-auth.log +gn-guile.log +``` + +All the nginx log are in "/var/log/nginx" + +Sometimes, it's useful to trigger webhooks while troubleshooting. Here are all the relevant webhooks: + +``` +/gn-guile +/genenetwork2 +/genenetwork3 +/gn-libs +/gn-auth +``` + +Inside the container, we have "coreutils-minimal", and "curl" that you can use to troubleshoot. diff --git a/issues/acme-error.gmi b/issues/acme-error.gmi new file mode 100644 index 0000000..b31d04b --- /dev/null +++ b/issues/acme-error.gmi @@ -0,0 +1,106 @@ +# uACME Error: "urn:ietf:params:acme:error:unauthorized" + +## Tags + +* status: closed, completed +* priority: high +* type: bug +* assigned: fredm +* keywords: uacme, certificates, "urn:ietf:params:acme:error:unauthorized" + +## Description + +Sometimes, when we attempt to request TLS certificates from Let's Encrypt using uacme, we run into an error of the following form: + +``` +uacme: polling challenge status at https://acme-v02.api.letsencrypt.org/acme/chall/2399017717/599167439271/jFB2Pg +uacme: challenge https://acme-v02.api.letsencrypt.org/acme/chall/2399017717/599167439271/jFB2Pg failed with status invalid +uacme: the server reported the following error: +{ + "type": "urn:ietf:params:acme:error:unauthorized", + "detail": "128.xxx.xxx.xxx: Invalid response from http://sparql.genenetwork.org/.well-known/acme-challenge/N-P-mhiK04c-Iophbem4iFYsaB +yeaxeSyXHSijx3e6k: 404", + "status": 403 +} +uacme: running /gnu/store/zwqavgjqyk0f0krv8ndwhv3767f6cnx1-uacme-hook failed http-01 sparql.genenetwork.org N-P-mhiK04c-Iophbem4iFYsaBy +eaxeSyXHSijx3e6k N-P-mhiK04c-Iophbem4iFYsaByeaxeSyXHSijx3e6k.9dRdXFhCbqeDGWYndRd_hTh920rplmy-ef-_aLgjJJE +uacme: failed to authorize order at https://acme-v02.api.letsencrypt.org/acme/order/2399017717/438986245271 + +``` + +From the above error, we note that the request for the "/.well-known/..." path fails with a 404 code: Why. + +Let's try figuring it out; connect to the running container: + +``` +$ sudo guix container exec 89086 /run/current-system/profile/bin/bash --login +root@sparql /# cd /var/run/acme/acme-challenge/ +root@sparql /var/run/acme/acme-challenge# while true; do ls; sleep 0.5; clear; done +``` + +In a separate terminal, connect to the same container and run `/usr/bin/acme renew`. + +The loop we created to list what files are created in the challenge directory outputs the file + +``` +root@sparql /var/run/acme/acme-challenge# while true; do ls; sleep 0.5; clear; done +Rm7qCec3naVvqPldGSGI9W4i9AceW0X3MUNSAbC7SVE +Rm7qCec3naVvqPldGSGI9W4i9AceW0X3MUNSAbC7SVE +⋮ +``` + +but we are still getting the same error: + +``` +uacme: challenge https://acme-v02.api.letsencrypt.org/acme/chall/2399017717/599184604221/7mTNdA failed with status invalid +uacme: the server reported the following error: +{ + "type": "urn:ietf:params:acme:error:unauthorized", + "detail": "128.169.5.101: Invalid response from http://sparql.genenetwork.org/.well-known/acme-challenge/Rm7qCec3naVvqPldGSGI9W4i9AceW0X3MUNSAbC7SVE: 404", + "status": 403 +} +uacme: running /gnu/store/zwqavgjqyk0f0krv8ndwhv3767f6cnx1-uacme-hook failed http-01 sparql.genenetwork.org Rm7qCec3naVvqPldGSGI9W4i9AceW0X3MUNSAbC7SVE Rm7qCec3naVvqPldGSGI9W4i9AceW0X3MUNSAbC7SVE.9dRdXFhCbqeDGWYndRd_hTh920rplmy-ef-_aLgjJJE +uacme: failed to authorize order at https://acme-v02.api.letsencrypt.org/acme/order/2399017717/438997397751 +``` + +meaning that somehow, nginx is not able to serve up this file. + +## Discovered Cause: 2025-10-20 + +There are 2 layers of nginx, the host nginx, and the internal/container nginx. + +The host nginx was proxying directly to the virtuoso http server rather than proxying to nte internal/container nginx. This led to the failure because the internal/container nginx handles the TLS/SSL certificates for the site. The host nginx should have offloaded the handling of the TLS/SSL certificates to the internal/container nginx, but since it was not going through the internal nginx, that led to the failure. + +A simile of the error condition and the solution are in the sections below: + +### Error Condition: Wrong proxying + +In host's "nginx.conf": +``` +⋮ + proxy_pass http://localhost:<virtuoso-http-server-port>; +⋮ +``` + +In internal/container "nginx.conf": +``` +⋮ + proxy_pass http://localhost:<virtuoso-http-server-port>; +⋮ +``` + +### Solution/Fix + +In host's "nginx.conf": +``` +⋮ + proxy_pass http://localhost:<container-nginx-http-port>; +⋮ +``` + +In internal/container "nginx.conf": +``` +⋮ + proxy_pass http://localhost:<virtuoso-http-server-port>; +⋮ +``` diff --git a/issues/ai-search.gmi b/issues/ai-search.gmi new file mode 100644 index 0000000..6611731 --- /dev/null +++ b/issues/ai-search.gmi @@ -0,0 +1,12 @@ +# Prompt As UI + +* assigned: bonfacem +* status: to-do + +## Description + +* New search page with ChatGPT-esque ux customised for GN that presents significant hits and metadata. Goal: better ux search experience for GN users. Feed metadata, and figure out +* Build agent/model to improve GN AI experience (related to above). +* Add pre-compute hits (already in RDF) when available. + +=> https://duckduckgo.com/?q=DuckDuckGo+AI+Chat&ia=chat&duckai=1 diff --git a/issues/auth/masquarade-as-bug.gmi b/issues/auth/masquarade-as-bug.gmi index 12c2c5f..36fe34a 100644 --- a/issues/auth/masquarade-as-bug.gmi +++ b/issues/auth/masquarade-as-bug.gmi @@ -2,6 +2,7 @@ * assigned: fredm * tags: critical +* status: closed, completed Right now you can't masquared as another user. Here's the trace: diff --git a/issues/correlation-timing-out.gmi b/issues/correlation-timing-out.gmi index 419524d..bed8692 100644 --- a/issues/correlation-timing-out.gmi +++ b/issues/correlation-timing-out.gmi @@ -5,7 +5,7 @@ * assigned: fredm, zsloan, alexm * type: bug * priority: high -* status: ongoing +* status: closed, completed * keywords: correlations ## Description @@ -17,3 +17,7 @@ Do correlations against the same dataset This might be the same issue as the one in => /issues/correlation-missing-file correlation-missing-file.gmi but I'm not sure. + +## Close as completed + +This is fixed. diff --git a/issues/fix-spam-entries-in-gn-auth-production.gmi b/issues/fix-spam-entries-in-gn-auth-production.gmi index db88eec..5ef7a42 100644 --- a/issues/fix-spam-entries-in-gn-auth-production.gmi +++ b/issues/fix-spam-entries-in-gn-auth-production.gmi @@ -2,6 +2,7 @@ # Tags +* status: closed, completed * assigned: fredm * keywords: auth @@ -13,4 +14,8 @@ We have spam entries in gn-auth in production in the groups table: b59229de-2fce-4a3d-82f1-d9eeee9b7009|Business For Sale Adelaide|{"group_description": "Welcome to Business2Sell, the ultimate online platform for those seeking affordable business opportunities in Adelaide. As a trusted first-party provider, we offer the ideal marketplace for buying or selling businesses across the country. Whether you're an aspiring entrepreneur looking for your next venture or a business owner ready to sell, Business2Sell provides the perfect platform for you. Our user-friendly interface and extensive listings make it effortless to discover a wide range of businesses, all within your budget. Join our vibrant community of buyers and sellers today, and let us help you achieve your business goals in Adelaide with ease and confidence.\r\nhttps://www.business2sell.com.au/businesses/sa/adelaide"} ``` +## Close as completed +We added email verification when registering, which should help reduce the success of these automated bots. + +We also added tooling to help with users and groups management, which is helping clean up these spam data. diff --git a/issues/genenetwork/guix-bioinformatics-remove-guix-rust-past-crates-channel.gmi b/issues/genenetwork/guix-bioinformatics-remove-guix-rust-past-crates-channel.gmi new file mode 100644 index 0000000..b804e10 --- /dev/null +++ b/issues/genenetwork/guix-bioinformatics-remove-guix-rust-past-crates-channel.gmi @@ -0,0 +1,23 @@ +# guix-bioinformatics: Remove `guix-rust-past-crates` channel + +## Tags + +* assigned: alexm, bonfacem +* interested: fredm +* priority: normal +* status: open +* type: bug +* keywords: guix-bioinformatics, guix-rust-past-crates, guix, rust, crates + +## Description + +GNU Guix recently changed[1] the way it handles packaging of rust packages. + +The old rust packages got moved to the "guix-rust-past-crates" to help avoid huge breakages for systems depending on the older packaging system. "guix-bioinformatics" used a number of rust packages, defined in the old form, and we needed a quick fix, thus the introduction of the "guix-rust-past-crates" channel as a dependency. + +We need to move away from depending on this channel, by updating all the rust crates we use to the new packaging model. + + +## Footnotes + +=> https://guix.gnu.org/en/blog/2025/a-new-rust-packaging-model/ [1] diff --git a/issues/genenetwork/markdown-editing-service-not-deployed.gmi b/issues/genenetwork/markdown-editing-service-not-deployed.gmi index e7a1717..9d72e4e 100644 --- a/issues/genenetwork/markdown-editing-service-not-deployed.gmi +++ b/issues/genenetwork/markdown-editing-service-not-deployed.gmi @@ -3,7 +3,7 @@ ## Tags * type: bug -* status: open +* status: closed, completed, fixed * assigned: fredm * priority: critical * keywords: production, container, tux04 @@ -32,3 +32,8 @@ If you do an edit and refresh the page, it will show up in the system, but it wi Set `CGIT_REPO_PATH="https://git.genenetwork.org/gn-guile"` which seems to allow the commit to work, but we do not actually get the changes pushed to the remote in any useful sense. It seems to me, that we need to configure the environment in such a way that it will be able to push the changes to remote. + + +## Close as Completed + +The markdown editing service is deployed and configured correctly. diff --git a/issues/genenetwork2/broken-collections-features.gmi b/issues/genenetwork2/broken-collections-features.gmi new file mode 100644 index 0000000..4239929 --- /dev/null +++ b/issues/genenetwork2/broken-collections-features.gmi @@ -0,0 +1,44 @@ +# Broken Collections Features + +## Tags + +* type: bug +* status: open +* priority: high +* assigned: zachs, fredm +* keywords: gn2, genenetwork2, genenetwork 2, collections + +## Descriptions + +There are some features in the search results page, and/or the collections page that are broken — these are: + +* "CTL" feature +* "MultiMap" feature +* "Partial Correlations" feature +* "Generate Heatmap" feature + +### Reproduce Issue + +* Go to https://genenetwork.org +* Select "Mouse (Mus musculus, mm10) for "Species" +* Select "BXD Family" for "Group" +* Select "Traits and Cofactors" for "Type" +* Select "BXD Published Phenotypes" for "Dataset" +* Type "locomotion" in the "Get Any" field (without the quotes) +* Click "Search" +* In the results page, select the traits with the following "Record" values: "BXD_10050", "BXD_10051", "BXD_10088", "BXD_10091", "BXD_10092", "BXD_10455", "BXD_10569", "BXD_10570", "BXD_11316", "BXD_11317" +* Click the "Add" button and add them to a new collection +* In the resulting collections page, click the button for any of the listed failing features above + +### Failure modes + +* The "CTL" and "WCGNA" features have a failure mode that might have been caused by recent changes making use of AJAX calls, rather than submitting the form manually. +* The "MultiMap" and "Generate Heatmap" features raise exceptions that need to be investigated and resolved +* The "Partial Correlations" feature seems to run forever + +## Break-out Issues + +We break-out the issues above into separate pages to track the progress of the fixes for each feature separately. + +=> /issues/genenetwork3/ctl-maps-error +=> /issues/genenetwork3/generate-heatmaps-failing diff --git a/issues/genenetwork2/error-adding-trait-to-collections-while-logged-out.gmi b/issues/genenetwork2/error-adding-trait-to-collections-while-logged-out.gmi new file mode 100644 index 0000000..4048f3e --- /dev/null +++ b/issues/genenetwork2/error-adding-trait-to-collections-while-logged-out.gmi @@ -0,0 +1,29 @@ +# Error Adding Trait(s) to Collection(s) While Logged Out + +## Tags + +* type: bug +* status: open +* priority: high +* assigned: zachs, fredm +* keywords: gn2, genenetwork2, genenetwork 2, collections + +## Description + +### Steps to Reproduce +* Go to https://genenetwork.org/ +* Ensure you are logged out +* Now, navigate to https://genenetwork.org/show_trait?trait_id=1435464_at&dataset=HC_M2_0606_P +* Click green "Add" button +* Note silent failure, then "MissingTokenError: missing_token: Please log in again to continue." notification + +### Expected + +On clicking "Add" we should have got a modal dialog with a selection of collections we can add the trait to. + +### Notes + +#### Cause of Failure + +=> https://github.com/genenetwork/genenetwork2/blob/c057054b69e673108410894ce87c5059aebb7b68/gn2/wqflask/collect.py#L81-L84 +The first fetch with ``oauth2_get`` fails in the case where the user is not logged in, raising a "MissingToken" exception. This then is handled by the default handlers and we get presented the prompt to login. diff --git a/issues/genenetwork2/fix-display-for-time-consumed-for-correlations.gmi b/issues/genenetwork2/fix-display-for-time-consumed-for-correlations.gmi new file mode 100644 index 0000000..0c8e9c8 --- /dev/null +++ b/issues/genenetwork2/fix-display-for-time-consumed-for-correlations.gmi @@ -0,0 +1,15 @@ +# Fix Display for the Time Consumed for Correlations + +## Tags + +* type: bug +* status: closed, completed +* priority: low +* assigned: @alexm, @bonz +* keywords: gn2, genenetwork2, genenetwork 2, gn3, genenetwork3 genenetwork 3, correlations, time display + +## Description + +The breakdown of the time consumed for the correlations computations, displayed at the bottom of the page, is not representative of reality. The time that GeneNetwork3 (or background process) takes for the computations is not actually represented in the breakdown, leading to wildly inaccurate displays of total time. + +This will need to be fixed. diff --git a/issues/genenetwork/genenetwork2_configurations.gmi b/issues/genenetwork2/genenetwork2_configurations.gmi index 7d08db0..4ba0a89 100644 --- a/issues/genenetwork/genenetwork2_configurations.gmi +++ b/issues/genenetwork2/genenetwork2_configurations.gmi @@ -4,7 +4,7 @@ * assigned: fredm * priority: normal -* status: open +* status: closed, obsoleted * keywords: configuration, config, gn2, genenetwork, genenetwork2 * type: bug @@ -72,3 +72,10 @@ For `wqflask/run_gunicorn.py`, the route can remain as is, since this is an entr ### Non-Executable Configuration Files Eschew executable formats (*.py) for configuration files and prefer non-executable formats e.g. *.cfg, *.json, *.conf etc + + +## Closed as obsoleted + +I am closing this issue as obsoleted, since a lot of things have changed since this issue was set up. The `bin/genenetwork2` script no longer exists and most of the paths mentioned have changed. + +The configuration issue(s) mentioned above still abound, but the changes will have to be incremental to avoid breaking the system. diff --git a/issues/genenetwork2/gn-libs-update.gmi b/issues/genenetwork2/gn-libs-update.gmi new file mode 100644 index 0000000..efb9c19 --- /dev/null +++ b/issues/genenetwork2/gn-libs-update.gmi @@ -0,0 +1,38 @@ +# Update gn-libs in CD container + +* assigned: bonfacem +* priority: high +* status: to-do +* deadline: <2026-01-07 Wed> + +## Description + +GN2 tests fail: + +``` +==================================== ERRORS ==================================== +____________ ERROR collecting gn2/tests/unit/base/test_data_set.py _____________ +ImportError while importing test module '/var/lib/laminar/run/genenetwork2/2358/genenetwork2/gn2/tests/unit/base/test_data_set.py'. +Hint: make sure your test modules/packages have valid Python names. +Traceback: +gn2/tests/unit/base/test_data_set.py:8: in <module> + from gn2.wqflask import app +gn2/wqflask/__init__.py:45: in <module> + from gn_libs.logging import setup_logging, setup_modules_logging +E ModuleNotFoundError: No module named 'gn_libs.logging' +______________ ERROR collecting gn2/tests/unit/base/test_trait.py ______________ +ImportError while importing test module '/var/lib/laminar/run/genenetwork2/2358/genenetwork2/gn2/tests/unit/base/test_trait.py'. +Hint: make sure your test modules/packages have valid Python names. +Traceback: +gn2/tests/unit/base/test_trait.py:6: in <module> + from gn2.base.trait import GeneralTrait +gn2/base/trait.py:3: in <module> + from gn2.wqflask import app +gn2/wqflask/__init__.py:45: in <module> + from gn_libs.logging import setup_logging, setup_modules_logging +E ModuleNotFoundError: No module named 'gn_libs.logging' +``` + +Note that a "gn-guile" definition in gn-machines has been added to gn-machines. + +* closed diff --git a/issues/genenetwork2/handle-oauth-errors-better.gmi b/issues/genenetwork2/handle-oauth-errors-better.gmi index 462ded5..77ad7ad 100644 --- a/issues/genenetwork2/handle-oauth-errors-better.gmi +++ b/issues/genenetwork2/handle-oauth-errors-better.gmi @@ -3,7 +3,7 @@ ## Tags * type: bug -* status: open +* status: closed, completed * priority: high * assigned: fredm * interested: zachs, robw @@ -15,3 +15,7 @@ When a session expires, for whatever reason, a notification is displayed to the => ./session_expiry_oauth_error.png The message is a little jarring to the end user. Make it gentler, and probably more informative, so the user is not as surprised. + +## Close as complete + +This should be fixed at this point. Closing this as complete. diff --git a/issues/genenetwork2/mapping-error.gmi b/issues/genenetwork2/mapping-error.gmi index 2e28491..7e7d0a7 100644 --- a/issues/genenetwork2/mapping-error.gmi +++ b/issues/genenetwork2/mapping-error.gmi @@ -3,7 +3,7 @@ ## Tags * type: bug -* status: open +* status: closed * priority: medium * assigned: zachs, fredm, flisso * keywords: gn2, genenetwork2, genenetwork 2, mapping @@ -49,3 +49,18 @@ TypeError: 'NoneType' object is not iterable ### Updates This is likely just because the genotype file doesn't exist in the necessary format (BIMBAM). We probably need to convert the R/qtl2 genotypes to BIMBAM. + +## Stalled + +This is currently stalled, until we can upload genotypes via the uploader. + + +## Notes + +### 2025-12-31 + +I am closing this issue as WONTFIX because of the following reasons: + +- Better fix is to prevent mapping in the first place, if no genotypes exist for the given trait(s) +- Issue relies on non-implemented feature (Genotypes upload) to fix it +- Issue does not exist on production diff --git a/issues/genenetwork2/mechanical-rob-add-partial-correlations-tests.gmi b/issues/genenetwork2/mechanical-rob-add-partial-correlations-tests.gmi new file mode 100644 index 0000000..e38f653 --- /dev/null +++ b/issues/genenetwork2/mechanical-rob-add-partial-correlations-tests.gmi @@ -0,0 +1,22 @@ +# mechanical-rob: Add Partial Correlations Tests + +## Tags + +* assigned: fredm +* priority: medium +* status: open +* keywords: genenetwork2, gn2, mechanical-rob, partial correlations, tests, regression +* type: enhancement + +## Description + +Add regression tests to verify that the partial correlations feature still works +as expected. + +### TODOS + +- [-] Tests for "entry-point" page +- [x] Tests for partial correlation using Pearson's R against select traits +- [ ] Tests for partial correlation using Spearman's Rho against select traits +- [ ] Tests for partial correlation using Pearson's R against an entire dataset +- [ ] Tests for partial correlation using Spearman's Rho against an entire dataset diff --git a/issues/genenetwork2/refresh-token-failure.gmi b/issues/genenetwork2/refresh-token-failure.gmi index dd33341..c488820 100644 --- a/issues/genenetwork2/refresh-token-failure.gmi +++ b/issues/genenetwork2/refresh-token-failure.gmi @@ -2,7 +2,7 @@ ## Tags -* status: open +* status: closed, obsoleted * priority: high * type: bug * assigned: fredm, zsloan, zachs @@ -106,3 +106,6 @@ The following commits were done as part of the troubleshooting: => https://git.genenetwork.org/guix-bioinformatics/commit/?id=955e4ce9370be9811262d7c73fa5398385cc04d8 +# Closed as Obsolete + +We no longer rely on refresh tokens. This issue is no longer present. diff --git a/issues/genenetwork2/remove-bin-genenetwork2-script.gmi b/issues/genenetwork2/remove-bin-genenetwork2-script.gmi new file mode 100644 index 0000000..da11be7 --- /dev/null +++ b/issues/genenetwork2/remove-bin-genenetwork2-script.gmi @@ -0,0 +1,114 @@ +# Remove `bin/genenetwork2` Script + +## Tags + +* type: improvement +* status: closed, completed +* priority: medium +* assigned: fredm, bonfacem, alexm, zachs +* interested: pjotrp, aruni +* keywords: gn2, bin/genenetwork2, startup script + +## Description + +The `bin/genenetwork2` script was used for a really long time to launch Genenetwork2, and has served that purpose with honour and dedication. We applaud that. + +It is, however, time to retire the script, since at this point in time, it serves more to obfuscate the startup that as a helpful tool. + +On production, we have all but abandoned the use of the script, and we need to do the same for CI/CD, and eventually, development. + +This issue tracks the process, and problems that come up during the move to retire the script. + +### Process + +* [x] Identify how to run unit tests without the script +* [x] Document how to run unit tests without the script +* [x] Identify how to run mechanical-rob tests without the script +* [x] Document how to run mechanical-rob tests without the script +* [x] Update CI/CD definitions to get rid of the references to the script +* [x] Delete the script from the repository + +### Setup + +First, we need to setup the following mandatory environment variables: + +* GN2_PROFILE +* GN2_SETTINGS +* JS_GUIX_PATH +* GEMMA_COMMAND +* PLINK_COMMAND +* GEMMA_WRAPPER_COMMAND +* REQUESTS_CA_BUNDLE + +Within a guix shell, you could do that with something like: + +``` +export GN2_PROFILE="${GUIX_ENVIRONMENT}" +export GN2_SETTINGS="/home/frederick/genenetwork/gn2_settings.conf" +export JS_GUIX_PATH="${GN2_PROFILE}/share/genenetwork2/javascript" +export GEMMA_COMMAND="${GN2_PROFILE}/bin/gemma" +export PLINK_COMMAND="${GN2_PROFILE}/bin/plink2" +export GEMMA_WRAPPER_COMMAND="${GN2_PROFILE}/bin/gemma-wrapper" +export REQUESTS_CA_BUNDLE="${GUIX_ENVIRONMENT}/etc/ssl/certs/ca-certificates.crt" +``` + +Note that, you can define all the variables derived from "GN2_PROFILE" in your settings file, if such a settings file is computed. + +### Running Unit Tests + +To run unit tests, run pytest at the root of the repository. + +``` +$ cd /path/to/genenetwork2 +$ pytest +``` + +### Running "mechanical-rob" Tests + +At the root of the repository, run something like: + +``` +python test/requests/test-website.py --all http://localhost:5033 +``` + +Change the port, as appropriate. + + +### Launching Application + +In addition to the minimum set of envvars defined in the "Setup" section above, we need the following variables defined to get the application to launch: + +* FLASK_APP + +In a guix shell, you could do: + +``` +export FLASK_APP="gn2.wsgi" +``` + +Now you can launch the application with flask with something like: + +``` +flask run --port=5033 --with-threads +``` + +or with green unicorn with something like: + +``` +gunicorn --reload \ + --workers 3 \ + --timeout 1200 \ + --log-level="debug" \ + --keep-alive 6000 \ + --max-requests 10 \ + --bind="127.0.0.1:5033" \ + --max-requests-jitter 5 \ + gn2.wsgi:application +``` + +You can change the gunicorn setting to fit your scenario. + + +## Close as completed + +The script has been deleted. diff --git a/issues/genenetwork2/viewing-dataset-error.gmi b/issues/genenetwork2/viewing-dataset-error.gmi new file mode 100644 index 0000000..6910275 --- /dev/null +++ b/issues/genenetwork2/viewing-dataset-error.gmi @@ -0,0 +1,65 @@ +# Unable to View Dataset Metadata + +* assigned: bonfacem +* priority: high +* status: to-do +* deadline: <2026-01-08 Thur> + +## Description + +Heading to: + +=> https://cd.genenetwork.org/datasets/BXDPublish + +or + +=> https://cd.genenetwork.org/datasets/BXDPublish + +We get the following stack trace: + +``` + GeneNetwork 3.12-rc1 https://genenetwork.org/datasets/BXDPublish +Traceback (most recent call last): + File "/gnu/store/9515mkxb3zq721b8mnh9xgjvvgmvbavq-profile/lib/python3.11/site-packages/flask/app.py", line 917, in full_dispatch_request + rv = self.dispatch_request() + ^^^^^^^^^^^^^^^^^^^^^^^ + File "/gnu/store/9515mkxb3zq721b8mnh9xgjvvgmvbavq-profile/lib/python3.11/site-packages/flask/app.py", line 902, in dispatch_request + return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args) # type: ignore[no-any-return] + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/gnu/store/9515mkxb3zq721b8mnh9xgjvvgmvbavq-profile/lib/python3.11/site-packages/gn2/wqflask/views.py", line 1565, in get_dataset + return render_template( + ^^^^^^^^^^^^^^^^ + File "/gnu/store/9515mkxb3zq721b8mnh9xgjvvgmvbavq-profile/lib/python3.11/site-packages/flask/templating.py", line 150, in render_template + return _render(app, template, context) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/gnu/store/9515mkxb3zq721b8mnh9xgjvvgmvbavq-profile/lib/python3.11/site-packages/flask/templating.py", line 131, in _render + rv = template.render(context) + ^^^^^^^^^^^^^^^^^^^^^^^^ + File "/gnu/store/9515mkxb3zq721b8mnh9xgjvvgmvbavq-profile/lib/python3.11/site-packages/jinja2/environment.py", line 1301, in render + self.environment.handle_exception() + File "/gnu/store/9515mkxb3zq721b8mnh9xgjvvgmvbavq-profile/lib/python3.11/site-packages/jinja2/environment.py", line 936, in handle_exception + raise rewrite_traceback_stack(source=source) + File "/gnu/store/9515mkxb3zq721b8mnh9xgjvvgmvbavq-profile/lib/python3.11/site-packages/gn2/wqflask/templates/dataset.html", line 1, in top-level template code + {% extends "index_page.html" %} + ^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/gnu/store/9515mkxb3zq721b8mnh9xgjvvgmvbavq-profile/lib/python3.11/site-packages/gn2/wqflask/templates/index_page.html", line 1, in top-level template code + {% extends "base.html" %} + File "/gnu/store/9515mkxb3zq721b8mnh9xgjvvgmvbavq-profile/lib/python3.11/site-packages/gn2/wqflask/templates/base.html", line 337, in top-level template code + {% block content %}{% endblock %} + ^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/gnu/store/9515mkxb3zq721b8mnh9xgjvvgmvbavq-profile/lib/python3.11/site-packages/gn2/wqflask/templates/dataset.html", line 66, in block 'content' + <a href="https://git.genenetwork.org/gn-docs/log/general/datasets/{{ dataset.id.split('/')[-1] }}" target="_blank">History</a> + ^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/gnu/store/9515mkxb3zq721b8mnh9xgjvvgmvbavq-profile/lib/python3.11/site-packages/jinja2/environment.py", line 485, in getattr + return getattr(obj, attribute) + ^^^^^^^^^^^^^^^^^^^^^^^ +jinja2.exceptions.UndefinedError: 'dict object' has no attribute 'id' +``` + +## Resolution + +=> https://github.com/genenetwork/genenetwork3/commit/7706990fbfd5e13617298999a5e30b6e8a4ed610 + +=> https://github.com/genenetwork/genenetwork3/pull/239 + +* closed diff --git a/issues/genenetwork3/01828928-26e6-4cad-bbc8-59fd7a7977de.json.zip b/issues/genenetwork3/01828928-26e6-4cad-bbc8-59fd7a7977de.json.zip new file mode 100644 index 0000000..7681b88 --- /dev/null +++ b/issues/genenetwork3/01828928-26e6-4cad-bbc8-59fd7a7977de.json.zip Binary files differdiff --git a/issues/genenetwork3/broken-aliases.gmi b/issues/genenetwork3/broken-aliases.gmi index 5735a1c..2bfbdae 100644 --- a/issues/genenetwork3/broken-aliases.gmi +++ b/issues/genenetwork3/broken-aliases.gmi @@ -5,23 +5,184 @@ * type: bug * status: open * priority: high -* assigned: fredm +* assigned: pjotrp * interested: pjotrp * keywords: aliases, aliases server +## Tasks + +* [X] Rewrite server in gn-guile +* [X] Fix menu search +* [X] Fix global search aliases +* [ ] Deploy and test aliases in GN2 ## Repository => https://github.com/genenetwork/gn3 +moved to + +gn-guile repo. + ## Bug Report ### Actual * Go to https://genenetwork.org/gn3/gene/aliases2/Shh,Brca2 -* Not that an exception is raised, with a "404 Not Found" message +* Note that an exception is raised, with a "404 Not Found" message ### Expected * We expected a list of aliases to be returned for the given symbols as is done in https://fallback.genenetwork.org/gn3/gene/aliases2/Shh,Brca2 +## Resolution + +Actually the server is up, but it is not part of the main deployment because it is written in Racket - and we don't have much support in Guix. I wrote the code the days after my bike accident: + +=> https://github.com/genenetwork/gn3/blob/master/gn3/web/wikidata.rkt + +and it is probably easiest to move it to gn-guile. Guile is another Scheme after all ;). Only fitting I spent days in hospital only recently (for a different reason). gn-guile already has its own web server and provides a REST API for our markdown editor, for example. On tux04 it responds with + +``` +curl http://127.0.0.1:8091/version +"4.0.0" +``` + +What we want is to add the aliases server that should respond to + +``` +curl http://localhost:8000/gene/aliases/Shh # direct on tux01 +["9530036O11Rik","Dsh","Hhg1","Hx","Hxl3","M100081","ShhNC","ShhNC"] +curl https://genenetwork.org/gn3/gene/aliases2/Shh,Brca2 +[["Shh",["9530036O11Rik","Dsh","Hhg1","Hx","Hxl3","M100081","ShhNC","ShhNC"]],["Brca2",["Fancd1","RAB163"]]] +``` + +Note this is used by search functionality in GN, as well as the gene aliases list on the mapping page. In principle we cache it for the duration of the running server so as not to overload wikidata. No one uses aliases2, that I can tell, so we only implement the first 'aliases'. + +Note the wikidata interface has been stable all this time. That is good. + +Turns out we already use wikidata in the gn-guile implementation for fetching the wikidata id for a species (as part of metadata retrieval). I wrote that about two years ago as part of the REST API expansion. + +Unfortunately + +``` +(sparql-scm (wd-sparql-endpoint-url) (wikidata-gene-alias "Q24420953")) +``` + +throws a 403 forbidden error. + +This however works: + +``` +scheme@(gn db sparql) [15]> (sparql-wd-species-info "Q83310") +;;; ("https://query.wikidata.org/sparql?query=%0ASELECT%20DISTINCT%20%3Ftaxon%20%3Fncbi%20%3Fdescr%20where%20%7B%0A%20%20%20%20wd%3AQ83310%20wdt%3AP225%20%3Ftaxon%20%3B%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20wdt%3AP685%20%3Fncbi%20%3B%0A%20%20%20%20%20%20schema%3Adescription%20%3Fdescr%20.%0A%20%20%20%20%3Fspecies%20wdt%3AP685%20%3Fncbi%20.%0A%20%20%20%20FILTER%20%28lang%28%3Fdescr%29%3D%27en%27%29%0A%7D%20limit%205%0A%0A") +$11 = "?taxon\t?ncbi\t?descr\n\"Mus musculus\"\t\"10090\"\t\"species of mammal\"@en\n" +``` + +(if you can see the mouse ;). + +Ah, this works + +``` +scheme@(gn db sparql) [17]> (sparql-tsv (wd-sparql-endpoint-url) (wikidata-query-geneids "Shh" )) +;;; ("https://query.wikidata.org/sparql?query=SELECT%20DISTINCT%20%3Fwikidata_id%0A%20%20%20%20%20%20%20%20%20%20%20%20WHERE%20%7B%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%3Fwikidata_id%20wdt%3AP31%20wd%3AQ7187%3B%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20wdt%3AP703%20%3Fspecies%20.%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20VALUES%20%28%3Fspecies%29%20%7B%20%28wd%3AQ15978631%20%29%20%28%20wd%3AQ83310%20%29%20%28%20wd%3AQ184224%20%29%20%7D%20.%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%3Fwikidata_id%20rdfs%3Alabel%20%22Shh%22%40en%20.%0A%20%20%20%20%20%20%20%20%7D%0A") +$12 = "?wikidata_id\n<http://www.wikidata.org/entity/Q14860079>\n<http://www.wikidata.org/entity/Q24420953>\n" +``` + +But this does not + +``` +scheme@(gn db sparql) [17]> (sparql-scm (wd-sparql-endpoint-url) (wikidata-query-geneids "Shh" )) +ice-9/boot-9.scm:1685:16: In procedure raise-exception: +In procedure utf8->string: Wrong type argument in position 1 (expecting bytevector): "<html>\r\n<head><title>403 Forbidden</title></head>\r\n<body>\r\n<center><h1>403 Forbidden</h1></center>\r\n<hr><center>nginx/1.18.0</center>\r\n</body>\r\n</html>\r\n" +``` + +Going via tsv does work + +``` +scheme@(gn db sparql) [18]> (tsv->scm (sparql-tsv (wd-sparql-endpoint-url) (wikidata-query-geneids "Shh" ))) + +;;; ("https://query.wikidata.org/sparql?query=SELECT%20DISTINCT%20%3Fwikidata_id%0A%20%20%20%20%20%20%20%20%20%20%20%20WHERE%20%7B%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%3Fwikidata_id%20wdt%3AP31%20wd%3AQ7187%3B%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20wdt%3AP703%20%3Fspecies%20.%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20VALUES%20%28%3Fspecies%29%20%7B%20%28wd%3AQ15978631%20%29%20%28%20wd%3AQ83310%20%29%20%28%20wd%3AQ184224%20%29%20%7D%20.%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%3Fwikidata_id%20rdfs%3Alabel%20%22Shh%22%40en%20.%0A%20%20%20%20%20%20%20%20%7D%0A") +$13 = ("?wikidata_id") +$14 = (("<http://www.wikidata.org/entity/Q14860079>") ("<http://www.wikidata.org/entity/Q24420953>")) +``` + +that is nice enough. + +We now got a working alias server that is part of gn-guile. E.g. + +``` +curl http://127.0.0.1:8091/gene/aliases/Brca2 +["breast cancer 2","breast cancer 2, early onset","Fancd1","RAB163","BRCA2, DNA repair associated"] +``` + +it is part of gn-guile. gn-guile also has the 'commit/' handler by Alex, documented as +'curl -X POST http://127.0.0.1:8091/commit' in git-markdown-editor.md. Let's see how that is wired up. The web interface is at, for example, +https://genenetwork.org/editor/edit?file-path=general/help/facilities.md. Part of gn2's + +``` +gn2/wqflask/views.py +398:@app.route("/editor/edit", methods=["GET"]) +408:@app.route("/editor/settings", methods=["GET"]) +414:@app.route("/editor/commit", methods=["GET", "POST"]) +``` + +which has the code + +``` +@app.route("/editor/edit", methods=["GET"]) +@require_oauth2 +def edit_gn_doc_file(): + file_path = urllib.parse.urlencode( + {"file_path": request.args.get("file-path", "")}) + response = requests.get(f"http://localhost:8091/edit?{file_path}") + response.raise_for_status() + return render_template("gn_editor.html", **response.json()) +``` + +Running over localhost. This is unfortunately hard coded, and we should change that! In guix system +configuration it is already a variable as 'genenetwork-configuration-gn-guile-port 8091'. gn-guile should also be visible from outside, so that is a separate configuration. + +Also I note that the mapping page does three requests to wikidata (for mouse, rat and human). That could really be one. + +# Search + +Aliases are also used in search. You can tell when GN search renders too few results that aliases are not used. When aliases work we expect to list '2310010I16Rik' with + +=> https://genenetwork.org/search?species=mouse&group=BXD&type=Hippocampus+mRNA&dataset=HC_M2_0606_P&search_terms_or=sh*&search_terms_and=&FormID=searchResult + +Sheepdog tests for that and it has been failing for a while. + +Global search finds way more results, but also lacks that alias! Meanwhile GN1 does find that alias for record 1431728_at. GN2 finds it with hippocampus mRNA + +=> https://genenetwork.org/search?species=mouse&group=BXD&type=Hippocampus+mRNA&dataset=HC_M2_0606_P&search_terms_or=1431728_at%0D%0A&search_terms_and=&accession_id=None&FormID=searchResult + +in standard search. +But neither 1431728_at or '2310010I16Rik' has a hit in *global* search and the result for Ssh should include the record in both search systems. + +# Deploy + +We introduced a new environment variable that does not show up on CD, part of the mapping page: + +=> + +In the logs on /export2: + +``` +root@tux02:/export2/guix-containers/genenetwork-development/var/log/cd# tail -100 genenetwork2.log +2025-07-20 04:19:43 File "/genenetwork2/gn2/base/trait.py", line 157, in wikidata_alias_fmt +2025-07-20 04:19:43 GN_GUILE_SERVER_URL + "gene/aliases/" + self.symbol.upper()) +2025-07-20 04:19:43 NameError: name 'GN_GUILE_SERVER_URL' is not defined +``` + +One thing I ran into is http://genenetwork.org/gn3-proxy/ - what is that for? + +## Deploy Updates: 2025-08-15 +=> https://git.genenetwork.org/guix-bioinformatics/commit/?id=269f99f1e1f0c253ecdd99f04bc7c6697012b0aa Update commit of gn-guile used on production + +This does not fix the issue on https://gn2-fred.genenetwork.org/show_trait?trait_id=1427571_at&dataset=HC_M2_0606_P, instead we get + +``` +fredm@tux04:~$ curl http://localhost:8091/gene/aliases/Brca2 +Resource not found: /gene/aliases/Brca2 +``` diff --git a/issues/genenetwork3/ctl-maps-error.gmi b/issues/genenetwork3/ctl-maps-error.gmi new file mode 100644 index 0000000..6726357 --- /dev/null +++ b/issues/genenetwork3/ctl-maps-error.gmi @@ -0,0 +1,46 @@ +# CTL Maps Error + +## Tags + +* type: bug +* status: open +* priority: high +* assigned: alexm, zachs, fredm +* keywords: CTL, CTL Maps, gn3, genetwork3, genenetwork 3 + +## Description + +Trying to run the CTL Maps feature in the collections page as described in +=> /issues/genenetwork2/broken-collections-feature + +We get an error in the results page of the form: + +``` +{'error': '{\'code\': 1, \'output\': \'Loading required package: MASS\\nLoading required package: parallel\\nLoading required package: qtl\\nThere were 13 warnings (use warnings() to see them)\\nError in xspline(x, y, shape = 0, lwd = lwd, border = col, lty = lty, : \\n invalid value specified for graphical parameter "lwd"\\nCalls: ctl.lineplot -> draw.spline -> xspline\\nExecution halted\\n\'}'} +``` + +on the CLI the same error is rendered: +``` +Loading required package: MASS +Loading required package: parallel +Loading required package: qtl +There were 13 warnings (use warnings() to see them) +Error in xspline(x, y, shape = 0, lwd = lwd, border = col, lty = lty, : + invalid value specified for graphical parameter "lwd" +Calls: ctl.lineplot -> draw.spline -> xspline +Execution halted +``` + +On my local development machine, the command run was +``` +Rscript /home/frederick/genenetwork/genenetwork3/scripts/ctl_analysis.R /tmp/01828928-26e6-4cad-bbc8-59fd7a7977de.json +``` + +Here is a zipped version of the json file (follow the link and click download): +=> https://github.com/genenetwork/gn-gemtext-threads/blob/main/issues/genenetwork3/01828928-26e6-4cad-bbc8-59fd7a7977de.json.zip + +Troubleshooting a while, I suspect +=> https://github.com/genenetwork/genenetwork3/blob/27d9c9d6ef7f37066fc63af3d6585bf18aeec925/scripts/ctl_analysis.R#L79-L80 this is the offending code. + +=> https://cran.r-project.org/web/packages/ctl/ctl.pdf The manual for the ctl library +indicates that our call above might be okay, which might mean something changed in the dependencies that the ctl library used. diff --git a/issues/genenetwork/genenetwork3_configuration.gmi b/issues/genenetwork3/genenetwork3_configuration.gmi index fcab572..cdd7c15 100644 --- a/issues/genenetwork/genenetwork3_configuration.gmi +++ b/issues/genenetwork3/genenetwork3_configuration.gmi @@ -1,10 +1,10 @@ -# Genenetwork2 Configurations +# Genenetwork3 Configurations ## Tags * assigned: fredm * priority: normal -* status: open +* status: closed, completed * keywords: configuration, config, gn2, genenetwork, genenetwork2 * type: bug @@ -13,3 +13,7 @@ The configuration file should only ever contain settings, and no code. Remove all code from the default settings file. Eschew executable formats (*.py) for configuration files and prefer non-executable formats e.g. *.cfg, *.json, *.conf etc + +## Closed as Completed + +See commit https://github.com/genenetwork/genenetwork3/commit/977efbb54da284fb3e8476f200206d00cb8e64cd diff --git a/issues/genenetwork3/generate-heatmaps-failing.gmi b/issues/genenetwork3/generate-heatmaps-failing.gmi index 03256e6..522dc27 100644 --- a/issues/genenetwork3/generate-heatmaps-failing.gmi +++ b/issues/genenetwork3/generate-heatmaps-failing.gmi @@ -28,3 +28,37 @@ On https://gn2-fred.genenetwork.org the heatmaps fails with a note ("ERROR: unde => https://github.com/scipy/scipy/issues/19972 This issue should not be present with python-plotly@5.20.0 but since guix-bioinformatics pins the guix version to `b0b988c41c9e0e591274495a1b2d6f27fcdae15a`, we are not able to pull in newer versions of packages from guix. + + +### Update 2025-04-08T10:59CDT + +Got the following error when I ran the background command manually: + +``` +$ export RUST_BACKTRACE=full +$ /gnu/store/dp4zq4xiap6rp7h6vslwl1n52bd8gnwm-profile/bin/qtlreaper --geno /home/frederick/genotype_files/genotype/genotype/BXD.geno --n_permutations 1000 --traits /tmp/traits_test_file_n2E7V06Cx7.txt --main_output /tmp/qtlreaper/main_output_NGVW4sfYha.txt --permu_output /tmp/qtlreaper/permu_output_MJnzLbrsrC.txt +thread 'main' panicked at src/regression.rs:216:25: +index out of bounds: the len is 20 but the index is 20 +stack backtrace: + 0: 0x61399d77d46d - <unknown> + 1: 0x61399d7b5e13 - <unknown> + 2: 0x61399d78b649 - <unknown> + 3: 0x61399d78f26f - <unknown> + 4: 0x61399d78ee98 - <unknown> + 5: 0x61399d78f815 - <unknown> + 6: 0x61399d77d859 - <unknown> + 7: 0x61399d77d679 - <unknown> + 8: 0x61399d78f3f4 - <unknown> + 9: 0x61399d6f4063 - <unknown> + 10: 0x61399d6f41f7 - <unknown> + 11: 0x61399d708f18 - <unknown> + 12: 0x61399d6f6e4e - <unknown> + 13: 0x61399d6f9e93 - <unknown> + 14: 0x61399d6f9e89 - <unknown> + 15: 0x61399d78e505 - <unknown> + 16: 0x61399d6f8d55 - <unknown> + 17: 0x75ee2b945bf7 - __libc_start_call_main + 18: 0x75ee2b945cac - __libc_start_main@GLIBC_2.2.5 + 19: 0x61399d6f4861 - <unknown> + 20: 0x0 - <unknown> +``` diff --git a/issues/genenetwork3/rqtl2-mapping-error.gmi b/issues/genenetwork3/rqtl2-mapping-error.gmi new file mode 100644 index 0000000..b43d66f --- /dev/null +++ b/issues/genenetwork3/rqtl2-mapping-error.gmi @@ -0,0 +1,46 @@ +# R/qtl2 Maps Error + +## Tags + +* type: bug +* status: closed, completed +* priority: high +* assigned: alexm, zachs, fredm +* keywords: R/qtl2, R/qtl2 Maps, gn3, genetwork3, genenetwork 3 + +## Reproduce + +* Go to https://genenetwork.org/ +* In the "Get Any" field, enter "synap*" and press the "Enter" key +* In the search results, click on the "1435464_at" trait +* Expand the "Mapping Tools" accordion section +* Select the "R/qtl2" option +* Click "Compute" +* In the "Computing the Maps" page that results, click on "Display System Log" + +### Observed + +A traceback is observed, with an error of the following form: + +``` +⋮ +FileNotFoundError: [Errno 2] No such file or directory: '/opt/gn/tmp/gn3-tmpdir/JL9PvKm3OyKk.txt' +``` + +### Expected + +The mapping runs successfully and the results are presented in the form of a mapping chart/graph and a table of values. + +### Debug Notes + +The directory "/opt/gn/tmp/gn3-tmpdir/" exists, and is actually used by other mappings (i.e. The "R/qtl" and "Pair Scan" mappings) successfully. + +This might imply a code issue: Perhaps +* a path is hardcoded, or +* the wrong path value is passed + +The same error occurs on https://cd.genenetwork.org but does not seem to prevent CD from running the mapping to completion. Maybe something is missing on production — what, though? + +## Closed as Completed + +This seems fixed now. diff --git a/issues/genetics/speeding-up-gemma.gmi b/issues/genetics/speeding-up-gemma.gmi new file mode 100644 index 0000000..91bab17 --- /dev/null +++ b/issues/genetics/speeding-up-gemma.gmi @@ -0,0 +1,492 @@ +# Speeding up GEMMA + +GEMMA is slow, but usually fast enough. Earlier I wrote gemma-wrapper to speed things up. In genenetwork.org, by using gemma-wrapper with LOCO, most traits are mapped in a few seconds on a a large server (30 individuals x 200K markers). By expanding makers to over 1 million, however, runtimes degrade to 6 minutes. Increasing the number of individuals to 1000 may slow mapping down to hour(s). As we are running 'precompute' on 13K traits - and soon maybe millions - it would be beneficial to reduce runtimes again. + +One thing to look at is Sen's bulklmm. It can do phenotypes in parallel, provided there is no missing data. This is perfect for permutations which we'll also do. For multiple phenotypes it is a bit tricky however, because you'll have to mix and match experiments to show the same individuals (read samples). + +So the approach is to first analyze steps in GEMMA and see where it is particularly inefficient. Maybe we can do something about that. I note I started the pangemma effort (and mgamma effort before). The idea is to use a propagator network for incremental improvements and also to introduce a new build system and testing framework. In parallel we'll try to scale out on HPC using Arun's ravanan software. + +There is no such thing as a free lunch. So, let's dive in. + +# Description + +# Tags + +* assigned: pjotrp +* type: feature +* priority: high + +# Tasks + +* [X] Try gzipped version +* [X] Run without debug +* [ ] Use lmdb for genotypes +* - [X] convert genotypes to lmdb +* - [X] replace GEMMA ReadGenotypes +* - [X] replace reading genotypes in AnalyzeBimbam +* - [+] Apply similar SNP filtering as the original +* - [X] Add SNP info tho Geno file +* - [X] Try different geno encodings +* - [+] Fix support for NAs - also in compute +* [X] Use lmdb for SNPs (probably part of Geno file) +* [X] Match output +* [ ] Write lmdb for output with filter +* [X] Optimize openblas for target architecture +* [ ] Use profiler +* [ ] Hash genotypes? Try buf.hash or xxhash +* [ ] Skip highly correlated markers with backtracking +* [ ] Perhaps try a faster malloc library for GEMMA +* [ ] Fix sqrt(NaN) when running big file example with -debug +* [ ] Fix/check assumption that geno is between 0 and 2 +* [ ] Try 64-bit integer index for lmdb +* [ ] Other improvements... + +# Summary + +Convert a geno file to mdb with + +``` +./bin/anno2mdb.rb mouse_hs1940.anno.txt +./bin/geno2mdb.rb mouse_hs1940.geno.txt --anno mouse_hs1940.anno.txt.mdb --eval Gf # convert to floating point +real 0m14.042s +user 0m12.639s +sys 0m0.402s +``` + +``` +../bin/anno2mdb.rb snps-matched.txt +../bin/geno2mdb.rb pangenome-13M-genotypes.txt --geno-json bxd_inds.list.json --anno snps-matched.txt.mdb --eval Gf +../bin/geno2mdb.rb pangenome-13M-genotypes.txt --geno-json bxd_inds.list.json --anno snps-matched.txt.mdb --eval Gb +``` + +even with floats a 30G pangenome genotype file got reduced to 12G. A quick full run of the mdb version takes 6 minutes. That is a massive 3x speedup. It also used less RAM (because it is one process instead of 20) and had a 40x core usage, much of it in the Linux kernel: + +``` +/bin/time -v ./build/bin/Release/gemma -k tmp/93f6b39ec06c09fb9ba9ca628b5fb990921b6c60.11.cXX.txt.cXX.txt -p tmp/pheno.json.txt -g pangenome-13M-genotypes.txt.mdb -lmm 9 -maf 0.1 -n 2 -debug +LD_LIBRARY_PATH=$GUIX_ENVIRONMENT/lib /bin/time -v ./build/bin/Release/gemma -k tmp/93f6b39ec06c09fb9ba9ca628b5fb990921b6c60.3.cXX.txt.cXX.txt -p tmp/pheno.json.txt -g tmp/pangenome-13M-genotypes.txt.mdb -lmm 9 -maf 0.1 -n 2 -no-check +real 5m47.587s +user 39m33.796s +sys 211m1.143s + +Command being timed: "./build/bin/Release/gemma -k tmp/93f6b39ec06c09fb9ba9ca628b5fb990921b6c60.3.cXX.txt.cXX.txt -p tmp/pheno.json.txt -g tmp/pangenome-13M-genotypes.txt.mdb -lmm 9 -maf 0.1 -n 2 -no-check" + User time (seconds): 2169.77 + System time (seconds): 11919.04 + Percent of CPU this job got: 3919% + Elapsed (wall clock) time (h:mm:ss or m:ss): 5:59.48 + Maximum resident set size (kbytes): 13377040 +``` + +as we only read the genotype file once it shows how much is IO bound! Moving to lmdb was the right choice to speed up pangemma. + +Old gemma does: + +``` + Command being timed: "/bin/gemma -k 93f6b39ec06c09fb9ba9ca628b5fb990921b6c60.11.cXX.txt.cXX.txt -p pheno.json.txt -g pangenome-13M-genotypes.txt.gz -a snps-matched.txt -lmm 9 -maf 0.1 -n 2 -no-check" + User time (seconds): 2017.25 + System time (seconds): 62.21 + Percent of CPU this job got: 240% + Elapsed (wall clock) time (h:mm:ss or m:ss): 14:24.17 + Maximum resident set size (kbytes): 9736884 +``` + +So we are at 3x speed. + +With Gb byte encoding the file got further reduced from 13Gb to 4Gb. + +What is more exciting is that LOCO now runs in 30s - compared to gemma's earlier 6 minutes, so that is at 10x speed, using about 1/3 of RAM. Note the CPU usage: + +``` + Command being timed: "./build/bin/Release/gemma -k tmp/93f6b39ec06c09fb9ba9ca628b5fb990921b6c60.3.cXX.txt.cXX.txt -p tmp/pheno.json.txt -g tmp/pangenome-13M-genotypes.txt-Gb.mdb -loco 2 -lmm 9 -maf 0.1 -n 2 -no-check" User time (seconds): 177.81 + System time (seconds): 934.92 + Percent of CPU this job got: 3391% + Elapsed (wall clock) time (h:mm:ss or m:ss): 0:32.80 + Maximum resident set size (kbytes): 4326308 +``` + +it looks like disk IO is no longer the bottleneck. The Gb version is much smaller than Gf, but runtime is only slightly better. So it is time for the profiler to find how we can make use of the other cores! But, for now, I am going to focus on getting the pipeline set up with ravanan. + +# Analysis + +As a test case we'll take on of the runs: + +``` +time -v /bin/gemma -loco 11 -k /export2/data/wrk/services/gemma-wrapper/tmp/tmp/panlmm/93f6b39ec06c09fb9ba9ca628b5fb990921b6c60.11.cXX.txt.cXX.txt -o 680029457111fdd460990f95853131c87ea20c57.11.assoc.txt -p pheno.json.txt -g pangenome-13M-genotypes.txt -a snps-matched.txt -lmm 9 -maf 0.1 -n 2 -outdir /export2/data/wrk/services/gemma-wrapper/tmp/tmp/panlmm/d20251111-588798-f81icw +``` + +which I simplify to + +``` +/bin/time -v /bin/gemma -loco 11 -k 93f6b39ec06c09fb9ba9ca628b5fb990921b6c60.11.cXX.txt.cXX.txt -p pheno.json.txt -g pangenome-13M-genotypes.txt -a snps-matched.txt -lmm 9 -maf 0.1 -n 2 -debug +Reading Files ... +number of total individuals = 143 +number of analyzed individuals = 20 +number of total SNPs/var = 13209385 +number of SNPS for K = 12376792 +number of SNPS for GWAS = 832593 +number of analyzed SNPs = 13111938 +``` + +The timer says: + +``` +User time (seconds): 365.33 +System time (seconds): 16.59 +Percent of CPU this job got: 128% +Elapsed (wall clock) time (h:mm:ss or m:ss): 4:57.01 +Average shared text size (kbytes): 0 +Average unshared data size (kbytes): 0 +Average stack size (kbytes): 0 +Average total size (kbytes): 0 +Maximum resident set size (kbytes): 11073412 +Average resident set size (kbytes): 0 +Major (requiring I/O) page faults: 0 +Minor (reclaiming a frame) page faults: 5756557 +Voluntary context switches: 1365 +nInvoluntary context switches: 478 +Swaps: 0 +File system inputs: 0 +File system outputs: 143704 +Socket messages sent: 0 +Socket messages received: 0 +Signals delivered: 0 +Page size (bytes): 4096 +Exit status: 0 +``` + +The genotype file is unzipped at 30G. Let's try running the gzipped version (which will be beneficial on a compute cluster anyhow) which comes in at 9.2G. We know that Gemma is not the most efficient when it comes to IO. So testing is crucial. +Critically the run gets slower: + +``` +Percent of CPU this job got: 118% +Elapsed (wall clock) time (h:mm:ss or m:ss): 7:43.56 +``` + +The problem is that unzip runs on a single thread in GEMMA, so it is actually slower that the gigantic raw text file. + +## Running without debug + +Without the debug swith gemma runs at the same speed with 128% CPU. That won't help much. + +## Optimizing GEMMA+OpenBLAS+GSL + +Compiling with optimization can be low hanging fruit - despite the fact that we seem to be IO bound at 128% CPU. Still, aggressive compiler optimizations may make a difference. The current build reads: + +``` +GEMMA Version = 0.98.6 (2022-08-05) +Build profile = /gnu/store/8rvid272yb53bgascf5c468z0jhsyflj-profile +GCC version = 14.3.0 +GSL Version = 2.8 +OpenBlas = OpenBLAS 0.3.30 - OpenBLAS 0.3.30 DYNAMIC_ARCH NO_AFFINITY Cooperlake MAX_THREADS=128 +arch = Cooperlake +threads = 96 +parallel type = threaded +``` + +this uses the gemma-gn2 package in + +=> https://git.genenetwork.org/guix-bioinformatics/tree/gn/packages/gemma.scm#n27 + +which is currently not built with arch optimizations (even though Cooperlake suggests differently). Another potential optimization is to use a fast malloc library. We do, however, already compile with a recent gcc, thanks to Guix. No need to improve on that. + +## Introduce lmdb for genotypes + +Rather than focussing on gzip, another potential improvement is to use lmdb with mmap. We am not going to upgrade the original gemma code (which is in maintenance mode). We are going to upgrade the new pangemma project instead: + +=> https://git.genenetwork.org/pangemma/ + +Reason being that this is our experimental project. + +So I just managed to build pangemma/gemma in Guix. Next step is to introduce lmdb genotypes. Genotypes come essentially as a matrix of markers x individuals. In the case of GN geno files and BIMBAM files they are simply stored as tab delimited values and/or probabilities. This happens in + +``` +src/param.cpp +1261:void PARAM::ReadGenotypes(gsl_matrix *UtX, gsl_matrix *K, const bool calc_K) { +1280:void PARAM::ReadGenotypes(vector<vector<unsigned char>> &Xt, gsl_matrix *K, +``` + +calling into + +``` +gemma_io.cpp +644:bool ReadFile_geno(const string &file_geno, const set<string> &setSnps, +1752:bool ReadFile_geno(const string file_geno, vector<int> &indicator_idv, +1857:bool ReadFile_geno(const string &file_geno, vector<int> &indicator_idv, +``` + +which are called from gemma.cpp. Also lmm.cpp reads the geno file in the AnalyzeBimbam function (see file_geno): + +``` +src/lmm.cpp +61: file_geno = cPar.file_geno; +1664: debug_msg(file_geno); +1665: auto infilen = file_geno.c_str(); +2291: cout << "error reading genotype file:" << file_geno << endl; +``` + +Note that also SNPs are read from a file (see file_snps). We already have an lmdb version for that! + +So, reading genotypes happens in multiple places. In fact, it is read 1x for computing K and 2x for GWA. And it is worth than this because LOCO runs GWA 20x rereading the same files. Reading it once using lmdb should speed things up. + +We'll start with the 30G 143samples.percentile.bimbam.bimbam-reduced2 file. To convert this file into lmdb we only do this once. We want to track both column and row names in the same lmdb and we will use a meta JSON record for that. On the command line we'll state wether the genotypes are stored as char or int. Floats will be packed into either of those. We'll expirement a bit to see what the default should be. A genotype is usually a number/character or a probability. In the latter case we don't have to have high precison and can choose to store an index into a range of values. We can also opt for Float16 or something more ad hoc because we don't have to store the exponent. + +But let's start with a standard float here, to keep things simple. To write the first version of code I'll use a byte conversion: + +``` +./bin/geno2mdb.rb BXD.geno.bimbam --eval '{"0"=>0,"1"=>1,"2"=>2,"NA"=>-1}' --pack 'C*' --geno-json BXD.geno.json +``` + +The lmdb file contains a metadata record that looks like: + +``` +{ + "type": "gemma-geno", + "version": 1, + "eval": "G0-2", + "key-format": "string", + "rec-format": "C*", + "geno": { + "type": "gn-geno-to-gemma", + "genofile": "BXD.geno", + "samples": [ + "BXD1", + "BXD2", + "BXD5", +etc. +``` + +i.e. it is a self-contained, efficient, genotype format. There is also another trick, we can use Plink-style compression with + +``` +./bin/geno2mdb.rb BXD.geno.bimbam --eval '{"0"=>0,"1"=>1,"2"=>2,"NA"=>4}' --geno-json BXD.geno.json --gpack 'l.each_slice(4).map { |slice| slice.map.with_index.sum {|val,i| val << (i*2) } }.pack("C*")' +``` + +reducing the original uncompressed BIMBAM from 9.9Mb to 2.7Mb. This is still a lot larger than the gzip compressed BIMBAM, but as I pointed out earlier the uncompressed version is faster by a wide margin. Compressing the lmdb file gets it in range of the compressed BIMBAM btw. So that is always an option. + +Next we create a floating point version. That reduces the file to 30% with + +``` +geno2mdb.rb fp.bimbam --geval 'g.to_f' --pack 'F*' --geno-json bxd_inds.list.json +``` + +and if we compress the probabilities into a byte reduces the file to 10%: + +``` +geno2mdb.rb fp.bimbam --geval '(g.to_f*255.0).to_i' --pack 'C*' --geno-json bxd_inds.list.json +``` + +And now the compressed version is also 4x smaller. We'll have to run gemma at scale to see what the impact is, but an uncompressed 10x reduction schould have an impact on the IO bottle neck. Note how easy it is to try these things with my little Ruby script. + +=> https://github.com/genetics-statistics/gemma-wrapper/blob/master/bin/geno2mdb.rb + +## Use lmdb genotypes from pangemma + +Rather than writing new code in C++ I proceeded embedding guile in pangemma. If it turns out to be a performance problem we can always fall back to C. Here we show a simple test witten in guile that gets called from main.cpp: + +=> https://git.genenetwork.org/pangemma/commit/?id=5b6b5e2ad97b4733125c0845cfae007e8094a687 + +## Some analysis of GEMMA + +GEMMA::BatchRun reads files and executes (b gemma.cpp:1657) +cPar.ReadFiles() + ReadFile_anno + ReadFile_pheno + ReadFile_geno (gemma_io.cpp:652) - first read to fetch SNPs info, num (ns_tset) and total SNPs (ns_total). + - it also does some checks + Note: These can all be handled by the lmdb files. So it saves one run. + +Summary of Mutated Outputs: +* indicator_snp: Binary indicators for which SNPs passed filtering +* snpInfo: Complete metadata for all SNPs in the file +* ns_test: Count of SNPs passing filters +checkpoint("read-geno-file",file_geno); + +Next start LMM9 gemma.cpp:2571 + ReadFile_kin + EigenDecomp_Zeroed + 2713 CalcUtX(U, W, UtW); + 2714 CalcUtX(U, Y, UtY); + CalcLambda + CalcLmmVgVeBeta + CalcPve + cPar.PrintSummary() + debug_msg("fit LMM (one phenotype)"); + cLmm.AnalyzeBimbam lmm.cpp:1665 and + LMM::Analyze lmm.cpp:1704 + + +Based on LLM code analysis, here's what gets mutated in the 'LMM' and Param class: + +### By 'ReadFile_geno': +This is a **standalone function** (not a member of LMM), but it mutates LMM members when passed as parameters: + +1. **'indicator_snp'** - cleared and populated with 0/1 filter flags +2. **'snpInfo'** - cleared and populated with SNP metadata +3. **'ns_test'** - set to count of SNPs that passed all filters + +### By 'LMM::AnalyzeBimbam': +(which calls 'LMM::Analyze') + +**Directly mutated in 'LMM::Analyze':** + +1. **'sumStat'** - PRIMARY OUTPUT + - Cleared at start (implied) + - Populated with one SUMSTAT entry per analyzed SNP + - Contains: beta, se, lambda_remle, lambda_mle, p_wald, p_lrt, p_score, logl_H1 + +2. **'time_UtX'** - timing accumulator + - '+= time_spent_on_matrix_multiplication' + +3. **'time_opt'** - timing accumulator + - '+= time_spent_on_optimization' + +**Read but NOT mutated:** +- 'indicator_snp' - read to determine which SNPs to process +- 'indicator_idv' - read to determine which individuals to include +- 'ni_total', 'ni_test' - used for loop bounds and assertions +- 'n_cvt' - number of covariates, used in calculations +- 'l_mle_null', 'l_min', 'l_max', 'n_region', 'logl_mle_H0' - analysis parameters +- 'a_mode' - determines which statistical tests to run +- 'd_pace' - controls progress bar display + +### Summary Table: + +| Member Variable | Mutated By | Purpose | +|----------------|------------|---------| +| 'indicator_snp' | 'ReadFile_geno' | Which SNPs passed filters | +| 'snpInfo' | 'ReadFile_geno' | SNP metadata (chr, pos, alleles, etc.) | +| 'ns_test' | 'ReadFile_geno' | Count of SNPs to analyze | +| 'sumStat' | 'Analyze' | **Main output**: Statistical results per SNP | +| 'time_UtX' | 'Analyze' | Performance profiling | +| 'time_opt' | 'Analyze' | Performance profiling | + +The key output is **'sumStat'** which contains all the association test results. + +PARAM variables directly mutated by these functions: + + indicator_snp (by ReadFile_geno) + snpInfo (by ReadFile_geno) + ns_test (by ReadFile_geno) + +LMM variables mutated: + + indicator_snp (by ReadFile_geno if passed LMM's copy) + snpInfo (by ReadFile_geno if passed LMM's copy) + ns_test (by ReadFile_geno if passed LMM's copy) + sumStat (by Analyze - this is LMM-only, not in PARAM) + time_UtX, time_opt (by Analyze) + +The actual analysis results (sumStat) exist only in LMM, not in PARAM. + +## Coding for lmdb support + +From above it should be clear that, if we have the genotypes and snp annotations in lmdb, we can skip reading the genotype file the first time. We can also rewrite the 'analyze' functions to fetch this information on the fly. + +Note that OpenBLAS will have to run single threaded when introducing SNP-based threads. + +## Fine grained multithreading + +From above it can be concluded that we can batch process SNPs in parallel. The only output is sumStat and that is written at once at the end. So, if we can collect the sumStat data without collision it should just work. + +Interestingly both Guile and C++ have recently introduced fibers. Boost.Fiber looks pretty clean: + +``` +#include <boost/fiber/all.hpp> +#include <vector> +#include <iostream> + +namespace fibers = boost::fibers; + +// Worker fiber +void compute_worker(int start, int end, + fibers::buffered_channel<int>& channel) { + for (int i = start; i < end; ++i) { + channel.push(i * i); + } +} + +void parallel_compute_fibers() { + fibers::buffered_channel<int> channel(100); + + // Spawn fibers + fibers::fiber f1([&]() { + compute_worker(0, 100, channel); + channel.close(); // Signal completion + }); + + fibers::fiber f2([&]() { + compute_worker(100, 200, channel); + }); + + // Collect results + std::vector<int> results; + int value; + while (fibers::channel_op_status::success == channel.pop(value)) { + results.push_back(value); + } + + f1.join(); + f2.join(); + + std::cout << "Total results: " << results.size() << std::endl; +} +``` + +Compare that with guile: + +``` +(use-modules (fibers) + (fibers channels)) + +;; Worker that streams individual results +(define (compute-worker-streaming start end result-channel) + (let loop ((i start)) + (when (< i end) + (put-message result-channel (* i i)) + (loop (+ i 1)))) + ;; Send completion signal + (put-message result-channel 'done)) + +;; Collector fiber +(define (result-collector result-channel num-workers) + (let loop ((results '()) + (done-count 0)) + (if (= done-count num-workers) + (reverse results) + (let ((msg (get-message result-channel))) + (if (eq? msg 'done) + (loop results (+ done-count 1)) + (loop (cons msg results) done-count)))))) + +(define (parallel-compute-streaming) + (run-fibers + (lambda () + (let ((result-channel (make-channel))) + + ;; Spawn workers + (spawn-fiber + (lambda () (compute-worker-streaming 0 100 result-channel))) + (spawn-fiber + (lambda () (compute-worker-streaming 100 200 result-channel))) + + ;; Collect results + (result-collector result-channel 2))))) +``` + +The Boost fiber is a relatively mature library now, with about 8+ years of development and real-world usage. +Interestingly Boost.fibers has work stealing built in. We can look at that later: + +=> https://www.boost.org/doc/libs/1_66_0/libs/fiber/doc/html/fiber/worker.html + +What about LOCO? Actually we can use the same fiber strategy for each chromosome as a per CHR process. We can set the number of threads differently based on chromosome SNP num, so all chromosomes take (about) the same time. Later, we can bring LOCO into one process with the advantage that the genotype data is only read once. In both cases the kinship matrices are in RAM anyway. + +# Reducing the size of the genotype file + +The first version of lmdb genotypes used simple floats. That reduced the pangenome text version from 30Gb to 12Gb with about a 3x speedup of gemma. Next I tried byte representation of the genotypes. + +# Optimizing SNP handling + +GEMMA originally used a separate SNP annotation file which proves inefficient. Now we transform the geno information to lmdb, we might as well include chr+pos. We'll make the key out of that and add a table with marker annotation. + +# Optimizing the index + +I opted for using a CHR+POS index (byte+long value). There are a few things to consider. There may be duplicates and there may be missing values. Also LMDB likes and integer index. The built-in dubsort does not work, so we need to create a unique pos for every variant. I'll do that by adding the line number. diff --git a/issues/global-search-performance.gmi b/issues/global-search-performance.gmi new file mode 100644 index 0000000..d5165b6 --- /dev/null +++ b/issues/global-search-performance.gmi @@ -0,0 +1,16 @@ +# Global Search Performance/Optimisation + +* assigned: bonfacem, pjotrp +* status: in-progress + +## Description + +Global search is slow. Search latency sits at ~30 seconds. Historically, identical queries completed within 1–4 seconds. Root cause unclear, maybe hardware problems. + +Performance appears to have returned to normal without code changes. + +## Tasks + +* [ ] Compare CD with prod wrt latency. See if hardware is really the problem. +* [ ] Check out query plans and execution paths for global search (xapian delve, fall-back sql, weaved in gnqna) +* [ ] Doc performance. diff --git a/issues/gn-auth/email_verification.gmi b/issues/gn-auth/email_verification.gmi index fff3d54..07e2b04 100644 --- a/issues/gn-auth/email_verification.gmi +++ b/issues/gn-auth/email_verification.gmi @@ -12,7 +12,7 @@ When setting up e-mail verification, the following configurations should be set for gn-auth: -SMTP_HOST = "smtp.uthsc.edu" +SMTP_HOST = "smtp.uthsc" SMTP_PORT = 25 (not 587, which is what we first tried) SMTP_TIMEOUT = 200 # seconds diff --git a/issues/gn-auth/fix-refresh-token.gmi b/issues/gn-auth/fix-refresh-token.gmi index 1a6a825..222b731 100644 --- a/issues/gn-auth/fix-refresh-token.gmi +++ b/issues/gn-auth/fix-refresh-token.gmi @@ -2,7 +2,7 @@ ## Tags -* status: open +* status: closed, obsolete * priority: high * assigned: fredm * type: feature-request, bug @@ -51,3 +51,8 @@ This actually kills 2 birds with the one stone: * Get the refresh token from the cookies rather than from the body * Maybe: make refreshing the access token unaware of threads/workers + + +## Close as Obsolete + +We no longer do refresh tokens at all, they were a pain to look into, so I simply removed them from the system. diff --git a/issues/gn-auth/pass-on-unknown-get-parameters.gmi b/issues/gn-auth/pass-on-unknown-get-parameters.gmi new file mode 100644 index 0000000..a349800 --- /dev/null +++ b/issues/gn-auth/pass-on-unknown-get-parameters.gmi @@ -0,0 +1,17 @@ +# Pass on Unknown GET Parameters + +## Tags + +* status: open +* priority: medium +* type: feature-request, enhancement +* assigned: fredm, zsloan +* keywords: gn-auth, authorisation + +## Description + +A developer or user could be needing to access some feature hidden behind some flag (so called, "feature flags"). Some of these flags are set using known (to the application and developer/user) GET parameters. + +If the user provides these get parameters before login, then go through the login process, the unknown GET parameters are dropped silently, and the user has to them manually set them up again. This, while not a big deal, is annoying and wastes a few seconds each time. + +This feature request proposes to pass any unknown GET parameters untouched through the authentication/authorisation server and back to the authenticating client during the login process, to mitigate this small annoyance. diff --git a/issues/gn-auth/rework-view-resource-page.gmi b/issues/gn-auth/rework-view-resource-page.gmi new file mode 100644 index 0000000..2d6e145 --- /dev/null +++ b/issues/gn-auth/rework-view-resource-page.gmi @@ -0,0 +1,22 @@ +# Rework "View-Resource" Page + +## Tags + +* status: closed, completed +* priority: medium +* type: enhancement +* assigned: fredm, zsloan +* keywords: gn-auth, resource, resources, view resource + +## Description + +The view resource page ('/oauth2/resource/<uuid>/view') was built with only Genotype, Phenotype, and mRNA resources in mind. + +We have since moved on, and added more types of resources (group, system, inbredset-group, etc). This leads to the page breaking for these other types of resources. + +We need to update the UI and route to ensure the page renders correctly for each type, or at the very least, redirects to the correct page (e.g. in the case of groups, which have a separate "view group" page). + + +## Close as complete + +This is fixed now. diff --git a/issues/gn-guile/activations-on-production-not-running-as-expected.gmi b/issues/gn-guile/activations-on-production-not-running-as-expected.gmi new file mode 100644 index 0000000..be9cc00 --- /dev/null +++ b/issues/gn-guile/activations-on-production-not-running-as-expected.gmi @@ -0,0 +1,57 @@ +# gn-guile: Activations on Production not Running as Expected + +## Tags + +* status: closed, completed, fixed +* priority: high +* type: bug +* assigned: bonfacem, fredm, aruni +* keywords: gn-guile, deployment, activation-service-type + +## Description + +With the recent changes to guix's `least-authority-wrapper` we can no longer write to the root filesystem ("/"). That is not much of a problem. + +So I tried adding `#:directory (dirname gn-doc-git-checkout)` to the `make-forkexec-constructor` for the `gn-guile-shepherd-service` and that actually changes the working directory of the process, as I would expect. + +In `genenetwork-activation` I add: + +``` + ;; setup correct ownership for gn-docs + (for-each (lambda (file) + (chown file + (passwd:uid (getpw "genenetwork")) + (passwd:gid (getpw "genenetwork")))) + (find-files #$(dirname gn-doc-git-checkout) + #:directories? #t)) +``` + +which, ideally, should change ownership of the parent directory of the bare git checkout for "gn-docs" when we build/start the container. This does not happen — the directory is still owned by root. + +My thinking goes, the "genenetwork" user[1] is not yet created at the point when the activation[2] is run, leading to the service failing to start. + +The reason I think this, is because, when I do: + +``` +fredm@tux04:/...$ sudo guix container exec <container-pid> /run/current-system/profile/bin/bash --login +root@genenetwork-gn2-fred /# chown -R genenetwork:genenetwork /var/lib/genenetwork/ +root@genenetwork-gn2-fred /# chown -R genenetwork:genenetwork /var/lib/genenetwork/ +``` + +The bound directory's permissions change, and we can now enable and start the service: + +``` +root@genenetwork-gn2-fred /# herd enable gn-guile +root@genenetwork-gn2-fred /# herd start gn-guile +``` + +which starts the service as expected. We can also simply restart the entire container at this point, and it works too. + +## Footnotes + +=> https://git.genenetwork.org/gn-machines/tree/genenetwork/services/genenetwork.scm?id=e425671e69a321a032134fafee974442e8c1ce6f#n167 [1] "genenetwork" user declaration +=> https://git.genenetwork.org/gn-machines/tree/genenetwork/services/genenetwork.scm?id=e425671e69a321a032134fafee974442e8c1ce6f#n680 [2] Activation of services (see also the account-service-type being extended with the "genenetwork" user). + +## Close as Fixed + +This issue is fixed, with newer Guix and changes that @bonz did to the gn-machines repo. diff --git a/issues/gn-guile/rendering-images-within-markdown-documents.gmi b/issues/gn-guile/rendering-images-within-markdown-documents.gmi new file mode 100644 index 0000000..fe3ed39 --- /dev/null +++ b/issues/gn-guile/rendering-images-within-markdown-documents.gmi @@ -0,0 +1,22 @@ +# Rendering Images Linked in Markdown Documents + +## Tags + +* status: open +* priority: high +* type: bug +* assigned: alexm, bonfacem, fredm +* keywords: gn-guile, images, markdown + +## Description + +Rendering images linked within markdown documents does not work as expected — we cannot render images if they have a relative path. +As an example see the commit below: +=> https://github.com/genenetwork/gn-docs/commit/783e7d20368e370fb497974f843f985b51606d00 + +In that commit, we are forced to use the full github uri to get the images to load correctly when rendered via gn-guile. This, has two unfortunate consequences: + +* It makes editing more difficult, since the user has to remember to find and use the full github URL for their images. +* It ties the data and code to github + +This needs to be fixed, such that any and all paths relative to the markdown file are resolved at render time automatically. diff --git a/issues/gn-guile/rework-hard-dependence-on-github.gmi b/issues/gn-guile/rework-hard-dependence-on-github.gmi new file mode 100644 index 0000000..751e9fe --- /dev/null +++ b/issues/gn-guile/rework-hard-dependence-on-github.gmi @@ -0,0 +1,21 @@ +# Rework Hard Dependence on Github + +## Tags + +* status: open +* priority: medium +* type: bug +* assigned: alexm +* assigned: bonfacem +* assigned: fredm +* keywords: gn-guile, github + +## Description + +Currently, we have a hard-dependence on Github for our source repository — you can see this in lines 31, 41, 55 and 59 of the code linked below: + +=> https://git.genenetwork.org/gn-guile/tree/web/view/markdown.scm?id=0ebf6926db0c69e4c444a6f95907e0971ae9bf40 + +The most likely reason is that the "edit online" functionality might not exist in a lot of other popular source forges. + +This is rendered moot, however, since we do provide a means to edit the data on Genenetwork itself. We might as well get rid of this option, and only allow the "edit online" feature on Genenetwork and stop relying on its presence in the forges we use. diff --git a/issues/gn-libs/jobs-allow-job-cascades.gmi b/issues/gn-libs/jobs-allow-job-cascades.gmi new file mode 100644 index 0000000..f659f32 --- /dev/null +++ b/issues/gn-libs/jobs-allow-job-cascades.gmi @@ -0,0 +1,26 @@ +# Jobs: Allow Job Cascades + +## Tags + +* status: open +* priority: medium +* type: enhancement +* assigned: fredm, zsloan +* keywords: gn-libs, genenetwork, async jobs, asynchronous jobs, background jobs + +## Description + +Some jobs could require more than a single command/script to be run to complete. + +Rather than refactoring/rewriting the entire "async jobs" feature, I propose adding a way to note who started a job, i.e. +* the user, OR +* another job + +This could be tracked in an extra field in the database, say "started_by" which can have values of the form +* "user:<user-id>" +* "job:<job-id>" +where the parts in the angle bracket (i.e. "<user-id>" and "<job-id>") are replaced by actual ids. + +## Related Issues + +=> /issues/gn-libs/jobs-track-who-jobs-belong-to diff --git a/issues/gn-libs/jobs-track-who-jobs-belong-to.gmi b/issues/gn-libs/jobs-track-who-jobs-belong-to.gmi new file mode 100644 index 0000000..00eaf21 --- /dev/null +++ b/issues/gn-libs/jobs-track-who-jobs-belong-to.gmi @@ -0,0 +1,23 @@ +# Jobs: Track Who Jobs Belong To + +## Tags + +* status: open +* priority: medium +* type: enhancement +* assigned: fredm, zsloan +* keywords: gn-libs, genenetwork, async jobs, asynchronous jobs, background jobs + +## Description + +Some features in Genenetwork require long-running processes to be triggered and run in the background. We have a way to trigger such background processes, but there is no way of tracking who started what job, and therefore, no real way for a user to list only their jobs. + +This issue will track the introduction of such tracking. This will enable the building new job-related functionality such as a user being able to: +* list their past, unexpired jobs +* delete past jobs +* possibly rerun jobs that failed but are recoverable +* see currently running jobs, and their status + +## Related Issues + +=> /issues/gn-libs/jobs-allow-job-cascades diff --git a/issues/gn-uploader/AuthorisationError-gn-uploader.gmi b/issues/gn-uploader/AuthorisationError-gn-uploader.gmi new file mode 100644 index 0000000..262ad19 --- /dev/null +++ b/issues/gn-uploader/AuthorisationError-gn-uploader.gmi @@ -0,0 +1,70 @@ +# AuthorisationError in gn uploader + +## Tags +* assigned: fredm +* status: closed, obsoleted +* priority: critical +* type: error +* key words: authorisation, permission + +## Description + +Trying to create population for Kilifish dataset in the gn-uploader webpage, +then encountered the following error: +```sh +Traceback (most recent call last): + File "/gnu/store/wxb6rqf7125sb6xqd4kng44zf9yzsm5p-profile/lib/python3.10/site-packages/flask/app.py", line 917, in full_dispatch_request + rv = self.dispatch_request() + File "/gnu/store/wxb6rqf7125sb6xqd4kng44zf9yzsm5p-profile/lib/python3.10/site-packages/flask/app.py", line 902, in dispatch_request + return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args) # type: ignore[no-any-return] + File "/gnu/store/wxb6rqf7125sb6xqd4kng44zf9yzsm5p-profile/lib/python3.10/site-packages/uploader/authorisation.py", line 23, in __is_session_valid__ + return session.user_token().either( + File "/gnu/store/wxb6rqf7125sb6xqd4kng44zf9yzsm5p-profile/lib/python3.10/site-packages/pymonad/either.py", line 89, in either + return right_function(self.value) + File "/gnu/store/wxb6rqf7125sb6xqd4kng44zf9yzsm5p-profile/lib/python3.10/site-packages/uploader/authorisation.py", line 25, in <lambda> + lambda token: function(*args, **kwargs)) + File "/gnu/store/wxb6rqf7125sb6xqd4kng44zf9yzsm5p-profile/lib/python3.10/site-packages/uploader/population/views.py", line 185, in create_population + ).either( + File "/gnu/store/wxb6rqf7125sb6xqd4kng44zf9yzsm5p-profile/lib/python3.10/site-packages/pymonad/either.py", line 91, in either + return left_function(self.monoid[0]) + File "/gnu/store/wxb6rqf7125sb6xqd4kng44zf9yzsm5p-profile/lib/python3.10/site-packages/uploader/monadic_requests.py", line 99, in __fail__ + raise Exception(_data) +Exception: {'error': 'AuthorisationError', 'error-trace': 'Traceback (most recent call last): + File "/gnu/store/38iayxz7dgm86f2x76kfaa6gwicnnjg4-profile/lib/python3.10/site-packages/flask/app.py", line 917, in full_dispatch_request + rv = self.dispatch_request() + File "/gnu/store/38iayxz7dgm86f2x76kfaa6gwicnnjg4-profile/lib/python3.10/site-packages/flask/app.py", line 902, in dispatch_request + return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args) # type: ignore[no-any-return] + File "/gnu/store/38iayxz7dgm86f2x76kfaa6gwicnnjg4-profile/lib/python3.10/site-packages/authlib/integrations/flask_oauth2/resource_protector.py", line 110, in decorated + return f(*args, **kwargs) + File "/gnu/store/38iayxz7dgm86f2x76kfaa6gwicnnjg4-profile/lib/python3.10/site-packages/gn_auth/auth/authorisation/resources/inbredset/views.py", line 95, in create_population_resource + ).then( + File "/gnu/store/38iayxz7dgm86f2x76kfaa6gwicnnjg4-profile/lib/python3.10/site-packages/pymonad/monad.py", line 152, in then + result = self.map(function) + File "/gnu/store/38iayxz7dgm86f2x76kfaa6gwicnnjg4-profile/lib/python3.10/site-packages/pymonad/either.py", line 106, in map + return self.__class__(function(self.value), (None, True)) + File "/gnu/store/38iayxz7dgm86f2x76kfaa6gwicnnjg4-profile/lib/python3.10/site-packages/gn_auth/auth/authorisation/resources/inbredset/views.py", line 98, in <lambda> + "resource": create_resource( + File "/gnu/store/38iayxz7dgm86f2x76kfaa6gwicnnjg4-profile/lib/python3.10/site-packages/gn_auth/auth/authorisation/resources/inbredset/models.py", line 25, in create_resource + return _create_resource(cursor, + File "/gnu/store/38iayxz7dgm86f2x76kfaa6gwicnnjg4-profile/lib/python3.10/site-packages/gn_auth/auth/authorisation/checks.py", line 56, in __authoriser__ + raise AuthorisationError(error_description) +gn_auth.auth.errors.AuthorisationError: Insufficient privileges to create a resource +', 'error_description': 'Insufficient privileges to create a resource'} + +``` +The error above resulted from the attempt to upload the following information on the gn-uploader-`create population section` +Input details are as follows: +Full Name: Kilifish F2 Intercross Lines +Name: KF2_Lines +Population code: KF2 +Description: Kilifish second generation population +Family: Crosses, AIL, HS +Mapping Methods: GEMMA, QTLReaper, R/qtl +Genetic type: intercross + +And when pressed the `Create Population` icon, it led to the error above. + +## Closed as Obsolete + +* The service this was happening on (https://staging-uploader.genenenetwork.org) is no longer running +* Most of the authorisation issues are resolved in newer code diff --git a/issues/export-uploaded-data-to-RDF-store.gmi b/issues/gn-uploader/export-uploaded-data-to-RDF-store.gmi index c39edec..3ef05cd 100644 --- a/issues/export-uploaded-data-to-RDF-store.gmi +++ b/issues/gn-uploader/export-uploaded-data-to-RDF-store.gmi @@ -6,7 +6,7 @@ * priority: medium * type: feature-request * status: open -* keywords: API, data upload +* keywords: API, data upload, gn-uploader ## Description @@ -73,10 +73,16 @@ The metadata is useful for searching for the data. The "metadata->rdf" project[4 * [ ] How do we handle this? +## Related Issues and Topics + +=> https://issues.genenetwork.org/topics/next-gen-databases/design-doc +=> https://issues.genenetwork.org/topics/lmms/rqtl2/using-rqtl2-lmdb-adapter +=> https://issues.genenetwork.org/issues/dump-sample-data-to-lmdb +=> https://issues.genenetwork.org/topics/database/genotype-database ## Footnotes -=> https://gitlab.com/fredmanglis/gnqc_py 1: QC/Data upload project repository +=> https://git.genenetwork.org/gn-uploader/ 1: QC/Data upload project (gn-uploader) repository => https://github.com/genenetwork/genenetwork3/pull/130 2: Munyoki's Pull request => https://github.com/BonfaceKilz/gn-dataset-dump 3: Dataset -> LMDB export repository -=> https://github.com/genenetwork/dump-genenetwork-database 4: Metadata -> RDF export repository +=> https://git.genenetwork.org/gn-transform-databases/ 4: Metadata -> RDF export repository diff --git a/issues/gn-uploader/guix-build-gn-uploader-error.gmi b/issues/gn-uploader/guix-build-gn-uploader-error.gmi index 44a5c4b..aeb6308 100644 --- a/issues/gn-uploader/guix-build-gn-uploader-error.gmi +++ b/issues/gn-uploader/guix-build-gn-uploader-error.gmi @@ -86,7 +86,7 @@ Filesystem Size Used Avail Use% Mounted on so we know that's not a problem. -A similar thing had shown up on space.uthsc.edu. +A similar thing had shown up on our space server. ### More Troubleshooting Efforts diff --git a/issues/gn-uploader/handling-tissues-in-uploader.gmi b/issues/gn-uploader/handling-tissues-in-uploader.gmi index 826af15..0c43040 100644 --- a/issues/gn-uploader/handling-tissues-in-uploader.gmi +++ b/issues/gn-uploader/handling-tissues-in-uploader.gmi @@ -2,11 +2,11 @@ ## Tags -* status: open +* status: closed, wontfix * priority: high * assigned: fredm * type: feature-request -* keywords: gn-uploader, tissues +* keywords: gn-uploader, tissues, archived ## Description @@ -112,3 +112,9 @@ ALTER TABLE Tissue MODIFY Id INT(5) UNIQUE NOT NULL; * [1] https://gn1.genenetwork.org/webqtl/main.py?FormID=schemaShowPage#ProbeFreeze * [2] https://gn1.genenetwork.org/webqtl/main.py?FormID=schemaShowPage#Tissue + +## Closed as WONTFIX + +I am closing this issue because it was created (2024-03-28) while I had a fundamental misunderstanding of the way data is laid out in the database. + +The information on the schema/layout of the tables is still useful, but chances are, we'll look at the tables themselves anyway should we need to figure out the schema. diff --git a/issues/gn-uploader/link-authentication-authorisation.gmi b/issues/gn-uploader/link-authentication-authorisation.gmi index 90b8e5e..b64f887 100644 --- a/issues/gn-uploader/link-authentication-authorisation.gmi +++ b/issues/gn-uploader/link-authentication-authorisation.gmi @@ -2,7 +2,7 @@ ## Tags -* status: open +* status: closed, completed * assigned: fredm * priority: critical * type: feature request, feature-request @@ -13,3 +13,9 @@ The last chain in the link to the uploads is the authentication/authorisation. Once the user uploads their data, they need access to it. The auth system, by default, will deny anyone/everyone access to any data that is not linked to a resource and which no user has any roles allowing them access to the data. We, currently, assign such data to the user manually, but that is not a sustainable way of working, especially as the uploader is exposed to more and more users. + +### Close as Completed + +The current iteration of the uploader does actually take into account the user that is uploading the data, granting them ownership of the uploaded data. By default, the data is not public, and is only accessible to the user who uploaded it. + +The user who uploads the data (and therefore own it) can later grant access to other users of the system. diff --git a/issues/gn-uploader/probeset-not-applicable-to-all-data.gmi b/issues/gn-uploader/probeset-not-applicable-to-all-data.gmi index 1841d36..af3b274 100644 --- a/issues/gn-uploader/probeset-not-applicable-to-all-data.gmi +++ b/issues/gn-uploader/probeset-not-applicable-to-all-data.gmi @@ -4,7 +4,7 @@ * type: bug * assigned: fredm -* status: open +* status: closed * priority: high * keywords: gn-uploader, uploader, ProbeSet @@ -20,3 +20,10 @@ applicable to our data, I don't think. ``` It seems like some of the data does not require a ProbeSet, and in that case, it should be possible to add it without one. + + +## Notes + +This "bug" is obsoleted by the fact that the implementation leading to it was entirely wrong. + +The feature that was leading to this bug no longer exists, and will have to be re-implemented from scratch with the involvement of @acenteno. diff --git a/issues/gn-uploader/provide-page-for-uploaded-data.gmi b/issues/gn-uploader/provide-page-for-uploaded-data.gmi index 60b154b..5ab7f80 100644 --- a/issues/gn-uploader/provide-page-for-uploaded-data.gmi +++ b/issues/gn-uploader/provide-page-for-uploaded-data.gmi @@ -2,7 +2,7 @@ ## Tags -* status: open +* status: closed, completed * assigned: fredm * priority: medium * type: feature, feature request, feature-request @@ -20,3 +20,8 @@ Once a user has uploaded their data, provide them with a landing page/dashboard Depends on => /issues/gn-uploader/link-authentication-authorisation + + +## Close as complete + +Current uploader directs user to a view of the data they uploader on GN2. This is complete. diff --git a/issues/gn-uploader/replace-redis-with-sqlite3.gmi b/issues/gn-uploader/replace-redis-with-sqlite3.gmi new file mode 100644 index 0000000..d3f94f0 --- /dev/null +++ b/issues/gn-uploader/replace-redis-with-sqlite3.gmi @@ -0,0 +1,29 @@ +# Replace Redis with SQL + +## Tags + +* status: open +* priority: low +* assigned: fredm +* type: feature, feature-request, feature request +* keywords: gn-uploader, uploader, redis, sqlite, sqlite3 + +## Description + +We currently (as of 2024-06-27) use Redis for tracking any asynchronous jobs (e.g. QC on uploaded files). + +A lot of what we use redis for, we can do in one of the many SQL databases (we'll probably use SQLite3 anyway), which are more standardised, and easier to migrate data from and to. It has the added advantage that we can open multiple connections to the database, enabling the different processes to update the status and metadata of the same job consistently. + +Changes done here can then be migrated to the other systems, i.e. GN2, GN3, and gn-auth, as necessary. + +### 2025-12-31: Progress Update + +Initial basic implementation can be found in: + +=> https://git.genenetwork.org/gn-libs/tree/gn_libs/jobs +=> https://git.genenetwork.org/gn-uploader/commit/?id=774a0af9db439f50421a47249c57e5a0a6932301 +=> https://git.genenetwork.org/gn-uploader/commit/?id=589ab74731aed62b1e1b3901d25a95fc73614f57 + +and others. + +More work needs to be done to clean-up some minor annoyances. diff --git a/issues/gn-uploader/samplelist-details.gmi b/issues/gn-uploader/samplelist-details.gmi deleted file mode 100644 index 2e64d8a..0000000 --- a/issues/gn-uploader/samplelist-details.gmi +++ /dev/null @@ -1,17 +0,0 @@ -# Explanation of how Sample Lists are handled in GN2 (and may be handled moving forward) - -## Tags - -* status: open -* assigned: fredm, zsloan -* priority: medium -* type: documentation -* keywords: strains, gn-uploader - -## Description - -Regarding the order of samples/strains, it can basically be whatever we decide it is. It just needs to stay consistent (like if there are multiple genotype files). It only really affects how the strains are displayed, and any other genotype files we use for mapping needs to share the same order. - -I think this is the case regardless of whether it's strains or individuals (and both the code and files make no distinction). Sometimes it just logically makes sense to sort them in a particular way for display purposes (like BXD1, BXD2, etc), but technically everything would still work the same if you swapped those columns across all genotype files. Users would be confused about why BXD2 is before BXD1, but everything would still work and all calculations would give the same results. - -zsloan's proposal for handling sample lists in the future is to just store them in a JSON file in the genotype_files/genotype directory. diff --git a/issues/gn-volt-genofiles-parsing-integration.gmi b/issues/gn-volt-genofiles-parsing-integration.gmi index 8d3d149..e1b0162 100644 --- a/issues/gn-volt-genofiles-parsing-integration.gmi +++ b/issues/gn-volt-genofiles-parsing-integration.gmi @@ -5,7 +5,7 @@ * assigned: alexm, * type: improvement * priority: high -* status: in progress +* status: stalled, closed. ## Notes diff --git a/issues/gnqa/implement-no-login-requirement-for-gnqa.gmi b/issues/gnqa/implement-no-login-requirement-for-gnqa.gmi new file mode 100644 index 0000000..5b0a1ff --- /dev/null +++ b/issues/gnqa/implement-no-login-requirement-for-gnqa.gmi @@ -0,0 +1,20 @@ +# Implement No-Login Requirement for GNQA + +## Tags + +* type: feature +* status: completed, closed +* priority: medium +* assigned: alexm, +* keywords: gnqa, user experience, authentication, login, llm + +## Description +This feature will allow usage of LLM/GNQA features without requiring user authentication, while implementing measures to filter out bots + + +## Tasks + +* [x] If logged in: perform AI search with zero penalty +* [x] Add caching lifetime to save on token usage +* [x] Routes: check for referrer headers — if the previous search was not from the homepage, perform AI search +* [x] If global search returns more than *n* results (*n = number*), perform an AI search diff --git a/issues/gnqa/merge-gnqa-to-production.gmi b/issues/gnqa/merge-gnqa-to-production.gmi index 3d34bb1..6e5f119 100644 --- a/issues/gnqa/merge-gnqa-to-production.gmi +++ b/issues/gnqa/merge-gnqa-to-production.gmi @@ -4,6 +4,7 @@ * assigned: alexm, * keywords: production, GNQA, integration +* status: closed, completed ## Description @@ -12,5 +13,5 @@ be pushed to production. We need to allow only logged-in users to access the ser ## Tasks -* [] Integrate GN-auth for the service -* [] Push production to the current commit \ No newline at end of file +* [x] Integrate GN-auth for the service +* [x] Push production to the current commit \ No newline at end of file diff --git a/issues/gnqna/query-bug-DatabaseError.gmi b/issues/gnqna/query-bug-DatabaseError.gmi new file mode 100644 index 0000000..b8c1cfc --- /dev/null +++ b/issues/gnqna/query-bug-DatabaseError.gmi @@ -0,0 +1,37 @@ +# Query Bug: DatabaseError + +## Tags + +* assigned: fredm, bonfacem +* priority: high +* status: open +* type: bug +* keywords: gnqna + +## Descriptions + +* Go to https://genenetwork.org/gnqna +* Type in a query +* Press "Enter" +* Observe the error "DatabaseError" with a status code of 500. + +Expected: Query returns a result. + + +## Troubleshooting: 2025-10-27 + +* GNQNA's deployment is not part of the gn-machine's definitions! + +## Troubleshooting: 2025-12-31 + +If a user **IS NOT** logged in, the system responds with: + +``` +Search_Query: +Status_Code: 500 +Error/Reason: Login/Verification required to make this request +``` + +On the other hand, if a user is logged in, a query returns a result. + +We, therefore, probably need to notify the user that they need to be logged in to use this service. diff --git a/issues/guix-bioinformatics/guix-updates.gmi b/issues/guix-bioinformatics/guix-updates.gmi new file mode 100644 index 0000000..9c65fb9 --- /dev/null +++ b/issues/guix-bioinformatics/guix-updates.gmi @@ -0,0 +1,18 @@ +# Planned Guix Updates + +## Tags + +* status: open +* priority: medium +* type: enhancement +* assigned: fredm, bonfacem +* keywords: guix-bioinformatics, guix +* interested: pjotrp, aruni + +## Description + +The following outlines issues around the next upgrade: + +* Update pinned guix commit to the latest and see whether inferior profiles for the laminar user are properly created. +* Rust packages (new package build system) we need to think about. + diff --git a/issues/guix-bioinformatics/pin-channels-commits.gmi b/issues/guix-bioinformatics/pin-channels-commits.gmi new file mode 100644 index 0000000..216dd24 --- /dev/null +++ b/issues/guix-bioinformatics/pin-channels-commits.gmi @@ -0,0 +1,39 @@ +# Pin Channel Commits; Decouple from Guix + +## Tags + +* status: closed +* priority: medium +* type: enhancement +* assigned: fredm, bonfacem, aruni +* keywords: guix-bioinformatics, guix +* interested: pjotrp, aruni + +## Description + +Changes in upstream Guix often lead to deployment issues, due to breakages caused by changes in how GNU Guix does things. This interrupts our day-to-day operations, leading us to scramble to fix the breakages and make the builds sane again. + +In order to avoid these breakages in the future, we'll need to actually pin the commit(s) for all the channels we depend on, to avoid surprises down the line. + +### Channel Dependencies + +We depend on the following channels in guix-bioinformatics: + +* guix: Mainline Guix channel +* guix-past: Channel for old packages, no longer maintained on guix mainline +* guix-rust-past-crates: Channel for rust packages using the old packaging form +* guix-forge: Manages building containers and whatnot. The dependence is implicit here, but it is one of the main causes of breakages + +### Tasks + +* [x] Pin guix channel +* [x] Pin guix-past +* [x] Pin guix-rust-past-crates channel +* [x] Pin guix-forge channel +* [ ] Move packages from (gn packages bioinformatics) to upstream (gnu packages bioinformatics) + +### Solution + +To allow guix-bioinformatics to continue improving, while preventing random breakages, we stopped depending on guix-bioinformatics directly, rather, we changed our main channel to gn-machines, and there, we pinned the version of guix-bioinformatics we depend on. + +This allows us to continue updating our packages while keeping the channel dependencies relatively stable. diff --git a/issues/guix-ci-tests.gmi b/issues/guix-ci-tests.gmi new file mode 100644 index 0000000..ce56705 --- /dev/null +++ b/issues/guix-ci-tests.gmi @@ -0,0 +1,47 @@ +# Guix CI failure: guix-past build breaks due to missing (libchop) + +# Tags + +* assigned: bonfacem +* type: bug, infrastructure +* priority: high + +# Notes + +After fixing a permissions issue in the Laminar CI environment (/var/guix/profiles/per-user/laminar): + +``` +[laminar] Executing cfg/jobs/gn-libs.run Backtrace: 9 (primitive-load "/var/lib/laminar/cfg/jobs/gn-libs.run") In ice-9/boot-9.scm: 152:2 8 (with-fluid* _ _ _) In ice-9/eval.scm: 202:51 7 (_ #(#(#<directory (guile-user) 7fce0bc71c80> #<pro?> ?))) 293:34 6 (_ #(#(#<directory (guile-user) 7fce0bc71c80> #<pro?> ?))) In guix/inferior.scm: 1006:4 5 (inferior-for-channels _ #:cache-directory _ #:ttl _) In ice-9/boot-9.scm: 1752:10 4 (with-exception-handler _ _ #:unwind? _ # _) In guix/store.scm: 690:37 3 (thunk) 1331:8 2 (call-with-build-handler #<procedure 7fce00e9f0c0 at g?> ?) In guix/inferior.scm: 951:2 1 (cached-channel-instance #<store-connection 256.100 7f?> ?) In ice-9/boot-9.scm: 1685:16 0 (raise-exception _ #:continuable? _) ice-9/boot-9.scm:1685:16: In procedure raise-exception: In procedure mkdir: Permission denied: "/var/guix/profiles/per-user/laminar" +``` + +... by (inside the container) running: + +``` +mkdir -p /var/guix/profiles/per-user/laminar +chown -R laminar:laminar /var/guix/profiles/per-user/laminar +``` + +... the CI progressed further but now fails when attempting to build guix-past. The failure is caused by an unbound variable error for the module (libchop), indicating a mismatch or missing dependency in the pinned Guix channels. + +Error Log: + +``` +(exception unbound-variable (value #f) + (value "Unbound variable: ~S") + (value (libchop)) (value #f)) + +builder for /gnu/store/gx57wj08yv0x0g1r8rbnwcp2fc58lqvx-guix-past.drv +failed to produce output path +/gnu/store/n3q0sgqwm9mwvna5215npwmdfigfyr9f-guix-past + +cannot build derivation +/gnu/store/3fwagz1p9vv3h020lwb2ab52f6wj6z1g-profile.drv: +1 dependencies couldn't be built +``` + +# Resolution + +* Inside genenetwork-development.scm, manually create `/var/guix/profiles/per-user/laminar` if it doesn't exist. +* Update the relevant .guix-channel file to match channels in guix-bioinformatics. + +* closed diff --git a/issues/implement-gn-markdown-editor.gmi b/issues/implement-gn-markdown-editor.gmi index 7d7d08f..a0d386b 100644 --- a/issues/implement-gn-markdown-editor.gmi +++ b/issues/implement-gn-markdown-editor.gmi @@ -13,7 +13,7 @@ Example of similar implementation * assigned: alexm * type: enhancement -* status: IN PROGRESS +* status: done, completed. * keywords: markdown,editor @@ -23,7 +23,7 @@ Example of similar implementation * [x] add live preview for page markdown on edit -* [] authentication(WIP) +* [x] authentication * [x] commit changes to github repo diff --git a/issues/implement_xapian_to_text_transformer.gmi b/issues/implement_xapian_to_text_transformer.gmi index a3c3dc8..192491a 100644 --- a/issues/implement_xapian_to_text_transformer.gmi +++ b/issues/implement_xapian_to_text_transformer.gmi @@ -4,7 +4,7 @@ * assigned: alexm, jnduli * keywords: llm, genenetwork2, xapian, transform * type: feature -* status: in-progress +* status: closed, completed ## Description: diff --git a/issues/prevent-weak-passwords.gmi b/issues/prevent-weak-passwords.gmi index 8e8ca2f..957a170 100644 --- a/issues/prevent-weak-passwords.gmi +++ b/issues/prevent-weak-passwords.gmi @@ -19,3 +19,11 @@ There was a request made to prevent weak passwords. Use existing libraries to check and prevent weak passwords. + +## Notes + +### 2025-12-31: Look Into Libraries + +=> https://pypi.org/project/password-strength/ password-strength + +The library above seems promising. Unfortunately, we'd have to write a guix definition for it. diff --git a/issues/production-container-mechanical-rob-failure.gmi b/issues/production-container-mechanical-rob-failure.gmi new file mode 100644 index 0000000..ae6bae8 --- /dev/null +++ b/issues/production-container-mechanical-rob-failure.gmi @@ -0,0 +1,224 @@ +# Production Container: `mechanical-rob` Failure + +## Tags + +* status: closed, completed, fixed +* priority: high +* type: bug +* assigned: fredm +* keywords: genenetwork, production, mechanical-rob + +## Description + +After deploying the latest commits to https://gn2-fred.genenetwork.org on 2025-02-19UTC-0600, with the following commits: + +* genenetwork2: 2a3df8cfba6b29dddbe40910c69283a1afbc8e51 +* genenetwork3: 99fd5070a84f37f91993f329f9cc8dd82a4b9339 +* gn-auth: 073395ff331042a5c686a46fa124f9cc6e10dd2f +* gn-libs: 72a95f8ffa5401649f70978e863dd3f21900a611 + +I had the (not so) bright idea to run the `mechanical-rob` tests against it before pushing it to production, proper. Here's where I ran into problems: some of the `mechanical-rob` tests failed, specifically, the correlation tests. + +Meanwhile, a run of the same tests against https://cd.genenetwork.org with the same commits was successful: + +=> https://ci.genenetwork.org/jobs/genenetwork2-mechanical-rob/1531 See this. + +This points to a possible problem with the setup of the production container, that leads to failures where none should be. This needs investigation and fixing. + +### Update 2025-02-20 + +The MariaDB server is crashing. To reproduce: + +* Go to https://gn2-fred.genenetwork.org/show_trait?trait_id=1435464_at&dataset=HC_M2_0606_P +* Click on "Calculate Correlations" to expand +* Click "Compute" + +Observe that after a little while, the system fails with the following errors: + +* `MySQLdb.OperationalError: (2013, 'Lost connection to MySQL server during query')` +* `MySQLdb.OperationalError: (2006, 'MySQL server has gone away')` + +I attempted updating the configuration for MariaDB, setting the `max_allowed_packet` to 16M and then 64M, but that did not resolve the problem. + +The log files indicate the following: + +``` +2025-02-20 7:46:07 0 [Note] Recovering after a crash using /var/lib/mysql/gn0-binary-log +2025-02-20 7:46:07 0 [Note] Starting crash recovery... +2025-02-20 7:46:07 0 [Note] Crash recovery finished. +2025-02-20 7:46:07 0 [Note] Server socket created on IP: '0.0.0.0'. +2025-02-20 7:46:07 0 [Warning] 'user' entry 'webqtlout@tux01' ignored in --skip-name-resolve mode. +2025-02-20 7:46:07 0 [Warning] 'db' entry 'db_webqtl webqtlout@tux01' ignored in --skip-name-resolve mode. +2025-02-20 7:46:07 0 [Note] Reading of all Master_info entries succeeded +2025-02-20 7:46:07 0 [Note] Added new Master_info '' to hash table +2025-02-20 7:46:07 0 [Note] /usr/sbin/mariadbd: ready for connections. +Version: '10.5.23-MariaDB-0+deb11u1-log' socket: '/run/mysqld/mysqld.sock' port: 3306 Debian 11 +2025-02-20 7:46:07 4 [Warning] Access denied for user 'root'@'localhost' (using password: NO) +2025-02-20 7:46:07 5 [Warning] Access denied for user 'root'@'localhost' (using password: NO) +2025-02-20 7:46:07 0 [Note] InnoDB: Buffer pool(s) load completed at 250220 7:46:07 +250220 7:50:12 [ERROR] mysqld got signal 11 ; +Sorry, we probably made a mistake, and this is a bug. + +Your assistance in bug reporting will enable us to fix this for the next release. +To report this bug, see https://mariadb.com/kb/en/reporting-bugs + +We will try our best to scrape up some info that will hopefully help +diagnose the problem, but since we have already crashed, +something is definitely wrong and this may fail. + +Server version: 10.5.23-MariaDB-0+deb11u1-log source revision: 6cfd2ba397b0ca689d8ff1bdb9fc4a4dc516a5eb +key_buffer_size=10485760 +read_buffer_size=131072 +max_used_connections=1 +max_threads=2050 +thread_count=1 +It is possible that mysqld could use up to +key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 4523497 K bytes of memory +Hope that's ok; if not, decrease some variables in the equation. + +Thread pointer: 0x7f599c000c58 +Attempting backtrace. You can use the following information to find out +where mysqld died. If you see no messages after this, something went +terribly wrong... +stack_bottom = 0x7f6150282d78 thread_stack 0x49000 +/usr/sbin/mariadbd(my_print_stacktrace+0x2e)[0x55f43330c14e] +/usr/sbin/mariadbd(handle_fatal_signal+0x475)[0x55f432e013b5] +sigaction.c:0(__restore_rt)[0x7f615a1cb140] +/usr/sbin/mariadbd(+0xcbffbe)[0x55f43314efbe] +/usr/sbin/mariadbd(+0xd730ec)[0x55f4332020ec] +/usr/sbin/mariadbd(+0xd1b36b)[0x55f4331aa36b] +/usr/sbin/mariadbd(+0xd1cd8e)[0x55f4331abd8e] +/usr/sbin/mariadbd(+0xc596f3)[0x55f4330e86f3] +/usr/sbin/mariadbd(_ZN7handler18ha_index_next_sameEPhPKhj+0x2a5)[0x55f432e092b5] +/usr/sbin/mariadbd(+0x7b54d1)[0x55f432c444d1] +/usr/sbin/mariadbd(_Z10sub_selectP4JOINP13st_join_tableb+0x1f8)[0x55f432c37da8] +/usr/sbin/mariadbd(_ZN10JOIN_CACHE24generate_full_extensionsEPh+0x134)[0x55f432d24224] +/usr/sbin/mariadbd(_ZN10JOIN_CACHE21join_matching_recordsEb+0x206)[0x55f432d245d6] +/usr/sbin/mariadbd(_ZN10JOIN_CACHE12join_recordsEb+0x1cf)[0x55f432d23eff] +/usr/sbin/mariadbd(_Z16sub_select_cacheP4JOINP13st_join_tableb+0x8a)[0x55f432c382fa] +/usr/sbin/mariadbd(_ZN4JOIN10exec_innerEv+0xd16)[0x55f432c63826] +/usr/sbin/mariadbd(_ZN4JOIN4execEv+0x35)[0x55f432c63cc5] +/usr/sbin/mariadbd(_Z12mysql_selectP3THDP10TABLE_LISTR4ListI4ItemEPS4_jP8st_orderS9_S7_S9_yP13select_resultP18st_select_lex_unitP13st_select_lex+0x106)[0x55f432c61c26] +/usr/sbin/mariadbd(_Z13handle_selectP3THDP3LEXP13select_resultm+0x138)[0x55f432c62698] +/usr/sbin/mariadbd(+0x762121)[0x55f432bf1121] +/usr/sbin/mariadbd(_Z21mysql_execute_commandP3THD+0x3d6c)[0x55f432bfdd1c] +/usr/sbin/mariadbd(_Z11mysql_parseP3THDPcjP12Parser_statebb+0x20b)[0x55f432bff17b] +/usr/sbin/mariadbd(_Z16dispatch_command19enum_server_commandP3THDPcjbb+0xdb5)[0x55f432c00f55] +/usr/sbin/mariadbd(_Z10do_commandP3THD+0x120)[0x55f432c02da0] +/usr/sbin/mariadbd(_Z24do_handle_one_connectionP7CONNECTb+0x2f2)[0x55f432cf8b32] +/usr/sbin/mariadbd(handle_one_connection+0x5d)[0x55f432cf8dad] +/usr/sbin/mariadbd(+0xbb4ceb)[0x55f433043ceb] +nptl/pthread_create.c:478(start_thread)[0x7f615a1bfea7] +x86_64/clone.S:97(__GI___clone)[0x7f6159dc6acf] + +Trying to get some variables. +Some pointers may be invalid and cause the dump to abort. +Query (0x7f599c012c50): SELECT ProbeSet.Name,ProbeSet.Chr,ProbeSet.Mb, + ProbeSet.Symbol,ProbeSetXRef.mean, + CONCAT_WS('; ', ProbeSet.description, ProbeSet.Probe_Target_Description) AS description, + ProbeSetXRef.additive,ProbeSetXRef.LRS,Geno.Chr, Geno.Mb + FROM ProbeSet INNER JOIN ProbeSetXRef + ON ProbeSet.Id=ProbeSetXRef.ProbeSetId + INNER JOIN Geno + ON ProbeSetXRef.Locus = Geno.Name + INNER JOIN Species + ON Geno.SpeciesId = Species.Id + WHERE ProbeSet.Name in ('1447591_x_at', '1422809_at', '1428917_at', '1438096_a_at', '1416474_at', '1453271_at', '1441725_at', '1452952_at', '1456774_at', '1438413_at', '1431110_at', '1453723_x_at', '1424124_at', '1448706_at', '1448762_at', '1428332_at', '1438389_x_at', '1455508_at', '1455805_x_at', '1433276_at', '1454989_at', '1427467_a_at', '1447448_s_at', '1438695_at', '1456795_at', '1454874_at', '1455189_at', '1448631_a_at', '1422697_s_at', '1423717_at', '1439484_at', '1419123_a_at', '1435286_at', '1439886_at', '1436348_at', '1437475_at', '1447667_x_at', '1421046_a_at', '1448296_x_at', '1460577_at', 'AFFX-GapdhMur/M32599_M_at', '1424393_s_at', '1426190_at', '1434749_at', '1455706_at', '1448584_at', '1434093_at', '1434461_at', '1419401_at', '1433957_at', '1419453_at', '1416500_at', '1439436_x_at', '1451413_at', '1455696_a_at', '1457190_at', '1455521_at', '1434842_s_at', '1442525_at', '1452331_s_at', '1428862_at', '1436463_at', '1438535_at', 'AFFX-GapdhMur/M32599_3_at', '1424012_at', '1440027_at', '1435846_x_at', '1443282_at', '1435567_at', '1450112_a_at', '1428251_at', '1429063_s_at', '1433781_a_at', '1436698_x_at', '1436175_at', '1435668_at', '1424683_at', '1442743_at', '1416944_a_at', '1437511_x_at', '1451254_at', '1423083_at', '1440158_x_at', '1424324_at', '1426382_at', '1420142_s_at', '1434553_at', '1428772_at', '1424094_at', '1435900_at', '1455322_at', '1453283_at', '1428551_at', '1453078_at', '1444602_at', '1443836_x_at', '1435590_at', '1434283_at', '1435240_at', '1434659_at', '1427032_at', '1455278_at', '1448104_at', '1421247_at', 'AFFX-MURINE_b1_at', '1460216_at', '1433969_at', '1419171_at', '1456699_s_at', '1456901_at', '1442139_at', '1421849_at', '1419824_a_at', '1460588_at', '1420131_s_at', '1446138_at', '1435829_at', '1434462_at', '1435059_at', '1415949_at', '1460624_at', '1426707_at', '1417250_at', '1434956_at', '1438018_at', '1454846_at', '1435298_at', '1442077_at', '1424074_at', '1428883_at', '1454149_a_at', '1423925_at', '1457060_at', '1433821_at', '1447923_at', '1460670_at', '1434468_at', '1454980_at', '1426913_at', '1456741_s_at', '1449278_at', '1443534_at', '1417941_at', '1433167_at', '1434401_at', '1456516_x_at', '1451360_at', 'AFFX-GapdhMur/M32599_5_at', '1417827_at', '1434161_at', '1448979_at', '1435797_at', '1419807_at', '1418330_at', '1426304_x_at', '1425492_at', '1437873_at', '1435734_x_at', '1420622_a_at', '1456019_at', '1449200_at', '1455314_at', '1428419_at', '1426349_s_at', '1426743_at', '1436073_at', '1452306_at', '1436735_at', '1439529_at', '1459347_at', '1429642_at', '1438930_s_at', '1437380_x_at', '1459861_s_at', '1424243_at', '1430503_at', '1434474_at', '1417962_s_at', '1440187_at', '1446809_at', '1436234_at', '1415906_at', 'AFFX-MURINE_B2_at', '1434836_at', '1426002_a_at', '1448111_at', '1452882_at', '1436597_at', '1455915_at', '1421846_at', '1428693_at', '1422624_at', '1423755_at', '1460367_at', '1433746_at', '1454872_at', '1429194_at', '1424652_at', '1440795_x_at', '1458690_at', '1434355_at', '1456324_at', '1457867_at', '1429698_at', '1423104_at', '1437585_x_at', '1437739_a_at', '1445605_s_at', '1436313_at', '1449738_s_at', '1437525_a_at', '1454937_at', '1429043_at', '1440091_at', '1422820_at', '1437456_x_at', '1427322_at', '1446649_at', '1433568_at', '1441114_at', '1456541_x_at', '1426985_s_at', '1454764_s_at', '1424071_s_at', '1429251_at', '1429155_at', '1433946_at', '1448771_a_at', '1458664_at', '1438320_s_at', '1449616_s_at', '1435445_at', '1433872_at', '1429273_at', '1420880_a_at', '1448645_at', '1449646_s_at', '1428341_at', '1431299_a_at', '1433427_at', '1418530_at', '1436247_at', '1454350_at', '1455860_at', '1417145_at', '1454952_s_at', '1435977_at', '1434807_s_at', '1428715_at', '1418117_at', '1447947_at', '1431781_at', '1428915_at', '1427197_at', '1427208_at', '1455460_at', '1423899_at', '1441944_s_at', '1455429_at', '1452266_at', '1454409_at', '1426384_a_at', '1428725_at', '1419181_at', '1454862_at', '1452907_at', '1433794_at', '1435492_at', '1424839_a_at', '1416214_at', '1449312_at', '1436678_at', '1426253_at', '1438859_x_at', '1448189_a_at', '1442557_at', '1446174_at', '1459718_x_at', '1437613_s_at', '1456509_at', '1455267_at', '1440480_at', '1417296_at', '1460050_x_at', '1433585_at', '1436771_x_at', '1424294_at', '1448648_at', '1417753_at', '1436139_at', '1425642_at', '1418553_at', '1415747_s_at', '1445984_at', '1440024_at', '1448720_at', '1429459_at', '1451459_at', '1428853_at', '1433856_at', '1426248_at', '1417765_a_at', '1439459_x_at', '1447023_at', '1426088_at', '1440825_s_at', '1417390_at', '1444744_at', '1435618_at', '1424635_at', '1443727_x_at', '1421096_at', '1427410_at', '1416860_s_at', '1442773_at', '1442030_at', '1452281_at', '1434774_at', '1416891_at', '1447915_x_at', '1429129_at', '1418850_at', '1416308_at', '1422858_at', '1447679_s_at', '1440903_at', '1417321_at', '1452342_at', '1453510_s_at', '1454923_at', '1454611_a_at', '1457532_at', '1438440_at', '1434232_a_at', '1455878_at', '1455571_x_at', '1436401_at', '1453289_at', '1457365_at', '1436708_x_at', '1434494_at', '1419588_at', '1433679_at', '1455159_at', '1428982_at', '1446510_at', '1434131_at', '1418066_at', '1435346_at', '1449415_at', '1455384_x_at', '1418817_at', '1442073_at', '1457265_at', '1447361_at', '1418039_at', '1428467_at', '1452224_at', '1417538_at', '1434529_x_at', '1442149_at', '1437379_x_at', '1416473_a_at', '1432750_at', '1428389_s_at', '1433823_at', '1451889_at', '1438178_x_at', '1441807_s_at', '1416799_at', '1420623_x_at', '1453245_at', '1434037_s_at', '1443012_at', '1443172_at', '1455321_at', '1438396_at', '1440823_x_at', '1436278_at', '1457543_at', '1452908_at', '1417483_at', '1418397_at', '1446589_at', '1450966_at', '1447877_x_at', '1446524_at', '1438592_at', '1455589_at', '1428629_at', '1429585_s_at', '1440020_at', '1417365_a_at', '1426442_at', '1427151_at', '1437377_a_at', '1433995_s_at', '1435464_at', '1417007_a_at', '1429690_at', '1427999_at', '1426819_at', '1454905_at', '1439516_at', '1434509_at', '1428707_at', '1416793_at', '1440822_x_at', '1437327_x_at', '1428682_at', '1435004_at', '1434238_at', '1417581_at', '1434699_at', '1455597_at', '1458613_at', '1456485_at', '1435122_x_at', '1452864_at', '1453122_at', '1435254_at', '1451221_at', '1460168_at', '1455336_at', '1427965_at', '1432576_at', '1455425_at', '1428762_at', '1455459_at', '1419317_x_at', '1434691_at', '1437950_at', '1426401_at', '1457261_at', '1433824_x_at', '1435235_at', '1437343_x_at', '1439964_at', '1444280_at', '1455434_a_at', '1424431_at', '1421519_a_at', '1428412_at', '1434010_at', '1419976_s_at', '1418887_a_at', '1428498_at', '1446883_at', '1435675_at', '1422599_s_at', '1457410_at', '1444437_at', '1421050_at', '1437885_at', '1459754_x_at', '1423807_a_at', '1435490_at', '1426760_at', '1449459_s_at', '1432098_a_at', '1437067_at', '1435574_at', '1433999_at', '1431289_at', '1428919_at', '1425678_a_at', '1434924_at', '1421640_a_at', '1440191_s_at', '1460082_at', '1449913_at', '1439830_at', '1425020_at', '1443790_x_at', '1436931_at', '1454214_a_at', '1455854_a_at', '1437061_at', '1436125_at', '1426385_x_at', '1431893_a_at', '1417140_a_at', '1435333_at', '1427907_at', '1434446_at', '1417594_at', '1426518_at', '1437345_a_at', '1420091_s_at', '1450058_at', '1435161_at', '1430348_at', '1455778_at', '1422653_at', '1447942_x_at', '1434843_at', '1454956_at', '1454998_at', '1427384_at', '1439828_at') AND + Species.Name = 'mouse' AND + ProbeSetXRef.ProbeSetFreezeId IN ( + SELECT ProbeSetFreeze.Id + FROM ProbeSetFreeze WHERE ProbeSetFreeze.Name = 'HC_M2_0606_P') + +Connection ID (thread ID): 41 +Status: NOT_KILLED + +Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=off,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=on,table_elimination=on,extended_keys=on,exists_to_in=on,orderby_uses_equalities=on,condition_pushdown_for_derived=on,split_materialized=on,condition_pushdown_for_subquery=on,rowid_filter=on,condition_pushdown_from_having=on,not_null_range_scan=off + +The manual page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mariadbd/ contains +information that should help you find out what is causing the crash. +Writing a core file... +Working directory at /export/mysql/var/lib/mysql +Resource Limits: +Limit Soft Limit Hard Limit Units +Max cpu time unlimited unlimited seconds +Max file size unlimited unlimited bytes +Max data size unlimited unlimited bytes +Max stack size 8388608 unlimited bytes +Max core file size 0 unlimited bytes +Max resident set unlimited unlimited bytes +Max processes 3094157 3094157 processes +Max open files 64000 64000 files +Max locked memory 65536 65536 bytes +Max address space unlimited unlimited bytes +Max file locks unlimited unlimited locks +Max pending signals 3094157 3094157 signals +Max msgqueue size 819200 819200 bytes +Max nice priority 0 0 +Max realtime priority 0 0 +Max realtime timeout unlimited unlimited us +Core pattern: core + +Kernel version: Linux version 5.10.0-22-amd64 (debian-kernel@lists.debian.org) (gcc-10 (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 SMP Debian 5.10.178-3 (2023-04-22) + +2025-02-20 7:50:17 0 [Note] Starting MariaDB 10.5.23-MariaDB-0+deb11u1-log source revision 6cfd2ba397b0ca689d8ff1bdb9fc4a4dc516a5eb as process 3086167 +2025-02-20 7:50:17 0 [Note] InnoDB: !!! innodb_force_recovery is set to 1 !!! +2025-02-20 7:50:17 0 [Note] InnoDB: Uses event mutexes +2025-02-20 7:50:17 0 [Note] InnoDB: Compressed tables use zlib 1.2.11 +2025-02-20 7:50:17 0 [Note] InnoDB: Number of pools: 1 +2025-02-20 7:50:17 0 [Note] InnoDB: Using crc32 + pclmulqdq instructions +2025-02-20 7:50:17 0 [Note] InnoDB: Using Linux native AIO +2025-02-20 7:50:17 0 [Note] InnoDB: Initializing buffer pool, total size = 17179869184, chunk size = 134217728 +2025-02-20 7:50:17 0 [Note] InnoDB: Completed initialization of buffer pool +2025-02-20 7:50:17 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=1537379110991,1537379110991 +2025-02-20 7:50:17 0 [Note] InnoDB: Last binlog file '/var/lib/mysql/gn0-binary-log.000134', position 82843148 +2025-02-20 7:50:17 0 [Note] InnoDB: 128 rollback segments are active. +2025-02-20 7:50:17 0 [Note] InnoDB: Removed temporary tablespace data file: "ibtmp1" +2025-02-20 7:50:17 0 [Note] InnoDB: Creating shared tablespace for temporary tables +2025-02-20 7:50:17 0 [Note] InnoDB: Setting file './ibtmp1' size to 12 MB. Physically writing the file full; Please wait ... +2025-02-20 7:50:17 0 [Note] InnoDB: File './ibtmp1' size is now 12 MB. +2025-02-20 7:50:17 0 [Note] InnoDB: 10.5.23 started; log sequence number 1537379111003; transaction id 3459549902 +2025-02-20 7:50:17 0 [Note] Plugin 'FEEDBACK' is disabled. +2025-02-20 7:50:17 0 [Note] InnoDB: Loading buffer pool(s) from /export/mysql/var/lib/mysql/ib_buffer_pool +2025-02-20 7:50:17 0 [Note] Loaded 'locales.so' with offset 0x7f9551bc0000 +2025-02-20 7:50:17 0 [Note] Recovering after a crash using /var/lib/mysql/gn0-binary-log +2025-02-20 7:50:17 0 [Note] Starting crash recovery... +2025-02-20 7:50:17 0 [Note] Crash recovery finished. +2025-02-20 7:50:17 0 [Note] Server socket created on IP: '0.0.0.0'. +2025-02-20 7:50:17 0 [Warning] 'user' entry 'webqtlout@tux01' ignored in --skip-name-resolve mode. +2025-02-20 7:50:17 0 [Warning] 'db' entry 'db_webqtl webqtlout@tux01' ignored in --skip-name-resolve mode. +2025-02-20 7:50:17 0 [Note] Reading of all Master_info entries succeeded +2025-02-20 7:50:17 0 [Note] Added new Master_info '' to hash table +2025-02-20 7:50:17 0 [Note] /usr/sbin/mariadbd: ready for connections. +Version: '10.5.23-MariaDB-0+deb11u1-log' socket: '/run/mysqld/mysqld.sock' port: 3306 Debian 11 +2025-02-20 7:50:17 4 [Warning] Access denied for user 'root'@'localhost' (using password: NO) +2025-02-20 7:50:17 5 [Warning] Access denied for user 'root'@'localhost' (using password: NO) +2025-02-20 7:50:17 0 [Note] InnoDB: Buffer pool(s) load completed at 250220 7:50:17 +``` + +A possible issue is the use of the environment variable SQL_URI at this point: + +=> https://github.com/genenetwork/genenetwork2/blob/testing/gn2/wqflask/correlation/rust_correlation.py#L34 + +which is requested + +=> https://github.com/genenetwork/genenetwork2/blob/testing/gn2/wqflask/correlation/rust_correlation.py#L7 from here. + +I tried setting an environment variable "SQL_URI" with the same value as the config and rebuilt the container. That did not fix the problem. + +Running the query directly in the default mysql client also fails with: + +``` +ERROR 2013 (HY000): Lost connection to MySQL server during query +``` + +Huh, so this was not a code problem. + +Configured database to allow upgrade of tables if necessary and restarted mariadbd. + +The problem still persists. + +Note Pjotr: this is likely a mariadb bug with 10.5.23, the most recent mariadbd we use (both tux01 and tux02 are older). The dump shows it balks on creating a new thread: pthread_create.c:478. Looks similar to https://jira.mariadb.org/browse/MDEV-32262 + +10.5, 10.6, 10.11 are affected. so running correlations on production crashes mysqld? I am not trying for obvious reasons ;) the threading issues of mariadb look scary - I wonder how deep it goes. + +We'll test for a different version of mariadb combining a Debian update because Debian on tux04 is broken. diff --git a/issues/provide-link-to-register-user-in-sign-in-page.gmi b/issues/provide-link-to-register-user-in-sign-in-page.gmi index 24d7c21..b9e6a4d 100644 --- a/issues/provide-link-to-register-user-in-sign-in-page.gmi +++ b/issues/provide-link-to-register-user-in-sign-in-page.gmi @@ -3,7 +3,7 @@ ## Tags * type: bug -* status: open +* status: closed * assigned: fredm * priority: medium * keywords: register user, gn-auth, genenetwork @@ -16,3 +16,8 @@ Provide a link allowing a user to register with the system on the sign-in page. We are now using OAuth2 to enable sign-in, which means that the user is redirected from the service they were in to the authorisation service to sign-in. The service should retain a note of the service which the user came from, and redirect back to it on successful registration. + + +### Close as Completed + +@zachs seems to have fixed this. diff --git a/issues/quality-control/r-qtl2-features.gmi b/issues/quality-control/r-qtl2-features.gmi index eac53c4..bcc5d71 100644 --- a/issues/quality-control/r-qtl2-features.gmi +++ b/issues/quality-control/r-qtl2-features.gmi @@ -3,7 +3,7 @@ ## Tags * type: listing -* status: open +* status: closed, completed * assigned: fredm * priority: high * keywords: listing, bug, feature @@ -12,5 +12,9 @@ This is a listing of non-critical features and bugs that do not currently have a dedicated issue, and need to be handled some time in the future. -* [feature] "Undo Transpose": Files marked as '*_transposed: true' will have the transposition undone to ease processing down the line. +* Closed, completed: [feature] "Undo Transpose": Files marked as '*_transposed: true' will have the transposition undone to ease processing down the line. * … + +### Close as completed + +Actually open dedicated issues for bugs and features rather than collecting them here. diff --git a/issues/rdf/automate-rdf-generation-and-ingress.gmi b/issues/rdf/automate-rdf-generation-and-ingress.gmi index ef4ba9f..dedf5d8 100644 --- a/issues/rdf/automate-rdf-generation-and-ingress.gmi +++ b/issues/rdf/automate-rdf-generation-and-ingress.gmi @@ -4,7 +4,7 @@ * assigned: bonfacem * priority: high -* tags: in-progress +* tags: done * deadline: 2024-10-23 Wed We need to update Virtuoso in production. At the moment this is done manually. For the current set-up, we need to update the recent modified RIF+WIKI models: @@ -35,3 +35,5 @@ CHECKPOINT; Above steps should be automated and tested in CD before roll-out in production. Key considerations: - Pick latest important changes from git, so that we can pick what files to run instead of generating all the ttl files all the time. + +* closed diff --git a/issues/rdf/expose-rdf-to-web.gmi b/issues/rdf/expose-rdf-to-web.gmi new file mode 100644 index 0000000..e5da94a --- /dev/null +++ b/issues/rdf/expose-rdf-to-web.gmi @@ -0,0 +1,85 @@ +# Expose Versioned "rdf.genenetwork.org" Namespaces + +* assigned: bonfacem +* status: in-progress + +## Description + +We have switched all RDF namespaces from "genenetwork.org" to the versioned base "rdf.genenetwork.org/v1." These endpoints don't resolve yet. + +## What changed + +Replaced + +* "http://genenetwork.org/id/" -> "http://rdf.genenetwork.org/v1/id/" +* "http://genenetwork.org/category/" -> "http://rdf.genenetwork.org/v1/category/" +* "http://genenetwork.org/term/" -> "http://rdf.genenetwork.org/v1/term/" + +## Current Problem + +New "rdf.genenetwork.org/v1/*" URIs return a 5XX/4XX which blocks validation, dereferencing and external re-use + +## Expected Behaviour + +All rdf.genenetwork.org/v1/* namespaces resolve over HTTP. At minimum: + +* Human-readable HTML in a browser. +* RDF (Turtle or RDF/XML) via content negotiation. + +## Notes + +(Example) Queries for all terms/categories/ids: + +``` +PREFIX gn: <http://rdf.genenetwork.org/v1/id/> +PREFIX gnc: <http://rdf.genenetwork.org/v1/category/> +PREFIX gnt: <http://rdf.genenetwork.org/v1/term/> + +CONSTRUCT { + gn:Arabidopsis_thaliana ?p ?o . +} FROM <http://rdf.genenetwork.org/v1> +WHERE { + gn:Arabidopsis_thaliana ?p ?o . + ?s ?p ?o . +} + +CONSTRUCT { + gnc:phenotype ?p ?o . +} FROM <http://rdf.genenetwork.org/v1> +WHERE { + gnc:phenotype ?p ?o . + ?s ?p ?o . +} + +CONSTRUCT { + gnt:shortName ?p ?o . +} FROM <http://rdf.genenetwork.org/v1> +WHERE { + gnt:shortName ?p ?o . + ?s ?p ?o . +} + +``` + +Some terms/categories/ids descriptions will have to be manually updated since querying them as a subject return nothing. + +When serving, all supported formats based off Virtuoso's content negotiation: + +=> https://docs.openlinksw.com/virtuoso/rdfsparqlprotocolendpoint/ + +Location of all files (tux02): /export/data/genenetwork-virtuoso/ + +For current gn3, old data is maintained in this named graph: "http://genenetwork.org" while new data is being uploaded to this named graph: "http://rdf.genenetwork.org/v1" + +## Tasks + +* [X] Set up new named graph for the rdf endpoint. Maintain old named graph for gn3 stability. +* [X] Serve: "/v1/id/<id>"; "/v1/category/<category>"; and "/v1/term/<term>" under "rdf.genenetwork.org" as "text/microdata+html" +* [X] Verify: `curl -H "Accept: text/html" http://rdf.genenetwork.org/v1/term/...` +* [X] Configure DNS (rdf.genenetwork.org) to point to gn-guile server. + +## Resolution + +=> https://git.genenetwork.org/gn-guile/commit/?id=45662839565f6482e7f034a07ae373bbeaeb9713 + +* closed diff --git a/issues/rdf/search-indexing-general-issues.gmi b/issues/rdf/search-indexing-general-issues.gmi index 3bcc36a..bd49419 100644 --- a/issues/rdf/search-indexing-general-issues.gmi +++ b/issues/rdf/search-indexing-general-issues.gmi @@ -30,3 +30,5 @@ A fix for this would be to replace "add_boolean_prefix" with "add_prefix". ## CIS/TRANS Searches The challenge with this search is that we would have to compare valuse for each possible result against one another, necessitating the generation of position values separately for every possible result. Also, for the devs (jnduli, bonfacem) we need to have a better understanding of how this work, which is currently vague. + +* closed diff --git a/issues/rdf/virtuoso-container-log-permission-crash.gmi b/issues/rdf/virtuoso-container-log-permission-crash.gmi new file mode 100644 index 0000000..14cbf4f --- /dev/null +++ b/issues/rdf/virtuoso-container-log-permission-crash.gmi @@ -0,0 +1,23 @@ +# Virtuoso Fails to Start Due to Incorrect Log File Ownership + +* assigned: bonfacem +* status: closed + +## Description + +In CD, the virtuoso container keeps exiting and being re-spawned with the following error: + +``` +Thu Jan 15 2026 11:20:44 Can't open log : Permission denied +``` + +## Resolution + +Fixed it by running (from inside the container): + +``` +chown virtuoso:virtuoso /var/lib/virtuoso +chmod 755 /var/lib/virtuoso +``` + +* closed diff --git a/issues/systems/apps.gmi b/issues/systems/apps.gmi index 51c9d24..e374250 100644 --- a/issues/systems/apps.gmi +++ b/issues/systems/apps.gmi @@ -153,7 +153,7 @@ downloading from http://cran.r-project.org/src/contrib/Archive/KernSmooth/KernSm - 'configure' phasesha256 hash mismatch for /gnu/store/n05zjfhxl0iqx1jbw8i6vv1174zkj7ja-KernSmooth_2.23-17.tar.gz: expected hash: 11g6b0q67vasxag6v9m4px33qqxpmnx47c73yv1dninv2pz76g9b actual hash: 1ciaycyp79l5aj78gpmwsyx164zi5jc60mh84vxxzq4j7vlcdb5p -hash mismatch for store item '/gnu/store/n05zjfhxl0iqx1jbw8i6vv1174zkj7ja-KernSmooth_2.23-17.tar.gz' + hash mismatch for store item '/gnu/store/n05zjfhxl0iqx1jbw8i6vv1174zkj7ja-KernSmooth_2.23-17.tar.gz' ``` Guix checks and it is not great CRAN allows for changing tarballs with the same version number!! Luckily building with a more recent version of Guix just worked (TM). Now we create a root too: @@ -184,12 +184,42 @@ and it looks like lines like these need to be updated: => https://github.com/genenetwork/singleCellRshiny/blob/6b2a344dd0d02f65228ad8c350bac0ced5850d05/app.R#L167 -Let me ask the author Siamak Yousefi. +Let me ask the author Siamak Yousefi. I think we'll drop it. + +## longevity + +Package definition is at + +=> https://git.genenetwork.org/guix-bioinformatics/tree/gn/packages/mouse-longevity.scm + +Container is at + +=> https://git.genenetwork.org/gn-machines/tree/gn/services/mouse-longevity.scm + +gaeta:~/iwrk/deploy/gn-machines$ guix system container -L . -L ~/guix-bioinformatics --verbosity=3 test-r-container.scm -L ~/iwrk/deploy/guix-forge/guix +forge/nginx.scm:145:40: error: acme-service-type: unbound variable +hint: Did you forget `(use-modules (forge acme))'? + ## jumpshiny +Jumpshiny is hosted on balg01. Scripts are in tux02 git. + +=> git.genenetwork.org:/home/git/shared/source/jumpshiny + ``` -balg01:~/gn-machines$ guix system container --network -L . -L ../guix-bioinformatics/ -L ../guix-past/modules/ --substitute-urls='https: -//ci.guix.gnu.org https://bordeaux.guix.gnu.org https://cuirass.genenetwork.org' test-r-container.scm -L ../guix-forge/guix/ -/gnu/store/xyks73sf6pk78rvrwf45ik181v0zw8rx-run-container +root@balg01:/home/j*/gn-machines# . /usr/local/guix-profiles/guix-pull/etc/profile +guix system container --network -L . -L ../guix-forge/guix/ -L ../guix-bioinformatics/ -L ../guix-past/modules/ --substitute-urls='https://ci.guix.gnu.org https://bordeaux.guix.gnu.org https://cuirass.genenetwork.org' test-r-container.scm -L ../guix-forge/guix/gnu/store/xyks73sf6pk78rvrwf45ik181v0zw8rx-run-container +/gnu/store/6y65x5jk3lxy4yckssnl32yayjx9nwl5-run-container ``` + +Currently: + +Jumpshiny: as aijun, cd services/jumpshiny and ./.guix-run + + +## JUMPsem_web + +Another shiny app to run on balg01. + +Jumpshiny: as aijun, cd services/jumpsem and ./.guix-run diff --git a/issues/systems/octopus.gmi b/issues/systems/octopus.gmi index c510fd9..3a6d317 100644 --- a/issues/systems/octopus.gmi +++ b/issues/systems/octopus.gmi @@ -1,6 +1,9 @@ # Octopus sysmaintenance -Reopened tasks because of new sheepdog layout and add new machines to Octopus and get fiber optic network going with @andreag. See also +Reopened tasks because of new sheepdog layout and add new machines to Octopus and get fiber optic network going with @andreag. +IT recently upgraded the network switch, so we should have great interconnect between all nodes. We also need to work on user management and network storage. + +See also => ../../topics/systemtopics/systems/hpcs/hpc/octopus-maintenance @@ -14,7 +17,7 @@ Reopened tasks because of new sheepdog layout and add new machines to Octopus an # Tasks -* [ ] add lizardfs to nodes +* [X] add lizardfs to nodes * [ ] add PBS to nodes * [ ] use fiber optic network * [ ] install sheepdog @@ -36,6 +39,17 @@ default via 172.23.16.1 dev ens1f0np0 # Current topology +vim /etc/ssh/sshd_config +systemctl reload ssh + +The routing should be as on octopus01 + +``` +default via 172.23.16.1 dev eno1 +172.23.16.0/21 dev ens1f0np0 proto kernel scope link src 172.23.18.221 +172.23.16.0/21 dev eno1 proto kernel scope link src 172.23.18.188 +``` + ``` ip a ip route @@ -44,3 +58,9 @@ ip route - Octopus01 uses eno1 172.23.18.188/21 gateway 172.23.16.1 (eno1: Link is up at 1000 Mbps) - Octopus02 uses eno1 172.23.17.63/21 gateway 172.23.16.1 (eno1: Link is up at 1000 Mbps) 172.23.x.x + +# Work + +* After the switch upgrade penguin2 NFS is not visible for octopus01. I disabled the mount in fstab +* On octopus01 disabled unattended upgrade script - we don't want kernel updates on this machine(!) +* Updated IP addresses in sshd_config diff --git a/issues/systems/octoraid-storage.gmi b/issues/systems/octoraid-storage.gmi new file mode 100644 index 0000000..97e0e55 --- /dev/null +++ b/issues/systems/octoraid-storage.gmi @@ -0,0 +1,18 @@ +# OctoRAID + +We are building machines that can handle cheap drives. + +# octoraid01 + +This is a jetson with 4 22TB seagate-ironwolf-pro-st22000nt001-22tb-enterprise-nas-hard-drives-7200-rpm. + +Unfortunately the stock kernel has no RAID support, so we simple mount the 4 drives (hosted on a USB-SATA bridge). + +Stress testing: + +``` +cd /export/nfs/lair01 +stress -v -d 1 +``` + +Running on multiple disks the jetson is holding up well! diff --git a/issues/systems/penguin2-raid5.gmi b/issues/systems/penguin2-raid5.gmi new file mode 100644 index 0000000..f03075d --- /dev/null +++ b/issues/systems/penguin2-raid5.gmi @@ -0,0 +1,61 @@ +# Penguin2 RAID 5 + +# Tags + +* assigned: @fredm, @pjotrp +* status: in progress + +# Description + +The current RAID contains 3 disks: + +``` +root@penguin2:~# cat /proc/mdstat +md0 : active raid5 sdb1[1] sda1[0] sdg1[4] +/dev/md0 33T 27T 4.2T 87% /export +``` + +using /dev/sda,sdb,sdg + +The current root and swap is on + +``` +# root +/dev/sdd1 393G 121G 252G 33% / +# swap +/dev/sdd5 partition 976M 76.5M -2 +``` + +We can therefore add four new disks in slots /dev/sdc,sde,sdf,sdh + +penguin2 has no out-of-band and no serial connector right now. That means any work needs to be done on the terminal. + +Boot loader menu: + +``` +menuentry 'Debian GNU/Linux' --class debian --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-simple-7ff268df-cb90-4cbc-9d76-7fd6677b4964' { + load_video + insmod gzio + if [ x$grub_platform = xxen ]; then insmod xzio; insmod lzopio; fi + insmod part_msdos + insmod ext2 + set root='hd2,msdos1' + if [ x$feature_platform_search_hint = xy ]; then + search --no-floppy --fs-uuid --set=root --hint-bios=hd2,msdos1 --hint-efi=hd2,msdos1 --hint-baremetal=ahci2,msdos1 7ff268df-cb90-4cbc-9d76-7fd6677b4964 + else + search --no-floppy --fs-uuid --set=root 7ff268df-cb90-4cbc-9d76-7fd6677b4964 + fi + echo 'Loading Linux 5.10.0-18-amd64 ...' + linux /boot/vmlinuz-5.10.0-18-amd64 root=UUID=7ff268df-cb90-4cbc-9d76-7fd6677b4964 ro quiet + echo 'Loading initial ramdisk ...' + initrd /boot/initrd.img-5.10.0-18-amd64 +} +``` + +Added to sdd MBR + +``` +root@penguin2:~# grub-install /dev/sdd +Installing for i386-pc platform. +Installation finished. No error reported. +``` diff --git a/issues/systems/t02-crash.gmi b/issues/systems/t02-crash.gmi new file mode 100644 index 0000000..bf0c5d5 --- /dev/null +++ b/issues/systems/t02-crash.gmi @@ -0,0 +1,47 @@ +## Postmortem tux02 crash + +I'll take a look at tux02 - it rebooted last night and I need to start some services. It rebooted at CDT Aug 07 19:29:14 tux02 kernel: Linux version ... We have two out of memory messages before that: + +``` +Aug 7 18:45:27 tux02 kernel: [13521994.665636] Out of memory: Kill process 30165 (guix) score 759 or sacrifice child +Aug 7 18:45:27 tux02 kernel: [13521994.758974] Killed process 30165 (guix) total-vm:498873224kB, anon-rss:223599272kB, file-rss:4kB, shmem-rss:0kB +``` + +My mosh clapped out before that + +``` +wrk pts/96 mosh [128868] Thu Aug 7 18:53 - down (00:00) +``` + +Someone killed the development container before that + +``` +Aug 7 18:06:32 tux02 systemd[1]: genenetwork-development-container.service: Killing process 86832 (20qjyhd7n9n62fa) with signal SIGKILL. +``` + +and + +``` +Aug 7 13:28:26 tux02 kernel: [13502972.611421] oom_reaper: reaped process 25224 (guix), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB +Aug 7 18:16:00 tux02 kernel: [13520227.160945] oom_reaper: reaped process 128091 (guix), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB +``` + +Guix builds running out of RAM... My conclusion is that someone has been doing some heavy lifting. Probably Fred. I'll ask him to use a different machine that is not shared by many people. First I need to bring up some processes. The shepherd had not started, so: + +``` +systemctl status user-shepherd.service +``` + +most services started now. I need to check in half an hour. + +BNW is the one that does not start up automatically. + +``` +su shepherd +herd status +herd stop bnw +herd status bnw +tail -f /home/shepherd/logs/bnw.log +``` + +Shows a process is blocking the port. Kill as root, after making sure herd status shows it as stopped. diff --git a/issues/systems/tux02-production.gmi b/issues/systems/tux02-production.gmi index 7de911f..d811c5e 100644 --- a/issues/systems/tux02-production.gmi +++ b/issues/systems/tux02-production.gmi @@ -14,9 +14,9 @@ We are going to move production to tux02 - tux01 will be the staging machine. Th * [X] update guix guix-1.3.0-9.f743f20 * [X] set up nginx (Debian) -* [X] test ipmi console (172.23.30.40) +* [X] test ipmi console * [X] test ports (nginx) -* [?] set up network for external tux02e.uthsc.edu (128.169.4.52) +* [?] set up network for external tux02 * [X] set up deployment evironment * [X] sheepdog copy database backup from tux01 on a daily basis using ibackup user * [X] same for GN2 production environment diff --git a/issues/systems/tux04-disk-issues.gmi b/issues/systems/tux04-disk-issues.gmi index 9bba105..3df0a03 100644 --- a/issues/systems/tux04-disk-issues.gmi +++ b/issues/systems/tux04-disk-issues.gmi @@ -101,3 +101,323 @@ and nothing ;). Megacli is actually the tool to use ``` megacli -AdpAllInfo -aAll ``` + +# Database + +During a backup the DB shows this error: + +``` +2025-03-02 06:28:33 Database page corruption detected at page 1079428, retrying...\n[01] 2025-03-02 06:29:33 Database page corruption detected at page 1103108, retrying... +``` + + +Interestingly the DB recovered on a second backup. + +The database is hosted on a solid /dev/sde Dell Ent NVMe FI. The log says + +``` +kernel: I/O error, dev sde, sector 2136655448 op 0x0:(READ) flags 0x80700 phys_seg 40 prio class 2 +``` + +Suggests: + +=> https://stackoverflow.com/questions/50312219/blk-update-request-i-o-error-dev-sda-sector-xxxxxxxxxxx + +> The errors that you see are interface errors, they are not coming from the disk itself but rather from the connection to it. It can be the cable or any of the ports in the connection. +> Since the CRC errors on the drive do not increase I can only assume that the problem is on the receive side of the machine you use. You should check the cable and try a different SATA port on the server. + +and someone wrote + +> analyzed that most of the reasons are caused by intensive reading and writing. This is a CDN cache node. Type reading NVME temperature is relatively high, if it continues, it will start to throttle and then slowly collapse. + +and temperature on that drive has been 70 C. + +Mariabd log is showing errors: + +``` +2025-03-02 6:54:47 0 [ERROR] InnoDB: Failed to read page 449925 from file './db_webqtl/SnpAll.ibd': Page read from tablespace is corrupted. +2025-03-02 7:01:43 489015 [ERROR] Got error 180 when reading table './db_webqtl/ProbeSetXRef' +2025-03-02 8:10:32 489143 [ERROR] Got error 180 when reading table './db_webqtl/ProbeSetXRef' +``` + +Let's try and dump those tables when the backup is done. + +``` +mariadb-dump -uwebqtlout db_webqtl SnpAll +mariadb-dump: Error 1030: Got error 1877 "Unknown error 1877" from storage engine InnoDB when dumping table `SnpAll` at row: 0 +mariadb-dump -uwebqtlout db_webqtl ProbeSetXRef > ProbeSetXRef.sql +``` + +Eeep: + +``` +tux04:/etc$ mariadb-check -uwebqtlout -c db_webqtl ProbeSetXRef +db_webqtl.ProbeSetXRef +Warning : InnoDB: Index ProbeSetFreezeId is marked as corrupted +Warning : InnoDB: Index ProbeSetId is marked as corrupted +error : Corrupt +tux04:/etc$ mariadb-check -uwebqtlout -c db_webqtl SnpAll +db_webqtl.SnpAll +Warning : InnoDB: Index PRIMARY is marked as corrupted +Warning : InnoDB: Index SnpName is marked as corrupted +Warning : InnoDB: Index Rs is marked as corrupted +Warning : InnoDB: Index Position is marked as corrupted +Warning : InnoDB: Index Source is marked as corrupted +error : Corrupt +``` + +On tux01 we have a working database, we can test with + +``` +mysqldump --no-data --all-databases > table_schema.sql +mysqldump -uwebqtlout db_webqtl SnpAll > SnpAll.sql +``` + +Running the backup with rate limiting from: + +``` +Mar 02 17:09:59 tux04 sudo[548058]: pam_unix(sudo:session): session opened for user root(uid=0) by wrk(uid=1000) +Mar 02 17:09:59 tux04 sudo[548058]: wrk : TTY=pts/3 ; PWD=/export3/local/home/wrk/iwrk/deploy/gn-deploy-servers/scripts/tux04 ; USER=roo> +Mar 02 17:09:55 tux04 sudo[548058]: pam_unix(sudo:auth): authentication failure; logname=wrk uid=1000 euid=0 tty=/dev/pts/3 ruser=wrk rhost= > +Mar 02 17:04:26 tux04 su[548006]: pam_unix(su:session): session opened for user ibackup(uid=1003) by wrk(uid=0) +``` + +Oh oh + +Tux04 is showing errors on all disks. We have to bail out. I am copying the potentially corrupted files to tux01 right now. We have backups, so nothing serious I hope. I am only worried about the myisam files we have because they have no strong internal validation: + +``` +2025-03-04 8:32:45 502 [ERROR] db_webqtl.ProbeSetData: Record-count is not ok; is 5264578601 Should be: 5264580806 +2025-03-04 8:32:45 502 [Warning] db_webqtl.ProbeSetData: Found 28665 deleted space. Should be 0 +2025-03-04 8:32:45 502 [Warning] db_webqtl.ProbeSetData: Found 2205 deleted blocks Should be: 0 +2025-03-04 8:32:45 502 [ERROR] Got an error from thread_id=502, ./storage/myisam/ha_myisam.cc:1120 +2025-03-04 8:32:45 502 [ERROR] MariaDB thread id 502, OS thread handle 139625162532544, query id 837999 localhost webqtlout Checking table +CHECK TABLE ProbeSetData +2025-03-04 8:34:02 79695 [ERROR] mariadbd: Table './db_webqtl/ProbeSetData' is marked as crashed and should be repaired +``` + +See also + +=> https://dev.mysql.com/doc/refman/8.4/en/myisam-check.html + +Tux04 will require open heart 'disk controller' surgery and some severe testing before we move back. We'll also look at tux05-8 to see if they have similar problems. + +## Recovery + +According to the logs tux04 started showing serious errors on March 2nd - when I introduced sanitizing the mariadb backup: + +``` +Mar 02 05:00:42 tux04 kernel: I/O error, dev sde, sector 2071078320 op 0x0:(READ) flags 0x80700 phys_seg 16 prio class 2 +Mar 02 05:00:58 tux04 kernel: I/O error, dev sde, sector 2083650928 op 0x0:(READ) flags 0x80700 phys_seg 59 prio class 2 +... +``` + +The log started on Feb 23 when we had our last reboot. It probably is a good idea to turn on persistent logging! Anyway, it is likely files were fine until March 2nd. Similarly the mariadb logs also show + +``` +2025-03-02 6:53:52 489007 [ERROR] mariadbd: Index for table './db_webqtl/ProbeSetData.MYI' is corrupt; try to repair it +2025-03-02 6:53:52 489007 [ERROR] db_webqtl.ProbeSetData: Can't read key from filepos: 2269659136 +``` + +So, if we can restore a backup from March 1st we should be reasonably confident it is sane. + +First is to backup the existing database(!) Next restore the new DB by changing the DB location (symlink in /var/lib/mysql as well as check /etc/mysql/mariadb.cnf). + +When upgrading it is an idea to switch on these in mariadb.cnf + +``` +# forcing recovery with these two lines: +innodb_force_recovery=3 +innodb_purge_threads=0 +``` + +Make sure to disable (and restart) once it is up and running! + +So the steps are: + +* [X] install updated guix version of mariadb in /usr/local/guix-profiles (don't use Debian!!) +* [X] repair borg backup +* [X] Stop old mariadb (on new host tux02) +* [X] backup old mariadb database +* [X] restore 'sane' version of DB from borg March 1st +* [X] point to new DB in /var/lib/mysql and cnf file +* [X] update systemd settings +* [X] start mariadb new version with recovery setting in cnf +* [X] check logs +* [X] once running revert on recovery setting in cnf and restart + +OK, looks like we are in business again. In the next phase we need to validate files. Normal files can be checked with + +``` +find -type f \( -not -name "md5sum.txt" \) -exec md5sum '{}' \; > md5sum.txt +``` + +and compared with another set on a different server with + +``` +md5sum -c md5sum.txt +``` + +* [X] check genotype file directory - some MAGIC files missing on tux01 + +gn-docs is a git repo, so that is easily checked + +* [X] check gn-docs and sync with master repo + + +## Other servers + +``` +journalctl -r|grep -i "I/O error"|less +# tux05 +Nov 18 02:19:55 tux05 kernel: XFS (sdc2): metadata I/O error in "xfs_da_read_buf+0xd9/0x130 [xfs]" at daddr 0x78 len 8 error 74 +Nov 05 14:36:32 tux05 kernel: blk_update_request: I/O error, dev sdb, sector 1993616 op 0x1:(WRITE) flags +0x0 phys_seg 35 prio class 0 +Jul 27 11:56:22 tux05 kernel: blk_update_request: I/O error, dev sdc, sector 55676616 op 0x0:(READ) flags +0x80700 phys_seg 26 prio class 0 +Jul 27 11:56:22 tux05 kernel: blk_update_request: I/O error, dev sdc, sector 55676616 op 0x0:(READ) flags +0x80700 phys_seg 26 prio class 0 +# tux06 +Apr 15 08:10:57 tux06 kernel: I/O error, dev sda, sector 21740352 op 0x1:(WRITE) flags 0x1000 phys_seg 4 prio class 2 +Dec 13 12:56:14 tux06 kernel: I/O error, dev sdb, sector 3910157327 op 0x9:(WRITE_ZEROES) flags 0x8000000 phys_seg 0 prio class 2 +# tux07 +Mar 27 08:00:11 tux07 mfschunkserver[1927469]: replication error: failed to create chunk (No space left) +# tux08 +Mar 27 08:12:11 tux08 mfschunkserver[464794]: replication error: failed to create chunk (No space left) +``` + +Tux04, 05 and 06 show disk errors. Tux07 and Tux08 are overloaded with a full disk, but no other errors. We need to babysit Lizard more! + +``` +stress -v -d 1 +``` + +Write test: + +``` +dd if=/dev/zero of=./test bs=512k count=2048 oflag=direct +``` + +Read test: + +``` +/sbin/sysctl -w vm.drop_caches=3 +dd if=./test of=/dev/zero bs=512k count=2048 +``` + + +smartctl -a /dev/sdd -d megaraid,0 + +RAID Controller in SL 3: Dell PERC H755N Front + +# The story continues + +I don't know what happened but the server gave a hard +error in the logs: + +``` +racadm getsel # get system log +Record: 340 +Date/Time: 05/31/2025 09:25:17 +Source: system +Severity: Critical +Description: A high-severity issue has occurred at the Power-On +Self-Test (POST) phase which has resulted in the system BIOS to +abruptly stop functioning. +``` + +Woops! I fixed it by resetting idrac and rebooting remotely. Nasty. + +Looking around I found this link + +=> +https://tomaskalabis.com/wordpress/a-high-severity-issue-has-occurred-at-the-power-on-self-te +st-post-phase-which-has-resulted-in-the-system-bios-to-abruptly-stop-functioning/ + +suggesting we should upgrade idrac firmware. I am not going to do that +without backups and a fully up-to-date fallback online. It may fix the +other hardware issues we have been seeing (who knows?). + +Fred, the boot sequence is not perfect yet. Turned out the network +interfaces do not come up in the right order and nginx failed because +of a missing /var/run/nginx. The container would not restart because - +missing above - it could not check the certificates. + +## A week later + +``` +[SMM] APIC 0x00 S00:C00:T00 > ASSERT [AmdPlatformRasRsSmm] u:\EDK2\MdePkg\Library\BasePciSegmentLibPci\PciSegmentLib.c(766): ((Address) & (0xfffffffff0000000ULL | (3))) == 0 !!!! X64 Exception Type - 03(#BP - Breakpoint) CPU Apic ID - 00000000 !!!! +RIP - 0000000076DA4343, CS - 0000000000000038, RFLAGS - 0000000000000002 +RAX - 0000000000000010, RCX - 00000000770D5B58, RDX - 00000000000002F8 +RBX - 0000000000000000, RSP - 0000000077773278, RBP - 0000000000000000 +RSI - 0000000000000087, RDI - 00000000777733E0 R8 - 00000000777731F8, R9 - 0000000000000000, R10 - 0000000000000000 +R11 - 00000000000000A0, R12 - 0000000000000000, R13 - 0000000000000000 +R14 - FFFFFFFFA0C1A118, R15 - 000000000005B000 +DS - 0000000000000020, ES - 0000000000000020, FS - 0000000000000020 +GS - 0000000000000020, SS - 0000000000000020 +CR0 - 0000000080010033, CR2 - 0000000015502000, CR3 - 0000000077749000 +CR4 - 0000000000001668, CR8 - 0000000000000001 +DR0 - 0000000000000000, DR1 - 0000000000000000, DR2 - 0000000000000000 DR3 - 0000000000000000, DR6 - 00000000FFFF0FF0, DR7 - 0000000000000400 +GDTR - 000000007773C000 000000000000004F, LDTR - 0000000000000000 IDTR - 0000000077761000 00000000000001FF, TR - 0000000000000040 +FXSAVE_STATE - 0000000077772ED0 +!!!! Find image based on IP(0x76DA4343) u:\Build_Genoa\DellBrazosPkg\DEBUG_MYTOOLS\X64\DellPkgs\DellChipsetPkgs\AmdGenoaModulePkg\Override\AmdCpmPkg\Features\PlatformRas\Rs\Smm\AmdPlatformRasRsSmm\DEBUG\AmdPlatformRasRsSmm.pdb (ImageBase=0000000076D3E000, EntryPoint=0000000076D3E6C0) !!!! +``` + +New error in system log: + +``` +Record: 341 Date/Time: 06/04/2025 19:47:08 +Source: system +Severity: Critical Description: A high-severity issue has occurred at the Power-On Self-Test (POST) phase which has resulted in the system BIOS to abruptly stop functioning. +``` + +The error appears to relate to AMD Brazos which is probably part of the on board APU/GPU. + +The code where it segfaulted is online at: + +=> https://github.com/tianocore/edk2/blame/master/MdePkg/Library/BasePciSegmentLibPci/PciSegmentLib.c + +and has to do with PCI registers and that can actually be caused by the new PCIe card we hosted. + +# Sept 2025 + +We moved production away from tux04, so now we should be able to work on this machine. + + +## System crash on tux04 + +And tux04 is down *again*. Wow, glad we moved off! I want to fix that machine and we had to move production off! I left the terminal open and the last message is: + +``` +tux04:~$ [SMM] APIC 0x00 S00:C00:T00 > ASSERT [AmdPlatformRasRsSmm] u:\EDK2\MdePkg\Library\BasePciSegmentLibPci\PciSegmentLib.c(766): ((Address) & (0xfffffffff0000000ULL | (3))) == 0 +!!!! X64 Exception Type - 03(#BP - Breakpoint) CPU Apic ID - 00000000 !!!! +RIP - 0000000076DA4343, CS - 0000000000000038, RFLAGS - 0000000000000002 +RAX - 0000000000000010, RCX - 00000000770D5B58, RDX - 00000000000002F8 +RBX - 0000000000000000, RSP - 0000000077773278, RBP - 0000000000000000 +RSI - 0000000000000000, RDI - 00000000777733E0 +R8 - 00000000777731F8, R9 - 0000000000000000, R10 - 0000000000000000 +R11 - 00000000000000A0, R12 - 0000000000000000, R13 - 0000000000000000 +R14 - FFFFFFFFAC41A118, R15 - 000000000005B000 +DS - 0000000000000020, ES - 0000000000000020, FS - 0000000000000020 +GS - 0000000000000020, SS - 0000000000000020 +CR0 - 0000000080010033, CR2 - 00007F67F5268030, CR3 - 0000000077749000 +CR4 - 0000000000001668, CR8 - 0000000000000001 +DR0 - 0000000000000000, DR1 - 0000000000000000, DR2 - 0000000000000000 +DR3 - 0000000000000000, DR6 - 00000000FFFF0FF0, DR7 - 0000000000000400 +GDTR - 000000007773C000 000000000000004F, LDTR - 0000000000000000 +IDTR - 0000000077761000 00000000000001FF, TR - 0000000000000040 +FXSAVE_STATE - 0000000077772ED0 +!!!! Find image based on IP(0x76DA4343) u:\Build_Genoa\DellBrazosPkg\DEBUG_MYTOOLS\X64\DellPkgs\DellChipsetPkgs\AmdGenoaModulePkg\Override\AmdCpmPkg\Features\PlatformRas\Rs\Smm\AmdPlatformRasRsSmm\DEBUG\AmdPlatformRasRsSmm.pdb (ImageBase=0000000076D3E000, EntryPoint=0000000076D3E6C0) !!!! +``` + +and the racadm system log says + +``` +Record: 362 +Date/Time: 09/11/2025 21:47:02 +Source: system +Severity: Critical +Description: A high-severity issue has occurred at the Power-On Self-Test (POST) phase which has resulted in the system BIOS to abruptly stop functioning. +``` + +I have seen that before and it is definitely a hardware/driver issue on the Dell itself. I'll work on tha later. Luckily it always reboots. diff --git a/issues/systems/tux04-production.gmi b/issues/systems/tux04-production.gmi new file mode 100644 index 0000000..58ff8c1 --- /dev/null +++ b/issues/systems/tux04-production.gmi @@ -0,0 +1,279 @@ +# Production on tux04 + +Lately we have been running production on tux04. Unfortunately Debian got broken and I don't see a way to fix it (something with python versions that break apt!). Also mariadb is giving problems: + +=> issues/production-container-mechanical-rob-failure.gmi + +and that is alarming. We might as well try an upgrade. I created a new partition on /dev/sda4 using debootstrap. + +The hardware RAID has proven unreliable on this machine (and perhaps others). + +We added a drive on a PCIe raiser outside the RAID. Use this for bulk data copying. We still bootstrap from the RAID. + +Luckily not too much is running on this machine and if we mount things again, most should work. + +# Tasks + +* [X] cleanly shut down mariadb +* [X] reboot into new partition /dev/sda4 +* [X] git in /etc +* [X] make sure serial boot works (/etc/default/grub) +* [X] fix groups and users +* [X] get guix going +* [X] get mariadb going +* [X] fire up GN2 service +* [X] fire up SPARQL service +* [X] sheepdog +* [ ] fix CRON jobs and backups +* [ ] test full reboots + + +# Boot in new partition + +``` +blkid /dev/sda4 +/dev/sda4: UUID="4aca24fe-3ece-485c-b04b-e2451e226bf7" BLOCK_SIZE="4096" TYPE="ext4" PARTUUID="2e3d569f-6024-46ea-8ef6-15b26725f811" +``` + +After debootstrap there are two things to take care of: the /dev directory and grub. For good measure +I also capture some state + +``` +cd ~ +ps xau > cron.log +systemctl > systemctl.txt +cp /etc/network/interfaces . +cp /boot/grub/grub.cfg . +``` + +we should still have access to the old root partition, so I don't need to capture everything. + +## /dev + +I ran MAKEDEV and that may not be needed with udev. + +## grub + +We need to tell grub to boot into the new partition. The old root is on +UUID=8e874576-a167-4fa1-948f-2031e8c3809f /dev/sda2. + +Next I ran + +``` +tux04:~$ update-grub2 /dev/sda +Generating grub configuration file ... +Found linux image: /boot/vmlinuz-5.10.0-32-amd64 +Found initrd image: /boot/initrd.img-5.10.0-32-amd64 +Found linux image: /boot/vmlinuz-5.10.0-22-amd64 +Found initrd image: /boot/initrd.img-5.10.0-22-amd64 +Warning: os-prober will be executed to detect other bootable partitions. +Its output will be used to detect bootable binaries on them and create new boot entries. +Found Debian GNU/Linux 12 (bookworm) on /dev/sda4 +Found Windows Boot Manager on /dev/sdd1@/efi/Microsoft/Boot/bootmgfw.efi +Found Debian GNU/Linux 11 (bullseye) on /dev/sdf2 +``` + +Very good. Do a diff on grub.cfg and you see it even picked up the serial configuration. It only shows it added menu entries for the new boot. Very nice. + +At this point I feel safe to boot as we should be able to get back into the old partition. + +# /etc/fstab + +The old fstab looked like + +``` +UUID=8e874576-a167-4fa1-948f-2031e8c3809f / ext4 errors=remount-ro 0 1 +# /boot/efi was on /dev/sdc1 during installation +UUID=998E-68AF /boot/efi vfat umask=0077 0 1 +# swap was on /dev/sdc3 during installation +UUID=cbfcd84e-73f8-4cec-98ee-40cad404735f none swap sw 0 0 +UUID="783e3bd6-5610-47be-be82-ac92fdd8c8b8" /export2 ext4 auto 0 2 +UUID="9e6a9d88-66e7-4a2e-a12c-f80705c16f4f" /export ext4 auto 0 2 +UUID="f006dd4a-2365-454d-a3a2-9a42518d6286" /export3 auto auto 0 2 +/export2/gnu /gnu none defaults,bind 0 0 +# /dev/sdd1: PARTLABEL="bulk" PARTUUID="b1a820fe-cb1f-425e-b984-914ee648097e" +# /dev/sdb4 /export ext4 auto 0 2 +# /dev/sdd1 /export2 ext4 auto 0 2 +``` + +# reboot + +Next we are going to reboot, and we need a serial connector to the Dell out-of-band using racadm: + +``` +ssh IP +console com2 +racadm getsel +racadm serveraction powercycle +racadm serveraction powerstatus + +``` + +Main trick it so hit ESC, wait 2 sec and 2 when you want the bios boot menu. Ctrl-\ to escape console. Otherwise ESC (wait) ! to get to the boot menu. + +# First boot + +It still boots by default into the old root. That gave an error: + +[FAILED] Failed to start File Syste…a-2365-454d-a3a2-9a42518d6286 + +This is /export3. We can fix that later. + +When I booted into the proper partition the console clapped out. Also the racadm password did not work on tmux -- I had to switch to a standard console to log in again. Not sure why that is, but next I got: + +``` +Give root password for maintenance +(or press Control-D to continue): +``` + +and giving the root password I was in maintenance mode on the correct partition! + +To rerun grup I had to add `GRUB_DISABLE_OS_PROBER=false`. + +Once booting up it is a matter of mounting partitions and tick the check boxes above. + +The following contained errors: + +``` +/dev/sdd1 3.6T 1.8T 1.7T 52% /export2 +``` + +# Guix + +Getting guix going is a bit tricky because we want to keep the store! + +``` +cp -vau /mnt/old-root/var/guix/ /var/ +cp -vau /mnt/old-root/usr/local/guix-profiles /usr/local/ +cp -vau /mnt/old-root/usr/local/bin/* /usr/local/bin/ +cp -vau /mnt/old-root/etc/systemd/system/guix-daemon.service* /etc/systemd/system/ +cp -vau /mnt/old-root/etc/systemd/system/gnu-store.mount* /etc/systemd/system/ +``` + +Also had to add guixbuild users and group by hand. + +# nginx + +We use the streaming facility. Check that + +``` +nginx -V +``` + +lists --with-stream=static, see + +=> https://serverfault.com/questions/858067/unknown-directive-stream-in-etc-nginx-nginx-conf86/858074#858074 + +and load at the start of nginx.conf: + +``` +load_module /usr/lib/nginx/modules/ngx_stream_module.so; +``` + +and + +``` +nginx -t +``` + +passes + +Now the container responds to the browser with `Internal Server Error`. + +# container web server + +Visit the container with something like + +``` +nsenter -at 2838 /run/current-system/profile/bin/bash --login +``` + +The nginx log in the container has many + +``` +2025/02/22 17:23:48 [error] 136#0: *166916 connect() failed (111: Connection refused) while connecting to upstream, client: 127.0.0.1, server: genenetwork.org, request: "GET /gn3/gene/aliases/st%2029:1;o;s HTTP/1.1", upstream: "http://127.0.0.1:9800/gene/aliases/st%2029:1;o;s", host: "genenetwork.org" +``` + +that is interesting. Acme/https is working because GN2 is working: + +``` +curl https://genenetwork.org/api3/version +"1.0" +``` + +Looking at the logs it appears it is a redis problem first for GN2. + +Fred builds the container with `/home/fredm/opt/guix-production/bin/guix`. Machines are defined in + +``` +fredm@tux04:/export3/local/home/fredm/gn-machines +``` + +The shared dir for redis is at + +--share=/export2/guix-containers/genenetwork/var/lib/redis=/var/lib/redis + +with + +``` +root@genenetwork-production /var# ls lib/redis/ -l +-rw-r--r-- 1 redis redis 629328484 Feb 22 17:25 dump.rdb +``` + +In production.scm it is defined as + +``` +(service redis-service-type + (redis-configuration + (bind "127.0.0.1") + (port 6379) + (working-directory "/var/lib/redis"))) +``` + +The defaults are the same as the definition of redis-service-type (in guix). Not sure why we are duplicating. + +After starting redis by hand I get another error `500 DatabaseError: The following exception was raised while attempting to access http://auth.genenetwork.org/auth/data/authorisation: database disk image is malformed`. The problem is it created +a DB in the wrong place. Alright, the logs in the container say: + +``` +Feb 23 14:04:31 genenetwork-production shepherd[1]: [redis-server] 3977:C 23 Feb 2025 14:04:31.040 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo +Feb 23 14:04:31 genenetwork-production shepherd[1]: [redis-server] 3977:C 23 Feb 2025 14:04:31.040 # Redis version=7.0.12, bits=64, commit=00000000, modified=0, pid=3977, just started +Feb 23 14:04:31 genenetwork-production shepherd[1]: [redis-server] 3977:C 23 Feb 2025 14:04:31.040 # Configuration loaded +Feb 23 14:04:31 genenetwork-production shepherd[1]: [redis-server] 3977:M 23 Feb 2025 14:04:31.041 * Increased maximum number of open files to 10032 (it was originally set to 1024). +Feb 23 14:04:31 genenetwork-production shepherd[1]: [redis-server] 3977:M 23 Feb 2025 14:04:31.041 * monotonic clock: POSIX clock_gettime +Feb 23 14:04:31 genenetwork-production shepherd[1]: [redis-server] 3977:M 23 Feb 2025 14:04:31.041 * Running mode=standalone, port=6379. +Feb 23 14:04:31 genenetwork-production shepherd[1]: [redis-server] 3977:M 23 Feb 2025 14:04:31.042 # Server initialized +Feb 23 14:04:31 genenetwork-production shepherd[1]: [redis-server] 3977:M 23 Feb 2025 14:04:31.042 # WARNING Memory overcommit must be enabled! Without it, a background save or replication may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect. +Feb 23 14:04:31 genenetwork-production shepherd[1]: [redis-server] 3977:M 23 Feb 2025 14:04:31.042 # Wrong signature trying to load DB from file +Feb 23 14:04:31 genenetwork-production shepherd[1]: [redis-server] 3977:M 23 Feb 2025 14:04:31.042 # Fatal error loading the DB: Invalid argument. Exiting. +Feb 23 14:04:31 genenetwork-production shepherd[1]: Service redis (PID 3977) exited with 1. +``` + +This is caused by a newer version of redis. This is odd because we are using the same version from the container?! + +Actually it turned out the redis DB was corrupted on the SSD! Same for some other databases (ugh). + +Fred copied all data to an enterprise level storage, and we rolled back to some older DBs, so hopefully we'll be OK for now. + +# Reinstating backups + +In the next step we need to restore backups as described in + +=> /topics/systems/backups-with-borg + +I already created an ibackup user. Next we test the backup script for mariadb. + +One important step is to check the database: + +``` +/usr/bin/mariadb-check -c -u user -p* db_webqtl +``` + +A successful mariadb backup consists of multiple steps + +``` +2025-02-27 11:48:28 +0000 (ibackup@tux04) SUCCESS 0 <32m43s> mariabackup-dump +2025-02-27 11:48:29 +0000 (ibackup@tux04) SUCCESS 0 <00m00s> mariabackup-make-consistent +2025-02-27 12:16:37 +0000 (ibackup@tux04) SUCCESS 0 <28m08s> borg-tux04-sql-backup +2025-02-27 12:16:46 +0000 (ibackup@tux04) SUCCESS 0 <00m07s> drop-rsync-balg01 +``` |
