summary refs log tree commit diff
diff options
context:
space:
mode:
-rw-r--r--issues/CI-CD/cd-is-slow.gmi172
1 files changed, 172 insertions, 0 deletions
diff --git a/issues/CI-CD/cd-is-slow.gmi b/issues/CI-CD/cd-is-slow.gmi
new file mode 100644
index 0000000..d888e82
--- /dev/null
+++ b/issues/CI-CD/cd-is-slow.gmi
@@ -0,0 +1,172 @@
+# CD is slow
+
+The pages are slow and some are broken.
+
+# Tags
+
+* type: bug
+* status: open
+* priority: high
+* assigned: pjotrp
+* interested: pjotrp, bonfacem
+* keywords: deployment, server
+
+# Measuring
+
+bonfacekilz:
+I'm currently instrumenting the requests.  See what hogs up time.  Loading the landing page takes up 32 seconds!
+
+Something's off. From outside the container:
+
+```
+123bonfacem@tux02 ~ $ guix shell python-wrapper python-requests -- python time.py
+Status: 200
+Time taken: 32.989222288131714 seconds
+```
+
+From inside the container:
+
+```
+12025-07-18 14:46:36 INFO:gn2.wqflask:Landing page rendered in 8.12 seconds
+```
+
+And I see:
+
+## CD
+
+```
+> curl -w @- -o /dev/null -s https://cd.genenetwork.org <<EOF
+\n
+DNS lookup:  %{time_namelookup}s\n
+Connect time: %{time_connect}s\n
+TLS handshake: %{time_appconnect}s\n
+Pre-transfer: %{time_pretransfer}s\n
+Start transfer: %{time_starttransfer}s\n
+Total time:   %{time_total}s\n
+EOF
+
+DNS lookup:  8.117543s
+Connect time: 8.117757s
+TLS handshake: 8.197767s
+Pre-transfer: 8.197861s
+Start transfer: 33.096467s
+Total time:   33.096601s
+```
+
+## Production
+```
+> curl -w @- -o /dev/null -s https://genenetwork.org <<EOF
+\n
+DNS lookup:  %{time_namelookup}s\n
+Connect time: %{time_connect}s\n
+TLS handshake: %{time_appconnect}s\n
+Pre-transfer: %{time_pretransfer}s\n
+Start transfer: %{time_starttransfer}s\n
+Total time:   %{time_total}s\n
+EOF
+
+DNS lookup:  8.075794s
+Connect time: 8.076402s
+TLS handshake: 8.147322s
+Pre-transfer: 8.147370s
+Start transfer: 8.797107s
+Total time:   8.797299s
+```
+
+## On tux02 (outside CD container)
+
+```
+> curl -w @- -o /dev/null -s http://localhost:9092 <<EOF
+\n
+DNS lookup:  %{time_namelookup}s\n
+Connect time: %{time_connect}s\n
+TLS handshake: %{time_appconnect}s\n
+Pre-transfer: %{time_pretransfer}s\n
+Start transfer: %{time_starttransfer}s\n
+Total time:   %{time_total}s\n
+EOF
+
+DNS lookup:  0.000068s
+Connect time: 0.000543s
+TLS handshake: 0.000000s
+Pre-transfer: 0.000606s
+Start transfer: 24.851069s
+Total time:   24.851166s
+```
+
+This does not look like an nginx problem (at least on tux02 itself). Also the nginx configuration was not really changed.
+The mysql configuration ditto. I can still test both, but it looks like the problem is inside the system container.
+
+The container logs are at
+
+```
+root@tux02:/export2/guix-containers/genenetwork-development/var/log/cd# tail -100 genenetwork2.log
+```
+
+Some interesting errors there that need resolving, such as
+
+```
+tail gn-guile.log
+2025-07-20 04:49:49 X.509 certificate of 'sparql.genenetwork.org' could not be verified:
+2025-07-20 04:49:49   signer-not-found invalid
+```
+
+Let's try to get inside the container with nsenter:
+
+```
+ps xau|grep genenetwork-development-container
+root     115940  0.0  0.0 163692 26296 ?        Ssl  Jul18   0:00 /gnu/store/ylwk2vn18dkzkj0nxq2h4vjzhz17bm7c-guile-3.0.9/bin/guile --no-auto-compile /usr/local/bin/genenetwork-development-container
+pgrep -P 115940
+115961
+```
+
+Use this PID and a recent nsenter:
+
+```
+/gnu/store/w7a3frdmffpw3hvxpvvxwxgzfhyqdm6n-profile/bin/nsenter -m -p -t 115961 /run/current-system/profile/bin/bash -login
+```
+
+System tools are in '/run/current-system/profile/bin/'
+
+We can pick up curl (I note the system container has full access to the /gnu/store on the host:
+
+```
+root@tux02 /# /gnu/store/vdaspmq10c3zmqhp38lfqy812w6r4xg3-curl-8.6.0/bin/curl  -w @- -o /dev/null -s http://localhost:9092 <<EOF
+\n
+DNS lookup:  %{time_namelookup}s\n
+Connect time: %{time_connect}s\n
+TLS handshake: %{time_appconnect}s\n
+Pre-transfer: %{time_pretransfer}s\n
+Start transfer: %{time_starttransfer}s\n
+Total time:   %{time_total}s\n
+EOF
+
+DNS lookup:  0.000064s
+Connect time: 0.000478s
+TLS handshake: 0.000000s
+Pre-transfer: 0.000551s
+Start transfer: 24.792926s
+Total time:   24.793015s
+```
+
+That rules out container and nginx streaming issues.
+
+So the problem is with GN and its DBs. The gn-machines is used from /home/aruni and it checkout is March. Has CD been slow since then? I don't think so. Also the changes to the actual scripts are even older. Also the guix-bioinformatics repo shows no changes. Remaining culprits I suspect are:
+
+* [*] MySQL
+* [ ] Interaction gn-auth with gn2
+* [ ] Interaction gnqa with gn2
+
+Running a standard test on mysql shows it is fine:
+
+```
+time mysql -u webqtlout -pwebqtlout db_webqtl < $rundir/../shared/sql/test02.sql
+Name    FullName        Name    Symbol  CAST(ProbeSet."description" AS BINARY)  CAST(ProbeSet."Probe_Target_Description" AS BINARY)    Chr     Mb      Mean    LRS     Locus   pValue  additive        geno_chr        geno_mb
+HC_M2_0606_P    Hippocampus Consortium M430v2 (Jun06) PDNN      1457545_at      9530036O11Rik   long non-coding RNA, expressed sequence tag (EST) AK035474 with high bladder expression        antisense EST 14 Kb upstream of Shh     5     28.480441        6.7419292929293 15.2845189682605        rsm10000001525  0.055   0.0434848484848485      3       9.671673
+HC_M2_0606_P    Hippocampus Consortium M430v2 (Jun06) PDNN      1427571_at      Shh     sonic hedgehog (hedgehog)     last exon        5       28.457886       6.50113131313131        9.58158655605723        rs8253327       0.697   0.0494097096188748     1       191.908118
+HC_M2_0606_P    Hippocampus Consortium M430v2 (Jun06) PDNN      1436869_at      Shh     sonic hedgehog (hedgehog)     mid distal 3' UTR        5       28.457155       9.279090909090911       12.7711275309832        rs8253327       0.306 -0.214087568058076       1       191.908118
+
+real    0m0.010s
+user    0m0.004s
+sys     0m0.000s
+```