summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorPjotr Prins2025-02-23 09:25:20 +0100
committerPjotr Prins2025-02-24 09:40:32 +0100
commit7cf95378517f7c5d85ffb2b1714cc3a603f1e41f (patch)
treee8d9e87168b9431dac48eea17174f7635d70dcf5
parent76948cc90467f74d84209186cc5582ce9dc2fdb1 (diff)
downloadgn-gemtext-7cf95378517f7c5d85ffb2b1714cc3a603f1e41f.tar.gz
Tasks
-rw-r--r--issues/systems/tux04-production.gmi247
1 files changed, 247 insertions, 0 deletions
diff --git a/issues/systems/tux04-production.gmi b/issues/systems/tux04-production.gmi
new file mode 100644
index 0000000..61804fa
--- /dev/null
+++ b/issues/systems/tux04-production.gmi
@@ -0,0 +1,247 @@
+# Production on tux04
+
+Lately we have been running production on tux04. Unfortunately Debian got broken and I don't see a way to fix it (something with python versions that break apt!). Also mariadb is giving problems:
+
+=> issues/production-container-mechanical-rob-failure.gmi
+
+and that is alarming. We might as well try an upgrade. I created a new partition on /dev/sda4 using debootstrap.
+
+Luckily not too much is running on this machine and if we mount things again, most should work.
+
+# Tasks
+
+* [X] cleanly shut down mariadb
+* [X] reboot into new partition /dev/sda4
+* [X] git in /etc
+* [X] make sure serial boot works (/etc/default/grub)
+* [X] fix groups and users
+* [X] get guix going
+* [X] get mariadb going
+* [ ] fire up GN2 service
+* [ ] fire up SPARQL service
+* [ ] fix CRON jobs and backups
+* [ ] test full reboots
+
+
+# Boot in new partition
+
+```
+blkid /dev/sda4
+/dev/sda4: UUID="4aca24fe-3ece-485c-b04b-e2451e226bf7" BLOCK_SIZE="4096" TYPE="ext4" PARTUUID="2e3d569f-6024-46ea-8ef6-15b26725f811"
+```
+
+After debootstrap there are two things to take care of: the /dev directory and grub. For good measure
+I also capture some state
+
+```
+cd ~
+ps xau > cron.log
+systemctl > systemctl.txt
+cp /etc/network/interfaces .
+cp /boot/grub/grub.cfg .
+```
+
+we should still have access to the old root partition, so I don't need to capture everything.
+
+## /dev
+
+I ran MAKEDEV and that may not be needed with udev.
+
+## grub
+
+We need to tell grub to boot into the new partition. The old root is on
+UUID=8e874576-a167-4fa1-948f-2031e8c3809f /dev/sda2.
+
+Next I ran
+
+```
+tux04:~$ update-grub2 /dev/sda
+Generating grub configuration file ...
+Found linux image: /boot/vmlinuz-5.10.0-32-amd64
+Found initrd image: /boot/initrd.img-5.10.0-32-amd64
+Found linux image: /boot/vmlinuz-5.10.0-22-amd64
+Found initrd image: /boot/initrd.img-5.10.0-22-amd64
+Warning: os-prober will be executed to detect other bootable partitions.
+Its output will be used to detect bootable binaries on them and create new boot entries.
+Found Debian GNU/Linux 12 (bookworm) on /dev/sda4
+Found Windows Boot Manager on /dev/sdd1@/efi/Microsoft/Boot/bootmgfw.efi
+Found Debian GNU/Linux 11 (bullseye) on /dev/sdf2
+```
+
+Very good. Do a diff on grub.cfg and you see it even picked up the serial configuration. It only shows it added menu entries for the new boot. Very nice.
+
+At this point I feel safe to boot as we should be able to get back into the old partition.
+
+# /etc/fstab
+
+The old fstab looked like
+
+```
+UUID=8e874576-a167-4fa1-948f-2031e8c3809f / ext4 errors=remount-ro 0 1
+# /boot/efi was on /dev/sdc1 during installation
+UUID=998E-68AF /boot/efi vfat umask=0077 0 1
+# swap was on /dev/sdc3 during installation
+UUID=cbfcd84e-73f8-4cec-98ee-40cad404735f none swap sw 0 0
+UUID="783e3bd6-5610-47be-be82-ac92fdd8c8b8" /export2 ext4 auto 0 2
+UUID="9e6a9d88-66e7-4a2e-a12c-f80705c16f4f" /export ext4 auto 0 2
+UUID="f006dd4a-2365-454d-a3a2-9a42518d6286" /export3 auto auto 0 2
+/export2/gnu /gnu none defaults,bind 0 0
+# /dev/sdd1: PARTLABEL="bulk" PARTUUID="b1a820fe-cb1f-425e-b984-914ee648097e"
+# /dev/sdb4 /export ext4 auto 0 2
+# /dev/sdd1 /export2 ext4 auto 0 2
+```
+
+# reboot
+
+Next we are going to reboot, and we need a serial connector to the Dell out-of-band using racadm:
+
+```
+ssh IP
+console com2
+racadm getsel
+racadm serveraction powercycle
+racadm serveraction powerstatus
+
+```
+
+Main trick it so hit ESC, wait 2 sec and 2 when you want the bios boot menu. Ctrl-\ to escape console. Otherwise ESC (wait) ! to get to the boot menu.
+
+# First boot
+
+It still boots by default into the old root. That gave an error:
+
+[FAILED] Failed to start File Syste…a-2365-454d-a3a2-9a42518d6286
+
+This is /export3. We can fix that later.
+
+When I booted into the proper partition the console clapped out. Also the racadm password did not work on tmux -- I had to switch to a standard console to log in again. Not sure why that is, but next I got:
+
+```
+Give root password for maintenance
+(or press Control-D to continue):
+```
+
+and giving the root password I was in maintenance mode on the correct partition!
+
+To rerun grup I had to add `GRUB_DISABLE_OS_PROBER=false`.
+
+Once booting up it is a matter of mounting partitions and tick the check boxes above.
+
+The following contained errors:
+
+```
+/dev/sdd1 3.6T 1.8T 1.7T 52% /export2
+```
+
+# Guix
+
+Getting guix going is a bit tricky because we want to keep the store!
+
+```
+cp -vau /mnt/old-root/var/guix/ /var/
+cp -vau /mnt/old-root/usr/local/guix-profiles /usr/local/
+cp -vau /mnt/old-root/usr/local/bin/* /usr/local/bin/
+cp -vau /mnt/old-root/etc/systemd/system/guix-daemon.service* /etc/systemd/system/
+cp -vau /mnt/old-root/etc/systemd/system/gnu-store.mount* /etc/systemd/system/
+```
+
+Also had to add guixbuild users and group by hand.
+
+# nginx
+
+We use the streaming facility. Check that
+
+```
+nginx -V
+```
+
+lists --with-stream=static, see
+
+=> https://serverfault.com/questions/858067/unknown-directive-stream-in-etc-nginx-nginx-conf86/858074#858074
+
+and load at the start of nginx.conf:
+
+```
+load_module /usr/lib/nginx/modules/ngx_stream_module.so;
+```
+
+and
+
+```
+nginx -t
+```
+
+passes
+
+Now the container responds to the browser with `Internal Server Error`.
+
+# container web server
+
+Visit the container with something like
+
+```
+nsenter -at 2838 /run/current-system/profile/bin/bash --login
+```
+
+The nginx log in the container has many
+
+```
+2025/02/22 17:23:48 [error] 136#0: *166916 connect() failed (111: Connection refused) while connecting to upstream, client: 127.0.0.1, server: genenetwork.org, request: "GET /gn3/gene/aliases/st%2029:1;o;s HTTP/1.1", upstream: "http://127.0.0.1:9800/gene/aliases/st%2029:1;o;s", host: "genenetwork.org"
+```
+
+that is interesting. Acme/https is working because GN2 is working:
+
+```
+curl https://genenetwork.org/api3/version
+"1.0"
+```
+
+Looking at the logs it appears it is a redis problem first for GN2.
+
+Fred builds the container with `/home/fredm/opt/guix-production/bin/guix`. Machines are defined in
+
+```
+fredm@tux04:/export3/local/home/fredm/gn-machines
+```
+
+The shared dir for redis is at
+
+--share=/export2/guix-containers/genenetwork/var/lib/redis=/var/lib/redis
+
+with
+
+```
+root@genenetwork-production /var# ls lib/redis/ -l
+-rw-r--r-- 1 redis redis 629328484 Feb 22 17:25 dump.rdb
+```
+
+In production.scm it is defined as
+
+```
+(service redis-service-type
+ (redis-configuration
+ (bind "127.0.0.1")
+ (port 6379)
+ (working-directory "/var/lib/redis")))
+```
+
+The defaults are the same as the definition of redis-service-type (in guix). Not sure why we are duplicating.
+
+After starting redis by hand I get another error `500 DatabaseError: The following exception was raised while attempting to access http://auth.genenetwork.org/auth/data/authorisation: database disk image is malformed`. The problem is it created
+a DB in the wrong place. Alright, the logs in the container say:
+
+```
+Feb 23 14:04:31 genenetwork-production shepherd[1]: [redis-server] 3977:C 23 Feb 2025 14:04:31.040 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
+Feb 23 14:04:31 genenetwork-production shepherd[1]: [redis-server] 3977:C 23 Feb 2025 14:04:31.040 # Redis version=7.0.12, bits=64, commit=00000000, modified=0, pid=3977, just started
+Feb 23 14:04:31 genenetwork-production shepherd[1]: [redis-server] 3977:C 23 Feb 2025 14:04:31.040 # Configuration loaded
+Feb 23 14:04:31 genenetwork-production shepherd[1]: [redis-server] 3977:M 23 Feb 2025 14:04:31.041 * Increased maximum number of open files to 10032 (it was originally set to 1024).
+Feb 23 14:04:31 genenetwork-production shepherd[1]: [redis-server] 3977:M 23 Feb 2025 14:04:31.041 * monotonic clock: POSIX clock_gettime
+Feb 23 14:04:31 genenetwork-production shepherd[1]: [redis-server] 3977:M 23 Feb 2025 14:04:31.041 * Running mode=standalone, port=6379.
+Feb 23 14:04:31 genenetwork-production shepherd[1]: [redis-server] 3977:M 23 Feb 2025 14:04:31.042 # Server initialized
+Feb 23 14:04:31 genenetwork-production shepherd[1]: [redis-server] 3977:M 23 Feb 2025 14:04:31.042 # WARNING Memory overcommit must be enabled! Without it, a background save or replication may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
+Feb 23 14:04:31 genenetwork-production shepherd[1]: [redis-server] 3977:M 23 Feb 2025 14:04:31.042 # Wrong signature trying to load DB from file
+Feb 23 14:04:31 genenetwork-production shepherd[1]: [redis-server] 3977:M 23 Feb 2025 14:04:31.042 # Fatal error loading the DB: Invalid argument. Exiting.
+Feb 23 14:04:31 genenetwork-production shepherd[1]: Service redis (PID 3977) exited with 1.
+```
+
+This is caused by a newer version of redis. This is odd because we are using the same version from the container?!