diff options
Diffstat (limited to 'issues')
-rw-r--r-- | issues/systems/tux04-production.gmi | 247 |
1 files changed, 247 insertions, 0 deletions
diff --git a/issues/systems/tux04-production.gmi b/issues/systems/tux04-production.gmi new file mode 100644 index 0000000..61804fa --- /dev/null +++ b/issues/systems/tux04-production.gmi @@ -0,0 +1,247 @@ +# Production on tux04 + +Lately we have been running production on tux04. Unfortunately Debian got broken and I don't see a way to fix it (something with python versions that break apt!). Also mariadb is giving problems: + +=> issues/production-container-mechanical-rob-failure.gmi + +and that is alarming. We might as well try an upgrade. I created a new partition on /dev/sda4 using debootstrap. + +Luckily not too much is running on this machine and if we mount things again, most should work. + +# Tasks + +* [X] cleanly shut down mariadb +* [X] reboot into new partition /dev/sda4 +* [X] git in /etc +* [X] make sure serial boot works (/etc/default/grub) +* [X] fix groups and users +* [X] get guix going +* [X] get mariadb going +* [ ] fire up GN2 service +* [ ] fire up SPARQL service +* [ ] fix CRON jobs and backups +* [ ] test full reboots + + +# Boot in new partition + +``` +blkid /dev/sda4 +/dev/sda4: UUID="4aca24fe-3ece-485c-b04b-e2451e226bf7" BLOCK_SIZE="4096" TYPE="ext4" PARTUUID="2e3d569f-6024-46ea-8ef6-15b26725f811" +``` + +After debootstrap there are two things to take care of: the /dev directory and grub. For good measure +I also capture some state + +``` +cd ~ +ps xau > cron.log +systemctl > systemctl.txt +cp /etc/network/interfaces . +cp /boot/grub/grub.cfg . +``` + +we should still have access to the old root partition, so I don't need to capture everything. + +## /dev + +I ran MAKEDEV and that may not be needed with udev. + +## grub + +We need to tell grub to boot into the new partition. The old root is on +UUID=8e874576-a167-4fa1-948f-2031e8c3809f /dev/sda2. + +Next I ran + +``` +tux04:~$ update-grub2 /dev/sda +Generating grub configuration file ... +Found linux image: /boot/vmlinuz-5.10.0-32-amd64 +Found initrd image: /boot/initrd.img-5.10.0-32-amd64 +Found linux image: /boot/vmlinuz-5.10.0-22-amd64 +Found initrd image: /boot/initrd.img-5.10.0-22-amd64 +Warning: os-prober will be executed to detect other bootable partitions. +Its output will be used to detect bootable binaries on them and create new boot entries. +Found Debian GNU/Linux 12 (bookworm) on /dev/sda4 +Found Windows Boot Manager on /dev/sdd1@/efi/Microsoft/Boot/bootmgfw.efi +Found Debian GNU/Linux 11 (bullseye) on /dev/sdf2 +``` + +Very good. Do a diff on grub.cfg and you see it even picked up the serial configuration. It only shows it added menu entries for the new boot. Very nice. + +At this point I feel safe to boot as we should be able to get back into the old partition. + +# /etc/fstab + +The old fstab looked like + +``` +UUID=8e874576-a167-4fa1-948f-2031e8c3809f / ext4 errors=remount-ro 0 1 +# /boot/efi was on /dev/sdc1 during installation +UUID=998E-68AF /boot/efi vfat umask=0077 0 1 +# swap was on /dev/sdc3 during installation +UUID=cbfcd84e-73f8-4cec-98ee-40cad404735f none swap sw 0 0 +UUID="783e3bd6-5610-47be-be82-ac92fdd8c8b8" /export2 ext4 auto 0 2 +UUID="9e6a9d88-66e7-4a2e-a12c-f80705c16f4f" /export ext4 auto 0 2 +UUID="f006dd4a-2365-454d-a3a2-9a42518d6286" /export3 auto auto 0 2 +/export2/gnu /gnu none defaults,bind 0 0 +# /dev/sdd1: PARTLABEL="bulk" PARTUUID="b1a820fe-cb1f-425e-b984-914ee648097e" +# /dev/sdb4 /export ext4 auto 0 2 +# /dev/sdd1 /export2 ext4 auto 0 2 +``` + +# reboot + +Next we are going to reboot, and we need a serial connector to the Dell out-of-band using racadm: + +``` +ssh IP +console com2 +racadm getsel +racadm serveraction powercycle +racadm serveraction powerstatus + +``` + +Main trick it so hit ESC, wait 2 sec and 2 when you want the bios boot menu. Ctrl-\ to escape console. Otherwise ESC (wait) ! to get to the boot menu. + +# First boot + +It still boots by default into the old root. That gave an error: + +[FAILED] Failed to start File Syste…a-2365-454d-a3a2-9a42518d6286 + +This is /export3. We can fix that later. + +When I booted into the proper partition the console clapped out. Also the racadm password did not work on tmux -- I had to switch to a standard console to log in again. Not sure why that is, but next I got: + +``` +Give root password for maintenance +(or press Control-D to continue): +``` + +and giving the root password I was in maintenance mode on the correct partition! + +To rerun grup I had to add `GRUB_DISABLE_OS_PROBER=false`. + +Once booting up it is a matter of mounting partitions and tick the check boxes above. + +The following contained errors: + +``` +/dev/sdd1 3.6T 1.8T 1.7T 52% /export2 +``` + +# Guix + +Getting guix going is a bit tricky because we want to keep the store! + +``` +cp -vau /mnt/old-root/var/guix/ /var/ +cp -vau /mnt/old-root/usr/local/guix-profiles /usr/local/ +cp -vau /mnt/old-root/usr/local/bin/* /usr/local/bin/ +cp -vau /mnt/old-root/etc/systemd/system/guix-daemon.service* /etc/systemd/system/ +cp -vau /mnt/old-root/etc/systemd/system/gnu-store.mount* /etc/systemd/system/ +``` + +Also had to add guixbuild users and group by hand. + +# nginx + +We use the streaming facility. Check that + +``` +nginx -V +``` + +lists --with-stream=static, see + +=> https://serverfault.com/questions/858067/unknown-directive-stream-in-etc-nginx-nginx-conf86/858074#858074 + +and load at the start of nginx.conf: + +``` +load_module /usr/lib/nginx/modules/ngx_stream_module.so; +``` + +and + +``` +nginx -t +``` + +passes + +Now the container responds to the browser with `Internal Server Error`. + +# container web server + +Visit the container with something like + +``` +nsenter -at 2838 /run/current-system/profile/bin/bash --login +``` + +The nginx log in the container has many + +``` +2025/02/22 17:23:48 [error] 136#0: *166916 connect() failed (111: Connection refused) while connecting to upstream, client: 127.0.0.1, server: genenetwork.org, request: "GET /gn3/gene/aliases/st%2029:1;o;s HTTP/1.1", upstream: "http://127.0.0.1:9800/gene/aliases/st%2029:1;o;s", host: "genenetwork.org" +``` + +that is interesting. Acme/https is working because GN2 is working: + +``` +curl https://genenetwork.org/api3/version +"1.0" +``` + +Looking at the logs it appears it is a redis problem first for GN2. + +Fred builds the container with `/home/fredm/opt/guix-production/bin/guix`. Machines are defined in + +``` +fredm@tux04:/export3/local/home/fredm/gn-machines +``` + +The shared dir for redis is at + +--share=/export2/guix-containers/genenetwork/var/lib/redis=/var/lib/redis + +with + +``` +root@genenetwork-production /var# ls lib/redis/ -l +-rw-r--r-- 1 redis redis 629328484 Feb 22 17:25 dump.rdb +``` + +In production.scm it is defined as + +``` +(service redis-service-type + (redis-configuration + (bind "127.0.0.1") + (port 6379) + (working-directory "/var/lib/redis"))) +``` + +The defaults are the same as the definition of redis-service-type (in guix). Not sure why we are duplicating. + +After starting redis by hand I get another error `500 DatabaseError: The following exception was raised while attempting to access http://auth.genenetwork.org/auth/data/authorisation: database disk image is malformed`. The problem is it created +a DB in the wrong place. Alright, the logs in the container say: + +``` +Feb 23 14:04:31 genenetwork-production shepherd[1]: [redis-server] 3977:C 23 Feb 2025 14:04:31.040 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo +Feb 23 14:04:31 genenetwork-production shepherd[1]: [redis-server] 3977:C 23 Feb 2025 14:04:31.040 # Redis version=7.0.12, bits=64, commit=00000000, modified=0, pid=3977, just started +Feb 23 14:04:31 genenetwork-production shepherd[1]: [redis-server] 3977:C 23 Feb 2025 14:04:31.040 # Configuration loaded +Feb 23 14:04:31 genenetwork-production shepherd[1]: [redis-server] 3977:M 23 Feb 2025 14:04:31.041 * Increased maximum number of open files to 10032 (it was originally set to 1024). +Feb 23 14:04:31 genenetwork-production shepherd[1]: [redis-server] 3977:M 23 Feb 2025 14:04:31.041 * monotonic clock: POSIX clock_gettime +Feb 23 14:04:31 genenetwork-production shepherd[1]: [redis-server] 3977:M 23 Feb 2025 14:04:31.041 * Running mode=standalone, port=6379. +Feb 23 14:04:31 genenetwork-production shepherd[1]: [redis-server] 3977:M 23 Feb 2025 14:04:31.042 # Server initialized +Feb 23 14:04:31 genenetwork-production shepherd[1]: [redis-server] 3977:M 23 Feb 2025 14:04:31.042 # WARNING Memory overcommit must be enabled! Without it, a background save or replication may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect. +Feb 23 14:04:31 genenetwork-production shepherd[1]: [redis-server] 3977:M 23 Feb 2025 14:04:31.042 # Wrong signature trying to load DB from file +Feb 23 14:04:31 genenetwork-production shepherd[1]: [redis-server] 3977:M 23 Feb 2025 14:04:31.042 # Fatal error loading the DB: Invalid argument. Exiting. +Feb 23 14:04:31 genenetwork-production shepherd[1]: Service redis (PID 3977) exited with 1. +``` + +This is caused by a newer version of redis. This is odd because we are using the same version from the container?! |