# Production on tux04 Lately we have been running production on tux04. Unfortunately Debian got broken and I don't see a way to fix it (something with python versions that break apt!). Also mariadb is giving problems: => issues/production-container-mechanical-rob-failure.gmi and that is alarming. We might as well try an upgrade. I created a new partition on /dev/sda4 using debootstrap. Luckily not too much is running on this machine and if we mount things again, most should work. # Tasks * [X] cleanly shut down mariadb * [X] reboot into new partition /dev/sda4 * [X] git in /etc * [X] make sure serial boot works (/etc/default/grub) * [X] fix groups and users * [X] get guix going * [X] get mariadb going * [ ] fire up GN2 service * [ ] fire up SPARQL service * [ ] fix CRON jobs and backups * [ ] test full reboots # Boot in new partition ``` blkid /dev/sda4 /dev/sda4: UUID="4aca24fe-3ece-485c-b04b-e2451e226bf7" BLOCK_SIZE="4096" TYPE="ext4" PARTUUID="2e3d569f-6024-46ea-8ef6-15b26725f811" ``` After debootstrap there are two things to take care of: the /dev directory and grub. For good measure I also capture some state ``` cd ~ ps xau > cron.log systemctl > systemctl.txt cp /etc/network/interfaces . cp /boot/grub/grub.cfg . ``` we should still have access to the old root partition, so I don't need to capture everything. ## /dev I ran MAKEDEV and that may not be needed with udev. ## grub We need to tell grub to boot into the new partition. The old root is on UUID=8e874576-a167-4fa1-948f-2031e8c3809f /dev/sda2. Next I ran ``` tux04:~$ update-grub2 /dev/sda Generating grub configuration file ... Found linux image: /boot/vmlinuz-5.10.0-32-amd64 Found initrd image: /boot/initrd.img-5.10.0-32-amd64 Found linux image: /boot/vmlinuz-5.10.0-22-amd64 Found initrd image: /boot/initrd.img-5.10.0-22-amd64 Warning: os-prober will be executed to detect other bootable partitions. Its output will be used to detect bootable binaries on them and create new boot entries. Found Debian GNU/Linux 12 (bookworm) on /dev/sda4 Found Windows Boot Manager on /dev/sdd1@/efi/Microsoft/Boot/bootmgfw.efi Found Debian GNU/Linux 11 (bullseye) on /dev/sdf2 ``` Very good. Do a diff on grub.cfg and you see it even picked up the serial configuration. It only shows it added menu entries for the new boot. Very nice. At this point I feel safe to boot as we should be able to get back into the old partition. # /etc/fstab The old fstab looked like ``` UUID=8e874576-a167-4fa1-948f-2031e8c3809f / ext4 errors=remount-ro 0 1 # /boot/efi was on /dev/sdc1 during installation UUID=998E-68AF /boot/efi vfat umask=0077 0 1 # swap was on /dev/sdc3 during installation UUID=cbfcd84e-73f8-4cec-98ee-40cad404735f none swap sw 0 0 UUID="783e3bd6-5610-47be-be82-ac92fdd8c8b8" /export2 ext4 auto 0 2 UUID="9e6a9d88-66e7-4a2e-a12c-f80705c16f4f" /export ext4 auto 0 2 UUID="f006dd4a-2365-454d-a3a2-9a42518d6286" /export3 auto auto 0 2 /export2/gnu /gnu none defaults,bind 0 0 # /dev/sdd1: PARTLABEL="bulk" PARTUUID="b1a820fe-cb1f-425e-b984-914ee648097e" # /dev/sdb4 /export ext4 auto 0 2 # /dev/sdd1 /export2 ext4 auto 0 2 ``` # reboot Next we are going to reboot, and we need a serial connector to the Dell out-of-band using racadm: ``` ssh IP console com2 racadm getsel racadm serveraction powercycle racadm serveraction powerstatus ``` Main trick it so hit ESC, wait 2 sec and 2 when you want the bios boot menu. Ctrl-\ to escape console. Otherwise ESC (wait) ! to get to the boot menu. # First boot It still boots by default into the old root. That gave an error: [FAILED] Failed to start File Syste…a-2365-454d-a3a2-9a42518d6286 This is /export3. We can fix that later. When I booted into the proper partition the console clapped out. Also the racadm password did not work on tmux -- I had to switch to a standard console to log in again. Not sure why that is, but next I got: ``` Give root password for maintenance (or press Control-D to continue): ``` and giving the root password I was in maintenance mode on the correct partition! To rerun grup I had to add `GRUB_DISABLE_OS_PROBER=false`. Once booting up it is a matter of mounting partitions and tick the check boxes above. The following contained errors: ``` /dev/sdd1 3.6T 1.8T 1.7T 52% /export2 ``` # Guix Getting guix going is a bit tricky because we want to keep the store! ``` cp -vau /mnt/old-root/var/guix/ /var/ cp -vau /mnt/old-root/usr/local/guix-profiles /usr/local/ cp -vau /mnt/old-root/usr/local/bin/* /usr/local/bin/ cp -vau /mnt/old-root/etc/systemd/system/guix-daemon.service* /etc/systemd/system/ cp -vau /mnt/old-root/etc/systemd/system/gnu-store.mount* /etc/systemd/system/ ``` Also had to add guixbuild users and group by hand. # nginx We use the streaming facility. Check that ``` nginx -V ``` lists --with-stream=static, see => https://serverfault.com/questions/858067/unknown-directive-stream-in-etc-nginx-nginx-conf86/858074#858074 and load at the start of nginx.conf: ``` load_module /usr/lib/nginx/modules/ngx_stream_module.so; ``` and ``` nginx -t ``` passes Now the container responds to the browser with `Internal Server Error`. # container web server Visit the container with something like ``` nsenter -at 2838 /run/current-system/profile/bin/bash --login ``` The nginx log in the container has many ``` 2025/02/22 17:23:48 [error] 136#0: *166916 connect() failed (111: Connection refused) while connecting to upstream, client: 127.0.0.1, server: genenetwork.org, request: "GET /gn3/gene/aliases/st%2029:1;o;s HTTP/1.1", upstream: "http://127.0.0.1:9800/gene/aliases/st%2029:1;o;s", host: "genenetwork.org" ``` that is interesting. Acme/https is working because GN2 is working: ``` curl https://genenetwork.org/api3/version "1.0" ``` Looking at the logs it appears it is a redis problem first for GN2. Fred builds the container with `/home/fredm/opt/guix-production/bin/guix`. Machines are defined in ``` fredm@tux04:/export3/local/home/fredm/gn-machines ``` The shared dir for redis is at --share=/export2/guix-containers/genenetwork/var/lib/redis=/var/lib/redis with ``` root@genenetwork-production /var# ls lib/redis/ -l -rw-r--r-- 1 redis redis 629328484 Feb 22 17:25 dump.rdb ``` In production.scm it is defined as ``` (service redis-service-type (redis-configuration (bind "127.0.0.1") (port 6379) (working-directory "/var/lib/redis"))) ``` The defaults are the same as the definition of redis-service-type (in guix). Not sure why we are duplicating. After starting redis by hand I get another error `500 DatabaseError: The following exception was raised while attempting to access http://auth.genenetwork.org/auth/data/authorisation: database disk image is malformed`. The problem is it created a DB in the wrong place. Alright, the logs in the container say: ``` Feb 23 14:04:31 genenetwork-production shepherd[1]: [redis-server] 3977:C 23 Feb 2025 14:04:31.040 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo Feb 23 14:04:31 genenetwork-production shepherd[1]: [redis-server] 3977:C 23 Feb 2025 14:04:31.040 # Redis version=7.0.12, bits=64, commit=00000000, modified=0, pid=3977, just started Feb 23 14:04:31 genenetwork-production shepherd[1]: [redis-server] 3977:C 23 Feb 2025 14:04:31.040 # Configuration loaded Feb 23 14:04:31 genenetwork-production shepherd[1]: [redis-server] 3977:M 23 Feb 2025 14:04:31.041 * Increased maximum number of open files to 10032 (it was originally set to 1024). Feb 23 14:04:31 genenetwork-production shepherd[1]: [redis-server] 3977:M 23 Feb 2025 14:04:31.041 * monotonic clock: POSIX clock_gettime Feb 23 14:04:31 genenetwork-production shepherd[1]: [redis-server] 3977:M 23 Feb 2025 14:04:31.041 * Running mode=standalone, port=6379. Feb 23 14:04:31 genenetwork-production shepherd[1]: [redis-server] 3977:M 23 Feb 2025 14:04:31.042 # Server initialized Feb 23 14:04:31 genenetwork-production shepherd[1]: [redis-server] 3977:M 23 Feb 2025 14:04:31.042 # WARNING Memory overcommit must be enabled! Without it, a background save or replication may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect. Feb 23 14:04:31 genenetwork-production shepherd[1]: [redis-server] 3977:M 23 Feb 2025 14:04:31.042 # Wrong signature trying to load DB from file Feb 23 14:04:31 genenetwork-production shepherd[1]: [redis-server] 3977:M 23 Feb 2025 14:04:31.042 # Fatal error loading the DB: Invalid argument. Exiting. Feb 23 14:04:31 genenetwork-production shepherd[1]: Service redis (PID 3977) exited with 1. ``` This is caused by a newer version of redis. This is odd because we are using the same version from the container?!