# Production on tux04

Lately we have been running production on tux04. Unfortunately Debian got broken and I don't see a way to fix it (something with python versions that break apt!). Also mariadb is giving problems:

=> issues/production-container-mechanical-rob-failure.gmi

and that is alarming. We might as well try an upgrade. I created a new partition on /dev/sda4 using debootstrap.

Luckily not too much is running on this machine and if we mount things again, most should work.

# Tasks

* [X] cleanly shut down mariadb
* [X] reboot into new partition /dev/sda4
* [X] git in /etc
* [X] make sure serial boot works (/etc/default/grub)
* [X] fix groups and users
* [X] get guix going
* [X] get mariadb going
* [ ] fire up GN2 service
* [ ] fire up SPARQL service
* [ ] fix CRON jobs and backups
* [ ] test full reboots


# Boot in new partition

```
blkid /dev/sda4
/dev/sda4: UUID="4aca24fe-3ece-485c-b04b-e2451e226bf7" BLOCK_SIZE="4096" TYPE="ext4" PARTUUID="2e3d569f-6024-46ea-8ef6-15b26725f811"
```

After debootstrap there are two things to take care of: the /dev directory and grub. For good measure
I also capture some state

```
cd ~
ps xau > cron.log
systemctl > systemctl.txt
cp /etc/network/interfaces .
cp /boot/grub/grub.cfg .
```

we should still have access to the old root partition, so I don't need to capture everything.

## /dev

I ran MAKEDEV and that may not be needed with udev.

## grub

We need to tell grub to boot into the new partition. The old root is on
UUID=8e874576-a167-4fa1-948f-2031e8c3809f /dev/sda2.

Next I ran

```
tux04:~$ update-grub2 /dev/sda
Generating grub configuration file ...
Found linux image: /boot/vmlinuz-5.10.0-32-amd64
Found initrd image: /boot/initrd.img-5.10.0-32-amd64
Found linux image: /boot/vmlinuz-5.10.0-22-amd64
Found initrd image: /boot/initrd.img-5.10.0-22-amd64
Warning: os-prober will be executed to detect other bootable partitions.
Its output will be used to detect bootable binaries on them and create new boot entries.
Found Debian GNU/Linux 12 (bookworm) on /dev/sda4
Found Windows Boot Manager on /dev/sdd1@/efi/Microsoft/Boot/bootmgfw.efi
Found Debian GNU/Linux 11 (bullseye) on /dev/sdf2
```

Very good. Do a diff on grub.cfg and you see it even picked up the serial configuration. It only shows it added menu entries for the new boot. Very nice.

At this point I feel safe to boot as we should be able to get back into the old partition.

# /etc/fstab

The old fstab looked like

```
UUID=8e874576-a167-4fa1-948f-2031e8c3809f /               ext4    errors=remount-ro 0       1
# /boot/efi was on /dev/sdc1 during installation
UUID=998E-68AF  /boot/efi       vfat    umask=0077      0       1
# swap was on /dev/sdc3 during installation
UUID=cbfcd84e-73f8-4cec-98ee-40cad404735f none            swap    sw              0       0
UUID="783e3bd6-5610-47be-be82-ac92fdd8c8b8"   /export2    ext4   auto     0   2
UUID="9e6a9d88-66e7-4a2e-a12c-f80705c16f4f"   /export     ext4   auto     0   2
UUID="f006dd4a-2365-454d-a3a2-9a42518d6286"   /export3    auto   auto     0   2
/export2/gnu /gnu none defaults,bind 0 0
# /dev/sdd1: PARTLABEL="bulk" PARTUUID="b1a820fe-cb1f-425e-b984-914ee648097e"
# /dev/sdb4   /export   ext4    auto   0       2
# /dev/sdd1   /export2   ext4    auto   0       2
```

# reboot

Next we are going to reboot, and we need a serial connector to the Dell out-of-band using racadm:

```
ssh IP
console com2
racadm getsel
racadm serveraction powercycle
racadm serveraction powerstatus

```

Main trick it so hit ESC, wait 2 sec and 2 when you want the bios boot menu. Ctrl-\ to escape console. Otherwise ESC (wait) ! to get to the boot menu.

# First boot

It still boots by default into the old root. That gave an error:

[FAILED] Failed to start File Syste…a-2365-454d-a3a2-9a42518d6286

This is /export3. We can fix that later.

When I booted into the proper partition the console clapped out. Also the racadm password did not work on tmux -- I had to switch to a standard console to log in again. Not sure why that is, but next I got:

```
Give root password for maintenance
(or press Control-D to continue):
```

and giving the root password I was in maintenance mode on the correct partition!

To rerun grup I had to add `GRUB_DISABLE_OS_PROBER=false`.

Once booting up it is a matter of mounting partitions and tick the check boxes above.

The following contained errors:

```
/dev/sdd1       3.6T  1.8T  1.7T  52% /export2
```

# Guix

Getting guix going is a bit tricky because we want to keep the store!

```
cp -vau /mnt/old-root/var/guix/ /var/
cp -vau /mnt/old-root/usr/local/guix-profiles /usr/local/
cp -vau /mnt/old-root/usr/local/bin/* /usr/local/bin/
cp -vau /mnt/old-root/etc/systemd/system/guix-daemon.service* /etc/systemd/system/
cp -vau /mnt/old-root/etc/systemd/system/gnu-store.mount* /etc/systemd/system/
```

Also had to add guixbuild users and group by hand.

# nginx

We use the streaming facility. Check that

```
nginx -V
```

lists --with-stream=static, see

=> https://serverfault.com/questions/858067/unknown-directive-stream-in-etc-nginx-nginx-conf86/858074#858074

and load at the start of nginx.conf:

```
load_module /usr/lib/nginx/modules/ngx_stream_module.so;
```

and

```
nginx -t
```

passes

Now the container responds to the browser with `Internal Server Error`.

# container web server

Visit the container with something like

```
nsenter -at 2838 /run/current-system/profile/bin/bash --login
```

The nginx log in the container has many

```
2025/02/22 17:23:48 [error] 136#0: *166916 connect() failed (111: Connection refused) while connecting to upstream, client: 127.0.0.1, server: genenetwork.org, request: "GET /gn3/gene/aliases/st%2029:1;o;s HTTP/1.1", upstream: "http://127.0.0.1:9800/gene/aliases/st%2029:1;o;s", host: "genenetwork.org"
```

that is interesting. Acme/https is working because GN2 is working:

```
curl https://genenetwork.org/api3/version
"1.0"
```

Looking at the logs it appears it is a redis problem first for GN2.

Fred builds the container with `/home/fredm/opt/guix-production/bin/guix`. Machines are defined in

```
fredm@tux04:/export3/local/home/fredm/gn-machines
```

The shared dir for redis is at

--share=/export2/guix-containers/genenetwork/var/lib/redis=/var/lib/redis

with

```
root@genenetwork-production /var# ls lib/redis/ -l
-rw-r--r-- 1 redis redis 629328484 Feb 22 17:25 dump.rdb
```

In production.scm it is defined as

```
(service redis-service-type
         (redis-configuration
          (bind "127.0.0.1")
          (port 6379)
          (working-directory "/var/lib/redis")))
```

The defaults are the same as the definition of redis-service-type (in guix). Not sure why we are duplicating.

After starting redis by hand I get another error `500 DatabaseError: The following exception was raised while attempting to access http://auth.genenetwork.org/auth/data/authorisation: database disk image is malformed`. The problem is it created
a DB in the wrong place. Alright, the logs in the container say:

```
Feb 23 14:04:31 genenetwork-production shepherd[1]: [redis-server] 3977:C 23 Feb 2025 14:04:31.040 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
Feb 23 14:04:31 genenetwork-production shepherd[1]: [redis-server] 3977:C 23 Feb 2025 14:04:31.040 # Redis version=7.0.12, bits=64, commit=00000000, modified=0, pid=3977, just started
Feb 23 14:04:31 genenetwork-production shepherd[1]: [redis-server] 3977:C 23 Feb 2025 14:04:31.040 # Configuration loaded
Feb 23 14:04:31 genenetwork-production shepherd[1]: [redis-server] 3977:M 23 Feb 2025 14:04:31.041 * Increased maximum number of open files to 10032 (it was originally set to 1024).
Feb 23 14:04:31 genenetwork-production shepherd[1]: [redis-server] 3977:M 23 Feb 2025 14:04:31.041 * monotonic clock: POSIX clock_gettime
Feb 23 14:04:31 genenetwork-production shepherd[1]: [redis-server] 3977:M 23 Feb 2025 14:04:31.041 * Running mode=standalone, port=6379.
Feb 23 14:04:31 genenetwork-production shepherd[1]: [redis-server] 3977:M 23 Feb 2025 14:04:31.042 # Server initialized
Feb 23 14:04:31 genenetwork-production shepherd[1]: [redis-server] 3977:M 23 Feb 2025 14:04:31.042 # WARNING Memory overcommit must be enabled! Without it, a background save or replication may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
Feb 23 14:04:31 genenetwork-production shepherd[1]: [redis-server] 3977:M 23 Feb 2025 14:04:31.042 # Wrong signature trying to load DB from file
Feb 23 14:04:31 genenetwork-production shepherd[1]: [redis-server] 3977:M 23 Feb 2025 14:04:31.042 # Fatal error loading the DB: Invalid argument. Exiting.
Feb 23 14:04:31 genenetwork-production shepherd[1]: Service redis (PID 3977) exited with 1.
```

This is caused by a newer version of redis. This is odd because we are using the same version from the container?!