1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
|
# Production on tux04
Lately we have been running production on tux04. Unfortunately Debian got broken and I don't see a way to fix it (something with python versions that break apt!). Also mariadb is giving problems:
=> issues/production-container-mechanical-rob-failure.gmi
and that is alarming. We might as well try an upgrade. I created a new partition on /dev/sda4 using debootstrap.
Luckily not too much is running on this machine and if we mount things again, most should work.
# Tasks
* [X] cleanly shut down mariadb
* [X] reboot into new partition /dev/sda4
* [X] git in /etc
* [X] make sure serial boot works (/etc/default/grub)
* [X] fix groups and users
* [X] get guix going
* [X] get mariadb going
* [X] fire up GN2 service
* [X] fire up SPARQL service
* [X] sheepdog
* [ ] fix CRON jobs and backups
* [ ] test full reboots
# Boot in new partition
```
blkid /dev/sda4
/dev/sda4: UUID="4aca24fe-3ece-485c-b04b-e2451e226bf7" BLOCK_SIZE="4096" TYPE="ext4" PARTUUID="2e3d569f-6024-46ea-8ef6-15b26725f811"
```
After debootstrap there are two things to take care of: the /dev directory and grub. For good measure
I also capture some state
```
cd ~
ps xau > cron.log
systemctl > systemctl.txt
cp /etc/network/interfaces .
cp /boot/grub/grub.cfg .
```
we should still have access to the old root partition, so I don't need to capture everything.
## /dev
I ran MAKEDEV and that may not be needed with udev.
## grub
We need to tell grub to boot into the new partition. The old root is on
UUID=8e874576-a167-4fa1-948f-2031e8c3809f /dev/sda2.
Next I ran
```
tux04:~$ update-grub2 /dev/sda
Generating grub configuration file ...
Found linux image: /boot/vmlinuz-5.10.0-32-amd64
Found initrd image: /boot/initrd.img-5.10.0-32-amd64
Found linux image: /boot/vmlinuz-5.10.0-22-amd64
Found initrd image: /boot/initrd.img-5.10.0-22-amd64
Warning: os-prober will be executed to detect other bootable partitions.
Its output will be used to detect bootable binaries on them and create new boot entries.
Found Debian GNU/Linux 12 (bookworm) on /dev/sda4
Found Windows Boot Manager on /dev/sdd1@/efi/Microsoft/Boot/bootmgfw.efi
Found Debian GNU/Linux 11 (bullseye) on /dev/sdf2
```
Very good. Do a diff on grub.cfg and you see it even picked up the serial configuration. It only shows it added menu entries for the new boot. Very nice.
At this point I feel safe to boot as we should be able to get back into the old partition.
# /etc/fstab
The old fstab looked like
```
UUID=8e874576-a167-4fa1-948f-2031e8c3809f / ext4 errors=remount-ro 0 1
# /boot/efi was on /dev/sdc1 during installation
UUID=998E-68AF /boot/efi vfat umask=0077 0 1
# swap was on /dev/sdc3 during installation
UUID=cbfcd84e-73f8-4cec-98ee-40cad404735f none swap sw 0 0
UUID="783e3bd6-5610-47be-be82-ac92fdd8c8b8" /export2 ext4 auto 0 2
UUID="9e6a9d88-66e7-4a2e-a12c-f80705c16f4f" /export ext4 auto 0 2
UUID="f006dd4a-2365-454d-a3a2-9a42518d6286" /export3 auto auto 0 2
/export2/gnu /gnu none defaults,bind 0 0
# /dev/sdd1: PARTLABEL="bulk" PARTUUID="b1a820fe-cb1f-425e-b984-914ee648097e"
# /dev/sdb4 /export ext4 auto 0 2
# /dev/sdd1 /export2 ext4 auto 0 2
```
# reboot
Next we are going to reboot, and we need a serial connector to the Dell out-of-band using racadm:
```
ssh IP
console com2
racadm getsel
racadm serveraction powercycle
racadm serveraction powerstatus
```
Main trick it so hit ESC, wait 2 sec and 2 when you want the bios boot menu. Ctrl-\ to escape console. Otherwise ESC (wait) ! to get to the boot menu.
# First boot
It still boots by default into the old root. That gave an error:
[FAILED] Failed to start File Syste…a-2365-454d-a3a2-9a42518d6286
This is /export3. We can fix that later.
When I booted into the proper partition the console clapped out. Also the racadm password did not work on tmux -- I had to switch to a standard console to log in again. Not sure why that is, but next I got:
```
Give root password for maintenance
(or press Control-D to continue):
```
and giving the root password I was in maintenance mode on the correct partition!
To rerun grup I had to add `GRUB_DISABLE_OS_PROBER=false`.
Once booting up it is a matter of mounting partitions and tick the check boxes above.
The following contained errors:
```
/dev/sdd1 3.6T 1.8T 1.7T 52% /export2
```
# Guix
Getting guix going is a bit tricky because we want to keep the store!
```
cp -vau /mnt/old-root/var/guix/ /var/
cp -vau /mnt/old-root/usr/local/guix-profiles /usr/local/
cp -vau /mnt/old-root/usr/local/bin/* /usr/local/bin/
cp -vau /mnt/old-root/etc/systemd/system/guix-daemon.service* /etc/systemd/system/
cp -vau /mnt/old-root/etc/systemd/system/gnu-store.mount* /etc/systemd/system/
```
Also had to add guixbuild users and group by hand.
# nginx
We use the streaming facility. Check that
```
nginx -V
```
lists --with-stream=static, see
=> https://serverfault.com/questions/858067/unknown-directive-stream-in-etc-nginx-nginx-conf86/858074#858074
and load at the start of nginx.conf:
```
load_module /usr/lib/nginx/modules/ngx_stream_module.so;
```
and
```
nginx -t
```
passes
Now the container responds to the browser with `Internal Server Error`.
# container web server
Visit the container with something like
```
nsenter -at 2838 /run/current-system/profile/bin/bash --login
```
The nginx log in the container has many
```
2025/02/22 17:23:48 [error] 136#0: *166916 connect() failed (111: Connection refused) while connecting to upstream, client: 127.0.0.1, server: genenetwork.org, request: "GET /gn3/gene/aliases/st%2029:1;o;s HTTP/1.1", upstream: "http://127.0.0.1:9800/gene/aliases/st%2029:1;o;s", host: "genenetwork.org"
```
that is interesting. Acme/https is working because GN2 is working:
```
curl https://genenetwork.org/api3/version
"1.0"
```
Looking at the logs it appears it is a redis problem first for GN2.
Fred builds the container with `/home/fredm/opt/guix-production/bin/guix`. Machines are defined in
```
fredm@tux04:/export3/local/home/fredm/gn-machines
```
The shared dir for redis is at
--share=/export2/guix-containers/genenetwork/var/lib/redis=/var/lib/redis
with
```
root@genenetwork-production /var# ls lib/redis/ -l
-rw-r--r-- 1 redis redis 629328484 Feb 22 17:25 dump.rdb
```
In production.scm it is defined as
```
(service redis-service-type
(redis-configuration
(bind "127.0.0.1")
(port 6379)
(working-directory "/var/lib/redis")))
```
The defaults are the same as the definition of redis-service-type (in guix). Not sure why we are duplicating.
After starting redis by hand I get another error `500 DatabaseError: The following exception was raised while attempting to access http://auth.genenetwork.org/auth/data/authorisation: database disk image is malformed`. The problem is it created
a DB in the wrong place. Alright, the logs in the container say:
```
Feb 23 14:04:31 genenetwork-production shepherd[1]: [redis-server] 3977:C 23 Feb 2025 14:04:31.040 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
Feb 23 14:04:31 genenetwork-production shepherd[1]: [redis-server] 3977:C 23 Feb 2025 14:04:31.040 # Redis version=7.0.12, bits=64, commit=00000000, modified=0, pid=3977, just started
Feb 23 14:04:31 genenetwork-production shepherd[1]: [redis-server] 3977:C 23 Feb 2025 14:04:31.040 # Configuration loaded
Feb 23 14:04:31 genenetwork-production shepherd[1]: [redis-server] 3977:M 23 Feb 2025 14:04:31.041 * Increased maximum number of open files to 10032 (it was originally set to 1024).
Feb 23 14:04:31 genenetwork-production shepherd[1]: [redis-server] 3977:M 23 Feb 2025 14:04:31.041 * monotonic clock: POSIX clock_gettime
Feb 23 14:04:31 genenetwork-production shepherd[1]: [redis-server] 3977:M 23 Feb 2025 14:04:31.041 * Running mode=standalone, port=6379.
Feb 23 14:04:31 genenetwork-production shepherd[1]: [redis-server] 3977:M 23 Feb 2025 14:04:31.042 # Server initialized
Feb 23 14:04:31 genenetwork-production shepherd[1]: [redis-server] 3977:M 23 Feb 2025 14:04:31.042 # WARNING Memory overcommit must be enabled! Without it, a background save or replication may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
Feb 23 14:04:31 genenetwork-production shepherd[1]: [redis-server] 3977:M 23 Feb 2025 14:04:31.042 # Wrong signature trying to load DB from file
Feb 23 14:04:31 genenetwork-production shepherd[1]: [redis-server] 3977:M 23 Feb 2025 14:04:31.042 # Fatal error loading the DB: Invalid argument. Exiting.
Feb 23 14:04:31 genenetwork-production shepherd[1]: Service redis (PID 3977) exited with 1.
```
This is caused by a newer version of redis. This is odd because we are using the same version from the container?!
Actually it turned out the redis DB was corrupted on the SSD! Same for some other databases (ugh).
Fred copied all data to an enterprise level storage, and we rolled back to some older DBs, so hopefully we'll be OK for now.
# Reinstating backups
In the next step we need to restore backups as described in
=> /topics/systems/backups-with-borg
I already created an ibackup user. Next we test the backup script for mariadb.
One important step is to check the database:
```
/usr/bin/mariadb-check -c -u user -p* db_webqtl
```
A successful mariadb backup consists of multiple steps
```
2025-02-27 11:48:28 +0000 (ibackup@tux04) SUCCESS 0 <32m43s> mariabackup-dump
2025-02-27 11:48:29 +0000 (ibackup@tux04) SUCCESS 0 <00m00s> mariabackup-make-consistent
2025-02-27 12:16:37 +0000 (ibackup@tux04) SUCCESS 0 <28m08s> borg-tux04-sql-backup
2025-02-27 12:16:46 +0000 (ibackup@tux04) SUCCESS 0 <00m07s> drop-rsync-balg01
```
|