Working on octopus

author: Pjotr Prins 2025-12-22 10:39:32 +0100
committer: Pjotr Prins 2026-01-05 11:12:11 +0100
commit: b1b2d68813935a747d08029e182671441a03a8a4 (patch)
tree: 3caa147663ac07eada0decf6e513a72849aada56
parent: c47420a0fc008586070f8a9212a1143c053c37eb (diff)
download: gn-gemtext-b1b2d68813935a747d08029e182671441a03a8a4.tar.gz
4 files changed, 84 insertions, 29 deletions
diff --git a/topics/hpc/octopus/slurm-user-guide.gmi b/topics/hpc/octopus/slurm-user-guide.gmi
index f7ea6d4..d0a3cc4 100644
--- a/topics/hpc/octopus/slurm-user-guide.gmi
+++ b/topics/hpc/octopus/slurm-user-guide.gmi
@@ -37,7 +37,6 @@ To get a shell prompt on one of the nodes (useful for testing your environment)
 srun -N 1 --mem=32G --pty /bin/bash
 ```
 
-
 # Differences
 
 ## Guix (look ma, no modules)
diff --git a/topics/octopus/maintenance.gmi b/topics/octopus/maintenance.gmi
index 65ea52e..00cc575 100644
--- a/topics/octopus/maintenance.gmi
+++ b/topics/octopus/maintenance.gmi
@@ -11,7 +11,7 @@ octopus02
 - Devices: 2 3.7T SSDs + 2 894.3G SSDs + 2 4.6T HDDs
 - **Status: Slurm not OK, LizardFS not OK**
 - Notes:
-  - `octopus02 mfsmount[31909]: can't resolve master hostname and/or portname (octopus01:9421)`, 
+  - `octopus02 mfsmount[31909]: can't resolve master hostname and/or portname (octopus01:9421)`,
   - **I don't see 2 drives that are physically mounted**
 
 octopus03
@@ -21,7 +21,7 @@ octopus03
 
 octopus04
 - Devices: 4 7.3 T SSDs (Neil) + 1 4.6T HDD + 1 3.7T SSD + 2 894.3G SSDs
-- Status: Slurm NO, LizardFS OK (we don't share the HDD) 
+- Status: Slurm NO, LizardFS OK (we don't share the HDD)
 - Notes: no
 
 octopus05
@@ -31,7 +31,7 @@ octopus05
 
 octopus06
 - Devices: 1 7.3 T SSDs (Neil) + 1 4.6T HDD + 4 3.7T SSDs + 2 894.3G SSDs
-- Status: Slurm OK, LizardFS OK (we don't share the HDD) 
+- Status: Slurm OK, LizardFS OK (we don't share the HDD)
 - Notes: no
 
 octopus07
@@ -41,17 +41,17 @@ octopus07
 
 octopus08
 - Devices: 1 7.3 T SSDs (Neil) + 1 4.6T HDD + 4 3.7T SSDs + 2 894.3G SSDs
-- Status: Slurm OK, LizardFS OK (we don't share the HDD) 
+- Status: Slurm OK, LizardFS OK (we don't share the HDD)
 - Notes: no
 
 octopus09
 - Devices: 1 7.3 T SSDs (Neil) + 1 4.6T HDD + 4 3.7T SSDs + 2 894.3G SSDs
-- Status: Slurm OK, LizardFS OK (we don't share the HDD) 
+- Status: Slurm OK, LizardFS OK (we don't share the HDD)
 - Notes: no
 
 octopus10
 - Devices: 1 7.3 T SSDs (Neil) + 4 3.7T SSDs + 2 894.3G SSDs
-- Status: Slurm OK, LizardFS OK (we don't share the HDD) 
+- Status: Slurm OK, LizardFS OK (we don't share the HDD)
 - Notes: **I don't see 1 device that is physically mounted**
 
 octopus11
diff --git a/topics/octopus/octopussy-needs-love.gmi b/topics/octopus/octopussy-needs-love.gmi
index 03f98a1..fc8e285 100644
--- a/topics/octopus/octopussy-needs-love.gmi
+++ b/topics/octopus/octopussy-needs-love.gmi
@@ -23,7 +23,9 @@ Another thing we ought to fix is introduce centralized user management. So far w
 * [ ] Install moosefs
 * [ ] Upgrade bios (tuxes)
 * [ ] Migrate lizardfs nodes to moosefs (one at a time)
+* [ ] Add server monitoring with sheepdog
 * [ ] Upgrade Debian
+* - [ ] Maybe, just maybe, boot the nodes from a central server
 * [ ] Introduce centralized user management
 
 # Progress
@@ -63,3 +65,32 @@ See
 => https://git.genenetwork.org/guix-bioinformatics/commit/?id=236903baaab0f84f012a55700c1917265a2b701c
 
 Next stop testing and deploying!
+
+## Choosing a head node
+
+Currently octopus01 is the head node. It probably is a good idea to change that, so we can safely upgrade the new server. The first choice would be octopus02 (o2). We can mirror the moose daemons on octopus01 (o1) later. Let's see what that looks like.
+
+A quick assessment of o1 shows that we have 14T storage on o1 that takes care of /home and /gnu. But only 1.2T is used.
+
+o2 has also quite a few disks (up 1417 days!), but a bunch of SSDs appears to error out. E.g.
+
+```
+Sep 04 07:44:56 octopus02 mfschunkserver[22766]: can't create lock file /mnt/sdd1/lizardfs_vol/.lock, marking hdd as damaged: Input/output error
+UUID=277c05de-64f5-48a8-8614-8027a53be212 /mnt/sdd1 xfs rw,exec,nodev,noatime,nodiratime,largeio,inode64 0 1
+```
+
+we'll need to reboot the server to see what storage still may work. The slurm connection appears to be misconfigured:
+
+```
+[2025-12-20T09:36:27.846] error: service_connection: slurm_receive_msg: Insane message length
+[2025-12-20T09:36:28.415] error: unpackstr_xmalloc: Buffer to be unpacked is too large (1700881509 > 1073741824)       [2025-12-20T09:36:28.415] error: unpacking header                                                                      [2025-12-20T09:36:28.415] error: destroy_forward: no init                                                              [2025-12-20T09:36:28.415] error: slurm_receive_msg_and_forward: [[nessus6.uthsc.edu]:35553] failed: Message receive failure
+```
+
+looks like Andrea is the only one using the machine right now though some others logged in. Before rebooting I'll block users, ask Andrea to move off, and deplete slurm and lizard. But o2 is a large RAM machine, so we should not use that as a head node.
+
+Let's take a look at o3. This one has less RAM. Flavia is running some tools, but I don't think the machine is really used right now. Slurm is running, but shows similar configuration issues as o2. Let's take a look at slurm
+
+=> ../systems/hpc/octopus-maintenance
+=> ../hpc/octopus/slurm-user-guide
+
+Alright, I depleted and removed slurm from o3. I think it would be wise to also deplete the lizard drives on that machine.
diff --git a/topics/systems/hpc/octopus-maintenance.gmi b/topics/systems/hpc/octopus-maintenance.gmi
index a0a2f16..d034575 100644
--- a/topics/systems/hpc/octopus-maintenance.gmi
+++ b/topics/systems/hpc/octopus-maintenance.gmi
@@ -2,10 +2,23 @@
 
 ## Slurm
 
-Status of slurm
+Status of slurm (as of 202512)
 
 ```
 sinfo
+workers*     up   infinite      8   idle octopus[03,05-11]
+allnodes     up   infinite      3  alloc tux[06,08-09]
+allnodes     up   infinite     11   idle octopus[02-03,05-11],tux[05,07]
+tux          up   infinite      3  alloc tux[06,08-09]
+tux          up   infinite      2   idle tux[05,07]
+1tbmem       up   infinite      1   idle octopus02
+headnode     up   infinite      1   idle octopus01
+highmem      up   infinite      2   idle octopus[02,11]
+386mem       up   infinite      6   idle octopus[03,06-10]
+lowmem       up   infinite      7   idle octopus[03,05-10]
+```
+
+```
 sinfo -R
 squeue
 ```
@@ -29,7 +42,7 @@ UnkillableStepProgram   = (null)
 UnkillableStepTimeout   = 60 sec
 ```
 
-check valid configuration with `slurmd -C` and update nodes with
+check valid configuration with 'slurmd -C' and update nodes with
 
 ```
 scontrol reconfigure
@@ -45,13 +58,13 @@ Basically the root user can copy across.
 
 ## Execute binaries on mounted devices
 
-To avoid `./scratch/script.sh: Permission denied` on `device_file`:
+To avoid './scratch/script.sh: Permission denied' on 'device_file':
 
-- `sudo bash`
-- `ls /scratch -l` to check where `/scratch` is
-- `vim /etc/fstab`
-- replace `noexec` with `exec` for `device_file`
-- `mount -o remount [device_file]` to remount the partition with its new configuration.
+- 'sudo bash'
+- 'ls /scratch -l' to check where '/scratch' is
+- 'vim /etc/fstab'
+- replace 'noexec' with 'exec' for 'device_file'
+- 'mount -o remount [device_file]' to remount the partition with its new configuration.
 
 Some notes:
 
@@ -67,7 +80,7 @@ x-systemd.device-timeout=
 10.0.0.110:/export/3T  /mnt/3T  nfs nofail,x-systemd.automount,x-systemd.requires=network-online.target,x-systemd.device-timeout=10 0 0
 
 
-## Installation of `munge` and `slurm` on a new node
+## Installation of 'munge' and 'slurm' on a new node
 
 Current nodes in the pool have:
 
@@ -78,7 +91,7 @@ sbatch --version
     slurm-wlm 18.08.5-2
 ```
 
-To install `munge`, go to `octopus01` and run:
+To install 'munge', go to 'octopus01' and run:
 
 ```shell
 guix package -i munge@0.5.14 -p /export/octopus01/guix-profiles/slurm
@@ -86,7 +99,7 @@ guix package -i munge@0.5.14 -p /export/octopus01/guix-profiles/slurm
 systemctl status munge # to check if the service is running and where its service file is
 ```
 
-We need to setup the rights for `munge`:
+We need to setup the rights for 'munge':
 
 ```shell
 sudo bash
@@ -100,7 +113,7 @@ mkdir -p /var/lib/munge
 chown munge:munge /var/lib/munge/
 
 mkdir -p /etc/munge
-# copy `munge.key` (from a working node) to `/etc/munge/munge.key`
+# copy 'munge.key' (from a working node) to '/etc/munge/munge.key'
 chown -R munge:munge /etc/munge
 
 mkdir -p /run/munge
@@ -112,7 +125,7 @@ chown munge:munge /var/log/munge
 mkdir -p /var/run/munge # todo: not sure why it needs such a folder
 chown munge:munge /var/run/munge
 
-# copy `munge.service` (from a working node) to `/etc/systemd/system/munge.service`
+# copy 'munge.service' (from a working node) to '/etc/systemd/system/munge.service'
 
 systemctl daemon-reload
 systemctl enable munge
@@ -120,25 +133,25 @@ systemctl start munge
 systemctl status munge
 ```
 
-To test the new installation, go to `octopus01` and then:
+To test the new installation, go to 'octopus01' and then:
 
 ```shell
 munge -n | ssh tux08 /export/octopus01/guix-profiles/slurm-2-link/bin/unmunge
 ```
 
-If you get `STATUS: Rewound credential (16)`, it means that there is a difference between the encoding and decoding times. To fix it, go into the new machine and fix the time with
+If you get 'STATUS: Rewound credential (16)', it means that there is a difference between the encoding and decoding times. To fix it, go into the new machine and fix the time with
 
 ```shell
 sudo date MMDDhhmmYYYY.ss
 ```
 
-To install `slurm`, go to `octopus01` and run:
+To install 'slurm', go to 'octopus01' and run:
 
 ```shell
 guix package -i slurm@18.08.9 -p /export/octopus01/guix-profiles/slurm
 ```
 
-We need to setup the rights for `slurm`:
+We need to setup the rights for 'slurm':
 
 ```shell
 sudo bash
@@ -152,8 +165,8 @@ mkdir -p /var/lib/slurm
 chown munge:munge /var/lib/slurm/
 
 mkdir -p /etc/slurm
-# copy `slurm.conf` to `/etc/slurm/slurm.conf`
-# copy `cgroup.conf` to `/etc/slurm/cgroup.conf`
+# copy 'slurm.conf' to '/etc/slurm/slurm.conf'
+# copy 'cgroup.conf' to '/etc/slurm/cgroup.conf'
 
 chown -R slurm:slurm /etc/slurm
 
@@ -163,7 +176,7 @@ chown slurm:slurm /run/slurm
 mkdir -p /var/log/slurm
 chown slurm:slurm /var/log/slurm
 
-# copy `slurm.service` to `/etc/systemd/system/slurm.service`
+# copy 'slurm.service' to '/etc/systemd/system/slurm.service'
 
 /export/octopus01/guix-profiles/slurm-2-link/sbin/slurmd -f /etc/slurm/slurm.conf -C | head -n 1 >> /etc/slurm/slurm.conf # add node configuration information
 
@@ -173,12 +186,24 @@ systemctl start slurm
 systemctl status slurm
 ```
 
-On `octopus01` (the master):
+On 'octopus01' (the master):
 
 ```shell
 sudo bash
 
-# add the new node to `/etc/slurm/slurm.conf`
+# add the new node to '/etc/slurm/slurm.conf'
 
 systemctl restart slurmctld # after editing /etc/slurm/slurm.conf on the master
 ```
+
+
+# Removing a node
+
+We are removing o3 so it can become the new head node:
+
+```
+scontrol update nodename=octopus03 state=drain reason="removing"
+scontrol show node octopus03 | grep State
+scontrol update nodename=octopus03 state=down reason="removed"
+  State=DOWN+DRAIN ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
+```
author	Pjotr Prins	2025-12-22 10:39:32 +0100
committer	Pjotr Prins	2026-01-05 11:12:11 +0100
commit	b1b2d68813935a747d08029e182671441a03a8a4 (patch)
tree	3caa147663ac07eada0decf6e513a72849aada56
parent	c47420a0fc008586070f8a9212a1143c053c37eb (diff)
download	gn-gemtext-b1b2d68813935a747d08029e182671441a03a8a4.tar.gz