diff options
| author | Pjotr Prins | 2025-12-22 10:39:32 +0100 |
|---|---|---|
| committer | Pjotr Prins | 2026-01-05 11:12:11 +0100 |
| commit | b1b2d68813935a747d08029e182671441a03a8a4 (patch) | |
| tree | 3caa147663ac07eada0decf6e513a72849aada56 | |
| parent | c47420a0fc008586070f8a9212a1143c053c37eb (diff) | |
| download | gn-gemtext-b1b2d68813935a747d08029e182671441a03a8a4.tar.gz | |
Working on octopus
| -rw-r--r-- | topics/hpc/octopus/slurm-user-guide.gmi | 1 | ||||
| -rw-r--r-- | topics/octopus/maintenance.gmi | 12 | ||||
| -rw-r--r-- | topics/octopus/octopussy-needs-love.gmi | 31 | ||||
| -rw-r--r-- | topics/systems/hpc/octopus-maintenance.gmi | 69 |
4 files changed, 84 insertions, 29 deletions
diff --git a/topics/hpc/octopus/slurm-user-guide.gmi b/topics/hpc/octopus/slurm-user-guide.gmi index f7ea6d4..d0a3cc4 100644 --- a/topics/hpc/octopus/slurm-user-guide.gmi +++ b/topics/hpc/octopus/slurm-user-guide.gmi @@ -37,7 +37,6 @@ To get a shell prompt on one of the nodes (useful for testing your environment) srun -N 1 --mem=32G --pty /bin/bash ``` - # Differences ## Guix (look ma, no modules) diff --git a/topics/octopus/maintenance.gmi b/topics/octopus/maintenance.gmi index 65ea52e..00cc575 100644 --- a/topics/octopus/maintenance.gmi +++ b/topics/octopus/maintenance.gmi @@ -11,7 +11,7 @@ octopus02 - Devices: 2 3.7T SSDs + 2 894.3G SSDs + 2 4.6T HDDs - **Status: Slurm not OK, LizardFS not OK** - Notes: - - `octopus02 mfsmount[31909]: can't resolve master hostname and/or portname (octopus01:9421)`, + - `octopus02 mfsmount[31909]: can't resolve master hostname and/or portname (octopus01:9421)`, - **I don't see 2 drives that are physically mounted** octopus03 @@ -21,7 +21,7 @@ octopus03 octopus04 - Devices: 4 7.3 T SSDs (Neil) + 1 4.6T HDD + 1 3.7T SSD + 2 894.3G SSDs -- Status: Slurm NO, LizardFS OK (we don't share the HDD) +- Status: Slurm NO, LizardFS OK (we don't share the HDD) - Notes: no octopus05 @@ -31,7 +31,7 @@ octopus05 octopus06 - Devices: 1 7.3 T SSDs (Neil) + 1 4.6T HDD + 4 3.7T SSDs + 2 894.3G SSDs -- Status: Slurm OK, LizardFS OK (we don't share the HDD) +- Status: Slurm OK, LizardFS OK (we don't share the HDD) - Notes: no octopus07 @@ -41,17 +41,17 @@ octopus07 octopus08 - Devices: 1 7.3 T SSDs (Neil) + 1 4.6T HDD + 4 3.7T SSDs + 2 894.3G SSDs -- Status: Slurm OK, LizardFS OK (we don't share the HDD) +- Status: Slurm OK, LizardFS OK (we don't share the HDD) - Notes: no octopus09 - Devices: 1 7.3 T SSDs (Neil) + 1 4.6T HDD + 4 3.7T SSDs + 2 894.3G SSDs -- Status: Slurm OK, LizardFS OK (we don't share the HDD) +- Status: Slurm OK, LizardFS OK (we don't share the HDD) - Notes: no octopus10 - Devices: 1 7.3 T SSDs (Neil) + 4 3.7T SSDs + 2 894.3G SSDs -- Status: Slurm OK, LizardFS OK (we don't share the HDD) +- Status: Slurm OK, LizardFS OK (we don't share the HDD) - Notes: **I don't see 1 device that is physically mounted** octopus11 diff --git a/topics/octopus/octopussy-needs-love.gmi b/topics/octopus/octopussy-needs-love.gmi index 03f98a1..fc8e285 100644 --- a/topics/octopus/octopussy-needs-love.gmi +++ b/topics/octopus/octopussy-needs-love.gmi @@ -23,7 +23,9 @@ Another thing we ought to fix is introduce centralized user management. So far w * [ ] Install moosefs * [ ] Upgrade bios (tuxes) * [ ] Migrate lizardfs nodes to moosefs (one at a time) +* [ ] Add server monitoring with sheepdog * [ ] Upgrade Debian +* - [ ] Maybe, just maybe, boot the nodes from a central server * [ ] Introduce centralized user management # Progress @@ -63,3 +65,32 @@ See => https://git.genenetwork.org/guix-bioinformatics/commit/?id=236903baaab0f84f012a55700c1917265a2b701c Next stop testing and deploying! + +## Choosing a head node + +Currently octopus01 is the head node. It probably is a good idea to change that, so we can safely upgrade the new server. The first choice would be octopus02 (o2). We can mirror the moose daemons on octopus01 (o1) later. Let's see what that looks like. + +A quick assessment of o1 shows that we have 14T storage on o1 that takes care of /home and /gnu. But only 1.2T is used. + +o2 has also quite a few disks (up 1417 days!), but a bunch of SSDs appears to error out. E.g. + +``` +Sep 04 07:44:56 octopus02 mfschunkserver[22766]: can't create lock file /mnt/sdd1/lizardfs_vol/.lock, marking hdd as damaged: Input/output error +UUID=277c05de-64f5-48a8-8614-8027a53be212 /mnt/sdd1 xfs rw,exec,nodev,noatime,nodiratime,largeio,inode64 0 1 +``` + +we'll need to reboot the server to see what storage still may work. The slurm connection appears to be misconfigured: + +``` +[2025-12-20T09:36:27.846] error: service_connection: slurm_receive_msg: Insane message length +[2025-12-20T09:36:28.415] error: unpackstr_xmalloc: Buffer to be unpacked is too large (1700881509 > 1073741824) [2025-12-20T09:36:28.415] error: unpacking header [2025-12-20T09:36:28.415] error: destroy_forward: no init [2025-12-20T09:36:28.415] error: slurm_receive_msg_and_forward: [[nessus6.uthsc.edu]:35553] failed: Message receive failure +``` + +looks like Andrea is the only one using the machine right now though some others logged in. Before rebooting I'll block users, ask Andrea to move off, and deplete slurm and lizard. But o2 is a large RAM machine, so we should not use that as a head node. + +Let's take a look at o3. This one has less RAM. Flavia is running some tools, but I don't think the machine is really used right now. Slurm is running, but shows similar configuration issues as o2. Let's take a look at slurm + +=> ../systems/hpc/octopus-maintenance +=> ../hpc/octopus/slurm-user-guide + +Alright, I depleted and removed slurm from o3. I think it would be wise to also deplete the lizard drives on that machine. diff --git a/topics/systems/hpc/octopus-maintenance.gmi b/topics/systems/hpc/octopus-maintenance.gmi index a0a2f16..d034575 100644 --- a/topics/systems/hpc/octopus-maintenance.gmi +++ b/topics/systems/hpc/octopus-maintenance.gmi @@ -2,10 +2,23 @@ ## Slurm -Status of slurm +Status of slurm (as of 202512) ``` sinfo +workers* up infinite 8 idle octopus[03,05-11] +allnodes up infinite 3 alloc tux[06,08-09] +allnodes up infinite 11 idle octopus[02-03,05-11],tux[05,07] +tux up infinite 3 alloc tux[06,08-09] +tux up infinite 2 idle tux[05,07] +1tbmem up infinite 1 idle octopus02 +headnode up infinite 1 idle octopus01 +highmem up infinite 2 idle octopus[02,11] +386mem up infinite 6 idle octopus[03,06-10] +lowmem up infinite 7 idle octopus[03,05-10] +``` + +``` sinfo -R squeue ``` @@ -29,7 +42,7 @@ UnkillableStepProgram = (null) UnkillableStepTimeout = 60 sec ``` -check valid configuration with `slurmd -C` and update nodes with +check valid configuration with 'slurmd -C' and update nodes with ``` scontrol reconfigure @@ -45,13 +58,13 @@ Basically the root user can copy across. ## Execute binaries on mounted devices -To avoid `./scratch/script.sh: Permission denied` on `device_file`: +To avoid './scratch/script.sh: Permission denied' on 'device_file': -- `sudo bash` -- `ls /scratch -l` to check where `/scratch` is -- `vim /etc/fstab` -- replace `noexec` with `exec` for `device_file` -- `mount -o remount [device_file]` to remount the partition with its new configuration. +- 'sudo bash' +- 'ls /scratch -l' to check where '/scratch' is +- 'vim /etc/fstab' +- replace 'noexec' with 'exec' for 'device_file' +- 'mount -o remount [device_file]' to remount the partition with its new configuration. Some notes: @@ -67,7 +80,7 @@ x-systemd.device-timeout= 10.0.0.110:/export/3T /mnt/3T nfs nofail,x-systemd.automount,x-systemd.requires=network-online.target,x-systemd.device-timeout=10 0 0 -## Installation of `munge` and `slurm` on a new node +## Installation of 'munge' and 'slurm' on a new node Current nodes in the pool have: @@ -78,7 +91,7 @@ sbatch --version slurm-wlm 18.08.5-2 ``` -To install `munge`, go to `octopus01` and run: +To install 'munge', go to 'octopus01' and run: ```shell guix package -i munge@0.5.14 -p /export/octopus01/guix-profiles/slurm @@ -86,7 +99,7 @@ guix package -i munge@0.5.14 -p /export/octopus01/guix-profiles/slurm systemctl status munge # to check if the service is running and where its service file is ``` -We need to setup the rights for `munge`: +We need to setup the rights for 'munge': ```shell sudo bash @@ -100,7 +113,7 @@ mkdir -p /var/lib/munge chown munge:munge /var/lib/munge/ mkdir -p /etc/munge -# copy `munge.key` (from a working node) to `/etc/munge/munge.key` +# copy 'munge.key' (from a working node) to '/etc/munge/munge.key' chown -R munge:munge /etc/munge mkdir -p /run/munge @@ -112,7 +125,7 @@ chown munge:munge /var/log/munge mkdir -p /var/run/munge # todo: not sure why it needs such a folder chown munge:munge /var/run/munge -# copy `munge.service` (from a working node) to `/etc/systemd/system/munge.service` +# copy 'munge.service' (from a working node) to '/etc/systemd/system/munge.service' systemctl daemon-reload systemctl enable munge @@ -120,25 +133,25 @@ systemctl start munge systemctl status munge ``` -To test the new installation, go to `octopus01` and then: +To test the new installation, go to 'octopus01' and then: ```shell munge -n | ssh tux08 /export/octopus01/guix-profiles/slurm-2-link/bin/unmunge ``` -If you get `STATUS: Rewound credential (16)`, it means that there is a difference between the encoding and decoding times. To fix it, go into the new machine and fix the time with +If you get 'STATUS: Rewound credential (16)', it means that there is a difference between the encoding and decoding times. To fix it, go into the new machine and fix the time with ```shell sudo date MMDDhhmmYYYY.ss ``` -To install `slurm`, go to `octopus01` and run: +To install 'slurm', go to 'octopus01' and run: ```shell guix package -i slurm@18.08.9 -p /export/octopus01/guix-profiles/slurm ``` -We need to setup the rights for `slurm`: +We need to setup the rights for 'slurm': ```shell sudo bash @@ -152,8 +165,8 @@ mkdir -p /var/lib/slurm chown munge:munge /var/lib/slurm/ mkdir -p /etc/slurm -# copy `slurm.conf` to `/etc/slurm/slurm.conf` -# copy `cgroup.conf` to `/etc/slurm/cgroup.conf` +# copy 'slurm.conf' to '/etc/slurm/slurm.conf' +# copy 'cgroup.conf' to '/etc/slurm/cgroup.conf' chown -R slurm:slurm /etc/slurm @@ -163,7 +176,7 @@ chown slurm:slurm /run/slurm mkdir -p /var/log/slurm chown slurm:slurm /var/log/slurm -# copy `slurm.service` to `/etc/systemd/system/slurm.service` +# copy 'slurm.service' to '/etc/systemd/system/slurm.service' /export/octopus01/guix-profiles/slurm-2-link/sbin/slurmd -f /etc/slurm/slurm.conf -C | head -n 1 >> /etc/slurm/slurm.conf # add node configuration information @@ -173,12 +186,24 @@ systemctl start slurm systemctl status slurm ``` -On `octopus01` (the master): +On 'octopus01' (the master): ```shell sudo bash -# add the new node to `/etc/slurm/slurm.conf` +# add the new node to '/etc/slurm/slurm.conf' systemctl restart slurmctld # after editing /etc/slurm/slurm.conf on the master ``` + + +# Removing a node + +We are removing o3 so it can become the new head node: + +``` +scontrol update nodename=octopus03 state=drain reason="removing" +scontrol show node octopus03 | grep State +scontrol update nodename=octopus03 state=down reason="removed" + State=DOWN+DRAIN ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A +``` |
