summary refs log tree commit diff
path: root/topics/systems/hpc
diff options
context:
space:
mode:
Diffstat (limited to 'topics/systems/hpc')
-rw-r--r--topics/systems/hpc/octopus-maintenance.gmi69
-rw-r--r--topics/systems/hpc/performance.gmi4
2 files changed, 49 insertions, 24 deletions
diff --git a/topics/systems/hpc/octopus-maintenance.gmi b/topics/systems/hpc/octopus-maintenance.gmi
index a0a2f16..d034575 100644
--- a/topics/systems/hpc/octopus-maintenance.gmi
+++ b/topics/systems/hpc/octopus-maintenance.gmi
@@ -2,10 +2,23 @@
 
 ## Slurm
 
-Status of slurm
+Status of slurm (as of 202512)
 
 ```
 sinfo
+workers*     up   infinite      8   idle octopus[03,05-11]
+allnodes     up   infinite      3  alloc tux[06,08-09]
+allnodes     up   infinite     11   idle octopus[02-03,05-11],tux[05,07]
+tux          up   infinite      3  alloc tux[06,08-09]
+tux          up   infinite      2   idle tux[05,07]
+1tbmem       up   infinite      1   idle octopus02
+headnode     up   infinite      1   idle octopus01
+highmem      up   infinite      2   idle octopus[02,11]
+386mem       up   infinite      6   idle octopus[03,06-10]
+lowmem       up   infinite      7   idle octopus[03,05-10]
+```
+
+```
 sinfo -R
 squeue
 ```
@@ -29,7 +42,7 @@ UnkillableStepProgram   = (null)
 UnkillableStepTimeout   = 60 sec
 ```
 
-check valid configuration with `slurmd -C` and update nodes with
+check valid configuration with 'slurmd -C' and update nodes with
 
 ```
 scontrol reconfigure
@@ -45,13 +58,13 @@ Basically the root user can copy across.
 
 ## Execute binaries on mounted devices
 
-To avoid `./scratch/script.sh: Permission denied` on `device_file`:
+To avoid './scratch/script.sh: Permission denied' on 'device_file':
 
-- `sudo bash`
-- `ls /scratch -l` to check where `/scratch` is
-- `vim /etc/fstab`
-- replace `noexec` with `exec` for `device_file`
-- `mount -o remount [device_file]` to remount the partition with its new configuration.
+- 'sudo bash'
+- 'ls /scratch -l' to check where '/scratch' is
+- 'vim /etc/fstab'
+- replace 'noexec' with 'exec' for 'device_file'
+- 'mount -o remount [device_file]' to remount the partition with its new configuration.
 
 Some notes:
 
@@ -67,7 +80,7 @@ x-systemd.device-timeout=
 10.0.0.110:/export/3T  /mnt/3T  nfs nofail,x-systemd.automount,x-systemd.requires=network-online.target,x-systemd.device-timeout=10 0 0
 
 
-## Installation of `munge` and `slurm` on a new node
+## Installation of 'munge' and 'slurm' on a new node
 
 Current nodes in the pool have:
 
@@ -78,7 +91,7 @@ sbatch --version
     slurm-wlm 18.08.5-2
 ```
 
-To install `munge`, go to `octopus01` and run:
+To install 'munge', go to 'octopus01' and run:
 
 ```shell
 guix package -i munge@0.5.14 -p /export/octopus01/guix-profiles/slurm
@@ -86,7 +99,7 @@ guix package -i munge@0.5.14 -p /export/octopus01/guix-profiles/slurm
 systemctl status munge # to check if the service is running and where its service file is
 ```
 
-We need to setup the rights for `munge`:
+We need to setup the rights for 'munge':
 
 ```shell
 sudo bash
@@ -100,7 +113,7 @@ mkdir -p /var/lib/munge
 chown munge:munge /var/lib/munge/
 
 mkdir -p /etc/munge
-# copy `munge.key` (from a working node) to `/etc/munge/munge.key`
+# copy 'munge.key' (from a working node) to '/etc/munge/munge.key'
 chown -R munge:munge /etc/munge
 
 mkdir -p /run/munge
@@ -112,7 +125,7 @@ chown munge:munge /var/log/munge
 mkdir -p /var/run/munge # todo: not sure why it needs such a folder
 chown munge:munge /var/run/munge
 
-# copy `munge.service` (from a working node) to `/etc/systemd/system/munge.service`
+# copy 'munge.service' (from a working node) to '/etc/systemd/system/munge.service'
 
 systemctl daemon-reload
 systemctl enable munge
@@ -120,25 +133,25 @@ systemctl start munge
 systemctl status munge
 ```
 
-To test the new installation, go to `octopus01` and then:
+To test the new installation, go to 'octopus01' and then:
 
 ```shell
 munge -n | ssh tux08 /export/octopus01/guix-profiles/slurm-2-link/bin/unmunge
 ```
 
-If you get `STATUS: Rewound credential (16)`, it means that there is a difference between the encoding and decoding times. To fix it, go into the new machine and fix the time with
+If you get 'STATUS: Rewound credential (16)', it means that there is a difference between the encoding and decoding times. To fix it, go into the new machine and fix the time with
 
 ```shell
 sudo date MMDDhhmmYYYY.ss
 ```
 
-To install `slurm`, go to `octopus01` and run:
+To install 'slurm', go to 'octopus01' and run:
 
 ```shell
 guix package -i slurm@18.08.9 -p /export/octopus01/guix-profiles/slurm
 ```
 
-We need to setup the rights for `slurm`:
+We need to setup the rights for 'slurm':
 
 ```shell
 sudo bash
@@ -152,8 +165,8 @@ mkdir -p /var/lib/slurm
 chown munge:munge /var/lib/slurm/
 
 mkdir -p /etc/slurm
-# copy `slurm.conf` to `/etc/slurm/slurm.conf`
-# copy `cgroup.conf` to `/etc/slurm/cgroup.conf`
+# copy 'slurm.conf' to '/etc/slurm/slurm.conf'
+# copy 'cgroup.conf' to '/etc/slurm/cgroup.conf'
 
 chown -R slurm:slurm /etc/slurm
 
@@ -163,7 +176,7 @@ chown slurm:slurm /run/slurm
 mkdir -p /var/log/slurm
 chown slurm:slurm /var/log/slurm
 
-# copy `slurm.service` to `/etc/systemd/system/slurm.service`
+# copy 'slurm.service' to '/etc/systemd/system/slurm.service'
 
 /export/octopus01/guix-profiles/slurm-2-link/sbin/slurmd -f /etc/slurm/slurm.conf -C | head -n 1 >> /etc/slurm/slurm.conf # add node configuration information
 
@@ -173,12 +186,24 @@ systemctl start slurm
 systemctl status slurm
 ```
 
-On `octopus01` (the master):
+On 'octopus01' (the master):
 
 ```shell
 sudo bash
 
-# add the new node to `/etc/slurm/slurm.conf`
+# add the new node to '/etc/slurm/slurm.conf'
 
 systemctl restart slurmctld # after editing /etc/slurm/slurm.conf on the master
 ```
+
+
+# Removing a node
+
+We are removing o3 so it can become the new head node:
+
+```
+scontrol update nodename=octopus03 state=drain reason="removing"
+scontrol show node octopus03 | grep State
+scontrol update nodename=octopus03 state=down reason="removed"
+  State=DOWN+DRAIN ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
+```
diff --git a/topics/systems/hpc/performance.gmi b/topics/systems/hpc/performance.gmi
index ac5e861..ee604b5 100644
--- a/topics/systems/hpc/performance.gmi
+++ b/topics/systems/hpc/performance.gmi
@@ -14,13 +14,13 @@ hdparm -Ttv /dev/sdc1
 
 Cheap and cheerful:
 
-Read test:
+Write test:
 
 ```
 dd if=/dev/zero of=./test bs=512k count=2048 oflag=direct
 ```
 
-Write test:
+Read test:
 
 ```
 /sbin/sysctl -w vm.drop_caches=3