diff options
-rw-r--r-- | tasks/andreag.gmi | 22 | ||||
-rw-r--r-- | topics/systems/hpc/octopus-maintenance.gmi | 36 |
2 files changed, 58 insertions, 0 deletions
diff --git a/tasks/andreag.gmi b/tasks/andreag.gmi new file mode 100644 index 0000000..6132b56 --- /dev/null +++ b/tasks/andreag.gmi @@ -0,0 +1,22 @@ +# Andrea tasks + +## Tags + +* kanban: andreag +* assigned: andreag +* status: in progress + +## Notes + +=> https://issues.genenetwork.org + +## Tasks + +### Meta-tasks + + +* [ ] Pjotr should give root access on all nodes +* [ ] Move /gnu to new partition on Oct01 and update nfs /etc/exports +* [ ] /dev/sdc1 is giving errors on Oct03 (XFS) + - Disk /dev/sdc: 3.7 TiB, Disk model: Samsung SSD 870 +* [ ] visit all lizardfs drives, remove USB (see /etc/lizardfs; fdisk -l) diff --git a/topics/systems/hpc/octopus-maintenance.gmi b/topics/systems/hpc/octopus-maintenance.gmi new file mode 100644 index 0000000..6f44433 --- /dev/null +++ b/topics/systems/hpc/octopus-maintenance.gmi @@ -0,0 +1,36 @@ +# Octopus Maintenance + +## Slurm + +Status of slurm + +``` +sinfo +sinfo -R +squeue +``` + +we have draining nodes, but no jobs running on them + +Reviving draining node (as root) + +``` +scontrol + update NodeName=octopus05 State=DOWN Reason="undraining" + update NodeName=octopus05 State=RESUME + show node octopus05 +``` + +Kill time can lead to drain state + +``` +scontrol show config | grep kill +UnkillableStepProgram = (null) +UnkillableStepTimeout = 60 sec +``` + +check valid configuration with `slurmd -C` and update nodes with + +``` +scontrol reconfigure +``` |