summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
-rw-r--r--tasks/andreag.gmi22
-rw-r--r--topics/systems/hpc/octopus-maintenance.gmi36
2 files changed, 58 insertions, 0 deletions
diff --git a/tasks/andreag.gmi b/tasks/andreag.gmi
new file mode 100644
index 0000000..6132b56
--- /dev/null
+++ b/tasks/andreag.gmi
@@ -0,0 +1,22 @@
+# Andrea tasks
+
+## Tags
+
+* kanban: andreag
+* assigned: andreag
+* status: in progress
+
+## Notes
+
+=> https://issues.genenetwork.org
+
+## Tasks
+
+### Meta-tasks
+
+
+* [ ] Pjotr should give root access on all nodes
+* [ ] Move /gnu to new partition on Oct01 and update nfs /etc/exports
+* [ ] /dev/sdc1 is giving errors on Oct03 (XFS)
+ - Disk /dev/sdc: 3.7 TiB, Disk model: Samsung SSD 870
+* [ ] visit all lizardfs drives, remove USB (see /etc/lizardfs; fdisk -l)
diff --git a/topics/systems/hpc/octopus-maintenance.gmi b/topics/systems/hpc/octopus-maintenance.gmi
new file mode 100644
index 0000000..6f44433
--- /dev/null
+++ b/topics/systems/hpc/octopus-maintenance.gmi
@@ -0,0 +1,36 @@
+# Octopus Maintenance
+
+## Slurm
+
+Status of slurm
+
+```
+sinfo
+sinfo -R
+squeue
+```
+
+we have draining nodes, but no jobs running on them
+
+Reviving draining node (as root)
+
+```
+scontrol
+ update NodeName=octopus05 State=DOWN Reason="undraining"
+ update NodeName=octopus05 State=RESUME
+ show node octopus05
+```
+
+Kill time can lead to drain state
+
+```
+scontrol show config | grep kill
+UnkillableStepProgram = (null)
+UnkillableStepTimeout = 60 sec
+```
+
+check valid configuration with `slurmd -C` and update nodes with
+
+```
+scontrol reconfigure
+```