1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
|
# Octopus Maintenance
## Slurm
Status of slurm
```
sinfo
sinfo -R
squeue
```
we have draining nodes, but no jobs running on them
Reviving draining node (as root)
```
scontrol
update NodeName=octopus05 State=DOWN Reason="undraining"
update NodeName=octopus05 State=RESUME
show node octopus05
```
Kill time can lead to drain state
```
scontrol show config | grep kill
UnkillableStepProgram = (null)
UnkillableStepTimeout = 60 sec
```
check valid configuration with `slurmd -C` and update nodes with
```
scontrol reconfigure
```
## Password management
So we create a script that can deploy files from octopus01 (head node). Unfortunately ids in passwd do no match, so we can't copy those yet.
See /etc/nodes for script and ssh files, sudoers (etc)
Basically the root user can copy across.
## Execute binaries on mounted devices
To avoid `./scratch/script.sh: Permission denied` on `device_file`:
- `sudo bash`
- `ls /scratch -l` to check where `/scratch` is
- `vim /etc/fstab`
- replace `noexec` with `exec` for `device_file`
- `mount -o remount [device_file]` to remount the partition with its new configuration.
Some notes:
root@tux09:~# mkdir -p /var/lib/nfs/statd
root@tux09:~# systemctl enable rpcbind
Synchronizing state of rpcbind.service with SysV service script with /lib/systemd/systemd-sysv-install.
Executing: /lib/systemd/systemd-sysv-install enable rpcbind
root@tux09:~# systemctl list-unit-files | grep -E 'rpc-statd.service'
rpc-statd.service static -
network-online.target
x-systemd.device-timeout=
10.0.0.110:/export/3T /mnt/3T nfs nofail,x-systemd.automount,x-systemd.requires=network-online.target,x-systemd.device-timeout=10 0 0
## Installation of `munge` and `slurm` on a new node
Current nodes in the pool have:
```shell
munge --version
munge-0.5.13 (2017-09-26)
sbatch --version
slurm-wlm 18.08.5-2
```
To install `munge`, go to `octopus01` and run:
```shell
guix package -i munge@0.5.14 -p /export/octopus01/guix-profiles/slurm
systemctl status munge # to check if the service is running and where its service file is
```
We need to setup the rights for `munge`:
```shell
sudo bash
addgroup -gid 900 munge
adduser -uid 900 -gid 900 --disabled-password munge
sed 's,/home/munge:/bin/bash,/var/lib/munge:/usr/sbin/nologin,g' /etc/passwd -i
mkdir -p /var/lib/munge
chown munge:munge /var/lib/munge/
mkdir -p /etc/munge
# copy `munge.key` (from a working node) to `/etc/munge/munge.key`
chown -R munge:munge /etc/munge
mkdir -p /run/munge
chown munge:munge /run/munge
mkdir -p /var/log/munge
chown munge:munge /var/log/munge
mkdir -p /var/run/munge # todo: not sure why it needs such a folder
chown munge:munge /var/run/munge
# copy `munge.service` (from a working node) to `/etc/systemd/system/munge.service`
systemctl daemon-reload
systemctl enable munge
systemctl start munge
systemctl status munge
```
To test the new installation, go to `octopus01` and then:
```shell
munge -n | ssh tux08 /export/octopus01/guix-profiles/slurm-2-link/bin/unmunge
```
To install `slurm`, go to `octopus01` and run:
```shell
guix package -i slurm@18.08.9 -p /export/octopus01/guix-profiles/slurm
```
We need to setup the rights for `slurm`:
```shell
sudo bash
addgroup -gid 901 slurm
adduser -uid 901 -gid 901 --no-create-home --disabled-password slurm
sed 's,/home/slurm:/bin/bash,/var/lib/slurm:/bin/bash,g' /etc/passwd -i
mkdir -p /var/lib/slurm
chown munge:munge /var/lib/slurm/
mkdir -p /etc/slurm
# copy `slurm.conf` to `/etc/slurm/slurm.conf`
# copy `cgroup.conf` to `/etc/slurm/cgroup.conf`
chown -R slurm:slurm /etc/slurm
mkdir -p /run/slurm
chown slurm:slurm /run/slurm
mkdir -p /var/log/slurm
chown slurm:slurm /var/log/slurm
# copy `slurm.service` to `/etc/systemd/system/slurm.service`
/export/octopus01/guix-profiles/slurm-2-link/sbin/slurmd -f /etc/slurm/slurm.conf -C | head -n 1 >> /etc/slurm/slurm.conf # add node configuration information
systemctl daemon-reload
systemctl enable slurm
systemctl start slurm
systemctl status slurm
```
On `octopus01` (the master):
```shell
sudo bash
# add the new node to `/etc/slurm/slurm.conf`
systemctl restart slurmctld # after editing /etc/slurm/slurm.conf on the master
```
|