topics/systems/hpc/octopus-maintenance.gmi


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184

# Octopus Maintenance

## Slurm

Status of slurm

```
sinfo
sinfo -R
squeue
```

we have draining nodes, but no jobs running on them

Reviving draining node (as root)

```
scontrol
  update NodeName=octopus05 State=DOWN Reason="undraining"
  update NodeName=octopus05 State=RESUME
  show node octopus05
```

Kill time can lead to drain state

```
scontrol show config | grep kill
UnkillableStepProgram   = (null)
UnkillableStepTimeout   = 60 sec
```

check valid configuration with `slurmd -C` and update nodes with

```
scontrol reconfigure
```

## Password management

So we create a script that can deploy files from octopus01 (head node). Unfortunately ids in passwd do no match, so we can't copy those yet.

See /etc/nodes for script and ssh files, sudoers (etc)

Basically the root user can copy across.

## Execute binaries on mounted devices

To avoid `./scratch/script.sh: Permission denied` on `device_file`:

- `sudo bash`
- `ls /scratch -l` to check where `/scratch` is
- `vim /etc/fstab`
- replace `noexec` with `exec` for `device_file`
- `mount -o remount [device_file]` to remount the partition with its new configuration.

Some notes:

root@tux09:~# mkdir -p /var/lib/nfs/statd
root@tux09:~# systemctl enable rpcbind
Synchronizing state of rpcbind.service with SysV service script with /lib/systemd/systemd-sysv-install.
Executing: /lib/systemd/systemd-sysv-install enable rpcbind
root@tux09:~# systemctl list-unit-files | grep -E 'rpc-statd.service'
rpc-statd.service                                                         static          -

network-online.target
x-systemd.device-timeout=
10.0.0.110:/export/3T  /mnt/3T  nfs nofail,x-systemd.automount,x-systemd.requires=network-online.target,x-systemd.device-timeout=10 0 0


## Installation of `munge` and `slurm` on a new node

Current nodes in the pool have:

```shell
munge --version
    munge-0.5.13 (2017-09-26)
sbatch --version
    slurm-wlm 18.08.5-2
```

To install `munge`, go to `octopus01` and run:

```shell
guix package -i munge@0.5.14 -p /export/octopus01/guix-profiles/slurm

systemctl status munge # to check if the service is running and where its service file is
```

We need to setup the rights for `munge`:

```shell
sudo bash

addgroup -gid 900 munge
adduser -uid 900 -gid 900 --disabled-password munge

sed 's,/home/munge:/bin/bash,/var/lib/munge:/usr/sbin/nologin,g' /etc/passwd -i

mkdir -p /var/lib/munge
chown munge:munge /var/lib/munge/

mkdir -p /etc/munge
# copy `munge.key` (from a working node) to `/etc/munge/munge.key`
chown -R munge:munge /etc/munge

mkdir -p /run/munge
chown munge:munge /run/munge

mkdir -p /var/log/munge
chown munge:munge /var/log/munge

mkdir -p /var/run/munge # todo: not sure why it needs such a folder
chown munge:munge /var/run/munge

# copy `munge.service` (from a working node) to `/etc/systemd/system/munge.service`

systemctl daemon-reload
systemctl enable munge
systemctl start munge
systemctl status munge
```

To test the new installation, go to `octopus01` and then:

```shell
munge -n | ssh tux08 /export/octopus01/guix-profiles/slurm-2-link/bin/unmunge
```

If you get `STATUS: Rewound credential (16)`, it means that there is a difference between the encoding and decoding times. To fix it, go into the new machine and fix the time with

```shell
sudo date MMDDhhmmYYYY.ss
```

To install `slurm`, go to `octopus01` and run:

```shell
guix package -i slurm@18.08.9 -p /export/octopus01/guix-profiles/slurm
```

We need to setup the rights for `slurm`:

```shell
sudo bash

addgroup -gid 901 slurm
adduser -uid 901 -gid 901 --no-create-home --disabled-password slurm

sed 's,/home/slurm:/bin/bash,/var/lib/slurm:/bin/bash,g' /etc/passwd -i

mkdir -p /var/lib/slurm
chown munge:munge /var/lib/slurm/

mkdir -p /etc/slurm
# copy `slurm.conf` to `/etc/slurm/slurm.conf`
# copy `cgroup.conf` to `/etc/slurm/cgroup.conf`

chown -R slurm:slurm /etc/slurm

mkdir -p /run/slurm
chown slurm:slurm /run/slurm

mkdir -p /var/log/slurm
chown slurm:slurm /var/log/slurm

# copy `slurm.service` to `/etc/systemd/system/slurm.service`

/export/octopus01/guix-profiles/slurm-2-link/sbin/slurmd -f /etc/slurm/slurm.conf -C | head -n 1 >> /etc/slurm/slurm.conf # add node configuration information

systemctl daemon-reload
systemctl enable slurm
systemctl start slurm
systemctl status slurm
```

On `octopus01` (the master):

```shell
sudo bash

# add the new node to `/etc/slurm/slurm.conf`

systemctl restart slurmctld # after editing /etc/slurm/slurm.conf on the master
```