You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
pjotrp 54e5bc4ca7 Some notes 4 weeks ago
..
README.org update lizardfs goal information 6 months ago

README.org

Installing GNU Guix on the Octopus HPC Cluster

Philosophy

Having run GNU Guix on our machines for some years with great results we decided to build a new HPC cluster consisting of Dell PowerEdge R6515 machines with AMD EPYC 7402P 24-Core Processor on board. Two machines (the head nodes) have 1TB RAM and the other 9 machines have 128GB RAM on board.

Because we own the cluster with admin rights we can deploy software as we see fit. The nodes can boot from an on demand image. We'll allow for Docker, Singularity and even VMs to run. Researchers will use our cluster and we want to give them freedom.

To bootstrap the installation we use remote terminal access (serial over LAN by iDRAC) and boot into a system using a Debian10 live USB stick. Next we start the network and install Debian10 on the hard disks with debootstrap as a base install. We need to get up to speed quickly and this is the fastest route. In time we may have GNU Guix images which may allow us to skip a Debian install. The design of the system will allow for foreign Docker images and VMs which means the underlying system can be very lean. A suitable target for GNU Guix.

Initially the base system will be a Debian install hosting shared GNU Guix software packages. We will migrate to a full GNU Guix setup.

Basic layout

Head nodes

We have two head nodes (octopus01 and octopus02) which are pretty identical though octopus01 will be the main head node and octopus02 is the fallback. These machines are large memory machines and we will allow people to run VMs on these - separate from slurm.

Compute nodes

We have nine compute nodes (octopus03-11). Initially we simply create Debian nodes running Slurm. The nodes will get on demand boot of other installations including GNU Guix.

Distributed storage

The machines are connected over 10Gbs and we are opting for a software distributed storage system between all machines.

Bootstrapping

Getting the machines up and running

We use remote out-of-band access using a serial interface over LAN connecting with ssh. E.g.

ssh idrac@hostname

First thing is to check serial settings

racadm>>get iDRAC.Serial
[Key=iDRAC.Embedded.1#Serial.1]
BaudRate=57600
Command=
Enable=Enabled
HistorySize=8192
IdleTimeout=300
NoAuth=Disabled

set IDRAC.serial.BaudRate 115200

You could set user and password, but leave that for now. Connect to serial interface (com2 even though we are using ttyS0 throughout)

racadm>>connect com2

Probably a blank, to leave serial type control-backslash or ^\ and reboot

racadm>>serveraction powercycle
connect com2

Hit ESC+! and ENTER a few times and wait. What we want to do is select a USB boot drive. ESC+! equals F11 which selects the bootmanager. It will pop up in 30s or 2min (on a large RAM machine).

Until we have PXE network and images we are going to manage installs via USB mounts. On the boot menu make a note of the service tag:

Service Tag: C5R6R53           PowerEdge R6515

And select One-shot BIOS Boot Menu. Choose USB 2: U3 Cruzer Micro which has a Debian10 live rescue system with serial access enabled. You don't see the select menu, but you need to press [ENTER] a few times. Login and you see we also have network and you can log in via ssh.

ip a

At this stage we can use debootstrap to start installing the machine.

Partition disks

Partition the first drive. You can model it on Octopus01, but essentially you are free to do what you want ;). EFI 500Mb, a small SWAP space 8GB, and a Linux partition 16GB is about minimal. The rest of these drives should be part of the distributed store.

cfdisk /dev/sda delete partitions partition 1, 500Mb, EFI (ef) partition 2, 8G, swap (82) partition 3, 16G, Linux (83) mark partition 3 bootable write and quit

We need to install dosfstools in order to format the EFI partition.

apt install dosfstools
mkfs.fat /dev/sda1
mkswap /dev/sda2
mkfs.ext4 /dev/sda3

Then we can mount the partitions in our custom directory, as we would expect them to be mounted once the system is booted.

mkdir /target
mount /dev/sda3 /target
mkdir -p /target/boot/efi
mount /dev/sda1 /target/boot/efi
#:end_src

*** Debootstrap

: debootstrap --include=openssh-server buster /target http://deb.debian.org/debian/

(may take a while)

Make sure the partition is bootable (with fdisk), and

  #+begin_src sh
  mount -t proc none /target/proc
  mount -o bind /dev /target/dev
  mount -t sysfs sys /target/sys
  env LANG=C.UTF-8 chroot /target /bin/bash

Set locales to include en_US.UTF-8 and install a coupld of useful packages. Etckeeper helps keep /etc in version control so changes are easy to see later.

apt install locales
dpkg-reconfigure locales
apt install vim less etckeeper screen tmux
passwd   # set root password, don't forget!

edit /etc/fstab

  /dev/sda3       /       ext4    errors=remount-ro       0 1
  /dev/sda1       /boot/efi   vfat    defaults        0 0
  /dev/sda2       none    swap    sw      0 0

  # IF YOU TYPO HERE AND NEED TO FIX IT AFTER BOOTING INTO IT:
  #mount -o remount,rw / --options-source=disable

Edit the hostname

echo "OctopusXX" > /etc/hostname

edit /etc/apt/sources.list and make sure the package lists are up to date

deb http://deb.debian.org/debian buster main contrib non-free
deb http://security.debian.org/ buster/updates main contrib non-free
apt update

Install kernel and headers (missing in target normally!)

apt-cache search linux-image
apt install linux-image-amd64 linux-source
apt install firmware-linux-free grub2

Edit /etc/default/grub to give serial access and symlink and enable getty@tty1.service -> /lib/systemd/system/getty@.service

ln -s /lib/systemd/system/getty\@.service /etc/systemd/system/getty@tty1.service

Check the OS

cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux 10 (buster)"

Make a note of the existing grub menu entries

In /etc/default/grub:

GRUB_CMDLINE_LINUX_DEFAULT="console=tty1 console=ttyS0,115200n8"
GRUB_CMDLINE_LINUX="console=tty1 console=ttyS0,115200n8"
GRUB_SERIAL_COMMAND="serial --speed=115200 --unit=1 --word=8 --parity=no --stop=1"
GRUB_TERMINAL="console serial"

And update grub

update-grub2

and enable serial in systemd

systemctl enable serial-getty@ttyS0.service
systemctl start serial-getty@ttyS0.service

or run

ln -s /lib/systemd/system/getty\@.service /etc/systemd/system/getty@tty1.service

Check grub menu in /boot/grub/grub.conf to make sure the serial connection is also set to display the grub console. It may not have been set!

Setup /etc/network/interfaces to include lo and eno

# The loopback network interface
auto lo
iface lo inet loopback

# The primary network interface
allow-hotplug eno1
iface eno1 inet dhcp

Now you are ready to reboot the machine and boot into the installed system.

sync && reboot
#:end_src


* Configuring the systems

** Munge

Munge is an authentication service, using a shared cryptographic key
to securely transfer messages between machines in a cluster. Slurm does
the actual workload management. Both systems need to be setup on all
the nodes in the cluster.

In order to make sure that the munge and slurm services are configured
the way we want it is best to do some of the configuration before
actually installing the packages. The UID and GID of munge and slurm may
change in other configurations, but they MUST be the same across the
cluster they run on.

#+begin_src sh
groupadd -g 900 munge
useradd -m -c "Munge User" -d /var/lib/munge -u 900 -g munge -s /usr/sbin/nologin munge
groupadd -g 901 slurm
useradd -m -c "Slurm" -d /var/lib/slurm -u 901 -g slurm -s /bin/bash slurm
apt install munge

Installing munge will also create the shared key used for communication. This key needs to be copied to each node in the cluster. Once it is there the permissions and ownership needs to be fixed. (You can copy the key over first and then run the commands above or you can copy the key from the head node to the other nodes and replace their key with the one taken from the head node.)

chmod 0400 /etc/munge/munge.key
chown munge:munge /etc/munge/munge.key

There are two simple tests to ensure that munge is setup correctly on each node. On each node run:

munge -n | unmunge
munge -n | ssh $OTHER_HOST unmunge

Slurm

#

groupadd -g 901 slurm
useradd -m -c "Slurm" -d /var/lib/slurm -u 901 -g slurm -s /bin/bash slurm

We have opted to not use the accounting built-in to slurm so it's setup is not documented here.

apt install slurmd slurmctld slurm-client

Configuring slurm takes a couple of tries so the included configuration may not be complete. For /etc/slurm-llnl/slurm.conf

ClusterName=linux
ControlMachine=octopus01
SlurmUser=slurm
SlurmdLogFile=/var/log/slurmd.log
SlurmctldLogFile=/var/log/slurmctld.log
SlurmctldHost=octopus01
StorageType=accounting_storage/none
SaveStateLocation=/var/spool/slurmd/ctld
ReturnToService=1
DebugFlags=NO_CONF_HASH
#COMPUTE NODES
NodeName=octopus[01-02] CPUs=48 Boards=1 SocketsPerBoard=1 CoresPerSocket=24 ThreadsPerCore=2 RealMemory=1027731
NodeName=octopus[03-11] CPUs=48 Boards=1 SocketsPerBoard=1 CoresPerSocket=24 ThreadsPerCore=2 RealMemory=128595
PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP

And for /etc/slurm-llnl/cgroups.conf

MaxRAMPercent=95

Then we start slurmd and slurmctld

systemctl enable --now slurmd slurmctld slurmdbd

On each of the compute nodes we only need slurm.conf, so we copy it from the head node.

apt install slurmd
cp slurm.conf /etc/slurm-llnl/slurm.conf
chown root:root /etc/slurm-llnl/slurm.conf
mkdir -p /var/spool/slurmd/ctld
chown -R slurm /var/spool/slurmd
# the next two probably aren't needed
touch /var/log/slurmd.log
chown slurm /var/log/slurmd.log

Now it is time to start slurm

systemctl enable --now slurmd

~ on octopus01 we mark the nodes as ready

sudo scontrol reconfigure
sudo scontrol update NodeName=octopus?? State=RESUME

And then to test that slurm is working correctly we test it

srun --ntasks=5 --nodelist=octopus?? --label /bin/hostname

Troubleshooting slurm

There are a couple of things which can go wrong while installing slurm. We want to make sure that slurm is started as the root user and then permissions are dropped to the slurm user. This allows slurm to run jobs as the person who starts the job, not just as slurm.

If a node goes into the 'DRAIN' state, then the fix is to resume it.

sudo scontrol update NodeName=octopus?? State=RESUME

Lizardfs

Lizardfs is a distributed file-system across multiple nodes. It is how we have decided to make shared storage generally available on all the nodes.

Head Node

As always, we start on the head node

apt install lizardfs-master lizardfs-metalogger lizardfs-adm
cp /usr/share/doc/lizardfs-master/examples/mfsmaster.cfg /etc/lizardfs/
cp /usr/share/doc/lizardfs-master/examples/mfsexports.cfg /etc/lizardfs/
cp /usr/share/doc/lizardfs-master/examples/mfsgoals.cfg /etc/lizardfs/
cp /usr/share/doc/lizardfs-metalogger/examples/mfsmetalogger.cfg /etc/lizardfs/

edit /etc/lizardfs/mfsmaster.cfg

    PERSONALITY = master
    MASTER_HOST = octopus01

edit /etc/lizardfs/mfsexports.cfg. We want to make sure that we only export the filesystem to the network blocks we want.

    172.23.17.0/24      /       rw,alldirs,maproot=0,ignoregid
    172.23.18.0/24      /       rw,alldirs,maproot=0,ignoregid
    *                   .       rw

edit /etc/lizardfs/mfsmetalogger.cfg

    MASTER_HOST = octopus01

Then it's time to reload the systemd units and start them.

systemctl daemon-reload
systemctl enable --now lizardfs-master lizardfs-metalogger

We can then use the admin tools to check on the storage

lizardfs-admin info octopus01 9421

If we choose to create a shadow master, ready to take over if something happens to the main node, then all of the configuration is the same with the exception of mfsmaster.cfg

    PERSONALITY = shadow
    MASTER_HOST = octopus01

If the time comes to change then "shadow" needs to change to "master", and all occurrences of octopus01 need to point to the new master.

Storage Nodes

On the storage nodes we first format the disks as suggested by the lizardfs documentation. Also install the lizardfs-chunkserver package so we create the user and group we need for lizardfs. https://docs.lizardfs.com/docs/adminguide/basic_configuration.html#for-the-chunkservers

apt install lizardfs-chunkserver xfsprogs

use lsblk to make sure /dev/sdb1 isn't in use for anything

lsblk

create partition table and create /dev/sdb1 as a Linux file system

cfdisk /dev/sdb
mkfs.xfs /dev/sdb1
mkdir /mnt/sdb1
mount /mnt/sdb1
mkdir /mnt/sdb1/lizardfs_vol
chown -R lizardfs:lizardfs /mnt/sdb1/lizardfs_vol/

discover UUID of /dev/sdb1

lsblk /dev/sdb1 -o +UUID

edit /etc/fstab

    UUID=<uuid> /mnt/sdb1 xfs rw,noexec,nodev,noatime,nodiratime,largeio,inode64 0 1

# edit /etc/lizardfs/mfshdd.cfg

    # as recommended by Debian's documentation
    /mnt/sdb1/lizardfs_vol

We now need to setup the configuration files for the chunkserver

gzip -k -d /usr/share/doc/lizardfs-chunkserver/examples/mfschunkserver.cfg.gz
mv /usr/share/doc/lizardfs-chunkserver/examples/mfschunkserver.cfg /etc/lizardfs/
cp /usr/share/doc/lizardfs-chunkserver/examples/mfshdd.cfg /etc/lizardfs/

edit /etc/lizardfs/mfschunkserver.cfg

    MASTER_HOST = octopus01

Now that everything is configured we can add it to the storage pool

systemctl daemon-reload
systemctl enable --now lizardfs-chunkserver

If you want to have both SSDs and HDDs on your machine but want lizardfs to treat them differently for storage then you have to set the LABEL for that chunkserver differently in mfschunkserver.cfg. For example, if you were to add HDDs:

partition and format the HDDs using the instructions above add to /etc/fstab, mount, chown to lizardfs

cp /etc/lizardfs/mfschunkserver.cfg /etc/lizardfs/mfschunkserver_hdd.cfg
cp /etc/lizardfs/mfshdd.cfg /etc/lizardfs/mfshdd_hdd.cfg
mkdir /var/lib/lizardfs_hdd
chown -R lizardfs:lizardfs /var/lib/lizardfs_hdd
#+end src

edit /etc/lizardfs/mfschunkserver_hdd.cfg
#+begin_src mfshunkserver_hdd.cfg
    LABEL = HDD
    DATA_PATH = /var/lib/lizardfs_hdd
    CSSERV_LISTEN_PORT = 9522
    HDD_CONF_FILENAME = /etc/lizardfs/mfshdd_hdd.cfg

edit /etc/lizardfs/mfshdd_hdd.cfg

    change listed paths to HDD mount points

Then we need to create a systemd service for the new chunkserver

cp /lib/systemd/system/lizardfs-chunkserver.service /etc/lizardfs/lizardfs-chunkserver-hdd.service

edit /etc/lizardfs/lizardfs-chunkserver-hdd.service

    ExecStart=/usr/sbin/mfschunkserver -c /etc/lizardfs/mfschunkserver_hdd.cfg -d start

Then it's time to start the service

systemctl daemon-reload
systemctl enable --now /etc/lizardfs/lizardfs-chunkserver-hdd

Lizardfs Client

This is what is required to actually use the storage configured.

apt install lizardfs-client
cp /usr/share/doc/lizardfs-client/examples/mfsmount.cfg /etc/lizardfs/

edit /etc/lizardfs/mfsmount.cfg so we can tell lizardfs what to mount

    # https://docs.lizardfs.com/docs/adminguide/connectclient.html#optional-settings-for-performance-on-nix
    mfsmaster=octopus01,big_writes,nosuid,nodev,noatime
    /lizardfs

and mount lizardfs with the following command:

mkfs /lizardfs
sudo mfsmount

It may be possible to add the lizardfs mount point to your /etc/fstab

  /lizardfs     lizardfs    fuse    rw,mfsdelayedinit,mfsmaster=octopus01,big_writes,nosuid,nodev,noatime   0 0

To change the replication level you need to choose a configuration from /etc/lizardfs/mfsgoals.cfg. Some examples:

2 2_copies : _ _
19 slow : $ec(3,1) { HDD HDD HDD HDD }
20 fast : $ed(3,1) { SSD SSD SSD SSD }

It is also possible to query the server for all the available goals

lizardfs-admin list-goals octopus01 9421

Goal definitions:
Id      Name    Definition
1       1_copy  1_copy: $std _
2       2_copy  2_copy: $std {_ _}
...
18      erasure_code_3_1        erasure_code_3_1: $ec(3,1) {_ _ _ _}
19      slow    slow: $ec(3,1) {HDD HDD HDD HDD}
20      fast    fast: $std {_ _}
21      2ssd    2ssd: $std {SSD SSD}
...

$std {_ _} means two copies on any two chunkservers, $std {SSD SSD} means two copies on any two SSD backed chunkservers and $ec(3,1) {_ _ _ _} means it is chunked across 4 chunkservers, where there are 3 data chunks and one parity disk, needing a minimum of 3 to retrieve the data. Erasure coding does decrease greatly write times but it is a good way to store large armounts of data without taking up lots of space. One option is to copy unchanging data to a lizardfs array at a faster write speed, 2_copy for example, and then to change the goal afterward to slow.

To change the replication level

lizardfs setgoal 2_copies /lizardfs -r

And to see the replication level:

lizardfs getgoal /lizardfs

/lizardfs: 2_copies

Lizardfs also keeps deleted files, by default for 30 days. If you need to recover deleted files (or delete them permanently) then the metadata directory can be mounted with:

mfsmount /path/to/unused/mount -o mfsmeta

For more information see the lizardfs documentation online https://dev.lizardfs.com/docs/adminguide/advanced_configuration.html#trash-directory