diff options
Diffstat (limited to 'topics/octopus')
| -rw-r--r-- | topics/octopus/lizardfs/lizard-maintenance.gmi (renamed from topics/octopus/lizardfs/README.gmi) | 113 | ||||
| -rw-r--r-- | topics/octopus/maintenance.gmi | 98 | ||||
| -rw-r--r-- | topics/octopus/moosefs/moosefs-maintenance.gmi | 252 | ||||
| -rw-r--r-- | topics/octopus/octopussy-needs-love.gmi | 266 | ||||
| -rw-r--r-- | topics/octopus/recent-rust.gmi | 76 | ||||
| -rw-r--r-- | topics/octopus/set-up-guix-for-new-users.gmi | 38 | ||||
| -rw-r--r-- | topics/octopus/slurm-upgrade.gmi | 89 |
7 files changed, 929 insertions, 3 deletions
diff --git a/topics/octopus/lizardfs/README.gmi b/topics/octopus/lizardfs/lizard-maintenance.gmi index 78316ef..a34ef3e 100644 --- a/topics/octopus/lizardfs/README.gmi +++ b/topics/octopus/lizardfs/lizard-maintenance.gmi @@ -1,4 +1,4 @@ -# Information about lizardfs, and some usage suggestions +# Lizard maintenance On the octopus cluster the lizardfs head node is on octopus01, with disks being added mainly from the other nodes. SSDs are added to the lizardfs-chunkserver.service systemd service and SDDs added to the lizardfs-chunkserver-hdd.service. The storage pool is available on all nodes at /lizardfs, with the default storage option of "slow", which corresponds to two copies of the data, both on SDDs. @@ -73,6 +73,17 @@ Chunks deletion state: 2ssd 7984 - - - - - - - - - - ``` +<<<<<<< HEAD +This table essentially says that slow and fast are replicating data (if they are in column 0 it is OK!). This looks good for fast: + +``` +Chunks replication state: + Goal 0 1 2 3 4 5 6 7 8 9 10+ + slow - 137461 448977 - - - - - - - - + fast 6133152 - 5 - - - - - - - - +``` +This table essentially says that slow and fast are replicating data (if they are in column 0 it is OK!). + To query how the individual disks are filling up and if there are any errors: List all disks @@ -83,17 +94,62 @@ lizardfs-admin list-disks octopus01 9421 | less Other commands can be found with `man lizardfs-admin`. +## Info + +``` +lizardfs-admin info octopus01 9421 +LizardFS v3.12.0 +Memory usage: 2.5GiB23 + +Total space: 250TiB Available space: 10TiB +Trash space: 510GiB +Trash files: 188 +Reserved space: 21GiB Reserved files: 18 +FS objects: 7369883 +Directories: 378782 +Files: 6858803 +Chunks: 9100088 +Chunk copies: 20017964 +Regular copies (deprecated): 20017964 +``` + +``` +lizardfs-admin chunks-health octopus01 9421 +Chunks availability state: + Goal Safe Unsafe Lost + slow 1323220 1 - + fast 6398524 - 5 + +Chunks replication state: + Goal 0 1 2 3 4 5 6 7 8 9 10+ + slow - 218663 1104558 - - - - - - - - + fast 6398524 - 5 - - - - - - - - + +Chunks deletion state: + Goal 0 1 2 3 4 5 6 7 8 9 10+ + slow - 104855 554911 203583 76228 39425 19348 8659 3276 20077 292859 + fast 6380439 18060 30 - - - - - - - - +``` ## Deleted files -Lizardfs also keeps deleted files, by default for 30 days. If you need to recover deleted files (or delete them permanently) then the metadata directory can be mounted with: +Lizardfs also keeps deleted files, by default for 30 days in `/mnt/lizardfs-meta/trash`. If you need to recover deleted files (or delete them permanently) then the metadata directory can be mounted with: ``` $ mfsmount /path/to/unused/mount -o mfsmeta ``` For more information see the lizardfs documentation online -=> https://dev.lizardfs.com/docs/adminguide/advanced_configuration.html#trash-directory lizardfs documentation for the trash directory +=> https://lizardfs-docs.readthedocs.io/en/latest/adminguide/advanced_configuration.html#trash-directory lizardfs documentation for the trash directory + +## Start lizardfs-mount (lizardfs reader daemon) after a system reboot + +``` +sudo bash +systemctl daemon-reload +systemctl restart lizardfs-mount +systemctl status lizardfs-mount +``` ## Gotchas @@ -179,3 +235,54 @@ KeyringMode=inherit [Install] WantedBy=multi-user.target ``` + +# To deplete and remove a drive in LizardFS + +**1. Mark the chunkserver (or specific disk) for removal** + +Edit the chunkserver's disk configuration file (typically `/etc/lizardfs/mfshdd.cfg`) and prefix the drive path with an asterisk: + +``` +*/mnt/disk_to_remove +``` + +Restart the chunkserver process on the node + +```bash +systemctl stop lizardfs-chunkserver +systemctl start lizardfs-chunkserver +``` + +**3. Monitor the evacuation progress** + +The master will begin migrating chunks off the marked drive. You can monitor progress with: + +```bash +lizardfs-admin list-disks octopus01 9421 +lizardfs-admin list-disks octopus01 9421|grep 172.23.19.59 -A 7 +172.23.19.59:9422:/mnt/sdc/lizardfs_vol/ + to delete: yes + damaged: no + scanning: no + last error: no errors + total space: 3.6TiB + used space: 3.4TiB + chunks: 277k +``` + +Look for the disk showing evacuation status. The "to delete" chunks count should decrease over time as data is replicated elsewhere. + +You can also check the CGI web interface if you have it running—it shows disk status and chunk counts. + +**4. Remove the drive once empty** + +Once all chunks have been evacuated (the disk shows 0 chunks or is marked as empty), you can safely: + +1. Remove the line from `mfshdd.cfg` entirely +2. Reload the configuration again +3. Physically remove or repurpose the drive + +**Important notes:** +- Ensure you have enough free space on other disks to absorb the migrating chunks +- The evacuation time depends on the amount of data and network/disk speed +- Don't forcibly remove a drive before evacuation completes, or you risk data loss if replication goals aren't met diff --git a/topics/octopus/maintenance.gmi b/topics/octopus/maintenance.gmi new file mode 100644 index 0000000..00cc575 --- /dev/null +++ b/topics/octopus/maintenance.gmi @@ -0,0 +1,98 @@ +# Octopus/Tux maintenance + +## To remember + +`fdisk -l` to see disk models +`lsblk -nd` to see mounted disks + +## Status + +octopus02 +- Devices: 2 3.7T SSDs + 2 894.3G SSDs + 2 4.6T HDDs +- **Status: Slurm not OK, LizardFS not OK** +- Notes: + - `octopus02 mfsmount[31909]: can't resolve master hostname and/or portname (octopus01:9421)`, + - **I don't see 2 drives that are physically mounted** + +octopus03 +- Devices: 4 3.7T SSDs + 2 894.3G SSDs +- Status: Slurm OK, LizardFS OK +- Notes: **I don't see 2 drives that are physically mounted** + +octopus04 +- Devices: 4 7.3 T SSDs (Neil) + 1 4.6T HDD + 1 3.7T SSD + 2 894.3G SSDs +- Status: Slurm NO, LizardFS OK (we don't share the HDD) +- Notes: no + +octopus05 +- Devices: 1 7.3 T SSDs (Neil) + 5 3.7T SSDs + 2 894.3G SSDs +- Status: Slurm OK, LizardFS OK +- Notes: no + +octopus06 +- Devices: 1 7.3 T SSDs (Neil) + 1 4.6T HDD + 4 3.7T SSDs + 2 894.3G SSDs +- Status: Slurm OK, LizardFS OK (we don't share the HDD) +- Notes: no + +octopus07 +- Devices: 1 7.3 T SSDs (Neil) + 4 3.7T SSDs + 2 894.3G SSDs +- Status: Slurm OK, LizardFS OK +- Notes: **I don't see 1 device that is physically mounted** + +octopus08 +- Devices: 1 7.3 T SSDs (Neil) + 1 4.6T HDD + 4 3.7T SSDs + 2 894.3G SSDs +- Status: Slurm OK, LizardFS OK (we don't share the HDD) +- Notes: no + +octopus09 +- Devices: 1 7.3 T SSDs (Neil) + 1 4.6T HDD + 4 3.7T SSDs + 2 894.3G SSDs +- Status: Slurm OK, LizardFS OK (we don't share the HDD) +- Notes: no + +octopus10 +- Devices: 1 7.3 T SSDs (Neil) + 4 3.7T SSDs + 2 894.3G SSDs +- Status: Slurm OK, LizardFS OK (we don't share the HDD) +- Notes: **I don't see 1 device that is physically mounted** + +octopus11 +- Devices: 1 7.3 T SSDs (Neil) + 5 3.7T SSDs + 2 894.3G SSDs +- Status: Slurm OK, LizardFS OK +- Notes: on + +tux05 +- Devices: 1 3.6 NVMe + 1 1.5T NVMe + 1 894.3G NVMe +- Status: Slurm OK, LizardFS OK (we don't share anything) +- Notes: **I don't have a picture to confirm physically mounted devices** + +tux06 +- Devices: 2 3.6 T SSDs (1 from Neil) + 1 1.5T NVMe + 1 894.3G NVMe +- Status: Slurm OK, LizardFS (we don't share anything) +- Notes: + - **Last picture reports 1 7.3 T SSD (Neil) that is missing** + - **Disk /dev/sdc: 3.64 TiB (Samsung SSD 990: free and usable for lizardfs** + - **Disk /dev/sdd: 3.64 TiB (Samsung SSD 990): free and usable for lizardfs** + +tux07 +- Devices: 3 3.6 T SSDs + 1 1.5T NVMe (Neil) + 1 894.3G NVMe +- Status: Slurm OK, LizardFS +- Notes: + - **Disk /dev/sdb: 3.64 TiB (Samsung SSD 990): free and usable for lizardfs** + - **Disk /dev/sdd: 3.64 TiB (Samsung SSD 990): mounted at /mnt/sdb and shared on LIZARDFS: TO CHECK BECAUSE IT HAS NO PARTITIONS** + +tux08 +- Devices: 3 3.6 T SSDs + 1 1.5T NVMe (Neil) + 1 894.3G NVMe +- Status: Slurm OK, LizardFS +- Notes: no + +tux09 +- Devices: 1 3.6 T SSDs + 1 1.5T NVMe + 1 894.3G NVMe +- Status: Slurm OK, LizardFS +- Notes: **I don't see 1 device that is physically mounted** + +## Neil disks +- four 8TB SSDs on the right of octopus04 +- one 8TB SSD in the left slot of octopus05 +- six 8TB SSDs bottom-right slot of octopus06,07,08,09,10,11 +- one 4TB NVMe and one 8TB SSDs on tux06, NVME in the bottom-right of the group of 4 on the left, SSD on the bottom-left of the group of 4 on the right +- one 4TB NVMe on tux07, on the top-left of the group of 4 on the right +- one 4TB NVMe on tux08, on the top-left of the group of 4 on the right diff --git a/topics/octopus/moosefs/moosefs-maintenance.gmi b/topics/octopus/moosefs/moosefs-maintenance.gmi new file mode 100644 index 0000000..1032cde --- /dev/null +++ b/topics/octopus/moosefs/moosefs-maintenance.gmi @@ -0,0 +1,252 @@ +# Moosefs + +We use moosefs as a network distributed storage system with redundancy. The setup is to use SSDs for fast access and spinning storage for redundancy/backups (in turn these are in RAID5 configuration). In addition we'll experiment with a non-redundant fast storage access using the fastest drives and network connections. + +# Configuration + +## Ports + +We should use different ports than lizard. Lizard uses 9419-24 by default. So let's use +9519- ports. + +* 9519 for moose meta logger +* 9520 for chunk server connections +* 9521 for mount connections +* 9522 for slow HDD chunks (HDD) +* 9523 for replicating SSD chunks (SSD) +* 9524 for fast non-redundant SSD chunks (FAST) + +## Topology + +Moosefs uses topology to decide where to fetch data. We can host the slow spinning HDD drives in a 'distant' location, so that data is fetched last. + +## Disks + +Some disks are slower than others. To test we can do: + +``` +root@octopus03:/export# dd if=/dev/zero of=test1.img bs=1G count=1 +1+0 records in +1+0 records out +1073741824 bytes (1.1 GB, 1.0 GiB) copied, 2.20529 s, 487 MB/s +/sbin/sysctl -w vm.drop_caches=3 +root@octopus03:/export# dd if=test1.img of=/dev/null bs=1G count=1 +1+0 records in +1+0 records out +1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.649035 s, 1.7 GB/s +rm test1.img +``` + +Above is on a RAID5 setup. Typical values are: + +``` + Write Read +Octopus Dell NVME 1.2 GB/s 2.0 GB/s +Octopus03 RAID5 487 MB/s 1.7 GB/s +Octopus01 RAID5 127 MB/s 163 MB/s +Samsung SSD 870 408 MB/s 565 MB/s +``` + +``` +mfs#octopus03:9521 3.7T 4.0G 3.7T 1% /moosefs-fast +``` + +## Command line + +``` +. /usr/local/guix-profiles/moosefs/etc/profile +mfscli -H octopus03 -P 9521 -SCS +``` + +## Config + +``` +root@octopus03:/etc/mfs# diff example/mfsexports.cfg.sample mfsexports.cfg +2c2,4 +< * / rw,alldirs,admin,maproot=0:0 +--- +> 172.23.21.0/24 / rw,alldirs,maproot=0,ignoregid +> 172.23.22.0/24 / rw,alldirs,maproot=0,ignoregid +> 172.23.17.0/24 / rw,alldirs,maproot=0,ignoregid +``` + +``` +root@octopus03:/etc/mfs# diff example/mfsmaster.cfg.sample mfsmaster.cfg +4a5,10 +> ## Only one metadata server in LizardFS shall have 'master' personality. +> PERSONALITY = master +> +> ## Password for administrative connections and commands. +> ADMIN_PASSWORD = nolizard +> +6c12 +< # WORKING_USER = nobody +--- +> WORKING_USER = mfs +9c15 +< # WORKING_GROUP = +--- +> WORKING_GROUP = mfs +27c33 +< # DATA_PATH = /gnu/store/yg0xb1g9mls04h4085kmfbbg8z36a7c2-moosefs-4.58.3/var/mfs +--- +> DATA_PATH = /export/var/lib/mfs +34c40 +< # EXPORTS_FILENAME = /gnu/store/yg0xb1g9mls04h4085kmfbbg8z36a7c2-moosefs-4.58.3/etc/mfs/mfsexports.cfg +--- +> EXPORTS_FILENAME = /etc/mfs/mfsexports.cfg +87c93 +< # MATOML_LISTEN_PORT = 9419 +--- +> MATOML_LISTEN_PORT = 9519 +103c109 +< # MATOCS_LISTEN_PORT = 9420 +--- +> MATOCS_LISTEN_PORT = 9520 +219c225 +< # MATOCL_LISTEN_PORT = 9421 +--- +> MATOCL_LISTEN_PORT = 9521 +``` + +``` +root@octopus03:/etc/mfs# cat mfsgoals.cfg +# safe - 2 copies, 1 on slow disk, 1 on fast disk +11 slow: HDD SSD + +# Fast storage - 1 copy on fast disks, no redundancy +12 fast: FAST +``` + +``` ++++ b/mfs/mfschunkserver-fast.cfg + # user to run daemon as (default is nobody) +-# WORKING_USER = nobody ++WORKING_USER = mfs + + # group to run daemon as (optional - if empty then default user group will be used) +-# WORKING_GROUP = ++WORKING_GROUP = mfs + + # name of process to place in syslog messages (default is mfschunkserver) + # SYSLOG_IDENT = mfschunkserver +@@ -28,6 +28,7 @@ + + # where to store daemon lock file (default is /gnu/store/yg0xb1g9mls04h4085kmfbbg8z36a7c2-moosefs-4.58.3/var/mfs) + # DATA_PATH = /gnu/store/yg0xb1g9mls04h4085kmfbbg8z36a7c2-moosefs-4.58.3/var/mfs ++DATA_PATH=/var/lib/mfs + + # when set to one chunkserver will not abort start even when incorrect entries are found in 'mfshdd.cfg' file + # ALLOW_STARTING_WITH_INVALID_DISKS = 0 +@@ -41,6 +42,7 @@ + + # alternate location/name of mfshdd.cfg file (default is /gnu/store/yg0xb1g9mls04h4085kmfbbg8z36a7c2-moosefs-4.58.3/etc/mfs/mfshdd.cfg); this +file will be re-read on each process reload, regardless if the path was changed + # HDD_CONF_FILENAME = /gnu/store/yg0xb1g9mls04h4085kmfbbg8z36a7c2-moosefs-4.58.3/etc/mfs/mfshdd.cfg ++HDD_CONF_FILENAME = /etc/mfs/mfsdisk-fast.cfg + + # speed of background chunk tests in MB/s per disk (formally entry defined in mfshdd.cfg). Value can be given as a decimal number (default is +1.0) + # deprecates: HDD_TEST_FREQ (if HDD_TEST_SPEED is not defined, but there is redefined HDD_TEST_FREQ, then HDD_TEST_SPEED = 10 / HDD_TEST_FREQ) +@@ -109,10 +111,10 @@ + # BIND_HOST = * + + # MooseFS master host, IP is allowed only in single-master installations (default is mfsmaster) +-# MASTER_HOST = mfsmaster ++MASTER_HOST = octopus03 + + # MooseFS master command port (default is 9420) +-# MASTER_PORT = 9420 ++MASTER_PORT = 9520 + + # timeout in seconds for master connections. Value >0 forces given timeout, but when value is 0 then CS asks master for timeout (default is 0 +- ask master) + # MASTER_TIMEOUT = 0 +@@ -134,5 +136,5 @@ + # CSSERV_LISTEN_HOST = * + + # port to listen for client (mount) connections (default is 9422) +-# CSSERV_LISTEN_PORT = 9422 ++CSSERV_LISTEN_PORT = 9524 +``` + +``` ++++ b/mfs/mfsmount.cfg +mfsmaster=octopus03,nosuid,nodev,noatime,nosuid,mfscachemode=AUTO,mfstimeout=30,mfswritecachesize=2048,mfsreadaheadsize=2048,mfsport=9521 +/moosefs-fast +``` + +## systemd + + +``` +root@octopus03:/etc# cat systemd/system/moosefs-master.service +Description=MooseFS master server daemon +Documentation=man:mfsmaster +After=network.target +Wants=network-online.target + +[Service] +Type=forking +TimeoutSec=0 +ExecStart=/usr/local/guix-profiles/moosefs/sbin/mfsmaster -d start -c /etc/mfs/mfsmaster.cfg -x +ExecStop=/usr/local/guix-profiles/moosefs/sbin/mfsmaster -c /etc/mfs/mfsmaster.cfg stop +ExecStop=/usr/local/guix-profiles/moosefs/sbin/mfsmaster -c /etc/mfs/mfsmaster.cfg reload +ExecReload=/bin/kill -HUP $MAINPID +User=mfs +Group=mfs +Restart=on-failure +RestartSec=60 +OOMScoreAdjust=-999 + +[Install] +WantedBy=multi-user.target +``` + +``` + cat systemd/system/moosefs-mount.service +[Unit] +Description=Moosefs mounts +After=syslog.target network.target + +[Service] +Type=forking +TimeoutSec=600 +ExecStart=/usr/local/guix-profiles/moosefs/bin/mfsmount -c /etc/mfs/mfsmount.cfg +ExecStop=/usr/bin/umount /moosefs-fast + +[Install] +WantedBy=multi-user.target +root@octopus04:/etc# cat systemd/system/moosefs-chunkserver-fast.service +[Unit] +Description=MooseFS Chunkserver (Fast) +After=network.target + +[Service] +Type=simple +ExecStart=/usr/local/guix-profiles/moosefs/sbin/mfschunkserver -f -c /etc/mfs/mfschunkserver-fast.cfg +User=mfs +Group=mfs +Restart=on-failure +RestartSec=5 +LimitNOFILE=65535 + +[Install] +WantedBy=multi-user.target +``` + +``` +cat systemd/system/moosefs-mount.service +[Unit] +Description=Moosefs mounts +After=syslog.target network.target + +[Service] +Type=forking +TimeoutSec=600 +ExecStart=/usr/local/guix-profiles/moosefs/bin/mfsmount -c /etc/mfs/mfsmount.cfg +ExecStop=/usr/bin/umount /moosefs-fast + +[Install] +WantedBy=multi-user.target +``` diff --git a/topics/octopus/octopussy-needs-love.gmi b/topics/octopus/octopussy-needs-love.gmi new file mode 100644 index 0000000..8c6315d --- /dev/null +++ b/topics/octopus/octopussy-needs-love.gmi @@ -0,0 +1,266 @@ +# Octopussy needs love + +At UTHSC, Memphis, TN, around October 2020 Efraim and I installed Octopus on Debian+Guix with lizard as a distributed network storage system and slurm for job control. Around October 2023 we added 5 genoa tux05-09 machines, doubling the cluster in size. See + +=> https://genenetwork.org/gn-docs/facilities + +Octopus made a lot of work possible we can't really do on larger HPCs and led to a bunch of high impact studies and publications, particularly on pangenomics. + +In the coming period we want te replace lizard with moosefs. Lizard is no longer maintained and as it was a fork of Moose, it is only logical to go forward on that one. We also looked at Ceph, but apparently Ceph is not great for systems that carry no redundancy. So far, lizard has been using redundancy, but we figure we can do without if the occassional (cheap) SSD goes bad. + +We also need to look at upgrading some of the Dell BIOS - particularly tux05-09 - as they can be occassionally problematic with non-OEM SSDs. + +On the worker nodes it may be wise to upgrade Debian. Followed by an upgrade to the head nodes and other supporting machines. Even though we rely on Guix for latest and greatest, there may be good upgrades in the underlying Linux kernel and drivers. + +Our Slurm PBS we are up-to-date because we run that completely on Guix and Arun supports the latest and greatest. + +Another thing we ought to fix is introduce centralized user management. So far we have had few users and just got by. But sometimes it bites us that users have different UIDs on the nodes. + +## Architecture overview + +* O1 is the old head node hosting lizardfs - will move to a compute +* O2 is the old backup hosting the lizardfs shadow - will move to compute +* O3 is the new head node hosting moosefs +* O4 is the backup head node hosting moosefs shadow - will act as a compute node too + +All the other nodes are for compute. O1 and O4 will be the last nodes to remain on older Debian. They will handle the last bits of lizard. + +# Tasks + +* [X] Create moosefs package +* [X] Install moosefs +* [X] Upgrade bios (all tuxes) +* [ ] Migrate lizardfs nodes to moosefs (one at a time) +* [ ] Add server monitoring with sheepdog +* [ ] Upgrade Debian +* - [ ] Maybe, just maybe, boot the nodes from a central server +* [ ] Introduce centralized user management + +# Progress + +## Lizardfs and Moosefs + +Our Lizard documention lives at + +=> lizardfs/README + +Efraim wrote a lizardfs for Guix at the time in guix-bioinformatics, but we ended up deploying with Debian. Going back now, the package does not look too taxing (I think we dropped it because the Guix system configuration did not play well). + +=> https://git.genenetwork.org/guix-bioinformatics/tree/gn/packages/file-systems.scm + +Looking at the Debian package + +=> https://salsa.debian.org/debian/moosefs + +It carries no special patches, but a few nice hints in *.README.debian. I think it is worth trying to write a Guix package so we can easily upgrade (even on an aging Debian). Future proofing is key. + +The following built moosefs in a guix shell: + +``` +guix shell -C -D -F coreutils make autoconf automake fuse libpcap zlib pkg-config python libtool gcc-toolchain +autoreconf -f -i +make +``` + +Next I created a guix package that installs with: + +``` +guix build -L ~/guix-bioinformatics -L ~/guix-past/modules moosefs +``` + +See + +=> https://git.genenetwork.org/guix-bioinformatics/commit/?id=236903baaab0f84f012a55700c1917265a2b701c + +Next stop testing and deploying! + +## Choosing a head node + +Currently octopus01 is the head node. It probably is a good idea to change that, so we can safely upgrade the new server. The first choice would be octopus02 (o2). We can mirror the moose daemons on octopus01 (o1) later. Let's see what that looks like. + +A quick assessment of o1 shows that we have 14T storage on o1 that takes care of /home and /gnu. But only 1.2T is used. + +o2 has also quite a few disks (up 1417 days!), but a bunch of SSDs appears to error out. E.g. + +``` +Sep 04 07:44:56 octopus02 mfschunkserver[22766]: can't create lock file /mnt/sdd1/lizardfs_vol/.lock, marking hdd as damaged: Input/output error +UUID=277c05de-64f5-48a8-8614-8027a53be212 /mnt/sdd1 xfs rw,exec,nodev,noatime,nodiratime,largeio,inode64 0 1 +``` + +Lizard also complains 4 SSDs have been wiped out. +We'll need to reboot the server to see what storage still may work. The slurm connection appears to be misconfigured: + +``` +[2025-12-20T09:36:27.846] error: service_connection: slurm_receive_msg: Insane message length +[2025-12-20T09:36:28.415] error: unpackstr_xmalloc: Buffer to be unpacked is too large (1700881509 > 1073741824) [2025-12-20T09:36:28.415] error: unpacking header [2025-12-20T09:36:28.415] error: destroy_forward: no init [2025-12-20T09:36:28.415] error: slurm_receive_msg_and_forward: [[nessus6.uthsc.edu]:35553] failed: Message receive failure +``` + +looks like Andrea is the only one using the machine right now though some others logged in. Before rebooting I'll block users, ask Andrea to move off, and deplete slurm and lizard. But o2 is a large RAM machine, so we should not use that as a head node. + +Let's take a look at o3. This one has less RAM. Flavia is running some tools, but I don't think the machine is really used right now. Slurm is running, but shows similar configuration issues as o2. Let's take a look at slurm + +=> ../systems/hpc/octopus-maintenance +=> ../hpc/octopus/slurm-user-guide + +Alright, I depleted and removed slurm from o3. I think it would be wise to also deplete the lizard drives on that machine. + +The big users on lizard are: + +``` +1.6T dashbrook +1.8T pangenomes +2.1T erikg +3.4T aruni +3.4T junh +8.4T hchen +9.2T salehi +13T guarracino +16T flaviav +``` + +it seems we can clean some of that up! We have some backup storage that we can use. Alternatively move to ISAAC. + +We'll slowly start depleting the lizard. See also + +=> lizardfs/README + +O3 has 4 lizard drives. We'll start by depleting one. + + +# O2 + +``` +172.23.22.159:9422:/mnt/sde1/lizardfs_vol/ + to delete: no + damaged: yes + scanning: no + last error: no errors + total space: 0B + used space: 0B + chunks: 0 +172.23.22.159:9422:/mnt/sdd1/lizardfs_vol/ + to delete: no + damaged: yes + scanning: no + last error: no errors + total space: 0B + used space: 0B + chunks: 0 +172.23.22.159:9422:/mnt/sdc1/lizardfs_vol/ + to delete: no + damaged: yes + scanning: no + last error: no errors + total space: 0B + used space: 0B + chunks: 0 +``` + +Stopped the chunk server. +sde remounted after xfs_repair. The others were not visible, so rebooted. The folloing storage should add to the total again: + +``` +/dev/sdc1 4.6T 3.9T 725G 85% /mnt/sdc1 +/dev/sdd1 4.6T 4.2T 428G 91% /mnt/sdd1 +/dev/sdf1 4.6T 4.2T 358G 93% /mnt/sdf1 +/dev/sde 3.7T 3.7T 4.0G 100% /mnt/sde +/dev/sdg1 3.7T 3.7T 3.9G 100% /mnt/sdg1 +``` + +After adding this storage and people removing material it starts to look better: + +``` +mfs#octopus01:9421 171T 83T 89T 49% /lizardfs +``` + +# O3 + +I have marked the disks (4x4T) on o3 for deletion - that will subtract 7T. This in preparation for upgrading Linux and migrating those disks to moosefs. Continue below. + +# T5 + +T5 requires a new bios - it has the same one as the unreliable T4. I also need to see if there are any disks in the bios we don't see right now. T5 has two small fast SSDs and one larger one (3.5T). + +I managed to install the new bios, but I had trouble getting into linux because of some network/driver issues. ipmi was suspect. Finally managed rescue mode by adding 'systemd.unit=emergency.target' in the grub line. 'single' is no longer enough (grrr). One to keep in mind. + +Had to disable ipmi modules. See my idrac.org. + +# T6 + +Tux06 (T6) contains two unused drives that appear to have contained XFS. xfs_repair did not really help... +The BIOS on T6 is newer than on T4+T5. That probably explains why the higher T numbers have no disk issues, while T4+T5 had problems with non-OEM! Anyway, as I was at it, I updated the BIOS for all. + +T6 has 4 SSDs, 2x 3.5T. Both unused. The lizard chunk server is failing, so might as well disable it. + +I am using T6 to test network boots because it is not serving lizard. + +# T7 + +On T7 root was full(!?). Culprit was Andrea with /tmp/sweepga_genomes_111850/. +T7 has 3x3.5T with one unused. + +# T8 + +T8 has 3x3.5T, all used. After the BIOS upgrade the efi partition did not boot. After a few reboots it did get into grub and I made a copy of the efi partition on sdd (just in case). + +# T9 + +T9 has 1x3.5T. Used. I had to reduce HDD_LEAVE_SPACE_DEFAULT to give the chunkserver some air. + +# O3 + O4 + +Back to O3, our future head node. lizard has mostly been depleted. Though every drive has a few chunks left. I just pulled down the chunkserver and lizard appears to be fine (no errors). Good! + +Next install Linux. I have two routes, one is using debootstrap, the other is via PXE. I want to try the latter. + +So far, I managed to boot into ipxe on Octopus. +The linux kernel loads over http, but it does not show output. Likely I need to: + +* [X] Build ipxe with serial support +* [X] Test the installer with serial support +* [X] Add NFS support +* [X] debootstrap install of new Debian on /export/nfs/nodes/debian14 +* [X] Make available through NFS and boot through IPXE + +I managed to boot T6 over the network. +Essentially we have a running Debian last stable on T6 that is completely run over NFS! +In the next steps I need to figure out: + +* [X] Mount NFS with root access +* [ ] Every PXE node needs its own hard disk configuration +* [ ] Mount NFS from octopus01 +* [ ] Start slurm + +We can have this as a test node pretty soon. +But first we have to start moosefs and migrate data. + +I am doing some small tests and will put (old) T6 back on slurm again. + +To get every node booted with its own version of fstab and state logging on a local disk we need to pull some trick with initrd. + +Basically NFS boot initrd needs to contain a script that invokes changes for every node. The node hostname and primary partition can be passed on from ipxe using the kernel myhost=client01 localdisk=/dev/sda1. So that is the differentiator. The script in /etc/nodes/initramfs-tools/update-node-etc will remount /tmp and /var onto $localdisk and copy /etc there too. Next it will symnlink a few files, such as /etc/hostname and /etc/fstab to adjust for local settings. + +This way we will deploy all nodes centrally. One aspect is that we don't need dynamic user management as it is centrally orchestrated! The user files can be copied from the head node when they change. + +O4 is going to be the backup head node. It will act as a compute node too, until we need it as the head node. O4 is currently not on the slurm queue. + +* [X] Update guix on O1 +* [X] Install guix moosefs +* [X] Start moosefs master on O3 +* [X] Start moosefs metalogger on O4 +* [ ] Check moosefs logging facilities +* [ ] See if we can mark drives so it is easier to track them +* [ ] Test broken (?) /dev/sdf on octopus03 + +We can start moose master on O3. We should use different ports than lizard. Lizard uses 9419-24 by default. So let's use +9519- ports. See + +=> moosefs/moosefs-maintenance.gmi + +# P2 + +Penguin2 has 80T of spinning disk storage. We are going to use that for redundancy. Basically these disks get a moosefs goal of HDD 'slow' and we'll configure them on a remote rack - so chunks get fetched from local chunk servers (first). This will gain us 40T of immediate storage. Adding more spinning disks will free up SSDs further. + +* [X] P2 Update Guix +* [X] Install moosefs +* [ ] Create HDD chunk server diff --git a/topics/octopus/recent-rust.gmi b/topics/octopus/recent-rust.gmi new file mode 100644 index 0000000..7ce8968 --- /dev/null +++ b/topics/octopus/recent-rust.gmi @@ -0,0 +1,76 @@ +# Use a recent Rust on Octopus + + +For impg we currently need a rust that is more recent than what we have in Debian +or Guix. No panic, because Rust has few requirements. + +Install latest rust using the script + +``` +curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh +``` + +Set path + +``` +. ~/.cargo/env +``` + +Update rust + +``` +rustup default stable +``` + +Next update Rust + +``` +octopus01:~/tmp/impg$ . ~/.cargo/env +octopus01:~/tmp/impg$ rustup default stable +info: syncing channel updates for 'stable-x86_64-unknown-linux-gnu' +info: latest update on 2025-05-15, rust version 1.87.0 (17067e9ac 2025-05-09) +info: downloading component 'cargo' +info: downloading component 'clippy' +info: downloading component 'rust-docs' +info: downloading component 'rust-std' +info: downloading component 'rustc' +(...) +``` + +and build the package + +``` +octopus01:~/tmp/impg$ cargo build +``` + +Since we are not in guix we get the local dependencies: + +``` +octopus01:~/tmp/impg$ ldd target/debug/impg + linux-vdso.so.1 (0x00007ffdb266a000) + libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fe404001000) + librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fe403ff7000) + libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fe403fd6000) + libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fe403fd1000) + libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fe403e11000) + /lib64/ld-linux-x86-64.so.2 (0x00007fe404682000) +``` + +Login on another octopus - say 02 you can run impg from this directory: + +``` +octopus02:~$ ~/tmp/impg/target/debug/impg +Command-line tool for querying overlaps in PAF files + +Usage: impg <COMMAND> + +Commands: + index Create an IMPG index + partition Partition the alignment + query Query overlaps in the alignment + stats Print alignment statistics + +Options: + -h, --help Print help + -V, --version Print version +``` diff --git a/topics/octopus/set-up-guix-for-new-users.gmi b/topics/octopus/set-up-guix-for-new-users.gmi new file mode 100644 index 0000000..f459559 --- /dev/null +++ b/topics/octopus/set-up-guix-for-new-users.gmi @@ -0,0 +1,38 @@ +# Set up Guix for new users + +This document describes how to set up Guix for new users on a machine in which Guix is already installed (such as octopus01). + +## Create a per-user profile for yourself by running your first guix pull + +"Borrow" some other user's guix to run guix pull. In the example below, we use root's guix, but it might as well be any guix. +``` +$ /var/guix/profiles/per-user/root/current-guix/bin/guix pull +``` +This should create your very own Guix profile at ~/.config/guix/current. You may invoke guix from this profile as +``` +$ ~/.config/guix/current/bin/guix ... +``` +But, you'd normally want to make this more convenient. So, add ~/.config/guix/current/bin to your PATH. To do this, add the following to your ~/.profile +``` +GUIX_PROFILE=~/.config/guix/current +. $GUIX_PROFILE/etc/profile +``` +Thereafter, you may run any guix command simply as +``` +$ guix ... +``` + +## Pulling from a different channels.scm + +By default, guix pull pulls the latest commit of the main upstream Guix channel. You may want to pull from additional channels as well. Put the channels you want into ~/.config/guix/channels.scm, and then run guix pull. For example, here's a channels.scm if you want to use the guix-bioinformatics channel. +``` +$ cat ~/.config/guix/channels.scm +(list (channel + (name 'gn-bioinformatics) + (url "https://git.genenetwork.org/guix-bioinformatics") + (branch "master"))) +``` +And, +``` +$ guix pull +``` diff --git a/topics/octopus/slurm-upgrade.gmi b/topics/octopus/slurm-upgrade.gmi new file mode 100644 index 0000000..822f68e --- /dev/null +++ b/topics/octopus/slurm-upgrade.gmi @@ -0,0 +1,89 @@ +# How to upgrade slurm on octopus + +This document closely mirrors the official upgrade guide. The official upgrade guide is very thorough. Please refer to it and update this document if something is not clear. +=> https://slurm.schedmd.com/upgrades.html Official slurm upgrade guide + +## Preparation + +It is possible to upgrade slurm in-place without upsetting running jobs. But, for our small cluster, we don't mind a little downtime. So, it is simpler if we schedule some downtime with other users and make sure there are no running jobs. + +slurm can only be upgraded safely in small version increments. For example, it is safe to upgrade version 18.08 to 19.05 or 20.02, but not to 20.11 or later. This compatibility information is in the RELEASE_NOTES file of the slurm git repo with the git tag corresponding to the version checked out. Any configuration file changes are also outlined in this file. +=> https://github.com/SchedMD/slurm/ slurm git repository + +## Backup + +Stop the slurmdbd, slurmctld and slurmd services. +``` +# systemctl stop slurmdbd slurmctld slurmd slurmrestd +``` +Backup the slurm StateSaveLocation (/var/spool/slurmd/ctld in our case) and the slurm configuration directory. +``` +# cp -av /var/spool/slurmd/ctld /somewhere/safe/ +# cp -av /etc/slurm /somewhere/safe/ +``` +Backup the slurmdbd MySQL database. Enter the password when prompted. The password is specified in StoragePass of /etc/slurm/slurmdbd.conf. +``` +$ mysqldump -u slurm -p --databases slurm_acct_db > /somewhere/safe/slurm_acct_db.sql +``` + +## Upgrade slurm on octopus01 (the head node) + +Clone the gn-machines git repo. +``` +$ git clone https://git.genenetwork.org/gn-machines +``` +Edit slurm.scm to build the version of slurm you are upgrading to. Ensure it builds successfully using +``` +$ guix build -f slurm.scm +``` +Upgrade slurm. +``` +# ./slurm-head-deploy.sh +``` +Make any configuration file changes outlined in RELEASE_NOTES. Next, run the slurmdbd daemon, wait for it to start up successfully and then exit with Ctrl+C. During upgrades, slurmdbd may take extra time to update the database. This may cause systemd to timeout and kill slurmdbd. So, we do it this way, instead of simply starting the slurmdbd systemd service. +``` +# sudo -u slurm slurmdbd -D +``` +Reload the new systemd configuration files. Then, start the slurmdbd, slurmctld and slurmd services one at a time ensuring that each starts up correctly before proceeding on to the next. +``` +# systemctl daemon-reload +# systemctl start slurmdbd +# systemctl start slurmctld +# systemctl start slurmd +# systemctl start slurmrestd +``` + +## Upgrade slurm on the worker nodes + +Repeat the steps below on every worker node. + +Stop the slurmd service. +``` +# systemctl stop slurmd +``` +Upgrade slurm, passing slurm-worker-deploy.sh the slurm store path obtained from building slurm using guix build on octopus01. Recall that you cannot invoke guix build on the worker nodes. +``` +# ./slurm-worker-deploy.sh /gnu/store/...-slurm +``` +Copy over any configuration file changes from octopus01. Then, reload the new systemd configuration files and start slurmd. +``` +# systemctl daemon-reload +# systemctl start slurmd +``` + +## Tip: Running the same command on all worker nodes + +It is a lot of typing to run the same command on all worker nodes. You could make this a little less cumbersome with the following bash for loop. +``` +for node in octopus02 octopus03 octopus05 octopus06 octopus07 octopus08 octopus09 octopus10 octopus11 tux05 tux06 tux07 tux08 tux09; +do + ssh $node your command +done +``` +You can even do this for sudo commands using the -S flag of sudo that makes it read the password from stdin. Assuming your password is in the pass password manager, the bash for loop would then look like: +``` +for node in octopus02 octopus03 octopus05 octopus06 octopus07 octopus08 octopus09 octopus10 octopus11 tux05 tux06 tux07 tux08 tux09; +do + pass octopus | ssh $node sudo -S your command +done +``` \ No newline at end of file |
