summaryrefslogtreecommitdiff
path: root/topics/octopus
diff options
context:
space:
mode:
Diffstat (limited to 'topics/octopus')
-rw-r--r--topics/octopus/lizardfs/README.gmi13
-rw-r--r--topics/octopus/maintenance.gmi98
-rw-r--r--topics/octopus/recent-rust.gmi76
-rw-r--r--topics/octopus/set-up-guix-for-new-users.gmi38
-rw-r--r--topics/octopus/slurm-upgrade.gmi89
5 files changed, 312 insertions, 2 deletions
diff --git a/topics/octopus/lizardfs/README.gmi b/topics/octopus/lizardfs/README.gmi
index 78316ef..7c91136 100644
--- a/topics/octopus/lizardfs/README.gmi
+++ b/topics/octopus/lizardfs/README.gmi
@@ -86,14 +86,23 @@ Other commands can be found with `man lizardfs-admin`.
## Deleted files
-Lizardfs also keeps deleted files, by default for 30 days. If you need to recover deleted files (or delete them permanently) then the metadata directory can be mounted with:
+Lizardfs also keeps deleted files, by default for 30 days in `/mnt/lizardfs-meta/trash`. If you need to recover deleted files (or delete them permanently) then the metadata directory can be mounted with:
```
$ mfsmount /path/to/unused/mount -o mfsmeta
```
For more information see the lizardfs documentation online
-=> https://dev.lizardfs.com/docs/adminguide/advanced_configuration.html#trash-directory lizardfs documentation for the trash directory
+=> https://lizardfs-docs.readthedocs.io/en/latest/adminguide/advanced_configuration.html#trash-directory lizardfs documentation for the trash directory
+
+## Start lizardfs-mount (lizardfs reader daemon) after a system reboot
+
+```
+sudo bash
+systemctl daemon-reload
+systemctl restart lizardfs-mount
+systemctl status lizardfs-mount
+```
## Gotchas
diff --git a/topics/octopus/maintenance.gmi b/topics/octopus/maintenance.gmi
new file mode 100644
index 0000000..65ea52e
--- /dev/null
+++ b/topics/octopus/maintenance.gmi
@@ -0,0 +1,98 @@
+# Octopus/Tux maintenance
+
+## To remember
+
+`fdisk -l` to see disk models
+`lsblk -nd` to see mounted disks
+
+## Status
+
+octopus02
+- Devices: 2 3.7T SSDs + 2 894.3G SSDs + 2 4.6T HDDs
+- **Status: Slurm not OK, LizardFS not OK**
+- Notes:
+ - `octopus02 mfsmount[31909]: can't resolve master hostname and/or portname (octopus01:9421)`,
+ - **I don't see 2 drives that are physically mounted**
+
+octopus03
+- Devices: 4 3.7T SSDs + 2 894.3G SSDs
+- Status: Slurm OK, LizardFS OK
+- Notes: **I don't see 2 drives that are physically mounted**
+
+octopus04
+- Devices: 4 7.3 T SSDs (Neil) + 1 4.6T HDD + 1 3.7T SSD + 2 894.3G SSDs
+- Status: Slurm NO, LizardFS OK (we don't share the HDD)
+- Notes: no
+
+octopus05
+- Devices: 1 7.3 T SSDs (Neil) + 5 3.7T SSDs + 2 894.3G SSDs
+- Status: Slurm OK, LizardFS OK
+- Notes: no
+
+octopus06
+- Devices: 1 7.3 T SSDs (Neil) + 1 4.6T HDD + 4 3.7T SSDs + 2 894.3G SSDs
+- Status: Slurm OK, LizardFS OK (we don't share the HDD)
+- Notes: no
+
+octopus07
+- Devices: 1 7.3 T SSDs (Neil) + 4 3.7T SSDs + 2 894.3G SSDs
+- Status: Slurm OK, LizardFS OK
+- Notes: **I don't see 1 device that is physically mounted**
+
+octopus08
+- Devices: 1 7.3 T SSDs (Neil) + 1 4.6T HDD + 4 3.7T SSDs + 2 894.3G SSDs
+- Status: Slurm OK, LizardFS OK (we don't share the HDD)
+- Notes: no
+
+octopus09
+- Devices: 1 7.3 T SSDs (Neil) + 1 4.6T HDD + 4 3.7T SSDs + 2 894.3G SSDs
+- Status: Slurm OK, LizardFS OK (we don't share the HDD)
+- Notes: no
+
+octopus10
+- Devices: 1 7.3 T SSDs (Neil) + 4 3.7T SSDs + 2 894.3G SSDs
+- Status: Slurm OK, LizardFS OK (we don't share the HDD)
+- Notes: **I don't see 1 device that is physically mounted**
+
+octopus11
+- Devices: 1 7.3 T SSDs (Neil) + 5 3.7T SSDs + 2 894.3G SSDs
+- Status: Slurm OK, LizardFS OK
+- Notes: on
+
+tux05
+- Devices: 1 3.6 NVMe + 1 1.5T NVMe + 1 894.3G NVMe
+- Status: Slurm OK, LizardFS OK (we don't share anything)
+- Notes: **I don't have a picture to confirm physically mounted devices**
+
+tux06
+- Devices: 2 3.6 T SSDs (1 from Neil) + 1 1.5T NVMe + 1 894.3G NVMe
+- Status: Slurm OK, LizardFS (we don't share anything)
+- Notes:
+ - **Last picture reports 1 7.3 T SSD (Neil) that is missing**
+ - **Disk /dev/sdc: 3.64 TiB (Samsung SSD 990: free and usable for lizardfs**
+ - **Disk /dev/sdd: 3.64 TiB (Samsung SSD 990): free and usable for lizardfs**
+
+tux07
+- Devices: 3 3.6 T SSDs + 1 1.5T NVMe (Neil) + 1 894.3G NVMe
+- Status: Slurm OK, LizardFS
+- Notes:
+ - **Disk /dev/sdb: 3.64 TiB (Samsung SSD 990): free and usable for lizardfs**
+ - **Disk /dev/sdd: 3.64 TiB (Samsung SSD 990): mounted at /mnt/sdb and shared on LIZARDFS: TO CHECK BECAUSE IT HAS NO PARTITIONS**
+
+tux08
+- Devices: 3 3.6 T SSDs + 1 1.5T NVMe (Neil) + 1 894.3G NVMe
+- Status: Slurm OK, LizardFS
+- Notes: no
+
+tux09
+- Devices: 1 3.6 T SSDs + 1 1.5T NVMe + 1 894.3G NVMe
+- Status: Slurm OK, LizardFS
+- Notes: **I don't see 1 device that is physically mounted**
+
+## Neil disks
+- four 8TB SSDs on the right of octopus04
+- one 8TB SSD in the left slot of octopus05
+- six 8TB SSDs bottom-right slot of octopus06,07,08,09,10,11
+- one 4TB NVMe and one 8TB SSDs on tux06, NVME in the bottom-right of the group of 4 on the left, SSD on the bottom-left of the group of 4 on the right
+- one 4TB NVMe on tux07, on the top-left of the group of 4 on the right
+- one 4TB NVMe on tux08, on the top-left of the group of 4 on the right
diff --git a/topics/octopus/recent-rust.gmi b/topics/octopus/recent-rust.gmi
new file mode 100644
index 0000000..7ce8968
--- /dev/null
+++ b/topics/octopus/recent-rust.gmi
@@ -0,0 +1,76 @@
+# Use a recent Rust on Octopus
+
+
+For impg we currently need a rust that is more recent than what we have in Debian
+or Guix. No panic, because Rust has few requirements.
+
+Install latest rust using the script
+
+```
+curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
+```
+
+Set path
+
+```
+. ~/.cargo/env
+```
+
+Update rust
+
+```
+rustup default stable
+```
+
+Next update Rust
+
+```
+octopus01:~/tmp/impg$ . ~/.cargo/env
+octopus01:~/tmp/impg$ rustup default stable
+info: syncing channel updates for 'stable-x86_64-unknown-linux-gnu'
+info: latest update on 2025-05-15, rust version 1.87.0 (17067e9ac 2025-05-09)
+info: downloading component 'cargo'
+info: downloading component 'clippy'
+info: downloading component 'rust-docs'
+info: downloading component 'rust-std'
+info: downloading component 'rustc'
+(...)
+```
+
+and build the package
+
+```
+octopus01:~/tmp/impg$ cargo build
+```
+
+Since we are not in guix we get the local dependencies:
+
+```
+octopus01:~/tmp/impg$ ldd target/debug/impg
+ linux-vdso.so.1 (0x00007ffdb266a000)
+ libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fe404001000)
+ librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fe403ff7000)
+ libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fe403fd6000)
+ libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fe403fd1000)
+ libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fe403e11000)
+ /lib64/ld-linux-x86-64.so.2 (0x00007fe404682000)
+```
+
+Login on another octopus - say 02 you can run impg from this directory:
+
+```
+octopus02:~$ ~/tmp/impg/target/debug/impg
+Command-line tool for querying overlaps in PAF files
+
+Usage: impg <COMMAND>
+
+Commands:
+ index Create an IMPG index
+ partition Partition the alignment
+ query Query overlaps in the alignment
+ stats Print alignment statistics
+
+Options:
+ -h, --help Print help
+ -V, --version Print version
+```
diff --git a/topics/octopus/set-up-guix-for-new-users.gmi b/topics/octopus/set-up-guix-for-new-users.gmi
new file mode 100644
index 0000000..f459559
--- /dev/null
+++ b/topics/octopus/set-up-guix-for-new-users.gmi
@@ -0,0 +1,38 @@
+# Set up Guix for new users
+
+This document describes how to set up Guix for new users on a machine in which Guix is already installed (such as octopus01).
+
+## Create a per-user profile for yourself by running your first guix pull
+
+"Borrow" some other user's guix to run guix pull. In the example below, we use root's guix, but it might as well be any guix.
+```
+$ /var/guix/profiles/per-user/root/current-guix/bin/guix pull
+```
+This should create your very own Guix profile at ~/.config/guix/current. You may invoke guix from this profile as
+```
+$ ~/.config/guix/current/bin/guix ...
+```
+But, you'd normally want to make this more convenient. So, add ~/.config/guix/current/bin to your PATH. To do this, add the following to your ~/.profile
+```
+GUIX_PROFILE=~/.config/guix/current
+. $GUIX_PROFILE/etc/profile
+```
+Thereafter, you may run any guix command simply as
+```
+$ guix ...
+```
+
+## Pulling from a different channels.scm
+
+By default, guix pull pulls the latest commit of the main upstream Guix channel. You may want to pull from additional channels as well. Put the channels you want into ~/.config/guix/channels.scm, and then run guix pull. For example, here's a channels.scm if you want to use the guix-bioinformatics channel.
+```
+$ cat ~/.config/guix/channels.scm
+(list (channel
+ (name 'gn-bioinformatics)
+ (url "https://git.genenetwork.org/guix-bioinformatics")
+ (branch "master")))
+```
+And,
+```
+$ guix pull
+```
diff --git a/topics/octopus/slurm-upgrade.gmi b/topics/octopus/slurm-upgrade.gmi
new file mode 100644
index 0000000..822f68e
--- /dev/null
+++ b/topics/octopus/slurm-upgrade.gmi
@@ -0,0 +1,89 @@
+# How to upgrade slurm on octopus
+
+This document closely mirrors the official upgrade guide. The official upgrade guide is very thorough. Please refer to it and update this document if something is not clear.
+=> https://slurm.schedmd.com/upgrades.html Official slurm upgrade guide
+
+## Preparation
+
+It is possible to upgrade slurm in-place without upsetting running jobs. But, for our small cluster, we don't mind a little downtime. So, it is simpler if we schedule some downtime with other users and make sure there are no running jobs.
+
+slurm can only be upgraded safely in small version increments. For example, it is safe to upgrade version 18.08 to 19.05 or 20.02, but not to 20.11 or later. This compatibility information is in the RELEASE_NOTES file of the slurm git repo with the git tag corresponding to the version checked out. Any configuration file changes are also outlined in this file.
+=> https://github.com/SchedMD/slurm/ slurm git repository
+
+## Backup
+
+Stop the slurmdbd, slurmctld and slurmd services.
+```
+# systemctl stop slurmdbd slurmctld slurmd slurmrestd
+```
+Backup the slurm StateSaveLocation (/var/spool/slurmd/ctld in our case) and the slurm configuration directory.
+```
+# cp -av /var/spool/slurmd/ctld /somewhere/safe/
+# cp -av /etc/slurm /somewhere/safe/
+```
+Backup the slurmdbd MySQL database. Enter the password when prompted. The password is specified in StoragePass of /etc/slurm/slurmdbd.conf.
+```
+$ mysqldump -u slurm -p --databases slurm_acct_db > /somewhere/safe/slurm_acct_db.sql
+```
+
+## Upgrade slurm on octopus01 (the head node)
+
+Clone the gn-machines git repo.
+```
+$ git clone https://git.genenetwork.org/gn-machines
+```
+Edit slurm.scm to build the version of slurm you are upgrading to. Ensure it builds successfully using
+```
+$ guix build -f slurm.scm
+```
+Upgrade slurm.
+```
+# ./slurm-head-deploy.sh
+```
+Make any configuration file changes outlined in RELEASE_NOTES. Next, run the slurmdbd daemon, wait for it to start up successfully and then exit with Ctrl+C. During upgrades, slurmdbd may take extra time to update the database. This may cause systemd to timeout and kill slurmdbd. So, we do it this way, instead of simply starting the slurmdbd systemd service.
+```
+# sudo -u slurm slurmdbd -D
+```
+Reload the new systemd configuration files. Then, start the slurmdbd, slurmctld and slurmd services one at a time ensuring that each starts up correctly before proceeding on to the next.
+```
+# systemctl daemon-reload
+# systemctl start slurmdbd
+# systemctl start slurmctld
+# systemctl start slurmd
+# systemctl start slurmrestd
+```
+
+## Upgrade slurm on the worker nodes
+
+Repeat the steps below on every worker node.
+
+Stop the slurmd service.
+```
+# systemctl stop slurmd
+```
+Upgrade slurm, passing slurm-worker-deploy.sh the slurm store path obtained from building slurm using guix build on octopus01. Recall that you cannot invoke guix build on the worker nodes.
+```
+# ./slurm-worker-deploy.sh /gnu/store/...-slurm
+```
+Copy over any configuration file changes from octopus01. Then, reload the new systemd configuration files and start slurmd.
+```
+# systemctl daemon-reload
+# systemctl start slurmd
+```
+
+## Tip: Running the same command on all worker nodes
+
+It is a lot of typing to run the same command on all worker nodes. You could make this a little less cumbersome with the following bash for loop.
+```
+for node in octopus02 octopus03 octopus05 octopus06 octopus07 octopus08 octopus09 octopus10 octopus11 tux05 tux06 tux07 tux08 tux09;
+do
+ ssh $node your command
+done
+```
+You can even do this for sudo commands using the -S flag of sudo that makes it read the password from stdin. Assuming your password is in the pass password manager, the bash for loop would then look like:
+```
+for node in octopus02 octopus03 octopus05 octopus06 octopus07 octopus08 octopus09 octopus10 octopus11 tux05 tux06 tux07 tux08 tux09;
+do
+ pass octopus | ssh $node sudo -S your command
+done
+``` \ No newline at end of file