diff options
Diffstat (limited to 'topics')
-rw-r--r-- | topics/octopus/guix.gmi | 81 |
1 files changed, 46 insertions, 35 deletions
diff --git a/topics/octopus/guix.gmi b/topics/octopus/guix.gmi index c9d7362..594b786 100644 --- a/topics/octopus/guix.gmi +++ b/topics/octopus/guix.gmi @@ -2,23 +2,25 @@ We needed to move /gnu on Octopus01 because we ran out of disk space. As it is shared across all nodes through NFS and, in turn, the nodes run services of that directory it required a strategic approach. -Fri, Apr 12, 2024, 18:38:29 - pjotrp: IMPORTANT: all nodes will be depleted tomorrow. Any running jobs will be killed. -Fri, Apr 12, 2024, 18:38:46 - pjotrp: you may want to stop long running jobs today. -Fri, Apr 12, 2024, 19:38:14 - Arun Isaac: Are we moving the store tomorrow? Should I be around? If so, when? I may be away in the day time. -Sat, Apr 13, 2024, 02:59:45 - andreaguarracino: I think so, but I have no idea about the time -Sat, Apr 13, 2024, 05:51:42 - erikg: Spring cleaning!!???!!! -Sat, Apr 13, 2024, 08:22:39 - pjotrp: yup -Sat, Apr 13, 2024, 08:23:10 - pjotrp: Arun Isaac: you may want to test later today when you are back. -Sat, Apr 13, 2024, 09:29:42 - pjotrp: <@erikgarrison:matrix.org "pjotrp: do we have backups of ho..."> /home is on /export and that is a disk array - so we can handle disk failure. Should we backup /home? People have all sorts of stuff there. -Sat, Apr 13, 2024, 09:36:25 - pjotrp: On Octo1 erik and flavia are using some tools in ~/.guix-profile. emacs, nucmer and perl. They may be OK, but just so you are aware. -Sat, Apr 13, 2024, 09:36:39 - pjotrp: I stopped the guix-daemon and started copying the store. -Sat, Apr 13, 2024, 10:22:56 - pjotrp: the store is now mounted on /export2/gnu: +> Fri, Apr 12, 2024, 18:38:29 - pjotrp: IMPORTANT: all nodes will be depleted tomorrow. Any running jobs will be killed. +> Fri, Apr 12, 2024, 18:38:46 - pjotrp: you may want to stop long running jobs today. +> Fri, Apr 12, 2024, 19:38:14 - Arun Isaac: Are we moving the store tomorrow? Should I be around? If so, when? I may be away in the day time. +> Sat, Apr 13, 2024, 02:59:45 - andreaguarracino: I think so, but I have no idea about the time +> Sat, Apr 13, 2024, 05:51:42 - erikg: Spring cleaning!!???!!! +> Sat, Apr 13, 2024, 08:22:39 - pjotrp: yup +> Sat, Apr 13, 2024, 08:23:10 - pjotrp: Arun Isaac: you may want to test later today when you are back. +> Sat, Apr 13, 2024, 09:29:42 - pjotrp: <@erikgarrison:matrix.org "pjotrp: do we have backups of ho..."> /home is on /export and that is a disk array - so we can handle disk failure. Should we backup /home? People have all sorts of stuff there. +> Sat, Apr 13, 2024, 09:36:25 - pjotrp: On Octo1 erik and flavia are using some tools in ~/.guix-profile. emacs, nucmer and perl. They may be OK, but just so you are aware. +> Sat, Apr 13, 2024, 09:36:39 - pjotrp: I stopped the guix-daemon and started copying the store. +> Sat, Apr 13, 2024, 10:22:56 - pjotrp: the store is now mounted on /export2/gnu: + ``` /export2/gnu /gnu none defaults,bind 0 0 ``` -Sat, Apr 13, 2024, 10:57:33 - pjotrp: /gnu successfully mounted on octo8 -Sat, Apr 13, 2024, 10:58:24 - pjotrp: also I updated the guix-daemon on octo1. -Sat, Apr 13, 2024, 11:08:38 - pjotrp: Funny thing + +> Sat, Apr 13, 2024, 10:57:33 - pjotrp: /gnu successfully mounted on octo8 +> Sat, Apr 13, 2024, 10:58:24 - pjotrp: also I updated the guix-daemon on octo1. +> Sat, Apr 13, 2024, 11:08:38 - pjotrp: Funny thing ``` root@octopus08:/etc/systemd# systemctl status lizardfs-mount ● lizardfs-mount.service - LizardFS mounts @@ -33,23 +35,27 @@ Feb 03 19:18:08 octopus08 systemd[1]: lizardfs-mount.service: Control process ex Feb 03 19:18:08 octopus08 systemd[1]: lizardfs-mount.service: Failed with result 'exit-code'. Feb 03 19:18:08 octopus08 systemd[1]: Failed to start LizardFS mounts. ``` -Sat, Apr 13, 2024, 11:08:52 - pjotrp: note the dates -Sat, Apr 13, 2024, 11:09:59 - pjotrp: /lizardfs is mounted so mfsmount is running. Weird. Nothing to do with guix btw - this is on Debian still -Sat, Apr 13, 2024, 11:10:14 - pjotrp: Checking tux06 -Sat, Apr 13, 2024, 11:12:31 - pjotrp: After +> Sat, Apr 13, 2024, 11:08:52 - pjotrp: note the dates +> Sat, Apr 13, 2024, 11:09:59 - pjotrp: /lizardfs is mounted so mfsmount is running. Weird. Nothing to do with guix btw - this is on Debian still +> Sat, Apr 13, 2024, 11:10:14 - pjotrp: Checking tux06 +> Sat, Apr 13, 2024, 11:12:31 - pjotrp: After ``` umount -l /gnu mount /gnu ``` -tux06 sees the right guix -Sat, Apr 13, 2024, 11:14:12 - pjotrp: lizardfs profile is fine on that machine -Sat, Apr 13, 2024, 11:18:07 - pjotrp: ``` + +> tux06 sees the right guix +> Sat, Apr 13, 2024, 11:14:12 - pjotrp: lizardfs profile is fine on that machine +> Sat, Apr 13, 2024, 11:18:07 - pjotrp: +``` systemctl restart lizardfs-mount systemctl restart lizardfs-chunkserver-hdd ``` -Sat, Apr 13, 2024, 11:18:29 - pjotrp: the last is still unloading. I don't think we have to do that for the other machines, just testing if I can bring it back up. -Sat, Apr 13, 2024, 11:19:42 - pjotrp: bit dangerous if we stop all chunkservers ;). But this one is restarting and appears happy enough -Sat, Apr 13, 2024, 11:20:05 - pjotrp: ``` +> Sat, Apr 13, 2024, 11:18:29 - pjotrp: the last is still unloading. I don't think we have to do that for the other machines, just testing if I can bring it back up. +> Sat, Apr 13, 2024, 11:19:42 - pjotrp: bit dangerous if we stop all chunkservers ;). But this one is restarting and appears happy enough +> Sat, Apr 13, 2024, 11:20:05 - pjotrp: + +``` Apr 13 04:13:48 tux06 mfschunkserver[1343867]: loaded charts data file from /var/lib/lizardfs_hdd/csstats.mfs Apr 13 04:13:48 tux06 mfschunkserver[1343867]: open files limit: 10000 Apr 13 04:13:48 tux06 mfschunkserver[1343867]: mfschunkserver daemon initialized properly @@ -57,13 +63,14 @@ Apr 13 04:13:48 tux06 systemd[1]: Finished lizardfs-chunkserver-hdd.service - Li Apr 13 04:13:48 tux06 mfschunkserver[1343867]: connected to Master Apr 13 04:13:48 tux06 mfschunkserver[1343867]: scanning folder /mnt/lizardserve/lizardfs_temp_vol/: complete (0s) ``` -Sat, Apr 13, 2024, 11:20:23 - pjotrp: OK, I'll remount nfs on all other machines and we should be OK -Sat, Apr 13, 2024, 11:21:10 - pjotrp: root on octo1: + +> Sat, Apr 13, 2024, 11:20:23 - pjotrp: OK, I'll remount nfs on all other machines and we should be OK +> Sat, Apr 13, 2024, 11:21:10 - pjotrp: root on octo1: ``` /dev/sda2 63G 16G 44G 26% / ``` -Sat, Apr 13, 2024, 11:34:41 - pjotrp: that is better -Sat, Apr 13, 2024, 11:42:17 - pjotrp: all nodes should see /gnu: +> Sat, Apr 13, 2024, 11:34:41 - pjotrp: that is better +> Sat, Apr 13, 2024, 11:42:17 - pjotrp: all nodes should see /gnu: ``` ---- node tux06 octopus01:/gnu 2.2T 656G 1.4T 32% /gnu @@ -74,13 +81,17 @@ octopus01:/gnu 2.2T 656G 1.4T 32% /gnu ---- node tux09 octopus01:/gnu 2.2T 656G 1.4T 32% /gnu ``` -Sat, Apr 13, 2024, 11:50:37 - pjotrp: lizard appears to be happy. OK andreaguarracino and Arun Isaac, please validate everything runs as expected. PBS does not run from guix, so that should be OK. -Sat, Apr 13, 2024, 11:53:59 - pjotrp: oh, checking, munged is also on guix on the tuxes and, oops: -Sat, Apr 13, 2024, 11:54:32 - pjotrp: ``` + +> Sat, Apr 13, 2024, 11:50:37 - pjotrp: lizard appears to be happy. OK andreaguarracino and Arun Isaac, please validate everything runs as expected. PBS does not run from guix, so that should be OK. +> Sat, Apr 13, 2024, 11:53:59 - pjotrp: oh, checking, munged is also on guix on the tuxes and, oops: +> Sat, Apr 13, 2024, 11:54:32 - pjotrp: + +``` root@tux06:/home/wrk# ls -l /export/octopus01/guix-profiles/slurm-1-link/sbin/munged ls: cannot access '/export/octopus01/guix-profiles/slurm-1-link/sbin/munged': No such file or directory ``` -Sat, Apr 13, 2024, 11:55:58 - pjotrp: must have happened during the profile cleanup andreaguarracino -Sat, Apr 13, 2024, 11:56:07 - pjotrp: fixed it -Sat, Apr 13, 2024, 12:00:21 - pjotrp: munge should be good now. Please test. -Sat, Apr 13, 2024, 12:06:14 - pjotrp: on a future cluster we should really consider net started nodes. With Efraim we looked into it, but never completed. It would make guix nodes an option, for one. + +> Sat, Apr 13, 2024, 11:55:58 - pjotrp: must have happened during the profile cleanup andreaguarracino +> Sat, Apr 13, 2024, 11:56:07 - pjotrp: fixed it +> Sat, Apr 13, 2024, 12:00:21 - pjotrp: munge should be good now. Please test. +> Sat, Apr 13, 2024, 12:06:14 - pjotrp: on a future cluster we should really consider net started nodes. With Efraim we looked into it, but never completed. It would make guix nodes an option, for one. |