diff options
Diffstat (limited to 'issues/systems')
| -rw-r--r-- | issues/systems/apps.gmi | 20 | ||||
| -rw-r--r-- | issues/systems/octopus.gmi | 24 | ||||
| -rw-r--r-- | issues/systems/t02-crash.gmi | 47 | ||||
| -rw-r--r-- | issues/systems/tux02-production.gmi | 4 | ||||
| -rw-r--r-- | issues/systems/tux04-disk-issues.gmi | 43 |
5 files changed, 133 insertions, 5 deletions
diff --git a/issues/systems/apps.gmi b/issues/systems/apps.gmi index b9d4155..e374250 100644 --- a/issues/systems/apps.gmi +++ b/issues/systems/apps.gmi @@ -194,14 +194,32 @@ Package definition is at Container is at -=> https://git.genenetwork.org/guix-bioinformatics/tree/gn/services/bxd-power-container.scm +=> https://git.genenetwork.org/gn-machines/tree/gn/services/mouse-longevity.scm + +gaeta:~/iwrk/deploy/gn-machines$ guix system container -L . -L ~/guix-bioinformatics --verbosity=3 test-r-container.scm -L ~/iwrk/deploy/guix-forge/guix +forge/nginx.scm:145:40: error: acme-service-type: unbound variable +hint: Did you forget `(use-modules (forge acme))'? + ## jumpshiny Jumpshiny is hosted on balg01. Scripts are in tux02 git. +=> git.genenetwork.org:/home/git/shared/source/jumpshiny + ``` root@balg01:/home/j*/gn-machines# . /usr/local/guix-profiles/guix-pull/etc/profile guix system container --network -L . -L ../guix-forge/guix/ -L ../guix-bioinformatics/ -L ../guix-past/modules/ --substitute-urls='https://ci.guix.gnu.org https://bordeaux.guix.gnu.org https://cuirass.genenetwork.org' test-r-container.scm -L ../guix-forge/guix/gnu/store/xyks73sf6pk78rvrwf45ik181v0zw8rx-run-container /gnu/store/6y65x5jk3lxy4yckssnl32yayjx9nwl5-run-container ``` + +Currently: + +Jumpshiny: as aijun, cd services/jumpshiny and ./.guix-run + + +## JUMPsem_web + +Another shiny app to run on balg01. + +Jumpshiny: as aijun, cd services/jumpsem and ./.guix-run diff --git a/issues/systems/octopus.gmi b/issues/systems/octopus.gmi index c510fd9..3a6d317 100644 --- a/issues/systems/octopus.gmi +++ b/issues/systems/octopus.gmi @@ -1,6 +1,9 @@ # Octopus sysmaintenance -Reopened tasks because of new sheepdog layout and add new machines to Octopus and get fiber optic network going with @andreag. See also +Reopened tasks because of new sheepdog layout and add new machines to Octopus and get fiber optic network going with @andreag. +IT recently upgraded the network switch, so we should have great interconnect between all nodes. We also need to work on user management and network storage. + +See also => ../../topics/systemtopics/systems/hpcs/hpc/octopus-maintenance @@ -14,7 +17,7 @@ Reopened tasks because of new sheepdog layout and add new machines to Octopus an # Tasks -* [ ] add lizardfs to nodes +* [X] add lizardfs to nodes * [ ] add PBS to nodes * [ ] use fiber optic network * [ ] install sheepdog @@ -36,6 +39,17 @@ default via 172.23.16.1 dev ens1f0np0 # Current topology +vim /etc/ssh/sshd_config +systemctl reload ssh + +The routing should be as on octopus01 + +``` +default via 172.23.16.1 dev eno1 +172.23.16.0/21 dev ens1f0np0 proto kernel scope link src 172.23.18.221 +172.23.16.0/21 dev eno1 proto kernel scope link src 172.23.18.188 +``` + ``` ip a ip route @@ -44,3 +58,9 @@ ip route - Octopus01 uses eno1 172.23.18.188/21 gateway 172.23.16.1 (eno1: Link is up at 1000 Mbps) - Octopus02 uses eno1 172.23.17.63/21 gateway 172.23.16.1 (eno1: Link is up at 1000 Mbps) 172.23.x.x + +# Work + +* After the switch upgrade penguin2 NFS is not visible for octopus01. I disabled the mount in fstab +* On octopus01 disabled unattended upgrade script - we don't want kernel updates on this machine(!) +* Updated IP addresses in sshd_config diff --git a/issues/systems/t02-crash.gmi b/issues/systems/t02-crash.gmi new file mode 100644 index 0000000..bf0c5d5 --- /dev/null +++ b/issues/systems/t02-crash.gmi @@ -0,0 +1,47 @@ +## Postmortem tux02 crash + +I'll take a look at tux02 - it rebooted last night and I need to start some services. It rebooted at CDT Aug 07 19:29:14 tux02 kernel: Linux version ... We have two out of memory messages before that: + +``` +Aug 7 18:45:27 tux02 kernel: [13521994.665636] Out of memory: Kill process 30165 (guix) score 759 or sacrifice child +Aug 7 18:45:27 tux02 kernel: [13521994.758974] Killed process 30165 (guix) total-vm:498873224kB, anon-rss:223599272kB, file-rss:4kB, shmem-rss:0kB +``` + +My mosh clapped out before that + +``` +wrk pts/96 mosh [128868] Thu Aug 7 18:53 - down (00:00) +``` + +Someone killed the development container before that + +``` +Aug 7 18:06:32 tux02 systemd[1]: genenetwork-development-container.service: Killing process 86832 (20qjyhd7n9n62fa) with signal SIGKILL. +``` + +and + +``` +Aug 7 13:28:26 tux02 kernel: [13502972.611421] oom_reaper: reaped process 25224 (guix), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB +Aug 7 18:16:00 tux02 kernel: [13520227.160945] oom_reaper: reaped process 128091 (guix), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB +``` + +Guix builds running out of RAM... My conclusion is that someone has been doing some heavy lifting. Probably Fred. I'll ask him to use a different machine that is not shared by many people. First I need to bring up some processes. The shepherd had not started, so: + +``` +systemctl status user-shepherd.service +``` + +most services started now. I need to check in half an hour. + +BNW is the one that does not start up automatically. + +``` +su shepherd +herd status +herd stop bnw +herd status bnw +tail -f /home/shepherd/logs/bnw.log +``` + +Shows a process is blocking the port. Kill as root, after making sure herd status shows it as stopped. diff --git a/issues/systems/tux02-production.gmi b/issues/systems/tux02-production.gmi index 7de911f..d811c5e 100644 --- a/issues/systems/tux02-production.gmi +++ b/issues/systems/tux02-production.gmi @@ -14,9 +14,9 @@ We are going to move production to tux02 - tux01 will be the staging machine. Th * [X] update guix guix-1.3.0-9.f743f20 * [X] set up nginx (Debian) -* [X] test ipmi console (172.23.30.40) +* [X] test ipmi console * [X] test ports (nginx) -* [?] set up network for external tux02e.uthsc.edu (128.169.4.52) +* [?] set up network for external tux02 * [X] set up deployment evironment * [X] sheepdog copy database backup from tux01 on a daily basis using ibackup user * [X] same for GN2 production environment diff --git a/issues/systems/tux04-disk-issues.gmi b/issues/systems/tux04-disk-issues.gmi index bc6e1db..3df0a03 100644 --- a/issues/systems/tux04-disk-issues.gmi +++ b/issues/systems/tux04-disk-issues.gmi @@ -378,3 +378,46 @@ The code where it segfaulted is online at: => https://github.com/tianocore/edk2/blame/master/MdePkg/Library/BasePciSegmentLibPci/PciSegmentLib.c and has to do with PCI registers and that can actually be caused by the new PCIe card we hosted. + +# Sept 2025 + +We moved production away from tux04, so now we should be able to work on this machine. + + +## System crash on tux04 + +And tux04 is down *again*. Wow, glad we moved off! I want to fix that machine and we had to move production off! I left the terminal open and the last message is: + +``` +tux04:~$ [SMM] APIC 0x00 S00:C00:T00 > ASSERT [AmdPlatformRasRsSmm] u:\EDK2\MdePkg\Library\BasePciSegmentLibPci\PciSegmentLib.c(766): ((Address) & (0xfffffffff0000000ULL | (3))) == 0 +!!!! X64 Exception Type - 03(#BP - Breakpoint) CPU Apic ID - 00000000 !!!! +RIP - 0000000076DA4343, CS - 0000000000000038, RFLAGS - 0000000000000002 +RAX - 0000000000000010, RCX - 00000000770D5B58, RDX - 00000000000002F8 +RBX - 0000000000000000, RSP - 0000000077773278, RBP - 0000000000000000 +RSI - 0000000000000000, RDI - 00000000777733E0 +R8 - 00000000777731F8, R9 - 0000000000000000, R10 - 0000000000000000 +R11 - 00000000000000A0, R12 - 0000000000000000, R13 - 0000000000000000 +R14 - FFFFFFFFAC41A118, R15 - 000000000005B000 +DS - 0000000000000020, ES - 0000000000000020, FS - 0000000000000020 +GS - 0000000000000020, SS - 0000000000000020 +CR0 - 0000000080010033, CR2 - 00007F67F5268030, CR3 - 0000000077749000 +CR4 - 0000000000001668, CR8 - 0000000000000001 +DR0 - 0000000000000000, DR1 - 0000000000000000, DR2 - 0000000000000000 +DR3 - 0000000000000000, DR6 - 00000000FFFF0FF0, DR7 - 0000000000000400 +GDTR - 000000007773C000 000000000000004F, LDTR - 0000000000000000 +IDTR - 0000000077761000 00000000000001FF, TR - 0000000000000040 +FXSAVE_STATE - 0000000077772ED0 +!!!! Find image based on IP(0x76DA4343) u:\Build_Genoa\DellBrazosPkg\DEBUG_MYTOOLS\X64\DellPkgs\DellChipsetPkgs\AmdGenoaModulePkg\Override\AmdCpmPkg\Features\PlatformRas\Rs\Smm\AmdPlatformRasRsSmm\DEBUG\AmdPlatformRasRsSmm.pdb (ImageBase=0000000076D3E000, EntryPoint=0000000076D3E6C0) !!!! +``` + +and the racadm system log says + +``` +Record: 362 +Date/Time: 09/11/2025 21:47:02 +Source: system +Severity: Critical +Description: A high-severity issue has occurred at the Power-On Self-Test (POST) phase which has resulted in the system BIOS to abruptly stop functioning. +``` + +I have seen that before and it is definitely a hardware/driver issue on the Dell itself. I'll work on tha later. Luckily it always reboots. |
