summary refs log tree commit diff
path: root/issues/systems
diff options
context:
space:
mode:
Diffstat (limited to 'issues/systems')
-rw-r--r--issues/systems/apps.gmi20
-rw-r--r--issues/systems/octopus.gmi24
-rw-r--r--issues/systems/t02-crash.gmi47
-rw-r--r--issues/systems/tux02-production.gmi4
-rw-r--r--issues/systems/tux04-disk-issues.gmi43
5 files changed, 133 insertions, 5 deletions
diff --git a/issues/systems/apps.gmi b/issues/systems/apps.gmi
index b9d4155..e374250 100644
--- a/issues/systems/apps.gmi
+++ b/issues/systems/apps.gmi
@@ -194,14 +194,32 @@ Package definition is at
 
 Container is at
 
-=> https://git.genenetwork.org/guix-bioinformatics/tree/gn/services/bxd-power-container.scm
+=> https://git.genenetwork.org/gn-machines/tree/gn/services/mouse-longevity.scm
+
+gaeta:~/iwrk/deploy/gn-machines$ guix system container -L . -L ~/guix-bioinformatics --verbosity=3 test-r-container.scm -L ~/iwrk/deploy/guix-forge/guix
+forge/nginx.scm:145:40: error: acme-service-type: unbound variable
+hint: Did you forget `(use-modules (forge acme))'?
+
 
 ## jumpshiny
 
 Jumpshiny is hosted on balg01. Scripts are in tux02 git.
 
+=> git.genenetwork.org:/home/git/shared/source/jumpshiny
+
 ```
 root@balg01:/home/j*/gn-machines# . /usr/local/guix-profiles/guix-pull/etc/profile
 guix system container --network -L . -L ../guix-forge/guix/ -L ../guix-bioinformatics/ -L ../guix-past/modules/ --substitute-urls='https://ci.guix.gnu.org https://bordeaux.guix.gnu.org https://cuirass.genenetwork.org' test-r-container.scm -L ../guix-forge/guix/gnu/store/xyks73sf6pk78rvrwf45ik181v0zw8rx-run-container
 /gnu/store/6y65x5jk3lxy4yckssnl32yayjx9nwl5-run-container
 ```
+
+Currently:
+
+Jumpshiny: as aijun, cd services/jumpshiny and ./.guix-run
+
+
+## JUMPsem_web
+
+Another shiny app to run on balg01.
+
+Jumpshiny: as aijun, cd services/jumpsem and ./.guix-run
diff --git a/issues/systems/octopus.gmi b/issues/systems/octopus.gmi
index c510fd9..3a6d317 100644
--- a/issues/systems/octopus.gmi
+++ b/issues/systems/octopus.gmi
@@ -1,6 +1,9 @@
 # Octopus sysmaintenance
 
-Reopened tasks because of new sheepdog layout and add new machines to Octopus and get fiber optic network going with @andreag. See also
+Reopened tasks because of new sheepdog layout and add new machines to Octopus and get fiber optic network going with @andreag.
+IT recently upgraded the network switch, so we should have great interconnect between all nodes. We also need to work on user management and network storage.
+
+See also
 
 => ../../topics/systemtopics/systems/hpcs/hpc/octopus-maintenance
 
@@ -14,7 +17,7 @@ Reopened tasks because of new sheepdog layout and add new machines to Octopus an
 
 # Tasks
 
-* [ ] add lizardfs to nodes
+* [X] add lizardfs to nodes
 * [ ] add PBS to nodes
 * [ ] use fiber optic network
 * [ ] install sheepdog
@@ -36,6 +39,17 @@ default via 172.23.16.1 dev ens1f0np0
 
 # Current topology
 
+vim /etc/ssh/sshd_config
+systemctl reload ssh
+
+The routing should be as on octopus01
+
+```
+default via 172.23.16.1 dev eno1
+172.23.16.0/21 dev ens1f0np0 proto kernel scope link src 172.23.18.221
+172.23.16.0/21 dev eno1 proto kernel scope link src 172.23.18.188
+```
+
 ```
 ip a
 ip route
@@ -44,3 +58,9 @@ ip route
 - Octopus01 uses eno1 172.23.18.188/21 gateway 172.23.16.1 (eno1: Link is up at 1000 Mbps)
 - Octopus02 uses eno1 172.23.17.63/21  gateway 172.23.16.1 (eno1: Link is up at 1000 Mbps)
                       172.23.x.x
+
+# Work
+
+* After the switch upgrade penguin2 NFS is not visible for octopus01. I disabled the mount in fstab
+* On octopus01 disabled unattended upgrade script - we don't want kernel updates on this machine(!)
+* Updated IP addresses in sshd_config
diff --git a/issues/systems/t02-crash.gmi b/issues/systems/t02-crash.gmi
new file mode 100644
index 0000000..bf0c5d5
--- /dev/null
+++ b/issues/systems/t02-crash.gmi
@@ -0,0 +1,47 @@
+## Postmortem tux02 crash
+
+I'll take a look at tux02 - it rebooted last night and I need to start some services. It rebooted at CDT Aug 07 19:29:14 tux02 kernel: Linux version ... We have two out of memory messages before that:
+
+```
+Aug  7 18:45:27 tux02 kernel: [13521994.665636] Out of memory: Kill process 30165 (guix) score 759 or sacrifice child
+Aug  7 18:45:27 tux02 kernel: [13521994.758974] Killed process 30165 (guix) total-vm:498873224kB, anon-rss:223599272kB, file-rss:4kB, shmem-rss:0kB
+```
+
+My mosh clapped out before that
+
+```
+wrk      pts/96       mosh [128868]    Thu Aug  7 18:53 - down   (00:00)
+```
+
+Someone killed the development container before that
+
+```
+Aug  7 18:06:32 tux02 systemd[1]: genenetwork-development-container.service: Killing process 86832 (20qjyhd7n9n62fa) with signal SIGKILL.
+```
+
+and
+
+```
+Aug  7 13:28:26 tux02 kernel: [13502972.611421] oom_reaper: reaped process 25224 (guix), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
+Aug  7 18:16:00 tux02 kernel: [13520227.160945] oom_reaper: reaped process 128091 (guix), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
+```
+
+Guix builds running out of RAM... My conclusion is that someone has been doing some heavy lifting. Probably Fred. I'll ask him to use a different machine that is not shared by many people. First I need to bring up some processes. The shepherd had not started, so:
+
+```
+systemctl status user-shepherd.service
+```
+
+most services started now. I need to check in half an hour.
+
+BNW is the one that does not start up automatically.
+
+```
+su shepherd
+herd status
+herd stop bnw
+herd status bnw
+tail -f /home/shepherd/logs/bnw.log
+```
+
+Shows a process is blocking the port. Kill as root, after making sure herd status shows it as stopped.
diff --git a/issues/systems/tux02-production.gmi b/issues/systems/tux02-production.gmi
index 7de911f..d811c5e 100644
--- a/issues/systems/tux02-production.gmi
+++ b/issues/systems/tux02-production.gmi
@@ -14,9 +14,9 @@ We are going to move production to tux02 - tux01 will be the staging machine. Th
 
 * [X] update guix guix-1.3.0-9.f743f20
 * [X] set up nginx (Debian)
-* [X] test ipmi console (172.23.30.40)
+* [X] test ipmi console
 * [X] test ports (nginx)
-* [?] set up network for external tux02e.uthsc.edu (128.169.4.52)
+* [?] set up network for external tux02
 * [X] set up deployment evironment
 * [X] sheepdog copy database backup from tux01 on a daily basis using ibackup user
 * [X] same for GN2 production environment
diff --git a/issues/systems/tux04-disk-issues.gmi b/issues/systems/tux04-disk-issues.gmi
index bc6e1db..3df0a03 100644
--- a/issues/systems/tux04-disk-issues.gmi
+++ b/issues/systems/tux04-disk-issues.gmi
@@ -378,3 +378,46 @@ The code where it segfaulted is online at:
 => https://github.com/tianocore/edk2/blame/master/MdePkg/Library/BasePciSegmentLibPci/PciSegmentLib.c
 
 and has to do with PCI registers and that can actually be caused by the new PCIe card we hosted.
+
+# Sept 2025
+
+We moved production away from tux04, so now we should be able to work on this machine.
+
+
+## System crash on tux04
+
+And tux04 is down *again*. Wow, glad we moved off! I want to fix that machine and we had to move production off! I left the terminal open and the last message is:
+
+```
+tux04:~$ [SMM] APIC 0x00 S00:C00:T00 > ASSERT [AmdPlatformRasRsSmm] u:\EDK2\MdePkg\Library\BasePciSegmentLibPci\PciSegmentLib.c(766): ((Address) & (0xfffffffff0000000ULL | (3))) == 0
+!!!! X64 Exception Type - 03(#BP - Breakpoint)  CPU Apic ID - 00000000 !!!!
+RIP  - 0000000076DA4343, CS  - 0000000000000038, RFLAGS - 0000000000000002
+RAX  - 0000000000000010, RCX - 00000000770D5B58, RDX - 00000000000002F8
+RBX  - 0000000000000000, RSP - 0000000077773278, RBP - 0000000000000000
+RSI  - 0000000000000000, RDI - 00000000777733E0
+R8   - 00000000777731F8, R9  - 0000000000000000, R10 - 0000000000000000
+R11  - 00000000000000A0, R12 - 0000000000000000, R13 - 0000000000000000
+R14  - FFFFFFFFAC41A118, R15 - 000000000005B000
+DS   - 0000000000000020, ES  - 0000000000000020, FS  - 0000000000000020
+GS   - 0000000000000020, SS  - 0000000000000020
+CR0  - 0000000080010033, CR2 - 00007F67F5268030, CR3 - 0000000077749000
+CR4  - 0000000000001668, CR8 - 0000000000000001
+DR0  - 0000000000000000, DR1 - 0000000000000000, DR2 - 0000000000000000
+DR3  - 0000000000000000, DR6 - 00000000FFFF0FF0, DR7 - 0000000000000400
+GDTR - 000000007773C000 000000000000004F, LDTR - 0000000000000000
+IDTR - 0000000077761000 00000000000001FF,   TR - 0000000000000040
+FXSAVE_STATE - 0000000077772ED0
+!!!! Find image based on IP(0x76DA4343) u:\Build_Genoa\DellBrazosPkg\DEBUG_MYTOOLS\X64\DellPkgs\DellChipsetPkgs\AmdGenoaModulePkg\Override\AmdCpmPkg\Features\PlatformRas\Rs\Smm\AmdPlatformRasRsSmm\DEBUG\AmdPlatformRasRsSmm.pdb (ImageBase=0000000076D3E000, EntryPoint=0000000076D3E6C0) !!!!
+```
+
+and the racadm system log says
+
+```
+Record:      362
+Date/Time:   09/11/2025 21:47:02
+Source:      system
+Severity:    Critical
+Description: A high-severity issue has occurred at the Power-On Self-Test (POST) phase which has resulted in the system BIOS to abruptly stop functioning.
+```
+
+I have seen that before and it is definitely a hardware/driver issue on the Dell itself. I'll work on tha later. Luckily it always reboots.