summaryrefslogtreecommitdiff
path: root/issues/systems/reboot-tux01-tux02.gmi
blob: dcd2677b376925d75048c24853bd23ae164fc148 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
# Rebooting the GN production machine(s)

I needed to add the hard disks in the BIOS to make them visible - one of the annoying aspects of these Dell machines. First on Tux02 I cheched the borg backups to see if we have a recent copy of MariaDB, GN2 etc. The DB is from 2 days ago and the genotypes of GN2 are a week old (because of a permission problem). I'll add a copy by hand of both - an opportunity to test the new 10Gbs router.

# Tasks

Before rebooting

On Tux02:

* [X] Check backups of DB and services
* [X] Copy trees between machines

On both:

* [X] Check network interface definitions (what happens on reboot)
* [X] Check IPMI access - should get serial login


# Info

## Reconfiguring disks

We can add a lot of SATA disks. To configure run racadm and set serial settings

: racadm set iDRAC.Serial.Enable 1
Dump everything

: racdump

: get nic.nicconfig.4
: hwinventory

Reboot

: racadm>>serveraction powercycle

: racadm>>serveraction hardreset
: racadm>>serveraction powerstatus
: racadm>>connect com2

racadm>>
config -g cfgSerial -o cfgSerialBaudRate 115200
config -g cfgSerial -o cfgSerialCom2RedirEnable 1
config -g cfgSerial -o cfgSerialSshEnable 1
config -g cfgIpmiSol -o cfgIpmiSolEnable 1
config -g cfgIpmiSol -o cfgIpmiSolBaudRate 115200
get iDRAC.serial


Now you should see the menu in

: console com2

## Routing

On tux02 eno2d1 is the 10Gbs network interface. Unfortunately I can't get it to connect at 10Gbs with Tux01 because the latter is using that port for the outside world.

Playing with 10Gbs on Tux01 sent the hardware in a tail spin, what to think of this solution on

```
bnxt_en 0000:01:00.1 (unnamed net_device) (uninitialized): Error (timeout: 500015) msg {0x0 0x0} len:0

Solution was to power down the server(s) and *remove* power cords for 5 minutes.
```

=> https://www.dell.com/community/PowerEdge-Hardware-General/Critical-network-bnxt-en-module-crashes-on-14G-servers/td-p/6031769

The Linux kernel shows some fixes that are not on Tux01 yet

=> https://lkml.org/lkml/2021/2/17/970

In our case a simple reboot worked, fortunately.

## Tags

* assigned: pjotrp
* status: unclear
* priority: medium
* type: system administration
* keywords: systems, tux01, tux02, production