# Tux04/Tux05 disk issues

We are facing some disk issues with Tux04:

```
May 02 20:57:42 tux04 kernel: Buffer I/O error on device sdf1, logical block 859240457
```

and the same happened to tux05 (same batch). Basically the controllers report no issues. Just to be sure we added
a copy of the boot partition.

=> topics/system/linux/add-boot-partition

# Tags

* assigned: pjotrp, aruni
* type: systems
* keywords: hardware
* status: unclear
* priority: medium

# Info


```
journalctl |grep mega
May 01 01:40:45 tux04 smartd[2440]: Device: /dev/bus/0 [megaraid_disk_00], opened
May 01 01:40:45 tux04 smartd[2440]: Device: /dev/bus/0 [megaraid_disk_00], [NVMe     Dell DC NVMe PE8 .2.0], lu id: 0x9a9ad026002ee4ac, S/N: SSBBN7299I250C41H, 960 GB
May 01 01:40:45 tux04 smartd[2440]: Device: /dev/bus/0 [megaraid_disk_01], opened
May 01 01:40:45 tux04 smartd[2440]: Device: /dev/bus/0 [megaraid_disk_01], [NVMe     Dell Ent NVMe FI .0.0], lu id: 0x3655523054a001820025384500000002, S/N: S6URNE0TA00182, 1.60 TB
May 01 01:40:45 tux04 smartd[2440]: Device: /dev/bus/0 [megaraid_disk_02], opened
May 01 01:40:45 tux04 smartd[2440]: Device: /dev/bus/0 [megaraid_disk_02], [NVMe     UMIS RPJTJ512MGE 0630], lu id: 0x8a13205102504a04, S/N: SS0L25210X8RC25E14WA, 512 GB
May 01 01:40:45 tux04 smartd[2440]: Device: /dev/bus/0 [megaraid_disk_03], opened
May 01 01:40:45 tux04 smartd[2440]: Device: /dev/bus/0 [megaraid_disk_03], [NVMe     CT4000P3SSD8     R30A], lu id: 0x550000f077a77964, S/N: 2314E6C3E33E, 4.00 TB
May 01 01:40:45 tux04 smartd[2440]: Device: /dev/bus/0 [megaraid_disk_04], opened
May 01 01:40:45 tux04 smartd[2440]: Device: /dev/bus/0 [megaraid_disk_04], [NVMe     CT4000P3SSD8     R30A], lu id: 0x830000f077a77964, S/N: 2314E6C3E2E2, 4.00 TB
May 01 01:40:45 tux04 smartd[2440]: Device: /dev/bus/0 [megaraid_disk_05], opened
May 01 01:40:45 tux04 smartd[2440]: Device: /dev/bus/0 [megaraid_disk_05], [NVMe     CT4000P3SSD8     R30A], lu id: 0x4d0000907da77964, S/N: 2327E6E9CB05, 4.00 TB
```

Switched on smartmontools.

```
smartctl -a /dev/sdf -d megaraid,0
```

shows no errors.

```
tux04:/$ lspci |grep RAID
41:00.0 RAID bus controller: Broadcom / LSI MegaRAID 12GSAS/PCIe Secure SAS39xx
```

Download megacli from

=> https://hwraid.le-vert.net/wiki/DebianPackages

```
apt-get update
apt-get install megacli
megacli -LDInfo -L5 -a0

```

```
tux04:/$  megacli -PDList -a0|grep -i S.M
megacli -PDList -a0
Drive has flagged a S.M.A.R.T alert : No
Drive has flagged a S.M.A.R.T alert : No
Drive has flagged a S.M.A.R.T alert : No
Drive has flagged a S.M.A.R.T alert : No
Drive has flagged a S.M.A.R.T alert : No
Drive has flagged a S.M.A.R.T alert : No
tux04:/$  megacli -PDList -a0|grep -i Firm
Firmware state: Online, Spun Up
Device Firmware Level: .2.0
Firmware state: Online, Spun Up
Device Firmware Level: .0.0
Firmware state: Online, Spun Up
Device Firmware Level: 0630
Firmware state: Online, Spun Up
Device Firmware Level: R30A
Firmware state: Online, Spun Up
Device Firmware Level: R30A
Firmware state: Online, Spun Up
Device Firmware Level: R30A
```

So the drives are OK and the controller is not complaining.

Smartctl self tests do not work on this controller:

```
tux04:/$ smartctl -t short -d megaraid,0 /dev/sdf -c
Short Background Self Test has begun
Use smartctl -X to abort test
```

and nothing ;). Megacli is actually the tool to use

```
megacli -AdpAllInfo -aAll
```

# Database

During a backup the DB shows this error:

```
2025-03-02 06:28:33 Database page corruption detected at page 1079428, retrying...\n[01] 2025-03-02 06:29:33 Database page corruption detected at page 1103108, retrying...
```


Interestingly the DB recovered on a second backup.

The database is hosted on a solid /dev/sde Dell Ent NVMe FI. The log says

```
kernel: I/O error, dev sde, sector 2136655448 op 0x0:(READ) flags 0x80700 phys_seg 40 prio class 2
```

Suggests:

=> https://stackoverflow.com/questions/50312219/blk-update-request-i-o-error-dev-sda-sector-xxxxxxxxxxx

> The errors that you see are interface errors, they are not coming from the disk itself but rather from the connection to it. It can be the cable or any of the ports in the connection.
> Since the CRC errors on the drive do not increase I can only assume that the problem is on the receive side of the machine you use. You should check the cable and try a different SATA port on the server.

and someone wrote

> analyzed that most of the reasons are caused by intensive reading and writing.  This is a CDN cache node. Type reading NVME temperature is relatively high, if it continues, it will start to throttle and then slowly collapse.

and temperature on that drive has been 70 C.