# Tux04/Tux05 disk issues We are facing some disk issues with Tux04: ``` May 02 20:57:42 tux04 kernel: Buffer I/O error on device sdf1, logical block 859240457 ``` and the same happened to tux05 (same batch). Basically the controllers report no issues. Just to be sure we added a copy of the boot partition. => topics/system/linux/add-boot-partition # Tags * assigned: pjotrp, aruni * type: systems * keywords: hardware * status: unclear * priority: medium # Info ``` journalctl |grep mega May 01 01:40:45 tux04 smartd[2440]: Device: /dev/bus/0 [megaraid_disk_00], opened May 01 01:40:45 tux04 smartd[2440]: Device: /dev/bus/0 [megaraid_disk_00], [NVMe Dell DC NVMe PE8 .2.0], lu id: 0x9a9ad026002ee4ac, S/N: SSBBN7299I250C41H, 960 GB May 01 01:40:45 tux04 smartd[2440]: Device: /dev/bus/0 [megaraid_disk_01], opened May 01 01:40:45 tux04 smartd[2440]: Device: /dev/bus/0 [megaraid_disk_01], [NVMe Dell Ent NVMe FI .0.0], lu id: 0x3655523054a001820025384500000002, S/N: S6URNE0TA00182, 1.60 TB May 01 01:40:45 tux04 smartd[2440]: Device: /dev/bus/0 [megaraid_disk_02], opened May 01 01:40:45 tux04 smartd[2440]: Device: /dev/bus/0 [megaraid_disk_02], [NVMe UMIS RPJTJ512MGE 0630], lu id: 0x8a13205102504a04, S/N: SS0L25210X8RC25E14WA, 512 GB May 01 01:40:45 tux04 smartd[2440]: Device: /dev/bus/0 [megaraid_disk_03], opened May 01 01:40:45 tux04 smartd[2440]: Device: /dev/bus/0 [megaraid_disk_03], [NVMe CT4000P3SSD8 R30A], lu id: 0x550000f077a77964, S/N: 2314E6C3E33E, 4.00 TB May 01 01:40:45 tux04 smartd[2440]: Device: /dev/bus/0 [megaraid_disk_04], opened May 01 01:40:45 tux04 smartd[2440]: Device: /dev/bus/0 [megaraid_disk_04], [NVMe CT4000P3SSD8 R30A], lu id: 0x830000f077a77964, S/N: 2314E6C3E2E2, 4.00 TB May 01 01:40:45 tux04 smartd[2440]: Device: /dev/bus/0 [megaraid_disk_05], opened May 01 01:40:45 tux04 smartd[2440]: Device: /dev/bus/0 [megaraid_disk_05], [NVMe CT4000P3SSD8 R30A], lu id: 0x4d0000907da77964, S/N: 2327E6E9CB05, 4.00 TB ``` Switched on smartmontools. ``` smartctl -a /dev/sdf -d megaraid,0 ``` shows no errors. ``` tux04:/$ lspci |grep RAID 41:00.0 RAID bus controller: Broadcom / LSI MegaRAID 12GSAS/PCIe Secure SAS39xx ``` Download megacli from => https://hwraid.le-vert.net/wiki/DebianPackages ``` apt-get update apt-get install megacli megacli -LDInfo -L5 -a0 ``` ``` tux04:/$ megacli -PDList -a0|grep -i S.M megacli -PDList -a0 Drive has flagged a S.M.A.R.T alert : No Drive has flagged a S.M.A.R.T alert : No Drive has flagged a S.M.A.R.T alert : No Drive has flagged a S.M.A.R.T alert : No Drive has flagged a S.M.A.R.T alert : No Drive has flagged a S.M.A.R.T alert : No tux04:/$ megacli -PDList -a0|grep -i Firm Firmware state: Online, Spun Up Device Firmware Level: .2.0 Firmware state: Online, Spun Up Device Firmware Level: .0.0 Firmware state: Online, Spun Up Device Firmware Level: 0630 Firmware state: Online, Spun Up Device Firmware Level: R30A Firmware state: Online, Spun Up Device Firmware Level: R30A Firmware state: Online, Spun Up Device Firmware Level: R30A ``` So the drives are OK and the controller is not complaining. Smartctl self tests do not work on this controller: ``` tux04:/$ smartctl -t short -d megaraid,0 /dev/sdf -c Short Background Self Test has begun Use smartctl -X to abort test ``` and nothing ;). Megacli is actually the tool to use ``` megacli -AdpAllInfo -aAll ``` # Database During a backup the DB shows this error: ``` 2025-03-02 06:28:33 Database page corruption detected at page 1079428, retrying...\n[01] 2025-03-02 06:29:33 Database page corruption detected at page 1103108, retrying... ``` Interestingly the DB recovered on a second backup. The database is hosted on a solid /dev/sde Dell Ent NVMe FI. The log says ``` kernel: I/O error, dev sde, sector 2136655448 op 0x0:(READ) flags 0x80700 phys_seg 40 prio class 2 ``` Suggests: => https://stackoverflow.com/questions/50312219/blk-update-request-i-o-error-dev-sda-sector-xxxxxxxxxxx > The errors that you see are interface errors, they are not coming from the disk itself but rather from the connection to it. It can be the cable or any of the ports in the connection. > Since the CRC errors on the drive do not increase I can only assume that the problem is on the receive side of the machine you use. You should check the cable and try a different SATA port on the server. and someone wrote > analyzed that most of the reasons are caused by intensive reading and writing. This is a CDN cache node. Type reading NVME temperature is relatively high, if it continues, it will start to throttle and then slowly collapse. and temperature on that drive has been 70 C.