action #177973
openDying disk on qa-power8-3: Needs replacement? size:S
0%
Description
Motivation¶
While working on #169939 @gpathak observed messages in boot logs
[ 28.432972][ C0] ipr 0001:08:00.0: 8150: Permanent IOA failure
[ 28.432986][ C0] ipr: 00000000: 04448200 13512400 FFFFFFFF 103034F0
...
[ 28.433297][ C0] ipr: 000003B0: 0040EF00 00A27DD0 14411245 EF000014
[ 28.433302][ C0] ipr: 000003C0: 000000B0 00A27DD0 144111C3 CE000000
[ 28.433307][ C0] ipr: 000003D0: 49434F4D 57414954 14410EB6 CE000000
[ 28.433354][ C0] ipr 0001:08:00.0: FFF4: Disk device problem
[ 28.433360][ C0] ipr: -----Failing Device Information-----
[ 28.433364][ C0] ipr: World Wide Unique ID: 5000CCA01D06CF5C0000000000000000
[ 28.433370][ C0] ipr: Device Resource Path: 00-03
[ 28.433374][ C0] ipr: Primary Problem Description: Device detected hardware error
[ 28.433379][ C0] ipr: Secondary Problem Description: Status Check
[ 28.433384][ C0] ipr: SCSI Sense Data:
[ 28.433387][ C0] ipr: 00000000: 70000400 00000018 00000000 44000000
[ 28.433393][ C0] ipr: 00000010: 00000000 F4400000 00000000 00000000
[ 28.433398][ C0] ipr: SCSI Command Descriptor Block:
[ 28.433402][ C0] ipr: 00000000: FFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFF
[ 28.433407][ C0] ipr: Additional IOA Data:
[ 28.433411][ C0] ipr: 00000000: 455300CC 07B00007 00000000 84000030
[ 28.433416][ C0] ipr: 00000010: 00000000 00000000 0B7EDFC0 00000000
[ 28.433421][ C0] ipr: 00000020: 00000000 0B7ED8A0 C8008000 00000000
[ 28.433427][ C0] ipr: 00000030: 00000000 00000000 00000480 8F000000
[ 28.433432][ C0] ipr: 00000040: 001F9D1B 00000000 00000000 00000000
...
[ 28.433505][ C0] ipr: 00000120: 43490018 00000002 0003FFFF FFFFFFFF
[ 28.433510][ C0] ipr: 00000130: 5000CCA0 1D06CF5D 00001770 545209C0
[ 129.774662][ T11] sd 0:0:3:0: [sdc] Asking for cache data failed
Doing some online search, stumbled upon this IBM website
Which indicates some issue with existing hard disk or loose cable?
Acceptance criteria¶
- AC1: There is no error message at boot time
Suggestions¶
- Get in touch with #discuss-powerpc-architecture, maybe we already have some hardware
- Order, replace or confirm "it works afterall" as needed
Files
Updated by gpathak about 1 month ago
- Copied from action #169939: Upgrade Power8 o3 workers to openSUSE Leap 15.6 size:M added
Updated by gpathak about 1 month ago
- File deleted (
clipboard-202501101213-mq9d6.png)
Updated by gpathak about 1 month ago
- File deleted (
clipboard-202501101222-maqgi.png)
Updated by gpathak about 1 month ago
- File deleted (
clipboard-202501101823-slqbk.png)
Updated by gpathak about 1 month ago
- File deleted (
clipboard-202501101826-poyf3.png)
Updated by gpathak about 1 month ago
- File deleted (
clipboard-202501101827-kt565.png)
Updated by gpathak about 1 month ago
- File deleted (
clipboard-202501131748-nqil7.png)
Updated by gpathak about 1 month ago
- File deleted (
clipboard-202501141602-ndo23.png)
Updated by gpathak about 1 month ago
- Tracker changed from coordination to action
Updated by okurz about 1 month ago
It sure sounds like broken hardware but did those errors only appear in a newer, unstable kernel?
Updated by gpathak about 1 month ago
okurz wrote in #note-12:
It sure sounds like broken hardware but did those errors only appear in a newer, unstable kernel?
Ohh, I totally missed this point. Actually, the crash wasn't happening on older kernel and that's why I never checked for this message in Boot logs.
Could it be due to a changed SCSI firmware in newer kernel/OS releases?
Updated by okurz about 1 month ago
What I consider more likely is that the processing part how CPU&memory is addressed is buggy which might cause this I/O errors as symptoms.
Updated by tinita about 1 month ago
- Subject changed from Dying disk on qa-power8-3: Needs replacement? to Dying disk on qa-power8-3: Needs replacement? size:S
- Description updated (diff)
- Status changed from New to Workable