Project

General

Profile

Actions

action #177973

open

openQA Project (public) - coordination #157969: [epic] Upgrade all our infrastructure, e.g. o3+osd workers+webui and production workloads, to openSUSE Leap 15.6

Dying disk on qa-power8-3: Needs replacement?

Added by gpathak 4 days ago. Updated 3 days ago.

Status:
New
Priority:
Low
Assignee:
-
Category:
Organisational
Start date:
2024-11-14
Due date:
% Done:

0%

Estimated time:

Description

Motivation

While working on #169939 @gpathak observed messages in boot logs

[   28.432972][    C0] ipr 0001:08:00.0: 8150: Permanent IOA failure
[   28.432986][    C0] ipr: 00000000: 04448200 13512400 FFFFFFFF 103034F0
...
[   28.433297][    C0] ipr: 000003B0: 0040EF00 00A27DD0 14411245 EF000014
[   28.433302][    C0] ipr: 000003C0: 000000B0 00A27DD0 144111C3 CE000000
[   28.433307][    C0] ipr: 000003D0: 49434F4D 57414954 14410EB6 CE000000
[   28.433354][    C0] ipr 0001:08:00.0: FFF4: Disk device problem
[   28.433360][    C0] ipr: -----Failing Device Information-----
[   28.433364][    C0] ipr: World Wide Unique ID: 5000CCA01D06CF5C0000000000000000
[   28.433370][    C0] ipr: Device Resource Path: 00-03
[   28.433374][    C0] ipr: Primary Problem Description: Device detected hardware error 
[   28.433379][    C0] ipr: Secondary Problem Description:  Status Check                   
[   28.433384][    C0] ipr: SCSI Sense Data:
[   28.433387][    C0] ipr: 00000000: 70000400 00000018 00000000 44000000
[   28.433393][    C0] ipr: 00000010: 00000000 F4400000 00000000 00000000
[   28.433398][    C0] ipr: SCSI Command Descriptor Block: 
[   28.433402][    C0] ipr: 00000000: FFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFF
[   28.433407][    C0] ipr: Additional IOA Data:
[   28.433411][    C0] ipr: 00000000: 455300CC 07B00007 00000000 84000030
[   28.433416][    C0] ipr: 00000010: 00000000 00000000 0B7EDFC0 00000000
[   28.433421][    C0] ipr: 00000020: 00000000 0B7ED8A0 C8008000 00000000
[   28.433427][    C0] ipr: 00000030: 00000000 00000000 00000480 8F000000
[   28.433432][    C0] ipr: 00000040: 001F9D1B 00000000 00000000 00000000
...
[   28.433505][    C0] ipr: 00000120: 43490018 00000002 0003FFFF FFFFFFFF
[   28.433510][    C0] ipr: 00000130: 5000CCA0 1D06CF5D 00001770 545209C0
[  129.774662][   T11] sd 0:0:3:0: [sdc] Asking for cache data failed

Doing some online search, stumbled upon this IBM website
Which indicates some issue with existing hard disk or loose cable?

Acceptance criteria

Suggestions


Files

Crash-Log.7z (2.07 MB) Crash-Log.7z gpathak, 2025-01-23 05:17
Crash-Log.tar.gz (4.37 MB) Crash-Log.tar.gz gpathak, 2025-01-23 09:39
qa-power8-softlockup-2.log (112 KB) qa-power8-softlockup-2.log gpathak, 2025-01-24 11:29
qa-power8-3-kernel-error.log (94.2 KB) qa-power8-3-kernel-error.log gpathak, 2025-02-15 08:43
clipboard-202502271847-zjmq6.png (30.6 KB) clipboard-202502271847-zjmq6.png gpathak, 2025-02-27 13:17

Related issues 1 (0 open1 closed)

Copied from openQA Infrastructure (public) - action #169939: Upgrade Power8 o3 workers to openSUSE Leap 15.6 size:MResolvedgpathak2024-11-14

Actions
Actions #1

Updated by gpathak 4 days ago

  • Copied from action #169939: Upgrade Power8 o3 workers to openSUSE Leap 15.6 size:M added
Actions #2

Updated by gpathak 4 days ago

  • File deleted (clipboard-202501101213-mq9d6.png)
Actions #3

Updated by gpathak 4 days ago

  • File deleted (clipboard-202501101222-maqgi.png)
Actions #4

Updated by gpathak 4 days ago

  • File deleted (clipboard-202501101823-slqbk.png)
Actions #5

Updated by gpathak 4 days ago

  • File deleted (clipboard-202501101826-poyf3.png)
Actions #6

Updated by gpathak 4 days ago

  • File deleted (clipboard-202501101827-kt565.png)
Actions #7

Updated by gpathak 4 days ago

  • File deleted (clipboard-202501131748-nqil7.png)
Actions #8

Updated by gpathak 4 days ago

  • File deleted (clipboard-202501141602-ndo23.png)
Actions #9

Updated by gpathak 4 days ago

  • File deleted (qa-power8-crash)
Actions #10

Updated by gpathak 4 days ago

  • File deleted (crash-qa-power8)
Actions #11

Updated by gpathak 4 days ago

  • Tracker changed from coordination to action
Actions #12

Updated by okurz 3 days ago

It sure sounds like broken hardware but did those errors only appear in a newer, unstable kernel?

Actions #13

Updated by gpathak 3 days ago

okurz wrote in #note-12:

It sure sounds like broken hardware but did those errors only appear in a newer, unstable kernel?

Ohh, I totally missed this point. Actually, the crash wasn't happening on older kernel and that's why I never checked for this message in Boot logs.
Could it be due to a changed SCSI firmware in newer kernel/OS releases?

Actions #14

Updated by okurz 3 days ago

What I consider more likely is that the processing part how CPU&memory is addressed is buggy which might cause this I/O errors as symptoms.

Actions

Also available in: Atom PDF