Project

General

Profile

Actions

action #162293

open

coordination #157969: [epic] Upgrade all our infrastructure, e.g. o3+osd workers+webui and production workloads, to openSUSE Leap 15.6

SMART errors on bootup of w31+w32, possibly more

Added by okurz about 1 month ago. Updated 30 days ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
Regressions/Crashes
Target version:
Start date:
2024-06-14
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Observation

Struggling with w31+w32 that upgraded themselves to Leap 15.6 and then crashed multiple times after booting into kernel 6.4 we observed that early during bootup there were SMART errors shown. Possibly this might explain kernel crashes or might be separate errors. We downgraded to Leap 15.5 for now and took it out of production but still run as openQA worker.

Acceptance criteria

  • AC1: w31 boots up fine without SMART errors

Steps to reproduce

  • reboot worker31 and then follow the output on ssh -t jumpy@qe-jumpy.prg2.suse.org "ipmitool -I lanplus -H openqaworker31.qe-ipmi-ur -U … -P … sol activate"
  • observe SMART errors very early during firmware initialization

Suggestions

  • Check the content of /var/crash and clean up after investigation
  • Check the status of SMART from the running Linux system and then also the messages on bootup
  • Crosscheck the SMART status on other salt controlled machines, at least observed the same on w32
  • Consider replacing defective hardware
  • Ensure no failed services again
  • Bring back the system into production

Rollback steps

  • hostname=worker31.oqa.prg2.suse.org ssh osd "sudo salt-key -y -a $hostname && sudo salt --state-output=changes $hostname state.apply"
  • ssh osd "worker31.oqa.prg2.suse.org' cmd.run 'systemctl unmask rebootmgr && systemctl enable --now rebootmgr && rebootmgrctl reboot'"

Related issues 2 (2 open0 closed)

Copied from openQA Project - action #157975: Upgrade osd workers to openSUSE Leap 15.6Blockedokurz

Actions
Copied to openQA Project - action #162296: openQA workers crash with Linux 6.4 after upgrade openSUSE Leap 15.6 size:SBlockeddheidler2024-06-14

Actions
Actions #1

Updated by okurz about 1 month ago

  • Copied from action #157975: Upgrade osd workers to openSUSE Leap 15.6 added
Actions #2

Updated by okurz about 1 month ago

  • Description updated (diff)
  • Priority changed from Normal to High
Actions #3

Updated by okurz about 1 month ago

  • Subject changed from SMART errors on bootup of w31 to SMART errors on bootup of w31+w32, possibly more
  • Description updated (diff)
Actions #4

Updated by okurz about 1 month ago

  • Copied to action #162296: openQA workers crash with Linux 6.4 after upgrade openSUSE Leap 15.6 size:S added
Actions #5

Updated by okurz about 1 month ago

Also observed on w34

Actions #6

Updated by okurz 30 days ago

  • Priority changed from High to Normal
Actions #7

Updated by okurz 30 days ago

  • Target version changed from Ready to Tools - Next
Actions

Also available in: Atom PDF