action #162293: SMART errors on bootup of worker31, worker32 and worker34 size:M - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

action #162293

closed

openQA Project (public) - coordination #157969: [epic] Upgrade all our infrastructure, e.g. o3+osd workers+webui and production workloads, to openSUSE Leap 15.6

SMART errors on bootup of worker31, worker32 and worker34 size:M

Added by okurz 9 months ago. Updated 6 months ago.

Status:

Resolved

Priority:

Normal

Assignee:

nicksinger

Category:

Regressions/Crashes

Target version:

openQA Project (public) - Ready

Start date:

2024-06-14

Due date:

% Done:

Estimated time:

Tags:

infra

Description

Observation¶

Struggling with worker31, worker32 and worker34 that upgraded themselves to Leap 15.6 and then crashed multiple times after booting into kernel 6.4 we observed that early during bootup there were SMART errors shown. Possibly this might explain kernel crashes or might be separate errors. We downgraded to Leap 15.5 for now and took it out of production but still run as openQA worker.

Acceptance criteria¶

AC1: w31 boots up fine without SMART errors
AC2: w32 boots up fine without SMART errors
AC3: w33 boots up fine without SMART errors

Steps to reproduce¶

reboot worker31 and then follow the output on ssh -t jumpy@qe-jumpy.prg2.suse.org "ipmitool -I lanplus -H openqaworker31.qe-ipmi-ur -U … -P … sol activate"
observe SMART errors very early during firmware initialization

Suggestions¶

Check the content of /var/crash and clean up after investigation
Check the status of SMART from the running Linux system and then also the messages on bootup
Crosscheck the SMART status on other salt controlled machines, at least observed the same on w32
Consider replacing defective hardware
Ensure no failed services again
Bring back the system into production

Rollback steps¶

hostname=worker31.oqa.prg2.suse.org ssh osd "sudo salt-key -y -a $hostname && sudo salt --state-output=changes $hostname state.apply"
ssh osd "worker31.oqa.prg2.suse.org' cmd.run 'systemctl unmask rebootmgr && systemctl enable --now rebootmgr && rebootmgrctl reboot'"

Related issues 4 (0 open — 4 closed)

Actions

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #162293

SMART errors on bootup of worker31, worker32 and worker34 size:M

Observation¶

Acceptance criteria¶

Steps to reproduce¶

Suggestions¶

Rollback steps¶

Updated by okurz 9 months ago

Updated by okurz 9 months ago

Updated by okurz 9 months ago

Updated by okurz 9 months ago

Updated by okurz 9 months ago

Updated by okurz 9 months ago

Updated by okurz 9 months ago

Updated by livdywan 7 months ago

Updated by livdywan 7 months ago

Updated by livdywan 7 months ago

Updated by okurz 7 months ago

Updated by nicksinger 7 months ago

Updated by nicksinger 7 months ago

Updated by nicksinger 7 months ago

Updated by nicksinger 7 months ago

Updated by openqa_review 7 months ago

Updated by nicksinger 7 months ago

Updated by mkittler 7 months ago

Updated by nicksinger 7 months ago

Updated by nicksinger 7 months ago

Updated by okurz 7 months ago

Updated by okurz 7 months ago

Updated by nicksinger 7 months ago

Updated by okurz 7 months ago

Updated by livdywan 7 months ago

Updated by nicksinger 7 months ago

Updated by nicksinger 7 months ago

Updated by openqa_review 7 months ago

Updated by nicksinger 7 months ago

Updated by nicksinger 7 months ago

Updated by nicksinger 7 months ago

Updated by livdywan 7 months ago

Updated by livdywan 7 months ago

Updated by livdywan 7 months ago

Updated by livdywan 7 months ago · Edited

Updated by nicksinger 6 months ago

Updated by livdywan 6 months ago

Updated by okurz 6 months ago