action #122158: [alert] qa-power8-4-kvm host up alert - machine not up, nothing obvious on SoL but IPMI works size:M - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

action #122158

closed

openQA Project (public) - coordination #111860: [epic] Upgrade all our infrastructure, e.g. o3+osd workers+webui and production workloads, to openSUSE Leap 15.4

action #114565: recover qa-power8-4+qa-power8-5 size:M

[alert] qa-power8-4-kvm host up alert - machine not up, nothing obvious on SoL but IPMI works size:M

Added by okurz over 2 years ago. Updated over 2 years ago.

Status:

Resolved

Priority:

High

Assignee:

dheidler

Category:

Target version:

openQA Project (public) - Ready

Start date:

2022-12-19

Due date:

% Done:

Estimated time:

Tags:

infra, reactive work

Description

Observation¶

https://monitor.qa.suse.de/d/WDQA-Power8-4-kvm/worker-dashboard-qa-power8-4-kvm?editPanel=65105&tab=alert failed since 2022-12-18 03:00 so the weekly reboot that triggered this, see https://monitor.qa.suse.de/d/WDQA-Power8-4-kvm/worker-dashboard-qa-power8-4-kvm?editPanel=65105&tab=alert&orgId=1&from=1671320482076&to=1671348249406 in detail. The machine is reachable over IPMI but does not show anything obvious.

Suggestions¶

Pause related alert(s)
Remove from salt
Trigger OSD deployment which failed due to this
Trigger reboot
Look into the other recent related tickets, e.g. about worker cache sqlite lookup something
#114565 and other related

Rollback steps¶

Bring back to salt
Unpause alert after resolution

Actions

Copy link

Updated by okurz over 2 years ago

Description updated (diff)
Priority changed from Normal to High

Actions

Copy link

Updated by okurz over 2 years ago

paused alert, removed from salt, triggered https://gitlab.suse.de/openqa/osd-deployment/-/pipelines/554010

Actions

Copy link

Updated by livdywan over 2 years ago

Deployment succeeded.
power8-4 unexpectedly came back up.

Actions

Copy link

Updated by dheidler over 2 years ago

Status changed from Workable to In Progress
Assignee set to dheidler

Will test reboot stability as described in https://progress.opensuse.org/projects/openqav3/wiki/#Best-practices-for-infrastructure-work

Actions

Copy link

Updated by openqa_review over 2 years ago

Due date set to 2023-01-04

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

Updated by dheidler over 2 years ago

Status changed from In Progress to Feedback

Reboot seems to be stable.
So added that machine back to salt, which worked fine.
And according to https://stats.openqa-monitor.qa.suse.de/alerting/list the alerts were already enabled.

Actions

Copy link

Updated by okurz over 2 years ago

dheidler wrote:

Reboot seems to be stable.

over how many tries? 30?

So added that machine back to salt, which worked fine.
And according to https://stats.openqa-monitor.qa.suse.de/alerting/list the alerts were already enabled.

ok for now. We will treat the machine as "stable" and will be informed by failing alerts if the problem persists. I assume that maybe some recent update also made Leap 15.3 ppc machines less stable. If that is true we could add all our ppc machines to the automatic recovery same as we already have for qa-power8-5 and powerqaworker-qam-1 and aarch64 machines.

cdywan wrote:

Deployment succeeded.

power8-4 unexpectedly came back up.

For more additional information: It did not come up "unexpectedly" but nsinger triggered a reboot over IPMI, no magic done :) He also checked SoL and found that there is nothing visible there and that the machine does not react to any action on SoL.

Actions

Copy link

Updated by dheidler over 2 years ago

okurz wrote:

over how many tries? 30?

yes

Actions

Copy link

Updated by okurz over 2 years ago

Tags changed from infra to infra, reactive work
Due date deleted (~~2023-01-04~~)
Status changed from Feedback to Resolved

worker is back in production, according alerts are ok again. Only the "package loss alert" is still paused due to #121282

Actions

Copy link

Also available in: Atom PDF

Project

General

Tags

Custom queries

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

action #122158

[alert] qa-power8-4-kvm host up alert - machine not up, nothing obvious on SoL but IPMI works size:M

Observation¶

Suggestions¶

Rollback steps¶

Updated by okurz over 2 years ago

Updated by okurz over 2 years ago

Updated by livdywan over 2 years ago

Updated by dheidler over 2 years ago

Updated by openqa_review over 2 years ago

Updated by dheidler over 2 years ago

Updated by okurz over 2 years ago

Updated by dheidler over 2 years ago

Updated by okurz over 2 years ago