Project

General

Profile

Actions

action #122158

closed

openQA Project (public) - coordination #111860: [epic] Upgrade all our infrastructure, e.g. o3+osd workers+webui and production workloads, to openSUSE Leap 15.4

action #114565: recover qa-power8-4+qa-power8-5 size:M

[alert] qa-power8-4-kvm host up alert - machine not up, nothing obvious on SoL but IPMI works size:M

Added by okurz about 2 years ago. Updated almost 2 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Start date:
2022-12-19
Due date:
% Done:

0%

Estimated time:

Description

Observation

https://monitor.qa.suse.de/d/WDQA-Power8-4-kvm/worker-dashboard-qa-power8-4-kvm?editPanel=65105&tab=alert failed since 2022-12-18 03:00 so the weekly reboot that triggered this, see https://monitor.qa.suse.de/d/WDQA-Power8-4-kvm/worker-dashboard-qa-power8-4-kvm?editPanel=65105&tab=alert&orgId=1&from=1671320482076&to=1671348249406 in detail. The machine is reachable over IPMI but does not show anything obvious.

Suggestions

  • Pause related alert(s)
  • Remove from salt
  • Trigger OSD deployment which failed due to this
  • Trigger reboot
  • Look into the other recent related tickets, e.g. about worker cache sqlite lookup something
  • #114565 and other related

Rollback steps

  • Bring back to salt
  • Unpause alert after resolution
Actions #1

Updated by okurz about 2 years ago

  • Description updated (diff)
  • Priority changed from Normal to High
Actions #2

Updated by okurz about 2 years ago

paused alert, removed from salt, triggered https://gitlab.suse.de/openqa/osd-deployment/-/pipelines/554010

Actions #3

Updated by livdywan about 2 years ago

  • Deployment succeeded.
  • power8-4 unexpectedly came back up.
Actions #4

Updated by dheidler about 2 years ago

  • Status changed from Workable to In Progress
  • Assignee set to dheidler
Actions #5

Updated by openqa_review about 2 years ago

  • Due date set to 2023-01-04

Setting due date based on mean cycle time of SUSE QE Tools

Actions #6

Updated by dheidler about 2 years ago

  • Status changed from In Progress to Feedback

Reboot seems to be stable.
So added that machine back to salt, which worked fine.
And according to https://stats.openqa-monitor.qa.suse.de/alerting/list the alerts were already enabled.

Actions #7

Updated by okurz about 2 years ago

dheidler wrote:

Reboot seems to be stable.

over how many tries? 30?

So added that machine back to salt, which worked fine.
And according to https://stats.openqa-monitor.qa.suse.de/alerting/list the alerts were already enabled.

ok for now. We will treat the machine as "stable" and will be informed by failing alerts if the problem persists. I assume that maybe some recent update also made Leap 15.3 ppc machines less stable. If that is true we could add all our ppc machines to the automatic recovery same as we already have for qa-power8-5 and powerqaworker-qam-1 and aarch64 machines.

cdywan wrote:

  • Deployment succeeded.
  • power8-4 unexpectedly came back up.

For more additional information: It did not come up "unexpectedly" but nsinger triggered a reboot over IPMI, no magic done :) He also checked SoL and found that there is nothing visible there and that the machine does not react to any action on SoL.

Actions #8

Updated by dheidler almost 2 years ago

okurz wrote:

over how many tries? 30?

yes

Actions #9

Updated by okurz almost 2 years ago

  • Tags changed from infra to infra, reactive work
  • Due date deleted (2023-01-04)
  • Status changed from Feedback to Resolved

worker is back in production, according alerts are ok again. Only the "package loss alert" is still paused due to #121282

Actions

Also available in: Atom PDF