action #122158
closedopenQA Project (public) - coordination #111860: [epic] Upgrade all our infrastructure, e.g. o3+osd workers+webui and production workloads, to openSUSE Leap 15.4
action #114565: recover qa-power8-4+qa-power8-5 size:M
[alert] qa-power8-4-kvm host up alert - machine not up, nothing obvious on SoL but IPMI works size:M
0%
Description
Observation¶
https://monitor.qa.suse.de/d/WDQA-Power8-4-kvm/worker-dashboard-qa-power8-4-kvm?editPanel=65105&tab=alert failed since 2022-12-18 03:00 so the weekly reboot that triggered this, see https://monitor.qa.suse.de/d/WDQA-Power8-4-kvm/worker-dashboard-qa-power8-4-kvm?editPanel=65105&tab=alert&orgId=1&from=1671320482076&to=1671348249406 in detail. The machine is reachable over IPMI but does not show anything obvious.
Suggestions¶
- Pause related alert(s)
- Remove from salt
- Trigger OSD deployment which failed due to this
- Trigger reboot
- Look into the other recent related tickets, e.g. about worker cache sqlite lookup something
- #114565 and other related
Rollback steps¶
- Bring back to salt
- Unpause alert after resolution
Updated by okurz about 2 years ago
- Description updated (diff)
- Priority changed from Normal to High
Updated by okurz about 2 years ago
paused alert, removed from salt, triggered https://gitlab.suse.de/openqa/osd-deployment/-/pipelines/554010
Updated by livdywan about 2 years ago
- Deployment succeeded.
- power8-4 unexpectedly came back up.
Updated by dheidler about 2 years ago
- Status changed from Workable to In Progress
- Assignee set to dheidler
Will test reboot stability as described in https://progress.opensuse.org/projects/openqav3/wiki/#Best-practices-for-infrastructure-work
Updated by openqa_review about 2 years ago
- Due date set to 2023-01-04
Setting due date based on mean cycle time of SUSE QE Tools
Updated by dheidler about 2 years ago
- Status changed from In Progress to Feedback
Reboot seems to be stable.
So added that machine back to salt, which worked fine.
And according to https://stats.openqa-monitor.qa.suse.de/alerting/list the alerts were already enabled.
Updated by okurz about 2 years ago
dheidler wrote:
Reboot seems to be stable.
over how many tries? 30?
So added that machine back to salt, which worked fine.
And according to https://stats.openqa-monitor.qa.suse.de/alerting/list the alerts were already enabled.
ok for now. We will treat the machine as "stable" and will be informed by failing alerts if the problem persists. I assume that maybe some recent update also made Leap 15.3 ppc machines less stable. If that is true we could add all our ppc machines to the automatic recovery same as we already have for qa-power8-5 and powerqaworker-qam-1 and aarch64 machines.
cdywan wrote:
- Deployment succeeded.
- power8-4 unexpectedly came back up.
For more additional information: It did not come up "unexpectedly" but nsinger triggered a reboot over IPMI, no magic done :) He also checked SoL and found that there is nothing visible there and that the machine does not react to any action on SoL.
Updated by okurz about 2 years ago
- Tags changed from infra to infra, reactive work
- Due date deleted (
2023-01-04) - Status changed from Feedback to Resolved
worker is back in production, according alerts are ok again. Only the "package loss alert" is still paused due to #121282