action #108671
closedResilient IPMI recovery of o3 machines in monitor-o3 size:M
0%
Description
Observation¶
The corresponding GitLab pipeline failed
Acceptance criteria¶
- AC1: Temporary loss of ipmi connectivity doesn't fail the pipeline
Suggestions¶
- Check if machine is actually online or needs recovery
- Re-try ipmi in monitor-o3 pipeline if it fails, e.g. do what we do in https://gitlab.suse.de/openqa/grafana-webhook-actions/-/blob/master/ipmi-recover-worker#L23 or https://gitlab.suse.de/qa-maintenance/bot-ng/-/merge_requests/50/diffs#587d266bb27a4dc3022bbed44dfa19849df3044c_36_40
Updated by livdywan over 2 years ago
- Copied from action #107917: Recovery of imagetester via IPMI failed size:M added
Updated by nicksinger over 2 years ago
I don't think this ticket is workable as it is. There are more recent runs of the pipeline which work perfectly fine: https://gitlab.suse.de/openqa/monitor-o3/-/jobs/889023
Would you mind updating this ticket here to reflect the idea of @mkittler to implement a retry (https://progress.opensuse.org/issues/107917#note-21)?
Updated by okurz over 2 years ago
- Subject changed from Recovery of openqaworker1 via IPMI failed size:M to Resilient IPMI recovery of o3 machines in monitor-o3 size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by mkittler over 2 years ago
- Status changed from Workable to Feedback
SR to add a retry: https://gitlab.suse.de/openqa/monitor-o3/-/merge_requests/4
Note that there's another flaw (which I haven't changed so far): If o3 is down we will needlessly power cycle all (physical) worker hosts.
Updated by mkittler over 2 years ago
- Status changed from Feedback to In Progress
I've merged the SR and also a 2nd one to avoid the problem mentioned in the previous comment.
I've tested the different cases locally (using power status
or invalid IPMI commands instead of power cycle
) so it should work.
I've also just retriggered a pipeline but it cannot find the script anymore. So I'll have a look what's wrong.
Updated by mkittler over 2 years ago
- Status changed from In Progress to Resolved
Updated by livdywan over 2 years ago
- Status changed from Resolved to Feedback
mkittler wrote:
The good case is now working with https://gitlab.suse.de/openqa/monitor-o3/-/commit/1b659e6cd2d3f11a352b2712892ac54a385be38b: https://gitlab.suse.de/openqa/monitor-o3/-/pipelines/350706
I don't think so...
https://gitlab.suse.de/openqa/monitor-o3/-/pipelines/350704
/scripts-5544-902568/step_script: line 157: ./monitor-and-recover: No such file or directory
Updated by mkittler over 2 years ago
- Status changed from Feedback to Resolved
But since then @okurz added the mentioned commit and the re-triggered pipeline works: https://gitlab.suse.de/openqa/monitor-o3/-/pipelines/350706