Resilient IPMI recovery of o3 machines in monitor-o3 size:M
The corresponding GitLab pipeline failed
- AC1: Temporary loss of ipmi connectivity doesn't fail the pipeline
- Check if machine is actually online or needs recovery
- Re-try ipmi in monitor-o3 pipeline if it fails, e.g. do what we do in https://gitlab.suse.de/openqa/grafana-webhook-actions/-/blob/master/ipmi-recover-worker#L23 or https://gitlab.suse.de/qa-maintenance/bot-ng/-/merge_requests/50/diffs#587d266bb27a4dc3022bbed44dfa19849df3044c_36_40
#2 Updated by nicksinger 3 months ago
I don't think this ticket is workable as it is. There are more recent runs of the pipeline which work perfectly fine: https://gitlab.suse.de/openqa/monitor-o3/-/jobs/889023
Would you mind updating this ticket here to reflect the idea of mkittler to implement a retry (https://progress.opensuse.org/issues/107917#note-21)?
- Status changed from Workable to Feedback
SR to add a retry: https://gitlab.suse.de/openqa/monitor-o3/-/merge_requests/4
Note that there's another flaw (which I haven't changed so far): If o3 is down we will needlessly power cycle all (physical) worker hosts.
- Status changed from Feedback to In Progress
I've merged the SR and also a 2nd one to avoid the problem mentioned in the previous comment.
I've tested the different cases locally (using
power status or invalid IPMI commands instead of
power cycle) so it should work.
I've also just retriggered a pipeline but it cannot find the script anymore. So I'll have a look what's wrong.
- Status changed from In Progress to Resolved
- Status changed from Resolved to Feedback
I don't think so...
/scripts-5544-902568/step_script: line 157: ./monitor-and-recover: No such file or directory
- Status changed from Feedback to Resolved
But since then okurz added the mentioned commit and the re-triggered pipeline works: https://gitlab.suse.de/openqa/monitor-o3/-/pipelines/350706