Project

General

Profile

action #108671

Resilient IPMI recovery of o3 machines in monitor-o3 size:M

Added by cdywan 3 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
Start date:
2022-03-07
Due date:
% Done:

0%

Estimated time:

Description

Observation

The corresponding GitLab pipeline failed

Acceptance criteria

  • AC1: Temporary loss of ipmi connectivity doesn't fail the pipeline

Suggestions


Related issues

Copied from openQA Infrastructure - action #107917: Recovery of imagetester via IPMI failed size:MResolved2022-03-072022-03-26

History

#1 Updated by cdywan 3 months ago

  • Copied from action #107917: Recovery of imagetester via IPMI failed size:M added

#2 Updated by nicksinger 3 months ago

I don't think this ticket is workable as it is. There are more recent runs of the pipeline which work perfectly fine: https://gitlab.suse.de/openqa/monitor-o3/-/jobs/889023
Would you mind updating this ticket here to reflect the idea of mkittler to implement a retry (https://progress.opensuse.org/issues/107917#note-21)?

#3 Updated by cdywan 3 months ago

  • Description updated (diff)

#4 Updated by okurz 3 months ago

  • Subject changed from Recovery of openqaworker1 via IPMI failed size:M to Resilient IPMI recovery of o3 machines in monitor-o3 size:M
  • Description updated (diff)
  • Status changed from New to Workable

#5 Updated by mkittler 3 months ago

  • Assignee set to mkittler

#6 Updated by mkittler 3 months ago

  • Status changed from Workable to Feedback

SR to add a retry: https://gitlab.suse.de/openqa/monitor-o3/-/merge_requests/4

Note that there's another flaw (which I haven't changed so far): If o3 is down we will needlessly power cycle all (physical) worker hosts.

#7 Updated by mkittler 3 months ago

  • Status changed from Feedback to In Progress

I've merged the SR and also a 2nd one to avoid the problem mentioned in the previous comment.

I've tested the different cases locally (using power status or invalid IPMI commands instead of power cycle) so it should work.

I've also just retriggered a pipeline but it cannot find the script anymore. So I'll have a look what's wrong.

#9 Updated by cdywan 3 months ago

  • Status changed from Resolved to Feedback

mkittler wrote:

The good case is now working with https://gitlab.suse.de/openqa/monitor-o3/-/commit/1b659e6cd2d3f11a352b2712892ac54a385be38b: https://gitlab.suse.de/openqa/monitor-o3/-/pipelines/350706

I don't think so...

https://gitlab.suse.de/openqa/monitor-o3/-/pipelines/350704

/scripts-5544-902568/step_script: line 157: ./monitor-and-recover: No such file or directory

#10 Updated by mkittler 3 months ago

  • Status changed from Feedback to Resolved

But since then okurz added the mentioned commit and the re-triggered pipeline works: https://gitlab.suse.de/openqa/monitor-o3/-/pipelines/350706

Also available in: Atom PDF