Project

General

Profile

Actions

action #108671

closed

Resilient IPMI recovery of o3 machines in monitor-o3 size:M

Added by livdywan over 2 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Start date:
2022-03-07
Due date:
% Done:

0%

Estimated time:

Description

Observation

The corresponding GitLab pipeline failed

Acceptance criteria

  • AC1: Temporary loss of ipmi connectivity doesn't fail the pipeline

Suggestions


Related issues 1 (0 open1 closed)

Copied from openQA Infrastructure (public) - action #107917: Recovery of imagetester via IPMI failed size:MResolvedmkittler2022-03-072022-03-26

Actions
Actions #1

Updated by livdywan over 2 years ago

  • Copied from action #107917: Recovery of imagetester via IPMI failed size:M added
Actions #2

Updated by nicksinger over 2 years ago

I don't think this ticket is workable as it is. There are more recent runs of the pipeline which work perfectly fine: https://gitlab.suse.de/openqa/monitor-o3/-/jobs/889023
Would you mind updating this ticket here to reflect the idea of @mkittler to implement a retry (https://progress.opensuse.org/issues/107917#note-21)?

Actions #3

Updated by livdywan over 2 years ago

  • Description updated (diff)
Actions #4

Updated by okurz over 2 years ago

  • Subject changed from Recovery of openqaworker1 via IPMI failed size:M to Resilient IPMI recovery of o3 machines in monitor-o3 size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #5

Updated by mkittler over 2 years ago

  • Assignee set to mkittler
Actions #6

Updated by mkittler over 2 years ago

  • Status changed from Workable to Feedback

SR to add a retry: https://gitlab.suse.de/openqa/monitor-o3/-/merge_requests/4

Note that there's another flaw (which I haven't changed so far): If o3 is down we will needlessly power cycle all (physical) worker hosts.

Actions #7

Updated by mkittler over 2 years ago

  • Status changed from Feedback to In Progress

I've merged the SR and also a 2nd one to avoid the problem mentioned in the previous comment.

I've tested the different cases locally (using power status or invalid IPMI commands instead of power cycle) so it should work.

I've also just retriggered a pipeline but it cannot find the script anymore. So I'll have a look what's wrong.

Actions #9

Updated by livdywan over 2 years ago

  • Status changed from Resolved to Feedback

mkittler wrote:

The good case is now working with https://gitlab.suse.de/openqa/monitor-o3/-/commit/1b659e6cd2d3f11a352b2712892ac54a385be38b: https://gitlab.suse.de/openqa/monitor-o3/-/pipelines/350706

I don't think so...

https://gitlab.suse.de/openqa/monitor-o3/-/pipelines/350704

/scripts-5544-902568/step_script: line 157: ./monitor-and-recover: No such file or directory
Actions #10

Updated by mkittler over 2 years ago

  • Status changed from Feedback to Resolved

But since then @okurz added the mentioned commit and the re-triggered pipeline works: https://gitlab.suse.de/openqa/monitor-o3/-/pipelines/350706

Actions

Also available in: Atom PDF