Project

General

Profile

Actions

action #108671

closed

Resilient IPMI recovery of o3 machines in monitor-o3 size:M

Added by livdywan about 2 years ago. Updated about 2 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
2022-03-07
Due date:
% Done:

0%

Estimated time:

Description

Observation

The corresponding GitLab pipeline failed

Acceptance criteria

  • AC1: Temporary loss of ipmi connectivity doesn't fail the pipeline

Suggestions


Related issues 1 (0 open1 closed)

Copied from openQA Infrastructure - action #107917: Recovery of imagetester via IPMI failed size:MResolvedmkittler2022-03-072022-03-26

Actions
Actions #1

Updated by livdywan about 2 years ago

  • Copied from action #107917: Recovery of imagetester via IPMI failed size:M added
Actions #2

Updated by nicksinger about 2 years ago

I don't think this ticket is workable as it is. There are more recent runs of the pipeline which work perfectly fine: https://gitlab.suse.de/openqa/monitor-o3/-/jobs/889023
Would you mind updating this ticket here to reflect the idea of @mkittler to implement a retry (https://progress.opensuse.org/issues/107917#note-21)?

Actions #3

Updated by livdywan about 2 years ago

  • Description updated (diff)
Actions #4

Updated by okurz about 2 years ago

  • Subject changed from Recovery of openqaworker1 via IPMI failed size:M to Resilient IPMI recovery of o3 machines in monitor-o3 size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #5

Updated by mkittler about 2 years ago

  • Assignee set to mkittler
Actions #6

Updated by mkittler about 2 years ago

  • Status changed from Workable to Feedback

SR to add a retry: https://gitlab.suse.de/openqa/monitor-o3/-/merge_requests/4

Note that there's another flaw (which I haven't changed so far): If o3 is down we will needlessly power cycle all (physical) worker hosts.

Actions #7

Updated by mkittler about 2 years ago

  • Status changed from Feedback to In Progress

I've merged the SR and also a 2nd one to avoid the problem mentioned in the previous comment.

I've tested the different cases locally (using power status or invalid IPMI commands instead of power cycle) so it should work.

I've also just retriggered a pipeline but it cannot find the script anymore. So I'll have a look what's wrong.

Actions #9

Updated by livdywan about 2 years ago

  • Status changed from Resolved to Feedback

mkittler wrote:

The good case is now working with https://gitlab.suse.de/openqa/monitor-o3/-/commit/1b659e6cd2d3f11a352b2712892ac54a385be38b: https://gitlab.suse.de/openqa/monitor-o3/-/pipelines/350706

I don't think so...

https://gitlab.suse.de/openqa/monitor-o3/-/pipelines/350704

/scripts-5544-902568/step_script: line 157: ./monitor-and-recover: No such file or directory
Actions #10

Updated by mkittler about 2 years ago

  • Status changed from Feedback to Resolved

But since then @okurz added the mentioned commit and the re-triggered pipeline works: https://gitlab.suse.de/openqa/monitor-o3/-/pipelines/350706

Actions

Also available in: Atom PDF