Project

General

Profile

Actions

action #97382

closed

ARM automatic reboot pipeline does not fail if ipmitool fails size:S

Added by nicksinger over 2 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
2021-08-23
Due date:
% Done:

0%

Estimated time:

Description

The most recent recovery attempt for openqaworker-arm-3 triggered a pipeline which failed but is shown as "succeeded": https://gitlab.suse.de/openqa/grafana-webhook-actions/-/jobs/534098#L34

A quick look at https://gitlab.suse.de/openqa/grafana-webhook-actions/-/blob/master/ipmi-recover-worker shows we have "set -e" already in place. So not sure why the exit-code of the failing ipmitool did not reach the pipeline runner.

AC1: Let the user know that "Error: Unable to establish IPMI v2 / RMCP+ session" is not the final reason why the job ended - e.g.: "IPMI tool failed after x retries. creating Infra service ticket now"
AC2: Check if the ticket creation was successful. Make the pipeline status depending on that final step so one can clearly see if the pipeline did something or not. It also helps monitoring the situation as subscribed people would receive a mail if everything fails (hence manual investigation from our side is needed)


Related issues 1 (0 open1 closed)

Related to openQA Infrastructure - action #97244: openqaworker-arm-3 is offline and EngInfra wants us to create JiraSD tickets instead of infra size:MResolveddheidler2021-08-192021-09-17

Actions
Actions #1

Updated by okurz over 2 years ago

  • Target version set to Ready
Actions #2

Updated by dheidler over 2 years ago

  • Related to action #97244: openqaworker-arm-3 is offline and EngInfra wants us to create JiraSD tickets instead of infra size:M added
Actions #3

Updated by dheidler over 2 years ago

  • Subject changed from ARM automatic reboot pipeline does not fail if ipmitool fails to ARM automatic reboot pipeline does not fail if ipmitool fails size:S
  • Status changed from New to In Progress
  • Assignee set to dheidler
Actions #4

Updated by dheidler over 2 years ago

This happens because the ipmitool call is used in this context:

if ! $ipmitool chassis status; then

In a failing condition it will just create an infra ticket.

Actions #5

Updated by dheidler over 2 years ago

  • Status changed from In Progress to Rejected

Okurz wrote:

if ipmi failed for X times and we resorted to reporting a ticket this should be a successful pipeline

This seems to be expected behavior so I will reject this ticket.

Actions #6

Updated by nicksinger over 2 years ago

  • Status changed from Rejected to New

I think we should still improve the printed messages here. IMHO it is highly confusing if a job succeeds if the last message is "Error: Unable to establish IPMI v2 / RMCP+ session". I will adjust the title and include some ACs with what could be improved. @dheidler feel free to unassign yourself it you don't want to continue working on these improvements.

Actions #7

Updated by nicksinger over 2 years ago

  • Description updated (diff)
Actions #8

Updated by dheidler over 2 years ago

I will improve the script with some more log output.

Actions #9

Updated by dheidler over 2 years ago

  • Status changed from New to In Progress

The ticket is created by the line

printf %b "Subject: $subject\n\n$EMAIL\n\n" | msmtp --from "$from" -t "$contact"

which should change in the near future (see https://progress.opensuse.org/issues/97244?).

When this command fails, the pipeline should already fail due to set -e -o pipefail so I think AC2 is already present.

Actions #10

Updated by dheidler over 2 years ago

  • Status changed from In Progress to Feedback
Actions #11

Updated by nicksinger over 2 years ago

Merged, thanks.

dheidler wrote:

The ticket is created by the line

printf %b "Subject: $subject\n\n$EMAIL\n\n" | msmtp --from "$from" -t "$contact"

which should change in the near future (see https://progress.opensuse.org/issues/97244?).

When this command fails, the pipeline should already fail due to set -e -o pipefail so I think AC2 is already present.

I see, this is fire-and-forget and we don't really have a way to tell if a ticket was created. I will extend the other ticket to include this AC. I don't see it fulfilled but understand that it is currently unfeasible to implement with this approach.

Actions #12

Updated by dheidler over 2 years ago

  • Status changed from Feedback to Resolved

With https://progress.opensuse.org/issues/97244 that should change and until then I guess we can close this one.

Actions

Also available in: Atom PDF