action #128273: [alert] openqaworker-arm-1+2+ failed to recover, problem in name resolution, network connection? size:M - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #128273

closed

[alert] openqaworker-arm-1+2+ failed to recover, problem in name resolution, network connection? size:M

Added by okurz about 2 years ago. Updated about 2 years ago.

Status:

Resolved

Priority:

High

Assignee:

nicksinger

Category:

Target version:

openQA Project (public) - Ready

Start date:

2023-04-25

Due date:

2023-05-12

% Done:

Estimated time:

Tags:

alert, recovery, grafana, ticket, infra, openqaworker-arm, openqaworker-arm-1, openqaworker-arm-2

Description

Observation¶

We received multiple emails on 2023-04-23 around 1500Z related to the attempted automatic recovery of openqaworker-arm-1+2+. It is unclear if an SD ticket was automatically created about that.

Acceptance criteria¶

AC1: The root problem was addressed
AC2: The reason for the multi-level recovery attempt is understood

Suggestions¶

DONE: Ensure that all three openqaworker-arm-1+2+3 are up and running again -> They are up and running, no problem there
Check timely order execution steps, e.g. from https://gitlab.suse.de/openqa/grafana-webhook-actions/-/pipelines/660173 for arm-3 and related jobs for arm-1+2
Understand the error source and address it, maybe we need to fix something there

Actions

Copy link

Updated by okurz about 2 years ago

Subject changed from [alert] openqaworker-arm-1+2+ failed to recover, problem in name resolution, network connection? to [alert] openqaworker-arm-1+2+ failed to recover, problem in name resolution, network connection? size:M
Status changed from New to Workable

estimated with nsinger

Actions

Copy link

Updated by nicksinger about 2 years ago

Status changed from Workable to In Progress
Assignee set to nicksinger

arm-1
15:13: https://gitlab.suse.de/openqa/grafana-webhook-actions/-/pipelines/660066 => Recovery over IPMI (succeeded)
15:53: https://gitlab.suse.de/openqa/grafana-webhook-actions/-/pipelines/660172 => 3 PDU retries, failed to complete but pipeline succeeded

arm-2
15:53: https://gitlab.suse.de/openqa/grafana-webhook-actions/-/jobs/1531713 => 3 PDU retries, failed to complete but pipeline succeeded

arm-3
15:53: https://gitlab.suse.de/openqa/grafana-webhook-actions/-/pipelines/660173 => PDU login failed, no retry and pipeline failed

Our current configuration doesn't supply a infra ticket e-mail but rather falls back to the default osd-admins (https://gitlab.suse.de/openqa/grafana-webhook-actions/-/blob/master/ipmi-recover-worker#L22). We got these two messages as expected for arm-1 and arm-2.
What stands out to me are the "command cancelled" mails we got from the PDU. I can only guess to explain these by the current implementation of the expect-script with a "Immediate Reboot" followed by a "Immediate On". A "reboot" has a delay of 5 seconds (e.g. https://gitlab.suse.de/openqa/grafana-webhook-actions/-/jobs/1531709#L96) but the expect-script doesn't wait so it might happen that the "Immediate On" happens inside this "sleep period" of the PDU which causes a cancellation of the previous reboot sequence.

The second observation is a failed recovery attempt for arm-3. Here expect was unable to login into the PDU and got no response via telnet. I saw this behavior previously if a session is already opened. Given that we had 3 pipelines running at the exact same time all connecting to qaps06nue I think we just got unlucky and two expect-scripts where running at the same time trying to login into the PDU.

Suggestions:

Adjust expect-script to either use "power off" followed by a "power on" or only "reboot"
Adjust recovery script to allow expect to fail without the pipeline failing: https://gitlab.suse.de/openqa/grafana-webhook-actions/-/blob/master/ipmi-recover-worker#L71
Ensure only one pipeline accesses our PDU
3.1 Make pipelines run in sequence not parallel
3.2 Use different method to issue commands to the PDU. I had great success with SNMP previously.

Actions

Copy link

Updated by nicksinger about 2 years ago

nicksinger wrote:

Suggestions:

Adjust expect-script to either use "power off" followed by a "power on" or only "reboot"

=> https://gitlab.suse.de/openqa/grafana-webhook-actions/-/merge_requests/28
=> Asked @okurz in slack if there was a rationale behind this: https://suse.slack.com/archives/C02AJ1E568M/p1682614288203629

Adjust recovery script to allow expect to fail without the pipeline failing: https://gitlab.suse.de/openqa/grafana-webhook-actions/-/blob/master/ipmi-recover-worker#L71

=> https://gitlab.suse.de/openqa/grafana-webhook-actions/-/merge_requests/29

Ensure only one pipeline accesses our PDU
3.1 Make pipelines run in sequence not parallel
3.2 Use different method to issue commands to the PDU. I had great success with SNMP previously.

=> If expect fails, MR#29 waits for 30 seconds. This should be enough for other pipelines to finish. If not we might have to revisit this topic again.

Actions

Copy link

Updated by nicksinger about 2 years ago

Coming back to the initial root cause of this mass-recovery I looked into grafana: https://stats.openqa-monitor.qa.suse.de/d/1bNU0StZz/automatic-actions?orgId=1&from=1682314203000&to=1682371803000
I only see arm-1 being down in the relevant time frame. However, roughly 3h later (~18:00) all arm machines went down - not sure if related to these pipeline.

Suggestion: