Project

General

Profile

Actions

action #128273

closed

[alert] openqaworker-arm-1+2+ failed to recover, problem in name resolution, network connection? size:M

Added by okurz over 1 year ago. Updated over 1 year ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
2023-04-25
Due date:
2023-05-12
% Done:

0%

Estimated time:

Description

Observation

We received multiple emails on 2023-04-23 around 1500Z related to the attempted automatic recovery of openqaworker-arm-1+2+. It is unclear if an SD ticket was automatically created about that.

Acceptance criteria

  • AC1: The root problem was addressed
  • AC2: The reason for the multi-level recovery attempt is understood

Suggestions

  • DONE: Ensure that all three openqaworker-arm-1+2+3 are up and running again -> They are up and running, no problem there
  • Check timely order execution steps, e.g. from https://gitlab.suse.de/openqa/grafana-webhook-actions/-/pipelines/660173 for arm-3 and related jobs for arm-1+2
  • Understand the error source and address it, maybe we need to fix something there
Actions #1

Updated by okurz over 1 year ago

  • Subject changed from [alert] openqaworker-arm-1+2+ failed to recover, problem in name resolution, network connection? to [alert] openqaworker-arm-1+2+ failed to recover, problem in name resolution, network connection? size:M
  • Status changed from New to Workable

estimated with nsinger

Actions #2

Updated by nicksinger over 1 year ago

  • Status changed from Workable to In Progress
  • Assignee set to nicksinger

arm-1
15:13: https://gitlab.suse.de/openqa/grafana-webhook-actions/-/pipelines/660066 => Recovery over IPMI (succeeded)
15:53: https://gitlab.suse.de/openqa/grafana-webhook-actions/-/pipelines/660172 => 3 PDU retries, failed to complete but pipeline succeeded

arm-2
15:53: https://gitlab.suse.de/openqa/grafana-webhook-actions/-/jobs/1531713 => 3 PDU retries, failed to complete but pipeline succeeded

arm-3
15:53: https://gitlab.suse.de/openqa/grafana-webhook-actions/-/pipelines/660173 => PDU login failed, no retry and pipeline failed

Our current configuration doesn't supply a infra ticket e-mail but rather falls back to the default osd-admins (https://gitlab.suse.de/openqa/grafana-webhook-actions/-/blob/master/ipmi-recover-worker#L22). We got these two messages as expected for arm-1 and arm-2.
What stands out to me are the "command cancelled" mails we got from the PDU. I can only guess to explain these by the current implementation of the expect-script with a "Immediate Reboot" followed by a "Immediate On". A "reboot" has a delay of 5 seconds (e.g. https://gitlab.suse.de/openqa/grafana-webhook-actions/-/jobs/1531709#L96) but the expect-script doesn't wait so it might happen that the "Immediate On" happens inside this "sleep period" of the PDU which causes a cancellation of the previous reboot sequence.

The second observation is a failed recovery attempt for arm-3. Here expect was unable to login into the PDU and got no response via telnet. I saw this behavior previously if a session is already opened. Given that we had 3 pipelines running at the exact same time all connecting to qaps06nue I think we just got unlucky and two expect-scripts where running at the same time trying to login into the PDU.

Suggestions:

  1. Adjust expect-script to either use "power off" followed by a "power on" or only "reboot"
  2. Adjust recovery script to allow expect to fail without the pipeline failing: https://gitlab.suse.de/openqa/grafana-webhook-actions/-/blob/master/ipmi-recover-worker#L71
  3. Ensure only one pipeline accesses our PDU 3.1 Make pipelines run in sequence not parallel 3.2 Use different method to issue commands to the PDU. I had great success with SNMP previously.
Actions #3

Updated by nicksinger over 1 year ago

nicksinger wrote:

Suggestions:

  1. Adjust expect-script to either use "power off" followed by a "power on" or only "reboot"

=> https://gitlab.suse.de/openqa/grafana-webhook-actions/-/merge_requests/28
=> Asked @okurz in slack if there was a rationale behind this: https://suse.slack.com/archives/C02AJ1E568M/p1682614288203629

  1. Adjust recovery script to allow expect to fail without the pipeline failing: https://gitlab.suse.de/openqa/grafana-webhook-actions/-/blob/master/ipmi-recover-worker#L71

=> https://gitlab.suse.de/openqa/grafana-webhook-actions/-/merge_requests/29

  1. Ensure only one pipeline accesses our PDU 3.1 Make pipelines run in sequence not parallel 3.2 Use different method to issue commands to the PDU. I had great success with SNMP previously.

=> If expect fails, MR#29 waits for 30 seconds. This should be enough for other pipelines to finish. If not we might have to revisit this topic again.

Actions #4

Updated by nicksinger over 1 year ago

Coming back to the initial root cause of this mass-recovery I looked into grafana: https://stats.openqa-monitor.qa.suse.de/d/1bNU0StZz/automatic-actions?orgId=1&from=1682314203000&to=1682371803000
I only see arm-1 being down in the relevant time frame. However, roughly 3h later (~18:00) all arm machines went down - not sure if related to these pipeline.

Suggestion:

  1. check if the alert triggering mechanism via separate alert-channels still works as expected or if a single machine causes all three pipelines to run
Actions #5

Updated by openqa_review over 1 year ago

  • Due date set to 2023-05-12

Setting due date based on mean cycle time of SUSE QE Tools

Actions #6

Updated by nicksinger over 1 year ago

  • Status changed from In Progress to Feedback

nicksinger wrote:

  1. check if the alert triggering mechanism via separate alert-channels still works as expected or if a single machine causes all three pipelines to run

Pipeline/Alertingflow is still working as expected after the migration. There is a separate "contact point" per worker in https://stats.openqa-monitor.qa.suse.de/alerting/notifications?alertmanager=grafana and https://gitlab.suse.de/openqa/grafana-webhook-actions/-/pipelines shows single pipelines for e.g. arm-3 and arm-2 on different timestamps (so no consecutive firing as feared). I think after https://gitlab.suse.de/openqa/grafana-webhook-actions/-/merge_requests/29 gets merged we can close this ticket.

Actions #7

Updated by okurz over 1 year ago

  • Status changed from Feedback to Resolved
Actions

Also available in: Atom PDF