action #128273
closed[alert] openqaworker-arm-1+2+ failed to recover, problem in name resolution, network connection? size:M
0%
Description
Observation¶
We received multiple emails on 2023-04-23 around 1500Z related to the attempted automatic recovery of openqaworker-arm-1+2+. It is unclear if an SD ticket was automatically created about that.
Acceptance criteria¶
- AC1: The root problem was addressed
- AC2: The reason for the multi-level recovery attempt is understood
Suggestions¶
- DONE: Ensure that all three openqaworker-arm-1+2+3 are up and running again -> They are up and running, no problem there
- Check timely order execution steps, e.g. from https://gitlab.suse.de/openqa/grafana-webhook-actions/-/pipelines/660173 for arm-3 and related jobs for arm-1+2
- Understand the error source and address it, maybe we need to fix something there
Updated by okurz over 1 year ago
- Subject changed from [alert] openqaworker-arm-1+2+ failed to recover, problem in name resolution, network connection? to [alert] openqaworker-arm-1+2+ failed to recover, problem in name resolution, network connection? size:M
- Status changed from New to Workable
estimated with nsinger
Updated by nicksinger over 1 year ago
- Status changed from Workable to In Progress
- Assignee set to nicksinger
arm-1
15:13: https://gitlab.suse.de/openqa/grafana-webhook-actions/-/pipelines/660066 => Recovery over IPMI (succeeded)
15:53: https://gitlab.suse.de/openqa/grafana-webhook-actions/-/pipelines/660172 => 3 PDU retries, failed to complete but pipeline succeeded
arm-2
15:53: https://gitlab.suse.de/openqa/grafana-webhook-actions/-/jobs/1531713 => 3 PDU retries, failed to complete but pipeline succeeded
arm-3
15:53: https://gitlab.suse.de/openqa/grafana-webhook-actions/-/pipelines/660173 => PDU login failed, no retry and pipeline failed
Our current configuration doesn't supply a infra ticket e-mail but rather falls back to the default osd-admins (https://gitlab.suse.de/openqa/grafana-webhook-actions/-/blob/master/ipmi-recover-worker#L22). We got these two messages as expected for arm-1 and arm-2.
What stands out to me are the "command cancelled" mails we got from the PDU. I can only guess to explain these by the current implementation of the expect-script with a "Immediate Reboot" followed by a "Immediate On". A "reboot" has a delay of 5 seconds (e.g. https://gitlab.suse.de/openqa/grafana-webhook-actions/-/jobs/1531709#L96) but the expect-script doesn't wait so it might happen that the "Immediate On" happens inside this "sleep period" of the PDU which causes a cancellation of the previous reboot sequence.
The second observation is a failed recovery attempt for arm-3. Here expect was unable to login into the PDU and got no response via telnet. I saw this behavior previously if a session is already opened. Given that we had 3 pipelines running at the exact same time all connecting to qaps06nue I think we just got unlucky and two expect-scripts where running at the same time trying to login into the PDU.
Suggestions:
- Adjust expect-script to either use "power off" followed by a "power on" or only "reboot"
- Adjust recovery script to allow expect to fail without the pipeline failing: https://gitlab.suse.de/openqa/grafana-webhook-actions/-/blob/master/ipmi-recover-worker#L71
- Ensure only one pipeline accesses our PDU 3.1 Make pipelines run in sequence not parallel 3.2 Use different method to issue commands to the PDU. I had great success with SNMP previously.
Updated by nicksinger over 1 year ago
nicksinger wrote:
Suggestions:
- Adjust expect-script to either use "power off" followed by a "power on" or only "reboot"
=> https://gitlab.suse.de/openqa/grafana-webhook-actions/-/merge_requests/28
=> Asked @okurz in slack if there was a rationale behind this: https://suse.slack.com/archives/C02AJ1E568M/p1682614288203629
- Adjust recovery script to allow expect to fail without the pipeline failing: https://gitlab.suse.de/openqa/grafana-webhook-actions/-/blob/master/ipmi-recover-worker#L71
=> https://gitlab.suse.de/openqa/grafana-webhook-actions/-/merge_requests/29
- Ensure only one pipeline accesses our PDU 3.1 Make pipelines run in sequence not parallel 3.2 Use different method to issue commands to the PDU. I had great success with SNMP previously.
=> If expect fails, MR#29 waits for 30 seconds. This should be enough for other pipelines to finish. If not we might have to revisit this topic again.
Updated by nicksinger over 1 year ago
Coming back to the initial root cause of this mass-recovery I looked into grafana: https://stats.openqa-monitor.qa.suse.de/d/1bNU0StZz/automatic-actions?orgId=1&from=1682314203000&to=1682371803000
I only see arm-1 being down in the relevant time frame. However, roughly 3h later (~18:00) all arm machines went down - not sure if related to these pipeline.
Suggestion:
- check if the alert triggering mechanism via separate alert-channels still works as expected or if a single machine causes all three pipelines to run
Updated by openqa_review over 1 year ago
- Due date set to 2023-05-12
Setting due date based on mean cycle time of SUSE QE Tools
Updated by nicksinger over 1 year ago
- Status changed from In Progress to Feedback
nicksinger wrote:
- check if the alert triggering mechanism via separate alert-channels still works as expected or if a single machine causes all three pipelines to run
Pipeline/Alertingflow is still working as expected after the migration. There is a separate "contact point" per worker in https://stats.openqa-monitor.qa.suse.de/alerting/notifications?alertmanager=grafana and https://gitlab.suse.de/openqa/grafana-webhook-actions/-/pipelines shows single pipelines for e.g. arm-3 and arm-2 on different timestamps (so no consecutive firing as feared). I think after https://gitlab.suse.de/openqa/grafana-webhook-actions/-/merge_requests/29 gets merged we can close this ticket.
Updated by okurz over 1 year ago
- Status changed from Feedback to Resolved
https://gitlab.suse.de/openqa/grafana-webhook-actions/-/merge_requests/29 merged. I agree that we can close right away.