Project

General

Profile

Actions

action #114586

closed

fix openqaworker-arm-1+2+3 recovery pipeline (was: likely stuck in reboot loop) size:M

Added by okurz over 2 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Immediate
Assignee:
Category:
-
Start date:
2022-07-24
Due date:
% Done:

0%

Estimated time:

Description

Observation

Both openqaworker-arm-1+2 seem to be stuck in reboot&recover loops according to alerts and observations from https://monitor.qa.suse.de/d/1bNU0StZz/automatic-actions. I paused according alerts to also not trigger https://gitlab.suse.de/openqa/grafana-webhook-actions. Maybe the problem is actually within the gitlab CI runners which seems to have problems to resolved hosts like openqaworker-arm-1, see https://gitlab.suse.de/openqa/grafana-webhook-actions/-/jobs/1069000#L8253:

ping: openqaworker-arm-1: No address associated with hostname
ping: openqaworker-arm-1: No address associated with hostname
ping: openqaworker-arm-1: No address associated with hostname
…

After pausing the alerts and cancelling the still running gitlab CI recovery job for openqaworker-arm-1 I could follow in ipmi-openqaworker-arm-1-ipmi sol activate that the machine was coming up just fine (with the usual horribly long boot times).

Acceptance criteria

  • AC1: All o3 arm workers can be recovered by the automatic pipeline
  • AC2: No arm workers are recovered and continue to run when they are fine
  • AC3: We know what the expected domain setup is

Suggestions

  • Check and compare DNS resolution in according environments within gitlab CI runners as well as locally.
  • Maybe as a workaround it helps to use the FQDN in the gitlab CI environment
  • Fix the issue with the help of SUSE-IT as they recently changed the gitlab CI runners, maybe that had some impact

Rollback steps

Actions

Also available in: Atom PDF