action #114586
Updated by okurz over 2 years ago
## Observation Both openqaworker-arm-1+2 seem to be stuck in reboot&recover loops according to alerts and observations from https://monitor.qa.suse.de/d/1bNU0StZz/automatic-actions. I paused according alerts to also not trigger https://gitlab.suse.de/openqa/grafana-webhook-actions. Maybe the problem is actually within the gitlab CI runners which seems to have problems to resolved hosts like openqaworker-arm-1, see https://gitlab.suse.de/openqa/grafana-webhook-actions/-/jobs/1069000#L8253: ``` ping: openqaworker-arm-1: No address associated with hostname ping: openqaworker-arm-1: No address associated with hostname ping: openqaworker-arm-1: No address associated with hostname … ``` After pausing the alerts and cancelling the still running gitlab CI recovery job for openqaworker-arm-1 I could follow in `ipmi-openqaworker-arm-1-ipmi sol activate` that the machine was coming up just fine (with the usual horribly long boot times). ## Suggestions * Check and compare DNS resolution in according environments within gitlab CI runners as well as locally. * Maybe as a workaround it helps to use the FQDN in the gitlab CI environment * Fix the issue with the help of SUSE-IT as they recently changed the gitlab CI runners, maybe that had some impact ## Rollback steps * Unpause both alerts "openqaworker-arm-1 offline" and "openqaworker-arm-2 offline" and "openqaworker-arm-3 offline" and the corresponding long-time alerts as well from https://monitor.qa.suse.de/alerting/list?state=not_ok