action #114586
Updated by livdywan over 2 years ago
## Observation
Both openqaworker-arm-1+2 seem to be stuck in reboot&recover loops according to alerts and observations from https://monitor.qa.suse.de/d/1bNU0StZz/automatic-actions. I paused according alerts to also not trigger https://gitlab.suse.de/openqa/grafana-webhook-actions. Maybe the problem is actually within the gitlab CI runners which seems to have problems to resolved hosts like openqaworker-arm-1, see https://gitlab.suse.de/openqa/grafana-webhook-actions/-/jobs/1069000#L8253:
```
ping: openqaworker-arm-1: No address associated with hostname
ping: openqaworker-arm-1: No address associated with hostname
ping: openqaworker-arm-1: No address associated with hostname
…
```
After pausing the alerts and cancelling the still running gitlab CI recovery job for openqaworker-arm-1 I could follow in `ipmi-openqaworker-arm-1-ipmi sol activate` that the machine was coming up just fine (with the usual horribly long boot times).
## Acceptance criteria
- **AC1**: All o3 arm workers can be recovered by the automatic pipeline
- **AC2**: No arm workers are recovered and continue to run when they are fine
- **AC3**: We know what the expected domain setup is
## Suggestions
* Check and compare DNS resolution in according environments within gitlab CI runners as well as locally.
* Maybe as a workaround it helps to use the FQDN in the gitlab CI environment
* Fix the issue with the help of SUSE-IT as they recently changed the gitlab CI runners, maybe that had some impact
## Rollback steps
* Unpause alerts "openqaworker-arm-1 offline" and "openqaworker-arm-2 offline" and "openqaworker-arm-3 offline" and the corresponding long-time alerts as well from https://monitor.qa.suse.de/alerting/list?state=not_ok