Project

General

Profile

action #114586

Updated by okurz over 1 year ago

## Observation 
 Both openqaworker-arm-1+2 seem to be stuck in reboot&recover loops according to alerts and observations from https://monitor.qa.suse.de/d/1bNU0StZz/automatic-actions. I paused according alerts to also not trigger https://gitlab.suse.de/openqa/grafana-webhook-actions. Maybe the problem is actually within the gitlab CI runners which seems to have problems to resolved hosts like openqaworker-arm-1, see https://gitlab.suse.de/openqa/grafana-webhook-actions/-/jobs/1069000#L8253: 

 ``` 
 ping: openqaworker-arm-1: No address associated with hostname 
 ping: openqaworker-arm-1: No address associated with hostname 
 ping: openqaworker-arm-1: No address associated with hostname 
 … 
 ``` 

 After pausing the alerts and cancelling the still running gitlab CI recovery job for openqaworker-arm-1 I could follow in `ipmi-openqaworker-arm-1-ipmi sol activate` that the machine was coming up just fine (with the usual horribly long boot times). 

 ## Suggestions 
 * Check and compare DNS resolution in according environments within gitlab CI runners as well as locally. 
 * Maybe as a workaround it helps to use the FQDN in the gitlab CI environment 
 * Fix the issue with the help of SUSE-IT as they recently changed the gitlab CI runners, maybe that had some impact 

 ## Rollback steps 
 * Unpause both alerts "openqaworker-arm-1 offline" and "openqaworker-arm-2 offline" and "openqaworker-arm-3 offline" and the corresponding long-time alerts as well from https://monitor.qa.suse.de/alerting/list?state=not_ok

Back