action #114586
closedfix openqaworker-arm-1+2+3 recovery pipeline (was: likely stuck in reboot loop) size:M
0%
Description
Observation¶
Both openqaworker-arm-1+2 seem to be stuck in reboot&recover loops according to alerts and observations from https://monitor.qa.suse.de/d/1bNU0StZz/automatic-actions. I paused according alerts to also not trigger https://gitlab.suse.de/openqa/grafana-webhook-actions. Maybe the problem is actually within the gitlab CI runners which seems to have problems to resolved hosts like openqaworker-arm-1, see https://gitlab.suse.de/openqa/grafana-webhook-actions/-/jobs/1069000#L8253:
ping: openqaworker-arm-1: No address associated with hostname
ping: openqaworker-arm-1: No address associated with hostname
ping: openqaworker-arm-1: No address associated with hostname
…
After pausing the alerts and cancelling the still running gitlab CI recovery job for openqaworker-arm-1 I could follow in ipmi-openqaworker-arm-1-ipmi sol activate
that the machine was coming up just fine (with the usual horribly long boot times).
Acceptance criteria¶
- AC1: All o3 arm workers can be recovered by the automatic pipeline
- AC2: No arm workers are recovered and continue to run when they are fine
- AC3: We know what the expected domain setup is
Suggestions¶
- Check and compare DNS resolution in according environments within gitlab CI runners as well as locally.
- Maybe as a workaround it helps to use the FQDN in the gitlab CI environment
- Fix the issue with the help of SUSE-IT as they recently changed the gitlab CI runners, maybe that had some impact
Rollback steps¶
- Unpause alerts "openqaworker-arm-1 offline" and "openqaworker-arm-2 offline" and "openqaworker-arm-3 offline" and the corresponding long-time alerts as well from https://monitor.qa.suse.de/alerting/list?state=not_ok
Updated by okurz over 2 years ago
- Subject changed from fix openqaworker-arm-1+2 recovery pipeline (was: likely stuck in reboot loop) to fix openqaworker-arm-1+2+3 recovery pipeline (was: likely stuck in reboot loop)
- Priority changed from High to Immediate
openqaworker-arm-3 is also affected now, see failed gitlab CI pipeline for recovery, paused alert
Updated by jbaier_cz over 2 years ago
I just tested that inside different Gitlab CI:
$ cat /etc/resolv.conf
### /etc/resolv.conf is a symlink to /var/run/netconfig/resolv.conf
### autogenerated by netconfig!
#
# Before you change this file manually, consider to define the
# static DNS configuration using the following variables in the
# /etc/sysconfig/network/config file:
# NETCONFIG_DNS_STATIC_SEARCHLIST
# NETCONFIG_DNS_STATIC_SERVERS
# NETCONFIG_DNS_FORWARDER
# or disable DNS configuration updates via netconfig by setting:
# NETCONFIG_DNS_POLICY=''
#
# See also the netconfig(8) manual page and other documentation.
#
### Call "netconfig update -f" to force adjusting of /etc/resolv.conf.
search cloud.suse.de
nameserver 10.162.191.2
nameserver 10.160.0.1
$ ping -c1 openqaworker-arm-1.suse.de
PING openqaworker-arm-1.suse.de (10.160.0.245) 56(84) bytes of data.
64 bytes from openqaworker-arm-1.suse.de (10.160.0.245): icmp_seq=1 ttl=62 time=0.198 ms
--- openqaworker-arm-1.suse.de ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.198/0.198/0.198/0.000 ms
$ ping -c1 openqaworker-arm-1
ping: openqaworker-arm-1: No address associated with hostname
search domains are missing
Updated by jbaier_cz over 2 years ago
I created https://gitlab.suse.de/openqa/grafana-webhook-actions/-/merge_requests/21 to mitigate the issue.
Updated by okurz over 2 years ago
merged. Retriggered a job to check, see https://gitlab.suse.de/openqa/grafana-webhook-actions/-/jobs/1074564
EDIT: Ok, that did not work because it's not picking up the new code. Just triggering a new pipeline https://gitlab.suse.de/openqa/grafana-webhook-actions/-/pipelines/445615 with MACHINE=openqaworker-arm-3
.
I suggest to still report a ticket with SUSE-IT, maybe the change was unintented on their side.
Updated by jbaier_cz over 2 years ago
okurz wrote:
Retriggered a job to check, see https://gitlab.suse.de/openqa/grafana-webhook-actions/-/jobs/1074564
Retrigger won't work as it will reuse the old configuration. We need a new run based on the new commit.
Updated by okurz over 2 years ago
jbaier_cz wrote:
Retrigger won't work as it will reuse the old configuration. We need a new run based on the new commit.
yes, done
Updated by jbaier_cz over 2 years ago
Much better (at least the DNS part)
PING openqaworker-arm-3.suse.de (10.160.0.85) 56(84) bytes of data.
--- openqaworker-arm-3.suse.de ping statistics ---
1 packets transmitted, 0 received, 100% packet loss, time 0ms
Updated by okurz over 2 years ago
agreed.
Can you report a ticket with SUSE-IT, maybe the change was unintented on their side.
Updated by livdywan over 2 years ago
- Subject changed from fix openqaworker-arm-1+2+3 recovery pipeline (was: likely stuck in reboot loop) to fix openqaworker-arm-1+2+3 recovery pipeline (was: likely stuck in reboot loop) size:M
- Description updated (diff)
- Status changed from New to In Progress
- Assignee set to jbaier_cz
Updated by jbaier_cz over 2 years ago
According to https://suse.slack.com/archives/C029APBKLGK/p1658998825405069, this change is a side-effect of the new faster worker nodes and it will probably stay that way. Apparently, we should use FQDN everywhere.
Updated by okurz over 2 years ago
ok, good to know. Then I suggest could you please quickly check other common gitlab CI pipeline configs if we might run into the same problem there?
Updated by jbaier_cz over 2 years ago
- Status changed from In Progress to Feedback
I was able to find out similar issues only inside https://gitlab.suse.de/openqa/monitor-o3, but since we are pinging through ssh via o3 itself, the dns works fine. So likely none of our other pipelines is currently in danger.
I created 2 more MR, for having more convenient way to manually start the pipeline and to make the timeout before next power cycle longer.
As of now, all 3 arm workers are up, I will apply the rollback steps.