action #114586: fix openqaworker-arm-1+2+3 recovery pipeline (was: likely stuck in reboot loop) size:M - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #114586

closed

fix openqaworker-arm-1+2+3 recovery pipeline (was: likely stuck in reboot loop) size:M

Added by okurz over 2 years ago. Updated over 2 years ago.

Status:

Resolved

Priority:

Immediate

Assignee:

jbaier_cz

Category:

Target version:

openQA Project (public) - Ready

Start date:

2022-07-24

Due date:

% Done:

Estimated time:

Description

Observation¶

Both openqaworker-arm-1+2 seem to be stuck in reboot&recover loops according to alerts and observations from https://monitor.qa.suse.de/d/1bNU0StZz/automatic-actions. I paused according alerts to also not trigger https://gitlab.suse.de/openqa/grafana-webhook-actions. Maybe the problem is actually within the gitlab CI runners which seems to have problems to resolved hosts like openqaworker-arm-1, see https://gitlab.suse.de/openqa/grafana-webhook-actions/-/jobs/1069000#L8253:

ping: openqaworker-arm-1: No address associated with hostname
ping: openqaworker-arm-1: No address associated with hostname
ping: openqaworker-arm-1: No address associated with hostname
…

After pausing the alerts and cancelling the still running gitlab CI recovery job for openqaworker-arm-1 I could follow in ipmi-openqaworker-arm-1-ipmi sol activate that the machine was coming up just fine (with the usual horribly long boot times).

Acceptance criteria¶

AC1: All o3 arm workers can be recovered by the automatic pipeline
AC2: No arm workers are recovered and continue to run when they are fine
AC3: We know what the expected domain setup is

Suggestions¶

Check and compare DNS resolution in according environments within gitlab CI runners as well as locally.
Maybe as a workaround it helps to use the FQDN in the gitlab CI environment
Fix the issue with the help of SUSE-IT as they recently changed the gitlab CI runners, maybe that had some impact

Rollback steps¶

Unpause alerts "openqaworker-arm-1 offline" and "openqaworker-arm-2 offline" and "openqaworker-arm-3 offline" and the corresponding long-time alerts as well from https://monitor.qa.suse.de/alerting/list?state=not_ok

Actions

Copy link

Updated by okurz over 2 years ago

Subject changed from fix openqaworker-arm-1+2 recovery pipeline (was: likely stuck in reboot loop) to fix openqaworker-arm-1+2+3 recovery pipeline (was: likely stuck in reboot loop)
Priority changed from High to Immediate

openqaworker-arm-3 is also affected now, see failed gitlab CI pipeline for recovery, paused alert

Actions

Copy link

Updated by okurz over 2 years ago

Description updated (diff)

Actions

Copy link

Updated by jbaier_cz over 2 years ago

I just tested that inside different Gitlab CI:

$ cat /etc/resolv.conf
### /etc/resolv.conf is a symlink to /var/run/netconfig/resolv.conf
### autogenerated by netconfig!
#
# Before you change this file manually, consider to define the
# static DNS configuration using the following variables in the
# /etc/sysconfig/network/config file:
#     NETCONFIG_DNS_STATIC_SEARCHLIST
#     NETCONFIG_DNS_STATIC_SERVERS
#     NETCONFIG_DNS_FORWARDER
# or disable DNS configuration updates via netconfig by setting:
#     NETCONFIG_DNS_POLICY=''
#
# See also the netconfig(8) manual page and other documentation.
#
### Call "netconfig update -f" to force adjusting of /etc/resolv.conf.
search cloud.suse.de
nameserver 10.162.191.2
nameserver 10.160.0.1
$ ping -c1 openqaworker-arm-1.suse.de
PING openqaworker-arm-1.suse.de (10.160.0.245) 56(84) bytes of data.
64 bytes from openqaworker-arm-1.suse.de (10.160.0.245): icmp_seq=1 ttl=62 time=0.198 ms
--- openqaworker-arm-1.suse.de ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.198/0.198/0.198/0.000 ms
$ ping -c1 openqaworker-arm-1
ping: openqaworker-arm-1: No address associated with hostname

search domains are missing

Actions

Copy link

Updated by jbaier_cz over 2 years ago

I created https://gitlab.suse.de/openqa/grafana-webhook-actions/-/merge_requests/21 to mitigate the issue.

Actions

Copy link

Updated by okurz over 2 years ago

merged. Retriggered a job to check, see https://gitlab.suse.de/openqa/grafana-webhook-actions/-/jobs/1074564

EDIT: Ok, that did not work because it's not picking up the new code. Just triggering a new pipeline https://gitlab.suse.de/openqa/grafana-webhook-actions/-/pipelines/445615 with MACHINE=openqaworker-arm-3.

I suggest to still report a ticket with SUSE-IT, maybe the change was unintented on their side.

Actions

Copy link

Updated by jbaier_cz over 2 years ago

okurz wrote:

Retriggered a job to check, see https://gitlab.suse.de/openqa/grafana-webhook-actions/-/jobs/1074564

Retrigger won't work as it will reuse the old configuration. We need a new run based on the new commit.

Actions

Copy link

Updated by okurz over 2 years ago

jbaier_cz wrote:

Retrigger won't work as it will reuse the old configuration. We need a new run based on the new commit.

yes, done

Actions

Copy link

Updated by jbaier_cz over 2 years ago

Much better (at least the DNS part)

PING openqaworker-arm-3.suse.de (10.160.0.85) 56(84) bytes of data.
--- openqaworker-arm-3.suse.de ping statistics ---
1 packets transmitted, 0 received, 100% packet loss, time 0ms

Actions

Copy link

Updated by okurz over 2 years ago

agreed.
Can you report a ticket with SUSE-IT, maybe the change was unintented on their side.

Actions

Copy link

#10

Updated by livdywan over 2 years ago

Subject changed from fix openqaworker-arm-1+2+3 recovery pipeline (was: likely stuck in reboot loop) to fix openqaworker-arm-1+2+3 recovery pipeline (was: likely stuck in reboot loop) size:M
Description updated (diff)
Status changed from New to In Progress
Assignee set to jbaier_cz

Actions

Copy link

#11

Updated by jbaier_cz over 2 years ago

According to https://suse.slack.com/archives/C029APBKLGK/p1658998825405069, this change is a side-effect of the new faster worker nodes and it will probably stay that way. Apparently, we should use FQDN everywhere.

Actions

Copy link

#12

Updated by okurz over 2 years ago

ok, good to know. Then I suggest could you please quickly check other common gitlab CI pipeline configs if we might run into the same problem there?

Actions

Copy link

#13

Updated by jbaier_cz over 2 years ago

Status changed from In Progress to Feedback

I was able to find out similar issues only inside https://gitlab.suse.de/openqa/monitor-o3, but since we are pinging through ssh via o3 itself, the dns works fine. So likely none of our other pipelines is currently in danger.

I created 2 more MR, for having more convenient way to manually start the pipeline and to make the timeout before next power cycle longer.

As of now, all 3 arm workers are up, I will apply the rollback steps.

Actions

Copy link

#14

Updated by jbaier_cz over 2 years ago

Status changed from Feedback to Resolved

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #114586

fix openqaworker-arm-1+2+3 recovery pipeline (was: likely stuck in reboot loop) size:M

Observation¶

Acceptance criteria¶

Suggestions¶

Rollback steps¶

Updated by okurz over 2 years ago

Updated by okurz over 2 years ago

Updated by jbaier_cz over 2 years ago

Updated by jbaier_cz over 2 years ago

Updated by okurz over 2 years ago

Updated by jbaier_cz over 2 years ago

Updated by okurz over 2 years ago

Updated by jbaier_cz over 2 years ago

Updated by okurz over 2 years ago

Updated by livdywan over 2 years ago

Updated by jbaier_cz over 2 years ago

Updated by okurz over 2 years ago

Updated by jbaier_cz over 2 years ago

Updated by jbaier_cz over 2 years ago