Project

General

Profile

Actions

action #114586

closed

fix openqaworker-arm-1+2+3 recovery pipeline (was: likely stuck in reboot loop) size:M

Added by okurz about 2 years ago. Updated about 2 years ago.

Status:
Resolved
Priority:
Immediate
Assignee:
Category:
-
Target version:
Start date:
2022-07-24
Due date:
% Done:

0%

Estimated time:

Description

Observation

Both openqaworker-arm-1+2 seem to be stuck in reboot&recover loops according to alerts and observations from https://monitor.qa.suse.de/d/1bNU0StZz/automatic-actions. I paused according alerts to also not trigger https://gitlab.suse.de/openqa/grafana-webhook-actions. Maybe the problem is actually within the gitlab CI runners which seems to have problems to resolved hosts like openqaworker-arm-1, see https://gitlab.suse.de/openqa/grafana-webhook-actions/-/jobs/1069000#L8253:

ping: openqaworker-arm-1: No address associated with hostname
ping: openqaworker-arm-1: No address associated with hostname
ping: openqaworker-arm-1: No address associated with hostname
…

After pausing the alerts and cancelling the still running gitlab CI recovery job for openqaworker-arm-1 I could follow in ipmi-openqaworker-arm-1-ipmi sol activate that the machine was coming up just fine (with the usual horribly long boot times).

Acceptance criteria

  • AC1: All o3 arm workers can be recovered by the automatic pipeline
  • AC2: No arm workers are recovered and continue to run when they are fine
  • AC3: We know what the expected domain setup is

Suggestions

  • Check and compare DNS resolution in according environments within gitlab CI runners as well as locally.
  • Maybe as a workaround it helps to use the FQDN in the gitlab CI environment
  • Fix the issue with the help of SUSE-IT as they recently changed the gitlab CI runners, maybe that had some impact

Rollback steps

Actions #1

Updated by okurz about 2 years ago

  • Subject changed from fix openqaworker-arm-1+2 recovery pipeline (was: likely stuck in reboot loop) to fix openqaworker-arm-1+2+3 recovery pipeline (was: likely stuck in reboot loop)
  • Priority changed from High to Immediate

openqaworker-arm-3 is also affected now, see failed gitlab CI pipeline for recovery, paused alert

Actions #2

Updated by okurz about 2 years ago

  • Description updated (diff)
Actions #3

Updated by jbaier_cz about 2 years ago

I just tested that inside different Gitlab CI:

$ cat /etc/resolv.conf
### /etc/resolv.conf is a symlink to /var/run/netconfig/resolv.conf
### autogenerated by netconfig!
#
# Before you change this file manually, consider to define the
# static DNS configuration using the following variables in the
# /etc/sysconfig/network/config file:
#     NETCONFIG_DNS_STATIC_SEARCHLIST
#     NETCONFIG_DNS_STATIC_SERVERS
#     NETCONFIG_DNS_FORWARDER
# or disable DNS configuration updates via netconfig by setting:
#     NETCONFIG_DNS_POLICY=''
#
# See also the netconfig(8) manual page and other documentation.
#
### Call "netconfig update -f" to force adjusting of /etc/resolv.conf.
search cloud.suse.de
nameserver 10.162.191.2
nameserver 10.160.0.1
$ ping -c1 openqaworker-arm-1.suse.de
PING openqaworker-arm-1.suse.de (10.160.0.245) 56(84) bytes of data.
64 bytes from openqaworker-arm-1.suse.de (10.160.0.245): icmp_seq=1 ttl=62 time=0.198 ms
--- openqaworker-arm-1.suse.de ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.198/0.198/0.198/0.000 ms
$ ping -c1 openqaworker-arm-1
ping: openqaworker-arm-1: No address associated with hostname

search domains are missing

Actions #5

Updated by okurz about 2 years ago

merged. Retriggered a job to check, see https://gitlab.suse.de/openqa/grafana-webhook-actions/-/jobs/1074564

EDIT: Ok, that did not work because it's not picking up the new code. Just triggering a new pipeline https://gitlab.suse.de/openqa/grafana-webhook-actions/-/pipelines/445615 with MACHINE=openqaworker-arm-3.

I suggest to still report a ticket with SUSE-IT, maybe the change was unintented on their side.

Actions #6

Updated by jbaier_cz about 2 years ago

okurz wrote:

Retriggered a job to check, see https://gitlab.suse.de/openqa/grafana-webhook-actions/-/jobs/1074564

Retrigger won't work as it will reuse the old configuration. We need a new run based on the new commit.

Actions #7

Updated by okurz about 2 years ago

jbaier_cz wrote:

Retrigger won't work as it will reuse the old configuration. We need a new run based on the new commit.

yes, done

Actions #8

Updated by jbaier_cz about 2 years ago

Much better (at least the DNS part)

PING openqaworker-arm-3.suse.de (10.160.0.85) 56(84) bytes of data.
--- openqaworker-arm-3.suse.de ping statistics ---
1 packets transmitted, 0 received, 100% packet loss, time 0ms
Actions #9

Updated by okurz about 2 years ago

agreed.
Can you report a ticket with SUSE-IT, maybe the change was unintented on their side.

Actions #10

Updated by livdywan about 2 years ago

  • Subject changed from fix openqaworker-arm-1+2+3 recovery pipeline (was: likely stuck in reboot loop) to fix openqaworker-arm-1+2+3 recovery pipeline (was: likely stuck in reboot loop) size:M
  • Description updated (diff)
  • Status changed from New to In Progress
  • Assignee set to jbaier_cz
Actions #11

Updated by jbaier_cz about 2 years ago

According to https://suse.slack.com/archives/C029APBKLGK/p1658998825405069, this change is a side-effect of the new faster worker nodes and it will probably stay that way. Apparently, we should use FQDN everywhere.

Actions #12

Updated by okurz about 2 years ago

ok, good to know. Then I suggest could you please quickly check other common gitlab CI pipeline configs if we might run into the same problem there?

Actions #13

Updated by jbaier_cz about 2 years ago

  • Status changed from In Progress to Feedback

I was able to find out similar issues only inside https://gitlab.suse.de/openqa/monitor-o3, but since we are pinging through ssh via o3 itself, the dns works fine. So likely none of our other pipelines is currently in danger.

I created 2 more MR, for having more convenient way to manually start the pipeline and to make the timeout before next power cycle longer.

As of now, all 3 arm workers are up, I will apply the rollback steps.

Actions #14

Updated by jbaier_cz about 2 years ago

  • Status changed from Feedback to Resolved
Actions

Also available in: Atom PDF