Project

General

Profile

Actions

action #159303

closed

[alert] osd-deployment pre-deploy pipeline failed because openqaworker-arm-1.qe.nue2.suse.org was offline size:S

Added by livdywan 3 months ago. Updated about 1 month ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
Due date:
2024-06-25
% Done:

0%

Estimated time:

Description

See https://gitlab.suse.de/openqa/osd-deployment/-/jobs/2510604

ERROR: Minions returned with non-zero exit code
worker32.oqa.prg2.suse.org:
openqaworker18.qa.suse.cz:
worker31.oqa.prg2.suse.org:
worker35.oqa.prg2.suse.org:
worker33.oqa.prg2.suse.org:
worker29.oqa.prg2.suse.org:
worker30.oqa.prg2.suse.org:
worker40.oqa.prg2.suse.org:
worker34.oqa.prg2.suse.org:
openqaworker16.qa.suse.cz:
openqaworker17.qa.suse.cz:
worker-arm2.oqa.prg2.suse.org:
worker-arm1.oqa.prg2.suse.org:
qesapworker-prg4.qa.suse.cz:
qesapworker-prg6.qa.suse.cz:
qesapworker-prg5.qa.suse.cz:
qesapworker-prg7.qa.suse.cz:
sapworker2.qe.nue2.suse.org:
sapworker3.qe.nue2.suse.org:
sapworker1.qe.nue2.suse.org:
openqaworker14.qa.suse.cz:
mania.qe.nue2.suse.org:
petrol.qe.nue2.suse.org:
openqaworker1.qe.nue2.suse.org:
imagetester.qe.nue2.suse.org:
diesel.qe.nue2.suse.org:
grenache-1.oqa.prg2.suse.org:
openqaworker-arm-1.qe.nue2.suse.org:
    Minion did not return. [Not connected]

Suggestions

  • DONE Retry the pipeline, could be temporary - seems to persist over a couple hours at least so far
  • Ensure that after the machine is up again in #159270 the deployment is retried and works fine
  • Wait for #157753 to have complete automation
  • This is related to #157753 . openqaworker-arm-1 is expected to be sometimes not responsive but automatic recovery should handle that already. Please check if the automatic recovery from https://gitlab.suse.de/openqa/grafana-webhook-actions is effective and how we can avoid the osd-deployment to be hung up because of this one host which is expected to be broken from time to time. Let's assume that our automatic recovery is good enough and always recovers the machine soon enough
  • Teach livdywan that it was just one machine failing, not all the listed ones

Out of scope

  • Recovering openqaworker-arm-1: #159270

Related issues 2 (0 open2 closed)

Related to QA - action #157753: Bring back automatic recovery for openqaworker-arm-1 size:MResolvedybonatakis

Actions
Related to openQA Infrastructure - action #159270: openqaworker-arm-1 is Unreachable size:SResolvedybonatakis2024-04-19

Actions
Actions #1

Updated by okurz 3 months ago

  • Related to action #157753: Bring back automatic recovery for openqaworker-arm-1 size:M added
Actions #2

Updated by okurz 3 months ago

  • Category set to Regressions/Crashes

This is related to #157753 . openqaworker-arm-1 is expected to be sometimes not responsive but automatic recovery should handle that already. Please check if the automatic recovery from https://gitlab.suse.de/openqa/grafana-webhook-actions is effective and how we can avoid the osd-deployment to be hung up because of this one host which is expected to be broken from time to time.

Actions #3

Updated by okurz 3 months ago

  • Related to action #159270: openqaworker-arm-1 is Unreachable size:S added
Actions #4

Updated by okurz 3 months ago

  • Subject changed from [alert] osd-deployment pre-deploy pipeline fails Minion did not return with many workers not responding to [alert] osd-deployment pre-deploy pipeline failed because openqaworker-arm-1.qe.nue2.suse.org was offline size:S
  • Description updated (diff)
  • Status changed from New to Workable
Actions #5

Updated by nicksinger 3 months ago

  • Status changed from Workable to Blocked
  • Assignee set to nicksinger

The recovery pipeline can be triggered and runs as expected: https://gitlab.suse.de/openqa/grafana-webhook-actions/-/jobs/2513420
In grafana the according "Contact point" for "Trigger reboot of openqaworker-arm-1" is still present but labeled as "Unused". The deployment was successful after we recovered arm-1 manually: https://gitlab.suse.de/openqa/osd-deployment/-/pipelines/1088854

Basically this has to wait until #157753 is done to continue further.

Actions #6

Updated by nicksinger 3 months ago

  • Priority changed from High to Normal

Also as the immediate problem was mitigated I think we can lower the prio.

Actions #7

Updated by nicksinger 2 months ago

  • Status changed from Blocked to Resolved

As discussed in the unblock, we see the main point of validating if the automatic recovery works as covered. If we encounter conflicts in the future between deployment and recovery, we will create a new ticket.

Actions #8

Updated by mkittler about 1 month ago

  • Status changed from Resolved to Workable

It happened again (see https://gitlab.suse.de/openqa/osd-deployment/-/jobs/2690871), @nicksinger said I can reopen the ticket.

Actions #9

Updated by okurz about 1 month ago

  • Status changed from Workable to In Progress
Actions #10

Updated by openqa_review about 1 month ago

  • Due date set to 2024-06-25

Setting due date based on mean cycle time of SUSE QE Tools

Actions #11

Updated by nicksinger about 1 month ago

As of today the deployment failed again because of arm-1 being offline. I restarted it and could reproduce the issue. salt 'openqaworker-arm-1.qe.nue2.suse.org' test.ping is hanging for a long time while I can still ssh onto the machine. That would explain why we don't see an automatic recovery kicking in. The journal of salt-minion is full of the following messages:

Jun 12 10:42:31 openqaworker-arm-1 salt-minion[40944]: System management is locked by the application with pid 36967 (zypper).
Jun 12 10:42:31 openqaworker-arm-1 salt-minion[40944]: Close this application before trying again.
Jun 12 10:42:31 openqaworker-arm-1 salt-minion[40944]: [ERROR   ] retcode: 7
Jun 12 10:42:35 openqaworker-arm-1 salt-minion[40944]: [ERROR   ] Command 'systemd-run' failed with return code: 7
Jun 12 10:42:35 openqaworker-arm-1 salt-minion[40944]: [ERROR   ] stderr: Running scope as unit: run-r7ecec40949804290a5ef758b39625e54.scope

the zypper PID in question is looking like this in ps:

root     36947  0.0  0.0   4068  2652 ?        S    Jun11   0:00 /bin/sh -c set -x; retry -r 3 -- zypper --no-refresh -n dup --replacefiles || ( if grep --perl-regexp -q "^2024-06-11.*(inconsistent rpm database|DB_PAGE_NOTFOUND)" /var/log/zypper.log; then rpm --rebuilddb || exit 1; fi && retry -r 3 -- zypper --no-refresh -n dup --replacefiles )
root     36960  0.0  0.0   4068  2652 ?        S    Jun11   0:00  \_ /bin/sh -e /usr/local/bin/retry -r 3 -- zypper --no-refresh -n dup --replacefiles
root     36967  0.0  0.0 351612 251568 ?       Sl   Jun11   0:42      \_ zypper --no-refresh -n dup --replacefiles
root     36997  0.0  0.0  17368  3784 ?        S    Jun11   0:00          \_ /usr/bin/systemd-inhibit --what=sleep:shutdown:idle --who=zypp --mode=block --why=Zypp commit running. /usr/bin/cat
root     36998  0.0  0.0   2168   476 ?        S    Jun11   0:00          |   \_ /usr/bin/cat
root     36999  0.0  0.0   4068  2772 ?        S    Jun11   0:00          \_ /bin/bash /usr/lib/zypp/plugins/commit/btrfs-defrag-plugin.sh
root     37005  0.0  0.0  17232  6836 ?        S    Jun11   0:00          \_ /usr/lib/zypp/plugins/commit/snapper-zypp-plugin
root     37006  0.0  0.0  15120 11196 ?        S    Jun11   0:00          \_ /usr/bin/python3 /usr/lib/zypp/plugins/commit/zyppnotify
root     37013  0.0  0.0  54248 48572 ?        D    Jun11   0:00          \_ rpm --root / --dbpath /usr/lib/sysimage/rpm --define _dump_posttrans 1 -U --percent --noglob --force --nodeps -- /var/cache/zypp/packages/devel_openQA/aarch64/openQA-common-4.6.1718010509.ef66dc1e-lp155.6749.1.aarch64.rpm

that looks like an automated call from within our pipelines. I see similar messages but with other PIDs in the journal as well so this command will eventually complete:

openqaworker-arm-1:~ # journalctl -u salt-minion | grep "System management is locked" | cut -d ":" -f 4- | sort | uniq -c
  96760  System management is locked by the application with pid 36967 (zypper).
 314541  System management is locked by the application with pid 37599 (zypper).
      4  [ERROR   ] output: System management is locked by the application with pid 37599 (zypper).
      4  [ERROR   ] stdout: System management is locked by the application with pid 37599 (zypper).

So maybe we should have a look at the underlying filesystem.

Actions #12

Updated by nicksinger about 1 month ago

Btw: the machine is up since "Jun 6 12:10". Starting "Jun 8 00:27" we see more and more entries like:

[Jun 8 00:27] nfs: server openqa.suse.de not responding, still trying
[  +0.039969] nfs: server openqa.suse.de OK
[Jun 8 00:39] nfs: server openqa.suse.de not responding, still trying
[  +0.039751] nfs: server openqa.suse.de OK
[Jun 8 00:46] nfs: server openqa.suse.de not responding, still trying
[  +0.039737] nfs: server openqa.suse.de OK

As of today:

[Jun10 00:43] nfs: server openqa.suse.de not responding, still trying
[Jun10 01:46] nfs: server openqa.suse.de not responding, still trying
[Jun10 02:46] nfs: server openqa.suse.de not responding, still trying
[Jun10 03:48] nfs: server openqa.suse.de not responding, still trying
[Jun10 04:48] nfs: server openqa.suse.de not responding, still trying
[Jun10 05:50] nfs: server openqa.suse.de not responding, still trying
[Jun10 06:55] nfs: server openqa.suse.de not responding, still trying
[Jun10 07:14] systemd-sysv-generator[11946]: SysV service '/etc/init.d/boot.local' lacks a native systemd unit file. Automatically generating a unit file for compatibility. Please update package to include a native systemd unit file, in order to make it more safe and robust.
[Jun10 07:16] systemd-sysv-generator[12044]: SysV service '/etc/init.d/boot.local' lacks a native systemd unit file. Automatically generating a unit file for compatibility. Please update package to include a native systemd unit file, in order to make it more safe and robust.
[Jun10 07:52] nfs: server openqa.suse.de not responding, still trying
[Jun10 07:56] nfs: server openqa.suse.de not responding, still trying
[Jun10 08:54] nfs: server openqa.suse.de not responding, still trying
[Jun10 09:54] nfs: server openqa.suse.de not responding, still trying
[Jun10 10:56] nfs: server openqa.suse.de not responding, still trying
[Jun10 12:01] nfs: server openqa.suse.de not responding, still trying
[Jun10 12:59] nfs: server openqa.suse.de not responding, still trying
[Jun10 14:02] nfs: server openqa.suse.de not responding, still trying
[Jun10 14:07] nfs: server openqa.suse.de not responding, still trying
[Jun10 15:05] nfs: server openqa.suse.de not responding, still trying
[Jun10 16:04] nfs: server openqa.suse.de not responding, still trying
[Jun10 16:46] nfs: server openqa.suse.de not responding, still trying
[Jun10 17:08] nfs: server openqa.suse.de not responding, still trying
[Jun10 18:06] nfs: server openqa.suse.de not responding, still trying
[Jun10 19:08] nfs: server openqa.suse.de not responding, still trying
[Jun10 19:21] nfs: server openqa.suse.de not responding, still trying
[Jun10 19:46] nfs: server openqa.suse.de not responding, still trying
[Jun10 20:10] nfs: server openqa.suse.de not responding, still trying
[Jun10 21:13] nfs: server openqa.suse.de not responding, still trying
[Jun10 22:13] nfs: server openqa.suse.de not responding, still trying
[Jun10 23:16] nfs: server openqa.suse.de not responding, still trying
[Jun11 00:18] nfs: server openqa.suse.de not responding, still trying
[Jun11 01:22] nfs: server openqa.suse.de not responding, still trying
[Jun11 02:18] nfs: server openqa.suse.de not responding, still trying
[Jun11 03:21] nfs: server openqa.suse.de not responding, still trying
[Jun11 04:22] nfs: server openqa.suse.de not responding, still trying
[Jun11 05:23] nfs: server openqa.suse.de not responding, still trying
[Jun11 06:23] nfs: server openqa.suse.de not responding, still trying
[Jun11 07:26] nfs: server openqa.suse.de not responding, still trying
[Jun11 07:52] nfs: server openqa.suse.de not responding, still trying

I would be surprised if zypper hangs because of a dangling NFS mount but it was the only thing I was able to spot in dmesg.

Actions #13

Updated by nicksinger about 1 month ago

nicksinger wrote in #note-12:

I would be surprised if zypper hangs because of a dangling NFS mount but it was the only thing I was able to spot in dmesg.

Or maybe not? NFS provides /var/lib/openqa/share and the worker-process is using that directory. When zypper wants to update os-autoinst it fails to restart the process because the old one refuses to stop until nfs is responsive again.

Actions #14

Updated by nicksinger about 1 month ago

  • Status changed from In Progress to Feedback

Indeed that was the culprit of the issue. Our automatic recovery didn't work and I created an MR to look at more places for a broken nfs mount: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1205

While trying to get the mountpoint to work again, I noticed that v6 causes problems. Mounting OSD via v4 worked instantly. I couldn't make out any firewall issues or any difference compared to v4. I tried nmap for portscanning, looked at the traffic in tcpdump on both sides and even setup my own "service" with netcat to confirm connectivity over both protocol versions and tcp+udp. Right when I wanted to enable more verbose output with rpcdebug -m nfs -s proc NFS over v6 worked again (of courseā€¦). So for now the only action I did was to recover the share on arm1 and improve the NFS recovery script. We might need to wait until this happens again or can be reproduced again to debug further and come up with possible solutions/workarounds.

Actions #16

Updated by nicksinger about 1 month ago

  • Status changed from Feedback to Resolved

Lets close this ticket then and see if we can debug more if this happens again.

Actions

Also available in: Atom PDF