action #159303
closed[alert] osd-deployment pre-deploy pipeline failed because openqaworker-arm-1.qe.nue2.suse.org was offline size:S
0%
Description
See https://gitlab.suse.de/openqa/osd-deployment/-/jobs/2510604
ERROR: Minions returned with non-zero exit code
worker32.oqa.prg2.suse.org:
openqaworker18.qa.suse.cz:
worker31.oqa.prg2.suse.org:
worker35.oqa.prg2.suse.org:
worker33.oqa.prg2.suse.org:
worker29.oqa.prg2.suse.org:
worker30.oqa.prg2.suse.org:
worker40.oqa.prg2.suse.org:
worker34.oqa.prg2.suse.org:
openqaworker16.qa.suse.cz:
openqaworker17.qa.suse.cz:
worker-arm2.oqa.prg2.suse.org:
worker-arm1.oqa.prg2.suse.org:
qesapworker-prg4.qa.suse.cz:
qesapworker-prg6.qa.suse.cz:
qesapworker-prg5.qa.suse.cz:
qesapworker-prg7.qa.suse.cz:
sapworker2.qe.nue2.suse.org:
sapworker3.qe.nue2.suse.org:
sapworker1.qe.nue2.suse.org:
openqaworker14.qa.suse.cz:
mania.qe.nue2.suse.org:
petrol.qe.nue2.suse.org:
openqaworker1.qe.nue2.suse.org:
imagetester.qe.nue2.suse.org:
diesel.qe.nue2.suse.org:
grenache-1.oqa.prg2.suse.org:
openqaworker-arm-1.qe.nue2.suse.org:
Minion did not return. [Not connected]
Suggestions¶
- DONE Retry the pipeline, could be temporary - seems to persist over a couple hours at least so far
- Ensure that after the machine is up again in #159270 the deployment is retried and works fine
- Wait for #157753 to have complete automation
- This is related to #157753 . openqaworker-arm-1 is expected to be sometimes not responsive but automatic recovery should handle that already. Please check if the automatic recovery from https://gitlab.suse.de/openqa/grafana-webhook-actions is effective and how we can avoid the osd-deployment to be hung up because of this one host which is expected to be broken from time to time. Let's assume that our automatic recovery is good enough and always recovers the machine soon enough
- Teach livdywan that it was just one machine failing, not all the listed ones
Out of scope¶
- Recovering openqaworker-arm-1: #159270
Updated by okurz 5 months ago
- Related to action #157753: Bring back automatic recovery for openqaworker-arm-1 size:M added
Updated by okurz 5 months ago
- Category set to Regressions/Crashes
This is related to #157753 . openqaworker-arm-1 is expected to be sometimes not responsive but automatic recovery should handle that already. Please check if the automatic recovery from https://gitlab.suse.de/openqa/grafana-webhook-actions is effective and how we can avoid the osd-deployment to be hung up because of this one host which is expected to be broken from time to time.
Updated by okurz 5 months ago
- Related to action #159270: openqaworker-arm-1 is Unreachable size:S added
Updated by okurz 5 months ago
- Subject changed from [alert] osd-deployment pre-deploy pipeline fails Minion did not return with many workers not responding to [alert] osd-deployment pre-deploy pipeline failed because openqaworker-arm-1.qe.nue2.suse.org was offline size:S
- Description updated (diff)
- Status changed from New to Workable
Updated by nicksinger 5 months ago
- Status changed from Workable to Blocked
- Assignee set to nicksinger
The recovery pipeline can be triggered and runs as expected: https://gitlab.suse.de/openqa/grafana-webhook-actions/-/jobs/2513420
In grafana the according "Contact point" for "Trigger reboot of openqaworker-arm-1" is still present but labeled as "Unused". The deployment was successful after we recovered arm-1 manually: https://gitlab.suse.de/openqa/osd-deployment/-/pipelines/1088854
Basically this has to wait until #157753 is done to continue further.
Updated by nicksinger 5 months ago
- Priority changed from High to Normal
Also as the immediate problem was mitigated I think we can lower the prio.
Updated by nicksinger 4 months ago
- Status changed from Blocked to Resolved
As discussed in the unblock, we see the main point of validating if the automatic recovery works as covered. If we encounter conflicts in the future between deployment and recovery, we will create a new ticket.
Updated by mkittler 3 months ago
- Status changed from Resolved to Workable
It happened again (see https://gitlab.suse.de/openqa/osd-deployment/-/jobs/2690871), @nicksinger said I can reopen the ticket.
Updated by openqa_review 3 months ago
- Due date set to 2024-06-25
Setting due date based on mean cycle time of SUSE QE Tools
Updated by nicksinger 3 months ago
As of today the deployment failed again because of arm-1 being offline. I restarted it and could reproduce the issue. salt 'openqaworker-arm-1.qe.nue2.suse.org' test.ping
is hanging for a long time while I can still ssh onto the machine. That would explain why we don't see an automatic recovery kicking in. The journal of salt-minion is full of the following messages:
Jun 12 10:42:31 openqaworker-arm-1 salt-minion[40944]: System management is locked by the application with pid 36967 (zypper).
Jun 12 10:42:31 openqaworker-arm-1 salt-minion[40944]: Close this application before trying again.
Jun 12 10:42:31 openqaworker-arm-1 salt-minion[40944]: [ERROR ] retcode: 7
Jun 12 10:42:35 openqaworker-arm-1 salt-minion[40944]: [ERROR ] Command 'systemd-run' failed with return code: 7
Jun 12 10:42:35 openqaworker-arm-1 salt-minion[40944]: [ERROR ] stderr: Running scope as unit: run-r7ecec40949804290a5ef758b39625e54.scope
the zypper PID in question is looking like this in ps:
root 36947 0.0 0.0 4068 2652 ? S Jun11 0:00 /bin/sh -c set -x; retry -r 3 -- zypper --no-refresh -n dup --replacefiles || ( if grep --perl-regexp -q "^2024-06-11.*(inconsistent rpm database|DB_PAGE_NOTFOUND)" /var/log/zypper.log; then rpm --rebuilddb || exit 1; fi && retry -r 3 -- zypper --no-refresh -n dup --replacefiles )
root 36960 0.0 0.0 4068 2652 ? S Jun11 0:00 \_ /bin/sh -e /usr/local/bin/retry -r 3 -- zypper --no-refresh -n dup --replacefiles
root 36967 0.0 0.0 351612 251568 ? Sl Jun11 0:42 \_ zypper --no-refresh -n dup --replacefiles
root 36997 0.0 0.0 17368 3784 ? S Jun11 0:00 \_ /usr/bin/systemd-inhibit --what=sleep:shutdown:idle --who=zypp --mode=block --why=Zypp commit running. /usr/bin/cat
root 36998 0.0 0.0 2168 476 ? S Jun11 0:00 | \_ /usr/bin/cat
root 36999 0.0 0.0 4068 2772 ? S Jun11 0:00 \_ /bin/bash /usr/lib/zypp/plugins/commit/btrfs-defrag-plugin.sh
root 37005 0.0 0.0 17232 6836 ? S Jun11 0:00 \_ /usr/lib/zypp/plugins/commit/snapper-zypp-plugin
root 37006 0.0 0.0 15120 11196 ? S Jun11 0:00 \_ /usr/bin/python3 /usr/lib/zypp/plugins/commit/zyppnotify
root 37013 0.0 0.0 54248 48572 ? D Jun11 0:00 \_ rpm --root / --dbpath /usr/lib/sysimage/rpm --define _dump_posttrans 1 -U --percent --noglob --force --nodeps -- /var/cache/zypp/packages/devel_openQA/aarch64/openQA-common-4.6.1718010509.ef66dc1e-lp155.6749.1.aarch64.rpm
that looks like an automated call from within our pipelines. I see similar messages but with other PIDs in the journal as well so this command will eventually complete:
openqaworker-arm-1:~ # journalctl -u salt-minion | grep "System management is locked" | cut -d ":" -f 4- | sort | uniq -c
96760 System management is locked by the application with pid 36967 (zypper).
314541 System management is locked by the application with pid 37599 (zypper).
4 [ERROR ] output: System management is locked by the application with pid 37599 (zypper).
4 [ERROR ] stdout: System management is locked by the application with pid 37599 (zypper).
So maybe we should have a look at the underlying filesystem.
Updated by nicksinger 3 months ago
Btw: the machine is up since "Jun 6 12:10". Starting "Jun 8 00:27" we see more and more entries like:
[Jun 8 00:27] nfs: server openqa.suse.de not responding, still trying
[ +0.039969] nfs: server openqa.suse.de OK
[Jun 8 00:39] nfs: server openqa.suse.de not responding, still trying
[ +0.039751] nfs: server openqa.suse.de OK
[Jun 8 00:46] nfs: server openqa.suse.de not responding, still trying
[ +0.039737] nfs: server openqa.suse.de OK
As of today:
[Jun10 00:43] nfs: server openqa.suse.de not responding, still trying
[Jun10 01:46] nfs: server openqa.suse.de not responding, still trying
[Jun10 02:46] nfs: server openqa.suse.de not responding, still trying
[Jun10 03:48] nfs: server openqa.suse.de not responding, still trying
[Jun10 04:48] nfs: server openqa.suse.de not responding, still trying
[Jun10 05:50] nfs: server openqa.suse.de not responding, still trying
[Jun10 06:55] nfs: server openqa.suse.de not responding, still trying
[Jun10 07:14] systemd-sysv-generator[11946]: SysV service '/etc/init.d/boot.local' lacks a native systemd unit file. Automatically generating a unit file for compatibility. Please update package to include a native systemd unit file, in order to make it more safe and robust.
[Jun10 07:16] systemd-sysv-generator[12044]: SysV service '/etc/init.d/boot.local' lacks a native systemd unit file. Automatically generating a unit file for compatibility. Please update package to include a native systemd unit file, in order to make it more safe and robust.
[Jun10 07:52] nfs: server openqa.suse.de not responding, still trying
[Jun10 07:56] nfs: server openqa.suse.de not responding, still trying
[Jun10 08:54] nfs: server openqa.suse.de not responding, still trying
[Jun10 09:54] nfs: server openqa.suse.de not responding, still trying
[Jun10 10:56] nfs: server openqa.suse.de not responding, still trying
[Jun10 12:01] nfs: server openqa.suse.de not responding, still trying
[Jun10 12:59] nfs: server openqa.suse.de not responding, still trying
[Jun10 14:02] nfs: server openqa.suse.de not responding, still trying
[Jun10 14:07] nfs: server openqa.suse.de not responding, still trying
[Jun10 15:05] nfs: server openqa.suse.de not responding, still trying
[Jun10 16:04] nfs: server openqa.suse.de not responding, still trying
[Jun10 16:46] nfs: server openqa.suse.de not responding, still trying
[Jun10 17:08] nfs: server openqa.suse.de not responding, still trying
[Jun10 18:06] nfs: server openqa.suse.de not responding, still trying
[Jun10 19:08] nfs: server openqa.suse.de not responding, still trying
[Jun10 19:21] nfs: server openqa.suse.de not responding, still trying
[Jun10 19:46] nfs: server openqa.suse.de not responding, still trying
[Jun10 20:10] nfs: server openqa.suse.de not responding, still trying
[Jun10 21:13] nfs: server openqa.suse.de not responding, still trying
[Jun10 22:13] nfs: server openqa.suse.de not responding, still trying
[Jun10 23:16] nfs: server openqa.suse.de not responding, still trying
[Jun11 00:18] nfs: server openqa.suse.de not responding, still trying
[Jun11 01:22] nfs: server openqa.suse.de not responding, still trying
[Jun11 02:18] nfs: server openqa.suse.de not responding, still trying
[Jun11 03:21] nfs: server openqa.suse.de not responding, still trying
[Jun11 04:22] nfs: server openqa.suse.de not responding, still trying
[Jun11 05:23] nfs: server openqa.suse.de not responding, still trying
[Jun11 06:23] nfs: server openqa.suse.de not responding, still trying
[Jun11 07:26] nfs: server openqa.suse.de not responding, still trying
[Jun11 07:52] nfs: server openqa.suse.de not responding, still trying
I would be surprised if zypper hangs because of a dangling NFS mount but it was the only thing I was able to spot in dmesg.
Updated by nicksinger 3 months ago
nicksinger wrote in #note-12:
I would be surprised if zypper hangs because of a dangling NFS mount but it was the only thing I was able to spot in dmesg.
Or maybe not? NFS provides /var/lib/openqa/share
and the worker-process is using that directory. When zypper wants to update os-autoinst it fails to restart the process because the old one refuses to stop until nfs is responsive again.
Updated by nicksinger 3 months ago
- Status changed from In Progress to Feedback
Indeed that was the culprit of the issue. Our automatic recovery didn't work and I created an MR to look at more places for a broken nfs mount: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1205
While trying to get the mountpoint to work again, I noticed that v6 causes problems. Mounting OSD via v4 worked instantly. I couldn't make out any firewall issues or any difference compared to v4. I tried nmap for portscanning, looked at the traffic in tcpdump on both sides and even setup my own "service" with netcat to confirm connectivity over both protocol versions and tcp+udp. Right when I wanted to enable more verbose output with rpcdebug -m nfs -s proc
NFS over v6 worked again (of courseā¦). So for now the only action I did was to recover the share on arm1 and improve the NFS recovery script. We might need to wait until this happens again or can be reproduced again to debug further and come up with possible solutions/workarounds.
Updated by nicksinger 3 months ago
- Status changed from Feedback to Resolved
Lets close this ticket then and see if we can debug more if this happens again.