action #168337
closedopenQA Project (public) - coordination #157969: [epic] Upgrade all our infrastructure, e.g. o3+osd workers+webui and production workloads, to openSUSE Leap 15.6
[tools]test fails in bootloader_zkvm - auto_review:"qemu-img.*Failed to get shared.*No locks available"
0%
Description
Description¶
Seem all files under s390x nfs mnount folder can not be read by qemu-img info command.
# qemu-img info --output=json /var/lib/openqa/share/factory/hdd/SLES-12-SP5-s390x-mru-install-minimal-with-addons-Build20241016-1-Server-DVD-Updates-s390x-kvm.qcow2
qemu-img: Could not open '/var/lib/openqa/share/factory/hdd/SLES-12-SP5-s390x-mru-install-minimal-with-addons-Build20241016-1-Server-DVD-Updates-s390x-kvm.qcow2': Failed to get shared "write" lock: No locks available
Is another process using the image [/var/lib/openqa/share/factory/hdd/SLES-12-SP5-s390x-mru-install-minimal-with-addons-Build20241016-1-Server-DVD-Updates-s390x-kvm.qcow2]?
s390zl12:/var/lib/libvirt/images # df
Filesystem 1K-blocks Used Available Use% Mounted on
-----
/dev/mapper/3600507638081855cd80000000000004b-part1 411724200 277619460 113116888 72% /var/lib/libvirt/images
openqa.oqa.prg2.suse.org:/var/lib/openqa/share/factory 15030298624 11063209984 3967088640 74% /var/lib/openqa/share/factory
openqa.oqa.prg2.suse.org:/var/lib/openqa/share/factory/hdd/fixed 6427781120 4164560896 2263220224 65% /var/lib/openqa/share/factory/hdd/fixed
Many s390x jobs are blocked.
Observation¶
openQA test in scenario sle-12-SP5-Server-DVD-Updates-s390x-mau-bind@s390x-kvm fails in
bootloader_zkvm
Test suite description¶
Testsuite maintained at https://gitlab.suse.de/qa-maintenance/qam-openqa-yml.
Reproducible¶
Fails since (at least) Build 20241016-1
sudo salt \* cmd.run 'qemu-img info --output=json /var/lib/openqa/share/factory/hdd/SLES-12-SP5-s390x-mru-install-minimal-with-addons-Build20241016-1-Server-DVD-Updates-s390x-kvm.qcow2 | grep "No locks available"'
Expected result¶
Last good: 20241015-1 (or more recent)
Further details¶
Always latest result in this scenario: latest
Updated by rfan1 2 months ago
I am not sure if the system update for openqa.suse.de causes the issue. but I can see the kernel is installed on 10/16
rfan@openqa:/var/lib/openqa/factory/hdd> rpm -qi kernel-default-6.4.0-150600.23.25.1.x86_64
Name : kernel-default
Version : 6.4.0
Release : 150600.23.25.1
Architecture: x86_64
Install Date: Wed 16 Oct 2024 06:14:00 PM UTC
Group : System/Kernel
Size : 198417504
License : GPL-2.0-only
Signature : RSA/SHA256, Wed 02 Oct 2024 09:35:17 AM UTC, Key ID 70af9e8139db7c82
Source RPM : kernel-default-6.4.0-150600.23.25.1.nosrc.rpm
Build Date : Wed 02 Oct 2024 09:29:32 AM UTC
Build Host : h01-ch5a
Relocations : (not relocatable)
Packager : https://www.suse.com/
Vendor : SUSE LLC <https://www.suse.com/>
URL : https://www.kernel.org/
Summary : The Standard Kernel
Description :
The standard kernel for both uniprocessor and multiprocessor systems.
Source Timestamp: 2024-10-01 10:54:01 +0000
GIT Revision: ea7c56db3e5d5339db9d2ca791dee6bb0a2188b1
GIT Branch: SLE15-SP6
Distribution: SUSE Linux Enterprise 15
Updated by pherranz 2 months ago
Also lots of failures due to: malformed JSON string, neither tag, array, object, number, string or atom, at character offset 0 at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/JSON.pm line 37, <$fh> line 10
https://openqa.suse.de/tests/15706910
https://openqa.suse.de/tests/15706887
Updated by okurz 2 months ago
- Related to action #157981: Upgrade osd webUI host to openSUSE Leap 15.6 size:S added
Updated by okurz 2 months ago
It seems s390x, unreal6, openqaw5-xen are affected so also svirt+xen.
As written in https://suse.slack.com/archives/C02CANHLANP/p1729154139545639?thread_ts=1729129131.475009&cid=C02CANHLANP
a complete OS upgrade rollback is possible in theory but would not leave us with a reproducer and the majority of tests don't seem to be affected. I would prefer to stay but we can consider partial package downgrades. Should we try the previous kernel maybe on OSD?
Updated by nicksinger 2 months ago
- Status changed from New to In Progress
- Assignee set to nicksinger
Updated by okurz 2 months ago
as a workaround or maybe solution I mounted with the "nolock" option now on unreal6 and it's working fine. I will change all current mounts transiently on clients for now.
sudo salt \* cmd.run 'umount -f -l /var/lib/openqa/share && mount -t nfs -o noauto,nofail,retry=30,ro,x-systemd.automount,x-
ut=30m,nolock openqa.suse.de:/var/lib/openqa/share /var/lib/openqa/share; qemu-img info --output=json /var/lib/openqa/share/factory/hdd/SLES
ild20241016-1-Server-DVD-Updates-s390x-kvm.qcow2 | grep "No locks available"'
Updated by okurz 2 months ago
OSD was unresponsive and not operative for multiple minutes. I now triggered
for i in failed incomplete; do host=openqa.suse.de failed_since="2024-10-17 08:00" result="result='$i'" local/os-autoinst/scripts/openqa-advanced-retrigger-jobs; done
which restarted about 30 jobs
Updated by nicksinger 2 months ago
- Priority changed from Urgent to High
I found that adding nolock apparently also adds local_lock=all which was enough on zl12 to make it work again and I like it more then disabling locking completely (despite not knowing the possible problems). It might also give me a hint on what fails to write remotely and why and how to debug/fix it. Now every worker can access OSD again and execute jobs, so the urgency is mitigated.
Updated by szarate 2 months ago
- Subject changed from [tools]test fails in bootloader_zkvm, qemu-img: Could not open '/var/lib/openqa/share/factory/hdd/*-s390x-kvm.qcow2': Failed to get shared "write" lock: No locks available to [tools]test fails in bootloader_zkvm - auto_review:"qemu-img.*Failed to get shared.*No locks available"
purposefully not adding soft-failed to avoid other potential hidden things in an update.
Updated by openqa_review 2 months ago
- Due date set to 2024-11-01
Setting due date based on mean cycle time of SUSE QE Tools
Updated by nicksinger about 2 months ago
- Status changed from In Progress to Resolved
The situation was already fine again on Friday and I was mainly waiting for further feedback. The main fix was to add nolock
to our shares: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1291