Project

General

Profile

Actions

action #168337

closed

openQA Project (public) - coordination #157969: [epic] Upgrade all our infrastructure, e.g. o3+osd workers+webui and production workloads, to openSUSE Leap 15.6

[tools]test fails in bootloader_zkvm - auto_review:"qemu-img.*Failed to get shared.*No locks available"

Added by rfan1 2 months ago. Updated about 2 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Start date:
2024-10-17
Due date:
% Done:

0%

Estimated time:

Description

Description

Seem all files under s390x nfs mnount folder can not be read by qemu-img info command.

# qemu-img info --output=json /var/lib/openqa/share/factory/hdd/SLES-12-SP5-s390x-mru-install-minimal-with-addons-Build20241016-1-Server-DVD-Updates-s390x-kvm.qcow2
qemu-img: Could not open '/var/lib/openqa/share/factory/hdd/SLES-12-SP5-s390x-mru-install-minimal-with-addons-Build20241016-1-Server-DVD-Updates-s390x-kvm.qcow2': Failed to get shared "write" lock: No locks available
Is another process using the image [/var/lib/openqa/share/factory/hdd/SLES-12-SP5-s390x-mru-install-minimal-with-addons-Build20241016-1-Server-DVD-Updates-s390x-kvm.qcow2]?

s390zl12:/var/lib/libvirt/images # df
Filesystem                                                         1K-blocks        Used  Available Use% Mounted on
-----
/dev/mapper/3600507638081855cd80000000000004b-part1                411724200   277619460  113116888  72% /var/lib/libvirt/images
openqa.oqa.prg2.suse.org:/var/lib/openqa/share/factory           15030298624 11063209984 3967088640  74% /var/lib/openqa/share/factory
openqa.oqa.prg2.suse.org:/var/lib/openqa/share/factory/hdd/fixed  6427781120  4164560896 2263220224  65% /var/lib/openqa/share/factory/hdd/fixed

Many s390x jobs are blocked.

Observation

openQA test in scenario sle-12-SP5-Server-DVD-Updates-s390x-mau-bind@s390x-kvm fails in
bootloader_zkvm

Test suite description

Testsuite maintained at https://gitlab.suse.de/qa-maintenance/qam-openqa-yml.

Reproducible

Fails since (at least) Build 20241016-1

sudo salt \* cmd.run 'qemu-img info --output=json /var/lib/openqa/share/factory/hdd/SLES-12-SP5-s390x-mru-install-minimal-with-addons-Build20241016-1-Server-DVD-Updates-s390x-kvm.qcow2 | grep "No locks available"'

Expected result

Last good: 20241015-1 (or more recent)

Further details

Always latest result in this scenario: latest


Related issues 1 (0 open1 closed)

Related to openQA Project (public) - action #157981: Upgrade osd webUI host to openSUSE Leap 15.6 size:SResolvednicksinger

Actions
Actions #1

Updated by rfan1 2 months ago

I am not sure if the system update for openqa.suse.de causes the issue. but I can see the kernel is installed on 10/16

rfan@openqa:/var/lib/openqa/factory/hdd> rpm -qi kernel-default-6.4.0-150600.23.25.1.x86_64
Name        : kernel-default
Version     : 6.4.0
Release     : 150600.23.25.1
Architecture: x86_64
Install Date: Wed 16 Oct 2024 06:14:00 PM UTC
Group       : System/Kernel
Size        : 198417504
License     : GPL-2.0-only
Signature   : RSA/SHA256, Wed 02 Oct 2024 09:35:17 AM UTC, Key ID 70af9e8139db7c82
Source RPM  : kernel-default-6.4.0-150600.23.25.1.nosrc.rpm
Build Date  : Wed 02 Oct 2024 09:29:32 AM UTC
Build Host  : h01-ch5a
Relocations : (not relocatable)
Packager    : https://www.suse.com/
Vendor      : SUSE LLC <https://www.suse.com/>
URL         : https://www.kernel.org/
Summary     : The Standard Kernel
Description :
The standard kernel for both uniprocessor and multiprocessor systems.


Source Timestamp: 2024-10-01 10:54:01 +0000
GIT Revision: ea7c56db3e5d5339db9d2ca791dee6bb0a2188b1
GIT Branch: SLE15-SP6
Distribution: SUSE Linux Enterprise 15

Actions #2

Updated by okurz 2 months ago

  • Tags set to infra, reactive work, s390x, osd
  • Category set to Regressions/Crashes
  • Priority changed from Normal to High
  • Target version set to Ready
Actions #3

Updated by pherranz 2 months ago

Also lots of failures due to: malformed JSON string, neither tag, array, object, number, string or atom, at character offset 0 at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/JSON.pm line 37, <$fh> line 10
https://openqa.suse.de/tests/15706910
https://openqa.suse.de/tests/15706887

Actions #4

Updated by okurz 2 months ago

  • Related to action #157981: Upgrade osd webUI host to openSUSE Leap 15.6 size:S added
Actions #5

Updated by okurz 2 months ago

  • Priority changed from High to Urgent
Actions #6

Updated by okurz 2 months ago

It seems s390x, unreal6, openqaw5-xen are affected so also svirt+xen.

As written in https://suse.slack.com/archives/C02CANHLANP/p1729154139545639?thread_ts=1729129131.475009&cid=C02CANHLANP

a complete OS upgrade rollback is possible in theory but would not leave us with a reproducer and the majority of tests don't seem to be affected. I would prefer to stay but we can consider partial package downgrades. Should we try the previous kernel maybe on OSD?

Actions #7

Updated by okurz 2 months ago

  • Description updated (diff)
Actions #8

Updated by nicksinger 2 months ago

  • Status changed from New to In Progress
  • Assignee set to nicksinger
Actions #9

Updated by okurz 2 months ago

as a workaround or maybe solution I mounted with the "nolock" option now on unreal6 and it's working fine. I will change all current mounts transiently on clients for now.

sudo salt \* cmd.run 'umount -f -l /var/lib/openqa/share && mount -t nfs -o noauto,nofail,retry=30,ro,x-systemd.automount,x-
ut=30m,nolock openqa.suse.de:/var/lib/openqa/share /var/lib/openqa/share; qemu-img info --output=json /var/lib/openqa/share/factory/hdd/SLES
ild20241016-1-Server-DVD-Updates-s390x-kvm.qcow2 | grep "No locks available"'
Actions #10

Updated by okurz 2 months ago

  • Parent task set to #157969
Actions #11

Updated by okurz 2 months ago

OSD was unresponsive and not operative for multiple minutes. I now triggered

for i in failed incomplete; do host=openqa.suse.de failed_since="2024-10-17 08:00" result="result='$i'" local/os-autoinst/scripts/openqa-advanced-retrigger-jobs; done

which restarted about 30 jobs

Actions #12

Updated by nicksinger 2 months ago

  • Priority changed from Urgent to High

I found that adding nolock apparently also adds local_lock=all which was enough on zl12 to make it work again and I like it more then disabling locking completely (despite not knowing the possible problems). It might also give me a hint on what fails to write remotely and why and how to debug/fix it. Now every worker can access OSD again and execute jobs, so the urgency is mitigated.

Actions #13

Updated by szarate 2 months ago

  • Subject changed from [tools]test fails in bootloader_zkvm, qemu-img: Could not open '/var/lib/openqa/share/factory/hdd/*-s390x-kvm.qcow2': Failed to get shared "write" lock: No locks available to [tools]test fails in bootloader_zkvm - auto_review:"qemu-img.*Failed to get shared.*No locks available"

purposefully not adding soft-failed to avoid other potential hidden things in an update.

Actions #14

Updated by openqa_review 2 months ago

  • Due date set to 2024-11-01

Setting due date based on mean cycle time of SUSE QE Tools

Actions #15

Updated by nicksinger about 2 months ago

  • Status changed from In Progress to Resolved

The situation was already fine again on Friday and I was mainly waiting for further feedback. The main fix was to add nolock to our shares: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1291

Actions #16

Updated by okurz about 2 months ago

  • Due date deleted (2024-11-01)
Actions

Also available in: Atom PDF