action #138746
closed[tools] s390x VM randomly fails to open QCOW disk image: Permission denied
0%
Description
s390x tests randomly fail to boot because the VM does not have permission to open the disk image. Multiple workers have the same issue. Restarting the job usually fixes the issue. Examples:
https://openqa.suse.de/tests/12711015#step/bootloader_zkvm/31
https://openqa.suse.de/tests/12711015/logfile?filename=autoinst-log.txt
https://openqa.suse.de/tests/12716015#step/bootloader_zkvm/31
https://openqa.suse.de/tests/12716015/logfile?filename=autoinst-log.txt
https://openqa.suse.de/tests/12708886#step/bootloader_start/34
https://openqa.suse.de/tests/12708886/logfile?filename=autoinst-log.txt
[2023-10-28T00:17:57.550325+02:00] [debug] [pid:56810] [run_ssh_cmd(virsh start openQA-SUT-6 2> >(tee /tmp/os-autoinst-openQA-SUT-6-stderr.log >&2))] stderr:
error: Failed to start domain 'openQA-SUT-6'
error: internal error: process exited while connecting to monitor: 2023-10-27T22:17:57.331249Z qemu-system-s390x: -blockdev {"driver":"file","filename":"/var/lib/libvirt/images//SLES-15-SP4-s390x-mru-install-minimal-with-addons-Build20231027-1-Server-DVD-Updates-s390x-kvm.qcow2","node-name":"libvirt-3-storage","cache":{"direct":false,"no-flush":true},"auto-read-only":true,"discard":"unmap"}: Could not open '/var/lib/libvirt/images//SLES-15-SP4-s390x-mru-install-minimal-with-addons-Build20231027-1-Server-DVD-Updates-s390x-kvm.qcow2': Permission denied
Updated by MDoucha 12 months ago
Looking at the first example in the ticket description, it appears that 3 different jobs ran on the same worker at the same time. All of them rsynced the disk image to the svirt host and then tried to boot. But the first job was blocked by qemu-img create
process owned by another worker slot.
https://openqa.suse.de/tests/12711015
https://openqa.suse.de/tests/12711016
https://openqa.suse.de/tests/12711017
Updated by livdywan 12 months ago
- Target version set to future
We took a brief look. We weren't clear where exactly the images are stored - it's not the cache which is separate, and is being freed as can be seen in the logs. So likely it's not critical right now - but please let us know if it happens more frequently and add more details.
Updated by mkittler 11 months ago
- Status changed from New to Feedback
Then it is likely best to disable the feature again: https://github.com/os-autoinst/os-autoinst/pull/2401
Considering all the problems we've encountered so far it is probably not worth it. One can still enable it for tests where it can actually be used.
Note that the permission denied error could have a different cause at this point it likely doesn't make much sense to investigate anymore and just disable the feature. Otherwise, on every bug related to the asset copying I would have to be involved again. And probably it is in fact the feature (because maybe rsync behaves slightly different when source and destination are on different hosts?).
Updated by okurz 11 months ago
- Subject changed from [kernel] s390x VM randomly fails to open QCOW disk image: Permission denied to [tools] s390x VM randomly fails to open QCOW disk image: Permission denied
- Status changed from Feedback to New
- Target version changed from future to Ready
mkittler wrote in #note-5:
Then it is likely best to disable the feature again: https://github.com/os-autoinst/os-autoinst/pull/2401
Considering all the problems we've encountered so far it is probably not worth it. One can still enable it for tests where it can actually be used.
Note that the permission denied error could have a different cause at this point it likely doesn't make much sense to investigate anymore and just disable the feature. Otherwise, on every bug related to the asset copying I would have to be involved again. And probably it is in fact the feature (because maybe rsync behaves slightly different when source and destination are on different hosts?).
I would not underestimate the benefit of the feature given that for long there were various problems and performance bottlenecks in this area. I guess we will have to adopt this ticket into the scope of "[tools]" then.
Updated by livdywan 11 months ago
okurz wrote in #note-6:
mkittler wrote in #note-5:
Note that the permission denied error could have a different cause at this point it likely doesn't make much sense to investigate anymore and just disable the feature. Otherwise, on every bug related to the asset copying I would have to be involved again. And probably it is in fact the feature (because maybe rsync behaves slightly different when source and destination are on different hosts?).
We can always re-run jobs with the setting flipped to confirm if a case is related. It could even be done in investigation jobs. Assuming the jobs are otherwise stable.
Updated by mkittler 11 months ago
- Status changed from New to Feedback
Assuming the jobs are otherwise stable.
Unfortunately, problems with this copy command were often sporadic.
For now I'd just keep it disabled and test developers might still decide themselves whether they want to enable the optimization.
Updated by okurz 11 months ago
- Status changed from Feedback to New
Please link the according ticket about bringing in the svirt worker cache and make sure there is an open ticket about having a reliable efficient cache approach for svirt workers by default. I am pretty sure we have a ticket about that request originally. Then we can resolve here because the "Permission denied" problem has been "solved".
Updated by mkittler 11 months ago
- Status changed from New to Resolved
I am pretty sure we have a ticket about that request originally.
I definitely have not worked on a ticket when coming up with https://github.com/os-autoinst/os-autoinst/pull/2357. The PR was based on an idea that came up in chat. However, it looks like this ticket is the one you're looking for: #44468
I'll update that ticket to reflect recent developments.