Project

General

Profile

Actions

action #138746

closed

[tools] s390x VM randomly fails to open QCOW disk image: Permission denied

Added by MDoucha 6 months ago. Updated 5 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
2023-10-30
Due date:
% Done:

0%

Estimated time:

Description

s390x tests randomly fail to boot because the VM does not have permission to open the disk image. Multiple workers have the same issue. Restarting the job usually fixes the issue. Examples:

https://openqa.suse.de/tests/12711015#step/bootloader_zkvm/31
https://openqa.suse.de/tests/12711015/logfile?filename=autoinst-log.txt

https://openqa.suse.de/tests/12716015#step/bootloader_zkvm/31
https://openqa.suse.de/tests/12716015/logfile?filename=autoinst-log.txt

https://openqa.suse.de/tests/12708886#step/bootloader_start/34
https://openqa.suse.de/tests/12708886/logfile?filename=autoinst-log.txt

[2023-10-28T00:17:57.550325+02:00] [debug] [pid:56810] [run_ssh_cmd(virsh  start openQA-SUT-6 2> >(tee /tmp/os-autoinst-openQA-SUT-6-stderr.log >&2))] stderr:
  error: Failed to start domain 'openQA-SUT-6'
  error: internal error: process exited while connecting to monitor: 2023-10-27T22:17:57.331249Z qemu-system-s390x: -blockdev {"driver":"file","filename":"/var/lib/libvirt/images//SLES-15-SP4-s390x-mru-install-minimal-with-addons-Build20231027-1-Server-DVD-Updates-s390x-kvm.qcow2","node-name":"libvirt-3-storage","cache":{"direct":false,"no-flush":true},"auto-read-only":true,"discard":"unmap"}: Could not open '/var/lib/libvirt/images//SLES-15-SP4-s390x-mru-install-minimal-with-addons-Build20231027-1-Server-DVD-Updates-s390x-kvm.qcow2': Permission denied
Actions #1

Updated by MDoucha 6 months ago

Looking at the first example in the ticket description, it appears that 3 different jobs ran on the same worker at the same time. All of them rsynced the disk image to the svirt host and then tried to boot. But the first job was blocked by qemu-img create process owned by another worker slot.

https://openqa.suse.de/tests/12711015
https://openqa.suse.de/tests/12711016
https://openqa.suse.de/tests/12711017

Actions #2

Updated by livdywan 6 months ago

  • Target version set to future

We took a brief look. We weren't clear where exactly the images are stored - it's not the cache which is separate, and is being freed as can be seen in the logs. So likely it's not critical right now - but please let us know if it happens more frequently and add more details.

Actions #3

Updated by MDoucha 6 months ago

The path from the error message is stored on the svirt host, which is separate from the worker machine. The disk image files get rsynced from the worker cache to the svirt host via network.

Actions #4

Updated by okurz 6 months ago

  • Subject changed from s390x VM randomly fails to open QCOW disk image: Permission denied to [kernel] s390x VM randomly fails to open QCOW disk image: Permission denied
  • Assignee set to mkittler

@mkittler very likely related to your work on the svirt asset cache

Actions #5

Updated by mkittler 6 months ago

  • Status changed from New to Feedback

Then it is likely best to disable the feature again: https://github.com/os-autoinst/os-autoinst/pull/2401

Considering all the problems we've encountered so far it is probably not worth it. One can still enable it for tests where it can actually be used.

Note that the permission denied error could have a different cause at this point it likely doesn't make much sense to investigate anymore and just disable the feature. Otherwise, on every bug related to the asset copying I would have to be involved again. And probably it is in fact the feature (because maybe rsync behaves slightly different when source and destination are on different hosts?).

Actions #6

Updated by okurz 6 months ago

  • Subject changed from [kernel] s390x VM randomly fails to open QCOW disk image: Permission denied to [tools] s390x VM randomly fails to open QCOW disk image: Permission denied
  • Status changed from Feedback to New
  • Target version changed from future to Ready

mkittler wrote in #note-5:

Then it is likely best to disable the feature again: https://github.com/os-autoinst/os-autoinst/pull/2401

Considering all the problems we've encountered so far it is probably not worth it. One can still enable it for tests where it can actually be used.

Note that the permission denied error could have a different cause at this point it likely doesn't make much sense to investigate anymore and just disable the feature. Otherwise, on every bug related to the asset copying I would have to be involved again. And probably it is in fact the feature (because maybe rsync behaves slightly different when source and destination are on different hosts?).

I would not underestimate the benefit of the feature given that for long there were various problems and performance bottlenecks in this area. I guess we will have to adopt this ticket into the scope of "[tools]" then.

Actions #7

Updated by livdywan 6 months ago

okurz wrote in #note-6:

mkittler wrote in #note-5:

Note that the permission denied error could have a different cause at this point it likely doesn't make much sense to investigate anymore and just disable the feature. Otherwise, on every bug related to the asset copying I would have to be involved again. And probably it is in fact the feature (because maybe rsync behaves slightly different when source and destination are on different hosts?).

We can always re-run jobs with the setting flipped to confirm if a case is related. It could even be done in investigation jobs. Assuming the jobs are otherwise stable.

Actions #8

Updated by mkittler 5 months ago

  • Status changed from New to Feedback

Assuming the jobs are otherwise stable.

Unfortunately, problems with this copy command were often sporadic.

For now I'd just keep it disabled and test developers might still decide themselves whether they want to enable the optimization.

Actions #9

Updated by okurz 5 months ago

  • Status changed from Feedback to New

Please link the according ticket about bringing in the svirt worker cache and make sure there is an open ticket about having a reliable efficient cache approach for svirt workers by default. I am pretty sure we have a ticket about that request originally. Then we can resolve here because the "Permission denied" problem has been "solved".

Actions #10

Updated by mkittler 5 months ago

  • Status changed from New to Resolved

I am pretty sure we have a ticket about that request originally.

I definitely have not worked on a ticket when coming up with https://github.com/os-autoinst/os-autoinst/pull/2357. The PR was based on an idea that came up in chat. However, it looks like this ticket is the one you're looking for: #44468

I'll update that ticket to reflect recent developments.

Actions

Also available in: Atom PDF