Project

General

Profile

Actions

action #54173

closed

[virtualization][xen][sporadic] qemu-img Failed to get "write" lock on xen workers (test fails in bootloader_svirt - test fails in bootloader_start)

Added by riafarov almost 5 years ago. Updated about 3 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
Infrastructure
Target version:
-
Start date:
2019-07-12
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

Formatting '/var/lib/libvirt/images/openQA-SUT-2b.img', fmt=qcow2 size=21474836480 cluster_size=65536 lazy_refcounts=off refcount_bits=16
[2019-07-12T10:10:45.345 CEST] [debug] Command's stderr:
qemu-img: /var/lib/libvirt/images/openQA-SUT-2b.img: Failed to get "write" lock
Is another process using the image?

Under certain conditions, we have lock on the image file remaining. There are multiple ways to address such issue, like using random postfix in the image name + cleanup. As seems that destroy command from previous run fails, so we need to check why.

This will solve root cause of the issue.

Alternative will be to investigate the problem and call lsof to see which process uses the image and put more things to investigate the issue in case it occurs.
We also could re-try in case command fails, or test if we can get write lock before with qemu-img info command.

openQA test in scenario sle-12-SP5-Server-DVD-x86_64-minimal+base@yast-xen-pv@svirt-xen-pv fails in
bootloader_start

Test suite description

Mantainer: jrivera Select a minimal textmode installation by starting with the default and unselecting all patterns except for "base" and "minimal". Not to be confused with the new system role "minimal" introduced with SLE15. Test modules 'grub_disable_timeout' and 'grub_test' in xen-pv are not scheduled due to grub2 doesn't support xfb console.

Reproducible

Fails since (at least) Build 0222 (current job)

Expected result

Last good: 0219 (or more recent)

Further details

Always latest result in this scenario: latest


Related issues 8 (0 open8 closed)

Related to openQA Tests - action #55058: [virtualization][xen][u] test fails in bootloader_svirt - domain xml file seems to be corruptResolvedszarate2019-08-02

Actions
Related to openQA Tests - action #43673: [functional][u] - test fails in bootloader_zkvm - worker couldn't get write lock for the disk imageResolvedszarate2018-11-12

Actions
Has duplicate openQA Tests - action #53849: [sle][functional][u] test fails in bootloader_zkvm - qemu-img create failedRejected2019-07-04

Actions
Has duplicate openQA Infrastructure - action #54758: [tools][u] test fails in bootloader_svirt - Failed to get "write" lock via virsh Rejected2019-07-29

Actions
Has duplicate openQA Tests - action #54863: [functional][u] test fails in bootloader_svirt - Missing domains in libvirt but still runnning in XEN.Resolvedszarate2019-07-30

Actions
Has duplicate openQA Tests - action #54866: [functional][u] test fails in bootloader_svirtRejectedszarate2019-07-30

Actions
Has duplicate openQA Tests - action #55961: [functional][u][sporadic] test fails in bootloader_svirt - qemu-img: Failed to get "write" lock Is another process using the image?Rejectedriafarov2019-08-26

Actions
Blocks openQA Tests - action #58787: [qe-core] test fails in bootloader_svirt - qemu-img create with backing file failedRejected2019-10-29

Actions
Actions #1

Updated by riafarov almost 5 years ago

  • Subject changed from [virtualization][xen][sporadic] qemu-img Failed to get "write" lockon xen workers to [virtualization][xen][sporadic] qemu-img Failed to get "write" lock on xen workers
Actions #2

Updated by riafarov almost 5 years ago

  • Description updated (diff)
Actions #3

Updated by riafarov almost 5 years ago

After investigation with Matthias Griesmeier, we have identified the root cause.
Problem is that for xen, in case we call wait_boot anywhere in the test suite (which is the case for any test suite where we need explicit reboot), we set SVIRT_KEEP_VM_RUNNING in the method attach_to_running.
When this variable is set, no destroy will happen even in case job fails. This will lead to the mentioned problem.

Seems that at least we need to unset this variable once machine was booted, but from the code we see, looks like attach_to_running is executed when machine is already running.

Actions #4

Updated by SLindoMansilla over 4 years ago

This ticket is a duplicate of #53849
But, I will keep this one since it has more information about the problem.

Actions #5

Updated by SLindoMansilla over 4 years ago

  • Has duplicate action #53849: [sle][functional][u] test fails in bootloader_zkvm - qemu-img create failed added
Actions #6

Updated by szarate over 4 years ago

  • Has duplicate action #54758: [tools][u] test fails in bootloader_svirt - Failed to get "write" lock via virsh added
Actions #7

Updated by szarate over 4 years ago

  • Status changed from New to In Progress
  • Assignee set to szarate
Actions #8

Updated by szarate over 4 years ago

  • Has duplicate action #54863: [functional][u] test fails in bootloader_svirt - Missing domains in libvirt but still runnning in XEN. added
Actions #9

Updated by szarate over 4 years ago

  • Related to action #55058: [virtualization][xen][u] test fails in bootloader_svirt - domain xml file seems to be corrupt added
Actions #10

Updated by szarate over 4 years ago

After doing some cleanup yesterday, some svirt jobs are able to run, poo#55058 is still there though

Actions #11

Updated by riafarov over 4 years ago

Now I can see same issue on s390x with svirt too, whereas our initial analysis has detected issue only for xen:
Formatting '/var/lib/libvirt/images/openQA-SUT-3a.img', fmt=qcow2 size=32212254720 cluster_size=65536 lazy_refcounts=off refcount_bits=16
[2019-08-08T10:59:22.161 CEST] [debug] Command's stderr:
qemu-img: /var/lib/libvirt/images/openQA-SUT-3a.img: Failed to get "write" lock
https://openqa.suse.de/tests/3224320/file/autoinst-log.txt

In the logs we can see that domain is defined.

Actions #12

Updated by szarate over 4 years ago

  • Has duplicate action #54866: [functional][u] test fails in bootloader_svirt added
Actions #13

Updated by szarate over 4 years ago

  • Status changed from In Progress to Workable
Actions #14

Updated by riafarov over 4 years ago

  • Has duplicate action #55961: [functional][u][sporadic] test fails in bootloader_svirt - qemu-img: Failed to get "write" lock Is another process using the image? added
Actions #15

Updated by SLindoMansilla over 4 years ago

  • Subject changed from [virtualization][xen][sporadic] qemu-img Failed to get "write" lock on xen workers to [virtualization][xen][sporadic] qemu-img Failed to get "write" lock on xen workers (test fails in bootloader_svirt - test fails in bootloader_start)

Due to the several duplicates, let's make openQA results reviewers find this ticket easier.

Actions #16

Updated by szarate over 4 years ago

  • Status changed from Workable to In Progress
Actions #18

Updated by szarate over 4 years ago

  • Status changed from In Progress to Feedback

Pr is open, waiting for reviews

Actions #19

Updated by szarate over 4 years ago

pr merged, waiting for deployment, however I guess that some more changes will be needed

Actions #20

Updated by szarate over 4 years ago

Next step: Reproduce the actual failure https://progress.opensuse.org/issues/54173#note-11, and survive it.

Actions #21

Updated by szarate over 4 years ago

Now the thing is simply "Handled" but may be virsh dominfo openQA-SUT-$instance --title Will be needed, need to check other jobs in this run, to see if the actual patch did manage to solve anything

https://openqa.suse.de/tests/3418789/file/autoinst-log.txt

Actions #22

Updated by szarate over 4 years ago

Now I have a data point: This job failed with this problem... https://openqa.suse.de/tests/3591586

Will look a bit deeper later.

Actions #23

Updated by szarate over 4 years ago

  • Blocks action #58787: [qe-core] test fails in bootloader_svirt - qemu-img create with backing file failed added
Actions #24

Updated by JRivrain over 4 years ago

I see that in this case https://openqa.suse.de/tests/3698948 where bootloader_svirt was replaced by bootloader_start, it is still happening.

Actions #26

Updated by okurz about 4 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: default@svirt-hyperv-uefi
https://openqa.suse.de/tests/3822960

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released"
  3. The label in the openQA scenario is removed
Actions #27

Updated by okurz about 4 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: default@svirt-hyperv-uefi
https://openqa.suse.de/tests/3885550

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released"
  3. The label in the openQA scenario is removed
Actions #28

Updated by okurz about 4 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: textmode@svirt-xen-hvm
https://openqa.suse.de/tests/3928603

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released"
  3. The label in the openQA scenario is removed
Actions #29

Updated by okurz about 4 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: default@svirt-hyperv-uefi
https://openqa.suse.de/tests/3983226

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released"
  3. The label in the openQA scenario is removed
Actions #30

Updated by okurz about 4 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: default@svirt-hyperv-uefi
https://openqa.suse.de/tests/3983226

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released"
  3. The label in the openQA scenario is removed
Actions #31

Updated by openqa_review about 4 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: default@svirt-hyperv-uefi
https://openqa.suse.de/tests/4038458

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released"
  3. The label in the openQA scenario is removed
Actions #32

Updated by szarate about 4 years ago

https://gitlab.suse.de/foursixnine/openqa-job-retrigger-tool/-/merge_requests/2 for now this is running on phobos, working on a container to have it on gitlab

Actions #33

Updated by okurz about 4 years ago

Interesting approach :) I suggest you also take a look at https://github.com/os-autoinst/scripts/blob/master/openqa-monitor-incompletes which I call combined with https://github.com/os-autoinst/scripts/blob/master/openqa-label-known-issues and this is automatically called in https://gitlab.suse.de/openqa/auto-review/pipelines . What is done there is that autoinst-log.txt is automatically parsed also against regex's defined in progress tickets' subject line based on special markers. This can easily by extended by replacing openqa-monitor-incompletes with a tool to query for the failed jobs you want to find labels for, if this is what you want to achieve? :) If you like we can have a brainstorm chat about that?

Actions #34

Updated by szarate about 3 years ago

  • Status changed from Feedback to Closed

This hasn't been seen for quite a long time, closing

Actions #35

Updated by szarate about 3 years ago

  • Related to action #43673: [functional][u] - test fails in bootloader_zkvm - worker couldn't get write lock for the disk image added
Actions #36

Updated by okurz about 3 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: create_hdd_minimal_base+sdk_withhome@s390x-kvm-sle15
https://openqa.suse.de/tests/5421382

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released"
  3. The label in the openQA scenario is removed
Actions

Also available in: Atom PDF