Project

General

Profile

action #46742

test incompletes trying to revert to qemu snapshot auto_review:"Could not open backing file: Could not open .*.qcow.*No such file or directory", likely premature deletion of files from cache

Added by okurz over 1 year ago. Updated 3 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Concrete Bugs
Target version:
Start date:
2019-01-28
Due date:
2020-02-18
% Done:

0%

Estimated time:
Difficulty:
Duration: 277

Description

Observation

In https://openqa.opensuse.org/tests/841053/file/autoinst-log.txt

[2019-01-28T00:52:33.034 CET] [debug] Loading a VM snapshot lastgood
[2019-01-28T00:52:33.035 CET] [debug] Loading snapshot (Current VM state is running).
[2019-01-28T00:52:33.051 CET] [debug] EVENT {"event":"STOP","timestamp":{"microseconds":51576,"seconds":1548633153}}
[2019-01-28T00:52:33.053 CET] [info] ::: OpenQA::Qemu::Proc::save_state: Saving QEMU state to qemu_state.json
[2019-01-28T00:52:35.439 CET] [debug] QEMU: qemu-system-x86_64: terminating on signal 15 from pid 26050 (/usr/bin/isotovideo: backen)
[2019-01-28T00:52:35.623 CET] [debug] running /usr/bin/qemu-img create -f qcow2 -b /var/lib/openqa/pool/7/raid/hd0-overlay10 /var/lib/openqa/pool/7/raid/hd0-overlay11 42949672960
[2019-01-28T00:52:35.651 CET] [debug] qemu-img: warning: Could not verify backing image. This may become an error in future versions.
Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open '/var/lib/openqa/pool/7/opensuse-42.3-x86_64-GM-kde@64bit-2G.qcow2': No such file or directory
Formatting '/var/lib/openqa/pool/7/raid/hd0-overlay11', fmt=qcow2 size=42949672960 backing_file=/var/lib/openqa/pool/7/raid/hd0-overlay10 cluster_size=65536 lazy_refcounts=off refcount_bits=16

[2019-01-28T00:52:35.651 CET] [debug] running /usr/bin/qemu-img create -f qcow2 -b /var/lib/openqa/pool/7/raid/cd0-overlay10 /var/lib/openqa/pool/7/raid/cd0-overlay11 125829120
[2019-01-28T00:52:35.683 CET] [debug] Formatting '/var/lib/openqa/pool/7/raid/cd0-overlay11', fmt=qcow2 size=125829120 backing_file=/var/lib/openqa/pool/7/raid/cd0-overlay10 cluster_size=65536 lazy_refcounts=off refcount_bits=16

[2019-01-28T00:52:35.684 CET] [debug] starting: /usr/bin/qemu-system-x86_64 -vga cirrus -only-migratable -chardev ringbuf,id=serial0,logfile=serial0,logappend=on -serial chardev:serial0 -soundhw ac97 -global isa-fdc.driveA= -m 2048 -cpu qemu64 -netdev user,id=qanet0 -device virtio-net,netdev=qanet0,mac=52:54:00:12:34:56 -boot once=d,menu=on,splash-time=5000 -device usb-ehci -device usb-tablet -smp 1 -enable-kvm -no-shutdown -vnc :97,share=force-shared -device virtio-serial -chardev socket,path=virtio_console,server,nowait,id=virtio_console,logfile=virtio_console.log,logappend=on -device virtconsole,chardev=virtio_console,name=org.openqa.console.virtio_console -chardev socket,path=qmp_socket,server,nowait,id=qmp_socket,logfile=qmp_socket.log,logappend=on -qmp chardev:qmp_socket -S -device virtio-scsi-pci,id=scsi0 -blockdev driver=file,node-name=hd0-overlay11-file,filename=/var/lib/openqa/pool/7/raid/hd0-overlay11,cache.no-flush=on -blockdev driver=qcow2,node-name=hd0-overlay11,file=hd0-overlay11-file,cache.no-flush=on -device virtio-blk,id=hd0-device,drive=hd0-overlay11,serial=hd0 -blockdev driver=file,node-name=cd0-overlay11-file,filename=/var/lib/openqa/pool/7/raid/cd0-overlay11,cache.no-flush=on -blockdev driver=qcow2,node-name=cd0-overlay11,file=cd0-overlay11-file,cache.no-flush=on -device scsi-cd,id=cd0-device,drive=cd0-overlay11,serial=cd0 -incoming defer
Attempt 0 at /usr/lib/os-autoinst/osutils.pm line 130.
Attempt 1 at /usr/lib/os-autoinst/osutils.pm line 130.
Attempt 2 at /usr/lib/os-autoinst/osutils.pm line 130.
Attempt 3 at /usr/lib/os-autoinst/osutils.pm line 130.
Attempt 4 at /usr/lib/os-autoinst/osutils.pm line 130.
Attempt 5 at /usr/lib/os-autoinst/osutils.pm line 130.
Attempt 6 at /usr/lib/os-autoinst/osutils.pm line 130.
Attempt 7 at /usr/lib/os-autoinst/osutils.pm line 130.
Attempt 8 at /usr/lib/os-autoinst/osutils.pm line 130.
Attempt 9 at /usr/lib/os-autoinst/osutils.pm line 130.
Attempt 10 at /usr/lib/os-autoinst/osutils.pm line 130.
Attempt 11 at /usr/lib/os-autoinst/osutils.pm line 130.
Attempt 12 at /usr/lib/os-autoinst/osutils.pm line 130.
Attempt 13 at /usr/lib/os-autoinst/osutils.pm line 130.
Attempt 14 at /usr/lib/os-autoinst/osutils.pm line 130.
Attempt 15 at /usr/lib/os-autoinst/osutils.pm line 130.
Attempt 16 at /usr/lib/os-autoinst/osutils.pm line 130.
Attempt 17 at /usr/lib/os-autoinst/osutils.pm line 130.
Attempt 18 at /usr/lib/os-autoinst/osutils.pm line 130.
Attempt 19 at /usr/lib/os-autoinst/osutils.pm line 130.
[2019-01-28T00:52:54.848 CET] [debug] Backend process died, backend errors are reported below in the following lines:
can't open qmp at /usr/lib/os-autoinst/OpenQA/Qemu/Proc.pm line 404.

[2019-01-28T00:52:54.848 CET] [info] ::: OpenQA::Qemu::Proc::save_state: Saving QEMU state to qemu_state.json
last frame
[2019-01-28T00:52:54.880 CET] [info] ::: OpenQA::Qemu::Proc::save_state: Saving QEMU state to qemu_state.json
[2019-01-28T00:52:54.882 CET] [debug] QEMU: QEMU emulator version 2.11.2(openSUSE Leap 15.0)
[2019-01-28T00:52:54.882 CET] [debug] QEMU: Copyright (c) 2003-2017 Fabrice Bellard and the QEMU Project developers
[2019-01-28T00:52:54.882 CET] [debug] QEMU: qemu-system-x86_64: -blockdev driver=qcow2,node-name=hd0-overlay11,file=hd0-overlay11-file,cache.no-flush=on: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open '/var/lib/openqa/pool/7/opensuse-42.3-x86_64-GM-kde@64bit-2G.qcow2': No such file or directory
[2019-01-28T00:52:54.882 CET] [debug] sending magic and exit
[2019-01-28T00:52:54.883 CET] [debug] received magic close
[2019-01-28T00:52:54.884 CET] [debug] THERE IS NOTHING TO READ 15 4 3
[2019-01-28T00:52:54.913 CET] [debug] backend process exited: 0
[2019-01-28T00:52:54.916 CET] [debug] commands process exited: 0
[2019-01-28T00:52:55.917 CET] [debug] sysread failed: 
syswrite failed Broken pipe at /usr/lib/os-autoinst/myjsonrpc.pm line 40.
    myjsonrpc::send_json(GLOB(0x55b06c28be48), HASH(0x55b06c2c8b98)) called at /usr/lib/os-autoinst/autotest.pm line 313
    autotest::query_isotovideo("backend_reset_console", HASH(0x55b075dd80e0)) called at /usr/lib/os-autoinst/basetest.pm line 645
    basetest::rollback_activated_consoles(consoletest_finish=HASH(0x55b06d9443d8)) called at /usr/lib/os-autoinst/autotest.pm line 375
    autotest::runalltests() called at /usr/lib/os-autoinst/autotest.pm line 214
    eval {...} called at /usr/lib/os-autoinst/autotest.pm line 214
    autotest::run_all() called at /usr/lib/os-autoinst/autotest.pm line 267
    autotest::__ANON__(Mojo::IOLoop::ReadWriteProcess=HASH(0x55b06ddb32e8)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm line 325
    eval {...} called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm line 325
    Mojo::IOLoop::ReadWriteProcess::_fork(Mojo::IOLoop::ReadWriteProcess=HASH(0x55b06ddb32e8), CODE(0x55b06dcba0c8)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm line 476
    Mojo::IOLoop::ReadWriteProcess::start(Mojo::IOLoop::ReadWriteProcess=HASH(0x55b06ddb32e8)) called at /usr/lib/os-autoinst/autotest.pm line 268
    autotest::start_process() called at /usr/bin/isotovideo line 251

Related issues

Related to openQA Project - action #12180: [webui] Prevent tests to be triggered when required assets are not present (anymore)New2016-05-31

Related to openQA Project - action #34783: Don't let jobs incomplete if mandatory resources are missingResolved2018-04-12

Related to openQA Infrastructure - action #19238: setup pool devices+mounts+folders with salt(was: ext2 on workers busted)Resolved2017-05-19

Related to openQA Project - action #57782: retrigger of job with failed gru download task ends up incomplete with 404 on asset, does not retry downloadResolved2019-10-08

Has duplicate openQA Project - action #55511: job failed to revert to lastgood snapshot "Could not open backing file: Could not open '/var/lib/openqa/pool/2/….iso': No such file or directory"Rejected2019-08-14

Has duplicate openQA Project - action #54464: qemu-img convert failedRejected2019-07-19

Has duplicate openQA Tests - action #34270: [tools]caching issue in openQA?Rejected2018-04-04

Copied to openQA Project - action #60140: job incompletes failing on initial asset download with "Asset was already requested by another job"Resolved2019-01-28

History

#1 Updated by rpalethorpe over 1 year ago

It is very unusual to get this message when absolute paths are being used. Possibly it is actually telling the truth and the file has really disappeared. It might be worth automatically checking that the backing image still exists and is accessible if this error happens a lot. I have had a lot of trouble with QEMU's handling of backing files, so I wouldn't be surprised if something is still wrong there, but I also would not be surprised if something unrelated is deleting/corrupting the file.

I think this should just go on the backlog of os-autoinst/QEMU problems for now.

#2 Updated by okurz 10 months ago

  • Has duplicate action #55511: job failed to revert to lastgood snapshot "Could not open backing file: Could not open '/var/lib/openqa/pool/2/….iso': No such file or directory" added

#3 Updated by okurz 9 months ago

#4 Updated by okurz 8 months ago

  • Priority changed from Normal to High

This is a nuisance and actually happening quite often. Bumping prio as I rejected a duplicate #54464

#5 Updated by coolo 8 months ago

Can we have some data for 'quite often'?

#6 Updated by okurz 8 months ago

Yes. For OSD

time for i in $(sudo -u geekotest psql --no-align --tuples-only --command="select id from jobs where (result='incomplete' and t_finished >= '2019-09-01');" openqa) ; do id=$(echo -n "0$i"); log=/var/lib/openqa/testresults/$(echo $id | cut -c -5)/$id*/autoinst-log.txt; test -e $log && grep -l "Could not open backing file" $log ; done | wc -l
141

real    1m18.282s

for 5978 incomplete jobs since 2019-09-01, that's 2.3%, for 111781 jobs overall in this period, that's 0.13%.

On o3 we also have 2.3% failing like this out of the incompletes.

There is the hypothesis that this is a problem due to concurrent worker jobs together with caching. Therefore it might help to look into logfiles of the failing workers at the same time with the cache services.

#7 Updated by coolo 8 months ago

  • Target version set to Ready

One more High prio then

#8 Updated by kraih 8 months ago

  • Assignee set to kraih

#9 Updated by okurz 8 months ago

As discussed in rocket chat:

The error is not hardly related to qemu but only to the presence or absence of asset files.
qemu is just complaining that it can not find a file so it's more like "file vanishes during test run". That qemu is involved should be of no concern
It's only that "qemu-img" is used to create a qcow image with "backing file" which basically means like a "copy-on-write" target file, the original is in /var/lib/openqa/cache and apparently this file, the "backing file" vanishes. most likely hypothesis: worker 1 requests file A, cache downloads file A, worker 1 starts job using file A, worker 2 requests file B, cache realizes cache quota is exceeded, deletes file A (to download file B), worker 1 fails as file A was deleted.

The only known way so far to reproduce (not necessarily replicate) is to run 10k jobs in production and the error should appear at least once. I would suggest to crosscheck my hypothesis by setting up a local worker host with caching service and two worker instances with a cache limit that is just big enough to hold a first file for a job that you want to clone and then clone a second job that also needs an asset that is then in sum with the first file exceeding the cache limit configured in /etc/openqa/workers.ini. The staging machines are already set up that way.
The worker instances run on openqaworker11 . I recommend to set the variable "CACHELIMIT" to a reasonable size (as above) in GB and restart both services, probably about the easiest are to set the cachesize to 5GB, clone https://openqa.opensuse.org/tests/1057111 which downloads the ISO which should not exceed 4.7GB, then clone https://openqa.opensuse.org/tests/1056969 which tries to download a qcow file. That should exceed the quota
You don't need jobs that "finish quickly" but rather take long enough so that the second job would be triggered when the first one is still running.

#10 Updated by kraih 7 months ago

  • Target version changed from Ready to Current Sprint

#11 Updated by kraih 7 months ago

  • Status changed from New to In Progress

Since i can't replicate this problem with a test case, i've started refactoring the cache service, and will be adding better log features along the way to make debugging a bit easier.

#12 Updated by okurz 7 months ago

  • Related to action #12180: [webui] Prevent tests to be triggered when required assets are not present (anymore) added

#13 Updated by okurz 7 months ago

  • Related to action #34783: Don't let jobs incomplete if mandatory resources are missing added

#14 Updated by okurz 7 months ago

  • Subject changed from test incompletes trying to revert to qemu snapshot "Could not open backing file: Could not open '/var/lib/openqa/pool/7/opensuse-42.3-x86_64-GM-kde@64bit-2G.qcow2': No such file or directory" to test incompletes trying to revert to qemu snapshot "Could not open backing file: Could not open ….qcow2': No such file or directory", likely premature deletion of files from cache

#15 Updated by okurz 6 months ago

It seems your latest changes have caused a regression, e.g. in https://openqa.opensuse.org/tests/1091856/file/autoinst-log.txt we see

[2019-11-21T20:45:00.0507 CET] [debug] Downloading openSUSE_Krypton_Qt_5.14.x86_64-5.12.80-Build16.2.iso, request sent to Cache Service (148635)
[2019-11-21T20:45:05.0571 CET] [debug] Download of openSUSE_Krypton_Qt_5.14.x86_64-5.12.80-Build16.2.iso processed: Asset was already requested by another job

with the job aborting as incomplete.

As I am not aware of any of these issues just some days ago I suspect
https://github.com/os-autoinst/openQA/pull/2463
as the culprit.

EDIT: Created a specific ticket #60140 in case it is not closely related and also to describe the symptom in the subject line better.

#16 Updated by okurz 6 months ago

  • Copied to action #60140: job incompletes failing on initial asset download with "Asset was already requested by another job" added

#17 Updated by kraih 6 months ago

  • Status changed from In Progress to Feedback

A patch has been applied that will give us much better cache service logs. So the next time this happens we should be able to tell exactly why the asset was deleted from the openqa-worker-cacheservice-minion journal. https://github.com/os-autoinst/openQA/pull/2548

#18 Updated by okurz 6 months ago

  • Status changed from Feedback to Workable

Changes are deployed to o3 and osd. I looked up tests that failed recently for this error on osd:

for i in $(sudo -u geekotest psql --no-align --tuples-only --command="select id from jobs where (result='incomplete' and t_finished >= '2019-12-01');" openqa) ; do id=$(echo -n "0$i"); log=/var/lib/openqa/testresults/$(echo $id | cut -c -5)/$id*/autoinst-log.txt; test -e $log && grep -l "Could not open backing file" $log ; done

output:

/var/lib/openqa/testresults/03649/03649083-sle-15-SP2-Online-x86_64-Build99.1-wayland-desktopapps-documentation@64bit-virtio-vga/autoinst-log.txt
/var/lib/openqa/testresults/03665/03665361-sle-15-SP1-Server-DVD-Incidents-Kernel-x86_64-Build4.12.14-711.1.gf5d22ce-ltp_net_sctp@64bit/autoinst-log.txt
/var/lib/openqa/testresults/03660/03660772-sle-12-SP4-Server-DVD-Incidents-Kernel-x86_64-Build4.12.14-425.1.g4305857-ltp_mm_oom@64bit/autoinst-log.txt
/var/lib/openqa/testresults/03658/03658101-sle-15-SP2-Full-x86_64-Build101.1-migration_offline_sle15sp1_ha_alpha_node01@64bit/autoinst-log.txt
/var/lib/openqa/testresults/03656/03656494-sle-15-SP2-JeOS-for-kvm-and-xen-x86_64-Build4.3-jeos-extratest_kvm@uefi-virtio-vga/autoinst-log.txt
/var/lib/openqa/testresults/03665/03665341-sle-12-SP4-Server-DVD-Incidents-Kernel-x86_64-Build4.12.14-426.1.gf529200-ltp_net_sctp@64bit/autoinst-log.txt
/var/lib/openqa/testresults/03649/03649096-sle-15-SP2-Online-x86_64-Build99.1-wayland-desktopapps-gnome@64bit-virtio-vga/autoinst-log.txt
/var/lib/openqa/testresults/03665/03665879-sle-15-SP2-Online-aarch64-Build101.1-extra_tests_kdump_multipath@aarch64-virtio/autoinst-log.txt
/var/lib/openqa/testresults/03649/03649098-sle-15-SP2-Online-x86_64-Build99.1-wayland-desktopapps-other@64bit-virtio-vga/autoinst-log.txt
/var/lib/openqa/testresults/03660/03660542-sle-15-Server-DVD-Incidents-Kernel-x86_64-Build4.12.14-1901.1.g5b29a34-ltp_net_sctp@64bit/autoinst-log.txt
/var/lib/openqa/testresults/03649/03649097-sle-15-SP2-Online-x86_64-Build99.1-wayland-desktopapps-message@64bit-virtio-vga/autoinst-log.txt
/var/lib/openqa/testresults/03661/03661236-sle-15-SP2-Regression-on-Migration-from-SLE15-SPX-to-SLE15-SP2-x86_64-Build101.1-online_sles15_pscc_Addons-basesys-srv-desk-dev-contm-lgm-pcm-wsm_all_full@64bit/autoinst-log.txt
/var/lib/openqa/testresults/03649/03649873-sle-15-SP2-Online-ppc64le-Build99.1-extra_tests_on_gnome@ppc64le/autoinst-log.txt
/var/lib/openqa/testresults/03658/03658102-sle-15-SP2-Full-x86_64-Build101.1-migration_offline_sle15sp1_ha_alpha_node02@64bit/autoinst-log.txt
/var/lib/openqa/testresults/03648/03648285-sle-15-SP2-Online-ppc64le-Build99.1-extra_tests_in_textmode@ppc64le/autoinst-log.txt
/var/lib/openqa/testresults/03659/03659555-sle-15-SP2-Migration-from-SLE15-SPX-to-SLE15-SP2-x86_64-Build101.1-autoupgrade_sles15sp1_pscc_basesys-srv_all_full@64bit_cirrus/autoinst-log.txt
/var/lib/openqa/testresults/03659/03659878-sle-15-SP1-Desktop-DVD-Updates-x86_64-Build20191204-1-qam-gnome@64bit/autoinst-log.txt
/var/lib/openqa/testresults/03648/03648636-sle-15-SP2-Online-x86_64-Build99.1-extra_tests_in_textmode@64bit/autoinst-log.txt
/var/lib/openqa/testresults/03652/03652984-sle-12-SP5-Server-DVD-Incidents-Kernel-x86_64-Build4.12.14-141.1.ga767afb-ltp_net_sctp@64bit/autoinst-log.txt
/var/lib/openqa/testresults/03652/03652753-sle-15-Server-DVD-Incidents-Kernel-x86_64-Build4.12.14-1895.1.gcc52525-ltp_net_sctp@64bit/autoinst-log.txt
/var/lib/openqa/testresults/03652/03652778-sle-15-SP1-Server-DVD-Incidents-Kernel-x86_64-Build4.12.14-707.1.g7d2dfd0-ltp_net_sctp@64bit/autoinst-log.txt

on osd. You can also get the details over the web UI of course. For example for the last job with id 03652778 that would be https://openqa.suse.de/t03652778 ran by openqaworker3:13 . Can you take a look into the service's journal and investigate further? You can also now look for jobs which have a comment like https://openqa.suse.de/tests/03652778#comment-216934 with content "poo#46742 premature deletion of files from cache" written by the user "auto-review".

#19 Updated by okurz 5 months ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: wayland-desktopapps-message
https://openqa.suse.de/tests/3720719

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released"
  3. The label in the openQA scenario is removed

#20 Updated by kraih 5 months ago

  • Status changed from Workable to Feedback

Looks like most of the OSD workers were only recently updated (around Dec 10), so i could not find any useful journal entries for the jobs above. Some workers are also so busy the openqa-worker-cacheservice-minion journal only goes back 7 days, so i'll have to be quick when investigating new cases in the future.

#21 Updated by coolo 5 months ago

Checking https://openqa.opensuse.org/tests/1124373 on openqaworker1:

Downloaded at 9pm:

[2019-12-26T21:03:09.0313 CET] [info] Downloading openSUSE-Leap-15.0-DVD-x86_64.iso, request #181245 sent to Cache Service
[2019-12-26T21:04:20.0200 CET] [info] Download of openSUSE-Leap-15.0-DVD-x86_64.iso processed:
[info] [#181244] Cache size of "/var/lib/openqa/cache" is 48GiB, with limit 50GiB
[info] [#181244] Downloading "openSUSE-Leap-15.0-DVD-x86_64.iso" from "http://openqa1-opensuse/tests/1124372/asset/iso/openSUSE-Leap-15.0-DVD-x86_64.iso"
[info] [#181244] Cache size 48GiB + needed 3.6GiB exceeds limit of 50GiB, purging least used assets
[info] [#181244] Purging "/var/lib/openqa/cache/openqa1-opensuse/openSUSE-Staging:K-MicroOS-DVD-x86_64-Build185.2-Media.iso" because we need space for new assets, reclaiming 1.2GiB
[info] [#181244] Purging "/var/lib/openqa/cache/openqa1-opensuse/windows-10-x86_64-1511@64bit_win.qcow2" because we need space for new assets, reclaiming 8.8GiB
[info] [#181244] Size of "/var/lib/openqa/cache/openqa1-opensuse/openSUSE-Leap-15.0-DVD-x86_64.iso" is 3.6GiB, with ETag ""e9800000-56c575dc99965""
[info] [#181244] Download of "/var/lib/openqa/cache/openqa1-opensuse/openSUSE-Leap-15.0-DVD-x86_64.iso" successful, new cache size is 41GiB

Failed to open at 1am:

[2019-12-27T01:24:25.675 CET] [debug] QEMU: qemu-system-x86_64: -blockdev driver=qcow2,node-name=cd0-overlay14,file=cd0-overlay14-file,cache.no-flush=on: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open '/var/lib/openqa/pool/11/openSUSE-Leap-15.0-DVD-x86_64.iso': No such file or directory

Because purged at 01:23 according to journal:

openqaworker1:~ # journalctl -u openqa-worker-cacheservice-minion --since='24 hours ago' | grep openSUSE-Leap-15.0-DVD-x86_64.iso
Dez 26 21:03:09 openqaworker1 openqa-workercache[2721]: [2019-12-26 21:03:09.13869] [21662] [info] [#181244] Downloading: "openSUSE-Leap-15.0-DVD-x86_64.iso"
Dez 26 21:03:09 openqaworker1 openqa-workercache[2721]: [2019-12-26 21:03:09.14347] [21662] [info] [#181244] Downloading "openSUSE-Leap-15.0-DVD-x86_64.iso" from "http://openqa1-opensuse/tests/1124372/asset/iso/openSUSE-Leap-15.0-DVD-x86_64.iso"
Dez 26 21:04:18 openqaworker1 openqa-workercache[2721]: [2019-12-26 21:04:18.21386] [21662] [info] [#181244] Size of "/var/lib/openqa/cache/openqa1-opensuse/openSUSE-Leap-15.0-DVD-x86_64.iso" is 3.6GiB, with ETag ""e9800000-56c575dc99965""
Dez 26 21:04:18 openqaworker1 openqa-workercache[2721]: [2019-12-26 21:04:18.21397] [21662] [info] [#181244] Download of "/var/lib/openqa/cache/openqa1-opensuse/openSUSE-Leap-15.0-DVD-x86_64.iso" successful, new cache size is 41GiB
Dez 27 01:23:15 openqaworker1 openqa-workercache[2721]: [2019-12-27 01:23:15.39021] [22652] [info] [#181337] Purging "/var/lib/openqa/cache/openqa1-opensuse/openSUSE-Leap-15.0-DVD-x86_64.iso" because we need space for new assets, reclaiming 3.6GiB
Dez 27 01:34:22 openqaworker1 openqa-workercache[2721]: [2019-12-27 01:34:22.40626] [26508] [info] [#181357] Downloading: "openSUSE-Leap-15.0-DVD-x86_64.iso"
Dez 27 01:34:22 openqaworker1 openqa-workercache[2721]: [2019-12-27 01:34:22.41048] [26508] [info] [#181357] Downloading "openSUSE-Leap-15.0-DVD-x86_64.iso" from "http://openqa1-opensuse/tests/1124479/asset/iso/openSUSE-Leap-15.0-DVD-x86_64.iso"
Dez 27 01:35:18 openqaworker1 openqa-workercache[2721]: [2019-12-27 01:35:18.44267] [26508] [info] [#181357] Size of "/var/lib/openqa/cache/openqa1-opensuse/openSUSE-Leap-15.0-DVD-x86_64.iso" is 3.6GiB, with ETag ""e9800000-56c575dc99965""
Dez 27 01:35:18 openqaworker1 openqa-workercache[2721]: [2019-12-27 01:35:18.44274] [26508] [info] [#181357] Download of "/var/lib/openqa/cache/openqa1-opensuse/openSUSE-Leap-15.0-DVD-x86_64.iso" successful, new cache size is 48GiB
Dez 27 07:20:49 openqaworker1 openqa-workercache[2703]: [2019-12-27 07:20:49.65200] [10741] [info] [#181423] Downloading: "openSUSE-Leap-15.0-DVD-x86_64.iso"
Dez 27 07:20:49 openqaworker1 openqa-workercache[2703]: [2019-12-27 07:20:49.65826] [10741] [info] [#181423] Downloading "openSUSE-Leap-15.0-DVD-x86_64.iso" from "http://openqa1-opensuse/tests/1124562/asset/iso/openSUSE-Leap-15.0-DVD-x86_64.iso"
Dez 27 07:20:49 openqaworker1 openqa-workercache[2703]: [2019-12-27 07:20:49.69699] [10741] [info] [#181423] Content of "/var/lib/openqa/cache/openqa1-opensuse/openSUSE-Leap-15.0-DVD-x86_64.iso" has not changed, updating last use

#22 Updated by kraih 5 months ago

  • Status changed from Feedback to Workable

That actually looks like a great case to investigate, all the logs still exist from the time it happened.

#23 Updated by kraih 5 months ago

So, the job in question was running for many hours. And in the time between downloading the affected asset and its deletion there have been 38 downloads.

Dec 26 21:03:09 openqaworker1 openqa-workercache[2721]: [2019-12-26 21:03:09.13869] [21662] [info] [#181244] Downloading: "openSUSE-Leap-15.0-DVD-x86_64.iso"
Dec 26 21:03:15 openqaworker1 openqa-workercache[2721]: [2019-12-26 21:03:15.14592] [21699] [info] [#181246] Downloading: "openSUSE-Leap-15.1-DVD-x86_64.iso"
Dec 26 21:05:34 openqaworker1 openqa-workercache[2721]: [2019-12-26 21:05:34.57778] [23194] [info] [#181254] Downloading: "openSUSE-Leap-15.0-GNOME-Live-x86_64-Snapshot21.19-Media.iso"
Dec 26 21:06:37 openqaworker1 openqa-workercache[2721]: [2019-12-26 21:06:37.13295] [23797] [info] [#181260] Downloading: "Argon.x86_64-15.1-Build2.200.iso"
Dec 26 21:32:01 openqaworker1 openqa-workercache[2721]: [2019-12-26 21:32:01.01051] [31721] [info] [#181262] Downloading: "openSUSE-Leap-15.2-DVD-x86_64-Build545.2-Media.iso"
Dec 26 21:32:01 openqaworker1 openqa-workercache[2721]: [2019-12-26 21:32:01.59090] [31722] [info] [#181263] Downloading: "openSUSE-Leap-15.2-DVD-x86_64-Build545.2-Media.iso"
Dec 26 21:32:09 openqaworker1 openqa-workercache[2721]: [2019-12-26 21:32:09.16080] [31733] [info] [#181266] Downloading: "opensuse-15.0-x86_64-GM-gnome@64bit.qcow2"
Dec 26 21:32:13 openqaworker1 openqa-workercache[2721]: [2019-12-26 21:32:13.87605] [31741] [info] [#181268] Downloading: "openSUSE-Leap-15.2-DVD-x86_64-Build545.2-Media.iso"
Dec 26 21:33:07 openqaworker1 openqa-workercache[2721]: [2019-12-26 21:33:07.90376] [32275] [info] [#181270] Downloading: "opensuse-15.1-x86_64-GM-gnome@64bit.qcow2"
Dec 26 21:33:17 openqaworker1 openqa-workercache[2721]: [2019-12-26 21:33:17.90721] [32306] [info] [#181271] Downloading: "openSUSE-Leap-15.2-DVD-x86_64-Build545.2-Media.iso"
Dec 26 22:05:27 openqaworker1 openqa-workercache[2721]: [2019-12-26 22:05:27.88886] [9364] [info] [#181273] Downloading: "openSUSE-Leap-15.0-Rescue-CD-x86_64-Snapshot21.19-Media.iso"
Dec 26 22:21:42 openqaworker1 openqa-workercache[2721]: [2019-12-26 22:21:42.62590] [15246] [info] [#181275] Downloading: "openSUSE_Krypton.x86_64-5.12.80-Build15.2.iso"
Dec 26 22:31:00 openqaworker1 openqa-workercache[2721]: [2019-12-26 22:31:00.58286] [19563] [info] [#181279] Downloading: "openSUSE-Staging:K-Tumbleweed-DVD-x86_64-Build185.2-Media.iso"
Dec 26 22:31:01 openqaworker1 openqa-workercache[2721]: [2019-12-26 22:31:01.08708] [19570] [info] [#181280] Downloading: "openSUSE-Staging-DVD-x86_64.qcow2"
Dec 26 23:02:11 openqaworker1 openqa-workercache[2721]: [2019-12-26 23:02:11.93639] [930] [info] [#181288] Downloading: "openSUSE-Leap-15.2-DVD-x86_64-Build545.2-Media.iso"
Dec 26 23:05:30 openqaworker1 openqa-workercache[2721]: [2019-12-26 23:05:30.49369] [2250] [info] [#181290] Downloading: "openSUSE-Leap-15.0-KDE-Live-x86_64-Snapshot21.19-Media.iso"
Dec 26 23:05:31 openqaworker1 openqa-workercache[2721]: [2019-12-26 23:05:31.05619] [2254] [info] [#181292] Downloading: "openSUSE-Leap-15.0-KDE-Live-x86_64-Snapshot21.19-Media.iso"
Dec 26 23:19:44 openqaworker1 openqa-workercache[2721]: [2019-12-26 23:19:44.84675] [8063] [info] [#181296] Downloading: "windows-10-x86_64-1511@64bit_win.qcow2"
Dec 26 23:19:44 openqaworker1 openqa-workercache[2721]: [2019-12-26 23:19:44.85117] [8064] [info] [#181297] Downloading: "opensuse-Tumbleweed-x86_64-20191225-gnome-wayland@64bit_virtio.qcow2"
Dec 26 23:19:46 openqaworker1 openqa-workercache[2721]: [2019-12-26 23:19:46.85822] [8066] [info] [#181298] Downloading: "opensuse-Tumbleweed-x86_64-20191225-gnome@64bit.qcow2"
Dec 26 23:19:48 openqaworker1 openqa-workercache[2721]: [2019-12-26 23:19:48.92653] [8067] [info] [#181299] Downloading: "opensuse-Tumbleweed-x86_64-20191225-gnome@64bit.qcow2"
Dec 26 23:19:51 openqaworker1 openqa-workercache[2721]: [2019-12-26 23:19:51.47949] [8076] [info] [#181300] Downloading: "openSUSE-Tumbleweed-DVD-x86_64-Snapshot20191225-Media.iso"
Dec 26 23:19:54 openqaworker1 openqa-workercache[2721]: [2019-12-26 23:19:54.03760] [8080] [info] [#181301] Downloading: "openSUSE-Tumbleweed-DVD-x86_64-Snapshot20191225-Media.iso"
Dec 26 23:19:56 openqaworker1 openqa-workercache[2721]: [2019-12-26 23:19:56.60052] [8087] [info] [#181302] Downloading: "openSUSE-Tumbleweed-KDE-Live-x86_64-Snapshot20191225-Media.iso"
Dec 26 23:20:35 openqaworker1 openqa-workercache[2721]: [2019-12-26 23:20:35.49563] [8494] [info] [#181306] Downloading: "openSUSE-Tumbleweed-DVD-x86_64-Snapshot20191225-Media.iso"
Dec 26 23:23:01 openqaworker1 openqa-workercache[2721]: [2019-12-26 23:23:01.54136] [9771] [info] [#181308] Downloading: "openSUSE-Tumbleweed-DVD-x86_64-Snapshot20191225-Media.iso"
Dec 26 23:31:16 openqaworker1 openqa-workercache[2721]: [2019-12-26 23:31:16.66389] [14539] [info] [#181311] Downloading: "windows-10-x86_64-1511@64bit_win.qcow2"
Dec 26 23:31:16 openqaworker1 openqa-workercache[2721]: [2019-12-26 23:31:16.66536] [14538] [info] [#181310] Downloading: "opensuse-Tumbleweed-x86_64-20191225-gnome-wayland@64bit_virtio.qcow2"
Dec 26 23:31:26 openqaworker1 openqa-workercache[2721]: [2019-12-26 23:31:26.67093] [14620] [info] [#181312] Downloading: "openSUSE-Tumbleweed-DVD-x86_64-Snapshot20191225-Media.iso"
Dec 27 00:01:56 openqaworker1 openqa-workercache[2721]: [2019-12-27 00:01:56.51828] [27190] [info] [#181316] Downloading: "openSUSE-Staging:K-Kubic-DVD-x86_64-Build186.1-Media.iso"
Dec 27 00:02:30 openqaworker1 openqa-workercache[2721]: [2019-12-27 00:02:30.53508] [27555] [info] [#181318] Downloading: "openSUSE-Leap-15.0-GNOME-Live-x86_64-Snapshot21.19-Media.iso"
Dec 27 00:19:21 openqaworker1 openqa-workercache[2721]: [2019-12-27 00:19:21.74634] [3723] [info] [#181324] Downloading: "openSUSE-13.1-gnome64.qcow2"
Dec 27 00:19:21 openqaworker1 openqa-workercache[2721]: [2019-12-27 00:19:21.75052] [3724] [info] [#181325] Downloading: "openSUSE-Staging:K-Tumbleweed-DVD-x86_64-Build186.1-Media.iso"
Dec 27 00:21:03 openqaworker1 openqa-workercache[2721]: [2019-12-27 00:21:03.29345] [5290] [info] [#181328] Downloading: "openSUSE-Staging:K-Tumbleweed-DVD-x86_64-Build186.1-Media.iso"
Dec 27 00:21:33 openqaworker1 openqa-workercache[2721]: [2019-12-27 00:21:33.16411] [5855] [info] [#181333] Downloading: "openSUSE-Staging:K-Tumbleweed-DVD-x86_64-Build186.1-Media.iso"
Dec 27 00:54:37 openqaworker1 openqa-workercache[2721]: [2019-12-27 00:54:37.14213] [16658] [info] [#181335] Downloading: "openSUSE-Staging:K-Tumbleweed-DVD-x86_64-Build186.1-Media.iso"
Dec 27 01:22:56 openqaworker1 openqa-workercache[2721]: [2019-12-27 01:22:56.54810] [22652] [info] [#181337] Downloading: "openSUSE-Tumbleweed-Rescue-CD-x86_64-Snapshot20191225-Media.iso"
Dec 27 01:22:56 openqaworker1 openqa-workercache[2721]: [2019-12-27 01:22:56.55354] [22653] [info] [#181338] Downloading: "openSUSE-Tumbleweed-XFCE-Live-x86_64-Snapshot20191225-Media.iso"

Of those downloads 22 were unique. Meaning they would end up with a fresher last use timestamp.

"Argon.x86_64-15.1-Build2.200.iso"
"openSUSE-13.1-gnome64.qcow2"
"openSUSE-Leap-15.0-DVD-x86_64.iso"
"openSUSE-Leap-15.0-GNOME-Live-x86_64-Snapshot21.19-Media.iso"
"openSUSE-Leap-15.0-KDE-Live-x86_64-Snapshot21.19-Media.iso"
"openSUSE-Leap-15.0-Rescue-CD-x86_64-Snapshot21.19-Media.iso"
"openSUSE-Leap-15.1-DVD-x86_64.iso"
"openSUSE-Leap-15.2-DVD-x86_64-Build545.2-Media.iso"
"openSUSE-Staging-DVD-x86_64.qcow2"
"openSUSE-Staging:K-Kubic-DVD-x86_64-Build186.1-Media.iso"
"openSUSE-Staging:K-Tumbleweed-DVD-x86_64-Build185.2-Media.iso"
"openSUSE-Staging:K-Tumbleweed-DVD-x86_64-Build186.1-Media.iso"
"openSUSE-Tumbleweed-DVD-x86_64-Snapshot20191225-Media.iso"
"openSUSE-Tumbleweed-KDE-Live-x86_64-Snapshot20191225-Media.iso"
"openSUSE-Tumbleweed-Rescue-CD-x86_64-Snapshot20191225-Media.iso"
"openSUSE-Tumbleweed-XFCE-Live-x86_64-Snapshot20191225-Media.iso"
"openSUSE_Krypton.x86_64-5.12.80-Build15.2.iso"
"opensuse-15.0-x86_64-GM-gnome@64bit.qcow2"
"opensuse-15.1-x86_64-GM-gnome@64bit.qcow2"
"opensuse-Tumbleweed-x86_64-20191225-gnome-wayland@64bit_virtio.qcow2"
"opensuse-Tumbleweed-x86_64-20191225-gnome@64bit.qcow2"
"windows-10-x86_64-1511@64bit_win.qcow2"

I did not check exact image sizes, but that looks suspiciously close to 50 GiB. So i'm guessing that the cleanup logic in the cache service is working as it should and the job is simply too slow to be running on such a busy worker. If this outcome is unacceptable, we might have to look into a real locking solution. Actively preventing files from being deleted while the jobs that requested them are still running.

#24 Updated by kraih 5 months ago

  • Status changed from Workable to Feedback

#25 Updated by kraih 5 months ago

  • Status changed from Feedback to In Progress

We've discussed the problem a bit in Rocket Chat and came to the conclusion that instead of locks we could also try using hard links if /var/lib/openqa/cache and /var/lib/openqa/pool are on the same partition. This has the advantage that the cache service could keep working as it does right now, but purged assets would persist in the pool until after the job is done.

#26 Updated by kraih 5 months ago

Opened a PR with a possible solution. https://github.com/os-autoinst/openQA/pull/2652

#27 Updated by kraih 5 months ago

  • Status changed from In Progress to Feedback

Fix has been merged, the problem might be resolved now.

#28 Updated by kraih 5 months ago

One problem now is that many/most workers still have /var/lib/openqa/pool and /var/lib/openqa/cache on different devices. That will have to change over time. Until then we should make sure that we don't see any cases of this issue on workers where hardlinks are active.

#29 Updated by okurz 5 months ago

  • Related to action #19238: setup pool devices+mounts+folders with salt(was: ext2 on workers busted) added

#30 Updated by okurz 5 months ago

  • Has duplicate action #34270: [tools]caching issue in openQA? added

#31 Updated by okurz 4 months ago

  • Related to action #57782: retrigger of job with failed gru download task ends up incomplete with 404 on asset, does not retry download added

#32 Updated by okurz 4 months ago

  • Subject changed from test incompletes trying to revert to qemu snapshot "Could not open backing file: Could not open ….qcow2': No such file or directory", likely premature deletion of files from cache to test incompletes trying to revert to qemu snapshot auto_review:"Could not open backing file: Could not open .*.qcow.*No such file or directory", likely premature deletion of files from cache

#33 Updated by okurz 4 months ago

  • Status changed from Feedback to In Progress
  • Assignee changed from kraih to okurz

Are you ok if I take over? The feature in openQA can be considered done but we can only provide a verification after adapting our infrastructure. My proposal is in https://gitlab.suse.de/openqa/salt-states-openqa/merge_requests/250 to ensure that …/pool and …/cache reside on the same partition so that hard-links are used.

EDIT: https://gitlab.suse.de/openqa/salt-states-openqa/merge_requests/250 adds support to our salt recipes to setup NVMe devices as RAID devices. I am using openqaworker11 for a test reinstallation. See #19238 for details.
If this doesn't work at least I can experiment further on w13, e.g. with

for i in {001..100}; do openqa-clone-job --within-instance https://openqa.suse.de --skip-chained-deps 3835852 WORKER_CLASS=openqaworker13 _GROUP=0 BUILD=poo46742_okurz TEST=memtest_poo46742_okurz_$i; done

-> https://openqa.suse.de/tests/overview?build=poo46742_okurz&distri=sle&version=15-SP2

100/100 passed. So we can be confident that the new setup is fine. As I haven't found a way to reinstall from PXE, see #19238#note-10 , so we can slowly move over the setup.

#34 Updated by okurz 4 months ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: qam-regression-gnome@64bit-virtio-vga
https://openqa.suse.de/tests/3868840

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released"
  3. The label in the openQA scenario is removed

#35 Updated by okurz 4 months ago

  • Due date set to 2020-02-18
  • Status changed from In Progress to Feedback
  • Priority changed from High to Normal

all o3 and openQA workers with NVMe now use the same filesystem for cache+pool to make use of the new feature, see #19238#note-12 . We can now wait some time and monitor if no jobs show these symptoms again and call the issue resolved.

#36 Updated by okurz 3 months ago

  • Status changed from Feedback to Resolved

The SQL query select id,reason,test from jobs where (result='incomplete' and t_finished >= (NOW() - interval '240 hour') and id in (select job_id from comments where text ~ 'poo#46742')) order by id desc; did not return any jobs on neither o3 nor osd. The last job failing on o3 referencing this ticket is 1159527 from 21 days ago. I guess we have actually finally also tackled this issue \o/ Thx all. One final idea I have regarding a hint in docs: https://github.com/os-autoinst/openQA/pull/2760

Also available in: Atom PDF