Project

General

Profile

Actions

action #152545

closed

Files have been deleted from the cache while the job was running size:M

Added by szarate 5 months ago. Updated 5 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2023-12-13
Due date:
2023-12-28
% Done:

0%

Estimated time:

Description

Observation

Looking at https://openqa.suse.de/tests/13038326 the cache seemed to be downloading a lot of the stuff, however while it was starting, the qcow was removed from the pool (I guess the

[2023-12-13T07:43:40.520891Z] [debug] [pid:18006] Command `/usr/bin/chattr +C /var/lib/openqa/pool/2/raid` terminated with 1
  /usr/bin/chattr: Operation not supported while reading flags on /var/lib/openqa/pool/2/raid
[2023-12-13T07:43:40.521015Z] [debug] [pid:18006] Configuring storage controllers and block devices
[2023-12-13T07:43:40.521653Z] [debug] [pid:18006] running `/usr/bin/qemu-img info --output=json /var/lib/openqa/pool/2/SLES-15-SP5-ppc64le-Installtest.qcow2`
[2023-12-13T07:43:40.631962Z] [info] [pid:18006] ::: backend::baseclass::die_handler: Backend process died, backend errors are reported below in the following lines:
  qemu-img: Could not open '/var/lib/openqa/pool/2/SLES-15-SP5-ppc64le-Installtest.qcow2': Could not open '/var/lib/openqa/pool/2/SLES-15-SP5-ppc64le-Installtest.qcow2': No such file or directory
[2023-12-13T07:43:40.632507Z] [info] [pid:18006] ::: OpenQA::Qemu::Proc::save_state: Saving QEMU state to qemu_state.json
[2023-12-13T07:43:40.632988Z] [debug] [pid:18006] Passing remaining frames to the video encoder
[2023-12-13T07:43:41.589400Z] [debug] [pid:18006] Waiting for video encoder to finalize the video
[2023-12-13T07:43:41.589502Z] [debug] [pid:18006] The built-in video encoder (pid 18047) terminated
[2023-12-13T07:43:41.590353Z] [debug] [pid:18006] sending magic and exit
[2023-12-13T07:43:41.590705Z] [debug] [pid:17761] received magic close
[2023-12-13T07:43:41.593864Z] [debug] [pid:17761] backend process exited: 0
[2023-12-13T07:43:41.694734Z] [warn] [pid:17761] !!! main: failed to start VM at /usr/lib/os-autoinst/backend/driver.pm line 104.
          backend::driver::start_vm(backend::driver=HASH(0x1003bbbab58)) called at /usr/lib/os-autoinst/OpenQA/Isotovideo/Backend.pm line 18
          OpenQA::Isotovideo::Backend::new("OpenQA::Isotovideo::Backend") called at /usr/lib/os-autoinst/OpenQA/Isotovideo/Runner.pm line 100
          OpenQA::Isotovideo::Runner::create_backend(OpenQA::Isotovideo::Runner=HASH(0x1003422b1e0)) called at /usr/bin/isotovideo line 134
          eval {...} called at /usr/bin/isotovideo line 122

If you look at the logs you can observe that 1.2GB is needed, then effectively 11 GB is needed and the new assets are too big for the amount of space gained:

[info] [#28462] Cache size of "/var/lib/openqa/cache" is 48 GiB, with limit 50 GiB
[info] [#28462] Downloading "SLES-15-SP5-ppc64le-Installtest.qcow2" from "http://openqa.suse.de/tests/13038326/asset/hdd/SLES-15-SP5-ppc64le-Installtest.qcow2"
[info] [#28462] Cache size 49 GiB + needed 1.2 GiB exceeds limit of 50 GiB, purging least used assets
[info] [#28462] Purging "/var/lib/openqa/cache/openqa.suse.de/sle-15-SP4-ppc64le-:31833:kernel-livepatch-SLE15-SP4_Update_6-Server-DVD-Incidents-Kernel@ppc64le-virtio-with-ltp.qcow2" because we need space for new assets, reclaiming 2.9 GiB
[info] [#28462] Size of "/var/lib/openqa/cache/openqa.suse.de/SLES-15-SP5-ppc64le-Installtest.qcow2" is 1.2 GiB, with ETag ""4bf00000-60b0c04e34447""
[info] [#28462] Download of "/var/lib/openqa/cache/openqa.suse.de/SLES-15-SP5-ppc64le-Installtest.qcow2" successful (28 MiB/s), new cache size is 47 GiB
[...]
[info] [#28465] Cache size of "/var/lib/openqa/cache" is 47 GiB, with limit 50 GiB
[info] [#28465] Downloading "SLE-15-SP5-Full-ppc64le-GM-Media1.iso" from "http://openqa.suse.de/tests/13038326/asset/iso/SLE-15-SP5-Full-ppc64le-GM-Media1.iso"
[info] [#28465] Cache size 49 GiB + needed 11 GiB exceeds limit of 50 GiB, purging least used assets
[...]
[info] [#28465] Purging "/var/lib/openqa/cache/openqa.suse.de/SLES-15-SP5-ppc64le-Installtest.qcow2" because we need space for new assets, reclaiming 1.2 GiB
[info] [#28465] Size of "/var/lib/openqa/cache/openqa.suse.de/SLE-15-SP5-Full-ppc64le-GM-Media1.iso" is 11 GiB, with ETag ""2d1468800-5fc052ed06a80""
[info] [#28465] Download of "/var/lib/openqa/cache/openqa.suse.de/SLE-15-SP5-Full-ppc64le-GM-Media1.iso" successful (24 MiB/s), new cache size is 49 GiB

The hard-linking approach didn't work because it only happens after the purging. Then the worker fell back to symlinking instead of just failing already at the linking:

    [2023-12-13T07:34:59.119198Z] [debug] [pid:96149] Symlinking asset because hardlink failed: Cannot create link from "/var/lib/openqa/cache/openqa.suse.de/SLES-15-SP5-ppc64le-Installtest.qcow2" to "/var/lib/openqa/pool/2/SLES-15-SP5-ppc64le-Installtest.qcow2": Invalid cross-device link at /usr/share/openqa/script/../lib/OpenQA/Worker/Engines/isotovideo.pm line 182.

SLE-15-SP5-Full-ppc64le-GM-Media1.iso is 11GB and SLES-15-SP5-ppc64le-Installtest gets deleted to free space before the job runs

Acceptance criteria

  • AC1: Jobs are no longer incompleting because a required asset is purged later in the setup
  • AC2: Ensure the asset cache is behaving as intended (so it really start cleaning only from the oldest asset and does not clean more than it needs to - also in cases like that)

Out of scope

Implementing a new feature in openQA, like linking directly after the download (before starting the download of the next asset) at the cost of possibly overallocating

Suggestions


Related issues 1 (1 open0 closed)

Related to openQA Project - action #121573: Asset/HDD goes missing while job is runningNew2022-12-06

Actions
Actions

Also available in: Atom PDF