Project

General

Profile

Actions

action #152545

closed

Files have been deleted from the cache while the job was running size:M

Added by szarate 5 months ago. Updated 5 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2023-12-13
Due date:
2023-12-28
% Done:

0%

Estimated time:

Description

Observation

Looking at https://openqa.suse.de/tests/13038326 the cache seemed to be downloading a lot of the stuff, however while it was starting, the qcow was removed from the pool (I guess the

[2023-12-13T07:43:40.520891Z] [debug] [pid:18006] Command `/usr/bin/chattr +C /var/lib/openqa/pool/2/raid` terminated with 1
  /usr/bin/chattr: Operation not supported while reading flags on /var/lib/openqa/pool/2/raid
[2023-12-13T07:43:40.521015Z] [debug] [pid:18006] Configuring storage controllers and block devices
[2023-12-13T07:43:40.521653Z] [debug] [pid:18006] running `/usr/bin/qemu-img info --output=json /var/lib/openqa/pool/2/SLES-15-SP5-ppc64le-Installtest.qcow2`
[2023-12-13T07:43:40.631962Z] [info] [pid:18006] ::: backend::baseclass::die_handler: Backend process died, backend errors are reported below in the following lines:
  qemu-img: Could not open '/var/lib/openqa/pool/2/SLES-15-SP5-ppc64le-Installtest.qcow2': Could not open '/var/lib/openqa/pool/2/SLES-15-SP5-ppc64le-Installtest.qcow2': No such file or directory
[2023-12-13T07:43:40.632507Z] [info] [pid:18006] ::: OpenQA::Qemu::Proc::save_state: Saving QEMU state to qemu_state.json
[2023-12-13T07:43:40.632988Z] [debug] [pid:18006] Passing remaining frames to the video encoder
[2023-12-13T07:43:41.589400Z] [debug] [pid:18006] Waiting for video encoder to finalize the video
[2023-12-13T07:43:41.589502Z] [debug] [pid:18006] The built-in video encoder (pid 18047) terminated
[2023-12-13T07:43:41.590353Z] [debug] [pid:18006] sending magic and exit
[2023-12-13T07:43:41.590705Z] [debug] [pid:17761] received magic close
[2023-12-13T07:43:41.593864Z] [debug] [pid:17761] backend process exited: 0
[2023-12-13T07:43:41.694734Z] [warn] [pid:17761] !!! main: failed to start VM at /usr/lib/os-autoinst/backend/driver.pm line 104.
          backend::driver::start_vm(backend::driver=HASH(0x1003bbbab58)) called at /usr/lib/os-autoinst/OpenQA/Isotovideo/Backend.pm line 18
          OpenQA::Isotovideo::Backend::new("OpenQA::Isotovideo::Backend") called at /usr/lib/os-autoinst/OpenQA/Isotovideo/Runner.pm line 100
          OpenQA::Isotovideo::Runner::create_backend(OpenQA::Isotovideo::Runner=HASH(0x1003422b1e0)) called at /usr/bin/isotovideo line 134
          eval {...} called at /usr/bin/isotovideo line 122

If you look at the logs you can observe that 1.2GB is needed, then effectively 11 GB is needed and the new assets are too big for the amount of space gained:

[info] [#28462] Cache size of "/var/lib/openqa/cache" is 48 GiB, with limit 50 GiB
[info] [#28462] Downloading "SLES-15-SP5-ppc64le-Installtest.qcow2" from "http://openqa.suse.de/tests/13038326/asset/hdd/SLES-15-SP5-ppc64le-Installtest.qcow2"
[info] [#28462] Cache size 49 GiB + needed 1.2 GiB exceeds limit of 50 GiB, purging least used assets
[info] [#28462] Purging "/var/lib/openqa/cache/openqa.suse.de/sle-15-SP4-ppc64le-:31833:kernel-livepatch-SLE15-SP4_Update_6-Server-DVD-Incidents-Kernel@ppc64le-virtio-with-ltp.qcow2" because we need space for new assets, reclaiming 2.9 GiB
[info] [#28462] Size of "/var/lib/openqa/cache/openqa.suse.de/SLES-15-SP5-ppc64le-Installtest.qcow2" is 1.2 GiB, with ETag ""4bf00000-60b0c04e34447""
[info] [#28462] Download of "/var/lib/openqa/cache/openqa.suse.de/SLES-15-SP5-ppc64le-Installtest.qcow2" successful (28 MiB/s), new cache size is 47 GiB
[...]
[info] [#28465] Cache size of "/var/lib/openqa/cache" is 47 GiB, with limit 50 GiB
[info] [#28465] Downloading "SLE-15-SP5-Full-ppc64le-GM-Media1.iso" from "http://openqa.suse.de/tests/13038326/asset/iso/SLE-15-SP5-Full-ppc64le-GM-Media1.iso"
[info] [#28465] Cache size 49 GiB + needed 11 GiB exceeds limit of 50 GiB, purging least used assets
[...]
[info] [#28465] Purging "/var/lib/openqa/cache/openqa.suse.de/SLES-15-SP5-ppc64le-Installtest.qcow2" because we need space for new assets, reclaiming 1.2 GiB
[info] [#28465] Size of "/var/lib/openqa/cache/openqa.suse.de/SLE-15-SP5-Full-ppc64le-GM-Media1.iso" is 11 GiB, with ETag ""2d1468800-5fc052ed06a80""
[info] [#28465] Download of "/var/lib/openqa/cache/openqa.suse.de/SLE-15-SP5-Full-ppc64le-GM-Media1.iso" successful (24 MiB/s), new cache size is 49 GiB

The hard-linking approach didn't work because it only happens after the purging. Then the worker fell back to symlinking instead of just failing already at the linking:

    [2023-12-13T07:34:59.119198Z] [debug] [pid:96149] Symlinking asset because hardlink failed: Cannot create link from "/var/lib/openqa/cache/openqa.suse.de/SLES-15-SP5-ppc64le-Installtest.qcow2" to "/var/lib/openqa/pool/2/SLES-15-SP5-ppc64le-Installtest.qcow2": Invalid cross-device link at /usr/share/openqa/script/../lib/OpenQA/Worker/Engines/isotovideo.pm line 182.

SLE-15-SP5-Full-ppc64le-GM-Media1.iso is 11GB and SLES-15-SP5-ppc64le-Installtest gets deleted to free space before the job runs

Acceptance criteria

  • AC1: Jobs are no longer incompleting because a required asset is purged later in the setup
  • AC2: Ensure the asset cache is behaving as intended (so it really start cleaning only from the oldest asset and does not clean more than it needs to - also in cases like that)

Out of scope

Implementing a new feature in openQA, like linking directly after the download (before starting the download of the next asset) at the cost of possibly overallocating

Suggestions


Related issues 1 (1 open0 closed)

Related to openQA Project - action #121573: Asset/HDD goes missing while job is runningNew2022-12-06

Actions
Actions #1

Updated by szarate 5 months ago

@okurz I'm kinda missing a "bugs" category for this project, as this isn't a regret

Actions #2

Updated by okurz 5 months ago

  • Tags set to reactive work
  • Category set to Regressions/Crashes
  • Priority changed from Normal to Urgent
  • Target version set to Ready

szarate wrote in #note-1:

@okurz I'm kinda missing a "bugs" category for this project, as this isn't a regret

We don't have "bugs" because that term is ambiguous. So we treat something as "regression" if there was clearly a point in time when this was working fine. This would help us because then we can research in our git history where we broke things. Everything else is "feature request" as in: We shouldn't invest time trying to find out when this broke but just treat it as a new area which would mean add according tests and implement to make it work as expected by the user. In this case here though I do assume this is a regression – or at least an obvious "crash" - so adding according category and also treating it as urgent.

To everybody, please help to fill the ticket description according to https://progress.opensuse.org/projects/openqav3/wiki#Defects

Actions #3

Updated by okurz 5 months ago

  • Related to action #121573: Asset/HDD goes missing while job is running added
Actions #4

Updated by livdywan 5 months ago

  • Description updated (diff)
Actions #5

Updated by livdywan 5 months ago

  • Subject changed from Files have been deleted from the cache while the job was running to Files have been deleted from the cache while the job was running size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #6

Updated by mkittler 5 months ago

  • Assignee set to mkittler
Actions #7

Updated by mkittler 5 months ago

  • Description updated (diff)
  • Status changed from Workable to In Progress
Actions #8

Updated by mkittler 5 months ago

  • Description updated (diff)
Actions #9

Updated by mkittler 5 months ago · Edited

It looks like only PowerPC jobs are affected, at least recently. (The most recent non-PowerPC job returned by select id, t_finished, (select host from workers where workers.id = jobs.assigned_worker_id), reason, arch from jobs where result = 'incomplete' and reason like '%qemu-img%No such file%' and arch != 'ppc64le' order by id desc limit 20; is from 2023-09-05 10:11:32.)

So I created https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/698 to increase the asset cache size of our (still in use) PowerPC workers.

By the way, some of our other workers already have CACHELIMIT: 600 but that's not consistently set (likely depending on the actually available disk space).


EDIT: It looks like this helped, e.g. https://openqa.suse.de/tests/13061244.

Considering select id, t_finished, (select host from workers where workers.id = jobs.assigned_worker_id), reason, arch from jobs where result = 'incomplete' and reason like '%qemu-img%No such file%' and arch = 'ppc64le' and clone_id is null order by id desc limit 20; just revealed a single job ID I also restarted that job: https://openqa.suse.de/tests/13062733

EDIT: The restarted job failed because its parallel sibling misses an asset. However, it is a usual 404 error and not this issue.

Actions #10

Updated by mkittler 5 months ago

I had a look at the code for AC2:

  • We flag assets as pending before starting the download.
  • We immediately update the last use of an asset once it has been downloaded and clear its "pending" flag.
  • We really delete the least recently used assets first.
  • We never delete pending assets and the asset we have just downloaded (but possible assets we have downloaded before for the same job).
  • The tests in t/25-cache.t look good.

So I would not assume that we have a problem with the cleanup algorithm.

Note that the cache on mania was filled up very fast after increasing it to 100 GiB. With so much traffic and assets that are 11 GiB big (more than 20 % of the total cache size before I increased it) I would say that the observed behavior makes sense without assuming we have any problem with the cleanup algorithm.

Let's see whether the 100 GiB are enough. We actually have 2 TiB free disk space left on mania (of 2.2 TiB in total) so I guess increasing the cache size further on mania specifically wouldn't be problematic at all.

Actions #11

Updated by mkittler 5 months ago

  • Priority changed from Urgent to High
Actions #12

Updated by openqa_review 5 months ago

  • Due date set to 2023-12-28

Setting due date based on mean cycle time of SUSE QE Tools

Actions #13

Updated by mkittler 5 months ago

  • Status changed from In Progress to Feedback

There have been no further failures since yesterday.

Actions #14

Updated by mkittler 5 months ago

  • Status changed from Feedback to Resolved

It looks still good so AC1 is fulfilled and for AC2 I checked our code/tests again in #152545#note-10.

Actions

Also available in: Atom PDF