action #152545: Files have been deleted from the cache while the job was running size:M - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #152545

closed

Files have been deleted from the cache while the job was running size:M

Added by szarate over 1 year ago. Updated over 1 year ago.

Status:

Resolved

Priority:

High

Assignee:

mkittler

Category:

Regressions/Crashes

Target version:

Ready

Start date:

2023-12-13

Due date:

2023-12-28

% Done:

Estimated time:

Tags:

reactive work

Description

Observation¶

Looking at https://openqa.suse.de/tests/13038326 the cache seemed to be downloading a lot of the stuff, however while it was starting, the qcow was removed from the pool (I guess the

[2023-12-13T07:43:40.520891Z] [debug] [pid:18006] Command `/usr/bin/chattr +C /var/lib/openqa/pool/2/raid` terminated with 1
  /usr/bin/chattr: Operation not supported while reading flags on /var/lib/openqa/pool/2/raid
[2023-12-13T07:43:40.521015Z] [debug] [pid:18006] Configuring storage controllers and block devices
[2023-12-13T07:43:40.521653Z] [debug] [pid:18006] running `/usr/bin/qemu-img info --output=json /var/lib/openqa/pool/2/SLES-15-SP5-ppc64le-Installtest.qcow2`
[2023-12-13T07:43:40.631962Z] [info] [pid:18006] ::: backend::baseclass::die_handler: Backend process died, backend errors are reported below in the following lines:
  qemu-img: Could not open '/var/lib/openqa/pool/2/SLES-15-SP5-ppc64le-Installtest.qcow2': Could not open '/var/lib/openqa/pool/2/SLES-15-SP5-ppc64le-Installtest.qcow2': No such file or directory
[2023-12-13T07:43:40.632507Z] [info] [pid:18006] ::: OpenQA::Qemu::Proc::save_state: Saving QEMU state to qemu_state.json
[2023-12-13T07:43:40.632988Z] [debug] [pid:18006] Passing remaining frames to the video encoder
[2023-12-13T07:43:41.589400Z] [debug] [pid:18006] Waiting for video encoder to finalize the video
[2023-12-13T07:43:41.589502Z] [debug] [pid:18006] The built-in video encoder (pid 18047) terminated
[2023-12-13T07:43:41.590353Z] [debug] [pid:18006] sending magic and exit
[2023-12-13T07:43:41.590705Z] [debug] [pid:17761] received magic close
[2023-12-13T07:43:41.593864Z] [debug] [pid:17761] backend process exited: 0
[2023-12-13T07:43:41.694734Z] [warn] [pid:17761] !!! main: failed to start VM at /usr/lib/os-autoinst/backend/driver.pm line 104.
          backend::driver::start_vm(backend::driver=HASH(0x1003bbbab58)) called at /usr/lib/os-autoinst/OpenQA/Isotovideo/Backend.pm line 18
          OpenQA::Isotovideo::Backend::new("OpenQA::Isotovideo::Backend") called at /usr/lib/os-autoinst/OpenQA/Isotovideo/Runner.pm line 100
          OpenQA::Isotovideo::Runner::create_backend(OpenQA::Isotovideo::Runner=HASH(0x1003422b1e0)) called at /usr/bin/isotovideo line 134
          eval {...} called at /usr/bin/isotovideo line 122

If you look at the logs you can observe that 1.2GB is needed, then effectively 11 GB is needed and the new assets are too big for the amount of space gained:

[info] [#28462] Cache size of "/var/lib/openqa/cache" is 48 GiB, with limit 50 GiB
[info] [#28462] Downloading "SLES-15-SP5-ppc64le-Installtest.qcow2" from "http://openqa.suse.de/tests/13038326/asset/hdd/SLES-15-SP5-ppc64le-Installtest.qcow2"
[info] [#28462] Cache size 49 GiB + needed 1.2 GiB exceeds limit of 50 GiB, purging least used assets
[info] [#28462] Purging "/var/lib/openqa/cache/openqa.suse.de/sle-15-SP4-ppc64le-:31833:kernel-livepatch-SLE15-SP4_Update_6-Server-DVD-Incidents-Kernel@ppc64le-virtio-with-ltp.qcow2" because we need space for new assets, reclaiming 2.9 GiB
[info] [#28462] Size of "/var/lib/openqa/cache/openqa.suse.de/SLES-15-SP5-ppc64le-Installtest.qcow2" is 1.2 GiB, with ETag ""4bf00000-60b0c04e34447""
[info] [#28462] Download of "/var/lib/openqa/cache/openqa.suse.de/SLES-15-SP5-ppc64le-Installtest.qcow2" successful (28 MiB/s), new cache size is 47 GiB
[...]
[info] [#28465] Cache size of "/var/lib/openqa/cache" is 47 GiB, with limit 50 GiB
[info] [#28465] Downloading "SLE-15-SP5-Full-ppc64le-GM-Media1.iso" from "http://openqa.suse.de/tests/13038326/asset/iso/SLE-15-SP5-Full-ppc64le-GM-Media1.iso"
[info] [#28465] Cache size 49 GiB + needed 11 GiB exceeds limit of 50 GiB, purging least used assets
[...]
[info] [#28465] Purging "/var/lib/openqa/cache/openqa.suse.de/SLES-15-SP5-ppc64le-Installtest.qcow2" because we need space for new assets, reclaiming 1.2 GiB
[info] [#28465] Size of "/var/lib/openqa/cache/openqa.suse.de/SLE-15-SP5-Full-ppc64le-GM-Media1.iso" is 11 GiB, with ETag ""2d1468800-5fc052ed06a80""
[info] [#28465] Download of "/var/lib/openqa/cache/openqa.suse.de/SLE-15-SP5-Full-ppc64le-GM-Media1.iso" successful (24 MiB/s), new cache size is 49 GiB

The hard-linking approach didn't work because it only happens after the purging. Then the worker fell back to symlinking instead of just failing already at the linking:

    [2023-12-13T07:34:59.119198Z] [debug] [pid:96149] Symlinking asset because hardlink failed: Cannot create link from "/var/lib/openqa/cache/openqa.suse.de/SLES-15-SP5-ppc64le-Installtest.qcow2" to "/var/lib/openqa/pool/2/SLES-15-SP5-ppc64le-Installtest.qcow2": Invalid cross-device link at /usr/share/openqa/script/../lib/OpenQA/Worker/Engines/isotovideo.pm line 182.

SLE-15-SP5-Full-ppc64le-GM-Media1.iso is 11GB and SLES-15-SP5-ppc64le-Installtest gets deleted to free space before the job runs

Acceptance criteria¶

AC1: Jobs are no longer incompleting because a required asset is purged later in the setup
AC2: Ensure the asset cache is behaving as intended (so it really start cleaning only from the oldest asset and does not clean more than it needs to - also in cases like that)

Out of scope¶

Implementing a new feature in openQA, like linking directly after the download (before starting the download of the next asset) at the cost of possibly overallocating

Suggestions¶

Restarting alone does not help, see https://openqa.suse.de/tests/13060373. So we need to find other means of mitigation first and only restart affected jobs afterwards At the very first mitigate the issue, e.g. restart affected jobs, restart the cache service or whatever, use https://github.com/os-autoinst/scripts/blob/master/README.md#auto-review---automatically-detect-known-issues-in-openqa-jobs-label-openqa-jobs-with-ticket-references-and-optionally-retrigger
The cleanup makes space to get below the configured limit - but deletes the new assets as well
Increase the asset cache size (temporarily?) so this is not happening anymore so likely
For AC2, check openQA code of the asset cache and related tests
Check if enough space is available at the end of the cleanup and clean more stuff (up to n times)

Related issues 1 (1 open — 0 closed)

Actions

Copy link

Updated by szarate over 1 year ago

@okurz I'm kinda missing a "bugs" category for this project, as this isn't a regret

Actions

Copy link

Updated by okurz over 1 year ago

Tags set to reactive work
Category set to Regressions/Crashes
Priority changed from Normal to Urgent
Target version set to Ready

szarate wrote in #note-1:

@okurz I'm kinda missing a "bugs" category for this project, as this isn't a regret

We don't have "bugs" because that term is ambiguous. So we treat something as "regression" if there was clearly a point in time when this was working fine. This would help us because then we can research in our git history where we broke things. Everything else is "feature request" as in: We shouldn't invest time trying to find out when this broke but just treat it as a new area which would mean add according tests and implement to make it work as expected by the user. In this case here though I do assume this is a regression – or at least an obvious "crash" - so adding according category and also treating it as urgent.

To everybody, please help to fill the ticket description according to https://progress.opensuse.org/projects/openqav3/wiki#Defects

Actions

Copy link

Updated by okurz over 1 year ago

Related to action #121573: Asset/HDD goes missing while job is running added

Actions

Copy link

Updated by livdywan over 1 year ago

Description updated (diff)

Actions

Copy link

Updated by livdywan over 1 year ago

Subject changed from Files have been deleted from the cache while the job was running to Files have been deleted from the cache while the job was running size:M
Description updated (diff)
Status changed from New to Workable

Actions

Copy link

Updated by mkittler over 1 year ago

Assignee set to mkittler

Actions

Copy link

Updated by mkittler over 1 year ago

Description updated (diff)
Status changed from Workable to In Progress

Actions

Copy link

Updated by mkittler over 1 year ago

Description updated (diff)

Actions

Copy link

Updated by mkittler over 1 year ago · Edited

It looks like only PowerPC jobs are affected, at least recently. (The most recent non-PowerPC job returned by select id, t_finished, (select host from workers where workers.id = jobs.assigned_worker_id), reason, arch from jobs where result = 'incomplete' and reason like '%qemu-img%No such file%' and arch != 'ppc64le' order by id desc limit 20; is from 2023-09-05 10:11:32.)

So I created https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/698 to increase the asset cache size of our (still in use) PowerPC workers.

By the way, some of our other workers already have CACHELIMIT: 600 but that's not consistently set (likely depending on the actually available disk space).

EDIT: It looks like this helped, e.g. https://openqa.suse.de/tests/13061244.

Considering select id, t_finished, (select host from workers where workers.id = jobs.assigned_worker_id), reason, arch from jobs where result = 'incomplete' and reason like '%qemu-img%No such file%' and arch = 'ppc64le' and clone_id is null order by id desc limit 20; just revealed a single job ID I also restarted that job: https://openqa.suse.de/tests/13062733

EDIT: The restarted job failed because its parallel sibling misses an asset. However, it is a usual 404 error and not this issue.

Actions

Copy link

#10

Updated by mkittler over 1 year ago

I had a look at the code for AC2:

We flag assets as pending before starting the download.
We immediately update the last use of an asset once it has been downloaded and clear its "pending" flag.
We really delete the least recently used assets first.
We never delete pending assets and the asset we have just downloaded (but possible assets we have downloaded before for the same job).
The tests in t/25-cache.t look good.

So I would not assume that we have a problem with the cleanup algorithm.

Note that the cache on mania was filled up very fast after increasing it to 100 GiB. With so much traffic and assets that are 11 GiB big (more than 20 % of the total cache size before I increased it) I would say that the observed behavior makes sense without assuming we have any problem with the cleanup algorithm.

Let's see whether the 100 GiB are enough. We actually have 2 TiB free disk space left on mania (of 2.2 TiB in total) so I guess increasing the cache size further on mania specifically wouldn't be problematic at all.

Actions

Copy link

#11