action #152545
closedFiles have been deleted from the cache while the job was running size:M
0%
Description
Observation¶
Looking at https://openqa.suse.de/tests/13038326 the cache seemed to be downloading a lot of the stuff, however while it was starting, the qcow was removed from the pool (I guess the
[2023-12-13T07:43:40.520891Z] [debug] [pid:18006] Command `/usr/bin/chattr +C /var/lib/openqa/pool/2/raid` terminated with 1
/usr/bin/chattr: Operation not supported while reading flags on /var/lib/openqa/pool/2/raid
[2023-12-13T07:43:40.521015Z] [debug] [pid:18006] Configuring storage controllers and block devices
[2023-12-13T07:43:40.521653Z] [debug] [pid:18006] running `/usr/bin/qemu-img info --output=json /var/lib/openqa/pool/2/SLES-15-SP5-ppc64le-Installtest.qcow2`
[2023-12-13T07:43:40.631962Z] [info] [pid:18006] ::: backend::baseclass::die_handler: Backend process died, backend errors are reported below in the following lines:
qemu-img: Could not open '/var/lib/openqa/pool/2/SLES-15-SP5-ppc64le-Installtest.qcow2': Could not open '/var/lib/openqa/pool/2/SLES-15-SP5-ppc64le-Installtest.qcow2': No such file or directory
[2023-12-13T07:43:40.632507Z] [info] [pid:18006] ::: OpenQA::Qemu::Proc::save_state: Saving QEMU state to qemu_state.json
[2023-12-13T07:43:40.632988Z] [debug] [pid:18006] Passing remaining frames to the video encoder
[2023-12-13T07:43:41.589400Z] [debug] [pid:18006] Waiting for video encoder to finalize the video
[2023-12-13T07:43:41.589502Z] [debug] [pid:18006] The built-in video encoder (pid 18047) terminated
[2023-12-13T07:43:41.590353Z] [debug] [pid:18006] sending magic and exit
[2023-12-13T07:43:41.590705Z] [debug] [pid:17761] received magic close
[2023-12-13T07:43:41.593864Z] [debug] [pid:17761] backend process exited: 0
[2023-12-13T07:43:41.694734Z] [warn] [pid:17761] !!! main: failed to start VM at /usr/lib/os-autoinst/backend/driver.pm line 104.
backend::driver::start_vm(backend::driver=HASH(0x1003bbbab58)) called at /usr/lib/os-autoinst/OpenQA/Isotovideo/Backend.pm line 18
OpenQA::Isotovideo::Backend::new("OpenQA::Isotovideo::Backend") called at /usr/lib/os-autoinst/OpenQA/Isotovideo/Runner.pm line 100
OpenQA::Isotovideo::Runner::create_backend(OpenQA::Isotovideo::Runner=HASH(0x1003422b1e0)) called at /usr/bin/isotovideo line 134
eval {...} called at /usr/bin/isotovideo line 122
If you look at the logs you can observe that 1.2GB is needed, then effectively 11 GB is needed and the new assets are too big for the amount of space gained:
[info] [#28462] Cache size of "/var/lib/openqa/cache" is 48 GiB, with limit 50 GiB
[info] [#28462] Downloading "SLES-15-SP5-ppc64le-Installtest.qcow2" from "http://openqa.suse.de/tests/13038326/asset/hdd/SLES-15-SP5-ppc64le-Installtest.qcow2"
[info] [#28462] Cache size 49 GiB + needed 1.2 GiB exceeds limit of 50 GiB, purging least used assets
[info] [#28462] Purging "/var/lib/openqa/cache/openqa.suse.de/sle-15-SP4-ppc64le-:31833:kernel-livepatch-SLE15-SP4_Update_6-Server-DVD-Incidents-Kernel@ppc64le-virtio-with-ltp.qcow2" because we need space for new assets, reclaiming 2.9 GiB
[info] [#28462] Size of "/var/lib/openqa/cache/openqa.suse.de/SLES-15-SP5-ppc64le-Installtest.qcow2" is 1.2 GiB, with ETag ""4bf00000-60b0c04e34447""
[info] [#28462] Download of "/var/lib/openqa/cache/openqa.suse.de/SLES-15-SP5-ppc64le-Installtest.qcow2" successful (28 MiB/s), new cache size is 47 GiB
[...]
[info] [#28465] Cache size of "/var/lib/openqa/cache" is 47 GiB, with limit 50 GiB
[info] [#28465] Downloading "SLE-15-SP5-Full-ppc64le-GM-Media1.iso" from "http://openqa.suse.de/tests/13038326/asset/iso/SLE-15-SP5-Full-ppc64le-GM-Media1.iso"
[info] [#28465] Cache size 49 GiB + needed 11 GiB exceeds limit of 50 GiB, purging least used assets
[...]
[info] [#28465] Purging "/var/lib/openqa/cache/openqa.suse.de/SLES-15-SP5-ppc64le-Installtest.qcow2" because we need space for new assets, reclaiming 1.2 GiB
[info] [#28465] Size of "/var/lib/openqa/cache/openqa.suse.de/SLE-15-SP5-Full-ppc64le-GM-Media1.iso" is 11 GiB, with ETag ""2d1468800-5fc052ed06a80""
[info] [#28465] Download of "/var/lib/openqa/cache/openqa.suse.de/SLE-15-SP5-Full-ppc64le-GM-Media1.iso" successful (24 MiB/s), new cache size is 49 GiB
The hard-linking approach didn't work because it only happens after the purging. Then the worker fell back to symlinking instead of just failing already at the linking:
[2023-12-13T07:34:59.119198Z] [debug] [pid:96149] Symlinking asset because hardlink failed: Cannot create link from "/var/lib/openqa/cache/openqa.suse.de/SLES-15-SP5-ppc64le-Installtest.qcow2" to "/var/lib/openqa/pool/2/SLES-15-SP5-ppc64le-Installtest.qcow2": Invalid cross-device link at /usr/share/openqa/script/../lib/OpenQA/Worker/Engines/isotovideo.pm line 182.
SLE-15-SP5-Full-ppc64le-GM-Media1.iso is 11GB and SLES-15-SP5-ppc64le-Installtest gets deleted to free space before the job runs
Acceptance criteria¶
- AC1: Jobs are no longer incompleting because a required asset is purged later in the setup
- AC2: Ensure the asset cache is behaving as intended (so it really start cleaning only from the oldest asset and does not clean more than it needs to - also in cases like that)
Out of scope¶
Implementing a new feature in openQA, like linking directly after the download (before starting the download of the next asset) at the cost of possibly overallocating
Suggestions¶
- Restarting alone does not help, see https://openqa.suse.de/tests/13060373. So we need to find other means of mitigation first and only restart affected jobs afterwards
At the very first mitigate the issue, e.g. restart affected jobs, restart the cache service or whatever, use https://github.com/os-autoinst/scripts/blob/master/README.md#auto-review---automatically-detect-known-issues-in-openqa-jobs-label-openqa-jobs-with-ticket-references-and-optionally-retrigger - The cleanup makes space to get below the configured limit - but deletes the new assets as well
- Increase the asset cache size (temporarily?) so this is not happening anymore so likely
- For AC2, check openQA code of the asset cache and related tests
- Check if enough space is available at the end of the cleanup and clean more stuff (up to n times)
Updated by okurz 10 months ago
- Tags set to reactive work
- Category set to Regressions/Crashes
- Priority changed from Normal to Urgent
- Target version set to Ready
szarate wrote in #note-1:
@okurz I'm kinda missing a "bugs" category for this project, as this isn't a regret
We don't have "bugs" because that term is ambiguous. So we treat something as "regression" if there was clearly a point in time when this was working fine. This would help us because then we can research in our git history where we broke things. Everything else is "feature request" as in: We shouldn't invest time trying to find out when this broke but just treat it as a new area which would mean add according tests and implement to make it work as expected by the user. In this case here though I do assume this is a regression – or at least an obvious "crash" - so adding according category and also treating it as urgent.
To everybody, please help to fill the ticket description according to https://progress.opensuse.org/projects/openqav3/wiki#Defects
Updated by okurz 10 months ago
- Related to action #121573: Asset/HDD goes missing while job is running added
Updated by mkittler 10 months ago · Edited
It looks like only PowerPC jobs are affected, at least recently. (The most recent non-PowerPC job returned by select id, t_finished, (select host from workers where workers.id = jobs.assigned_worker_id), reason, arch from jobs where result = 'incomplete' and reason like '%qemu-img%No such file%' and arch != 'ppc64le' order by id desc limit 20;
is from 2023-09-05 10:11:32.)
So I created https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/698 to increase the asset cache size of our (still in use) PowerPC workers.
By the way, some of our other workers already have CACHELIMIT: 600
but that's not consistently set (likely depending on the actually available disk space).
EDIT: It looks like this helped, e.g. https://openqa.suse.de/tests/13061244.
Considering select id, t_finished, (select host from workers where workers.id = jobs.assigned_worker_id), reason, arch from jobs where result = 'incomplete' and reason like '%qemu-img%No such file%' and arch = 'ppc64le' and clone_id is null order by id desc limit 20;
just revealed a single job ID I also restarted that job: https://openqa.suse.de/tests/13062733
EDIT: The restarted job failed because its parallel sibling misses an asset. However, it is a usual 404 error and not this issue.
Updated by mkittler 10 months ago
I had a look at the code for AC2:
- We flag assets as pending before starting the download.
- We immediately update the last use of an asset once it has been downloaded and clear its "pending" flag.
- We really delete the least recently used assets first.
- We never delete pending assets and the asset we have just downloaded (but possible assets we have downloaded before for the same job).
- The tests in
t/25-cache.t
look good.
So I would not assume that we have a problem with the cleanup algorithm.
Note that the cache on mania was filled up very fast after increasing it to 100 GiB. With so much traffic and assets that are 11 GiB big (more than 20 % of the total cache size before I increased it) I would say that the observed behavior makes sense without assuming we have any problem with the cleanup algorithm.
Let's see whether the 100 GiB are enough. We actually have 2 TiB free disk space left on mania (of 2.2 TiB in total) so I guess increasing the cache size further on mania specifically wouldn't be problematic at all.
Updated by openqa_review 10 months ago
- Due date set to 2023-12-28
Setting due date based on mean cycle time of SUSE QE Tools
Updated by mkittler 10 months ago
- Status changed from Feedback to Resolved
It looks still good so AC1 is fulfilled and for AC2 I checked our code/tests again in #152545#note-10.