Project

General

Profile

action #152545

Updated by mkittler 5 months ago

## Observation 

 Looking at https://openqa.suse.de/tests/13038326 the cache seemed to be downloading a lot of the stuff, however while it was starting, the qcow was removed from the pool (I guess the  

 ``` 
 [2023-12-13T07:43:40.520891Z] [debug] [pid:18006] Command `/usr/bin/chattr +C /var/lib/openqa/pool/2/raid` terminated with 1 
   /usr/bin/chattr: Operation not supported while reading flags on /var/lib/openqa/pool/2/raid 
 [2023-12-13T07:43:40.521015Z] [debug] [pid:18006] Configuring storage controllers and block devices 
 [2023-12-13T07:43:40.521653Z] [debug] [pid:18006] running `/usr/bin/qemu-img info --output=json /var/lib/openqa/pool/2/SLES-15-SP5-ppc64le-Installtest.qcow2` 
 [2023-12-13T07:43:40.631962Z] [info] [pid:18006] ::: backend::baseclass::die_handler: Backend process died, backend errors are reported below in the following lines: 
   qemu-img: Could not open '/var/lib/openqa/pool/2/SLES-15-SP5-ppc64le-Installtest.qcow2': Could not open '/var/lib/openqa/pool/2/SLES-15-SP5-ppc64le-Installtest.qcow2': No such file or directory 
 [2023-12-13T07:43:40.632507Z] [info] [pid:18006] ::: OpenQA::Qemu::Proc::save_state: Saving QEMU state to qemu_state.json 
 [2023-12-13T07:43:40.632988Z] [debug] [pid:18006] Passing remaining frames to the video encoder 
 [2023-12-13T07:43:41.589400Z] [debug] [pid:18006] Waiting for video encoder to finalize the video 
 [2023-12-13T07:43:41.589502Z] [debug] [pid:18006] The built-in video encoder (pid 18047) terminated 
 [2023-12-13T07:43:41.590353Z] [debug] [pid:18006] sending magic and exit 
 [2023-12-13T07:43:41.590705Z] [debug] [pid:17761] received magic close 
 [2023-12-13T07:43:41.593864Z] [debug] [pid:17761] backend process exited: 0 
 [2023-12-13T07:43:41.694734Z] [warn] [pid:17761] !!! main: failed to start VM at /usr/lib/os-autoinst/backend/driver.pm line 104. 
           backend::driver::start_vm(backend::driver=HASH(0x1003bbbab58)) called at /usr/lib/os-autoinst/OpenQA/Isotovideo/Backend.pm line 18 
           OpenQA::Isotovideo::Backend::new("OpenQA::Isotovideo::Backend") called at /usr/lib/os-autoinst/OpenQA/Isotovideo/Runner.pm line 100 
           OpenQA::Isotovideo::Runner::create_backend(OpenQA::Isotovideo::Runner=HASH(0x1003422b1e0)) called at /usr/bin/isotovideo line 134 
           eval {...} called at /usr/bin/isotovideo line 122 
 ``` 

 If you look at the logs you can observe that 1.2GB is needed, then effectively 11 GB is needed and the new assets are too big for the amount of space gained: 

 ``` 
 [info] [#28462] Cache size of "/var/lib/openqa/cache" is 48 GiB, with limit 50 GiB 
 [info] [#28462] Downloading "SLES-15-SP5-ppc64le-Installtest.qcow2" from "http://openqa.suse.de/tests/13038326/asset/hdd/SLES-15-SP5-ppc64le-Installtest.qcow2" 
 [info] [#28462] Cache size 49 GiB + needed 1.2 GiB exceeds limit of 50 GiB, purging least used assets 
 [info] [#28462] Purging "/var/lib/openqa/cache/openqa.suse.de/sle-15-SP4-ppc64le-:31833:kernel-livepatch-SLE15-SP4_Update_6-Server-DVD-Incidents-Kernel@ppc64le-virtio-with-ltp.qcow2" because we need space for new assets, reclaiming 2.9 GiB 
 [info] [#28462] Size of "/var/lib/openqa/cache/openqa.suse.de/SLES-15-SP5-ppc64le-Installtest.qcow2" is 1.2 GiB, with ETag ""4bf00000-60b0c04e34447"" 
 [info] [#28462] Download of "/var/lib/openqa/cache/openqa.suse.de/SLES-15-SP5-ppc64le-Installtest.qcow2" successful (28 MiB/s), new cache size is 47 GiB 
 [...] 
 [info] [#28465] Cache size of "/var/lib/openqa/cache" is 47 GiB, with limit 50 GiB 
 [info] [#28465] Downloading "SLE-15-SP5-Full-ppc64le-GM-Media1.iso" from "http://openqa.suse.de/tests/13038326/asset/iso/SLE-15-SP5-Full-ppc64le-GM-Media1.iso" 
 [info] [#28465] Cache size 49 GiB + needed 11 GiB exceeds limit of 50 GiB, purging least used assets 
 [...] 
 [info] [#28465] Purging "/var/lib/openqa/cache/openqa.suse.de/SLES-15-SP5-ppc64le-Installtest.qcow2" because we need space for new assets, reclaiming 1.2 GiB 
 [info] [#28465] Size of "/var/lib/openqa/cache/openqa.suse.de/SLE-15-SP5-Full-ppc64le-GM-Media1.iso" is 11 GiB, with ETag ""2d1468800-5fc052ed06a80"" 
 [info] [#28465] Download of "/var/lib/openqa/cache/openqa.suse.de/SLE-15-SP5-Full-ppc64le-GM-Media1.iso" successful (24 MiB/s), new cache size is 49 GiB 
 ``` 

 The hard-linking approach didn't work because it only happens after the purging. Then the worker fell back to symlinking instead of just failing already at the linking: 
 ``` 
     [2023-12-13T07:34:59.119198Z] [debug] [pid:96149] Symlinking asset because hardlink failed: Cannot create link from "/var/lib/openqa/cache/openqa.suse.de/SLES-15-SP5-ppc64le-Installtest.qcow2" to "/var/lib/openqa/pool/2/SLES-15-SP5-ppc64le-Installtest.qcow2": Invalid cross-device link at /usr/share/openqa/script/../lib/OpenQA/Worker/Engines/isotovideo.pm line 182. 
 ``` 

 SLE-15-SP5-Full-ppc64le-GM-Media1.iso is 11GB and SLES-15-SP5-ppc64le-Installtest gets deleted to free space before the job runs 

 ## Acceptance criteria 
 * **AC1**: Jobs are no longer incompleting because a required asset is purged later in the setup 
 * **AC2**: Ensure the asset cache is behaving as intended (so it really start cleaning only from the oldest asset and does not clean more than it needs to - also in cases like that) 

 ## Out of scope 
 Implementing a new feature in openQA, like linking directly after the download (before starting the download of the next asset) at the cost of possibly overallocating 

 ## Suggestions 
 * Restarting alone does not help, see see https://openqa.suse.de/tests/13060373. So we need to find other means of mitigation first and only restart affected jobs afterwards ~~At the very first mitigate the issue, e.g. restart affected jobs, restart the cache service or whatever, use https://github.com/os-autoinst/scripts/blob/master/README.md#auto-review---automatically-detect-known-issues-in-openqa-jobs-label-openqa-jobs-with-ticket-references-and-optionally-retrigger~~ 
 * The cleanup makes space to get below the configured limit - but deletes the new assets as well 
 * Increase the asset cache size (temporarily?) so this is not happening anymore so likely 
 * For AC2, check openQA code of the asset cache and related tests 
 * Check if enough space is available at the end of the cleanup and clean more stuff (up to n times)

Back