Project

General

Profile

action #97304

Updated by mkittler over 2 years ago

### observation 

 I've recently observed multiple occurrences where the parent job (e.g. https://openqa.suse.de/tests/6859366) successfully creates an asset (e.g. `hdd/SLES-15-SP2-x86_64-mru-install-minimal-with-addons-Build:20740:libesmtp-Server-DVD-Incidents-64bit.qcow2`) but the chained children incomplete (e.g. https://openqa.suse.de/tests/6859372) because they cannot download the asset anymore because it has already been cleaned up on the web UI host which can be seen in the logs: 

 ``` 
 [2021-08-19T18:10:34.0628 CEST] [debug] [pid:21356] Checking whether asset hdd/SLES-15-SP2-x86_64-mru-install-minimal-with-addons-Build:20740:libesmtp-Server-DVD-Incidents-64bit.qcow2 (2777677824) fits into group 306 (581430272) 
 [2021-08-19T18:15:59.0996 CEST] [debug] [pid:21356] { 
   assets    => [ 
 … 
                { 
                  fixed         => 0, 
                  groups        => { 306 => 6859366 }, 
                  id            => 27413793, 
                  max_job       => 6859366, 
                  name          => "hdd/SLES-15-SP2-x86_64-mru-install-minimal-with-addons-Build:20740:libesmtp-Server-DVD-Incidents-64bit.qcow2", 
                  parents       => { 8 => 1 }, 
                  pending       => 0, 
                  picked_into => 0, 
                  size          => 2777677824, 
                  t_created     => "2021-08-19 15:35:50", 
                  type          => "hdd", 
                }, 
 [2021-08-19T18:16:07.0773 CEST] [info] [pid:21356] Removing asset hdd/SLES-15-SP2-x86_64-mru-install-minimal-with-addons-Build:20740:libesmtp-Server-DVD-Incidents-64bit.qcow2 (belonging to job groups: 306 within parent job groups 8) 
 [2021-08-19T18:16:08.0067 CEST] [info] [pid:21356] GRU: removed /var/lib/openqa/share/factory/hdd/SLES-15-SP2-x86_64-mru-install-minimal-with-addons-Build:20740:libesmtp-Server-DVD-Incidents-64bit.qcow2 
 ``` 

 So the asset has been deleted 2021-08-19T18:16:08 CEST and the job using the asset has only been started on 2021-08-19 21:51:43 CEST. 

 All jobs have the asset correctly listed in the job settings (`HDD_1=SLES-15-SP2-x86_64-mru-install-minimal-with-addons-Build:20740:libesmtp-Server-DVD-Incidents-64bit.qcow2` in the child and `PUBLISH_HDD_1` in the parent). 

 ### expected behavior 
 "Pending" assets are preserved. So all assets which are associated with a job that is not `done` are `cancelled` are *not* subject to the assert cleanup. 

 ### further information 
 1. We have already code which implements the expected behavior (in `lib/OpenQA/Schema/ResultSet/Assets.pm`) and there are also unit tests (in `t/14-grutasks.t`) to verify whether it works correctly. I've already extended those tests in the past (in https://github.com/os-autoinst/openQA/commit/22185d2d8f126990e8e1e4b6543d88f6bbc947ac) because we saw the same problem in the past (see #64544) but couldn't do more. 
 2. It might be worth checking whether the implementation is correct but due to the previous point that's unlikely. Possibly the jobs were never correctly associated with the assets (despite the job settings being correct)? 
 3. For later investigation I've been storing the database dump of OSD from that time on `storage.qa.suse.de:/storage/osd-archive/osd-dump-for-poo-97304-2021-08-19.dump`.

Back