Project

General

Profile

action #80108

HDD images not available for aarch64 Tumbleweed (cleaned-up too early?)

Added by ggardet_arm 2 months ago. Updated about 1 month ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Concrete Bugs
Target version:
Start date:
2020-11-20
Due date:
2020-12-17
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

We have some incompletes due to missing qcow2 images:

Checking https://openqa.opensuse.org/admin/assets I can find some HDD images from previous snapshots, such as hdd/opensuse-Tumbleweed-aarch64-20201114-textmode@aarch64.qcow2 whereas the same image for 20201119 is missing.

Steps to reproduce

TBC

Acceptance criteria

  • AC1: assets are only deleted if the corresponding assets from previous builds (or "older" assets) of comparable size have been deleted first

Suggestions

  • Get mentioned logs from aarch64.o.o
  • Look into logs, crosscheck with assets, e.g. in o3

Workaround

Retrigger image creation jobs

History

#1 Updated by ggardet_arm 2 months ago

I restarted the various create_hdd_* tests to make the qcow2 images again. I hope it will not be cleaned-up too early again.

#2 Updated by coolo about 2 months ago

Indeed at 11am CET the asset was removed for not fitting into job group 3. That looks more like a bug than an infrastructure problem though
I saved the affected log file as /root/openqa_gru.poo80108.xz for someone to pick it up

#3 Updated by okurz about 2 months ago

  • Tags set to asset cleanup, o3, aarch64, incomplete, premature cleanup
  • Project changed from openQA Infrastructure to openQA Project
  • Description updated (diff)
  • Category set to Concrete Bugs
  • Status changed from New to Workable
  • Priority changed from Urgent to Normal
  • Target version set to Ready

ok, treating as bug :)

As you found a workaround and I am not aware of this issue elsewhere I am lowering prio to "Normal" but adding the ticket to our backlog to crosscheck the situation.

#4 Updated by mkittler about 2 months ago

  • Assignee set to mkittler

#5 Updated by mkittler about 2 months ago

AC1: assets are only deleted if the corresponding assets from previous builds (or "older" assets) of comparable size have been deleted first

Currently, the "age" of an asset is determined by the age of the most recent job which has been using the asset. This job might not necessarily belong to the latest build. However, should we really change that suitability?


For the record, the names of the concerning assets are:

  • opensuse-Tumbleweed-aarch64-20201119-textmode@aarch64.qcow2
  • opensuse-Tumbleweed-aarch64-20201119-Tumbleweed-kde@aarch64.qcow2
  • opensuse-Tumbleweed-aarch64-20201119-gnome-wayland@aarch64.qcow2

while e.g. opensuse-Tumbleweed-aarch64-20201114-textmode@aarch64.qcow2 survived the cleanup.


I moved the logs and changed permissions so one can download them with rsync openqa.opensuse.org:/space/logs/openqa_gru.poo80108.xz …. The relevant lines for one of the removed assets are:

grep -B 1 -A 1 'opensuse-Tumbleweed-aarch64-20201119-gnome-wayland@aarch64.qcow2' openqa_gru.poo80108
[2020-11-20T11:00:11.0451 UTC] [debug] [pid:984] Asset hdd/opensuse-Tumbleweed-aarch64-20201119-xfce@aarch64-uefi-vars.qcow2 (330752) picked into group 3
[2020-11-20T11:00:11.0451 UTC] [debug] [pid:984] Checking whether asset hdd/opensuse-Tumbleweed-aarch64-20201119-gnome-wayland@aarch64.qcow2 (2252079104) fits into group 3 (461266876)
[2020-11-20T11:00:11.0451 UTC] [debug] [pid:984] Checking whether asset hdd/opensuse-Tumbleweed-aarch64-20201119-gnome-wayland@aarch64-uefi-vars.qcow2 (330752) fits into group 3 (461266876)
--
}
[2020-11-20T11:00:14.0829 UTC] [info] [pid:984] Removing asset hdd/opensuse-Tumbleweed-aarch64-20201119-gnome-wayland@aarch64.qcow2 (belonging to job groups: 3)
[2020-11-20T11:00:14.0839 UTC] [info] [pid:984] GRU: removed /var/lib/openqa/share/factory/hdd/opensuse-Tumbleweed-aarch64-20201119-gnome-wayland@aarch64.qcow2
[2020-11-20T11:00:14.0848 UTC] [info] [pid:984] Removing asset hdd/opensuse-Tumbleweed-aarch64-20201119-textmode@aarch64.qcow2 (belonging to job groups: 3)

And a few seconds before within the same cleanup task the asset from the previous build is indeed picked into a group:

[2020-11-20T11:00:11.0451 UTC] [debug] [pid:984] Checking whether asset hdd/fixed/opensuse-15.2-aarch64-GM-kde@aarch64.qcow2 (2726100992) fits into group 3 (461597628)
[2020-11-20T11:00:11.0451 UTC] [debug] [pid:984] Checking whether asset hdd/opensuse-Tumbleweed-aarch64-20201119-xfce@aarch64-uefi-vars.qcow2 (330752) fits into group 3 (461597628)
[2020-11-20T11:00:11.0451 UTC] [debug] [pid:984] Asset hdd/opensuse-Tumbleweed-aarch64-20201119-xfce@aarch64-uefi-vars.qcow2 (330752) picked into group 3
[2020-11-20T11:00:11.0451 UTC] [debug] [pid:984] Checking whether asset hdd/opensuse-Tumbleweed-aarch64-20201119-gnome-wayland@aarch64.qcow2 (2252079104) fits into group 3 (461266876)
[2020-11-20T11:00:11.0451 UTC] [debug] [pid:984] Checking whether asset hdd/opensuse-Tumbleweed-aarch64-20201119-gnome-wayland@aarch64-uefi-vars.qcow2 (330752) fits into group 3 (461266876)
[2020-11-20T11:00:11.0451 UTC] [debug] [pid:984] Asset hdd/opensuse-Tumbleweed-aarch64-20201119-gnome-wayland@aarch64-uefi-vars.qcow2 (330752) picked into group 3
[2020-11-20T11:00:11.0452 UTC] [debug] [pid:984] Checking whether asset hdd/opensuse-Tumbleweed-aarch64-20201119-textmode@aarch64.qcow2 (960102400) fits into group 3 (460936124)

One would have expected that the asset for the current build is considered first. Either this is caused by a bug or there's really just a newer job for the previous build then for the current build.

#6 Updated by mkittler about 2 months ago

If someone had checked what the lastest job of opensuse-Tumbleweed-aarch64-20201114-textmode@aarch64.qcow2 was that would have been useful. By the way, it is possible to save the whole "asset status" via curl https://openqa.opensuse.org/admin/assets/status > asset_status_backup.json. The status from the time where the asset from the older build was present and the asset from the newer build already cleaned up would have been useful.

#7 Updated by openqa_review about 2 months ago

  • Due date set to 2020-12-17

Setting due date based on mean cycle time of SUSE QE Tools

#8 Updated by mkittler about 1 month ago

  • Status changed from Workable to New
  • Assignee deleted (mkittler)

I don't consider this ticket workable. It is not clear to me whether this is really a bug because the previous build might have had a more recent job at the time because the asset status from that time hasn't been preserved. It is also not clear whether we should really adjust the behavior of the cleanup algorithm to make preserving the latest build the highest goal.

#9 Updated by okurz about 1 month ago

  • Status changed from New to Resolved
  • Assignee set to okurz

You already did a lot to investigate. I am also not sure if there is anything really working not as expected. The least I could do is bump the asset limit in openSUSE Tumbleweed AArch64 from 200G to 240G for o3 as we can spare that space right now.

Also available in: Atom PDF