Project

General

Profile

Actions

coordination #64881

open

coordination #103941: [saga][epic] Scale up: Efficient, event-based handling of storage on new, clean instances

[epic] Reconsider triggering cleanup jobs

Added by mkittler over 4 years ago. Updated over 2 years ago.

Status:
New
Priority:
Low
Assignee:
-
Category:
Feature requests
Target version:
Start date:
2021-08-31
Due date:
% Done:

38%

Estimated time:
(Total: 0.00 h)

Description

Motivation

Currently the cleanup jobs are triggered time-based using a systemd timer. So far this ticket is just a collection of ideas how we can improve.

The time-based trigger might not be frequent enough, especially if the cleanup is aborted in the middle for some reason. E.g. on o3 we see that results/screenshots might pile up a lot which makes the error rate even higher because in consequence the cleanup jobs can take quite long (see https://progress.opensuse.org/issues/55922) and are possible interrupted with the result "worker went away".

Triggering the cleanup blindly when new jobs are scheduled as we did before is not nice either. It means creating tons of Minion jobs which just terminate because a cleanup job is already running and clutter the dashboard.

Acceptance criteria

  • AC1: Idle instances of openQA, e.g. personal single-user developer instances, only trigger cleanup jobs when quota usage is likely to change, e.g. when new builds or jobs are scheduled or jobs complete
  • AC2: Cleanup jobs are only triggered when a useful effect is expected, e.g. not 100 times in a row shortly after each other

Suggestions

One possibility to solve this would be that the jobs delete themselves if they can't acquire the lock. Another possibility would be acquiring the lock before creating the job and if that's not possible there will simply be no job (and if it is possible the job needs to adapt the lock).

Note that triggering the cleanup more frequently will not magically solve problems without adjusting quotas for the result storage duration. Now that we keep track of the result size we could additionally add a size-based threshold for results (maybe specify a max. total size and a percentage for each job group).


Subtasks 4 (3 open1 closed)

action #97763: Event-based cleanup jobs triggered based on quota size:MResolvedlivdywan2021-08-31

Actions
action #99258: openQA enables event-based cleanup out of the boxNew2021-09-24

Actions
action #101376: Use cleanup triggers on finished jobs by defaultNew2021-08-31

Actions
action #101602: Research how to properly communicate these changes based on systemd files size:SWorkable

Actions

Related issues 2 (1 open1 closed)

Related to openQA Project - coordination #76984: [epic] Automatically remove assets+results based on available free spaceNew2021-01-21

Actions
Related to openQA Infrastructure - coordination #96974: [epic] Improve/reconsider thresholds for skipping cleanupResolvedokurz2021-09-20

Actions
Actions

Also available in: Atom PDF