Project

General

Profile

Actions

coordination #64881

open

coordination #103941: [saga][epic] Scale up: Efficient, event-based handling of storage on new, clean instances

[epic] Reconsider triggering cleanup jobs

Added by mkittler over 4 years ago. Updated almost 3 years ago.

Status:
New
Priority:
Low
Assignee:
-
Category:
Feature requests
Target version:
QA (public, currently private due to #173521) - future
Start date:
2021-08-31
Due date:
% Done:

38%

Estimated time:
(Total: 0.00 h)

Description

Motivation

Currently the cleanup jobs are triggered time-based using a systemd timer. So far this ticket is just a collection of ideas how we can improve.

The time-based trigger might not be frequent enough, especially if the cleanup is aborted in the middle for some reason. E.g. on o3 we see that results/screenshots might pile up a lot which makes the error rate even higher because in consequence the cleanup jobs can take quite long (see https://progress.opensuse.org/issues/55922) and are possible interrupted with the result "worker went away".

Triggering the cleanup blindly when new jobs are scheduled as we did before is not nice either. It means creating tons of Minion jobs which just terminate because a cleanup job is already running and clutter the dashboard.

Acceptance criteria

  • AC1: Idle instances of openQA, e.g. personal single-user developer instances, only trigger cleanup jobs when quota usage is likely to change, e.g. when new builds or jobs are scheduled or jobs complete
  • AC2: Cleanup jobs are only triggered when a useful effect is expected, e.g. not 100 times in a row shortly after each other

Suggestions

One possibility to solve this would be that the jobs delete themselves if they can't acquire the lock. Another possibility would be acquiring the lock before creating the job and if that's not possible there will simply be no job (and if it is possible the job needs to adapt the lock).

Note that triggering the cleanup more frequently will not magically solve problems without adjusting quotas for the result storage duration. Now that we keep track of the result size we could additionally add a size-based threshold for results (maybe specify a max. total size and a percentage for each job group).


Subtasks 4 (3 open1 closed)

action #97763: Event-based cleanup jobs triggered based on quota size:MResolvedlivdywan2021-08-31

Actions
action #99258: openQA enables event-based cleanup out of the boxNew2021-09-24

Actions
action #101376: Use cleanup triggers on finished jobs by defaultNew2021-08-31

Actions
action #101602: Research how to properly communicate these changes based on systemd files size:SWorkable

Actions

Related issues 2 (1 open1 closed)

Related to openQA Project (public) - coordination #76984: [epic] Automatically remove assets+results based on available free spaceNew2021-01-21

Actions
Related to openQA Infrastructure (public) - coordination #96974: [epic] Improve/reconsider thresholds for skipping cleanupResolvedokurz2021-09-20

Actions
Actions #1

Updated by okurz over 4 years ago

  • Category set to Feature requests
Actions #2

Updated by okurz over 4 years ago

  • Target version set to Ready
Actions #3

Updated by okurz over 4 years ago

  • Description updated (diff)
  • Status changed from New to Workable
Actions #5

Updated by okurz over 4 years ago

We discussed this topic in the QA tools weekly meeting 2020-07-28. "Expiring jobs" would expire based on just time and as we want to look into event-based triggers they are not that helpful. But as kraih explained we can go another way: We can create minion jobs that can create locks which expire after a time which can be longer than the time the actual minion job runs. So what we should do:

  • Add back trigger for cleanup job
  • Use a configurable "dead-time" for locks
  • Optional, after that: Based on configuration, call "df", compare free space with a configurable limit and only trigger cleanup job if threshold is exceeded
Actions #6

Updated by okurz about 4 years ago

  • Priority changed from Normal to Low
Actions #7

Updated by okurz about 4 years ago

  • Related to coordination #76984: [epic] Automatically remove assets+results based on available free space added
Actions #8

Updated by okurz about 4 years ago

  • Status changed from Workable to Blocked
  • Assignee set to okurz

I think we can look into #76984 first

Actions #9

Updated by okurz over 3 years ago

  • Status changed from Blocked to Feedback

We have not yet done #76984 but I think that #88121 brought us further.

@mkittler I would very much appreciate your feedback on how you see this ticket here after #88121 . We can discuss here or also have a video chat.

Actions #10

Updated by okurz over 3 years ago

  • Status changed from Feedback to Blocked

By default the cleanup systemd timers are started as dependency in systemd/openqa-webui.service . Currently the space aware cleanup is triggered also when the timers trigger. We can consider the cases when "jobs post" or "isos post" is called to trigger the space aware cleanup and not have the systemd timer pulled in by default anymore.

waiting for #91782 which we see as related

We also have the feature of keeping a minimum amount of space but have not enabled it in production yet

Actions #11

Updated by okurz over 3 years ago

  • Parent task set to #64746
Actions #12

Updated by okurz over 3 years ago

  • Status changed from Blocked to New
  • Assignee deleted (okurz)

blocker #91782 resolved

Actions #13

Updated by livdywan over 3 years ago

  • Due date set to 2021-08-31
  • Status changed from New to Feedback
  • Assignee set to livdywan

Some thoughts from the planning poker:

  • We may want backups in place for tackling this (#94555)
  • Maybe we want event-based triggers instead of (systemd) timers?
  • We could have a workshop about this topic? From the user story, personal setups, experience with production setups e.g. assets being cleaned up too soon - see workshop topics

I'll take the ticket and look into a workshop slot with @mkittler as a resident expert

Actions #14

Updated by livdywan over 3 years ago

  • Status changed from Feedback to Resolved

We could have a workshop about this topic? From the user story, personal setups, experience with production setups e.g. assets being cleaned up too soon - see workshop topics
I'll take the ticket and look into a workshop slot with mkittler as a resident expert

Workshop slot taken for this Friday.

Actions #15

Updated by livdywan over 3 years ago

  • Status changed from Resolved to Feedback

Meh. Why does Redmine keep flipping around states.

Actions #16

Updated by livdywan over 3 years ago

cdywan wrote:

We could have a workshop about this topic? From the user story, personal setups, experience with production setups e.g. assets being cleaned up too soon - see workshop topics
I'll take the ticket and look into a workshop slot with mkittler as a resident expert

Workshop slot taken for this Friday.

Actions #17

Updated by livdywan over 3 years ago

  • Related to action #97304: Assets deleted even if there are still pending jobs size:M added
Actions #18

Updated by mkittler over 3 years ago

  • Related to coordination #96974: [epic] Improve/reconsider thresholds for skipping cleanup added
Actions #19

Updated by mkittler over 3 years ago

Note that #97304 is not really related except for the tact that it is about cleanup. It is a problem which is independent from triggering the cleanup which this ticket is about. It was also not a direct outcome of the workshop or discussed there. I just looked at incompletes with the cleanup topic in mind.

Actually, #96974 is more related. It is basically about the very same problem as this ticket. I've created it to note down a few ideas I had.

Actions #20

Updated by mkittler over 3 years ago

  • Related to deleted (action #97304: Assets deleted even if there are still pending jobs size:M)
Actions #21

Updated by livdywan over 3 years ago

  • Subject changed from Reconsider triggering cleanup jobs to [epic] Reconsider triggering cleanup jobs
Actions #22

Updated by okurz over 3 years ago

  • Tracker changed from action to coordination
Actions #23

Updated by okurz about 3 years ago

@cdywan With the single subtask resolved I think the next step could be a subtask about using your new feature switch as default to cover AC1

Actions #24

Updated by livdywan about 3 years ago

  • Copied to action #101376: Use cleanup triggers on finished jobs by default added
Actions #25

Updated by livdywan about 3 years ago

  • Status changed from Feedback to Blocked
Actions #26

Updated by okurz almost 3 years ago

  • Parent task changed from #64746 to #103941
Actions #27

Updated by okurz almost 3 years ago

  • Status changed from Blocked to New
  • Assignee deleted (livdywan)
  • Target version changed from Ready to future
Actions

Also available in: Atom PDF