Project

General

Profile

Actions

action #164988

open

coordination #92323: [saga][epic] Scale up: Fine-grained control over use and removal of results, assets, test data

Better accounting for openqa-investigation jobs size:S

Added by okurz 8 months ago. Updated about 12 hours ago.

Status:
Feedback
Priority:
Low
Assignee:
Category:
Feature requests
Target version:
Start date:
2024-08-06
Due date:
% Done:

0%

Estimated time:

Description

Motivation

#164979 alerted us about /results being nearly full. We found that groupless jobs are now the biggest offender linked to heavy jobs failing often triggering also heavy openqa-investigate jobs.

Acceptance criteria

  • AC1: Big investigation jobs will not fill up our disk space; we would instead just keep less of them.

Suggestions

  • Count investigation jobs towards the group of the original job
    • Investigation jobs are groupless to avoid being considered for the result of the according group
    • It is probably also not wanted by users; investigation jobs should not cause normal jobs to be stored less long but still be kept for a short time.
    • The way the cleanup algorithm currently works makes this also hard to implement. It goes though jobs group by group and factoring in groupless jobs here without good relations in the database is not straight forward / efficient.
  • Use a dedicated group for all investigation jobs
    • Sounds most promising - just create a new group and schedule investigation jobs to be part of it.
    • There is a caveat: Having all investigation jobs in one group does not solve the problem that investigation jobs for a particular scenario become very big. If we put everything in one group one scenario might cause other investigation jobs to be stored only very shortly.

Related issues 1 (0 open1 closed)

Copied from openQA Infrastructure (public) - action #164979: [alert][grafana] File systems alert for WebUI /results size:SResolvedmkittler2024-08-21

Actions
Actions #1

Updated by okurz 8 months ago

  • Copied from action #164979: [alert][grafana] File systems alert for WebUI /results size:S added
Actions #2

Updated by okurz 8 months ago

  • Tags deleted (alert, infra)
  • Description updated (diff)
Actions #3

Updated by tinita 7 months ago

  • Target version changed from Tools - Next to Ready
Actions #4

Updated by livdywan 7 months ago

  • Subject changed from Better accounting for openqa-investigation jobs to Better accounting for openqa-investigation jobs size:S
  • Description updated (diff)
  • Status changed from New to Workable
Actions #5

Updated by okurz 7 months ago

  • Priority changed from Normal to Low
Actions #6

Updated by okurz 6 months ago

  • Parent task set to #92323
Actions #7

Updated by okurz 4 months ago

  • Target version changed from Ready to Tools - Next
Actions #8

Updated by okurz about 2 months ago

  • Target version changed from Tools - Next to Ready
Actions #9

Updated by ybonatakis 13 days ago

  • Status changed from Workable to In Progress
  • Assignee set to ybonatakis
Actions #10

Updated by ybonatakis 13 days ago ยท Edited

A new Investigations group is created under Others for both OSD[0] and O3[1].
The settings keeps the defaults but they are a bit different between each instance.

For instance OSD Keep results for is 21 days as opposed to O3 which is 40

[0] https://openqa.suse.de/group_overview/637
[1] https://openqa.opensuse.org/group_overview/132

Actions #11

Updated by ybonatakis 12 days ago

https://github.com/os-autoinst/scripts/pull/381

struggling with the test. submitted only the change in the investigation script

Actions #12

Updated by ybonatakis 12 days ago

  • Status changed from In Progress to Feedback

ybonatakis wrote in #note-11:

https://github.com/os-autoinst/scripts/pull/381

struggling with the test. submitted only the change in the investigation script

I did add a test but I decided to go with what works as I couldnt make the new test case to work modifying the host explicitly. I guess there is not a real request to O3, but it would be nice to understand why test breaking in different ways with my other attempts to inject host.

Actions #14

Updated by okurz 1 day ago

  • Status changed from Feedback to Workable

As you were confused during the daily please review this and let us know where you need feedback or help.

Actions #15

Updated by ybonatakis 1 day ago

  • Status changed from Workable to Feedback

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1424.

But I am not sure where to go for O3. is it something we set manual?

Actions #16

Updated by tinita about 23 hours ago

Yet another idea: How about a new "Investigation" parent group, which can have sub groups per original group. If there is a investigation subgroup defined for a group, investigation jobs go there, otherwise as a fallback they go into the main Investigation (sub) group.
E.g. OpenQA investigation jobs would go into "Investigation - openQA", "Development / Agama Devel" would go into "Investigation - Development - Agama Devel". Others go into "Investigation / Misc".
And then such individual investigation subgroups can be configured to keep results/logs for a shorter time.
This way we don't have to define an extra group for every group, just for the big ones.

I can't see another way of doing this automatically if we have such different cases where some groups create a lot of investigation jobs and others don't.

Actions #17

Updated by mkittler about 12 hours ago

Sounds like a good idea.

Actions #18

Updated by ybonatakis about 12 hours ago

tinita wrote in #note-16:

Yet another idea: How about a new "Investigation" parent group, which can have sub groups per original group. If there is a investigation subgroup defined for a group, investigation jobs go there, otherwise as a fallback they go into the main Investigation (sub) group.
E.g. OpenQA investigation jobs would go into "Investigation - openQA", "Development / Agama Devel" would go into "Investigation - Development - Agama Devel". Others go into "Investigation / Misc".
And then such individual investigation subgroups can be configured to keep results/logs for a shorter time.
This way we don't have to define an extra group for every group, just for the big ones.

I can't see another way of doing this automatically if we have such different cases where some groups create a lot of investigation jobs and others don't.

I kinda liked the idea in the first read. But may not need to go that far. The main problem in concern is to not have investigation job consuming disk space with logs and results, right?
If we have different groups we have to keep track of their settings. I would say it is unnecessary and it doesnt give us much benefits at the end.

Actions

Also available in: Atom PDF