Project

General

Profile

Actions

action #164988

closed

coordination #92323: [saga][epic] Scale up: Fine-grained control over use and removal of results, assets, test data

coordination #179888: [epic] Creating, tracking, accounting "supporting jobs"

Better accounting for openqa-investigation jobs size:S

Added by okurz 8 months ago. Updated about 4 hours ago.

Status:
Resolved
Priority:
Low
Assignee:
Category:
Feature requests
Target version:
Start date:
2024-08-06
Due date:
% Done:

0%

Estimated time:

Description

Motivation

#164979 alerted us about /results being nearly full. We found that groupless jobs are now the biggest offender linked to heavy jobs failing often triggering also heavy openqa-investigate jobs.

Acceptance criteria

  • AC1: Big investigation jobs will not fill up our disk space; we would instead just keep less of them.

Suggestions

  • Count investigation jobs towards the group of the original job
    • Investigation jobs are groupless to avoid being considered for the result of the according group
    • It is probably also not wanted by users; investigation jobs should not cause normal jobs to be stored less long but still be kept for a short time.
    • The way the cleanup algorithm currently works makes this also hard to implement. It goes though jobs group by group and factoring in groupless jobs here without good relations in the database is not straight forward / efficient.
  • Use a dedicated group for all investigation jobs
    • Sounds most promising - just create a new group and schedule investigation jobs to be part of it.
    • There is a caveat: Having all investigation jobs in one group does not solve the problem that investigation jobs for a particular scenario become very big. If we put everything in one group one scenario might cause other investigation jobs to be stored only very shortly.

Related issues 3 (2 open1 closed)

Related to openQA Project (public) - coordination #179221: [epic] Support keeping only jobs in database, remove all logs, assets, test results soonerNew2025-03-19

Actions
Copied from openQA Infrastructure (public) - action #164979: [alert][grafana] File systems alert for WebUI /results size:SResolvedmkittler2024-08-21

Actions
Copied to openQA Project (public) - action #179894: [spike][timeboxed:10h] Count assets+results of openqa-investigate jobs towards the originating groupNew

Actions
Actions #1

Updated by okurz 8 months ago

  • Copied from action #164979: [alert][grafana] File systems alert for WebUI /results size:S added
Actions #2

Updated by okurz 8 months ago

  • Tags deleted (alert, infra)
  • Description updated (diff)
Actions #3

Updated by tinita 7 months ago

  • Target version changed from Tools - Next to Ready
Actions #4

Updated by livdywan 7 months ago

  • Subject changed from Better accounting for openqa-investigation jobs to Better accounting for openqa-investigation jobs size:S
  • Description updated (diff)
  • Status changed from New to Workable
Actions #5

Updated by okurz 7 months ago

  • Priority changed from Normal to Low
Actions #6

Updated by okurz 6 months ago

  • Parent task set to #92323
Actions #7

Updated by okurz 4 months ago

  • Target version changed from Ready to Tools - Next
Actions #8

Updated by okurz about 2 months ago

  • Target version changed from Tools - Next to Ready
Actions #9

Updated by ybonatakis 14 days ago

  • Status changed from Workable to In Progress
  • Assignee set to ybonatakis
Actions #10

Updated by ybonatakis 14 days ago · Edited

A new Investigations group is created under Others for both OSD[0] and O3[1].
The settings keeps the defaults but they are a bit different between each instance.

For instance OSD Keep results for is 21 days as opposed to O3 which is 40

[0] https://openqa.suse.de/group_overview/637
[1] https://openqa.opensuse.org/group_overview/132

Actions #11

Updated by ybonatakis 14 days ago

https://github.com/os-autoinst/scripts/pull/381

struggling with the test. submitted only the change in the investigation script

Actions #12

Updated by ybonatakis 14 days ago

  • Status changed from In Progress to Feedback

ybonatakis wrote in #note-11:

https://github.com/os-autoinst/scripts/pull/381

struggling with the test. submitted only the change in the investigation script

I did add a test but I decided to go with what works as I couldnt make the new test case to work modifying the host explicitly. I guess there is not a real request to O3, but it would be nice to understand why test breaking in different ways with my other attempts to inject host.

Actions #14

Updated by okurz 3 days ago

  • Status changed from Feedback to Workable

As you were confused during the daily please review this and let us know where you need feedback or help.

Actions #15

Updated by ybonatakis 2 days ago

  • Status changed from Workable to Feedback

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1424.

But I am not sure where to go for O3. is it something we set manual?

Actions #16

Updated by tinita 2 days ago

Yet another idea: How about a new "Investigation" parent group, which can have sub groups per original group. If there is a investigation subgroup defined for a group, investigation jobs go there, otherwise as a fallback they go into the main Investigation (sub) group.
E.g. OpenQA investigation jobs would go into "Investigation - openQA", "Development / Agama Devel" would go into "Investigation - Development - Agama Devel". Others go into "Investigation / Misc".
And then such individual investigation subgroups can be configured to keep results/logs for a shorter time.
This way we don't have to define an extra group for every group, just for the big ones.

I can't see another way of doing this automatically if we have such different cases where some groups create a lot of investigation jobs and others don't.

Actions #17

Updated by mkittler 1 day ago

Sounds like a good idea.

Actions #18

Updated by ybonatakis 1 day ago

tinita wrote in #note-16:

Yet another idea: How about a new "Investigation" parent group, which can have sub groups per original group. If there is a investigation subgroup defined for a group, investigation jobs go there, otherwise as a fallback they go into the main Investigation (sub) group.
E.g. OpenQA investigation jobs would go into "Investigation - openQA", "Development / Agama Devel" would go into "Investigation - Development - Agama Devel". Others go into "Investigation / Misc".
And then such individual investigation subgroups can be configured to keep results/logs for a shorter time.
This way we don't have to define an extra group for every group, just for the big ones.

I can't see another way of doing this automatically if we have such different cases where some groups create a lot of investigation jobs and others don't.

I kinda liked the idea in the first read. But may not need to go that far. The main problem in concern is to not have investigation job consuming disk space with logs and results, right?
If we have different groups we have to keep track of their settings. I would say it is unnecessary and it doesnt give us much benefits at the end.

Actions #20

Updated by okurz about 13 hours ago

  • Related to coordination #179221: [epic] Support keeping only jobs in database, remove all logs, assets, test results sooner added
Actions #21

Updated by ybonatakis about 12 hours ago · Edited

settings on OSD adjusted

Actions #22

Updated by ybonatakis about 10 hours ago

  • Status changed from Feedback to Resolved

Also on O3:

ariel:/home/ybonatakis # grep investigation_gid -rn /etc/openqa/openqa.ini
313:job_done_hook_failed = env from_email=o3-admins@suse.de scheme=http enable_force_result=true email_unreviewed=true investigation_gid=132 exclude_group_regex='(Development|Open Build Service|Others|Kernel).*/.*' /opt/os-autoinst-scripts/openqa-label-known-issues-and-investigate-hook
315:job_done_hook = env scheme=http enable_force_result=true email_unreviewed=true investigation_gid=132 exclude_group_regex='(Development|Open Build Service|Others|Kernel).*/.*' /opt/os-autoinst-scripts/openqa-label-known-issues-and-investigate-hook

and group settings adjust as:
Keep logs for 5 days
Keep important logs for 15 days
Keep results for 25 days
Keep important results for 0 (default-no change) days

please check if anything does not look as expected. I see already jobs running on OSD and as such I am going to resolve this ticket and mark the jobs with its issue number

Actions #23

Updated by okurz about 8 hours ago

  • Parent task changed from #92323 to #179888
Actions #24

Updated by okurz about 8 hours ago

  • Copied to action #179894: [spike][timeboxed:10h] Count assets+results of openqa-investigate jobs towards the originating group added
Actions #25

Updated by ybonatakis about 4 hours ago

3h ago I restarted the openqa-webui and someone restarted openqa-gru 1.5h ago on O3. There are some jobs running in https://openqa.opensuse.org/group_overview/132 but I see investigation jobs which are not in the group. gonna keep tracking

Actions

Also available in: Atom PDF