action #164988: Better accounting for openqa-investigation jobs size:S - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #164988

closed

coordination #92323: [saga][epic] Scale up: Fine-grained control over use and removal of results, assets, test data

coordination #179888: [epic] Creating, tracking, accounting "supporting jobs"

Better accounting for openqa-investigation jobs size:S

Added by okurz 8 months ago. Updated about 4 hours ago.

Status:

Resolved

Priority:

Low

Assignee:

ybonatakis

Category:

Feature requests

Target version:

Ready

Start date:

2024-08-06

Due date:

% Done:

Estimated time:

Description

Motivation¶

#164979 alerted us about /results being nearly full. We found that groupless jobs are now the biggest offender linked to heavy jobs failing often triggering also heavy openqa-investigate jobs.

Acceptance criteria¶

AC1: Big investigation jobs will not fill up our disk space; we would instead just keep less of them.

Suggestions¶

Count investigation jobs towards the group of the original job
- Investigation jobs are groupless to avoid being considered for the result of the according group
- It is probably also not wanted by users; investigation jobs should not cause normal jobs to be stored less long but still be kept for a short time.
- The way the cleanup algorithm currently works makes this also hard to implement. It goes though jobs group by group and factoring in groupless jobs here without good relations in the database is not straight forward / efficient.
Use a dedicated group for all investigation jobs
- Sounds most promising - just create a new group and schedule investigation jobs to be part of it.
- There is a caveat: Having all investigation jobs in one group does not solve the problem that investigation jobs for a particular scenario become very big. If we put everything in one group one scenario might cause other investigation jobs to be stored only very shortly.

Related issues 3 (2 open — 1 closed)

Actions

Copy link

Updated by okurz 8 months ago

Copied from action #164979: [alert][grafana] File systems alert for WebUI /results size:S added

Actions

Copy link

Updated by okurz 8 months ago

Tags deleted (~~alert, infra~~)
Description updated (diff)

Actions

Copy link

Updated by tinita 7 months ago

Target version changed from Tools - Next to Ready

Actions

Copy link

Updated by livdywan 7 months ago

Subject changed from Better accounting for openqa-investigation jobs to Better accounting for openqa-investigation jobs size:S
Description updated (diff)
Status changed from New to Workable

Actions

Copy link

Updated by okurz 7 months ago

Priority changed from Normal to Low

Actions

Copy link

Updated by okurz 6 months ago

Parent task set to #92323

Actions

Copy link

Updated by okurz 4 months ago

Target version changed from Ready to Tools - Next

Actions

Copy link

Updated by okurz about 2 months ago

Target version changed from Tools - Next to Ready

Actions

Copy link

Updated by ybonatakis 14 days ago

Status changed from Workable to In Progress
Assignee set to ybonatakis

Actions

Copy link

#10

Updated by ybonatakis 14 days ago · Edited

A new Investigations group is created under Others for both OSD[0] and O3[1].
The settings keeps the defaults but they are a bit different between each instance.

For instance OSD Keep results for is 21 days as opposed to O3 which is 40

[0] https://openqa.suse.de/group_overview/637
[1] https://openqa.opensuse.org/group_overview/132

Actions

Copy link

#11

Updated by ybonatakis 14 days ago

https://github.com/os-autoinst/scripts/pull/381

struggling with the test. submitted only the change in the investigation script

Actions

Copy link

#12

Updated by ybonatakis 14 days ago

Status changed from In Progress to Feedback

ybonatakis wrote in #note-11:

https://github.com/os-autoinst/scripts/pull/381

struggling with the test. submitted only the change in the investigation script

I did add a test but I decided to go with what works as I couldnt make the new test case to work modifying the host explicitly. I guess there is not a real request to O3, but it would be nice to understand why test breaking in different ways with my other attempts to inject host.

Actions

Copy link

#13

Updated by ybonatakis 7 days ago

I guess this is still open due to https://github.com/os-autoinst/scripts/pull/381#discussion_r2007995030

Actions

Copy link

#14

Updated by okurz 3 days ago

Status changed from Feedback to Workable

As you were confused during the daily please review this and let us know where you need feedback or help.

Actions

Copy link

#15

Updated by ybonatakis 2 days ago

Status changed from Workable to Feedback

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1424.

But I am not sure where to go for O3. is it something we set manual?

Actions

Copy link

#16

Updated by tinita 2 days ago

Yet another idea: How about a new "Investigation" parent group, which can have sub groups per original group. If there is a investigation subgroup defined for a group, investigation jobs go there, otherwise as a fallback they go into the main Investigation (sub) group.
E.g. OpenQA investigation jobs would go into "Investigation - openQA", "Development / Agama Devel" would go into "Investigation - Development - Agama Devel". Others go into "Investigation / Misc".
And then such individual investigation subgroups can be configured to keep results/logs for a shorter time.
This way we don't have to define an extra group for every group, just for the big ones.

I can't see another way of doing this automatically if we have such different cases where some groups create a lot of investigation jobs and others don't.

Actions

Copy link

#17

Updated by mkittler 1 day ago

Sounds like a good idea.

Actions

Copy link

#18

Updated by ybonatakis 1 day ago

tinita wrote in #note-16:

Yet another idea: How about a new "Investigation" parent group, which can have sub groups per original group. If there is a investigation subgroup defined for a group, investigation jobs go there, otherwise as a fallback they go into the main Investigation (sub) group.
E.g. OpenQA investigation jobs would go into "Investigation - openQA", "Development / Agama Devel" would go into "Investigation - Development - Agama Devel". Others go into "Investigation / Misc".
And then such individual investigation subgroups can be configured to keep results/logs for a shorter time.
This way we don't have to define an extra group for every group, just for the big ones.

I can't see another way of doing this automatically if we have such different cases where some groups create a lot of investigation jobs and others don't.

I kinda liked the idea in the first read. But may not need to go that far. The main problem in concern is to not have investigation job consuming disk space with logs and results, right?
If we have different groups we have to keep track of their settings. I would say it is unnecessary and it doesnt give us much benefits at the end.

Actions

Copy link

#19

Updated by okurz about 13 hours ago · Edited

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1424 merged. Please compare settings in https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/openqa-salt.ini?ref_type=heads#L57 vs. https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/openqa-salt.ini?ref_type=heads#L64

Actions

Copy link

#20

Updated by okurz about 13 hours ago

Related to coordination #179221: [epic] Support keeping only jobs in database, remove all logs, assets, test results sooner added

Actions

Copy link

#21

Updated by ybonatakis about 12 hours ago · Edited

settings on OSD adjusted

Actions

Copy link

#22

Updated by ybonatakis about 10 hours ago

Status changed from Feedback to Resolved

Also on O3:

ariel:/home/ybonatakis # grep investigation_gid -rn /etc/openqa/openqa.ini
313:job_done_hook_failed = env from_email=o3-admins@suse.de scheme=http enable_force_result=true email_unreviewed=true investigation_gid=132 exclude_group_regex='(Development|Open Build Service|Others|Kernel).*/.*' /opt/os-autoinst-scripts/openqa-label-known-issues-and-investigate-hook
315:job_done_hook = env scheme=http enable_force_result=true email_unreviewed=true investigation_gid=132 exclude_group_regex='(Development|Open Build Service|Others|Kernel).*/.*' /opt/os-autoinst-scripts/openqa-label-known-issues-and-investigate-hook

and group settings adjust as:
Keep logs for 5 days
Keep important logs for 15 days
Keep results for 25 days
Keep important results for 0 (default-no change) days

please check if anything does not look as expected. I see already jobs running on OSD and as such I am going to resolve this ticket and mark the jobs with its issue number

Actions

Copy link

#23

Updated by okurz about 8 hours ago

Parent task changed from #92323 to #179888

Actions

Copy link

#24

Updated by okurz about 8 hours ago

Copied to action #179894: [spike][timeboxed:10h] Count assets+results of openqa-investigate jobs towards the originating group added

Actions

Copy link

#25

Updated by ybonatakis about 4 hours ago

3h ago I restarted the openqa-webui and someone restarted openqa-gru 1.5h ago on O3. There are some jobs running in https://openqa.opensuse.org/group_overview/132 but I see investigation jobs which are not in the group. gonna keep tracking

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #164988

Better accounting for openqa-investigation jobs size:S

Motivation¶

Acceptance criteria¶

Suggestions¶

Updated by okurz 8 months ago

Updated by okurz 8 months ago

Updated by tinita 7 months ago

Updated by livdywan 7 months ago

Updated by okurz 7 months ago

Updated by okurz 6 months ago

Updated by okurz 4 months ago

Updated by okurz about 2 months ago

Updated by ybonatakis 14 days ago

Updated by ybonatakis 14 days ago · Edited

Updated by ybonatakis 14 days ago

Updated by ybonatakis 14 days ago

Updated by ybonatakis 7 days ago

Updated by okurz 3 days ago

Updated by ybonatakis 2 days ago

Updated by tinita 2 days ago

Updated by mkittler 1 day ago

Updated by ybonatakis 1 day ago

Updated by okurz about 13 hours ago · Edited

Updated by okurz about 13 hours ago

Updated by ybonatakis about 12 hours ago · Edited

Updated by ybonatakis about 10 hours ago

Updated by okurz about 8 hours ago

Updated by okurz about 8 hours ago

Updated by ybonatakis about 4 hours ago