action #164988
closedcoordination #92323: [saga][epic] Scale up: Fine-grained control over use and removal of results, assets, test data
coordination #179888: [epic] Creating, tracking, accounting "supporting jobs"
Better accounting for openqa-investigation jobs size:S
0%
Description
Motivation¶
#164979 alerted us about /results being nearly full. We found that groupless jobs are now the biggest offender linked to heavy jobs failing often triggering also heavy openqa-investigate jobs.
Acceptance criteria¶
- AC1: Big investigation jobs will not fill up our disk space; we would instead just keep less of them.
Suggestions¶
- Count investigation jobs towards the group of the original job
- Investigation jobs are groupless to avoid being considered for the result of the according group
- It is probably also not wanted by users; investigation jobs should not cause normal jobs to be stored less long but still be kept for a short time.
- The way the cleanup algorithm currently works makes this also hard to implement. It goes though jobs group by group and factoring in groupless jobs here without good relations in the database is not straight forward / efficient.
- Use a dedicated group for all investigation jobs
- Sounds most promising - just create a new group and schedule investigation jobs to be part of it.
- There is a caveat: Having all investigation jobs in one group does not solve the problem that investigation jobs for a particular scenario become very big. If we put everything in one group one scenario might cause other investigation jobs to be stored only very shortly.
Updated by okurz 8 months ago
- Copied from action #164979: [alert][grafana] File systems alert for WebUI /results size:S added
Updated by okurz about 2 months ago
- Target version changed from Tools - Next to Ready
Updated by ybonatakis 14 days ago
- Status changed from Workable to In Progress
- Assignee set to ybonatakis
Updated by ybonatakis 14 days ago · Edited
A new Investigations
group is created under Others
for both OSD[0] and O3[1].
The settings keeps the defaults but they are a bit different between each instance.
For instance OSD Keep results for
is 21 days as opposed to O3 which is 40
[0] https://openqa.suse.de/group_overview/637
[1] https://openqa.opensuse.org/group_overview/132
Updated by ybonatakis 14 days ago
https://github.com/os-autoinst/scripts/pull/381
struggling with the test. submitted only the change in the investigation script
Updated by ybonatakis 14 days ago
- Status changed from In Progress to Feedback
ybonatakis wrote in #note-11:
https://github.com/os-autoinst/scripts/pull/381
struggling with the test. submitted only the change in the investigation script
I did add a test but I decided to go with what works as I couldnt make the new test case to work modifying the host explicitly. I guess there is not a real request to O3, but it would be nice to understand why test breaking in different ways with my other attempts to inject host.
Updated by ybonatakis 7 days ago
I guess this is still open due to https://github.com/os-autoinst/scripts/pull/381#discussion_r2007995030
Updated by ybonatakis 2 days ago
- Status changed from Workable to Feedback
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1424.
But I am not sure where to go for O3. is it something we set manual?
Updated by tinita 2 days ago
Yet another idea: How about a new "Investigation" parent group, which can have sub groups per original group. If there is a investigation subgroup defined for a group, investigation jobs go there, otherwise as a fallback they go into the main Investigation (sub) group.
E.g. OpenQA investigation jobs would go into "Investigation - openQA", "Development / Agama Devel" would go into "Investigation - Development - Agama Devel". Others go into "Investigation / Misc".
And then such individual investigation subgroups can be configured to keep results/logs for a shorter time.
This way we don't have to define an extra group for every group, just for the big ones.
I can't see another way of doing this automatically if we have such different cases where some groups create a lot of investigation jobs and others don't.
Updated by ybonatakis 1 day ago
tinita wrote in #note-16:
Yet another idea: How about a new "Investigation" parent group, which can have sub groups per original group. If there is a investigation subgroup defined for a group, investigation jobs go there, otherwise as a fallback they go into the main Investigation (sub) group.
E.g. OpenQA investigation jobs would go into "Investigation - openQA", "Development / Agama Devel" would go into "Investigation - Development - Agama Devel". Others go into "Investigation / Misc".
And then such individual investigation subgroups can be configured to keep results/logs for a shorter time.
This way we don't have to define an extra group for every group, just for the big ones.I can't see another way of doing this automatically if we have such different cases where some groups create a lot of investigation jobs and others don't.
I kinda liked the idea in the first read. But may not need to go that far. The main problem in concern is to not have investigation job consuming disk space with logs and results, right?
If we have different groups we have to keep track of their settings. I would say it is unnecessary and it doesnt give us much benefits at the end.
Updated by okurz about 13 hours ago · Edited
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1424 merged. Please compare settings in https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/openqa-salt.ini?ref_type=heads#L57 vs. https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/openqa-salt.ini?ref_type=heads#L64
Updated by okurz about 13 hours ago
- Related to coordination #179221: [epic] Support keeping only jobs in database, remove all logs, assets, test results sooner added
Updated by ybonatakis about 10 hours ago
- Status changed from Feedback to Resolved
Also on O3:
ariel:/home/ybonatakis # grep investigation_gid -rn /etc/openqa/openqa.ini
313:job_done_hook_failed = env from_email=o3-admins@suse.de scheme=http enable_force_result=true email_unreviewed=true investigation_gid=132 exclude_group_regex='(Development|Open Build Service|Others|Kernel).*/.*' /opt/os-autoinst-scripts/openqa-label-known-issues-and-investigate-hook
315:job_done_hook = env scheme=http enable_force_result=true email_unreviewed=true investigation_gid=132 exclude_group_regex='(Development|Open Build Service|Others|Kernel).*/.*' /opt/os-autoinst-scripts/openqa-label-known-issues-and-investigate-hook
and group settings adjust as:
Keep logs for 5 days
Keep important logs for 15 days
Keep results for 25 days
Keep important results for 0 (default-no change) days
please check if anything does not look as expected. I see already jobs running on OSD and as such I am going to resolve this ticket and mark the jobs with its issue number
Updated by okurz about 8 hours ago
- Copied to action #179894: [spike][timeboxed:10h] Count assets+results of openqa-investigate jobs towards the originating group added
Updated by ybonatakis about 4 hours ago
3h ago I restarted the openqa-webui and someone restarted openqa-gru 1.5h ago on O3. There are some jobs running in https://openqa.opensuse.org/group_overview/132 but I see investigation jobs which are not in the group. gonna keep tracking