Fix /results over-usage on osd (was: sudden increase in job group results for SLE 15 SP2 Incidents)
see https://w3.nue.suse.com/~okurz/job_group_results_2020-10-30.png , there seems to be a very sudden increase in the job group "Maintenance: Test Repo/Maintenance: SLE 15 SP2 Updates". I wonder if someone changed result settings or just many recent results accumulated now. I will just monitor :)
EDIT: In 2020-11-04: we have seen an email alert from grafana for /results
- AC1: /results is way below the alarm threshold again to have headroom for some weeks at least
- Review the trend of individual job groups
- Reduce result retention periods after coordinating with job group stakeholders or owners
- Tags set to storage, results, osd, job group settings
- Subject changed from sudden increase in job group results for SLE 15 SP2 Incidents to Fix /results over-usage on osd (was: sudden increase in job group results for SLE 15 SP2 Incidents)
- Description updated (diff)
- Status changed from Feedback to Workable
- Assignee deleted (
- Priority changed from Normal to Urgent
By now there was an email alert message confirming a problem:
[osd-admins] [Alerting] File systems alert From: Grafana <email@example.com> To: firstname.lastname@example.org Sender: osd-admins <email@example.com> List-Id: <osd-admins.suse.de> Date: 03/11/2020 20.20 */[Alerting] File systems alert/* results: Used Percentage 90.008
checking right now we are back to 85% after a sudden decrease in https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&fullscreen&panelId=74&from=1604430390434&to=1604494503308 , likely simply the regular results cleanup.
https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&fullscreen&panelId=74&from=1601949566797&to=1604500482068 shows an increase since the last 30 days from 70% to the point of the alert crossing 90%.
- Status changed from Workable to In Progress
- Assignee set to okurz
discussion is going on what could have caused that, maybe it was a personal bug investigation action. I will see if this leads to conclusion and resolutions. Also, I prepared https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/396 so that we remove lengthy videos before we trigger any alerts. Also as the check first calls "df" the cost of calling it much more often is negligible. Extended https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/396 to cover that as well.
- Status changed from In Progress to Feedback
[ "$(df --output=pcent /results/testresults | sed '1d;s/[^0-9]//g')" -ge 84 ] && time find /results/testresults -type f -iname '*.ogv' -mtime +28 -delete explictly now and we are down to 81%. Enough headroom until my MR is approved for sure :)