Actions
action #163394
openConsider extending our logging of broken workers in grafana (Better understand "Broken workers alert" retroactively)
Start date:
2024-07-05
Due date:
% Done:
0%
Estimated time:
Tags:
Description
Observation¶
The "Broken workers alert" (https://stats.openqa-monitor.qa.suse.de/alerting/grafana/dZ025mf4z/view?orgId=1) is hard to understand retroactively because we only collect the total amount of broken workers and not their names. https://openqa.suse.de/admin/workers only has the current status of a worker and no history.
Suggestions¶
- Consider extending our metrics to also collect the worker(instance) with its state. This allows us to further understand past issues.
- Include these information into alert messages if possible. Try to not introduce new alerts.
Updated by okurz 5 months ago
- Tags set to infra, monitoring, grafana, telegraf, influxdb
- Category set to Feature requests
- Target version set to future
More information would indeed be helpful. However that would also increase the space we need to store for monitoring data. Also very likely the reason for more broken workers is the system load limit which keeps workers in "broken" until the system load is again under the configured limit. I would rather like to improve that behavior in openQA itself
Actions