Project

General

Profile

Actions

action #163394

open

Consider extending our logging of broken workers in grafana (Better understand "Broken workers alert" retroactively)

Added by nicksinger 22 days ago. Updated 21 days ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
Feature requests
Target version:
Start date:
2024-07-05
Due date:
% Done:

0%

Estimated time:

Description

Observation

The "Broken workers alert" (https://stats.openqa-monitor.qa.suse.de/alerting/grafana/dZ025mf4z/view?orgId=1) is hard to understand retroactively because we only collect the total amount of broken workers and not their names. https://openqa.suse.de/admin/workers only has the current status of a worker and no history.

Suggestions

  • Consider extending our metrics to also collect the worker(instance) with its state. This allows us to further understand past issues.
  • Include these information into alert messages if possible. Try to not introduce new alerts.
Actions #1

Updated by okurz 21 days ago

  • Tags set to infra, monitoring, grafana, telegraf, influxdb
  • Category set to Feature requests
  • Target version set to future

More information would indeed be helpful. However that would also increase the space we need to store for monitoring data. Also very likely the reason for more broken workers is the system load limit which keeps workers in "broken" until the system load is again under the configured limit. I would rather like to improve that behavior in openQA itself

Actions

Also available in: Atom PDF