Project

General

Profile

Actions

action #163394

open

Consider extending our logging of broken workers in grafana (Better understand "Broken workers alert" retroactively)

Added by nicksinger 8 months ago. Updated 18 days ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
Feature requests
Target version:
Start date:
2024-07-05
Due date:
% Done:

0%

Estimated time:

Description

Observation

The "Broken workers alert" (https://stats.openqa-monitor.qa.suse.de/alerting/grafana/dZ025mf4z/view?orgId=1) is hard to understand retroactively because we only collect the total amount of broken workers and not their names. https://openqa.suse.de/admin/workers only has the current status of a worker and no history.

Suggestions

  • Consider extending our metrics to also collect the worker(instance) with its state. This allows us to further understand past issues.
  • Include these information into alert messages if possible. Try to not introduce new alerts.

Related issues 1 (0 open1 closed)

Related to openQA Infrastructure (public) - action #176763: [alert] Flaky broken workers alert size:SResolvedjbaier_cz2025-02-07

Actions
Actions #1

Updated by okurz 8 months ago

  • Tags set to infra, monitoring, grafana, telegraf, influxdb
  • Category set to Feature requests
  • Target version set to future

More information would indeed be helpful. However that would also increase the space we need to store for monitoring data. Also very likely the reason for more broken workers is the system load limit which keeps workers in "broken" until the system load is again under the configured limit. I would rather like to improve that behavior in openQA itself

Actions #2

Updated by jbaier_cz 22 days ago

  • Related to action #176763: [alert] Flaky broken workers alert size:S added
Actions #3

Updated by jbaier_cz 18 days ago

Maybe we can at least report the names of the broken workers? https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1374

Actions

Also available in: Atom PDF