Project

General

Custom queries

Profile

Actions

action #176763

closed

[alert] Flaky broken workers alert size:S

Added by mkittler about 2 months ago. Updated about 1 month ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Start date:
2025-02-07
Due date:
% Done:

0%

Estimated time:

Description

Observation

The alert was firing from 2025-02-05T22:46Z to 2025-02-05T22:51Z and from 2025-02-06T22:56Z to 2025-02-06T23:26Z and from 2025-02-07T01:46Z to 2025-02-07T01:56Z. So it was firing three times around midnight.

We also saw this alert before those occurrences recently, see #175836.

See https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&viewPanel=panel-96&from=2025-02-05T11%3A02%3A48.013Z&to=2025-02-07T09%3A23%3A58.259Z&timezone=utc&var-host_disks=%24__all for the monitoring data.

Rollback actions

  • Remove alert silence alertname=Broken workers alert

Related issues 1 (1 open0 closed)

Related to openQA Infrastructure (public) - action #163394: Consider extending our logging of broken workers in grafana (Better understand "Broken workers alert" retroactively)New2024-07-05

Actions
Actions #1

Updated by okurz about 2 months ago

  • Priority changed from Normal to High
  • Target version set to Ready
Actions #2

Updated by mkittler about 2 months ago

The alert itself makes sense. If there are one or more broken workers consistently for 15 minutes it will alert and that was also the case. Setting the grace period to 30 minutes would have helped to prevent the alerts. Should we just do that?

Actions #3

Updated by ybonatakis about 2 months ago

I silent the alert for the period of 14d as it keep coming. Last event http://monitor.qa.suse.de/goto/h3uAGBFHg?orgId=1

Actions #4

Updated by ybonatakis about 2 months ago

Many comments between that timeframe
Feb 07 22:13:39 openqa openqa-websockets-daemon[17853]: [debug] [pid:17853] Worker 3054 rejected job(s) 16698642: The average load (16.76 22.09 30.50) is exceeding the configured threshold of 16. The worker will temporarily not accept new jobs until the load is lower again.

I checked the websocket with journalctl -u openqa-websockets.service --since "2025-02-07 22:12:51" --until "2025-02-07 22:43:18"

Actions #5

Updated by okurz about 2 months ago

  • Priority changed from High to Urgent
Actions #6

Updated by jbaier_cz about 2 months ago

  • Assignee set to jbaier_cz

I will investigate

Actions #7

Updated by jbaier_cz about 2 months ago

  • Description updated (diff)
  • Priority changed from Urgent to High

Mitigation is applied, the alert is ok at this moment, hence lowering the prio back to high.

Actions #8

Updated by jbaier_cz about 2 months ago

  • Status changed from New to In Progress
Actions #9

Updated by jbaier_cz about 2 months ago

  • Related to action #163394: Consider extending our logging of broken workers in grafana (Better understand "Broken workers alert" retroactively) added
Actions #10

Updated by jbaier_cz about 2 months ago

I tried to gather some more info, but failed to obtain anything useful. I followed the suggestion from #176763#note-2 and created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1369. I feel that in this case I can't do much more and that #163394 could be helpful here.

Actions #11

Updated by jbaier_cz about 2 months ago

  • Status changed from In Progress to Feedback
Actions #12

Updated by livdywan about 2 months ago

  • Subject changed from [alert] Flaky broken workers alert to [alert] Flaky broken workers alert size:S

Briefly discussed in the estimations

Actions #13

Updated by okurz about 2 months ago

  • Due date set to 2025-02-27
Actions #15

Updated by jbaier_cz about 2 months ago

  • Status changed from Feedback to Blocked

Merged but not deployed, blocked by #177324

Actions #16

Updated by jbaier_cz about 2 months ago

  • Status changed from Blocked to In Progress
Actions #17

Updated by jbaier_cz about 2 months ago

I recreated the alert and prepared https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1382. That will hopefully provision the alert correctly this time.

Actions #18

Updated by jbaier_cz about 1 month ago

  • Due date deleted (2025-02-27)
  • Status changed from In Progress to Resolved

Merged and deployed, panel looks good and already shows some data.

Actions

Also available in: Atom PDF