action #176763: [alert] Flaky broken workers alert size:S - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #176763

closed

[alert] Flaky broken workers alert size:S

Added by mkittler 4 months ago. Updated 3 months ago.

Status:

Resolved

Priority:

High

Assignee:

jbaier_cz

Category:

Regressions/Crashes

Target version:

openQA Project (public) - Ready

Start date:

2025-02-07

Due date:

% Done:

Estimated time:

Tags:

alert, infra, reactive work

Description

Observation¶

The alert was firing from 2025-02-05T22:46Z to 2025-02-05T22:51Z and from 2025-02-06T22:56Z to 2025-02-06T23:26Z and from 2025-02-07T01:46Z to 2025-02-07T01:56Z. So it was firing three times around midnight.

We also saw this alert before those occurrences recently, see #175836.

See https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&viewPanel=panel-96&from=2025-02-05T11%3A02%3A48.013Z&to=2025-02-07T09%3A23%3A58.259Z&timezone=utc&var-host_disks=%24__all for the monitoring data.

Rollback actions¶

Remove alert silence alertname=Broken workers alert

Related issues 1 (1 open — 0 closed)

Actions

Copy link

Updated by okurz 4 months ago

Priority changed from Normal to High
Target version set to Ready

Actions

Copy link

Updated by mkittler 4 months ago

The alert itself makes sense. If there are one or more broken workers consistently for 15 minutes it will alert and that was also the case. Setting the grace period to 30 minutes would have helped to prevent the alerts. Should we just do that?

Actions

Copy link

Updated by ybonatakis 4 months ago

I silent the alert for the period of 14d as it keep coming. Last event http://monitor.qa.suse.de/goto/h3uAGBFHg?orgId=1

Actions

Copy link

Updated by ybonatakis 4 months ago

Many comments between that timeframe
Feb 07 22:13:39 openqa openqa-websockets-daemon[17853]: [debug] [pid:17853] Worker 3054 rejected job(s) 16698642: The average load (16.76 22.09 30.50) is exceeding the configured threshold of 16. The worker will temporarily not accept new jobs until the load is lower again.

I checked the websocket with journalctl -u openqa-websockets.service --since "2025-02-07 22:12:51" --until "2025-02-07 22:43:18"

Actions

Copy link

Updated by okurz 4 months ago

Priority changed from High to Urgent

Actions

Copy link

Updated by jbaier_cz 4 months ago

Assignee set to jbaier_cz

I will investigate

Actions

Copy link

Updated by jbaier_cz 4 months ago

Description updated (diff)
Priority changed from Urgent to High

Mitigation is applied, the alert is ok at this moment, hence lowering the prio back to high.

Actions

Copy link

Updated by jbaier_cz 4 months ago

Status changed from New to In Progress

Actions

Copy link

Updated by jbaier_cz 4 months ago

Related to action #163394: Consider extending our logging of broken workers in grafana (Better understand "Broken workers alert" retroactively) added

Actions

Copy link

#10

Updated by jbaier_cz 4 months ago

I tried to gather some more info, but failed to obtain anything useful. I followed the suggestion from #176763#note-2 and created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1369. I feel that in this case I can't do much more and that #163394 could be helpful here.

Actions

Copy link

#11