action #182210: [Alert] Multiple broken workers size:S - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #182210

open

[Alert] Multiple broken workers size:S

Added by gpuliti 23 days ago. Updated 8 days ago.

Status:

Workable

Priority:

Normal

Assignee:

robert.richardson

Category:

Regressions/Crashes

Target version:

Ready

Start date:

2025-05-12

Due date:

% Done:

Estimated time:

Tags:

alert, alerts, infra, reactive work

Description

Observation¶

https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=2025-05-11T20:46:25.633Z&to=2025-05-12T00:07:17.690Z&timezone=UTC&var-host_disks=$__all

The panel shows "broken workers" and no "limited workers". That is correct because right now the webUI is not running limited, i.e. no workers are refused to connect. However we still need to check what made those worker instances ending up as broken.

Acceptance criteria¶

AC1: We are still receiving alerts if there are too many "broken" workers
AC2: We are not receiving alerts if there are "limited" (as defined in miniondb data) workers just for some hours

Suggestions¶

Check the dashboard https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=2025-05-11T20%3A46%3A25.633Z&to=2025-05-12T00%3A07%3A17.690Z&timezone=UTC&var-host_disks=%24__all&viewPanel=panel-96
Check the journal on osd for the referenced timestamp to find out if those are actually "broken" workers needing handling or just "limited" workers where we just need to adapt monitoring+alerting
- e.g. sudo journalctl -S '2025-05-11 20:46:25' -U '2025-05-12 00:07:17'
See https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1374 related to #176763 and #163394 splitting "broken and limited". Maybe that's not working as intended … probably it's fine
Consider changing what "limited" means in gitlab, change it or remove it

Rollback steps¶

Remove silence with "Broken workers alert rule_uid=dZ025mf4z" from https://monitor.qa.suse.de/alerting/silences?alertmanager=grafana

Out of scope¶

Making this distinction visible in the web UI e.g. https://openqa.suse.de/admin/workers

Actions

Copy link

Updated by gpuliti 23 days ago

Tags changed from alerts to alerts, reactive work, infra, alert
Category set to Regressions/Crashes
Priority changed from Normal to High
Target version set to Ready

Actions

Copy link

Updated by mkittler 23 days ago

I suppose this was about the panel here: https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=2025-05-11T20%3A46%3A25.633Z&to=2025-05-12T00%3A07%3A17.690Z&timezone=UTC&var-host_disks=%24__all&viewPanel=panel-96

Actions

Copy link

Updated by gpuliti 23 days ago

I left the whole dashboard because I think it might be handy to have an overview of the whole situation

Actions

Copy link

Updated by robert.richardson 22 days ago

Subject changed from [Alert] Multiple broken workers to [Alert] Multiple broken workers size:S
Description updated (diff)
Status changed from New to Workable

Actions

Copy link

Updated by livdywan 19 days ago

I'm not sure I can commit to providing a fix right now, but keeping the SLO's in mind trying to at least conduct an initial investigation based on the suggestions.

sudo journalctl -S '2025-05-11 20:46:25' -U '2025-05-12 00:07:17'

Could this be related?

May 11 21:03:10 openqa openqa-webui-daemon[3496]: [warn] [pid:3496] Unable to incomplete/duplicate or reschedule jobs abandoned by worker 3065: Malformed/unreadable JSON file "/var/lib/openqa/testresults/17335/17335390-sle-15-SP5-Server-DVD-Incidents-Kernel-KOTD-aarch64-Build5.14.21-150500.37.1.gb680b98-ltp_fs@aarch64-virtio/details-gf16.json": malformed JSON string, neither tag, array, object, number, string or atom, at character offset 0 at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/JSON.pm line 37.

and there's a variation of it:

May 11 21:03:10 openqa openqa-webui-daemon[30459]: [warn] [pid:30459] Unable to incomplete/duplicate or reschedule jobs abandoned by worker 3080: Malformed/unreadable JSON file "/var/lib/openqa/testresults/17335/17335533-sle-15-SP6-Server-DVD-Incidents-Kernel-KOTD-aarch64-Build6.4.0-150600.1097.1.gdcc1d06-ltp_openposix@aarch64-virtio/details-sigaction_17-17.json": malformed JSON string, neither tag, array, object, number, string or atom, at character offset 0 at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/JSON.pm line 37.                                                                                                                                                                                                                                     
May 11 21:03:10 openqa openqa-websockets-daemon[27442]: [debug] [pid:27442] Updating seen of worker 3080 from worker_status (broken)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
May 11 21:03:10 openqa openqa-webui-daemon[2759]: [debug] [pid:2759] GruTask 40777354 already gone, skip assigning jobs (message: DBIx::Class::Storage::DBI::_dbh_execute(): DBI Exception: DBD::Pg::st execute failed: ERROR:  insert or update on table "gru_dependencies" violates foreign key constraint "gru_dependencies_fk_gru_task_id"                                                                                                                                                                                                                                                                                                                                                                                                                                 
May 11 21:03:10 openqa openqa-webui-daemon[2759]: DETAIL:  Key (gru_task_id)=(40777354) is not present in table "gru_tasks". [for Statement "INSERT INTO gru_dependencies ( gru_task_id, job_id) VALUES ( ?, ? )" with ParamValues: 1='40777354', 2='17656478'] at /usr/share/openqa/script/../lib/OpenQA/Shared/Plugin/Gru.pm line 160                                                                                                                                                                                                                                                                                                                                                                                                                                        
May 11 21:03:10 openqa openqa-webui-daemon[2759]: )                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
May 11 21:03:10 openqa openqa-webui-daemon[2759]: [debug] [pid:2759] Job 17335539 duplicated as 17656478

I also noticed this, which points at a worker entering an unavailable state:

May 11 21:05:40 openqa openqa-websockets-daemon[27442]: [debug] [pid:27442] Worker 2628 rejected job(s) 17653831: The average load (26.17 26.54 25.46) is exceeding the configured threshold of 25. The worker will temporarily not accept new jobs until the load is lower again.

And actually more concrete instances of a "broken" worker:

May 11 21:07:21 openqa openqa-websockets-daemon[27442]: [debug] [pid:27442] Updating seen of worker 3082 from worker_status (broken)

Actions

Copy link

Updated by robert.richardson 16 days ago

Status changed from Workable to In Progress
Assignee set to robert.richardson

Actions

Copy link

Updated by openqa_review 15 days ago

Due date set to 2025-06-03

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

Updated by okurz 14 days ago

Due date deleted (~~2025-06-03~~)
Status changed from In Progress to Resolved

We looked at the definitions of telegraf and grafana rules again and found that everything is defined correctly. "limited" refers to "limited by global job limit" and "broken" means "broken" which we call "unavailable" in the webUI but excluding "system load exceeded" on worker hosts.

And actually more concrete instances of a "broken" worker:

May 11 21:07:21 openqa openqa-websockets-daemon[27442]: [debug] [pid:27442] Updating seen of worker 3082 from worker_status (broken)

We couldn't find those references anymore. I also conducted a query select count(id) as broken_workers from workers where error is not null and t_updated > (timezone('UTC', now()) - interval '1 hour') and not error like 'graceful disconnect%' and not error like 'limited%' and not error like '%Cache service queue already full %' and not error like '%average load%exceeding%'; and found 10 "broken workers" at a time but after some minutes the query returned 0 so the 1h period shifted enough to not have any left. I guess we shouldn't do anything as we can't reproduce the original problem and have verified that monitoring and alerting itself works fine.

Actions

Copy link

Updated by okurz 11 days ago

Status changed from Resolved to Workable

reopening as there is another alert from today, see https://monitor.qa.suse.de/d/WebuiDb/webui-summary?from=2025-05-23T19:28:15.413Z&orgId=1&to=2025-05-24T04:23:28.031Z&viewPanel=panel-96&timezone=UTC&var-host_disks=$__all

Actions

Copy link

#10

Updated by okurz 11 days ago

Description updated (diff)

I added a silence with according rollback step

Actions

Copy link

#11

Updated by okurz 8 days ago

Priority changed from High to Normal

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #182210

[Alert] Multiple broken workers size:S

Observation¶

Acceptance criteria¶

Suggestions¶

Rollback steps¶

Out of scope¶

Updated by gpuliti 23 days ago

Updated by mkittler 23 days ago

Updated by gpuliti 23 days ago

Updated by robert.richardson 22 days ago

Updated by livdywan 19 days ago

Updated by robert.richardson 16 days ago

Updated by openqa_review 15 days ago

Updated by okurz 14 days ago

Updated by okurz 11 days ago

Updated by okurz 11 days ago

Updated by okurz 8 days ago