action #182210
open[Alert] Multiple broken workers size:S
0%
Description
Observation¶
The panel shows "broken workers" and no "limited workers". That is correct because right now the webUI is not running limited, i.e. no workers are refused to connect. However we still need to check what made those worker instances ending up as broken.
Acceptance criteria¶
- AC1: We are still receiving alerts if there are too many "broken" workers
- AC2: We are not receiving alerts if there are "limited" (as defined in miniondb data) workers just for some hours
Suggestions¶
- Check the dashboard https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=2025-05-11T20%3A46%3A25.633Z&to=2025-05-12T00%3A07%3A17.690Z&timezone=UTC&var-host_disks=%24__all&viewPanel=panel-96
- Check the journal on osd for the referenced timestamp to find out if those are actually "broken" workers needing handling or just "limited" workers where we just need to adapt monitoring+alerting
- e.g.
sudo journalctl -S '2025-05-11 20:46:25' -U '2025-05-12 00:07:17'
- e.g.
- See https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1374 related to #176763 and #163394 splitting "broken and limited". Maybe that's not working as intended … probably it's fine
- Consider changing what "limited" means in gitlab, change it or remove it
Rollback steps¶
- Remove silence with "Broken workers alert rule_uid=dZ025mf4z" from https://monitor.qa.suse.de/alerting/silences?alertmanager=grafana
Out of scope¶
- Making this distinction visible in the web UI e.g. https://openqa.suse.de/admin/workers
Updated by robert.richardson 22 days ago
- Subject changed from [Alert] Multiple broken workers to [Alert] Multiple broken workers size:S
- Description updated (diff)
- Status changed from New to Workable
Updated by livdywan 19 days ago
I'm not sure I can commit to providing a fix right now, but keeping the SLO's in mind trying to at least conduct an initial investigation based on the suggestions.
sudo journalctl -S '2025-05-11 20:46:25' -U '2025-05-12 00:07:17'
Could this be related?
May 11 21:03:10 openqa openqa-webui-daemon[3496]: [warn] [pid:3496] Unable to incomplete/duplicate or reschedule jobs abandoned by worker 3065: Malformed/unreadable JSON file "/var/lib/openqa/testresults/17335/17335390-sle-15-SP5-Server-DVD-Incidents-Kernel-KOTD-aarch64-Build5.14.21-150500.37.1.gb680b98-ltp_fs@aarch64-virtio/details-gf16.json": malformed JSON string, neither tag, array, object, number, string or atom, at character offset 0 at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/JSON.pm line 37.
and there's a variation of it:
May 11 21:03:10 openqa openqa-webui-daemon[30459]: [warn] [pid:30459] Unable to incomplete/duplicate or reschedule jobs abandoned by worker 3080: Malformed/unreadable JSON file "/var/lib/openqa/testresults/17335/17335533-sle-15-SP6-Server-DVD-Incidents-Kernel-KOTD-aarch64-Build6.4.0-150600.1097.1.gdcc1d06-ltp_openposix@aarch64-virtio/details-sigaction_17-17.json": malformed JSON string, neither tag, array, object, number, string or atom, at character offset 0 at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/JSON.pm line 37.
May 11 21:03:10 openqa openqa-websockets-daemon[27442]: [debug] [pid:27442] Updating seen of worker 3080 from worker_status (broken)
May 11 21:03:10 openqa openqa-webui-daemon[2759]: [debug] [pid:2759] GruTask 40777354 already gone, skip assigning jobs (message: DBIx::Class::Storage::DBI::_dbh_execute(): DBI Exception: DBD::Pg::st execute failed: ERROR: insert or update on table "gru_dependencies" violates foreign key constraint "gru_dependencies_fk_gru_task_id"
May 11 21:03:10 openqa openqa-webui-daemon[2759]: DETAIL: Key (gru_task_id)=(40777354) is not present in table "gru_tasks". [for Statement "INSERT INTO gru_dependencies ( gru_task_id, job_id) VALUES ( ?, ? )" with ParamValues: 1='40777354', 2='17656478'] at /usr/share/openqa/script/../lib/OpenQA/Shared/Plugin/Gru.pm line 160
May 11 21:03:10 openqa openqa-webui-daemon[2759]: )
May 11 21:03:10 openqa openqa-webui-daemon[2759]: [debug] [pid:2759] Job 17335539 duplicated as 17656478
I also noticed this, which points at a worker entering an unavailable state:
May 11 21:05:40 openqa openqa-websockets-daemon[27442]: [debug] [pid:27442] Worker 2628 rejected job(s) 17653831: The average load (26.17 26.54 25.46) is exceeding the configured threshold of 25. The worker will temporarily not accept new jobs until the load is lower again.
And actually more concrete instances of a "broken" worker:
May 11 21:07:21 openqa openqa-websockets-daemon[27442]: [debug] [pid:27442] Updating seen of worker 3082 from worker_status (broken)
Updated by robert.richardson 16 days ago
- Status changed from Workable to In Progress
- Assignee set to robert.richardson
Updated by openqa_review 15 days ago
- Due date set to 2025-06-03
Setting due date based on mean cycle time of SUSE QE Tools
Updated by okurz 14 days ago
- Due date deleted (
2025-06-03) - Status changed from In Progress to Resolved
We looked at the definitions of telegraf and grafana rules again and found that everything is defined correctly. "limited" refers to "limited by global job limit" and "broken" means "broken" which we call "unavailable" in the webUI but excluding "system load exceeded" on worker hosts.
And actually more concrete instances of a "broken" worker:
May 11 21:07:21 openqa openqa-websockets-daemon[27442]: [debug] [pid:27442] Updating seen of worker 3082 from worker_status (broken)
We couldn't find those references anymore. I also conducted a query select count(id) as broken_workers from workers where error is not null and t_updated > (timezone('UTC', now()) - interval '1 hour') and not error like 'graceful disconnect%' and not error like 'limited%' and not error like '%Cache service queue already full %' and not error like '%average load%exceeding%';
and found 10 "broken workers" at a time but after some minutes the query returned 0 so the 1h period shifted enough to not have any left. I guess we shouldn't do anything as we can't reproduce the original problem and have verified that monitoring and alerting itself works fine.
Updated by okurz 11 days ago
- Status changed from Resolved to Workable
reopening as there is another alert from today, see https://monitor.qa.suse.de/d/WebuiDb/webui-summary?from=2025-05-23T19:28:15.413Z&orgId=1&to=2025-05-24T04:23:28.031Z&viewPanel=panel-96&timezone=UTC&var-host_disks=$__all