Project

General

Profile

Actions

action #182210

open

[Alert] Multiple broken workers size:S

Added by gpuliti 23 days ago. Updated 8 days ago.

Status:
Workable
Priority:
Normal
Category:
Regressions/Crashes
Target version:
Start date:
2025-05-12
Due date:
% Done:

0%

Estimated time:

Description

Observation

https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=2025-05-11T20:46:25.633Z&to=2025-05-12T00:07:17.690Z&timezone=UTC&var-host_disks=$__all

The panel shows "broken workers" and no "limited workers". That is correct because right now the webUI is not running limited, i.e. no workers are refused to connect. However we still need to check what made those worker instances ending up as broken.

Acceptance criteria

  • AC1: We are still receiving alerts if there are too many "broken" workers
  • AC2: We are not receiving alerts if there are "limited" (as defined in miniondb data) workers just for some hours

Suggestions

Rollback steps

Out of scope

Actions #1

Updated by gpuliti 23 days ago

  • Tags changed from alerts to alerts, reactive work, infra, alert
  • Category set to Regressions/Crashes
  • Priority changed from Normal to High
  • Target version set to Ready
Actions #3

Updated by gpuliti 23 days ago

I left the whole dashboard because I think it might be handy to have an overview of the whole situation

Actions #4

Updated by robert.richardson 22 days ago

  • Subject changed from [Alert] Multiple broken workers to [Alert] Multiple broken workers size:S
  • Description updated (diff)
  • Status changed from New to Workable
Actions #5

Updated by livdywan 19 days ago

I'm not sure I can commit to providing a fix right now, but keeping the SLO's in mind trying to at least conduct an initial investigation based on the suggestions.

sudo journalctl -S '2025-05-11 20:46:25' -U '2025-05-12 00:07:17'

Could this be related?

May 11 21:03:10 openqa openqa-webui-daemon[3496]: [warn] [pid:3496] Unable to incomplete/duplicate or reschedule jobs abandoned by worker 3065: Malformed/unreadable JSON file "/var/lib/openqa/testresults/17335/17335390-sle-15-SP5-Server-DVD-Incidents-Kernel-KOTD-aarch64-Build5.14.21-150500.37.1.gb680b98-ltp_fs@aarch64-virtio/details-gf16.json": malformed JSON string, neither tag, array, object, number, string or atom, at character offset 0 at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/JSON.pm line 37.

and there's a variation of it:

May 11 21:03:10 openqa openqa-webui-daemon[30459]: [warn] [pid:30459] Unable to incomplete/duplicate or reschedule jobs abandoned by worker 3080: Malformed/unreadable JSON file "/var/lib/openqa/testresults/17335/17335533-sle-15-SP6-Server-DVD-Incidents-Kernel-KOTD-aarch64-Build6.4.0-150600.1097.1.gdcc1d06-ltp_openposix@aarch64-virtio/details-sigaction_17-17.json": malformed JSON string, neither tag, array, object, number, string or atom, at character offset 0 at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/JSON.pm line 37.                                                                                                                                                                                                                                     
May 11 21:03:10 openqa openqa-websockets-daemon[27442]: [debug] [pid:27442] Updating seen of worker 3080 from worker_status (broken)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
May 11 21:03:10 openqa openqa-webui-daemon[2759]: [debug] [pid:2759] GruTask 40777354 already gone, skip assigning jobs (message: DBIx::Class::Storage::DBI::_dbh_execute(): DBI Exception: DBD::Pg::st execute failed: ERROR:  insert or update on table "gru_dependencies" violates foreign key constraint "gru_dependencies_fk_gru_task_id"                                                                                                                                                                                                                                                                                                                                                                                                                                 
May 11 21:03:10 openqa openqa-webui-daemon[2759]: DETAIL:  Key (gru_task_id)=(40777354) is not present in table "gru_tasks". [for Statement "INSERT INTO gru_dependencies ( gru_task_id, job_id) VALUES ( ?, ? )" with ParamValues: 1='40777354', 2='17656478'] at /usr/share/openqa/script/../lib/OpenQA/Shared/Plugin/Gru.pm line 160                                                                                                                                                                                                                                                                                                                                                                                                                                        
May 11 21:03:10 openqa openqa-webui-daemon[2759]: )                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
May 11 21:03:10 openqa openqa-webui-daemon[2759]: [debug] [pid:2759] Job 17335539 duplicated as 17656478 

I also noticed this, which points at a worker entering an unavailable state:

May 11 21:05:40 openqa openqa-websockets-daemon[27442]: [debug] [pid:27442] Worker 2628 rejected job(s) 17653831: The average load (26.17 26.54 25.46) is exceeding the configured threshold of 25. The worker will temporarily not accept new jobs until the load is lower again.

And actually more concrete instances of a "broken" worker:

May 11 21:07:21 openqa openqa-websockets-daemon[27442]: [debug] [pid:27442] Updating seen of worker 3082 from worker_status (broken) 
Actions #6

Updated by robert.richardson 16 days ago

  • Status changed from Workable to In Progress
  • Assignee set to robert.richardson
Actions #7

Updated by openqa_review 15 days ago

  • Due date set to 2025-06-03

Setting due date based on mean cycle time of SUSE QE Tools

Actions #8

Updated by okurz 14 days ago

  • Due date deleted (2025-06-03)
  • Status changed from In Progress to Resolved

We looked at the definitions of telegraf and grafana rules again and found that everything is defined correctly. "limited" refers to "limited by global job limit" and "broken" means "broken" which we call "unavailable" in the webUI but excluding "system load exceeded" on worker hosts.

And actually more concrete instances of a "broken" worker:

May 11 21:07:21 openqa openqa-websockets-daemon[27442]: [debug] [pid:27442] Updating seen of worker 3082 from worker_status (broken)

We couldn't find those references anymore. I also conducted a query select count(id) as broken_workers from workers where error is not null and t_updated > (timezone('UTC', now()) - interval '1 hour') and not error like 'graceful disconnect%' and not error like 'limited%' and not error like '%Cache service queue already full %' and not error like '%average load%exceeding%'; and found 10 "broken workers" at a time but after some minutes the query returned 0 so the 1h period shifted enough to not have any left. I guess we shouldn't do anything as we can't reproduce the original problem and have verified that monitoring and alerting itself works fine.

Actions #10

Updated by okurz 11 days ago

  • Description updated (diff)

I added a silence with according rollback step

Actions #11

Updated by okurz 8 days ago

  • Priority changed from High to Normal
Actions

Also available in: Atom PDF