action #179753
closed[alert] Worker cache service is sometimes not available leading to broken workers alert size:S
0%
Description
Observation¶
We recently had two instances where the broken workers alert fired:
- https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=2025-03-28T08%3A39%3A24.744Z&to=2025-03-29T02%3A45%3A49.085Z&timezone=utc&var-host_disks=%24__all&editPanel=96
- https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=2025-03-30T19%3A32%3A23.443Z&to=2025-03-31T01%3A08%3A58.565Z&timezone=utc&var-host_disks=%24__all&editPanel=96
I checked the list of broken workers during the first alert and found that some slots on worker30 were affected with Worker cache not available via http://127.0.0.1:9530: Cache service info error: Inactivity timeout
.
The 2nd alert was already off when I saw it so I checked the journal on worker30 and found a few related messages:
Mar 30 23:52:41 worker30 worker[91302]: [warn] [pid:91302] Worker cache not available via http://127.0.0.1:9530: Cache service info error: Inactivity timeout - checking again for web UI 'openqa.suse.de' in 269.63 s
Mar 30 23:52:50 worker30 worker[91223]: [warn] [pid:91223] Worker cache not available via http://127.0.0.1:9530: Cache service info error: Inactivity timeout - checking again for web UI 'openqa.suse.de' in 100.62 s
…
Mar 31 14:35:11 worker30 worker[46072]: [warn] [pid:46072] Worker cache not available via http://127.0.0.1:9530: Cache service queue already full (10) - checking again for web UI 'openqa.suse.de' in 175.11 s
Mar 31 14:35:19 worker30 worker[46216]: [warn] [pid:46216] Worker cache not available via http://127.0.0.1:9530: Cache service queue already full (10) - checking again for web UI 'openqa.suse.de' in 238.67 s
Mar 31 14:35:21 worker30 worker[46086]: [warn] [pid:46086] Worker cache not available via http://127.0.0.1:9530: Cache service queue already full (10) - checking again for web UI 'openqa.suse.de' in 259.93 s
Mar 31 14:35:28 worker30 worker[46497]: [warn] [pid:46497] Worker cache not available via http://127.0.0.1:9530: Cache service queue already full (10) - checking again for web UI 'openqa.suse.de' in 285.07 s
…
Mar 31 15:17:34 worker30 worker[46071]: [warn] [pid:46071] Worker cache not available via http://127.0.0.1:9530: Cache service queue already full (10) - checking again for web UI 'openqa.suse.de' in 207.77 s
Mar 31 15:17:34 worker30 worker[46265]: [warn] [pid:46265] Worker cache not available via http://127.0.0.1:9530: Cache service queue already full (10) - checking again for web UI 'openqa.suse.de' in 116.81 s
…
Mar 31 15:27:46 worker30 worker[46352]: [warn] [pid:46352] Worker cache not available via http://127.0.0.1:9530: Cache service queue already full (10) - checking again for web UI 'openqa.suse.de' in 149.38 s
Mar 31 15:27:55 worker30 worker[46071]: [warn] [pid:46071] Worker cache not available via http://127.0.0.1:9530: Cache service queue already full (10) - checking again for web UI 'openqa.suse.de' in 140.51 s
…
Mar 31 16:47:27 worker30 worker[46498]: [warn] [pid:46498] Worker cache not available via http://127.0.0.1:9530: Cache service info error: Inactivity timeout - checking again for web UI 'openqa.suse.de' in 78.35 s
…
Apr 01 06:10:44 worker30 worker[46071]: [warn] [pid:46071] Worker cache not available via http://127.0.0.1:9530: Cache service info error: Inactivity timeout - checking again for web UI 'openqa.suse.de' in 158.00 s
Apr 01 06:10:45 worker30 worker[46329]: [warn] [pid:46329] Worker cache not available via http://127.0.0.1:9530: Cache service info error: Inactivity timeout - checking again for web UI 'openqa.suse.de' in 108.78 s
I guess the "queue already full" messages are actually not relevant as they are filtered out.
The messages from outside the window in which the alert fired are of course generally not relevant for this particular instance of the alert but show that the "Inactivity timeout" is reached also without triggering the alert.
In all cases the situation resolved itself.
There are slow SQL queries logged in the journal, e.g.:
Mar 28 22:54:00 worker30 openqa-workercache-daemon[105432]: [105432] [i] Slow SQLite query: "select id, args, attempts,
Mar 28 22:54:00 worker30 openqa-workercache-daemon[105432]: (select json_group_array(distinct child.id)
Mar 28 22:54:00 worker30 openqa-workercache-daemon[105432]: from minion_jobs as child, json_each(child.parents) as parent_id
Mar 28 22:54:00 worker30 openqa-workercache-daemon[105432]: where j.id = parent_id.value) as children,
Mar 28 22:54:00 worker30 openqa-workercache-daemon[105432]: strftime('%s',created) as created, strftime('%s',delayed) as delayed,
Mar 28 22:54:00 worker30 openqa-workercache-daemon[105432]: strftime('%s',expires) as expires, strftime('%s',finished) as finished,
Mar 28 22:54:00 worker30 openqa-workercache-daemon[105432]: lax, notes, parents, priority, queue, result,
Mar 28 22:54:00 worker30 openqa-workercache-daemon[105432]: strftime('%s',retried) as retried, retries,
Mar 28 22:54:00 worker30 openqa-workercache-daemon[105432]: strftime('%s',started) as started, state, task,
Mar 28 22:54:00 worker30 openqa-workercache-daemon[105432]: strftime('%s','now') as time, worker
Mar 28 22:54:00 worker30 openqa-workercache-daemon[105432]: from minion_jobs as j
Mar 28 22:54:00 worker30 openqa-workercache-daemon[105432]: where id in (?) and (state != 'inactive' or expires is null or expires > datetime('now')) order by id desc limit ? offset ?" -> 62278ms
Suggestion¶
- Consider to increase the grace period of the alert.
- One or multiple worker instances stop working
- Implement automatic restarting of the worker cache service
- Maybe workers were unable to report their status due to the web socket service being overloaded
- At least monitor the situation over the next days to see if recent changes to the infrastructure, e.g. config file loading, etc., could be the cause or something different
- On the first occurrence the cache service wasn't restarted in the relevant time window. On the second occurrence the cache service was perhaps restarted around that time (Mon 2025-03-31 05:18:53 UTC).
Updated by okurz 3 days ago
- Subject changed from [alert] Worker cache service is sometimes not available leading to broken workers alert to [alert] Worker cache service is sometimes not available leading to broken workers alert size:S
- Description updated (diff)
- Status changed from New to Feedback
- Assignee set to okurz
- Priority changed from Urgent to Normal
Updated by okurz about 1 hour ago
- Related to action #177276: Make use of config files in openqa.ini.d for OSD specific settings size:S added
Updated by okurz about 1 hour ago
- Status changed from Feedback to Resolved
No more problems observed the past days so I assume this was related to problems we had when working on #177276