action #179753: [alert] Worker cache service is sometimes not available leading to broken workers alert size:S - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #179753

closed

[alert] Worker cache service is sometimes not available leading to broken workers alert size:S

Added by mkittler 2 months ago. Updated about 2 months ago.

Status:

Resolved

Priority:

Normal

Assignee:

okurz

Category:

Regressions/Crashes

Target version:

openQA Project (public) - Ready

Start date:

2025-04-01

Due date:

% Done:

Estimated time:

Tags:

alert, infra, reactive work

Description

Observation¶

We recently had two instances where the broken workers alert fired:

I checked the list of broken workers during the first alert and found that some slots on worker30 were affected with Worker cache not available via http://127.0.0.1:9530: Cache service info error: Inactivity timeout.

The 2nd alert was already off when I saw it so I checked the journal on worker30 and found a few related messages:

Mar 30 23:52:41 worker30 worker[91302]: [warn] [pid:91302] Worker cache not available via http://127.0.0.1:9530: Cache service info error: Inactivity timeout - checking again for web UI 'openqa.suse.de' in 269.63 s
Mar 30 23:52:50 worker30 worker[91223]: [warn] [pid:91223] Worker cache not available via http://127.0.0.1:9530: Cache service info error: Inactivity timeout - checking again for web UI 'openqa.suse.de' in 100.62 s
…
Mar 31 14:35:11 worker30 worker[46072]: [warn] [pid:46072] Worker cache not available via http://127.0.0.1:9530: Cache service queue already full (10) - checking again for web UI 'openqa.suse.de' in 175.11 s
Mar 31 14:35:19 worker30 worker[46216]: [warn] [pid:46216] Worker cache not available via http://127.0.0.1:9530: Cache service queue already full (10) - checking again for web UI 'openqa.suse.de' in 238.67 s
Mar 31 14:35:21 worker30 worker[46086]: [warn] [pid:46086] Worker cache not available via http://127.0.0.1:9530: Cache service queue already full (10) - checking again for web UI 'openqa.suse.de' in 259.93 s
Mar 31 14:35:28 worker30 worker[46497]: [warn] [pid:46497] Worker cache not available via http://127.0.0.1:9530: Cache service queue already full (10) - checking again for web UI 'openqa.suse.de' in 285.07 s
…
Mar 31 15:17:34 worker30 worker[46071]: [warn] [pid:46071] Worker cache not available via http://127.0.0.1:9530: Cache service queue already full (10) - checking again for web UI 'openqa.suse.de' in 207.77 s
Mar 31 15:17:34 worker30 worker[46265]: [warn] [pid:46265] Worker cache not available via http://127.0.0.1:9530: Cache service queue already full (10) - checking again for web UI 'openqa.suse.de' in 116.81 s
…
Mar 31 15:27:46 worker30 worker[46352]: [warn] [pid:46352] Worker cache not available via http://127.0.0.1:9530: Cache service queue already full (10) - checking again for web UI 'openqa.suse.de' in 149.38 s
Mar 31 15:27:55 worker30 worker[46071]: [warn] [pid:46071] Worker cache not available via http://127.0.0.1:9530: Cache service queue already full (10) - checking again for web UI 'openqa.suse.de' in 140.51 s
…
Mar 31 16:47:27 worker30 worker[46498]: [warn] [pid:46498] Worker cache not available via http://127.0.0.1:9530: Cache service info error: Inactivity timeout - checking again for web UI 'openqa.suse.de' in 78.35 s
…
Apr 01 06:10:44 worker30 worker[46071]: [warn] [pid:46071] Worker cache not available via http://127.0.0.1:9530: Cache service info error: Inactivity timeout - checking again for web UI 'openqa.suse.de' in 158.00 s
Apr 01 06:10:45 worker30 worker[46329]: [warn] [pid:46329] Worker cache not available via http://127.0.0.1:9530: Cache service info error: Inactivity timeout - checking again for web UI 'openqa.suse.de' in 108.78 s

I guess the "queue already full" messages are actually not relevant as they are filtered out.

The messages from outside the window in which the alert fired are of course generally not relevant for this particular instance of the alert but show that the "Inactivity timeout" is reached also without triggering the alert.

In all cases the situation resolved itself.

There are slow SQL queries logged in the journal, e.g.:

Mar 28 22:54:00 worker30 openqa-workercache-daemon[105432]: [105432] [i] Slow SQLite query: "select id, args, attempts,
Mar 28 22:54:00 worker30 openqa-workercache-daemon[105432]:        (select json_group_array(distinct child.id)
Mar 28 22:54:00 worker30 openqa-workercache-daemon[105432]:          from minion_jobs as child, json_each(child.parents) as parent_id
Mar 28 22:54:00 worker30 openqa-workercache-daemon[105432]:          where j.id = parent_id.value) as children,
Mar 28 22:54:00 worker30 openqa-workercache-daemon[105432]:        strftime('%s',created) as created, strftime('%s',delayed) as delayed,
Mar 28 22:54:00 worker30 openqa-workercache-daemon[105432]:        strftime('%s',expires) as expires, strftime('%s',finished) as finished,
Mar 28 22:54:00 worker30 openqa-workercache-daemon[105432]:        lax, notes, parents, priority, queue, result,
Mar 28 22:54:00 worker30 openqa-workercache-daemon[105432]:        strftime('%s',retried) as retried, retries,
Mar 28 22:54:00 worker30 openqa-workercache-daemon[105432]:        strftime('%s',started) as started, state, task,
Mar 28 22:54:00 worker30 openqa-workercache-daemon[105432]:        strftime('%s','now') as time, worker
Mar 28 22:54:00 worker30 openqa-workercache-daemon[105432]:        from minion_jobs as j
Mar 28 22:54:00 worker30 openqa-workercache-daemon[105432]:        where id in (?) and (state != 'inactive' or expires is null or expires > datetime('now')) order by id desc limit ? offset ?" -> 62278ms

Suggestion¶

Consider to increase the grace period of the alert.
One or multiple worker instances stop working
- Implement automatic restarting of the worker cache service
Maybe workers were unable to report their status due to the web socket service being overloaded
At least monitor the situation over the next days to see if recent changes to the infrastructure, e.g. config file loading, etc., could be the cause or something different
On the first occurrence the cache service wasn't restarted in the relevant time window. On the second occurrence the cache service was perhaps restarted around that time (Mon 2025-03-31 05:18:53 UTC).

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by okurz 2 months ago

Category set to Regressions/Crashes
Priority changed from Normal to Urgent
Target version set to Ready

Actions

Copy link

Updated by okurz 2 months ago

Subject changed from [alert] Worker cache service is sometimes not available leading to broken workers alert to [alert] Worker cache service is sometimes not available leading to broken workers alert size:S
Description updated (diff)
Status changed from New to Feedback
Assignee set to okurz
Priority changed from Urgent to Normal