Project

General

Profile

Actions

action #154927

closed

[alert] Broken workers alert was firing several hours after weekly reboot

Added by mkittler 3 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
2024-02-05
Due date:
2024-02-20
% Done:

0%

Estimated time:

Description

Observation

See https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&viewPanel=96&editPanel=96&from=1706991565957&to=1707139468853 for the panel/timeframe

Example of how the worker log looked like (looked similar on all machines/services I checked):

Feb 04 03:33:36 worker40 worker[3881]: [error] [pid:3881] Worker cache not available via http://127.0.0.1:9530: Cache service info error: Connection refused
Feb 04 03:33:36 worker40 worker[3881]: [info] [pid:3881] Project dir for host openqa.suse.de is /var/lib/openqa/share
Feb 04 03:33:36 worker40 worker[3881]: [info] [pid:3881] Registering with openQA openqa.suse.de
Feb 04 03:33:36 worker40 worker[3881]: [warn] [pid:3881] Failed to register at openqa.suse.de - connection error: Transport endpoint is not connected - trying again in 10 seconds
Feb 04 03:33:46 worker40 worker[3881]: [info] [pid:3881] Registering with openQA openqa.suse.de
Feb 04 03:33:46 worker40 worker[3881]: [warn] [pid:3881] Failed to register at openqa.suse.de - connection error: Transport endpoint is not connected - trying again in 10 seconds
Feb 04 03:33:56 worker40 worker[3881]: [info] [pid:3881] Registering with openQA openqa.suse.de
Feb 04 03:33:56 worker40 worker[3881]: [warn] [pid:3881] Failed to register at openqa.suse.de - connection error: Transport endpoint is not connected - trying again in 10 seconds
Feb 04 03:34:06 worker40 worker[3881]: [info] [pid:3881] Registering with openQA openqa.suse.de
Feb 04 03:34:06 worker40 worker[3881]: [warn] [pid:3881] Failed to register at openqa.suse.de - connection error: Transport endpoint is not connected - trying again in 10 seconds
Feb 04 03:34:16 worker40 worker[3881]: [info] [pid:3881] Registering with openQA openqa.suse.de
Feb 04 03:34:16 worker40 worker[3881]: [warn] [pid:3881] Failed to register at openqa.suse.de - connection error: Transport endpoint is not connected - trying again in 10 seconds
Feb 04 03:34:26 worker40 worker[3881]: [info] [pid:3881] Registering with openQA openqa.suse.de
Feb 04 03:34:26 worker40 worker[3881]: [warn] [pid:3881] Failed to register at openqa.suse.de - connection error: Transport endpoint is not connected - trying again in 10 seconds
Feb 04 03:34:36 worker40 worker[3881]: [info] [pid:3881] Registering with openQA openqa.suse.de
Feb 04 03:34:36 worker40 worker[3881]: [warn] [pid:3881] Failed to register at openqa.suse.de - connection error: Transport endpoint is not connected - trying again in 10 seconds
Feb 04 03:34:46 worker40 worker[3881]: [info] [pid:3881] Registering with openQA openqa.suse.de
Feb 04 03:34:46 worker40 worker[3881]: [info] [pid:3881] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/3087
Feb 04 03:34:46 worker40 worker[3881]: [info] [pid:3881] Registered and connected via websockets with openQA host openqa.suse.de and worker ID 3087
Feb 04 19:43:20 worker40 worker[3881]: [debug] [pid:3881] Accepting job 13420128 from openqa.suse.de.
Feb 04 19:43:20 worker40 worker[3881]: [debug] [pid:3881] Setting job 13420128 from openqa.suse.de up
Feb 04 19:43:20 worker40 worker[3881]: [debug] [pid:3881] Preparing Mojo::IOLoop::ReadWriteProcess::Session
Feb 04 19:43:20 worker40 worker[3881]: [info] [pid:3881] +++ setup notes +++
Feb 04 19:43:20 worker40 worker[3881]: [info] [pid:3881] Running on worker40:1 (Linux 5.14.21-150500.55.44-default #1 SMP PREEMPT_DYNAMIC Mon Jan 15 10:03:40 UTC 2024 (cc7d8b6) x86_64)

Sometimes the connection to the cache service is not even possible despite the worker registration already working:

Feb 04 03:35:12 worker34 worker[3937]: [error] [pid:3937] Worker cache not available via http://127.0.0.1:9530: Cache service info error: Connection refused
Feb 04 03:35:12 worker34 worker[3937]: [info] [pid:3937] Project dir for host openqa.suse.de is /var/lib/openqa/share
Feb 04 03:35:12 worker34 worker[3937]: [info] [pid:3937] Registering with openQA openqa.suse.de
Feb 04 03:35:12 worker34 worker[3937]: [warn] [pid:3937] Failed to register at openqa.suse.de - connection error: Transport endpoint is not connected - trying again in 10 seconds
Feb 04 03:35:22 worker34 worker[3937]: [info] [pid:3937] Registering with openQA openqa.suse.de
Feb 04 03:35:22 worker34 worker[3937]: [warn] [pid:3937] Failed to register at openqa.suse.de - connection error: Transport endpoint is not connected - trying again in 10 seconds
Feb 04 03:35:32 worker34 worker[3937]: [info] [pid:3937] Registering with openQA openqa.suse.de
Feb 04 03:35:32 worker34 worker[3937]: [warn] [pid:3937] Failed to register at openqa.suse.de - connection error: Transport endpoint is not connected - trying again in 10 seconds
Feb 04 03:35:42 worker34 worker[3937]: [info] [pid:3937] Registering with openQA openqa.suse.de
Feb 04 03:35:42 worker34 worker[3937]: [warn] [pid:3937] Failed to register at openqa.suse.de - connection error: Transport endpoint is not connected - trying again in 10 seconds
Feb 04 03:35:52 worker34 worker[3937]: [info] [pid:3937] Registering with openQA openqa.suse.de
Feb 04 03:35:52 worker34 worker[3937]: [warn] [pid:3937] Failed to register at openqa.suse.de - connection error: Transport endpoint is not connected - trying again in 10 seconds
Feb 04 03:36:02 worker34 worker[3937]: [info] [pid:3937] Registering with openQA openqa.suse.de
Feb 04 03:36:02 worker34 worker[3937]: [warn] [pid:3937] Failed to register at openqa.suse.de - connection error: Transport endpoint is not connected - trying again in 10 seconds
Feb 04 03:36:12 worker34 worker[3937]: [info] [pid:3937] Registering with openQA openqa.suse.de
Feb 04 03:36:12 worker34 worker[3937]: [warn] [pid:3937] Failed to register at openqa.suse.de - connection error: Transport endpoint is not connected - trying again in 10 seconds
Feb 04 03:36:22 worker34 worker[3937]: [info] [pid:3937] Registering with openQA openqa.suse.de
Feb 04 03:36:22 worker34 worker[3937]: [warn] [pid:3937] Failed to register at openqa.suse.de - connection error: Transport endpoint is not connected - trying again in 10 seconds
Feb 04 03:36:32 worker34 worker[3937]: [info] [pid:3937] Registering with openQA openqa.suse.de
Feb 04 03:36:33 worker34 worker[3937]: [info] [pid:3937] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/2687
Feb 04 03:36:33 worker34 worker[3937]: [info] [pid:3937] Registered and connected via websockets with openQA host openqa.suse.de and worker ID 2687
Feb 04 03:36:33 worker34 worker[3937]: [warn] [pid:3937] Worker cache not available via http://127.0.0.1:9530: Cache service info error: Connection refused - checking again for web UI 'openqa.suse.de' in 80.81 s
Feb 04 19:18:04 worker34 worker[3937]: [debug] [pid:3937] Accepting job 13423952 from openqa.suse.de.
Feb 04 19:18:04 worker34 worker[3937]: [debug] [pid:3937] Setting job 13423952 from openqa.suse.de up
Feb 04 19:18:04 worker34 worker[3937]: [debug] [pid:3937] Preparing Mojo::IOLoop::ReadWriteProcess::Session

Of course the worker is designed to try again and after a while all ~ 100 affected worker slots were good again with the exception of two worker slots where it apparently took several hours. Those two slots were triggering the alert. Unfortunately we don't know which slots those were (from the data in Grafana).

The cache service itself only took long to start because it couldn't get a socket to listen on at first:

…
Feb 04 03:34:27 worker40 systemd[1]: Started OpenQA Worker Cache Service.
Feb 04 03:34:27 worker40 openqa-workercache-daemon[17249]: [17249] [i] Cache size of "/var/lib/openqa/cache" is 0 Byte, with limit 50 GiB
Feb 04 03:34:27 worker40 openqa-workercache-daemon[17249]: Can't create listen socket: Address family for hostname not supported at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop.pm line 124.
Feb 04 03:34:27 worker40 systemd[1]: openqa-worker-cacheservice.service: Main process exited, code=exited, status=22/n/a
Feb 04 03:34:27 worker40 systemd[1]: openqa-worker-cacheservice.service: Failed with result 'exit-code'.
Feb 04 03:34:27 worker40 systemd[1]: openqa-worker-cacheservice.service: Scheduled restart job, restart counter is at 11.
Feb 04 03:34:32 worker40 systemd[1]: Stopped OpenQA Worker Cache Service.
Feb 04 03:34:32 worker40 systemd[1]: Started OpenQA Worker Cache Service.
Feb 04 03:34:33 worker40 openqa-workercache-daemon[17340]: [17340] [i] Cache size of "/var/lib/openqa/cache" is 0 Byte, with limit 50 GiB
Feb 04 03:34:33 worker40 openqa-workercache-daemon[17340]: Can't create listen socket: Address family for hostname not supported at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop.pm line 124.
Feb 04 03:34:33 worker40 systemd[1]: openqa-worker-cacheservice.service: Main process exited, code=exited, status=22/n/a
Feb 04 03:34:33 worker40 systemd[1]: openqa-worker-cacheservice.service: Failed with result 'exit-code'.
Feb 04 03:34:33 worker40 systemd[1]: openqa-worker-cacheservice.service: Scheduled restart job, restart counter is at 12.
Feb 04 03:34:38 worker40 systemd[1]: Stopped OpenQA Worker Cache Service.
Feb 04 03:34:38 worker40 systemd[1]: Started OpenQA Worker Cache Service.
Feb 04 03:34:38 worker40 openqa-workercache-daemon[17453]: [17453] [i] Cache size of "/var/lib/openqa/cache" is 0 Byte, with limit 50 GiB
Feb 04 03:34:38 worker40 openqa-workercache-daemon[17453]: Can't create listen socket: Address family for hostname not supported at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop.pm line 124.
Feb 04 03:34:38 worker40 systemd[1]: openqa-worker-cacheservice.service: Main process exited, code=exited, status=22/n/a
Feb 04 03:34:38 worker40 systemd[1]: openqa-worker-cacheservice.service: Failed with result 'exit-code'.
Feb 04 03:34:39 worker40 systemd[1]: openqa-worker-cacheservice.service: Scheduled restart job, restart counter is at 13.
Feb 04 03:34:43 worker40 systemd[1]: Stopped OpenQA Worker Cache Service.
Feb 04 03:34:43 worker40 systemd[1]: Started OpenQA Worker Cache Service.
Feb 04 03:34:44 worker40 openqa-workercache-daemon[17834]: Web application available at http://127.0.0.1:9530
Feb 04 03:34:44 worker40 openqa-workercache-daemon[17834]: Web application available at http://[::1]:9530
Feb 04 03:34:44 worker40 openqa-workercache-daemon[17834]: [17834] [i] Cache size of "/var/lib/openqa/cache" is 0 Byte, with limit 50 GiB
Feb 04 03:34:44 worker40 openqa-workercache-daemon[17834]: [17834] [i] Listening at "http://127.0.0.1:9530"
Feb 04 03:34:44 worker40 openqa-workercache-daemon[17834]: [17834] [i] Listening at "http://[::1]:9530"

There were actually 13 failed startup attempts on that particular host (that all happened within the time frame of a minute).

Actions #1

Updated by mkittler 3 months ago

  • Description updated (diff)
Actions #2

Updated by mkittler 3 months ago

  • Description updated (diff)
Actions #3

Updated by okurz 3 months ago

  • Tags changed from alert to alert, reactive work
  • Target version set to Ready

but that looks like it's a genuine openQA problem. The alert was correct, right?

Actions #4

Updated by mkittler 3 months ago

  • Description updated (diff)
Actions #5

Updated by mkittler 3 months ago

I ran martchus@openqa:~> sudo salt -C 'G@roles:worker' cmd.run "journalctl --boot -u openqa-worker-auto-restart* | grep -i 'cache service info error'" to see what worker slots were still logging errors for a longer time. It looks like those were worker slots on petrol and diesel. The cache service was restarted multiple times on those hosts which explains why it was sometimes not available. Note that petrol and diesel have not been rebooted like the other hosts and the cache service restarts on those hosts in the relevant time frame are likely normal and just because a new version of the worker package was installed, e.g.:

Feb 05 06:38:58 diesel [RPM][2939]: install openQA-worker-4.6.1706881004.0d50a814-lp155.6302.1.ppc64le: success
…
Feb 05 06:38:59 diesel systemd[1]: Stopping OpenQA Worker Cache Service Minion...
Feb 05 06:39:05 diesel openqa-worker-cacheservice-minion[7870]: [7870] [i] Worker 7870 stopped
Feb 05 06:39:05 diesel systemd[1]: openqa-worker-cacheservice-minion.service: Main process exited, code=exited, status=144/n/a
Feb 05 06:39:05 diesel systemd[1]: openqa-worker-cacheservice-minion.service: Failed with result 'exit-code'.
Feb 05 06:39:05 diesel systemd[1]: Stopped OpenQA Worker Cache Service Minion.
Feb 05 06:39:05 diesel systemd[1]: Stopping OpenQA Worker Cache Service...
…
Feb 05 06:39:05 diesel openqa-workercache-daemon[7864]: [7864] [i] Creating process id file "/tmp/prefork.pid"
Feb 05 06:39:05 diesel openqa-workercache-daemon[7864]: [7864] [w] Stopping worker 7918 immediately
Feb 05 06:39:05 diesel openqa-workercache-daemon[7864]: [7864] [w] Stopping worker 7921 immediately
Feb 05 06:39:05 diesel openqa-workercache-daemon[7864]: [7864] [w] Stopping worker 7919 immediately
Feb 05 06:39:05 diesel openqa-workercache-daemon[7864]: [7864] [w] Stopping worker 7920 immediately
Feb 05 06:39:05 diesel openqa-workercache-daemon[7864]: [7864] [i] Worker 7918 stopped
Feb 05 06:39:05 diesel openqa-workercache-daemon[7864]: [7864] [i] Worker 7919 stopped
Feb 05 06:39:05 diesel openqa-workercache-daemon[7864]: [7864] [i] Worker 7921 stopped
Feb 05 06:39:05 diesel openqa-workercache-daemon[7864]: [7864] [w] Stopping worker 7920 immediately
Feb 05 06:39:05 diesel openqa-workercache-daemon[7864]: [7864] [i] Worker 7920 stopped
Feb 05 06:39:05 diesel openqa-workercache-daemon[7864]: [7864] [i] Manager 7864 stopped
Feb 05 06:39:05 diesel worker[7913]: [info] [pid:7913] Received signal HUP
Feb 05 06:39:05 diesel systemd[1]: openqa-worker-cacheservice.service: Main process exited, code=exited, status=96/n/a
Feb 05 06:39:05 diesel systemd[1]: openqa-worker-cacheservice.service: Failed with result 'exit-code'.
Feb 05 06:39:05 diesel systemd[1]: Stopped OpenQA Worker Cache Service.
Feb 05 06:39:05 diesel systemd[1]: Started OpenQA Worker Cache Service.
Feb 05 06:39:05 diesel systemd[1]: Started OpenQA Worker Cache Service Minion.
Feb 05 06:39:05 diesel systemd[1]: Reloading openQA Worker #4...
…
Feb 05 06:39:05 diesel worker[7910]: [info] [pid:7910] Received signal HUP
Feb 05 06:39:05 diesel [RPM][2939]: erase openQA-worker-4.6.1706881004.0d50a814-lp155.6301.1.ppc64le: success
Feb 05 06:39:05 diesel [RPM][2939]: install openQA-worker-4.6.1706881004.0d50a814-lp155.6302.1.ppc64le: success
Feb 05 06:39:05 diesel [RPM][2939]: Transaction ID 65c07470 finished: 0
Feb 05 06:39:05 diesel openqa-workercache-daemon[3064]: Web application available at http://127.0.0.1:9530
Feb 05 06:39:05 diesel openqa-workercache-daemon[3064]: Web application available at http://[::1]:9530
Feb 05 06:39:05 diesel openqa-workercache-daemon[3064]: [3064] [i] Cache size of "/var/lib/openqa/cache" is 99 GiB, with limit 100 GiB

I suppose that this would normally not trigger an alert but perhaps it was enough to keep the already firing alert from turning off.

All of this was no problem if the worker services and the worker cache service would only be started after the network is available (one can listen on a socket on the loopback interface). We could probably workaround this by preventing the worker to connect to the web UI when the cache service is not available - at least for the first few minutes (at some point it should of course connect and report as broken because if there's really a problem we want to know about it and not just see that our workers aren't online for some reason).

Actions #6

Updated by mkittler 3 months ago

  • Status changed from New to In Progress

There's one small improvement on the worker side I could come up with that definitely makes sense (although also most likely not a big difference): https://github.com/os-autoinst/openQA/pull/5451

Actions #7

Updated by openqa_review 3 months ago

  • Due date set to 2024-02-20

Setting due date based on mean cycle time of SUSE QE Tools

Actions #8

Updated by mkittler 3 months ago

The PR was merged.


We could probably workaround this by preventing the worker to connect to the web UI when the cache service is not available …

Following up on that thought:

  • The worker will not be able to connect to the web UI anyway in case the network (and thus the cache service) is not available.
  • If the network becomes available and we can connect to the web UI then we check the cache service availability on-the-fly (and not return a cached/old value) and that check is supposed to find the cache service as the network is available. I checked that by having a look at the code and it is really that simple.

So just considering these two points there shouldn't be the problem we see. However, there's maybe one more thing: The cache service is only retrying to create its socket in a certain frequency so there might be a short window in which the network is available but the cache service is not running yet. Probably it is that small window that causes trouble.

You can see the delay in the log messages from the ticket description:

  • The last Can't create listen socket: Address family for hostname not supported is from 03:34:38.
  • The Web application available at http://127.0.0.1:9530 message is from 03:34:44.

That's a 6 seconds window in which the network might have become available and workers might have connected to the web UI as broken because the cache service was not available yet.

The cleanest solution to solve this would be to have the systemd unit openqa-worker-cacheservice.service signal when it is actually ready (ideally a single call via libsystemd after Web application available at …). Then the openqa-worker-… units could simply depend on that. The downside to this approach is that the worker units probably should not have a hard dependency on the cache service as it is on optional component. We could of course always configure this as needed in our salt states (and for o3 this doesn't matter anyway because nobody cares about workers showing shortly up as broken there).

Alternatively I could just follow-up on the initially quoted idea from #154927#note-5. The code on the worker side might get a bit messy but it should be possible of course. The worker would check the general network availability itself and keep track of the time that has elapsed since then. If the time is below a certain configurable threshold it will refrain from sending updates for being broken (and just send no status update which should be fine as of https://github.com/os-autoinst/openQA/pull/5451 or pretend to be idling). The disadvantage (besides the potentially complicated code) is that then the scheduler would of course assign jobs to the worker which the worker needs to reject (which might even work out of the box but one needed to make sure it definitely happens; otherwise we'd end up with more incomplete jobs).

Actions #9

Updated by mkittler 3 months ago

Besides the two options mentioned in my previous comment there's of course also the option to extend the grace period of our alert.

It currently fires if there is at least one broken worker at any point in time during the last 15 minutes.

Not sure yet how to change the alert, though. Just increasing the 15 minutes wouldn't have helped. Actually, also implementing all the previously mentioned options wouldn't help with the alert as it is.

Maybe we should think whether to change the alert more fundamentally or investigate why always at least one worker was in the broken state for so long.

Actions #10

Updated by mkittler 3 months ago

  • Status changed from In Progress to Resolved

I checked the logs again to find out why we had 2 broken worker slots for so long. It looks like logs on some hosts don't start early enough. Otherwise I couldn't find a particular culprit (a worker slot that logs cache service errors long enough to be relevant).

I added ideas from #154927#note-8 to our ticket for various feature requests (as there's room for improvement but it wouldn't help with the concrete alert): https://progress.opensuse.org/journals/761167/diff?detail_id=714442

If this would happen again in at a time where we can react faster we can easily find out what slots were broken via the web UI. For now I'm considering this ticket resolved (the alert is not firing anymore and never had to be silenced).

Actions

Also available in: Atom PDF