action #154927
closed[alert] Broken workers alert was firing several hours after weekly reboot
0%
Description
Observation¶
See https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&viewPanel=96&editPanel=96&from=1706991565957&to=1707139468853 for the panel/timeframe
Example of how the worker log looked like (looked similar on all machines/services I checked):
Feb 04 03:33:36 worker40 worker[3881]: [error] [pid:3881] Worker cache not available via http://127.0.0.1:9530: Cache service info error: Connection refused
Feb 04 03:33:36 worker40 worker[3881]: [info] [pid:3881] Project dir for host openqa.suse.de is /var/lib/openqa/share
Feb 04 03:33:36 worker40 worker[3881]: [info] [pid:3881] Registering with openQA openqa.suse.de
Feb 04 03:33:36 worker40 worker[3881]: [warn] [pid:3881] Failed to register at openqa.suse.de - connection error: Transport endpoint is not connected - trying again in 10 seconds
Feb 04 03:33:46 worker40 worker[3881]: [info] [pid:3881] Registering with openQA openqa.suse.de
Feb 04 03:33:46 worker40 worker[3881]: [warn] [pid:3881] Failed to register at openqa.suse.de - connection error: Transport endpoint is not connected - trying again in 10 seconds
Feb 04 03:33:56 worker40 worker[3881]: [info] [pid:3881] Registering with openQA openqa.suse.de
Feb 04 03:33:56 worker40 worker[3881]: [warn] [pid:3881] Failed to register at openqa.suse.de - connection error: Transport endpoint is not connected - trying again in 10 seconds
Feb 04 03:34:06 worker40 worker[3881]: [info] [pid:3881] Registering with openQA openqa.suse.de
Feb 04 03:34:06 worker40 worker[3881]: [warn] [pid:3881] Failed to register at openqa.suse.de - connection error: Transport endpoint is not connected - trying again in 10 seconds
Feb 04 03:34:16 worker40 worker[3881]: [info] [pid:3881] Registering with openQA openqa.suse.de
Feb 04 03:34:16 worker40 worker[3881]: [warn] [pid:3881] Failed to register at openqa.suse.de - connection error: Transport endpoint is not connected - trying again in 10 seconds
Feb 04 03:34:26 worker40 worker[3881]: [info] [pid:3881] Registering with openQA openqa.suse.de
Feb 04 03:34:26 worker40 worker[3881]: [warn] [pid:3881] Failed to register at openqa.suse.de - connection error: Transport endpoint is not connected - trying again in 10 seconds
Feb 04 03:34:36 worker40 worker[3881]: [info] [pid:3881] Registering with openQA openqa.suse.de
Feb 04 03:34:36 worker40 worker[3881]: [warn] [pid:3881] Failed to register at openqa.suse.de - connection error: Transport endpoint is not connected - trying again in 10 seconds
Feb 04 03:34:46 worker40 worker[3881]: [info] [pid:3881] Registering with openQA openqa.suse.de
Feb 04 03:34:46 worker40 worker[3881]: [info] [pid:3881] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/3087
Feb 04 03:34:46 worker40 worker[3881]: [info] [pid:3881] Registered and connected via websockets with openQA host openqa.suse.de and worker ID 3087
Feb 04 19:43:20 worker40 worker[3881]: [debug] [pid:3881] Accepting job 13420128 from openqa.suse.de.
Feb 04 19:43:20 worker40 worker[3881]: [debug] [pid:3881] Setting job 13420128 from openqa.suse.de up
Feb 04 19:43:20 worker40 worker[3881]: [debug] [pid:3881] Preparing Mojo::IOLoop::ReadWriteProcess::Session
Feb 04 19:43:20 worker40 worker[3881]: [info] [pid:3881] +++ setup notes +++
Feb 04 19:43:20 worker40 worker[3881]: [info] [pid:3881] Running on worker40:1 (Linux 5.14.21-150500.55.44-default #1 SMP PREEMPT_DYNAMIC Mon Jan 15 10:03:40 UTC 2024 (cc7d8b6) x86_64)
Sometimes the connection to the cache service is not even possible despite the worker registration already working:
Feb 04 03:35:12 worker34 worker[3937]: [error] [pid:3937] Worker cache not available via http://127.0.0.1:9530: Cache service info error: Connection refused
Feb 04 03:35:12 worker34 worker[3937]: [info] [pid:3937] Project dir for host openqa.suse.de is /var/lib/openqa/share
Feb 04 03:35:12 worker34 worker[3937]: [info] [pid:3937] Registering with openQA openqa.suse.de
Feb 04 03:35:12 worker34 worker[3937]: [warn] [pid:3937] Failed to register at openqa.suse.de - connection error: Transport endpoint is not connected - trying again in 10 seconds
Feb 04 03:35:22 worker34 worker[3937]: [info] [pid:3937] Registering with openQA openqa.suse.de
Feb 04 03:35:22 worker34 worker[3937]: [warn] [pid:3937] Failed to register at openqa.suse.de - connection error: Transport endpoint is not connected - trying again in 10 seconds
Feb 04 03:35:32 worker34 worker[3937]: [info] [pid:3937] Registering with openQA openqa.suse.de
Feb 04 03:35:32 worker34 worker[3937]: [warn] [pid:3937] Failed to register at openqa.suse.de - connection error: Transport endpoint is not connected - trying again in 10 seconds
Feb 04 03:35:42 worker34 worker[3937]: [info] [pid:3937] Registering with openQA openqa.suse.de
Feb 04 03:35:42 worker34 worker[3937]: [warn] [pid:3937] Failed to register at openqa.suse.de - connection error: Transport endpoint is not connected - trying again in 10 seconds
Feb 04 03:35:52 worker34 worker[3937]: [info] [pid:3937] Registering with openQA openqa.suse.de
Feb 04 03:35:52 worker34 worker[3937]: [warn] [pid:3937] Failed to register at openqa.suse.de - connection error: Transport endpoint is not connected - trying again in 10 seconds
Feb 04 03:36:02 worker34 worker[3937]: [info] [pid:3937] Registering with openQA openqa.suse.de
Feb 04 03:36:02 worker34 worker[3937]: [warn] [pid:3937] Failed to register at openqa.suse.de - connection error: Transport endpoint is not connected - trying again in 10 seconds
Feb 04 03:36:12 worker34 worker[3937]: [info] [pid:3937] Registering with openQA openqa.suse.de
Feb 04 03:36:12 worker34 worker[3937]: [warn] [pid:3937] Failed to register at openqa.suse.de - connection error: Transport endpoint is not connected - trying again in 10 seconds
Feb 04 03:36:22 worker34 worker[3937]: [info] [pid:3937] Registering with openQA openqa.suse.de
Feb 04 03:36:22 worker34 worker[3937]: [warn] [pid:3937] Failed to register at openqa.suse.de - connection error: Transport endpoint is not connected - trying again in 10 seconds
Feb 04 03:36:32 worker34 worker[3937]: [info] [pid:3937] Registering with openQA openqa.suse.de
Feb 04 03:36:33 worker34 worker[3937]: [info] [pid:3937] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/2687
Feb 04 03:36:33 worker34 worker[3937]: [info] [pid:3937] Registered and connected via websockets with openQA host openqa.suse.de and worker ID 2687
Feb 04 03:36:33 worker34 worker[3937]: [warn] [pid:3937] Worker cache not available via http://127.0.0.1:9530: Cache service info error: Connection refused - checking again for web UI 'openqa.suse.de' in 80.81 s
Feb 04 19:18:04 worker34 worker[3937]: [debug] [pid:3937] Accepting job 13423952 from openqa.suse.de.
Feb 04 19:18:04 worker34 worker[3937]: [debug] [pid:3937] Setting job 13423952 from openqa.suse.de up
Feb 04 19:18:04 worker34 worker[3937]: [debug] [pid:3937] Preparing Mojo::IOLoop::ReadWriteProcess::Session
Of course the worker is designed to try again and after a while all ~ 100 affected worker slots were good again with the exception of two worker slots where it apparently took several hours. Those two slots were triggering the alert. Unfortunately we don't know which slots those were (from the data in Grafana).
The cache service itself only took long to start because it couldn't get a socket to listen on at first:
…
Feb 04 03:34:27 worker40 systemd[1]: Started OpenQA Worker Cache Service.
Feb 04 03:34:27 worker40 openqa-workercache-daemon[17249]: [17249] [i] Cache size of "/var/lib/openqa/cache" is 0 Byte, with limit 50 GiB
Feb 04 03:34:27 worker40 openqa-workercache-daemon[17249]: Can't create listen socket: Address family for hostname not supported at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop.pm line 124.
Feb 04 03:34:27 worker40 systemd[1]: openqa-worker-cacheservice.service: Main process exited, code=exited, status=22/n/a
Feb 04 03:34:27 worker40 systemd[1]: openqa-worker-cacheservice.service: Failed with result 'exit-code'.
Feb 04 03:34:27 worker40 systemd[1]: openqa-worker-cacheservice.service: Scheduled restart job, restart counter is at 11.
Feb 04 03:34:32 worker40 systemd[1]: Stopped OpenQA Worker Cache Service.
Feb 04 03:34:32 worker40 systemd[1]: Started OpenQA Worker Cache Service.
Feb 04 03:34:33 worker40 openqa-workercache-daemon[17340]: [17340] [i] Cache size of "/var/lib/openqa/cache" is 0 Byte, with limit 50 GiB
Feb 04 03:34:33 worker40 openqa-workercache-daemon[17340]: Can't create listen socket: Address family for hostname not supported at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop.pm line 124.
Feb 04 03:34:33 worker40 systemd[1]: openqa-worker-cacheservice.service: Main process exited, code=exited, status=22/n/a
Feb 04 03:34:33 worker40 systemd[1]: openqa-worker-cacheservice.service: Failed with result 'exit-code'.
Feb 04 03:34:33 worker40 systemd[1]: openqa-worker-cacheservice.service: Scheduled restart job, restart counter is at 12.
Feb 04 03:34:38 worker40 systemd[1]: Stopped OpenQA Worker Cache Service.
Feb 04 03:34:38 worker40 systemd[1]: Started OpenQA Worker Cache Service.
Feb 04 03:34:38 worker40 openqa-workercache-daemon[17453]: [17453] [i] Cache size of "/var/lib/openqa/cache" is 0 Byte, with limit 50 GiB
Feb 04 03:34:38 worker40 openqa-workercache-daemon[17453]: Can't create listen socket: Address family for hostname not supported at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop.pm line 124.
Feb 04 03:34:38 worker40 systemd[1]: openqa-worker-cacheservice.service: Main process exited, code=exited, status=22/n/a
Feb 04 03:34:38 worker40 systemd[1]: openqa-worker-cacheservice.service: Failed with result 'exit-code'.
Feb 04 03:34:39 worker40 systemd[1]: openqa-worker-cacheservice.service: Scheduled restart job, restart counter is at 13.
Feb 04 03:34:43 worker40 systemd[1]: Stopped OpenQA Worker Cache Service.
Feb 04 03:34:43 worker40 systemd[1]: Started OpenQA Worker Cache Service.
Feb 04 03:34:44 worker40 openqa-workercache-daemon[17834]: Web application available at http://127.0.0.1:9530
Feb 04 03:34:44 worker40 openqa-workercache-daemon[17834]: Web application available at http://[::1]:9530
Feb 04 03:34:44 worker40 openqa-workercache-daemon[17834]: [17834] [i] Cache size of "/var/lib/openqa/cache" is 0 Byte, with limit 50 GiB
Feb 04 03:34:44 worker40 openqa-workercache-daemon[17834]: [17834] [i] Listening at "http://127.0.0.1:9530"
Feb 04 03:34:44 worker40 openqa-workercache-daemon[17834]: [17834] [i] Listening at "http://[::1]:9530"
There were actually 13 failed startup attempts on that particular host (that all happened within the time frame of a minute).