Project

General

Profile

action #154927

Updated by mkittler 3 months ago

## Observation 
 See see https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&viewPanel=96&editPanel=96&from=1706991565957&to=1707139468853 for the panel/timeframe 

 Example of how the worker log looked like (looked similar on all machines/services I checked): 

 ``` 
 Feb 04 03:33:36 worker40 worker[3881]: [error] [pid:3881] Worker cache not available via http://127.0.0.1:9530: Cache service info error: Connection refused 
 Feb 04 03:33:36 worker40 worker[3881]: [info] [pid:3881] Project dir for host openqa.suse.de is /var/lib/openqa/share 
 Feb 04 03:33:36 worker40 worker[3881]: [info] [pid:3881] Registering with openQA openqa.suse.de 
 Feb 04 03:33:36 worker40 worker[3881]: [warn] [pid:3881] Failed to register at openqa.suse.de - connection error: Transport endpoint is not connected - trying again in 10 seconds 
 Feb 04 03:33:46 worker40 worker[3881]: [info] [pid:3881] Registering with openQA openqa.suse.de 
 Feb 04 03:33:46 worker40 worker[3881]: [warn] [pid:3881] Failed to register at openqa.suse.de - connection error: Transport endpoint is not connected - trying again in 10 seconds 
 Feb 04 03:33:56 worker40 worker[3881]: [info] [pid:3881] Registering with openQA openqa.suse.de 
 Feb 04 03:33:56 worker40 worker[3881]: [warn] [pid:3881] Failed to register at openqa.suse.de - connection error: Transport endpoint is not connected - trying again in 10 seconds 
 Feb 04 03:34:06 worker40 worker[3881]: [info] [pid:3881] Registering with openQA openqa.suse.de 
 Feb 04 03:34:06 worker40 worker[3881]: [warn] [pid:3881] Failed to register at openqa.suse.de - connection error: Transport endpoint is not connected - trying again in 10 seconds 
 Feb 04 03:34:16 worker40 worker[3881]: [info] [pid:3881] Registering with openQA openqa.suse.de 
 Feb 04 03:34:16 worker40 worker[3881]: [warn] [pid:3881] Failed to register at openqa.suse.de - connection error: Transport endpoint is not connected - trying again in 10 seconds 
 Feb 04 03:34:26 worker40 worker[3881]: [info] [pid:3881] Registering with openQA openqa.suse.de 
 Feb 04 03:34:26 worker40 worker[3881]: [warn] [pid:3881] Failed to register at openqa.suse.de - connection error: Transport endpoint is not connected - trying again in 10 seconds 
 Feb 04 03:34:36 worker40 worker[3881]: [info] [pid:3881] Registering with openQA openqa.suse.de 
 Feb 04 03:34:36 worker40 worker[3881]: [warn] [pid:3881] Failed to register at openqa.suse.de - connection error: Transport endpoint is not connected - trying again in 10 seconds 
 Feb 04 03:34:46 worker40 worker[3881]: [info] [pid:3881] Registering with openQA openqa.suse.de 
 Feb 04 03:34:46 worker40 worker[3881]: [info] [pid:3881] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/3087 
 Feb 04 03:34:46 worker40 worker[3881]: [info] [pid:3881] Registered and connected via websockets with openQA host openqa.suse.de and worker ID 3087 
 Feb 04 19:43:20 worker40 worker[3881]: [debug] [pid:3881] Accepting job 13420128 from openqa.suse.de. 
 Feb 04 19:43:20 worker40 worker[3881]: [debug] [pid:3881] Setting job 13420128 from openqa.suse.de up 
 Feb 04 19:43:20 worker40 worker[3881]: [debug] [pid:3881] Preparing Mojo::IOLoop::ReadWriteProcess::Session 
 Feb 04 19:43:20 worker40 worker[3881]: [info] [pid:3881] +++ setup notes +++ 
 Feb 04 19:43:20 worker40 worker[3881]: [info] [pid:3881] Running on worker40:1 (Linux 5.14.21-150500.55.44-default #1 SMP PREEMPT_DYNAMIC Mon Jan 15 10:03:40 UTC 2024 (cc7d8b6) x86_64) 
 ``` 

 Sometimes the connection to the cache service is not even possible despite the worker registration already working: 

 ``` 
 Feb 04 03:35:12 worker34 worker[3937]: [error] [pid:3937] Worker cache not available via http://127.0.0.1:9530: Cache service info error: Connection refused 
 Feb 04 03:35:12 worker34 worker[3937]: [info] [pid:3937] Project dir for host openqa.suse.de is /var/lib/openqa/share 
 Feb 04 03:35:12 worker34 worker[3937]: [info] [pid:3937] Registering with openQA openqa.suse.de 
 Feb 04 03:35:12 worker34 worker[3937]: [warn] [pid:3937] Failed to register at openqa.suse.de - connection error: Transport endpoint is not connected - trying again in 10 seconds 
 Feb 04 03:35:22 worker34 worker[3937]: [info] [pid:3937] Registering with openQA openqa.suse.de 
 Feb 04 03:35:22 worker34 worker[3937]: [warn] [pid:3937] Failed to register at openqa.suse.de - connection error: Transport endpoint is not connected - trying again in 10 seconds 
 Feb 04 03:35:32 worker34 worker[3937]: [info] [pid:3937] Registering with openQA openqa.suse.de 
 Feb 04 03:35:32 worker34 worker[3937]: [warn] [pid:3937] Failed to register at openqa.suse.de - connection error: Transport endpoint is not connected - trying again in 10 seconds 
 Feb 04 03:35:42 worker34 worker[3937]: [info] [pid:3937] Registering with openQA openqa.suse.de 
 Feb 04 03:35:42 worker34 worker[3937]: [warn] [pid:3937] Failed to register at openqa.suse.de - connection error: Transport endpoint is not connected - trying again in 10 seconds 
 Feb 04 03:35:52 worker34 worker[3937]: [info] [pid:3937] Registering with openQA openqa.suse.de 
 Feb 04 03:35:52 worker34 worker[3937]: [warn] [pid:3937] Failed to register at openqa.suse.de - connection error: Transport endpoint is not connected - trying again in 10 seconds 
 Feb 04 03:36:02 worker34 worker[3937]: [info] [pid:3937] Registering with openQA openqa.suse.de 
 Feb 04 03:36:02 worker34 worker[3937]: [warn] [pid:3937] Failed to register at openqa.suse.de - connection error: Transport endpoint is not connected - trying again in 10 seconds 
 Feb 04 03:36:12 worker34 worker[3937]: [info] [pid:3937] Registering with openQA openqa.suse.de 
 Feb 04 03:36:12 worker34 worker[3937]: [warn] [pid:3937] Failed to register at openqa.suse.de - connection error: Transport endpoint is not connected - trying again in 10 seconds 
 Feb 04 03:36:22 worker34 worker[3937]: [info] [pid:3937] Registering with openQA openqa.suse.de 
 Feb 04 03:36:22 worker34 worker[3937]: [warn] [pid:3937] Failed to register at openqa.suse.de - connection error: Transport endpoint is not connected - trying again in 10 seconds 
 Feb 04 03:36:32 worker34 worker[3937]: [info] [pid:3937] Registering with openQA openqa.suse.de 
 Feb 04 03:36:33 worker34 worker[3937]: [info] [pid:3937] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/2687 
 Feb 04 03:36:33 worker34 worker[3937]: [info] [pid:3937] Registered and connected via websockets with openQA host openqa.suse.de and worker ID 2687 
 Feb 04 03:36:33 worker34 worker[3937]: [warn] [pid:3937] Worker cache not available via http://127.0.0.1:9530: Cache service info error: Connection refused - checking again for web UI 'openqa.suse.de' in 80.81 s 
 Feb 04 19:18:04 worker34 worker[3937]: [debug] [pid:3937] Accepting job 13423952 from openqa.suse.de. 
 Feb 04 19:18:04 worker34 worker[3937]: [debug] [pid:3937] Setting job 13423952 from openqa.suse.de up 
 Feb 04 19:18:04 worker34 worker[3937]: [debug] [pid:3937] Preparing Mojo::IOLoop::ReadWriteProcess::Session 
 ``` 

 Of course the worker is designed to try again and after a while all ~ 100 affected worker slots were good again with the exception of two worker slots where it apparently took several hours. Those two slots were triggering the alert. Unfortunately we don't know which slots those were (from the data in Grafana).

Back