Project

General

Profile

action #110785

Updated by mkittler almost 2 years ago

## Observation 
 On 2022-05-09 we found that there are many jobs scheduled but not enough running. On https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&viewPanel=9&from=1651831201500&to=1652176797500 one can see that we have the case of many jobs scheduled in the evening of 2022-05-09. A discussion thread asking about problems was opened by Xiaojing Liu in https://suse.slack.com/archives/C02AJ1E568M/p1652083286946669 . jbaier pointed out that the majority of the queue are LTP tests (new kernel livepatches in the maintenance queue) which is happening seldomly but can still be expected 

 On openqaworker6 `journalctl -u openqa-worker-auto-restart@20` we found 

 ``` 
 May 09 03:26:12 openqaworker6 worker[33850]: [warn] [pid:33850] Worker cache not available: Cache service queue already full (10) 
 ``` 

 ## Problem 

 In https://suse.slack.com/archives/C02AJ1E568M/p1652085933670009?thread_ts=1652083286.946669&cid=C02AJ1E568M coolo stated the hypothesis 

 > The scheduler didn't schedule because there was that one worker who was kept as idle, but was actually dead. And that situation is fatal for the scheduling loop 

 It seems that the scheduling loop might not be completely blocked but slowed down a lot iterating over a big list of jobs and workers and only assigning jobs very slowly 

 The scheduler currently fails to assign jobs to various workers, possibly slowing down the whole scheduling tick: 
 ``` 
 [2022-05-09T11:06:58.325734+02:00] [debug] [pid:14772] Assigned job '8722179' to worker ID '579'  
 [2022-05-09T11:06:58.334240+02:00] [debug] [pid:14772] [Job#8722179] Prepare for being processed by worker 579  
 [2022-05-09T11:06:58.461793+02:00] [debug] [pid:14772] Sent job(s) '8722179' to worker '579'  
 [2022-05-09T11:06:58.496361+02:00] [debug] [pid:14772] Assigned job '8714856' to worker ID '912'  
 [2022-05-09T11:06:58.500502+02:00] [debug] [pid:14772] [Job#8714856] Prepare for being processed by worker 912  
 [2022-05-09T11:06:58.664234+02:00] [debug] [pid:14772] Failed sending job(s) '8714856' to worker '912'  
 [2022-05-09T11:06:58.701367+02:00] [debug] [pid:14772] Job 8714856 reset to state scheduled  
 [2022-05-09T11:06:58.759157+02:00] [debug] [pid:14772] Assigned job '8723222' to worker ID '529'  
 [2022-05-09T11:06:58.787946+02:00] [debug] [pid:14772] [Job#8723222] Prepare for being processed by worker 529  
 [2022-05-09T11:06:58.926925+02:00] [debug] [pid:14772] Created network for 8723222 : 73  
 [2022-05-09T11:06:59.179954+02:00] [debug] [pid:14772] Sent job(s) '8723222' to worker '529'  
 [2022-05-09T11:06:59.228883+02:00] [debug] [pid:14772] Assigned job '8723225' to worker ID '527'  
 [2022-05-09T11:06:59.244018+02:00] [debug] [pid:14772] [Job#8723225] Prepare for being processed by worker 527  
 [2022-05-09T11:06:59.464249+02:00] [debug] [pid:14772] Sent job(s) '8723225' to worker '527'  
 [2022-05-09T11:06:59.507154+02:00] [debug] [pid:14772] Assigned job '8722186' to worker ID '1119'  
 [2022-05-09T11:06:59.517069+02:00] [debug] [pid:14772] [Job#8722186] Prepare for being processed by worker 1119  
 [2022-05-09T11:06:59.738507+02:00] [debug] [pid:14772] Failed sending job(s) '8722186' to worker '1119'  
 [2022-05-09T11:06:59.765372+02:00] [debug] [pid:14772] Job 8722186 reset to state scheduled  
 [2022-05-09T11:06:59.825905+02:00] [debug] [pid:14772] Assigned job '8722182' to worker ID '1317'  
 [2022-05-09T11:06:59.837061+02:00] [debug] [pid:14772] [Job#8722182] Prepare for being processed by worker 1317  
 [2022-05-09T11:06:59.918401+02:00] [debug] [pid:14772] Sent job(s) '8722182' to worker '1317'  
 [2022-05-09T11:06:59.970125+02:00] [debug] [pid:14772] Assigned job '8722181' to worker ID '1266'  
 [2022-05-09T11:06:59.976989+02:00] [debug] [pid:14772] [Job#8722181] Prepare for being processed by worker 1266  
 [2022-05-09T11:07:00.076180+02:00] [debug] [pid:14772] Sent job(s) '8722181' to worker '1266'  
 [2022-05-09T11:07:00.125276+02:00] [debug] [pid:14772] Assigned job '8723223' to worker ID '1287'  
 [2022-05-09T11:07:00.149012+02:00] [debug] [pid:14772] [Job#8723223] Prepare for being processed by worker 1287  
 [2022-05-09T11:07:00.538347+02:00] [debug] [pid:14772] Sent job(s) '8723223' to worker '1287'  
 [2022-05-09T11:07:00.594073+02:00] [debug] [pid:14772] Assigned job '8722190' to worker ID '1114'  
 [2022-05-09T11:07:00.599663+02:00] [debug] [pid:14772] [Job#8722190] Prepare for being processed by worker 1114  
 [2022-05-09T11:07:00.699187+02:00] [debug] [pid:14772] Failed sending job(s) '8722190' to worker '1114'  
 [2022-05-09T11:07:00.737092+02:00] [debug] [pid:14772] Job 8722190 reset to state scheduled  
 [2022-05-09T11:07:00.800405+02:00] [debug] [pid:14772] Assigned job '8723227' to worker ID '1077'  
 [2022-05-09T11:07:00.807488+02:00] [debug] [pid:14772] [Job#8723227] Prepare for being processed by worker 1077  
 [2022-05-09T11:07:01.166880+02:00] [debug] [pid:14772] Sent job(s) '8723227' to worker '1077'  
 [2022-05-09T11:07:01.205905+02:00] [debug] [pid:14772] Assigned job '8723226' to worker ID '1297'  
 [2022-05-09T11:07:01.216100+02:00] [debug] [pid:14772] [Job#8723226] Prepare for being processed by worker 1297  
 [2022-05-09T11:07:01.415973+02:00] [debug] [pid:14772] Sent job(s) '8723226' to worker '1297'  
 [2022-05-09T11:07:01.474224+02:00] [debug] [pid:14772] Assigned job '8723224' to worker ID '1080'  
 [2022-05-09T11:07:01.486018+02:00] [debug] [pid:14772] [Job#8723224] Prepare for being processed by worker 1080  
 [2022-05-09T11:07:01.661337+02:00] [debug] [pid:14772] Sent job(s) '8723224' to worker '1080'  
 [2022-05-09T11:07:01.704444+02:00] [debug] [pid:14772] Assigned job '8722104' to worker ID '593'  
 [2022-05-09T11:07:01.709074+02:00] [debug] [pid:14772] [Job#8722104] Prepare for being processed by worker 593  
 [2022-05-09T11:07:01.812775+02:00] [debug] [pid:14772] Failed sending job(s) '8722104' to worker '593'  
 [2022-05-09T11:07:01.837325+02:00] [debug] [pid:14772] Job 8722104 reset to state scheduled  
 [2022-05-09T11:07:01.873641+02:00] [debug] [pid:14772] Assigned job '8722185' to worker ID '903'  
 [2022-05-09T11:07:01.881935+02:00] [debug] [pid:14772] [Job#8722185] Prepare for being processed by worker 903  
 [2022-05-09T11:07:01.958940+02:00] [debug] [pid:14772] Failed sending job(s) '8722185' to worker '903'  
 [2022-05-09T11:07:01.981328+02:00] [debug] [pid:14772] Job 8722185 reset to state scheduled  
 [2022-05-09T11:07:02.017183+02:00] [debug] [pid:14772] Assigned job '8715010' to worker ID '1349'  
 [2022-05-09T11:07:02.022772+02:00] [debug] [pid:14772] [Job#8715010] Prepare for being processed by worker 1349  
 [2022-05-09T11:07:02.137027+02:00] [debug] [pid:14772] Failed sending job(s) '8715010' to worker '1349'  
 [2022-05-09T11:07:02.162308+02:00] [debug] [pid:14772] Job 8715010 reset to state scheduled  
 [2022-05-09T11:07:02.199658+02:00] [debug] [pid:14772] Assigned job '8722098' to worker ID '954'  
 [2022-05-09T11:07:02.204521+02:00] [debug] [pid:14772] [Job#8722098] Prepare for being processed by worker 954  
 [2022-05-09T11:07:02.297548+02:00] [debug] [pid:14772] Failed sending job(s) '8722098' to worker '954'  
 [2022-05-09T11:07:02.318036+02:00] [debug] [pid:14772] Job 8722098 reset to state scheduled  
 [2022-05-09T11:07:02.352935+02:00] [debug] [pid:14772] Assigned job '8722195' to worker ID '1210'  
 [2022-05-09T11:07:02.357567+02:00] [debug] [pid:14772] [Job#8722195] Prepare for being processed by worker 1210  
 [2022-05-09T11:07:02.497911+02:00] [debug] [pid:14772] Failed sending job(s) '8722195' to worker '1210'  
 [2022-05-09T11:07:02.523069+02:00] [debug] [pid:14772] Job 8722195 reset to state scheduled  
 [2022-05-09T11:07:02.581416+02:00] [debug] [pid:14772] Assigned job '8715007' to worker ID '1274'  
 [2022-05-09T11:07:02.584929+02:00] [debug] [pid:14772] [Job#8715007] Prepare for being processed by worker 1274  
 [2022-05-09T11:07:02.759493+02:00] [debug] [pid:14772] Sent job(s) '8715007' to worker '1274' 
 ``` 

 Those workers are currently idling: 
 ``` 
 openqa=> select host, instance, job_id, error from workers where id in (912, 1119, 1114, 593, 903, 954, 1210);    
         host           | instance | job_id | error   
 ---------------------+----------+--------+-------  
  openqaworker9         |          9 |          |   
  openqaworker6         |         15 |          |   
  openqaworker9         |          8 |          |   
  powerqaworker-qam-1 |          4 |          |   
  openqaworker13        |          8 |          |   
  openqaworker13        |          3 |          |   
  openqaworker13        |         14 |          |   
 (7 Zeilen) 
 ``` 

 The problem persists during multiple scheduler ticks. The set of workers is always similar, now also one more shows up: 
 ``` 
 [2022-05-09T11:14:27.829472+02:00] [debug] [pid:14772] Assigned job '8715013' to worker ID '1349'  
 [2022-05-09T11:14:27.836584+02:00] [debug] [pid:14772] [Job#8715013] Prepare for being processed by worker 1349  
 [2022-05-09T11:14:27.968569+02:00] [debug] [pid:14772] Failed sending job(s) '8715013' to worker '1349'  
 [2022-05-09T11:14:27.999362+02:00] [debug] [pid:14772] Job 8715013 reset to state scheduled 
 ``` 

 Besides, the scheduler ticket takes very long. I saw one ticket that took almost 30 seconds. 

 The affected workers log looks like this: 
 ``` 
 Mai 08 21:17:49 openqaworker13 worker[32415]: [warn] [pid:32415] Unable to upgrade to ws connection via http://openqa.suse.de/api/v1/ws/1210, code 502 - trying again in 10 seconds  
 Mai 08 21:17:59 openqaworker13 worker[32415]: [info] [pid:32415] Registering with openQA openqa.suse.de  
 Mai 08 21:17:59 openqaworker13 worker[32415]: [info] [pid:32415] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/1210  
 Mai 08 21:25:09 openqaworker13 worker[32415]: [warn] [pid:32415] Unable to upgrade to ws connection via http://openqa.suse.de/api/v1/ws/1210, code 502 - trying again in 10 seconds  
 Mai 08 21:25:19 openqaworker13 worker[32415]: [info] [pid:32415] Registering with openQA openqa.suse.de  
 Mai 08 21:25:20 openqaworker13 worker[32415]: [info] [pid:32415] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/1210 
 Mai 08 21:30:20 openqaworker13 worker[32415]: [warn] [pid:32415] Unable to upgrade to ws connection via http://openqa.suse.de/api/v1/ws/1210, code 502 - trying again in 10 seconds  
 Mai 08 21:30:30 openqaworker13 worker[32415]: [info] [pid:32415] Registering with openQA openqa.suse.de  
 Mai 08 21:30:31 openqaworker13 worker[32415]: [info] [pid:32415] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/1210  
 Mai 08 21:30:57 openqaworker13 worker[32415]: [info] [pid:32415] Registered and connected via websockets with openQA host openqa.suse.de and worker ID 1210 
 ``` 
 So there were connection issues but eventually the worker could register. However, then it nevertheless didn't take any jobs despite attempts from the scheduler to assign a job to it. After restarting the worker it instantly starts working on a new job: 

 ``` 
 Mai 09 11:30:11 openqaworker13 systemd[1]: Stopping openQA Worker #14...  
 Mai 09 11:30:11 openqaworker13 worker[32415]: [info] [pid:32415] Received signal TERM  
 Mai 09 11:30:11 openqaworker13 worker[32415]: [debug] [pid:32415] Informing openqa.suse.de that we are going offline  
 Mai 09 11:30:11 openqaworker13 systemd[1]: openqa-worker-auto-restart@14.service: Succeeded.  
 Mai 09 11:30:11 openqaworker13 systemd[1]: Stopped openQA Worker #14.  
 Mai 09 11:30:11 openqaworker13 systemd[1]: Starting openQA Worker #14...  
 Mai 09 11:30:11 openqaworker13 systemd[1]: Started openQA Worker #14.  
 Mai 09 11:30:13 openqaworker13 worker[17023]: [info] [pid:17023] worker 14:  
 Mai 09 11:30:13 openqaworker13 worker[17023]:    - config file:             /etc/openqa/workers.ini  
 Mai 09 11:30:13 openqaworker13 worker[17023]:    - worker hostname:         openqaworker13  
 Mai 09 11:30:13 openqaworker13 worker[17023]:    - isotovideo version:      25  
 Mai 09 11:30:13 openqaworker13 worker[17023]:    - websocket API version: 1  
 Mai 09 11:30:13 openqaworker13 worker[17023]:    - web UI hosts:            openqa.suse.de  
 Mai 09 11:30:13 openqaworker13 worker[17023]:    - class:                   qemu_x86_64,qemu_x86_64_staging,amd,openqaworker13  
 Mai 09 11:30:13 openqaworker13 worker[17023]:    - no cleanup:              no  
 Mai 09 11:30:13 openqaworker13 worker[17023]:    - pool directory:          /var/lib/openqa/pool/14  
 Mai 09 11:30:13 openqaworker13 worker[17023]: [info] [pid:17023] CACHE: caching is enabled, setting up /var/lib/openqa/cache/openqa.suse.de  
 Mai 09 11:30:13 openqaworker13 worker[17023]: [info] [pid:17023] Project dir for host openqa.suse.de is /var/lib/openqa/share  
 Mai 09 11:30:13 openqaworker13 worker[17023]: [info] [pid:17023] Registering with openQA openqa.suse.de  
 Mai 09 11:30:13 openqaworker13 worker[17023]: [info] [pid:17023] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/1210  
 Mai 09 11:30:13 openqaworker13 worker[17023]: [info] [pid:17023] Registered and connected via websockets with openQA host openqa.suse.de and worker ID 1210  
 Mai 09 11:30:17 openqaworker13 worker[17023]: [debug] [pid:17023] Accepting job 8722699 from openqa.suse.de.  
 Mai 09 11:30:17 openqaworker13 worker[17023]: [debug] [pid:17023] Setting job 8722699 from openqa.suse.de up  
 Mai 09 11:30:17 openqaworker13 worker[17023]: [debug] [pid:17023] Preparing Mojo::IOLoop::ReadWriteProcess::Session  
 Mai 09 11:30:17 openqaworker13 worker[17023]: [info] [pid:17023] +++ setup notes +++  
 Mai 09 11:30:17 openqaworker13 worker[17023]: [info] [pid:17023] Running on openqaworker13:14 (Linux 5.3.18-150300.59.63-default #1 SMP Tue Apr 5 12:47:31 UTC 2022 (d77db66) x86_64)  
 Mai 09 11:30:17 openqaworker13 worker[17023]: [debug] [pid:17023] Job settings:  
 Mai 09 11:30:17 openqaworker13 worker[17023]: [debug] [pid:17023]  
 Mai 09 11:30:17 openqaworker13 worker[17023]:       ARCH=x86_64  
 Mai 09 11:30:17 openqaworker13 worker[17023]:       BACKEND=qemu  
 Mai 09 11:30:17 openqaworker13 worker[17023]:       BOOT_HDD_IMAGE=1  
 Mai 09 11:30:17 openqaworker13 worker[17023]:       BUGZILLA_URL=http://fastzilla.suse.de/short@BUGID@.xml  
 Mai 09 11:30:17 openqaworker13 worker[17023]:       BUILD=20220508-1 
 ``` 
 So I suspect there's something wrong on the worker-side that can leave it stuck after registration. Possibly the previous connection errors are part of this special case. However, besides that, nothing else was going on: 
    
 ``` 
 Mai 08 20:31:18 openqaworker13 systemd[1]: openqa-worker-auto-restart@14.service: Succeeded.  
 Mai 08 20:31:18 openqaworker13 systemd[1]: openqa-worker-auto-restart@14.service: Scheduled restart job, restart counter is at 5.  
 Mai 08 20:31:18 openqaworker13 systemd[1]: Stopped openQA Worker #14.  
 Mai 08 20:31:18 openqaworker13 systemd[1]: Starting openQA Worker #14...  
 Mai 08 20:31:18 openqaworker13 systemd[1]: Started openQA Worker #14.  
 Mai 08 20:31:19 openqaworker13 worker[32415]: [info] [pid:32415] worker 14:  
 Mai 08 20:31:19 openqaworker13 worker[32415]:    - config file:             /etc/openqa/workers.ini  
 Mai 08 20:31:19 openqaworker13 worker[32415]:    - worker hostname:         openqaworker13  
 Mai 08 20:31:19 openqaworker13 worker[32415]:    - isotovideo version:      25  
 Mai 08 20:31:19 openqaworker13 worker[32415]:    - websocket API version: 1  
 Mai 08 20:31:19 openqaworker13 worker[32415]:    - web UI hosts:            openqa.suse.de  
 Mai 08 20:31:19 openqaworker13 worker[32415]:    - class:                   qemu_x86_64,qemu_x86_64_staging,amd,openqaworker13  
 Mai 08 20:31:19 openqaworker13 worker[32415]:    - no cleanup:              no  
 Mai 08 20:31:19 openqaworker13 worker[32415]:    - pool directory:          /var/lib/openqa/pool/14  
 Mai 08 20:31:19 openqaworker13 worker[32415]: [info] [pid:32415] CACHE: caching is enabled, setting up /var/lib/openqa/cache/openqa.suse.de  
 Mai 08 20:31:19 openqaworker13 worker[32415]: [info] [pid:32415] Project dir for host openqa.suse.de is /var/lib/openqa/share  
 Mai 08 20:31:19 openqaworker13 worker[32415]: [info] [pid:32415] Registering with openQA openqa.suse.de  
 Mai 08 20:31:20 openqaworker13 worker[32415]: [info] [pid:32415] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/1210  
 Mai 08 20:36:20 openqaworker13 worker[32415]: [warn] [pid:32415] Unable to upgrade to ws connection via http://openqa.suse.de/api/v1/ws/1210, code 502 - trying again in 10 seconds  
 Mai 08 20:36:30 openqaworker13 worker[32415]: [info] [pid:32415] Registering with openQA openqa.suse.de  
 Mai 08 20:36:31 openqaworker13 worker[32415]: [info] [pid:32415] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/1210  
 Mai 08 20:41:31 openqaworker13 worker[32415]: [warn] [pid:32415] Unable to upgrade to ws connection via http://openqa.suse.de/api/v1/ws/1210, code 502 - trying again in 10 seconds  
 Mai 08 20:41:41 openqaworker13 worker[32415]: [info] [pid:32415] Registering with openQA openqa.suse.de  
 Mai 08 20:41:43 openqaworker13 worker[32415]: [info] [pid:32415] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/1210  
 Mai 08 20:46:43 openqaworker13 worker[32415]: [warn] [pid:32415] Unable to upgrade to ws connection via http://openqa.suse.de/api/v1/ws/1210, code 502 - trying again in 10 seconds  
 Mai 08 20:46:53 openqaworker13 worker[32415]: [info] [pid:32415] Registering with openQA openqa.suse.de  
 Mai 08 20:46:54 openqaworker13 worker[32415]: [info] [pid:32415] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/1210  
 Mai 08 20:51:54 openqaworker13 worker[32415]: [warn] [pid:32415] Unable to upgrade to ws connection via http://openqa.suse.de/api/v1/ws/1210, code 502 - trying again in 10 seconds  
 Mai 08 20:52:04 openqaworker13 worker[32415]: [info] [pid:32415] Registering with openQA openqa.suse.de  
 Mai 08 20:52:05 openqaworker13 worker[32415]: [info] [pid:32415] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/1210  
 Mai 08 20:57:05 openqaworker13 worker[32415]: [warn] [pid:32415] Unable to upgrade to ws connection via http://openqa.suse.de/api/v1/ws/1210, code 502 - trying again in 10 seconds  
 Mai 08 20:57:15 openqaworker13 worker[32415]: [info] [pid:32415] Registering with openQA openqa.suse.de  
 Mai 08 20:57:16 openqaworker13 worker[32415]: [info] [pid:32415] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/1210  
 Mai 08 21:02:16 openqaworker13 worker[32415]: [warn] [pid:32415] Unable to upgrade to ws connection via http://openqa.suse.de/api/v1/ws/1210, code 502 - trying again in 10 seconds  
 Mai 08 21:02:26 openqaworker13 worker[32415]: [info] [pid:32415] Registering with openQA openqa.suse.de  
 Mai 08 21:02:27 openqaworker13 worker[32415]: [info] [pid:32415] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/1210  
 Mai 08 21:07:27 openqaworker13 worker[32415]: [warn] [pid:32415] Unable to upgrade to ws connection via http://openqa.suse.de/api/v1/ws/1210, code 502 - trying again in 10 seconds  
 Mai 08 21:07:37 openqaworker13 worker[32415]: [info] [pid:32415] Registering with openQA openqa.suse.de  
 Mai 08 21:07:38 openqaworker13 worker[32415]: [info] [pid:32415] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/1210  
 Mai 08 21:12:38 openqaworker13 worker[32415]: [warn] [pid:32415] Unable to upgrade to ws connection via http://openqa.suse.de/api/v1/ws/1210, code 502 - trying again in 10 seconds  
 Mai 08 21:12:48 openqaworker13 worker[32415]: [info] [pid:32415] Registering with openQA openqa.suse.de  
 Mai 08 21:12:49 openqaworker13 worker[32415]: [info] [pid:32415] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/1210  
 Mai 08 21:17:49 openqaworker13 worker[32415]: [warn] [pid:32415] Unable to upgrade to ws connection via http://openqa.suse.de/api/v1/ws/1210, code 502 - trying again in 10 seconds  
 Mai 08 21:17:59 openqaworker13 worker[32415]: [info] [pid:32415] Registering with openQA openqa.suse.de  
 Mai 08 21:17:59 openqaworker13 worker[32415]: [info] [pid:32415] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/1210  
 Mai 08 21:25:09 openqaworker13 worker[32415]: [warn] [pid:32415] Unable to upgrade to ws connection via http://openqa.suse.de/api/v1/ws/1210, code 502 - trying again in 10 seconds  
 Mai 08 21:25:19 openqaworker13 worker[32415]: [info] [pid:32415] Registering with openQA openqa.suse.de  
 Mai 08 21:25:20 openqaworker13 worker[32415]: [info] [pid:32415] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/1210  
 Mai 08 21:30:20 openqaworker13 worker[32415]: [warn] [pid:32415] Unable to upgrade to ws connection via http://openqa.suse.de/api/v1/ws/1210, code 502 - trying again in 10 seconds  
 Mai 08 21:30:30 openqaworker13 worker[32415]: [info] [pid:32415] Registering with openQA openqa.suse.de  
 Mai 08 21:30:31 openqaworker13 worker[32415]: [info] [pid:32415] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/1210  
 Mai 08 21:30:57 openqaworker13 worker[32415]: [info] [pid:32415] Registered and connected via websockets with openQA host openqa.suse.de and worker ID 1210 
 ``` 

 For another worker it looks like this: 
 ``` 
 Mai 08 21:25:15 openqaworker6 worker[33863]: [info] [pid:33863] Registering with openQA openqa.suse.de  
 Mai 08 21:25:16 openqaworker6 worker[33863]: [info] [pid:33863] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/954  
 Mai 08 21:30:16 openqaworker6 worker[33863]: [warn] [pid:33863] Unable to upgrade to ws connection via http://openqa.suse.de/api/v1/ws/954, code 502 - trying again in 10 seconds  
 Mai 08 21:30:26 openqaworker6 worker[33863]: [info] [pid:33863] Registering with openQA openqa.suse.de  
 Mai 08 21:30:26 openqaworker6 worker[33863]: [info] [pid:33863] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/954  
 Mai 08 21:30:54 openqaworker6 worker[33863]: [info] [pid:33863] Registered and connected via websockets with openQA host openqa.suse.de and worker ID 954  
 Mai 09 03:26:13 openqaworker6 worker[33863]: [warn] [pid:33863] Worker cache not available: Cache service queue already full (10) - checking again for web UI 'openqa.suse.de' in 100.00 s  
 Mai 09 03:27:53 openqaworker6 worker[33863]: [warn] [pid:33863] Worker cache not available: Cache service queue already full (10) - checking again for web UI 'openqa.suse.de' in 100.00 s  
 Mai 09 03:29:33 openqaworker6 worker[33863]: [warn] [pid:33863] Worker cache not available: Cache service queue already full (10) - checking again for web UI 'openqa.suse.de' in 100.00 s  
 Mai 09 03:31:13 openqaworker6 worker[33863]: [warn] [pid:33863] Worker cache not available: Cache service queue already full (10) - checking again for web UI 'openqa.suse.de' in 100.00 s  
 Mai 09 05:51:20 openqaworker6 worker[33863]: [warn] [pid:33863] Worker cache not available: Cache service queue already full (10) - checking again for web UI 'openqa.suse.de' in 100.00 s  
 Mai 09 05:53:00 openqaworker6 worker[33863]: [warn] [pid:33863] Worker cache not available: Cache service queue already full (10) - checking again for web UI 'openqa.suse.de' in 100.00 s  
 Mai 09 05:54:40 openqaworker6 worker[33863]: [warn] [pid:33863] Worker cache not available: Cache service queue already full (10) - checking again for web UI 'openqa.suse.de' in 100.00 s  
 Mai 09 05:58:00 openqaworker6 worker[33863]: [warn] [pid:33863] Worker cache not available: Cache service queue already full (10) - checking again for web UI 'openqa.suse.de' in 100.00 s  
 Mai 09 05:59:41 openqaworker6 worker[33863]: [warn] [pid:33863] Worker cache not available: Cache service queue already full (10) - checking again for web UI 'openqa.suse.de' in 100.00 s 
 ``` 
 The worker shows nevertheless up as "idle" (and not as "broken"). I suppose that's problematic because it makes it look like the don't assign jobs to an idling worker while it is actually busy downloading. However, when restarting the worker slot it didn't show the cache service error anymore. So maybe it wasn't *actually* busy with downloading anymore. The restarting also helped in this case to unblock the worker. Considering that only one of the stuck workers shows these "Cache service queue already full" message and that these messages stopped at Mai 09 05:59:41 I would conclude that these messages are unrelated. 

 Note that restarting such stuck workers via `SIGHUP` helps as well. So the workers aren't *that* stuck to not even react to that signal anymore. 

 ## Ideas 
 * Following the assumption that the worker(s) are at fault 
     * Check the code for problems leading to stuck workers after registration 
     * ~~Maybe Maybe receiving a signal `SIGTERM`/`SIGHUP` in the wrong moment can leave the worker in a bad state~~ Definitely not the case, it would have been logged state 
 * Following the assumption that the scheduler is at fault (seems *not* so likely at this point of the investigation anymore) 
     * Test locally by scheduling something like 100k jobs and see how the scheduler scales 
     * Test locally by scheduling many jobs on something like 1k worker instances and see how the scheduler scales 
     * Note that there's a unit test for scalability which one might simply invoke with very high numbers for scheduled jobs and available workers 
 * It could also be a problem within the websocket server 


 ## Mitigation steps conducted 
 On openqaworker6: 
 ``` 
 2022-05-09 10:52:08 journalctl -u openqa-worker-auto-restart@20 > openqa_worker_auto_restart_20-$(date +%F).log 
 2022-05-09 10:52:56 journalctl -u openqa-worker-cacheservice.service > openqa_worker_cacheservice-$(date +%F).log 
 2022-05-09 10:53:07 journalctl -u openqa-worker-cacheservice-minion.service > openqa_worker_cacheservice_minion-$(date +%F).log 
 2022-05-09 10:53:26 systemctl restart openqa-worker-cacheservice-minion.service  
 2022-05-09 10:55:28 systemctl restart openqa-worker-auto-restart@20 
 ```

Back