action #110785
Updated by livdywan over 2 years ago
## Observation On 2022-05-09 we found that there are many jobs scheduled but not enough running. On https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&viewPanel=9&from=1651831201500&to=1652176797500 one can see that we have the case of many jobs scheduled in the evening of 2022-05-09. A discussion thread asking about problems was opened by Xiaojing Liu in https://suse.slack.com/archives/C02AJ1E568M/p1652083286946669 . jbaier pointed out that the majority of the queue are LTP tests (new kernel livepatches in the maintenance queue) which is happening seldomly but can still be expected On openqaworker6 `journalctl -u openqa-worker-auto-restart@20` we found ``` May 09 03:26:12 openqaworker6 worker[33850]: [warn] [pid:33850] Worker cache not available: Cache service queue already full (10) ``` ## Problem In https://suse.slack.com/archives/C02AJ1E568M/p1652085933670009?thread_ts=1652083286.946669&cid=C02AJ1E568M coolo stated the hypothesis > The scheduler didn't schedule because there was that one worker who was kept as idle, but was actually dead. And that situation is fatal for the scheduling loop It seems that the scheduling loop might not be completely blocked but slowed down a lot iterating over a big list of jobs and workers and only assigning jobs very slowly The scheduler currently fails to assign jobs to various workers, possibly slowing down the whole scheduling tick: ``` [2022-05-09T11:06:58.325734+02:00] [debug] [pid:14772] Assigned job '8722179' to worker ID '579' [2022-05-09T11:06:58.334240+02:00] [debug] [pid:14772] [Job#8722179] Prepare for being processed by worker 579 [2022-05-09T11:06:58.461793+02:00] [debug] [pid:14772] Sent job(s) '8722179' to worker '579' [2022-05-09T11:06:58.496361+02:00] [debug] [pid:14772] Assigned job '8714856' to worker ID '912' [2022-05-09T11:06:58.500502+02:00] [debug] [pid:14772] [Job#8714856] Prepare for being processed by worker 912 [2022-05-09T11:06:58.664234+02:00] [debug] [pid:14772] Failed sending job(s) '8714856' to worker '912' [2022-05-09T11:06:58.701367+02:00] [debug] [pid:14772] Job 8714856 reset to state scheduled [2022-05-09T11:06:58.759157+02:00] [debug] [pid:14772] Assigned job '8723222' to worker ID '529' [2022-05-09T11:06:58.787946+02:00] [debug] [pid:14772] [Job#8723222] Prepare for being processed by worker 529 [2022-05-09T11:06:58.926925+02:00] [debug] [pid:14772] Created network for 8723222 : 73 [2022-05-09T11:06:59.179954+02:00] [debug] [pid:14772] Sent job(s) '8723222' to worker '529' [2022-05-09T11:06:59.228883+02:00] [debug] [pid:14772] Assigned job '8723225' to worker ID '527' [2022-05-09T11:06:59.244018+02:00] [debug] [pid:14772] [Job#8723225] Prepare for being processed by worker 527 [2022-05-09T11:06:59.464249+02:00] [debug] [pid:14772] Sent job(s) '8723225' to worker '527' [2022-05-09T11:06:59.507154+02:00] [debug] [pid:14772] Assigned job '8722186' to worker ID '1119' [2022-05-09T11:06:59.517069+02:00] [debug] [pid:14772] [Job#8722186] Prepare for being processed by worker 1119 [2022-05-09T11:06:59.738507+02:00] [debug] [pid:14772] Failed sending job(s) '8722186' to worker '1119' [2022-05-09T11:06:59.765372+02:00] [debug] [pid:14772] Job 8722186 reset to state scheduled [2022-05-09T11:06:59.825905+02:00] [debug] [pid:14772] Assigned job '8722182' to worker ID '1317' [2022-05-09T11:06:59.837061+02:00] [debug] [pid:14772] [Job#8722182] Prepare for being processed by worker 1317 [2022-05-09T11:06:59.918401+02:00] [debug] [pid:14772] Sent job(s) '8722182' to worker '1317' [2022-05-09T11:06:59.970125+02:00] [debug] [pid:14772] Assigned job '8722181' to worker ID '1266' [2022-05-09T11:06:59.976989+02:00] [debug] [pid:14772] [Job#8722181] Prepare for being processed by worker 1266 [2022-05-09T11:07:00.076180+02:00] [debug] [pid:14772] Sent job(s) '8722181' to worker '1266' [2022-05-09T11:07:00.125276+02:00] [debug] [pid:14772] Assigned job '8723223' to worker ID '1287' [2022-05-09T11:07:00.149012+02:00] [debug] [pid:14772] [Job#8723223] Prepare for being processed by worker 1287 [2022-05-09T11:07:00.538347+02:00] [debug] [pid:14772] Sent job(s) '8723223' to worker '1287' [2022-05-09T11:07:00.594073+02:00] [debug] [pid:14772] Assigned job '8722190' to worker ID '1114' [2022-05-09T11:07:00.599663+02:00] [debug] [pid:14772] [Job#8722190] Prepare for being processed by worker 1114 [2022-05-09T11:07:00.699187+02:00] [debug] [pid:14772] Failed sending job(s) '8722190' to worker '1114' [2022-05-09T11:07:00.737092+02:00] [debug] [pid:14772] Job 8722190 reset to state scheduled [2022-05-09T11:07:00.800405+02:00] [debug] [pid:14772] Assigned job '8723227' to worker ID '1077' [2022-05-09T11:07:00.807488+02:00] [debug] [pid:14772] [Job#8723227] Prepare for being processed by worker 1077 [2022-05-09T11:07:01.166880+02:00] [debug] [pid:14772] Sent job(s) '8723227' to worker '1077' [2022-05-09T11:07:01.205905+02:00] [debug] [pid:14772] Assigned job '8723226' to worker ID '1297' [2022-05-09T11:07:01.216100+02:00] [debug] [pid:14772] [Job#8723226] Prepare for being processed by worker 1297 [2022-05-09T11:07:01.415973+02:00] [debug] [pid:14772] Sent job(s) '8723226' to worker '1297' [2022-05-09T11:07:01.474224+02:00] [debug] [pid:14772] Assigned job '8723224' to worker ID '1080' [2022-05-09T11:07:01.486018+02:00] [debug] [pid:14772] [Job#8723224] Prepare for being processed by worker 1080 [2022-05-09T11:07:01.661337+02:00] [debug] [pid:14772] Sent job(s) '8723224' to worker '1080' [2022-05-09T11:07:01.704444+02:00] [debug] [pid:14772] Assigned job '8722104' to worker ID '593' [2022-05-09T11:07:01.709074+02:00] [debug] [pid:14772] [Job#8722104] Prepare for being processed by worker 593 [2022-05-09T11:07:01.812775+02:00] [debug] [pid:14772] Failed sending job(s) '8722104' to worker '593' [2022-05-09T11:07:01.837325+02:00] [debug] [pid:14772] Job 8722104 reset to state scheduled [2022-05-09T11:07:01.873641+02:00] [debug] [pid:14772] Assigned job '8722185' to worker ID '903' [2022-05-09T11:07:01.881935+02:00] [debug] [pid:14772] [Job#8722185] Prepare for being processed by worker 903 [2022-05-09T11:07:01.958940+02:00] [debug] [pid:14772] Failed sending job(s) '8722185' to worker '903' [2022-05-09T11:07:01.981328+02:00] [debug] [pid:14772] Job 8722185 reset to state scheduled [2022-05-09T11:07:02.017183+02:00] [debug] [pid:14772] Assigned job '8715010' to worker ID '1349' [2022-05-09T11:07:02.022772+02:00] [debug] [pid:14772] [Job#8715010] Prepare for being processed by worker 1349 [2022-05-09T11:07:02.137027+02:00] [debug] [pid:14772] Failed sending job(s) '8715010' to worker '1349' [2022-05-09T11:07:02.162308+02:00] [debug] [pid:14772] Job 8715010 reset to state scheduled [2022-05-09T11:07:02.199658+02:00] [debug] [pid:14772] Assigned job '8722098' to worker ID '954' [2022-05-09T11:07:02.204521+02:00] [debug] [pid:14772] [Job#8722098] Prepare for being processed by worker 954 [2022-05-09T11:07:02.297548+02:00] [debug] [pid:14772] Failed sending job(s) '8722098' to worker '954' [2022-05-09T11:07:02.318036+02:00] [debug] [pid:14772] Job 8722098 reset to state scheduled [2022-05-09T11:07:02.352935+02:00] [debug] [pid:14772] Assigned job '8722195' to worker ID '1210' [2022-05-09T11:07:02.357567+02:00] [debug] [pid:14772] [Job#8722195] Prepare for being processed by worker 1210 [2022-05-09T11:07:02.497911+02:00] [debug] [pid:14772] Failed sending job(s) '8722195' to worker '1210' [2022-05-09T11:07:02.523069+02:00] [debug] [pid:14772] Job 8722195 reset to state scheduled [2022-05-09T11:07:02.581416+02:00] [debug] [pid:14772] Assigned job '8715007' to worker ID '1274' [2022-05-09T11:07:02.584929+02:00] [debug] [pid:14772] [Job#8715007] Prepare for being processed by worker 1274 [2022-05-09T11:07:02.759493+02:00] [debug] [pid:14772] Sent job(s) '8715007' to worker '1274' ``` Those workers are currently idling: ``` openqa=> select host, instance, job_id, error from workers where id in (912, 1119, 1114, 593, 903, 954, 1210); host | instance | job_id | error ---------------------+----------+--------+------- openqaworker9 | 9 | | openqaworker6 | 15 | | openqaworker9 | 8 | | powerqaworker-qam-1 | 4 | | openqaworker13 | 8 | | openqaworker13 | 3 | | openqaworker13 | 14 | | (7 Zeilen) ``` The problem persists during multiple scheduler ticks. The set of workers is always similar, now also one more shows up: ``` [2022-05-09T11:14:27.829472+02:00] [debug] [pid:14772] Assigned job '8715013' to worker ID '1349' [2022-05-09T11:14:27.836584+02:00] [debug] [pid:14772] [Job#8715013] Prepare for being processed by worker 1349 [2022-05-09T11:14:27.968569+02:00] [debug] [pid:14772] Failed sending job(s) '8715013' to worker '1349' [2022-05-09T11:14:27.999362+02:00] [debug] [pid:14772] Job 8715013 reset to state scheduled ``` Besides, the scheduler ticket takes very long. I saw one ticket that took almost 30 seconds. The affected workers log looks like this: ``` Mai 08 21:17:49 openqaworker13 worker[32415]: [warn] [pid:32415] Unable to upgrade to ws connection via http://openqa.suse.de/api/v1/ws/1210, code 502 - trying again in 10 seconds Mai 08 21:17:59 openqaworker13 worker[32415]: [info] [pid:32415] Registering with openQA openqa.suse.de Mai 08 21:17:59 openqaworker13 worker[32415]: [info] [pid:32415] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/1210 Mai 08 21:25:09 openqaworker13 worker[32415]: [warn] [pid:32415] Unable to upgrade to ws connection via http://openqa.suse.de/api/v1/ws/1210, code 502 - trying again in 10 seconds Mai 08 21:25:19 openqaworker13 worker[32415]: [info] [pid:32415] Registering with openQA openqa.suse.de Mai 08 21:25:20 openqaworker13 worker[32415]: [info] [pid:32415] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/1210 Mai 08 21:30:20 openqaworker13 worker[32415]: [warn] [pid:32415] Unable to upgrade to ws connection via http://openqa.suse.de/api/v1/ws/1210, code 502 - trying again in 10 seconds Mai 08 21:30:30 openqaworker13 worker[32415]: [info] [pid:32415] Registering with openQA openqa.suse.de Mai 08 21:30:31 openqaworker13 worker[32415]: [info] [pid:32415] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/1210 Mai 08 21:30:57 openqaworker13 worker[32415]: [info] [pid:32415] Registered and connected via websockets with openQA host openqa.suse.de and worker ID 1210 ``` So there were connection issues but eventually the worker could register. However, then it nevertheless didn't take any jobs despite attempts from the scheduler to assign a job to it. After restarting the worker it instantly starts working on a new job: ``` Mai 09 11:30:11 openqaworker13 systemd[1]: Stopping openQA Worker #14... Mai 09 11:30:11 openqaworker13 worker[32415]: [info] [pid:32415] Received signal TERM Mai 09 11:30:11 openqaworker13 worker[32415]: [debug] [pid:32415] Informing openqa.suse.de that we are going offline Mai 09 11:30:11 openqaworker13 systemd[1]: openqa-worker-auto-restart@14.service: Succeeded. Mai 09 11:30:11 openqaworker13 systemd[1]: Stopped openQA Worker #14. Mai 09 11:30:11 openqaworker13 systemd[1]: Starting openQA Worker #14... Mai 09 11:30:11 openqaworker13 systemd[1]: Started openQA Worker #14. Mai 09 11:30:13 openqaworker13 worker[17023]: [info] [pid:17023] worker 14: Mai 09 11:30:13 openqaworker13 worker[17023]: - config file: /etc/openqa/workers.ini Mai 09 11:30:13 openqaworker13 worker[17023]: - worker hostname: openqaworker13 Mai 09 11:30:13 openqaworker13 worker[17023]: - isotovideo version: 25 Mai 09 11:30:13 openqaworker13 worker[17023]: - websocket API version: 1 Mai 09 11:30:13 openqaworker13 worker[17023]: - web UI hosts: openqa.suse.de Mai 09 11:30:13 openqaworker13 worker[17023]: - class: qemu_x86_64,qemu_x86_64_staging,amd,openqaworker13 Mai 09 11:30:13 openqaworker13 worker[17023]: - no cleanup: no Mai 09 11:30:13 openqaworker13 worker[17023]: - pool directory: /var/lib/openqa/pool/14 Mai 09 11:30:13 openqaworker13 worker[17023]: [info] [pid:17023] CACHE: caching is enabled, setting up /var/lib/openqa/cache/openqa.suse.de Mai 09 11:30:13 openqaworker13 worker[17023]: [info] [pid:17023] Project dir for host openqa.suse.de is /var/lib/openqa/share Mai 09 11:30:13 openqaworker13 worker[17023]: [info] [pid:17023] Registering with openQA openqa.suse.de Mai 09 11:30:13 openqaworker13 worker[17023]: [info] [pid:17023] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/1210 Mai 09 11:30:13 openqaworker13 worker[17023]: [info] [pid:17023] Registered and connected via websockets with openQA host openqa.suse.de and worker ID 1210 Mai 09 11:30:17 openqaworker13 worker[17023]: [debug] [pid:17023] Accepting job 8722699 from openqa.suse.de. Mai 09 11:30:17 openqaworker13 worker[17023]: [debug] [pid:17023] Setting job 8722699 from openqa.suse.de up Mai 09 11:30:17 openqaworker13 worker[17023]: [debug] [pid:17023] Preparing Mojo::IOLoop::ReadWriteProcess::Session Mai 09 11:30:17 openqaworker13 worker[17023]: [info] [pid:17023] +++ setup notes +++ Mai 09 11:30:17 openqaworker13 worker[17023]: [info] [pid:17023] Running on openqaworker13:14 (Linux 5.3.18-150300.59.63-default #1 SMP Tue Apr 5 12:47:31 UTC 2022 (d77db66) x86_64) Mai 09 11:30:17 openqaworker13 worker[17023]: [debug] [pid:17023] Job settings: Mai 09 11:30:17 openqaworker13 worker[17023]: [debug] [pid:17023] Mai 09 11:30:17 openqaworker13 worker[17023]: ARCH=x86_64 Mai 09 11:30:17 openqaworker13 worker[17023]: BACKEND=qemu Mai 09 11:30:17 openqaworker13 worker[17023]: BOOT_HDD_IMAGE=1 Mai 09 11:30:17 openqaworker13 worker[17023]: BUGZILLA_URL=http://fastzilla.suse.de/short@BUGID@.xml Mai 09 11:30:17 openqaworker13 worker[17023]: BUILD=20220508-1 ``` So I suspect there's something wrong on the worker-side that can leave it stuck after registration. Possibly the previous connection errors are part of this special case. However, besides that, nothing else was going on: ``` Mai 08 20:31:18 openqaworker13 systemd[1]: openqa-worker-auto-restart@14.service: Succeeded. Mai 08 20:31:18 openqaworker13 systemd[1]: openqa-worker-auto-restart@14.service: Scheduled restart job, restart counter is at 5. Mai 08 20:31:18 openqaworker13 systemd[1]: Stopped openQA Worker #14. Mai 08 20:31:18 openqaworker13 systemd[1]: Starting openQA Worker #14... Mai 08 20:31:18 openqaworker13 systemd[1]: Started openQA Worker #14. Mai 08 20:31:19 openqaworker13 worker[32415]: [info] [pid:32415] worker 14: Mai 08 20:31:19 openqaworker13 worker[32415]: - config file: /etc/openqa/workers.ini Mai 08 20:31:19 openqaworker13 worker[32415]: - worker hostname: openqaworker13 Mai 08 20:31:19 openqaworker13 worker[32415]: - isotovideo version: 25 Mai 08 20:31:19 openqaworker13 worker[32415]: - websocket API version: 1 Mai 08 20:31:19 openqaworker13 worker[32415]: - web UI hosts: openqa.suse.de Mai 08 20:31:19 openqaworker13 worker[32415]: - class: qemu_x86_64,qemu_x86_64_staging,amd,openqaworker13 Mai 08 20:31:19 openqaworker13 worker[32415]: - no cleanup: no Mai 08 20:31:19 openqaworker13 worker[32415]: - pool directory: /var/lib/openqa/pool/14 Mai 08 20:31:19 openqaworker13 worker[32415]: [info] [pid:32415] CACHE: caching is enabled, setting up /var/lib/openqa/cache/openqa.suse.de Mai 08 20:31:19 openqaworker13 worker[32415]: [info] [pid:32415] Project dir for host openqa.suse.de is /var/lib/openqa/share Mai 08 20:31:19 openqaworker13 worker[32415]: [info] [pid:32415] Registering with openQA openqa.suse.de Mai 08 20:31:20 openqaworker13 worker[32415]: [info] [pid:32415] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/1210 Mai 08 20:36:20 openqaworker13 worker[32415]: [warn] [pid:32415] Unable to upgrade to ws connection via http://openqa.suse.de/api/v1/ws/1210, code 502 - trying again in 10 seconds Mai 08 20:36:30 openqaworker13 worker[32415]: [info] [pid:32415] Registering with openQA openqa.suse.de Mai 08 20:36:31 openqaworker13 worker[32415]: [info] [pid:32415] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/1210 Mai 08 20:41:31 openqaworker13 worker[32415]: [warn] [pid:32415] Unable to upgrade to ws connection via http://openqa.suse.de/api/v1/ws/1210, code 502 - trying again in 10 seconds Mai 08 20:41:41 openqaworker13 worker[32415]: [info] [pid:32415] Registering with openQA openqa.suse.de Mai 08 20:41:43 openqaworker13 worker[32415]: [info] [pid:32415] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/1210 Mai 08 20:46:43 openqaworker13 worker[32415]: [warn] [pid:32415] Unable to upgrade to ws connection via http://openqa.suse.de/api/v1/ws/1210, code 502 - trying again in 10 seconds Mai 08 20:46:53 openqaworker13 worker[32415]: [info] [pid:32415] Registering with openQA openqa.suse.de Mai 08 20:46:54 openqaworker13 worker[32415]: [info] [pid:32415] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/1210 Mai 08 20:51:54 openqaworker13 worker[32415]: [warn] [pid:32415] Unable to upgrade to ws connection via http://openqa.suse.de/api/v1/ws/1210, code 502 - trying again in 10 seconds Mai 08 20:52:04 openqaworker13 worker[32415]: [info] [pid:32415] Registering with openQA openqa.suse.de Mai 08 20:52:05 openqaworker13 worker[32415]: [info] [pid:32415] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/1210 Mai 08 20:57:05 openqaworker13 worker[32415]: [warn] [pid:32415] Unable to upgrade to ws connection via http://openqa.suse.de/api/v1/ws/1210, code 502 - trying again in 10 seconds Mai 08 20:57:15 openqaworker13 worker[32415]: [info] [pid:32415] Registering with openQA openqa.suse.de Mai 08 20:57:16 openqaworker13 worker[32415]: [info] [pid:32415] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/1210 Mai 08 21:02:16 openqaworker13 worker[32415]: [warn] [pid:32415] Unable to upgrade to ws connection via http://openqa.suse.de/api/v1/ws/1210, code 502 - trying again in 10 seconds Mai 08 21:02:26 openqaworker13 worker[32415]: [info] [pid:32415] Registering with openQA openqa.suse.de Mai 08 21:02:27 openqaworker13 worker[32415]: [info] [pid:32415] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/1210 Mai 08 21:07:27 openqaworker13 worker[32415]: [warn] [pid:32415] Unable to upgrade to ws connection via http://openqa.suse.de/api/v1/ws/1210, code 502 - trying again in 10 seconds Mai 08 21:07:37 openqaworker13 worker[32415]: [info] [pid:32415] Registering with openQA openqa.suse.de Mai 08 21:07:38 openqaworker13 worker[32415]: [info] [pid:32415] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/1210 Mai 08 21:12:38 openqaworker13 worker[32415]: [warn] [pid:32415] Unable to upgrade to ws connection via http://openqa.suse.de/api/v1/ws/1210, code 502 - trying again in 10 seconds Mai 08 21:12:48 openqaworker13 worker[32415]: [info] [pid:32415] Registering with openQA openqa.suse.de Mai 08 21:12:49 openqaworker13 worker[32415]: [info] [pid:32415] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/1210 Mai 08 21:17:49 openqaworker13 worker[32415]: [warn] [pid:32415] Unable to upgrade to ws connection via http://openqa.suse.de/api/v1/ws/1210, code 502 - trying again in 10 seconds Mai 08 21:17:59 openqaworker13 worker[32415]: [info] [pid:32415] Registering with openQA openqa.suse.de Mai 08 21:17:59 openqaworker13 worker[32415]: [info] [pid:32415] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/1210 Mai 08 21:25:09 openqaworker13 worker[32415]: [warn] [pid:32415] Unable to upgrade to ws connection via http://openqa.suse.de/api/v1/ws/1210, code 502 - trying again in 10 seconds Mai 08 21:25:19 openqaworker13 worker[32415]: [info] [pid:32415] Registering with openQA openqa.suse.de Mai 08 21:25:20 openqaworker13 worker[32415]: [info] [pid:32415] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/1210 Mai 08 21:30:20 openqaworker13 worker[32415]: [warn] [pid:32415] Unable to upgrade to ws connection via http://openqa.suse.de/api/v1/ws/1210, code 502 - trying again in 10 seconds Mai 08 21:30:30 openqaworker13 worker[32415]: [info] [pid:32415] Registering with openQA openqa.suse.de Mai 08 21:30:31 openqaworker13 worker[32415]: [info] [pid:32415] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/1210 Mai 08 21:30:57 openqaworker13 worker[32415]: [info] [pid:32415] Registered and connected via websockets with openQA host openqa.suse.de and worker ID 1210 ``` For another worker it looks like this: ``` Mai 08 21:25:15 openqaworker6 worker[33863]: [info] [pid:33863] Registering with openQA openqa.suse.de Mai 08 21:25:16 openqaworker6 worker[33863]: [info] [pid:33863] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/954 Mai 08 21:30:16 openqaworker6 worker[33863]: [warn] [pid:33863] Unable to upgrade to ws connection via http://openqa.suse.de/api/v1/ws/954, code 502 - trying again in 10 seconds Mai 08 21:30:26 openqaworker6 worker[33863]: [info] [pid:33863] Registering with openQA openqa.suse.de Mai 08 21:30:26 openqaworker6 worker[33863]: [info] [pid:33863] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/954 Mai 08 21:30:54 openqaworker6 worker[33863]: [info] [pid:33863] Registered and connected via websockets with openQA host openqa.suse.de and worker ID 954 Mai 09 03:26:13 openqaworker6 worker[33863]: [warn] [pid:33863] Worker cache not available: Cache service queue already full (10) - checking again for web UI 'openqa.suse.de' in 100.00 s Mai 09 03:27:53 openqaworker6 worker[33863]: [warn] [pid:33863] Worker cache not available: Cache service queue already full (10) - checking again for web UI 'openqa.suse.de' in 100.00 s Mai 09 03:29:33 openqaworker6 worker[33863]: [warn] [pid:33863] Worker cache not available: Cache service queue already full (10) - checking again for web UI 'openqa.suse.de' in 100.00 s Mai 09 03:31:13 openqaworker6 worker[33863]: [warn] [pid:33863] Worker cache not available: Cache service queue already full (10) - checking again for web UI 'openqa.suse.de' in 100.00 s Mai 09 05:51:20 openqaworker6 worker[33863]: [warn] [pid:33863] Worker cache not available: Cache service queue already full (10) - checking again for web UI 'openqa.suse.de' in 100.00 s Mai 09 05:53:00 openqaworker6 worker[33863]: [warn] [pid:33863] Worker cache not available: Cache service queue already full (10) - checking again for web UI 'openqa.suse.de' in 100.00 s Mai 09 05:54:40 openqaworker6 worker[33863]: [warn] [pid:33863] Worker cache not available: Cache service queue already full (10) - checking again for web UI 'openqa.suse.de' in 100.00 s Mai 09 05:58:00 openqaworker6 worker[33863]: [warn] [pid:33863] Worker cache not available: Cache service queue already full (10) - checking again for web UI 'openqa.suse.de' in 100.00 s Mai 09 05:59:41 openqaworker6 worker[33863]: [warn] [pid:33863] Worker cache not available: Cache service queue already full (10) - checking again for web UI 'openqa.suse.de' in 100.00 s ``` The worker shows nevertheless up as "idle" (and not as "broken"). I suppose that's problematic because it makes it look like the don't assign jobs to an idling worker while it is actually busy downloading. However, when restarting the worker slot it didn't show the cache service error anymore. ## Ideas * Following the assumption that the worker(s) are at fault * Check the code for problems leading to stuck workers after registration * Following the assumption that the scheduler is at fault * Test locally by scheduling something like 100k jobs and see how the scheduler scales * Test locally by scheduling many jobs on something like 1k worker instances and see how the scheduler scales ## Mitigation steps conducted On openqaworker6 ``` 2022-05-09 10:52:08 journalctl -u openqa-worker-auto-restart@20 > openqa_worker_auto_restart_20-$(date +%F).log 2022-05-09 10:52:56 journalctl -u openqa-worker-cacheservice.service > openqa_worker_cacheservice-$(date +%F).log 2022-05-09 10:53:07 journalctl -u openqa-worker-cacheservice-minion.service > openqa_worker_cacheservice_minion-$(date +%F).log 2022-05-09 10:53:26 systemctl restart openqa-worker-cacheservice-minion.service 2022-05-09 10:55:28 systemctl restart openqa-worker-auto-restart@20 ```