Project

General

Profile

Actions

action #110785

closed

coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances

OSD incident 2022-05-09: Many scheduled jobs not picked up despite idle workers, blocked by one worker instance that should be broken?

Added by okurz almost 2 years ago. Updated almost 2 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2022-05-09
Due date:
% Done:

0%

Estimated time:

Description

Observation

On 2022-05-09 we found that there are many jobs scheduled but not enough running. On https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&viewPanel=9&from=1651831201500&to=1652176797500 one can see that we have the case of many jobs scheduled in the evening of 2022-05-09. A discussion thread asking about problems was opened by Xiaojing Liu in https://suse.slack.com/archives/C02AJ1E568M/p1652083286946669 . jbaier pointed out that the majority of the queue are LTP tests (new kernel livepatches in the maintenance queue) which is happening seldomly but can still be expected

On openqaworker6 journalctl -u openqa-worker-auto-restart@20 we found

May 09 03:26:12 openqaworker6 worker[33850]: [warn] [pid:33850] Worker cache not available: Cache service queue already full (10)

Problem

In https://suse.slack.com/archives/C02AJ1E568M/p1652085933670009?thread_ts=1652083286.946669&cid=C02AJ1E568M coolo stated the hypothesis

The scheduler didn't schedule because there was that one worker who was kept as idle, but was actually dead. And that situation is fatal for the scheduling loop

It seems that the scheduling loop might not be completely blocked but slowed down a lot iterating over a big list of jobs and workers and only assigning jobs very slowly

The scheduler currently fails to assign jobs to various workers, possibly slowing down the whole scheduling tick:

[2022-05-09T11:06:58.325734+02:00] [debug] [pid:14772] Assigned job '8722179' to worker ID '579' 
[2022-05-09T11:06:58.334240+02:00] [debug] [pid:14772] [Job#8722179] Prepare for being processed by worker 579 
[2022-05-09T11:06:58.461793+02:00] [debug] [pid:14772] Sent job(s) '8722179' to worker '579' 
[2022-05-09T11:06:58.496361+02:00] [debug] [pid:14772] Assigned job '8714856' to worker ID '912' 
[2022-05-09T11:06:58.500502+02:00] [debug] [pid:14772] [Job#8714856] Prepare for being processed by worker 912 
[2022-05-09T11:06:58.664234+02:00] [debug] [pid:14772] Failed sending job(s) '8714856' to worker '912' 
[2022-05-09T11:06:58.701367+02:00] [debug] [pid:14772] Job 8714856 reset to state scheduled 
[2022-05-09T11:06:58.759157+02:00] [debug] [pid:14772] Assigned job '8723222' to worker ID '529' 
[2022-05-09T11:06:58.787946+02:00] [debug] [pid:14772] [Job#8723222] Prepare for being processed by worker 529 
[2022-05-09T11:06:58.926925+02:00] [debug] [pid:14772] Created network for 8723222 : 73 
[2022-05-09T11:06:59.179954+02:00] [debug] [pid:14772] Sent job(s) '8723222' to worker '529' 
[2022-05-09T11:06:59.228883+02:00] [debug] [pid:14772] Assigned job '8723225' to worker ID '527' 
[2022-05-09T11:06:59.244018+02:00] [debug] [pid:14772] [Job#8723225] Prepare for being processed by worker 527 
[2022-05-09T11:06:59.464249+02:00] [debug] [pid:14772] Sent job(s) '8723225' to worker '527' 
[2022-05-09T11:06:59.507154+02:00] [debug] [pid:14772] Assigned job '8722186' to worker ID '1119' 
[2022-05-09T11:06:59.517069+02:00] [debug] [pid:14772] [Job#8722186] Prepare for being processed by worker 1119 
[2022-05-09T11:06:59.738507+02:00] [debug] [pid:14772] Failed sending job(s) '8722186' to worker '1119' 
[2022-05-09T11:06:59.765372+02:00] [debug] [pid:14772] Job 8722186 reset to state scheduled 
[2022-05-09T11:06:59.825905+02:00] [debug] [pid:14772] Assigned job '8722182' to worker ID '1317' 
[2022-05-09T11:06:59.837061+02:00] [debug] [pid:14772] [Job#8722182] Prepare for being processed by worker 1317 
[2022-05-09T11:06:59.918401+02:00] [debug] [pid:14772] Sent job(s) '8722182' to worker '1317' 
[2022-05-09T11:06:59.970125+02:00] [debug] [pid:14772] Assigned job '8722181' to worker ID '1266' 
[2022-05-09T11:06:59.976989+02:00] [debug] [pid:14772] [Job#8722181] Prepare for being processed by worker 1266 
[2022-05-09T11:07:00.076180+02:00] [debug] [pid:14772] Sent job(s) '8722181' to worker '1266' 
[2022-05-09T11:07:00.125276+02:00] [debug] [pid:14772] Assigned job '8723223' to worker ID '1287' 
[2022-05-09T11:07:00.149012+02:00] [debug] [pid:14772] [Job#8723223] Prepare for being processed by worker 1287 
[2022-05-09T11:07:00.538347+02:00] [debug] [pid:14772] Sent job(s) '8723223' to worker '1287' 
[2022-05-09T11:07:00.594073+02:00] [debug] [pid:14772] Assigned job '8722190' to worker ID '1114' 
[2022-05-09T11:07:00.599663+02:00] [debug] [pid:14772] [Job#8722190] Prepare for being processed by worker 1114 
[2022-05-09T11:07:00.699187+02:00] [debug] [pid:14772] Failed sending job(s) '8722190' to worker '1114' 
[2022-05-09T11:07:00.737092+02:00] [debug] [pid:14772] Job 8722190 reset to state scheduled 
[2022-05-09T11:07:00.800405+02:00] [debug] [pid:14772] Assigned job '8723227' to worker ID '1077' 
[2022-05-09T11:07:00.807488+02:00] [debug] [pid:14772] [Job#8723227] Prepare for being processed by worker 1077 
[2022-05-09T11:07:01.166880+02:00] [debug] [pid:14772] Sent job(s) '8723227' to worker '1077' 
[2022-05-09T11:07:01.205905+02:00] [debug] [pid:14772] Assigned job '8723226' to worker ID '1297' 
[2022-05-09T11:07:01.216100+02:00] [debug] [pid:14772] [Job#8723226] Prepare for being processed by worker 1297 
[2022-05-09T11:07:01.415973+02:00] [debug] [pid:14772] Sent job(s) '8723226' to worker '1297' 
[2022-05-09T11:07:01.474224+02:00] [debug] [pid:14772] Assigned job '8723224' to worker ID '1080' 
[2022-05-09T11:07:01.486018+02:00] [debug] [pid:14772] [Job#8723224] Prepare for being processed by worker 1080 
[2022-05-09T11:07:01.661337+02:00] [debug] [pid:14772] Sent job(s) '8723224' to worker '1080' 
[2022-05-09T11:07:01.704444+02:00] [debug] [pid:14772] Assigned job '8722104' to worker ID '593' 
[2022-05-09T11:07:01.709074+02:00] [debug] [pid:14772] [Job#8722104] Prepare for being processed by worker 593 
[2022-05-09T11:07:01.812775+02:00] [debug] [pid:14772] Failed sending job(s) '8722104' to worker '593' 
[2022-05-09T11:07:01.837325+02:00] [debug] [pid:14772] Job 8722104 reset to state scheduled 
[2022-05-09T11:07:01.873641+02:00] [debug] [pid:14772] Assigned job '8722185' to worker ID '903' 
[2022-05-09T11:07:01.881935+02:00] [debug] [pid:14772] [Job#8722185] Prepare for being processed by worker 903 
[2022-05-09T11:07:01.958940+02:00] [debug] [pid:14772] Failed sending job(s) '8722185' to worker '903' 
[2022-05-09T11:07:01.981328+02:00] [debug] [pid:14772] Job 8722185 reset to state scheduled 
[2022-05-09T11:07:02.017183+02:00] [debug] [pid:14772] Assigned job '8715010' to worker ID '1349' 
[2022-05-09T11:07:02.022772+02:00] [debug] [pid:14772] [Job#8715010] Prepare for being processed by worker 1349 
[2022-05-09T11:07:02.137027+02:00] [debug] [pid:14772] Failed sending job(s) '8715010' to worker '1349' 
[2022-05-09T11:07:02.162308+02:00] [debug] [pid:14772] Job 8715010 reset to state scheduled 
[2022-05-09T11:07:02.199658+02:00] [debug] [pid:14772] Assigned job '8722098' to worker ID '954' 
[2022-05-09T11:07:02.204521+02:00] [debug] [pid:14772] [Job#8722098] Prepare for being processed by worker 954 
[2022-05-09T11:07:02.297548+02:00] [debug] [pid:14772] Failed sending job(s) '8722098' to worker '954' 
[2022-05-09T11:07:02.318036+02:00] [debug] [pid:14772] Job 8722098 reset to state scheduled 
[2022-05-09T11:07:02.352935+02:00] [debug] [pid:14772] Assigned job '8722195' to worker ID '1210' 
[2022-05-09T11:07:02.357567+02:00] [debug] [pid:14772] [Job#8722195] Prepare for being processed by worker 1210 
[2022-05-09T11:07:02.497911+02:00] [debug] [pid:14772] Failed sending job(s) '8722195' to worker '1210' 
[2022-05-09T11:07:02.523069+02:00] [debug] [pid:14772] Job 8722195 reset to state scheduled 
[2022-05-09T11:07:02.581416+02:00] [debug] [pid:14772] Assigned job '8715007' to worker ID '1274' 
[2022-05-09T11:07:02.584929+02:00] [debug] [pid:14772] [Job#8715007] Prepare for being processed by worker 1274 
[2022-05-09T11:07:02.759493+02:00] [debug] [pid:14772] Sent job(s) '8715007' to worker '1274'

Those workers are currently idling:

openqa=> select host, instance, job_id, error from workers where id in (912, 1119, 1114, 593, 903, 954, 1210);   
        host         | instance | job_id | error  
---------------------+----------+--------+------- 
 openqaworker9       |        9 |        |  
 openqaworker6       |       15 |        |  
 openqaworker9       |        8 |        |  
 powerqaworker-qam-1 |        4 |        |  
 openqaworker13      |        8 |        |  
 openqaworker13      |        3 |        |  
 openqaworker13      |       14 |        |  
(7 Zeilen)

The problem persists during multiple scheduler ticks. The set of workers is always similar, now also one more shows up:

[2022-05-09T11:14:27.829472+02:00] [debug] [pid:14772] Assigned job '8715013' to worker ID '1349' 
[2022-05-09T11:14:27.836584+02:00] [debug] [pid:14772] [Job#8715013] Prepare for being processed by worker 1349 
[2022-05-09T11:14:27.968569+02:00] [debug] [pid:14772] Failed sending job(s) '8715013' to worker '1349' 
[2022-05-09T11:14:27.999362+02:00] [debug] [pid:14772] Job 8715013 reset to state scheduled

Besides, the scheduler ticket takes very long. I saw one ticket that took almost 30 seconds.

The affected workers log looks like this:

Mai 08 21:17:49 openqaworker13 worker[32415]: [warn] [pid:32415] Unable to upgrade to ws connection via http://openqa.suse.de/api/v1/ws/1210, code 502 - trying again in 10 seconds 
Mai 08 21:17:59 openqaworker13 worker[32415]: [info] [pid:32415] Registering with openQA openqa.suse.de 
Mai 08 21:17:59 openqaworker13 worker[32415]: [info] [pid:32415] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/1210 
Mai 08 21:25:09 openqaworker13 worker[32415]: [warn] [pid:32415] Unable to upgrade to ws connection via http://openqa.suse.de/api/v1/ws/1210, code 502 - trying again in 10 seconds 
Mai 08 21:25:19 openqaworker13 worker[32415]: [info] [pid:32415] Registering with openQA openqa.suse.de 
Mai 08 21:25:20 openqaworker13 worker[32415]: [info] [pid:32415] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/1210
Mai 08 21:30:20 openqaworker13 worker[32415]: [warn] [pid:32415] Unable to upgrade to ws connection via http://openqa.suse.de/api/v1/ws/1210, code 502 - trying again in 10 seconds 
Mai 08 21:30:30 openqaworker13 worker[32415]: [info] [pid:32415] Registering with openQA openqa.suse.de 
Mai 08 21:30:31 openqaworker13 worker[32415]: [info] [pid:32415] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/1210 
Mai 08 21:30:57 openqaworker13 worker[32415]: [info] [pid:32415] Registered and connected via websockets with openQA host openqa.suse.de and worker ID 1210

So there were connection issues but eventually the worker could register. However, then it nevertheless didn't take any jobs despite attempts from the scheduler to assign a job to it. After restarting the worker it instantly starts working on a new job:

Mai 09 11:30:11 openqaworker13 systemd[1]: Stopping openQA Worker #14... 
Mai 09 11:30:11 openqaworker13 worker[32415]: [info] [pid:32415] Received signal TERM 
Mai 09 11:30:11 openqaworker13 worker[32415]: [debug] [pid:32415] Informing openqa.suse.de that we are going offline 
Mai 09 11:30:11 openqaworker13 systemd[1]: openqa-worker-auto-restart@14.service: Succeeded. 
Mai 09 11:30:11 openqaworker13 systemd[1]: Stopped openQA Worker #14. 
Mai 09 11:30:11 openqaworker13 systemd[1]: Starting openQA Worker #14... 
Mai 09 11:30:11 openqaworker13 systemd[1]: Started openQA Worker #14. 
Mai 09 11:30:13 openqaworker13 worker[17023]: [info] [pid:17023] worker 14: 
Mai 09 11:30:13 openqaworker13 worker[17023]:  - config file:           /etc/openqa/workers.ini 
Mai 09 11:30:13 openqaworker13 worker[17023]:  - worker hostname:       openqaworker13 
Mai 09 11:30:13 openqaworker13 worker[17023]:  - isotovideo version:    25 
Mai 09 11:30:13 openqaworker13 worker[17023]:  - websocket API version: 1 
Mai 09 11:30:13 openqaworker13 worker[17023]:  - web UI hosts:          openqa.suse.de 
Mai 09 11:30:13 openqaworker13 worker[17023]:  - class:                 qemu_x86_64,qemu_x86_64_staging,amd,openqaworker13 
Mai 09 11:30:13 openqaworker13 worker[17023]:  - no cleanup:            no 
Mai 09 11:30:13 openqaworker13 worker[17023]:  - pool directory:        /var/lib/openqa/pool/14 
Mai 09 11:30:13 openqaworker13 worker[17023]: [info] [pid:17023] CACHE: caching is enabled, setting up /var/lib/openqa/cache/openqa.suse.de 
Mai 09 11:30:13 openqaworker13 worker[17023]: [info] [pid:17023] Project dir for host openqa.suse.de is /var/lib/openqa/share 
Mai 09 11:30:13 openqaworker13 worker[17023]: [info] [pid:17023] Registering with openQA openqa.suse.de 
Mai 09 11:30:13 openqaworker13 worker[17023]: [info] [pid:17023] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/1210 
Mai 09 11:30:13 openqaworker13 worker[17023]: [info] [pid:17023] Registered and connected via websockets with openQA host openqa.suse.de and worker ID 1210 
Mai 09 11:30:17 openqaworker13 worker[17023]: [debug] [pid:17023] Accepting job 8722699 from openqa.suse.de. 
Mai 09 11:30:17 openqaworker13 worker[17023]: [debug] [pid:17023] Setting job 8722699 from openqa.suse.de up 
Mai 09 11:30:17 openqaworker13 worker[17023]: [debug] [pid:17023] Preparing Mojo::IOLoop::ReadWriteProcess::Session 
Mai 09 11:30:17 openqaworker13 worker[17023]: [info] [pid:17023] +++ setup notes +++ 
Mai 09 11:30:17 openqaworker13 worker[17023]: [info] [pid:17023] Running on openqaworker13:14 (Linux 5.3.18-150300.59.63-default #1 SMP Tue Apr 5 12:47:31 UTC 2022 (d77db66) x86_64) 
Mai 09 11:30:17 openqaworker13 worker[17023]: [debug] [pid:17023] Job settings: 
Mai 09 11:30:17 openqaworker13 worker[17023]: [debug] [pid:17023] 
Mai 09 11:30:17 openqaworker13 worker[17023]:     ARCH=x86_64 
Mai 09 11:30:17 openqaworker13 worker[17023]:     BACKEND=qemu 
Mai 09 11:30:17 openqaworker13 worker[17023]:     BOOT_HDD_IMAGE=1 
Mai 09 11:30:17 openqaworker13 worker[17023]:     BUGZILLA_URL=http://fastzilla.suse.de/short@BUGID@.xml 
Mai 09 11:30:17 openqaworker13 worker[17023]:     BUILD=20220508-1

So I suspect there's something wrong on the worker-side that can leave it stuck after registration. Possibly the previous connection errors are part of this special case. However, besides that, nothing else was going on:

Mai 08 20:31:18 openqaworker13 systemd[1]: openqa-worker-auto-restart@14.service: Succeeded. 
Mai 08 20:31:18 openqaworker13 systemd[1]: openqa-worker-auto-restart@14.service: Scheduled restart job, restart counter is at 5. 
Mai 08 20:31:18 openqaworker13 systemd[1]: Stopped openQA Worker #14. 
Mai 08 20:31:18 openqaworker13 systemd[1]: Starting openQA Worker #14... 
Mai 08 20:31:18 openqaworker13 systemd[1]: Started openQA Worker #14. 
Mai 08 20:31:19 openqaworker13 worker[32415]: [info] [pid:32415] worker 14: 
Mai 08 20:31:19 openqaworker13 worker[32415]:  - config file:           /etc/openqa/workers.ini 
Mai 08 20:31:19 openqaworker13 worker[32415]:  - worker hostname:       openqaworker13 
Mai 08 20:31:19 openqaworker13 worker[32415]:  - isotovideo version:    25 
Mai 08 20:31:19 openqaworker13 worker[32415]:  - websocket API version: 1 
Mai 08 20:31:19 openqaworker13 worker[32415]:  - web UI hosts:          openqa.suse.de 
Mai 08 20:31:19 openqaworker13 worker[32415]:  - class:                 qemu_x86_64,qemu_x86_64_staging,amd,openqaworker13 
Mai 08 20:31:19 openqaworker13 worker[32415]:  - no cleanup:            no 
Mai 08 20:31:19 openqaworker13 worker[32415]:  - pool directory:        /var/lib/openqa/pool/14 
Mai 08 20:31:19 openqaworker13 worker[32415]: [info] [pid:32415] CACHE: caching is enabled, setting up /var/lib/openqa/cache/openqa.suse.de 
Mai 08 20:31:19 openqaworker13 worker[32415]: [info] [pid:32415] Project dir for host openqa.suse.de is /var/lib/openqa/share 
Mai 08 20:31:19 openqaworker13 worker[32415]: [info] [pid:32415] Registering with openQA openqa.suse.de 
Mai 08 20:31:20 openqaworker13 worker[32415]: [info] [pid:32415] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/1210 
Mai 08 20:36:20 openqaworker13 worker[32415]: [warn] [pid:32415] Unable to upgrade to ws connection via http://openqa.suse.de/api/v1/ws/1210, code 502 - trying again in 10 seconds 
Mai 08 20:36:30 openqaworker13 worker[32415]: [info] [pid:32415] Registering with openQA openqa.suse.de 
Mai 08 20:36:31 openqaworker13 worker[32415]: [info] [pid:32415] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/1210 
Mai 08 20:41:31 openqaworker13 worker[32415]: [warn] [pid:32415] Unable to upgrade to ws connection via http://openqa.suse.de/api/v1/ws/1210, code 502 - trying again in 10 seconds 
Mai 08 20:41:41 openqaworker13 worker[32415]: [info] [pid:32415] Registering with openQA openqa.suse.de 
Mai 08 20:41:43 openqaworker13 worker[32415]: [info] [pid:32415] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/1210 
Mai 08 20:46:43 openqaworker13 worker[32415]: [warn] [pid:32415] Unable to upgrade to ws connection via http://openqa.suse.de/api/v1/ws/1210, code 502 - trying again in 10 seconds 
Mai 08 20:46:53 openqaworker13 worker[32415]: [info] [pid:32415] Registering with openQA openqa.suse.de 
Mai 08 20:46:54 openqaworker13 worker[32415]: [info] [pid:32415] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/1210 
Mai 08 20:51:54 openqaworker13 worker[32415]: [warn] [pid:32415] Unable to upgrade to ws connection via http://openqa.suse.de/api/v1/ws/1210, code 502 - trying again in 10 seconds 
Mai 08 20:52:04 openqaworker13 worker[32415]: [info] [pid:32415] Registering with openQA openqa.suse.de 
Mai 08 20:52:05 openqaworker13 worker[32415]: [info] [pid:32415] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/1210 
Mai 08 20:57:05 openqaworker13 worker[32415]: [warn] [pid:32415] Unable to upgrade to ws connection via http://openqa.suse.de/api/v1/ws/1210, code 502 - trying again in 10 seconds 
Mai 08 20:57:15 openqaworker13 worker[32415]: [info] [pid:32415] Registering with openQA openqa.suse.de 
Mai 08 20:57:16 openqaworker13 worker[32415]: [info] [pid:32415] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/1210 
Mai 08 21:02:16 openqaworker13 worker[32415]: [warn] [pid:32415] Unable to upgrade to ws connection via http://openqa.suse.de/api/v1/ws/1210, code 502 - trying again in 10 seconds 
Mai 08 21:02:26 openqaworker13 worker[32415]: [info] [pid:32415] Registering with openQA openqa.suse.de 
Mai 08 21:02:27 openqaworker13 worker[32415]: [info] [pid:32415] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/1210 
Mai 08 21:07:27 openqaworker13 worker[32415]: [warn] [pid:32415] Unable to upgrade to ws connection via http://openqa.suse.de/api/v1/ws/1210, code 502 - trying again in 10 seconds 
Mai 08 21:07:37 openqaworker13 worker[32415]: [info] [pid:32415] Registering with openQA openqa.suse.de 
Mai 08 21:07:38 openqaworker13 worker[32415]: [info] [pid:32415] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/1210 
Mai 08 21:12:38 openqaworker13 worker[32415]: [warn] [pid:32415] Unable to upgrade to ws connection via http://openqa.suse.de/api/v1/ws/1210, code 502 - trying again in 10 seconds 
Mai 08 21:12:48 openqaworker13 worker[32415]: [info] [pid:32415] Registering with openQA openqa.suse.de 
Mai 08 21:12:49 openqaworker13 worker[32415]: [info] [pid:32415] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/1210 
Mai 08 21:17:49 openqaworker13 worker[32415]: [warn] [pid:32415] Unable to upgrade to ws connection via http://openqa.suse.de/api/v1/ws/1210, code 502 - trying again in 10 seconds 
Mai 08 21:17:59 openqaworker13 worker[32415]: [info] [pid:32415] Registering with openQA openqa.suse.de 
Mai 08 21:17:59 openqaworker13 worker[32415]: [info] [pid:32415] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/1210 
Mai 08 21:25:09 openqaworker13 worker[32415]: [warn] [pid:32415] Unable to upgrade to ws connection via http://openqa.suse.de/api/v1/ws/1210, code 502 - trying again in 10 seconds 
Mai 08 21:25:19 openqaworker13 worker[32415]: [info] [pid:32415] Registering with openQA openqa.suse.de 
Mai 08 21:25:20 openqaworker13 worker[32415]: [info] [pid:32415] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/1210 
Mai 08 21:30:20 openqaworker13 worker[32415]: [warn] [pid:32415] Unable to upgrade to ws connection via http://openqa.suse.de/api/v1/ws/1210, code 502 - trying again in 10 seconds 
Mai 08 21:30:30 openqaworker13 worker[32415]: [info] [pid:32415] Registering with openQA openqa.suse.de 
Mai 08 21:30:31 openqaworker13 worker[32415]: [info] [pid:32415] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/1210 
Mai 08 21:30:57 openqaworker13 worker[32415]: [info] [pid:32415] Registered and connected via websockets with openQA host openqa.suse.de and worker ID 1210

For another worker it looks like this:

Mai 08 21:25:15 openqaworker6 worker[33863]: [info] [pid:33863] Registering with openQA openqa.suse.de 
Mai 08 21:25:16 openqaworker6 worker[33863]: [info] [pid:33863] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/954 
Mai 08 21:30:16 openqaworker6 worker[33863]: [warn] [pid:33863] Unable to upgrade to ws connection via http://openqa.suse.de/api/v1/ws/954, code 502 - trying again in 10 seconds 
Mai 08 21:30:26 openqaworker6 worker[33863]: [info] [pid:33863] Registering with openQA openqa.suse.de 
Mai 08 21:30:26 openqaworker6 worker[33863]: [info] [pid:33863] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/954 
Mai 08 21:30:54 openqaworker6 worker[33863]: [info] [pid:33863] Registered and connected via websockets with openQA host openqa.suse.de and worker ID 954 
Mai 09 03:26:13 openqaworker6 worker[33863]: [warn] [pid:33863] Worker cache not available: Cache service queue already full (10) - checking again for web UI 'openqa.suse.de' in 100.00 s 
Mai 09 03:27:53 openqaworker6 worker[33863]: [warn] [pid:33863] Worker cache not available: Cache service queue already full (10) - checking again for web UI 'openqa.suse.de' in 100.00 s 
Mai 09 03:29:33 openqaworker6 worker[33863]: [warn] [pid:33863] Worker cache not available: Cache service queue already full (10) - checking again for web UI 'openqa.suse.de' in 100.00 s 
Mai 09 03:31:13 openqaworker6 worker[33863]: [warn] [pid:33863] Worker cache not available: Cache service queue already full (10) - checking again for web UI 'openqa.suse.de' in 100.00 s 
Mai 09 05:51:20 openqaworker6 worker[33863]: [warn] [pid:33863] Worker cache not available: Cache service queue already full (10) - checking again for web UI 'openqa.suse.de' in 100.00 s 
Mai 09 05:53:00 openqaworker6 worker[33863]: [warn] [pid:33863] Worker cache not available: Cache service queue already full (10) - checking again for web UI 'openqa.suse.de' in 100.00 s 
Mai 09 05:54:40 openqaworker6 worker[33863]: [warn] [pid:33863] Worker cache not available: Cache service queue already full (10) - checking again for web UI 'openqa.suse.de' in 100.00 s 
Mai 09 05:58:00 openqaworker6 worker[33863]: [warn] [pid:33863] Worker cache not available: Cache service queue already full (10) - checking again for web UI 'openqa.suse.de' in 100.00 s 
Mai 09 05:59:41 openqaworker6 worker[33863]: [warn] [pid:33863] Worker cache not available: Cache service queue already full (10) - checking again for web UI 'openqa.suse.de' in 100.00 s

The worker shows nevertheless up as "idle" (and not as "broken"). I suppose that's problematic because it makes it look like the don't assign jobs to an idling worker while it is actually busy downloading. However, when restarting the worker slot it didn't show the cache service error anymore. So maybe it wasn't actually busy with downloading anymore. The restarting also helped in this case to unblock the worker. Considering that only one of the stuck workers shows these "Cache service queue already full" message and that these messages stopped at Mai 09 05:59:41 I would conclude that these messages are unrelated.

Note that restarting such stuck workers via SIGHUP helps as well. So the workers aren't that stuck to not even react to that signal anymore.

Ideas

  • Following the assumption that the worker(s) are at fault
    • Check the code for problems leading to stuck workers after registration
    • Maybe receiving a signal SIGTERM/SIGHUP in the wrong moment can leave the worker in a bad state Definitely not the case, it would have been logged
  • Following the assumption that the scheduler is at fault (seems not so likely at this point of the investigation anymore)
    • Test locally by scheduling something like 100k jobs and see how the scheduler scales
    • Test locally by scheduling many jobs on something like 1k worker instances and see how the scheduler scales
    • Note that there's a unit test for scalability which one might simply invoke with very high numbers for scheduled jobs and available workers
  • It could also be a problem within the websocket server

Mitigation steps conducted

On openqaworker6:

2022-05-09 10:52:08 journalctl -u openqa-worker-auto-restart@20 > openqa_worker_auto_restart_20-$(date +%F).log
2022-05-09 10:52:56 journalctl -u openqa-worker-cacheservice.service > openqa_worker_cacheservice-$(date +%F).log
2022-05-09 10:53:07 journalctl -u openqa-worker-cacheservice-minion.service > openqa_worker_cacheservice_minion-$(date +%F).log
2022-05-09 10:53:26 systemctl restart openqa-worker-cacheservice-minion.service 
2022-05-09 10:55:28 systemctl restart openqa-worker-auto-restart@20
Actions #1

Updated by livdywan almost 2 years ago

  • Description updated (diff)
  • Status changed from New to In Progress
  • Assignee set to mkittler
Actions #2

Updated by mkittler almost 2 years ago

  • Description updated (diff)
Actions #3

Updated by mkittler almost 2 years ago

  • Description updated (diff)
Actions #4

Updated by mkittler almost 2 years ago

  • Description updated (diff)
Actions #5

Updated by mkittler almost 2 years ago

After investigating enough stuck workers I ran salt -C 'G@roles:worker' cmd.run 'systemctl reload openqa-worker-auto-restart@*' which soft-restarts all workers (no running jobs are interfered with). Since a soft-restart (SIGHUP) already helped in all cases I looked at manually I'm sure that this way everything should be back to normal.

So now I need to look at the code to find out what could have brought the workers into that situation.

Actions #6

Updated by mkittler almost 2 years ago

  • Description updated (diff)
Actions #7

Updated by mkittler almost 2 years ago

Unfortunately, errors on the websocket server side aren't forwarded to the scheduler. I tried to find the websocket errors which would match

[2022-05-09T11:07:02.352935+02:00] [debug] [pid:14772] Assigned job '8722195' to worker ID '1210' 
[2022-05-09T11:07:02.357567+02:00] [debug] [pid:14772] [Job#8722195] Prepare for being processed by worker 1210 
[2022-05-09T11:07:02.497911+02:00] [debug] [pid:14772] Failed sending job(s) '8722195' to worker '1210' 

manually but haven't found anything except that the job is later assigned to another worker:

martchus@openqa:~> sudo journalctl -u openqa-websockets | grep 8722195
Mai 09 11:08:52 openqa openqa-websockets-daemon[1406]: [debug] [pid:1406] Started to send message to 1778 for job(s) 8722195
Mai 09 11:08:52 openqa openqa-websockets-daemon[1406]: [debug] [pid:1406] Worker 1778 accepted job 8722195
martchus@openqa:~> sudo journalctl -u openqa-websockets | grep 'Unable to assign job to worker 1210'

The 2nd grep shows no results because the error message is never logged but only passed to the scheduler where it is also never logged.


PR for improving the logging: https://github.com/os-autoinst/openQA/pull/4644

Actions #8

Updated by mkittler almost 2 years ago

Unfortunately the logs of the websocket server don't go until yesterday, at least those are the earliest messages about worker 1210:

Mai 09 08:31:33 openqa openqa-websockets-daemon[1406]: [debug] [pid:1406] Updating seen of worker 1210 from worker_status (free)
Mai 09 08:33:13 openqa openqa-websockets-daemon[1406]: [debug] [pid:1406] Updating seen of worker 1210 from worker_status (free)
Mai 09 08:34:53 openqa openqa-websockets-daemon[1406]: [debug] [pid:1406] Updating seen of worker 1210 from worker_status (free)
Mai 09 08:36:33 openqa openqa-websockets-daemon[1406]: [debug] [pid:1406] Updating seen of worker 1210 from worker_status (free)
Mai 09 08:38:13 openqa openqa-websockets-daemon[1406]: [debug] [pid:1406] Updating seen of worker 1210 from worker_status (free)
Mai 09 08:39:53 openqa openqa-websockets-daemon[1406]: [debug] [pid:1406] Updating seen of worker 1210 from worker_status (free)
Mai 09 08:41:33 openqa openqa-websockets-daemon[1406]: [debug] [pid:1406] Updating seen of worker 1210 from worker_status (free)

But they come after the worker's last log message:

Mai 08 21:30:57 openqaworker13 worker[32415]: [info] [pid:32415] Registered and connected via websockets with openQA host openqa.suse.de and worker ID 1210

So the worker was not completely stuck at this point. It still sent update messages.

Actions #9

Updated by mkittler almost 2 years ago

I've already found problematic behavior within the web socket server and created a fix (https://github.com/os-autoinst/openQA/pull/4645) but I suppose it won't fix this issue. However, it shows at least one way how the web socket server may be responsible for a worker getting stuck.

Actions #10

Updated by openqa_review almost 2 years ago

  • Due date set to 2022-05-24

Setting due date based on mean cycle time of SUSE QE Tools

Actions #11

Updated by okurz almost 2 years ago

  • Related to coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances added
Actions #12

Updated by okurz almost 2 years ago

  • Tags set to reactive work
  • Parent task set to #110833
Actions #13

Updated by mkittler almost 2 years ago

  • Tags deleted (reactive work)

With the new logging it looks like this:

martchus@openqa:~> grep -i fail /var/log/openqa_scheduler
[2022-05-10T13:08:22.622596+02:00] [warn] [pid:11980] Failed sending job(s) '8729188' to worker '570': Unable to assign job to worker 570: the worker is not connected anymore
[2022-05-10T13:13:31.971638+02:00] [warn] [pid:11980] Failed sending job(s) '8729173' to worker '927': Unable to assign job to worker 927: the worker is not connected anymore
[2022-05-10T15:45:53.517015+02:00] [warn] [pid:11980] Failed to send data to websocket server, reason: Premature connection close at /usr/share/openqa/script/../lib/OpenQA/WebSockets/Client.pm line 27.
[2022-05-10T15:45:53.517245+02:00] [warn] [pid:11980] Failed sending job(s) '8729641' to worker '1267': unknown error
martchus@openqa:~> xzgrep -i fail /var/log/openqa_scheduler.1.xz 
[2022-05-10T06:50:55.433526+02:00] [debug] [pid:14772] Failed sending job(s) '8718406' to worker '1042'
[2022-05-10T07:31:35.843194+02:00] [warn] [pid:11980] Failed to send data to websocket server, reason: Premature connection close at /usr/share/openqa/script/../lib/OpenQA/WebSockets/Client.pm line 27.
[2022-05-10T07:31:35.843455+02:00] [warn] [pid:11980] Failed sending job(s) '8718486' to worker '1061': unknown error
[2022-05-10T10:18:55.317366+02:00] [warn] [pid:11980] Failed sending job(s) '8728353' to worker '578': Unable to assign job to worker 578: the worker is not connected anymore
[2022-05-10T10:19:56.036100+02:00] [warn] [pid:11980] Failed sending job(s) '8728384' to worker '1123': Unable to assign job to worker 1123: the worker is not connected anymore
[2022-05-10T10:20:33.956226+02:00] [warn] [pid:11980] Failed sending job(s) '8728362' to worker '1118': Unable to assign job to worker 1118: the worker is not connected anymore
[2022-05-10T10:20:36.120146+02:00] [warn] [pid:11980] Failed sending job(s) '8728363' to worker '590': Unable to assign job to worker 590: the worker is not connected anymore
[2022-05-10T10:21:19.373625+02:00] [warn] [pid:11980] Failed sending job(s) '8728457, 8728455, 8728454' to worker '1070': Unable to assign job to worker 1070: the worker is not connected anymore
[2022-05-10T12:03:28.436417+02:00] [warn] [pid:11980] Failed sending job(s) '8720009' to worker '1059': Unable to assign job to worker 1059: the worker is not connected anymore
[2022-05-10T12:12:26.873022+02:00] [warn] [pid:11980] Failed to send data to websocket server, reason: Premature connection close at /usr/share/openqa/script/../lib/OpenQA/WebSockets/Client.pm line 27.
[2022-05-10T12:12:26.873190+02:00] [warn] [pid:11980] Failed sending job(s) '8728812' to worker '1061': unknown error
[2022-05-10T12:28:40.525726+02:00] [warn] [pid:11980] Failed to send data to websocket server, reason: Premature connection close at /usr/share/openqa/script/../lib/OpenQA/WebSockets/Client.pm line 27.
[2022-05-10T12:28:40.525880+02:00] [warn] [pid:11980] Failed sending job(s) '8729041' to worker '575': unknown error
[2022-05-10T12:45:32.903195+02:00] [warn] [pid:11980] Failed sending job(s) '8720053' to worker '1042': Unable to assign job to worker 1042: the worker is not connected anymore

I haven't expected so many occurrences. However, none of those workers appears to be actually stuck.

Actions #14

Updated by mkittler almost 2 years ago

  • Tags set to reactive work
Actions #15

Updated by mkittler almost 2 years ago

  • Status changed from In Progress to Feedback
  • Priority changed from Urgent to High

I still couldn't find any stuck workers anymore by looking at the scheduler logs as before. So I'm lowering the priority. I'm still wondering what triggered the problem on Monday.

Actions #16

Updated by okurz almost 2 years ago

  • Due date deleted (2022-05-24)
  • Status changed from Feedback to Resolved

Discussed in daily 2022-05-16. We did not see the issue again and likely it was not that severe in the first place anyway. Still, we have future ideas noted down in a parent saga to be followed up with later.

Actions

Also available in: Atom PDF