action #33805
openMake worker registration resilient when multiple webuis are not reachable (was: Worker websocket registration blocks the worker loop)
0%
Description
When the worker fail registering against several specified WebUI(s) (in this case, two are failing):
Mar 26 14:04:34 loewe worker[70750]: [error] unable to connect to host http://copland.arch.suse.de, retry in 10s
Mar 26 14:04:50 loewe worker[70750]: [info] Collected unknown process with pid 76883 and exit status: 0
Mar 26 14:04:50 loewe worker[70750]: [info] registering worker loewe version 7 with openQA http://g226.suse.de using protocol version [1]
Mar 26 14:06:58 loewe worker[70750]: [error] unable to connect to host http://g226.suse.de, retry in 10s
Mar 26 14:06:58 loewe worker[70750]: [info] Collected unknown process with pid 77115 and exit status: 0
Mar 26 14:06:58 loewe worker[70750]: [info] registering worker loewe version 7 with openQA http://copland.arch.suse.de using protocol version [1]
Mar 26 14:06:58 loewe worker[70750]: [error] unable to connect to host http://copland.arch.suse.de, retry in 10s
Mar 26 14:08:00 loewe worker[70750]: [info] Collected unknown process with pid 77353 and exit status: 0
Mar 26 14:08:00 loewe worker[70750]: [info] registering worker loewe version 7 with openQA http://g226.suse.de using protocol version [1]
Mar 26 14:10:07 loewe worker[70750]: [error] unable to connect to host http://g226.suse.de, retry in 10s
Mar 26 14:10:07 loewe worker[70750]: [info] Collected unknown process with pid 77565 and exit status: 0
Mar 26 14:10:07 loewe worker[70750]: [info] registering worker loewe version 7 with openQA http://copland.arch.suse.de using protocol version [1]
Mar 26 14:10:07 loewe worker[70750]: [error] unable to connect to host http://copland.arch.suse.de, retry in 10s
Mar 26 14:10:17 loewe worker[70750]: [info] Collected unknown process with pid 77629 and exit status: 0
Mar 26 14:10:17 loewe worker[70750]: [info] registering worker loewe version 7 with openQA http://g226.suse.de using protocol version [1]
Mar 26 14:12:25 loewe worker[70750]: [error] unable to connect to host http://g226.suse.de, retry in 10s
Mar 26 14:12:25 loewe worker[70750]: [info] Collected unknown process with pid 77823 and exit status: 0
Mar 26 14:12:25 loewe worker[70750]: [info] registering worker loewe version 7 with openQA http://copland.arch.suse.de using protocol version [1]
Mar 26 14:12:25 loewe worker[70750]: [error] unable to connect to host http://copland.arch.suse.de, retry in 10s
Mar 26 14:12:35 loewe worker[70750]: [info] Collected unknown process with pid 77831 and exit status: 0
Mar 26 14:12:35 loewe worker[70750]: [info] registering worker loewe version 7 with openQA http://g226.suse.de using protocol version [1]
The worker start to slow down updates wrt WebUI - and from the dashboard is not clear what's going on, as it just seems the worker is slow, but actually it is not (in such case, there was just 1 job running, and the machine have room for much more).
Live Stream becomes unresponsive and slow and live console as well, with regularly updates in time - when this is happening you can see that worker update the job only after the timeout of the registration is expired - so i think what's happening is we are blocking the loop, making shared workers slow in case other configured instances are down.
Updated by okurz over 4 years ago
- Subject changed from [tools] Worker websocket registration blocks the worker loop to Make worker registration resilient when multiple webuis are not reachable (was: Worker websocket registration blocks the worker loop)
- Category changed from Regressions/Crashes to Feature requests
- Priority changed from Normal to Low
- Target version set to future
multiple webuis aren't that commonly used hence downgrading to "Low" after this longer time. The original idea of having the registration being more resilient for failed registration attempts is valid though. I changed to a "feature request" therefore.