Project

General

Profile

Actions

action #33805

open

Make worker registration resilient when multiple webuis are not reachable (was: Worker websocket registration blocks the worker loop)

Added by EDiGiacinto about 6 years ago. Updated about 4 years ago.

Status:
New
Priority:
Low
Assignee:
-
Category:
Feature requests
Target version:
Start date:
2018-03-26
Due date:
% Done:

0%

Estimated time:

Description

When the worker fail registering against several specified WebUI(s) (in this case, two are failing):

Mar 26 14:04:34 loewe worker[70750]: [error] unable to connect to host http://copland.arch.suse.de, retry in 10s
Mar 26 14:04:50 loewe worker[70750]: [info] Collected unknown process with pid 76883 and exit status: 0
Mar 26 14:04:50 loewe worker[70750]: [info] registering worker loewe version 7 with openQA http://g226.suse.de using protocol version [1]
Mar 26 14:06:58 loewe worker[70750]: [error] unable to connect to host http://g226.suse.de, retry in 10s
Mar 26 14:06:58 loewe worker[70750]: [info] Collected unknown process with pid 77115 and exit status: 0
Mar 26 14:06:58 loewe worker[70750]: [info] registering worker loewe version 7 with openQA http://copland.arch.suse.de using protocol version [1]
Mar 26 14:06:58 loewe worker[70750]: [error] unable to connect to host http://copland.arch.suse.de, retry in 10s
Mar 26 14:08:00 loewe worker[70750]: [info] Collected unknown process with pid 77353 and exit status: 0
Mar 26 14:08:00 loewe worker[70750]: [info] registering worker loewe version 7 with openQA http://g226.suse.de using protocol version [1]
Mar 26 14:10:07 loewe worker[70750]: [error] unable to connect to host http://g226.suse.de, retry in 10s
Mar 26 14:10:07 loewe worker[70750]: [info] Collected unknown process with pid 77565 and exit status: 0
Mar 26 14:10:07 loewe worker[70750]: [info] registering worker loewe version 7 with openQA http://copland.arch.suse.de using protocol version [1]
Mar 26 14:10:07 loewe worker[70750]: [error] unable to connect to host http://copland.arch.suse.de, retry in 10s
Mar 26 14:10:17 loewe worker[70750]: [info] Collected unknown process with pid 77629 and exit status: 0
Mar 26 14:10:17 loewe worker[70750]: [info] registering worker loewe version 7 with openQA http://g226.suse.de using protocol version [1]
Mar 26 14:12:25 loewe worker[70750]: [error] unable to connect to host http://g226.suse.de, retry in 10s
Mar 26 14:12:25 loewe worker[70750]: [info] Collected unknown process with pid 77823 and exit status: 0
Mar 26 14:12:25 loewe worker[70750]: [info] registering worker loewe version 7 with openQA http://copland.arch.suse.de using protocol version [1]
Mar 26 14:12:25 loewe worker[70750]: [error] unable to connect to host http://copland.arch.suse.de, retry in 10s
Mar 26 14:12:35 loewe worker[70750]: [info] Collected unknown process with pid 77831 and exit status: 0
Mar 26 14:12:35 loewe worker[70750]: [info] registering worker loewe version 7 with openQA http://g226.suse.de using protocol version [1]

The worker start to slow down updates wrt WebUI - and from the dashboard is not clear what's going on, as it just seems the worker is slow, but actually it is not (in such case, there was just 1 job running, and the machine have room for much more).
Live Stream becomes unresponsive and slow and live console as well, with regularly updates in time - when this is happening you can see that worker update the job only after the timeout of the registration is expired - so i think what's happening is we are blocking the loop, making shared workers slow in case other configured instances are down.

Actions

Also available in: Atom PDF