Project

General

Profile

Actions

action #33805

open

Make worker registration resilient when multiple webuis are not reachable (was: Worker websocket registration blocks the worker loop)

Added by EDiGiacinto about 6 years ago. Updated about 4 years ago.

Status:
New
Priority:
Low
Assignee:
-
Category:
Feature requests
Target version:
Start date:
2018-03-26
Due date:
% Done:

0%

Estimated time:

Description

When the worker fail registering against several specified WebUI(s) (in this case, two are failing):

Mar 26 14:04:34 loewe worker[70750]: [error] unable to connect to host http://copland.arch.suse.de, retry in 10s
Mar 26 14:04:50 loewe worker[70750]: [info] Collected unknown process with pid 76883 and exit status: 0
Mar 26 14:04:50 loewe worker[70750]: [info] registering worker loewe version 7 with openQA http://g226.suse.de using protocol version [1]
Mar 26 14:06:58 loewe worker[70750]: [error] unable to connect to host http://g226.suse.de, retry in 10s
Mar 26 14:06:58 loewe worker[70750]: [info] Collected unknown process with pid 77115 and exit status: 0
Mar 26 14:06:58 loewe worker[70750]: [info] registering worker loewe version 7 with openQA http://copland.arch.suse.de using protocol version [1]
Mar 26 14:06:58 loewe worker[70750]: [error] unable to connect to host http://copland.arch.suse.de, retry in 10s
Mar 26 14:08:00 loewe worker[70750]: [info] Collected unknown process with pid 77353 and exit status: 0
Mar 26 14:08:00 loewe worker[70750]: [info] registering worker loewe version 7 with openQA http://g226.suse.de using protocol version [1]
Mar 26 14:10:07 loewe worker[70750]: [error] unable to connect to host http://g226.suse.de, retry in 10s
Mar 26 14:10:07 loewe worker[70750]: [info] Collected unknown process with pid 77565 and exit status: 0
Mar 26 14:10:07 loewe worker[70750]: [info] registering worker loewe version 7 with openQA http://copland.arch.suse.de using protocol version [1]
Mar 26 14:10:07 loewe worker[70750]: [error] unable to connect to host http://copland.arch.suse.de, retry in 10s
Mar 26 14:10:17 loewe worker[70750]: [info] Collected unknown process with pid 77629 and exit status: 0
Mar 26 14:10:17 loewe worker[70750]: [info] registering worker loewe version 7 with openQA http://g226.suse.de using protocol version [1]
Mar 26 14:12:25 loewe worker[70750]: [error] unable to connect to host http://g226.suse.de, retry in 10s
Mar 26 14:12:25 loewe worker[70750]: [info] Collected unknown process with pid 77823 and exit status: 0
Mar 26 14:12:25 loewe worker[70750]: [info] registering worker loewe version 7 with openQA http://copland.arch.suse.de using protocol version [1]
Mar 26 14:12:25 loewe worker[70750]: [error] unable to connect to host http://copland.arch.suse.de, retry in 10s
Mar 26 14:12:35 loewe worker[70750]: [info] Collected unknown process with pid 77831 and exit status: 0
Mar 26 14:12:35 loewe worker[70750]: [info] registering worker loewe version 7 with openQA http://g226.suse.de using protocol version [1]

The worker start to slow down updates wrt WebUI - and from the dashboard is not clear what's going on, as it just seems the worker is slow, but actually it is not (in such case, there was just 1 job running, and the machine have room for much more).
Live Stream becomes unresponsive and slow and live console as well, with regularly updates in time - when this is happening you can see that worker update the job only after the timeout of the registration is expired - so i think what's happening is we are blocking the loop, making shared workers slow in case other configured instances are down.

Actions #1

Updated by okurz about 4 years ago

  • Subject changed from [tools] Worker websocket registration blocks the worker loop to Make worker registration resilient when multiple webuis are not reachable (was: Worker websocket registration blocks the worker loop)
  • Category changed from Regressions/Crashes to Feature requests
  • Priority changed from Normal to Low
  • Target version set to future

multiple webuis aren't that commonly used hence downgrading to "Low" after this longer time. The original idea of having the registration being more resilient for failed registration attempts is valid though. I changed to a "feature request" therefore.

Actions

Also available in: Atom PDF