Project

General

Profile

Actions

action #105274

closed

Workers constantly logging websocket connection "finished by remote side with code 1006, no reason"

Added by AdamWill almost 3 years ago. Updated almost 3 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Start date:
2022-01-21
Due date:
% Done:

0%

Estimated time:

Description

I deployed a couple of new worker hosts to our openQA instance today, and noticed that there seems to be some kind of constant websocket connection churn, like this (logs for just one worker process isolated, but I get this for every worker process):

Jan 21 21:49:28 openqa-x86-worker06.iad2.fedoraproject.org worker[19244]: [info] worker 26:
Jan 21 21:49:28 openqa-x86-worker06.iad2.fedoraproject.org worker[19244]:  - config file:           /etc/openqa/workers.ini
Jan 21 21:49:28 openqa-x86-worker06.iad2.fedoraproject.org worker[19244]:  - worker hostname:       openqa-x86-worker06.iad2.fedoraproject.org
Jan 21 21:49:28 openqa-x86-worker06.iad2.fedoraproject.org worker[19244]:  - isotovideo version:    24
Jan 21 21:49:28 openqa-x86-worker06.iad2.fedoraproject.org worker[19244]:  - websocket API version: 1
Jan 21 21:49:28 openqa-x86-worker06.iad2.fedoraproject.org worker[19244]:  - web UI hosts:          http://openqa-lab01.iad2.fedoraproject.org
Jan 21 21:49:28 openqa-x86-worker06.iad2.fedoraproject.org worker[19244]:  - class:                 ?
Jan 21 21:49:28 openqa-x86-worker06.iad2.fedoraproject.org worker[19244]:  - no cleanup:            no
Jan 21 21:49:28 openqa-x86-worker06.iad2.fedoraproject.org worker[19244]:  - pool directory:        /var/lib/openqa/pool/26
Jan 21 21:49:28 openqa-x86-worker06.iad2.fedoraproject.org worker[19244]: [info] Project dir for host http://openqa-lab01.iad2.fedoraproject.org is /var/lib/openqa/share
Jan 21 21:49:28 openqa-x86-worker06.iad2.fedoraproject.org worker[19244]: [info] Registering with openQA http://openqa-lab01.iad2.fedoraproject.org
Jan 21 21:49:29 openqa-x86-worker06.iad2.fedoraproject.org worker[19244]: [info] Establishing ws connection via ws://openqa-lab01.iad2.fedoraproject.org/api/v1/ws/321
Jan 21 21:49:29 openqa-x86-worker06.iad2.fedoraproject.org worker[19244]: [info] Registered and connected via websockets with openQA host http://openqa-lab01.iad2.fedoraproject.org and worker ID 321
Jan 21 21:50:29 openqa-x86-worker06.iad2.fedoraproject.org worker[19244]: [warn] Websocket connection to http://openqa-lab01.iad2.fedoraproject.org/api/v1/ws/321 finished by remote side with code 1006, no reason - trying again in 10 seconds
Jan 21 21:50:39 openqa-x86-worker06.iad2.fedoraproject.org worker[19244]: [info] Registering with openQA http://openqa-lab01.iad2.fedoraproject.org
Jan 21 21:50:39 openqa-x86-worker06.iad2.fedoraproject.org worker[19244]: [info] Establishing ws connection via ws://openqa-lab01.iad2.fedoraproject.org/api/v1/ws/321
Jan 21 21:50:39 openqa-x86-worker06.iad2.fedoraproject.org worker[19244]: [info] Registered and connected via websockets with openQA host http://openqa-lab01.iad2.fedoraproject.org and worker ID 321
Jan 21 21:51:39 openqa-x86-worker06.iad2.fedoraproject.org worker[19244]: [warn] Websocket connection to http://openqa-lab01.iad2.fedoraproject.org/api/v1/ws/321 finished by remote side with code 1006, no reason - trying again in 10 seconds
Jan 21 21:51:49 openqa-x86-worker06.iad2.fedoraproject.org worker[19244]: [info] Registering with openQA http://openqa-lab01.iad2.fedoraproject.org
Jan 21 21:51:49 openqa-x86-worker06.iad2.fedoraproject.org worker[19244]: [info] Establishing ws connection via ws://openqa-lab01.iad2.fedoraproject.org/api/v1/ws/321
Jan 21 21:51:49 openqa-x86-worker06.iad2.fedoraproject.org worker[19244]: [info] Registered and connected via websockets with openQA host http://openqa-lab01.iad2.fedoraproject.org and worker ID 321
Jan 21 21:52:49 openqa-x86-worker06.iad2.fedoraproject.org worker[19244]: [warn] Websocket connection to http://openqa-lab01.iad2.fedoraproject.org/api/v1/ws/321 finished by remote side with code 1006, no reason - trying again in 10 seconds
Jan 21 21:52:59 openqa-x86-worker06.iad2.fedoraproject.org worker[19244]: [info] Registering with openQA http://openqa-lab01.iad2.fedoraproject.org
Jan 21 21:53:00 openqa-x86-worker06.iad2.fedoraproject.org worker[19244]: [info] Establishing ws connection via ws://openqa-lab01.iad2.fedoraproject.org/api/v1/ws/321
Jan 21 21:53:00 openqa-x86-worker06.iad2.fedoraproject.org worker[19244]: [info] Registered and connected via websockets with openQA host http://openqa-lab01.iad2.fedoraproject.org and worker ID 321
Jan 21 21:54:00 openqa-x86-worker06.iad2.fedoraproject.org worker[19244]: [warn] Websocket connection to http://openqa-lab01.iad2.fedoraproject.org/api/v1/ws/321 finished by remote side with code 1006, no reason - trying again in 10 seconds
Jan 21 21:54:10 openqa-x86-worker06.iad2.fedoraproject.org worker[19244]: [info] Registering with openQA http://openqa-lab01.iad2.fedoraproject.org
Jan 21 21:54:10 openqa-x86-worker06.iad2.fedoraproject.org worker[19244]: [info] Establishing ws connection via ws://openqa-lab01.iad2.fedoraproject.org/api/v1/ws/321
Jan 21 21:54:10 openqa-x86-worker06.iad2.fedoraproject.org worker[19244]: [info] Registered and connected via websockets with openQA host http://openqa-lab01.iad2.fedoraproject.org and worker ID 321
Jan 21 21:55:10 openqa-x86-worker06.iad2.fedoraproject.org worker[19244]: [warn] Websocket connection to http://openqa-lab01.iad2.fedoraproject.org/api/v1/ws/321 finished by remote side with code 1006, no reason - trying again in 10 seconds
Jan 21 21:55:20 openqa-x86-worker06.iad2.fedoraproject.org worker[19244]: [info] Registering with openQA http://openqa-lab01.iad2.fedoraproject.org
Jan 21 21:55:20 openqa-x86-worker06.iad2.fedoraproject.org worker[19244]: [info] Establishing ws connection via ws://openqa-lab01.iad2.fedoraproject.org/api/v1/ws/321
Jan 21 21:55:20 openqa-x86-worker06.iad2.fedoraproject.org worker[19244]: [info] Registered and connected via websockets with openQA host http://openqa-lab01.iad2.fedoraproject.org and worker ID 321
Jan 21 21:56:20 openqa-x86-worker06.iad2.fedoraproject.org worker[19244]: [warn] Websocket connection to http://openqa-lab01.iad2.fedoraproject.org/api/v1/ws/321 finished by remote side with code 1006, no reason - trying again in 10 seconds
Jan 21 21:56:30 openqa-x86-worker06.iad2.fedoraproject.org worker[19244]: [info] Registering with openQA http://openqa-lab01.iad2.fedoraproject.org
Jan 21 21:56:30 openqa-x86-worker06.iad2.fedoraproject.org worker[19244]: [info] Establishing ws connection via ws://openqa-lab01.iad2.fedoraproject.org/api/v1/ws/321
Jan 21 21:56:30 openqa-x86-worker06.iad2.fedoraproject.org worker[19244]: [info] Registered and connected via websockets with openQA host http://openqa-lab01.iad2.fedoraproject.org and worker ID 321
Jan 21 21:57:30 openqa-x86-worker06.iad2.fedoraproject.org worker[19244]: [warn] Websocket connection to http://openqa-lab01.iad2.fedoraproject.org/api/v1/ws/321 finished by remote side with code 1006, no reason - trying again in 10 seconds
Jan 21 21:57:40 openqa-x86-worker06.iad2.fedoraproject.org worker[19244]: [info] Registering with openQA http://openqa-lab01.iad2.fedoraproject.org
Jan 21 21:57:40 openqa-x86-worker06.iad2.fedoraproject.org worker[19244]: [info] Establishing ws connection via ws://openqa-lab01.iad2.fedoraproject.org/api/v1/ws/321
Jan 21 21:57:40 openqa-x86-worker06.iad2.fedoraproject.org worker[19244]: [info] Registered and connected via websockets with openQA host http://openqa-lab01.iad2.fedoraproject.org and worker ID 321
Jan 21 21:58:40 openqa-x86-worker06.iad2.fedoraproject.org worker[19244]: [warn] Websocket connection to http://openqa-lab01.iad2.fedoraproject.org/api/v1/ws/321 finished by remote side with code 1006, no reason - trying again in 10 seconds
Jan 21 21:58:50 openqa-x86-worker06.iad2.fedoraproject.org worker[19244]: [info] Registering with openQA http://openqa-lab01.iad2.fedoraproject.org
Jan 21 21:58:50 openqa-x86-worker06.iad2.fedoraproject.org worker[19244]: [info] Establishing ws connection via ws://openqa-lab01.iad2.fedoraproject.org/api/v1/ws/321
Jan 21 21:58:50 openqa-x86-worker06.iad2.fedoraproject.org worker[19244]: [info] Registered and connected via websockets with openQA host http://openqa-lab01.iad2.fedoraproject.org and worker ID 321
Jan 21 21:59:50 openqa-x86-worker06.iad2.fedoraproject.org worker[19244]: [warn] Websocket connection to http://openqa-lab01.iad2.fedoraproject.org/api/v1/ws/321 finished by remote side with code 1006, no reason - trying again in 10 seconds
Jan 21 22:00:00 openqa-x86-worker06.iad2.fedoraproject.org worker[19244]: [info] Registering with openQA http://openqa-lab01.iad2.fedoraproject.org
Jan 21 22:00:00 openqa-x86-worker06.iad2.fedoraproject.org worker[19244]: [info] Establishing ws connection via ws://openqa-lab01.iad2.fedoraproject.org/api/v1/ws/321
Jan 21 22:00:00 openqa-x86-worker06.iad2.fedoraproject.org worker[19244]: [info] Registered and connected via websockets with openQA host http://openqa-lab01.iad2.fedoraproject.org and worker ID 321
Jan 21 22:01:00 openqa-x86-worker06.iad2.fedoraproject.org worker[19244]: [warn] Websocket connection to http://openqa-lab01.iad2.fedoraproject.org/api/v1/ws/321 finished by remote side with code 1006, no reason - trying again in 10 seconds
Jan 21 22:01:10 openqa-x86-worker06.iad2.fedoraproject.org worker[19244]: [info] Registering with openQA http://openqa-lab01.iad2.fedoraproject.org
Jan 21 22:01:10 openqa-x86-worker06.iad2.fedoraproject.org worker[19244]: [info] Establishing ws connection via ws://openqa-lab01.iad2.fedoraproject.org/api/v1/ws/321
Jan 21 22:01:10 openqa-x86-worker06.iad2.fedoraproject.org worker[19244]: [info] Registered and connected via websockets with openQA host http://openqa-lab01.iad2.fedoraproject.org and worker ID 321
Jan 21 22:02:10 openqa-x86-worker06.iad2.fedoraproject.org worker[19244]: [warn] Websocket connection to http://openqa-lab01.iad2.fedoraproject.org/api/v1/ws/321 finished by remote side with code 1006, no reason - trying again in 10 seconds
Jan 21 22:02:20 openqa-x86-worker06.iad2.fedoraproject.org worker[19244]: [info] Registering with openQA http://openqa-lab01.iad2.fedoraproject.org
Jan 21 22:02:20 openqa-x86-worker06.iad2.fedoraproject.org worker[19244]: [info] Establishing ws connection via ws://openqa-lab01.iad2.fedoraproject.org/api/v1/ws/321
Jan 21 22:02:20 openqa-x86-worker06.iad2.fedoraproject.org worker[19244]: [info] Registered and connected via websockets with openQA host http://openqa-lab01.iad2.fedoraproject.org and worker ID 321
Jan 21 22:03:20 openqa-x86-worker06.iad2.fedoraproject.org worker[19244]: [warn] Websocket connection to http://openqa-lab01.iad2.fedoraproject.org/api/v1/ws/321 finished by remote side with code 1006, no reason - trying again in 10 seconds
Jan 21 22:03:30 openqa-x86-worker06.iad2.fedoraproject.org worker[19244]: [info] Registering with openQA http://openqa-lab01.iad2.fedoraproject.org
Jan 21 22:03:30 openqa-x86-worker06.iad2.fedoraproject.org worker[19244]: [info] Establishing ws connection via ws://openqa-lab01.iad2.fedoraproject.org/api/v1/ws/321
Jan 21 22:03:30 openqa-x86-worker06.iad2.fedoraproject.org worker[19244]: [info] Registered and connected via websockets with openQA host http://openqa-lab01.iad2.fedoraproject.org and worker ID 321

...and so on and so on. It goes on apparently forever. At first I thought this meant something was wrong with the new workers, but then I went and checked other existing worker hosts that are working OK, and found the same pattern there. It seems to be happening on all of them. It doesn't stop jobs from running and working, but it's a lot of log spam and probably wasted work/network traffic.

I found a couple of existing closed issues that mention this code 1006 message, but they seemed to be associated with other problems that were then resolved. Couldn't find any report of just this kind of...eternal connection loop going on without the workers actually being apparently broken.

Actions #1

Updated by AdamWill almost 3 years ago

Looking at the timestamps, it seems like the connection times out after a minute, the worker waits ten seconds and reconnects, the new connection times out after a minute...in a loop, forever.

Actions #2

Updated by okurz almost 3 years ago

  • Priority changed from Normal to High
  • Target version set to Ready
Actions #3

Updated by kraih almost 3 years ago

60 seconds is a bit odd for a timeout there. Looking at the code, the Worker uses OpenQA::Client which inherits a 600 second inactivity timeout from OpenQA::UserAgent. On the WebSocket server side we even set the inactivity timeout to 0, in turn disabling it completely. So neither should be causing the problem here. Perhaps it is the reverse proxy used in the Fedora deployment disrupting the connection every 60 seconds?

Actions #4

Updated by kraih almost 3 years ago

  • Assignee set to kraih
Actions #5

Updated by kraih almost 3 years ago

  • Status changed from New to In Progress
Actions #6

Updated by openqa_review almost 3 years ago

  • Due date set to 2022-02-12

Setting due date based on mean cycle time of SUSE QE Tools

Actions #7

Updated by kraih almost 3 years ago

I've looked through the logs on our production workers, and while i did see the log message a few times, it was no more than once or twice a day. And there was a very clear pattern showing that it's just the reverse proxy resetting the connections (service restart and the like...).

Jan 26 06:30:16 openqaworker2 worker[24578]: [warn] [pid:24578] Websocket connection to http://openqa.suse.de/api/v1/ws/1070 finished by remote side with code 1006, no reason - trying again in 10 seconds
Jan 26 06:30:16 openqaworker3 worker[32313]: [warn] [pid:32313] Websocket connection to http://openqa.suse.de/api/v1/ws/1266 finished by remote side with code 1006, no reason - trying again in 10 seconds
Jan 26 06:30:16 openqaworker8 worker[56549]: [warn] [pid:56549] Websocket connection to http://openqa.suse.de/api/v1/ws/570 finished by remote side with code 1006, no reason - trying again in 10 seconds
Jan 26 05:30:16 openqaworker-arm-1 worker[35370]: [warn] [pid:35370] Websocket connection to http://openqa.suse.de/api/v1/ws/477 finished by remote side with code 1006, no reason - trying again in 10 seconds

Jan 28 06:16:10 openqaworker2 worker[6998]: [warn] [pid:6998] Websocket connection to http://openqa.suse.de/api/v1/ws/1070 finished by remote side with code 1006, no reason - trying again in 10 seconds
Jan 28 06:16:10 openqaworker8 worker[44847]: [warn] [pid:44847] Websocket connection to http://openqa.suse.de/api/v1/ws/570 finished by remote side with code 1006, no reason - trying again in 10 seconds
Jan 28 06:16:10 openqaworker-arm-3 worker[49935]: [warn] [pid:49935] Websocket connection to http://openqa.suse.de/api/v1/ws/1281 finished by remote side with code 1006, no reason - trying again in 10 seconds

Jan 30 06:52:44 openqaworker2 worker[4230]: [warn] [pid:4230] Websocket connection to http://openqa.suse.de/api/v1/ws/1070 finished by remote side with code 1006, no reason - trying again in 10 seconds
Jan 30 06:52:44 openqaworker5 worker[12747]: [warn] [pid:12747] Websocket connection to http://openqa.suse.de/api/v1/ws/961 finished by remote side with code 1006, no reason - trying again in 10 seconds
Jan 30 06:52:44 openqaworker8 worker[62786]: [warn] [pid:62786] Websocket connection to http://openqa.suse.de/api/v1/ws/570 finished by remote side with code 1006, no reason - trying again in 10 seconds
Jan 30 06:52:44 openqaworker-arm-2 worker[79404]: [warn] [pid:79404] Websocket connection to http://openqa.suse.de/api/v1/ws/534 finished by remote side with code 1006, no reason - trying again in 10 seconds

Jan 31 06:55:41 openqaworker2 worker[5795]: [warn] [pid:5795] Websocket connection to http://openqa.suse.de/api/v1/ws/1070 finished by remote side with code 1006, no reason - trying again in 10 seconds
Jan 31 06:55:41 openqaworker5 worker[5201]: [warn] [pid:5201] Websocket connection to http://openqa.suse.de/api/v1/ws/961 finished by remote side with code 1006, no reason - trying again in 10 seconds
Jan 31 06:55:41 openqaworker8 worker[53660]: [warn] [pid:53660] Websocket connection to http://openqa.suse.de/api/v1/ws/570 finished by remote side with code 1006, no reason - trying again in 10 seconds

So i'm pretty confident that there's no bug here and it's just the proxy configuration.

Actions #8

Updated by kraih almost 3 years ago

  • Status changed from In Progress to Feedback
Actions #9

Updated by okurz almost 3 years ago

  • Due date deleted (2022-02-12)
  • Status changed from Feedback to Resolved

@AdamWill we talked about this ticket within the team SUSE QE Tools and agreed that there shouldn't be anything that we need to improve from openQA side itself. You might benefit from talking a look into our proxy configuration e.g. https://github.com/os-autoinst/openQA/blob/master/etc/apache2/vhosts.d/openqa-common.inc#L89 (there is also a corresponding one for nginx) . There we set e.g.

ProxyTimeout 300

I hope that's enough to help you here.

Actions

Also available in: Atom PDF