Project

General

Profile

Actions

action #21836

closed

[tools][sprint 201709.1] Many "A message received from unknown worker connection" log entries on openqa.suse.de

Added by nicksinger over 6 years ago. Updated over 6 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
-
Start date:
2017-08-08
Due date:
% Done:

0%

Estimated time:

Description

tldr; we have too many "A message received from unknown worker connection" in our logs. These are caused by workers still trying to connect.

  • Why does this happen exactly?
  • We need to increase the information of the message

Right now we've many log entries like these. I've highlighted the one in question and added a little more context to it:

[Wed Aug  9 10:04:20 2017] [websockets:error] Worker not found for given connection during connection close
[Wed Aug  9 10:04:20 2017] [websockets:error] Worker not found for given connection during connection close
[Wed Aug  9 10:04:20 2017] [websockets:error] Worker not found for given connection during connection close
[Wed Aug  9 10:04:21 2017] [websockets:error] Worker not found for given connection during connection close
[Wed Aug  9 10:04:22 2017] [5391:info] Worker 30039 stopped
[Wed Aug  9 10:04:22 2017] [18822:info] Worker 18822 started
[Wed Aug  9 10:04:22 2017] [18822:info] Connecting to AMQP server
[Wed Aug  9 10:04:22 2017] [30051:info] Got status update for job 1099197 that does not belong to Worker 384
[Wed Aug  9 10:04:22 2017] [18822:info] AMQP connection established
[Wed Aug  9 10:04:22 2017] [websockets:error] Worker not found for given connection during connection close
[Wed Aug  9 10:04:22 2017] [websockets:error] Worker not found for given connection during connection close
[Wed Aug  9 10:04:22 2017] [websockets:error] Worker not found for given connection during connection close
[Wed Aug  9 10:04:23 2017] [websockets:error] Worker not found for given connection during connection close
[Wed Aug  9 10:04:24 2017] [websockets:error] Worker not found for given connection during connection close
--> [Wed Aug  9 10:04:24 2017] [websockets:warn] A message received from unknown worker connection
[Wed Aug  9 10:04:26 2017] [websockets:error] Worker not found for given connection during connection close
[Wed Aug  9 10:04:26 2017] [websockets:error] Worker not found for given connection during connection close
[Wed Aug  9 10:04:27 2017] [30051:info] Got status update for job 1099197 that does not belong to Worker 384
[Wed Aug  9 10:04:27 2017] [websockets:error] Worker not found for given connection during connection close
[Wed Aug  9 10:04:28 2017] [websockets:error] Worker not found for given connection during connection close
[Wed Aug  9 10:04:30 2017] [websockets:error] Worker not found for given connection during connection close
[Wed Aug  9 10:04:32 2017] [3650:info] Got status update for job 1099197 that does not belong to Worker 384
[Wed Aug  9 10:04:33 2017] [17486:info] Got status update for job 1099197 that does not belong to Worker 384
[Wed Aug  9 10:04:33 2017] [websockets:error] Worker not found for given connection during connection close
[Wed Aug  9 10:04:33 2017] [websockets:error] Worker not found for given connection during connection close
[Wed Aug  9 10:04:36 2017] [websockets:error] Worker not found for given connection during connection close
[Wed Aug  9 10:04:37 2017] [websockets:error] Worker not found for given connection during connection close
[Wed Aug  9 10:04:38 2017] [websockets:error] Worker not found for given connection during connection close
[Wed Aug  9 10:04:38 2017] [17486:info] Got status update for job 1099197 that does not belong to Worker 384

Excerpt about this issue:

13:10 <nsinger> mudler: are you two still working on the scheduler somehow? We get a lot of "A message received from unknown worker connection" in our osd logs and i'm not 100% certain if we have to act on it
13:13 <mudler> nsinger, that doesn't regard too much the scheduler changes
13:13 <mudler> nsinger, was the connection dropped from the worker?
13:13 <mudler> e.g. similar to https://progress.opensuse.org/issues/21028#change-59178
13:16 <nsinger> mudler: have not checked further yet. In the logwarn mail i only get the line posted above but maybe it ignores the context
13:16 <mudler> a bit of context would help to debug the issue further
13:17 <nsinger> mudler: just checked; http://paste.suse.de/24302
13:17 <nsinger> i think the "Worker 4650 stopped" could explain it :)
13:19 <mudler> oh i see then :) 
13:19 <mudler> oki - good to know
13:20 <nsinger> mudler: i'd create a ticket if it is not already covered? 
13:22 <mudler> well - that is partially covered, worker should re-register and re-create a new ws association inside websocket server
13:22 <nsinger> mudler: can you give me a link please? Then i could add this line inside our ignore-list
13:22 <mudler> i wonder why worker kept sending message
13:24 <mudler> https://github.com/os-autoinst/openQA/blob/70954b894d39e3215d6db38b743f4cc1680c2a2b/lib/OpenQA/WebSockets/Server.pm#L192
13:24 <nsinger> mudler: honestly, i've no clue if these lines in the log are even related. The "A message received"-message is pretty useless. Maybe we should increase information first here?
13:24 <mudler> nsinger, but i don't mind if you open an issue and assign to me - so i remember to re-check later
13:24 <mudler> true, it's pretty  though to debug like so
13:25 <mudler> but hold on a bit touching those files because i'm woring on a PR that touches them as well
13:25 <mudler> working*
13:25 <mudler> (i would avoid merge mess! :p )
13:26 <nsinger> mudler: if you take care of improving the message, i'm happy not touching these files ;)
13:26 <mudler> nsinger, sure :) will add to the changes then
13:28 <mudler> nsinger, it's cool if i will let it dump the message?
13:29 <nsinger> mudler: what do you mean? "dump" as in: remove it completely?
13:30 <mudler> sorry - meant dump the full output in the logs, it could be a bit cluttering but if we want to see what's happening it would be the best
13:31 <nsinger> mudler: fine with me but please lower the loglevel to at least debug then :P
13:32 <mudler> nsinger, then i'll print the full message in log debug and keep the same error message as for now in warn

Related issues 2 (0 open2 closed)

Related to openQA Project - action #19424: [tools] logwarn: [websockets:error] Worker not found for given connection during connection closeResolvedmkittler2017-05-30

Actions
Related to openQA Project - action #20544: [tools] Research/investigate ways to optimize scheduler grab_jobResolvedEDiGiacinto2017-07-18

Actions
Actions #1

Updated by nicksinger over 6 years ago

  • Related to action #19424: [tools] logwarn: [websockets:error] Worker not found for given connection during connection close added
Actions #2

Updated by nicksinger over 6 years ago

  • Description updated (diff)
Actions #3

Updated by nicksinger over 6 years ago

  • Subject changed from Many "A message received from unknown worker connection" log entries on openqa.suse.de to [tools] Many "A message received from unknown worker connection" log entries on openqa.suse.de
Actions #4

Updated by okurz over 6 years ago

  • Related to action #20544: [tools] Research/investigate ways to optimize scheduler grab_job added
Actions #5

Updated by nicksinger over 6 years ago

  • Priority changed from Normal to High

Update from today: The scheduler on OSD completely refused to assign jobs to the workers.
A quick look into the logs showed that it gets spammed with:

[Tue Aug 29 12:56:59 2017] [websockets:warn] A message received from unknown worker connection

After restarting the websockets+scheduler OSD could slowly recover but printed even more "unknown worker connection" errors/warnings in the logs.
We definitely need to change this message to something meaningful and besides that need to find out why it kills our scheduling so hard.
Raising this to high priority since it really influences our automated testing.

Actions #6

Updated by EDiGiacinto over 6 years ago

If no workers are seen online by the scheduler, it won't assigne new jobs since now are completely dispatched over websocket. This seems to happen even more when WebUI is restarted, but workers are not - since they are not attempting to re-register to the WebUI, and even if we see the state as online (updated over different routes), scheduler can't dispatch jobs as there is no websocket connection with the workers anymore.

Actions #7

Updated by EDiGiacinto over 6 years ago

This is now handled in: https://github.com/os-autoinst/openQA/pull/1441

We should observe a rate decrease in the log messages (since we force workers to re-register)

Actions #8

Updated by EDiGiacinto over 6 years ago

  • Status changed from New to In Progress
Actions #9

Updated by EDiGiacinto over 6 years ago

  • Status changed from In Progress to Resolved

Closing, since now it's handled.

Actions #10

Updated by szarate over 6 years ago

  • Subject changed from [tools] Many "A message received from unknown worker connection" log entries on openqa.suse.de to [tools][sprint 201709.1] Many "A message received from unknown worker connection" log entries on openqa.suse.de
Actions

Also available in: Atom PDF