Actions
action #21836
closed[tools][sprint 201709.1] Many "A message received from unknown worker connection" log entries on openqa.suse.de
Start date:
2017-08-08
Due date:
% Done:
0%
Estimated time:
Description
tldr; we have too many "A message received from unknown worker connection" in our logs. These are caused by workers still trying to connect.
- Why does this happen exactly?
- We need to increase the information of the message
Right now we've many log entries like these. I've highlighted the one in question and added a little more context to it:
[Wed Aug 9 10:04:20 2017] [websockets:error] Worker not found for given connection during connection close
[Wed Aug 9 10:04:20 2017] [websockets:error] Worker not found for given connection during connection close
[Wed Aug 9 10:04:20 2017] [websockets:error] Worker not found for given connection during connection close
[Wed Aug 9 10:04:21 2017] [websockets:error] Worker not found for given connection during connection close
[Wed Aug 9 10:04:22 2017] [5391:info] Worker 30039 stopped
[Wed Aug 9 10:04:22 2017] [18822:info] Worker 18822 started
[Wed Aug 9 10:04:22 2017] [18822:info] Connecting to AMQP server
[Wed Aug 9 10:04:22 2017] [30051:info] Got status update for job 1099197 that does not belong to Worker 384
[Wed Aug 9 10:04:22 2017] [18822:info] AMQP connection established
[Wed Aug 9 10:04:22 2017] [websockets:error] Worker not found for given connection during connection close
[Wed Aug 9 10:04:22 2017] [websockets:error] Worker not found for given connection during connection close
[Wed Aug 9 10:04:22 2017] [websockets:error] Worker not found for given connection during connection close
[Wed Aug 9 10:04:23 2017] [websockets:error] Worker not found for given connection during connection close
[Wed Aug 9 10:04:24 2017] [websockets:error] Worker not found for given connection during connection close
--> [Wed Aug 9 10:04:24 2017] [websockets:warn] A message received from unknown worker connection
[Wed Aug 9 10:04:26 2017] [websockets:error] Worker not found for given connection during connection close
[Wed Aug 9 10:04:26 2017] [websockets:error] Worker not found for given connection during connection close
[Wed Aug 9 10:04:27 2017] [30051:info] Got status update for job 1099197 that does not belong to Worker 384
[Wed Aug 9 10:04:27 2017] [websockets:error] Worker not found for given connection during connection close
[Wed Aug 9 10:04:28 2017] [websockets:error] Worker not found for given connection during connection close
[Wed Aug 9 10:04:30 2017] [websockets:error] Worker not found for given connection during connection close
[Wed Aug 9 10:04:32 2017] [3650:info] Got status update for job 1099197 that does not belong to Worker 384
[Wed Aug 9 10:04:33 2017] [17486:info] Got status update for job 1099197 that does not belong to Worker 384
[Wed Aug 9 10:04:33 2017] [websockets:error] Worker not found for given connection during connection close
[Wed Aug 9 10:04:33 2017] [websockets:error] Worker not found for given connection during connection close
[Wed Aug 9 10:04:36 2017] [websockets:error] Worker not found for given connection during connection close
[Wed Aug 9 10:04:37 2017] [websockets:error] Worker not found for given connection during connection close
[Wed Aug 9 10:04:38 2017] [websockets:error] Worker not found for given connection during connection close
[Wed Aug 9 10:04:38 2017] [17486:info] Got status update for job 1099197 that does not belong to Worker 384
Excerpt about this issue:
13:10 <nsinger> mudler: are you two still working on the scheduler somehow? We get a lot of "A message received from unknown worker connection" in our osd logs and i'm not 100% certain if we have to act on it
13:13 <mudler> nsinger, that doesn't regard too much the scheduler changes
13:13 <mudler> nsinger, was the connection dropped from the worker?
13:13 <mudler> e.g. similar to https://progress.opensuse.org/issues/21028#change-59178
13:16 <nsinger> mudler: have not checked further yet. In the logwarn mail i only get the line posted above but maybe it ignores the context
13:16 <mudler> a bit of context would help to debug the issue further
13:17 <nsinger> mudler: just checked; http://paste.suse.de/24302
13:17 <nsinger> i think the "Worker 4650 stopped" could explain it :)
13:19 <mudler> oh i see then :)
13:19 <mudler> oki - good to know
13:20 <nsinger> mudler: i'd create a ticket if it is not already covered?
13:22 <mudler> well - that is partially covered, worker should re-register and re-create a new ws association inside websocket server
13:22 <nsinger> mudler: can you give me a link please? Then i could add this line inside our ignore-list
13:22 <mudler> i wonder why worker kept sending message
13:24 <mudler> https://github.com/os-autoinst/openQA/blob/70954b894d39e3215d6db38b743f4cc1680c2a2b/lib/OpenQA/WebSockets/Server.pm#L192
13:24 <nsinger> mudler: honestly, i've no clue if these lines in the log are even related. The "A message received"-message is pretty useless. Maybe we should increase information first here?
13:24 <mudler> nsinger, but i don't mind if you open an issue and assign to me - so i remember to re-check later
13:24 <mudler> true, it's pretty though to debug like so
13:25 <mudler> but hold on a bit touching those files because i'm woring on a PR that touches them as well
13:25 <mudler> working*
13:25 <mudler> (i would avoid merge mess! :p )
13:26 <nsinger> mudler: if you take care of improving the message, i'm happy not touching these files ;)
13:26 <mudler> nsinger, sure :) will add to the changes then
13:28 <mudler> nsinger, it's cool if i will let it dump the message?
13:29 <nsinger> mudler: what do you mean? "dump" as in: remove it completely?
13:30 <mudler> sorry - meant dump the full output in the logs, it could be a bit cluttering but if we want to see what's happening it would be the best
13:31 <nsinger> mudler: fine with me but please lower the loglevel to at least debug then :P
13:32 <mudler> nsinger, then i'll print the full message in log debug and keep the same error message as for now in warn
Actions