action #19424
closed[tools] logwarn: [websockets:error] Worker not found for given connection during connection close
Description
Following message appears quite often lately in the logs:
[Tue May 30 06:10:09 2017] [websockets:error] Worker not found for given connection during connection close
[Tue May 30 06:10:41 2017] [websockets:error] Worker not found for given connection during connection close
[Tue May 30 06:15:14 2017] [websockets:error] Worker not found for given connection during connection close
[Tue May 30 06:15:46 2017] [websockets:error] Worker not found for given connection during connection close
[Tue May 30 06:20:19 2017] [websockets:error] Worker not found for given connection during connection close
[Tue May 30 06:20:51 2017] [websockets:error] Worker not found for given connection during connection close
[...]
Updated by okurz over 7 years ago
- Status changed from New to In Progress
This was persisting until yesterday evening. The last line in the logfile I can see for now
[Fri Jun 2 23:32:27 2017] [websockets:error] Worker not found for given connection during connection close
I looked in the source code itself but could not find a good way to improve the error message to hint to a certain problem or worker.
Updated by okurz over 7 years ago
monitoring alert disabled with https://github.com/okurz/openqa_monitoring/pull/10 so don't get confused when you don't get an email anymore.
Updated by nicksinger over 7 years ago
- Related to action #21836: [tools][sprint 201709.1] Many "A message received from unknown worker connection" log entries on openqa.suse.de added
Updated by coolo about 7 years ago
- Priority changed from Normal to High
- Target version set to Ready
This is still going on
Updated by mkittler almost 6 years ago
Is this still happening? I couldn't see anything in the recent websocket server logs on OSD. I'm not sure why this problem would occur occasionally but it should be easy to rewrite the code to get rid of it (using the same pattern as in the developer mode code).
Updated by okurz over 5 years ago
- Status changed from In Progress to Workable
I just checked the logs on o3 with sudo grep 'Worker not found for given connection' /var/log/openqa
and found a lot of these messages still.
Updated by mkittler over 5 years ago
- Status changed from Workable to In Progress
- Assignee set to mkittler
Since I'm looking at the web socket server code anyways right now I'll have a look.
Updated by mkittler over 5 years ago
Updated by okurz over 5 years ago
PR merged. I am not sure if the PR should be the only thing needed to resolve this ticket though.
Updated by mkittler over 5 years ago
- Status changed from In Progress to Feedback
Me neither. I could only reproduce this issue by configuring a worker to connect to the same web UI twice at the same time (which is unlikely to happen in production). So let's see whether this fixes the production case as well (which is - if my theory is correct - that the worker already tries to reconnect while the web socket server hasn't handled the previous disconnect yet).
Updated by okurz over 5 years ago
mkittler wrote:
I could only reproduce this issue by configuring a worker to connect to the same web UI twice at the same time (which is unlikely to happen in production)
Can an admin by mistake really configure the worker in this way or rather change the code?
Updated by mkittler over 5 years ago
If you add a worker host twice in the config the worker will connect twice as if they were different hosts. No de-duplication happens. (I guess this was also the case before my worker restructuring. At least there was no explicit code for de-duplication.)
I could of course change it so the hostname/URL would be de-duplicated at least on string-level.
Updated by okurz over 5 years ago
I would rather die the worker hard on this configuration error, don't try to be too smart in code when the admin messes up :)
Updated by mkittler over 5 years ago
- Status changed from Feedback to Resolved
I've just had a brief look at the recent OSD logs and it is not happening anymore. I'd say making the worker fail in this case is a different issue.