action #41027
closedworker disconnects during cleanup
Description
Websocket noticed a disconnecting worker:
[2018-09-14T09:13:20.0190 CEST] [info] Worker 646 websocket connection closed - 1006
Looking what happened during that time:
Sep 14 09:03:12 openqaworker5 worker[19313]: [info] uploading serial_terminal.txt
Sep 14 09:03:13 openqaworker5 worker[19313]: [info] uploading worker-log.txt
Sep 14 09:13:21 openqaworker5 worker[19313]: [info] cleaning up 02047035-sle-12-Server-DVD-Incidents-x86_64-Build:6801:qpdf.1536848520-qam-gnome@64bit
Sep 14 09:13:45 openqaworker5 worker[19313]: GLOB(0x5b3cfb8)[error] Could not unlink 'serial0.txt': No such file or directory
Sep 14 09:13:45 openqaworker5 worker[19313]: [error] Could not unlink 'autoinst-log.txt': No such file or directory
Sep 14 09:13:45 openqaworker5 worker[19313]: [error] Could not unlink 'serial_terminal.txt': No such file or directory
Sep 14 09:13:45 openqaworker5 worker[19313]: [info] got job 2047365: 02047365-sle-12-SP1-Server-DVD-Updates-x86_64-Build20180913-4-mau-qa_sharutils@64bit
Sep 14 09:13:45 openqaworker5 worker[19313]: [info] +++ setup notes +++
Sep 14 09:13:45 openqaworker5 worker[19313]: [info] start time: 2018-09-14 07:13:45
Sep 14 09:13:45 openqaworker5 worker[19313]: [info] running on openqaworker5:8 (Linux 4.4.104-39-default #1 SMP Thu Jan 4 08:11:03 UTC 2018 (7db1912) x86_64)
So the worker removed old files - and disconnected (I assume timeout) from websocket in that timeframe. No harm was
done, the job was already set to done, but the disconnect is worrysome in the log, so we should try to avoid this.
Updated by EDiGiacinto over 6 years ago
I suspect a bit more at this point that the cleanup is performing blocking actions that are exhausting the timeout of the ws connection
Updated by coolo over 6 years ago
very similiar is this case:
[2018-09-19T11:31:03.0624 CEST] [info] Worker 1123 websocket connection closed - 1006
Sep 19 11:31:01 openqaworker13 worker[8111]: [info] uploading vars.json
Sep 19 11:31:01 openqaworker13 worker[8111]: [info] uploading serial0.txt
Sep 19 11:31:03 openqaworker13 worker[8111]: [info] uploading autoinst-log.txt
Sep 19 11:31:03 openqaworker13 worker[8111]: [info] uploading worker-log.txt
Sep 19 11:31:15 openqaworker13 worker[8111]: [info] cleaning up 02072667-sle-15-SP1-Installer-DVD-x86_64-Build43.1-cryptlvm@uefi
Updated by coolo about 6 years ago
- Blocks action #41222: A lot of occurences of 'websocket connection closed' added
Updated by mkittler about 6 years ago
Since we're now using Minion on the worker, we could run the clean_pool
function as a Minion job and listen to its reap event to continue.
Updated by kraih about 6 years ago
1006 means that the connection was closed abnormally. So one side just closed the TCP connection without sending a CLOSE frame, most likely an inactivity timeout. Looking at the logs, it appears that it is not the cleanup but the file upload while loop that blocks the worker process (https://github.com/os-autoinst/openQA/blob/1393e2197db60bdffdb17179e7f44764f3fde053/lib/OpenQA/Worker/Jobs.pm#L284).
Updated by kraih about 6 years ago
Well, technically i suppose everything in _stop_job_2 blocks the event loop (https://github.com/os-autoinst/openQA/blob/1393e2197db60bdffdb17179e7f44764f3fde053/lib/OpenQA/Worker/Jobs.pm#L373). So, wouldn't it make sense to perform those operations in a job or subprocess?
Updated by coolo about 6 years ago
We we need to decouple the websocket listening from the work - and we already do subprocesses for some of the working parts. Not sure it helped to make the code easier to follow :(
Updated by mkittler about 6 years ago
- Status changed from New to In Progress
- Assignee set to mkittler
So let's do the result upload via a minion task then.
WIP branch: https://github.com/Martchus/openQA/commits/upload_via_minion
Updated by mkittler almost 6 years ago
- Status changed from In Progress to Blocked
- Target version deleted (
Ready)
So I would say we postpone this until we decided our long-term goals regarding the worker. If we really re-write the worker from scratch then this problem should be kept in mind from the beginning (as stated in https://progress.opensuse.org/issues/46187#Biggest-problems-with-the-worker-code-right-now).
Updated by okurz over 5 years ago
- Category changed from 132 to Feature requests
Updated by okurz over 5 years ago
- Blocked by coordination #47117: [epic] Fix worker->websocket->scheduler->webui connection added
Updated by okurz over 5 years ago
- Assignee set to okurz
I guess that means we are blocked by #47117 so I will pick up the ticket just to track the "blocker"
Updated by mkittler about 5 years ago
- Blocked by deleted (coordination #47117: [epic] Fix worker->websocket->scheduler->webui connection)
Updated by mkittler about 5 years ago
- Status changed from Blocked to Resolved
- Assignee changed from okurz to kraih
- Target version set to Done
@kraih already said that "Looking at the logs, it appears that it is not the cleanup but the file upload while loop that blocks the worker process." And that also makes perfectly sense because the cleanup should really not take that long and the upload on the other hand might do. But that's already fixed by @kraih's recent changes. So I'm marking this as resolved and remove the blocker.
Updated by okurz about 5 years ago
- Related to action #57620: job is incomplete if websockets server (or webui?) is unreachable for a minute, e.g. during upgrade added