action #41027
closed
worker disconnects during cleanup
Added by coolo about 6 years ago.
Updated about 5 years ago.
Category:
Feature requests
Description
Websocket noticed a disconnecting worker:
[2018-09-14T09:13:20.0190 CEST] [info] Worker 646 websocket connection closed - 1006
Looking what happened during that time:
Sep 14 09:03:12 openqaworker5 worker[19313]: [info] uploading serial_terminal.txt
Sep 14 09:03:13 openqaworker5 worker[19313]: [info] uploading worker-log.txt
Sep 14 09:13:21 openqaworker5 worker[19313]: [info] cleaning up 02047035-sle-12-Server-DVD-Incidents-x86_64-Build:6801:qpdf.1536848520-qam-gnome@64bit
Sep 14 09:13:45 openqaworker5 worker[19313]: GLOB(0x5b3cfb8)[error] Could not unlink 'serial0.txt': No such file or directory
Sep 14 09:13:45 openqaworker5 worker[19313]: [error] Could not unlink 'autoinst-log.txt': No such file or directory
Sep 14 09:13:45 openqaworker5 worker[19313]: [error] Could not unlink 'serial_terminal.txt': No such file or directory
Sep 14 09:13:45 openqaworker5 worker[19313]: [info] got job 2047365: 02047365-sle-12-SP1-Server-DVD-Updates-x86_64-Build20180913-4-mau-qa_sharutils@64bit
Sep 14 09:13:45 openqaworker5 worker[19313]: [info] +++ setup notes +++
Sep 14 09:13:45 openqaworker5 worker[19313]: [info] start time: 2018-09-14 07:13:45
Sep 14 09:13:45 openqaworker5 worker[19313]: [info] running on openqaworker5:8 (Linux 4.4.104-39-default #1 SMP Thu Jan 4 08:11:03 UTC 2018 (7db1912) x86_64)
So the worker removed old files - and disconnected (I assume timeout) from websocket in that timeframe. No harm was
done, the job was already set to done, but the disconnect is worrysome in the log, so we should try to avoid this.
I suspect a bit more at this point that the cleanup is performing blocking actions that are exhausting the timeout of the ws connection
very similiar is this case:
[2018-09-19T11:31:03.0624 CEST] [info] Worker 1123 websocket connection closed - 1006
Sep 19 11:31:01 openqaworker13 worker[8111]: [info] uploading vars.json
Sep 19 11:31:01 openqaworker13 worker[8111]: [info] uploading serial0.txt
Sep 19 11:31:03 openqaworker13 worker[8111]: [info] uploading autoinst-log.txt
Sep 19 11:31:03 openqaworker13 worker[8111]: [info] uploading worker-log.txt
Sep 19 11:31:15 openqaworker13 worker[8111]: [info] cleaning up 02072667-sle-15-SP1-Installer-DVD-x86_64-Build43.1-cryptlvm@uefi
- Blocks action #41222: A lot of occurences of 'websocket connection closed' added
Since we're now using Minion on the worker, we could run the clean_pool
function as a Minion job and listen to its reap event to continue.
We we need to decouple the websocket listening from the work - and we already do subprocesses for some of the working parts. Not sure it helped to make the code easier to follow :(
- Status changed from New to In Progress
- Assignee set to mkittler
- Status changed from In Progress to Blocked
- Target version deleted (
Ready)
- Assignee deleted (
mkittler)
- Category changed from 132 to Feature requests
- Blocked by coordination #47117: [epic] Fix worker->websocket->scheduler->webui connection added
I guess that means we are blocked by #47117 so I will pick up the ticket just to track the "blocker"
- Blocked by deleted (coordination #47117: [epic] Fix worker->websocket->scheduler->webui connection)
- Status changed from Blocked to Resolved
- Assignee changed from okurz to kraih
- Target version set to Done
@kraih already said that "Looking at the logs, it appears that it is not the cleanup but the file upload while loop that blocks the worker process." And that also makes perfectly sense because the cleanup should really not take that long and the upload on the other hand might do. But that's already fixed by @kraih's recent changes. So I'm marking this as resolved and remove the blocker.
- Related to action #57620: job is incomplete if websockets server (or webui?) is unreachable for a minute, e.g. during upgrade added
Also available in: Atom
PDF