Project

General

Profile

Actions

action #41027

closed

worker disconnects during cleanup

Added by coolo over 6 years ago. Updated about 5 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
2018-09-14
Due date:
% Done:

0%

Estimated time:

Description

Websocket noticed a disconnecting worker:

[2018-09-14T09:13:20.0190 CEST] [info] Worker 646 websocket connection closed - 1006

Looking what happened during that time:

Sep 14 09:03:12 openqaworker5 worker[19313]: [info] uploading serial_terminal.txt
Sep 14 09:03:13 openqaworker5 worker[19313]: [info] uploading worker-log.txt
Sep 14 09:13:21 openqaworker5 worker[19313]: [info] cleaning up 02047035-sle-12-Server-DVD-Incidents-x86_64-Build:6801:qpdf.1536848520-qam-gnome@64bit
Sep 14 09:13:45 openqaworker5 worker[19313]: GLOB(0x5b3cfb8)[error] Could not unlink 'serial0.txt': No such file or directory
Sep 14 09:13:45 openqaworker5 worker[19313]: [error] Could not unlink 'autoinst-log.txt': No such file or directory
Sep 14 09:13:45 openqaworker5 worker[19313]: [error] Could not unlink 'serial_terminal.txt': No such file or directory
Sep 14 09:13:45 openqaworker5 worker[19313]: [info] got job 2047365: 02047365-sle-12-SP1-Server-DVD-Updates-x86_64-Build20180913-4-mau-qa_sharutils@64bit
Sep 14 09:13:45 openqaworker5 worker[19313]: [info] +++ setup notes +++
Sep 14 09:13:45 openqaworker5 worker[19313]: [info] start time: 2018-09-14 07:13:45
Sep 14 09:13:45 openqaworker5 worker[19313]: [info] running on openqaworker5:8 (Linux 4.4.104-39-default #1 SMP Thu Jan 4 08:11:03 UTC 2018 (7db1912) x86_64)

So the worker removed old files - and disconnected (I assume timeout) from websocket in that timeframe. No harm was
done, the job was already set to done, but the disconnect is worrysome in the log, so we should try to avoid this.


Related issues 2 (0 open2 closed)

Related to openQA Project (public) - action #57620: job is incomplete if websockets server (or webui?) is unreachable for a minute, e.g. during upgradeResolvedokurz2019-10-022020-04-09

Actions
Blocks openQA Project (public) - action #41222: A lot of occurences of 'websocket connection closed'Resolvedokurz2018-09-19

Actions
Actions #1

Updated by EDiGiacinto over 6 years ago

I suspect a bit more at this point that the cleanup is performing blocking actions that are exhausting the timeout of the ws connection

Actions #2

Updated by coolo over 6 years ago

very similiar is this case:

[2018-09-19T11:31:03.0624 CEST] [info] Worker 1123 websocket connection closed - 1006
Sep 19 11:31:01 openqaworker13 worker[8111]: [info] uploading vars.json
Sep 19 11:31:01 openqaworker13 worker[8111]: [info] uploading serial0.txt
Sep 19 11:31:03 openqaworker13 worker[8111]: [info] uploading autoinst-log.txt
Sep 19 11:31:03 openqaworker13 worker[8111]: [info] uploading worker-log.txt
Sep 19 11:31:15 openqaworker13 worker[8111]: [info] cleaning up 02072667-sle-15-SP1-Installer-DVD-x86_64-Build43.1-cryptlvm@uefi
Actions #3

Updated by coolo about 6 years ago

  • Blocks action #41222: A lot of occurences of 'websocket connection closed' added
Actions #4

Updated by mkittler about 6 years ago

Since we're now using Minion on the worker, we could run the clean_pool function as a Minion job and listen to its reap event to continue.

Actions #5

Updated by kraih about 6 years ago

1006 means that the connection was closed abnormally. So one side just closed the TCP connection without sending a CLOSE frame, most likely an inactivity timeout. Looking at the logs, it appears that it is not the cleanup but the file upload while loop that blocks the worker process (https://github.com/os-autoinst/openQA/blob/1393e2197db60bdffdb17179e7f44764f3fde053/lib/OpenQA/Worker/Jobs.pm#L284).

Actions #6

Updated by kraih about 6 years ago

Well, technically i suppose everything in _stop_job_2 blocks the event loop (https://github.com/os-autoinst/openQA/blob/1393e2197db60bdffdb17179e7f44764f3fde053/lib/OpenQA/Worker/Jobs.pm#L373). So, wouldn't it make sense to perform those operations in a job or subprocess?

Actions #7

Updated by coolo about 6 years ago

We we need to decouple the websocket listening from the work - and we already do subprocesses for some of the working parts. Not sure it helped to make the code easier to follow :(

Actions #8

Updated by mkittler about 6 years ago

  • Status changed from New to In Progress
  • Assignee set to mkittler

So let's do the result upload via a minion task then.

WIP branch: https://github.com/Martchus/openQA/commits/upload_via_minion

Actions #9

Updated by mkittler almost 6 years ago

  • Status changed from In Progress to Blocked
  • Target version deleted (Ready)

So I would say we postpone this until we decided our long-term goals regarding the worker. If we really re-write the worker from scratch then this problem should be kept in mind from the beginning (as stated in https://progress.opensuse.org/issues/46187#Biggest-problems-with-the-worker-code-right-now).

Actions #10

Updated by mkittler almost 6 years ago

  • Assignee deleted (mkittler)
Actions #11

Updated by okurz over 5 years ago

  • Category changed from 132 to Feature requests
Actions #12

Updated by okurz over 5 years ago

  • Blocked by coordination #47117: [epic] Fix worker->websocket->scheduler->webui connection added
Actions #13

Updated by okurz over 5 years ago

  • Assignee set to okurz

I guess that means we are blocked by #47117 so I will pick up the ticket just to track the "blocker"

Actions #14

Updated by mkittler about 5 years ago

  • Blocked by deleted (coordination #47117: [epic] Fix worker->websocket->scheduler->webui connection)
Actions #15

Updated by mkittler about 5 years ago

  • Status changed from Blocked to Resolved
  • Assignee changed from okurz to kraih
  • Target version set to Done

@kraih already said that "Looking at the logs, it appears that it is not the cleanup but the file upload while loop that blocks the worker process." And that also makes perfectly sense because the cleanup should really not take that long and the upload on the other hand might do. But that's already fixed by @kraih's recent changes. So I'm marking this as resolved and remove the blocker.

Actions #16

Updated by okurz about 5 years ago

  • Related to action #57620: job is incomplete if websockets server (or webui?) is unreachable for a minute, e.g. during upgrade added
Actions

Also available in: Atom PDF