action #41027

worker disconnects during cleanup

Added by coolo over 1 year ago. Updated 5 months ago.

Status:ResolvedStart date:14/09/2018
Priority:NormalDue date:
Assignee:kraih% Done:

0%

Category:Feature requests
Target version:Done
Difficulty:medium
Duration:

Description

Websocket noticed a disconnecting worker:

[2018-09-14T09:13:20.0190 CEST] [info] Worker 646 websocket connection closed - 1006

Looking what happened during that time:

Sep 14 09:03:12 openqaworker5 worker[19313]: [info] uploading serial_terminal.txt
Sep 14 09:03:13 openqaworker5 worker[19313]: [info] uploading worker-log.txt
Sep 14 09:13:21 openqaworker5 worker[19313]: [info] cleaning up 02047035-sle-12-Server-DVD-Incidents-x86_64-Build:6801:qpdf.1536848520-qam-gnome@64bit
Sep 14 09:13:45 openqaworker5 worker[19313]: GLOB(0x5b3cfb8)[error] Could not unlink 'serial0.txt': No such file or directory
Sep 14 09:13:45 openqaworker5 worker[19313]: [error] Could not unlink 'autoinst-log.txt': No such file or directory
Sep 14 09:13:45 openqaworker5 worker[19313]: [error] Could not unlink 'serial_terminal.txt': No such file or directory
Sep 14 09:13:45 openqaworker5 worker[19313]: [info] got job 2047365: 02047365-sle-12-SP1-Server-DVD-Updates-x86_64-Build20180913-4-mau-qa_sharutils@64bit
Sep 14 09:13:45 openqaworker5 worker[19313]: [info] +++ setup notes +++
Sep 14 09:13:45 openqaworker5 worker[19313]: [info] start time: 2018-09-14 07:13:45
Sep 14 09:13:45 openqaworker5 worker[19313]: [info] running on openqaworker5:8 (Linux 4.4.104-39-default #1 SMP Thu Jan 4 08:11:03 UTC 2018 (7db1912) x86_64)

So the worker removed old files - and disconnected (I assume timeout) from websocket in that timeframe. No harm was
done, the job was already set to done, but the disconnect is worrysome in the log, so we should try to avoid this.


Related issues

Related to openQA Project - action #57620: job is incomplete if websockets server (or webui?) is unr... Workable 02/10/2019
Blocks openQA Project - action #41222: A lot of occurences of 'websocket connection closed' New 19/09/2018

History

#1 Updated by EDiGiacinto over 1 year ago

I suspect a bit more at this point that the cleanup is performing blocking actions that are exhausting the timeout of the ws connection

#2 Updated by coolo over 1 year ago

very similiar is this case:

[2018-09-19T11:31:03.0624 CEST] [info] Worker 1123 websocket connection closed - 1006
Sep 19 11:31:01 openqaworker13 worker[8111]: [info] uploading vars.json
Sep 19 11:31:01 openqaworker13 worker[8111]: [info] uploading serial0.txt
Sep 19 11:31:03 openqaworker13 worker[8111]: [info] uploading autoinst-log.txt
Sep 19 11:31:03 openqaworker13 worker[8111]: [info] uploading worker-log.txt
Sep 19 11:31:15 openqaworker13 worker[8111]: [info] cleaning up 02072667-sle-15-SP1-Installer-DVD-x86_64-Build43.1-cryptlvm@uefi

#3 Updated by coolo over 1 year ago

  • Blocks action #41222: A lot of occurences of 'websocket connection closed' added

#4 Updated by mkittler over 1 year ago

Since we're now using Minion on the worker, we could run the clean_pool function as a Minion job and listen to its reap event to continue.

#5 Updated by kraih about 1 year ago

1006 means that the connection was closed abnormally. So one side just closed the TCP connection without sending a CLOSE frame, most likely an inactivity timeout. Looking at the logs, it appears that it is not the cleanup but the file upload while loop that blocks the worker process (https://github.com/os-autoinst/openQA/blob/1393e2197db60bdffdb17179e7f44764f3fde053/lib/OpenQA/Worker/Jobs.pm#L284).

#6 Updated by kraih about 1 year ago

Well, technically i suppose everything in _stop_job_2 blocks the event loop (https://github.com/os-autoinst/openQA/blob/1393e2197db60bdffdb17179e7f44764f3fde053/lib/OpenQA/Worker/Jobs.pm#L373). So, wouldn't it make sense to perform those operations in a job or subprocess?

#7 Updated by coolo about 1 year ago

We we need to decouple the websocket listening from the work - and we already do subprocesses for some of the working parts. Not sure it helped to make the code easier to follow :(

#8 Updated by mkittler about 1 year ago

  • Status changed from New to In Progress
  • Assignee set to mkittler

So let's do the result upload via a minion task then.

WIP branch: https://github.com/Martchus/openQA/commits/upload_via_minion

#9 Updated by mkittler about 1 year ago

  • Status changed from In Progress to Blocked
  • Target version deleted (Ready)

So I would say we postpone this until we decided our long-term goals regarding the worker. If we really re-write the worker from scratch then this problem should be kept in mind from the beginning (as stated in https://progress.opensuse.org/issues/46187#Biggest-problems-with-the-worker-code-right-now).

#10 Updated by mkittler about 1 year ago

  • Assignee deleted (mkittler)

#11 Updated by okurz 8 months ago

  • Category changed from 132 to Feature requests

#12 Updated by okurz 7 months ago

  • Blocked by action #47117: EPIC: Fix worker->websocket->scheduler->webui connection added

#13 Updated by okurz 7 months ago

  • Assignee set to okurz

I guess that means we are blocked by #47117 so I will pick up the ticket just to track the "blocker"

#14 Updated by mkittler 5 months ago

  • Blocked by deleted (action #47117: EPIC: Fix worker->websocket->scheduler->webui connection)

#15 Updated by mkittler 5 months ago

  • Status changed from Blocked to Resolved
  • Assignee changed from okurz to kraih
  • Target version set to Done

@kraih already said that "Looking at the logs, it appears that it is not the cleanup but the file upload while loop that blocks the worker process." And that also makes perfectly sense because the cleanup should really not take that long and the upload on the other hand might do. But that's already fixed by @kraih's recent changes. So I'm marking this as resolved and remove the blocker.

#16 Updated by okurz 5 months ago

  • Related to action #57620: job is incomplete if websockets server (or webui?) is unreachable for a minute, e.g. during upgrade added

Also available in: Atom PDF