Project

General

Profile

Actions

action #41015

closed

Don't use livehandler if no developer looks at it

Added by coolo over 5 years ago. Updated over 5 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2018-09-14
Due date:
% Done:

0%

Estimated time:

Description

Next morning, next outage :(

error_log was filling up with errors trying to access the live handler port
and it's no suprise as the live handler was dead (and we have no idea what's
blocking it):

openqa:/home/coolo # strace -p 28474 -f
Process 28474 attached
restart_syscall(<... resuming interrupted call ...>CProcess 28474 detached


Related issues 2 (0 open2 closed)

Related to openQA Project - action #38510: Allow os-autoinst to pause on next assert_screen timeoutResolvedmkittler2018-07-18

Actions
Has duplicate openQA Project - action #41042: [tools][osd] "isos post" from rsync.pl aborted with "Use of uninitialized value in concatenation (.) or string at /opt/openqa-scripts/rsync.pl line 998. error scheduling 502 Proxy Error"Resolvedokurz2018-09-14

Actions
Actions #1

Updated by coolo over 5 years ago

  • Assignee set to mkittler

Please make sure the live handler is only involved when jobs are monitored - that was the premise of this seperate service. It's not supposed to break nightly service - even if broken.

Actions #2

Updated by coolo over 5 years ago

  • Related to action #38510: Allow os-autoinst to pause on next assert_screen timeout added
Actions #3

Updated by mkittler over 5 years ago

https://github.com/os-autoinst/openQA/commit/7a97302b8a42dcaedfb34fd60a04efea0b08bc7c should prevent the immediate problem when the livehandler isn't reachable.

But yes, it would be nice if the worker would only post the upload progress if someone is watching the test. I could just use the existing has_logviewers for this.

Only problem would be the following sequence of events:

  1. Nobody is watching the job (eg. the developer closed the tab).
  2. The job is paused due to assert_screen timeout.
  3. The developer opens the tab again. The upload progress hasn't been posted by the worker so the needle editor is not offered although the latest screenshot would be ready.

Not sure how to solve this in an elegant way. Actually I wanted to keep the worker as much out of it as possible. The problem is that the worker is responsible for uploading the test artifacts and hence only knows when the latest screenshot is ready.

One the other side, what would be the big benefit from saving that post call? It is only a small extra cost on top of uploading the artifacts. And now that should be actually true because shouldn't be endlessly trying the same post again and again in the error case.

Actions #4

Updated by coolo over 5 years ago

Your commit does not limit the problem well enough - because you still pile up apache workers waiting for the backend to
be reachable.

And I don't care too much about developers closing tabs - as soon as one developer looked at it, it's fine to
use the live handler. But what we should avoid is jobs that are just the mass of jobs touch unnecessary parts.

Actions #5

Updated by coolo over 5 years ago

  • Has duplicate action #41042: [tools][osd] "isos post" from rsync.pl aborted with "Use of uninitialized value in concatenation (.) or string at /opt/openqa-scripts/rsync.pl line 998. error scheduling 502 Proxy Error" added
Actions #6

Updated by coolo over 5 years ago

  • Subject changed from livehandler is stuck to Don't use livehandler if no developer looks at it
  • Target version changed from Ready to Current Sprint

The actual problem might be dup of another, but let's take this ticket to ease the load

Actions #7

Updated by mkittler over 5 years ago

  • Status changed from New to In Progress

PR for sending the updates only if a developer session has been opened: https://github.com/os-autoinst/openQA/pull/1789

Actions #8

Updated by coolo over 5 years ago

  • Status changed from In Progress to Resolved

merged and deployed

Actions #9

Updated by szarate over 5 years ago

  • Target version changed from Current Sprint to Done
Actions

Also available in: Atom PDF