Project

General

Profile

Actions

action #163757

closed

Prevent live view viewers from making openQA unresponsive

Added by mkittler 4 months ago. Updated 4 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Feature requests
Target version:
Start date:
2024-07-11
Due date:
2024-07-30
% Done:

0%

Estimated time:

Description

Motivation

We noticed that OSD is really unresponsive at times. On a closer look via strace -p … with the PIDs of the Mojolicious workers we saw yesterday that almost all workers were busy serving the live view image leaving too few other workers for an otherwise response web application.

This situation can be easily simulated locally by running openQA via script/openqa prefork -c 1 -w n where n is a low number (in production we use 30). When then opening n live views no other requests can be served anymore.

Acceptance criteria

  • AC1: Live view viewers do not make openQA generally unresponsive on production instances

Further remarks

I have already implemented a change that will help, see https://github.com/os-autoinst/openQA/pull/5754. We now need to see how well it works in production (after my local tests already showed that the problem cannot longer be reproduced as mentioned under "Motivation").


Related issues 2 (0 open2 closed)

Related to openQA Infrastructure - action #163592: [alert] (HTTP Response alert Salt tm0h5mf4k) size:MResolvedokurz2024-07-10

Actions
Related to openQA Project - action #163931: OpenQA logreport for ariel.suse-dmz.opensuse.org Can't locate object method render_specific_not_found via package OpenQA::Shared::Controller::RunningResolvedtinita2024-07-31

Actions
Actions #1

Updated by livdywan 4 months ago

  • Related to action #163592: [alert] (HTTP Response alert Salt tm0h5mf4k) size:M added
Actions #2

Updated by mkittler 4 months ago

  • Status changed from In Progress to Feedback
Actions #3

Updated by mkittler 4 months ago

The change has been deployed on o3 and works. I suppose it'll be deployed tomorrow on OSD.

Actions #4

Updated by livdywan 4 months ago

mkittler wrote in #note-3:

The change has been deployed on o3 and works. I suppose it'll be deployed tomorrow on OSD.

Deployments on OSD are not currently enabled. See #163592.

Actions #5

Updated by okurz 4 months ago

AFAIR we asked you in the daily to deploy it today

Actions #6

Updated by okurz 4 months ago

  • Status changed from Feedback to In Progress

I enabled deployment again on https://gitlab.suse.de/openqa/osd-deployment/-/pipeline_schedules and triggered a pipeline: https://gitlab.suse.de/openqa/osd-deployment/-/pipelines/1209111 . Please actively monitor that.

Actions #7

Updated by mkittler 4 months ago

  • Status changed from In Progress to Resolved

The PR has been deployed and it seems to work. So I'm resolving this specific issue considering we still have #163592. Also opening many (40+ tabs) keeps working.

Actions #8

Updated by okurz 4 months ago

  • Related to action #163931: OpenQA logreport for ariel.suse-dmz.opensuse.org Can't locate object method render_specific_not_found via package OpenQA::Shared::Controller::Running added
Actions #9

Updated by livdywan 4 months ago

  • Status changed from Resolved to Workable

From OpenQA logreport for ariel.suse-dmz.opensuse.org:

[2024-07-15T11:58:22.581979Z] [error] [pid:7665] Unable to ask worker 1028 to start providing livestream for 4342852: Connection refused at /usr/share/openqa/script/../lib/OpenQA/WebSockets/Client.pm line 40.

[2024-07-15T11:58:22.584250Z] [error] [pid:7665] Unable to ask worker 1028 to stop providing livestream for 4342852: Connection refused at /usr/share/openqa/script/../lib/OpenQA/WebSockets/Client.pm line 40.

[2024-07-15T13:20:27.111415Z] [error] [pid:7516] Unable to ask worker 1070 to start providing livestream for 4343079: Connection refused at /usr/share/openqa/script/../lib/OpenQA/WebSockets/Client.pm line 40.

[2024-07-15T13:20:27.113682Z] [error] [pid:7516] Unable to ask worker 1070 to stop providing livestream for 4343079: Connection refused at /usr/share/openqa/script/../lib/OpenQA/WebSockets/Client.pm line 40.

[2024-07-15T13:20:29.132402Z] [error] [pid:7516] Unable to ask worker 1084 to start providing livestream for 4343062: Connection refused at /usr/share/openqa/script/../lib/OpenQA/WebSockets/Client.pm line 40.

[2024-07-15T13:20:29.138160Z] [error] [pid:7516] Unable to ask worker 1084 to stop providing livestream for 4343062: Connection refused at /usr/share/openqa/script/../lib/OpenQA/WebSockets/Client.pm line 40.
Actions #10

Updated by mkittler 4 months ago

  • Status changed from Workable to In Progress

This error can generally be logged when opening/closing the live view. However, I can't reproduce this locally except by shortly stopping the websocket server and then switching between "Live" and other tabs. I also cannot reproduce it on o3 itself. The live view is working and this line is not being logged right now.

I suppose it was logged on o3 because the web socket server was really not reachable (or overloaded?). But this has nothing to do with moving the code to the live view handler. It works just fine within the live view handler. The only problem I see is that this request to the web socket server is blocking. So if the web socket server is slow it will also slow down the live view handler.

Actions #11

Updated by mkittler 4 months ago

PR for avoiding the blocking call: https://github.com/os-autoinst/openQA/pull/5766

With this PR I also extended unit tests for the error case mentioned in #163757#note-9. However, this is simply a case this code might run into and not really related to this ticket. Maybe we have also seen these error messages in the past.

Actions #12

Updated by openqa_review 4 months ago

  • Due date set to 2024-07-30

Setting due date based on mean cycle time of SUSE QE Tools

Actions #13

Updated by mkittler 4 months ago · Edited

  • Status changed from In Progress to Feedback

Maybe this change increased the likelihood of #163757#note-9 because the live handler daemon usually starts faster than the main web UI (which has to fork 30 processes) so it is more likely to be already running when the web socket server is not ready yet.

I created https://github.com/os-autoinst/openQA/pull/5767 to fix dependencies on systemd level and express the new dependency of the live handler on the web socket server explicitly.

This will maybe help with #163757#note-9 but is of course not perfect because we have not implemented notifying systemd when our services are actually ready.

(Judging by grep -i 'Unable to ask worker' /var/log/openqa there was no further occurrence of those error messages anyway.)

Actions #14

Updated by mkittler 4 months ago

  • Status changed from Feedback to Resolved

I am resolving this ticket now again. If this error occurs again and it is considered too noisy we can either work on systemd dependency handling, making it a warning instead or just ignoring it in logwarn. Maybe it would also make sense to add a retry but it is async code and we should not just put a sleep in livehandler code so this is not going to be completely straight forward.

Actions

Also available in: Atom PDF