action #163757
closed
Prevent live view viewers from making openQA unresponsive
Added by mkittler 6 months ago.
Updated 6 months ago.
Category:
Feature requests
Description
Motivation¶
We noticed that OSD is really unresponsive at times. On a closer look via strace -p …
with the PIDs of the Mojolicious workers we saw yesterday that almost all workers were busy serving the live view image leaving too few other workers for an otherwise response web application.
This situation can be easily simulated locally by running openQA via script/openqa prefork -c 1 -w n
where n
is a low number (in production we use 30). When then opening n
live views no other requests can be served anymore.
Acceptance criteria¶
- AC1: Live view viewers do not make openQA generally unresponsive on production instances
Further remarks¶
I have already implemented a change that will help, see https://github.com/os-autoinst/openQA/pull/5754. We now need to see how well it works in production (after my local tests already showed that the problem cannot longer be reproduced as mentioned under "Motivation").
- Related to action #163592: [alert] (HTTP Response alert Salt tm0h5mf4k) size:M added
- Status changed from In Progress to Feedback
The change has been deployed on o3 and works. I suppose it'll be deployed tomorrow on OSD.
mkittler wrote in #note-3:
The change has been deployed on o3 and works. I suppose it'll be deployed tomorrow on OSD.
Deployments on OSD are not currently enabled. See #163592.
AFAIR we asked you in the daily to deploy it today
- Status changed from Feedback to In Progress
- Status changed from In Progress to Resolved
The PR has been deployed and it seems to work. So I'm resolving this specific issue considering we still have #163592. Also opening many (40+ tabs) keeps working.
- Related to action #163931: OpenQA logreport for ariel.suse-dmz.opensuse.org Can't locate object method render_specific_not_found via package OpenQA::Shared::Controller::Running added
- Status changed from Resolved to Workable
From OpenQA logreport for ariel.suse-dmz.opensuse.org:
[2024-07-15T11:58:22.581979Z] [error] [pid:7665] Unable to ask worker 1028 to start providing livestream for 4342852: Connection refused at /usr/share/openqa/script/../lib/OpenQA/WebSockets/Client.pm line 40.
[2024-07-15T11:58:22.584250Z] [error] [pid:7665] Unable to ask worker 1028 to stop providing livestream for 4342852: Connection refused at /usr/share/openqa/script/../lib/OpenQA/WebSockets/Client.pm line 40.
[2024-07-15T13:20:27.111415Z] [error] [pid:7516] Unable to ask worker 1070 to start providing livestream for 4343079: Connection refused at /usr/share/openqa/script/../lib/OpenQA/WebSockets/Client.pm line 40.
[2024-07-15T13:20:27.113682Z] [error] [pid:7516] Unable to ask worker 1070 to stop providing livestream for 4343079: Connection refused at /usr/share/openqa/script/../lib/OpenQA/WebSockets/Client.pm line 40.
[2024-07-15T13:20:29.132402Z] [error] [pid:7516] Unable to ask worker 1084 to start providing livestream for 4343062: Connection refused at /usr/share/openqa/script/../lib/OpenQA/WebSockets/Client.pm line 40.
[2024-07-15T13:20:29.138160Z] [error] [pid:7516] Unable to ask worker 1084 to stop providing livestream for 4343062: Connection refused at /usr/share/openqa/script/../lib/OpenQA/WebSockets/Client.pm line 40.
- Status changed from Workable to In Progress
This error can generally be logged when opening/closing the live view. However, I can't reproduce this locally except by shortly stopping the websocket server and then switching between "Live" and other tabs. I also cannot reproduce it on o3 itself. The live view is working and this line is not being logged right now.
I suppose it was logged on o3 because the web socket server was really not reachable (or overloaded?). But this has nothing to do with moving the code to the live view handler. It works just fine within the live view handler. The only problem I see is that this request to the web socket server is blocking. So if the web socket server is slow it will also slow down the live view handler.
PR for avoiding the blocking call: https://github.com/os-autoinst/openQA/pull/5766
With this PR I also extended unit tests for the error case mentioned in #163757#note-9. However, this is simply a case this code might run into and not really related to this ticket. Maybe we have also seen these error messages in the past.
- Due date set to 2024-07-30
Setting due date based on mean cycle time of SUSE QE Tools
- Status changed from In Progress to Feedback
Maybe this change increased the likelihood of #163757#note-9 because the live handler daemon usually starts faster than the main web UI (which has to fork 30 processes) so it is more likely to be already running when the web socket server is not ready yet.
I created https://github.com/os-autoinst/openQA/pull/5767 to fix dependencies on systemd level and express the new dependency of the live handler on the web socket server explicitly.
This will maybe help with #163757#note-9 but is of course not perfect because we have not implemented notifying systemd when our services are actually ready.
(Judging by grep -i 'Unable to ask worker' /var/log/openqa
there was no further occurrence of those error messages anyway.)
- Status changed from Feedback to Resolved
I am resolving this ticket now again. If this error occurs again and it is considered too noisy we can either work on systemd dependency handling, making it a warning instead or just ignoring it in logwarn. Maybe it would also make sense to add a retry but it is async code and we should not just put a sleep in livehandler code so this is not going to be completely straight forward.
Also available in: Atom
PDF