action #159792
closedAdd better logging for 500 errors on websocket routes size:M
0%
Description
Observation¶
https://monitor.qa.suse.de/d/WebuiDb/webui-summary?viewPanel=80&orgId=1&from=1714042970812&to=1714056541493
shows an alert condition (dashed red line)
https://mailman.suse.de/mlarch/SuSE/osd-admins/2024/osd-admins.2024.04/msg00148.html
is the corresponding alert which bundles two alerts and only the less significant one was commented on. We should still look into the 5xx HTTP response alert problem #159639
Affected routes include:
/api/v1/ws/3410
/liveviewhandler/tests/14146684/developer/ws-proxy/status
As there doesn't seem to be a proper error in the log, it should be made sure, that we get a usable error message indicating the error, file and line number where the error was raised.
Acceptance Criteria¶
- AC1: If a 500 error is logged by the reverse-proxy there is also a corresponding log message in the underlying service logs.
Suggestions¶
- Confirm how this is only happening on certain workers (see #159639#note-8)
- Maybe those machines are outdated? Consider updating the system
- 10.161.50.3
- 10.100.2.148
- Extend error handling and logging on likely relevant routes
- api/v1/ws
- liveviewhandler/tests/.../developer/ws-proxy/status
- Also keep in mind the openqa-websockets service (the worker connects to)
- Maybe the connection was lost (so the gateway could be reached and the websocket connection established but at some point the connection was lost) and Mojolicious doesn't show a good error message in that case
- Maybe this goes away after switching to NGINX (and also implementing the kind of monitoring for NGINX)
Updated by dheidler 5 months ago
- Copied from action #159639: [alert] "web UI: Too many 5xx HTTP responses alert" size:S added
Updated by mkittler 5 months ago
There are really no errors in the lgos. However, normally the livehandler is expected to log errors in case of unhandled exceptions (leading to 500 errors). From when were the 500 responses according to Apache? Does the log of the livehandler go far enough so these errors would even be visible (right now it starts at Apr 17, so it goes 13 days into the past).
Updated by mkittler 5 months ago · Edited
I just simulated this locally by putting a die "foo"
somewhere in the code and an error was definitely logged:
script/openqa-livehandler daemon -m production
[2024-04-30 16:59:19.34722] [29935] [warn] Deprecated use of config key '[audit]: blacklist'. Use '[audit]: blocklist' instead
[info] Listening at "http://127.0.0.1:9528"
Web application available at http://127.0.0.1:9528
[info] Listening at "http://[::1]:9528"
Web application available at http://[::1]:9528
[debug] attempt to open ws proxy for job opensuse-Tumbleweed-DVD-x86_64-Build20240322-wireguard_client@64bit (4266) where URL to os-autoinst command server is unknown
[error] foo at /hdd/openqa-devel/repos/openQA/script/../lib/OpenQA/LiveHandler/Controller/LiveViewHandler.pm line 66.
For this I invoked the web socket route mentioned in the ticket description like this:
wsdump ws://127.0.0.1:9528/liveviewhandler/tests/4266/developer/ws-proxy/status
Traceback (most recent call last):
…
File "/usr/lib/python3.11/site-packages/websocket/_handshake.py", line 150, in _get_resp_headers
raise WebSocketBadStatusException(
websocket._exceptions.WebSocketBadStatusException: Handshake status 500 Internal Server Error -+-+- {'connection': 'Upgrade', 'content-length': '14', 'content-type': 'text/html;charset=UTF-8', 'date': 'Tue, 30 Apr 2024 14:59:24 GMT', 'sec-websocket-accept': '…', 'server': 'Mojolicious (Perl)', 'strict-transport-security': 'max-age=31536000; includeSubDomains', 'upgrade': 'websocket'} -+-+- b'internal error'
(Without the -m production
the actual error is returned to the ws client instead of just "internal error".)
Updated by dheidler 5 months ago
- Copied from deleted (action #159639: [alert] "web UI: Too many 5xx HTTP responses alert" size:S)
Updated by dheidler 5 months ago
- Blocks action #159639: [alert] "web UI: Too many 5xx HTTP responses alert" size:S added
Updated by openqa_review 5 months ago
- Due date set to 2024-05-28
Setting due date based on mean cycle time of SUSE QE Tools
Updated by livdywan 5 months ago
- Status changed from In Progress to Workable
- Assignee deleted (
livdywan)
Just for transparency, I did not do much here since I prioritized looking into unrelated, urgent tickets. Will update here if I get to it, otherwise feel free to grab, no reason to block on me ;-)
Updated by mkittler 5 months ago
I did the same test as in #159792#note-4 with the web socket server and also here errors are definitely logged:
script/openqa-websockets daemon
[2024-05-15 17:17:21.69383] [13123] [warn] Deprecated use of config key '[audit]: blacklist'. Use '[audit]: blocklist' instead
[info] Listening at "http://127.0.0.1:9527"
Web application available at http://127.0.0.1:9527
[info] Listening at "http://[::1]:9527"
Web application available at http://[::1]:9527
Mojo::Reactor::Poll: I/O watcher failed: foo at /hdd/openqa-devel/repos/openQA/script/../lib/OpenQA/WebSockets/Controller/Worker.pm line 58.
I connected via wsdump ws://127.0.0.1:9527/api/v1/ws/20
.
I also put the die
in a few other places and tested other routes. This changed the error message a little bit (e.g. to [error] foo at /hdd/openqa-devel/repos/openQA/script/../lib/OpenQA/WebSockets/Controller/Worker.pm line 31.
) but there was still always an error logged.
Maybe those 500 errors came just from Apache after all? Maybe, once we setup the Grafana panel for NGINX, we don't see this discrepancy between 500 responses and actual errors being logged anymore?
Updated by okurz 4 months ago
- Status changed from Workable to Rejected
https://monitor.qa.suse.de/d/WebuiDb/webui-summary looks good now with nginx data coming in. It seems so far no 5xx errors. Unless we can't reproduce we shouldn't need to invest more here.
Updated by okurz 4 months ago
- Blocks deleted (action #159639: [alert] "web UI: Too many 5xx HTTP responses alert" size:S)
Updated by okurz 4 months ago
- Related to action #159639: [alert] "web UI: Too many 5xx HTTP responses alert" size:S added