Project

General

Profile

Actions

action #159792

open

Add better logging for 500 errors on websocket routes size:M

Added by dheidler 15 days ago. Updated about 4 hours ago.

Status:
Workable
Priority:
High
Assignee:
-
Category:
Feature requests
Target version:
Start date:
2024-04-26
Due date:
2024-05-28 (Due in 13 days)
% Done:

0%

Estimated time:
Tags:

Description

Observation

https://monitor.qa.suse.de/d/WebuiDb/webui-summary?viewPanel=80&orgId=1&from=1714042970812&to=1714056541493
shows an alert condition (dashed red line)

https://mailman.suse.de/mlarch/SuSE/osd-admins/2024/osd-admins.2024.04/msg00148.html
is the corresponding alert which bundles two alerts and only the less significant one was commented on. We should still look into the 5xx HTTP response alert problemhttps://progress.opensuse.org/issues/159639

Affected routes include:

/api/v1/ws/3410
/liveviewhandler/tests/14146684/developer/ws-proxy/status

As there doesn't seem to be a proper error in the log, it should be made sure, that we get a usable error message indicating the error, file and line number where the error was raised.

Acceptance Criteria

  • AC1: If a 500 error is logged by the reverse-proxy there is also a corresponding log message in the underlying service logs.

Suggestions

  • Confirm how this is only happening on certain workers (see #159639#note-8)
    • Maybe those machines are outdated? Consider updating the system
    • 10.161.50.3
    • 10.100.2.148
  • Extend error handling and logging on likely relevant routes
    • api/v1/ws
    • liveviewhandler/tests/.../developer/ws-proxy/status
    • Also keep in mind the openqa-websockets service (the worker connects to)
  • Maybe the connection was lost (so the gateway could be reached and the websocket connection established but at some point the connection was lost) and Mojolicious doesn't show a good error message in that case
  • Maybe this goes away after switching to NGINX (and also implementing the kind of monitoring for NGINX)

Related issues 1 (1 open0 closed)

Blocks openQA Infrastructure - action #159639: [alert] "web UI: Too many 5xx HTTP responses alert" size:SBlockeddheidler2024-04-26

Actions
Actions #1

Updated by dheidler 15 days ago

  • Copied from action #159639: [alert] "web UI: Too many 5xx HTTP responses alert" size:S added
Actions #2

Updated by mkittler 15 days ago

There are really no errors in the lgos. However, normally the livehandler is expected to log errors in case of unhandled exceptions (leading to 500 errors). From when were the 500 responses according to Apache? Does the log of the livehandler go far enough so these errors would even be visible (right now it starts at Apr 17, so it goes 13 days into the past).

Actions #3

Updated by dheidler 15 days ago

Yes - they were from around 2024-04-25 (the integer at the output from influxdb that I included in the ticket where this was copied from is a unix based timestamp, if you remove enough digits on the right).

Actions #4

Updated by mkittler 15 days ago · Edited

I just simulated this locally by putting a die "foo" somewhere in the code and an error was definitely logged:

script/openqa-livehandler daemon -m production
[2024-04-30 16:59:19.34722] [29935] [warn] Deprecated use of config key '[audit]: blacklist'. Use '[audit]: blocklist' instead
[info] Listening at "http://127.0.0.1:9528"
Web application available at http://127.0.0.1:9528
[info] Listening at "http://[::1]:9528"
Web application available at http://[::1]:9528
[debug] attempt to open ws proxy for job opensuse-Tumbleweed-DVD-x86_64-Build20240322-wireguard_client@64bit (4266) where URL to os-autoinst command server is unknown
[error] foo at /hdd/openqa-devel/repos/openQA/script/../lib/OpenQA/LiveHandler/Controller/LiveViewHandler.pm line 66.

For this I invoked the web socket route mentioned in the ticket description like this:

wsdump ws://127.0.0.1:9528/liveviewhandler/tests/4266/developer/ws-proxy/status
Traceback (most recent call last):
…
  File "/usr/lib/python3.11/site-packages/websocket/_handshake.py", line 150, in _get_resp_headers
    raise WebSocketBadStatusException(
websocket._exceptions.WebSocketBadStatusException: Handshake status 500 Internal Server Error -+-+- {'connection': 'Upgrade', 'content-length': '14', 'content-type': 'text/html;charset=UTF-8', 'date': 'Tue, 30 Apr 2024 14:59:24 GMT', 'sec-websocket-accept': '…', 'server': 'Mojolicious (Perl)', 'strict-transport-security': 'max-age=31536000; includeSubDomains', 'upgrade': 'websocket'} -+-+- b'internal error'

(Without the -m production the actual error is returned to the ws client instead of just "internal error".)

Actions #5

Updated by dheidler 9 days ago

  • Copied from deleted (action #159639: [alert] "web UI: Too many 5xx HTTP responses alert" size:S)
Actions #6

Updated by dheidler 9 days ago

  • Blocks action #159639: [alert] "web UI: Too many 5xx HTTP responses alert" size:S added
Actions #7

Updated by okurz 9 days ago

  • Due date deleted (2024-05-14)
  • Category set to Feature requests
Actions #8

Updated by livdywan 8 days ago

  • Tags changed from infra, grafana, 5xx to 5xx
  • Subject changed from Add better logging for 500 errors on websocket routes to Add better logging for 500 errors on websocket routes size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #9

Updated by livdywan 2 days ago

  • Status changed from Workable to In Progress
  • Assignee set to livdywan

I'm taking a look

Actions #10

Updated by openqa_review 1 day ago

  • Due date set to 2024-05-28

Setting due date based on mean cycle time of SUSE QE Tools

Actions #11

Updated by livdywan about 10 hours ago

  • Status changed from In Progress to Workable
  • Assignee deleted (livdywan)

Just for transparency, I did not do much here since I prioritized looking into unrelated, urgent tickets. Will update here if I get to it, otherwise feel free to grab, no reason to block on me ;-)

Actions #12

Updated by mkittler about 4 hours ago

I did the same test as in #159792#note-4 with the web socket server and also here errors are definitely logged:

script/openqa-websockets daemon
[2024-05-15 17:17:21.69383] [13123] [warn] Deprecated use of config key '[audit]: blacklist'. Use '[audit]: blocklist' instead
[info] Listening at "http://127.0.0.1:9527"
Web application available at http://127.0.0.1:9527
[info] Listening at "http://[::1]:9527"
Web application available at http://[::1]:9527
Mojo::Reactor::Poll: I/O watcher failed: foo at /hdd/openqa-devel/repos/openQA/script/../lib/OpenQA/WebSockets/Controller/Worker.pm line 58.

I connected via wsdump ws://127.0.0.1:9527/api/v1/ws/20.

I also put the die in a few other places and tested other routes. This changed the error message a little bit (e.g. to [error] foo at /hdd/openqa-devel/repos/openQA/script/../lib/OpenQA/WebSockets/Controller/Worker.pm line 31.) but there was still always an error logged.


Maybe those 500 errors came just from Apache after all? Maybe, once we setup the Grafana panel for NGINX, we don't see this discrepancy between 500 responses and actual errors being logged anymore?

Actions

Also available in: Atom PDF