Project

General

Profile

Actions

action #159792

closed

Add better logging for 500 errors on websocket routes size:M

Added by dheidler 5 months ago. Updated 4 months ago.

Status:
Rejected
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
2024-04-26
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Observation

https://monitor.qa.suse.de/d/WebuiDb/webui-summary?viewPanel=80&orgId=1&from=1714042970812&to=1714056541493
shows an alert condition (dashed red line)

https://mailman.suse.de/mlarch/SuSE/osd-admins/2024/osd-admins.2024.04/msg00148.html
is the corresponding alert which bundles two alerts and only the less significant one was commented on. We should still look into the 5xx HTTP response alert problem #159639

Affected routes include:

/api/v1/ws/3410
/liveviewhandler/tests/14146684/developer/ws-proxy/status

As there doesn't seem to be a proper error in the log, it should be made sure, that we get a usable error message indicating the error, file and line number where the error was raised.

Acceptance Criteria

  • AC1: If a 500 error is logged by the reverse-proxy there is also a corresponding log message in the underlying service logs.

Suggestions

  • Confirm how this is only happening on certain workers (see #159639#note-8)
    • Maybe those machines are outdated? Consider updating the system
    • 10.161.50.3
    • 10.100.2.148
  • Extend error handling and logging on likely relevant routes
    • api/v1/ws
    • liveviewhandler/tests/.../developer/ws-proxy/status
    • Also keep in mind the openqa-websockets service (the worker connects to)
  • Maybe the connection was lost (so the gateway could be reached and the websocket connection established but at some point the connection was lost) and Mojolicious doesn't show a good error message in that case
  • Maybe this goes away after switching to NGINX (and also implementing the kind of monitoring for NGINX)

Related issues 1 (0 open1 closed)

Related to openQA Infrastructure - action #159639: [alert] "web UI: Too many 5xx HTTP responses alert" size:SResolveddheidler2024-04-26

Actions
Actions #1

Updated by dheidler 5 months ago

  • Copied from action #159639: [alert] "web UI: Too many 5xx HTTP responses alert" size:S added
Actions #2

Updated by mkittler 5 months ago

There are really no errors in the lgos. However, normally the livehandler is expected to log errors in case of unhandled exceptions (leading to 500 errors). From when were the 500 responses according to Apache? Does the log of the livehandler go far enough so these errors would even be visible (right now it starts at Apr 17, so it goes 13 days into the past).

Actions #3

Updated by dheidler 5 months ago

Yes - they were from around 2024-04-25 (the integer at the output from influxdb that I included in the ticket where this was copied from is a unix based timestamp, if you remove enough digits on the right).

Actions #4

Updated by mkittler 5 months ago · Edited

I just simulated this locally by putting a die "foo" somewhere in the code and an error was definitely logged:

script/openqa-livehandler daemon -m production
[2024-04-30 16:59:19.34722] [29935] [warn] Deprecated use of config key '[audit]: blacklist'. Use '[audit]: blocklist' instead
[info] Listening at "http://127.0.0.1:9528"
Web application available at http://127.0.0.1:9528
[info] Listening at "http://[::1]:9528"
Web application available at http://[::1]:9528
[debug] attempt to open ws proxy for job opensuse-Tumbleweed-DVD-x86_64-Build20240322-wireguard_client@64bit (4266) where URL to os-autoinst command server is unknown
[error] foo at /hdd/openqa-devel/repos/openQA/script/../lib/OpenQA/LiveHandler/Controller/LiveViewHandler.pm line 66.

For this I invoked the web socket route mentioned in the ticket description like this:

wsdump ws://127.0.0.1:9528/liveviewhandler/tests/4266/developer/ws-proxy/status
Traceback (most recent call last):
…
  File "/usr/lib/python3.11/site-packages/websocket/_handshake.py", line 150, in _get_resp_headers
    raise WebSocketBadStatusException(
websocket._exceptions.WebSocketBadStatusException: Handshake status 500 Internal Server Error -+-+- {'connection': 'Upgrade', 'content-length': '14', 'content-type': 'text/html;charset=UTF-8', 'date': 'Tue, 30 Apr 2024 14:59:24 GMT', 'sec-websocket-accept': '…', 'server': 'Mojolicious (Perl)', 'strict-transport-security': 'max-age=31536000; includeSubDomains', 'upgrade': 'websocket'} -+-+- b'internal error'

(Without the -m production the actual error is returned to the ws client instead of just "internal error".)

Actions #5

Updated by dheidler 5 months ago

  • Copied from deleted (action #159639: [alert] "web UI: Too many 5xx HTTP responses alert" size:S)
Actions #6

Updated by dheidler 5 months ago

  • Blocks action #159639: [alert] "web UI: Too many 5xx HTTP responses alert" size:S added
Actions #7

Updated by okurz 5 months ago

  • Due date deleted (2024-05-14)
  • Category set to Feature requests
Actions #8

Updated by livdywan 5 months ago

  • Tags changed from infra, grafana, 5xx to 5xx
  • Subject changed from Add better logging for 500 errors on websocket routes to Add better logging for 500 errors on websocket routes size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #9

Updated by livdywan 5 months ago

  • Status changed from Workable to In Progress
  • Assignee set to livdywan

I'm taking a look

Actions #10

Updated by openqa_review 5 months ago

  • Due date set to 2024-05-28

Setting due date based on mean cycle time of SUSE QE Tools

Actions #11

Updated by livdywan 5 months ago

  • Status changed from In Progress to Workable
  • Assignee deleted (livdywan)

Just for transparency, I did not do much here since I prioritized looking into unrelated, urgent tickets. Will update here if I get to it, otherwise feel free to grab, no reason to block on me ;-)

Actions #12

Updated by mkittler 5 months ago

I did the same test as in #159792#note-4 with the web socket server and also here errors are definitely logged:

script/openqa-websockets daemon
[2024-05-15 17:17:21.69383] [13123] [warn] Deprecated use of config key '[audit]: blacklist'. Use '[audit]: blocklist' instead
[info] Listening at "http://127.0.0.1:9527"
Web application available at http://127.0.0.1:9527
[info] Listening at "http://[::1]:9527"
Web application available at http://[::1]:9527
Mojo::Reactor::Poll: I/O watcher failed: foo at /hdd/openqa-devel/repos/openQA/script/../lib/OpenQA/WebSockets/Controller/Worker.pm line 58.

I connected via wsdump ws://127.0.0.1:9527/api/v1/ws/20.

I also put the die in a few other places and tested other routes. This changed the error message a little bit (e.g. to [error] foo at /hdd/openqa-devel/repos/openQA/script/../lib/OpenQA/WebSockets/Controller/Worker.pm line 31.) but there was still always an error logged.


Maybe those 500 errors came just from Apache after all? Maybe, once we setup the Grafana panel for NGINX, we don't see this discrepancy between 500 responses and actual errors being logged anymore?

Actions #13

Updated by okurz 5 months ago

  • Due date deleted (2024-05-28)
Actions #14

Updated by okurz 5 months ago

  • Status changed from Workable to Blocked
  • Assignee set to okurz
  • Priority changed from High to Normal

We switched to nginx so maybe we can't even reproduce those errors and have not encountered more user-facing bigger problems. let's wait for #159654 first

Actions #15

Updated by livdywan 4 months ago

  • Status changed from Blocked to Workable

okurz wrote in #note-14:

We switched to nginx so maybe we can't even reproduce those errors and have not encountered more user-facing bigger problems. let's wait for #159654 first

Blocker was resolved.

Actions #16

Updated by okurz 4 months ago

  • Description updated (diff)
Actions #17

Updated by okurz 4 months ago

  • Status changed from Workable to Rejected

https://monitor.qa.suse.de/d/WebuiDb/webui-summary looks good now with nginx data coming in. It seems so far no 5xx errors. Unless we can't reproduce we shouldn't need to invest more here.

Actions #18

Updated by okurz 4 months ago

  • Blocks deleted (action #159639: [alert] "web UI: Too many 5xx HTTP responses alert" size:S)
Actions #19

Updated by okurz 4 months ago

  • Related to action #159639: [alert] "web UI: Too many 5xx HTTP responses alert" size:S added
Actions

Also available in: Atom PDF