action #158808: Prevent HTTP response codes 500 as observed in OSD monitoring size:M - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #158808

closed

coordination #127031: [saga][epic] openQA for SUSE customers

coordination #152955: [epic] Metric-driven project management in SUSE QE Tools team

Prevent HTTP response codes 500 as observed in OSD monitoring size:M

Added by okurz about 1 year ago. Updated about 1 year ago.

Status:

Resolved

Priority:

Normal

Assignee:

mkittler

Category:

Feature requests

Target version:

Ready

Start date:

Due date:

% Done:

Estimated time:

Description

Motivation¶

See #158550 and #158556. We introduced an alert based on 5xx HTTP responses and found unexpectedly that we have about 120 5xx HTTP responses every hour. We should identify why we have so many hits, fix the problem in either openQA behaviour or the bug in monitoring data and then reduce the alert threshold accordingly.

Acceptance criteria¶

AC1: The number of HTTP 5xx errors is reasonably low
AC2: We know how many 500 errors we actually have (so our monitoring doesn't fool us)

Suggestions¶

On OSD grep '" \<500\> ' /var/log/apache2/access_log which right now looks like this

10.149.213.14 - - [10/Apr/2024:03:39:24 +0200] "GET /liveviewhandler/tests/13971445/developer/ws-proxy/status HTTP/1.1" 500 - "-" "Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Firefox/102.0" 363
2a07:de40:b2bf:1b::1117 - - [10/Apr/2024:08:47:18 +0200] "GET /liveviewhandler/tests/13991767/developer/ws-proxy/status HTTP/1.1" 500 - "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36" 5079
10.149.213.14 - - [10/Apr/2024:11:52:14 +0200] "GET /liveviewhandler/tests/13993065/developer/ws-proxy/status HTTP/1.1" 500 - "-" "Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Firefox/102.0" 586
2a07:de40:b203:12:7ec2:55ff:fe24:de70 - - [10/Apr/2024:13:17:33 +0200] "POST /api/v1/mutex/support_server_ready?action=lock HTTP/1.1" 500 860 "-" "Mojolicious (Perl)" 2818

Also https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&viewPanel=80 shows only very few so likely something counted wrong in the alert query on https://monitor.qa.suse.de/alerting/grafana/d949dbae-8034-4bf4-8418-f148dfcaf89d/view?returnTo=%2Fd%2FWebuiDb%2Fwebui-summary%3ForgId%3D1%26viewPanel%3D80%26editPanel%3D80%26tab%3Dalert . Seems like we are counting 0's and should only count non-zeroes or adjust the query to only return 500's
Fix possible mistakes in the alert and panel queries
Prevent the 500 errors as seen above and fix our monitoring accordingly
After the value ending up in grafana is reduced adjust the alert threshold to a lower sensible value
Confirm that the HTTP status code tracking is correct, or if it needs to be fixed first https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1146/diffs#e5dc019b71ceec2f7e2df28d3e06d72a110fb6b0_1648_1745
Create follow-up tickets for fixing non-trivial causes of 500 responses

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by okurz about 1 year ago

Copied from action #158556: Single-value SLI of OSD HTTP response code successful vs. all size:S added

Actions

Copy link

Updated by okurz about 1 year ago

Subject changed from Prevent HTTP response codes 500 as observed in OSD monitoring to Prevent HTTP response codes 500 as observed in OSD monitoring size:M
Description updated (diff)
Status changed from New to Workable

Actions

Copy link

Updated by mkittler about 1 year ago

Status changed from Workable to In Progress
Assignee set to mkittler

Actions

Copy link

Updated by mkittler about 1 year ago

The way the alert was created was really wrong on many levels, see MR https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1151 for details and the fix.

With that we still have 138 5xx responses in the last 24 hours as of now. So despite the figure in the alert being computed very wrongly, the real value is actually also similarly high. I'll see whether I'll also have to updated panels (and then I can also link the panel here) and will then try to investigate why we have so many 500 errors. So far I can already tell that the biggest spike was around yesterday around 2024-04-11T16:17Z. Maybe this was even a small outage.

Actions

Copy link

Updated by mkittler about 1 year ago

The MR was merged. I checked the panel and here the use of count() and the grouping make sense. The panel graph is also in-line with the one in the alert and one can also see the spike mentioned in my previous comment: https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&viewPanel=80&editPanel=80&from=1712710230000&to=1712822694301

Considering this is a spike of 503 errors specifically it makes sense to assume this was just a small outage so simply all the requests in that small timeframe ended up with a 503 response. I think this can be considered normal. I also checked the logs and the web UI service was really just restarting - which took only a few seconds:

Apr 10 16:17:03 openqa systemd[1]: Stopping The openQA web UI...
…
Apr 10 16:17:04 openqa systemd[1]: openqa-webui.service: Deactivated successfully.
Apr 10 16:17:04 openqa systemd[1]: Stopped The openQA web UI.
Apr 10 16:17:04 openqa systemd[1]: Started The openQA web UI.
Apr 10 16:17:05 openqa openqa[11034]: [info] Loading external plugin AMQP
Apr 10 16:17:05 openqa openqa[11034]: [info] Loading external plugin ObsRsync
Apr 10 16:17:06 openqa openqa[11034]: [info] Listening at "http://127.0.0.1:9526"
Apr 10 16:17:06 openqa openqa[11034]: Web application available at http://127.0.0.1:9526
Apr 10 16:17:06 openqa openqa[11034]: [info] Listening at "http://[::1]:9526"
Apr 10 16:17:06 openqa openqa[11034]: Web application available at http://[::1]:9526

Considering such a small outage already leads to over 60 503 responses I'm not sure whether the alert makes sense as we put the threshold now. On a day where we have to restart the service more often it'll most likely fire but it would still not be concerning or actionable. Maybe something to discuss in the next infra daily.

Actions

Copy link

Updated by mkittler about 1 year ago

PRs for a few problems I could spot in the logs: https://github.com/os-autoinst/openQA/pull/5574

Note that I mainly focused on the systemd journal logs. The problem with /var/log/apache2/access_log is that it only shows routes but not what the error was (so it would involve some serious guesswork to make sense of them).

Actions

Copy link

Updated by openqa_review about 1 year ago

Due date set to 2024-04-26

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

Updated by mkittler about 1 year ago

Status changed from In Progress to Resolved

The PR was merged.

The high number of 500 responses we see is just caused by spikes when restarting the service. Otherwise there are not that many 500 responses and the PR fixes some of them.

The alert makes more sense now and the panel was good anyway. We might need to tweak the alert to handle situations where we restart the web UI a lot but for now I'd consider this ticket resolved.

Actions

Copy link

Updated by okurz about 1 year ago

Due date deleted (~~2024-04-26~~)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #158808

Prevent HTTP response codes 500 as observed in OSD monitoring size:M

Motivation¶

Acceptance criteria¶

Suggestions¶

Updated by okurz about 1 year ago

Updated by okurz about 1 year ago

Updated by mkittler about 1 year ago

Updated by mkittler about 1 year ago

Updated by mkittler about 1 year ago

Updated by mkittler about 1 year ago

Updated by openqa_review about 1 year ago

Updated by mkittler about 1 year ago

Updated by okurz about 1 year ago