action #158808
closedcoordination #127031: [saga][epic] openQA for SUSE customers
coordination #152955: [epic] Metric-driven project management in SUSE QE Tools team
Prevent HTTP response codes 500 as observed in OSD monitoring size:M
Description
Motivation¶
See #158550 and #158556. We introduced an alert based on 5xx HTTP responses and found unexpectedly that we have about 120 5xx HTTP responses every hour. We should identify why we have so many hits, fix the problem in either openQA behaviour or the bug in monitoring data and then reduce the alert threshold accordingly.
Acceptance criteria¶
- AC1: The number of HTTP 5xx errors is reasonably low
- AC2: We know how many 500 errors we actually have (so our monitoring doesn't fool us)
Suggestions¶
- On OSD
grep '" \<500\> ' /var/log/apache2/access_log
which right now looks like this
10.149.213.14 - - [10/Apr/2024:03:39:24 +0200] "GET /liveviewhandler/tests/13971445/developer/ws-proxy/status HTTP/1.1" 500 - "-" "Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Firefox/102.0" 363
2a07:de40:b2bf:1b::1117 - - [10/Apr/2024:08:47:18 +0200] "GET /liveviewhandler/tests/13991767/developer/ws-proxy/status HTTP/1.1" 500 - "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36" 5079
10.149.213.14 - - [10/Apr/2024:11:52:14 +0200] "GET /liveviewhandler/tests/13993065/developer/ws-proxy/status HTTP/1.1" 500 - "-" "Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Firefox/102.0" 586
2a07:de40:b203:12:7ec2:55ff:fe24:de70 - - [10/Apr/2024:13:17:33 +0200] "POST /api/v1/mutex/support_server_ready?action=lock HTTP/1.1" 500 860 "-" "Mojolicious (Perl)" 2818
- Also https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&viewPanel=80 shows only very few so likely something counted wrong in the alert query on https://monitor.qa.suse.de/alerting/grafana/d949dbae-8034-4bf4-8418-f148dfcaf89d/view?returnTo=%2Fd%2FWebuiDb%2Fwebui-summary%3ForgId%3D1%26viewPanel%3D80%26editPanel%3D80%26tab%3Dalert . Seems like we are counting 0's and should only count non-zeroes or adjust the query to only return 500's
- Fix possible mistakes in the alert and panel queries
- Prevent the 500 errors as seen above and fix our monitoring accordingly
- After the value ending up in grafana is reduced adjust the alert threshold to a lower sensible value
- Confirm that the HTTP status code tracking is correct, or if it needs to be fixed first https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1146/diffs#e5dc019b71ceec2f7e2df28d3e06d72a110fb6b0_1648_1745
- Create follow-up tickets for fixing non-trivial causes of 500 responses
Updated by okurz 7 months ago
- Copied from action #158556: Single-value SLI of OSD HTTP response code successful vs. all size:S added
Updated by mkittler 7 months ago
The way the alert was created was really wrong on many levels, see MR https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1151 for details and the fix.
With that we still have 138 5xx responses in the last 24 hours as of now. So despite the figure in the alert being computed very wrongly, the real value is actually also similarly high. I'll see whether I'll also have to updated panels (and then I can also link the panel here) and will then try to investigate why we have so many 500 errors. So far I can already tell that the biggest spike was around yesterday around 2024-04-11T16:17Z. Maybe this was even a small outage.
Updated by mkittler 7 months ago
The MR was merged. I checked the panel and here the use of count()
and the grouping make sense. The panel graph is also in-line with the one in the alert and one can also see the spike mentioned in my previous comment: https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&viewPanel=80&editPanel=80&from=1712710230000&to=1712822694301
Considering this is a spike of 503 errors specifically it makes sense to assume this was just a small outage so simply all the requests in that small timeframe ended up with a 503 response. I think this can be considered normal. I also checked the logs and the web UI service was really just restarting - which took only a few seconds:
Apr 10 16:17:03 openqa systemd[1]: Stopping The openQA web UI...
…
Apr 10 16:17:04 openqa systemd[1]: openqa-webui.service: Deactivated successfully.
Apr 10 16:17:04 openqa systemd[1]: Stopped The openQA web UI.
Apr 10 16:17:04 openqa systemd[1]: Started The openQA web UI.
Apr 10 16:17:05 openqa openqa[11034]: [info] Loading external plugin AMQP
Apr 10 16:17:05 openqa openqa[11034]: [info] Loading external plugin ObsRsync
Apr 10 16:17:06 openqa openqa[11034]: [info] Listening at "http://127.0.0.1:9526"
Apr 10 16:17:06 openqa openqa[11034]: Web application available at http://127.0.0.1:9526
Apr 10 16:17:06 openqa openqa[11034]: [info] Listening at "http://[::1]:9526"
Apr 10 16:17:06 openqa openqa[11034]: Web application available at http://[::1]:9526
Considering such a small outage already leads to over 60 503 responses I'm not sure whether the alert makes sense as we put the threshold now. On a day where we have to restart the service more often it'll most likely fire but it would still not be concerning or actionable. Maybe something to discuss in the next infra daily.
Updated by mkittler 7 months ago
PRs for a few problems I could spot in the logs: https://github.com/os-autoinst/openQA/pull/5574
Note that I mainly focused on the systemd journal logs. The problem with /var/log/apache2/access_log
is that it only shows routes but not what the error was (so it would involve some serious guesswork to make sense of them).
Updated by openqa_review 7 months ago
- Due date set to 2024-04-26
Setting due date based on mean cycle time of SUSE QE Tools
Updated by mkittler 7 months ago
- Status changed from In Progress to Resolved
The PR was merged.
The high number of 500 responses we see is just caused by spikes when restarting the service. Otherwise there are not that many 500 responses and the PR fixes some of them.
The alert makes more sense now and the panel was good anyway. We might need to tweak the alert to handle situations where we restart the web UI a lot but for now I'd consider this ticket resolved.