Project

General

Profile

Actions

action #158808

closed

coordination #127031: [saga][epic] openQA for SUSE customers

coordination #152955: [epic] Metric-driven project management in SUSE QE Tools team

Prevent HTTP response codes 500 as observed in OSD monitoring size:M

Added by okurz 8 months ago. Updated 8 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:

Description

Motivation

See #158550 and #158556. We introduced an alert based on 5xx HTTP responses and found unexpectedly that we have about 120 5xx HTTP responses every hour. We should identify why we have so many hits, fix the problem in either openQA behaviour or the bug in monitoring data and then reduce the alert threshold accordingly.

Acceptance criteria

  • AC1: The number of HTTP 5xx errors is reasonably low
  • AC2: We know how many 500 errors we actually have (so our monitoring doesn't fool us)

Suggestions

  • On OSD grep '" \<500\> ' /var/log/apache2/access_log which right now looks like this
10.149.213.14 - - [10/Apr/2024:03:39:24 +0200] "GET /liveviewhandler/tests/13971445/developer/ws-proxy/status HTTP/1.1" 500 - "-" "Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Firefox/102.0" 363
2a07:de40:b2bf:1b::1117 - - [10/Apr/2024:08:47:18 +0200] "GET /liveviewhandler/tests/13991767/developer/ws-proxy/status HTTP/1.1" 500 - "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36" 5079
10.149.213.14 - - [10/Apr/2024:11:52:14 +0200] "GET /liveviewhandler/tests/13993065/developer/ws-proxy/status HTTP/1.1" 500 - "-" "Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Firefox/102.0" 586
2a07:de40:b203:12:7ec2:55ff:fe24:de70 - - [10/Apr/2024:13:17:33 +0200] "POST /api/v1/mutex/support_server_ready?action=lock HTTP/1.1" 500 860 "-" "Mojolicious (Perl)" 2818

Related issues 1 (0 open1 closed)

Copied from openQA Infrastructure (public) - action #158556: Single-value SLI of OSD HTTP response code successful vs. all size:SResolvedokurz2024-04-07

Actions
Actions #1

Updated by okurz 8 months ago

  • Copied from action #158556: Single-value SLI of OSD HTTP response code successful vs. all size:S added
Actions #2

Updated by okurz 8 months ago

  • Subject changed from Prevent HTTP response codes 500 as observed in OSD monitoring to Prevent HTTP response codes 500 as observed in OSD monitoring size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #3

Updated by mkittler 8 months ago

  • Status changed from Workable to In Progress
  • Assignee set to mkittler
Actions #4

Updated by mkittler 8 months ago

The way the alert was created was really wrong on many levels, see MR https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1151 for details and the fix.

With that we still have 138 5xx responses in the last 24 hours as of now. So despite the figure in the alert being computed very wrongly, the real value is actually also similarly high. I'll see whether I'll also have to updated panels (and then I can also link the panel here) and will then try to investigate why we have so many 500 errors. So far I can already tell that the biggest spike was around yesterday around 2024-04-11T16:17Z. Maybe this was even a small outage.

Actions #5

Updated by mkittler 8 months ago

The MR was merged. I checked the panel and here the use of count() and the grouping make sense. The panel graph is also in-line with the one in the alert and one can also see the spike mentioned in my previous comment: https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&viewPanel=80&editPanel=80&from=1712710230000&to=1712822694301

Considering this is a spike of 503 errors specifically it makes sense to assume this was just a small outage so simply all the requests in that small timeframe ended up with a 503 response. I think this can be considered normal. I also checked the logs and the web UI service was really just restarting - which took only a few seconds:

Apr 10 16:17:03 openqa systemd[1]: Stopping The openQA web UI...
…
Apr 10 16:17:04 openqa systemd[1]: openqa-webui.service: Deactivated successfully.
Apr 10 16:17:04 openqa systemd[1]: Stopped The openQA web UI.
Apr 10 16:17:04 openqa systemd[1]: Started The openQA web UI.
Apr 10 16:17:05 openqa openqa[11034]: [info] Loading external plugin AMQP
Apr 10 16:17:05 openqa openqa[11034]: [info] Loading external plugin ObsRsync
Apr 10 16:17:06 openqa openqa[11034]: [info] Listening at "http://127.0.0.1:9526"
Apr 10 16:17:06 openqa openqa[11034]: Web application available at http://127.0.0.1:9526
Apr 10 16:17:06 openqa openqa[11034]: [info] Listening at "http://[::1]:9526"
Apr 10 16:17:06 openqa openqa[11034]: Web application available at http://[::1]:9526

Considering such a small outage already leads to over 60 503 responses I'm not sure whether the alert makes sense as we put the threshold now. On a day where we have to restart the service more often it'll most likely fire but it would still not be concerning or actionable. Maybe something to discuss in the next infra daily.

Actions #6

Updated by mkittler 8 months ago

PRs for a few problems I could spot in the logs: https://github.com/os-autoinst/openQA/pull/5574

Note that I mainly focused on the systemd journal logs. The problem with /var/log/apache2/access_log is that it only shows routes but not what the error was (so it would involve some serious guesswork to make sense of them).

Actions #7

Updated by openqa_review 8 months ago

  • Due date set to 2024-04-26

Setting due date based on mean cycle time of SUSE QE Tools

Actions #8

Updated by mkittler 8 months ago

  • Status changed from In Progress to Resolved

The PR was merged.

The high number of 500 responses we see is just caused by spikes when restarting the service. Otherwise there are not that many 500 responses and the PR fixes some of them.

The alert makes more sense now and the panel was good anyway. We might need to tweak the alert to handle situations where we restart the web UI a lot but for now I'd consider this ticket resolved.

Actions #9

Updated by okurz 8 months ago

  • Due date deleted (2024-04-26)
Actions

Also available in: Atom PDF