action #125468: [alert] [FIRING:1] (Apache Response Time alert J5M8aX04z) then resolved itself so flaky? size:M - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #125468

closed

[alert] [FIRING:1] (Apache Response Time alert J5M8aX04z) then resolved itself so flaky? size:M

Added by okurz about 2 years ago. Updated about 2 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

nicksinger

Category:

Target version:

openQA Project (public) - Ready

Start date:

2023-03-06

Due date:

% Done:

Estimated time:

Tags:

alert, flaky, infra, apache, response

Description

Observation¶

From email: 2023-03-05 2331

*Firing: 1 alert *
Firing
_*Apache Response Time alert *_
*Value:* [ var='A0' metric='Min' labels={} value=1.111601e+06 ]
*message:* The apache response time exceeded the alert threshold. * Check the load of the web UI host * Consider restarting the openQA web UI service and/or apache Also see https://progress.opensuse.org/issues/73633
*Labels:*
* alertname: Apache Response Time alert
* rule_uid: J5M8aX04z
[2]* Silence *[3][4]* Go to Dashboard *[5][4]* Go to Panel [6]Source[7]*

see https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&viewPanel=84&from=1678050000000&to=1678057199000

Acceptance criteria¶

AC1: Apache response time consistently stable over some days
AC2: OSD has been checked for possible causes of the alert firing during the original alert reporting period

Suggestions¶

Check https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&viewPanel=84&from=1678050000000&to=1678057199000 and look into other panels for the same time and system logs
If no significant problem was found on OSD itself compare with monitoring data for multiple other hosts, maybe something with the network at the time?
Act accordingly to make the issue less likely to reappear

Actions

Copy link

Updated by okurz about 2 years ago

Description updated (diff)

Actions

Copy link

Updated by okurz about 2 years ago

Subject changed from [alert] [FIRING:1] (Apache Response Time alert J5M8aX04z) then resolved itself so flaky? to [alert] [FIRING:1] (Apache Response Time alert J5M8aX04z) then resolved itself so flaky? size:M
Description updated (diff)
Status changed from New to Workable

Actions

Copy link

Updated by nicksinger about 2 years ago

Status changed from Workable to In Progress
Assignee set to nicksinger

Actions

Copy link

Updated by nicksinger about 2 years ago

Looking into this I saw some data points exceeding our limit but the average over the last 30 minutes (what our alert intends to check) doesn't look like it. So I assume we're having a badly configured alert here.
While trying to come up with a better alert I realized that we're missing metrics since 13:00 CET today. I checked telegraf logs on OSD and came up with the following MR to remediate this: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/805

I think fixing the missing data points here is first needed to continue here

Actions

Copy link

Updated by openqa_review about 2 years ago

Due date set to 2023-03-25

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

Updated by okurz about 2 years ago

As discussed going to 30s collection interval for the apache data in particular has not triggered timeouts yet so this might help. Please salt it then.

Actions

Copy link

Updated by nicksinger about 2 years ago

Status changed from In Progress to Workable
Assignee deleted (~~nicksinger~~)
Priority changed from High to Normal

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/811

Filtering out values with the previous MR didn't work out as expected so I'm bumping the interval now. In general I'd like to put this ticket back in our queue now as we also reverted to "legacy alerting" in the meantime and I don't think working on it right now makes much sense.

Actions

Copy link

Updated by okurz about 2 years ago

Due date deleted (~~2023-03-25~~)
Status changed from Workable to Resolved
Assignee set to nicksinger

MR merged. This might suffice.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #125468

[alert] [FIRING:1] (Apache Response Time alert J5M8aX04z) then resolved itself so flaky? size:M

Observation¶

Acceptance criteria¶

Suggestions¶

Updated by okurz about 2 years ago

Updated by okurz about 2 years ago

Updated by nicksinger about 2 years ago

Updated by nicksinger about 2 years ago

Updated by openqa_review about 2 years ago

Updated by okurz about 2 years ago

Updated by nicksinger about 2 years ago

Updated by okurz about 2 years ago