Project

General

Profile

Actions

action #125468

closed

[alert] [FIRING:1] (Apache Response Time alert J5M8aX04z) then resolved itself so flaky? size:M

Added by okurz almost 2 years ago. Updated almost 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Start date:
2023-03-06
Due date:
% Done:

0%

Estimated time:

Description

Observation

From email: 2023-03-05 2331

*Firing: 1 alert *
Firing
_*Apache Response Time alert *_
*Value:* [ var='A0' metric='Min' labels={} value=1.111601e+06 ]
*message:* The apache response time exceeded the alert threshold. * Check the load of the web UI host * Consider restarting the openQA web UI service and/or apache Also see https://progress.opensuse.org/issues/73633
*Labels:*
* alertname: Apache Response Time alert
* rule_uid: J5M8aX04z
[2]* Silence *[3][4]* Go to Dashboard *[5][4]* Go to Panel [6]Source[7]*

see https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&viewPanel=84&from=1678050000000&to=1678057199000

Acceptance criteria

  • AC1: Apache response time consistently stable over some days
  • AC2: OSD has been checked for possible causes of the alert firing during the original alert reporting period

Suggestions

Actions #1

Updated by okurz almost 2 years ago

  • Description updated (diff)
Actions #2

Updated by okurz almost 2 years ago

  • Subject changed from [alert] [FIRING:1] (Apache Response Time alert J5M8aX04z) then resolved itself so flaky? to [alert] [FIRING:1] (Apache Response Time alert J5M8aX04z) then resolved itself so flaky? size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #3

Updated by nicksinger almost 2 years ago

  • Status changed from Workable to In Progress
  • Assignee set to nicksinger
Actions #4

Updated by nicksinger almost 2 years ago

Looking into this I saw some data points exceeding our limit but the average over the last 30 minutes (what our alert intends to check) doesn't look like it. So I assume we're having a badly configured alert here.
While trying to come up with a better alert I realized that we're missing metrics since 13:00 CET today. I checked telegraf logs on OSD and came up with the following MR to remediate this: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/805

I think fixing the missing data points here is first needed to continue here

Actions #5

Updated by openqa_review almost 2 years ago

  • Due date set to 2023-03-25

Setting due date based on mean cycle time of SUSE QE Tools

Actions #6

Updated by okurz almost 2 years ago

As discussed going to 30s collection interval for the apache data in particular has not triggered timeouts yet so this might help. Please salt it then.

Actions #7

Updated by nicksinger almost 2 years ago

  • Status changed from In Progress to Workable
  • Assignee deleted (nicksinger)
  • Priority changed from High to Normal

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/811

Filtering out values with the previous MR didn't work out as expected so I'm bumping the interval now. In general I'd like to put this ticket back in our queue now as we also reverted to "legacy alerting" in the meantime and I don't think working on it right now makes much sense.

Actions #8

Updated by okurz almost 2 years ago

  • Due date deleted (2023-03-25)
  • Status changed from Workable to Resolved
  • Assignee set to nicksinger

MR merged. This might suffice.

Actions

Also available in: Atom PDF