action #125468
closed[alert] [FIRING:1] (Apache Response Time alert J5M8aX04z) then resolved itself so flaky? size:M
0%
Description
Observation¶
From email: 2023-03-05 2331
*Firing: 1 alert *
Firing
_*Apache Response Time alert *_
*Value:* [ var='A0' metric='Min' labels={} value=1.111601e+06 ]
*message:* The apache response time exceeded the alert threshold. * Check the load of the web UI host * Consider restarting the openQA web UI service and/or apache Also see https://progress.opensuse.org/issues/73633
*Labels:*
* alertname: Apache Response Time alert
* rule_uid: J5M8aX04z
[2]* Silence *[3][4]* Go to Dashboard *[5][4]* Go to Panel [6]Source[7]*
Acceptance criteria¶
- AC1: Apache response time consistently stable over some days
- AC2: OSD has been checked for possible causes of the alert firing during the original alert reporting period
Suggestions¶
- Check https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&viewPanel=84&from=1678050000000&to=1678057199000 and look into other panels for the same time and system logs
- If no significant problem was found on OSD itself compare with monitoring data for multiple other hosts, maybe something with the network at the time?
- Act accordingly to make the issue less likely to reappear
Updated by okurz almost 2 years ago
- Subject changed from [alert] [FIRING:1] (Apache Response Time alert J5M8aX04z) then resolved itself so flaky? to [alert] [FIRING:1] (Apache Response Time alert J5M8aX04z) then resolved itself so flaky? size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by nicksinger almost 2 years ago
- Status changed from Workable to In Progress
- Assignee set to nicksinger
Updated by nicksinger almost 2 years ago
Looking into this I saw some data points exceeding our limit but the average over the last 30 minutes (what our alert intends to check) doesn't look like it. So I assume we're having a badly configured alert here.
While trying to come up with a better alert I realized that we're missing metrics since 13:00 CET today. I checked telegraf logs on OSD and came up with the following MR to remediate this: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/805
I think fixing the missing data points here is first needed to continue here
Updated by openqa_review almost 2 years ago
- Due date set to 2023-03-25
Setting due date based on mean cycle time of SUSE QE Tools
Updated by okurz over 1 year ago
As discussed going to 30s collection interval for the apache data in particular has not triggered timeouts yet so this might help. Please salt it then.
Updated by nicksinger over 1 year ago
- Status changed from In Progress to Workable
- Assignee deleted (
nicksinger) - Priority changed from High to Normal
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/811
Filtering out values with the previous MR didn't work out as expected so I'm bumping the interval now. In general I'd like to put this ticket back in our queue now as we also reverted to "legacy alerting" in the meantime and I don't think working on it right now makes much sense.
Updated by okurz over 1 year ago
- Due date deleted (
2023-03-25) - Status changed from Workable to Resolved
- Assignee set to nicksinger
MR merged. This might suffice.