action #125468
[alert] [FIRING:1] (Apache Response Time alert J5M8aX04z) then resolved itself so flaky? size:M
0%
Description
Observation¶
From email: 2023-03-05 2331
*Firing: 1 alert * Firing _*Apache Response Time alert *_ *Value:* [ var='A0' metric='Min' labels={} value=1.111601e+06 ] *message:* The apache response time exceeded the alert threshold. * Check the load of the web UI host * Consider restarting the openQA web UI service and/or apache Also see https://progress.opensuse.org/issues/73633 *Labels:* * alertname: Apache Response Time alert * rule_uid: J5M8aX04z [2]* Silence *[3][4]* Go to Dashboard *[5][4]* Go to Panel [6]Source[7]*
Acceptance criteria¶
- AC1: Apache response time consistently stable over some days
- AC2: OSD has been checked for possible causes of the alert firing during the original alert reporting period
Suggestions¶
- Check https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&viewPanel=84&from=1678050000000&to=1678057199000 and look into other panels for the same time and system logs
- If no significant problem was found on OSD itself compare with monitoring data for multiple other hosts, maybe something with the network at the time?
- Act accordingly to make the issue less likely to reappear
History
#2
Updated by okurz 3 months ago
- Subject changed from [alert] [FIRING:1] (Apache Response Time alert J5M8aX04z) then resolved itself so flaky? to [alert] [FIRING:1] (Apache Response Time alert J5M8aX04z) then resolved itself so flaky? size:M
- Description updated (diff)
- Status changed from New to Workable
#3
Updated by nicksinger 3 months ago
- Status changed from Workable to In Progress
- Assignee set to nicksinger
#4
Updated by nicksinger 3 months ago
Looking into this I saw some data points exceeding our limit but the average over the last 30 minutes (what our alert intends to check) doesn't look like it. So I assume we're having a badly configured alert here.
While trying to come up with a better alert I realized that we're missing metrics since 13:00 CET today. I checked telegraf logs on OSD and came up with the following MR to remediate this: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/805
I think fixing the missing data points here is first needed to continue here
#5
Updated by openqa_review 3 months ago
- Due date set to 2023-03-25
Setting due date based on mean cycle time of SUSE QE Tools
#7
Updated by nicksinger 3 months ago
- Status changed from In Progress to Workable
- Assignee deleted (
nicksinger) - Priority changed from High to Normal
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/811
Filtering out values with the previous MR didn't work out as expected so I'm bumping the interval now. In general I'd like to put this ticket back in our queue now as we also reverted to "legacy alerting" in the meantime and I don't think working on it right now makes much sense.