Project

General

Profile

Actions

action #107257

closed

[alert][osd] Apache Response Time alert size:M

Added by okurz almost 3 years ago. Updated almost 3 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Start date:
2022-02-22
Due date:
% Done:

0%

Estimated time:

Description

Observation

From grafana: [Alerting] Apache Response Time alert

The apache response time exceeded the alert threshold. * Check the load of the web UI host * Consider restarting the openQA web UI service and/or apache Also see https://progress.opensuse.org/issues/73633
Metric name

Value
Min

2565671.000

view alert rule: http://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?tab=alert&viewPanel=84&orgId=1

Reproducible

Multiple alerts since at least 2022-02-22, likely even the past days.

Suggestions

  • okurz already restarted the apache service because it was running for longer than the time since the labs was moved. But since then we had multiple other alerts
  • Likely the problem is not apache itself but either the network is problematic or our openQA service
  • It seems we are smoothing over not that long time so maybe we don't have enough data due to the data outages. So we should look into #107437 first
  • Look back how it looks after #107437 is resolved
  • Optional: Reconsider how we alert on response times when we actually do not have that many responses

Rollback steps


Related issues 3 (0 open3 closed)

Related to openQA Infrastructure (public) - action #107437: [alert] Recurring "no data" alerts with only few minutes of outages since SUSE Nbg QA labs move size:MResolvedokurz2022-02-23

Actions
Related to openQA Infrastructure (public) - action #102650: Organize labs move to new building and SRV2 size:MResolvednicksinger2021-11-182022-05-27

Actions
Related to openQA Infrastructure (public) - action #107875: [alert][osd] Apache Response Time alert size:MResolvedtinita2022-03-04

Actions
Actions #1

Updated by okurz almost 3 years ago

  • Priority changed from High to Urgent

recurring a lot over the day

Actions #2

Updated by livdywan almost 3 years ago

okurz wrote:

recurring a lot over the day

I can confirm. And in general we get a lot of messages. I suggest to find out if there's a correlation, or identify another ticket during estimation since I've reached alert fatigue.

Actions #3

Updated by okurz almost 3 years ago

  • Related to action #107437: [alert] Recurring "no data" alerts with only few minutes of outages since SUSE Nbg QA labs move size:M added
Actions #4

Updated by okurz almost 3 years ago

  • Related to action #102650: Organize labs move to new building and SRV2 size:M added
Actions #5

Updated by okurz almost 3 years ago

  • Subject changed from [alert][osd] Apache Response Time alert to [alert][osd] Apache Response Time alert size:M
  • Description updated (diff)
  • Status changed from New to Blocked
  • Assignee set to okurz
Actions #6

Updated by okurz almost 3 years ago

#107437 first, blocked by that.

Actions #7

Updated by okurz almost 3 years ago

  • Status changed from Blocked to Resolved

https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?tab=alert&viewPanel=84&orgId=1&from=now-12h&to=now looks good again. Alert unpaused. #107437 resolved. We can resolve here as well as there is nothing else showing up.

Actions #8

Updated by okurz almost 3 years ago

  • Related to action #107875: [alert][osd] Apache Response Time alert size:M added
Actions

Also available in: Atom PDF