Project

General

Profile

Actions

action #107875

closed

[alert][osd] Apache Response Time alert size:M

Added by mkittler almost 3 years ago. Updated 5 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Start date:
2022-03-04
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Observation

We've got the alert again on March 3, 2022 09:00:40:

[Alerting] Apache Response Time alert
The apache response time exceeded the alert threshold. * Check the load of the web UI host * Consider restarting the openQA web UI service and/or apache Also see https://progress.opensuse.org/issues/73633

Metric name
Value
Min
18733128.83

Relevant panel: https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?tab=alert&viewPanel=84


Tina wrote in chat

if anyone was wondering about the short high load on osd, I fetched /api/v1/jobs and it took 10 minutes

but that was already on Wednesday so it shouldn't have been caused this.

Further data points
- High CPU likely didn't affect scheduling, or we should've had other reports of it
- High CPU wouldn't cause a spike in failures in jobs?

Suggestions

  • The apache log parsing seems to be quite heavy. Can we reduce the amount of data parsed by telegraf
  • Reduce interval we take new data points in telegraf
  • Extend alerting measurement period from 5m to 30m (or higher) to smooth out gaps

Files


Related issues 5 (0 open5 closed)

Related to openQA Infrastructure (public) - action #107257: [alert][osd] Apache Response Time alert size:MResolvedokurz2022-02-22

Actions
Related to openQA Infrastructure (public) - action #96807: Web UI is slow and Apache Response Time alert got triggeredResolvedokurz2021-08-122021-10-01

Actions
Related to openQA Project (public) - action #94111: Optimize /api/v1/jobsResolvedtinita2021-06-16

Actions
Related to openQA Infrastructure (public) - action #128789: [alert] Apache Response Time alert size:MResolvednicksinger2023-04-01

Actions
Copied to openQA Project (public) - coordination #108209: [epic] Reduce load on OSDResolvedokurz2023-04-01

Actions

Updated by mkittler almost 3 years ago

I've also just had a look. The InfluxDB query is very slot when selecting a time-range like "Last 2 days". Maybe we're collecting too many data points per time. Regardless, it looks like gaps causing this, indeed:

Some other graphs have gaps as well but not all:

The CPU load was quite high from time to time but the HTTP response graph shows no gaps:

Actions #3

Updated by okurz almost 3 years ago

  • Priority changed from High to Urgent
Actions #4

Updated by okurz almost 3 years ago

  • Related to action #107257: [alert][osd] Apache Response Time alert size:M added
Actions #5

Updated by okurz almost 3 years ago

  • Related to action #96807: Web UI is slow and Apache Response Time alert got triggered added
Actions #6

Updated by livdywan almost 3 years ago

  • Subject changed from [alert][osd] Apache Response Time alert to [alert][osd] Apache Response Time alert size:M
  • Description updated (diff)
  • Status changed from New to Workable
  • Assignee set to tinita
Actions #7

Updated by okurz almost 3 years ago

  • Status changed from Workable to In Progress

@tinita I have an idea regarding the apache response alert ticket after looking at the graph. I prepared an MR for the dashboard

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/662

You could look into the apache logs parsing from telegraf.

Actions #8

Updated by tinita almost 3 years ago

All graphs with gaps are reading from the apache_log table, but the comment Response time measured by the apache proxy [...] suggests that this data comes from the proxy logs and not from apache itself.

I need to find out where to find the proxy and the logs.

Actions #9

Updated by okurz almost 3 years ago

tinita wrote:

All graphs with gaps are reading from the apache_log table, but the comment Response time measured by the apache proxy [...] suggests that this data comes from the proxy logs and not from apache itself.

I need to find out where to find the proxy and the logs.

We use apache as the reverse proxy for openQA, so apache == proxy.

Actions #11

Updated by openqa_review almost 3 years ago

  • Due date set to 2022-03-24

Setting due date based on mean cycle time of SUSE QE Tools

Actions #12

Updated by okurz almost 3 years ago

Actions #13

Updated by okurz almost 3 years ago

In the weekly we extracted #108209 into a separate ticket, so all mid- and long-term ideas should go into there. Here we should really focus on short-term mitigations avoiding alerts when our system is still operable (under the known constraints).

@tinita try out different log parsing intervals in the telegraf config for apache logs and monitor if the alert still triggers. Maybe https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/662 and https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/664 are already enough.

Actions #15

Updated by tinita almost 3 years ago

  • Status changed from In Progress to Feedback
Actions #16

Updated by tinita almost 3 years ago

  • Status changed from Feedback to Resolved

So even after the interval change to 30s was merged, we still have gaps (there was a one hour gap this morning, in the middle of a 3 hour timeframe with high load).

But we haven't seen alerts, so I consider this ticket resolved, as we have a followup ticket about the high load.

Actions #17

Updated by tinita almost 3 years ago

Just out of curiosity I created a grafana dashboard, btw: https://monitor.qa.suse.de/d/1pHb56Lnk/tinas-dashboard which can be interesting to see which type of requests we have and which useragents.

Actions #18

Updated by okurz almost 3 years ago

Actions #19

Updated by okurz over 1 year ago

  • Related to action #128789: [alert] Apache Response Time alert size:M added
Actions #20

Updated by okurz 5 months ago

  • Due date deleted (2022-03-24)
Actions

Also available in: Atom PDF