Project

General

Profile

action #107875

[alert][osd] Apache Response Time alert size:M

Added by mkittler over 1 year ago. Updated about 1 year ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Target version:
Start date:
2022-03-04
Due date:
2022-03-24
% Done:

0%

Estimated time:
Tags:

Description

Observation

We've got the alert again on March 3, 2022 09:00:40:

[Alerting] Apache Response Time alert
The apache response time exceeded the alert threshold. * Check the load of the web UI host * Consider restarting the openQA web UI service and/or apache Also see https://progress.opensuse.org/issues/73633

Metric name
Value
Min
18733128.83

Relevant panel: https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?tab=alert&viewPanel=84


Tina wrote in chat

if anyone was wondering about the short high load on osd, I fetched /api/v1/jobs and it took 10 minutes

but that was already on Wednesday so it shouldn't have been caused this.

Further data points
- High CPU likely didn't affect scheduling, or we should've had other reports of it
- High CPU wouldn't cause a spike in failures in jobs?

Suggestions

  • The apache log parsing seems to be quite heavy. Can we reduce the amount of data parsed by telegraf
  • Reduce interval we take new data points in telegraf
  • Extend alerting measurement period from 5m to 30m (or higher) to smooth out gaps

Related issues

Related to openQA Infrastructure - action #107257: [alert][osd] Apache Response Time alert size:MResolved2022-02-22

Related to openQA Infrastructure - action #96807: Web UI is slow and Apache Response Time alert got triggeredResolved2021-08-122021-10-01

Related to openQA Project - action #94111: Optimize /api/v1/jobsResolved2021-06-16

Related to openQA Infrastructure - action #128789: [alert] Apache Response Time alert size:MWorkable2023-04-01

Copied to openQA Project - coordination #108209: [epic] Reduce load on OSDBlocked2023-04-012023-06-20

History

#2 Updated by mkittler over 1 year ago

12928
12931
12934

I've also just had a look. The InfluxDB query is very slot when selecting a time-range like "Last 2 days". Maybe we're collecting too many data points per time. Regardless, it looks like gaps causing this, indeed:

Some other graphs have gaps as well but not all:

The CPU load was quite high from time to time but the HTTP response graph shows no gaps:

#3 Updated by okurz about 1 year ago

  • Priority changed from High to Urgent

#4 Updated by okurz about 1 year ago

  • Related to action #107257: [alert][osd] Apache Response Time alert size:M added

#5 Updated by okurz about 1 year ago

  • Related to action #96807: Web UI is slow and Apache Response Time alert got triggered added

#6 Updated by cdywan about 1 year ago

  • Subject changed from [alert][osd] Apache Response Time alert to [alert][osd] Apache Response Time alert size:M
  • Description updated (diff)
  • Status changed from New to Workable
  • Assignee set to tinita

#7 Updated by okurz about 1 year ago

  • Status changed from Workable to In Progress

tinita I have an idea regarding the apache response alert ticket after looking at the graph. I prepared an MR for the dashboard

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/662

You could look into the apache logs parsing from telegraf.

#8 Updated by tinita about 1 year ago

All graphs with gaps are reading from the apache_log table, but the comment Response time measured by the apache proxy [...] suggests that this data comes from the proxy logs and not from apache itself.

I need to find out where to find the proxy and the logs.

#9 Updated by okurz about 1 year ago

tinita wrote:

All graphs with gaps are reading from the apache_log table, but the comment Response time measured by the apache proxy [...] suggests that this data comes from the proxy logs and not from apache itself.

I need to find out where to find the proxy and the logs.

We use apache as the reverse proxy for openQA, so apache == proxy.

#11 Updated by openqa_review about 1 year ago

  • Due date set to 2022-03-24

Setting due date based on mean cycle time of SUSE QE Tools

#12 Updated by okurz about 1 year ago

#13 Updated by okurz about 1 year ago

In the weekly we extracted #108209 into a separate ticket, so all mid- and long-term ideas should go into there. Here we should really focus on short-term mitigations avoiding alerts when our system is still operable (under the known constraints).

tinita try out different log parsing intervals in the telegraf config for apache logs and monitor if the alert still triggers. Maybe https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/662 and https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/664 are already enough.

#15 Updated by tinita about 1 year ago

  • Status changed from In Progress to Feedback

#16 Updated by tinita about 1 year ago

  • Status changed from Feedback to Resolved

So even after the interval change to 30s was merged, we still have gaps (there was a one hour gap this morning, in the middle of a 3 hour timeframe with high load).

But we haven't seen alerts, so I consider this ticket resolved, as we have a followup ticket about the high load.

#17 Updated by tinita about 1 year ago

Just out of curiosity I created a grafana dashboard, btw: https://monitor.qa.suse.de/d/1pHb56Lnk/tinas-dashboard which can be interesting to see which type of requests we have and which useragents.

#18 Updated by okurz about 1 year ago

#19 Updated by okurz 21 days ago

  • Related to action #128789: [alert] Apache Response Time alert size:M added

Also available in: Atom PDF