action #107875: [alert][osd] Apache Response Time alert size:M - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #107875

closed

[alert][osd] Apache Response Time alert size:M

Added by mkittler about 3 years ago. Updated 10 months ago.

Status:

Resolved

Priority:

Urgent

Assignee:

tinita

Category:

Target version:

openQA Project (public) - Ready

Start date:

2022-03-04

Due date:

% Done:

Estimated time:

Tags:

alert

Description

Observation¶

We've got the alert again on March 3, 2022 09:00:40:

[Alerting] Apache Response Time alert
The apache response time exceeded the alert threshold. * Check the load of the web UI host * Consider restarting the openQA web UI service and/or apache Also see https://progress.opensuse.org/issues/73633

Metric name
Value
Min
18733128.83

Relevant panel: https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?tab=alert&viewPanel=84

Tina wrote in chat

if anyone was wondering about the short high load on osd, I fetched /api/v1/jobs and it took 10 minutes

but that was already on Wednesday so it shouldn't have been caused this.

Further data points
- High CPU likely didn't affect scheduling, or we should've had other reports of it
- High CPU wouldn't cause a spike in failures in jobs?

Suggestions¶

The apache log parsing seems to be quite heavy. Can we reduce the amount of data parsed by telegraf
Reduce interval we take new data points in telegraf
Extend alerting measurement period from 5m to 30m (or higher) to smooth out gaps

Files

Download all files

screenshot_20220304_160005.png (99 KB) screenshot_20220304_160005.png		mkittler, 2022-03-04 15:01
screenshot_20220304_160755.png (216 KB) screenshot_20220304_160755.png		mkittler, 2022-03-04 15:08
screenshot_20220304_160913.png (307 KB) screenshot_20220304_160913.png		mkittler, 2022-03-04 15:09

Related issues 5 (0 open — 5 closed)

Actions

Copy link

Updated by tinita about 3 years ago

To me it looks like it was caused by data gaps again:
https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&tab=alert&viewPanel=84&from=1646285146518&to=1646301111736

Actions

Copy link Download all files

Updated by mkittler about 3 years ago

File screenshot_20220304_160005.png screenshot_20220304_160005.png added
File screenshot_20220304_160755.png screenshot_20220304_160755.png added
File screenshot_20220304_160913.png screenshot_20220304_160913.png added

I've also just had a look. The InfluxDB query is very slot when selecting a time-range like "Last 2 days". Maybe we're collecting too many data points per time. Regardless, it looks like gaps causing this, indeed:

Some other graphs have gaps as well but not all:

The CPU load was quite high from time to time but the HTTP response graph shows no gaps:

Actions

Copy link

Updated by okurz about 3 years ago

Priority changed from High to Urgent

Actions

Copy link

Updated by okurz about 3 years ago

Related to action #107257: [alert][osd] Apache Response Time alert size:M added

Actions

Copy link

Updated by okurz about 3 years ago

Related to action #96807: Web UI is slow and Apache Response Time alert got triggered added

Actions

Copy link

Updated by livdywan about 3 years ago

Subject changed from [alert][osd] Apache Response Time alert to [alert][osd] Apache Response Time alert size:M
Description updated (diff)
Status changed from New to Workable
Assignee set to tinita

Actions

Copy link

Updated by okurz about 3 years ago

Status changed from Workable to In Progress

@tinita I have an idea regarding the apache response alert ticket after looking at the graph. I prepared an MR for the dashboard

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/662

You could look into the apache logs parsing from telegraf.

Actions

Copy link

Updated by tinita about 3 years ago

All graphs with gaps are reading from the apache_log table, but the comment Response time measured by the apache proxy [...] suggests that this data comes from the proxy logs and not from apache itself.

I need to find out where to find the proxy and the logs.

Actions

Copy link

Updated by okurz about 3 years ago

tinita wrote:

All graphs with gaps are reading from the apache_log table, but the comment Response time measured by the apache proxy [...] suggests that this data comes from the proxy logs and not from apache itself.

I need to find out where to find the proxy and the logs.

We use apache as the reverse proxy for openQA, so apache == proxy.

Actions

Copy link

#10

Updated by tinita about 3 years ago

Created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/664 to replace logparser with tail

Actions

Copy link

#11

Updated by openqa_review about 3 years ago

Due date set to 2022-03-24

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

#12

Updated by okurz about 3 years ago

Copied to coordination #108209: [epic] Reduce load on OSD added

Actions

Copy link

#13

Updated by okurz about 3 years ago

In the weekly we extracted #108209 into a separate ticket, so all mid- and long-term ideas should go into there. Here we should really focus on short-term mitigations avoiding alerts when our system is still operable (under the known constraints).

@tinita try out different log parsing intervals in the telegraf config for apache logs and monitor if the alert still triggers. Maybe https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/662 and https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/664 are already enough.

Actions

Copy link

#14

Updated by tinita about 3 years ago

Increase the interval for tail: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/665

Actions

Copy link

#15

Updated by tinita about 3 years ago

Status changed from In Progress to Feedback

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/665 was merged.

Actions

Copy link

#16

Updated by tinita about 3 years ago

Status changed from Feedback to Resolved

So even after the interval change to 30s was merged, we still have gaps (there was a one hour gap this morning, in the middle of a 3 hour timeframe with high load).

But we haven't seen alerts, so I consider this ticket resolved, as we have a followup ticket about the high load.

Actions

Copy link

#17

Updated by tinita about 3 years ago

Just out of curiosity I created a grafana dashboard, btw: https://monitor.qa.suse.de/d/1pHb56Lnk/tinas-dashboard which can be interesting to see which type of requests we have and which useragents.

Actions

Copy link

#18

Updated by okurz about 3 years ago

Related to action #94111: Optimize /api/v1/jobs added

Actions

Copy link

#19

Updated by okurz about 2 years ago

Related to action #128789: [alert] Apache Response Time alert size:M added

Actions

Copy link

#20

Updated by okurz 10 months ago

Due date deleted (~~2022-03-24~~)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #107875

[alert][osd] Apache Response Time alert size:M

Observation¶

Suggestions¶

Updated by tinita about 3 years ago

Updated by mkittler about 3 years ago

Updated by okurz about 3 years ago

Updated by okurz about 3 years ago

Updated by okurz about 3 years ago

Updated by livdywan about 3 years ago

Updated by okurz about 3 years ago

Updated by tinita about 3 years ago

Updated by okurz about 3 years ago

Updated by tinita about 3 years ago

Updated by openqa_review about 3 years ago

Updated by okurz about 3 years ago

Updated by okurz about 3 years ago

Updated by tinita about 3 years ago

Updated by tinita about 3 years ago

Updated by tinita about 3 years ago

Updated by tinita about 3 years ago

Updated by okurz about 3 years ago

Updated by okurz about 2 years ago

Updated by okurz 10 months ago