Project

General

Profile

Actions

action #127052

closed

[alert] Apache Response Time alert followed by DatasourceNoData for min_apache_response size:M

Added by jbaier_cz about 1 year ago. Updated 12 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
2023-04-01
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Observation

#### Apache Response Time alert 

    alertname: Apache Response Time alert
    grafana_folder: Salt
    rule_uid: zZAh5mf4k

Value: A=963378 
Date: 31. 03. 23 23:38
Link: http://stats.openqa-monitor.qa.suse.de/d/WebuiDb?orgId=1&viewPanel=84

#### DatasourceNoData

    alertname: DatasourceNoData
    datasource_uid: 000000001
    grafana_folder: Salt
    ref_id: min_apache_response
    rule_uid: zZAh5mf4k
    rulename: Apache Response Time alert

Date: 31. 03. 23 23:55
Link: http://stats.openqa-monitor.qa.suse.de/d/WebuiDb?orgId=1&viewPanel=84

Acceptance Criteria

  • AC1: It is known what the alert means
  • AC2: Alerts are not triggering

Suggestions

  • Remove the alert if we do not know what we need it for
  • Check if and why data points are missing
  • Consider a mob session or just get together in jitsi to brainstorm a solution

Rollback steps

  • On OSD
salt-key -y -a openqa.suse.de

Files

response time.png (2.07 MB) response time.png kraih, 2023-04-21 13:15
response-time.png (1.54 MB) response-time.png kraih, 2023-04-24 13:50
year-ago.png (1.65 MB) year-ago.png kraih, 2023-04-24 14:42
first.png (1.67 MB) first.png kraih, 2023-04-24 15:30
24hours.png (1.61 MB) 24hours.png kraih, 2023-04-27 08:59
fail.png (1.61 MB) fail.png kraih, 2023-04-29 12:54

Related issues 4 (1 open3 closed)

Related to openQA Infrastructure - action #127985: [alert][grafana] Handle various problems with slow monitoring https requests and such (was: DatasourceNoData HTTP Response alert) size:MResolvedmkittler2023-04-192023-05-04

Actions
Related to openQA Infrastructure - action #126212: openqa.suse.de response times very slow. No alert fired size:MResolvedmkittler2023-03-20

Actions
Copied to openQA Infrastructure - action #128513: Apache Response Time monitoring dashboard is slow to render and should be made lighterNew

Actions
Copied to openQA Infrastructure - action #128789: [alert] Apache Response Time alert size:MResolvednicksinger2023-04-01

Actions
Actions #1

Updated by okurz about 1 year ago

  • Tags changed from alert to alert, infra
  • Priority changed from Normal to Urgent
Actions #2

Updated by mkittler about 1 year ago

It happened again. I've been adding a silence now.

Actions #3

Updated by mkittler about 1 year ago

The alert was triggered again and my silence didn't work. I don't know why. It matches the alert exactly and it is really always the same alert.

Actions #4

Updated by mkittler about 1 year ago

Now using the rule_uid in the silence. Maybe it works better but I doubt it.

Actions #5

Updated by livdywan about 1 year ago

  • Subject changed from [alert] Apache Response Time alert followed by DatasourceNoData for min_apache_response to [alert] Apache Response Time alert followed by DatasourceNoData for min_apache_response size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #6

Updated by nicksinger about 1 year ago

  • Priority changed from Urgent to High

I adjusted the panel to better show periods of missing data over a longer period of time: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/834
With that merged you can see spots of 1-2h missing in https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=now-7d&to=now&viewPanel=84 - a lot of times this happens around midnight but sometimes also in the middle of the day. The silence we added apparently worked this time so I still see this issue as high priority but not urgent anymore.

Actions #7

Updated by kraih about 1 year ago

  • Assignee set to kraih
Actions #8

Updated by okurz about 1 year ago

  • Related to action #127985: [alert][grafana] Handle various problems with slow monitoring https requests and such (was: DatasourceNoData HTTP Response alert) size:M added
Actions #9

Updated by okurz about 1 year ago

  • Related to action #126212: openqa.suse.de response times very slow. No alert fired size:M added
Actions #10

Updated by okurz about 1 year ago

please be aware about my MR https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/844 to handle NoData gracefully also affecting the alert mentioned here.

Actions #11

Updated by kraih about 1 year ago

The graph for the last 7 days looks rather suspicious, like something is happening on a schedule when the no data events occur:

21/04 02:05-04:00 (last: 02:07:48)
20/04 23:35-01:05 (last: 23:37:48)
20/04 00:45-03:00 (last: 00:48:24)
19/04 22:45-23:00 (last: 22:42:48)
19/04 11:30-11:50 (last: 11:30:12)
19/04 04:55-07:00 (last: 04:56:36)
19/04 00:55-03:00 (last: 00:55:00)
18/04 22:55-00:00 (last: 22:59:24)
18/04 00:00-02:00 (last: 00:08:00)
17/04 00:25-02:00 (last: 00:27:12)
16/04 02:05-03:25 (last: 02:06:36)
15/04 23:15-01:05 (last: 23:19:00)
15/04 17:45-19:00 (last: 17:49:48)
15/04 03:25-05:00 (last: 03:29:12)
14/04 23:35-02:00 (last: 23:35:36)
Actions #12

Updated by kraih about 1 year ago

In the data i'm not seeing an obvious connection to #127985, there is very little overlap between the no data events. It appears much more random in the other ticket.

Actions #13

Updated by okurz about 1 year ago

My idea based on your suggestions: https://github.com/os-autoinst/openQA/pull/5092

EDIT: 2023-04-25: Created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/849 to use the lightweight "/health" check

Actions #14

Updated by kraih about 1 year ago

okurz wrote:

My idea based on your suggestions: https://github.com/os-autoinst/openQA/pull/5092

For context, i noticed that our workers are DDoS-ing the OSD webui a little bit by requesting /changelog for monitoring, that PR is a solution:

**IP_ADDRESS1** - - [21/Apr/2023:00:50:00 +0200] "GET /changelog HTTP/1.1" 200 134226 "-" "Go-http-client/1.1" 1639
**IP_ADDRESS2** - - [21/Apr/2023:00:50:00 +0200] "GET /changelog HTTP/1.1" 200 134227 "-" "Go-http-client/1.1" 1784
**IP_ADDRESS3** - - [21/Apr/2023:00:50:00 +0200] "GET /changelog HTTP/1.1" 200 134227 "-" "Go-http-client/1.1" 1913
**IP_ADDRESS4** - - [21/Apr/2023:00:50:00 +0200] "GET /changelog HTTP/1.1" 200 134226 "-" "Go-http-client/1.1" 2010
**IP_ADDRESS5** - - [21/Apr/2023:00:50:00 +0200] "GET /changelog HTTP/1.1" 200 134226 "-" "Go-http-client/1.1" 2010
...

The request frequency seems a bit too low to be the cause for this ticket though, despite the response size. So i'll keep looking for more in the Apache logs.

Actions #15

Updated by openqa_review about 1 year ago

  • Due date set to 2023-05-06

Setting due date based on mean cycle time of SUSE QE Tools

Actions #16

Updated by kraih about 1 year ago

Same pattern again in the last two days.

23/04 23:45:12 - 01:59:48
23/04 01:58:12 - 04:00:00
22/04 23:06:24 - 00:00:12

I've checked the Apache logs for all cases in the past week and i'm not seeing anything particularly suspicious. Everything looks perfectly normal and there's hundreds of normal requests happening, and users interacting with the webui during those timeframes. Which leads me to believe that the problem might not be on the web server side.

Actions #17

Updated by kraih about 1 year ago

Also worth mentioning that this metric is based on inputs.tail (using /var/log/apache2/openqa.access_log, which can rotate more than once a day), where #127985 is based on inputs.http_response. Think i actually went into this with the wrong assumptions, expecting it to be a new-ish issue. But looking back at a graph from a year ago, we've always had these holes in the data. Probably from the very start, which makes me think that it might just be logrotate messing with the telegraf data collection.

Actions #18

Updated by kraih about 1 year ago

It looks like this might have started with the switch from inputs.logparser to inputs.tail. Because the data before the change has no holes. https://gitlab.suse.de/openqa/salt-states-openqa/-/commit/c4e978078c285f5712265347889057ff32ab9bfe

Actions #19

Updated by kraih about 1 year ago

And i think this is the first case in our data.

2022/01/13 03:35:12 - 06:08:12
Actions #20

Updated by kraih about 1 year ago

Actually, i think it might have started with this change for Apache log rotate the day before: https://gitlab.suse.de/openqa/salt-states-openqa/-/commit/659769de466389a844c77f1a4023bd2dc449a092

Actions #21

Updated by kraih about 1 year ago

Also worth mentioning that there does not appear to be any data missing in the rotated logs. There's countless requests logged during the time influxdb has no data. And the logrotate config already uses copytruncate, which would seem like the correct mode for telegraf inputs.tail. Wonder what else there is that we could try.

Actions #22

Updated by okurz almost 1 year ago

  • Description updated (diff)

Discussed in the unblock 2023-04-26. Removed salt and disabled logrotate config for /var/log/apache2/openqa.access_log

salt-key -y -d openqa.suse.de
sed -i 's/^./#&/' /etc/logrotate.d/openqa-apache

and check tomorrow if there are still data gaps

Actions #24

Updated by kraih 12 months ago

okurz wrote:

and check tomorrow if there are still data gaps

The graph for the last 24 hours looks clean.

Actions #25

Updated by kraih 12 months ago

okurz wrote:

salt-key -y -d openqa.suse.de
sed -i 's/^./#&/' /etc/logrotate.d/openqa-apache

And reverted.

Actions #26

Updated by kraih 12 months ago

Now we are going to need ideas for what to change in the logrotate config, theoretically copytruncate should have been correct for our use case:

/var/log/apache2/openqa.access_log {
    compress
    hourly
    maxage 30
    rotate 20
    size=+300000k
    notifempty
    missingok
    copytruncate
    compresscmd /usr/bin/xz
    uncompresscmd /usr/bin/xzdec
}
Actions #27

Updated by kraih 12 months ago

There's also a few options to try for the telegraf config:

[[inputs.tail]]
  files = ["/var/log/apache2/openqa.access_log"]
  interval = "30s"
  from_beginning = false
  name_override = "apache_log"
  ## For parsing logstash-style "grok" patterns:
  data_format = "grok"
  grok_patterns = ["%{CUSTOM_LOG}"]
  grok_custom_pattern_files = []
  grok_custom_patterns = '''
      CUSTOM_LOG %{COMBINED_LOG_FORMAT} %{NUMBER:response_time_us:int}
  '''

For example we could try changing the watch method, which defaults to inotify:

  ## Method used to watch for file updates.  Can be either "inotify" or "poll".
  # watch_method = "inotify"
Actions #28

Updated by kraih 12 months ago

  • Description updated (diff)

I'm now testing watch_method = "poll" in /etc/telegraf/telegraf.d/telegraf-webui.conf on OSD and have disabled salt with salt-key -y -d openqa.suse.de.

Actions #29

Updated by nicksinger 12 months ago

I added another silence lasting over the weekend. I can't really tell if the missing series was a problem (wrongly calculated values therefore) or if me need to adjust the alert too. For now it is a good idea anyway to wait for stable metrics before adjusting the alert.

Actions #30

Updated by kraih 12 months ago

Unfortunately watch_method = "poll" did not make a difference. So i've enabled salt again. And we are going to need new ideas.

Actions #31

Updated by kraih 12 months ago

  • Status changed from In Progress to Workable
  • Assignee deleted (kraih)

Putting the ticket back, since i'm out of ideas.

Actions #32

Updated by okurz 12 months ago

  • Copied to action #128513: Apache Response Time monitoring dashboard is slow to render and should be made lighter added
Actions #33

Updated by okurz 12 months ago

  • Due date deleted (2023-05-06)
  • Status changed from Workable to Resolved
  • Assignee set to okurz

ok, so we have currently no silences related to that, no firing alert. I created #128513 to make the data more light weight.

Actions #34

Updated by okurz 12 months ago

  • Copied to action #128789: [alert] Apache Response Time alert size:M added
Actions

Also available in: Atom PDF