action #128789: [alert] Apache Response Time alert size:M - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

action #128789

closed

openQA Project (public) - coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances

openQA Project (public) - coordination #108209: [epic] Reduce load on OSD

[alert] Apache Response Time alert size:M

Added by okurz about 2 years ago. Updated almost 2 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

nicksinger

Category:

Target version:

openQA Project (public) - Ready

Start date:

2023-04-01

Due date:

% Done:

Estimated time:

Tags:

infra

Description

Observation¶

https://stats.openqa-monitor.qa.suse.de/alerting/grafana/zZAh5mf4k/view?orgId=1 alerted 2023-05-04 22:31Z

Suggestions¶

Look into the timeframe
https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?viewPanel=84&orgId=1&from=1683230731919&to=1683273044855 and compare to other panels on OSD if it's visible what made the system busy, e.g. from https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?from=1683229623827&to=1683256684959
Also compare to the dashboard with openQA jobs
Research how one would save a snapshot of the current process table, like "htop", at times of alerts
Investigate the source and kind of requests from apache access logs

Rollback steps¶

Unsilence alert with uid zZAh5mf4k on https://stats.openqa-monitor.qa.suse.de/alerting/silences

Related issues 5 (0 open — 5 closed)

Actions

Copy link

Updated by okurz about 2 years ago

Copied from action #127052: [alert] Apache Response Time alert followed by DatasourceNoData for min_apache_response size:M added

Actions

Copy link

Updated by okurz about 2 years ago

Priority changed from High to Urgent

Actions

Copy link

Updated by okurz about 2 years ago

Subject changed from [alert] Apache Response Time alert to [alert] Apache Response Time alert size:M
Description updated (diff)
Status changed from New to Workable

Actions

Copy link

Updated by okurz about 2 years ago

Latest occurence from yesterday: https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&viewPanel=84&from=1683401285710&to=1683438122920 , triggered alert

Actions

Copy link

Updated by livdywan about 2 years ago

okurz wrote:

Research how one would save a snapshot of the current process table, like "htop", at times of alerts

I did a little bit of research on that suggestion and found an idea to create a notification channel with a web hook which then can pull any data we're interested in whilst the alert is active. It doesn't seem like our Grafana has that feature, though?

Actions

Copy link

Updated by nicksinger about 2 years ago

Related to action #129065: [alert] HTTP Response alert fired, OSD loads slow size:M added

Actions

Copy link

Updated by nicksinger about 2 years ago

Assignee set to nicksinger

Actions

Copy link

Updated by okurz about 2 years ago

Status changed from Workable to In Progress

we already use web hooks to trigger gitlab pipelines for arm workers
add new metrics rather than a web hook
- for processes
- for ports?
- how do we avoid putting too much data?
- try https://github.com/influxdata/telegraf/blob/master/plugins/inputs/procstat/README.md which we already use on workers for some services
Consider upgrading telegraf as our version is quite old -> try first updates to https://build.opensuse.org/package/show/openSUSE:Backports:SLE-15-SP4:Update/telegraf
Include process count or process overview in telegraf data but don't overwhelm with too much data
Find upstream issues for "missing apache log" data which we consider a symptom of the same problem source

Actions

Copy link

Updated by openqa_review about 2 years ago

Due date set to 2023-05-26

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

#10

Updated by nicksinger about 2 years ago

okurz wrote:

Find upstream issues for "missing apache log" data which we consider a symptom of the same problem source

I was unfortunately unable to figure out why telegraf is unable to follow the log after a rotate. There are several issues reported but none of them provide a solution despite being closed/resolved.
I went ahead with trying to reproduce manually with a minimal telegraf config:

[global_tags]

[agent]
  interval = "10s"
  round_interval = true
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  collection_jitter = "0s"
  flush_interval = "10s"
  flush_jitter = "0s"
  precision = "0s"
  hostname = ""
  omit_hostname = false

[[outputs.file]]
  files = ["stdout", "/tmp/metrics.out"]

[[inputs.tail]]
  files = ["/tmp/testfile"]
  watch_method = "inotify"
  data_format = "value"

and started telegraf --config /etc/telegraf/telegraf.conf.tailtest --debug. While telegraf was running I used "echo 1 >> /tmp/testfile", etc. to push metrics which where recorded. After recreating that file with rm and echo I observed the same missing metrics as we see on OSD. killall -SIGHUP telegraf reopened the newly created file and recorded metrics again so I came up with the following workaround for now: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/856

Actions

Copy link

#11

Updated by okurz about 2 years ago

Great idea! I merged https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/856 . So should we wait over the weekend for this to see if there are gaps and if it helps us to discover something about the actual response time exceeding limits or is there something else that you have in mind that you would look into?

Actions

Copy link

#12

Updated by nicksinger about 2 years ago

I also introduced netstat to telegraf for our webui to collect the number of connections grouped by protocol which might help us to get a better picture on what bottlenecks us here: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/857

Actions

Copy link

#13

Updated by jbaier_cz about 2 years ago

Related to action #129262: [openQA][webui] Can not access or refresh openQA website due to SSL failure added

Actions

Copy link

#14

Updated by livdywan about 2 years ago

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/860

Actions

Copy link

#15

Updated by okurz about 2 years ago

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/860 merged, nicksinger is preparing an MR to add grafana panels

Open point: Investigate the source and kind of requests from apache access logs

EDIT: Looking at https://monitor.qa.suse.de/d/1pHb56Lnk/tinas-dashboard?orgId=1&refresh=5m&from=now-24h&to=now I think "openQA jobs that upload results" are overwhelming the webUI instance.

High or missed apache response times seem to correlate with high CPU usage

TODOs:

@nicksinger I suggest to pick at least some panels from https://monitor.qa.suse.de/d/1pHb56Lnk/tinas-dashboard and put into our production dashboards
Consider bumping the apache response time alerts
Try to reduce the number of apache workers to limit concurrent requests causing high CPU usage -> #129481
Consider reducing number of openQA worker instances -> #129484
Limit the number of concurrent job upload handling on webUI side. Can we use a semaphore or lock using the database? -> #129487
Try nginx with enabled load limiting or load balancing features -> #129490
better nice level for velociraptor -> #129493

Actions

Copy link

#16

Updated by okurz about 2 years ago

Related to action #107875: [alert][osd] Apache Response Time alert size:M added

Actions

Copy link

#17

Updated by nicksinger about 2 years ago

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/862

Actions

Copy link

#18

Updated by okurz about 2 years ago

Copied to action #129481: Try to *reduce* number of apache workers to limit concurrent requests causing high CPU usage added

Actions

Copy link

#19

Updated by okurz about 2 years ago

Parent task set to #108209

TODOs:

@nicksinger I suggest to pick at least some panels from https://monitor.qa.suse.de/d/1pHb56Lnk/tinas-dashboard and put into our production dashboards
Consider bumping the apache response time alerts

Actions

Copy link

#20

Updated by okurz about 2 years ago

nsinger and me looked into alerts. We alert based on the median value of minimum response times. Because of network saturated due to uploading assets from jobs as visible in https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=now-12h&to=now in network usage eth0, disk write to vdc aka assets and apache response sizes being in the range of gb we assume that apache has exhausted all apache workers being occupied working on the requested asset upload requests hence finishing nearly no requests during the observed outages. Due to the outages the first responses that actually go through as visible in https://stats.openqa-monitor.qa.suse.de/explore?left=%7B%22datasource%22:%22000000001%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22alias%22:%22Min%22,%22groupBy%22:%5B%7B%22params%22:%5B%22$__interval%22%5D,%22type%22:%22time%22%7D,%7B%22params%22:%5B%22null%22%5D,%22type%22:%22fill%22%7D%5D,%22interval%22:%2212s%22,%22intervalMs%22:1000,%22maxDataPoints%22:43200,%22measurement%22:%22apache_log%22,%22orderByTime%22:%22ASC%22,%22policy%22:%22default%22,%22query%22:%22SELECT%20min%28%5C%22response_time_us%5C%22%29%20FROM%20%5C%22apache_log%5C%22%20WHERE%20%28%5C%22host%5C%22%20%3D%20%27openqa%27%20AND%20%5C%22request%5C%22%20%21~%20%2F%5C%5C%2Fapi%5C%5C%2Fv1%5C%5C%2Fws%2F%29%20AND%20$timeFilter%20GROUP%20BY%20time%28$__interval%29%20fill%28null%29%22,%22rawQuery%22:false,%22resultFormat%22:%22time_series%22,%22select%22:%5B%5B%7B%22params%22:%5B%22response_time_us%22%5D,%22type%22:%22field%22%7D,%7B%22params%22:%5B%5D,%22type%22:%22min%22%7D%5D%5D,%22tags%22:%5B%7B%22key%22:%22host%22,%22operator%22:%22%3D%22,%22value%22:%22openqa%22%7D,%7B%22condition%22:%22AND%22,%22key%22:%22request%22,%22operator%22:%22%21~%22,%22value%22:%22%2F%5C%5C%2Fapi%5C%5C%2Fv1%5C%5C%2Fws%2F%22%7D%5D%7D%5D,%22range%22:%7B%22from%22:%221684671333664%22,%22to%22:%221684747761127%22%7D%7D&orgId=1 are in the range of 500s or slightly below meaning that "median of min over last 30m" suddenly exceed the threshold because there are mostly no data and then suddenly 500s. Our suggestion: Create two alert queries: One for absence of responses finished and the existing one changed from fill(null) to fill(0) to only alert if actually finished requests are too slow.

Actions

Copy link

#21

Updated by nicksinger about 2 years ago

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/866 for filling up the missing data in the alert query with zeros.
Regarding "busy apache workers" I think I have to object your hypothesis. Looking at e.g. this gap: https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=1684702551192&to=1684713235063
You can see that we're missing data for "Apache Response Time" and "Response Size". The "Response Codes" panel reports close to 0 answers (sometime one or two 101's) in that time frame, however, we served requests in that period:

openqa:/var/log/apache2 # xzcat access_log*.xz | grep "21/May/2023:23:[3-9]" | wc -l
3196

Something doesn't add up here. Also the "Apache Workers" panel displays 100-200 "idle" workers in that time period. What is suspicious is the sudden peak in "Finishing" apache workers right at 00:00 ("Scoreboard" panel) with a sudden drop to zero idle workers which recovers quickly after again while the "Finishing" tasks (whatever that means) takes almost 2h to settle. A similar behavior can be found here too: https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=1684617976043&to=1684628427656

Actions

Copy link

#22

Updated by nicksinger about 2 years ago

Tried to come up with the second alert here: https://stats.openqa-monitor.qa.suse.de/alerting/a0fa05d6-307f-40b3-9408-9f5d6b7d2392/edit - still missing: provision it via Salt

Actions

Copy link

#23

Updated by okurz about 2 years ago

nicksinger wrote:

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/866 for filling up the missing data in the alert query with zeros.

merged, should have been effective but we still have received another alert. Why is that?

Regarding "busy apache workers" I think I have to object your hypothesis. Looking at e.g. this gap: https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=1684702551192&to=1684713235063
You can see that we're missing data for "Apache Response Time" and "Response Size". The "Response Codes" panel reports close to 0 answers (sometime one or two 101's) in that time frame, however, we served requests in that period:
openqa:/var/log/apache2 # xzcat access_log*.xz | grep "21/May/2023:23:[3-9]" | wc -l
3196
Something doesn't add up here. Also the "Apache Workers" panel displays 100-200 "idle" workers in that time period. What is suspicious is the sudden peak in "Finishing" apache workers right at 00:00 ("Scoreboard" panel) with a sudden drop to zero idle workers which recovers quickly after again while the "Finishing" tasks (whatever that means) takes almost 2h to settle. A similar behavior can be found here too: https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=1684617976043&to=1684628427656

Could it be that request still bunch up and are appearing as a big bunch with "old" timestamps in the log later so that afterwards the log looks complete but still there was no new entry for hours in between?

Actions

Copy link

#24

Updated by okurz about 2 years ago

Description updated (diff)

Created silence

Actions

Copy link

#25

Updated by okurz about 2 years ago

Due date deleted (~~2023-05-26~~)
Status changed from In Progress to Resolved

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/867, merged. Also https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/858 merged now so we should see the effect of http-only in the same monitoring panels as the https ones, i.e. both internal and external together.

nsinger, mkittler, okurz also conducted an experiment involving telegraf and simulated writing to logs. Also we reviewed https://httpd.apache.org/docs/2.4/mod/mod_log_config.html . The timestamp apache logs is the timestamp when a request is received, i.e. the starting time, but the log line is only written when apache has finished the request. So the log lines can refer to older requests in the past with an old timestamp. That timestamp is also the timestamp that data points will show up in grafana (because the same timestamp is used by influxdb and telegraf uses the apache timestamp when forwarding data points to influxdb.

We should keep the alert silenced until we have features implemented. Moved reactivation to #108209

Actions

Copy link

#26

Updated by dheidler about 2 years ago

https://stats.openqa-monitor.qa.suse.de/alerting/grafana/a0fa05d6-307f-40b3-9408-9f5d6b7d2392/view?orgId=1
fired on 31.05.2023 06:50-06:55.

Actions

Copy link

#27

Updated by mkittler almost 2 years ago

It happened again today.

Actions

Copy link

#28

Updated by mkittler almost 2 years ago

Status changed from Resolved to Workable

Actions

Copy link

#29

Updated by okurz almost 2 years ago

Priority changed from Urgent to Immediate

Setting to "immediate" due to daily alert

Actions

Copy link

#30

Updated by livdywan almost 2 years ago

okurz wrote:

We should keep the alert silenced until we have features implemented. Moved reactivation to #108209

That apparently didn't happen? Do we just want to silence the alert? #108209 is not on the backlog right now so let's keep that in mind.

Actions

Copy link

#31

Updated by nicksinger almost 2 years ago

Priority changed from Immediate to Urgent

Silence added for now.

Actions

Copy link

#32

Updated by okurz almost 2 years ago

I added #108209 to the backlog but that shouldn't have mattered for a silence. nicksinger added an alert "alertname=Apache log data". That reduces priority again which is good. I mentioned also the "apache log alert" to be re-enabled as part of #129745.

@nicksinger what are your further plans for this ticket? You could just resolve the ticket assuming that we know that the situation is as it is with silenced alerts that we try to re-enable. Or you could adjust alerts with much more conservative thresholds so that we can get rid of silences after being sure that relaxed thresholds would only warn us if the situation worsens again significantly.

Actions

Copy link

#33

Updated by nicksinger almost 2 years ago

I'm tempted to remove the "No apache log" alert again as it doesn't provide us with any valuable information

Actions

Copy link

#34

Updated by okurz almost 2 years ago

nicksinger wrote:

I'm tempted to remove the "No apache log" alert again as it doesn't provide us with any valuable information

Why do you think so? We decided that the response time alert should only alert on actual response times and hence changed it accordingly and then added the new alert to alert if there is actually missing data. What has changed?

Actions

Copy link

#35

Updated by okurz almost 2 years ago

As discussed in daily infra call 2023-06-12 @nicksinger will remove the "apache log data" alert again as it does not help us to provide anymore insight in particular because the actual data panel shows complete graphs because retroactively any gaps are filled anyway.

Actions

Copy link

#36

Updated by nicksinger almost 2 years ago

Status changed from Workable to Feedback

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/883

Actions

Copy link

#37

Updated by okurz almost 2 years ago

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/883 merged but could not be deployed yet due to other problems. Let's await the other pipelines that are currently busy and should deploy, e.g. https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/1631882 or https://gitlab.suse.de/openqa/salt-pillars-openqa/-/jobs/1632007

Actions

Copy link

#38

Updated by nicksinger almost 2 years ago

Status changed from Feedback to Resolved

Pipeline and deployment was successful, stale alert from Grafana was removed based on https://gitlab.suse.de/openqa/salt-states-openqa#removing-stale-provisioned-alerts

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #128789

[alert] Apache Response Time alert size:M

Observation¶

Suggestions¶

Rollback steps¶

Updated by okurz about 2 years ago

Updated by okurz about 2 years ago

Updated by okurz about 2 years ago

Updated by okurz about 2 years ago

Updated by livdywan about 2 years ago

Updated by nicksinger about 2 years ago

Updated by nicksinger about 2 years ago

Updated by okurz about 2 years ago

Updated by openqa_review about 2 years ago

Updated by nicksinger about 2 years ago

Updated by okurz about 2 years ago

Updated by nicksinger about 2 years ago

Updated by jbaier_cz about 2 years ago

Updated by livdywan about 2 years ago

Updated by okurz about 2 years ago

Updated by okurz about 2 years ago

Updated by nicksinger about 2 years ago

Updated by okurz about 2 years ago

Updated by okurz about 2 years ago

Updated by okurz about 2 years ago

Updated by nicksinger about 2 years ago

Updated by nicksinger about 2 years ago

Updated by okurz about 2 years ago

Updated by okurz about 2 years ago

Updated by okurz about 2 years ago

Updated by dheidler about 2 years ago

Updated by mkittler almost 2 years ago

Updated by mkittler almost 2 years ago

Updated by okurz almost 2 years ago

Updated by livdywan almost 2 years ago

Updated by nicksinger almost 2 years ago

Updated by okurz almost 2 years ago

Updated by nicksinger almost 2 years ago

Updated by okurz almost 2 years ago

Updated by okurz almost 2 years ago

Updated by nicksinger almost 2 years ago

Updated by okurz almost 2 years ago

Updated by nicksinger almost 2 years ago