Project

General

Profile

Actions

action #128789

closed

openQA Project - coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances

openQA Project - coordination #108209: [epic] Reduce load on OSD

[alert] Apache Response Time alert size:M

Added by okurz about 1 year ago. Updated almost 1 year ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
Start date:
2023-04-01
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Observation

https://stats.openqa-monitor.qa.suse.de/alerting/grafana/zZAh5mf4k/view?orgId=1 alerted 2023-05-04 22:31Z

Suggestions

Rollback steps


Related issues 5 (0 open5 closed)

Related to openQA Infrastructure - action #129065: [alert] HTTP Response alert fired, OSD loads slow size:MResolvedmkittler2023-04-01

Actions
Related to openQA Project - action #129262: [openQA][webui] Can not access or refresh openQA website due to SSL failureResolvedokurz2023-05-15

Actions
Related to openQA Infrastructure - action #107875: [alert][osd] Apache Response Time alert size:MResolvedtinita2022-03-042022-03-24

Actions
Copied from openQA Infrastructure - action #127052: [alert] Apache Response Time alert followed by DatasourceNoData for min_apache_response size:MResolvedokurz2023-04-01

Actions
Copied to openQA Project - action #129481: Try to *reduce* number of apache workers to limit concurrent requests causing high CPU usageRejectedokurz

Actions
Actions #1

Updated by okurz about 1 year ago

  • Copied from action #127052: [alert] Apache Response Time alert followed by DatasourceNoData for min_apache_response size:M added
Actions #2

Updated by okurz about 1 year ago

  • Priority changed from High to Urgent
Actions #3

Updated by okurz about 1 year ago

  • Subject changed from [alert] Apache Response Time alert to [alert] Apache Response Time alert size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #5

Updated by livdywan about 1 year ago

okurz wrote:

  • Research how one would save a snapshot of the current process table, like "htop", at times of alerts

I did a little bit of research on that suggestion and found an idea to create a notification channel with a web hook which then can pull any data we're interested in whilst the alert is active. It doesn't seem like our Grafana has that feature, though?

Actions #6

Updated by nicksinger about 1 year ago

  • Related to action #129065: [alert] HTTP Response alert fired, OSD loads slow size:M added
Actions #7

Updated by nicksinger about 1 year ago

  • Assignee set to nicksinger
Actions #8

Updated by okurz about 1 year ago

  • Status changed from Workable to In Progress
Actions #9

Updated by openqa_review about 1 year ago

  • Due date set to 2023-05-26

Setting due date based on mean cycle time of SUSE QE Tools

Actions #10

Updated by nicksinger about 1 year ago

okurz wrote:

  • Find upstream issues for "missing apache log" data which we consider a symptom of the same problem source

I was unfortunately unable to figure out why telegraf is unable to follow the log after a rotate. There are several issues reported but none of them provide a solution despite being closed/resolved.
I went ahead with trying to reproduce manually with a minimal telegraf config:

[global_tags]

[agent]
  interval = "10s"
  round_interval = true
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  collection_jitter = "0s"
  flush_interval = "10s"
  flush_jitter = "0s"
  precision = "0s"
  hostname = ""
  omit_hostname = false

[[outputs.file]]
  files = ["stdout", "/tmp/metrics.out"]

[[inputs.tail]]
  files = ["/tmp/testfile"]
  watch_method = "inotify"
  data_format = "value"

and started telegraf --config /etc/telegraf/telegraf.conf.tailtest --debug. While telegraf was running I used "echo 1 >> /tmp/testfile", etc. to push metrics which where recorded. After recreating that file with rm and echo I observed the same missing metrics as we see on OSD. killall -SIGHUP telegraf reopened the newly created file and recorded metrics again so I came up with the following workaround for now: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/856

Actions #11

Updated by okurz about 1 year ago

Great idea! I merged https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/856 . So should we wait over the weekend for this to see if there are gaps and if it helps us to discover something about the actual response time exceeding limits or is there something else that you have in mind that you would look into?

Actions #12

Updated by nicksinger about 1 year ago

I also introduced netstat to telegraf for our webui to collect the number of connections grouped by protocol which might help us to get a better picture on what bottlenecks us here: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/857

Actions #13

Updated by jbaier_cz about 1 year ago

  • Related to action #129262: [openQA][webui] Can not access or refresh openQA website due to SSL failure added
Actions #15

Updated by okurz about 1 year ago

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/860 merged, nicksinger is preparing an MR to add grafana panels

Open point: Investigate the source and kind of requests from apache access logs

EDIT: Looking at https://monitor.qa.suse.de/d/1pHb56Lnk/tinas-dashboard?orgId=1&refresh=5m&from=now-24h&to=now I think "openQA jobs that upload results" are overwhelming the webUI instance.

High or missed apache response times seem to correlate with high CPU usage

TODOs:

  1. @nicksinger I suggest to pick at least some panels from https://monitor.qa.suse.de/d/1pHb56Lnk/tinas-dashboard and put into our production dashboards
  2. Consider bumping the apache response time alerts
  3. Try to reduce the number of apache workers to limit concurrent requests causing high CPU usage -> #129481
  4. Consider reducing number of openQA worker instances -> #129484
  5. Limit the number of concurrent job upload handling on webUI side. Can we use a semaphore or lock using the database? -> #129487
  6. Try nginx with enabled load limiting or load balancing features -> #129490
  7. better nice level for velociraptor -> #129493
Actions #16

Updated by okurz about 1 year ago

  • Related to action #107875: [alert][osd] Apache Response Time alert size:M added
Actions #18

Updated by okurz about 1 year ago

  • Copied to action #129481: Try to *reduce* number of apache workers to limit concurrent requests causing high CPU usage added
Actions #19

Updated by okurz about 1 year ago

  • Parent task set to #108209

TODOs:

  1. @nicksinger I suggest to pick at least some panels from https://monitor.qa.suse.de/d/1pHb56Lnk/tinas-dashboard and put into our production dashboards
  2. Consider bumping the apache response time alerts
Actions #20

Updated by okurz about 1 year ago

nsinger and me looked into alerts. We alert based on the median value of minimum response times. Because of network saturated due to uploading assets from jobs as visible in https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=now-12h&to=now in network usage eth0, disk write to vdc aka assets and apache response sizes being in the range of gb we assume that apache has exhausted all apache workers being occupied working on the requested asset upload requests hence finishing nearly no requests during the observed outages. Due to the outages the first responses that actually go through as visible in https://stats.openqa-monitor.qa.suse.de/explore?left=%7B%22datasource%22:%22000000001%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22alias%22:%22Min%22,%22groupBy%22:%5B%7B%22params%22:%5B%22$__interval%22%5D,%22type%22:%22time%22%7D,%7B%22params%22:%5B%22null%22%5D,%22type%22:%22fill%22%7D%5D,%22interval%22:%2212s%22,%22intervalMs%22:1000,%22maxDataPoints%22:43200,%22measurement%22:%22apache_log%22,%22orderByTime%22:%22ASC%22,%22policy%22:%22default%22,%22query%22:%22SELECT%20min%28%5C%22response_time_us%5C%22%29%20FROM%20%5C%22apache_log%5C%22%20WHERE%20%28%5C%22host%5C%22%20%3D%20%27openqa%27%20AND%20%5C%22request%5C%22%20%21~%20%2F%5C%5C%2Fapi%5C%5C%2Fv1%5C%5C%2Fws%2F%29%20AND%20$timeFilter%20GROUP%20BY%20time%28$__interval%29%20fill%28null%29%22,%22rawQuery%22:false,%22resultFormat%22:%22time_series%22,%22select%22:%5B%5B%7B%22params%22:%5B%22response_time_us%22%5D,%22type%22:%22field%22%7D,%7B%22params%22:%5B%5D,%22type%22:%22min%22%7D%5D%5D,%22tags%22:%5B%7B%22key%22:%22host%22,%22operator%22:%22%3D%22,%22value%22:%22openqa%22%7D,%7B%22condition%22:%22AND%22,%22key%22:%22request%22,%22operator%22:%22%21~%22,%22value%22:%22%2F%5C%5C%2Fapi%5C%5C%2Fv1%5C%5C%2Fws%2F%22%7D%5D%7D%5D,%22range%22:%7B%22from%22:%221684671333664%22,%22to%22:%221684747761127%22%7D%7D&orgId=1 are in the range of 500s or slightly below meaning that "median of min over last 30m" suddenly exceed the threshold because there are mostly no data and then suddenly 500s. Our suggestion: Create two alert queries: One for absence of responses finished and the existing one changed from fill(null) to fill(0) to only alert if actually finished requests are too slow.

Actions #21

Updated by nicksinger about 1 year ago

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/866 for filling up the missing data in the alert query with zeros.
Regarding "busy apache workers" I think I have to object your hypothesis. Looking at e.g. this gap: https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=1684702551192&to=1684713235063
You can see that we're missing data for "Apache Response Time" and "Response Size". The "Response Codes" panel reports close to 0 answers (sometime one or two 101's) in that time frame, however, we served requests in that period:

openqa:/var/log/apache2 # xzcat access_log*.xz | grep "21/May/2023:23:[3-9]" | wc -l
3196

Something doesn't add up here. Also the "Apache Workers" panel displays 100-200 "idle" workers in that time period. What is suspicious is the sudden peak in "Finishing" apache workers right at 00:00 ("Scoreboard" panel) with a sudden drop to zero idle workers which recovers quickly after again while the "Finishing" tasks (whatever that means) takes almost 2h to settle. A similar behavior can be found here too: https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=1684617976043&to=1684628427656

Actions #22

Updated by nicksinger about 1 year ago

Tried to come up with the second alert here: https://stats.openqa-monitor.qa.suse.de/alerting/a0fa05d6-307f-40b3-9408-9f5d6b7d2392/edit - still missing: provision it via Salt

Actions #23

Updated by okurz about 1 year ago

nicksinger wrote:

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/866 for filling up the missing data in the alert query with zeros.

merged, should have been effective but we still have received another alert. Why is that?

Regarding "busy apache workers" I think I have to object your hypothesis. Looking at e.g. this gap: https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=1684702551192&to=1684713235063
You can see that we're missing data for "Apache Response Time" and "Response Size". The "Response Codes" panel reports close to 0 answers (sometime one or two 101's) in that time frame, however, we served requests in that period:

openqa:/var/log/apache2 # xzcat access_log*.xz | grep "21/May/2023:23:[3-9]" | wc -l
3196

Something doesn't add up here. Also the "Apache Workers" panel displays 100-200 "idle" workers in that time period. What is suspicious is the sudden peak in "Finishing" apache workers right at 00:00 ("Scoreboard" panel) with a sudden drop to zero idle workers which recovers quickly after again while the "Finishing" tasks (whatever that means) takes almost 2h to settle. A similar behavior can be found here too: https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=1684617976043&to=1684628427656

Could it be that request still bunch up and are appearing as a big bunch with "old" timestamps in the log later so that afterwards the log looks complete but still there was no new entry for hours in between?

Actions #24

Updated by okurz about 1 year ago

  • Description updated (diff)

Created silence

Actions #25

Updated by okurz about 1 year ago

  • Due date deleted (2023-05-26)
  • Status changed from In Progress to Resolved

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/867, merged. Also https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/858 merged now so we should see the effect of http-only in the same monitoring panels as the https ones, i.e. both internal and external together.

nsinger, mkittler, okurz also conducted an experiment involving telegraf and simulated writing to logs. Also we reviewed https://httpd.apache.org/docs/2.4/mod/mod_log_config.html . The timestamp apache logs is the timestamp when a request is received, i.e. the starting time, but the log line is only written when apache has finished the request. So the log lines can refer to older requests in the past with an old timestamp. That timestamp is also the timestamp that data points will show up in grafana (because the same timestamp is used by influxdb and telegraf uses the apache timestamp when forwarding data points to influxdb.

We should keep the alert silenced until we have features implemented. Moved reactivation to #108209

Actions #27

Updated by mkittler about 1 year ago

It happened again today.

Actions #28

Updated by mkittler about 1 year ago

  • Status changed from Resolved to Workable
Actions #29

Updated by okurz about 1 year ago

  • Priority changed from Urgent to Immediate

Setting to "immediate" due to daily alert

Actions #30

Updated by livdywan about 1 year ago

okurz wrote:

We should keep the alert silenced until we have features implemented. Moved reactivation to #108209

That apparently didn't happen? Do we just want to silence the alert? #108209 is not on the backlog right now so let's keep that in mind.

Actions #31

Updated by nicksinger about 1 year ago

  • Priority changed from Immediate to Urgent

Silence added for now.

Actions #32

Updated by okurz about 1 year ago

I added #108209 to the backlog but that shouldn't have mattered for a silence. nicksinger added an alert "alertname=Apache log data". That reduces priority again which is good. I mentioned also the "apache log alert" to be re-enabled as part of #129745.

@nicksinger what are your further plans for this ticket? You could just resolve the ticket assuming that we know that the situation is as it is with silenced alerts that we try to re-enable. Or you could adjust alerts with much more conservative thresholds so that we can get rid of silences after being sure that relaxed thresholds would only warn us if the situation worsens again significantly.

Actions #33

Updated by nicksinger about 1 year ago

I'm tempted to remove the "No apache log" alert again as it doesn't provide us with any valuable information

Actions #34

Updated by okurz about 1 year ago

nicksinger wrote:

I'm tempted to remove the "No apache log" alert again as it doesn't provide us with any valuable information

Why do you think so? We decided that the response time alert should only alert on actual response times and hence changed it accordingly and then added the new alert to alert if there is actually missing data. What has changed?

Actions #35

Updated by okurz about 1 year ago

As discussed in daily infra call 2023-06-12 @nicksinger will remove the "apache log data" alert again as it does not help us to provide anymore insight in particular because the actual data panel shows complete graphs because retroactively any gaps are filled anyway.

Actions #36

Updated by nicksinger about 1 year ago

  • Status changed from Workable to Feedback
Actions #37

Updated by okurz about 1 year ago

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/883 merged but could not be deployed yet due to other problems. Let's await the other pipelines that are currently busy and should deploy, e.g. https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/1631882 or https://gitlab.suse.de/openqa/salt-pillars-openqa/-/jobs/1632007

Actions #38

Updated by nicksinger almost 1 year ago

  • Status changed from Feedback to Resolved

Pipeline and deployment was successful, stale alert from Grafana was removed based on https://gitlab.suse.de/openqa/salt-states-openqa#removing-stale-provisioned-alerts

Actions

Also available in: Atom PDF