action #128789
closedopenQA Project - coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances
openQA Project - coordination #108209: [epic] Reduce load on OSD
[alert] Apache Response Time alert size:M
0%
Description
Observation¶
https://stats.openqa-monitor.qa.suse.de/alerting/grafana/zZAh5mf4k/view?orgId=1 alerted 2023-05-04 22:31Z
Suggestions¶
- Look into the timeframe https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?viewPanel=84&orgId=1&from=1683230731919&to=1683273044855 and compare to other panels on OSD if it's visible what made the system busy, e.g. from https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?from=1683229623827&to=1683256684959
- Also compare to the dashboard with openQA jobs
- Research how one would save a snapshot of the current process table, like "htop", at times of alerts
- Investigate the source and kind of requests from apache access logs
Rollback steps¶
- Unsilence alert with uid zZAh5mf4k on https://stats.openqa-monitor.qa.suse.de/alerting/silences
Updated by okurz over 1 year ago
- Copied from action #127052: [alert] Apache Response Time alert followed by DatasourceNoData for min_apache_response size:M added
Updated by okurz over 1 year ago
- Subject changed from [alert] Apache Response Time alert to [alert] Apache Response Time alert size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by okurz over 1 year ago
Latest occurence from yesterday: https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&viewPanel=84&from=1683401285710&to=1683438122920 , triggered alert
Updated by livdywan over 1 year ago
okurz wrote:
- Research how one would save a snapshot of the current process table, like "htop", at times of alerts
I did a little bit of research on that suggestion and found an idea to create a notification channel with a web hook which then can pull any data we're interested in whilst the alert is active. It doesn't seem like our Grafana has that feature, though?
Updated by nicksinger over 1 year ago
- Related to action #129065: [alert] HTTP Response alert fired, OSD loads slow size:M added
Updated by okurz over 1 year ago
- Status changed from Workable to In Progress
- we already use web hooks to trigger gitlab pipelines for arm workers
- add new metrics rather than a web hook
- for processes
- for ports?
- how do we avoid putting too much data?
- try https://github.com/influxdata/telegraf/blob/master/plugins/inputs/procstat/README.md which we already use on workers for some services
- Consider upgrading telegraf as our version is quite old -> try first updates to https://build.opensuse.org/package/show/openSUSE:Backports:SLE-15-SP4:Update/telegraf
- Include process count or process overview in telegraf data but don't overwhelm with too much data
- Find upstream issues for "missing apache log" data which we consider a symptom of the same problem source
Updated by openqa_review over 1 year ago
- Due date set to 2023-05-26
Setting due date based on mean cycle time of SUSE QE Tools
Updated by nicksinger over 1 year ago
okurz wrote:
- Find upstream issues for "missing apache log" data which we consider a symptom of the same problem source
I was unfortunately unable to figure out why telegraf is unable to follow the log after a rotate. There are several issues reported but none of them provide a solution despite being closed/resolved.
I went ahead with trying to reproduce manually with a minimal telegraf config:
[global_tags]
[agent]
interval = "10s"
round_interval = true
metric_batch_size = 1000
metric_buffer_limit = 10000
collection_jitter = "0s"
flush_interval = "10s"
flush_jitter = "0s"
precision = "0s"
hostname = ""
omit_hostname = false
[[outputs.file]]
files = ["stdout", "/tmp/metrics.out"]
[[inputs.tail]]
files = ["/tmp/testfile"]
watch_method = "inotify"
data_format = "value"
and started telegraf --config /etc/telegraf/telegraf.conf.tailtest --debug
. While telegraf was running I used "echo 1 >> /tmp/testfile", etc. to push metrics which where recorded. After recreating that file with rm and echo I observed the same missing metrics as we see on OSD. killall -SIGHUP telegraf
reopened the newly created file and recorded metrics again so I came up with the following workaround for now: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/856
Updated by okurz over 1 year ago
Great idea! I merged https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/856 . So should we wait over the weekend for this to see if there are gaps and if it helps us to discover something about the actual response time exceeding limits or is there something else that you have in mind that you would look into?
Updated by nicksinger over 1 year ago
I also introduced netstat to telegraf for our webui to collect the number of connections grouped by protocol which might help us to get a better picture on what bottlenecks us here: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/857
Updated by jbaier_cz over 1 year ago
- Related to action #129262: [openQA][webui] Can not access or refresh openQA website due to SSL failure added
Updated by livdywan over 1 year ago
Updated by okurz over 1 year ago
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/860 merged, nicksinger is preparing an MR to add grafana panels
Open point: Investigate the source and kind of requests from apache access logs
EDIT: Looking at https://monitor.qa.suse.de/d/1pHb56Lnk/tinas-dashboard?orgId=1&refresh=5m&from=now-24h&to=now I think "openQA jobs that upload results" are overwhelming the webUI instance.
High or missed apache response times seem to correlate with high CPU usage
TODOs:
- @nicksinger I suggest to pick at least some panels from https://monitor.qa.suse.de/d/1pHb56Lnk/tinas-dashboard and put into our production dashboards
- Consider bumping the apache response time alerts
- Try to reduce the number of apache workers to limit concurrent requests causing high CPU usage -> #129481
- Consider reducing number of openQA worker instances -> #129484
- Limit the number of concurrent job upload handling on webUI side. Can we use a semaphore or lock using the database? -> #129487
- Try nginx with enabled load limiting or load balancing features -> #129490
- better nice level for velociraptor -> #129493
Updated by okurz over 1 year ago
- Related to action #107875: [alert][osd] Apache Response Time alert size:M added
Updated by nicksinger over 1 year ago
Updated by okurz over 1 year ago
- Copied to action #129481: Try to *reduce* number of apache workers to limit concurrent requests causing high CPU usage added
Updated by okurz over 1 year ago
- Parent task set to #108209
TODOs:
- @nicksinger I suggest to pick at least some panels from https://monitor.qa.suse.de/d/1pHb56Lnk/tinas-dashboard and put into our production dashboards
- Consider bumping the apache response time alerts
Updated by okurz over 1 year ago
nsinger and me looked into alerts. We alert based on the median value of minimum response times. Because of network saturated due to uploading assets from jobs as visible in https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=now-12h&to=now in network usage eth0, disk write to vdc aka assets and apache response sizes being in the range of gb we assume that apache has exhausted all apache workers being occupied working on the requested asset upload requests hence finishing nearly no requests during the observed outages. Due to the outages the first responses that actually go through as visible in https://stats.openqa-monitor.qa.suse.de/explore?left=%7B%22datasource%22:%22000000001%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22alias%22:%22Min%22,%22groupBy%22:%5B%7B%22params%22:%5B%22$__interval%22%5D,%22type%22:%22time%22%7D,%7B%22params%22:%5B%22null%22%5D,%22type%22:%22fill%22%7D%5D,%22interval%22:%2212s%22,%22intervalMs%22:1000,%22maxDataPoints%22:43200,%22measurement%22:%22apache_log%22,%22orderByTime%22:%22ASC%22,%22policy%22:%22default%22,%22query%22:%22SELECT%20min%28%5C%22response_time_us%5C%22%29%20FROM%20%5C%22apache_log%5C%22%20WHERE%20%28%5C%22host%5C%22%20%3D%20%27openqa%27%20AND%20%5C%22request%5C%22%20%21~%20%2F%5C%5C%2Fapi%5C%5C%2Fv1%5C%5C%2Fws%2F%29%20AND%20$timeFilter%20GROUP%20BY%20time%28$__interval%29%20fill%28null%29%22,%22rawQuery%22:false,%22resultFormat%22:%22time_series%22,%22select%22:%5B%5B%7B%22params%22:%5B%22response_time_us%22%5D,%22type%22:%22field%22%7D,%7B%22params%22:%5B%5D,%22type%22:%22min%22%7D%5D%5D,%22tags%22:%5B%7B%22key%22:%22host%22,%22operator%22:%22%3D%22,%22value%22:%22openqa%22%7D,%7B%22condition%22:%22AND%22,%22key%22:%22request%22,%22operator%22:%22%21~%22,%22value%22:%22%2F%5C%5C%2Fapi%5C%5C%2Fv1%5C%5C%2Fws%2F%22%7D%5D%7D%5D,%22range%22:%7B%22from%22:%221684671333664%22,%22to%22:%221684747761127%22%7D%7D&orgId=1 are in the range of 500s or slightly below meaning that "median of min over last 30m" suddenly exceed the threshold because there are mostly no data and then suddenly 500s. Our suggestion: Create two alert queries: One for absence of responses finished and the existing one changed from fill(null)
to fill(0)
to only alert if actually finished requests are too slow.
Updated by nicksinger over 1 year ago
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/866 for filling up the missing data in the alert query with zeros.
Regarding "busy apache workers" I think I have to object your hypothesis. Looking at e.g. this gap: https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=1684702551192&to=1684713235063
You can see that we're missing data for "Apache Response Time" and "Response Size". The "Response Codes" panel reports close to 0 answers (sometime one or two 101's) in that time frame, however, we served requests in that period:
openqa:/var/log/apache2 # xzcat access_log*.xz | grep "21/May/2023:23:[3-9]" | wc -l
3196
Something doesn't add up here. Also the "Apache Workers" panel displays 100-200 "idle" workers in that time period. What is suspicious is the sudden peak in "Finishing" apache workers right at 00:00 ("Scoreboard" panel) with a sudden drop to zero idle workers which recovers quickly after again while the "Finishing" tasks (whatever that means) takes almost 2h to settle. A similar behavior can be found here too: https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=1684617976043&to=1684628427656
Updated by nicksinger over 1 year ago
Tried to come up with the second alert here: https://stats.openqa-monitor.qa.suse.de/alerting/a0fa05d6-307f-40b3-9408-9f5d6b7d2392/edit - still missing: provision it via Salt
Updated by okurz over 1 year ago
nicksinger wrote:
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/866 for filling up the missing data in the alert query with zeros.
merged, should have been effective but we still have received another alert. Why is that?
Regarding "busy apache workers" I think I have to object your hypothesis. Looking at e.g. this gap: https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=1684702551192&to=1684713235063
You can see that we're missing data for "Apache Response Time" and "Response Size". The "Response Codes" panel reports close to 0 answers (sometime one or two 101's) in that time frame, however, we served requests in that period:openqa:/var/log/apache2 # xzcat access_log*.xz | grep "21/May/2023:23:[3-9]" | wc -l 3196
Something doesn't add up here. Also the "Apache Workers" panel displays 100-200 "idle" workers in that time period. What is suspicious is the sudden peak in "Finishing" apache workers right at 00:00 ("Scoreboard" panel) with a sudden drop to zero idle workers which recovers quickly after again while the "Finishing" tasks (whatever that means) takes almost 2h to settle. A similar behavior can be found here too: https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=1684617976043&to=1684628427656
Could it be that request still bunch up and are appearing as a big bunch with "old" timestamps in the log later so that afterwards the log looks complete but still there was no new entry for hours in between?
Updated by okurz over 1 year ago
- Due date deleted (
2023-05-26) - Status changed from In Progress to Resolved
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/867, merged. Also https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/858 merged now so we should see the effect of http-only in the same monitoring panels as the https ones, i.e. both internal and external together.
nsinger, mkittler, okurz also conducted an experiment involving telegraf and simulated writing to logs. Also we reviewed https://httpd.apache.org/docs/2.4/mod/mod_log_config.html . The timestamp apache logs is the timestamp when a request is received, i.e. the starting time, but the log line is only written when apache has finished the request. So the log lines can refer to older requests in the past with an old timestamp. That timestamp is also the timestamp that data points will show up in grafana (because the same timestamp is used by influxdb and telegraf uses the apache timestamp when forwarding data points to influxdb.
We should keep the alert silenced until we have features implemented. Moved reactivation to #108209
Updated by dheidler over 1 year ago
https://stats.openqa-monitor.qa.suse.de/alerting/grafana/a0fa05d6-307f-40b3-9408-9f5d6b7d2392/view?orgId=1
fired on 31.05.2023 06:50-06:55.
Updated by okurz over 1 year ago
- Priority changed from Urgent to Immediate
Setting to "immediate" due to daily alert
Updated by livdywan over 1 year ago
Updated by nicksinger over 1 year ago
- Priority changed from Immediate to Urgent
Silence added for now.
Updated by okurz over 1 year ago
I added #108209 to the backlog but that shouldn't have mattered for a silence. nicksinger added an alert "alertname=Apache log data". That reduces priority again which is good. I mentioned also the "apache log alert" to be re-enabled as part of #129745.
@nicksinger what are your further plans for this ticket? You could just resolve the ticket assuming that we know that the situation is as it is with silenced alerts that we try to re-enable. Or you could adjust alerts with much more conservative thresholds so that we can get rid of silences after being sure that relaxed thresholds would only warn us if the situation worsens again significantly.
Updated by nicksinger over 1 year ago
I'm tempted to remove the "No apache log" alert again as it doesn't provide us with any valuable information
Updated by okurz over 1 year ago
nicksinger wrote:
I'm tempted to remove the "No apache log" alert again as it doesn't provide us with any valuable information
Why do you think so? We decided that the response time alert should only alert on actual response times and hence changed it accordingly and then added the new alert to alert if there is actually missing data. What has changed?
Updated by okurz over 1 year ago
As discussed in daily infra call 2023-06-12 @nicksinger will remove the "apache log data" alert again as it does not help us to provide anymore insight in particular because the actual data panel shows complete graphs because retroactively any gaps are filled anyway.
Updated by nicksinger over 1 year ago
- Status changed from Workable to Feedback
Updated by okurz over 1 year ago
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/883 merged but could not be deployed yet due to other problems. Let's await the other pipelines that are currently busy and should deploy, e.g. https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/1631882 or https://gitlab.suse.de/openqa/salt-pillars-openqa/-/jobs/1632007
Updated by nicksinger over 1 year ago
- Status changed from Feedback to Resolved
Pipeline and deployment was successful, stale alert from Grafana was removed based on https://gitlab.suse.de/openqa/salt-states-openqa#removing-stale-provisioned-alerts