action #159654
closed
coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances
coordination #108209: [epic] Reduce load on OSD
high response times on osd - nginx properly monitored in grafana size:S
Added by okurz 11 months ago.
Updated 10 months ago.
Category:
Feature requests
Description
Motivation¶
Apache in prefork mode uses a lot of resources to provide mediocre performance. We have nginx on OSD deployed with #159651. Now let's make sure we have it properly monitored as the web proxy is critical for the overall performance and user experience
Acceptance criteria¶
- AC1: Nginx on OSD is properly monitored in grafana
- AC2: No alerts about apache being down
Suggestions¶
- Follow #159651 for the actual nginx deployment
- Add changes to salt-states-openqa including monitoring: we have multiple panels regarding apache that need to be adapted for nginx as applicable
- Ensure that we have no alerts regarding "oh no, apache is down" ;)
Out of scope¶
- No need for any additional metrics, just feature-parity with what we have regarding apache, e.g. response sizes, response codes, response times
Files
- Copied from action #159651: high response times on osd - nginx with enabled rate limiting features size:S added
- Subject changed from high response times on osd - nginx properly monitored in grafana to high response times on osd - nginx properly monitored in grafana size:S
- Description updated (diff)
- Status changed from New to Workable
- Target version changed from Tools - Next to Ready
- Assignee set to jbaier_cz
- Related to action #160877: [alert] Scripts CI pipeline failing due to osd yielding 502 size:M added
- Due date set to 2024-06-18
- Status changed from Workable to In Progress
- Status changed from In Progress to Workable
Added 2 more commits which drops unrelated apache worker monitoring panels.
Last update contains two additional panels to monitor some stats provided by the ngx_http_stub_status_module
. It might need some additional tweaks but at least we can see the data somewhere.
- Status changed from Workable to Feedback
Now I need to wait for merge/deploy to see if the JSON changes make some sense.
Yes, the nginx_log
collection is part of the same MR so there are likely no data in the database right now. I will monitor the monitoring and create a fix if necessary.
show field keys
in the influx prompt shows apache
, apache_log
and nginx
, but no nginx_log
table.
But I don't know how to fix.
It could also mean that I made some mistake in the telegraf configuration and
[[inputs.tail]]
files = ["/var/log/nginx/access.log"]
interval = "30s"
from_beginning = false
name_override = "nginx_log"
## For parsing logstash-style "grok" patterns:
data_format = "grok"
grok_patterns = ["%{CUSTOM_LOG}"]
grok_custom_pattern_files = []
grok_custom_patterns = '''
CUSTOM_LOG %{COMBINED_LOG_FORMAT} rt=%{NUMBER:response_time_s:float} urt=%"{NUMBER:upstream_response_time_s:float}"
'''
is not working as expected
- Due date deleted (
2024-06-18)
- Status changed from Feedback to Resolved
Also available in: Atom
PDF