action #159654
closedcoordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances
coordination #108209: [epic] Reduce load on OSD
high response times on osd - nginx properly monitored in grafana size:S
0%
Description
Motivation¶
Apache in prefork mode uses a lot of resources to provide mediocre performance. We have nginx on OSD deployed with #159651. Now let's make sure we have it properly monitored as the web proxy is critical for the overall performance and user experience
Acceptance criteria¶
- AC1: Nginx on OSD is properly monitored in grafana
- AC2: No alerts about apache being down
Suggestions¶
- Follow #159651 for the actual nginx deployment
- Add changes to salt-states-openqa including monitoring: we have multiple panels regarding apache that need to be adapted for nginx as applicable
- Ensure that we have no alerts regarding "oh no, apache is down" ;)
Out of scope¶
- No need for any additional metrics, just feature-parity with what we have regarding apache, e.g. response sizes, response codes, response times
Files
Updated by okurz 3 months ago
- Copied from action #159651: high response times on osd - nginx with enabled rate limiting features size:S added
Updated by jbaier_cz about 2 months ago
First draft https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1198; I would also like to include nginx monitoring setup with https://github.com/influxdata/telegraf/tree/master/plugins/inputs/nginx to replace the already removed [[inputs.apache]]
Updated by okurz about 1 month ago
- Related to action #160877: [alert] Scripts CI pipeline failing due to osd yielding 502 size:M added
Updated by jbaier_cz about 1 month ago
- Status changed from Workable to In Progress
Updated by jbaier_cz about 1 month ago
- Status changed from In Progress to Workable
Added 2 more commits which drops unrelated apache worker monitoring panels.
Updated by jbaier_cz about 1 month ago
Last update contains two additional panels to monitor some stats provided by the ngx_http_stub_status_module
. It might need some additional tweaks but at least we can see the data somewhere.
Updated by jbaier_cz about 1 month ago
- Status changed from Workable to Feedback
Now I need to wait for merge/deploy to see if the JSON changes make some sense.
Updated by okurz about 1 month ago
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1198 merged. https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1 shows some panels with data but "response codes" and "respone times" panels show "No data" as of now. Might need some time to populate or actual fixing.
Updated by jbaier_cz about 1 month ago
Yes, the nginx_log
collection is part of the same MR so there are likely no data in the database right now. I will monitor the monitoring and create a fix if necessary.
Updated by tinita about 1 month ago
show field keys
in the influx prompt shows apache
, apache_log
and nginx
, but no nginx_log
table.
But I don't know how to fix.
Updated by jbaier_cz about 1 month ago
It could also mean that I made some mistake in the telegraf configuration and
[[inputs.tail]]
files = ["/var/log/nginx/access.log"]
interval = "30s"
from_beginning = false
name_override = "nginx_log"
## For parsing logstash-style "grok" patterns:
data_format = "grok"
grok_patterns = ["%{CUSTOM_LOG}"]
grok_custom_pattern_files = []
grok_custom_patterns = '''
CUSTOM_LOG %{COMBINED_LOG_FORMAT} rt=%{NUMBER:response_time_s:float} urt=%"{NUMBER:upstream_response_time_s:float}"
'''
is not working as expected
Updated by jbaier_cz about 1 month ago
Found the typo in the grok pattern, fix https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1204
Updated by jbaier_cz about 1 month ago
New panels are there.
Updated by jbaier_cz about 1 month ago
- Due date deleted (
2024-06-18) - Status changed from Feedback to Resolved