action #159654
closedcoordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances
coordination #108209: [epic] Reduce load on OSD
high response times on osd - nginx properly monitored in grafana size:S
Apache in prefork mode uses a lot of resources to provide mediocre performance. We have nginx on OSD deployed with #159651. Now let's make sure we have it properly monitored as the web proxy is critical for the overall performance and user experience
Acceptance criteria¶
- AC1: Nginx on OSD is properly monitored in grafana
- AC2: No alerts about apache being down
- Follow #159651 for the actual nginx deployment
- Add changes to salt-states-openqa including monitoring: we have multiple panels regarding apache that need to be adapted for nginx as applicable
- Ensure that we have no alerts regarding "oh no, apache is down" ;)
Out of scope¶
- No need for any additional metrics, just feature-parity with what we have regarding apache, e.g. response sizes, response codes, response times
Updated by okurz 6 months ago
- Copied from action #159651: high response times on osd - nginx with enabled rate limiting features size:S added
Updated by jbaier_cz 5 months ago
First draft; I would also like to include nginx monitoring setup with to replace the already removed [[inputs.apache]]
Updated by okurz 4 months ago
- Related to action #160877: [alert] Scripts CI pipeline failing due to osd yielding 502 size:M added
Updated by okurz 4 months ago merged. shows some panels with data but "response codes" and "respone times" panels show "No data" as of now. Might need some time to populate or actual fixing.
Updated by jbaier_cz 4 months ago
It could also mean that I made some mistake in the telegraf configuration and
files = ["/var/log/nginx/access.log"]
interval = "30s"
from_beginning = false
name_override = "nginx_log"
## For parsing logstash-style "grok" patterns:
data_format = "grok"
grok_patterns = ["%{CUSTOM_LOG}"]
grok_custom_pattern_files = []
grok_custom_patterns = '''
CUSTOM_LOG %{COMBINED_LOG_FORMAT} rt=%{NUMBER:response_time_s:float} urt=%"{NUMBER:upstream_response_time_s:float}"
is not working as expected
Updated by jbaier_cz 4 months ago
Found the typo in the grok pattern, fix