Project

General

Profile

Actions

action #159654

closed

coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances

coordination #108209: [epic] Reduce load on OSD

high response times on osd - nginx properly monitored in grafana size:S

Added by okurz 3 months ago. Updated about 1 month ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
2024-04-26
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Motivation

Apache in prefork mode uses a lot of resources to provide mediocre performance. We have nginx on OSD deployed with #159651. Now let's make sure we have it properly monitored as the web proxy is critical for the overall performance and user experience

Acceptance criteria

  • AC1: Nginx on OSD is properly monitored in grafana
  • AC2: No alerts about apache being down

Suggestions

  • Follow #159651 for the actual nginx deployment
  • Add changes to salt-states-openqa including monitoring: we have multiple panels regarding apache that need to be adapted for nginx as applicable
  • Ensure that we have no alerts regarding "oh no, apache is down" ;)

Out of scope

  • No need for any additional metrics, just feature-parity with what we have regarding apache, e.g. response sizes, response codes, response times

Files

20240611_14h33m35s_grim.png (83.4 KB) 20240611_14h33m35s_grim.png jbaier_cz, 2024-06-11 12:33

Related issues 2 (0 open2 closed)

Related to openQA Project - action #160877: [alert] Scripts CI pipeline failing due to osd yielding 502 size:MResolvedmkittler2024-05-24

Actions
Copied from openQA Project - action #159651: high response times on osd - nginx with enabled rate limiting features size:SRejectedokurz2024-04-262024-06-14

Actions
Actions #1

Updated by okurz 3 months ago

  • Copied from action #159651: high response times on osd - nginx with enabled rate limiting features size:S added
Actions #2

Updated by jbaier_cz 3 months ago

  • Subject changed from high response times on osd - nginx properly monitored in grafana to high response times on osd - nginx properly monitored in grafana size:S
  • Description updated (diff)
  • Status changed from New to Workable
Actions #3

Updated by okurz 2 months ago

  • Target version changed from Tools - Next to Ready
Actions #4

Updated by jbaier_cz about 2 months ago

  • Assignee set to jbaier_cz
Actions #5

Updated by jbaier_cz about 2 months ago

First draft https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1198; I would also like to include nginx monitoring setup with https://github.com/influxdata/telegraf/tree/master/plugins/inputs/nginx to replace the already removed [[inputs.apache]]

Actions #6

Updated by okurz about 1 month ago

  • Related to action #160877: [alert] Scripts CI pipeline failing due to osd yielding 502 size:M added
Actions #7

Updated by okurz about 1 month ago

  • Due date set to 2024-06-18
Actions #8

Updated by jbaier_cz about 1 month ago

  • Status changed from Workable to In Progress
Actions #9

Updated by jbaier_cz about 1 month ago

  • Status changed from In Progress to Workable

Added 2 more commits which drops unrelated apache worker monitoring panels.

Actions #10

Updated by jbaier_cz about 1 month ago

Last update contains two additional panels to monitor some stats provided by the ngx_http_stub_status_module. It might need some additional tweaks but at least we can see the data somewhere.

Actions #11

Updated by jbaier_cz about 1 month ago

  • Status changed from Workable to Feedback

Now I need to wait for merge/deploy to see if the JSON changes make some sense.

Actions #12

Updated by okurz about 1 month ago

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1198 merged. https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1 shows some panels with data but "response codes" and "respone times" panels show "No data" as of now. Might need some time to populate or actual fixing.

Actions #13

Updated by jbaier_cz about 1 month ago

Yes, the nginx_log collection is part of the same MR so there are likely no data in the database right now. I will monitor the monitoring and create a fix if necessary.

Actions #14

Updated by tinita about 1 month ago

show field keys in the influx prompt shows apache, apache_log and nginx, but no nginx_log table.
But I don't know how to fix.

Actions #15

Updated by jbaier_cz about 1 month ago

It could also mean that I made some mistake in the telegraf configuration and

[[inputs.tail]]
  files = ["/var/log/nginx/access.log"]
  interval = "30s"
  from_beginning = false
  name_override = "nginx_log"
  ## For parsing logstash-style "grok" patterns:
  data_format = "grok"
  grok_patterns = ["%{CUSTOM_LOG}"]
  grok_custom_pattern_files = []
  grok_custom_patterns = '''
      CUSTOM_LOG %{COMBINED_LOG_FORMAT} rt=%{NUMBER:response_time_s:float} urt=%"{NUMBER:upstream_response_time_s:float}"
  '''

is not working as expected

Actions #17

Updated by jbaier_cz about 1 month ago

New panels are there.

Actions #18

Updated by jbaier_cz about 1 month ago

  • Due date deleted (2024-06-18)
  • Status changed from Feedback to Resolved
Actions

Also available in: Atom PDF