Project

General

Profile

Actions

action #168718

open

openQA Project (public) - coordination #157969: [epic] Upgrade all our infrastructure, e.g. o3+osd workers+webui and production workloads, to openSUSE Leap 15.6

The "response codes" panel takes a considerable time to load or even runs into timeouts size:M

Added by okurz 6 months ago. Updated 9 days ago.

Status:
Workable
Priority:
Low
Category:
Regressions/Crashes
Target version:
Start date:
2024-10-17
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Observation

The "response codes" panel takes a considerable time to load or even runs into timeouts:
https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?viewPanel=80&orgId=1

Acceptance criteria

Suggestions

  • Look into what makes the query slow, could be (too) big database measurements in influxdb needing tailoring or telegraf already pushing too much data
    • Adjust the interval used to push new data
  • The query also sometimes times out completely, resulting in no data
    • We checked if this could be something like multiple requests on different machines but couldn't confirm that
    • Grafana might still be running operations that already timed out?
    • Even a small range like 24h is likely to hit the issue
  • Slowness also affects other panels
  • Monitor resource usage on the VM and hypervisor host
    • Make sure we have enough resources for Grafana/InfluxDB
    • Trim data we have in InfluxDB somehow to make the amount of data more manageable
  • Look into limiting the retention of data e.g. up to 1 year only
Actions #1

Updated by livdywan 6 months ago

  • Subject changed from The "response codes" panel takes a considerable time to load or even runs into timeouts to The "response codes" panel takes a considerable time to load or even runs into timeouts size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #2

Updated by okurz 4 months ago

  • Target version changed from Tools - Next to Ready
Actions #3

Updated by okurz 4 months ago

  • Priority changed from Normal to Low
Actions #4

Updated by jbaier_cz 3 months ago

Is that still an issue? Can't reproduce it right now, the link in the AC1 feels good to me.

Actions #5

Updated by tinita 3 months ago

I think it's quite slow for 24h and it starts to get really annoying at 7d

Actions #6

Updated by okurz 3 months ago

  • Target version changed from Ready to Tools - Next
Actions #7

Updated by okurz about 1 month ago

  • Target version changed from Tools - Next to Ready
Actions #8

Updated by robert.richardson 28 days ago

  • Status changed from Workable to In Progress
  • Assignee set to robert.richardson
Actions #9

Updated by robert.richardson 27 days ago

  • Status changed from In Progress to Workable

I noticed that the Nginx Response Time Panel is taking even longer than the Response panel (about 3x).
Also i copied the different queries from the grafana webui, replacing $timeFilter with time > now() - 24h and $__interval with different values, though as it is part of the `GROUP BY´ statement the impact on timing isnt really big:

ssh openqa-monitor.qa.suse.de
influx
use telegraf
> EXPLAIN ANALYZE SELECT...
Response Codes
Interval Duration
1s ~8.0s
30s (default) ~6.3s
5m ~5.8s
Response Size
Interval Duration
1s ~4.8s
12s (default) ~4.3s
5m ~3.8s
Nginx Response Time (mean_nginx_response)
Interval Duration
1s ~16.9s
12s (default) ~15.6s
5m ~15s
Actions #10

Updated by okurz 9 days ago

  • Target version changed from Ready to future
Actions

Also available in: Atom PDF