Project

General

Profile

Actions

action #109635

open

Check grafana monitoring host performance size:M

Added by tinita over 2 years ago. Updated over 1 year ago.

Status:
Workable
Priority:
Low
Assignee:
-
Category:
-
Target version:
QA (public, currently private due to #173521) - future
Start date:
2022-04-07
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Observation

Some graphs have a lot of data points and can be very slow to load. In the worst case Grafana says "No data".
Example: https://monitor.qa.suse.de/d/WebuiDb/webui-summary?viewPanel=80&orgId=1&from=now-30d&to=now
30 days seems to be ok for this graph (if there are no other expensive queries running).

There are other graphs using conditions on fields which take even more time:
https://monitor.qa.suse.de/d/1pHb56Lnk/tinas-dashboard?viewPanel=10&orgId=1&from=now-7d&to=now

Looking at htop, at least influxdb seems to be able to make use of all the CPUs for showing one graph, so maybe we could ask for more CPUs to improve the situation a bit.

it could also be that influxdb is having too much data be able to act efficiently. In #94492 we already worked on this topic and we resolved with the database being 101GB which is rather big but was at least better than in before. Now we are at 117GB again. I suggest into reducing the size of the database. For that I suggest to research online and if nothing found actively seek help from the influxdb community.

The problem is happening for certain graphs only, which have a lot of data points.
Why should be the total size of the DB responsible for graphs showing no data if you select 90 days?

Maybe because when getting many data points the problem of a fragmented database becomes more severe or so. Maybe just the heavy graphs themselves are the problematic ones, could also be.

Acceptance criteria

  • AC1: It is known what size vm is required for our monitoring needs
  • AC2: All panels load in a reasonable time frame

Suggestions

  • Compare current influxdb size with community recommendations (https://docs.influxdata.com/influxdb/v1.8/guides/hardware_sizing/)
  • Look into older tickets where we already looked into sizes of tables and downsampling
  • Lookup sizes of tables to find biggest size contributors
  • If we find individual measurements that contribute significantly more than others then handle these specific measurements. E.g. if "apache response times" account for 80% of size then handle that measurement either with downsampling or deleting it completely, etc.

Related issues 1 (0 open1 closed)

Related to openQA Project (public) - action #107881: [retro] Conduct a zombie scrum team surveyResolvedlivdywan2022-03-042022-04-15

Actions
Actions #1

Updated by tinita over 2 years ago

  • Related to action #107881: [retro] Conduct a zombie scrum team survey added
Actions #2

Updated by okurz over 2 years ago

  • Priority changed from Normal to Low
  • Target version set to Ready

I am a bit worried adding that ticket to our backlog. Yesterday we identified "too many infrastructure issues" as a problem for us and then we need to add more ourselves :)

Actions #3

Updated by tinita over 2 years ago

Well, we need grafana for helping to identify infrastructure issues.
So this will hopefully make grafana a bit more useful.

Actions #4

Updated by okurz over 2 years ago

tinita wrote:

Looking at htop, at least influxdb seems to be able to make use of all the CPUs for showing one graph, so maybe we could ask for more CPUs to improve the situation a bit.

it could also be that influxdb is having too much data be able to act efficiently. In #94492 we already worked on this topic and we resolved with the database being 101GB which is rather big but was at least better than in before. Now we are at 117GB again. I suggest into reducing the size of the database. For that I suggest to research online and if nothing found actively seek help from the influxdb community.

Actions #5

Updated by tinita over 2 years ago

The problem is happening for certain graphs only, which have a lot of data points.
Why should be the total size of the DB responsible for graphs showing no data if you select 90 days?

Actions #6

Updated by okurz over 2 years ago

  • Parent task set to #109743
Actions #7

Updated by okurz over 2 years ago

tinita wrote:

The problem is happening for certain graphs only, which have a lot of data points.
Why should be the total size of the DB responsible for graphs showing no data if you select 90 days?

Maybe because when getting many data points the problem of a fragmented database becomes more severe or so. Maybe just the heavy graphs themselves are the problematic ones, could also be.

Actions #8

Updated by okurz over 2 years ago

  • Subject changed from Check grafana monitoring host performance to Check grafana monitoring host performance size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #9

Updated by mkittler over 2 years ago

  • Description updated (diff)
Actions #10

Updated by okurz over 2 years ago

  • Target version changed from Ready to future
Actions #11

Updated by okurz almost 2 years ago

  • Tags set to infra
Actions #12

Updated by okurz over 1 year ago

  • Parent task changed from #109743 to #121732
Actions

Also available in: Atom PDF