Project

General

Profile

Actions

action #167728

closed

coordination #161414: [epic] Improved salt based infrastructure management

grafana dashboard for monitor.qe.nue2.suse.org size:S

Added by okurz about 2 months ago. Updated about 1 month ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
2024-10-02
Due date:
% Done:

0%

Estimated time:

Description

Motivation

In #167719 we ran out of space because influxdb grew to 300G+ on monitor.qe.nue2.suse.org and okurz only realized because grafana did not show any more up-to-date data. There was no related alert and also no alert about the decreased availability of space on the host before the incident. We have telegraf running on monitor but we have no generic machine dashboard which we should have like for other "generic" machines, i.e. not-worker and not-webui.

Acceptance criteria

  • AC1: A machine specific grafana dashboard with alert definitions exists for all machines (including monitor.qe.nue2.suse.org)
  • AC2: Special machine roles like openQA worker and openQA webUI don't have multiple dashboards showing the same data (e.g. not special openQA one + generic one)

Suggestions


Related issues 2 (1 open1 closed)

Related to openQA Infrastructure - action #167051: https://gitlab.suse.de/openqa/salt-pillars-openqa/-/jobs/3109145 failed due to telegraf errors on monitor.qa.suse.de size:SResolvednicksinger2024-09-19

Actions
Copied from openQA Infrastructure - action #167722: Efficient use of monitoring data within influxdb on monitor.qe.nue2.suse.org size:MWorkablenicksinger2024-10-022024-11-29

Actions
Actions

Also available in: Atom PDF