Project

General

Profile

Actions

action #167728

closed

coordination #161414: [epic] Improved salt based infrastructure management

grafana dashboard for monitor.qe.nue2.suse.org size:S

Added by okurz 3 months ago. Updated 2 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Start date:
2024-10-02
Due date:
% Done:

0%

Estimated time:

Description

Motivation

In #167719 we ran out of space because influxdb grew to 300G+ on monitor.qe.nue2.suse.org and okurz only realized because grafana did not show any more up-to-date data. There was no related alert and also no alert about the decreased availability of space on the host before the incident. We have telegraf running on monitor but we have no generic machine dashboard which we should have like for other "generic" machines, i.e. not-worker and not-webui.

Acceptance criteria

  • AC1: A machine specific grafana dashboard with alert definitions exists for all machines (including monitor.qe.nue2.suse.org)
  • AC2: Special machine roles like openQA worker and openQA webUI don't have multiple dashboards showing the same data (e.g. not special openQA one + generic one)

Suggestions


Related issues 2 (1 open1 closed)

Related to openQA Infrastructure (public) - action #167051: https://gitlab.suse.de/openqa/salt-pillars-openqa/-/jobs/3109145 failed due to telegraf errors on monitor.qa.suse.de size:SResolvednicksinger2024-09-19

Actions
Copied from openQA Infrastructure (public) - action #167722: Efficient use of monitoring data within influxdb on monitor.qe.nue2.suse.org size:MWorkablenicksinger2024-10-02

Actions
Actions #1

Updated by okurz 3 months ago

  • Copied from action #167722: Efficient use of monitoring data within influxdb on monitor.qe.nue2.suse.org size:M added
Actions #2

Updated by nicksinger 3 months ago

  • Related to action #167051: https://gitlab.suse.de/openqa/salt-pillars-openqa/-/jobs/3109145 failed due to telegraf errors on monitor.qa.suse.de size:S added
Actions #3

Updated by okurz 2 months ago

  • Subject changed from grafana dashboard for monitor.qe.nue2.suse.org to grafana dashboard for monitor.qe.nue2.suse.org size:S
  • Description updated (diff)
  • Status changed from New to Workable
Actions #4

Updated by okurz 2 months ago

  • Description updated (diff)
Actions #5

Updated by gpathak 2 months ago

  • Assignee set to gpathak
Actions #6

Updated by gpathak 2 months ago

Need some inputs:

  • How to setup a local instance for testing changes?
  • How to perform changes on monitor.qa.suse.de without downtime?
    • SSH Login on any other method
Actions #7

Updated by okurz 2 months ago

gpathak wrote in #note-6:

Need some inputs:

  • How to setup a local instance for testing changes?

You can try to run make test but in most cases it's fine to trust the automatic CI tests when you create a merge request. An alternative would be to spawn containers or VMs manually and try out changes in there but that's likely not very efficient for this issue and too much effort with no guarantee to see all side-effects. Another alternative is to manually change the salt repository on the salt master openqa.suse.de, apply the changes by applying a high state but only for monitor.qe.nue2.suse.org and check for the effect.

  • How to perform changes on monitor.qa.suse.de without downtime?
    • SSH Login on any other method

That is handled by the automatic CI pipelines so after you create a merge request in https://gitlab.suse.de/openqa/salt-states-openqa tests are running and after the MR is merged the changes are automatically deployed without downtime.

Actions #8

Updated by okurz 2 months ago

  • Target version changed from Tools - Next to Ready
Actions #9

Updated by gpathak 2 months ago

Haven't tried any changes yet, just tried executing make test and getting below error:

command -v gitlab-ci-linter >/dev/null || (sudo wget -q https://gitlab.com/orobardet/gitlab-ci-linter/uploads/c4b64fb3b94473483dd2d02f0f32e1f6/gitlab-ci-linter.linux-amd64 -O /usr/local/bin/gitlab-ci-linter && \
    sudo chmod +x /usr/local/bin/gitlab-ci-linter)
yamllint .gitlab-ci.yml
gitlab-ci-linter --gitlab-url https://gitlab.suse.de
Validating .gitlab-ci.yml... Error linting using Gitlab API https://gitlab.suse.de: API respond  404 Not Found
make: *** [Makefile:12: test] Error 5
Actions #10

Updated by gpathak 2 months ago

  • Status changed from Workable to In Progress
Actions #12

Updated by gpathak 2 months ago

  • Status changed from In Progress to Resolved
Actions

Also available in: Atom PDF