action #91779: Add monitoring for storage.qa.suse.de - openQA Infrastructure - openSUSE Project Management Tool

Custom queries

openQA Infrastructure Project
openqa-review - Closed tickets last updated by openqa-review, last 30 days
QA roadmap long-term
QA SLE functional
QA SLE Functional - closed in last 14 days
QA SLE Functional - High, need to be refined
QA SLE Functional - over cycle time median
QA SLE u
QA SLE y
QA tools (tag not necessary in openQA and subprojects)
QA tools tag (tag not necessary in openQA and subprojects; excluding tickets in "Ready" version as they are already on the backlog)
QAC - Backlog
QE tools team - backlog (dev)
QE tools team - backlog (ready issues)
QE tools team - backlog SLA high
QE tools team - backlog SLA immediate
QE tools team - backlog SLA no immediate/urgent in feedback/blocked
QE tools team - backlog SLA normal
QE tools team - backlog SLA urgent
QE tools team - backlog SLO high
QE tools team - backlog SLO normal
QE tools team - backlog SLO urgent
QE tools team - backlog, high-level view (epics and higher)
QE tools team - backlog, non-reactive work, needs parent
QE tools team - backlog, top-level view (all sagas)
QE tools team - closed within last 14 days
QE tools team - closed within last 60 days
QE tools team - closed yesterday
QE Tools Team - Collaborative Session
QE tools team - due date forecast
QE tools team - exceeding due-date
QE tools team - infrastructure backlog
QE tools team - next - sorted by update time
QE tools team - next issues
QE tools team - non-estimated (unblocked) issues (dev)
QE tools team - non-estimated (unblocked) issues (infra)
QE tools team - ready issues - Workable
QE tools team - ready, not assigned/blocked/low
QE tools team - SLO high forecast
QE tools team - update forecast
QE tools team - updated by priority
QE tools team - what members of the team are working on - Feedback (not-low)
QE Tools Team Backlog By Assignee
Tools Team Retrospective
Tools Team Retrospective (not estimated or assigned)

Actions

action #91779

closed

openQA Project - coordination #64746: [saga][epic] Scale up: Efficient handling of large storage to be able to run current tests efficiently but keep big archives of old results

openQA Project - coordination #80546: [epic] Scale up: Enable to store more results

Add monitoring for storage.qa.suse.de

Added by okurz about 3 years ago. Updated about 3 years ago.

Status:

Resolved

Priority:

Low

Assignee:

mkittler

Category:

Target version:

openQA Project - Ready

Start date:

2021-04-26

Due date:

% Done:

Estimated time:

Description

Acceptance criteria¶

AC1: alerts exist for free space on storage.qa.suse.de

Suggestions¶

Extend https://gitlab.suse.de/openqa/salt-pillars-openqa to cover storage.qa.suse.de same as we cover the other hosts, e.g. compare to the monitoring host as well the worker machines, of course without making storage.qa.suse.de a full "worker" host :)
Ensure that alerts exist, especially for free space on storage.qa.suse.de as storage.qa.suse.de is a storage host (duh)

History
Notes
Property changes

Actions

Copy link

Updated by okurz about 3 years ago

Parent task set to #80546

Actions

Copy link

Updated by okurz about 3 years ago

Target version changed from future to Ready

Actions

Copy link

Updated by okurz about 3 years ago

One option we might be able to follow is to apply https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/monitoring/grafana/worker.json.template for the host "storage.qa.suse.de" as well. For storage of course we do not have "minion jobs" or an (openQA worker) web service but as we would not alarm about "no data" we could simply ignore these :) I added roles: storage in /etc/salt/grains on storage manually now so that we can distinguish and apply to that roles within top.sls, or we apply basic monitoring for each host and on top special openQA monitoring for only all workers. I tried to extend the mine.get statement in https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/monitoring/grafana.sls#L3 to target something like roles:worker or roles:storage but could not succeed getting something to succeed in my experiments on the command line like sudo salt -l error --no-color -C 'openqa.suse.de' mine.get 'G@roles:worker and G@roles:storage' 'nodename' 'grain'. As an alternative we copy the worker template for "storage" and delete all not relevant panels.

Actions

Copy link

Updated by mkittler about 3 years ago

Status changed from Workable to In Progress
Assignee set to mkittler

Actions

Copy link

Updated by mkittler about 3 years ago

SR: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/507

With https://gitlab.suse.de/openqa/salt-states-openqa/-/commit/da6e34e0121f1f4f4042ef3e4687873311e6e228 the systemd services monitoring/alert should now cover the storage host as well.

Actions

Copy link

Updated by mkittler about 3 years ago

Status changed from In Progress to Feedback

Actions

Copy link

Updated by okurz about 3 years ago

merged https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/507, please monitor deployment and ensure that the dashboard will be correctly shown. Maybe you can add the "additional generic nodes" on https://monitor.qa.suse.de/d/4KkGdvvZk/osd-status-overview?orgId=1 as well

Actions

Copy link

Updated by mkittler about 3 years ago

The dashboards are now shown. I've been creating https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/510 to tweak the memory alert. The "ping" alert doesn't seem to work as there's no data. The dashboard isn't using the correct hostname in the query - but even if the query is fixed there doesn't seem to be any ping-data. I'm not sure why that is the case because the telegraf config pinging actually contains these hosts and they're pingable.

Actions

Copy link

Updated by mkittler about 3 years ago

With https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/513#note_327858 the generic dashboard works now. Unfortunately this doesn't contain a file system alert yet because I apparently removed that part of the dashboard when removing worker-specific parts.

Actions

Copy link

#10

Updated by mkittler about 3 years ago

Status changed from Feedback to Resolved

The partition usage and the corresponding alert are actually already there, just hidden within a folded section. The figures match what I see via df -h.

Actions

Copy link

#11

Updated by okurz about 3 years ago

Status changed from Resolved to Feedback

The disk block is "collapsed" by default, that should be changed. I tried to save that change myself but I was not sure because all content after that single true/false switch also showed as "changed" when I tried to save the changes to git so please try to fix that yourself.

Actions

Copy link

#12

Updated by mkittler about 3 years ago

SR: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/514/diffs

Actions

Copy link

#13

Updated by okurz about 3 years ago

merged and seemingly broken completely. Now it shows a line with "Disk (0 panels)"

Actions

Copy link

#14

Updated by mkittler about 3 years ago

I would try to revert it for now because at this point it is hard to change anything within Grafana's UI: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/515

(Note that I actually took the "collapsed": false, change from Grafana's JSON so I'm really wondering why it isn't working. Maybe I forgot to add some relevant sections of the diff to the commit.)

Actions

Copy link

#15

Updated by okurz about 3 years ago

How did you create the template in the first place? I guess you saved an existing dashboard and have replaced some values with variables. So we can simply save the dashboard again and replace these variables – if you can state how you did it :)

Actions

Copy link

#16

Updated by mkittler about 3 years ago

I did it like you've guessed but forgot to add some parts of the diff. This SR should have everything needed: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/518

Actions

Copy link

#17

Updated by mkittler about 3 years ago

Status changed from Feedback to Resolved

The latest SR has been deployed and now it works.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA » openQA Project » openQA Infrastructure

Tags

Custom queries

action #91779

Add monitoring for storage.qa.suse.de

Acceptance criteria¶

Suggestions¶

Updated by okurz about 3 years ago

Updated by okurz about 3 years ago

Updated by okurz about 3 years ago

Updated by mkittler about 3 years ago

Updated by mkittler about 3 years ago

Updated by mkittler about 3 years ago

Updated by okurz about 3 years ago

Updated by mkittler about 3 years ago

Updated by mkittler about 3 years ago

Updated by mkittler about 3 years ago

Updated by okurz about 3 years ago

Updated by mkittler about 3 years ago

Updated by okurz about 3 years ago

Updated by mkittler about 3 years ago

Updated by okurz about 3 years ago

Updated by mkittler about 3 years ago

Updated by mkittler about 3 years ago