action #91779
closed
openQA Project (public) - coordination #64746: [saga][epic] Scale up: Efficient handling of large storage to be able to run current tests efficiently but keep big archives of old results
openQA Project (public) - coordination #80546: [epic] Scale up: Enable to store more results
Add monitoring for storage.qa.suse.de
Added by okurz over 3 years ago.
Updated over 3 years ago.
Description
Acceptance criteria¶
- AC1: alerts exist for free space on storage.qa.suse.de
Suggestions¶
- Extend https://gitlab.suse.de/openqa/salt-pillars-openqa to cover storage.qa.suse.de same as we cover the other hosts, e.g. compare to the monitoring host as well the worker machines, of course without making storage.qa.suse.de a full "worker" host :)
- Ensure that alerts exist, especially for free space on storage.qa.suse.de as storage.qa.suse.de is a storage host (duh)
- Parent task set to #80546
- Target version changed from future to Ready
One option we might be able to follow is to apply https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/monitoring/grafana/worker.json.template for the host "storage.qa.suse.de" as well. For storage of course we do not have "minion jobs" or an (openQA worker) web service but as we would not alarm about "no data" we could simply ignore these :) I added roles: storage
in /etc/salt/grains on storage manually now so that we can distinguish and apply to that roles within top.sls, or we apply basic monitoring for each host and on top special openQA monitoring for only all workers. I tried to extend the mine.get
statement in https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/monitoring/grafana.sls#L3 to target something like roles:worker or roles:storage
but could not succeed getting something to succeed in my experiments on the command line like sudo salt -l error --no-color -C 'openqa.suse.de' mine.get 'G@roles:worker and G@roles:storage' 'nodename' 'grain'
. As an alternative we copy the worker template for "storage" and delete all not relevant panels.
- Status changed from Workable to In Progress
- Assignee set to mkittler
- Status changed from In Progress to Feedback
The dashboards are now shown. I've been creating https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/510 to tweak the memory alert. The "ping" alert doesn't seem to work as there's no data. The dashboard isn't using the correct hostname in the query - but even if the query is fixed there doesn't seem to be any ping-data. I'm not sure why that is the case because the telegraf config pinging actually contains these hosts and they're pingable.
- Status changed from Feedback to Resolved
The partition usage and the corresponding alert are actually already there, just hidden within a folded section. The figures match what I see via df -h
.
- Status changed from Resolved to Feedback
The disk block is "collapsed" by default, that should be changed. I tried to save that change myself but I was not sure because all content after that single true/false switch also showed as "changed" when I tried to save the changes to git so please try to fix that yourself.
merged and seemingly broken completely. Now it shows a line with "Disk (0 panels)"
I would try to revert it for now because at this point it is hard to change anything within Grafana's UI: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/515
(Note that I actually took the "collapsed": false,
change from Grafana's JSON so I'm really wondering why it isn't working. Maybe I forgot to add some relevant sections of the diff to the commit.)
How did you create the template in the first place? I guess you saved an existing dashboard and have replaced some values with variables. So we can simply save the dashboard again and replace these variables – if you can state how you did it :)
- Status changed from Feedback to Resolved
The latest SR has been deployed and now it works.
Also available in: Atom
PDF