[tools][monitoring] Worker 'reachable' notifications sent form Grafana instance
As we have initial grafana to monitor the state of machines/workers, we need to start sending email notifications from existing grafana monitoring.
- email notifications should be delivered to an open mailing list
- QAM and QASLE (Marita and Heiko) are subscribed to those notifications
- Heiko and Marita are recieving notifications
- emails sent to the requested mailing list
#1 Updated by okurz about 4 years ago
- Subject changed from [tools][openqa][monitoring] Worker 'reachable' notifications to [tools][monitoring] Worker 'reachable' notifications
- Category set to Infrastructure
I guess the tag "openqa" is really implicit as we have this ticket on the "openQA tests" issue tracker :)
#5 Updated by szarate about 4 years ago
After this, the mailing list would be suscribed to the host group of the nagios monitoring
#7 Updated by sebchlad about 4 years ago
OK cool. We can be waiting, but why not doing something? :)
I see our version of grafana is 4.0+ so we can enable notifications already.
I assume setting up notifications is a matter of minutes or hrs, so it make sense to 'just do it'.
#8 Updated by szarate about 4 years ago
I prefer https://progress.opensuse.org/issues/18164#note-27 to be solved first. Having people running around red herring in chicken headless mode sounds like no fun to me.
However this can be done in Parallel ;)
#9 Updated by sebchlad about 4 years ago
- Subject changed from [tools][monitoring] Worker 'reachable' notifications to [tools][monitoring] Worker 'reachable' notifications sent form Grafana instance
- Description updated (diff)
- Difficulty set to easy
I'm changing the description to indicate this is solely about grafana notification configured for sending emails.
It might take a while to have all needed parts done in the Infra infrastructure and grafana settings seem to be easy, which makes me think we can just go ahead with having an intermediate solution using grafana instance we have now.
#12 Updated by sebchlad about 4 years ago
Having people running around red herring in chicken headless mode sounds like no fun to me.
We are talking here about directed messages to limited number of people who know what to do.
So in theory we should not have chicken headless mode. Actually we should have exactly the opposite - notifications to limited number of people, so they will be on the same page; chicken headless mode stems from people being on different pages of different books ;) so they start asking questions. We all start asking questions. As we care. And we eventually all run after the same problems understanding them differently.
We want people who care, like Marita, to be well informed and have immediate and constant feedback of the production openQA. So they could feel safe situation is monitored and under control.
We do not want people to be surprised. So we do not want situations "Ups, workers were down. We did not notice. Ups"
#13 Updated by sebchlad about 4 years ago
@marita: Adding you as a watcher. The monitoring dashboard is in place. Now we would like to have notifications alerts, so you could have immediate alerts on the situation.
nicksinger: hmmm since current solution is "work in progress" and we might have some false alarms I would consider email alerts to be sent under certain circumstances:
- adding aggregated stats for sshd and systemd services
- perhaps per architecture?
- sending emails only in special cases: if 50% of servers reports offline status, then mail people? if ppc worker is offline, send email? You know what I mean. We should expect loads of false alarms, so having emails send only if serious problems are visible, make sense to me. Then we can fine-tune the alerts as we improve, or in fact we can by then have alerts sent form our infra system. We should discuss with Coolo
#16 Updated by szarate about 4 years ago
- Checklist item changed from to [ ] Heiko and Marita are recieving notifications, [ ] emails sent to the requested mailing list
- Status changed from New to Feedback
I'm calling this ticket as done, on the notification side of things.
There's still the issue on when a worker instance is considered online but that's another story, since this triggers a lot of false positives.
I will set this ticket to feedback for the time being and move it out of current sprint.
Security group on the monitoring instance has been solved, space problem also fixed.
#21 Updated by okurz about 4 years ago
- Subject changed from [tools][monitoring] Worker 'reachable' notifications sent form Grafana instance to [tools][functional][u][monitoring] Worker 'reachable' notifications sent form Grafana instance
- Target version changed from Current Sprint to Milestone 20
szarate joined qsf-u
#26 Updated by nicksinger almost 3 years ago
- Status changed from Feedback to Resolved
We moved forward quite a bit. While the initial ACs where never met (we have no public ML, Heiko and Marita aren't subscribed) I still think we can close this as we have good monitoring and notifications in place in the meantime. If anybody objects feel free to reopen this issue.