Project

General

Profile

action #41189

[tools][monitoring] Worker 'reachable' notifications sent form Grafana instance

Added by sebchlad about 4 years ago. Updated almost 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
-
Start date:
2018-09-18
Due date:
% Done:

0%

Estimated time:

Description

As we have initial grafana to monitor the state of machines/workers, we need to start sending email notifications from existing grafana monitoring.

See: http://docs.grafana.org/alerting/notifications/

Requirements:

  • email notifications should be delivered to an open mailing list
  • QAM and QASLE (Marita and Heiko) are subscribed to those notifications

Checklist

  • Heiko and Marita are recieving notifications
  • emails sent to the requested mailing list

Related issues

Related to openQA Infrastructure - action #41336: Create a monitoring dashboard for openqa.suse.deResolved2018-09-19

Related to openQA Infrastructure - action #18164: [devops][tools] monitoring of openqa worker instancesResolved2018-04-25

History

#1 Updated by okurz about 4 years ago

  • Subject changed from [tools][openqa][monitoring] Worker 'reachable' notifications to [tools][monitoring] Worker 'reachable' notifications
  • Category set to Infrastructure

I guess the tag "openqa" is really implicit as we have this ticket on the "openQA tests" issue tracker :)

#2 Updated by sebchlad about 4 years ago

  • Project changed from openQA Tests to openQA Project
  • Category deleted (Infrastructure)

#3 Updated by sebchlad about 4 years ago

  • Related to action #18164: [devops][tools] monitoring of openqa worker instances added

#4 Updated by szarate about 4 years ago

  • Description updated (diff)
  • Category set to 168
  • Assignee set to szarate
  • Target version set to Current Sprint

#5 Updated by szarate about 4 years ago

Mailing list requested https://infra.nue.suse.com/Ticket/Display.html?id=121433&results=2c9585f45abca9e4c6fb768ef12f2a58

After this, the mailing list would be suscribed to the host group of the nagios monitoring

#6 Updated by szarate about 4 years ago

  • Related to action #41336: Create a monitoring dashboard for openqa.suse.de added

#7 Updated by sebchlad about 4 years ago

OK cool. We can be waiting, but why not doing something? :)

http://docs.grafana.org/alerting/notifications/

I see our version of grafana is 4.0+ so we can enable notifications already.

I assume setting up notifications is a matter of minutes or hrs, so it make sense to 'just do it'.

#8 Updated by szarate about 4 years ago

I prefer https://progress.opensuse.org/issues/18164#note-27 to be solved first. Having people running around red herring in chicken headless mode sounds like no fun to me.

However this can be done in Parallel ;)

#9 Updated by sebchlad about 4 years ago

  • Subject changed from [tools][monitoring] Worker 'reachable' notifications to [tools][monitoring] Worker 'reachable' notifications sent form Grafana instance
  • Description updated (diff)
  • Difficulty set to easy

I'm changing the description to indicate this is solely about grafana notification configured for sending emails.

It might take a while to have all needed parts done in the Infra infrastructure and grafana settings seem to be easy, which makes me think we can just go ahead with having an intermediate solution using grafana instance we have now.

#10 Updated by sebchlad about 4 years ago

You mean the ticket which was open a year ago?? :)
I prefer to have something done, so we can improve current imperfect solution than wait for the eternity to have a perfect solution to our problems.

#11 Updated by szarate about 4 years ago

I meant the comment #note-27 :), but again, this can be done in parallel :)

#12 Updated by sebchlad about 4 years ago

szarate wrote:

Having people running around red herring in chicken headless mode sounds like no fun to me.

We are talking here about directed messages to limited number of people who know what to do.
So in theory we should not have chicken headless mode. Actually we should have exactly the opposite - notifications to limited number of people, so they will be on the same page; chicken headless mode stems from people being on different pages of different books ;) so they start asking questions. We all start asking questions. As we care. And we eventually all run after the same problems understanding them differently.

We want people who care, like Marita, to be well informed and have immediate and constant feedback of the production openQA. So they could feel safe situation is monitored and under control.
We do not want people to be surprised. So we do not want situations "Ups, workers were down. We did not notice. Ups"

#13 Updated by sebchlad about 4 years ago

@marita: Adding you as a watcher. The monitoring dashboard is in place. Now we would like to have notifications alerts, so you could have immediate alerts on the situation.

nicksinger: hmmm since current solution is "work in progress" and we might have some false alarms I would consider email alerts to be sent under certain circumstances:

  • adding aggregated stats for sshd and systemd services
  • perhaps per architecture?
  • sending emails only in special cases: if 50% of servers reports offline status, then mail people? if ppc worker is offline, send email? You know what I mean. We should expect loads of false alarms, so having emails send only if serious problems are visible, make sense to me. Then we can fine-tune the alerts as we improve, or in fact we can by then have alerts sent form our infra system. We should discuss with Coolo

#14 Updated by maritawerner about 4 years ago

Thanks a lot for the Dashboard! That is really cool!

#15 Updated by szarate about 4 years ago

So we currently have the notifications already there, and all workers should be reporting data properly, but there are two issues atm:

  • Out of space in the monitoring host
  • Security group permissions too restrictive

#16 Updated by szarate about 4 years ago

  • Checklist item changed from to [ ] Heiko and Marita are recieving notifications, [ ] emails sent to the requested mailing list
  • Status changed from New to Feedback

I'm calling this ticket as done, on the notification side of things.

There's still the issue on when a worker instance is considered online but that's another story, since this triggers a lot of false positives.

I will set this ticket to feedback for the time being and move it out of current sprint.

Security group on the monitoring instance has been solved, space problem also fixed.

#17 Updated by szarate about 4 years ago

  • Related to deleted (action #18164: [devops][tools] monitoring of openqa worker instances)

#18 Updated by szarate about 4 years ago

  • Related to action #18164: [devops][tools] monitoring of openqa worker instances added

#19 Updated by szarate about 4 years ago

Notification mailing list has been created, nothing done with it yet.

#20 Updated by coolo about 4 years ago

  • Project changed from openQA Project to openQA Infrastructure
  • Category deleted (168)

#21 Updated by okurz about 4 years ago

  • Subject changed from [tools][monitoring] Worker 'reachable' notifications sent form Grafana instance to [tools][functional][u][monitoring] Worker 'reachable' notifications sent form Grafana instance
  • Target version changed from Current Sprint to Milestone 20

szarate joined qsf-u

#22 Updated by okurz about 4 years ago

  • Target version changed from Milestone 20 to Milestone 21

#23 Updated by szarate almost 4 years ago

  • Assignee changed from szarate to nicksinger

Assigning this to Nick :)

#24 Updated by mgriessmeier almost 4 years ago

  • Subject changed from [tools][functional][u][monitoring] Worker 'reachable' notifications sent form Grafana instance to [tools][monitoring] Worker 'reachable' notifications sent form Grafana instance

#25 Updated by okurz almost 4 years ago

  • Target version deleted (Milestone 21)

removing target version as tools team does not use milestones

#26 Updated by nicksinger almost 3 years ago

  • Status changed from Feedback to Resolved

We moved forward quite a bit. While the initial ACs where never met (we have no public ML, Heiko and Marita aren't subscribed) I still think we can close this as we have good monitoring and notifications in place in the meantime. If anybody objects feel free to reopen this issue.

Also available in: Atom PDF