Project

General

Profile

Actions

action #41189

closed

[tools][monitoring] Worker 'reachable' notifications sent form Grafana instance

Added by sebchlad over 5 years ago. Updated about 4 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
Start date:
2018-09-18
Due date:
% Done:

0%

Estimated time:

Description

As we have initial grafana to monitor the state of machines/workers, we need to start sending email notifications from existing grafana monitoring.

See: http://docs.grafana.org/alerting/notifications/

Requirements:

  • email notifications should be delivered to an open mailing list
  • QAM and QASLE (Marita and Heiko) are subscribed to those notifications

Checklist

  • Heiko and Marita are recieving notifications
  • emails sent to the requested mailing list

Related issues 2 (0 open2 closed)

Related to openQA Infrastructure - action #41336: Create a monitoring dashboard for openqa.suse.deResolved2018-09-19

Actions
Related to openQA Infrastructure - action #18164: [devops][tools] monitoring of openqa worker instancesResolvednicksinger2018-04-25

Actions
Actions #1

Updated by okurz over 5 years ago

  • Subject changed from [tools][openqa][monitoring] Worker 'reachable' notifications to [tools][monitoring] Worker 'reachable' notifications
  • Category set to Infrastructure

I guess the tag "openqa" is really implicit as we have this ticket on the "openQA tests" issue tracker :)

Actions #2

Updated by sebchlad over 5 years ago

  • Project changed from openQA Tests to openQA Project
  • Category deleted (Infrastructure)
Actions #3

Updated by sebchlad over 5 years ago

  • Related to action #18164: [devops][tools] monitoring of openqa worker instances added
Actions #4

Updated by szarate over 5 years ago

  • Description updated (diff)
  • Category set to 168
  • Assignee set to szarate
  • Target version set to Current Sprint
Actions #5

Updated by szarate over 5 years ago

Mailing list requested https://infra.nue.suse.com/Ticket/Display.html?id=121433&results=2c9585f45abca9e4c6fb768ef12f2a58

After this, the mailing list would be suscribed to the host group of the nagios monitoring

Actions #6

Updated by szarate over 5 years ago

  • Related to action #41336: Create a monitoring dashboard for openqa.suse.de added
Actions #7

Updated by sebchlad over 5 years ago

OK cool. We can be waiting, but why not doing something? :)

http://docs.grafana.org/alerting/notifications/

I see our version of grafana is 4.0+ so we can enable notifications already.

I assume setting up notifications is a matter of minutes or hrs, so it make sense to 'just do it'.

Actions #8

Updated by szarate over 5 years ago

I prefer https://progress.opensuse.org/issues/18164#note-27 to be solved first. Having people running around red herring in chicken headless mode sounds like no fun to me.

However this can be done in Parallel ;)

Actions #9

Updated by sebchlad over 5 years ago

  • Subject changed from [tools][monitoring] Worker 'reachable' notifications to [tools][monitoring] Worker 'reachable' notifications sent form Grafana instance
  • Description updated (diff)
  • Difficulty set to easy

I'm changing the description to indicate this is solely about grafana notification configured for sending emails.

It might take a while to have all needed parts done in the Infra infrastructure and grafana settings seem to be easy, which makes me think we can just go ahead with having an intermediate solution using grafana instance we have now.

Actions #10

Updated by sebchlad over 5 years ago

You mean the ticket which was open a year ago?? :)
I prefer to have something done, so we can improve current imperfect solution than wait for the eternity to have a perfect solution to our problems.

Actions #11

Updated by szarate over 5 years ago

I meant the comment #note-27 :), but again, this can be done in parallel :)

Actions #12

Updated by sebchlad over 5 years ago

szarate wrote:

Having people running around red herring in chicken headless mode sounds like no fun to me.

We are talking here about directed messages to limited number of people who know what to do.
So in theory we should not have chicken headless mode. Actually we should have exactly the opposite - notifications to limited number of people, so they will be on the same page; chicken headless mode stems from people being on different pages of different books ;) so they start asking questions. We all start asking questions. As we care. And we eventually all run after the same problems understanding them differently.

We want people who care, like Marita, to be well informed and have immediate and constant feedback of the production openQA. So they could feel safe situation is monitored and under control.
We do not want people to be surprised. So we do not want situations "Ups, workers were down. We did not notice. Ups"

Actions #13

Updated by sebchlad over 5 years ago

@marita: Adding you as a watcher. The monitoring dashboard is in place. Now we would like to have notifications alerts, so you could have immediate alerts on the situation.

@nicksinger: hmmm since current solution is "work in progress" and we might have some false alarms I would consider email alerts to be sent under certain circumstances:

  • adding aggregated stats for sshd and systemd services
  • perhaps per architecture?
  • sending emails only in special cases: if 50% of servers reports offline status, then mail people? if ppc worker is offline, send email? You know what I mean. We should expect loads of false alarms, so having emails send only if serious problems are visible, make sense to me. Then we can fine-tune the alerts as we improve, or in fact we can by then have alerts sent form our infra system. We should discuss with Coolo
Actions #14

Updated by maritawerner over 5 years ago

Thanks a lot for the Dashboard! That is really cool!

Actions #15

Updated by szarate over 5 years ago

So we currently have the notifications already there, and all workers should be reporting data properly, but there are two issues atm:

  • Out of space in the monitoring host
  • Security group permissions too restrictive
Actions #16

Updated by szarate over 5 years ago

  • Checklist item changed from to [ ] Heiko and Marita are recieving notifications, [ ] emails sent to the requested mailing list
  • Status changed from New to Feedback

I'm calling this ticket as done, on the notification side of things.

There's still the issue on when a worker instance is considered online but that's another story, since this triggers a lot of false positives.

I will set this ticket to feedback for the time being and move it out of current sprint.

Security group on the monitoring instance has been solved, space problem also fixed.

Actions #17

Updated by szarate over 5 years ago

  • Related to deleted (action #18164: [devops][tools] monitoring of openqa worker instances)
Actions #18

Updated by szarate over 5 years ago

  • Related to action #18164: [devops][tools] monitoring of openqa worker instances added
Actions #19

Updated by szarate over 5 years ago

Notification mailing list has been created, nothing done with it yet.

Actions #20

Updated by coolo over 5 years ago

  • Project changed from openQA Project to openQA Infrastructure
  • Category deleted (168)
Actions #21

Updated by okurz over 5 years ago

  • Subject changed from [tools][monitoring] Worker 'reachable' notifications sent form Grafana instance to [tools][functional][u][monitoring] Worker 'reachable' notifications sent form Grafana instance
  • Target version changed from Current Sprint to Milestone 20

szarate joined qsf-u

Actions #22

Updated by okurz over 5 years ago

  • Target version changed from Milestone 20 to Milestone 21
Actions #23

Updated by szarate over 5 years ago

  • Assignee changed from szarate to nicksinger

Assigning this to Nick :)

Actions #24

Updated by mgriessmeier over 5 years ago

  • Subject changed from [tools][functional][u][monitoring] Worker 'reachable' notifications sent form Grafana instance to [tools][monitoring] Worker 'reachable' notifications sent form Grafana instance
Actions #25

Updated by okurz about 5 years ago

  • Target version deleted (Milestone 21)

removing target version as tools team does not use milestones

Actions #26

Updated by nicksinger about 4 years ago

  • Status changed from Feedback to Resolved

We moved forward quite a bit. While the initial ACs where never met (we have no public ML, Heiko and Marita aren't subscribed) I still think we can close this as we have good monitoring and notifications in place in the meantime. If anybody objects feel free to reopen this issue.

Actions

Also available in: Atom PDF