Project

General

Profile

Actions

action #12912

closed

[tools]monitoring of o3/osd

Added by okurz over 7 years ago. Updated over 6 years ago.

Status:
Resolved
Priority:
Low
Assignee:
Category:
Feature requests
Target version:
-
Start date:
2016-07-28
Due date:
% Done:

0%

Estimated time:

Description

disk space depleted, coolo cleaned up. Why no monitoring? asked on #infra --> https://nagios-devel.suse.de/icinga/cgi-bin/status.cgi?host=openqa.suse.de which needs special permissions, but
https://nagios-devel.suse.de/pnp4nagios/index.php/graph?host=openqa.suse.de&srv=fs_/var/lib/openqa&view=4 works; asked again on #infra: [16/06/2016 10:23:34] hello, who can help with monitoring of openqa.suse.de? It is monitored by nagios but it seems there is no notifications on email or similar. Can this be enabled? can the infra-bot also tell on that? or can we have a nagios notification in another irc channel?

Am 2016-07-23 20:20, okurz schrieb:

=> no output from the bot in the qa-review channel. I guess we need
to coordinate what should be in the context sent to the bot. Can you
sent me some examples output?

Well, the bot does not yet support any of these features. So far I
just started a listening netcat myself. Is this IRC notification custom
built or part of nagios?

Nagios will just sent anything you like to anything you can configure

The bot we use at #infra is a Supybot.
https://www.dragonsreach.it/2012/06/30/nagios-irc-notifications/

So who can assume to see what notifications under which circumstances?
I would like to inform others but I don't know on what, yet.

service_notification_options w,u,c,r
host_notification_options d,r

Any service that is:

  • warning
  • unknown
  • critical
  • recovered and any host that is:
  • down
  • recovered will trigger a notification.

"common admins" should be subscribed by email for the more important notifications but Lars did not further answer my question. I can take a look into IRC notifications some time.

For o3 the ones that should receive a monitoring alert are: lnussel, rbrown, coolo, okurz, dleuenberger, mlin

further details

literature


Related issues 3 (0 open3 closed)

Related to openQA Tests - action #16512: timeout while uploading the logs - test fails in install_and_rebootResolvedokurz2017-02-06

Actions
Related to openQA Tests - action #17548: osd out of spaceResolvedokurz2017-03-06

Actions
Related to openQA Infrastructure - action #18164: [devops][tools] monitoring of openqa worker instancesResolvednicksinger2018-04-25

Actions
Actions #1

Updated by okurz over 7 years ago

Automatic restart of apache on o3 'cause of memory depletion? --> there is a cron job that is restarting all apache workers every hour already, seems like it is just caching and coolo thinks everything is right. I created a ticked for infra: https://infra.nue.suse.com/SelfService/Display.html?id=55081 regarding this question. I disabled the nagios notification but I am not sure if it's just for me or everybody. Also provided a comment on the incident within nagios.

Actions #2

Updated by okurz over 7 years ago

  • Description updated (diff)
Actions #3

Updated by okurz over 7 years ago

  • Description updated (diff)
Actions #4

Updated by okurz over 7 years ago

  • Description updated (diff)
Actions #6

Updated by okurz about 7 years ago

  • Related to action #16512: timeout while uploading the logs - test fails in install_and_reboot added
Actions #7

Updated by okurz about 7 years ago

from #17548

  • why was there no monitoring notification
  • can we move the yellow bar further down
  • on nagios clicking on the icons above graphs in the top right corner like "most recent alerts…" yields 404
Actions #8

Updated by okurz about 7 years ago

created ticket to add mgriessmeier, nsinger, szarate as monitoring recipients for osd: https://infra.nue.suse.com/SelfService/Display.html?id=64510

Actions #9

Updated by okurz about 7 years ago

okurz wrote:

from #17548

  • why was there no monitoring notification

https://infra.nue.suse.com/Ticket/Display.html?id=64518

  • can we move the yellow bar further down

https://infra.nue.suse.com/Ticket/Display.html?id=64522

  • on nagios clicking on the icons above graphs in the top right corner like "most recent alerts…" yields 404

https://infra.nue.suse.com/Ticket/Display.html?id=64514

Actions #10

Updated by okurz about 7 years ago

Actions #13

Updated by RBrownSUSE about 7 years ago

  • Subject changed from monitoring of o3/osd to [tools]monitoring of o3/osd
Actions #14

Updated by okurz about 7 years ago

  • Assignee deleted (okurz)

not currently working on this myself. recently added szarate, mgriessmeier, nsinger to logwarn notification emails. now mainly waiting for reaction on infra@suse.de tickets.

Actions #15

Updated by okurz about 7 years ago

https://infra.nue.suse.com/SelfService/Display.html?id=64510 solved, nsinger and szarate are now added to monitoring in the "openqa-suse" group.

Actions #17

Updated by nicksinger about 7 years ago

  • Related to action #18164: [devops][tools] monitoring of openqa worker instances added
Actions #18

Updated by coolo over 6 years ago

  • Related to deleted (action #18164: [devops][tools] monitoring of openqa worker instances)
Actions #19

Updated by coolo over 6 years ago

  • Status changed from In Progress to Resolved

let's handle the worker monitoring as independent.

Actions #20

Updated by okurz over 6 years ago

  • Assignee set to okurz

coolo, not everyone is an uber-brain like you are so please keep the relations. It's just this, a relation.

Actions #21

Updated by okurz over 6 years ago

  • Related to action #18164: [devops][tools] monitoring of openqa worker instances added
Actions #22

Updated by coolo over 6 years ago

okurz wrote:

coolo, not everyone is an uber-brain like you are

I know! I have to suffer from this basically all my life!

It's just this, a relation.

It was more, it was a subtask, so it was impossible to resolve this.

Actions

Also available in: Atom PDF