action #12912
closed[tools]monitoring of o3/osd
0%
Description
disk space depleted, coolo cleaned up. Why no monitoring? asked on #infra --> https://nagios-devel.suse.de/icinga/cgi-bin/status.cgi?host=openqa.suse.de which needs special permissions, but
https://nagios-devel.suse.de/pnp4nagios/index.php/graph?host=openqa.suse.de&srv=fs_/var/lib/openqa&view=4 works; asked again on #infra: [16/06/2016 10:23:34] hello, who can help with monitoring of openqa.suse.de? It is monitored by nagios but it seems there is no notifications on email or similar. Can this be enabled? can the infra-bot also tell on that? or can we have a nagios notification in another irc channel?
Am 2016-07-23 20:20, okurz schrieb:
=> no output from the bot in the qa-review channel. I guess we need
to coordinate what should be in the context sent to the bot. Can you
sent me some examples output?Well, the bot does not yet support any of these features. So far I
just started a listening netcat myself. Is this IRC notification custom
built or part of nagios?
Nagios will just sent anything you like to anything you can configure
The bot we use at #infra is a Supybot.
https://www.dragonsreach.it/2012/06/30/nagios-irc-notifications/
So who can assume to see what notifications under which circumstances?
I would like to inform others but I don't know on what, yet.
service_notification_options w,u,c,r
host_notification_options d,r
Any service that is:
- warning
- unknown
- critical
- recovered and any host that is:
- down
- recovered will trigger a notification.
"common admins" should be subscribed by email for the more important notifications but Lars did not further answer my question. I can take a look into IRC notifications some time.
For o3 the ones that should receive a monitoring alert are: lnussel, rbrown, coolo, okurz, dleuenberger, mlin
further details¶
literature¶
Updated by okurz over 8 years ago
Automatic restart of apache on o3 'cause of memory depletion? --> there is a cron job that is restarting all apache workers every hour already, seems like it is just caching and coolo thinks everything is right. I created a ticked for infra: https://infra.nue.suse.com/SelfService/Display.html?id=55081 regarding this question. I disabled the nagios notification but I am not sure if it's just for me or everybody. Also provided a comment on the incident within nagios.
Updated by okurz over 8 years ago
TODO: continue reading on https://nagios.nue.suse.com/docs/en/quickstart-icinga.html
Updated by okurz about 8 years ago
- Related to action #16512: timeout while uploading the logs - test fails in install_and_reboot added
Updated by okurz about 8 years ago
from #17548
- why was there no monitoring notification
- can we move the yellow bar further down
- on nagios clicking on the icons above graphs in the top right corner like "most recent alerts…" yields 404
Updated by okurz about 8 years ago
created ticket to add mgriessmeier, nsinger, szarate as monitoring recipients for osd: https://infra.nue.suse.com/SelfService/Display.html?id=64510
Updated by okurz about 8 years ago
okurz wrote:
from #17548
- why was there no monitoring notification
https://infra.nue.suse.com/Ticket/Display.html?id=64518
- can we move the yellow bar further down
https://infra.nue.suse.com/Ticket/Display.html?id=64522
- on nagios clicking on the icons above graphs in the top right corner like "most recent alerts…" yields 404
Updated by okurz about 8 years ago
- Related to action #17548: osd out of space added
Updated by szarate about 8 years ago
Updated by okurz@suse.de about 8 years ago
I already did that! Read two comments above:
https://infra.nue.suse.com/SelfService/Display.html?id=64510
Updated by RBrownSUSE about 8 years ago
- Subject changed from monitoring of o3/osd to [tools]monitoring of o3/osd
Updated by okurz about 8 years ago
- Assignee deleted (
okurz)
not currently working on this myself. recently added szarate, mgriessmeier, nsinger to logwarn notification emails. now mainly waiting for reaction on infra@suse.de tickets.
Updated by okurz almost 8 years ago
https://infra.nue.suse.com/SelfService/Display.html?id=64510 solved, nsinger and szarate are now added to monitoring in the "openqa-suse" group.
Updated by okurz almost 8 years ago
[24 Mar 2017 14:04:41] okurz: 404 fixed ;)
[24 Mar 2017 14:05:53] okurz: for example : https://nagios-devel.suse.de/cgi-bin/summary.cgi?report=1&displaytype=1&timeperiod=custom&smon=02&sday=20&syear=2017&shour=13&smin=05&ssec=34&emon=03&eday=24&eyear=2017&ehour=13&emin=05&esec=34&hostgroup=all&servicegroup=all&host=openqa.suse.de&alerttypes=3&statetypes=3&hoststates=7&servicestates=120&limit=999 now works
Updated by nicksinger almost 8 years ago
- Related to action #18164: [devops][tools] monitoring of openqa worker instances added
Updated by coolo over 7 years ago
- Related to deleted (action #18164: [devops][tools] monitoring of openqa worker instances)
Updated by coolo over 7 years ago
- Status changed from In Progress to Resolved
let's handle the worker monitoring as independent.
Updated by okurz over 7 years ago
- Assignee set to okurz
coolo, not everyone is an uber-brain like you are so please keep the relations. It's just this, a relation.
Updated by okurz over 7 years ago
- Related to action #18164: [devops][tools] monitoring of openqa worker instances added
Updated by coolo over 7 years ago
okurz wrote:
coolo, not everyone is an uber-brain like you are
I know! I have to suffer from this basically all my life!
It's just this, a relation.
It was more, it was a subtask, so it was impossible to resolve this.