coordination #132275: [epic] Better o3 monitoring
Basic o3 http response alert on zabbix size:M
We had a bigger outage of o3 and we did not receive any monitoring alert for that, only user reports, see #132218
- AC1: A SUSE-IT maintained monitoring solution will alert us if https://openqa.opensuse.org does not return a valid response for some time
- Login to https://zabbix.nue.suse.com/ and play around until you have an alert for o3 http response or ask Eng-Infra to bring back what they likely still store in some of their git repos regarding http response alerts from their former icinga/nagios instance
- https://zabbix.nue.suse.com/zabbix.php?show=1&name=&inventory%5B0%5D%5Bfield%5D=type&inventory%5B0%5D%5Bvalue%5D=&evaltype=0&tags%5B0%5D%5Btag%5D=&tags%5B0%5D%5Boperator%5D=0&tags%5B0%5D%5Bvalue%5D=&show_tags=3&tag_name_format=0&tag_priority=&show_opdata=0&show_timeline=1&filter_name=&filter_show_counter=0&filter_custom_time=0&sort=clock&sortorder=DESC&age_state=0&show_suppressed=0&unacknowledged=0&compact_view=0&details=0&highlight_row=0&action=problem.view&hostids%5B%5D=10855 if that link works shows me two problems, e.g. that the zabbix agent is not available for months. This might be the first thing to look into but we shouldn't need an agent on the system to find out if the system is reachable
Web scenario for openqa.opensuse.org is available on https://zabbix.suse.de/httpdetails.php?httptestid=15 (configured as a web scenario for host ariel) and a trigger for a failure in that scenario has been created. As a last step, I will look on the notification options.
- Due date deleted (
- Status changed from In Progress to Resolved
Notifications are also working. The setup is not completely ideal though. The action is configured to notify myself over email upon any trigger inside Owners/O3 host group. My email is set to o3-admins mailing list. Anyone should be able to change the setting if needed so it is not a big deal. I will create a follow-up ticket for cleaner solution (bot account).