action #132278
closedcoordination #132275: [epic] Better o3 monitoring
Basic o3 http response alert on zabbix size:M
0%
Description
Motivation¶
We had a bigger outage of o3 and we did not receive any monitoring alert for that, only user reports, see #132218
Acceptance criteria¶
- AC1: A SUSE-IT maintained monitoring solution will alert us if https://openqa.opensuse.org does not return a valid response for some time
Suggestions¶
- Login to https://zabbix.nue.suse.com/ and play around until you have an alert for o3 http response or ask Eng-Infra to bring back what they likely still store in some of their git repos regarding http response alerts from their former icinga/nagios instance
- https://zabbix.nue.suse.com/zabbix.php?show=1&name=&inventory%5B0%5D%5Bfield%5D=type&inventory%5B0%5D%5Bvalue%5D=&evaltype=0&tags%5B0%5D%5Btag%5D=&tags%5B0%5D%5Boperator%5D=0&tags%5B0%5D%5Bvalue%5D=&show_tags=3&tag_name_format=0&tag_priority=&show_opdata=0&show_timeline=1&filter_name=&filter_show_counter=0&filter_custom_time=0&sort=clock&sortorder=DESC&age_state=0&show_suppressed=0&unacknowledged=0&compact_view=0&details=0&highlight_row=0&action=problem.view&hostids%5B%5D=10855 if that link works shows me two problems, e.g. that the zabbix agent is not available for months. This might be the first thing to look into but we shouldn't need an agent on the system to find out if the system is reachable
Updated by okurz over 1 year ago
- Related to action #132218: Conduct lessons learned for "openQA is not accessible" on 2023-07-02 added
Updated by okurz over 1 year ago
- Subject changed from Basic o3 http response alert on zabbix to Basic o3 http response alert on zabbix size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by jbaier_cz over 1 year ago
- Status changed from Workable to In Progress
- Assignee set to jbaier_cz
Updated by jbaier_cz over 1 year ago
Both errors in zabbix (no data from ariel and low disk space) solved. Zabbix agent on ariel was not properly configured and enabled.
Updated by openqa_review over 1 year ago
- Due date set to 2023-07-25
Setting due date based on mean cycle time of SUSE QE Tools
Updated by jbaier_cz over 1 year ago
Web scenario for openqa.opensuse.org is available on https://zabbix.suse.de/httpdetails.php?httptestid=15 (configured as a web scenario for host ariel) and a trigger for a failure in that scenario has been created. As a last step, I will look on the notification options.
Updated by jbaier_cz over 1 year ago
- Due date deleted (
2023-07-25) - Status changed from In Progress to Resolved
Notifications are also working. The setup is not completely ideal though. The action is configured to notify myself over email upon any trigger inside Owners/O3 host group. My email is set to o3-admins mailing list. Anyone should be able to change the setting if needed so it is not a big deal. I will create a follow-up ticket for cleaner solution (bot account).
Updated by jbaier_cz over 1 year ago
- Precedes action #132752: Use proper bot account for notifications in zabbix.suse.de size:M added
Updated by okurz over 1 year ago
- Related to action #132815: [alert][flaky][o3] Multiple flaky zabbix alerts related to o3 added