Project

General

Profile

Actions

action #132278

closed

coordination #132275: [epic] Better o3 monitoring

Basic o3 http response alert on zabbix size:M

Added by okurz 5 months ago. Updated 5 months ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:

Description

Motivation

We had a bigger outage of o3 and we did not receive any monitoring alert for that, only user reports, see #132218

Acceptance criteria

  • AC1: A SUSE-IT maintained monitoring solution will alert us if https://openqa.opensuse.org does not return a valid response for some time

Suggestions


Related issues 3 (0 open3 closed)

Related to openQA Infrastructure - action #132218: Conduct lessons learned for "openQA is not accessible" on 2023-07-02Resolvedokurz2023-07-02

Actions
Related to openQA Infrastructure - action #132815: [alert][flaky][o3] Multiple flaky zabbix alerts related to o3Resolvedjbaier_cz2023-07-16

Actions
Precedes openQA Infrastructure - action #132752: Use proper bot account for notifications in zabbix.suse.de size:MResolvedjbaier_cz2023-07-14

Actions
Actions #1

Updated by okurz 5 months ago

  • Related to action #132218: Conduct lessons learned for "openQA is not accessible" on 2023-07-02 added
Actions #2

Updated by okurz 5 months ago

  • Subject changed from Basic o3 http response alert on zabbix to Basic o3 http response alert on zabbix size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #3

Updated by jbaier_cz 5 months ago

  • Status changed from Workable to In Progress
  • Assignee set to jbaier_cz
Actions #4

Updated by jbaier_cz 5 months ago

Both errors in zabbix (no data from ariel and low disk space) solved. Zabbix agent on ariel was not properly configured and enabled.

Actions #5

Updated by openqa_review 5 months ago

  • Due date set to 2023-07-25

Setting due date based on mean cycle time of SUSE QE Tools

Actions #6

Updated by jbaier_cz 5 months ago

Web scenario for openqa.opensuse.org is available on https://zabbix.suse.de/httpdetails.php?httptestid=15 (configured as a web scenario for host ariel) and a trigger for a failure in that scenario has been created. As a last step, I will look on the notification options.

Actions #7

Updated by jbaier_cz 5 months ago

  • Due date deleted (2023-07-25)
  • Status changed from In Progress to Resolved

Notifications are also working. The setup is not completely ideal though. The action is configured to notify myself over email upon any trigger inside Owners/O3 host group. My email is set to o3-admins mailing list. Anyone should be able to change the setting if needed so it is not a big deal. I will create a follow-up ticket for cleaner solution (bot account).

Actions #8

Updated by jbaier_cz 5 months ago

  • Precedes action #132752: Use proper bot account for notifications in zabbix.suse.de size:M added
Actions #9

Updated by okurz 5 months ago

  • Related to action #132815: [alert][flaky][o3] Multiple flaky zabbix alerts related to o3 added
Actions

Also available in: Atom PDF