Project

General

Profile

Actions

action #133385

closed

Problem: Interface tun5: Link down alerting and autoresolving shortly size:S

Added by livdywan over 1 year ago. Updated over 1 year ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Start date:
Due date:
% Done:

0%

Estimated time:

Description

Observation

This is about o3, so zabbix, not grafana!

From zabbix@suse.de

-Problem started at 11:05:17 on 2023.07.26
Problem name: Interface tun5: Link down
Host: ariel.suse-dmz.opensuse.org
Severity: Average
Operational data: Current state: down (2)
Original problem ID: 510998085

followed by

Problem has been resolved at 11:12:17 on 2023.07.26
Problem name: Interface tun5: Link down
Problem duration: 7m 0s
Host: ariel.suse-dmz.opensuse.org
Severity: Average
Original problem ID: 510998085

likely during the time when o3 was rebooting as planned

Acceptance criteria

  • AC1: No more alerts for tun5 are observed if o3 or the tunnel is just down for some minutes

Steps to reproduce

  • Temporarily shut down the autossh-old-ariel.service on new-ariel or try to trigger with reboots of o3

Suggestions

  • Login on https://zabbix.nue.suse.com/ and play around to find your way around. If in doubt ask jbaier_cz
  • Just bump the sensitivity of the alert or delay the actual notification
  • Try the effect with multiple reboots of o3
Actions #1

Updated by nicksinger over 1 year ago

the underlying problem was actually caused by manual actions. We just need to adjust the sensitivity of this alert to not fire immediately if somebody reboots the system.

Actions #2

Updated by jbaier_cz over 1 year ago

In zabbix, it should be possible to use the "escalation mechanism" to make e-mail alert notification a second step after some time frame when there is no change for the firing trigger. That should cover cases like this, when the trigger fires only for a short time period.

Actions #3

Updated by okurz over 1 year ago

  • Tags set to infra, o3, zabbix, false positive
  • Target version set to Ready
Actions #4

Updated by okurz over 1 year ago

  • Subject changed from Problem: Interface tun5: Link down alerting and autoresolving shortly to Problem: Interface tun5: Link down alerting and autoresolving shortly size:S
  • Description updated (diff)
  • Status changed from New to Workable
Actions #5

Updated by jbaier_cz over 1 year ago

  • Status changed from Workable to In Progress
  • Assignee set to jbaier_cz
Actions #6

Updated by openqa_review over 1 year ago

  • Due date set to 2023-08-11

Setting due date based on mean cycle time of SUSE QE Tools

Actions #7

Updated by jbaier_cz over 1 year ago

  • Status changed from In Progress to Feedback

A new dummy operation step has been added inside our notification action (Configuration -> Actions -> Trigger actions). The notification action (which sends mail) is now a second step and there is 15 minutes duration after the first (non-existent) step. This should notify us about problems which cannot solve themselves inside 15 minutes window. Lets wait and see if that is what we want.

Actions #8

Updated by okurz over 1 year ago

Hm, an interesting approach. But isn't it possible to define a "pending" duration or similar per alert?

Actions #9

Updated by jbaier_cz over 1 year ago

okurz wrote:

Hm, an interesting approach. But isn't it possible to define a "pending" duration or similar per alert?

Yes, there are other options (this one is the simplest and more transparent). You can:

1) edit the trigger and make zabbix not to raise problem immediately (as the trigger is from a template, this means edit the template and/or disable template and create own copy of the trigger; in this case it is even a discovered trigger, which adds one more layer of dependency)
2) edit the action and create a custom action for each trigger / trigger group (tag) / trigger severity and change the step duration differently in each action

The pairing between triggers (source of the problems) and actions (source of the e-mails) can be quite nicely customized, but it can also grow quickly.

Actions #10

Updated by okurz over 1 year ago

  • Due date deleted (2023-08-11)
  • Status changed from Feedback to Resolved

ok, thanks for the hints. Anyway we have not seen an alert related to tun5 so this should fine.

Actions

Also available in: Atom PDF