action #133385
closed
Problem: Interface tun5: Link down alerting and autoresolving shortly size:S
Added by livdywan over 1 year ago.
Updated over 1 year ago.
Description
Observation¶
This is about o3, so zabbix, not grafana!
From zabbix@suse.de
-Problem started at 11:05:17 on 2023.07.26
Problem name: Interface tun5: Link down
Host: ariel.suse-dmz.opensuse.org
Severity: Average
Operational data: Current state: down (2)
Original problem ID: 510998085
followed by
Problem has been resolved at 11:12:17 on 2023.07.26
Problem name: Interface tun5: Link down
Problem duration: 7m 0s
Host: ariel.suse-dmz.opensuse.org
Severity: Average
Original problem ID: 510998085
likely during the time when o3 was rebooting as planned
Acceptance criteria¶
- AC1: No more alerts for tun5 are observed if o3 or the tunnel is just down for some minutes
Steps to reproduce¶
- Temporarily shut down the autossh-old-ariel.service on new-ariel or try to trigger with reboots of o3
Suggestions¶
- Login on https://zabbix.nue.suse.com/ and play around to find your way around. If in doubt ask jbaier_cz
- Just bump the sensitivity of the alert or delay the actual notification
- Try the effect with multiple reboots of o3
the underlying problem was actually caused by manual actions. We just need to adjust the sensitivity of this alert to not fire immediately if somebody reboots the system.
In zabbix, it should be possible to use the "escalation mechanism" to make e-mail alert notification a second step after some time frame when there is no change for the firing trigger. That should cover cases like this, when the trigger fires only for a short time period.
- Tags set to infra, o3, zabbix, false positive
- Target version set to Ready
- Subject changed from Problem: Interface tun5: Link down alerting and autoresolving shortly to Problem: Interface tun5: Link down alerting and autoresolving shortly size:S
- Description updated (diff)
- Status changed from New to Workable
- Status changed from Workable to In Progress
- Assignee set to jbaier_cz
- Due date set to 2023-08-11
Setting due date based on mean cycle time of SUSE QE Tools
- Status changed from In Progress to Feedback
A new dummy operation step has been added inside our notification action (Configuration -> Actions -> Trigger actions). The notification action (which sends mail) is now a second step and there is 15 minutes duration after the first (non-existent) step. This should notify us about problems which cannot solve themselves inside 15 minutes window. Lets wait and see if that is what we want.
Hm, an interesting approach. But isn't it possible to define a "pending" duration or similar per alert?
okurz wrote:
Hm, an interesting approach. But isn't it possible to define a "pending" duration or similar per alert?
Yes, there are other options (this one is the simplest and more transparent). You can:
1) edit the trigger and make zabbix not to raise problem immediately (as the trigger is from a template, this means edit the template and/or disable template and create own copy of the trigger; in this case it is even a discovered trigger, which adds one more layer of dependency)
2) edit the action and create a custom action for each trigger / trigger group (tag) / trigger severity and change the step duration differently in each action
The pairing between triggers (source of the problems) and actions (source of the e-mails) can be quite nicely customized, but it can also grow quickly.
- Due date deleted (
2023-08-11)
- Status changed from Feedback to Resolved
ok, thanks for the hints. Anyway we have not seen an alert related to tun5 so this should fine.
Also available in: Atom
PDF