action #132812
closed
[alert] openqaw5-xen host up alert + infrastructure ping size:M
Added by okurz over 1 year ago.
Updated about 1 year ago.
Description
Observation¶
https://monitor.qa.suse.de/d/EML0bpuGk/monitoring?viewPanel=4&orgId=1&from=now-6h&to=now showing 100% packet loss between qa-power8-4 and openqaw5-xen.
Acceptance criteria¶
- AC1: Alert resolved
- AC2: Alert about packet loss should only fire if we don't already have a related "host up" alert
Suggestions¶
- Look into the individual alerts and fix the error source
- Crosscheck definitions of "host up" and "packet loss" alerts, do we have a redundant alerting overlap? IIRC (okurz) then packet loss was intended to fire only when we have significant packet loss but not hosts being down completely
- Ensure all rollback steps are conducted
Rollback steps¶
- Description updated (diff)
- Priority changed from Normal to High
- Related to action #132500: NUE1-SRV2, .qa.suse.de, aarch64 workers offline due to heat-related SRV2 shutdown size:M added
- Subject changed from [alert] openqaw5-xen host up alert + infrastructure ping to [alert] openqaw5-xen host up alert + infrastructure ping size:M
- Status changed from New to Workable
These are the hosts reporting 100% packet loss as of right now:
- QA-Power8-4-kvm - openqaw5-xen.qa.suse.de
- QA-Power8-5-kvm - openqaw5-xen.qa.suse.de
- grenache-1 - openqaw9-hyperv.qa.suse.de
- openqa - worker-arm1.oqa.prg2.suse.org
- openqa - worker-arm2.oqa.prg2.suse.org
- openqa - worker29.oqa.prg2.suse.org
- openqa - worker30.oqa.prg2.suse.org
- openqa - worker31.oqa.prg2.suse.org
- openqa - worker32.oqa.prg2.suse.org
- openqa - worker33.oqa.prg2.suse.org
- openqa - worker34.oqa.prg2.suse.org
- openqa - worker35.oqa.prg2.suse.org
- openqa - worker36.oqa.prg2.suse.org
- openqa - worker37.oqa.prg2.suse.org
- openqa - worker38.oqa.prg2.suse.org
- openqa - worker39.oqa.prg2.suse.org
- openqa - worker40.oqa.prg2.suse.org
- powerqaworker-qam-1 - openqaw9-hyperv.qa.suse.de
- Assignee set to nicksinger
- Status changed from Workable to Feedback
I think this issue is already resolved by the various networking fixes done in the past weeks. I don't see any problematic package loss and also didn't find related silences I'd need to remove (also no expired ones).
I was also thinking about the relation of "host up" alerts and "package loss" alerts. One idea would be to not fire the "package loss"-alert if the loss is 100% to avoid overlapping with the "host up" alert. But if we do this we loose our monitoring for e.g. network A is not properly routed/reachable from network B. So the only real option would be to not fire "package loss" if a "host up"-alert is already present for the same host (as condition for the first one) which is IIUC not a supported feature by grafana.
We could workaround it by scraping active alerts via telegraf, feed that data back into grafana and use it as condition for the "package loss"-alert but this would be a lot of work which is currently not reasonable.
- Status changed from Feedback to Resolved
Also available in: Atom
PDF