Project

General

Profile

Actions

action #132812

closed

[alert] openqaw5-xen host up alert + infrastructure ping size:M

Added by okurz over 1 year ago. Updated about 1 year ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Start date:
2023-07-16
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Observation

https://monitor.qa.suse.de/d/EML0bpuGk/monitoring?viewPanel=4&orgId=1&from=now-6h&to=now showing 100% packet loss between qa-power8-4 and openqaw5-xen.

Acceptance criteria

  • AC1: Alert resolved
  • AC2: Alert about packet loss should only fire if we don't already have a related "host up" alert

Suggestions

  • Look into the individual alerts and fix the error source
  • Crosscheck definitions of "host up" and "packet loss" alerts, do we have a redundant alerting overlap? IIRC (okurz) then packet loss was intended to fire only when we have significant packet loss but not hosts being down completely
  • Ensure all rollback steps are conducted

Rollback steps

  • Remove related silences

Related issues 1 (0 open1 closed)

Related to openQA Infrastructure (public) - action #132500: NUE1-SRV2, .qa.suse.de, aarch64 workers offline due to heat-related SRV2 shutdown size:MResolvednicksinger2023-07-27

Actions
Actions #1

Updated by okurz over 1 year ago

  • Description updated (diff)
  • Priority changed from Normal to High
Actions #2

Updated by okurz over 1 year ago

  • Related to action #132500: NUE1-SRV2, .qa.suse.de, aarch64 workers offline due to heat-related SRV2 shutdown size:M added
Actions #3

Updated by livdywan over 1 year ago

  • Subject changed from [alert] openqaw5-xen host up alert + infrastructure ping to [alert] openqaw5-xen host up alert + infrastructure ping size:M
  • Status changed from New to Workable
Actions #4

Updated by livdywan over 1 year ago

These are the hosts reporting 100% packet loss as of right now:

  • QA-Power8-4-kvm - openqaw5-xen.qa.suse.de
  • QA-Power8-5-kvm - openqaw5-xen.qa.suse.de
  • grenache-1 - openqaw9-hyperv.qa.suse.de
  • openqa - worker-arm1.oqa.prg2.suse.org
  • openqa - worker-arm2.oqa.prg2.suse.org
  • openqa - worker29.oqa.prg2.suse.org
  • openqa - worker30.oqa.prg2.suse.org
  • openqa - worker31.oqa.prg2.suse.org
  • openqa - worker32.oqa.prg2.suse.org
  • openqa - worker33.oqa.prg2.suse.org
  • openqa - worker34.oqa.prg2.suse.org
  • openqa - worker35.oqa.prg2.suse.org
  • openqa - worker36.oqa.prg2.suse.org
  • openqa - worker37.oqa.prg2.suse.org
  • openqa - worker38.oqa.prg2.suse.org
  • openqa - worker39.oqa.prg2.suse.org
  • openqa - worker40.oqa.prg2.suse.org
  • powerqaworker-qam-1 - openqaw9-hyperv.qa.suse.de
Actions #5

Updated by nicksinger about 1 year ago

  • Assignee set to nicksinger
Actions #6

Updated by nicksinger about 1 year ago

  • Status changed from Workable to Feedback

I think this issue is already resolved by the various networking fixes done in the past weeks. I don't see any problematic package loss and also didn't find related silences I'd need to remove (also no expired ones).
I was also thinking about the relation of "host up" alerts and "package loss" alerts. One idea would be to not fire the "package loss"-alert if the loss is 100% to avoid overlapping with the "host up" alert. But if we do this we loose our monitoring for e.g. network A is not properly routed/reachable from network B. So the only real option would be to not fire "package loss" if a "host up"-alert is already present for the same host (as condition for the first one) which is IIUC not a supported feature by grafana.
We could workaround it by scraping active alerts via telegraf, feed that data back into grafana and use it as condition for the "package loss"-alert but this would be a lot of work which is currently not reasonable.

Actions #7

Updated by nicksinger about 1 year ago

  • Status changed from Feedback to Resolved

Added that idea as a "various feature request" in https://progress.opensuse.org/issues/65271#note-118

Actions

Also available in: Atom PDF