action #137600
closed
[alert] Packet loss between worker hosts and other hosts size:S
Added by jbaier_cz about 1 year ago.
Updated 9 months ago.
Description
Observation¶
We had multiple occurrences of packet loss alert over the weekend
alertname Packet loss between worker hosts and other hosts alert
grafana_folder Salt
rule_uid 2Z025iB4km
http://stats.openqa-monitor.qa.suse.de/d/EML0bpuGk?orgId=1&viewPanel=4
Currently, the problematic ones according to the panel are:
imagetester - walter1.qe.nue2.suse.org 100%
petrol-1 - walter1.qe.nue2.suse.org 100%
sapworker1 - walter1.qe.nue2.suse.org 100%
That is a little bit weird as I manually checked the first one and it can reach each other well
walter1:~ # ping imagetester.qe.nue2.suse.org
PING imagetester.qe.nue2.suse.org (10.168.192.249) 56(84) bytes of data.
64 bytes from imagetester.qe.nue2.suse.org (10.168.192.249): icmp_seq=7 ttl=64 time=0.326 ms
jbaier@imagetester:~> ping walter1.qe.nue2.suse.org
PING walter1.qe.nue2.suse.org (10.168.192.1) 56(84) bytes of data.
64 bytes from walter1.qe.nue2.suse.org (10.168.192.1): icmp_seq=1 ttl=64 time=0.331 ms
Suggestions¶
- Confirm when this started happening or if it's no longer an issue
- There's no paused alerts
- Target version set to Ready
- Subject changed from [alert] Packet loss between worker hosts and other hosts to [alert] Packet loss between worker hosts and other hosts size:S
- Description updated (diff)
- Status changed from New to Feedback
- Assignee set to livdywan
Maybe it's already fine as-is. I'll monitor this a bit.
- Status changed from Feedback to Workable
- Assignee deleted (
livdywan)
- Related to action #138044: Grouped seemingly unrelated alert emails are confusing size:M added
- Related to action #138005: grafana panel "Packet loss between worker hosts and other hosts" shows more than just ping to "other hosts" and hence becomes slow and triggers redundant alerts size:M added
Dominik and I were trying to investigate the current packet loss alert... or should I say alerts?
diesel-1 - walter1.qe.nue2.suse.org
100%
imagetester - walter1.qe.nue2.suse.org
100%
openqa - ada.qe.suse.de
100%
openqaworker1 - walter1.qe.nue2.suse.org
100%
petrol-1 - walter1.qe.nue2.suse.org
100%
sapworker2 - walter1.qe.nue2.suse.org
100%
This is what the graph from the specific point in time was showing. We couldn't figure out what hosts may have been become problematic and what was before that, so without trying to be dramatic I'm not finding this alert/graph very actionable in its current form.
- Related to action #138038: diesel+petrol missing network, IPMI still reachable added
- Status changed from Workable to Blocked
- Assignee set to okurz
- Priority changed from High to Low
- Target version changed from Ready to Tools - Next
Related to #138038 and #137993 as well as #138005 which should cover the rest, waiting for that to resolve first, then rechecking
- Status changed from Blocked to Resolved
- Target version changed from Tools - Next to Ready
Also available in: Atom
PDF