Project

General

Profile

Actions

action #137600

closed

[alert] Packet loss between worker hosts and other hosts size:S

Added by jbaier_cz 7 months ago. Updated about 1 month ago.

Status:
Resolved
Priority:
Low
Assignee:
Category:
-
Target version:
Start date:
2023-10-09
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Observation

We had multiple occurrences of packet loss alert over the weekend

alertname         Packet loss between worker hosts and other hosts alert
grafana_folder         Salt
rule_uid         2Z025iB4km

http://stats.openqa-monitor.qa.suse.de/d/EML0bpuGk?orgId=1&viewPanel=4

Currently, the problematic ones according to the panel are:

imagetester - walter1.qe.nue2.suse.org  100%
petrol-1 - walter1.qe.nue2.suse.org     100%
sapworker1 - walter1.qe.nue2.suse.org   100%

That is a little bit weird as I manually checked the first one and it can reach each other well

walter1:~ #   ping imagetester.qe.nue2.suse.org
PING imagetester.qe.nue2.suse.org (10.168.192.249) 56(84) bytes of data.
64 bytes from imagetester.qe.nue2.suse.org (10.168.192.249): icmp_seq=7 ttl=64 time=0.326 ms

jbaier@imagetester:~>  ping walter1.qe.nue2.suse.org
PING walter1.qe.nue2.suse.org (10.168.192.1) 56(84) bytes of data.
64 bytes from walter1.qe.nue2.suse.org (10.168.192.1): icmp_seq=1 ttl=64 time=0.331 ms

Suggestions

  • Confirm when this started happening or if it's no longer an issue
  • There's no paused alerts

Related issues 3 (0 open3 closed)

Related to openQA Infrastructure - action #138044: Grouped seemingly unrelated alert emails are confusing size:MRejectedokurz2023-10-09

Actions
Related to openQA Infrastructure - action #138005: grafana panel "Packet loss between worker hosts and other hosts" shows more than just ping to "other hosts" and hence becomes slow and triggers redundant alerts size:MResolvednicksinger2023-10-14

Actions
Related to openQA Infrastructure - action #138038: diesel+petrol missing network, IPMI still reachableResolvedokurz2023-10-16

Actions
Actions #1

Updated by jbaier_cz 7 months ago

  • Target version set to Ready
Actions #2

Updated by livdywan 7 months ago

  • Subject changed from [alert] Packet loss between worker hosts and other hosts to [alert] Packet loss between worker hosts and other hosts size:S
  • Description updated (diff)
  • Status changed from New to Feedback
  • Assignee set to livdywan

Maybe it's already fine as-is. I'll monitor this a bit.

Actions #3

Updated by okurz 7 months ago

  • Status changed from Feedback to Workable
Actions #4

Updated by livdywan 7 months ago

  • Assignee deleted (livdywan)

okurz wrote in #note-3:

nope, not fine, see https://stats.openqa-monitor.qa.suse.de/d/EML0bpuGk/monitoring?orgId=1&viewPanel=4&from=1697037757451&to=1697053106647 diesel<->walter

Thank you for taking over, that was the point of my taking the ticket ;-)

Actions #5

Updated by livdywan 6 months ago

  • Related to action #138044: Grouped seemingly unrelated alert emails are confusing size:M added
Actions #6

Updated by livdywan 6 months ago

  • Related to action #138005: grafana panel "Packet loss between worker hosts and other hosts" shows more than just ping to "other hosts" and hence becomes slow and triggers redundant alerts size:M added
Actions #7

Updated by livdywan 6 months ago

Dominik and I were trying to investigate the current packet loss alert... or should I say alerts?

diesel-1 - walter1.qe.nue2.suse.org
100%
imagetester - walter1.qe.nue2.suse.org
100%
openqa - ada.qe.suse.de
100%
openqaworker1 - walter1.qe.nue2.suse.org
100%
petrol-1 - walter1.qe.nue2.suse.org
100%
sapworker2 - walter1.qe.nue2.suse.org
100%

This is what the graph from the specific point in time was showing. We couldn't figure out what hosts may have been become problematic and what was before that, so without trying to be dramatic I'm not finding this alert/graph very actionable in its current form.

Actions #8

Updated by okurz 6 months ago

  • Related to action #138038: diesel+petrol missing network, IPMI still reachable added
Actions #10

Updated by okurz 6 months ago

  • Status changed from Workable to Blocked
  • Assignee set to okurz
  • Priority changed from High to Low
  • Target version changed from Ready to Tools - Next

Related to #138038 and #137993 as well as #138005 which should cover the rest, waiting for that to resolve first, then rechecking

Actions #11

Updated by okurz about 1 month ago

  • Status changed from Blocked to Resolved
  • Target version changed from Tools - Next to Ready

All referenced three tickets resolved, https://stats.openqa-monitor.qa.suse.de/d/EML0bpuGk/monitoring?orgId=1&viewPanel=4 looks green, no related alert silences left.

Actions

Also available in: Atom PDF