Project

General

Profile

Actions

action #174322

closed

[alert][FIRING:1] (Packet loss between worker hosts and other hosts alert Salt 2Z025iB4km)

Added by okurz 7 days ago. Updated about 21 hours ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Start date:
2024-12-12
Due date:
% Done:

0%

Estimated time:

Description

Observation

According to https://monitor.qa.suse.de/d/EML0bpuGk/monitoring?orgId=1&viewPanel=panel-4&from=2024-12-12T05:16:13.296Z&to=2024-12-12T14:30:56.398Z
diesel.qe.nue2.suse.org and others can not reach download.opensuse.org anymore

At least one host listed under required_external_networks in workerconf.sls in the pillars repository is not pingable from at least one openQA worker host. Check the panel associated with the alert. The legend table on the right shows the problematic hosts on top.

Suggestions

  • Check manually from osd with salt \* cmd.run 'ping -c1 download.opensuse.org' or similar
  • Look for related messages over mailing list posts or chat, ask experts, report ticket, etc.

Rollback actions

Actions #1

Updated by okurz 7 days ago

  • Description updated (diff)
  • Priority changed from Urgent to High

Added silence, noted according rollback action

Actions #2

Updated by okurz 6 days ago

  • Parent task set to #166598
Actions #3

Updated by okurz 2 days ago

  • Status changed from New to Resolved
  • Assignee set to okurz

The problem does not happen anymore. There was no general problem in the time range, only hosts in QE NUE2. Actually all QE NUE2 OSD workers which all use wireguard, i.e. diesel+mania+petrol+sapworker1. See
https://monitor.qa.suse.de/d/EML0bpuGk/monitoring?orgId=1&viewPanel=panel-4&from=2024-12-12T03:51:07.738Z&to=2024-12-14T00:33:47.420Z in detail. We don't have related monitoring data from hosts w/o wireguard during that time. We are good.

Actions #4

Updated by okurz about 21 hours ago

  • Status changed from Resolved to New
  • Assignee deleted (okurz)
Actions #5

Updated by okurz about 21 hours ago

  • Status changed from New to Resolved
  • Assignee set to okurz

silence removed after crosschecking that the current state is fine, no related alert.

Actions

Also available in: Atom PDF