action #174322
closed[alert][FIRING:1] (Packet loss between worker hosts and other hosts alert Salt 2Z025iB4km)
0%
Description
Observation¶
According to https://monitor.qa.suse.de/d/EML0bpuGk/monitoring?orgId=1&viewPanel=panel-4&from=2024-12-12T05:16:13.296Z&to=2024-12-12T14:30:56.398Z
diesel.qe.nue2.suse.org and others can not reach download.opensuse.org anymore
At least one host listed under required_external_networks
in workerconf.sls
in the pillars repository is not pingable from at least one openQA worker host. Check the panel associated with the alert. The legend table on the right shows the problematic hosts on top.
Suggestions¶
- Check manually from osd with
salt \* cmd.run 'ping -c1 download.opensuse.org'
or similar - Look for related messages over mailing list posts or chat, ask experts, report ticket, etc.
Rollback actions¶
- Remove silence from https://monitor.qa.suse.de/alerting/silences?alertmanager=grafana
alertname=Packet loss between worker hosts and other hosts alert
Updated by okurz 2 days ago
- Status changed from New to Resolved
- Assignee set to okurz
The problem does not happen anymore. There was no general problem in the time range, only hosts in QE NUE2. Actually all QE NUE2 OSD workers which all use wireguard, i.e. diesel+mania+petrol+sapworker1. See
https://monitor.qa.suse.de/d/EML0bpuGk/monitoring?orgId=1&viewPanel=panel-4&from=2024-12-12T03:51:07.738Z&to=2024-12-14T00:33:47.420Z in detail. We don't have related monitoring data from hosts w/o wireguard during that time. We are good.
Updated by okurz about 21 hours ago
- Status changed from Resolved to New
- Assignee deleted (
okurz)
Updated by okurz about 21 hours ago
- Status changed from New to Resolved
- Assignee set to okurz
silence removed after crosschecking that the current state is fine, no related alert.