action #114802
closedHandle "QA network infrastructure Package loss alert" introduced by #113746 size:M
0%
Description
The alerts introduced by #113746 are alerting as not all hosts mentioned in that ticket's description are actually pingable.
Acceptance criteria¶
- AC1: All package/packet are unpaused again and not alerting as problematic hosts are either recovered or ignored after all.
Suggestions¶
- Check whether problematic hosts should be online or offline. If they should be online, try recovering them. If they should be offline, remove them from the list of checked hosts.
- At this time, there's actually only one problematic host (s390zp14.suse.de). The alert is only firing multiple times because it is fired for each worker that cannot reach that host.
- To check the problematic hosts, just check the panel of one of the package/packet loss alerts.
Updated by mkittler about 2 years ago
- Related to action #113746: monitoring: The grafana "ping time" panel does not list all hosts size:S added
Updated by mkittler about 2 years ago
- Description updated (diff)
- Target version set to Ready
Updated by mkittler about 2 years ago
- Status changed from New to In Progress
- Assignee set to mkittler
Updated by mkittler about 2 years ago
I've asked about the worker on #eng-testing.
Updated by openqa_review about 2 years ago
- Due date set to 2022-08-12
Setting due date based on mean cycle time of SUSE QE Tools
Updated by okurz about 2 years ago
It seems some confusion was caused because we suddenly received a big number of emails mentioning the workers but the underlying problem is that only one target is down and that this is confirmed by multiple sending hosts. We should think of a solution how to prevent the big number of alert messages when it's just one target down.
How about adding the panel to be run from monitor.qa.suse.de itself and only enable alerts there, remove the alerts from workers, only keep the monitoring. So that if a host is down we receive one and only one alert but still from the worker hosts we have additional supporting monitoring data.
Updated by mkittler about 2 years ago
How about adding the panel to be run from monitor.qa.suse.de itself and only enable alerts there, remove the alerts from workers, only keep the monitoring. So that if a host is down we receive one and only one alert but still from the worker hosts we have additional supporting monitoring data.
Sounds reasonable. Should I do that as part of this ticket?
Updated by mkittler about 2 years ago
As discussed, it would make sense to have simply one graph that has no filter by the source host. It would show all "source - target" ping combinations and alert if there's a problem with any of it. If we sort the legend nicely and add a good alert description that should be sufficient to see quickly where the problem is (after the alert would trigger).
Updated by mkittler about 2 years ago
- SR for the change mentioned in the previous comment: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/721
- The host s390zp14.suse.de is meanwhile pingable again. It is used by Xuguang Guo. I still have to clarify whether his usage of the host corresponds to https://openqa.suse.de/admin/workers/357 (and if not remove that worker slot).
- As shown in the screenshot on https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/721, two power workers cannot ping some s390x hosts. I'll have to look into that as it supposedly should work (even though it is not a real problem).
Updated by mkittler about 2 years ago
s390zp14.suse.de is in fact used as openqaworker5:9
. So nothing to do.
Updated by mkittler about 2 years ago
The SR has been deployed and it looks good (https://monitor.qa.suse.de/d/EML0bpuGk/monitoring?orgId=1&viewPanel=4). Of course the alert is no still on (see last point of #114802#note-9). I've paused it for now.
Updated by mkittler about 2 years ago
Asked on #eng-infra about the problematic connections. In case we need to ignore those connection after all, we'd just have to merge this change: https://gitlab.suse.de/mkittler/salt-states-openqa/-/merge_requests/new/diffs?merge_request%5Bsource_branch%5D=packet-loss
Updated by mkittler about 2 years ago
- Status changed from In Progress to Feedback
Looks like Nick has already created an infra ticket for it: https://sd.suse.com/servicedesk/customer/portal/1/SD-92689
So let's wait what the outcome of that will be. In the meantime I could apply the SR to ignore those connections to be able to enable the alert (for everything else).
Updated by livdywan about 2 years ago
- Subject changed from Handle "QA network infrastructure Package loss alert" introduced by #113746 to Handle "QA network infrastructure Package loss alert" introduced by #113746 size:M
Updated by mkittler about 2 years ago
It is unlikely to be a problem on the infra side. It could be a problem with our switches or a problem on the hosts. Mikelis said he'll update the infra ticket accordingly.
Updated by mkittler about 2 years ago
He hasn't updated the ticket so here the most important findings from his investigation:
- I had tcpdump on grenache, weird thing is: it does receive icmp request from s390zp14 and sends out a reply
- both directions icmp is received and reply sent, but seems reply is not getting back
- icmp echo is being received on both sides and icmp reply is being sent out
- tcp session capture seems broken
- I checked all the stuff on my side and there are no issues
- there are few switches in both direction that are not managed by me and I cannot see their config
- I'm fairly confident that issue is not on our infra side
Updated by mkittler about 2 years ago
- Due date changed from 2022-08-12 to 2022-10-12
- Created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/725
- However, let's wait for a bit longer.
Updated by mkittler about 2 years ago
I've been updating https://sd.suse.com/servicedesk/customer/portal/1/SD-92689 with an explanation that it can be closed. (I cannot close it myself.)
The SR https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/725 has the problem that it is mentioning OSD-specific host names in the salt states itself (rather than the pillars). We could either move the patterns into the pillars but further templating the grafana config is not so nice (as it always makes updating harder). We could also try to filter at telegraf-level ensuring that the alert won't fire for the data that's already in InfluxDB.
Updated by okurz about 2 years ago
I paused the alert "Packet loss between worker hosts and other hosts alert" again as it was alerting.
Updated by mkittler about 2 years ago
This is how excluding the problematic connections on telegraf-level could look like: https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/443
Of course salt-states-openqa/monitoring/telegraf/telegraf-worker.conf
needed to be adjusted for the altered data structure and an if
matching the regex needed to be added: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/741
Updated by mkittler about 2 years ago
- Status changed from Feedback to Resolved
The SRs have been merged and I've checked a few workers and the config changes look good.
I've also just been adding qa-power8-4-kvm back to salt. The salt package wasn't installed anymore at all on the machine. Likely due to an issue from quite a while ago and the machine was likely offline or booted into a temporary snapshot when it was fixed.
With that the alert is not firing anymore so I enabled it again.
Updated by okurz about 2 years ago
- Status changed from Resolved to Feedback
mkittler wrote:
With that the alert is not firing anymore so I enabled it again.
On https://monitor.qa.suse.de/alerting/list I find "openqaworker10: package loss alert, PAUSED for 2 months" so that should be handled. And also I am confused because https://monitor.qa.suse.de/d/WDopenqaworker9/worker-dashboard-openqaworker9?orgId=1&viewPanel=65113 does not show any enabled alert. What do I miss? Or am I confusing something?
Updated by mkittler about 2 years ago
- Status changed from Feedback to Resolved
"openqaworker10: package loss alert, PAUSED for 2 months" is just part of "tinas-dashboard". That's likely something she created for testing purposes but it has nothing to do with the alert this ticket is about.
The alert is only defined on https://stats.openqa-monitor.qa.suse.de/d/EML0bpuGk/monitoring?orgId=1&editPanel=4&tab=alert and not the individual worker dashboards so we don't get tons of mails for the same problem.