action #109253
closed
openQA Tests (public) - action #107062: Multiple failures due to network issues
Add monitoring for SUSE QA network infrastructure size:M
Added by okurz over 2 years ago.
Updated over 2 years ago.
Description
Motivation¶
As we found out during our investigation work on #108845 it was us pointing EngInfra to network problems that were also affecting other teams and components within SUSE Nue server rooms but nobody noticed.
Acceptance criteria¶
- AC1: Alerting is defined for common SLE OSD test requirements regarding network
Suggestions¶
- Add monitoring, e.g. ping checks in telegraf from each openQA worker (or monitor.qa as source) to qanet.qa, dist.suse.de, download.opensuse.org, scc.suse.com, proxy.scc.suse.de
- Optional: Ping between switches (check out https://gitlab.suse.de/nicksinger/network-scripts/-/blob/main/find_mac.py for an example how to execute commands on switches directly)
- Optional: Add more HTTP response checks
- Subject changed from Add monitoring for SUSE QA network infrastructure to Add monitoring for SUSE QA network infrastructure s:M
- Description updated (diff)
- Status changed from New to Workable
- Subject changed from Add monitoring for SUSE QA network infrastructure s:M to Add monitoring for SUSE QA network infrastructure size:M
- Assignee set to jbaier_cz
- Status changed from Workable to In Progress
- Due date set to 2022-04-22
Setting due date based on mean cycle time of SUSE QE Tools
For the ping checks, I created:
SNMP monitoring can be also added, but needs (probably manual) configuration on every switch to enable sending traps and select desired information to send (there are a lot of option, it can be done via web ui as well). On the telegraf side, this could be handled by [[inputs.snmp_trap]]
.
As a next step, I will try to include the ping monitoring into Grafana worker dashboard.
- Status changed from In Progress to Feedback
okurz wrote:
As some of the hosts do not seem to be reachable fully (IPv4+IPv6) we can select explicitly which protocols are supported, e.g. see the arguments option in https://github.com/influxdata/telegraf/blob/release-1.10/plugins/inputs/ping/README.md#configuration=. One could pass -4
or -6
for individual hosts.
It seems to me, it is not because of protocol issue, but because of (intentionally?) broken network:
# curl scc.suse.com
<html>
<head><title>301 Moved Permanently</title></head>
<body>
<center><h1>301 Moved Permanently</h1></center>
</body>
</html>
# ping -c3 scc.suse.com
PING scc.suse.com (54.93.98.193) 56(84) bytes of data.
--- scc.suse.com ping statistics ---
3 packets transmitted, 0 received, 100% packet loss, time 2025ms
# curl proxy.scc.suse.de
<html><body>You are being <a href="http://proxy.scc.suse.de/login">redirected</a>.</body></html>
# ping -c3 proxy.scc.suse.de
PING proxy.scc.suse.de (10.160.7.1) 56(84) bytes of data.
From caasp-w7.suse.de (10.160.1.153) icmp_seq=2 Redirect Host(New nexthop: 1.7.160.10 (1.7.160.10))
From caasp-w7.suse.de (10.160.1.153) icmp_seq=3 Redirect Host(New nexthop: 1.7.160.10 (1.7.160.10))
From caasp-w7.suse.de (10.160.1.153) icmp_seq=1 Destination Host Unreachable
--- proxy.scc.suse.de ping statistics ---
3 packets transmitted, 0 received, +3 errors, 100% packet loss, time 2011ms
Anyway, the panel should be available soon. We can adjust the monitoring anytime after that.
- Due date deleted (
2022-04-22)
- Status changed from Feedback to Resolved
Also available in: Atom
PDF