action #109253
closedopenQA Tests - action #107062: Multiple failures due to network issues
Add monitoring for SUSE QA network infrastructure size:M
0%
Description
Motivation¶
As we found out during our investigation work on #108845 it was us pointing EngInfra to network problems that were also affecting other teams and components within SUSE Nue server rooms but nobody noticed.
Acceptance criteria¶
- AC1: Alerting is defined for common SLE OSD test requirements regarding network
Suggestions¶
- Add monitoring, e.g. ping checks in telegraf from each openQA worker (or monitor.qa as source) to qanet.qa, dist.suse.de, download.opensuse.org, scc.suse.com, proxy.scc.suse.de
- Optional: Ping between switches (check out https://gitlab.suse.de/nicksinger/network-scripts/-/blob/main/find_mac.py for an example how to execute commands on switches directly)
- Optional: Add more HTTP response checks
Updated by mkittler over 2 years ago
- Subject changed from Add monitoring for SUSE QA network infrastructure to Add monitoring for SUSE QA network infrastructure s:M
- Description updated (diff)
- Status changed from New to Workable
Updated by mkittler over 2 years ago
- Subject changed from Add monitoring for SUSE QA network infrastructure s:M to Add monitoring for SUSE QA network infrastructure size:M
Updated by okurz over 2 years ago
Regarding "ping between the switches" maybe instead the "Remote Networking Monitoring" capability from Cisco, see https://www.cisco.com/c/dam/en/us/td/docs/switches/lan/csbms/sf30x_sg30x/administration_guide/Cisco_300Sx_v1_4_AG.pdf , using SNMP.
Updated by jbaier_cz over 2 years ago
- Status changed from Workable to In Progress
Updated by openqa_review over 2 years ago
- Due date set to 2022-04-22
Setting due date based on mean cycle time of SUSE QE Tools
Updated by jbaier_cz over 2 years ago
For the ping checks, I created:
- https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/403
- https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/667
SNMP monitoring can be also added, but needs (probably manual) configuration on every switch to enable sending traps and select desired information to send (there are a lot of option, it can be done via web ui as well). On the telegraf side, this could be handled by [[inputs.snmp_trap]]
.
As a next step, I will try to include the ping monitoring into Grafana worker dashboard.
Updated by okurz over 2 years ago
As some of the hosts do not seem to be reachable fully (IPv4+IPv6) we can select explicitly which protocols are supported, e.g. see the arguments option in https://github.com/influxdata/telegraf/blob/release-1.10/plugins/inputs/ping/README.md#configuration=. One could pass -4
or -6
for individual hosts.
Updated by jbaier_cz over 2 years ago
I created a dashboard panel for Grafana to actually show the ping monitoring: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/672
Updated by jbaier_cz over 2 years ago
- Status changed from In Progress to Feedback
okurz wrote:
As some of the hosts do not seem to be reachable fully (IPv4+IPv6) we can select explicitly which protocols are supported, e.g. see the arguments option in https://github.com/influxdata/telegraf/blob/release-1.10/plugins/inputs/ping/README.md#configuration=. One could pass
-4
or-6
for individual hosts.
It seems to me, it is not because of protocol issue, but because of (intentionally?) broken network:
# curl scc.suse.com
<html>
<head><title>301 Moved Permanently</title></head>
<body>
<center><h1>301 Moved Permanently</h1></center>
</body>
</html>
# ping -c3 scc.suse.com
PING scc.suse.com (54.93.98.193) 56(84) bytes of data.
--- scc.suse.com ping statistics ---
3 packets transmitted, 0 received, 100% packet loss, time 2025ms
# curl proxy.scc.suse.de
<html><body>You are being <a href="http://proxy.scc.suse.de/login">redirected</a>.</body></html>
# ping -c3 proxy.scc.suse.de
PING proxy.scc.suse.de (10.160.7.1) 56(84) bytes of data.
From caasp-w7.suse.de (10.160.1.153) icmp_seq=2 Redirect Host(New nexthop: 1.7.160.10 (1.7.160.10))
From caasp-w7.suse.de (10.160.1.153) icmp_seq=3 Redirect Host(New nexthop: 1.7.160.10 (1.7.160.10))
From caasp-w7.suse.de (10.160.1.153) icmp_seq=1 Destination Host Unreachable
--- proxy.scc.suse.de ping statistics ---
3 packets transmitted, 0 received, +3 errors, 100% packet loss, time 2011ms
Anyway, the panel should be available soon. We can adjust the monitoring anytime after that.
Updated by okurz over 2 years ago
- Due date deleted (
2022-04-22) - Status changed from Feedback to Resolved
You also asked in chat about that. I guess it's ok what we have for now.
I created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/675 to fix the duplicate ID in the grafana panel which made the webUI act weirdly. I verified it working on multiple worker dashboards. With that merged we can resolve the ticket.