Project

General

Profile

action #109253

openQA Tests - action #107062: Multiple failures due to network issues

Add monitoring for SUSE QA network infrastructure size:M

Added by okurz 3 months ago. Updated 2 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:

Description

Motivation

As we found out during our investigation work on #108845 it was us pointing EngInfra to network problems that were also affecting other teams and components within SUSE Nue server rooms but nobody noticed.

Acceptance criteria

  • AC1: Alerting is defined for common SLE OSD test requirements regarding network

Suggestions

  • Add monitoring, e.g. ping checks in telegraf from each openQA worker (or monitor.qa as source) to qanet.qa, dist.suse.de, download.opensuse.org, scc.suse.com, proxy.scc.suse.de
  • Optional: Ping between switches (check out https://gitlab.suse.de/nicksinger/network-scripts/-/blob/main/find_mac.py for an example how to execute commands on switches directly)
  • Optional: Add more HTTP response checks

History

#2 Updated by mkittler 3 months ago

  • Subject changed from Add monitoring for SUSE QA network infrastructure to Add monitoring for SUSE QA network infrastructure s:M
  • Description updated (diff)
  • Status changed from New to Workable

#3 Updated by mkittler 3 months ago

  • Subject changed from Add monitoring for SUSE QA network infrastructure s:M to Add monitoring for SUSE QA network infrastructure size:M

#4 Updated by okurz 3 months ago

Regarding "ping between the switches" maybe instead the "Remote Networking Monitoring" capability from Cisco, see https://www.cisco.com/c/dam/en/us/td/docs/switches/lan/csbms/sf30x_sg30x/administration_guide/Cisco_300Sx_v1_4_AG.pdf , using SNMP.

#5 Updated by jbaier_cz 3 months ago

  • Assignee set to jbaier_cz

#6 Updated by jbaier_cz 3 months ago

  • Status changed from Workable to In Progress

#7 Updated by openqa_review 3 months ago

  • Due date set to 2022-04-22

Setting due date based on mean cycle time of SUSE QE Tools

#8 Updated by jbaier_cz 3 months ago

For the ping checks, I created:

SNMP monitoring can be also added, but needs (probably manual) configuration on every switch to enable sending traps and select desired information to send (there are a lot of option, it can be done via web ui as well). On the telegraf side, this could be handled by [[inputs.snmp_trap]].

As a next step, I will try to include the ping monitoring into Grafana worker dashboard.

#9 Updated by okurz 3 months ago

As some of the hosts do not seem to be reachable fully (IPv4+IPv6) we can select explicitly which protocols are supported, e.g. see the arguments option in https://github.com/influxdata/telegraf/blob/release-1.10/plugins/inputs/ping/README.md#configuration=. One could pass -4 or -6 for individual hosts.

#10 Updated by jbaier_cz 3 months ago

I created a dashboard panel for Grafana to actually show the ping monitoring: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/672

#11 Updated by jbaier_cz 3 months ago

  • Status changed from In Progress to Feedback

okurz wrote:

As some of the hosts do not seem to be reachable fully (IPv4+IPv6) we can select explicitly which protocols are supported, e.g. see the arguments option in https://github.com/influxdata/telegraf/blob/release-1.10/plugins/inputs/ping/README.md#configuration=. One could pass -4 or -6 for individual hosts.

It seems to me, it is not because of protocol issue, but because of (intentionally?) broken network:

#  curl scc.suse.com
<html>
<head><title>301 Moved Permanently</title></head>
<body>
<center><h1>301 Moved Permanently</h1></center>
</body>
</html>
#  ping -c3 scc.suse.com
PING scc.suse.com (54.93.98.193) 56(84) bytes of data.

--- scc.suse.com ping statistics ---
3 packets transmitted, 0 received, 100% packet loss, time 2025ms

#  curl  proxy.scc.suse.de
<html><body>You are being <a href="http://proxy.scc.suse.de/login">redirected</a>.</body></html>
#  ping -c3 proxy.scc.suse.de
PING proxy.scc.suse.de (10.160.7.1) 56(84) bytes of data.
From caasp-w7.suse.de (10.160.1.153) icmp_seq=2 Redirect Host(New nexthop: 1.7.160.10 (1.7.160.10))
From caasp-w7.suse.de (10.160.1.153) icmp_seq=3 Redirect Host(New nexthop: 1.7.160.10 (1.7.160.10))
From caasp-w7.suse.de (10.160.1.153) icmp_seq=1 Destination Host Unreachable

--- proxy.scc.suse.de ping statistics ---
3 packets transmitted, 0 received, +3 errors, 100% packet loss, time 2011ms

Anyway, the panel should be available soon. We can adjust the monitoring anytime after that.

#12 Updated by okurz 2 months ago

  • Due date deleted (2022-04-22)
  • Status changed from Feedback to Resolved

You also asked in chat about that. I guess it's ok what we have for now.

I created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/675 to fix the duplicate ID in the grafana panel which made the webUI act weirdly. I verified it working on multiple worker dashboards. With that merged we can resolve the ticket.

Also available in: Atom PDF