action #109253: Add monitoring for SUSE QA network infrastructure size:M - openQA Infrastructure (public) - openSUSE Project Management Tool

Custom queries

openQA Infrastructure Project
openqa-review - Closed tickets last updated by openqa-review, last 30 days
QA roadmap long-term
QA SLE functional
QA SLE Functional - closed in last 14 days
QA SLE Functional - High, need to be refined
QA SLE Functional - over cycle time median
QA SLE u
QA SLE y
QA tools (tag not necessary in openQA and subprojects)
QA tools tag (tag not necessary in openQA and subprojects; excluding tickets in "Ready" version as they are already on the backlog)
QAC - Backlog
QE tools team - backlog (dev)
QE tools team - backlog (ready issues)
QE tools team - backlog SLA high
QE tools team - backlog SLA immediate
QE tools team - backlog SLA no immediate/urgent in feedback/blocked
QE tools team - backlog SLA normal
QE tools team - backlog SLA urgent
QE tools team - backlog SLO high
QE tools team - backlog SLO normal
QE tools team - backlog SLO urgent
QE tools team - backlog, high-level view (epics and higher)
QE tools team - backlog, non-reactive work, needs parent
QE tools team - backlog, top-level view (all sagas)
QE Tools Team - Beginner
QE tools team - closed within last 14 days
QE tools team - closed within last 60 days
QE tools team - closed yesterday
QE Tools Team - Collaborative Session
QE tools team - due date forecast
QE tools team - exceeding due-date
QE Tools Team - Expert
QE tools team - infrastructure backlog
QE tools team - next - sorted by update time
QE tools team - next issues
QE tools team - non-estimated (unblocked) issues (dev)
QE tools team - non-estimated (unblocked) issues (infra)
QE tools team - ready issues - Workable
QE tools team - ready, not assigned/blocked/low
QE tools team - SLO high forecast
QE tools team - update forecast
QE tools team - updated by priority
QE tools team - what members of the team are working on - Feedback (not-low)
QE Tools Team Backlog By Assignee
Tools Team Retrospective
Tools Team Retrospective (not estimated or assigned)

Actions

Copy link

action #109253

closed

openQA Tests (public) - action #107062: Multiple failures due to network issues

Add monitoring for SUSE QA network infrastructure size:M

Added by okurz about 3 years ago. Updated about 3 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

jbaier_cz

Category:

Target version:

openQA Project (public) - Ready

Start date:

Due date:

% Done:

Estimated time:

Description

Motivation¶

As we found out during our investigation work on #108845 it was us pointing EngInfra to network problems that were also affecting other teams and components within SUSE Nue server rooms but nobody noticed.

Acceptance criteria¶

AC1: Alerting is defined for common SLE OSD test requirements regarding network

Suggestions¶

Add monitoring, e.g. ping checks in telegraf from each openQA worker (or monitor.qa as source) to qanet.qa, dist.suse.de, download.opensuse.org, scc.suse.com, proxy.scc.suse.de
Optional: Ping between switches (check out https://gitlab.suse.de/nicksinger/network-scripts/-/blob/main/find_mac.py for an example how to execute commands on switches directly)
Optional: Add more HTTP response checks

History
Notes
Property changes

Actions

Copy link

Updated by mkittler about 3 years ago

Subject changed from Add monitoring for SUSE QA network infrastructure to Add monitoring for SUSE QA network infrastructure s:M
Description updated (diff)
Status changed from New to Workable

Actions

Copy link

Updated by mkittler about 3 years ago

Subject changed from Add monitoring for SUSE QA network infrastructure s:M to Add monitoring for SUSE QA network infrastructure size:M

Actions

Copy link

Updated by okurz about 3 years ago

Regarding "ping between the switches" maybe instead the "Remote Networking Monitoring" capability from Cisco, see https://www.cisco.com/c/dam/en/us/td/docs/switches/lan/csbms/sf30x_sg30x/administration_guide/Cisco_300Sx_v1_4_AG.pdf , using SNMP.

Actions

Copy link

Updated by jbaier_cz about 3 years ago

Assignee set to jbaier_cz

Actions

Copy link

Updated by jbaier_cz about 3 years ago

Status changed from Workable to In Progress

Actions

Copy link

Updated by openqa_review about 3 years ago

Due date set to 2022-04-22

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

Updated by jbaier_cz about 3 years ago

For the ping checks, I created:

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/403
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/667

SNMP monitoring can be also added, but needs (probably manual) configuration on every switch to enable sending traps and select desired information to send (there are a lot of option, it can be done via web ui as well). On the telegraf side, this could be handled by [[inputs.snmp_trap]].

As a next step, I will try to include the ping monitoring into Grafana worker dashboard.

Actions

Copy link

Updated by okurz about 3 years ago

As some of the hosts do not seem to be reachable fully (IPv4+IPv6) we can select explicitly which protocols are supported, e.g. see the arguments option in https://github.com/influxdata/telegraf/blob/release-1.10/plugins/inputs/ping/README.md#configuration=. One could pass -4 or -6 for individual hosts.

Actions

Copy link

#10

Updated by jbaier_cz about 3 years ago

I created a ~~dashboard~~ panel for Grafana to actually show the ping monitoring: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/672

Actions

Copy link

#11

Updated by jbaier_cz about 3 years ago

Status changed from In Progress to Feedback

okurz wrote:

As some of the hosts do not seem to be reachable fully (IPv4+IPv6) we can select explicitly which protocols are supported, e.g. see the arguments option in https://github.com/influxdata/telegraf/blob/release-1.10/plugins/inputs/ping/README.md#configuration=. One could pass -4 or -6 for individual hosts.

It seems to me, it is not because of protocol issue, but because of (intentionally?) broken network:

#  curl scc.suse.com
<html>
<head><title>301 Moved Permanently</title></head>
<body>
<center><h1>301 Moved Permanently</h1></center>
</body>
</html>
#  ping -c3 scc.suse.com
PING scc.suse.com (54.93.98.193) 56(84) bytes of data.

--- scc.suse.com ping statistics ---
3 packets transmitted, 0 received, 100% packet loss, time 2025ms

#  curl  proxy.scc.suse.de
<html><body>You are being <a href="http://proxy.scc.suse.de/login">redirected</a>.</body></html>
#  ping -c3 proxy.scc.suse.de
PING proxy.scc.suse.de (10.160.7.1) 56(84) bytes of data.
From caasp-w7.suse.de (10.160.1.153) icmp_seq=2 Redirect Host(New nexthop: 1.7.160.10 (1.7.160.10))
From caasp-w7.suse.de (10.160.1.153) icmp_seq=3 Redirect Host(New nexthop: 1.7.160.10 (1.7.160.10))
From caasp-w7.suse.de (10.160.1.153) icmp_seq=1 Destination Host Unreachable

--- proxy.scc.suse.de ping statistics ---
3 packets transmitted, 0 received, +3 errors, 100% packet loss, time 2011ms

Anyway, the panel should be available soon. We can adjust the monitoring anytime after that.

Actions

Copy link

#12

Updated by okurz about 3 years ago

Due date deleted (~~2022-04-22~~)
Status changed from Feedback to Resolved

You also asked in chat about that. I guess it's ok what we have for now.

I created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/675 to fix the duplicate ID in the grafana panel which made the webUI act weirdly. I verified it working on multiple worker dashboards. With that merged we can resolve the ticket.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #109253

Add monitoring for SUSE QA network infrastructure size:M

Motivation¶

Acceptance criteria¶

Suggestions¶

Updated by mkittler about 3 years ago

Updated by mkittler about 3 years ago

Updated by okurz about 3 years ago

Updated by jbaier_cz about 3 years ago

Updated by jbaier_cz about 3 years ago

Updated by openqa_review about 3 years ago

Updated by jbaier_cz about 3 years ago

Updated by okurz about 3 years ago

Updated by jbaier_cz about 3 years ago

Updated by jbaier_cz about 3 years ago

Updated by okurz about 3 years ago