Project

General

Profile

Actions

action #109253

closed

openQA Tests (public) - action #107062: Multiple failures due to network issues

Add monitoring for SUSE QA network infrastructure size:M

Added by okurz over 2 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Start date:
Due date:
% Done:

0%

Estimated time:

Description

Motivation

As we found out during our investigation work on #108845 it was us pointing EngInfra to network problems that were also affecting other teams and components within SUSE Nue server rooms but nobody noticed.

Acceptance criteria

  • AC1: Alerting is defined for common SLE OSD test requirements regarding network

Suggestions

  • Add monitoring, e.g. ping checks in telegraf from each openQA worker (or monitor.qa as source) to qanet.qa, dist.suse.de, download.opensuse.org, scc.suse.com, proxy.scc.suse.de
  • Optional: Ping between switches (check out https://gitlab.suse.de/nicksinger/network-scripts/-/blob/main/find_mac.py for an example how to execute commands on switches directly)
  • Optional: Add more HTTP response checks
Actions

Also available in: Atom PDF