Project

General

Profile

Actions

action #179302

open

coordination #161414: [epic] Improved salt based infrastructure management

Better monitoring for correct MTU size limits

Added by okurz about 1 month ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
Feature requests
Target version:
Start date:
2025-03-07
Due date:
% Done:

0%

Estimated time:

Description

Motivation

We have a diverse infrastructure including hosts in multiple network segments including IPSec or wireguard tunnels between hosts which can cause problems with MTU sizes like recently in #178576. Within osado we already have a MTU size ping check checking for increasing size of packets. We can improve our salt based monitoring using
https://github.com/influxdata/telegraf/tree/master/plugins/inputs/ping but with different or higher ping sizes.

Acceptance criteria

  • AC1: OSD monitoring provides ping monitoring for host-specific maximum MTU size
  • AC2: We still have generic minimum-size ping checks in parallel
  • AC3: Additional monitoring does not have a negative performance or storage overhead on any of our salt controlled hosts

Suggestions

  • We could start with a simple maximum size ping based on what is applicable for certain network segments, e.g. based on wireguard and/or location. See https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/308922361e546d6dd1d99f991cce28089d39477a/top.sls#L21 where we use 'G@needs_wireguard:True or ( *.nue2.suse.org and not G@needs_wireguard:False )'
  • Add one or multiple ping checks with lower interval than default. Something like once an hour with multiple sizes should be enough
  • Create an according grafana panel and alert definition
  • Ensure that the additional monitoring does not have a negative performance or storage overhead on any of our salt controlled hosts

Related issues 1 (0 open1 closed)

Copied from openQA Infrastructure (public) - action #178576: Workers unresponsive in salt pipelines including openqa-piworker, sapworker1 and monitor size:SResolvednicksinger2025-03-07

Actions
Actions #1

Updated by okurz about 1 month ago

  • Copied from action #178576: Workers unresponsive in salt pipelines including openqa-piworker, sapworker1 and monitor size:S added
Actions

Also available in: Atom PDF