Project

General

Profile

Actions

action #155929

closed

openQA Project - coordination #112862: [saga][epic] Future ideas for easy multi-machine handling: MM-tests as first-class citizens

openQA Project - coordination #111929: [epic] Stable multi-machine tests covering multiple physical workers

Try out rstp_enable=True in openqa/openvswitch.sls size:M

Added by okurz 2 months ago. Updated 2 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Motivation

We have the theory that our multi-machine setup with GRE tunnels and STP cause problems like happened in #155716-8 possibly due to STP being too slow to adapt causing openQA tests to fail.

Acceptance criteria

  • AC1: Temporary multi-machine test issues are prevented when worker hosts temporarily are unavailable
  • AC2: RSTP does not break more than we had in before
  • AC3: Our documentation and salt states are up-to-date regarding STP+RSTP

Suggestions


Related issues 1 (1 open0 closed)

Copied to openQA Infrastructure - action #157738: Use rstp_enable=True on o3 as wellNew

Actions
Actions #2

Updated by dheidler 2 months ago

  • Assignee set to dheidler
Actions #3

Updated by dheidler 2 months ago

  • Status changed from Workable to In Progress
Actions #4

Updated by dheidler 2 months ago

RSTP is an improvement over STP (Spanning Tree Protocol) mainly due to its reduction in convergence time – that is, the time it takes all switches on a network to reach a state of convergence, or agreement, on the topology of the network. In STP, there is substantial convergence time whenever there is a topology change or failure in the network, which typically lasts for 40-50 seconds. In a modern, high-demand networking environment, there is a constant need for increased speed and reliability and a delay of 40-50 seconds is generally unacceptable. RSTP reduces the convergence time significantly down to around 5-10 seconds.

Fortunately, many modern switches on the market automatically enable RSTP by default. Further, for networking environments with a mix of older and newer equipment, it is important to note that RSTP is backward compatible with the older STP standard .

Actions #5

Updated by dheidler 2 months ago

ovs-appctl stp/show
ovs-appctl rstp/show
ovs-vsctl set bridge br1 rstp_enable=true
ovs-vsctl set bridge br1 stp_enable=false
ovs-appctl stp/show
ovs-appctl rstp/show
Actions #6

Updated by dheidler 2 months ago · Edited

I tested switching from STP to RSTP and back using test VMs as described here: https://open.qa/docs/#_start_test_vms_manually

Workers hosts used for testing: worker36.oqa.prg2.suse.org and worker37.oqa.prg2.suse.org

  • STP -> RSTP takes about ~3 seconds.
  • RSTP -> STP takes about ~30 seconds.

This is in line with what is expected and indicates that RTSP actually works.

Actions #8

Updated by dheidler 2 months ago

  • Status changed from In Progress to Feedback
Actions #9

Updated by okurz 2 months ago

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1114 merged. Please closely monitor the impact on the infrastructure.

Actions #10

Updated by okurz 2 months ago

  • Status changed from Feedback to In Progress
Actions #11

Updated by okurz 2 months ago

Not sure if the recent multi-machine failures are related, anyway I retriggered them with

failed_since="2024-02-26 18:00Z" result="result='parallel_failed'" host=openqa.suse.de comment="label:poo155929" openqa-advanced-retrigger-jobs

We can monitor https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&viewPanel=24&from=now-12h&to=now and https://openqa.suse.de/tests?resultfilter=parallel_failed&limit=2000

Actions #12

Updated by openqa_review 2 months ago

  • Due date set to 2024-03-12

Setting due date based on mean cycle time of SUSE QE Tools

Actions #16

Updated by dheidler 2 months ago

  • Status changed from In Progress to Resolved
Actions #17

Updated by okurz 2 months ago

  • Due date deleted (2024-03-12)

I see most steps done though I have not yet seen a verification of the fix for the original issue, e.g. in
https://openqa.suse.de/tests/latest?arch=x86_64&distri=sle&flavor=Server-DVD-HA-Incidents&machine=64bit&test=qam_ha_hawk_haproxy_node02&version=15-SP2
when some machine changes the network config or is rebooted. Without that we will have to see if users come back to us with feedback about failing tests.

Actions #18

Updated by okurz about 1 month ago

Actions

Also available in: Atom PDF