action #155929: Try out rstp_enable=True in openqa/openvswitch.sls size:M - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

action #155929

closed

openQA Project (public) - coordination #112862: [saga][epic] Future ideas for easy multi-machine handling: MM-tests as first-class citizens

openQA Project (public) - coordination #111929: [epic] Stable multi-machine tests covering multiple physical workers

Try out rstp_enable=True in openqa/openvswitch.sls size:M

Added by okurz over 1 year ago. Updated over 1 year ago.

Status:

Resolved

Priority:

Normal

Assignee:

dheidler

Category:

Target version:

openQA Project (public) - Ready

Start date:

Due date:

% Done:

Estimated time:

Tags:

infra

Description

Motivation¶

We have the theory that our multi-machine setup with GRE tunnels and STP cause problems like happened in #155716-8 possibly due to STP being too slow to adapt causing openQA tests to fail.

Acceptance criteria¶

AC1: Temporary multi-machine test issues are prevented when worker hosts temporarily are unavailable
AC2: RSTP does not break more than we had in before
AC3: Our documentation and salt states are up-to-date regarding STP+RSTP

Suggestions¶

Read https://pve.proxmox.com/wiki/Open_vSwitch#Rapid_Spanning_Tree_.28RSTP.29 and enable the setting via Salt
Read https://www.accuenergy.com/support/reference-directory/rapid-spanning-tree-protocol-rstp/#:~:text=Rapid%20Spanning%20Tree%20Protocol%20(RSTP%3A%20IEEE%20802.1w)%20is,free%E2%80%9D%20topology%20within%20Ethernet%20networks.
Do a simple ping test between VMs (using a cluster of at least 3 machines connected via GRE) when one of the GRE nodes disconnects and connects (see http://open.qa/docs/#_start_test_vms_manually)
Try via the MM openQA-in-openQA test by simply changing https://github.com/os-autoinst/os-autoinst/blob/master/script/os-autoinst-setup-multi-machine#L50 and adapting the openQA-in-openQA test to use that os-autoinst version instead of the stable package
Try to reproduce the test e.g. using https://openqa.suse.de/tests/latest?arch=x86_64&distri=sle&flavor=Server-DVD-HA-Incidents&machine=64bit&test=qam_ha_hawk_haproxy_node02&version=15-SP2 by running this test near-continuous and then trigger a reboot of a machine which "ovs-appctl stp/show" shows to be crucial for the connection while the test is running
Then enable rstp in the wicked hook scripts and possibly disable stp instead
Reconduct the experiment and check if the above significantly prevents related problems
If successful ensure that https://github.com/os-autoinst/os-autoinst/blob/master/script/os-autoinst-setup-multi-machine#L50 and salt-states are in sync and our config in http://open.qa/docs/

Related issues 1 (1 open — 0 closed)

Actions

Copy link

Updated by dheidler over 1 year ago

Assignee set to dheidler

Actions

Copy link

Updated by dheidler over 1 year ago

Status changed from Workable to In Progress

Actions

Copy link

Updated by dheidler over 1 year ago

RSTP is an improvement over STP (Spanning Tree Protocol) mainly due to its reduction in convergence time – that is, the time it takes all switches on a network to reach a state of convergence, or agreement, on the topology of the network. In STP, there is substantial convergence time whenever there is a topology change or failure in the network, which typically lasts for 40-50 seconds. In a modern, high-demand networking environment, there is a constant need for increased speed and reliability and a delay of 40-50 seconds is generally unacceptable. RSTP reduces the convergence time significantly down to around 5-10 seconds.

Fortunately, many modern switches on the market automatically enable RSTP by default. Further, for networking environments with a mix of older and newer equipment, it is important to note that RSTP is backward compatible with the older STP standard .

Actions

Copy link

Updated by dheidler over 1 year ago

ovs-appctl stp/show
ovs-appctl rstp/show
ovs-vsctl set bridge br1 rstp_enable=true
ovs-vsctl set bridge br1 stp_enable=false
ovs-appctl stp/show
ovs-appctl rstp/show

Actions

Copy link

Updated by dheidler over 1 year ago · Edited

I tested switching from STP to RSTP and back using test VMs as described here: https://open.qa/docs/#_start_test_vms_manually

Workers hosts used for testing: worker36.oqa.prg2.suse.org and worker37.oqa.prg2.suse.org

STP -> RSTP takes about ~3 seconds.
RSTP -> STP takes about ~30 seconds.

This is in line with what is expected and indicates that RTSP actually works.

Actions

Copy link

Updated by dheidler over 1 year ago

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1114

Actions

Copy link

Updated by dheidler over 1 year ago

Status changed from In Progress to Feedback

Actions

Copy link

Updated by okurz over 1 year ago

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1114 merged. Please closely monitor the impact on the infrastructure.

Actions

Copy link

#10

Updated by okurz over 1 year ago

Status changed from Feedback to In Progress

https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/2317610#L916 seems to be related

Actions

Copy link

#11

Updated by okurz over 1 year ago

Not sure if the recent multi-machine failures are related, anyway I retriggered them with

failed_since="2024-02-26 18:00Z" result="result='parallel_failed'" host=openqa.suse.de comment="label:poo155929" openqa-advanced-retrigger-jobs

We can monitor https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&viewPanel=24&from=now-12h&to=now and https://openqa.suse.de/tests?resultfilter=parallel_failed&limit=2000

Actions

Copy link

#12

Updated by openqa_review over 1 year ago

Due date set to 2024-03-12

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

#13

Updated by dheidler over 1 year ago

https://github.com/os-autoinst/os-autoinst/pull/2464

Actions

Copy link

#14

Updated by dheidler over 1 year ago

https://github.com/os-autoinst/openQA/pull/5489

Actions

Copy link

#15

Updated by okurz over 1 year ago

https://github.com/os-autoinst/os-autoinst/pull/2464 and https://github.com/os-autoinst/openQA/pull/5489 merged

Actions

Copy link

#16

Updated by dheidler over 1 year ago

Status changed from In Progress to Resolved

Actions

Copy link

#17

Updated by okurz over 1 year ago

Due date deleted (~~2024-03-12~~)

I see most steps done though I have not yet seen a verification of the fix for the original issue, e.g. in
https://openqa.suse.de/tests/latest?arch=x86_64&distri=sle&flavor=Server-DVD-HA-Incidents&machine=64bit&test=qam_ha_hawk_haproxy_node02&version=15-SP2
when some machine changes the network config or is rebooted. Without that we will have to see if users come back to us with feedback about failing tests.

Actions

Copy link

#18

Updated by okurz about 1 year ago

Copied to action #157738: Use rstp_enable=True on o3 as well added

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #155929

Try out rstp_enable=True in openqa/openvswitch.sls size:M

Motivation¶

Acceptance criteria¶

Suggestions¶

Updated by dheidler over 1 year ago

Updated by dheidler over 1 year ago

Updated by dheidler over 1 year ago

Updated by dheidler over 1 year ago

Updated by dheidler over 1 year ago · Edited

Updated by dheidler over 1 year ago

Updated by dheidler over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by openqa_review over 1 year ago

Updated by dheidler over 1 year ago

Updated by dheidler over 1 year ago

Updated by okurz over 1 year ago

Updated by dheidler over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz about 1 year ago