action #155929
closedopenQA Project (public) - coordination #112862: [saga][epic] Future ideas for easy multi-machine handling: MM-tests as first-class citizens
openQA Project (public) - coordination #111929: [epic] Stable multi-machine tests covering multiple physical workers
Try out rstp_enable=True in openqa/openvswitch.sls size:M
0%
Description
Motivation¶
We have the theory that our multi-machine setup with GRE tunnels and STP cause problems like happened in #155716-8 possibly due to STP being too slow to adapt causing openQA tests to fail.
Acceptance criteria¶
- AC1: Temporary multi-machine test issues are prevented when worker hosts temporarily are unavailable
- AC2: RSTP does not break more than we had in before
- AC3: Our documentation and salt states are up-to-date regarding STP+RSTP
Suggestions¶
- Read https://pve.proxmox.com/wiki/Open_vSwitch#Rapid_Spanning_Tree_.28RSTP.29 and enable the setting via Salt
- Read https://www.accuenergy.com/support/reference-directory/rapid-spanning-tree-protocol-rstp/#:~:text=Rapid%20Spanning%20Tree%20Protocol%20(RSTP%3A%20IEEE%20802.1w)%20is,free%E2%80%9D%20topology%20within%20Ethernet%20networks.
- Do a simple ping test between VMs (using a cluster of at least 3 machines connected via GRE) when one of the GRE nodes disconnects and connects (see http://open.qa/docs/#_start_test_vms_manually)
- Try via the MM openQA-in-openQA test by simply changing https://github.com/os-autoinst/os-autoinst/blob/master/script/os-autoinst-setup-multi-machine#L50 and adapting the openQA-in-openQA test to use that os-autoinst version instead of the stable package
- Try to reproduce the test e.g. using https://openqa.suse.de/tests/latest?arch=x86_64&distri=sle&flavor=Server-DVD-HA-Incidents&machine=64bit&test=qam_ha_hawk_haproxy_node02&version=15-SP2 by running this test near-continuous and then trigger a reboot of a machine which "ovs-appctl stp/show" shows to be crucial for the connection while the test is running
- Then enable rstp in the wicked hook scripts and possibly disable stp instead
- Reconduct the experiment and check if the above significantly prevents related problems
- If successful ensure that https://github.com/os-autoinst/os-autoinst/blob/master/script/os-autoinst-setup-multi-machine#L50 and salt-states are in sync and our config in http://open.qa/docs/
Updated by dheidler 10 months ago
RSTP is an improvement over STP (Spanning Tree Protocol) mainly due to its reduction in convergence time – that is, the time it takes all switches on a network to reach a state of convergence, or agreement, on the topology of the network. In STP, there is substantial convergence time whenever there is a topology change or failure in the network, which typically lasts for 40-50 seconds. In a modern, high-demand networking environment, there is a constant need for increased speed and reliability and a delay of 40-50 seconds is generally unacceptable. RSTP reduces the convergence time significantly down to around 5-10 seconds.
Fortunately, many modern switches on the market automatically enable RSTP by default. Further, for networking environments with a mix of older and newer equipment, it is important to note that RSTP is backward compatible with the older STP standard .
Updated by dheidler 10 months ago · Edited
I tested switching from STP to RSTP and back using test VMs as described here: https://open.qa/docs/#_start_test_vms_manually
Workers hosts used for testing: worker36.oqa.prg2.suse.org and worker37.oqa.prg2.suse.org
- STP -> RSTP takes about ~3 seconds.
- RSTP -> STP takes about ~30 seconds.
This is in line with what is expected and indicates that RTSP actually works.
Updated by okurz 10 months ago
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1114 merged. Please closely monitor the impact on the infrastructure.
Updated by okurz 10 months ago
- Status changed from Feedback to In Progress
https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/2317610#L916 seems to be related
Updated by okurz 10 months ago
Not sure if the recent multi-machine failures are related, anyway I retriggered them with
failed_since="2024-02-26 18:00Z" result="result='parallel_failed'" host=openqa.suse.de comment="label:poo155929" openqa-advanced-retrigger-jobs
We can monitor https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&viewPanel=24&from=now-12h&to=now and https://openqa.suse.de/tests?resultfilter=parallel_failed&limit=2000
Updated by openqa_review 10 months ago
- Due date set to 2024-03-12
Setting due date based on mean cycle time of SUSE QE Tools
Updated by okurz 10 months ago
- Due date deleted (
2024-03-12)
I see most steps done though I have not yet seen a verification of the fix for the original issue, e.g. in
https://openqa.suse.de/tests/latest?arch=x86_64&distri=sle&flavor=Server-DVD-HA-Incidents&machine=64bit&test=qam_ha_hawk_haproxy_node02&version=15-SP2
when some machine changes the network config or is rebooted. Without that we will have to see if users come back to us with feedback about failing tests.
Updated by okurz 9 months ago
- Copied to action #157738: Use rstp_enable=True on o3 as well added