action #160646
closedopenQA Project (public) - coordination #112862: [saga][epic] Future ideas for easy multi-machine handling: MM-tests as first-class citizens
openQA Project (public) - coordination #111929: [epic] Stable multi-machine tests covering multiple physical workers
multiple multi-machine test failures, no GRE tunnels are setup between machines anymore at all size:M
0%
Description
Observation¶
Originally reported in https://suse.slack.com/archives/C02CANHLANP/p1716169544132569
(Richard Fan) Hello experts, many Multi-machine tests are failed like MM failed jobs on qe-core (edited)
After that as visible in https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&from=1716124476616&to=1716212983933&viewPanel=24
2024-05-19 22:30 there is an increase of parallel_failed.
E.g.
openQA test in scenario sle-15-SP3-Server-DVD-Updates-x86_64-qam_kernel_multipath@64bit fails in
multipath_iscsi
shows that it's not a problem with MTU as the error message is "connect: Network is unreachable"
ssh worker29.oqa.prg2.suse.org "cat /etc/wicked/scripts/gre_tunnel_preup.sh"
shows a problem
#!/bin/sh
action="$1"
bridge="$2"
# enable STP for the multihost bridges
ovs-vsctl set bridge $bridge stp_enable=false
ovs-vsctl set bridge $bridge rstp_enable=true
for gre_port in $(ovs-vsctl list-ifaces $bridge | grep gre) ; do ovs-vsctl --if-exists del-port $bridge $gre_port ; done
there should be a list of GRE tunnel interface setup calls after the last line between those machines but no GRE tunnels are setup at all
Reproducible¶
Fails since
https://openqa.suse.de/tests/overview?result=parallel_failed&distri=sle&version=15-SP4&build=20240519-1
Expected result¶
Last good: 20240519-1 (or more recent)
Suggestions¶
- DONE Mitigate -> https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/814
- Apply workarounds
- Retrigger affected jobs
- Investigate error source
- Fix the problem or at least abort the generation with error if the section would be completely empty?
- Prevent the same and similar problems in the future
- Apply rollback steps
- Monitor effect carefully
- Look into https://stats.openqa-monitor.qa.suse.de/alerting/grafana/0XohcmfVk/view?orgId=1
Rollback steps¶
Further details¶
Always latest result in the originally mentioned scenario: latest