action #154552
closedcoordination #112862: [saga][epic] Future ideas for easy multi-machine handling: MM-tests as first-class citizens
coordination #111929: [epic] Stable multi-machine tests covering multiple physical workers
[ppc64le] test fails in iscsi_client - zypper reports Error Message: Could not resolve host: openqa.suse.de
0%
Description
Observation¶
openQA test in scenario sle-15-SP6-Online-ppc64le-SAPHanaSR_ScaleUp_PerfOpt_WMP_node01@ppc64le-sap fails in
iscsi_client
Other MM jobs in ppc64le in the job group also failed:
https://openqa.suse.de/tests/13381522#step/iscsi_client/9
But failure seems to be limited to ppc64le as equivalent x86_64 jobs cleared this step:
https://openqa.suse.de/tests/13382300 & https://openqa.suse.de/tests/13382301
https://openqa.suse.de/tests/13382303 & https://openqa.suse.de/tests/13382304
(Those fail later in an unrelated bsc#)
Recommendation is to investigate if something changed or if there is something wrong on qemu_ppc64le-large-mem workers, as HA jobs in the same build in ppc64le were able to clear that test module and in some cases pass completely:
Alpha Cluster: https://openqa.suse.de/tests/13364670 & https://openqa.suse.de/tests/13364672 (passes)
Beta Cluster: https://openqa.suse.de/tests/13364675 & https://openqa.suse.de/tests/13364678 (fails later in filesystem
module)
(There are other examples in https://openqa.suse.de/tests/overview?distri=sle&version=15-SP6&build=50.1&groupid=143)
Reproducible¶
Fails since (at least) Build 50.1
Same test with same build but 3 days ago did not show this issue: https://openqa.suse.de/tests/13364664
Further details¶
Always latest result in this scenario: latest
Updated by acarvajal 8 months ago
- Related to action #95788: [qe-sap][ha][shap] test fails in iscsi_client or other modules in HA tests, missing network added
Updated by mkittler 8 months ago
The gre configuration on petrol and mania looks generally good. Maybe it still doesn't work in practice, though.
In all the failures I've seen the support server ran on mania and the other jobs on petrol so the gre tunnel setup may be relevant here. I suppose one could cross-check with a simple scenario or a VM as documented on https://open.qa/docs/#_verify_the_setup. Not sure whether e.g. the ping_…
scenario works on ppc64le at all, though.
Updated by mkittler 8 months ago · Edited
I assigned myself for some initial investigation so we have something to work with when estimating the ticket.
As first step I created a test cluster for testing the gre connection: openqa-clone-job --skip-chained-deps --within-instance https://openqa.suse.de/tests/13366259 _GROUP=0 BUILD+=-gre-test-for-poo-154552 WORKER_CLASS:wicked_basic_ref+=,mania WORKER_CLASS:wicked_basic_sut+=,petrol
2 jobs have been created:
- sle-15-SP6-Online-ppc64le-Build50.1-wicked_basic_ref@ppc64le -> https://openqa.suse.de/tests/13382433
- sle-15-SP6-Online-ppc64le-Build50.1-wicked_basic_sut@ppc64le -> https://openqa.suse.de/tests/13382434
Updated by okurz 8 months ago
- Related to action #153769: Better handle changes in GRE tunnel configuration size:M added
Updated by okurz 8 months ago
- Related to action #152389: significant increase in MM-test failure ratio 2023-12-11: test fails in multipath_iscsi and other multi-machine scenarios due to MTU size auto_review:"ping with packet size 1350 failed, problems with MTU" size:M added
Updated by mkittler 8 months ago · Edited
It worked again after rebooting petrol and mania: https://openqa.suse.de/tests/13382492
Before the wicked test scenario also didn't work and just running the preup script again (on both workers) to delete and add back the gre tunnel connection didn't help as well.
The scenario mentioned in the ticket description works again as well: https://openqa.suse.de/tests/13382503
(Although in this test run all jobs ran only on petrol.)
Updated by okurz 8 months ago
That leaves the question open what caused this as we didn't have any intended changes in the GRE structure lately, did we?
Also next time we see such problem we can also try the following alternatives before resorting to rebooting:
wicked ifup all
systemctl restart network
Updated by okurz 8 months ago
- Copied to action #154624: Periodically running simple ping-check multi-machine tests on x86_64 covering multiple physical hosts on OSD alerting tools team on failures size:M added