Project

General

Profile

Actions

action #154552

closed

coordination #112862: [saga][epic] Future ideas for easy multi-machine handling: MM-tests as first-class citizens

coordination #111929: [epic] Stable multi-machine tests covering multiple physical workers

[ppc64le] test fails in iscsi_client - zypper reports Error Message: Could not resolve host: openqa.suse.de

Added by acarvajal 10 months ago. Updated 10 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Support
Target version:
Start date:
2024-01-30
Due date:
% Done:

0%

Estimated time:

Description

Observation

openQA test in scenario sle-15-SP6-Online-ppc64le-SAPHanaSR_ScaleUp_PerfOpt_WMP_node01@ppc64le-sap fails in
iscsi_client

Other MM jobs in ppc64le in the job group also failed:

https://openqa.suse.de/tests/13381522#step/iscsi_client/9

But failure seems to be limited to ppc64le as equivalent x86_64 jobs cleared this step:

https://openqa.suse.de/tests/13382300 & https://openqa.suse.de/tests/13382301
https://openqa.suse.de/tests/13382303 & https://openqa.suse.de/tests/13382304

(Those fail later in an unrelated bsc#)

Recommendation is to investigate if something changed or if there is something wrong on qemu_ppc64le-large-mem workers, as HA jobs in the same build in ppc64le were able to clear that test module and in some cases pass completely:

Alpha Cluster: https://openqa.suse.de/tests/13364670 & https://openqa.suse.de/tests/13364672 (passes)
Beta Cluster: https://openqa.suse.de/tests/13364675 & https://openqa.suse.de/tests/13364678 (fails later in filesystem module)
(There are other examples in https://openqa.suse.de/tests/overview?distri=sle&version=15-SP6&build=50.1&groupid=143)

Reproducible

Fails since (at least) Build 50.1

Same test with same build but 3 days ago did not show this issue: https://openqa.suse.de/tests/13364664

Further details

Always latest result in this scenario: latest


Related issues 4 (1 open3 closed)

Related to openQA Tests - action #95788: [qe-sap][ha][shap] test fails in iscsi_client or other modules in HA tests, missing network Feedback2021-07-21

Actions
Related to openQA Project - action #153769: Better handle changes in GRE tunnel configuration size:MResolvedokurz2024-01-17

Actions
Related to openQA Project - action #152389: significant increase in MM-test failure ratio 2023-12-11: test fails in multipath_iscsi and other multi-machine scenarios due to MTU size auto_review:"ping with packet size 1350 failed, problems with MTU" size:MResolvedmkittler2023-12-11

Actions
Copied to openQA Infrastructure - action #154624: Periodically running simple ping-check multi-machine tests on x86_64 covering multiple physical hosts on OSD alerting tools team on failures size:MResolvedjbaier_cz2024-01-30

Actions
Actions #1

Updated by acarvajal 10 months ago

  • Related to action #95788: [qe-sap][ha][shap] test fails in iscsi_client or other modules in HA tests, missing network added
Actions #2

Updated by mkittler 10 months ago

The gre configuration on petrol and mania looks generally good. Maybe it still doesn't work in practice, though.

In all the failures I've seen the support server ran on mania and the other jobs on petrol so the gre tunnel setup may be relevant here. I suppose one could cross-check with a simple scenario or a VM as documented on https://open.qa/docs/#_verify_the_setup. Not sure whether e.g. the ping_… scenario works on ppc64le at all, though.

Actions #3

Updated by mkittler 10 months ago

  • Assignee set to mkittler
Actions #4

Updated by mkittler 10 months ago · Edited

I assigned myself for some initial investigation so we have something to work with when estimating the ticket.

As first step I created a test cluster for testing the gre connection: openqa-clone-job --skip-chained-deps --within-instance https://openqa.suse.de/tests/13366259 _GROUP=0 BUILD+=-gre-test-for-poo-154552 WORKER_CLASS:wicked_basic_ref+=,mania WORKER_CLASS:wicked_basic_sut+=,petrol

2 jobs have been created:

Actions #5

Updated by okurz 10 months ago

  • Tags set to infra, multi-machine
  • Project changed from openQA Infrastructure to openQA Project
  • Category set to Support
  • Priority changed from Normal to High
  • Target version set to Ready
Actions #6

Updated by okurz 10 months ago

  • Related to action #153769: Better handle changes in GRE tunnel configuration size:M added
Actions #7

Updated by okurz 10 months ago

  • Related to action #152389: significant increase in MM-test failure ratio 2023-12-11: test fails in multipath_iscsi and other multi-machine scenarios due to MTU size auto_review:"ping with packet size 1350 failed, problems with MTU" size:M added
Actions #8

Updated by okurz 10 months ago

  • Parent task set to #111929
Actions #9

Updated by okurz 10 months ago

  • Subject changed from test fails in iscsi_client - zypper reports Error Message: Could not resolve host: openqa.suse.de to [ppc64le] test fails in iscsi_client - zypper reports Error Message: Could not resolve host: openqa.suse.de
Actions #10

Updated by mkittler 10 months ago · Edited

It worked again after rebooting petrol and mania: https://openqa.suse.de/tests/13382492

Before the wicked test scenario also didn't work and just running the preup script again (on both workers) to delete and add back the gre tunnel connection didn't help as well.


The scenario mentioned in the ticket description works again as well: https://openqa.suse.de/tests/13382503
(Although in this test run all jobs ran only on petrol.)

Actions #11

Updated by okurz 10 months ago

That leaves the question open what caused this as we didn't have any intended changes in the GRE structure lately, did we?

Also next time we see such problem we can also try the following alternatives before resorting to rebooting:

  • wicked ifup all
  • systemctl restart network
Actions #12

Updated by okurz 10 months ago

  • Status changed from New to Resolved

Problem is gone. We will follow-up in related tickets e.g. #153769

Actions #13

Updated by okurz 10 months ago

  • Copied to action #154624: Periodically running simple ping-check multi-machine tests on x86_64 covering multiple physical hosts on OSD alerting tools team on failures size:M added
Actions

Also available in: Atom PDF