action #111908: Multimachine failures between multiple physical workers - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #111908

closed

coordination #112862: [saga][epic] Future ideas for easy multi-machine handling: MM-tests as first-class citizens

coordination #111929: [epic] Stable multi-machine tests covering multiple physical workers

Multimachine failures between multiple physical workers

Added by dzedro about 3 years ago. Updated 11 months ago.

Status:

Resolved

Priority:

Normal

Assignee:

okurz

Category:

Feature requests

Target version:

Ready

Start date:

2022-06-03

Due date:

% Done:

100%

Estimated time:

(Total: 0.00 h)

Description

Observation¶

There are "random unexpected" MM failures due to some issue between multiple workers.
Below is list of support_server jobs of failed MM HA/SAP jobs in last two weeks.
This jobs I restarted on same openQA worker and they didn't fail.

Same experience I have with local HA/SAP instance, when I use one worker, there are nearly no "random unexpected" failures.
When I use two physical workers, the rate of "random unexpected" failures does increase.

Steps to reproduce¶

The failures are random, I could reproduce this failures on local instance with multiple physical worker.

Problem¶

I assume it's network/openvswitch/GRE issue between servers.

Workaround¶

Run the jobs on one physical worker via WORKER_CLASS e.g. WORKER_CLASS=qemu_x86_64,tap,openqaworker8

Subtasks 1 (0 open — 1 closed)

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by dzedro about 3 years ago

Description updated (diff)

Actions

Copy link

Updated by okurz about 3 years ago

Target version set to Ready
Parent task set to #103962

Hi dzedro, could you help us and extend the ticket description according to the template https://progress.opensuse.org/projects/openqav3/wiki/#Defects especially the "steps to reproduce". E.g. something like openqa-clone-job commands or some openqa-cli calls that you might have used to trigger tests.

Actions

Copy link

Updated by okurz about 3 years ago

Parent task changed from #103962 to #111929

Actions

Copy link

Updated by dzedro about 3 years ago

Description updated (diff)

Actions

Copy link

Updated by okurz almost 3 years ago

Project changed from openQA Infrastructure (public) to openQA Project (public)
Target version changed from Ready to future

@dzedro thanks for the detailed extension of the ticket. I agree with the workaround. For now I suggest to follow the workaround as we don't have the capacity to look into the specific problem so all affected tests should be configured according to the mentioned workaround.

Actions

Copy link

Updated by livdywan almost 2 years ago

Related to action #134282: [tools] network protocols failures on multimachine tests on HA/SAP size:S auto_review:"no candidate.*iscsi-target-overview-service-tab|yast2.+firewall.+services.+add.+zone":retry added

Actions

Copy link

Updated by livdywan over 1 year ago

Re-reading http://open.qa/docs/#_gre_tunnels just now:

As long as the SUT has access to external network, there should be a non-zero packet count in the forward chain between the br1 and external interface.

sudo salt -C 'worker*' cmd.run 'iptables --list --verbose | grep FORWARD'
worker29.oqa.prg2.suse.org:
    Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
worker39.oqa.prg2.suse.org:
    Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
worker40.oqa.prg2.suse.org:
    Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
worker30.oqa.prg2.suse.org:
    Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
worker37.oqa.prg2.suse.org:
    Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
worker38.oqa.prg2.suse.org:
    Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
worker-arm1.oqa.prg2.suse.org:
    Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
worker-arm2.oqa.prg2.suse.org:
    Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
worker3.oqa.suse.de:
    Chain FORWARD (policy ACCEPT 34524 packets, 40M bytes)
worker2.oqa.suse.de:
    Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
worker5.oqa.suse.de:
    Chain FORWARD (policy ACCEPT 7461 packets, 3820K bytes)
worker8.oqa.suse.de:
    Chain FORWARD (policy ACCEPT 8591 packets, 12M bytes)
worker10.oqa.suse.de:
    Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)

To me it reads like GRE tunnels won't work on machines showing 0 packets. Maybe this is something that could be validated as part of deployments in salt?

Actions

Copy link

Updated by okurz 11 months ago

Category set to Feature requests
Status changed from New to Resolved
Assignee set to okurz
Target version changed from future to Ready

With #112001 and other related tickets we could improve, prevent original issues, apply mitigations as well are able to quickly workaround similar problems in the future, e.g. by running multi machine tests on one host each in case of problems

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #111908

Multimachine failures between multiple physical workers

Observation¶

Steps to reproduce¶

Problem¶

Workaround¶

Updated by dzedro about 3 years ago

Updated by okurz about 3 years ago

Updated by okurz about 3 years ago

Updated by dzedro about 3 years ago

Updated by okurz almost 3 years ago

Updated by livdywan almost 2 years ago

Updated by livdywan over 1 year ago

Updated by okurz 11 months ago