Project

General

Profile

Actions

action #111908

closed

coordination #112862: [saga][epic] Future ideas for easy multi-machine handling: MM-tests as first-class citizens

coordination #111929: [epic] Stable multi-machine tests covering multiple physical workers

Multimachine failures between multiple physical workers

Added by dzedro over 2 years ago. Updated 4 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
2022-06-03
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)

Description

Observation

There are "random unexpected" MM failures due to some issue between multiple workers.
Below is list of support_server jobs of failed MM HA/SAP jobs in last two weeks.
This jobs I restarted on same openQA worker and they didn't fail.

Same experience I have with local HA/SAP instance, when I use one worker, there are nearly no "random unexpected" failures.
When I use two physical workers, the rate of "random unexpected" failures does increase.

https://openqa.suse.de/tests/8804890#dependencies
https://openqa.suse.de/tests/8804876#dependencies
https://openqa.suse.de/tests/8804944#dependencies
https://openqa.suse.de/tests/8796653#dependencies
https://openqa.suse.de/tests/8806626#dependencies
https://openqa.suse.de/tests/8813734#dependencies
https://openqa.suse.de/tests/8819834#dependencies
https://openqa.suse.de/tests/8818172#dependencies
https://openqa.suse.de/tests/8818165#dependencies
https://openqa.suse.de/tests/8825849#dependencies
https://openqa.suse.de/tests/8842164#dependencies
https://openqa.suse.de/tests/8844261#dependencies
https://openqa.suse.de/tests/8855774#dependencies
https://openqa.suse.de/tests/8856411#dependencies

Steps to reproduce

The failures are random, I could reproduce this failures on local instance with multiple physical worker.

Problem

I assume it's network/openvswitch/GRE issue between servers.

Workaround

Run the jobs on one physical worker via WORKER_CLASS e.g. WORKER_CLASS=qemu_x86_64,tap,openqaworker8


Subtasks 1 (0 open1 closed)

action #112001: [timeboxed:20h][spike solution] Pin multi-machine cluster jobs to same openQA worker host based on configurationResolvedokurz2022-06-03

Actions

Related issues 1 (0 open1 closed)

Related to openQA Infrastructure - action #134282: [tools] network protocols failures on multimachine tests on HA/SAP size:S auto_review:"no candidate.*iscsi-target-overview-service-tab|yast2.+firewall.+services.+add.+zone":retryResolvednicksinger2023-08-15

Actions
Actions #1

Updated by dzedro over 2 years ago

  • Description updated (diff)
Actions #2

Updated by okurz over 2 years ago

  • Target version set to Ready
  • Parent task set to #103962

Hi dzedro, could you help us and extend the ticket description according to the template https://progress.opensuse.org/projects/openqav3/wiki/#Defects especially the "steps to reproduce". E.g. something like openqa-clone-job commands or some openqa-cli calls that you might have used to trigger tests.

Actions #3

Updated by okurz over 2 years ago

  • Parent task changed from #103962 to #111929
Actions #4

Updated by dzedro over 2 years ago

  • Description updated (diff)
Actions #5

Updated by okurz over 2 years ago

  • Project changed from openQA Infrastructure to openQA Project
  • Target version changed from Ready to future

@dzedro thanks for the detailed extension of the ticket. I agree with the workaround. For now I suggest to follow the workaround as we don't have the capacity to look into the specific problem so all affected tests should be configured according to the mentioned workaround.

Actions #6

Updated by livdywan over 1 year ago

  • Related to action #134282: [tools] network protocols failures on multimachine tests on HA/SAP size:S auto_review:"no candidate.*iscsi-target-overview-service-tab|yast2.+firewall.+services.+add.+zone":retry added
Actions #7

Updated by livdywan about 1 year ago

Re-reading http://open.qa/docs/#_gre_tunnels just now:

As long as the SUT has access to external network, there should be a non-zero packet count in the forward chain between the br1 and external interface.

sudo salt -C 'worker*' cmd.run 'iptables --list --verbose | grep FORWARD'
worker29.oqa.prg2.suse.org:
    Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
worker39.oqa.prg2.suse.org:
    Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
worker40.oqa.prg2.suse.org:
    Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
worker30.oqa.prg2.suse.org:
    Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
worker37.oqa.prg2.suse.org:
    Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
worker38.oqa.prg2.suse.org:
    Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
worker-arm1.oqa.prg2.suse.org:
    Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
worker-arm2.oqa.prg2.suse.org:
    Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
worker3.oqa.suse.de:
    Chain FORWARD (policy ACCEPT 34524 packets, 40M bytes)
worker2.oqa.suse.de:
    Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
worker5.oqa.suse.de:
    Chain FORWARD (policy ACCEPT 7461 packets, 3820K bytes)
worker8.oqa.suse.de:
    Chain FORWARD (policy ACCEPT 8591 packets, 12M bytes)
worker10.oqa.suse.de:
    Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)

To me it reads like GRE tunnels won't work on machines showing 0 packets. Maybe this is something that could be validated as part of deployments in salt?

Actions #8

Updated by okurz 4 months ago

  • Category set to Feature requests
  • Status changed from New to Resolved
  • Assignee set to okurz
  • Target version changed from future to Ready

With #112001 and other related tickets we could improve, prevent original issues, apply mitigations as well are able to quickly workaround similar problems in the future, e.g. by running multi machine tests on one host each in case of problems

Actions

Also available in: Atom PDF