Project

General

Profile

Actions

action #111908

closed

coordination #112862: [saga][epic] Future ideas for easy multi-machine handling: MM-tests as first-class citizens

coordination #111929: [epic] Stable multi-machine tests covering multiple physical workers

Multimachine failures between multiple physical workers

Added by dzedro about 2 years ago. Updated 6 days ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
2022-06-03
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)

Description

Observation

There are "random unexpected" MM failures due to some issue between multiple workers.
Below is list of support_server jobs of failed MM HA/SAP jobs in last two weeks.
This jobs I restarted on same openQA worker and they didn't fail.

Same experience I have with local HA/SAP instance, when I use one worker, there are nearly no "random unexpected" failures.
When I use two physical workers, the rate of "random unexpected" failures does increase.

https://openqa.suse.de/tests/8804890#dependencies
https://openqa.suse.de/tests/8804876#dependencies
https://openqa.suse.de/tests/8804944#dependencies
https://openqa.suse.de/tests/8796653#dependencies
https://openqa.suse.de/tests/8806626#dependencies
https://openqa.suse.de/tests/8813734#dependencies
https://openqa.suse.de/tests/8819834#dependencies
https://openqa.suse.de/tests/8818172#dependencies
https://openqa.suse.de/tests/8818165#dependencies
https://openqa.suse.de/tests/8825849#dependencies
https://openqa.suse.de/tests/8842164#dependencies
https://openqa.suse.de/tests/8844261#dependencies
https://openqa.suse.de/tests/8855774#dependencies
https://openqa.suse.de/tests/8856411#dependencies

Steps to reproduce

The failures are random, I could reproduce this failures on local instance with multiple physical worker.

Problem

I assume it's network/openvswitch/GRE issue between servers.

Workaround

Run the jobs on one physical worker via WORKER_CLASS e.g. WORKER_CLASS=qemu_x86_64,tap,openqaworker8


Subtasks 1 (0 open1 closed)

action #112001: [timeboxed:20h][spike solution] Pin multi-machine cluster jobs to same openQA worker host based on configurationResolvedokurz2022-06-03

Actions

Related issues 1 (0 open1 closed)

Related to openQA Infrastructure - action #134282: [tools] network protocols failures on multimachine tests on HA/SAP size:S auto_review:"no candidate.*iscsi-target-overview-service-tab|yast2.+firewall.+services.+add.+zone":retryResolvednicksinger2023-08-15

Actions
Actions

Also available in: Atom PDF