action #110467: Establish reliable tap setup on ow14 - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

action #110467

open

openQA Project (public) - coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

openQA Project (public) - coordination #109659: [epic] More remote workers

Establish reliable tap setup on ow14

Added by mkittler about 3 years ago. Updated over 2 years ago.

Status:

New

Priority:

Low

Assignee:

Category:

Target version:

QA (public) - future

Start date:

2022-04-29

Due date:

% Done:

Estimated time:

Tags:

infra

Description

After noticing connection problems with these two workers located in Prague the tap worker class has been removed again via salt.

Acceptance criteria¶

AC1: ow14 can run multi-machine tests without running into connection issues (at least not more than usual)
AC2: The multi-machine setup on ow14 recovers automatically after a temporary connection loss

Suggestions¶

The tap setup has been reverted by salt as the "tap" worker class has been removed. If one wanted to do further tap testing, the easiest solution is to use local salt/pillar checkouts under /srv and apply them via salt-call --local state.apply after removing the worker from the master via e.g. salt-key -d openqaworker14.qa.suse.cz. As far as I observed when working on #104970, salt will configure everything needed for tap jobs to work and the configuration persisted after a reboot.
To spawn a parallel job cluster where some jobs are running on Nürnberg workers and some on Prague workers one can modify openqa-clone-job to print the parameters instead of posting them by putting print STDERR join(' ', map { "'$_=$composed_params{$_ }'" } sort keys %composed_params); exit 0; in post_jobs within CloneJob.pm. One can then modify the worker classes as needed and post the jobs manually.
One scenario affected by the so far unreliable tap setup is https://openqa.suse.de/tests/latest?arch=x86_64&distri=sle&flavor=Server-DVD-HPC-Incidents&machine=64bit&test=hpc_ALPHA_mpich_mpi_supportserver&version=15-SP1. So it can be cloned as mentioned in the previous point. (The scenario generally works when some jobs run on a mix of Nürnberg and Prague workers. I've tested that with ow3, ow10 and ow15. Only if the Prague workers are in some broken state the issue can be reproduced.)
I suppose one needs to research how gre tunnels generally behave on connection problems and how one can tweak the behavior.
Related threads on slack: https://suse.slack.com/archives/C02D16TCP99/p1651131920442009 (and before that there was https://suse.slack.com/archives/C02CANHLANP/p1651052375642909)
- Note that the threads also stray in other issues like broken asset downloads on arm workers which are completely distinct issues.
- We also have seen connection issues on other workers than ow14 and ow15 but way less.
Try to create distinct "bubbles" of "tap"-enabled workers which are not interconnected. So e.g. the workers in Nürnberg and Prague would be in different "bubbles" and would therefore not be expected to be able to reach each other via gre tunnels as they use a distinct gre setup.
- The salt states should be able to establish distinct gre setups based on some configuration.
- The openQA scheduler needed to be aware that a certain set of jobs must only run within a certain "bubble". We could use a distinct WORKER_CLASS but that setup would be rather static (we don't care in which specific bubble a parallel cluster runs, just that it is only run within the one bubble). openQA has already the concept of "vlan"s which jobs can be grouped by. Maybe that is helpful in this regard.