Actions
action #110467
openopenQA Project - coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes
openQA Project - coordination #109659: [epic] More remote workers
Establish reliable tap setup on ow14
Start date:
2022-04-29
Due date:
% Done:
0%
Estimated time:
Tags:
Description
After noticing connection problems with these two workers located in Prague the tap worker class has been removed again via salt.
Acceptance criteria¶
- AC1: ow14 can run multi-machine tests without running into connection issues (at least not more than usual)
- AC2: The multi-machine setup on ow14 recovers automatically after a temporary connection loss
Suggestions¶
- The tap setup has been reverted by salt as the "tap" worker class has been removed. If one wanted to do further tap testing, the easiest solution is to use local salt/pillar checkouts under
/srv
and apply them viasalt-call --local state.apply
after removing the worker from the master via e.g.salt-key -d openqaworker14.qa.suse.cz
. As far as I observed when working on #104970, salt will configure everything needed for tap jobs to work and the configuration persisted after a reboot. - To spawn a parallel job cluster where some jobs are running on Nürnberg workers and some on Prague workers one can modify openqa-clone-job to print the parameters instead of posting them by putting
print STDERR join(' ', map { "'$_=$composed_params{$_ }'" } sort keys %composed_params); exit 0;
inpost_jobs
withinCloneJob.pm
. One can then modify the worker classes as needed and post the jobs manually. - One scenario affected by the so far unreliable tap setup is https://openqa.suse.de/tests/latest?arch=x86_64&distri=sle&flavor=Server-DVD-HPC-Incidents&machine=64bit&test=hpc_ALPHA_mpich_mpi_supportserver&version=15-SP1. So it can be cloned as mentioned in the previous point. (The scenario generally works when some jobs run on a mix of Nürnberg and Prague workers. I've tested that with ow3, ow10 and ow15. Only if the Prague workers are in some broken state the issue can be reproduced.)
- I suppose one needs to research how gre tunnels generally behave on connection problems and how one can tweak the behavior.
- Related threads on slack: https://suse.slack.com/archives/C02D16TCP99/p1651131920442009 (and before that there was https://suse.slack.com/archives/C02CANHLANP/p1651052375642909)
- Note that the threads also stray in other issues like broken asset downloads on arm workers which are completely distinct issues.
- We also have seen connection issues on other workers than ow14 and ow15 but way less.
- Try to create distinct "bubbles" of "tap"-enabled workers which are not interconnected. So e.g. the workers in Nürnberg and Prague would be in different "bubbles" and would therefore not be expected to be able to reach each other via gre tunnels as they use a distinct gre setup.
- The salt states should be able to establish distinct gre setups based on some configuration.
- The openQA scheduler needed to be aware that a certain set of jobs must only run within a certain "bubble". We could use a distinct
WORKER_CLASS
but that setup would be rather static (we don't care in which specific bubble a parallel cluster runs, just that it is only run within the one bubble). openQA has already the concept of "vlan"s which jobs can be grouped by. Maybe that is helpful in this regard.
Actions