Project

General

Profile

action #110467

Updated by mkittler 3 months ago

After noticing connection problems with these two workers located in Prague the tap worker class has been removed again via salt.

## Acceptance criteria
* **AC1**: ow14 can run multi-machine tests without running into connection issues (at least not more than usual)
* **AC2**: The multi-machine setup on ow14 recovers automatically after a temporary connection loss

## Suggestions
* The tap setup has been reverted by salt as the "tap" worker class has been removed. If one wanted to do further tap testing, the easiest solution is to use local salt/pillar checkouts under `/srv` and apply them via `salt-call --local state.apply` after removing the worker from the master via e.g. `salt-key -d openqaworker14.qa.suse.cz`. As far as I observed when working on #104970, salt will configure everything needed for tap jobs to work and the configuration persisted after a reboot.
* To spawn a parallel job cluster where some jobs are running on Nürnberg workers and some on Prague workers one can modify openqa-clone-job to print the parameters instead of posting them by putting `print STDERR join(' ', map { "'$_=$composed_params{$_ }'" } sort keys %composed_params); exit 0;` in `post_jobs` within `CloneJob.pm`. One can then modify the worker classes as needed and post the jobs manually.
* One scenario affected by the so far unreliable tap setup is https://openqa.suse.de/tests/latest?arch=x86_64&distri=sle&flavor=Server-DVD-HPC-Incidents&machine=64bit&test=hpc_ALPHA_mpich_mpi_supportserver&version=15-SP1. So it can be cloned as mentioned in the previous point. (The scenario generally works when some jobs run on a mix of Nürnberg and Prague workers. I've tested that with ow3, ow10 and ow15. Only if the Prague workers are in some broken state the issue can be reproduced.)
* I suppose one needs to research how gre tunnels generally behave on connection problems and how one can tweak the behavior.
* Related threads on slack: https://suse.slack.com/archives/C02D16TCP99/p1651131920442009 (and before that there was https://suse.slack.com/archives/C02CANHLANP/p1651052375642909)
* Note that the threads also stray in other issues like broken asset downloads on arm workers which are completely distinct issues.
* We also have seen connection issues on other workers than ow14 and ow15 but way less.
* Try to create distinct "bubbles" of "tap"-enabled workers which are *not* interconnected. So e.g. the workers in Nürnberg and Prague would be in different "bubbles" and would therefore not be expected to be able to reach each other via gre tunnels as they use a distinct gre setup.
* The salt states should be able to establish distinct gre setups based on *some* configuration.
* The openQA scheduler needed to be aware that a certain set of jobs must only run within a certain "bubble". We could use a distinct `WORKER_CLASS` but that setup would be rather static (we don't care in which specific bubble a parallel cluster runs, just that it is only run within the *one* bubble). openQA has already the concept of "vlan"s which jobs can be grouped by. *Maybe* that is helpful in this regard.

Back