Project

General

Profile

Actions

action #110467

open

openQA Project (public) - coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

openQA Project (public) - coordination #109659: [epic] More remote workers

Establish reliable tap setup on ow14

Added by mkittler over 2 years ago. Updated almost 2 years ago.

Status:
New
Priority:
Low
Assignee:
-
Category:
-
Target version:
Start date:
2022-04-29
Due date:
% Done:

0%

Estimated time:
Tags:

Description

After noticing connection problems with these two workers located in Prague the tap worker class has been removed again via salt.

Acceptance criteria

  • AC1: ow14 can run multi-machine tests without running into connection issues (at least not more than usual)
  • AC2: The multi-machine setup on ow14 recovers automatically after a temporary connection loss

Suggestions

  • The tap setup has been reverted by salt as the "tap" worker class has been removed. If one wanted to do further tap testing, the easiest solution is to use local salt/pillar checkouts under /srv and apply them via salt-call --local state.apply after removing the worker from the master via e.g. salt-key -d openqaworker14.qa.suse.cz. As far as I observed when working on #104970, salt will configure everything needed for tap jobs to work and the configuration persisted after a reboot.
  • To spawn a parallel job cluster where some jobs are running on Nürnberg workers and some on Prague workers one can modify openqa-clone-job to print the parameters instead of posting them by putting print STDERR join(' ', map { "'$_=$composed_params{$_ }'" } sort keys %composed_params); exit 0; in post_jobs within CloneJob.pm. One can then modify the worker classes as needed and post the jobs manually.
  • One scenario affected by the so far unreliable tap setup is https://openqa.suse.de/tests/latest?arch=x86_64&distri=sle&flavor=Server-DVD-HPC-Incidents&machine=64bit&test=hpc_ALPHA_mpich_mpi_supportserver&version=15-SP1. So it can be cloned as mentioned in the previous point. (The scenario generally works when some jobs run on a mix of Nürnberg and Prague workers. I've tested that with ow3, ow10 and ow15. Only if the Prague workers are in some broken state the issue can be reproduced.)
  • I suppose one needs to research how gre tunnels generally behave on connection problems and how one can tweak the behavior.
  • Related threads on slack: https://suse.slack.com/archives/C02D16TCP99/p1651131920442009 (and before that there was https://suse.slack.com/archives/C02CANHLANP/p1651052375642909)
    • Note that the threads also stray in other issues like broken asset downloads on arm workers which are completely distinct issues.
    • We also have seen connection issues on other workers than ow14 and ow15 but way less.
  • Try to create distinct "bubbles" of "tap"-enabled workers which are not interconnected. So e.g. the workers in Nürnberg and Prague would be in different "bubbles" and would therefore not be expected to be able to reach each other via gre tunnels as they use a distinct gre setup.
    • The salt states should be able to establish distinct gre setups based on some configuration.
    • The openQA scheduler needed to be aware that a certain set of jobs must only run within a certain "bubble". We could use a distinct WORKER_CLASS but that setup would be rather static (we don't care in which specific bubble a parallel cluster runs, just that it is only run within the one bubble). openQA has already the concept of "vlan"s which jobs can be grouped by. Maybe that is helpful in this regard.

Related issues 1 (0 open1 closed)

Related to openQA Infrastructure (public) - action #104970: Add two OSD workers (openqaworker14+openqaworker15) specifically for sap-application testing size:MResolvedmkittler2022-01-17

Actions
Actions #1

Updated by mkittler over 2 years ago

  • Tracker changed from coordination to action
Actions #2

Updated by mkittler over 2 years ago

  • Project changed from openQA Project (public) to openQA Infrastructure (public)
  • Category deleted (Feature requests)
Actions #3

Updated by mkittler over 2 years ago

  • Related to action #104970: Add two OSD workers (openqaworker14+openqaworker15) specifically for sap-application testing size:M added
Actions #4

Updated by okurz over 2 years ago

  • Target version set to Ready
Actions #5

Updated by okurz over 2 years ago

  • Copied to action #110515: Command export feature for openqa-clone-job size:M added
Actions #6

Updated by okurz over 2 years ago

  • Status changed from New to Blocked
  • Assignee set to okurz
Actions #7

Updated by mkittler over 2 years ago

  • Copied to deleted (action #110515: Command export feature for openqa-clone-job size:M)
Actions #8

Updated by mkittler over 2 years ago

  • Status changed from Blocked to New

The ticket is no longer blocked by #110515.

Actions #9

Updated by okurz over 2 years ago

  • Subject changed from Establish reliable tap setup on ow14+15 to Establish reliable tap setup on ow14
  • Assignee deleted (okurz)
Actions #10

Updated by okurz over 2 years ago

  • Description updated (diff)
Actions #11

Updated by mkittler over 2 years ago

  • Description updated (diff)

Added suggestion about distinct "bubbles" of "tap"-enabled workers.

Actions #12

Updated by mkittler over 2 years ago

  • Description updated (diff)
Actions #13

Updated by okurz over 2 years ago

  • Priority changed from Normal to Low
  • Target version changed from Ready to future
Actions #14

Updated by okurz almost 2 years ago

  • Tags set to infra
Actions

Also available in: Atom PDF