action #157606
openopenQA Project (public) - coordination #112862: [saga][epic] Future ideas for easy multi-machine handling: MM-tests as first-class citizens
openQA Project (public) - coordination #111929: [epic] Stable multi-machine tests covering multiple physical workers
Prevent missing gre tunnel connections in our salt states due to misconfiguration
0%
Description
Motivation¶
In #157534 we encountered the case of multi-machine tests failing due to a worker with "tap" class ending up with no GRE tunnel connections to other hosts that participated in cluster tests. This was due to me doing a mistake and using a differing "location-" worker class which is fixed meanwhile but our salt states worker class gre tunnel thingy computation in https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/_modules/gre_peers.py?ref_type=heads was happily putting worker40 in one "cluster" which we should improve to better handle.
Acceptance criteria¶
- AC1: /etc/wicked/scripts/gre_tunnel_preup.sh on OSD workers is ensured to have N:N connections for all "tap" connected workers
Suggestions¶
- Provide a summary when generating the files i.e. not relying on people to check files by hand
- Issue errors or warnings in cases like cluster with only 1 machine in it
- Take a look at worker29+30+31+32 based on https://netbox.suse.de/dcim/devices/6156/device-bays/ as they are all same, in one chassis and our workerconf in https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls lists all four with worker classes "tap" so they should all be inter-connected but at time of writing in w29 there is only a connection to w32+w36, not any other
- Extend the unit tests and investigate how to improve them
- https://gitlab.suse.de/openqa/salt-states-openqa/-/commit/52bdbd8ab4537db362a55ecd93f5bc97be171bf9
- Check how salt is configured, maybe we are relying on old data that was not syced yet?