action #157606
openopenQA Project (public) - coordination #112862: [saga][epic] Future ideas for easy multi-machine handling: MM-tests as first-class citizens
openQA Project (public) - coordination #111929: [epic] Stable multi-machine tests covering multiple physical workers
Prevent missing gre tunnel connections in our salt states due to misconfiguration
0%
Description
Motivation¶
In #157534 we encountered the case of multi-machine tests failing due to a worker with "tap" class ending up with no GRE tunnel connections to other hosts that participated in cluster tests. This was due to me doing a mistake and using a differing "location-" worker class which is fixed meanwhile but our salt states worker class gre tunnel thingy computation in https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/_modules/gre_peers.py?ref_type=heads was happily putting worker40 in one "cluster" which we should improve to better handle.
Acceptance criteria¶
- AC1: /etc/wicked/scripts/gre_tunnel_preup.sh on OSD workers is ensured to have N:N connections for all "tap" connected workers
Suggestions¶
- Provide a summary when generating the files i.e. not relying on people to check files by hand
- Issue errors or warnings in cases like cluster with only 1 machine in it
- Take a look at worker29+30+31+32 based on https://netbox.suse.de/dcim/devices/6156/device-bays/ as they are all same, in one chassis and our workerconf in https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls lists all four with worker classes "tap" so they should all be inter-connected but at time of writing in w29 there is only a connection to w32+w36, not any other
- Extend the unit tests and investigate how to improve them
- https://gitlab.suse.de/openqa/salt-states-openqa/-/commit/52bdbd8ab4537db362a55ecd93f5bc97be171bf9
- Check how salt is configured, maybe we are relying on old data that was not syced yet?
Updated by okurz about 1 year ago
- Copied from action #157534: Multi-Machine Job fails in suseconnect_scc due to worker class misconfiguration when we introduced prg2e machines added
Updated by okurz 10 months ago
- Related to action #162320: multi-machine test failures 2024-06-14+, auto_review:"ping with packet size 100 failed.*can be GRE tunnel setup issue":retry added
Updated by okurz 8 months ago
- Related to action #160826: Optimize gre_tunnel_preup.sh generation jinja template size:S added
Updated by okurz 8 months ago
- Related to action #162734: Simple script detecting gre_tunnel_preup.sh with only empty remote_ip= statements during salt CI pipelines size:M added