coordination #161735
openopenQA Project - coordination #112862: [saga][epic] Future ideas for easy multi-machine handling: MM-tests as first-class citizens
[epic] Better error detection on GRE tunnel misconfiguration
0%
Description
Motivation¶
Acceptance criteria¶
- AC1: The backend and/or test code can point better to likely causes of an error
- AC2: Similar future issues are prevented with better CI checks
Suggestions¶
- Monitor contents of the mine to better understand when it breaks and why
- Implement sanity checks on the worker to check for proper peer configuration
- Change the MTU-size check in the test distribution so make the error message more clear in case not even the smallest MTU-size works (e.g. "The network connection within the SUT does not work at all." and maybe for tap-based tests "Check the MM-setup, e.g. GRE tunnels")
- Get rid of the mine completely for "workername" <-> IP lookup
- Problem: Currently the pillar-data does not contain the FQDN of the other workers.
- We already have "## FQDN: …" in many cases so it would be easy to make that a mandatory key for all, at least the ones where we expect that the tap class should be usable
WARNING¶
- Do not touch the key of a worker in workerconf.sls - a lot of other states depend on it!
Updated by okurz 6 months ago
- Copied from action #161381: multi-machine test network issues reported 2024-06-03 due to missing content in the salt mine size:S added
Updated by okurz 6 months ago
- Tags changed from reactive work, infra to infra
- Tracker changed from action to coordination
- Subject changed from multi-machine test network issues reported 2024-06-03 - improvements to [epic] Better error detection on GRE tunnel misconfiguration
- Assignee deleted (
okurz) - Target version changed from Ready to future
- Parent task set to #112862
Updated by nicksinger 5 months ago
Regarding the mine contents. I catched our openQA test failing to check the configured GRE tunnels. Checking on worker33 I can see indeed that /etc/wicked/scripts/gre_tunnel_preup.sh
is "empty" again (only containing options:remote_ip= # worker36 (offline at point of file generation)
lines).
Checking the mine from OSD on that and other workers with: salt 'worker33.oqa.prg2.suse.org' mine.get 'host:worker33' ip4_interfaces "grain"
I can see that the general mine structure is currently populated:
worker33.oqa.prg2.suse.org:
----------
worker33.oqa.prg2.suse.org:
----------
br1:
eth0:
eth1:
lo:
- 127.0.0.1
ovs-system:
tap0:
tap1:
tap10:
tap100:
[…]
(The same output happens from different hosts querying other hosts).
Basically everything except localhost contains no values. That means that https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/openvswitch.sls#L122 is never entered and set and therefore we generate a wrong config. I'm not sure yet why the mine currently looks like this. Calling salt 'worker33.oqa.prg2.suse.org' mine.update
doesn't seem to change anything.
Updated by nicksinger 5 months ago
restarting salt minion with systemctl restart salt-minion
apparently fixed the mine contents on worker33 but also on all other workers. That means for some reason our minion stopped populating the correct values into the mine. Trying with another host now to further narrow down the problem.
Updated by nicksinger 5 months ago
While checking worker32 I found:
worker32:/etc/salt # journalctl -n 100 -u salt-minion | cat -
Jun 14 13:26:14 worker32 salt-minion[3804]: [WARNING ] cmd.run args: ['command -v auto-update 2>/dev/null && auto-update']
Jun 14 13:28:22 worker32 salt-minion[3804]: [WARNING ] remote: "worker-arm1" found in workerconf.sls but not in salt mine, host currently offline?
So checking package logs if anything salt related was installed:
worker32:/var/log/zypp # cat history | grep 2024-06-14 | cut -d "|" -f 3 | grep -v "^#" | sort | uniq
atk-lang
chrony
chrony-pool-openSUSE
cups
cups-client
cups-config
gcr3-lang
gcr3-ssh-askpass
gio-branding-openSUSE
glib2-tools
iputils
kernel-default
kernel-default-extra
kernel-default-optional
libatk-1_0-0
libcups2
libcupscgi1
libcupsimage2
libcupsmime1
libcupsppdc1
libgck-1-0
libgcr-3-1
libgio-2_0-0
libjitterentropy3
liblept5
libpoppler-cpp0
libpoppler126
libsemanage-conf
libsemanage1
libsemanage2
libsepol2
libtiff5
libunbound8
libvpl2
nmap
openQA-client
openQA-common
openQA-worker
openSUSE-2024-154
openSUSE-SLE-15.5-2024-1980
openSUSE-SLE-15.5-2024-1991
openSUSE-SLE-15.5-2024-1994
openSUSE-SLE-15.5-2024-2003
openSUSE-SLE-15.5-2024-2007
openSUSE-SLE-15.5-2024-2022
openSUSE-SLE-15.5-2024-2024
openSUSE-SLE-15.5-2024-2028
openSUSE-SLE-15.6-2024-1950
openSUSE-SLE-15.6-2024-2022
openSUSE-SLE-15.6-2024-2024
os-autoinst
os-autoinst-devel
os-autoinst-distri-opensuse-deps
os-autoinst-openvswitch
os-autoinst-swtpm
poppler-tools
python3-Pillow
python3-Pillow-tk
root@worker32
shadow
unbound-anchor
wsdd
yast2-country
yast2-country-data
Nothing really related to salt but iputils looks a little bit suspicious to me. But this might be just a bad coincidence. I don't know.
As the mine is basically just a predefined grain from another minion (https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/salt/mine.sls?ref_type=heads#L3-6), I tried to check the minion grain itself if it contains the correct content:
worker32:/var/log/zypp # salt-call grains.item ip4_interfaces
local:
----------
ip4_interfaces:
----------
br1:
- 10.0.2.2
erspan0:
eth0:
- 10.145.10.5
eth1:
gre0:
gretap0:
lo:
- 127.0.0.1
ovs-system:
tap0:
tap1:
And it does. So for some reason the publishing of the grain goes wrong. Using salt-call mine.update
on the worker itself fixes the mine content on all other machines. salt '*' cmd.run 'salt-call mine.update'
can be used from OSD to fix it on all connected machines.
Afterwards the contents of gre_tunnel_preup.sh
can be corrected with salt -C 'G@roles:worker' state.sls_id /etc/wicked/scripts/gre_tunnel_preup.sh openqa.openvswitch
. That makes our openQA job pass again as well: https://openqa.suse.de/tests/14619330#dependencies