coordination #161735: [epic] Better error detection on GRE tunnel misconfiguration - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

coordination #161735

open

openQA Project (public) - coordination #112862: [saga][epic] Future ideas for easy multi-machine handling: MM-tests as first-class citizens

[epic] Better error detection on GRE tunnel misconfiguration

Added by okurz 12 months ago. Updated 2 months ago.

Status:

Blocked

Priority:

Low

Assignee:

okurz

Category:

Feature requests

Target version:

QA (public) - future

Start date:

2024-06-21

Due date:

% Done:

50%

Estimated time:

(Total: 0.00 h)

Tags:

infra

Description

Motivation¶

See #160646 and #161381

Acceptance criteria¶

AC1: The backend and/or test code can point better to likely causes of an error
AC2: Similar future issues are prevented with better CI checks

Suggestions¶

Monitor contents of the mine to better understand when it breaks and why
Implement sanity checks on the worker to check for proper peer configuration
Change the MTU-size check in the test distribution so make the error message more clear in case not even the smallest MTU-size works (e.g. "The network connection within the SUT does not work at all." and maybe for tap-based tests "Check the MM-setup, e.g. GRE tunnels")
Get rid of the mine completely for "workername" <-> IP lookup
- Problem: Currently the pillar-data does not contain the FQDN of the other workers.
- We already have "## FQDN: …" in many cases so it would be easy to make that a mandatory key for all, at least the ones where we expect that the tap class should be usable

WARNING¶

Do not touch the key of a worker in workerconf.sls - a lot of other states depend on it!

Subtasks 2 (1 open — 1 closed)

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by okurz 12 months ago

Copied from action #161381: multi-machine test network issues reported 2024-06-03 due to missing content in the salt mine size:S added

Actions

Copy link

Updated by mkittler 12 months ago

Description updated (diff)

Actions

Copy link

Updated by okurz 12 months ago

Tags changed from reactive work, infra to infra
Tracker changed from action to coordination
Subject changed from multi-machine test network issues reported 2024-06-03 - improvements to [epic] Better error detection on GRE tunnel misconfiguration
Assignee deleted (~~okurz~~)
Target version changed from Ready to future
Parent task set to #112862

Actions

Copy link

Updated by nicksinger 12 months ago

Regarding the mine contents. I catched our openQA test failing to check the configured GRE tunnels. Checking on worker33 I can see indeed that /etc/wicked/scripts/gre_tunnel_preup.sh is "empty" again (only containing options:remote_ip= # worker36 (offline at point of file generation) lines).
Checking the mine from OSD on that and other workers with: salt 'worker33.oqa.prg2.suse.org' mine.get 'host:worker33' ip4_interfaces "grain" I can see that the general mine structure is currently populated:

worker33.oqa.prg2.suse.org:
    ----------
    worker33.oqa.prg2.suse.org:
        ----------
        br1:
        eth0:
        eth1:
        lo:
            - 127.0.0.1
        ovs-system:
        tap0:
        tap1:
        tap10:
        tap100:
[…]

(The same output happens from different hosts querying other hosts).

Basically everything except localhost contains no values. That means that https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/openvswitch.sls#L122 is never entered and set and therefore we generate a wrong config. I'm not sure yet why the mine currently looks like this. Calling salt 'worker33.oqa.prg2.suse.org' mine.update doesn't seem to change anything.

Actions

Copy link

Updated by nicksinger 12 months ago

restarting salt minion with systemctl restart salt-minion apparently fixed the mine contents on worker33 but also on all other workers. That means for some reason our minion stopped populating the correct values into the mine. Trying with another host now to further narrow down the problem.

Actions

Copy link

Updated by nicksinger 12 months ago

While checking worker32 I found:

worker32:/etc/salt # journalctl -n 100 -u salt-minion | cat -
Jun 14 13:26:14 worker32 salt-minion[3804]: [WARNING ] cmd.run args: ['command -v auto-update 2>/dev/null && auto-update']
Jun 14 13:28:22 worker32 salt-minion[3804]: [WARNING ] remote: "worker-arm1" found in workerconf.sls but not in salt mine, host currently offline?

So checking package logs if anything salt related was installed:

worker32:/var/log/zypp # cat history | grep 2024-06-14 | cut -d "|" -f 3 | grep -v "^#" | sort | uniq
atk-lang
chrony
chrony-pool-openSUSE
cups
cups-client
cups-config
gcr3-lang
gcr3-ssh-askpass
gio-branding-openSUSE
glib2-tools
iputils
kernel-default
kernel-default-extra
kernel-default-optional
libatk-1_0-0
libcups2
libcupscgi1
libcupsimage2
libcupsmime1
libcupsppdc1
libgck-1-0
libgcr-3-1
libgio-2_0-0
libjitterentropy3
liblept5
libpoppler-cpp0
libpoppler126
libsemanage-conf
libsemanage1
libsemanage2
libsepol2
libtiff5
libunbound8
libvpl2
nmap
openQA-client
openQA-common
openQA-worker
openSUSE-2024-154
openSUSE-SLE-15.5-2024-1980
openSUSE-SLE-15.5-2024-1991
openSUSE-SLE-15.5-2024-1994
openSUSE-SLE-15.5-2024-2003
openSUSE-SLE-15.5-2024-2007
openSUSE-SLE-15.5-2024-2022
openSUSE-SLE-15.5-2024-2024
openSUSE-SLE-15.5-2024-2028
openSUSE-SLE-15.6-2024-1950
openSUSE-SLE-15.6-2024-2022
openSUSE-SLE-15.6-2024-2024
os-autoinst
os-autoinst-devel
os-autoinst-distri-opensuse-deps
os-autoinst-openvswitch
os-autoinst-swtpm
poppler-tools
python3-Pillow
python3-Pillow-tk
root@worker32
shadow
unbound-anchor
wsdd
yast2-country
yast2-country-data

Nothing really related to salt but iputils looks a little bit suspicious to me. But this might be just a bad coincidence. I don't know.
As the mine is basically just a predefined grain from another minion (https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/salt/mine.sls?ref_type=heads#L3-6), I tried to check the minion grain itself if it contains the correct content:

worker32:/var/log/zypp # salt-call grains.item ip4_interfaces
local:
    ----------
    ip4_interfaces:
        ----------
        br1:
            - 10.0.2.2
        erspan0:
        eth0:
            - 10.145.10.5
        eth1:
        gre0:
        gretap0:
        lo:
            - 127.0.0.1
        ovs-system:
        tap0:
        tap1:

And it does. So for some reason the publishing of the grain goes wrong. Using salt-call mine.update on the worker itself fixes the mine content on all other machines. salt '*' cmd.run 'salt-call mine.update' can be used from OSD to fix it on all connected machines.
Afterwards the contents of gre_tunnel_preup.sh can be corrected with salt -C 'G@roles:worker' state.sls_id /etc/wicked/scripts/gre_tunnel_preup.sh openqa.openvswitch. That makes our openQA job pass again as well: https://openqa.suse.de/tests/14619330#dependencies

Actions

Copy link

Updated by okurz 11 months ago

Target version changed from future to Ready

Actions

Copy link

Updated by okurz 11 months ago

Subtask #162734 added

Actions

Copy link

Updated by okurz 11 months ago

Subtask #162737 added

Actions

Copy link

#10

Updated by okurz 11 months ago

Status changed from New to Blocked
Assignee set to okurz

Actions

Copy link

#11

Updated by okurz 11 months ago

Target version changed from Ready to future

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries