tickets #168487
openasgard stops routing IPv6 randomly
0%
Description
In the last months we found twice that asgard1 would no longer forward IPv6 packets from/to the SUSE side gateway (= our path to the internet and to SUSE side hosts like the login proxies). Presumably only "new" sessions are affected as I would still be connected to the VPN with no issues at the time. Also internal routing (between openSUSE VLANs) would still work fine. No odd kernel messages or similar are found at the problematic time. Switching traffic over to asgard2 makes all traffic work again. I then use the opportunity to install updates on asgard1, reboot, and with that switch traffic back, and it continues to work.
This happening randomly without any noteworthy log entries makes it rather difficult to debug and reproduce. At the problematic time my focus is usually also to get connectivity back fast which does not leave much room for debugging.
However I did find the second time it happened that tcpdump on the os-p2p-pub interface (the one facing the gateway) filtering for ICMP from the login proxies (which I used to test-ping from the SUSE side) did not record any packets arriving.
Updated by cboltz about 1 month ago
Switching traffic over to asgard2 makes all traffic work again.
Is this because asgard2 is otherwise bored (wild guess: something like "overflow not reached yet"), or because something is broken on asgard1?
To find out, can you switch the traffic to asgard2 for the next two months?
Updated by crameleon about 1 month ago
I think it's indeed asgard2 being idle before since both machines are configured pretty much identically, and it works again on asgard1 after switching back to it later .. but we could indeed reconfigure asgard2 to be the master for some time to test.
Updated by crameleon 8 days ago
I have not yet gotten to switch asgard2 to master as suggested, in the meanwhile here is advice how to deal with this situation:
- Assess it is the problem in question:
- ping an IPv6 address on the internet, for example
2620:fe::fe
from both asgard1 and asgard2 - if it works on asgard2, but not on asgard1, then the problem is identified - proceed, otherwise it is a different issue, do not proceed here
- ping an IPv6 address on the internet, for example
- Switch all VRRP instances to asgard2 by executing the following on asgard1:
for x in $(ip -br l | awk '/^d-/{ printf $1 " " }'); do ip l s down $x; done
- Verify IPv6 routing is working correctly and services are fully reachable again
- Wait a bit until the ping to the test IPv6 host on the internet works from asgard1 again as well, then enable the VRRP instances on the master again (do not leave the master, asgard1, in a failed state for eternity):
for x in $(ip -br l | awk '/^d-/{ printf $1 " " }'); do ip l s up $x; done
Updated by crameleon 8 days ago
It happened more frequently now, once yesterday, and once today. Online research suggested to check for neighbor discovery issues, however this is fine at the time of the breakage:
asgard1 (Firewall, Router):~ # ip neigh sh|grep p2p
195.135.223.46 dev os-p2p-pub lladdr 00:10:db:ff:10:03 REACHABLE
2a07:de40:b27f:201:ffff:ffff:ffff:ffff dev os-p2p-pub lladdr 00:10:db:ff:10:03 router REACHABLE
Updated by bmwiedemann 8 days ago
On my private dedicated server, I run into networking problems when wicked gets updated. Maybe it can also be triggered with a rcnetwork restart
.
The problem started around 01:20 UTC which might be when maintenance-updates were installed. Can you check /var/log/zypp/history ? (I can't ssh to asgard1/2 for some reason)
Updated by crameleon 8 days ago
Hi @bmwiedemann,
I checked, but no updates were installed at the time at all. Also worth mentioning is these machines only get patches with the "security" level installed automatically, so updates do not happen as often.
It also tends to start working again without restarting any services or similar - for example today I just switch traffic away to asgard2 for a bit, let asgard1 "cool down" briefly, and then routing through asgard1 magically works again as well. There was no interaction with wicked ..
Regarding SSH: https://gitlab.infra.opensuse.org/infra/ssh_config/-/commit/c793677810f77036948b1bd51179c01717d8915c.
Updated by bmwiedemann 7 days ago
We had another instance of this today starting around 02:48 UTC.
sudo ip -s -s neigh flush all
did not help, but a reboot of asgard1 did help.
I have captured the neigh table contents before+after the flush in
asgard1:/home/bmwiedemann/ip-neigh-sh*
Updated by crameleon 7 days ago ยท Edited
Upon implementing the firewalls in SLC1 I forked the Asgard ruleset and found a discrepancy in the ICMPv6 rules:
salt/files/nftables/asgard/zones/00_global.nft:
10 ip6 saddr fe80::/10 ip6 nexthdr icmpv6 icmpv6 type { mld2-listener-report, nd-router-solicit, nd-neighbor-advert } accept
Here we allow nd-neighbor-advert (ICMPv6 type 136), but not nd-neighbor-solicit (ICMPv6 type 135) (instead of neighbor-solitit router-solicit is allowed, which we probably do not need as no router advertisements are in use, the P2P connection too is using a static route - this might have been a mixup).
To assess whether this might really be the (or at least a) problem, I configured logging on the os-p2p-pub interface - we generally have it disabled for that interface, as there is tons of "internet noise", hence I added rules to the input and forward chains to capture droppings from link local source addresses:
ip6 saddr fe80::/10 log prefix "[asgard from LL] Forward Dropped: " flags all
Indeed, at the timestamp last mentioned by @bmwiedemann, a relevant dropping for ICMPv6 type 135 can be found:
2024-11-12T02:48:00.499910+00:00 asgard1 kernel: [ C3] [asgard from LL] Inbound Dropped: IN=os-p2p-pub OUT= MACSRC=00:10:db:ff:10:03 MACDST=33:33:ff:00:00:01 MACPROTO=86dd SRC=fe80:0000:0000:0000:0210:db0c:81ff:1003 DST=ff02:0000:0000:0000:0000:0001:ff00:0001 LEN=72 TC=192 HOPLIMIT=255 FLOWLBL=0 PROTO=ICMPv6 TYPE=135 CODE=0
Now, there are many more such occurrences grep -E 'asgard from LL.*TYPE=135' /var/log/firewall
which did not cause any outage, and the one at the given timestamp might be a coincidence, but regardless I am repairing this via:
https://gitlab.infra.opensuse.org/infra/salt/-/merge_requests/2146
Updated by bmwiedemann 7 days ago
There is a series of these log entries from MACSRC=00:10:db:ff:10:03 that
starts at 02:40:20
and ends at 05:18:02 (when I rebooted asgard1)
So this seems very related to the outage.
The output of ip neigh sh
during the breakage had these entries for the MAC:
195.135.223.46 dev os-p2p-pub lladdr 00:10:db:ff:10:03 DELAY
2a07:de40:b27f:201:ffff:ffff:ffff:ffff dev os-p2p-pub lladdr 00:10:db:ff:10:03 router DELAY
And without outage it has
195.135.223.46 dev os-p2p-pub lladdr 00:10:db:ff:10:03 DELAY
2a07:de40:b27f:201:ffff:ffff:ffff:ffff dev os-p2p-pub lladdr 00:10:db:ff:10:03 router REACHABLE
Updated by crameleon 7 days ago
I also submitted the link local logging change as it does not hurt to have permanently:
https://gitlab.infra.opensuse.org/infra/salt/-/merge_requests/2148