Project

General

Profile

Actions

tickets #168487

open

asgard stops routing IPv6 randomly

Added by crameleon about 1 month ago. Updated 7 days ago.

Status:
In Progress
Priority:
Normal
Assignee:
-
Category:
Network
Target version:
-
Start date:
2024-10-19
Due date:
% Done:

0%

Estimated time:

Description

In the last months we found twice that asgard1 would no longer forward IPv6 packets from/to the SUSE side gateway (= our path to the internet and to SUSE side hosts like the login proxies). Presumably only "new" sessions are affected as I would still be connected to the VPN with no issues at the time. Also internal routing (between openSUSE VLANs) would still work fine. No odd kernel messages or similar are found at the problematic time. Switching traffic over to asgard2 makes all traffic work again. I then use the opportunity to install updates on asgard1, reboot, and with that switch traffic back, and it continues to work.

This happening randomly without any noteworthy log entries makes it rather difficult to debug and reproduce. At the problematic time my focus is usually also to get connectivity back fast which does not leave much room for debugging.

However I did find the second time it happened that tcpdump on the os-p2p-pub interface (the one facing the gateway) filtering for ICMP from the login proxies (which I used to test-ping from the SUSE side) did not record any packets arriving.

Actions #1

Updated by crameleon about 1 month ago

  • Private changed from Yes to No
Actions #2

Updated by cboltz about 1 month ago

Switching traffic over to asgard2 makes all traffic work again.

Is this because asgard2 is otherwise bored (wild guess: something like "overflow not reached yet"), or because something is broken on asgard1?

To find out, can you switch the traffic to asgard2 for the next two months?

Actions #3

Updated by crameleon about 1 month ago

I think it's indeed asgard2 being idle before since both machines are configured pretty much identically, and it works again on asgard1 after switching back to it later .. but we could indeed reconfigure asgard2 to be the master for some time to test.

Actions #4

Updated by crameleon 8 days ago

I have not yet gotten to switch asgard2 to master as suggested, in the meanwhile here is advice how to deal with this situation:

  1. Assess it is the problem in question:
    • ping an IPv6 address on the internet, for example 2620:fe::fe from both asgard1 and asgard2
    • if it works on asgard2, but not on asgard1, then the problem is identified - proceed, otherwise it is a different issue, do not proceed here
  2. Switch all VRRP instances to asgard2 by executing the following on asgard1:
    • for x in $(ip -br l | awk '/^d-/{ printf $1 " " }'); do ip l s down $x; done
  3. Verify IPv6 routing is working correctly and services are fully reachable again
  4. Wait a bit until the ping to the test IPv6 host on the internet works from asgard1 again as well, then enable the VRRP instances on the master again (do not leave the master, asgard1, in a failed state for eternity):
    • for x in $(ip -br l | awk '/^d-/{ printf $1 " " }'); do ip l s up $x; done
Actions #5

Updated by crameleon 8 days ago

It happened more frequently now, once yesterday, and once today. Online research suggested to check for neighbor discovery issues, however this is fine at the time of the breakage:

asgard1 (Firewall, Router):~ # ip neigh sh|grep p2p
195.135.223.46 dev os-p2p-pub lladdr 00:10:db:ff:10:03 REACHABLE
2a07:de40:b27f:201:ffff:ffff:ffff:ffff dev os-p2p-pub lladdr 00:10:db:ff:10:03 router REACHABLE
Actions #6

Updated by bmwiedemann 8 days ago

On my private dedicated server, I run into networking problems when wicked gets updated. Maybe it can also be triggered with a rcnetwork restart.

The problem started around 01:20 UTC which might be when maintenance-updates were installed. Can you check /var/log/zypp/history ? (I can't ssh to asgard1/2 for some reason)

Actions #7

Updated by crameleon 8 days ago

Hi @bmwiedemann,

I checked, but no updates were installed at the time at all. Also worth mentioning is these machines only get patches with the "security" level installed automatically, so updates do not happen as often.

It also tends to start working again without restarting any services or similar - for example today I just switch traffic away to asgard2 for a bit, let asgard1 "cool down" briefly, and then routing through asgard1 magically works again as well. There was no interaction with wicked ..

Regarding SSH: https://gitlab.infra.opensuse.org/infra/ssh_config/-/commit/c793677810f77036948b1bd51179c01717d8915c.

Actions #8

Updated by bmwiedemann 7 days ago

We had another instance of this today starting around 02:48 UTC.

sudo ip -s -s neigh flush all
did not help, but a reboot of asgard1 did help.

I have captured the neigh table contents before+after the flush in
asgard1:/home/bmwiedemann/ip-neigh-sh*

Actions #9

Updated by crameleon 7 days ago ยท Edited

Upon implementing the firewalls in SLC1 I forked the Asgard ruleset and found a discrepancy in the ICMPv6 rules:

salt/files/nftables/asgard/zones/00_global.nft:

 10   ip6 saddr fe80::/10 ip6 nexthdr icmpv6 icmpv6 type { mld2-listener-report, nd-router-solicit, nd-neighbor-advert     } accept

Here we allow nd-neighbor-advert (ICMPv6 type 136), but not nd-neighbor-solicit (ICMPv6 type 135) (instead of neighbor-solitit router-solicit is allowed, which we probably do not need as no router advertisements are in use, the P2P connection too is using a static route - this might have been a mixup).

To assess whether this might really be the (or at least a) problem, I configured logging on the os-p2p-pub interface - we generally have it disabled for that interface, as there is tons of "internet noise", hence I added rules to the input and forward chains to capture droppings from link local source addresses:

ip6 saddr fe80::/10 log prefix "[asgard from LL] Forward Dropped: " flags all

Indeed, at the timestamp last mentioned by @bmwiedemann, a relevant dropping for ICMPv6 type 135 can be found:

2024-11-12T02:48:00.499910+00:00 asgard1 kernel: [    C3] [asgard from LL] Inbound Dropped: IN=os-p2p-pub OUT= MACSRC=00:10:db:ff:10:03 MACDST=33:33:ff:00:00:01 MACPROTO=86dd SRC=fe80:0000:0000:0000:0210:db0c:81ff:1003 DST=ff02:0000:0000:0000:0000:0001:ff00:0001 LEN=72 TC=192 HOPLIMIT=255 FLOWLBL=0 PROTO=ICMPv6 TYPE=135 CODE=0

Now, there are many more such occurrences grep -E 'asgard from LL.*TYPE=135' /var/log/firewall which did not cause any outage, and the one at the given timestamp might be a coincidence, but regardless I am repairing this via:

https://gitlab.infra.opensuse.org/infra/salt/-/merge_requests/2146

Actions #10

Updated by bmwiedemann 7 days ago

There is a series of these log entries from MACSRC=00:10:db:ff:10:03 that
starts at 02:40:20
and ends at 05:18:02 (when I rebooted asgard1)
So this seems very related to the outage.

The output of ip neigh sh during the breakage had these entries for the MAC:

195.135.223.46 dev os-p2p-pub lladdr 00:10:db:ff:10:03 DELAY 
2a07:de40:b27f:201:ffff:ffff:ffff:ffff dev os-p2p-pub lladdr 00:10:db:ff:10:03 router DELAY 

And without outage it has

195.135.223.46 dev os-p2p-pub lladdr 00:10:db:ff:10:03 DELAY 
2a07:de40:b27f:201:ffff:ffff:ffff:ffff dev os-p2p-pub lladdr 00:10:db:ff:10:03 router REACHABLE 
Actions #11

Updated by crameleon 7 days ago

  • Status changed from New to In Progress

Great, thanks, I did not notice the time correlation of the other entries - then that might actually be it. Let's observe for a few days if any more issues occur.

Actions #12

Updated by crameleon 7 days ago

I also submitted the link local logging change as it does not hurt to have permanently:

https://gitlab.infra.opensuse.org/infra/salt/-/merge_requests/2148

Actions

Also available in: Atom PDF