Project

General

Profile

Actions

action #155824

closed

QA (public) - coordination #121720: [saga][epic] Migration to QE setup in PRG2+NUE3 while ensuring availability

QA (public) - coordination #153685: [epic] Move from SUSE NUE1 (Maxtorhof) to PRG2e

Support IPv6 SLAAC in our infrastructure size:M

Added by okurz 10 months ago. Updated 10 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Start date:
2024-02-22
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Motivation

https://jira.suse.com/browse/ENGINFRA-3685?focusedId=1329421&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-1329421
Can we enable SLAAC support in all our salt managed machines regardless if the infrastructure needs/supports this?

Acceptance criteria

  • AC1: Machines in PRG2e configured by salt, like ada, get full IPv6
  • AC2: Machines in PRG2 configured by salt, like worker29, *still have full IPv6

Acceptance tests

  • AT1-1: for i in $(sudo salt-key -l acc) ;do ping -6 -c1 $i; done returns a successful pong for each prg2 based machine

Suggestions

  • Check differences between openqaw5-xen which already has IPv6 and ada which does not. Maybe ada has IPv6 explicitly disabled by sysctl.

  • If ada is not just a single exception then add to salt

# cat /etc/sysctl.d/10-slaac.conf
net.ipv6.conf.default.use_tempaddr=1
net.ipv6.conf.default.autoconf=1 

and see if that automagically fixes IPv6 on ada and does not break something on others. If that does not work then set

BOOTPROTO="dhcp4+auto6"

but possibly only for machines in PRG2e, not PRG2, or based on salt grain or something

Actions #1

Updated by okurz 10 months ago

  • Subject changed from Support IPv6 SLAAC in our infrastructure to Support IPv6 SLAAC in our infrastructure size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #2

Updated by nicksinger 10 months ago

  • Status changed from Workable to In Progress
  • Assignee set to nicksinger
Actions #3

Updated by nicksinger 10 months ago

One major difference I spotted was that e.g. xen5 is using OVS-bridges while ada uses a native linux bridge but not sure if this is really related. I also saw that net.ipv6.conf.eth0.disable_ipv6 is set but on both machines

Actions #4

Updated by nicksinger 10 months ago

alright so the bridge causes the issue because it has forwarding enabled on an OS level. This causes Linux to refuse router advertisements because sysctl net.ipv6.conf.br0.accept_ra=1 and forwarding is enabled. If we set accept_ra=2 this disables this behavior and the next RA is applied to the interface:

ada:/etc # ip a s dev br0
19: br0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 3c:ec:ef:b9:b8:30 brd ff:ff:ff:ff:ff:ff
    inet 10.146.4.1/23 brd 10.146.5.255 scope global br0
       valid_lft forever preferred_lft forever
    inet6 2a07:de40:b230:1:a8d8:5bc6:d523:bab6/64 scope global temporary dynamic
       valid_lft 604172sec preferred_lft 85216sec
    inet6 2a07:de40:b230:1:3eec:efff:feb9:b830/64 scope global dynamic mngtmpaddr
       valid_lft 2591754sec preferred_lft 604554sec
    inet6 fe80::3eec:efff:feb9:b830/64 scope link
       valid_lft forever preferred_lft forever

I think we should able to detect this and configure accordingly with salt.

Actions #5

Updated by nicksinger 10 months ago

salt '*' cmd.run 'ip l | grep -i master | grep -v tap | grep -v gre_sys | grep -v ovs-system' shows the following hosts potentially "affected" by this:

osiris-1.qe.nue2.suse.org:
    2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master br0 state UP mode DEFAULT group default qlen 1000
openqaworker1.qe.nue2.suse.org:
    2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master br0 state UP mode DEFAULT group default qlen 1000
qamaster.qe.nue2.suse.org:
    2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master br0 state UP mode DEFAULT group default qlen 1000
    12: vnet0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br0 state UNKNOWN mode DEFAULT group default qlen 1000
    13: vnet1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br0 state UNKNOWN mode DEFAULT group default qlen 1000
    14: vnet2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br0 state UNKNOWN mode DEFAULT group default qlen 1000
    15: vnet3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br0 state UNKNOWN mode DEFAULT group default qlen 1000
    16: vnet4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br0 state UNKNOWN mode DEFAULT group default qlen 1000
    17: vnet5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br0 state UNKNOWN mode DEFAULT group default qlen 1000
    18: vnet6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br0 state UNKNOWN mode DEFAULT group default qlen 1000
    19: vnet7: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br0 state UNKNOWN mode DEFAULT group default qlen 1000
imagetester.qe.nue2.suse.org:
    2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master br0 state UP mode DEFAULT group default qlen 1000
schort-server.qe.nue2.suse.org:
mania.qe.nue2.suse.org:
    2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master br0 state UP mode DEFAULT group default qlen 1000
openqa-piworker.qe.nue2.suse.org:
    3: wlan0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master br0 state UP mode DEFAULT group default qlen 1000

Actions #6

Updated by openqa_review 10 months ago

  • Due date set to 2024-03-08

Setting due date based on mean cycle time of SUSE QE Tools

Actions #7

Updated by okurz 10 months ago

Just add accept_ra=2 in salt states for all hosts in sysctl.sls

Actions #8

Updated by nicksinger 10 months ago

  • Priority changed from High to Normal

I already prepared an MR for setting net.ipv6.conf.all.accept_ra = 1 but deterred even before creating it. I don't think setting this on all interfaces in our infrastructure is a good idea. I'm going to build something conditional similar to what was done in https://progress.opensuse.org/issues/155824#note-5

Given that ada is fixed in the meantime I lower the prio here.

Actions #9

Updated by okurz 10 months ago

what fixed ada? Was it a transient change of sysctl setting or is it persistent?

Actions #10

Updated by nicksinger 10 months ago

me setting net.ipv6.conf.br0.accept_ra = 2 manually so it is transient

Actions #11

Updated by nicksinger 10 months ago

I created /etc/sysctl.d/99-poo155824.conf manually to make it persistent on that host for now at least.

Actions #12

Updated by okurz 10 months ago

ok, I now created
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1115

I think we should try this out and revert if we see problems with that.

Actions #13

Updated by okurz 10 months ago

We already have

net.ipv6.conf.default.use_tempaddr=1
net.ipv6.conf.default.autoconf=1

on all salt controlled hosts so we shouldn't need this. And the accept_ra should also not matter here. So far it seems only ada needs special treatment for potentially manual configuration. So retracting my MR.

Actions #14

Updated by nicksinger 10 months ago

  • Status changed from In Progress to Feedback

Created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1116 to handle this case for all other machines

Actions #16

Updated by nicksinger 10 months ago

  • Status changed from Feedback to Workable

verified that it works with:

openqa:~ # salt '*' cmd.run 'ip -o -f link link show type bridge | cut -d : -f 2 | xargs -I{} sysctl net.ipv6.conf.{}.accept_ra'

however, by checking /etc/sysctl.d/99-salt.conf on ada, I realized that we have leftovers like:

net.ipv6.conf.b.accept_ra = 2
net.ipv6.conf.r.accept_ra = 2
net.ipv6.conf.0.accept_ra = 2

same on mania (only touched by my merged MR). So I still have a bug in my code which needs to be fixed.

Actions #17

Updated by livdywan 10 months ago

  • Due date deleted (2024-03-08)

I am resetting the due date since nobody can look into it currently, as opposed to higher priority tasks (at least that is my understanding).

Actions #18

Updated by okurz 10 months ago

  • Description updated (diff)

On grenache-1 we now have from cat /etc/sysctl.d/99-salt.conf

#
# Kernel sysctl configuration
#
kernel.softlockup_panic = 1
net.ipv6.conf.eth0.accept_ra = 2
kernel.panic = 60
net.ipv4.ip_forward = 1
net.ipv4.conf.br1.forwarding = 1
net.ipv4.conf.eth0.forwarding = 1

To cleanup some other messed up entries we did manually:

sudo salt \* cmd.run 'sed -i "/\.\w\.accept_ra/d" /etc/sysctl.d/99-salt.conf'
sudo salt \* cmd.run 'sed -i "/\. \w*\.accept_ra/d" /etc/sysctl.d/99-salt.conf'

Next stop: Add according entries in https://gitlab.suse.de/OPS-Service/salt/-/blob/production/salt/profile/dns/files/prg2_suse_org/dns-oqa.prg2.suse.org where not present
Actions #19

Updated by nicksinger 10 months ago

  • Status changed from Workable to In Progress

In the daily we discovered that not every prg2-based machine has full IPv6 capabilities yet. To verify, we tried to ping the hostname of every salt-controlled machine from OSD via IPv6 only. I replicated that list of broken machines by using:

openqa:~ # (salt-key -l accepted | grep "\." | xargs -i{} ping -c 1 -6 {} > /dev/null) 2>&1 | cut -d ":" -f 2 | rev | sort | rev
 openqaworker1.qe.nue2.suse.org
 sapworker1.qe.nue2.suse.org
 osiris-1.qe.nue2.suse.org
 sapworker2.qe.nue2.suse.org
 sapworker3.qe.nue2.suse.org
 unreal6.qe.nue2.suse.org
 mania.qe.nue2.suse.org
 tumblesle.qe.nue2.suse.org
 diesel.qe.nue2.suse.org
 petrol.qe.nue2.suse.org
 backup-qam.qe.nue2.suse.org
 backup-vm.qe.nue2.suse.org
 openqa-piworker.qe.nue2.suse.org
 qamaster.qe.nue2.suse.org
 imagetester.qe.nue2.suse.org
 schort-server.qe.nue2.suse.org
 monitor.qe.nue2.suse.org
 jenkins.qe.nue2.suse.org
 baremetal-support.qe.nue2.suse.org
 grenache-1.oqa.prg2.suse.org
 s390zl12.oqa.prg2.suse.org
 s390zl13.oqa.prg2.suse.org
 openqaworker14.qa.suse.cz
 qesapworker-prg4.qa.suse.cz
 qesapworker-prg5.qa.suse.cz
 openqaworker16.qa.suse.cz
 qesapworker-prg6.qa.suse.cz
 openqaworker17.qa.suse.cz
 qesapworker-prg7.qa.suse.cz
 openqaworker18.qa.suse.cz

Strictly speaking this leaves grenache, s390zl12 and s390zl13 for prg2 without proper v6 connection. I've created https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/4890 to cover grenache for now. It is also our most recent example of a machine added "newly" to our infra.

Actions #20

Updated by nicksinger 10 months ago

I checked s390zl12 and 13 and it should be easy enough to give them proper AAAA-records as well.
Generating these v6 addresses with only the MAC address available (e.g. if a machine is not already running) requires one to calculate the EUI-64 (see https://www.kwtrain.com/blog/how-to-calculate-an-eui-64-address for an example). I wrote a dirty script to help me with that:

#!/usr/bin/env python3
import sys
if len(sys.argv) < 2:
    mac = input("Enter mac of host: ")
else:
    mac = sys.argv[1]
byte_strings = mac.split(":")
fffe_inserted = byte_strings[:3] + ["ff", "fe"] + byte_strings[3:]
flipped_bit = hex(int(fffe_inserted[0], 16) ^ 0b00000010)
fffe_inserted[0] = flipped_bit.split("x", 1)[1]
v6_notation = list(zip(fffe_inserted[0::2], fffe_inserted[1::2]))
v6_notation = list(map(lambda x: "".join(x), v6_notation))
v6_notation = ":".join(v6_notation)
print(v6_notation)

If the machine is running, we can use the following command to print the mac of the default interface:

ip -j a s dev $(ip -6 -j r s | jq -r "(.[] | select(.dst == \"default\")).dev") | jq -r .[].address

Maybe these two methods can be combined to automate the process for the many machines in nue2

Actions #21

Updated by openqa_review 10 months ago

  • Due date set to 2024-03-27

Setting due date based on mean cycle time of SUSE QE Tools

Actions #22

Updated by nicksinger 10 months ago

  • Status changed from In Progress to Feedback

https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/4890 merged and validated with ping -6 grenache-1.oqa.prg2.suse.org (works).
https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/4909 created to cover zl12+zl13. I skipped other hosts in our zone file because we cannot access all of them to verify v6 is configured at all.

We agreed in the infra daily today that we ignore every machine in .qa.suse.cz because we consider it "old" and not under our control.

Actions #23

Updated by okurz 10 months ago

  • Due date deleted (2024-03-27)
  • Status changed from Feedback to Resolved

https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/4909 was merged and applied.

$ ping s390zl12.oqa.prg2.suse.org 
PING s390zl12.oqa.prg2.suse.org(2a07:de40:b203:12:f495:72ff:fee6:b4f1 (2a07:de40:b203:12:f495:72ff:fee6:b4f1)) 56 data bytes
64 bytes from 2a07:de40:b203:12:f495:72ff:fee6:b4f1 (2a07:de40:b203:12:f495:72ff:fee6:b4f1): icmp_seq=1 ttl=62 time=23.4 ms
$ ping s390zl13.oqa.prg2.suse.org 
PING s390zl13.oqa.prg2.suse.org(2a07:de40:b203:12:7cc4:feff:fe8b:75f7 (2a07:de40:b203:12:7cc4:feff:fe8b:75f7)) 56 data bytes
64 bytes from 2a07:de40:b203:12:7cc4:feff:fe8b:75f7 (2a07:de40:b203:12:7cc4:feff:fe8b:75f7): icmp_seq=1 ttl=62 time=24.5 ms

AT1-1 successful, both AC1+AC2 covered.

Actions

Also available in: Atom PDF