action #155824
closedQA (public) - coordination #121720: [saga][epic] Migration to QE setup in PRG2+NUE3 while ensuring availability
QA (public) - coordination #153685: [epic] Move from SUSE NUE1 (Maxtorhof) to PRG2e
Support IPv6 SLAAC in our infrastructure size:M
0%
Description
Motivation¶
https://jira.suse.com/browse/ENGINFRA-3685?focusedId=1329421&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-1329421
Can we enable SLAAC support in all our salt managed machines regardless if the infrastructure needs/supports this?
Acceptance criteria¶
- AC1: Machines in PRG2e configured by salt, like ada, get full IPv6
- AC2: Machines in PRG2 configured by salt, like worker29, *still have full IPv6
Acceptance tests¶
- AT1-1:
for i in $(sudo salt-key -l acc) ;do ping -6 -c1 $i; done
returns a successful pong for each prg2 based machine
Suggestions¶
Check differences between openqaw5-xen which already has IPv6 and ada which does not. Maybe ada has IPv6 explicitly disabled by sysctl.
If ada is not just a single exception then add to salt
# cat /etc/sysctl.d/10-slaac.conf
net.ipv6.conf.default.use_tempaddr=1
net.ipv6.conf.default.autoconf=1
and see if that automagically fixes IPv6 on ada and does not break something on others. If that does not work then set
BOOTPROTO="dhcp4+auto6"
but possibly only for machines in PRG2e, not PRG2, or based on salt grain or something
Updated by nicksinger 10 months ago
- Status changed from Workable to In Progress
- Assignee set to nicksinger
Updated by nicksinger 10 months ago
One major difference I spotted was that e.g. xen5 is using OVS-bridges while ada uses a native linux bridge but not sure if this is really related. I also saw that net.ipv6.conf.eth0.disable_ipv6
is set but on both machines
Updated by nicksinger 10 months ago
alright so the bridge causes the issue because it has forwarding enabled on an OS level. This causes Linux to refuse router advertisements because sysctl net.ipv6.conf.br0.accept_ra=1
and forwarding is enabled. If we set accept_ra=2
this disables this behavior and the next RA is applied to the interface:
ada:/etc # ip a s dev br0
19: br0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 3c:ec:ef:b9:b8:30 brd ff:ff:ff:ff:ff:ff
inet 10.146.4.1/23 brd 10.146.5.255 scope global br0
valid_lft forever preferred_lft forever
inet6 2a07:de40:b230:1:a8d8:5bc6:d523:bab6/64 scope global temporary dynamic
valid_lft 604172sec preferred_lft 85216sec
inet6 2a07:de40:b230:1:3eec:efff:feb9:b830/64 scope global dynamic mngtmpaddr
valid_lft 2591754sec preferred_lft 604554sec
inet6 fe80::3eec:efff:feb9:b830/64 scope link
valid_lft forever preferred_lft forever
I think we should able to detect this and configure accordingly with salt.
Updated by nicksinger 10 months ago
salt '*' cmd.run 'ip l | grep -i master | grep -v tap | grep -v gre_sys | grep -v ovs-system'
shows the following hosts potentially "affected" by this:
osiris-1.qe.nue2.suse.org:
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master br0 state UP mode DEFAULT group default qlen 1000
openqaworker1.qe.nue2.suse.org:
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master br0 state UP mode DEFAULT group default qlen 1000
qamaster.qe.nue2.suse.org:
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master br0 state UP mode DEFAULT group default qlen 1000
12: vnet0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br0 state UNKNOWN mode DEFAULT group default qlen 1000
13: vnet1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br0 state UNKNOWN mode DEFAULT group default qlen 1000
14: vnet2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br0 state UNKNOWN mode DEFAULT group default qlen 1000
15: vnet3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br0 state UNKNOWN mode DEFAULT group default qlen 1000
16: vnet4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br0 state UNKNOWN mode DEFAULT group default qlen 1000
17: vnet5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br0 state UNKNOWN mode DEFAULT group default qlen 1000
18: vnet6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br0 state UNKNOWN mode DEFAULT group default qlen 1000
19: vnet7: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br0 state UNKNOWN mode DEFAULT group default qlen 1000
imagetester.qe.nue2.suse.org:
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master br0 state UP mode DEFAULT group default qlen 1000
schort-server.qe.nue2.suse.org:
mania.qe.nue2.suse.org:
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master br0 state UP mode DEFAULT group default qlen 1000
openqa-piworker.qe.nue2.suse.org:
3: wlan0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master br0 state UP mode DEFAULT group default qlen 1000
Updated by openqa_review 10 months ago
- Due date set to 2024-03-08
Setting due date based on mean cycle time of SUSE QE Tools
Updated by nicksinger 10 months ago
- Priority changed from High to Normal
I already prepared an MR for setting net.ipv6.conf.all.accept_ra = 1
but deterred even before creating it. I don't think setting this on all interfaces in our infrastructure is a good idea. I'm going to build something conditional similar to what was done in https://progress.opensuse.org/issues/155824#note-5
Given that ada is fixed in the meantime I lower the prio here.
Updated by nicksinger 10 months ago
me setting net.ipv6.conf.br0.accept_ra = 2
manually so it is transient
Updated by nicksinger 10 months ago
I created /etc/sysctl.d/99-poo155824.conf
manually to make it persistent on that host for now at least.
Updated by okurz 10 months ago
ok, I now created
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1115
I think we should try this out and revert if we see problems with that.
Updated by okurz 10 months ago
We already have
net.ipv6.conf.default.use_tempaddr=1
net.ipv6.conf.default.autoconf=1
on all salt controlled hosts so we shouldn't need this. And the accept_ra should also not matter here. So far it seems only ada needs special treatment for potentially manual configuration. So retracting my MR.
Updated by nicksinger 10 months ago
- Status changed from In Progress to Feedback
Created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1116 to handle this case for all other machines
Updated by nicksinger 10 months ago
- Status changed from Feedback to Workable
verified that it works with:
openqa:~ # salt '*' cmd.run 'ip -o -f link link show type bridge | cut -d : -f 2 | xargs -I{} sysctl net.ipv6.conf.{}.accept_ra'
however, by checking /etc/sysctl.d/99-salt.conf on ada, I realized that we have leftovers like:
net.ipv6.conf.b.accept_ra = 2
net.ipv6.conf.r.accept_ra = 2
net.ipv6.conf.0.accept_ra = 2
same on mania (only touched by my merged MR). So I still have a bug in my code which needs to be fixed.
Updated by okurz 10 months ago
- Description updated (diff)
On grenache-1 we now have from cat /etc/sysctl.d/99-salt.conf
#
# Kernel sysctl configuration
#
kernel.softlockup_panic = 1
net.ipv6.conf.eth0.accept_ra = 2
kernel.panic = 60
net.ipv4.ip_forward = 1
net.ipv4.conf.br1.forwarding = 1
net.ipv4.conf.eth0.forwarding = 1
To cleanup some other messed up entries we did manually:
sudo salt \* cmd.run 'sed -i "/\.\w\.accept_ra/d" /etc/sysctl.d/99-salt.conf'
sudo salt \* cmd.run 'sed -i "/\. \w*\.accept_ra/d" /etc/sysctl.d/99-salt.conf'
Next stop: Add according entries in https://gitlab.suse.de/OPS-Service/salt/-/blob/production/salt/profile/dns/files/prg2_suse_org/dns-oqa.prg2.suse.org where not present
Updated by nicksinger 10 months ago
- Status changed from Workable to In Progress
In the daily we discovered that not every prg2-based machine has full IPv6 capabilities yet. To verify, we tried to ping the hostname of every salt-controlled machine from OSD via IPv6 only. I replicated that list of broken machines by using:
openqa:~ # (salt-key -l accepted | grep "\." | xargs -i{} ping -c 1 -6 {} > /dev/null) 2>&1 | cut -d ":" -f 2 | rev | sort | rev
openqaworker1.qe.nue2.suse.org
sapworker1.qe.nue2.suse.org
osiris-1.qe.nue2.suse.org
sapworker2.qe.nue2.suse.org
sapworker3.qe.nue2.suse.org
unreal6.qe.nue2.suse.org
mania.qe.nue2.suse.org
tumblesle.qe.nue2.suse.org
diesel.qe.nue2.suse.org
petrol.qe.nue2.suse.org
backup-qam.qe.nue2.suse.org
backup-vm.qe.nue2.suse.org
openqa-piworker.qe.nue2.suse.org
qamaster.qe.nue2.suse.org
imagetester.qe.nue2.suse.org
schort-server.qe.nue2.suse.org
monitor.qe.nue2.suse.org
jenkins.qe.nue2.suse.org
baremetal-support.qe.nue2.suse.org
grenache-1.oqa.prg2.suse.org
s390zl12.oqa.prg2.suse.org
s390zl13.oqa.prg2.suse.org
openqaworker14.qa.suse.cz
qesapworker-prg4.qa.suse.cz
qesapworker-prg5.qa.suse.cz
openqaworker16.qa.suse.cz
qesapworker-prg6.qa.suse.cz
openqaworker17.qa.suse.cz
qesapworker-prg7.qa.suse.cz
openqaworker18.qa.suse.cz
Strictly speaking this leaves grenache, s390zl12 and s390zl13 for prg2 without proper v6 connection. I've created https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/4890 to cover grenache for now. It is also our most recent example of a machine added "newly" to our infra.
Updated by nicksinger 10 months ago
I checked s390zl12 and 13 and it should be easy enough to give them proper AAAA-records as well.
Generating these v6 addresses with only the MAC address available (e.g. if a machine is not already running) requires one to calculate the EUI-64 (see https://www.kwtrain.com/blog/how-to-calculate-an-eui-64-address for an example). I wrote a dirty script to help me with that:
#!/usr/bin/env python3
import sys
if len(sys.argv) < 2:
mac = input("Enter mac of host: ")
else:
mac = sys.argv[1]
byte_strings = mac.split(":")
fffe_inserted = byte_strings[:3] + ["ff", "fe"] + byte_strings[3:]
flipped_bit = hex(int(fffe_inserted[0], 16) ^ 0b00000010)
fffe_inserted[0] = flipped_bit.split("x", 1)[1]
v6_notation = list(zip(fffe_inserted[0::2], fffe_inserted[1::2]))
v6_notation = list(map(lambda x: "".join(x), v6_notation))
v6_notation = ":".join(v6_notation)
print(v6_notation)
If the machine is running, we can use the following command to print the mac of the default interface:
ip -j a s dev $(ip -6 -j r s | jq -r "(.[] | select(.dst == \"default\")).dev") | jq -r .[].address
Maybe these two methods can be combined to automate the process for the many machines in nue2
Updated by openqa_review 10 months ago
- Due date set to 2024-03-27
Setting due date based on mean cycle time of SUSE QE Tools
Updated by nicksinger 10 months ago
- Status changed from In Progress to Feedback
https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/4890 merged and validated with ping -6 grenache-1.oqa.prg2.suse.org
(works).
https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/4909 created to cover zl12+zl13. I skipped other hosts in our zone file because we cannot access all of them to verify v6 is configured at all.
We agreed in the infra daily today that we ignore every machine in .qa.suse.cz
because we consider it "old" and not under our control.
Updated by okurz 10 months ago
- Due date deleted (
2024-03-27) - Status changed from Feedback to Resolved
https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/4909 was merged and applied.
$ ping s390zl12.oqa.prg2.suse.org
PING s390zl12.oqa.prg2.suse.org(2a07:de40:b203:12:f495:72ff:fee6:b4f1 (2a07:de40:b203:12:f495:72ff:fee6:b4f1)) 56 data bytes
64 bytes from 2a07:de40:b203:12:f495:72ff:fee6:b4f1 (2a07:de40:b203:12:f495:72ff:fee6:b4f1): icmp_seq=1 ttl=62 time=23.4 ms
$ ping s390zl13.oqa.prg2.suse.org
PING s390zl13.oqa.prg2.suse.org(2a07:de40:b203:12:7cc4:feff:fe8b:75f7 (2a07:de40:b203:12:7cc4:feff:fe8b:75f7)) 56 data bytes
64 bytes from 2a07:de40:b203:12:7cc4:feff:fe8b:75f7 (2a07:de40:b203:12:7cc4:feff:fe8b:75f7): icmp_seq=1 ttl=62 time=24.5 ms
AT1-1 successful, both AC1+AC2 covered.