action #181862
closedMTU connection issues on osiris-1 virtual machines size:M
0%
Description
Observation¶
I have noticed strange connection issues when trying to setup a VM on osiris-1.qe.nue2.suse.org (NUE-2 machine). Although the initial setup of a fresh leap image, including network connectivity, seems to be working fine (i can successfully ping and curl download.opensuse.org from within the VM), once i try to run any zypper command the connection will get stuck, without any output indicating what happened.
I was suggested to run ping commands with a higher MTU value as @okurz assumed these issues may be related to that, which seems to be correct, as those ping commands would also get stuck.
It seems only VMs on osiris-1 are affected.
Similar network-related MTU issues have been reported in previous SD tickets:
Steps to reproduce¶
-
Create a fresh Leap 15.6 or Tumbleweed VM on
osiris-1
. -
Confirm that basic network connectivity works (
ping
,curl
todownload.opensuse.org
). -
Run
zypper ref
and observe that it hangs without completing. -
Alternatively, run:
ping -Mdo -s1442 download.opensuse.org
and observe that it also hangs.
Suggestions¶
-
DONE It only this one VM affected ?
Try another VM on osiris
-> ok1 also affected -
DONE Is the host itself affected ?
Try on osiris-1 directly
-> not affected -
DONE Are only VMs on osiris-1 affected ?
Try another VM on another host(e.g. ada.qe.suse.de)
-> not affected - Consider introducing a network diagnostic hook or health check script for VM post-boot validation.
CLICK HERE To see the entire list of hypothesis / experiments and observations
-
REJECTED H1 All NUE2 QE machines have problems with MTU sizes
-
E1-1 Select any other than the original machine and call
zypper ref
and observe if this times out- O1-1-1 osiris itself has no problem
-
E1-1 Select any other than the original machine and call
-
REJECTED H1.1* All NUE2 QE non-salt controlled machines have problems with MTU sizes
- -> see O1.2-1
-
ACCEPTED H1.2 Only VMs on osiris have problems
-
E1.2-1 Try a VM elsewhere e.g. qamaster
- O1.2-1 VM on ada.qe.suse.de not affected
-
E1.2-2 Try another VM on osiris
- O1.2-2 ok1 also affected
-
E1.2-1 Try a VM elsewhere e.g. qamaster
- H1.3 Only VMs in NUE2 have problems
- E1.3-1 Try a VM elsewhere within NUE2, e.g. qamaster and compare VM settings
-
REJECTED H1.3 Only rrichardson VM has problems
- E1.3-1 See O1.2-2
-
ACCEPTED H2 The problem of
zypper ref
can be more easily reproduced withping -Mdo -s1442 download.opensuse.org
-
E2-1 Try the ping and if it fails then we can assume this is a valid
reproducer until that is fixed. Then verify withzypper ref
again-
O2-1-1 confirmed reproducing an error so assumed to be valid
reproduced
-
O2-1-1 confirmed reproducing an error so assumed to be valid
-
E2-1 Try the ping and if it fails then we can assume this is a valid
-
ACCEPTED H3 The MTU size problem only appeared recently
-
E3-1 Check logs
- O3-1-1 From ok1.qe.nue2.suse.org also running on osiris-1 okurz found that the automatic os-update stopped after 2025-04-12 showing timeouts in /var log/zypper.log since 2025-04-12. So "last good" 2025-04-12
-
E3-1 Check logs
-
REJECTED H4 The problem started with recent Tumbleweed 20250410 which is the last upgraded version on ok1
-
E4-1 Try to recreate the problem on a different version
- O4-1 Leap 15.6 was also shown to be affected
-
E4-1 Try to recreate the problem on a different version
- Reference
https://suse.slack.com/archives/C02AJ1E568M/p1746444515137329 and https://suse.slack.com/archives/C02AJ1E568M/p1746544263871029 discussion by nicksinger, robert.richardson and okurz
Workaround¶
Manually set the MTU size within the affected VM to a lower value, like 1360:
ip link set dev eth0 mtu 1360
This allows zypper
and other network operations to proceed without hanging.
Updated by nicksinger 26 days ago
I've added <mtu size='1360'/>
to both interface definitions of the domains/VMs called "okurz" and "rrichardson-leap15.6" according to https://libvirt.org/formatdomain.html#mtu-configuration
virt-manager told me this will be effective after the next guest shutdown so please reboot whenever suited and try it out. If it is, we can think about why this is needed now but not before 2025-04-12
Updated by okurz 26 days ago
@robert.richardson why didn't you take over our notes from the etherpad document?
Updated by robert.richardson 19 days ago
- Subject changed from MTU connection issues on osiris-1 virtual machines to MTU connection issues on osiris-1 virtual machines size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by openqa_review 18 days ago
- Due date set to 2025-05-28
Setting due date based on mean cycle time of SUSE QE Tools
Updated by dheidler 18 days ago
Strange - it changed itself back to 1360.
Seems to come from salt.
# MTU for this network is 1360 bytes
network_mtu:
file.keyvalue:
- name: /etc/sysconfig/network/ifcfg-{{ grains["default_interface"] }}
- append_if_not_found: True
- separator: '='
- key_values:
MTU: "1360"`
Updated by dheidler 18 days ago
- Related to action #178576: Workers unresponsive in salt pipelines including openqa-piworker, sapworker1 and monitor size:S added
Updated by dheidler 16 days ago
dheidler@openqa:~> LANG=c ping -M do -4 -s 1472 osiris-1.qe.nue2.suse.org
PING (10.168.192.102) 1472(1500) bytes of data.
From gateway.oqa.prg2.suse.org (10.145.10.254) icmp_seq=1 Frag needed and DF set (mtu = 1400)
ping: local error: message too long, mtu=1400
ping: local error: message too long, mtu=1400
Interestingly enough the MTU between NUE2 and PRG2 is 1400.
But what is really noteworthy is that the ICMP messages for path MTU discovery seem to be finally working.
This means that we should be fine to remove the <1500 MTU config for NUE2 hosts from salt.
For wireguard links we could set an tunnel MTU of 1320 (given the link MTU of 1400 between sites - see https://lists.zx2c4.com/pipermail/wireguard/2017-December/002201.html).
Updated by okurz 9 days ago
- Status changed from Resolved to Workable
dheidler wrote in #note-13:
This is because the host has an MTU of only 1360 on the bridge device.
Not sure, why the VM can use 4 bytes more than the bridge device MTU, though.
In any case the bridge device should have an MTU of 1500.
Doesn't this need to be made persistent, e.g. in salt? And why don't we need to do changes on qamaster?
Updated by dheidler 6 days ago
- Status changed from Workable to Resolved
qamaster has 1500:
dheidler@qamaster:~> ip li | grep mtu
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master br0 state UP mode DEFAULT group default qlen 1000
3: eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
4: eth2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
5: eth3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
6: eth4: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
7: eth5: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
8: eth6: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
9: eth7: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
10: br0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
11: vnet0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br0 state UNKNOWN mode DEFAULT group default qlen 1000
12: vnet1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br0 state UNKNOWN mode DEFAULT group default qlen 1000
13: vnet2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br0 state UNKNOWN mode DEFAULT group default qlen 1000
14: vnet3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br0 state UNKNOWN mode DEFAULT group default qlen 1000
15: vnet4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br0 state UNKNOWN mode DEFAULT group default qlen 1000
16: vnet5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br0 state UNKNOWN mode DEFAULT group default qlen 1000
17: vnet6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br0 state UNKNOWN mode DEFAULT group default qlen 1000
18: vnet7: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br0 state UNKNOWN mode DEFAULT group default qlen 1000
Doesn't this need to be made persistent, e.g. in salt? And why don't we need to do changes on qamaster?
See comment 18 and 19.