action #181862
openMTU connection issues on osiris-1 virtual machines
0%
Description
Observation¶
I have noticed strange connection issues when trying to setup a VM on osiris-1.qe.nue2.suse.org (NUE-2 machine). Although the initial setup of a fresh leap image, including network connectivity, seems to be working fine (i can successfully ping and curl download.opensuse.org from within the VM), once i try to run any zypper command the connection will get stuck, without any output indicating what happened.
I was suggested to run ping commands with a higher MTU value as @okurz assumed these issues may be related to that, which seems to be correct, as those ping commands would also get stuck.
It seems only VMs on osiris-1 are affected.
Similar network-related MTU issues have been reported in previous SD tickets:
Steps to reproduce¶
-
Create a fresh Leap 15.6 or Tumbleweed VM on
osiris-1
. -
Confirm that basic network connectivity works (
ping
,curl
todownload.opensuse.org
). -
Run
zypper ref
and observe that it hangs without completing. -
Alternatively, run:
ping -Mdo -s1442 download.opensuse.org
and observe that it also hangs.
Suggestions¶
-
DONE It only this one VM affected ?
Try another VM on osiris
-> ok1 also affected -
DONE Is the host itself affected ?
Try on osiris-1 directly
-> not affected -
DONE Are only VMs on osiris-1 affected ?
Try another VM on another host(e.g. ada.qe.suse.de)
-> not affected - Consider introducing a network diagnostic hook or health check script for VM post-boot validation.
CLICK HERE To see the entire list of hypothesis / experiments and observations
-
REJECTED H1 All NUE2 QE machines have problems with MTU sizes
-
E1-1 Select any other than the original machine and call
zypper ref
and observe if this times out- O1-1-1 osiris itself has no problem
-
E1-1 Select any other than the original machine and call
-
REJECTED H1.1* All NUE2 QE non-salt controlled machines have problems with MTU sizes
- -> see O1.2-1
-
ACCEPTED H1.2 Only VMs on osiris have problems
-
E1.2-1 Try a VM elsewhere e.g. qamaster
- O1.2-1 VM on ada.qe.suse.de not affected
-
E1.2-2 Try another VM on osiris
- O1.2-2 ok1 also affected
-
E1.2-1 Try a VM elsewhere e.g. qamaster
-
REJECTED H1.3 Only rrichardson VM has problems
- E1.3-1 See O1.2-2
-
ACCEPTED H2 The problem of
zypper ref
can be more easily reproduced withping -Mdo -s1442 download.opensuse.org
-
E2-1 Try the ping and if it fails then we can assume this is a valid
reproducer until that is fixed. Then verify withzypper ref
again-
O2-1-1 confirmed reproducing an error so assumed to be valid
reproduced
-
O2-1-1 confirmed reproducing an error so assumed to be valid
-
E2-1 Try the ping and if it fails then we can assume this is a valid
-
ACCEPTED H3 The MTU size problem only appeared recently
-
E3-1 Check logs
- O3-1-1 From ok1.qe.nue2.suse.org also running on osiris-1 okurz found that the automatic os-update stopped after 2025-04-12 showing timeouts in /var log/zypper.log since 2025-04-12. So "last good" 2025-04-12
-
E3-1 Check logs
-
REJECTED H4 The problem started with recent Tumbleweed 20250410 which is the last upgraded version on ok1
-
E4-1 Try to recreate the problem on a different version
- O4-1 Leap 15.6 was also shown to be affected
-
E4-1 Try to recreate the problem on a different version
Workaround¶
Manually set the MTU size within the affected VM to a lower value, like 1360:
ip link set dev eth0 mtu 1360
This allows zypper
and other network operations to proceed without hanging.
Updated by nicksinger 1 day ago
I've added <mtu size='1360'/>
to both interface definitions of the domains/VMs called "okurz" and "rrichardson-leap15.6" according to https://libvirt.org/formatdomain.html#mtu-configuration
virt-manager told me this will be effective after the next guest shutdown so please reboot whenever suited and try it out. If it is, we can think about why this is needed now but not before 2025-04-12
Updated by okurz 1 day ago
@robert.richardson why didn't you take over our notes from the etherpad document?