Project

General

Profile

Actions

action #181862

open

MTU connection issues on osiris-1 virtual machines

Added by robert.richardson 1 day ago. Updated about 11 hours ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
Regressions/Crashes
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:

Description

Observation

I have noticed strange connection issues when trying to setup a VM on osiris-1.qe.nue2.suse.org (NUE-2 machine). Although the initial setup of a fresh leap image, including network connectivity, seems to be working fine (i can successfully ping and curl download.opensuse.org from within the VM), once i try to run any zypper command the connection will get stuck, without any output indicating what happened.

I was suggested to run ping commands with a higher MTU value as @okurz assumed these issues may be related to that, which seems to be correct, as those ping commands would also get stuck.

It seems only VMs on osiris-1 are affected.

Similar network-related MTU issues have been reported in previous SD tickets:

Steps to reproduce

  1. Create a fresh Leap 15.6 or Tumbleweed VM on osiris-1.

  2. Confirm that basic network connectivity works (ping, curl to download.opensuse.org).

  3. Run zypper ref and observe that it hangs without completing.

  4. Alternatively, run:

    ping -Mdo -s1442 download.opensuse.org
    

    and observe that it also hangs.

Suggestions

  • DONE It only this one VM affected ? Try another VM on osiris
    -> ok1 also affected
  • DONE Is the host itself affected ? Try on osiris-1 directly
    -> not affected
  • DONE Are only VMs on osiris-1 affected ? Try another VM on another host (e.g. ada.qe.suse.de)
    -> not affected
  • Consider introducing a network diagnostic hook or health check script for VM post-boot validation.
CLICK HERE To see the entire list of hypothesis / experiments and observations
  • REJECTED H1 All NUE2 QE machines have problems with MTU sizes
    • E1-1 Select any other than the original machine and call zypper ref
      and observe if this times out
      • O1-1-1 osiris itself has no problem
  • REJECTED H1.1* All NUE2 QE non-salt controlled machines have problems with MTU sizes
    • -> see O1.2-1
  • ACCEPTED H1.2 Only VMs on osiris have problems
    • E1.2-1 Try a VM elsewhere e.g. qamaster
      • O1.2-1 VM on ada.qe.suse.de not affected
    • E1.2-2 Try another VM on osiris
      • O1.2-2 ok1 also affected
  • REJECTED H1.3 Only rrichardson VM has problems
    • E1.3-1 See O1.2-2
  • ACCEPTED H2 The problem of zypper ref can be more easily reproduced with ping -Mdo -s1442 download.opensuse.org
    • E2-1 Try the ping and if it fails then we can assume this is a valid
      reproducer until that is fixed. Then verify with zypper ref again
      • O2-1-1 confirmed reproducing an error so assumed to be valid
        reproduced
  • ACCEPTED H3 The MTU size problem only appeared recently
    • E3-1 Check logs
      • O3-1-1 From ok1.qe.nue2.suse.org also running on osiris-1 okurz found that the automatic os-update stopped after 2025-04-12 showing timeouts in /var log/zypper.log since 2025-04-12. So "last good" 2025-04-12
  • REJECTED H4 The problem started with recent Tumbleweed 20250410 which is the last upgraded version on ok1
    • E4-1 Try to recreate the problem on a different version
      • O4-1 Leap 15.6 was also shown to be affected

Workaround

Manually set the MTU size within the affected VM to a lower value, like 1360:

ip link set dev eth0 mtu 1360

This allows zypper and other network operations to proceed without hanging.

Actions #1

Updated by okurz 1 day ago

From ok1.qe.nue2.suse.org also running on osiris-1 I found that the automatic os-update stopped after 2025-04-12 showing timeouts in /var/log/zypper.log since 2025-04-12. So "last good" 2025-04-12

Actions #2

Updated by robert.richardson 1 day ago

  • Description updated (diff)
Actions #3

Updated by nicksinger 1 day ago

I've added <mtu size='1360'/> to both interface definitions of the domains/VMs called "okurz" and "rrichardson-leap15.6" according to https://libvirt.org/formatdomain.html#mtu-configuration

virt-manager told me this will be effective after the next guest shutdown so please reboot whenever suited and try it out. If it is, we can think about why this is needed now but not before 2025-04-12

Actions #4

Updated by okurz 1 day ago

@robert.richardson why didn't you take over our notes from the etherpad document?

Actions #5

Updated by okurz 1 day ago

  • Tags set to infra, reactive work, nue2, mtu, network, it, vm, osiris
  • Category set to Regressions/Crashes
  • Target version set to future
Actions #6

Updated by robert.richardson about 11 hours ago

  • Description updated (diff)
Actions

Also available in: Atom PDF