Project

General

Profile

Actions

action #181862

closed

MTU connection issues on osiris-1 virtual machines size:M

Added by robert.richardson 26 days ago. Updated 6 days ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Start date:
Due date:
% Done:

0%

Estimated time:

Description

Observation

I have noticed strange connection issues when trying to setup a VM on osiris-1.qe.nue2.suse.org (NUE-2 machine). Although the initial setup of a fresh leap image, including network connectivity, seems to be working fine (i can successfully ping and curl download.opensuse.org from within the VM), once i try to run any zypper command the connection will get stuck, without any output indicating what happened.

I was suggested to run ping commands with a higher MTU value as @okurz assumed these issues may be related to that, which seems to be correct, as those ping commands would also get stuck.

It seems only VMs on osiris-1 are affected.

Similar network-related MTU issues have been reported in previous SD tickets:

Steps to reproduce

  1. Create a fresh Leap 15.6 or Tumbleweed VM on osiris-1.

  2. Confirm that basic network connectivity works (ping, curl to download.opensuse.org).

  3. Run zypper ref and observe that it hangs without completing.

  4. Alternatively, run:

    ping -Mdo -s1442 download.opensuse.org
    

    and observe that it also hangs.

Suggestions

  • DONE It only this one VM affected ? Try another VM on osiris
    -> ok1 also affected
  • DONE Is the host itself affected ? Try on osiris-1 directly
    -> not affected
  • DONE Are only VMs on osiris-1 affected ? Try another VM on another host (e.g. ada.qe.suse.de)
    -> not affected
  • Consider introducing a network diagnostic hook or health check script for VM post-boot validation.
CLICK HERE To see the entire list of hypothesis / experiments and observations
  • REJECTED H1 All NUE2 QE machines have problems with MTU sizes
    • E1-1 Select any other than the original machine and call zypper ref
      and observe if this times out
      • O1-1-1 osiris itself has no problem
  • REJECTED H1.1* All NUE2 QE non-salt controlled machines have problems with MTU sizes
    • -> see O1.2-1
  • ACCEPTED H1.2 Only VMs on osiris have problems
    • E1.2-1 Try a VM elsewhere e.g. qamaster
      • O1.2-1 VM on ada.qe.suse.de not affected
    • E1.2-2 Try another VM on osiris
      • O1.2-2 ok1 also affected
  • H1.3 Only VMs in NUE2 have problems
  • E1.3-1 Try a VM elsewhere within NUE2, e.g. qamaster and compare VM settings
  • REJECTED H1.3 Only rrichardson VM has problems
    • E1.3-1 See O1.2-2
  • ACCEPTED H2 The problem of zypper ref can be more easily reproduced with ping -Mdo -s1442 download.opensuse.org
    • E2-1 Try the ping and if it fails then we can assume this is a valid
      reproducer until that is fixed. Then verify with zypper ref again
      • O2-1-1 confirmed reproducing an error so assumed to be valid
        reproduced
  • ACCEPTED H3 The MTU size problem only appeared recently
    • E3-1 Check logs
      • O3-1-1 From ok1.qe.nue2.suse.org also running on osiris-1 okurz found that the automatic os-update stopped after 2025-04-12 showing timeouts in /var log/zypper.log since 2025-04-12. So "last good" 2025-04-12
  • REJECTED H4 The problem started with recent Tumbleweed 20250410 which is the last upgraded version on ok1
    • E4-1 Try to recreate the problem on a different version
      • O4-1 Leap 15.6 was also shown to be affected

Workaround

Manually set the MTU size within the affected VM to a lower value, like 1360:

ip link set dev eth0 mtu 1360

This allows zypper and other network operations to proceed without hanging.


Related issues 1 (0 open1 closed)

Related to openQA Infrastructure (public) - action #178576: Workers unresponsive in salt pipelines including openqa-piworker, sapworker1 and monitor size:SResolvednicksinger2025-03-07

Actions
Actions #1

Updated by okurz 26 days ago

From ok1.qe.nue2.suse.org also running on osiris-1 I found that the automatic os-update stopped after 2025-04-12 showing timeouts in /var/log/zypper.log since 2025-04-12. So "last good" 2025-04-12

Actions #2

Updated by robert.richardson 26 days ago

  • Description updated (diff)
Actions #3

Updated by nicksinger 26 days ago

I've added <mtu size='1360'/> to both interface definitions of the domains/VMs called "okurz" and "rrichardson-leap15.6" according to https://libvirt.org/formatdomain.html#mtu-configuration

virt-manager told me this will be effective after the next guest shutdown so please reboot whenever suited and try it out. If it is, we can think about why this is needed now but not before 2025-04-12

Actions #4

Updated by okurz 26 days ago

@robert.richardson why didn't you take over our notes from the etherpad document?

Actions #5

Updated by okurz 26 days ago

  • Tags set to infra, reactive work, nue2, mtu, network, it, vm, osiris
  • Category set to Regressions/Crashes
  • Target version set to future
Actions #6

Updated by robert.richardson 25 days ago

  • Description updated (diff)
Actions #7

Updated by okurz 23 days ago

  • Target version changed from future to Ready
Actions #8

Updated by robert.richardson 19 days ago

  • Subject changed from MTU connection issues on osiris-1 virtual machines to MTU connection issues on osiris-1 virtual machines size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #9

Updated by dheidler 19 days ago

  • Status changed from Workable to In Progress
  • Assignee set to dheidler
Actions #10

Updated by openqa_review 18 days ago

  • Due date set to 2025-05-28

Setting due date based on mean cycle time of SUSE QE Tools

Actions #11

Updated by livdywan 18 days ago

Reportedly setup of a test VM was proving tricky

Actions #12

Updated by dheidler 18 days ago

Created a VM on osiris-1 with ethernet bridged to host.

The maximum MTU that goes through when pinging hosts on the internet is 1364 (parameter for ping is 1336 due to 28 bytes overhead for ICMP/IPv4)

ping -M do -4 -s 1336 1.1.1.1
Actions #13

Updated by dheidler 18 days ago

This is because the host has an MTU of only 1360 on the bridge device.
Not sure, why the VM can use 4 bytes more than the bridge device MTU, though.
In any case the bridge device should have an MTU of 1500.

Actions #14

Updated by dheidler 18 days ago

I reconfigured the ifcfg entry for br0 on the host and increased the configured MTU to 1500.
But the VMs need to be rebooted to detect the new MTU for their TAP devices.

Actions #15

Updated by dheidler 18 days ago

Strange - it changed itself back to 1360.
Seems to come from salt.

  # MTU for this network is 1360 bytes
  network_mtu:
    file.keyvalue:
      - name: /etc/sysconfig/network/ifcfg-{{ grains["default_interface"] }}
      - append_if_not_found: True
      - separator: '='
      - key_values:
          MTU: "1360"`
Actions #16

Updated by dheidler 18 days ago

  • Related to action #178576: Workers unresponsive in salt pipelines including openqa-piworker, sapworker1 and monitor size:S added
Actions #17

Updated by dheidler 16 days ago

dheidler@openqa:~> LANG=c ping -M do -4 -s 1472 osiris-1.qe.nue2.suse.org
PING  (10.168.192.102) 1472(1500) bytes of data.
From gateway.oqa.prg2.suse.org (10.145.10.254) icmp_seq=1 Frag needed and DF set (mtu = 1400)
ping: local error: message too long, mtu=1400
ping: local error: message too long, mtu=1400

Interestingly enough the MTU between NUE2 and PRG2 is 1400.
But what is really noteworthy is that the ICMP messages for path MTU discovery seem to be finally working.
This means that we should be fine to remove the <1500 MTU config for NUE2 hosts from salt.

For wireguard links we could set an tunnel MTU of 1320 (given the link MTU of 1400 between sites - see https://lists.zx2c4.com/pipermail/wireguard/2017-December/002201.html).

Actions #19

Updated by dheidler 16 days ago

  • Status changed from In Progress to Feedback
Actions #20

Updated by okurz 13 days ago

Both merged. Please check for the effect.

Actions #21

Updated by dheidler 9 days ago

  • Status changed from Feedback to Resolved

Seems to work fine.

Also updated VM config to use 1500.
The path MTU discovery also works as expected in the VM.

Actions #22

Updated by okurz 9 days ago

  • Status changed from Resolved to Workable

dheidler wrote in #note-13:

This is because the host has an MTU of only 1360 on the bridge device.
Not sure, why the VM can use 4 bytes more than the bridge device MTU, though.
In any case the bridge device should have an MTU of 1500.

Doesn't this need to be made persistent, e.g. in salt? And why don't we need to do changes on qamaster?

Actions #23

Updated by dheidler 6 days ago

  • Status changed from Workable to Resolved

qamaster has 1500:

dheidler@qamaster:~> ip li | grep mtu
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master br0 state UP mode DEFAULT group default qlen 1000
3: eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
4: eth2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
5: eth3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
6: eth4: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
7: eth5: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
8: eth6: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
9: eth7: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
10: br0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
11: vnet0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br0 state UNKNOWN mode DEFAULT group default qlen 1000
12: vnet1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br0 state UNKNOWN mode DEFAULT group default qlen 1000
13: vnet2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br0 state UNKNOWN mode DEFAULT group default qlen 1000
14: vnet3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br0 state UNKNOWN mode DEFAULT group default qlen 1000
15: vnet4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br0 state UNKNOWN mode DEFAULT group default qlen 1000
16: vnet5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br0 state UNKNOWN mode DEFAULT group default qlen 1000
17: vnet6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br0 state UNKNOWN mode DEFAULT group default qlen 1000
18: vnet7: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br0 state UNKNOWN mode DEFAULT group default qlen 1000

Doesn't this need to be made persistent, e.g. in salt? And why don't we need to do changes on qamaster?

See comment 18 and 19.

Actions #24

Updated by okurz 6 days ago

  • Due date deleted (2025-05-28)
Actions

Also available in: Atom PDF