action #181862: MTU connection issues on osiris-1 virtual machines size:M - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #181862

closed

MTU connection issues on osiris-1 virtual machines size:M

Added by robert.richardson 26 days ago. Updated 6 days ago.

Status:

Resolved

Priority:

Normal

Assignee:

dheidler

Category:

Regressions/Crashes

Target version:

openQA Project (public) - Ready

Start date:

Due date:

% Done:

Estimated time:

Tags:

network, infra, reactive work, vm, IT, nue2, mtu, osiris

Description

Observation¶

I have noticed strange connection issues when trying to setup a VM on osiris-1.qe.nue2.suse.org (NUE-2 machine). Although the initial setup of a fresh leap image, including network connectivity, seems to be working fine (i can successfully ping and curl download.opensuse.org from within the VM), once i try to run any zypper command the connection will get stuck, without any output indicating what happened.

I was suggested to run ping commands with a higher MTU value as @okurz assumed these issues may be related to that, which seems to be correct, as those ping commands would also get stuck.

It seems only VMs on osiris-1 are affected.

Similar network-related MTU issues have been reported in previous SD tickets:

Steps to reproduce¶

Create a fresh Leap 15.6 or Tumbleweed VM on osiris-1.
Confirm that basic network connectivity works (ping, curl to download.opensuse.org).
Run zypper ref and observe that it hangs without completing.
Alternatively, run:
```
ping -Mdo -s1442 download.opensuse.org
```
and observe that it also hangs.

Suggestions¶

DONE It only this one VM affected ? ~~Try another VM on osiris~~
-> ok1 also affected
DONE Is the host itself affected ? ~~Try on osiris-1 directly~~
-> not affected
DONE Are only VMs on osiris-1 affected ? ~~Try another VM on another host~~ (e.g. ada.qe.suse.de)
-> not affected
Consider introducing a network diagnostic hook or health check script for VM post-boot validation.

CLICK HERE To see the entire list of hypothesis / experiments and observations

REJECTED H1 All NUE2 QE machines have problems with MTU sizes
- E1-1 Select any other than the original machine and call zypper ref
  and observe if this times out
  - O1-1-1 osiris itself has no problem
REJECTED H1.1* All NUE2 QE non-salt controlled machines have problems with MTU sizes
- -> see O1.2-1
ACCEPTED H1.2 Only VMs on osiris have problems
- E1.2-1 Try a VM elsewhere e.g. qamaster
  - O1.2-1 VM on ada.qe.suse.de not affected
- E1.2-2 Try another VM on osiris
  - O1.2-2 ok1 also affected
H1.3 Only VMs in NUE2 have problems
E1.3-1 Try a VM elsewhere within NUE2, e.g. qamaster and compare VM settings
REJECTED H1.3 Only rrichardson VM has problems
- E1.3-1 See O1.2-2
ACCEPTED H2 The problem of zypper ref can be more easily reproduced with ping -Mdo -s1442 download.opensuse.org
- E2-1 Try the ping and if it fails then we can assume this is a valid
  reproducer until that is fixed. Then verify with zypper ref again
  - O2-1-1 confirmed reproducing an error so assumed to be valid
    reproduced
ACCEPTED H3 The MTU size problem only appeared recently
- E3-1 Check logs
  - O3-1-1 From ok1.qe.nue2.suse.org also running on osiris-1 okurz found that the automatic os-update stopped after 2025-04-12 showing timeouts in /var log/zypper.log since 2025-04-12. So "last good" 2025-04-12
REJECTED H4 The problem started with recent Tumbleweed 20250410 which is the last upgraded version on ok1
- E4-1 Try to recreate the problem on a different version
  - O4-1 Leap 15.6 was also shown to be affected

Reference
https://suse.slack.com/archives/C02AJ1E568M/p1746444515137329 and https://suse.slack.com/archives/C02AJ1E568M/p1746544263871029 discussion by nicksinger, robert.richardson and okurz

Workaround¶

Manually set the MTU size within the affected VM to a lower value, like 1360:

ip link set dev eth0 mtu 1360

This allows zypper and other network operations to proceed without hanging.

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by okurz 26 days ago

From ok1.qe.nue2.suse.org also running on osiris-1 I found that the automatic os-update stopped after 2025-04-12 showing timeouts in /var/log/zypper.log since 2025-04-12. So "last good" 2025-04-12

Actions

Copy link

Updated by robert.richardson 26 days ago

Description updated (diff)

Actions

Copy link

Updated by nicksinger 26 days ago

I've added <mtu size='1360'/> to both interface definitions of the domains/VMs called "okurz" and "rrichardson-leap15.6" according to https://libvirt.org/formatdomain.html#mtu-configuration

virt-manager told me this will be effective after the next guest shutdown so please reboot whenever suited and try it out. If it is, we can think about why this is needed now but not before 2025-04-12

Actions

Copy link

Updated by okurz 26 days ago

@robert.richardson why didn't you take over our notes from the etherpad document?

Actions

Copy link

Updated by okurz 26 days ago

Tags set to infra, reactive work, nue2, mtu, network, it, vm, osiris
Category set to Regressions/Crashes
Target version set to future

Actions

Copy link

Updated by robert.richardson 25 days ago

Description updated (diff)

Actions

Copy link

Updated by okurz 23 days ago

Target version changed from future to Ready

Actions

Copy link

Updated by robert.richardson 19 days ago

Subject changed from MTU connection issues on osiris-1 virtual machines to MTU connection issues on osiris-1 virtual machines size:M
Description updated (diff)
Status changed from New to Workable

Actions

Copy link

Updated by dheidler 19 days ago

Status changed from Workable to In Progress
Assignee set to dheidler

Actions

Copy link

#10

Updated by openqa_review 18 days ago

Due date set to 2025-05-28

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

#11

Updated by livdywan 18 days ago

Reportedly setup of a test VM was proving tricky

Actions

Copy link

#12

Updated by dheidler 18 days ago

Created a VM on osiris-1 with ethernet bridged to host.

The maximum MTU that goes through when pinging hosts on the internet is 1364 (parameter for ping is 1336 due to 28 bytes overhead for ICMP/IPv4)

ping -M do -4 -s 1336 1.1.1.1

Actions

Copy link

#13

Updated by dheidler 18 days ago

This is because the host has an MTU of only 1360 on the bridge device.
Not sure, why the VM can use 4 bytes more than the bridge device MTU, though.
In any case the bridge device should have an MTU of 1500.

Actions

Copy link

#14

Updated by dheidler 18 days ago

I reconfigured the ifcfg entry for br0 on the host and increased the configured MTU to 1500.
But the VMs need to be rebooted to detect the new MTU for their TAP devices.

Actions

Copy link

#15

Updated by dheidler 18 days ago

Strange - it changed itself back to 1360.
Seems to come from salt.

  # MTU for this network is 1360 bytes
  network_mtu:
    file.keyvalue:
      - name: /etc/sysconfig/network/ifcfg-{{ grains["default_interface"] }}
      - append_if_not_found: True
      - separator: '='
      - key_values:
          MTU: "1360"`

Actions

Copy link

#16

Updated by dheidler 18 days ago

Related to action #178576: Workers unresponsive in salt pipelines including openqa-piworker, sapworker1 and monitor size:S added

Actions

Copy link

#17

Updated by dheidler 16 days ago

dheidler@openqa:~> LANG=c ping -M do -4 -s 1472 osiris-1.qe.nue2.suse.org
PING  (10.168.192.102) 1472(1500) bytes of data.
From gateway.oqa.prg2.suse.org (10.145.10.254) icmp_seq=1 Frag needed and DF set (mtu = 1400)
ping: local error: message too long, mtu=1400
ping: local error: message too long, mtu=1400

Interestingly enough the MTU between NUE2 and PRG2 is 1400.
But what is really noteworthy is that the ICMP messages for path MTU discovery seem to be finally working.
This means that we should be fine to remove the <1500 MTU config for NUE2 hosts from salt.

For wireguard links we could set an tunnel MTU of 1320 (given the link MTU of 1400 between sites - see https://lists.zx2c4.com/pipermail/wireguard/2017-December/002201.html).

Actions

Copy link

#18

Updated by dheidler 16 days ago

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1462

Actions

Copy link

#19

Updated by dheidler 16 days ago

Status changed from In Progress to Feedback

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/1027

Actions

Copy link

#20

Updated by okurz 13 days ago

Both merged. Please check for the effect.

Actions

Copy link

#21

Updated by dheidler 9 days ago

Status changed from Feedback to Resolved

Seems to work fine.

Also updated VM config to use 1500.
The path MTU discovery also works as expected in the VM.

Actions

Copy link

#22

Updated by okurz 9 days ago

Status changed from Resolved to Workable

dheidler wrote in #note-13:

This is because the host has an MTU of only 1360 on the bridge device.
Not sure, why the VM can use 4 bytes more than the bridge device MTU, though.
In any case the bridge device should have an MTU of 1500.

Doesn't this need to be made persistent, e.g. in salt? And why don't we need to do changes on qamaster?

Actions

Copy link

#23

Updated by dheidler 6 days ago

Status changed from Workable to Resolved

qamaster has 1500:

dheidler@qamaster:~> ip li | grep mtu
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master br0 state UP mode DEFAULT group default qlen 1000
3: eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
4: eth2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
5: eth3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
6: eth4: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
7: eth5: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
8: eth6: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
9: eth7: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
10: br0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
11: vnet0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br0 state UNKNOWN mode DEFAULT group default qlen 1000
12: vnet1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br0 state UNKNOWN mode DEFAULT group default qlen 1000
13: vnet2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br0 state UNKNOWN mode DEFAULT group default qlen 1000
14: vnet3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br0 state UNKNOWN mode DEFAULT group default qlen 1000
15: vnet4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br0 state UNKNOWN mode DEFAULT group default qlen 1000
16: vnet5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br0 state UNKNOWN mode DEFAULT group default qlen 1000
17: vnet6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br0 state UNKNOWN mode DEFAULT group default qlen 1000
18: vnet7: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br0 state UNKNOWN mode DEFAULT group default qlen 1000

Doesn't this need to be made persistent, e.g. in salt? And why don't we need to do changes on qamaster?

See comment 18 and 19.

Actions

Copy link

#24

Updated by okurz 6 days ago

Due date deleted (~~2025-05-28~~)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #181862

MTU connection issues on osiris-1 virtual machines size:M

Observation¶

Steps to reproduce¶

Suggestions¶

Workaround¶

Updated by okurz 26 days ago

Updated by robert.richardson 26 days ago

Updated by nicksinger 26 days ago

Updated by okurz 26 days ago

Updated by okurz 26 days ago

Updated by robert.richardson 25 days ago

Updated by okurz 23 days ago

Updated by robert.richardson 19 days ago

Updated by dheidler 19 days ago

Updated by openqa_review 18 days ago

Updated by livdywan 18 days ago

Updated by dheidler 18 days ago

Updated by dheidler 18 days ago

Updated by dheidler 18 days ago

Updated by dheidler 18 days ago

Updated by dheidler 18 days ago

Updated by dheidler 16 days ago

Updated by dheidler 16 days ago

Updated by dheidler 16 days ago

Updated by okurz 13 days ago

Updated by dheidler 9 days ago

Updated by okurz 9 days ago

Updated by dheidler 6 days ago

Updated by okurz 6 days ago