action #181862
Updated by robert.richardson 6 days ago
# Observation I have noticed strange connection issues when trying to setup a VM on osiris-1.qe.nue2.suse.org (NUE-2 machine). Although the initial setup of a fresh leap image, including network connectivity, seems to be working fine (i can successfully ping and curl download.opensuse.org from within the VM), once i try to run any zypper command the connection will get stuck, without any output indicating what happened. I was suggested to run ping commands with a higher MTU value as @okurz assumed these issues may be related to that, which seems to be correct, as those ping commands would also get stuck. It seems only VMs on osiris-1 are affected. Similar network-related MTU issues have been reported in previous SD tickets: * [SD-182364](https://sd.suse.com/servicedesk/customer/portal/1/SD-182364) * [SD-142688](https://sd.suse.com/servicedesk/customer/portal/1/SD-142688) ## Steps to reproduce 1. Create a fresh Leap 15.6 or Tumbleweed VM on `osiris-1`. 2. Confirm that basic network connectivity works (`ping`, `curl` to `download.opensuse.org`). 3. Run `zypper ref` and observe that it hangs without completing. 4. Alternatively, run: ```bash ping -Mdo -s1442 download.opensuse.org ``` and observe that it also hangs. ## Suggestions * *DONE* It only this one VM affected ? ~~Try Try another VM on osiris~~ osiris -> *DONE* ok1 also affected * *DONE* Is the host itself affected ? ~~Try Try on osiris-1 directly~~ directly -> *DONE* not affected * *DONE* Are only VMs on osiris-1 affected ? ~~Try another VM on another host~~ (e.g. ada.qe.suse.de) -> Try make sure other non-salt controlled machines are not affected * Consider introducing a network diagnostic hook or health check script for VM post-boot validation. <details> <summary><b>CLICK HERE</b> To see the entire list of hypothesis / experiments and observations</summary> * **REJECTED** *H1* All NUE2 QE machines have problems with MTU sizes * *E1-1* Select any other than the original machine and call `zypper ref` and observe if this times out * *O1-1-1* osiris itself has no problem * **REJECTED** H1.1* All NUE2 QE non-salt controlled machines have problems with MTU sizes * -> see *O1.2-1* * **ACCEPTED** *H1.2* Only VMs on osiris have problems * *E1.2-1* Try a VM elsewhere e.g. qamaster * *O1.2-1* VM on ada.qe.suse.de **not** affected * *E1.2-2* Try another VM on osiris * *O1.2-2* ok1 also affected * **REJECTED** *H1.3* Only rrichardson VM has problems * *E1.3-1* See *O1.2-2* * **ACCEPTED** *H2* The problem of `zypper ref` can be more easily reproduced with `ping -Mdo -s1442 download.opensuse.org` * *E2-1* Try the ping and if it fails then we can assume this is a valid reproducer until that is fixed. Then verify with `zypper ref` again * *O2-1-1* confirmed reproducing an error so assumed to be valid reproduced * **ACCEPTED** *H3* The MTU size problem only appeared recently * *E3-1* Check logs * *O3-1-1* From ok1.qe.nue2.suse.org also running on osiris-1 okurz found that the automatic os-update stopped after 2025-04-12 showing timeouts in /var log/zypper.log since 2025-04-12. So "last good" 2025-04-12 * **REJECTED** *H4* The problem started with recent Tumbleweed 20250410 which is the last upgraded version on ok1 * *E4-1* Try to recreate the problem on a different version * *O4-1* Leap 15.6 was also shown to be affected </details> ## Workaround Manually set the MTU size within the affected VM to a lower value, like 1360: ```bash ip link set dev eth0 mtu 1360 ``` This allows `zypper` and other network operations to proceed without hanging.