Project

General

Profile

action #181862

Updated by robert.richardson 6 days ago

# Observation 

 I have noticed strange connection issues when trying to setup a VM on osiris-1.qe.nue2.suse.org (NUE-2 machine). Although the initial setup of a fresh leap image, including network connectivity, seems to be working fine (i can successfully ping and curl download.opensuse.org from within the VM), once i try to run any zypper command the connection will get stuck, without any output indicating what happened. 

 I was suggested to run ping commands with a higher MTU value as @okurz assumed these issues may be related to that, which seems to be correct, as those ping commands would also get stuck. 

 It seems only VMs on osiris-1 are affected. 

 Similar network-related MTU issues have been reported in previous SD tickets: 

 * [SD-182364](https://sd.suse.com/servicedesk/customer/portal/1/SD-182364) 
 * [SD-142688](https://sd.suse.com/servicedesk/customer/portal/1/SD-142688) 

 ## Steps to reproduce 

 1. Create a fresh Leap 15.6 or Tumbleweed VM on `osiris-1`. 
 2. Confirm that basic network connectivity works (`ping`, `curl` to `download.opensuse.org`). 
 3. Run `zypper ref` and observe that it hangs without completing. 
 4. Alternatively, run: 

    ```bash 
    ping -Mdo -s1442 download.opensuse.org 
    ``` 

    and observe that it also hangs. 

 ## Suggestions 
 * *DONE* It only this one VM affected ? ~~Try Try another VM on osiris~~ 
   osiris -> *DONE* ok1 also affected 
 * *DONE* Is the host itself affected ? ~~Try Try on osiris-1 directly~~ 
   directly -> *DONE* not affected 
 * *DONE* Are only VMs on osiris-1 affected ? ~~Try another VM on another host~~ (e.g. ada.qe.suse.de) 
   -> Try make sure other non-salt controlled machines are not affected 
 * Consider introducing a network diagnostic hook or health check script for VM post-boot validation. 

 <details> 
 <summary><b>CLICK HERE</b> To see the entire list of hypothesis / experiments and observations</summary> 

 * **REJECTED** *H1* All NUE2 QE machines have problems with MTU sizes 
   * *E1-1* Select any other than the original machine and call `zypper ref` 
 and observe if this times out 
     * *O1-1-1* osiris itself has no problem 
 * **REJECTED** H1.1* All NUE2 QE non-salt controlled machines have problems with MTU sizes 
   * -> see *O1.2-1* 
 * **ACCEPTED** *H1.2* Only VMs on osiris have problems 
   * *E1.2-1* Try a VM elsewhere e.g. qamaster 
     * *O1.2-1* VM on ada.qe.suse.de **not** affected 
   * *E1.2-2* Try another VM on osiris 
     * *O1.2-2* ok1 also affected 
 * **REJECTED** *H1.3* Only rrichardson VM has problems 
   * *E1.3-1* See *O1.2-2* 
 * **ACCEPTED** *H2* The problem of `zypper ref` can be more easily reproduced with `ping -Mdo -s1442 download.opensuse.org` 
   * *E2-1* Try the ping and if it fails then we can assume this is a valid  
 reproducer until that is fixed. Then verify with `zypper ref` again 
     * *O2-1-1* confirmed reproducing an error so assumed to be valid  
 reproduced 
 * **ACCEPTED** *H3* The MTU size problem only appeared recently 
   * *E3-1* Check logs 
     * *O3-1-1* From ok1.qe.nue2.suse.org also running on osiris-1 okurz found that the automatic os-update stopped after 2025-04-12 showing timeouts in /var log/zypper.log since 2025-04-12. So "last good" 2025-04-12 
 * **REJECTED** *H4* The problem started with recent Tumbleweed 20250410 which is the last upgraded version on ok1 
   * *E4-1* Try to recreate the problem on a different version 
     * *O4-1* Leap 15.6 was also shown to be affected 
 </details> 

 ## Workaround 

 Manually set the MTU size within the affected VM to a lower value, like 1360: 

 ```bash 
 ip link set dev eth0 mtu 1360 
 ``` 

 This allows `zypper` and other network operations to proceed without hanging.

Back