Project

General

Profile

action #88299

[virtualization] Worker openqaw5-xen-1.qa.suse.de is not reachable (xen-hvm/xen-pv failing)

Added by nanzhang 2 months ago. Updated about 1 month ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
Start date:
2021-01-28
Due date:
% Done:

0%

Estimated time:
Tags:

Description

The following OSD job failed due to the worker openqaw5-xen-1.qa.suse.de is not reachable and cannot be booted up.
sle-15-SP3-Online-x86_64-Build133.1-default_install_svirt@svirt-hyperv2012r2-uefi (https://openqa.nue.suse.com/tests/5357316#)

Host: openqaw5-xen.qa.suse.de
Guest VM: openqaw5-xen-1.qa.suse.de


Related issues

Related to openQA Tests - action #88217: [qe-core] test fails in bootloader_svirt - libxenlight failed to create new domain: leftover qemu processResolved2021-01-26

Related to openQA Tests - action #88373: [xen-post-upgrade][qac-infra][investigation] post configuration leftoversWorkable2021-02-01

History

#1 Updated by okurz 2 months ago

  • Subject changed from Worker openqaw5-xen-1.qa.suse.de is not reachable to [virtualization] Worker openqaw5-xen-1.qa.suse.de is not reachable
  • Target version set to future

#2 Updated by szarate 2 months ago

  • Assignee set to mloviska

#3 Updated by szarate 2 months ago

  • Related to action #88217: [qe-core] test fails in bootloader_svirt - libxenlight failed to create new domain: leftover qemu process added

#4 Updated by tjyrinki_suse 2 months ago

  • Subject changed from [virtualization] Worker openqaw5-xen-1.qa.suse.de is not reachable to [virtualization] Worker openqaw5-xen-1.qa.suse.de is not reachable (xen-hvm/xen-pv failing)

#5 Updated by mloviska 2 months ago

  • Status changed from New to In Progress

openqaw5-xen.qa.suse.de has been successfully migrated to sle15sp2.

# cat /etc/os-release 
NAME="SLES"
VERSION="15-SP2"
VERSION_ID="15.2"
PRETTY_NAME="SUSE Linux Enterprise Server 15 SP2"
ID="sles"
ID_LIKE="suse"
ANSI_COLOR="0;32"
CPE_NAME="cpe:/o:suse:sles:15:sp2"

Unfortunately, there seems to be still a problem in libvirtd

openqaw5-xen:~ # systemctl status libvirtd
● libvirtd.service - Virtualization daemon
   Loaded: loaded (/usr/lib/systemd/system/libvirtd.service; enabled; vendor preset: enabled)
   Active: active (running) since Thu 2021-01-28 19:08:13 CET; 15h ago
     Docs: man:libvirtd(8)
           https://libvirt.org
 Main PID: 2715 (libvirtd)
    Tasks: 28 (limit: 32768)
   CGroup: /system.slice/libvirtd.service
           ├─2715 /usr/sbin/libvirtd --timeout 120
           └─6205 /usr/bin/qemu-system-x86_64 -xen-domid 2 -chardev socket,id=libxl-cmd,path=/var/run/xen/qmp-libxl-2,server,nowait -no-shutdown -mon chardev=libxl-cmd,mode=control -chardev >

Jan 28 19:55:47 openqaw5-xen root[28322]: /etc/xen/scripts/vif-bridge: ip link set vif9.0 nomaster failed
Jan 28 19:55:47 openqaw5-xen root[28326]: /etc/xen/scripts/vif-bridge: ip link set vif9.0 down failed
Jan 28 19:55:47 openqaw5-xen root[28327]: /etc/xen/scripts/vif-bridge: Successful vif-bridge offline for vif9.0, bridge br0.
Jan 28 19:55:49 openqaw5-xen libvirtd[2715]: 2731: error : virDomainSnapshotNum:344 : this function is not supported by the connection driver: virDomainSnapshotNum
Jan 28 23:02:08 openqaw5-xen libvirtd[2715]: 2730: warning : libxlDomainObjBeginJob:146 : Cannot start job (modify) for domain openQA-SUT-1; current job is (modify) owned by (2732)
Jan 28 23:02:08 openqaw5-xen libvirtd[2715]: 2730: error : libxlDomainObjBeginJob:150 : Timed out during operation: cannot acquire state change lock
Jan 28 23:02:08 openqaw5-xen libvirtd[2715]: 2715: error : virNetSocketReadWire:1832 : End of file while reading data: Input/output error
Jan 28 23:02:09 openqaw5-xen libvirtd[2715]: 2729: warning : libxlDomainObjBeginJob:146 : Cannot start job (modify) for domain openQA-SUT-3; current job is (modify) owned by (2733)
Jan 28 23:02:09 openqaw5-xen libvirtd[2715]: 2729: error : libxlDomainObjBeginJob:150 : Timed out during operation: cannot acquire state change lock
Jan 28 23:02:09 openqaw5-xen libvirtd[2715]: 2715: error : virNetSocketReadWire:1817 : Cannot recv data: Connection reset by peer
openqaw5-xen:~ # systemctl restart libvirtd
openqaw5-xen:~ # systemctl status libvirtd
● libvirtd.service - Virtualization daemon
   Loaded: loaded (/usr/lib/systemd/system/libvirtd.service; enabled; vendor preset: enabled)
   Active: active (running) since Fri 2021-01-29 10:22:30 CET; 8s ago
     Docs: man:libvirtd(8)
           https://libvirt.org
 Main PID: 25246 (libvirtd)
    Tasks: 29 (limit: 32768)
   CGroup: /system.slice/libvirtd.service
           ├─ 6205 /usr/bin/qemu-system-x86_64 -xen-domid 2 -chardev socket,id=libxl-cmd,path=/var/run/xen/qmp-libxl-2,server,nowait -no-shutdown -mon chardev=libxl-cmd,mode=control -chardev>
           └─25246 /usr/sbin/libvirtd --timeout 120

Jan 29 10:22:30 openqaw5-xen libvirtd[25246]: 2021-01-29 09:22:30.630+0000: 25266: debug : virFileClose:110 : Closed fd 9
Jan 29 10:22:30 openqaw5-xen libvirtd[25246]: 2021-01-29 09:22:30.630+0000: 25266: debug : virFileClose:110 : Closed fd 10
Jan 29 10:22:30 openqaw5-xen libvirtd[25246]: 2021-01-29 09:22:30.630+0000: 25266: debug : virFileClose:110 : Closed fd 11
Jan 29 10:22:30 openqaw5-xen libvirtd[25246]: 2021-01-29 09:22:30.630+0000: 25266: debug : virFileClose:110 : Closed fd 12
Jan 29 10:22:30 openqaw5-xen libvirtd[25246]: 2021-01-29 09:22:30.630+0000: 25266: debug : virFileClose:110 : Closed fd 13
Jan 29 10:22:30 openqaw5-xen libvirtd[25246]: 2021-01-29 09:22:30.630+0000: 25266: debug : virFileClose:110 : Closed fd 14
Jan 29 10:22:30 openqaw5-xen libvirtd[25246]: 2021-01-29 09:22:30.630+0000: 25266: debug : virFileClose:110 : Closed fd 15
Jan 29 10:22:30 openqaw5-xen libvirtd[25246]: 2021-01-29 09:22:30.630+0000: 25266: debug : virFileClose:110 : Closed fd 17
Jan 29 10:22:30 openqaw5-xen libvirtd[25246]: 2021-01-29 09:22:30.630+0000: 25266: debug : virFileClose:110 : Closed fd 18
Jan 29 10:22:30 openqaw5-xen libvirtd[25246]: 2021-01-29 09:22:30.630+0000: 25266: debug : virFileClose:110 : Closed fd 20

As of now, I am not really sure what is the root-cause, and it is still under investigation. Nevertheless, it seems to affect mostly xen jobs, hyperv (RDP-VNC wrapper VM seems to work) or vmware should not be affected.

#6 Updated by mloviska 2 months ago

I had to install and configure xen to use openqswitch instead of brigde-utils as brigde-utils become deprecated and part of legacy module.

openqaw5-xen:~ # xl list
Name                                        ID   Mem VCPUs      State   Time(s)
Domain-0                                     0  2268    32     r-----     872.1
Xenstore                                     1    31     1     -b----       0.9
openQA_hyperv_intermediary                   2  4088     2     -b----      85.6
openQA-SUT-1                                 4  1016     1     -b----     108.6
openQA-SUT-2                                 5  1016     1     -b----      60.5
openqaw5-xen:~ # virsh list
 Id   Name                         State
--------------------------------------------
 0    Domain-0                     running
 2    openQA_hyperv_intermediary   running
 4    openQA-SUT-1                 running
 5    openQA-SUT-2                 running

openqaw5-xen:~ # 

Temporary I have started for sure more xen related service than I should, however it is not clear to me which are necessary. To be clarified later.

#7 Updated by nanzhang 2 months ago

Thanks mloviska. I've re-run the job, and the issue has gone.
https://openqa.nue.suse.com/tests/5386318

#8 Updated by mloviska 2 months ago

  • Related to action #88373: [xen-post-upgrade][qac-infra][investigation] post configuration leftovers added

#9 Updated by mloviska 2 months ago

https://openqa.suse.de/tests/5399205/file/serial0.txt

After triggering kernel crash, it seems like the brigde settings in xen aren't restored.
Locally I can see error msg: error: Disconnected from xen:///system due to end of file

Checking libxl logs and ip settings I can see following

2021-02-03 14:36:44.208+0000: libxl: libxl_event.c:676:libxl__ev_xswatch_deregister: watch w=0x7f61e0014e20 wpath=/local/domain/0/backend/vif/623/0/state token=2/6: deregister slotnum=2
2021-02-03 14:36:44.208+0000: libxl: libxl_device.c:1086:device_backend_callback: Domain 623:calling device_backend_cleanup
2021-02-03 14:36:44.208+0000: libxl: libxl_event.c:689:libxl__ev_xswatch_deregister: watch w=0x7f61e0014e20: deregister unregistered
2021-02-03 14:36:44.210+0000: libxl: libxl_device.c:1187:device_hotplug: Domain 623:calling hotplug script: /etc/xen/scripts/vif-bridge online
2021-02-03 14:36:44.210+0000: libxl: libxl_device.c:1188:device_hotplug: Domain 623:extra args:
2021-02-03 14:36:44.210+0000: libxl: libxl_device.c:1194:device_hotplug: Domain 623:    type_if=vif
2021-02-03 14:36:44.210+0000: libxl: libxl_device.c:1196:device_hotplug: Domain 623:env:
2021-02-03 14:36:44.210+0000: libxl: libxl_device.c:1203:device_hotplug: Domain 623:    script: /etc/xen/scripts/vif-bridge
2021-02-03 14:36:44.210+0000: libxl: libxl_device.c:1203:device_hotplug: Domain 623:    XENBUS_TYPE: vif
2021-02-03 14:36:44.211+0000: libxl: libxl_device.c:1203:device_hotplug: Domain 623:    XENBUS_PATH: backend/vif/623/0
2021-02-03 14:36:44.211+0000: libxl: libxl_device.c:1203:device_hotplug: Domain 623:    XENBUS_BASE_PATH: backend
2021-02-03 14:36:44.211+0000: libxl: libxl_device.c:1203:device_hotplug: Domain 623:    netdev:
2021-02-03 14:36:44.211+0000: libxl: libxl_device.c:1203:device_hotplug: Domain 623:    vif: vif623.0
2021-02-03 14:36:44.211+0000: libxl: libxl_internal.c:75:libxl__suse_domain_get_hotplug_timeout: Domain 623:Got from '' = 0 from /libxl/623/suse/nics-LIBXL_HOTPLUG_TIMEOUT for /local/domain/0/backend/vif/623/0: No such file or directory
2021-02-03 14:36:44.211+0000: libxl: libxl_aoutils.c:599:libxl__async_exec_start: forking to execute: /etc/xen/scripts/vif-bridge online for /local/domain/0/backend/vif/623/0
2021-02-03 14:36:44.491+0000: libxl: libxl_event.c:689:libxl__ev_xswatch_deregister: watch w=0x7f61e0014f30: deregister unregistered
2021-02-03 14:36:44.492+0000: libxl: libxl_device.c:1172:device_hotplug: Domain 623:No hotplug script to execute
2021-02-03 14:36:44.492+0000: libxl: libxl_event.c:689:libxl__ev_xswatch_deregister: watch w=0x7f61e0014f30: deregister unregistered
2021-02-03 14:36:44.492+0000: libxl: libxl_event.c:2228:libxl__ao_progress_report: ao 0x7f61e000e2c0: progress report: callback queued aop=0x7f61e005a1d0
2021-02-03 14:36:44.494+0000: libxl: libxl_event.c:1897:libxl__ao_complete: ao 0x7f61e000e2c0: complete, rc=0
2021-02-03 14:36:44.494+0000: libxl: libxl_event.c:1432:egc_run_callbacks: ao 0x7f61e000e2c0: progress report: callback aop=0x7f61e005a1d0
2021-02-03 14:36:44.494+0000: libxl: libxl_event.c:1866:libxl__ao__destroy: ao 0x7f61e000e2c0: destroy
2021-02-03 14:36:44.501+0000: libxl: libxl_event.c:689:libxl__ev_xswatch_deregister: watch w=0x7f61e000ebc8: deregister unregistered
2021-02-03 14:36:44.501+0000: xc: SUSEINFO: domid 623: xc_domain_unpause returned 0
2021-02-03 14:36:44.501+0000: libxl: libxl_event.c:1897:libxl__ao_complete: ao 0x7f61e000ef90: complete, rc=0
2021-02-03 14:36:44.501+0000: libxl: libxl_event.c:1866:libxl__ao__destroy: ao 0x7f61e000ef90: destroy
2021-02-03T15:38:33.050859+01:00 openqaw5-xen libvirtd[32011]: 2021-02-03 14:38:33.029+0000: 32061: debug : virFileClose:110 : Closed fd 45
2021-02-03T15:38:33.051368+01:00 openqaw5-xen libvirtd[32011]: 2021-02-03 14:38:33.029+0000: 32061: debug : virFileClose:110 : Closed fd 47
2021-02-03T15:39:23.752533+01:00 openqaw5-xen kernel: [438217.127392] vif vif-623-0 vif623.0: Guest Rx stalled
2021-02-03T15:39:23.752568+01:00 openqaw5-xen kernel: [438217.127757] br0: port 5(vif623.0) entered disabled state
624: vif623.0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq master br0 state DOWN group default qlen 32
    link/ether fe:ff:ff:ff:ff:ff brd ff:ff:ff:ff:ff:ff
    inet6 fe80::fcff:ffff:feff:ffff/64 scope link
       valid_lft forever preferred_lft forever

#10 Updated by nanzhang about 2 months ago

Job failed on build 150.1(Snapshot10), looks like the worker is not reachable again.
sle-15-SP3-Online-x86_64-Build150.1-default_install_svirt@svirt-hyperv2012r2-uefi (https://openqa.nue.suse.com/tests/5476478)
Error connecting to VNC server openqaw5-xen-1.qa.suse.de:5905: IO::Socket::INET: connect: Connection timed out

The host can't be reached as well.
# ssh root@openqaw5-xen.qa.suse.de
ssh: connect to host openqaw5-xen.qa.suse.de port 22: Connection timed out

#11 Updated by xlai about 2 months ago

mloviska wrote:

https://openqa.suse.de/tests/5399205/file/serial0.txt

After triggering kernel crash, it seems like the brigde settings in xen aren't restored.
Locally I can see error msg: error: Disconnected from xen:///system due to end of file

mloviska nanzhang
Seems same root cause with https://bugzilla.suse.com/show_bug.cgi?id=1181989 openQA job causes libvirtd to dump core when running kdump inside domain, which is P1 now and fix wip.

#12 Updated by mloviska about 2 months ago

If we want to reboot a xen domain we have to remove /etc/udev/rules.d/70-persistent-net.rules. Frankly, it is quite surprising to me that this file appears on xen domU after the xen host upgrade.
Could it be possibly a side effect of replacing linux bridge by openvswitch?

Also network configuration on libvirt level has to contain a reference that domU uses openvswitch. I will push the code change tomorrow morning (has to be done in bootloader_svirt).

      <interface type='bridge'>
      <mac address='00:16:3e:09:6f:df'/>
      <source bridge='br0'/>
      <virtualport type='openvswitch'>
        <parameters interfaceid='bf0a6496-a421-41f1-926e-a593f96ce1bb'/>
      </virtualport>
      <target dev='vif19.0'/>
      <model type='netfront'/>
    </interface>

#13 Updated by mloviska about 1 month ago

  • Tags set to qac
  • Status changed from In Progress to Resolved

Except of https://bugzilla.suse.com/show_bug.cgi?id=1181989 there should be no more leftovers. Feel free to reopen if anything shows up. Thanks for your patience!

Also available in: Atom PDF