Project

General

Profile

Actions

action #158419

closed

osiris-1.qe.nue2.suse.org not responsive over virt-manager and "virsh list" hangs

Added by okurz 3 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2024-04-02
Due date:
2024-04-17
% Done:

0%

Estimated time:

Description

Observation

Trying to find out the state of machines on osiris I connected over virt-manager to osiris-1.qe.nue2.suse.org and also tried ssh and then "virsh list" which seems to hang for longer.

Acceptance criteria

  • AC1: osiris-1.qe.nue2.suse.org is consistently responsive over virt-manager again

Suggestions

  • Debug currently stuck processes
  • Optionally reboot
  • Check logs
  • Ensure the machine is fully responsive again
  • Crosscheck if our monitoring should have caugth the issue
Actions #1

Updated by nicksinger 3 months ago

  • Assignee set to nicksinger
Actions #2

Updated by mkittler 3 months ago

Looks like this is related to the failed systemd services alert we've just got:

martchus@osiris-1:~> sudo journalctl -u libvirtd.socket
…
-- Boot 21546e85f20c411cbd3798fb609533ce --
Mar 31 03:31:36 osiris-1 systemd[1]: Listening on Libvirt local socket.
Apr 02 13:34:06 osiris-1 systemd[1]: libvirtd.socket: Deactivated successfully.
Apr 02 13:34:06 osiris-1 systemd[1]: Closed Libvirt local socket.
Apr 02 13:34:06 osiris-1 systemd[1]: Stopping Libvirt local socket...
Apr 02 13:34:06 osiris-1 systemd[1]: Listening on Libvirt local socket.
Apr 02 14:00:03 osiris-1 systemd[1]: libvirtd.socket: Trigger limit hit, refusing further activation.
Apr 02 14:00:03 osiris-1 systemd[1]: libvirtd.socket: Failed with result 'trigger-limit-hit'.
Apr 02 14:28:22 osiris-1 systemd[1]: Listening on Libvirt local socket.
martchus@osiris-1:~> sudo systemctl status libvirtd.service
● libvirtd.service - Virtualization daemon
     Loaded: loaded (/usr/lib/systemd/system/libvirtd.service; enabled; vendor preset: disabled)
    Drop-In: /etc/systemd/system/libvirtd.service.d
             └─override.conf
     Active: active (running) since Tue 2024-04-02 14:28:56 CEST; 7min ago
TriggeredBy: ● libvirtd-ro.socket
             ● libvirtd.socket
       Docs: man:libvirtd(8)
             https://libvirt.org
   Main PID: 24407 (libvirtd)
      Tasks: 21 (limit: 32768)
     CGroup: /system.slice/libvirtd.service
             └─ 24407 /usr/sbin/libvirtd --timeout 120

Apr 02 14:28:56 osiris-1 systemd[1]: Started Virtualization daemon.
Apr 02 14:28:57 osiris-1 libvirtd[24407]: libvirt version: 9.0.0
Apr 02 14:28:57 osiris-1 libvirtd[24407]: hostname: osiris-1
Apr 02 14:28:57 osiris-1 libvirtd[24407]: ignoring dangling symlink '/var/lib/libvirt/images/dist.suse.de/SLE-12-SP5-UNTESTED'
Apr 02 14:28:57 osiris-1 libvirtd[24407]: ignoring dangling symlink '/var/lib/libvirt/images/dist.suse.de/mounts'
Apr 02 14:29:01 osiris-1 libvirtd[24407]: operation failed: domain 'vHMC' already exists with uuid 21c11de1-5280-4bed-a598-05ebcb9800f5
Apr 02 14:29:01 osiris-1 libvirtd[24407]: Failed to load config for domain 'vHMC'
Apr 02 14:29:01 osiris-1 libvirtd[24573]: 2024-04-02 12:29:01.474+0000: 24573: info : libvirt version: 9.0.0
Apr 02 14:29:01 osiris-1 libvirtd[24573]: 2024-04-02 12:29:01.474+0000: 24573: info : hostname: osiris-1
Apr 02 14:29:01 osiris-1 libvirtd[24573]: 2024-04-02 12:29:01.474+0000: 24573: warning : virSecurityValidateTimestamp:205 : Invalid XATTR timestamp detected on /var/lib/libvirt/images/mmoese.qcow2 secdriver=dac

Right now no units are in the failed state anymore, though.

Actions #3

Updated by nicksinger 3 months ago

  • Status changed from New to In Progress

yes, I had to restart drbd and this caused a lot of other services to fail. After all these services had been restarted we also see all required VMs again:

osiris-1:/var/lib/libvirt/images # virsh list
 Id   Name     State
------------------------
 1    mmoese   running
 2    okurz    running
Actions #4

Updated by openqa_review 3 months ago

  • Due date set to 2024-04-17

Setting due date based on mean cycle time of SUSE QE Tools

Actions #5

Updated by nicksinger 3 months ago

  • Status changed from In Progress to Resolved

No alert would have covered this as the only indicator I had where some timeout messages in the libvirtd log. We could introduce specific alerts for this host if it happens more often. For now I consider the recovery work done and I won't add any new alerts.

Actions

Also available in: Atom PDF