action #158419
closed
osiris-1.qe.nue2.suse.org not responsive over virt-manager and "virsh list" hangs
Added by okurz 9 months ago.
Updated 9 months ago.
Category:
Regressions/Crashes
Description
Observation¶
Trying to find out the state of machines on osiris I connected over virt-manager to osiris-1.qe.nue2.suse.org and also tried ssh and then "virsh list" which seems to hang for longer.
Acceptance criteria¶
- AC1: osiris-1.qe.nue2.suse.org is consistently responsive over virt-manager again
Suggestions¶
- Debug currently stuck processes
- Optionally reboot
- Check logs
- Ensure the machine is fully responsive again
- Crosscheck if our monitoring should have caugth the issue
- Assignee set to nicksinger
Looks like this is related to the failed systemd services alert we've just got:
martchus@osiris-1:~> sudo journalctl -u libvirtd.socket
…
-- Boot 21546e85f20c411cbd3798fb609533ce --
Mar 31 03:31:36 osiris-1 systemd[1]: Listening on Libvirt local socket.
Apr 02 13:34:06 osiris-1 systemd[1]: libvirtd.socket: Deactivated successfully.
Apr 02 13:34:06 osiris-1 systemd[1]: Closed Libvirt local socket.
Apr 02 13:34:06 osiris-1 systemd[1]: Stopping Libvirt local socket...
Apr 02 13:34:06 osiris-1 systemd[1]: Listening on Libvirt local socket.
Apr 02 14:00:03 osiris-1 systemd[1]: libvirtd.socket: Trigger limit hit, refusing further activation.
Apr 02 14:00:03 osiris-1 systemd[1]: libvirtd.socket: Failed with result 'trigger-limit-hit'.
Apr 02 14:28:22 osiris-1 systemd[1]: Listening on Libvirt local socket.
martchus@osiris-1:~> sudo systemctl status libvirtd.service
● libvirtd.service - Virtualization daemon
Loaded: loaded (/usr/lib/systemd/system/libvirtd.service; enabled; vendor preset: disabled)
Drop-In: /etc/systemd/system/libvirtd.service.d
└─override.conf
Active: active (running) since Tue 2024-04-02 14:28:56 CEST; 7min ago
TriggeredBy: ● libvirtd-ro.socket
● libvirtd.socket
Docs: man:libvirtd(8)
https://libvirt.org
Main PID: 24407 (libvirtd)
Tasks: 21 (limit: 32768)
CGroup: /system.slice/libvirtd.service
└─ 24407 /usr/sbin/libvirtd --timeout 120
Apr 02 14:28:56 osiris-1 systemd[1]: Started Virtualization daemon.
Apr 02 14:28:57 osiris-1 libvirtd[24407]: libvirt version: 9.0.0
Apr 02 14:28:57 osiris-1 libvirtd[24407]: hostname: osiris-1
Apr 02 14:28:57 osiris-1 libvirtd[24407]: ignoring dangling symlink '/var/lib/libvirt/images/dist.suse.de/SLE-12-SP5-UNTESTED'
Apr 02 14:28:57 osiris-1 libvirtd[24407]: ignoring dangling symlink '/var/lib/libvirt/images/dist.suse.de/mounts'
Apr 02 14:29:01 osiris-1 libvirtd[24407]: operation failed: domain 'vHMC' already exists with uuid 21c11de1-5280-4bed-a598-05ebcb9800f5
Apr 02 14:29:01 osiris-1 libvirtd[24407]: Failed to load config for domain 'vHMC'
Apr 02 14:29:01 osiris-1 libvirtd[24573]: 2024-04-02 12:29:01.474+0000: 24573: info : libvirt version: 9.0.0
Apr 02 14:29:01 osiris-1 libvirtd[24573]: 2024-04-02 12:29:01.474+0000: 24573: info : hostname: osiris-1
Apr 02 14:29:01 osiris-1 libvirtd[24573]: 2024-04-02 12:29:01.474+0000: 24573: warning : virSecurityValidateTimestamp:205 : Invalid XATTR timestamp detected on /var/lib/libvirt/images/mmoese.qcow2 secdriver=dac
Right now no units are in the failed state anymore, though.
- Status changed from New to In Progress
yes, I had to restart drbd and this caused a lot of other services to fail. After all these services had been restarted we also see all required VMs again:
osiris-1:/var/lib/libvirt/images # virsh list
Id Name State
------------------------
1 mmoese running
2 okurz running
- Due date set to 2024-04-17
Setting due date based on mean cycle time of SUSE QE Tools
- Status changed from In Progress to Resolved
No alert would have covered this as the only indicator I had where some timeout messages in the libvirtd log. We could introduce specific alerts for this host if it happens more often. For now I consider the recovery work done and I won't add any new alerts.
Also available in: Atom
PDF