osd infrastructure: Many failed systemd services on various machines
hi guys, https://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1&editPanel=6&tab=alert is disabled since some weeks since we had bigger problems which we already handled in various tickets, e.g. the broken worker issues reg. network, but it shows currently 14 (!) failed systemd services on our hosts. I think the original ticket is still blocked but by a new issue. I will create a new urgent issue to handle the plethora of failed services
- AC1: Significantly reduced number of failed systemd services
- AC2: alert is again enabled
#2 Updated by nicksinger 9 months ago
starting out with qa-power8-5-kvm we have the most services failing:
● kdump-early.service loaded failed failed Load kdump kernel early on startup ● kdump.service loaded failed failed Load kdump kernel and initrd ● logrotate.service loaded failed failed Rotate log files ● rebootmgr.service loaded failed failed Reboot Manager ● snapper-cleanup.service loaded failed failed Daily Cleanup of Snapper Snapshots ● snapper-timeline.service loaded failed failed Timeline of Snapper Snapshots ● iscsid.socket loaded failed failed Open-iSCSI iscsid Socket
- iscsid complained about
iscsid.socket: Failed to listen on sockets: Address already in use. After restarting the service it worked again. However, I raise the question if the whole iscsi setup is still necessary. From a first look I can't make out its use anymore. (also see https://chat.suse.de/group/qa-tools?msg=CTKrp7eih4DxaBgBb)
- The snapper services hit a connection limit of dbus. To avoid this in the future I created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/436
/usr/sbin/rebootmgrd showed up 245 times in the process tree. Some very old ones from before we disabled it (with https://progress.opensuse.org/issues/81058) but also at least one from 2021. Oli raised the very valid point that salt still starts the unit (which can't be confirmed by
systemctl status rebootmgr). Therefore I will mask them all for all PPC workers with:
salt -l error --no-color -C 'G@roles:worker and G@cpuarch:ppc64le' cmd.run "systemctl disable --now rebootmgr && systemctl mask rebootmgr"
- kdump-early (and therefore kdump) fails as the
crashkernelcmdline is missing. I assume that this was because somebody did not follow https://progress.opensuse.org/issues/81058 while rebooting the machine. At least there where no additional parameter in
/proc/cmdline. Rebooting the machine with the correct kexec-line fixed that as well. Services are both running now.
failed to rename /var/log/openvswitch/ovs-vswitchd.log to /var/log/openvswitch/ovs-vswitchd.log-20210126: Permission denied. I've created https://bugzilla.opensuse.org/show_bug.cgi?id=1181418 and applied
salt -l error --no-color -C 'G@roles:worker' cmd.run "chown openvswitch:openvswitch /var/log/openvswitch/ && systemctl restart logrotate"as workaround for now. This got rid of almost every other systemd service failing on other servers too :)
#3 Updated by nicksinger 9 months ago
- Priority changed from Urgent to Normal
Unfortunately my workaround for logrotate didn't work and I couldn't find any solution yet. Therefore I will leave the alert disabled. However we're just having 1 failing service per worker now so I reduce the urgency at least :)
#4 Updated by nicksinger 9 months ago
- Status changed from Workable to Feedback
I've came up with a hopefully working workaround which is implemented in salt: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/439
If the workaround is merged and deployed I'd give enabling the alert another try. Feedback until then :)
#6 Updated by nicksinger 8 months ago
I extended the workaround state another time with https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/448 . It seems like logrotate doesn't fail anymore on systems where the directory was properly owned by openvswitch. It was still failing on grenache and two arm workers where it was still owned by root. Unfortunately I didn't note down on which machines I chown'ed the log directory. Therefore now a more persistent approach with salt.
#7 Updated by nicksinger 8 months ago
postfix on openqaworker-arm-3 failed with:
Jan 29 01:42:20 openqaworker-arm-3 postfix: fatal: parameter inet_interfaces: no local interface found for ::1
I've enabled ipv6 on the loopback interface once again (on all 3 arm workers) by removing the corresponding line in
/etc/sysctl.d/99-poo81198.conf while leaving the others disabled (see poo#81198)
#8 Updated by nicksinger 8 months ago
- Status changed from Feedback to Resolved
Over the weekend we only collected fails of the
openqa-worker service I asked about in https://chat.suse.de/group/qa-tools?msg=HwEftS6jgna3QgJEk . The main Issue (logrotate) and stuff mentioned in https://progress.opensuse.org/issues/88225#note-2 all seem to be fixed and the alert is enabled again. Therefore I'm finally closing this now as resolved.
#10 Updated by nicksinger 8 months ago
- Status changed from Feedback to Resolved
Unfortunately ATM I can not reproduce the iscsi issue. But IMHO when it comes back it should be handled in a different ticket as the measures to resolve it might involve removing the service all together from our infrastructure (I don't see where it is needed anymore).