action #88225: osd infrastructure: Many failed systemd services on various machines - openQA Infrastructure (public) - openSUSE Project Management Tool

Custom queries

openQA Infrastructure Project
openqa-review - Closed tickets last updated by openqa-review, last 30 days
QA roadmap long-term
QA SLE functional
QA SLE Functional - closed in last 14 days
QA SLE Functional - High, need to be refined
QA SLE Functional - over cycle time median
QA SLE u
QA SLE y
QA tools (tag not necessary in openQA and subprojects)
QA tools tag (tag not necessary in openQA and subprojects; excluding tickets in "Ready" version as they are already on the backlog)
QAC - Backlog
QE tools team - backlog (dev)
QE tools team - backlog (ready issues)
QE tools team - backlog SLA high
QE tools team - backlog SLA immediate
QE tools team - backlog SLA no immediate/urgent in feedback/blocked
QE tools team - backlog SLA normal
QE tools team - backlog SLA urgent
QE tools team - backlog SLO high
QE tools team - backlog SLO normal
QE tools team - backlog SLO urgent
QE tools team - backlog, high-level view (epics and higher)
QE tools team - backlog, non-reactive work, needs parent
QE tools team - backlog, top-level view (all sagas)
QE Tools Team - Beginner
QE tools team - closed within last 14 days
QE tools team - closed within last 60 days
QE tools team - closed yesterday
QE Tools Team - Collaborative Session
QE tools team - due date forecast
QE tools team - exceeding due-date
QE Tools Team - Expert
QE tools team - infrastructure backlog
QE tools team - next - sorted by update time
QE tools team - next issues
QE tools team - non-estimated (unblocked) issues (dev)
QE tools team - non-estimated (unblocked) issues (infra)
QE tools team - ready issues - Workable
QE tools team - ready, not assigned/blocked/low
QE tools team - SLO high forecast
QE tools team - update forecast
QE tools team - updated by priority
QE tools team - what members of the team are working on - Feedback (not-low)
QE Tools Team Backlog By Assignee
Tools Team Retrospective
Tools Team Retrospective (not estimated or assigned)

Actions

Copy link

action #88225

closed

osd infrastructure: Many failed systemd services on various machines

Added by okurz about 4 years ago. Updated about 4 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

nicksinger

Category:

Target version:

openQA Project (public) - Ready

Start date:

2021-01-26

Due date:

% Done:

Estimated time:

Description

Observation¶

hi guys, https://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1&editPanel=6&tab=alert is disabled since some weeks since we had bigger problems which we already handled in various tickets, e.g. the broken worker issues reg. network, but it shows currently 14 (!) failed systemd services on our hosts. I think the original ticket is still blocked but by a new issue. I will create a new urgent issue to handle the plethora of failed services

Acceptance criteria¶

AC1: Significantly reduced number of failed systemd services
AC2: alert is again enabled

Related issues 1 (0 open — 1 closed)

Related to openQA Infrastructure (public) - action #88474: All workers on powerqaworker-qam-1 are offline

Resolved

livdywan

2021-02-08

Actions

Issue # Delay: days Cancel

History
Notes
Property changes

Actions

Copy link

Updated by okurz about 4 years ago

Status changed from New to Workable
Assignee set to nicksinger

assigning to nsinger as proposed by him. Thanks, you are awesome! :)

Actions

Copy link

Updated by nicksinger about 4 years ago

starting out with qa-power8-5-kvm we have the most services failing:

● kdump-early.service      loaded failed failed Load kdump kernel early on startup
● kdump.service            loaded failed failed Load kdump kernel and initrd      
● logrotate.service        loaded failed failed Rotate log files                  
● rebootmgr.service        loaded failed failed Reboot Manager                    
● snapper-cleanup.service  loaded failed failed Daily Cleanup of Snapper Snapshots
● snapper-timeline.service loaded failed failed Timeline of Snapper Snapshots     
● iscsid.socket            loaded failed failed Open-iSCSI iscsid Socket

iscsid complained about iscsid.socket: Failed to listen on sockets: Address already in use. After restarting the service it worked again. However, I raise the question if the whole iscsi setup is still necessary. From a first look I can't make out its use anymore. (also see https://chat.suse.de/group/qa-tools?msg=CTKrp7eih4DxaBgBb)
The snapper services hit a connection limit of dbus. To avoid this in the future I created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/436 /usr/sbin/rebootmgrd showed up 245 times in the process tree. Some very old ones from before we disabled it (with https://progress.opensuse.org/issues/81058) but also at least one from 2021. Oli raised the very valid point that salt still starts the unit (which can't be confirmed by systemctl status rebootmgr). Therefore I will mask them all for all PPC workers with: salt -l error --no-color -C 'G@roles:worker and G@cpuarch:ppc64le' cmd.run "systemctl disable --now rebootmgr && systemctl mask rebootmgr"
kdump-early (and therefore kdump) fails as the crashkernel cmdline is missing. I assume that this was because somebody did not follow https://progress.opensuse.org/issues/81058 while rebooting the machine. At least there where no additional parameter in /proc/cmdline. Rebooting the machine with the correct kexec-line fixed that as well. Services are both running now.
logrotate failed to rename /var/log/openvswitch/ovs-vswitchd.log to /var/log/openvswitch/ovs-vswitchd.log-20210126: Permission denied. I've created https://bugzilla.opensuse.org/show_bug.cgi?id=1181418 and applied salt -l error --no-color -C 'G@roles:worker' cmd.run "chown openvswitch:openvswitch /var/log/openvswitch/ && systemctl restart logrotate" as workaround for now. This got rid of almost every other systemd service failing on other servers too :)

Actions

Copy link

Updated by nicksinger about 4 years ago

Priority changed from Urgent to Normal

Unfortunately my workaround for logrotate didn't work and I couldn't find any solution yet. Therefore I will leave the alert disabled. However we're just having 1 failing service per worker now so I reduce the urgency at least :)

Actions

Copy link

Updated by nicksinger about 4 years ago

Status changed from Workable to Feedback

I've came up with a hopefully working workaround which is implemented in salt: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/439
If the workaround is merged and deployed I'd give enabling the alert another try. Feedback until then :)

Actions

Copy link

Updated by livdywan about 4 years ago

Related to action #88474: All workers on powerqaworker-qam-1 are offline added

Actions

Copy link

Updated by nicksinger about 4 years ago

I extended the workaround state another time with https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/448 . It seems like logrotate doesn't fail anymore on systems where the directory was properly owned by openvswitch. It was still failing on grenache and two arm workers where it was still owned by root. Unfortunately I didn't note down on which machines I chown'ed the log directory. Therefore now a more persistent approach with salt.

Actions

Copy link

Updated by nicksinger about 4 years ago

postfix on openqaworker-arm-3 failed with:

Jan 29 01:42:20 openqaworker-arm-3 postfix[3476]: fatal: parameter inet_interfaces: no local interface found for ::1

I've enabled ipv6 on the loopback interface once again (on all 3 arm workers) by removing the corresponding line in /etc/sysctl.d/99-poo81198.conf while leaving the others disabled (see poo#81198)

Actions

Copy link

Updated by nicksinger about 4 years ago

Status changed from Feedback to Resolved

Over the weekend we only collected fails of the openqa-worker service I asked about in https://chat.suse.de/group/qa-tools?msg=HwEftS6jgna3QgJEk . The main Issue (logrotate) and stuff mentioned in https://progress.opensuse.org/issues/88225#note-2 all seem to be fixed and the alert is enabled again. Therefore I'm finally closing this now as resolved.

Actions

Copy link

Updated by okurz about 4 years ago

Status changed from Resolved to Feedback

hm, seems like you might have missed something in your queries: Same as mentioned in #88225#note-2 there still is iscsid.socket on openqaworker8. Maybe you just looked for ".service" lately?

Actions

Copy link

#10

Updated by nicksinger about 4 years ago

Status changed from Feedback to Resolved

Unfortunately ATM I can not reproduce the iscsi issue. But IMHO when it comes back it should be handled in a different ticket as the measures to resolve it might involve removing the service all together from our infrastructure (I don't see where it is needed anymore).

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #88225

osd infrastructure: Many failed systemd services on various machines

Observation¶

Acceptance criteria¶

Updated by okurz about 4 years ago

Updated by nicksinger about 4 years ago

Updated by nicksinger about 4 years ago

Updated by nicksinger about 4 years ago

Updated by livdywan about 4 years ago

Updated by nicksinger about 4 years ago

Updated by nicksinger about 4 years ago

Updated by nicksinger about 4 years ago

Updated by okurz about 4 years ago

Updated by nicksinger about 4 years ago