action #120783: [Alerting] failed systemd service on worker11, os-autoinst-openvswitch. Failed at system boot, turned ok after some hours size:M - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #120783

closed

[Alerting] failed systemd service on worker11, os-autoinst-openvswitch. Failed at system boot, turned ok after some hours size:M

Added by okurz over 2 years ago. Updated over 2 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

mkittler

Category:

Target version:

openQA Project (public) - Ready

Start date:

2022-11-20

Due date:

% Done:

Estimated time:

Tags:

infra, reactive work

Description

Observation¶

Received email "[Alerting] InfluxDB not reachable" at 2022-11-20 03:54. https://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?from=1668898957390&to=1668947747978&viewPanel=6 shows the alert to go to pending 03:37, that is the time when we apply automatic reboots when necessary, e.g. for kernel and base library changes, and then back to ok 06:38. That's 3h laters. The system journal shows just:

Nov 20 03:37:13 worker11 50mounted-tests[10630]: debug: running subtest /usr/lib/os-probes/mounted/90linux-distro
Nov 20 03:37:13 worker11 sh[2094]: br1             setup-in-progress
Nov 20 03:37:13 worker11 systemd[1]: os-autoinst-openvswitch.service: Control process exited, code=exited, status=162/n/a
Nov 20 03:37:13 worker11 systemd[1]: os-autoinst-openvswitch.service: Failed with result 'exit-code'.
Nov 20 03:37:13 worker11 systemd[1]: Failed to start os-autoinst openvswitch helper.

Acceptance criteria¶

AC1: os-autoinst-openvswitch is stable after repeated reboots on worker11

Suggestion¶

In the past there were problematic overrides of the systemd unit present so check for that

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by okurz over 2 years ago

Copied from action #120780: [Alerting] InfluxDB not reachable, turned ok after some minutes added

Actions

Copy link

Updated by livdywan over 2 years ago

Subject changed from [Alerting] failed systemd service on worker11, os-autoinst-openvswitch. Failed at system boot, turned ok after some hours to [Alerting] failed systemd service on worker11, os-autoinst-openvswitch. Failed at system boot, turned ok after some hours size:M
Description updated (diff)
Status changed from New to Workable

Actions

Copy link

Updated by mkittler over 2 years ago

Status changed from Workable to Resolved
Assignee set to mkittler

I wasn't aware of this ticket anymore. So #123151#note-7 wasn't the first occurrence of the problem. It is even the same worker again.

I've checked again for overrides and there are actually two overrides. The first is our normal bump of the timeout (deployed via salt) and the 2nd seems to be a leftover (not in salt and on other workers):

# /etc/systemd/system/os-autoinst-openvswitch.service.d/override.conf
[Service]
ExecStartPre=/bin/sh -c 'command -v wicked >/dev/null && wicked ifstatus br1 | grep -q up || wicked ifup br1'

I have removed that file and reloaded daemons. That should fix the problem as it did in the past. I have cleaned up such overrides a year ago (or maybe it was even longer?). Maybe this one wasn't covered because the machine wasn't online at the time. According to sudo salt -C 'G@roles:worker' cmd.run 'test -e /etc/systemd/system/os-autoinst-openvswitch.service.d/override.conf' the override is no gone on all salt-controlled workers. I've also checked o3 workers. I have also rebooted ow11.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #120783

[Alerting] failed systemd service on worker11, os-autoinst-openvswitch. Failed at system boot, turned ok after some hours size:M

Observation¶

Acceptance criteria¶

Suggestion¶

Updated by okurz over 2 years ago

Updated by livdywan over 2 years ago

Updated by mkittler over 2 years ago