Project

General

Profile

action #120783

[Alerting] failed systemd service on worker11, os-autoinst-openvswitch. Failed at system boot, turned ok after some hours size:M

Added by okurz 3 months ago. Updated 16 days ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
Start date:
2022-11-20
Due date:
% Done:

0%

Estimated time:

Description

Observation

Received email "[Alerting] InfluxDB not reachable" at 2022-11-20 03:54. https://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?from=1668898957390&to=1668947747978&viewPanel=6 shows the alert to go to pending 03:37, that is the time when we apply automatic reboots when necessary, e.g. for kernel and base library changes, and then back to ok 06:38. That's 3h laters. The system journal shows just:

Nov 20 03:37:13 worker11 50mounted-tests[10630]: debug: running subtest /usr/lib/os-probes/mounted/90linux-distro
Nov 20 03:37:13 worker11 sh[2094]: br1             setup-in-progress
Nov 20 03:37:13 worker11 systemd[1]: os-autoinst-openvswitch.service: Control process exited, code=exited, status=162/n/a
Nov 20 03:37:13 worker11 systemd[1]: os-autoinst-openvswitch.service: Failed with result 'exit-code'.
Nov 20 03:37:13 worker11 systemd[1]: Failed to start os-autoinst openvswitch helper.

Acceptance criteria

  • AC1: os-autoinst-openvswitch is stable after repeated reboots on worker11

Suggestion

  • In the past there were problematic overrides of the systemd unit present so check for that

Related issues

Copied from openQA Infrastructure - action #120780: [Alerting] InfluxDB not reachable, turned ok after some minutesResolved2022-11-20

History

#1 Updated by okurz 3 months ago

  • Copied from action #120780: [Alerting] InfluxDB not reachable, turned ok after some minutes added

#2 Updated by cdywan 2 months ago

  • Subject changed from [Alerting] failed systemd service on worker11, os-autoinst-openvswitch. Failed at system boot, turned ok after some hours to [Alerting] failed systemd service on worker11, os-autoinst-openvswitch. Failed at system boot, turned ok after some hours size:M
  • Description updated (diff)
  • Status changed from New to Workable

#3 Updated by mkittler 16 days ago

  • Status changed from Workable to Resolved
  • Assignee set to mkittler

I wasn't aware of this ticket anymore. So #123151#note-7 wasn't the first occurrence of the problem. It is even the same worker again.

I've checked again for overrides and there are actually two overrides. The first is our normal bump of the timeout (deployed via salt) and the 2nd seems to be a leftover (not in salt and on other workers):

# /etc/systemd/system/os-autoinst-openvswitch.service.d/override.conf
[Service]
ExecStartPre=/bin/sh -c 'command -v wicked >/dev/null && wicked ifstatus br1 | grep -q up || wicked ifup br1'

I have removed that file and reloaded daemons. That should fix the problem as it did in the past. I have cleaned up such overrides a year ago (or maybe it was even longer?). Maybe this one wasn't covered because the machine wasn't online at the time. According to sudo salt -C 'G@roles:worker' cmd.run 'test -e /etc/systemd/system/os-autoinst-openvswitch.service.d/override.conf' the override is no gone on all salt-controlled workers. I've also checked o3 workers. I have also rebooted ow11.

Also available in: Atom PDF