Project

General

Profile

Actions

action #104350

closed

[alert] failed systemd service on grenache-1, os-autoinst-openvswitch, turned to "ok" automatically size:M

Added by okurz over 2 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Low
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2021-12-26
Due date:
% Done:

0%

Estimated time:

Description

Observation

From journalctl -b -u os-autoinst-openvswitch.service on grenache-1:

-- Logs begin at Mon 2021-12-20 21:50:34 CET, end at Sun 2021-12-26 14:26:34 CET. --
Dec 26 03:34:34 grenache-1 systemd[1]: Starting os-autoinst openvswitch helper...
Dec 26 03:35:34 grenache-1 wicked[3515]: device br1: unable to apply configuration to nanny
Dec 26 03:36:04 grenache-1 systemd[1]: os-autoinst-openvswitch.service: start-pre operation timed out. Terminating.
Dec 26 03:36:04 grenache-1 systemd[1]: os-autoinst-openvswitch.service: Control process exited, code=killed, status=15/TERM
Dec 26 03:36:04 grenache-1 systemd[1]: os-autoinst-openvswitch.service: Failed with result 'timeout'.
Dec 26 03:36:04 grenache-1 systemd[1]: Failed to start os-autoinst openvswitch helper.
Dec 26 04:38:08 grenache-1 systemd[1]: Starting os-autoinst openvswitch helper...
Dec 26 04:38:08 grenache-1 systemd[1]: Started os-autoinst openvswitch helper.

this triggered an alert as visible on https://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1&from=1640478716612&to=1640495459276 but as can be seen resolved itself roughly one hour later.

Acceptance criteria

Suggestions

# /etc/systemd/system/os-autoinst-openvswitch.service.d/override.conf
[Service]
ExecStartPre=/bin/sh -c 'command -v wicked >/dev/null && wicked ifstatus br1 | grep -q up || wicked ifup br1'

not sure if this is manually maintained or where this comes from

Actions #1

Updated by okurz over 2 years ago

  • Subject changed from [alert] failed systemd service on grenache-1, os-autoinst-openvswitch, turned to "ok" automatically to [alert] failed systemd service on grenache-1, os-autoinst-openvswitch, turned to "ok" automatically size:M
  • Description updated (diff)
  • Priority changed from Normal to Low
Actions #2

Updated by okurz over 2 years ago

  • Status changed from New to Workable
Actions #3

Updated by mkittler over 2 years ago

  • Assignee set to mkittler
Actions #4

Updated by mkittler over 2 years ago

  • Status changed from Workable to In Progress

I remember that strange "unable to apply configuration to nanny" error. So it isn't the first time we see this. However, I still cannot really make sense of it.

There's a systemd override which defines the operation which times our here:

martchus@grenache-1:~> systemctl cat os-autoinst-openvswitch.service
# /usr/lib/systemd/system/os-autoinst-openvswitch.service
…

# /etc/systemd/system/os-autoinst-openvswitch.service.d/30-init-timeout.conf
[Service]
Environment="OS_AUTOINST_OPENVSWITCH_INIT_TIMEOUT=300"

# /etc/systemd/system/os-autoinst-openvswitch.service.d/override.conf
[Service]
ExecStartPre=/bin/sh -c 'command -v wicked >/dev/null && wicked ifstatus br1 | grep -q up || wicked ifup br1'

The configured timeout is only used by os-autoinst-openvswitch itself but has of course no affect on the ExecStartPre command. I cannot find this override in our salt repos so I assume it is a manual configuration or a leftover. The override is also present on other workers but not on all (e.g. it is present on many x86_64 workers as well and also on arm-2 but none of the other arm workers).

To me this looks like an old workaround for ensuring the bridge is ready before starting the actual service. However, at this point the service handles the waiting itself. So I'm simply going to remove this override on OSD workers where it is still present.

Actions #5

Updated by mkittler over 2 years ago

  • Status changed from In Progress to Resolved

Removed it on all OSD and o3 workers (on o3 only openqaworker7 had it). So now the timeout of 300 seconds should be used everywhere.

Actions

Also available in: Atom PDF