action #104350: [alert] failed systemd service on grenache-1, os-autoinst-openvswitch, turned to "ok" automatically size:M - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #104350

closed

[alert] failed systemd service on grenache-1, os-autoinst-openvswitch, turned to "ok" automatically size:M

Added by okurz over 3 years ago. Updated over 3 years ago.

Status:

Resolved

Priority:

Low

Assignee:

mkittler

Category:

Regressions/Crashes

Target version:

Ready

Start date:

2021-12-26

Due date:

% Done:

Estimated time:

Description

Observation¶

From journalctl -b -u os-autoinst-openvswitch.service on grenache-1:

-- Logs begin at Mon 2021-12-20 21:50:34 CET, end at Sun 2021-12-26 14:26:34 CET. --
Dec 26 03:34:34 grenache-1 systemd[1]: Starting os-autoinst openvswitch helper...
Dec 26 03:35:34 grenache-1 wicked[3515]: device br1: unable to apply configuration to nanny
Dec 26 03:36:04 grenache-1 systemd[1]: os-autoinst-openvswitch.service: start-pre operation timed out. Terminating.
Dec 26 03:36:04 grenache-1 systemd[1]: os-autoinst-openvswitch.service: Control process exited, code=killed, status=15/TERM
Dec 26 03:36:04 grenache-1 systemd[1]: os-autoinst-openvswitch.service: Failed with result 'timeout'.
Dec 26 03:36:04 grenache-1 systemd[1]: Failed to start os-autoinst openvswitch helper.
Dec 26 04:38:08 grenache-1 systemd[1]: Starting os-autoinst openvswitch helper...
Dec 26 04:38:08 grenache-1 systemd[1]: Started os-autoinst openvswitch helper.

this triggered an alert as visible on https://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1&from=1640478716612&to=1640495459276 but as can be seen resolved itself roughly one hour later.

Acceptance criteria¶

AC1: At least the code flow in https://github.com/os-autoinst/os-autoinst/blob/master/os-autoinst-openvswitch and the corresponding systemd service has been reviewed once

Suggestions¶

Investigate why os-autoinst-openvswitch.service times out after 2.5m ~ 180s, when the config file in salt-states says 300s and https://github.com/os-autoinst/os-autoinst/blob/master/os-autoinst-openvswitch#L30 reads like there should be indefinite waiting time
Check if one of the related systemd units has retry. If not, add one or extend timeout
on grenache-1.qa there is already

# /etc/systemd/system/os-autoinst-openvswitch.service.d/override.conf
[Service]
ExecStartPre=/bin/sh -c 'command -v wicked >/dev/null && wicked ifstatus br1 | grep -q up || wicked ifup br1'

not sure if this is manually maintained or where this comes from

Actions

Copy link

Updated by okurz over 3 years ago

Subject changed from [alert] failed systemd service on grenache-1, os-autoinst-openvswitch, turned to "ok" automatically to [alert] failed systemd service on grenache-1, os-autoinst-openvswitch, turned to "ok" automatically size:M
Description updated (diff)
Priority changed from Normal to Low

Actions

Copy link

Updated by okurz over 3 years ago

Status changed from New to Workable

Actions

Copy link

Updated by mkittler over 3 years ago

Assignee set to mkittler

Actions

Copy link

Updated by mkittler over 3 years ago

Status changed from Workable to In Progress

I remember that strange "unable to apply configuration to nanny" error. So it isn't the first time we see this. However, I still cannot really make sense of it.

There's a systemd override which defines the operation which times our here:

martchus@grenache-1:~> systemctl cat os-autoinst-openvswitch.service
# /usr/lib/systemd/system/os-autoinst-openvswitch.service
…

# /etc/systemd/system/os-autoinst-openvswitch.service.d/30-init-timeout.conf
[Service]
Environment="OS_AUTOINST_OPENVSWITCH_INIT_TIMEOUT=300"

# /etc/systemd/system/os-autoinst-openvswitch.service.d/override.conf
[Service]
ExecStartPre=/bin/sh -c 'command -v wicked >/dev/null && wicked ifstatus br1 | grep -q up || wicked ifup br1'

The configured timeout is only used by os-autoinst-openvswitch itself but has of course no affect on the ExecStartPre command. I cannot find this override in our salt repos so I assume it is a manual configuration or a leftover. The override is also present on other workers but not on all (e.g. it is present on many x86_64 workers as well and also on arm-2 but none of the other arm workers).

To me this looks like an old workaround for ensuring the bridge is ready before starting the actual service. However, at this point the service handles the waiting itself. So I'm simply going to remove this override on OSD workers where it is still present.

Actions

Copy link

Updated by mkittler over 3 years ago

Status changed from In Progress to Resolved

Removed it on all OSD and o3 workers (on o3 only openqaworker7 had it). So now the timeout of 300 seconds should be used everywhere.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #104350

[alert] failed systemd service on grenache-1, os-autoinst-openvswitch, turned to "ok" automatically size:M

Observation¶

Acceptance criteria¶

Suggestions¶

Updated by okurz over 3 years ago

Updated by okurz over 3 years ago

Updated by mkittler over 3 years ago

Updated by mkittler over 3 years ago

Updated by mkittler over 3 years ago