action #75274: [osd-admins][alert][learning] Failed systemd services alert (workers): os-autoinst-openvswitch.service aborts retries after 60s and is not easily configurable - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #75274

closed

[osd-admins][alert][learning] Failed systemd services alert (workers): os-autoinst-openvswitch.service aborts retries after 60s and is not easily configurable

Added by okurz over 4 years ago. Updated over 2 years ago.

Status:

Resolved

Priority:

High

Assignee:

livdywan

Category:

Target version:

openQA Project (public) - Ready

Start date:

Due date:

2020-12-04

% Done:

Estimated time:

Tags:

multi-machine, osd, network, salt, learning, infra

Description

Observation¶

https://openqa.suse.de/tests/4885662 is incomplete due to

backend died: Open vSwitch command 'set_vlan' with arguments 'tap3 1' failed: org.freedesktop.DBus.Error.ServiceUnknown: The name org.opensuse.os_autoinst.switch was not provided by any .service files

as the service os-autoinst-openvswitch.service aborted after 60s without network on the host due to #73633

Acceptance criteria¶

AC1: DONE hpc_ALPHA_openmpi_mpi_supportserver passes
AC2: DONE os-autoinst-openvswitch timeout is configurable
AC3: A higher timeout OS_AUTOINST_OPENVSWITCH_INIT_TIMEOUT is set in salt on all osd workers

Suggestions¶

Read https://github.com/os-autoinst/os-autoinst/pull/1555/files
Override systemd service definitions setting OS_AUTOINST_OPENVSWITCH_INIT_TIMEOUT higher than 60, e.g. 300

Related issues 3 (1 open — 2 closed)

Actions

Copy link

Updated by okurz over 4 years ago

Copied from action #75016: [osd-admins][alert] Failed systemd services alert (workers): os-autoinst-openvswitch.service (and var-lib-openqa-share.mount) on openqaworker-arm-2 and others added

Actions

Copy link

Updated by okurz over 4 years ago

Status changed from In Progress to Feedback

https://github.com/os-autoinst/os-autoinst/pull/1555

Actions

Copy link

Updated by okurz over 4 years ago

Due date set to 2020-10-29

PR merged. After we deploy this we can override the env variable, e.g. within a systemd service override.

Actions

Copy link

Updated by okurz over 4 years ago

Status changed from Feedback to Blocked

as long as #73633 is unresolved we are not getting an automatic deployment of os-autoinst, waiting for that.

Actions

Copy link

Updated by okurz over 4 years ago

Tags set to osd, network, infrastructure, salt, multi-machine
Due date changed from 2020-10-29 to 2020-11-04
Status changed from Blocked to Workable
Assignee deleted (~~okurz~~)

Feature was deployed. We can set the timeout value in salt

Actions

Copy link

Updated by okurz over 4 years ago

Tags changed from osd, network, infrastructure, salt, multi-machine to osd, network, infrastructure, salt, multi-machine, learning
Subject changed from [osd-admins][alert] Failed systemd services alert (workers): os-autoinst-openvswitch.service aborts retries after 60s and is not easily configurable to [osd-admins][alert][learning] Failed systemd services alert (workers): os-autoinst-openvswitch.service aborts retries after 60s and is not easily configurable

Added "[learning]" to the ticket. I prefer to not do this task because I consider it a good learning opportunity for others that are not that proficient with the current infrastructure management.

Actions

Copy link

Updated by livdywan over 4 years ago

Description updated (diff)
Due date changed from 2020-11-04 to 2020-11-20

I think this would ideally have defined ACs so it's clear what the learning step is that's needed to resolve this ticket.

Actions

Copy link

Updated by okurz over 4 years ago

Description updated (diff)

Actions

Copy link

Updated by livdywan over 4 years ago

Due date changed from 2020-11-20 to 2020-11-27
Status changed from Workable to In Progress
Assignee set to livdywan

Since I didn't manage to tempt anyone, and it's been sitting here a while I'll come up with a fix and maybe it can still serve as a reference for the next opportunity.

Actions

Copy link

#10

Updated by livdywan over 4 years ago

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/409/diffs

Actions

Copy link

#11

Updated by livdywan over 4 years ago

Status changed from In Progress to Feedback

MR !409 got merged, pipeline failed, re-ran it, failed again however the changes seem to have been applied.

Now to confirm that the variable was applied correctly, and not just on workers using the nvme mount overrides (which was wrong with my previous change) I used this:

sudo salt -l error --no-color -C 'G@roles:worker' cmd.run 'systemctl cat openqa-worker@.service | grep 300'

Actions

Copy link

#12

Updated by livdywan over 4 years ago

Well, I just might've found an actual error and proposed another follow-up!410, Result: False points to a use of the old .conf filename - previously I'd only seen the ERROR: Minions returned with non-zero exit code line at the end

Actions

Copy link

#13

Updated by livdywan over 4 years ago

Due date changed from 2020-11-27 to 2020-12-04

Let's see if I can wrap this up this week. It's cleaner now but ofc I'm making silly mistakes along the way.

Actions

Copy link

#14

Updated by livdywan over 4 years ago

!415 should address the last piece here, which is old .conf files being left behind after introducing new ones with specific names.

sudo salt -l error --no-color -C 'G@roles:worker' cmd.run 'systemctl cat openqa-worker@.service | grep -E "Environment|# /"' shows that OS_AUTOINST_OPENVSWITCH_INIT_TIMEOUT=300 gets specified twice on some of the machines.

Actions

Copy link

#15

Updated by okurz over 4 years ago

Overall I think going this far is not needed and prevents that we can not put any temporary overrides into place without using salt as any next salt update could delete the files. In case there are still temporary overrides left I would simply delete them in a one-shot, e.g. trigger manually salt -C 'G@roles:worker' cmd.run 'rm /etc/systemd/system/openqa-worker@.service.d/the_file_you_want_to_delete'

Actions

Copy link

#16

Updated by livdywan over 4 years ago

okurz wrote:

Overall I think going this far is not needed and prevents that we can not put any temporary overrides into place without using salt as any next salt update could delete the files. In case there are still temporary overrides left I would simply delete them in a one-shot, e.g. trigger manually salt -C 'G@roles:worker' cmd.run 'rm /etc/systemd/system/openqa-worker@.service.d/the_file_you_want_to_delete'

Isn't that what we want? If we started to rely on it, it wouldn't be temporary...

Actions

Copy link

#17

Updated by livdywan over 4 years ago

Status changed from Feedback to Resolved

Well, I deleted the files manually now (sudo salt -C 'G@roles:worker' cmd.run 'rm /etc/systemd/system/openqa-worker@.service.d/override.conf'), the question of rumpfushing can be re-visited anyway and is out of scope here

Actions

Copy link

#18

Updated by okurz over 4 years ago

well, certainly we can discuss it. So far I have seen the people most active during critical situations rely on temporary commands and override files. I would not be that harsh to call it "rumpfushing" when it's still documented somewhere what's being done. In some situations fast reaction is important and trying to achieve that quick turnaround times with salt+git+gitlab+CI+review does not work out. Shortcuts can be taken, e.g. still commit what is done but skip waiting for CI and review but this can make it very noisy when multiple iterations are taken.

Actions

Copy link

#19

Updated by livdywan over 4 years ago

okurz wrote:

well, certainly we can discuss it. So far I have seen the people most active during critical situations rely on temporary commands and override files. I would not be that harsh to call it "rumpfushing" when it's still documented somewhere what's being done. In some situations fast reaction is important and trying to achieve that quick turnaround times with salt+git+gitlab+CI+review does not work out. Shortcuts can be taken, e.g. still commit what is done but skip waiting for CI and review but this can make it very noisy when multiple iterations are taken.

Sorry if that came across as harsh. My point here was to eventually ensure consistency. So next time salt runs, or whenever a new machine is deployed it has all the fixes. I don't mind manual intervention at all.

Actions

Copy link

#20

Updated by okurz over 4 years ago

cdywan wrote:

My point here was to eventually ensure consistency. So next time salt runs, or whenever a new machine is deployed it has all the fixes. I don't mind manual intervention at all.

Currently the design goal is more like: Next time a machine is (re-)installed all the non-temporary configuration is applied correctly.

Actions

Copy link

#21

Updated by okurz over 3 years ago

Related to action #98835: arm jobs failing (again?) with auto_review:"backend died: Open vSwitch command 'set_vlan' with arguments .*was not provided by any .service files":retry added

Actions

Copy link

#22

Updated by okurz over 2 years ago

Tags changed from osd, network, infrastructure, salt, multi-machine, learning to osd, network, salt, multi-machine, learning, infra

Actions

Copy link

#23

Updated by okurz over 1 year ago

Related to action #152365: os-autoinst-openvswitch.service fails on start-up size:S added

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #75274

[osd-admins][alert][learning] Failed systemd services alert (workers): os-autoinst-openvswitch.service aborts retries after 60s and is not easily configurable

Observation¶

Acceptance criteria¶

Suggestions¶

Updated by okurz over 4 years ago

Updated by okurz over 4 years ago

Updated by okurz over 4 years ago

Updated by okurz over 4 years ago

Updated by okurz over 4 years ago

Updated by okurz over 4 years ago

Updated by livdywan over 4 years ago

Updated by okurz over 4 years ago

Updated by livdywan over 4 years ago

Updated by livdywan over 4 years ago

Updated by livdywan over 4 years ago

Updated by livdywan over 4 years ago

Updated by livdywan over 4 years ago

Updated by livdywan over 4 years ago

Updated by okurz over 4 years ago

Updated by livdywan over 4 years ago

Updated by livdywan over 4 years ago

Updated by okurz over 4 years ago

Updated by livdywan over 4 years ago

Updated by okurz over 4 years ago

Updated by okurz over 3 years ago

Updated by okurz over 2 years ago

Updated by okurz over 1 year ago