action #75274
closed[osd-admins][alert][learning] Failed systemd services alert (workers): os-autoinst-openvswitch.service aborts retries after 60s and is not easily configurable
0%
Description
Observation¶
https://openqa.suse.de/tests/4885662 is incomplete due to
backend died: Open vSwitch command 'set_vlan' with arguments 'tap3 1' failed: org.freedesktop.DBus.Error.ServiceUnknown: The name org.opensuse.os_autoinst.switch was not provided by any .service files
as the service os-autoinst-openvswitch.service aborted after 60s without network on the host due to #73633
Acceptance criteria¶
- AC1: DONE hpc_ALPHA_openmpi_mpi_supportserver passes
- AC2: DONE os-autoinst-openvswitch timeout is configurable
- AC3: A higher timeout OS_AUTOINST_OPENVSWITCH_INIT_TIMEOUT is set in salt on all osd workers
Suggestions¶
- Read https://github.com/os-autoinst/os-autoinst/pull/1555/files
- Override systemd service definitions setting OS_AUTOINST_OPENVSWITCH_INIT_TIMEOUT higher than 60, e.g. 300
Updated by okurz about 4 years ago
- Copied from action #75016: [osd-admins][alert] Failed systemd services alert (workers): os-autoinst-openvswitch.service (and var-lib-openqa-share.mount) on openqaworker-arm-2 and others added
Updated by okurz about 4 years ago
- Status changed from In Progress to Feedback
Updated by okurz about 4 years ago
- Due date set to 2020-10-29
PR merged. After we deploy this we can override the env variable, e.g. within a systemd service override.
Updated by okurz about 4 years ago
- Status changed from Feedback to Blocked
as long as #73633 is unresolved we are not getting an automatic deployment of os-autoinst, waiting for that.
Updated by okurz about 4 years ago
- Tags set to osd, network, infrastructure, salt, multi-machine
- Due date changed from 2020-10-29 to 2020-11-04
- Status changed from Blocked to Workable
- Assignee deleted (
okurz)
Feature was deployed. We can set the timeout value in salt
Updated by okurz almost 4 years ago
- Tags changed from osd, network, infrastructure, salt, multi-machine to osd, network, infrastructure, salt, multi-machine, learning
- Subject changed from [osd-admins][alert] Failed systemd services alert (workers): os-autoinst-openvswitch.service aborts retries after 60s and is not easily configurable to [osd-admins][alert][learning] Failed systemd services alert (workers): os-autoinst-openvswitch.service aborts retries after 60s and is not easily configurable
Added "[learning]" to the ticket. I prefer to not do this task because I consider it a good learning opportunity for others that are not that proficient with the current infrastructure management.
Updated by livdywan almost 4 years ago
- Description updated (diff)
- Due date changed from 2020-11-04 to 2020-11-20
I think this would ideally have defined ACs so it's clear what the learning step is that's needed to resolve this ticket.
Updated by livdywan almost 4 years ago
- Due date changed from 2020-11-20 to 2020-11-27
- Status changed from Workable to In Progress
- Assignee set to livdywan
Since I didn't manage to tempt anyone, and it's been sitting here a while I'll come up with a fix and maybe it can still serve as a reference for the next opportunity.
Updated by livdywan almost 4 years ago
- Status changed from In Progress to Feedback
MR !409 got merged, pipeline failed, re-ran it, failed again however the changes seem to have been applied.
Now to confirm that the variable was applied correctly, and not just on workers using the nvme mount overrides (which was wrong with my previous change) I used this:
sudo salt -l error --no-color -C 'G@roles:worker' cmd.run 'systemctl cat openqa-worker@.service | grep 300'
Updated by livdywan almost 4 years ago
Well, I just might've found an actual error and proposed another follow-up!410, Result: False
points to a use of the old .conf filename - previously I'd only seen the ERROR: Minions returned with non-zero exit code
line at the end
Updated by livdywan almost 4 years ago
- Due date changed from 2020-11-27 to 2020-12-04
Let's see if I can wrap this up this week. It's cleaner now but ofc I'm making silly mistakes along the way.
Updated by livdywan almost 4 years ago
!415 should address the last piece here, which is old .conf
files being left behind after introducing new ones with specific names.
sudo salt -l error --no-color -C 'G@roles:worker' cmd.run 'systemctl cat openqa-worker@.service | grep -E "Environment|# /"'
shows that OS_AUTOINST_OPENVSWITCH_INIT_TIMEOUT=300
gets specified twice on some of the machines.
Updated by okurz almost 4 years ago
Overall I think going this far is not needed and prevents that we can not put any temporary overrides into place without using salt as any next salt update could delete the files. In case there are still temporary overrides left I would simply delete them in a one-shot, e.g. trigger manually salt -C 'G@roles:worker' cmd.run 'rm /etc/systemd/system/openqa-worker@.service.d/the_file_you_want_to_delete'
Updated by livdywan almost 4 years ago
okurz wrote:
Overall I think going this far is not needed and prevents that we can not put any temporary overrides into place without using salt as any next salt update could delete the files. In case there are still temporary overrides left I would simply delete them in a one-shot, e.g. trigger manually
salt -C 'G@roles:worker' cmd.run 'rm /etc/systemd/system/openqa-worker@.service.d/the_file_you_want_to_delete'
Isn't that what we want? If we started to rely on it, it wouldn't be temporary...
Updated by livdywan almost 4 years ago
- Status changed from Feedback to Resolved
Well, I deleted the files manually now (sudo salt -C 'G@roles:worker' cmd.run 'rm /etc/systemd/system/openqa-worker@.service.d/override.conf'
), the question of rumpfushing can be re-visited anyway and is out of scope here
Updated by okurz almost 4 years ago
well, certainly we can discuss it. So far I have seen the people most active during critical situations rely on temporary commands and override files. I would not be that harsh to call it "rumpfushing" when it's still documented somewhere what's being done. In some situations fast reaction is important and trying to achieve that quick turnaround times with salt+git+gitlab+CI+review does not work out. Shortcuts can be taken, e.g. still commit what is done but skip waiting for CI and review but this can make it very noisy when multiple iterations are taken.
Updated by livdywan almost 4 years ago
okurz wrote:
well, certainly we can discuss it. So far I have seen the people most active during critical situations rely on temporary commands and override files. I would not be that harsh to call it "rumpfushing" when it's still documented somewhere what's being done. In some situations fast reaction is important and trying to achieve that quick turnaround times with salt+git+gitlab+CI+review does not work out. Shortcuts can be taken, e.g. still commit what is done but skip waiting for CI and review but this can make it very noisy when multiple iterations are taken.
Sorry if that came across as harsh. My point here was to eventually ensure consistency. So next time salt runs, or whenever a new machine is deployed it has all the fixes. I don't mind manual intervention at all.
Updated by okurz almost 4 years ago
cdywan wrote:
My point here was to eventually ensure consistency. So next time salt runs, or whenever a new machine is deployed it has all the fixes. I don't mind manual intervention at all.
Currently the design goal is more like: Next time a machine is (re-)installed all the non-temporary configuration is applied correctly.
Updated by okurz about 3 years ago
- Related to action #98835: arm jobs failing (again?) with auto_review:"backend died: Open vSwitch command 'set_vlan' with arguments .*was not provided by any .service files":retry added
Updated by okurz almost 2 years ago
- Tags changed from osd, network, infrastructure, salt, multi-machine, learning to osd, network, salt, multi-machine, learning, infra
Updated by okurz 11 months ago
- Related to action #152365: os-autoinst-openvswitch.service fails on start-up size:S added