Project

General

Profile

action #75274

[osd-admins][alert][learning] Failed systemd services alert (workers): os-autoinst-openvswitch.service aborts retries after 60s and is not easily configurable

Added by okurz 11 months ago. Updated 10 months ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
Start date:
Due date:
2020-12-04
% Done:

0%

Estimated time:

Description

Observation

https://openqa.suse.de/tests/4885662 is incomplete due to

backend died: Open vSwitch command 'set_vlan' with arguments 'tap3 1' failed: org.freedesktop.DBus.Error.ServiceUnknown: The name org.opensuse.os_autoinst.switch was not provided by any .service files 

as the service os-autoinst-openvswitch.service aborted after 60s without network on the host due to #73633

Acceptance criteria

  • AC1: DONE hpc_ALPHA_openmpi_mpi_supportserver passes
  • AC2: DONE os-autoinst-openvswitch timeout is configurable
  • AC3: A higher timeout OS_AUTOINST_OPENVSWITCH_INIT_TIMEOUT is set in salt on all osd workers

Suggestions


Related issues

Related to openQA Infrastructure - action #98835: arm jobs failing (again?) with auto_review:"backend died: Open vSwitch command 'set_vlan' with arguments .*was not provided by any .service files":retryNew2021-09-17

Copied from openQA Infrastructure - action #75016: [osd-admins][alert] Failed systemd services alert (workers): os-autoinst-openvswitch.service (and var-lib-openqa-share.mount) on openqaworker-arm-2 and othersResolved2020-10-21

History

#1 Updated by okurz 11 months ago

  • Copied from action #75016: [osd-admins][alert] Failed systemd services alert (workers): os-autoinst-openvswitch.service (and var-lib-openqa-share.mount) on openqaworker-arm-2 and others added

#2 Updated by okurz 11 months ago

  • Status changed from In Progress to Feedback

#3 Updated by okurz 11 months ago

  • Due date set to 2020-10-29

PR merged. After we deploy this we can override the env variable, e.g. within a systemd service override.

#4 Updated by okurz 11 months ago

  • Status changed from Feedback to Blocked

as long as #73633 is unresolved we are not getting an automatic deployment of os-autoinst, waiting for that.

#5 Updated by okurz 11 months ago

  • Tags set to osd, network, infrastructure, salt, multi-machine
  • Due date changed from 2020-10-29 to 2020-11-04
  • Status changed from Blocked to Workable
  • Assignee deleted (okurz)

Feature was deployed. We can set the timeout value in salt

#6 Updated by okurz 11 months ago

  • Tags changed from osd, network, infrastructure, salt, multi-machine to osd, network, infrastructure, salt, multi-machine, learning
  • Subject changed from [osd-admins][alert] Failed systemd services alert (workers): os-autoinst-openvswitch.service aborts retries after 60s and is not easily configurable to [osd-admins][alert][learning] Failed systemd services alert (workers): os-autoinst-openvswitch.service aborts retries after 60s and is not easily configurable

Added "[learning]" to the ticket. I prefer to not do this task because I consider it a good learning opportunity for others that are not that proficient with the current infrastructure management.

#7 Updated by cdywan 11 months ago

  • Description updated (diff)
  • Due date changed from 2020-11-04 to 2020-11-20

I think this would ideally have defined ACs so it's clear what the learning step is that's needed to resolve this ticket.

#8 Updated by okurz 11 months ago

  • Description updated (diff)

#9 Updated by cdywan 10 months ago

  • Due date changed from 2020-11-20 to 2020-11-27
  • Status changed from Workable to In Progress
  • Assignee set to cdywan

Since I didn't manage to tempt anyone, and it's been sitting here a while I'll come up with a fix and maybe it can still serve as a reference for the next opportunity.

#11 Updated by cdywan 10 months ago

  • Status changed from In Progress to Feedback

MR !409 got merged, pipeline failed, re-ran it, failed again however the changes seem to have been applied.

Now to confirm that the variable was applied correctly, and not just on workers using the nvme mount overrides (which was wrong with my previous change) I used this:

sudo salt -l error --no-color -C 'G@roles:worker' cmd.run 'systemctl cat openqa-worker@.service | grep 300'

#12 Updated by cdywan 10 months ago

Well, I just might've found an actual error and proposed another follow-up!410, Result: False points to a use of the old .conf filename - previously I'd only seen the ERROR: Minions returned with non-zero exit code line at the end

#13 Updated by cdywan 10 months ago

  • Due date changed from 2020-11-27 to 2020-12-04

Let's see if I can wrap this up this week. It's cleaner now but ofc I'm making silly mistakes along the way.

#14 Updated by cdywan 10 months ago

!415 should address the last piece here, which is old .conf files being left behind after introducing new ones with specific names.

sudo salt -l error --no-color -C 'G@roles:worker' cmd.run 'systemctl cat openqa-worker@.service | grep -E "Environment|# /"' shows that OS_AUTOINST_OPENVSWITCH_INIT_TIMEOUT=300 gets specified twice on some of the machines.

#15 Updated by okurz 10 months ago

Overall I think going this far is not needed and prevents that we can not put any temporary overrides into place without using salt as any next salt update could delete the files. In case there are still temporary overrides left I would simply delete them in a one-shot, e.g. trigger manually salt -C 'G@roles:worker' cmd.run 'rm /etc/systemd/system/openqa-worker@.service.d/the_file_you_want_to_delete'

#16 Updated by cdywan 10 months ago

okurz wrote:

Overall I think going this far is not needed and prevents that we can not put any temporary overrides into place without using salt as any next salt update could delete the files. In case there are still temporary overrides left I would simply delete them in a one-shot, e.g. trigger manually salt -C 'G@roles:worker' cmd.run 'rm /etc/systemd/system/openqa-worker@.service.d/the_file_you_want_to_delete'

Isn't that what we want? If we started to rely on it, it wouldn't be temporary...

#17 Updated by cdywan 10 months ago

  • Status changed from Feedback to Resolved

Well, I deleted the files manually now (sudo salt -C 'G@roles:worker' cmd.run 'rm /etc/systemd/system/openqa-worker@.service.d/override.conf'), the question of rumpfushing can be re-visited anyway and is out of scope here

#18 Updated by okurz 10 months ago

well, certainly we can discuss it. So far I have seen the people most active during critical situations rely on temporary commands and override files. I would not be that harsh to call it "rumpfushing" when it's still documented somewhere what's being done. In some situations fast reaction is important and trying to achieve that quick turnaround times with salt+git+gitlab+CI+review does not work out. Shortcuts can be taken, e.g. still commit what is done but skip waiting for CI and review but this can make it very noisy when multiple iterations are taken.

#19 Updated by cdywan 10 months ago

okurz wrote:

well, certainly we can discuss it. So far I have seen the people most active during critical situations rely on temporary commands and override files. I would not be that harsh to call it "rumpfushing" when it's still documented somewhere what's being done. In some situations fast reaction is important and trying to achieve that quick turnaround times with salt+git+gitlab+CI+review does not work out. Shortcuts can be taken, e.g. still commit what is done but skip waiting for CI and review but this can make it very noisy when multiple iterations are taken.

Sorry if that came across as harsh. My point here was to eventually ensure consistency. So next time salt runs, or whenever a new machine is deployed it has all the fixes. I don't mind manual intervention at all.

#20 Updated by okurz 10 months ago

cdywan wrote:

My point here was to eventually ensure consistency. So next time salt runs, or whenever a new machine is deployed it has all the fixes. I don't mind manual intervention at all.

Currently the design goal is more like: Next time a machine is (re-)installed all the non-temporary configuration is applied correctly.

#21 Updated by okurz 10 days ago

  • Related to action #98835: arm jobs failing (again?) with auto_review:"backend died: Open vSwitch command 'set_vlan' with arguments .*was not provided by any .service files":retry added

Also available in: Atom PDF