https://progress.opensuse.org/https://progress.opensuse.org/themes/openSUSE/favicon/favicon.ico?15829177842020-10-25T20:57:14ZopenSUSE Project Management ToolopenQA Infrastructure - action #75274: [osd-admins][alert][learning] Failed systemd services alert (workers): os-autoinst-openvswitch.service aborts retries after 60s and is not easily configurablehttps://progress.opensuse.org/issues/75274?journal_id=3427032020-10-25T20:57:14Zokurzokurz@suse.com
<ul><li><strong>Copied from</strong> <i><a class="issue tracker-4 status-3 priority-5 priority-high3 closed" href="/issues/75016">action #75016</a>: [osd-admins][alert] Failed systemd services alert (workers): os-autoinst-openvswitch.service (and var-lib-openqa-share.mount) on openqaworker-arm-2 and others</i> added</li></ul> openQA Infrastructure - action #75274: [osd-admins][alert][learning] Failed systemd services alert (workers): os-autoinst-openvswitch.service aborts retries after 60s and is not easily configurablehttps://progress.opensuse.org/issues/75274?journal_id=3427062020-10-25T21:20:21Zokurzokurz@suse.com
<ul><li><strong>Status</strong> changed from <i>In Progress</i> to <i>Feedback</i></li></ul><p><a href="https://github.com/os-autoinst/os-autoinst/pull/1555" class="external">https://github.com/os-autoinst/os-autoinst/pull/1555</a></p>
openQA Infrastructure - action #75274: [osd-admins][alert][learning] Failed systemd services alert (workers): os-autoinst-openvswitch.service aborts retries after 60s and is not easily configurablehttps://progress.opensuse.org/issues/75274?journal_id=3429612020-10-26T11:05:58Zokurzokurz@suse.com
<ul><li><strong>Due date</strong> set to <i>2020-10-29</i></li></ul><p>PR merged. After we deploy this we can override the env variable, e.g. within a systemd service override.</p>
openQA Infrastructure - action #75274: [osd-admins][alert][learning] Failed systemd services alert (workers): os-autoinst-openvswitch.service aborts retries after 60s and is not easily configurablehttps://progress.opensuse.org/issues/75274?journal_id=3445302020-10-29T08:54:29Zokurzokurz@suse.com
<ul><li><strong>Status</strong> changed from <i>Feedback</i> to <i>Blocked</i></li></ul><p>as long as <a class="issue tracker-4 status-3 priority-6 priority-high2 closed behind-schedule" title="action: OSD partially unresponsive, triggering 500 responses, spotty response visible in monitoring panel... (Resolved)" href="https://progress.opensuse.org/issues/73633">#73633</a> is unresolved we are not getting an automatic deployment of os-autoinst, waiting for that.</p>
openQA Infrastructure - action #75274: [osd-admins][alert][learning] Failed systemd services alert (workers): os-autoinst-openvswitch.service aborts retries after 60s and is not easily configurablehttps://progress.opensuse.org/issues/75274?journal_id=3469842020-11-05T05:51:50Zokurzokurz@suse.com
<ul><li><strong>Tags</strong> set to <i>osd, network, infrastructure, salt, multi-machine</i></li><li><strong>Due date</strong> changed from <i>2020-10-29</i> to <i>2020-11-04</i></li><li><strong>Status</strong> changed from <i>Blocked</i> to <i>Workable</i></li><li><strong>Assignee</strong> deleted (<del><i>okurz</i></del>)</li></ul><p>Feature was deployed. We can set the timeout value in salt</p>
openQA Infrastructure - action #75274: [osd-admins][alert][learning] Failed systemd services alert (workers): os-autoinst-openvswitch.service aborts retries after 60s and is not easily configurablehttps://progress.opensuse.org/issues/75274?journal_id=3500832020-11-12T13:12:20Zokurzokurz@suse.com
<ul><li><strong>Tags</strong> changed from <i>osd, network, infrastructure, salt, multi-machine</i> to <i>osd, network, infrastructure, salt, multi-machine, learning</i></li><li><strong>Subject</strong> changed from <i>[osd-admins][alert] Failed systemd services alert (workers): os-autoinst-openvswitch.service aborts retries after 60s and is not easily configurable</i> to <i>[osd-admins][alert][learning] Failed systemd services alert (workers): os-autoinst-openvswitch.service aborts retries after 60s and is not easily configurable</i></li></ul><p>Added "[learning]" to the ticket. I prefer to not do this task because I consider it a good learning opportunity for others that are not that proficient with the current infrastructure management.</p>
openQA Infrastructure - action #75274: [osd-admins][alert][learning] Failed systemd services alert (workers): os-autoinst-openvswitch.service aborts retries after 60s and is not easily configurablehttps://progress.opensuse.org/issues/75274?journal_id=3503322020-11-12T18:18:45Zlivdywanliv.dywan@suse.com
<ul><li><strong>Description</strong> updated (<a title="View differences" href="/journals/350332/diff?detail_id=347887">diff</a>)</li><li><strong>Due date</strong> changed from <i>2020-11-04</i> to <i>2020-11-20</i></li></ul><p>I think this would ideally have defined ACs so it's clear what the learning step is that's needed to resolve this ticket.</p>
openQA Infrastructure - action #75274: [osd-admins][alert][learning] Failed systemd services alert (workers): os-autoinst-openvswitch.service aborts retries after 60s and is not easily configurablehttps://progress.opensuse.org/issues/75274?journal_id=3503382020-11-12T19:31:07Zokurzokurz@suse.com
<ul><li><strong>Description</strong> updated (<a title="View differences" href="/journals/350338/diff?detail_id=347893">diff</a>)</li></ul> openQA Infrastructure - action #75274: [osd-admins][alert][learning] Failed systemd services alert (workers): os-autoinst-openvswitch.service aborts retries after 60s and is not easily configurablehttps://progress.opensuse.org/issues/75274?journal_id=3549002020-11-26T16:40:46Zlivdywanliv.dywan@suse.com
<ul><li><strong>Due date</strong> changed from <i>2020-11-20</i> to <i>2020-11-27</i></li><li><strong>Status</strong> changed from <i>Workable</i> to <i>In Progress</i></li><li><strong>Assignee</strong> set to <i>livdywan</i></li></ul><p>Since I didn't manage to tempt anyone, and it's been sitting here a while I'll come up with a fix and maybe it can still serve as a reference for the next opportunity.</p>
openQA Infrastructure - action #75274: [osd-admins][alert][learning] Failed systemd services alert (workers): os-autoinst-openvswitch.service aborts retries after 60s and is not easily configurablehttps://progress.opensuse.org/issues/75274?journal_id=3556282020-11-30T17:06:17Zlivdywanliv.dywan@suse.com
<ul></ul><p><a href="https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/409/diffs" class="external">https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/409/diffs</a></p>
openQA Infrastructure - action #75274: [osd-admins][alert][learning] Failed systemd services alert (workers): os-autoinst-openvswitch.service aborts retries after 60s and is not easily configurablehttps://progress.opensuse.org/issues/75274?journal_id=3559902020-12-02T10:12:58Zlivdywanliv.dywan@suse.com
<ul><li><strong>Status</strong> changed from <i>In Progress</i> to <i>Feedback</i></li></ul><p><a href="https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/409/diffs" class="external">MR !409</a> got merged, <a href="https://gitlab.suse.de/openqa/salt-states-openqa/-/pipelines/89939" class="external">pipeline</a> failed, re-ran it, failed again however the changes seem to have been applied.</p>
<p>Now to confirm that the variable was applied correctly, and not just on workers using the nvme mount overrides (which was wrong with my previous change) I used this:</p>
<pre><code>sudo salt -l error --no-color -C 'G@roles:worker' cmd.run 'systemctl cat openqa-worker@.service | grep 300'
</code></pre> openQA Infrastructure - action #75274: [osd-admins][alert][learning] Failed systemd services alert (workers): os-autoinst-openvswitch.service aborts retries after 60s and is not easily configurablehttps://progress.opensuse.org/issues/75274?journal_id=3560022020-12-02T10:41:57Zlivdywanliv.dywan@suse.com
<ul></ul><p>Well, I just might've found an actual error <a href="https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/410" class="external">and proposed another follow-up!410</a>, <code>Result: False</code> points to a use of the old .conf filename - previously I'd only seen the <code>ERROR: Minions returned with non-zero exit code</code> line at the end</p>
openQA Infrastructure - action #75274: [osd-admins][alert][learning] Failed systemd services alert (workers): os-autoinst-openvswitch.service aborts retries after 60s and is not easily configurablehttps://progress.opensuse.org/issues/75274?journal_id=3560082020-12-02T10:46:46Zlivdywanliv.dywan@suse.com
<ul><li><strong>Due date</strong> changed from <i>2020-11-27</i> to <i>2020-12-04</i></li></ul><p>Let's see if I can wrap this up this week. It's cleaner now but ofc I'm making silly mistakes along the way.</p>
openQA Infrastructure - action #75274: [osd-admins][alert][learning] Failed systemd services alert (workers): os-autoinst-openvswitch.service aborts retries after 60s and is not easily configurablehttps://progress.opensuse.org/issues/75274?journal_id=3567042020-12-07T15:11:10Zlivdywanliv.dywan@suse.com
<ul></ul><p><a href="https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/415" class="external">!415</a> should address the last piece here, which is old <code>.conf</code> files being left behind after introducing new ones with specific names.</p>
<p><code>sudo salt -l error --no-color -C 'G@roles:worker' cmd.run 'systemctl cat openqa-worker@.service | grep -E "Environment|# /"'</code> shows that <code>OS_AUTOINST_OPENVSWITCH_INIT_TIMEOUT=300</code> gets specified twice on some of the machines.</p>
openQA Infrastructure - action #75274: [osd-admins][alert][learning] Failed systemd services alert (workers): os-autoinst-openvswitch.service aborts retries after 60s and is not easily configurablehttps://progress.opensuse.org/issues/75274?journal_id=3567162020-12-07T16:56:44Zokurzokurz@suse.com
<ul></ul><p>Overall I think going this far is not needed and prevents that we can not put any temporary overrides into place without using salt as any next salt update could delete the files. In case there are still temporary overrides left I would simply delete them in a one-shot, e.g. trigger manually <code>salt -C 'G@roles:worker' cmd.run 'rm /etc/systemd/system/openqa-worker@.service.d/the_file_you_want_to_delete'</code></p>
openQA Infrastructure - action #75274: [osd-admins][alert][learning] Failed systemd services alert (workers): os-autoinst-openvswitch.service aborts retries after 60s and is not easily configurablehttps://progress.opensuse.org/issues/75274?journal_id=3567262020-12-07T18:18:27Zlivdywanliv.dywan@suse.com
<ul></ul><p>okurz wrote:</p>
<blockquote>
<p>Overall I think going this far is not needed and prevents that we can not put any temporary overrides into place without using salt as any next salt update could delete the files. In case there are still temporary overrides left I would simply delete them in a one-shot, e.g. trigger manually <code>salt -C 'G@roles:worker' cmd.run 'rm /etc/systemd/system/openqa-worker@.service.d/the_file_you_want_to_delete'</code></p>
</blockquote>
<p>Isn't that what we want? If we started to rely on it, it wouldn't be temporary...</p>
openQA Infrastructure - action #75274: [osd-admins][alert][learning] Failed systemd services alert (workers): os-autoinst-openvswitch.service aborts retries after 60s and is not easily configurablehttps://progress.opensuse.org/issues/75274?journal_id=3568782020-12-08T10:37:25Zlivdywanliv.dywan@suse.com
<ul><li><strong>Status</strong> changed from <i>Feedback</i> to <i>Resolved</i></li></ul><p>Well, I deleted the files manually now (<code>sudo salt -C 'G@roles:worker' cmd.run 'rm /etc/systemd/system/openqa-worker@.service.d/override.conf'</code>), the question of rumpfushing can be re-visited anyway and is out of scope here</p>
openQA Infrastructure - action #75274: [osd-admins][alert][learning] Failed systemd services alert (workers): os-autoinst-openvswitch.service aborts retries after 60s and is not easily configurablehttps://progress.opensuse.org/issues/75274?journal_id=3569682020-12-08T16:06:25Zokurzokurz@suse.com
<ul></ul><p>well, certainly we can discuss it. So far I have seen the people most active during critical situations rely on temporary commands and override files. I would not be that harsh to call it "rumpfushing" when it's still documented somewhere what's being done. In some situations fast reaction is important and trying to achieve that quick turnaround times with salt+git+gitlab+CI+review does not work out. Shortcuts can be taken, e.g. still commit what is done but skip waiting for CI and review but this can make it very noisy when multiple iterations are taken.</p>
openQA Infrastructure - action #75274: [osd-admins][alert][learning] Failed systemd services alert (workers): os-autoinst-openvswitch.service aborts retries after 60s and is not easily configurablehttps://progress.opensuse.org/issues/75274?journal_id=3578022020-12-11T13:34:56Zlivdywanliv.dywan@suse.com
<ul></ul><p>okurz wrote:</p>
<blockquote>
<p>well, certainly we can discuss it. So far I have seen the people most active during critical situations rely on temporary commands and override files. I would not be that harsh to call it "rumpfushing" when it's still documented somewhere what's being done. In some situations fast reaction is important and trying to achieve that quick turnaround times with salt+git+gitlab+CI+review does not work out. Shortcuts can be taken, e.g. still commit what is done but skip waiting for CI and review but this can make it very noisy when multiple iterations are taken.</p>
</blockquote>
<p>Sorry if that came across as harsh. My point here was to eventually ensure consistency. So next time salt runs, or whenever a new machine is deployed it has all the fixes. I don't mind manual intervention at all. </p>
openQA Infrastructure - action #75274: [osd-admins][alert][learning] Failed systemd services alert (workers): os-autoinst-openvswitch.service aborts retries after 60s and is not easily configurablehttps://progress.opensuse.org/issues/75274?journal_id=3578562020-12-13T09:47:59Zokurzokurz@suse.com
<ul></ul><p>cdywan wrote:</p>
<blockquote>
<p>My point here was to eventually ensure consistency. So next time salt runs, or whenever a new machine is deployed it has all the fixes. I don't mind manual intervention at all.</p>
</blockquote>
<p>Currently the design goal is more like: Next time a machine is (re-)installed all the non-temporary configuration is applied correctly.</p>
openQA Infrastructure - action #75274: [osd-admins][alert][learning] Failed systemd services alert (workers): os-autoinst-openvswitch.service aborts retries after 60s and is not easily configurablehttps://progress.opensuse.org/issues/75274?journal_id=4471352021-09-17T13:59:14Zokurzokurz@suse.com
<ul><li><strong>Related to</strong> <i><a class="issue tracker-4 status-1 priority-4 priority-default" href="/issues/98835">action #98835</a>: arm jobs failing (again?) with auto_review:"backend died: Open vSwitch command 'set_vlan' with arguments .*was not provided by any .service files":retry</i> added</li></ul> openQA Infrastructure - action #75274: [osd-admins][alert][learning] Failed systemd services alert (workers): os-autoinst-openvswitch.service aborts retries after 60s and is not easily configurablehttps://progress.opensuse.org/issues/75274?journal_id=5831892022-12-09T13:44:14Zokurzokurz@suse.com
<ul><li><strong>Tags</strong> changed from <i>osd, network, infrastructure, salt, multi-machine, learning</i> to <i>osd, network, salt, multi-machine, learning, infra</i></li></ul> openQA Infrastructure - action #75274: [osd-admins][alert][learning] Failed systemd services alert (workers): os-autoinst-openvswitch.service aborts retries after 60s and is not easily configurablehttps://progress.opensuse.org/issues/75274?journal_id=7475572023-12-21T10:51:04Zokurzokurz@suse.com
<ul><li><strong>Related to</strong> <i><a class="issue tracker-4 status-3 priority-4 priority-default closed" href="/issues/152365">action #152365</a>: os-autoinst-openvswitch.service fails on start-up size:S</i> added</li></ul>