openSUSE Project Management Tool: Issueshttps://progress.opensuse.org/https://progress.opensuse.org/themes/openSUSE/favicon/favicon.ico?15829177842024-03-01T17:06:37ZopenSUSE Project Management Tool
Redmine openQA Infrastructure - action #156481 (Resolved): cron -> (fetch_openqa_bugs)> /tmp/fetch_openqa...https://progress.opensuse.org/issues/1564812024-03-01T17:06:37Zlivdywanliv.dywan@suse.com
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<p>Cron <a href="mailto:root@openqa-service">root@openqa-service</a> (date; fetch_openqa_bugs)> /tmp/fetch_openqa_bugs_osd.log:</p>
<pre><code>openqa_client.exceptions.ConnectionError: HTTPSConnectionPool(host='openqa.suse.de', port=443): Max retries exceeded with url: /api/v1/bugs?refreshable=1&delta=86400 (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f91f6877080>: Failed to establish a new connection: [Errno 113] No route to host',))
</code></pre>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li></li>
</ul>
openQA Infrastructure - action #156460 (Resolved): Potential FS corruption on osd due to 2 VMs ac...https://progress.opensuse.org/issues/1564602024-03-01T13:51:21Zjbaier_czjbaier@suse.cz
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<p>Users noticed slowness of osd in <a href="https://suse.slack.com/archives/C02CANHLANP/p1709297645213609" class="external">https://suse.slack.com/archives/C02CANHLANP/p1709297645213609</a>; openqa-monitor.qa.suse.de also show problem with availability. </p>
<p>Logs on osd shows potential problem with FS</p>
<pre><code>Mar 01 14:29:14 openqa salt-master[25856]: [ERROR ] Unable to remove /var/cache/salt/master/jobs/26/4669e8a06e5502583ba67b138a9c30b97efbfff1f8af0b92f937ad8b70035d: [Errno 117] Structure needs cleaning: '.min>
Mar 01 14:29:14 openqa kernel: EXT4-fs error (device vda1): ext4_lookup:1855: inode #467326: comm salt-master: deleted inode referenced: 467329
Mar 01 14:29:14 openqa kernel: EXT4-fs error (device vda1): ext4_lookup:1855: inode #467326: comm salt-master: deleted inode referenced: 467329
Mar 01 14:29:14 openqa kernel: EXT4-fs error (device vda1): ext4_lookup:1855: inode #428053: comm salt-master: deleted inode referenced: 428056
Mar 01 14:29:14 openqa kernel: EXT4-fs error (device vda1): ext4_lookup:1855: inode #428053: comm salt-master: deleted inode referenced: 428056
Mar 01 14:29:14 openqa salt-master[25856]: [ERROR ] Unable to remove /var/cache/salt/master/jobs/08/96cf9ed4cc58d8c044fe257e5e977516e49383070eea5680e3f8d53fc31712: [Errno 117] Structure needs cleaning: '.min>
Mar 01 14:29:14 openqa kernel: EXT4-fs error (device vda1): ext4_lookup:1855: inode #358221: comm salt-master: deleted inode referenced: 358225
Mar 01 14:29:14 openqa kernel: EXT4-fs error (device vda1): ext4_lookup:1855: inode #358221: comm salt-master: deleted inode referenced: 358225
Mar 01 14:29:14 openqa salt-master[25856]: [ERROR ] Unable to remove /var/cache/salt/master/jobs/eb/8843afe01ce61b501612957cc3df3a3d8371a9c2694ebd800b47d514066853: [Errno 117] Structure needs cleaning: '.min>
Mar 01 14:29:14 openqa openqa-websockets-daemon[15372]: [debug] [pid:15372] Updating seen of worker 1951 from worker_status (free)
</code></pre>
<p>There might be a situation where two VMs were running with the same backing device according to <a href="https://suse.slack.com/archives/C02CANHLANP/p1709299401351479?thread_ts=1709297645.213609&cid=C02CANHLANP" class="external">https://suse.slack.com/archives/C02CANHLANP/p1709299401351479?thread_ts=1709297645.213609&cid=C02CANHLANP</a></p>
<p>The server was rebooted to get it to consistent state, but unfortunately due the FS corruption osd is currently in the maintenance mode and needs recovery.</p>
openQA Infrastructure - action #156331 (Resolved): [gitlab] New pipeline schedules cannot be crea...https://progress.opensuse.org/issues/1563312024-02-29T12:50:10Zjbaier_czjbaier@suse.cz
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<p>New pipeline schedules can’t be created.</p>
<a name="Steps-to-reproduce"></a>
<h2 >Steps to reproduce<a href="#Steps-to-reproduce" class="wiki-anchor">¶</a></h2>
<ol>
<li>Visit pipeline schedules of any project with CI/CD enabled.</li>
<li>Observe message: You have exceeded the maximum number of pipeline schedules for your plan. To create a new schedule, either increase your plan limit or delete an exisiting schedule.</li>
<li>See disabled button “New schedule”.</li>
</ol>
<a name="Expected-result"></a>
<h2 >Expected result<a href="#Expected-result" class="wiki-anchor">¶</a></h2>
<p>New pipeline schedules can be created.</p>
<a name="Impact"></a>
<h2 >Impact<a href="#Impact" class="wiki-anchor">¶</a></h2>
<p>Without the ability to create more schedules, the automation process might be hindered.</p>
<a name="Further-details"></a>
<h2 >Further details<a href="#Further-details" class="wiki-anchor">¶</a></h2>
<p>This issue can be easily solved by following the steps mentioned in <a href="https://gitlab.suse.de/help/administration/instance_limits#number-of-pipeline-schedules" class="external">https://gitlab.suse.de/help/administration/instance_limits#number-of-pipeline-schedules</a></p>
openQA Infrastructure - action #156226 (Resolved): [bot-ng] Pipeline failed / failed to pulled im...https://progress.opensuse.org/issues/1562262024-02-28T13:51:23Zlivdywanliv.dywan@suse.com
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<p><a href="https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/2325569" class="external">https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/2325569</a></p>
<pre><code>WARNING: Failed to pull image with policy "always": failed to register layer: open /var/cache/zypp/solv/@System/solv.idx: no space left on device (manager.go:237:16s)
ERROR: Job failed: failed to pull image "registry.suse.de/qa/maintenance/containers/qam-ci-leap:latest" with specified policies [always]: failed to register layer: open /var/cache/zypp/solv/@System/solv.idx: no space left on device (manager.go:237:16s)
WARNING: Failed to pull image with policy "always": failed to register layer: mkdir /var/cache/zypp/solv/obs_repository: no space left on device (manager.go:237:13s)
ERROR: Job failed: failed to pull image "registry.suse.de/qa/maintenance/containers/qam-ci-leap:latest" with specified policies [always]: failed to register layer: mkdir /var/cache/zypp/solv/obs_repository: no space left on device (manager.go:237:13s)
</code></pre>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li><strong>DONE</strong> Restart pipelines</li>
<li><strong>DONE</strong> Report an infra SD ticket</li>
<li><strong>DONE</strong> Add retries to the pipeline</li>
</ul>
openQA Infrastructure - action #155929 (Resolved): Try out rstp_enable=True in openqa/openvswitch...https://progress.opensuse.org/issues/1559292024-02-23T12:56:48Zokurzokurz@suse.com
<a name="Motivation"></a>
<h2 >Motivation<a href="#Motivation" class="wiki-anchor">¶</a></h2>
<p>We have the theory that our multi-machine setup with GRE tunnels and STP cause problems like happened in <a class="issue tracker-4 status-3 priority-6 priority-high2 closed behind-schedule" title="action: [alert] openqa-worker-cacheservice fails to start on worker29.oqa.prg2.suse.org with "Database ha... (Resolved)" href="https://progress.opensuse.org/issues/155716#note-8">#155716-8</a> possibly due to STP being too slow to adapt causing openQA tests to fail.</p>
<a name="Acceptance-criteria"></a>
<h2 >Acceptance criteria<a href="#Acceptance-criteria" class="wiki-anchor">¶</a></h2>
<ul>
<li><strong>AC1:</strong> Temporary multi-machine test issues are prevented when worker hosts temporarily are unavailable</li>
<li><strong>AC2:</strong> RSTP does not break more than we had in before</li>
<li><strong>AC3:</strong> Our documentation and salt states are up-to-date regarding STP+RSTP</li>
</ul>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>Read <a href="https://pve.proxmox.com/wiki/Open_vSwitch#Rapid_Spanning_Tree_.28RSTP.29" class="external">https://pve.proxmox.com/wiki/Open_vSwitch#Rapid_Spanning_Tree_.28RSTP.29</a> and enable the setting via Salt</li>
<li>Read <a href="https://www.accuenergy.com/support/reference-directory/rapid-spanning-tree-protocol-rstp/#:~:text=Rapid%20Spanning%20Tree%20Protocol%20(RSTP%3A%20IEEE%20802.1w)%20is,free%E2%80%9D%20topology%20within%20Ethernet%20networks" class="external">https://www.accuenergy.com/support/reference-directory/rapid-spanning-tree-protocol-rstp/#:~:text=Rapid%20Spanning%20Tree%20Protocol%20(RSTP%3A%20IEEE%20802.1w)%20is,free%E2%80%9D%20topology%20within%20Ethernet%20networks</a>.</li>
<li>Do a simple ping test between VMs (using a cluster of at least 3 machines connected via GRE) when one of the GRE nodes disconnects and connects (see <a href="http://open.qa/docs/#_start_test_vms_manually" class="external">http://open.qa/docs/#_start_test_vms_manually</a>)</li>
<li>Try via the MM openQA-in-openQA test by simply changing <a href="https://github.com/os-autoinst/os-autoinst/blob/master/script/os-autoinst-setup-multi-machine#L50" class="external">https://github.com/os-autoinst/os-autoinst/blob/master/script/os-autoinst-setup-multi-machine#L50</a> and adapting the openQA-in-openQA test to use that os-autoinst version instead of the stable package</li>
<li>Try to reproduce the test e.g. using <a href="https://openqa.suse.de/tests/latest?arch=x86_64&distri=sle&flavor=Server-DVD-HA-Incidents&machine=64bit&test=qam_ha_hawk_haproxy_node02&version=15-SP2" class="external">https://openqa.suse.de/tests/latest?arch=x86_64&distri=sle&flavor=Server-DVD-HA-Incidents&machine=64bit&test=qam_ha_hawk_haproxy_node02&version=15-SP2</a> by running this test near-continuous and then trigger a reboot of a machine which "ovs-appctl stp/show" shows to be crucial for the connection while the test is running</li>
<li>Then enable rstp in the wicked hook scripts and possibly disable stp instead</li>
<li>Reconduct the experiment and check if the above significantly prevents related problems</li>
<li>If successful ensure that <a href="https://github.com/os-autoinst/os-autoinst/blob/master/script/os-autoinst-setup-multi-machine#L50" class="external">https://github.com/os-autoinst/os-autoinst/blob/master/script/os-autoinst-setup-multi-machine#L50</a> and salt-states are in sync and our config in <a href="http://open.qa/docs/" class="external">http://open.qa/docs/</a></li>
</ul>
openQA Infrastructure - action #155848 (Resolved): Firewalld is logging many errors and sometimes...https://progress.opensuse.org/issues/1558482024-02-22T11:51:27Zmkittlermarius.kittler@suse.com
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<pre><code>martchus@worker29:~> sudo journalctl -u firewalld.service
Feb 18 03:31:57 worker29 systemd[1]: Stopping firewalld - dynamic firewall daemon...
Feb 18 03:31:59 worker29 systemd[1]: firewalld.service: Deactivated successfully.
Feb 18 03:31:59 worker29 systemd[1]: Stopped firewalld - dynamic firewall daemon.
-- Boot 8c90f12e00d94891941a5b00e8d1124a --
Feb 18 03:34:44 worker29 systemd[1]: Starting firewalld - dynamic firewall daemon...
Feb 18 03:34:45 worker29 systemd[1]: Started firewalld - dynamic firewall daemon.
Feb 18 03:34:50 worker29 firewalld[2185]: ERROR: Calling pre func <bound method Firewall.full_check_config of <class 'firewall.core.fw.Firewall'>(True, True, True, 'RUNNING', False, 'trusted', {}, [], True, True, True, False, 'off')>(()) failed: INVALID_ZONE: 'libvirt-routed' not among existing zones
Feb 18 03:34:51 worker29 firewalld[2185]: ERROR: Calling pre func <bound method Firewall.full_check_config of <class 'firewall.core.fw.Firewall'>(True, True, True, 'RUNNING', False, 'trusted', {}, [], True, True, True, False, 'off')>(()) failed: INVALID_ZONE: 'libvirt-routed' not among existing zones
Feb 18 03:34:52 worker29 firewalld[2185]: ERROR: Calling pre func <bound method Firewall.full_check_config of <class 'firewall.core.fw.Firewall'>(True, True, True, 'RUNNING', False, 'trusted', {}, [], True, True, True, False, 'off')>(()) failed: INVALID_ZONE: 'libvirt-routed' not among existing zones
Feb 18 03:34:52 worker29 firewalld[2185]: ERROR: Calling pre func <bound method Firewall.full_check_config of <class 'firewall.core.fw.Firewall'>(True, True, True, 'RUNNING', False, 'trusted', {}, [], True, True, True, False, 'off')>(()) failed: INVALID_ZONE: 'libvirt-routed' not among existing zones
…
Feb 18 05:40:05 worker29 firewalld[96768]: ERROR: Calling pre func <bound method Firewall.full_check_config of <class 'firewall.core.fw.Firewall'>(True, True, True, 'RUNNING', False, 'trusted', {}, [], True, True, True, False, 'off')>(()) failed: INVALID_ZONE: 'libvirt-routed' not among existing zones
Feb 18 05:40:06 worker29 firewalld[96768]: ERROR: Calling pre func <bound method Firewall.full_check_config of <class 'firewall.core.fw.Firewall'>(True, True, True, 'RUNNING', False, 'trusted', {}, [], True, True, True, False, 'off')>(()) failed: INVALID_ZONE: 'libvirt-routed' not among existing zones
Feb 18 05:40:06 worker29 firewalld[96768]: ERROR: Calling pre func <bound method Firewall.full_check_config of <class 'firewall.core.fw.Firewall'>(True, True, True, 'RUNNING', False, 'trusted', {}, [], True, True, True, False, 'off')>(()) failed: INVALID_ZONE: 'libvirt-routed' not among existing zones
Feb 18 05:40:07 worker29 firewalld[96768]: ERROR: Calling pre func <bound method Firewall.full_check_config of <class 'firewall.core.fw.Firewall'>(True, True, True, 'RUNNING', False, 'trusted', {}, [], True, True, True, False, 'off')>(()) failed: INVALID_ZONE: 'libvirt-routed' not among existing zones
Feb 18 05:40:07 worker29 firewalld[96768]: ERROR: Calling pre func <bound method Firewall.full_check_config of <class 'firewall.core.fw.Firewall'>(True, True, True, 'RUNNING', False, 'trusted', {}, [], True, True, True, False, 'off')>(()) failed: INVALID_ZONE: 'libvirt-routed' not among existing zones
Feb 18 05:40:08 worker29 firewalld[96768]: ERROR: Calling pre func <bound method Firewall.full_check_config of <class 'firewall.core.fw.Firewall'>(True, True, True, 'RUNNING', False, 'trusted', {}, [], True, True, True, False, 'off')>(()) failed: INVALID_ZONE: 'libvirt-routed' not among existing zones
Feb 18 06:38:10 worker29 systemd[1]: Stopping firewalld - dynamic firewall daemon...
Feb 18 06:38:12 worker29 systemd[1]: firewalld.service: Deactivated successfully.
Feb 18 06:38:12 worker29 systemd[1]: Stopped firewalld - dynamic firewall daemon.
Feb 18 06:38:12 worker29 systemd[1]: Starting firewalld - dynamic firewall daemon...
Feb 18 06:38:13 worker29 systemd[1]: Started firewalld - dynamic firewall daemon.
Feb 18 06:38:21 worker29 firewalld[108896]: ERROR: Calling pre func <bound method Firewall.full_check_config of <class 'firewall.core.fw.Firewall'>(True, True, True, 'RUNNING', False, 'trusted', {}, [], True, True, True, False, 'off')>(()) failed: INVALID_ZONE: 'libvirt-routed' not among existing zones
Feb 18 06:38:22 worker29 firewalld[108896]: ERROR: Calling pre func <bound method Firewall.full_check_config of <class 'firewall.core.fw.Firewall'>(True, True, True, 'RUNNING', False, 'trusted', {}, [], True, True, True, False, 'off')>(()) failed: INVALID_ZONE: 'libvirt-routed' not among existing zones
Feb 18 06:38:23 worker29 firewalld[108896]: ERROR: Calling pre func <bound method Firewall.full_check_config of <class 'firewall.core.fw.Firewall'>(True, True, True, 'RUNNING', False, 'trusted', {}, [], True, True, True, False, 'off')>(()) failed: INVALID_ZONE: 'libvirt-routed' not among existing zones
Feb 18 06:38:23 worker29 firewalld[108896]: ERROR: Calling pre func <bound method Firewall.full_check_config of <class 'firewall.core.fw.Firewall'>(True, True, True, 'RUNNING', False, 'trusted', {}, [], True, True, True, False, 'off')>(()) failed: INVALID_ZONE: 'libvirt-routed' not among existing zones
Feb 18 06:38:24 worker29 firewalld[108896]: ERROR: Calling pre func <bound method Firewall.full_check_config of <class 'firewall.core.fw.Firewall'>(True, True, True, 'RUNNING', False, 'trusted', {}, [], True, True, True, False, 'off')>(()) failed: INVALID_ZONE: 'libvirt-routed' not among existing zones
…
Feb 18 06:40:05 worker29 firewalld[108896]: ERROR: Calling pre func <bound method Firewall.full_check_config of <class 'firewall.core.fw.Firewall'>(True, True, True, 'RUNNING', False, 'trusted', {}, [], True, True, True, False, 'off')>(()) failed: INVALID_ZONE: 'libvirt-routed' not among existing zones
Feb 18 06:40:06 worker29 firewalld[108896]: ERROR: Calling pre func <bound method Firewall.full_check_config of <class 'firewall.core.fw.Firewall'>(True, True, True, 'RUNNING', False, 'trusted', {}, [], True, True, True, False, 'off')>(()) failed: INVALID_ZONE: 'libvirt-routed' not among existing zones
Feb 18 06:40:06 worker29 firewalld[108896]: ERROR: Calling pre func <bound method Firewall.full_check_config of <class 'firewall.core.fw.Firewall'>(True, True, True, 'RUNNING', False, 'trusted', {}, [], True, True, True, False, 'off')>(()) failed: INVALID_ZONE: 'libvirt-routed' not among existing zones
Feb 18 06:40:07 worker29 firewalld[108896]: ERROR: Calling pre func <bound method Firewall.full_check_config of <class 'firewall.core.fw.Firewall'>(True, True, True, 'RUNNING', False, 'trusted', {}, [], True, True, True, False, 'off')>(()) failed: INVALID_ZONE: 'libvirt-routed' not among existing zones
Feb 18 06:40:08 worker29 firewalld[108896]: ERROR: Calling pre func <bound method Firewall.full_check_config of <class 'firewall.core.fw.Firewall'>(True, True, True, 'RUNNING', False, 'trusted', {}, [], True, True, True, False, 'off')>(()) failed: INVALID_ZONE: 'libvirt-routed' not among existing zones
Feb 18 07:38:11 worker29 systemd[1]: Stopping firewalld - dynamic firewall daemon...
Feb 18 07:38:13 worker29 systemd[1]: firewalld.service: Deactivated successfully.
Feb 18 07:38:13 worker29 systemd[1]: Stopped firewalld - dynamic firewall daemon.
Feb 18 07:38:13 worker29 systemd[1]: Starting firewalld - dynamic firewall daemon...
Feb 18 07:38:13 worker29 systemd[1]: Started firewalld - dynamic firewall daemon.
Feb 21 13:39:57 worker29 systemd[1]: Stopping firewalld - dynamic firewall daemon...
Feb 21 13:39:58 worker29 systemd[1]: firewalld.service: Deactivated successfully.
Feb 21 13:39:58 worker29 systemd[1]: Stopped firewalld - dynamic firewall daemon.
-- Boot 1ca309edcd134e5195355a0904a6a196 --
Feb 21 13:43:20 worker29 systemd[1]: Starting firewalld - dynamic firewall daemon...
Feb 21 13:43:20 worker29 systemd[1]: Started firewalld - dynamic firewall daemon.
Feb 21 13:43:25 worker29 firewalld[2198]: ERROR: Calling pre func <bound method Firewall.full_check_config of <class 'firewall.core.fw.Firewall'>(True, True, True, 'RUNNING', False, 'trusted', {}, [], True, True, True, False, 'off')>(()) failed: INVALID_ZONE: 'libvirt-routed' not among existing zones
Feb 21 13:43:26 worker29 firewalld[2198]: ERROR: Calling pre func <bound method Firewall.full_check_config of <class 'firewall.core.fw.Firewall'>(True, True, True, 'RUNNING', False, 'trusted', {}, [], True, True, True, False, 'off')>(()) failed: INVALID_ZONE: 'libvirt-routed' not among existing zones
Feb 21 13:43:26 worker29 firewalld[2198]: ERROR: Calling pre func <bound method Firewall.full_check_config of <class 'firewall.core.fw.Firewall'>(True, True, True, 'RUNNING', False, 'trusted', {}, [], True, True, True, False, 'off')>(()) failed: INVALID_ZONE: 'libvirt-routed' not among existing zones
Feb 21 13:43:27 worker29 firewalld[2198]: ERROR: Calling pre func <bound method Firewall.full_check_config of <class 'firewall.core.fw.Firewall'>(True, True, True, 'RUNNING', False, 'trusted', {}, [], True, True, True, False, 'off')>(()) failed: INVALID_ZONE: 'libvirt-routed' not among existing zones
…
Feb 21 13:45:12 worker29 firewalld[2198]: ERROR: Calling pre func <bound method Firewall.full_check_config of <class 'firewall.core.fw.Firewall'>(True, True, True, 'RUNNING', False, 'trusted', {}, [], True, True, True, False, 'off')>(()) failed: INVALID_ZONE: 'libvirt-routed' not among existing zones
Feb 21 13:45:13 worker29 firewalld[2198]: ERROR: Calling pre func <bound method Firewall.full_check_config of <class 'firewall.core.fw.Firewall'>(True, True, True, 'RUNNING', False, 'trusted', {}, [], True, True, True, False, 'off')>(()) failed: INVALID_ZONE: 'libvirt-routed' not among existing zones
Feb 21 13:45:14 worker29 firewalld[2198]: ERROR: Calling pre func <bound method Firewall.full_check_config of <class 'firewall.core.fw.Firewall'>(True, True, True, 'RUNNING', False, 'trusted', {}, [], True, True, True, False, 'off')>(()) failed: INVALID_ZONE: 'libvirt-routed' not among existing zones
Feb 21 14:46:45 worker29 systemd[1]: Stopping firewalld - dynamic firewall daemon...
Feb 21 14:46:47 worker29 systemd[1]: firewalld.service: Deactivated successfully.
Feb 21 14:46:47 worker29 systemd[1]: Stopped firewalld - dynamic firewall daemon.
Feb 21 14:46:47 worker29 systemd[1]: Starting firewalld - dynamic firewall daemon...
Feb 21 14:46:48 worker29 systemd[1]: Started firewalld - dynamic firewall daemon.
</code></pre>
<p>We have also seen MM failures in the timeframe of the most recent <code>Stopping/Starting</code>-lines in the log above, see <a class="issue tracker-4 status-3 priority-6 priority-high2 closed behind-schedule" title="action: [alert] openqa-worker-cacheservice fails to start on worker29.oqa.prg2.suse.org with "Database ha... (Resolved)" href="https://progress.opensuse.org/issues/155716#note-8">#155716#note-8</a>. See also <a class="issue tracker-4 status-3 priority-6 priority-high2 closed behind-schedule" title="action: [alert] openqa-worker-cacheservice fails to start on worker29.oqa.prg2.suse.org with "Database ha... (Resolved)" href="https://progress.opensuse.org/issues/155716#note-9">#155716#note-9</a> for my initial investigation.</p>
<a name="Acceptance-criteria"></a>
<h2 >Acceptance criteria<a href="#Acceptance-criteria" class="wiki-anchor">¶</a></h2>
<ul>
<li>REJECTED: <strong>AC1</strong>: We know why firewalld is repeatedly restarted and whether restarting is related to the error messages.
<ul>
<li>I'm still not sure why it happens.</li>
<li>I'm still not sure whether it has an impact. I stopped it temporarily on one worker and settings like ip forwarding were still in place. However, settings on nft-level like masquerading were gone - at least <code>sudo nft list ruleset</code> showed an empty response instead of rules like <code>chain nat_POST_trusted_allow { oifname != "lo" masquerade }</code>. So I guess firewalld being restarted <em>might</em> interfere with running MM tests.</li>
<li>We decided to not further investigate because the impact is likely low.</li>
</ul></li>
<li>DONE: <strong>AC2</strong>: We know the meaning and impact of the error message.
<ul>
<li>The error is caused by this bug: <a href="https://bugzilla.opensuse.org/show_bug.cgi?id=1214160" class="external">https://bugzilla.opensuse.org/show_bug.cgi?id=1214160</a></li>
<li>I don't think this error is problematic for us as we don't use libvirt that way.</li>
</ul></li>
<li>DONE: <strong>AC3</strong>: We know whether it also happens on other workers. (Does it also happen on other OSD workers? Does it also happen on o3 workers?)
<ul>
<li>The mentioned firewalld error is occurring on other OSD and o3 workers as well (on o3 only on workers openqaworker23 and openqaworker-arm21). I have also seen the firewalld service restarting at some point in the middle on other OSD workers.</li>
</ul></li>
<li>DONE: <strong>AC4</strong>: The error is prevented or worked around.
<ul>
<li>We cannot just uninstall the broken package (see <a class="issue tracker-4 status-3 priority-4 priority-default closed" title="action: Firewalld is logging many errors and sometimes restarting on worker29, possibly related to MM fai... (Resolved)" href="https://progress.opensuse.org/issues/155848#note-9">#155848#note-9</a>) so I guess it is best we ignore this error for now.</li>
</ul></li>
<li>DONE: <strong>AC5</strong>: We know why the MM failures were happening (<em>possibly</em> due to these problems with firewalld but it could also be a red herring).
<ul>
<li>As discussed, those failures are unlikely to happen because of the firewalld issues. Maybe enabling rstp helps, see <a class="issue tracker-4 status-3 priority-4 priority-default closed child" title="action: Try out rstp_enable=True in openqa/openvswitch.sls size:M (Resolved)" href="https://progress.opensuse.org/issues/155929">#155929</a>.</li>
</ul></li>
<li>DONE: <strong>AC6</strong>: The MM failures are prevented if caused by a concrete issue.
<ul>
<li>We created <a class="issue tracker-4 status-3 priority-4 priority-default closed child" title="action: Try out rstp_enable=True in openqa/openvswitch.sls size:M (Resolved)" href="https://progress.opensuse.org/issues/155929">#155929</a> as follow-up for the next best thing to try to improve the MM setup.</li>
</ul></li>
</ul>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>Ensure that the error is in an upstream report, e.g. bugzilla and/or further upstream</li>
<li>Do what we can do to prevent the error</li>
</ul>
<a name="Out-of-scope"></a>
<h2 >Out of scope<a href="#Out-of-scope" class="wiki-anchor">¶</a></h2>
<ul>
<li>Multi-machine config rework or anything about STP (see new ticket about that)</li>
</ul>
openQA Infrastructure - action #155725 (Resolved): [openQA][infra][sut] Failed to establish connn...https://progress.opensuse.org/issues/1557252024-02-21T09:38:46Zwaynechen55wchen@suse.com
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<p>Can not establish ipmi sol connnection to fozzie-sp and quinn-sp</p>
<pre><code>localhost:~ # ipmitool -I lanplus -H fozzie-sp.qe.nue2.suse.org -U ADMIN -P xxxxx chassis power status
Address lookup for fozzie-sp.qe.nue2.suse.org failed
Could not open socket!
Error: Unable to establish IPMI v2 / RMCP+ session
localhost:~ # ipmitool -I lanplus -H quinn-sp.qe.nue2.suse.org -U ADMIN -P xxxxx chassis power status
Address lookup for quinn-sp.qe.nue2.suse.org failed
Could not open socket!
Error: Unable to establish IPMI v2 / RMCP+ session
localhost:~ # ping -c5 fozzie-sp.qe.nue2.suse.org
ping: fozzie-sp.qe.nue2.suse.org: Name or service not known
localhost:~ # ping -c5 quinn-sp.qe.nue2.suse.org
ping: quinn-sp.qe.nue2.suse.org: Name or service not known
</code></pre>
<a name="Steps-to-reproduce"></a>
<h2 >Steps to reproduce<a href="#Steps-to-reproduce" class="wiki-anchor">¶</a></h2>
<ul>
<li>Use ipmitool to do operation</li>
</ul>
<a name="Impact"></a>
<h2 >Impact<a href="#Impact" class="wiki-anchor">¶</a></h2>
<p>Test run keeps failing.</p>
<a name="Problem"></a>
<h2 >Problem<a href="#Problem" class="wiki-anchor">¶</a></h2>
<p>Looks like something wrong with management unit</p>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>Check management unit state</li>
<li>Check error/warning report from management unit</li>
<li>Check management unit configuration</li>
<li>Check ipmi sol is enabled</li>
</ul>
<a name="Workaround"></a>
<h2 >Workaround<a href="#Workaround" class="wiki-anchor">¶</a></h2>
<p>n/a</p>
openQA Infrastructure - action #152095 (Resolved): [spike solution][timeboxed:8h] Ping over GRE t...https://progress.opensuse.org/issues/1520952023-12-05T13:22:56Zokurzokurz@suse.com
<a name="Motivation"></a>
<h2 >Motivation<a href="#Motivation" class="wiki-anchor">¶</a></h2>
<p>See lessons learned meeting <a class="issue tracker-4 status-3 priority-5 priority-high3 closed child" title="action: Conduct "lessons learned" with Five Why analysis for "test fails in iscsi_client due to salt 'hos... (Resolved)" href="https://progress.opensuse.org/issues/139136">#139136</a>. We would again benefit from an easier reproducer. Related to <a class="issue tracker-4 status-3 priority-4 priority-default closed" title="action: [kernel] minimal reproducer for many multi-machine test failures in "ovs-client+ovs-server" test ... (Resolved)" href="https://progress.opensuse.org/issues/135818">#135818</a> . Come up with a way to ping over GRE tunnels and TAP devices and openvswitch outside a VM with differing packet sizes.</p>
<a name="Acceptance-criteria"></a>
<h2 >Acceptance criteria<a href="#Acceptance-criteria" class="wiki-anchor">¶</a></h2>
<ul>
<li><strong>AC1:</strong> We know how to ping over GRE tunnels and TAP devices and openvswitch outside a VM with differing packet sizes</li>
</ul>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li><p>Research upstream about pinging over specific interfaces, GRE tunnels, TAP devices, openvswitch, etc.</p>
<ul>
<li>Like <code>ping -I<interface></code> or <code>ping X.X.X.X%tap0</code>?</li>
<li>Checkout network namespaces and if they could be used</li>
</ul></li>
<li><p>Research about MTU size debugging, tracepath, traceroute, etc.</p></li>
<li><p>Experiment in an openQA-environment or openQA-like with the bridges, tap devices, etc.</p></li>
<li><p>Demonstrate to the team in written form or interactively</p></li>
<li><p>Lookup how the existing check is done via a VM/VNC, and see how this could be simplified</p></li>
</ul>
openQA Infrastructure - action #150938 (Resolved): [openQA][sut][ipmi] No ipmi sol output with ix...https://progress.opensuse.org/issues/1509382023-11-16T09:39:37Zwaynechen55wchen@suse.com
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<p>Test run starts failing with <code>imagetester:7</code> at ipxe_install, for example, <a href="https://openqa.suse.de/tests/12822901#step/ipxe_install/1" class="external">https://openqa.suse.de/tests/12822901#step/ipxe_install/1</a>. It looks like needle matching failure, but actually there is nothing printed out on its ipmi sol console after reboot. </p>
<pre><code>ipmitool -I lanplus -C 3 -H ix64ph1075-sp.qe.nue2.suse.org -U admin -P xxxxxxxx sol activate
</code></pre>
<a name="Steps-to-reproduce"></a>
<h2 >Steps to reproduce<a href="#Steps-to-reproduce" class="wiki-anchor">¶</a></h2>
<ul>
<li>Connect to ix64ph1075 ipmi sol console</li>
<li>Reboot the machine</li>
<li>Wait for output on ipmi sol console</li>
</ul>
<a name="Impact"></a>
<h2 >Impact<a href="#Impact" class="wiki-anchor">¶</a></h2>
<p>No test run assigned to <code>imagetester:7</code> can proceed. Now <code>imagetester:6</code></p>
<a name="Problem"></a>
<h2 >Problem<a href="#Problem" class="wiki-anchor">¶</a></h2>
<ul>
<li>Looks like something wrong with ipmi sol console</li>
</ul>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>Check ipmi sol config</li>
<li>Check warning/error in BMC</li>
<li>Factory-reset the BMC</li>
<li>Reinstall the firmware</li>
<li>Click every possible button</li>
<li>Check that the physical ethernet cable is not broken</li>
</ul>
<a name="Workaround"></a>
<h2 >Workaround<a href="#Workaround" class="wiki-anchor">¶</a></h2>
<p>n/a</p>
<a name="Rollback-actions"></a>
<h2 >Rollback actions<a href="#Rollback-actions" class="wiki-anchor">¶</a></h2>
<ul>
<li><code>sudo systemctl unmask openqa-worker-auto-restart@6 && sudo systemctl enable --now openqa-worker-auto-restart@6</code></li>
</ul>
openQA Infrastructure - action #137600 (Resolved): [alert] Packet loss between worker hosts and o...https://progress.opensuse.org/issues/1376002023-10-09T07:46:02Zjbaier_czjbaier@suse.cz
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<p>We had multiple occurrences of packet loss alert over the weekend</p>
<pre><code>alertname Packet loss between worker hosts and other hosts alert
grafana_folder Salt
rule_uid 2Z025iB4km
</code></pre>
<p><a href="http://stats.openqa-monitor.qa.suse.de/d/EML0bpuGk?orgId=1&viewPanel=4" class="external">http://stats.openqa-monitor.qa.suse.de/d/EML0bpuGk?orgId=1&viewPanel=4</a></p>
<p>Currently, the problematic ones according to the panel are:</p>
<pre><code>imagetester - walter1.qe.nue2.suse.org 100%
petrol-1 - walter1.qe.nue2.suse.org 100%
sapworker1 - walter1.qe.nue2.suse.org 100%
</code></pre>
<p>That is a little bit weird as I manually checked the first one and it can reach each other well</p>
<pre><code>walter1:~ # ping imagetester.qe.nue2.suse.org
PING imagetester.qe.nue2.suse.org (10.168.192.249) 56(84) bytes of data.
64 bytes from imagetester.qe.nue2.suse.org (10.168.192.249): icmp_seq=7 ttl=64 time=0.326 ms
jbaier@imagetester:~> ping walter1.qe.nue2.suse.org
PING walter1.qe.nue2.suse.org (10.168.192.1) 56(84) bytes of data.
64 bytes from walter1.qe.nue2.suse.org (10.168.192.1): icmp_seq=1 ttl=64 time=0.331 ms
</code></pre>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>Confirm <strong>when</strong> this started happening or if it's no longer an issue</li>
<li>There's no paused alerts</li>
</ul>
openQA Infrastructure - action #132860 (Resolved): openqa-piworker is unstable and needs regular ...https://progress.opensuse.org/issues/1328602023-07-17T08:39:49Zosukup
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<p><a href="https://gitlab.suse.de/openqa/salt-pillars-openqa/-/jobs/1694765" class="external">https://gitlab.suse.de/openqa/salt-pillars-openqa/-/jobs/1694765</a></p>
<p>only thing found in logs:<br>
salt_ping.log:</p>
<pre><code>Currently the following minions are down:
8d7
< "openqa-piworker.qa.suse.de"
===================
</code></pre>
<a name="Acceptance-criteria"></a>
<h2 >Acceptance criteria<a href="#Acceptance-criteria" class="wiki-anchor">¶</a></h2>
<ul>
<li><strong>AC1:</strong> we are able to process openQA Raspberry Pi bare-metal jobs consistently over some days</li>
</ul>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li><p>Identify the cause for regression</p>
<ul>
<li>likely something related to the hardware RTC</li>
<li>try if it just works with Leap 15.5 because we wanted to upgrade anyway</li>
<li>could be a recent kernel update so try to downgrade</li>
</ul></li>
<li><p>If it is really necessary and you exhausted all other remote-controllable options then go to the office, unplug RTC, reinstall the system assuming it was a borked system and corruption, or whatever</p></li>
<li><p>As Plan Y (if options A to X failed) buy wifi&bluetooth adapter for a IPMI controllable server and use that instead to connect to the rpi bare metal test instances</p></li>
</ul>
<a name="Rollback-steps"></a>
<h2 >Rollback steps<a href="#Rollback-steps" class="wiki-anchor">¶</a></h2>
<ul>
<li>Add back salt key with <code>ssh osd "sudo salt-key -y -a openqa-piworker.qa.suse.de"</code></li>
</ul>
openQA Infrastructure - action #65178 (Resolved): Drop rsync.pl config from salt for osd and o3https://progress.opensuse.org/issues/651782020-04-02T10:15:17Zlivdywanliv.dywan@suse.com
<p>okurz wrote:</p>
<blockquote>
<p>oops, found that we have the repo still installed and configured for both osd and o3 which we should remove before we can call this done. E.g. see <a href="https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/server.sls#L177" class="external">https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/server.sls#L177</a> . IMHO as long as we have the repo checked out and covered in salt we are not done with the cleanup.</p>
</blockquote>
openQA Infrastructure - action #44612 (Resolved): Do we want to update http://tumblesle.qa.suse.d...https://progress.opensuse.org/issues/446122018-12-01T07:30:43Zokurzokurz@suse.com
<p><a href="http://tumblesle.qa.suse.de/" class="external">http://tumblesle.qa.suse.de/</a></p>
openQA Infrastructure - action #37644 (Resolved): [tools] osd SSL certificate is only valid for o...https://progress.opensuse.org/issues/376442018-06-21T18:58:28Zokurzokurz@suse.comopenQA Infrastructure - action #19190 (Resolved): make use of ix64ph1014, e.g. for proxymodehttps://progress.opensuse.org/issues/191902017-05-17T07:11:33Zokurzokurz@suse.com
<p>[17 May 2017 08:52:46] coolo: do we still need <a href="https://openqa.suse.de/tests/latest?machine=ix64ph1014" class="external">https://openqa.suse.de/tests/latest?machine=ix64ph1014</a> ?<br>
[17 May 2017 08:54:40] okurz: well, I have no love left for this vnc thingie<br>
[17 May 2017 08:54:52] okurz: we better free this machine and use it for proxymode<br>
[17 May 2017 09:08:38] good morning<br>
[17 May 2017 09:09:53] coolo: so I will delete the machine in openQA and delete the schedule?<br>
[17 May 2017 09:10:51] okurz: leave the machine as documentation how to set this up. It might still be wanted in the future - for another machine<br>
[17 May 2017 09:11:00] but drop the job</p>
<p>okurz: I dropped the job from scheduling</p>