openSUSE Project Management Tool: Issueshttps://progress.opensuse.org/https://progress.opensuse.org/themes/openSUSE/favicon/favicon.ico?15829177842024-03-28T19:30:27ZopenSUSE Project Management Tool
Redmine openQA Infrastructure - action #158242 (New): Prevent ssh access to test VMs on svirt hypervisor ...https://progress.opensuse.org/issues/1582422024-03-28T19:30:27Zokurzokurz@suse.com
<a name="Motivation"></a>
<h2 >Motivation<a href="#Motivation" class="wiki-anchor">¶</a></h2>
<p>In <a href="https://sd.suse.com/servicedesk/customer/portal/1/SD-150437" class="external">https://sd.suse.com/servicedesk/customer/portal/1/SD-150437</a> we are asked to handle "compromised root passwords in QA segments" including s390zl11…16</p>
<a name="Acceptance-criteria"></a>
<h2 >Acceptance criteria<a href="#Acceptance-criteria" class="wiki-anchor">¶</a></h2>
<ul>
<li><strong>AC1:</strong> firewall on OSD svirt hosts prevents direct ssh+vnc access from outside, i.e. normal office networks</li>
<li><strong>AC2:</strong> openQA svirt jobs are still able to access ssh+vnc as necessary, e.g. from openQA workers in the same network OR openQA workers on the hypervisor hosts themselves</li>
</ul>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>Take openQA svirt worker instances related to one hypervisor host, e.g. s390zl12, out of production for testing</li>
<li>Configure a/the firewall on that host to block ssh+vnc to VMs running on that host</li>
<li>Allow traffic from other hosts in oqa.prg2.suse.org</li>
<li>Ensure that openQA tests still work</li>
<li>Ensure that the according firewall config is made boot-persistent and in salt</li>
<li>Crosscheck with at least one reboot</li>
<li>Apply the same solution to all other OSD svirt hosts</li>
</ul>
openQA Infrastructure - action #158116 (New): typing issue on ppc64 worker - crosscheck performan...https://progress.opensuse.org/issues/1581162024-03-27T08:14:10Zokurzokurz@suse.com
<a name="Motivation"></a>
<h2 >Motivation<a href="#Motivation" class="wiki-anchor">¶</a></h2>
<p>In <a class="issue tracker-4 status-4 priority-5 priority-high3 child behind-schedule" title="action: typing issue on ppc64 worker size:S (Feedback)" href="https://progress.opensuse.org/issues/158104">#158104</a> system overload on ppc64le machines was found which was likely triggered by <a class="issue tracker-4 status-1 priority-4 priority-default" title="action: remove NOVIDEO=1 from ppc64le workers (New)" href="https://progress.opensuse.org/issues/157636">#157636</a>. As a snapshot the current process list output from htop looks like this:</p>
<pre><code> PID USER PRI NI VIRT RES SHR S DISK R/W CPU% MEM% TIME+ ▽Command
1541 root 20 0 320M 194M 182M S 0.00 B/s 0.0 0.0 2h29:59 /usr/lib/systemd/systemd-j
96369 root 20 0 623M 98880 14336 S 0.00 B/s 0.0 0.0 54:05.86 /usr/bin/python3 /usr/bin/
1 root 20 0 178M 25024 11776 S 0.00 B/s 0.0 0.0 48:46.08 /usr/lib/systemd/systemd n
2000 root 20 0 9728 6208 2176 S 0.00 B/s 0.0 0.0 40:44.69 /usr/sbin/haveged -w 1024
157105 _openqa-wo 20 0 427M 189M 23808 R 0.00 B/s 68.4 0.0 32:22.39 ffmpeg -y -hide_banner -no
157062 _openqa-wo 20 0 427M 193M 23808 R 0.00 B/s 42.1 0.0 32:07.83 ffmpeg -y -hide_banner -no
157107 _openqa-wo 20 0 427M 189M 23808 R 0.00 B/s 68.4 0.0 30:29.03 ffmpeg -y -hide_banner -no
157063 _openqa-wo 20 0 427M 193M 23808 R 0.00 B/s 5.3 0.0 29:30.58 ffmpeg -y -hide_banner -no
6267 _openqa-wo 20 0 427M 193M 23808 R 0.00 B/s 63.2 0.0 25:54.22 ffmpeg -y -hide_banner -no
157108 _openqa-wo 20 0 427M 189M 23808 R 0.00 B/s 63.2 0.0 25:03.79 ffmpeg -y -hide_banner -no
157064 _openqa-wo 20 0 427M 193M 23808 R 0.00 B/s 2.6 0.0 23:50.53 ffmpeg -y -hide_banner -no
156485 _openqa-wo 20 0 427M 189M 23808 R 0.00 B/s 34.2 0.0 22:18.78 ffmpeg -y -hide_banner -no
6268 _openqa-wo 20 0 427M 193M 23808 R 0.00 B/s 57.9 0.0 21:48.92 ffmpeg -y -hide_banner -no
156601 _openqa-wo 20 0 427M 193M 23808 R 0.00 B/s 10.5 0.0 20:19.58 ffmpeg -y -hide_banner -no
6269 _openqa-wo 20 0 427M 193M 23808 R 0.00 B/s 55.3 0.0 16:33.02 ffmpeg -y -hide_banner -no
5898 _openqa-wo 20 0 427M 193M 23808 R 0.00 B/s 86.8 0.0 14:48.15 ffmpeg -y -hide_banner -no
31080 _openqa-wo 20 0 5720M 758M 28416 R 0.00 B/s 57.9 0.1 12:58.63 /usr/bin/qemu-system-ppc64
15778 _openqa-wo 20 0 6767M 1779M 28480 R 0.00 B/s 81.6 0.2 12:50.94 /usr/bin/qemu-system-ppc64
15781 _openqa-wo 20 0 6767M 1779M 28480 S 0.00 B/s 0.0 0.2 10:13.25 /usr/bin/qemu-system-ppc64
156709 _openqa-wo 20 0 6762M 1766M 28288 S 0.00 B/s 13.2 0.2 10:08.67 /usr/bin/qemu-system-ppc64
33559 _openqa-wo 20 0 6756M 1724M 28416 R 0.00 B/s 86.8 0.2 10:05.56 /usr/bin/qemu-system-ppc64
35017 _openqa-wo 20 0 3946M 753M 28416 R 0.00 B/s 84.2 0.1 9:30.77 /usr/bin/qemu-system-ppc64
24085 _openqa-wo 20 0 6901M 1781M 28480 S 0.00 B/s 0.0 0.2 9:13.94 /usr/bin/qemu-system-ppc64
24092 _openqa-wo 20 0 6901M 1781M 28480 R 0.00 B/s 78.9 0.2 8:40.60 /usr/bin/qemu-system-ppc64
28718 _openqa-wo 20 0 7135M 1787M 28480 S 0.00 B/s 50.0 0.2 8:17.91 /usr/bin/qemu-system-ppc64
28720 _openqa-wo 20 0 7135M 1787M 28480 R 0.00 B/s 13.2 0.2 6:51.75 /usr/bin/qemu-system-ppc64
39280 _openqa-wo 20 0 5712M 755M 28416 R 0.00 B/s 65.8 0.1 6:41.38 /usr/bin/qemu-system-ppc64
39683 _openqa-wo 20 0 6731M 1549M 28416 R 0.00 B/s 65.8 0.2 6:24.06 /usr/bin/qemu-system-ppc64
3699 root 20 0 3968 3200 2368 S 0.00 B/s 0.0 0.0 6:04.21 /sbin/agetty -o -p -- \u -
34903 _openqa-wo 20 0 6334M 1483M 28416 R 0.00 B/s 50.0 0.2 5:29.90 /usr/bin/qemu-system-ppc64
34902 _openqa-wo 20 0 6334M 1483M 28416 S 0.00 B/s 0.0 0.2 4:40.00 /usr/bin/qemu-system-ppc64
38988 _openqa-wo 20 0 6790M 1376M 28480 R 0.00 B/s 107.9 0.2 3:52.33 /usr/bin/qemu-system-ppc64
38599 _openqa-wo 20 0 8040M 4187M 28480 R 0.00 B/s 47.4 0.5 3:41.13 /usr/bin/qemu-system-ppc64
45395 _openqa-wo 20 0 3732M 757M 28416 R 0.00 B/s 71.1 0.1 3:38.90 /usr/bin/qemu-system-ppc64
38600 _openqa-wo 20 0 8040M 4187M 28480 S 0.00 B/s 0.0 0.5 3:18.94 /usr/bin/qemu-system-ppc64
43853 _openqa-wo 20 0 5641M 1696M 28480 R 0.00 B/s 63.2 0.2 3:12.66 /usr/bin/qemu-system-ppc64
38456 _openqa-wo 20 0 9087M 4195M 28480 R 0.00 B/s 78.9 0.5 3:08.68 /usr/bin/qemu-system-ppc64
38986 _openqa-wo 20 0 6790M 1376M 28480 R 0.00 B/s 86.8 0.2 3:06.34 /usr/bin/qemu-system-ppc64
</code></pre>
<p>so ffmpeg shows significantly higher accumulated CPU time usage compared to the according qemu processes. We should investigate if ffmpeg is having a "too high" impact on machine performance, if it should be running with nice level to prevent typing issues, if ffmpeg parameters can be tweaked or if ffmpeg should be avoided at all on ppc64le.</p>
<a name="Acceptance-criteria"></a>
<h2 >Acceptance criteria<a href="#Acceptance-criteria" class="wiki-anchor">¶</a></h2>
<ul>
<li><strong>AC1:</strong> openQA test video compression is ensured to not significantly impacting system performance causing typing issues</li>
</ul>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>Check if ffmpeg CPU usage as visible in the above htop output is considered expected or something unusual</li>
<li>Consider introducing a nice-level for calling ffmpeg in os-autoinst</li>
<li>Crosscheck if ffmpeg can be tweaked, in particular for ppc64le qemu workers</li>
<li>Decide if ffmpeg or even complete should be completely forbidden on ppc64le, see <a class="issue tracker-4 status-1 priority-4 priority-default" title="action: remove NOVIDEO=1 from ppc64le workers (New)" href="https://progress.opensuse.org/issues/157636">#157636</a> </li>
</ul>
openQA Infrastructure - action #158113 (Feedback): typing issue on ppc64 worker - make CPU load a...https://progress.opensuse.org/issues/1581132024-03-27T08:03:58Zokurzokurz@suse.com
<a name="Motivation"></a>
<h2 >Motivation<a href="#Motivation" class="wiki-anchor">¶</a></h2>
<p><a class="issue tracker-4 status-4 priority-5 priority-high3 child behind-schedule" title="action: typing issue on ppc64 worker size:S (Feedback)" href="https://progress.opensuse.org/issues/158104">#158104</a> shows VNC typing issues. For this in <a class="issue tracker-4 status-3 priority-4 priority-default closed child" title="action: CPU Load and usage alert for openQA workers size:S (Resolved)" href="https://progress.opensuse.org/issues/150983">#150983</a> on purpose we added alerts to alert on too high CPU load. <a href="https://monitor.qa.suse.de/d/WDmania/worker-dashboard-mania?orgId=1&from=now-2d&to=now&viewPanel=54694" class="external">https://monitor.qa.suse.de/d/WDmania/worker-dashboard-mania?orgId=1&from=now-2d&to=now&viewPanel=54694</a> clearly shows a load consistently in the range of 50-70(!) for mania but no alert triggered. We should crosscheck <a href="https://monitor.qa.suse.de/alerting/cpu_load_alert_mania/modify-export?returnTo=%2Fd%2FWDmania%2Fworker-dashboard-mania%3ForgId%3D1%26from%3Dnow-7d%26to%3Dnow%26viewPanel%3D54694%26editPanel%3D54694%26tab%3Dalert" class="external">https://monitor.qa.suse.de/alerting/cpu_load_alert_mania/modify-export?returnTo=%2Fd%2FWDmania%2Fworker-dashboard-mania%3ForgId%3D1%26from%3Dnow-7d%26to%3Dnow%26viewPanel%3D54694%26editPanel%3D54694%26tab%3Dalert</a><br>
and make that alert more strict.</p>
<a name="Acceptance-criteria"></a>
<h2 >Acceptance criteria<a href="#Acceptance-criteria" class="wiki-anchor">¶</a></h2>
<ul>
<li><strong>AC1:</strong> CPU load alerts trigger for a CPU load15 consistently above 40 as originally planned</li>
</ul>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>Crosscheck <a href="https://monitor.qa.suse.de/alerting/cpu_load_alert_mania/modify-export?returnTo=%2Fd%2FWDmania%2Fworker-dashboard-mania%3ForgId%3D1%26from%3Dnow-7d%26to%3Dnow%26viewPanel%3D54694%26editPanel%3D54694%26tab%3Dalert" class="external">https://monitor.qa.suse.de/alerting/cpu_load_alert_mania/modify-export?returnTo=%2Fd%2FWDmania%2Fworker-dashboard-mania%3ForgId%3D1%26from%3Dnow-7d%26to%3Dnow%26viewPanel%3D54694%26editPanel%3D54694%26tab%3Dalert</a> or the implementation in code <a href="https://gitlab.suse.de/openqa/salt-states-openqa/-/blame/master/monitoring/grafana/alerting-dashboard-WD.yaml.template?ref_type=heads#L941" class="external">https://gitlab.suse.de/openqa/salt-states-openqa/-/blame/master/monitoring/grafana/alerting-dashboard-WD.yaml.template?ref_type=heads#L941</a></li>
</ul>
openQA Infrastructure - action #158104 (Feedback): typing issue on ppc64 worker size:Shttps://progress.opensuse.org/issues/1581042024-03-27T06:57:56Zzcjiazcjia@suse.com
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<p>openQA test in scenario sle-15-SP6-Online-ppc64le-ha_beta_supportserver@ppc64le-2g fails in<br>
<a href="https://openqa.suse.de/tests/13885455/modules/setup/steps/84" class="external">setup</a></p>
<p><a href="https://openqa.suse.de/tests/13885455#step/setup/84" class="external">https://openqa.suse.de/tests/13885455#step/setup/84</a> (see attachment p1.png)</p>
<p><a href="https://openqa.suse.de/tests/13885471#step/setup/30" class="external">https://openqa.suse.de/tests/13885471#step/setup/30</a> (see attachment p2.png) It missed "$" before "?".</p>
<p><a href="https://openqa.suse.de/tests/13885404#step/setup/12" class="external">https://openqa.suse.de/tests/13885404#step/setup/12</a> (see attachment p3.png)</p>
<p><a href="https://openqa.suse.de/tests/13885407#step/setup/9" class="external">https://openqa.suse.de/tests/13885407#step/setup/9</a> (see attachment p4.png)</p>
<p>I think this may related with the high work load of underlying ppc64 worker.</p>
<p>All on "mania"</p>
<a name="Test-suite-description"></a>
<h2 >Test suite description<a href="#Test-suite-description" class="wiki-anchor">¶</a></h2>
<p>The base test suite is used for job templates defined in YAML documents. It has no settings of its own.</p>
<a name="Reproducible"></a>
<h2 >Reproducible<a href="#Reproducible" class="wiki-anchor">¶</a></h2>
<p>Fails since (at least) Build <a href="https://openqa.suse.de/tests/13885455" class="external">73.1</a> (current job)</p>
<a name="Expected-result"></a>
<h2 >Expected result<a href="#Expected-result" class="wiki-anchor">¶</a></h2>
<p>Last good: <a href="https://openqa.suse.de/tests/13829359" class="external">67.1</a> (or more recent)</p>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>Identify the affected machines and workers, apply mitigations to prevent recurring typing issues, e.g. reducing CPU load</li>
<li>Restart related failed jobs</li>
<li>Identify follow-up tasks</li>
<li>Reduce the number of worker instances as a first mitigation measure. <a href="https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/759" class="external">https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/759</a> (merged)</li>
<li>Make the alert for CPU load more strict - <a class="issue tracker-4 status-4 priority-5 priority-high3 child behind-schedule" title="action: typing issue on ppc64 worker - make CPU load alert more strict (Feedback)" href="https://progress.opensuse.org/issues/158113">#158113</a></li>
<li>Evaluate the impact on video encoding in particular on ppc64le, maybe ffmpeg on Power8 kvm is inefficient - <a class="issue tracker-4 status-1 priority-4 priority-default child" title="action: typing issue on ppc64 worker - crosscheck performance impact of ffmpeg on ppc64le (Power8 kvm) (New)" href="https://progress.opensuse.org/issues/158116">#158116</a></li>
<li>Check existing ffmpeg processes on mania which take a lot of CPU time - <a class="issue tracker-4 status-1 priority-4 priority-default child" title="action: typing issue on ppc64 worker - crosscheck performance impact of ffmpeg on ppc64le (Power8 kvm) (New)" href="https://progress.opensuse.org/issues/158116">#158116</a></li>
</ul>
<a name="Out-of-scope"></a>
<h2 >Out of scope<a href="#Out-of-scope" class="wiki-anchor">¶</a></h2>
<ul>
<li>ffmpeg impact investigation -> <a class="issue tracker-4 status-4 priority-5 priority-high3 child behind-schedule" title="action: typing issue on ppc64 worker - make CPU load alert more strict (Feedback)" href="https://progress.opensuse.org/issues/158113">#158113</a></li>
<li>code improvements -> <a class="issue tracker-4 status-1 priority-4 priority-default child" title="action: typing issue on ppc64 worker - only pick up (or start) new jobs if CPU load is below configured t... (New)" href="https://progress.opensuse.org/issues/158125">#158125</a></li>
<li>improving the alert -> <a class="issue tracker-4 status-4 priority-5 priority-high3 child behind-schedule" title="action: typing issue on ppc64 worker - make CPU load alert more strict (Feedback)" href="https://progress.opensuse.org/issues/158113">#158113</a></li>
</ul>
<a name="Further-details"></a>
<h2 >Further details<a href="#Further-details" class="wiki-anchor">¶</a></h2>
<p>Always latest result in this scenario: <a href="https://openqa.suse.de/tests/latest?arch=ppc64le&distri=sle&flavor=Online&machine=ppc64le-2g&test=ha_beta_supportserver&version=15-SP6" class="external">latest</a></p>
openQA Infrastructure - action #157615 (Feedback): [alert] osd-deployment failed in post-deploy ,...https://progress.opensuse.org/issues/1576152024-03-20T18:18:05Zjbaier_czjbaier@suse.cz
<p>See <a href="https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/2411217" class="external">https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/2411217</a></p>
<pre><code>schort-server.qe.nue2.suse.org:
2024-03-20T16:23:32Z E! [agent] Error killing process: os: process already finished
2024-03-20T16:23:32Z E! [agent] Error killing process: os: process already finished
2024-03-20T16:23:32Z E! [inputs.exec] Error in plugin: exec: command timed out for command '/etc/telegraf/scripts/systemd_list_service_by_state_for_telegraf.sh --state masked --exclude ""':
2024-03-20T16:23:32Z E! [inputs.exec] Error in plugin: exec: command timed out for command '/etc/telegraf/scripts/systemd_list_service_by_state_for_telegraf.sh --state failed --exclude ""':
2024-03-20T16:23:32Z E! [telegraf] Error running agent: input plugins recorded 2 errors
telegraf errors
monitor.qe.nue2.suse.org:
2024-03-20T16:23:31Z E! [inputs.x509_cert] Error in plugin: cannot get SSL cert 'https://monitor.qa.suse.de:443': dial tcp: lookup monitor.qa.suse.de: i/o timeout
2024-03-20T16:23:35Z E! [telegraf] Error running agent: input plugins recorded 1 errors
telegraf errors
++ grep ' E! ' salt_post_deploy_checks.log
2024-03-20T16:23:32Z E! [agent] Error killing process: os: process already finished
2024-03-20T16:23:32Z E! [agent] Error killing process: os: process already finished
2024-03-20T16:23:32Z E! [inputs.exec] Error in plugin: exec: command timed out for command '/etc/telegraf/scripts/systemd_list_service_by_state_for_telegraf.sh --state masked --exclude ""':
2024-03-20T16:23:32Z E! [inputs.exec] Error in plugin: exec: command timed out for command '/etc/telegraf/scripts/systemd_list_service_by_state_for_telegraf.sh --state failed --exclude ""':
2024-03-20T16:23:32Z E! [telegraf] Error running agent: input plugins recorded 2 errors
2024-03-20T16:23:31Z E! [inputs.x509_cert] Error in plugin: cannot get SSL cert 'https://monitor.qa.suse.de:443': dial tcp: lookup monitor.qa.suse.de: i/o timeout
2024-03-20T16:23:35Z E! [telegraf] Error running agent: input plugins recorded 1 errors
</code></pre>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ol>
<li>Understand why and where <code>systemd_list_service_by_state_for_telegraf.sh</code> times out. It could be the general telegraf-timeout in the pipeline, in the execution of the script itself (from telegraf.conf) or another place. Adjust the timeout to match expected runtime or fix the script to complete faster -> schort-server only has 1 VM core, consider configuring the hypervisor to use at least 2 cores</li>
<li>"Error killing process: os: process already finished" might just be a consequence of the above</li>
<li>"Error in plugin: cannot get SSL cert '<a href="https://monitor.qa.suse.de:443':" class="external">https://monitor.qa.suse.de:443':</a> dial tcp: lookup monitor.qa.suse.de: i/o timeout" possibly to be covered with some retrying? Investigate what the real error message means, ask <a href="https://www.ecosia.org/chat" class="external">https://www.ecosia.org/chat</a> (or if that does not work invest in coal-powered <a href="https://www.cat-gpt.com/chat" class="external">https://www.cat-gpt.com/chat</a> ) or something</li>
<li>If we cannot solve these problems, consider excluding them from CI execution to avoid false-positives. Consider the impact of doing this first however!</li>
</ol>
openQA Project - action #157540 (Feedback): [sporadic] ci openQA: t/33-developer_mode.t fails size:Mhttps://progress.opensuse.org/issues/1575402024-03-19T14:15:50Ztinitatina.mueller+trick-redmine@suse.com
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<p><a href="https://app.circleci.com/pipelines/github/os-autoinst/openQA/13196/workflows/ddb935c7-31dd-4beb-877c-25ef1e703b4d/jobs/123231" class="external">https://app.circleci.com/pipelines/github/os-autoinst/openQA/13196/workflows/ddb935c7-31dd-4beb-877c-25ef1e703b4d/jobs/123231</a></p>
<pre><code>[14:03:42] t/33-developer_mode.t .. 17/? # Unexpected Javascript console errors, waiting for connection opened: [
# {
# level => "SEVERE",
# message => "http://localhost:9526/asset/3906633cf0/ws_console.js 8 WebSocket connection to 'ws://localhost:9528/liveviewhandler/tests/1/developer/ws-proxy' failed: Error during WebSocket handshake: Unexpected response code: 302",
# source => "network",
# timestamp => 1710857067816,
# },
# ]
# Failed test 'No unexpected js warnings'
# at /home/squamata/project/t/lib/OpenQA/Test/FullstackUtils.pm line 123.
# Looks like you failed 1 test of 9.
[14:03:42] t/33-developer_mode.t .. 20/?
</code></pre>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>While investigating the code in parallel try to reproduce locally with coverage enabled and multiple runs to get a statistically significant result, e.g. <code>make test KEEP_DB=1 RETRY=500 TESTS=t/33-developer.t</code> and go for lunch or continue coding :)</li>
<li>If it's not reproducible consider the same with coverage enabled and/or in circleCI, e.g. a temporary branch in your github repo fork</li>
<li>Identify where in <a href="https://github.com/os-autoinst/openQA/blob/master/t/33-developer_mode.t" class="external">https://github.com/os-autoinst/openQA/blob/master/t/33-developer_mode.t</a> the redirection "302" could happen</li>
<li>Even though the test is not technically a UI test in the t/ui/ folder it might still be necessary to apply UI test related synchronisation means to fix the sporadic failure as a selenium instance is used</li>
<li>Might be a similar issue: <a class="issue tracker-4 status-3 priority-5 priority-high3 closed child" title="action: [sporadic] t/full-stack.t Failed test 'Expected result for job 1 not found' size:M (Resolved)" href="https://progress.opensuse.org/issues/102578">#102578</a></li>
</ul>
openQA Project - coordination #157537 (Blocked): [epic] Secure setup of openQA test machines with...https://progress.opensuse.org/issues/1575372024-03-19T14:15:29Zokurzokurz@suse.com
<a name="Motivation"></a>
<h2 >Motivation<a href="#Motivation" class="wiki-anchor">¶</a></h2>
<p>In <a href="https://sd.suse.com/servicedesk/customer/portal/1/SD-150437" class="external">https://sd.suse.com/servicedesk/customer/portal/1/SD-150437</a> we are asked to handle "compromised root passwords in QA segments" including s390zl11…16 . We should secure our network and password handling better.</p>
<a name="Acceptance-criteria"></a>
<h2 >Acceptance criteria<a href="#Acceptance-criteria" class="wiki-anchor">¶</a></h2>
<ul>
<li><strong>AC1:</strong> No openQA machine test machines directly accessible by SUSE users use ssh root with publically known passwords</li>
</ul>
<a name="Ideas"></a>
<h2 >Ideas<a href="#Ideas" class="wiki-anchor">¶</a></h2>
<ol>
<li>Be able to set a different password valid for tests, in particular s390kvm…, e.g. be able to set password by test variable and follow through in the complete test platform -> <a class="issue tracker-4 status-15 priority-5 priority-high3 child" title="action: [spike][timeboxed:10h] Use a different ssh root password for s390x kvm installation openQA jobs (... (Blocked)" href="https://progress.opensuse.org/issues/157555">#157555</a></li>
<li>Key based authentication -> <a class="issue tracker-4 status-15 priority-4 priority-default child" title="action: [spike][timeboxed:10h] Use ssh key authentication in particular for s390x kvm installation openQA... (Blocked)" href="https://progress.opensuse.org/issues/157744">#157744</a></li>
<li>Rotating, automatic passwords saved as test variables connected to images, e.g. to be able to use a pre-installed image</li>
<li>Better secure the networks to have s390kvm… (and others) less accessible -> We have stated the requirement in <a href="https://confluence.suse.com/pages/viewpage.action?pageId=1006108843" class="external">https://confluence.suse.com/pages/viewpage.action?pageId=1006108843</a> that ssh 22/tcp needs to be reachable. We could try to replicate the setup we know from o3 to give OSD a second network interface which allows ssh 22/tcp and block ssh 22/tcp on .oqa.prg2.suse.org as usually we don't need ssh to workers, just from within the oqa network as well as for administrative purposes for which we could go over OSD which we also already normally do for salt. -> <a class="issue tracker-4 status-1 priority-4 priority-default child" title="action: Better secure the networks to have s390kvm… (and others) less accessible (New)" href="https://progress.opensuse.org/issues/157750">#157750</a></li>
</ol>
openQA Tests - action #157414 (In Progress): Network broken with multimachine on multiple workers...https://progress.opensuse.org/issues/1574142024-03-18T07:37:57Zggardet_armguillaume.gardet@arm.com
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<p>openQA test in scenario microos-Tumbleweed-DVD-aarch64-remote_ssh_controller@aarch64 fails in<br>
<a href="https://openqa.opensuse.org/tests/4018286/modules/await_install/steps/5" class="external">await_install</a><br>
possibly similar to what had happened in <a class="issue tracker-4 status-3 priority-5 priority-high3 closed" title="action: openqaworker-arm22 is unable to join download.opensuse.org in parallel tests = tap mode size:M (Resolved)" href="https://progress.opensuse.org/issues/150920">#150920</a> and <a class="issue tracker-4 status-3 priority-4 priority-default closed" title="action: o3 aarch64 multi-machine tests on openqaworker-arm21 and 22 fail to resolve codecs.opensuse.org s... (Resolved)" href="https://progress.opensuse.org/issues/155278">#155278</a></p>
<a name="Test-suite-description"></a>
<h2 >Test suite description<a href="#Test-suite-description" class="wiki-anchor">¶</a></h2>
<p>Maintainer: jrivera Install remote server (parallel job) with ssh.</p>
<a name="Reproducible"></a>
<h2 >Reproducible<a href="#Reproducible" class="wiki-anchor">¶</a></h2>
<p>Fails since (at least) Build <a href="https://openqa.opensuse.org/tests/4018286" class="external">20240314</a> (current job)</p>
<a name="Expected-result"></a>
<h2 >Expected result<a href="#Expected-result" class="wiki-anchor">¶</a></h2>
<p>Last good: <a href="https://openqa.opensuse.org/tests/4004882" class="external">20240310</a> (or more recent)</p>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>Extend the existing test code as suggested in <a class="issue tracker-4 status-2 priority-5 priority-high3 behind-schedule" title="action: Network broken with multimachine on multiple workers (broken packet forwarding / NAT) size:M (In Progress)" href="https://progress.opensuse.org/issues/157414#note-8">#157414-8</a> to have more explicit error messages</li>
<li>Lookup the history of tickets in <a class="issue tracker-4 status-3 priority-5 priority-high3 closed" title="action: openqaworker-arm22 is unable to join download.opensuse.org in parallel tests = tap mode size:M (Resolved)" href="https://progress.opensuse.org/issues/150920">#150920</a>, <a class="issue tracker-4 status-3 priority-4 priority-default closed" title="action: o3 aarch64 multi-machine tests on openqaworker-arm21 and 22 fail to resolve codecs.opensuse.org s... (Resolved)" href="https://progress.opensuse.org/issues/155278">#155278</a></li>
<li>Consider extending our setup multimachine script and potentially call it periodically?</li>
<li>Consider more explicit error checks in our worker code to prevent even running into such problems in openQA tests</li>
<li>Find a persistent solution</li>
</ul>
<a name="Further-details"></a>
<h2 >Further details<a href="#Further-details" class="wiki-anchor">¶</a></h2>
<p>Always latest result in this scenario: <a href="https://openqa.opensuse.org/tests/latest?arch=aarch64&distri=microos&flavor=DVD&machine=aarch64&test=remote_ssh_controller&version=Tumbleweed" class="external">latest</a></p>
openQA Project - action #157273 (Workable): Run os-autoinst-distri-openQA directly from git witho...https://progress.opensuse.org/issues/1572732024-03-14T16:38:04Zokurzokurz@suse.com
<a name="Motivation"></a>
<h2 >Motivation<a href="#Motivation" class="wiki-anchor">¶</a></h2>
<p>With <a class="issue tracker-4 status-3 priority-4 priority-default closed child" title="action: [spike][timeboxed:10h] Run os-autoinst-distri-example directly from git and ensure candidate need... (Resolved)" href="https://progress.opensuse.org/issues/154783">#154783</a> we have proper git caching so we can run git based tests efficiently on our workers now. Now we should go the next step and migrate one "production" test distribution to use only git and not hold anything provided by admins on o3 in o3:/var/lib/openqa/share/tests for this test distribution.</p>
<a name="Acceptance-criteria"></a>
<h2 >Acceptance criteria<a href="#Acceptance-criteria" class="wiki-anchor">¶</a></h2>
<ul>
<li><strong>AC1:</strong> /var/lib/openqa/share/tests/open{qa,QA} do not exist</li>
<li><strong>AC2:</strong> openqa-in-openqa tests still pass consistently</li>
<li><strong>AC3:</strong> openqa-in-openqa test details, needle candidates and source code views still show content as expected</li>
</ul>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>Change test definitions in <a href="https://github.com/os-autoinst/os-autoinst-distri-openQA/blob/master/scenario-definitions.yaml" class="external">https://github.com/os-autoinst/os-autoinst-distri-openQA/blob/master/scenario-definitions.yaml</a> in your branch to use <a href="https://github.com/os-autoinst/os-autoinst-distri-openQA" class="external">https://github.com/os-autoinst/os-autoinst-distri-openQA</a> for test code (and needles)</li>
<li>Check that tests can be triggered this way on a test instance</li>
<li>Do not put anything in /var/lib/openqa/share/tests and ensure tests still work as well as source code view and needle candidates in test details pages</li>
<li>To provide needle candidates there are multiple possibilities when and where the needle candidate data can be provided, try out one or multiple of the following:
<ol>
<li><em>Given</em> a test distribution/needledir does not yet exist in a local cache (like asset downloads work or GIT_CACHE_DIR in os-autoinst and/or worker implementation), <em>When</em> tests are triggered on the side of web UI, <em>Then</em> the relevant data is git cloned, e.g. in the same steps as or similar to *_URL asset download</li>
<li><em>Given</em> a test distribution/needledir does not yet exist in a local cache, <em>When</em> the worker uploads the general test structure, e.g. which modules will be executed, <em>Then</em> the relevant data is git cloned</li>
<li><em>Given</em> a test distribution/needledir does not yet exist in a local cache, <em>When</em> the worker uploads individual needle check results, <em>Then</em> it also uploads as part of the JSON result files and image uploads all the necessary information to display needle candidates <em>And</em> the webUI in the receiving upload handler handles that somewhat … but does not overload when 1k workers upload in parallel or something :)</li>
<li><em>Given</em> a test distribution/needledir does not yet exist in a local cache, <em>When</em> the worker uploads final results (or "finalizes" the job), <em>Then</em> the webUI triggers a download of test files and/or needle files to a local git cache dir as necessary</li>
<li><em>Given</em> a test distribution/needledir does not yet exist in a local cache, <em>When</em> the first person reviews test results and selects needle candidates, <em>Then</em> the webUI triggers a download of test files and/or needle files to a local git cache dir as necessary</li>
</ol></li>
<li>If you identify any bigger feature implementation in openQA or os-autoinst itself being necessary then ensure those requirements are covered in other tickets and block on those tickets accordingly</li>
</ul>
<a name="Out-of-scope"></a>
<h2 >Out of scope<a href="#Out-of-scope" class="wiki-anchor">¶</a></h2>
<ul>
<li>Any bigger feature implementation in openQA or os-autoinst itself.</li>
</ul>
QA - action #157204 (Workable): Sync openQA job removal events to qem-dashboard listening to AMQP...https://progress.opensuse.org/issues/1572042024-03-14T05:33:51Zokurzokurz@suse.com
<a name="Motivation"></a>
<h2 >Motivation<a href="#Motivation" class="wiki-anchor">¶</a></h2>
<p><a href="https://suse.slack.com/archives/C02CLB8TZP1/p1709892527534149?thread_ts=1709883106.021479&cid=C02CLB8TZP1" class="external">https://suse.slack.com/archives/C02CLB8TZP1/p1709892527534149?thread_ts=1709883106.021479&cid=C02CLB8TZP1</a><br>
When openQA jobs are deleted then the according reference in qem-dashboard should also be removed. Listen to AMQP events to sync the removal accordingly</p>
<a name="Acceptance-criteria"></a>
<h2 >Acceptance criteria<a href="#Acceptance-criteria" class="wiki-anchor">¶</a></h2>
<ul>
<li><strong>AC1:</strong> SLE maintenance openQA jobs previously blocking SLE maintenance updates on <a href="http://dashboard.qam.suse.de/blocked" class="external">http://dashboard.qam.suse.de/blocked</a> do not block approval after such openQA jobs are deleted from the openQA database</li>
</ul>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>Use TDD: Extend <a href="https://github.com/openSUSE/qem-dashboard/blob/main/t/amqp.t" class="external">https://github.com/openSUSE/qem-dashboard/blob/main/t/amqp.t</a> and ensure there is a failing test first</li>
<li>Extend <a href="https://github.com/openSUSE/qem-dashboard/blob/08cea810f936faeb6af35b645270d85f6569c6b9/lib/Dashboard/Model/AMQP.pm#L33" class="external">https://github.com/openSUSE/qem-dashboard/blob/08cea810f936faeb6af35b645270d85f6569c6b9/lib/Dashboard/Model/AMQP.pm#L33</a> to update the database entry accordingly or delete, whatever is applicable</li>
<li>For all current openQA job result entries in the dashboard database crosscheck if there are entries for jobs that do not exist anymore in the openQA database. Remove accordingly.</li>
<li>Verify operation in production: E.g. create an artificial, failed openQA job in OSD for a non-critical SLE maintenance update, wait till it shows up as blocking on <a href="http://dashboard.qam.suse.de/blocked" class="external">http://dashboard.qam.suse.de/blocked</a> or in log files of the qem-bot "approve" cycle, remove the job over <code>openqa-cli -X delete jobs/$id</code> again and verify that <a href="http://dashboard.qam.suse.de/blocked" class="external">http://dashboard.qam.suse.de/blocked</a> does not show up as blocked on that job anymore</li>
</ul>
<a name="Out-of-scope"></a>
<h2 >Out of scope<a href="#Out-of-scope" class="wiki-anchor">¶</a></h2>
<ul>
<li>Regular cleanup of results when we missed or have otherwise not received according AMQP events</li>
</ul>
openQA Project - coordination #152847 (Blocked): [epic] version control awareness within openQA f...https://progress.opensuse.org/issues/1528472023-12-21T12:48:46Zokurzokurz@suse.comQA - action #139115 (Workable): Ensure o3 openQA PowerPC machine qa-power8-3 is operational from ...https://progress.opensuse.org/issues/1391152023-11-04T12:51:06Zokurzokurz@suse.com
<a name="Motivation"></a>
<h2 >Motivation<a href="#Motivation" class="wiki-anchor">¶</a></h2>
<p>Most PowerPC machines are being setup in PRG2 within <a class="issue tracker-4 status-15 priority-3 priority-lowest child" title="action: Support move of PowerPC machines to PRG2 size:M (Blocked)" href="https://progress.opensuse.org/issues/132140">#132140</a> and most machines could be discovered from the HMC. qa-power8-3 is meant for o3 and likely needs more collaboration with SUSE-IT Eng-Infra to bring the machine back into operation for o3 as the machine is a bare-metal installation we would rely on ASM+IPMI (HMC <strong>not</strong> needed) and system ethernet in the o3 network.</p>
<a name="Acceptance-criteria"></a>
<h2 >Acceptance criteria<a href="#Acceptance-criteria" class="wiki-anchor">¶</a></h2>
<ul>
<li><strong>AC1:</strong> qa-power8-3 openQA instances are able to pass o3 openQA jobs after the move to PRG2</li>
</ul>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>Read <a class="issue tracker-4 status-15 priority-3 priority-lowest child" title="action: Support move of PowerPC machines to PRG2 size:M (Blocked)" href="https://progress.opensuse.org/issues/132140">#132140</a> about the generic setup and in particular the HMC and understand that we work with the machine bare-metal in so called "OPAL" mode here, similar to kerosene</li>
<li>See current configuration and inventory management entry <a href="https://racktables.nue.suse.com/index.php?page=object&tab=default&object_id=2352" class="external">https://racktables.nue.suse.com/index.php?page=object&tab=default&object_id=2352</a> for the machine</li>
<li>Check if one of the specified interfaces show up in o3 dhcp logs (dnsmasq)</li>
<li>Crosscheck mac address entries on racktables against the entries in dnsmasq DHCP static lease configuration</li>
<li>Ensure we have access to qa-power8-3 manually again over ASM and IPMI as well as with verification openQA jobs on o3</li>
<li>Inform users about the result</li>
<li>Update racktables entry accordingly well as <a href="https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls" class="external">https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls</a></li>
</ul>
QA - coordination #123800 (Blocked): [epic] Provide SUSE QE Tools services running in PRG2 aka. P...https://progress.opensuse.org/issues/1238002023-01-30T14:46:55Zokurzokurz@suse.com
<a name="Motivation"></a>
<h2 >Motivation<a href="#Motivation" class="wiki-anchor">¶</a></h2>
<p>SUSE is deprecating NUE1 (Maxtorhof) and setting up a Prague Co-Location datacenter "Prg CoLo" or "DC7" as primary location in particular for serving public services. This includes what we serve so far from VM clusters managed by EngInfra and in particular the openqa.opensuse.org infrastructure, likely also openqa.suse.de. We must participate in planning and setup and accordingly a migration until we can provide our services from Prg CoLo.</p>
<a name="Acceptance-criteria"></a>
<h2 >Acceptance criteria<a href="#Acceptance-criteria" class="wiki-anchor">¶</a></h2>
<ul>
<li><strong>AC1:</strong> SUSE QE Tools services are provided out of Prg CoLo</li>
</ul>
QA - coordination #121720 (Blocked): [saga][epic] Migration to QE setup in PRG2+NUE3 while ensuri...https://progress.opensuse.org/issues/1217202022-12-08T19:30:27Zokurzokurz@suse.com
<a name="Motivation"></a>
<h2 >Motivation<a href="#Motivation" class="wiki-anchor">¶</a></h2>
<p>SUSE is deprecating NUE1 (Maxtorhof) and setting up a Prague Co-Location datacenter "Prg CoLo" or "DC7" as primary location in particular for serving public services. This includes what we serve so far from VM clusters managed by EngInfra and in particular the openqa.opensuse.org infrastructure, likely also openqa.suse.de. Or defined differently: Everything that is currently served from NUE1-SRV1. We must participate in planning and setup and accordingly a migration until we can provide our services from Prg CoLo and do not rely on NUE1-SRV1 anymore except for the purpose of an optional fail-over datacenter in Nbg.<br>
SUSE is deprecating NUE1 (Maxtorhof) and setting up replacement data centers. Additionally a new datacenter is planned as fail-over location</p>
<a name="Acceptance-criteria"></a>
<h2 >Acceptance criteria<a href="#Acceptance-criteria" class="wiki-anchor">¶</a></h2>
<ul>
<li><strong>AC1:</strong> SUSE QE Tools services are provided out of Prg CoLo <a class="issue tracker-6 status-15 priority-5 priority-high3 child parent behind-schedule" title="coordination: [epic] Provide SUSE QE Tools services running in PRG2 aka. Prg CoLo (Blocked)" href="https://progress.opensuse.org/issues/123800">#123800</a></li>
<li><strong>AC2:</strong> NUE1 (Maxtorhof) is not relied upon by SUSE QE Tools anymore and has been evacuated by us <a class="issue tracker-6 status-15 priority-4 priority-default child parent behind-schedule" title="coordination: [epic] Move from SUSE NUE1 (Maxtorhof) to new NBG Datacenters (Blocked)" href="https://progress.opensuse.org/issues/129280">#129280</a></li>
<li><strong>AC3:</strong> Relevant SUSE QE Tools services are provided out of NUE3 <a class="issue tracker-6 status-3 priority-4 priority-default closed child parent" title="coordination: [epic] Migration out of SUSE NUE1 - QE setup in NUE3 (Resolved)" href="https://progress.opensuse.org/issues/130955">#130955</a></li>
</ul>
<a name="Further-details"></a>
<h2 >Further details<a href="#Further-details" class="wiki-anchor">¶</a></h2>
<p>Coordination chat room <a href="https://suse.slack.com/archives/C04MDKHQE20" class="external">#dct-migration</a></p>
openQA Project - coordination #58184 (Blocked): [saga][epic][use case] full version control aware...https://progress.opensuse.org/issues/581842019-10-15T10:19:57Zokurzokurz@suse.com
<a name="Motivation"></a>
<h2 >Motivation<a href="#Motivation" class="wiki-anchor">¶</a></h2>
<p>This is linked to <a href="https://progress.opensuse.org/projects/openqav3/wiki#Use-case-4" class="external">Use case 4</a> and motivated by a discussion by the QA tools team in the weekly meeting 2019-10-15. What we should have are for example user forks and branches, fully versioned test schedules and configuration settings</p>
<a name="User-story"></a>
<h2 >User story<a href="#User-story" class="wiki-anchor">¶</a></h2>
<p>As a test case contributor during test case development I want to run tests on production instances with all necessary changes recorded in version control before merging to master so that my change will have minimal unexpected impact (test regressions) on existing tests</p>
<a name="Further-user-stories-from-httpsconfluencesusecompagesviewpageactionpageId365527173"></a>
<h2 >Further user stories (from <a href="https://confluence.suse.com/pages/viewpage.action?pageId=365527173" class="external">https://confluence.suse.com/pages/viewpage.action?pageId=365527173</a>)<a href="#Further-user-stories-from-httpsconfluencesusecompagesviewpageactionpageId365527173" class="wiki-anchor">¶</a></h2>
<ol>
<li>I want to start a job based on a modified test in production (In production tests can behave differently, for example because of the heavier load) -> see openqa-clone-job + CASEDIR</li>
<li>I want to edit needles and test if they work before proposing changes</li>
<li>I want to compare the results of a certain job group between two of my branches</li>
<li>I want to schedule a test 100 times without it showing up in the group overview -> see <a href="https://progress.opensuse.org/projects/openqatests/wiki#Statistical-investigation" class="external">statistical-investigation</a></li>
<li>I want to trigger multiple cloned jobs for each pull-request (Sometimes you want to trigger VR for different jobs against the same PR. it would be nice to do that in one command line)</li>
<li>I want to trigger the relevant tests automatically by creating a PR</li>
</ol>
<a name="Implications-and-suggestions"></a>
<h2 >Implications and suggestions<a href="#Implications-and-suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li><p>The usual test contributor workflows should be supported and made easier by making openQA fully aware of tests triggered for development purposes without negatively impacting existing validation tests</p>
<ul>
<li>Potential impact on asset management</li>
<li>No pollution of validation test reports by development tests</li>
</ul></li>
<li><p>If there are new/modified needles involved, the existing workflow cannot handle that. The current practice is:</p>
<ul>
<li>Test your changes (and possibly needle changes) locally and create PR(s)</li>
<li>Edit needles online and save them (then they will be committed to master). Requires admin rights</li>
</ul></li>
<li><p>DONE: Cloning cancelled or incomplete jobs currently does not work as openqa-clone-custom-git-refspec requires the vars.json file from a completed job with this file uploaded -> <a href="https://github.com/os-autoinst/openQA/pull/3170" class="external">https://github.com/os-autoinst/openQA/pull/3170</a></p></li>
<li><p>Replace "fetchneedles" by inherent git support</p></li>
<li><p>Provide support for github pull request validation</p></li>
<li><p>DONE: Extend openqa-clone-custom-git-refspec to accept list of source tests to clone -> <a href="https://github.com/os-autoinst/openQA/pull/2577" class="external">https://github.com/os-autoinst/openQA/pull/2577</a></p></li>
<li><p>DONE: openqa-clone-custom-git-refspec: Output in markdown format for easy copy/pasting into git commit messages and github PR comments -> <a href="https://github.com/os-autoinst/openQA/pull/2577" class="external">https://github.com/os-autoinst/openQA/pull/2577</a></p></li>
<li><p>openqa-clone-custom-git-refspec: Provide link to /tests/overview page for the custom build when multiple tests have been cloned</p></li>
<li><p>Make the trigger source of test jobs apparent, e.g. the source git repositories</p></li>
<li><p><a class="issue tracker-6 status-3 priority-4 priority-default closed parent" title="coordination: [EPIC] Interactive mode is an usability disaster (Resolved)" href="https://progress.opensuse.org/issues/14818#note-18">#14818#note-18</a> : "Tim got a ticket from Ray that the docker test failed and wants openQA to reproduce the issue and pause at the beginning of the docker test. Afterwards he wants openQA to make a disk snapshot and step through the test execution to find out where the problem is. After he found out, he reloads the snapshot to tweak the execution. During this process, openQA records his steps and allows to add needles."</p></li>
</ul>