openSUSE Project Management Tool: Issueshttps://progress.opensuse.org/https://progress.opensuse.org/themes/openSUSE/favicon/favicon.ico?15829177842024-03-29T09:06:46ZopenSUSE Project Management Tool
Redmine qe-yam - action #158269 (New): Need add 'sshd' package to the autoyast profilehttps://progress.opensuse.org/issues/1582692024-03-29T09:06:46Ztinawang123yuwang@suse.com
<p><strong>Motivation</strong><br>
Failed job: <a href="https://openqa.suse.de/tests/13902800#step/installation/7" class="external">https://openqa.suse.de/tests/13902800#step/installation/7</a><br>
Service 'sshd' not found, because it not add at profile.</p>
<p><strong>Acceptance criteria</strong><br>
AC1: Add 'open-sshd' to the profile.</p>
openQA Infrastructure - action #158266 (In Progress): openQA jobs on diesel ppc64le fail due to a...https://progress.opensuse.org/issues/1582662024-03-29T08:41:33Zokurzokurz@suse.com
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<p>From <a href="https://suse.slack.com/archives/C02CANHLANP/p1711700522125619" class="external">https://suse.slack.com/archives/C02CANHLANP/p1711700522125619</a></p>
<blockquote>
<p>Warning: tests are failing on ppc64 worker host diesel around 5 hours ago, seem qemu VM can't start. <a href="https://openqa.suse.de/admin/workers/3393" class="external">https://openqa.suse.de/admin/workers/3393</a> <a href="https://openqa.suse.de/admin/workers/3388" class="external">https://openqa.suse.de/admin/workers/3388</a> <a href="https://openqa.suse.de/admin/workers/3390" class="external">https://openqa.suse.de/admin/workers/3390</a></p>
</blockquote>
<p>autoinst-log.txt says</p>
<pre><code>[2024-03-29T09:37:43.496499+01:00] [debug] [pid:18748] QEMU: error: kvm run failed Device or resource busy
[2024-03-29T09:37:43.496606+01:00] [debug] [pid:18748] QEMU: This is probably because your SMT is enabled.
[2024-03-29T09:37:43.496679+01:00] [debug] [pid:18748] QEMU: VCPU can only run on primary threads with all secondary threads offline.
</code></pre>
<p>There is the "smt_off" service <a href="https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/worker.sls?ref_type=heads#L263" class="external">https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/worker.sls?ref_type=heads#L263</a> to fix the problem regarding SMT. the service was running fine but I restarted the service and restarted <a href="https://openqa.suse.de/tests/13906928#live" class="external">https://openqa.suse.de/tests/13906928#live</a>. But it seems it reproduces the problem.</p>
<p>Only diesel is affected, mania and petrol seem fine.</p>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li><em>DONE</em> <code>ssh osd 'sudo salt-key -y -d diesel.qe.nue2.suse.org'</code></li>
<li><em>DONE</em> <code>ssh diesel.qe.nue2.suse.org 'sed -i 's/qemu_ppc64le,/qemu_ppc64le-poo158266,/' /etc/openqa/workers.ini && systemctl restart openqa-worker-auto-restart@{1..8} && systemctl disable --now salt-minion telegraf'</code></li>
<li><em>DONE</em> <code>host=openqa.suse.de WORKER=diesel result="result='failed'" comment="label:poo158266" ./openqa-advanced-retrigger-jobs</code></li>
<li>Investigate what is different on diesel vs. mania+petrol. Maybe mania+petrol are also affected but not noticed yet, maybe they haven't rebooted yet</li>
<li>Fix the problem</li>
<li>verify</li>
<li>rollback</li>
</ul>
<a name="Rollback-actions"></a>
<h2 >Rollback actions<a href="#Rollback-actions" class="wiki-anchor">¶</a></h2>
<ul>
<li><code>ssh diesel.qe.nue2.suse.org 'sed -i 's/qemu_ppc64le-poo158266,/qemu_ppc64le,/' /etc/openqa/workers.ini && systemctl restart openqa-worker-auto-restart@{1..8} && systemctl enable --now salt-minion telegraf'</code></li>
<li><code>ssh osd 'sudo salt-key -y -a diesel.qe.nue2.suse.org'</code></li>
</ul>
openQA Tests - action #158245 (Feedback): test fails in openqa_workerhttps://progress.opensuse.org/issues/1582452024-03-28T21:42:42Zlivdywanliv.dywan@suse.com
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<p>openQA test in scenario openqa-Tumbleweed-dev-x86_64-openqa_install_nginx@64bit-2G fails in<br>
<a href="https://openqa.opensuse.org/tests/4049058/modules/openqa_worker/steps/8" class="external">openqa_worker</a></p>
<a name="Test-suite-description"></a>
<h2 >Test suite description<a href="#Test-suite-description" class="wiki-anchor">¶</a></h2>
<a name="Reproducible"></a>
<h2 >Reproducible<a href="#Reproducible" class="wiki-anchor">¶</a></h2>
<p>Fails since (at least) Build <a href="https://openqa.opensuse.org/tests/4049058" class="external">:TW.27506</a> (current job)</p>
<a name="Expected-result"></a>
<h2 >Expected result<a href="#Expected-result" class="wiki-anchor">¶</a></h2>
<p>Last good: <a href="https://openqa.opensuse.org/tests/4049009" class="external">:TW.27505</a> (or more recent)</p>
<a name="Further-details"></a>
<h2 >Further details<a href="#Further-details" class="wiki-anchor">¶</a></h2>
<p>Always latest result in this scenario: <a href="https://openqa.opensuse.org/tests/latest?arch=x86_64&distri=openqa&flavor=dev&machine=64bit-2G&test=openqa_install_nginx&version=Tumbleweed" class="external">latest</a></p>
openQA Infrastructure - action #158242 (New): Prevent ssh access to test VMs on svirt hypervisor ...https://progress.opensuse.org/issues/1582422024-03-28T19:30:27Zokurzokurz@suse.com
<a name="Motivation"></a>
<h2 >Motivation<a href="#Motivation" class="wiki-anchor">¶</a></h2>
<p>In <a href="https://sd.suse.com/servicedesk/customer/portal/1/SD-150437" class="external">https://sd.suse.com/servicedesk/customer/portal/1/SD-150437</a> we are asked to handle "compromised root passwords in QA segments" including s390zl11…16</p>
<a name="Acceptance-criteria"></a>
<h2 >Acceptance criteria<a href="#Acceptance-criteria" class="wiki-anchor">¶</a></h2>
<ul>
<li><strong>AC1:</strong> firewall on OSD svirt hosts prevents direct ssh+vnc access from outside, i.e. normal office networks</li>
<li><strong>AC2:</strong> openQA svirt jobs are still able to access ssh+vnc as necessary, e.g. from openQA workers in the same network OR openQA workers on the hypervisor hosts themselves</li>
</ul>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>Take openQA svirt worker instances related to one hypervisor host, e.g. s390zl12, out of production for testing</li>
<li>Configure a/the firewall on that host to block ssh+vnc to VMs running on that host</li>
<li>Allow traffic from other hosts in oqa.prg2.suse.org</li>
<li>Ensure that openQA tests still work</li>
<li>Ensure that the according firewall config is made boot-persistent and in salt</li>
<li>Crosscheck with at least one reboot</li>
<li>Apply the same solution to all other OSD svirt hosts</li>
</ul>
openQA Project - action #158236 (New): Backlog Limits Checker github workflow fails on pull reque...https://progress.opensuse.org/issues/1582362024-03-28T17:10:31Ztinitatina.mueller+trick-redmine@suse.com
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<p><a href="https://github.com/openSUSE/backlogger/actions/runs/8468254805/job/23200822772" class="external">https://github.com/openSUSE/backlogger/actions/runs/8468254805/job/23200822772</a><br>
The workflow is creating a preview of the HTML page in the origin gh-pages branch.<br>
For that, it needs the right permissions. A PR with a branch from origin works, but it fails for forks.</p>
<p>Maybe there are other options to make it work.</p>
openQA Tests - action #158215 (New): windows images cleaned up - but referenced in jobshttps://progress.opensuse.org/issues/1582152024-03-28T09:11:29Zdimstardimstar@opensuse.org
<p>Recently, some old windows 10 images had been cleaned up from o3<br>
First batch was fix in the jobgroups to reference the new images - but snapshot 0327 still has a few tests failing on assets:</p>
<p>Flavor: DVD <br>
gnome_dual_windows10@64bit_win <a href="https://openqa.opensuse.org/tests/4047942" class="external">https://openqa.opensuse.org/tests/4047942</a><br>
gnome_dual_windows10@uefi_win <a href="https://openqa.opensuse.org/tests/4047941" class="external">https://openqa.opensuse.org/tests/4047941</a><br>
kde_dual_windows10@64bit_win <a href="https://openqa.opensuse.org/tests/4047943" class="external">https://openqa.opensuse.org/tests/4047943</a><br>
kde_dual_windows10@uefi_win <a href="https://openqa.opensuse.org/tests/4047940" class="external">https://openqa.opensuse.org/tests/4047940</a></p>
<p>Flavor: NET<br>
kde_dual_windows10@uefi_win <a href="https://openqa.opensuse.org/tests/4047944" class="external">https://openqa.opensuse.org/tests/4047944</a></p>
<p>They all miss the relevant HDD_1 asset, e.g.<br>
Reason: asset failure: Failed to download <a href="mailto:windows-10-x86_64-21H1@64bit_win.qcow2">windows-10-x86_64-21H1@64bit_win.qcow2</a> to /var/lib/openqa/cache/openqa.opensuse.org/<a href="mailto:windows-10-x86_64-21H1@64bit_win.qcow2">windows-10-x86_64-21H1@64bit_win.qcow2</a> </p>
<p>The setting comes from the test suite directly</p>
<p>gnome_dual_windows10 CDMODEL=ide-cd<br>
DESKTOP=gnome<br>
DUALBOOT=1<br>
EXCLUDE_MODULES=system_prepare<br>
HDDVERSION=Windows 10<br>
HDD_1=windows-10-x86_64-21H1@%MACHINE%.qcow2<br>
Maintainer: <a href="mailto:grace.wang@suse.com">grace.wang@suse.com</a> </p>
<p>kde_dual_windows10 CDMODEL=ide-cd<br>
DESKTOP=kde<br>
DUALBOOT=1<br>
EXCLUDE_MODULES=system_prepare<br>
HDDVERSION=Windows 10<br>
HDD_1=windows-10-x86_64-1903@%MACHINE%.qcow2<br>
Maintainer: <a href="mailto:grace.wang@suse.com">grace.wang@suse.com</a></p>
qe-yam - action #158209 (New): [Research] Add service check test on migration path from 15SP3 to ...https://progress.opensuse.org/issues/1582092024-03-28T07:56:02Zlelileli@suse.com
<a name="Motivation"></a>
<h4 >Motivation<a href="#Motivation" class="wiki-anchor">¶</a></h4>
<p>This idea comes from a mail from customer titled 'named.service won't start (permission denied) after upgrade 15SP3-->15SP5', I will paste the content of the mail in comments.<br>
To test named after migration we need add service check, to cover the migration path of 15SP3 to 15SP5, we need add a continuous migration test with service check.</p>
<p>Ex: the current service check for named in regression test <a href="https://openqa.suse.de/tests/13887909#step/check_upgraded_service/16" class="external">online_sles15sp4_pscc_live-basesys-srv-desktop-dev-contm-lgm-tsm-wsm-pcm_all_full</a> for reference.</p>
<a name="Acceptance-criteria"></a>
<h4 >Acceptance criteria<a href="#Acceptance-criteria" class="wiki-anchor">¶</a></h4>
<p><strong>AC1</strong>: Add service check test on migration path from 15SP3 to 15SP5.</p>
qe-yam - action #158194 (New): If firewall is disabled, it should not match the tag 'nfs-firewall...https://progress.opensuse.org/issues/1581942024-03-28T05:54:09Ztinawang123yuwang@suse.com
<p><strong>Motivation</strong><br>
Failed job: <a href="https://openqa.suse.de/tests/13830874#step/install_service/123" class="external">https://openqa.suse.de/tests/13830874#step/install_service/123</a><br>
As firewall is disabled, so the 'open port in firewall' cannot be chosen. </p>
<p><strong>Acceptance criteria</strong><br>
AC1: Update the code to check if need send key 'alt-f' to open port in firewall.</p>
qe-yam - action #158191 (New): ppc64le_regression_test_offline_textmode.yaml should not include d...https://progress.opensuse.org/issues/1581912024-03-28T03:18:09Ztinawang123yuwang@suse.com
<p><strong>Motivation</strong><br>
Failed job: <a href="https://openqa.suse.de/tests/13892873" class="external">https://openqa.suse.de/tests/13892873</a><br>
This job is textmode, but include desktop x11 test modules.</p>
<p><strong>Acceptance criteria</strong><br>
AC1: Should update the yaml profile. textmode job should not test desktop x11. </p>
openQA Infrastructure - action #158185 (Feedback): parallel job failed to get the vars from its p...https://progress.opensuse.org/issues/1581852024-03-28T00:56:46ZJulie_CAOjcao@suse.com
<a name="Observation"></a>
<h3 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h3>
<p>We have a parallel job which failed in getting the vars from its pair. Rerun still failed. Is there something wrong with the worker service?</p>
<pre><code>sub get_var_from_parent {
my ($self, $var) = @_;
my $parents = get_parents();
#Query every parent to find the var
for my $job_id (@$parents) {
my $ref = get_job_autoinst_vars($job_id);
return $ref->{$var} if defined $ref->{$var};
}
return;
}
</code></pre>
<p><a href="https://openqa.suse.de/tests/13885165/logfile?filename=autoinst-log.txt" class="external">https://openqa.suse.de/tests/13885165/logfile?filename=autoinst-log.txt</a></p>
<pre><code>[2024-03-27T15:39:25.691962Z] [debug] [pid:4639] get_job_autoinst_vars: Connection error: Can't connect: Name or service not known; URL was http://worker35:20493/wS5wkxkWNNB9LK92/vars
</code></pre> openQA Infrastructure - action #158170 (Feedback): Increase ressources for s390x kvmhttps://progress.opensuse.org/issues/1581702024-03-27T14:54:10Zokurzokurz@suse.com
<a name="Motivation"></a>
<h2 >Motivation<a href="#Motivation" class="wiki-anchor">¶</a></h2>
<p><a href="https://suse.slack.com/archives/C02CANHLANP/p1711533706482229" class="external">https://suse.slack.com/archives/C02CANHLANP/p1711533706482229</a></p>
<blockquote>
<p>(Oliver Kurz) @Matthias Griessmeier would you be interested in trying to acquire more s390x kvm testing ressources? Looking into <a href="https://suse.slack.com/archives/C02CLB8TZP1/p1711532709502039" class="external">https://suse.slack.com/archives/C02CLB8TZP1/p1711532709502039</a> I found that s390x kvm openQA jobs have a significant schedule due to the limit of available instances. We would be able to run more instances with more memory assigned to the hpervisor LPAR</p>
</blockquote>
openQA Project - coordination #158167 (New): [epic] Increase worker capacityhttps://progress.opensuse.org/issues/1581672024-03-27T14:53:49Zokurzokurz@suse.comqe-yam - action #158158 (Workable): GTK glitch in yast2_lan_restart_vlanhttps://progress.opensuse.org/issues/1581582024-03-27T12:10:09Zrainerkoenig
<a name="Observation"></a>
<h4 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h4>
<p>openQA test in scenario sle-15-SP6-Online-ppc64le-yast2_gui@ppc64le-4g fails in<br>
<a href="https://openqa.suse.de/tests/13886245/modules/yast2_lan_restart_vlan/steps/39" class="external">yast2_lan_restart_vlan</a></p>
<p>The problem is the well known <a href="https://progress.opensuse.org/issues/124652" class="external">screen refresh glitch</a> showing up again.<br>
The workaround_poo124652 needs to be applied here.</p>
<a name="Additional-information"></a>
<h4 >Additional information<a href="#Additional-information" class="wiki-anchor">¶</a></h4>
<p>Screenshot from failed run<br>
<img src="https://progress.opensuse.org/attachments/download/17515/screenshot-glitch.png" alt="Screenshot from failure" loading="lazy" /></p>
<p>Screenshot from previous run<br>
<img src="https://progress.opensuse.org/attachments/download/17512/screenshot-ok.png" alt="Screenshot from previous run" loading="lazy" /><br>
<a href="https://openqa.suse.de/tests/13849100#step/yast2_lan_restart_vlan/38" class="external">Link to the passed test step</a></p>
<a name="Acceptance-criteria"></a>
<h4 >Acceptance criteria<a href="#Acceptance-criteria" class="wiki-anchor">¶</a></h4>
<ul>
<li><strong>AC1</strong>: <code>workaround_poo124652</code> from <code>lib/YaST/workarounds.pm</code> is applied for this situation.</li>
<li><strong>AC2</strong>: problem does not longer show up.</li>
</ul>
openQA Project - action #158146 (New): Prevent scheduling across-host multimachine clusters to ho...https://progress.opensuse.org/issues/1581462024-03-27T11:06:56Zokurzokurz@suse.com
<a name="Motivation"></a>
<h2 >Motivation<a href="#Motivation" class="wiki-anchor">¶</a></h2>
<p>Multi-machine jobs have been failing since 20230814, because of a misconfiguration of the MTU/GRE tunnels. A workaround has been found in forcing the complete multi-machine tests to run in the same worker. In <a class="issue tracker-4 status-3 priority-4 priority-default closed child behind-schedule" title="action: Optionally restrict multimachine jobs to a single worker (Resolved)" href="https://progress.opensuse.org/issues/135035">#135035</a> we added a feature flag to limit jobs to a single physical host which can be used for debugging or as temporary workaround or if the network design prevents multiple hosts to be interconnected by GRE tunnels. But by default when multi-machine jobs are scheduled with worker classes fulfilled by multiple hosts which might not be properly interconnected then there is no measure preventing workers to pick up such clusters causing hard to investigate openQA job failures which we should try to prevent. Can we propagate test variables like the "limit to one host only" feature flag in worker properties so that the openQA scheduler can see that flag before assigning to workers?</p>
<a name="Acceptance-Criteria"></a>
<h2 >Acceptance Criteria<a href="#Acceptance-Criteria" class="wiki-anchor">¶</a></h2>
<ul>
<li><strong>AC1:</strong> the openQA scheduler does not schedule across-host multimachine clusters to any host that has the feature flag from <a class="issue tracker-4 status-3 priority-4 priority-default closed child behind-schedule" title="action: Optionally restrict multimachine jobs to a single worker (Resolved)" href="https://progress.opensuse.org/issues/135035">#135035</a> set</li>
<li><strong>AC2:</strong> By default jobs of a multi-machine parallel cluster can still be scheduled covering multiple different hosts</li>
</ul>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>Look into what was done in <a class="issue tracker-4 status-3 priority-4 priority-default closed child behind-schedule" title="action: Optionally restrict multimachine jobs to a single worker (Resolved)" href="https://progress.opensuse.org/issues/135035">#135035</a> but for the central openQA scheduler</li>
<li>Investigate if any worker properties are already available to read by the openQA scheduler when scheduling. At least it knows about the worker class already, right? Should we translate the feature flag from <a class="issue tracker-4 status-3 priority-4 priority-default closed child behind-schedule" title="action: Optionally restrict multimachine jobs to a single worker (Resolved)" href="https://progress.opensuse.org/issues/135035">#135035</a> as a "special worker class" to act as an exclusive class that is only implemented by one host at a time?</li>
<li>Ensure that the scheduler does not schedule across-host multimachine clusters to any host that has such special worker class or worker property</li>
</ul>
openQA Project - action #158143 (New): Make workers unassign/reject/incomplete jobs when across-h...https://progress.opensuse.org/issues/1581432024-03-27T11:01:42Zokurzokurz@suse.com
<a name="Motivation"></a>
<h2 >Motivation<a href="#Motivation" class="wiki-anchor">¶</a></h2>
<p>Multi-machine jobs have been failing since 20230814, because of a misconfiguration of the MTU/GRE tunnels. A workaround has been found in forcing the complete multi-machine tests to run in the same worker. In <a class="issue tracker-4 status-3 priority-4 priority-default closed child behind-schedule" title="action: Optionally restrict multimachine jobs to a single worker (Resolved)" href="https://progress.opensuse.org/issues/135035">#135035</a> we added a feature flag to limit jobs to a single physical host which can be used for debugging or as temporary workaround or if the network design prevents multiple hosts to be interconnected by GRE tunnels. But by default when multi-machine jobs are scheduled with worker classes fulfilled by multiple hosts which might not be properly interconnected then there is no measure preventing workers to pick up such clusters causing hard to investigate openQA job failures which we should try to prevent. We should make workers unassign/reject/incomplete jobs when across-host multimachine setup is requested but not available and optionally inform about the possibility to use the "limit to one host only" feature flag.</p>
<a name="Acceptance-Criteria"></a>
<h2 >Acceptance Criteria<a href="#Acceptance-Criteria" class="wiki-anchor">¶</a></h2>
<ul>
<li><strong>AC1:</strong> openQA workers with "tap" class but not configured for across-host multimachine setup do not fail openQA jobs due to being spread over multiple hosts</li>
<li><strong>AC2:</strong> By default jobs of a multi-machine parallel cluster can still be scheduled covering multiple different hosts</li>
</ul>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>Look into what was done in <a class="issue tracker-4 status-3 priority-4 priority-default closed child behind-schedule" title="action: Optionally restrict multimachine jobs to a single worker (Resolved)" href="https://progress.opensuse.org/issues/135035">#135035</a> but for the central openQA scheduler</li>
<li>Investigate if a worker knows about other workers that it would need to communicate with in a multi-machine cluster job, possibly during the "assignment" step</li>
<li>Implement a pre-run check, possibly during the "assignment" step, where the worker would check if pre-requisites for across-host multimachine testing are fulfilled <em>if</em> the test cluster would need that, and fail early</li>
<li>Ensure that such early failure is fed back to the openQA scheduler, e.g. by unassigning the job, possibly with an explicit message visible by admins somewhere?</li>
<li>If not possible to unassign then somehow "reject" jobs or as last resort "incomplete" a job with an explicit "reason" which is still better than actually starting an openQA job and then causing fails</li>
<li>Optionally in the message/reason returned suggest to the admin/users to use the feature flag from <a class="issue tracker-4 status-3 priority-4 priority-default closed child behind-schedule" title="action: Optionally restrict multimachine jobs to a single worker (Resolved)" href="https://progress.opensuse.org/issues/135035">#135035</a></li>
</ul>