openSUSE Project Management Tool: Issueshttps://progress.opensuse.org/https://progress.opensuse.org/themes/openSUSE/favicon/favicon.ico?15829177842024-03-27T08:03:58ZopenSUSE Project Management Tool
Redmine openQA Infrastructure - action #158113 (Feedback): typing issue on ppc64 worker - make CPU load a...https://progress.opensuse.org/issues/1581132024-03-27T08:03:58Zokurzokurz@suse.com
<a name="Motivation"></a>
<h2 >Motivation<a href="#Motivation" class="wiki-anchor">¶</a></h2>
<p><a class="issue tracker-4 status-4 priority-5 priority-high3 child behind-schedule" title="action: typing issue on ppc64 worker size:S (Feedback)" href="https://progress.opensuse.org/issues/158104">#158104</a> shows VNC typing issues. For this in <a class="issue tracker-4 status-3 priority-4 priority-default closed child" title="action: CPU Load and usage alert for openQA workers size:S (Resolved)" href="https://progress.opensuse.org/issues/150983">#150983</a> on purpose we added alerts to alert on too high CPU load. <a href="https://monitor.qa.suse.de/d/WDmania/worker-dashboard-mania?orgId=1&from=now-2d&to=now&viewPanel=54694" class="external">https://monitor.qa.suse.de/d/WDmania/worker-dashboard-mania?orgId=1&from=now-2d&to=now&viewPanel=54694</a> clearly shows a load consistently in the range of 50-70(!) for mania but no alert triggered. We should crosscheck <a href="https://monitor.qa.suse.de/alerting/cpu_load_alert_mania/modify-export?returnTo=%2Fd%2FWDmania%2Fworker-dashboard-mania%3ForgId%3D1%26from%3Dnow-7d%26to%3Dnow%26viewPanel%3D54694%26editPanel%3D54694%26tab%3Dalert" class="external">https://monitor.qa.suse.de/alerting/cpu_load_alert_mania/modify-export?returnTo=%2Fd%2FWDmania%2Fworker-dashboard-mania%3ForgId%3D1%26from%3Dnow-7d%26to%3Dnow%26viewPanel%3D54694%26editPanel%3D54694%26tab%3Dalert</a><br>
and make that alert more strict.</p>
<a name="Acceptance-criteria"></a>
<h2 >Acceptance criteria<a href="#Acceptance-criteria" class="wiki-anchor">¶</a></h2>
<ul>
<li><strong>AC1:</strong> CPU load alerts trigger for a CPU load15 consistently above 40 as originally planned</li>
</ul>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>Crosscheck <a href="https://monitor.qa.suse.de/alerting/cpu_load_alert_mania/modify-export?returnTo=%2Fd%2FWDmania%2Fworker-dashboard-mania%3ForgId%3D1%26from%3Dnow-7d%26to%3Dnow%26viewPanel%3D54694%26editPanel%3D54694%26tab%3Dalert" class="external">https://monitor.qa.suse.de/alerting/cpu_load_alert_mania/modify-export?returnTo=%2Fd%2FWDmania%2Fworker-dashboard-mania%3ForgId%3D1%26from%3Dnow-7d%26to%3Dnow%26viewPanel%3D54694%26editPanel%3D54694%26tab%3Dalert</a> or the implementation in code <a href="https://gitlab.suse.de/openqa/salt-states-openqa/-/blame/master/monitoring/grafana/alerting-dashboard-WD.yaml.template?ref_type=heads#L941" class="external">https://gitlab.suse.de/openqa/salt-states-openqa/-/blame/master/monitoring/grafana/alerting-dashboard-WD.yaml.template?ref_type=heads#L941</a></li>
</ul>
openQA Infrastructure - action #158104 (Feedback): typing issue on ppc64 worker size:Shttps://progress.opensuse.org/issues/1581042024-03-27T06:57:56Zzcjiazcjia@suse.com
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<p>openQA test in scenario sle-15-SP6-Online-ppc64le-ha_beta_supportserver@ppc64le-2g fails in<br>
<a href="https://openqa.suse.de/tests/13885455/modules/setup/steps/84" class="external">setup</a></p>
<p><a href="https://openqa.suse.de/tests/13885455#step/setup/84" class="external">https://openqa.suse.de/tests/13885455#step/setup/84</a> (see attachment p1.png)</p>
<p><a href="https://openqa.suse.de/tests/13885471#step/setup/30" class="external">https://openqa.suse.de/tests/13885471#step/setup/30</a> (see attachment p2.png) It missed "$" before "?".</p>
<p><a href="https://openqa.suse.de/tests/13885404#step/setup/12" class="external">https://openqa.suse.de/tests/13885404#step/setup/12</a> (see attachment p3.png)</p>
<p><a href="https://openqa.suse.de/tests/13885407#step/setup/9" class="external">https://openqa.suse.de/tests/13885407#step/setup/9</a> (see attachment p4.png)</p>
<p>I think this may related with the high work load of underlying ppc64 worker.</p>
<p>All on "mania"</p>
<a name="Test-suite-description"></a>
<h2 >Test suite description<a href="#Test-suite-description" class="wiki-anchor">¶</a></h2>
<p>The base test suite is used for job templates defined in YAML documents. It has no settings of its own.</p>
<a name="Reproducible"></a>
<h2 >Reproducible<a href="#Reproducible" class="wiki-anchor">¶</a></h2>
<p>Fails since (at least) Build <a href="https://openqa.suse.de/tests/13885455" class="external">73.1</a> (current job)</p>
<a name="Expected-result"></a>
<h2 >Expected result<a href="#Expected-result" class="wiki-anchor">¶</a></h2>
<p>Last good: <a href="https://openqa.suse.de/tests/13829359" class="external">67.1</a> (or more recent)</p>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>Identify the affected machines and workers, apply mitigations to prevent recurring typing issues, e.g. reducing CPU load</li>
<li>Restart related failed jobs</li>
<li>Identify follow-up tasks</li>
<li>Reduce the number of worker instances as a first mitigation measure. <a href="https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/759" class="external">https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/759</a> (merged)</li>
<li>Make the alert for CPU load more strict - <a class="issue tracker-4 status-4 priority-5 priority-high3 child behind-schedule" title="action: typing issue on ppc64 worker - make CPU load alert more strict (Feedback)" href="https://progress.opensuse.org/issues/158113">#158113</a></li>
<li>Evaluate the impact on video encoding in particular on ppc64le, maybe ffmpeg on Power8 kvm is inefficient - <a class="issue tracker-4 status-1 priority-4 priority-default child" title="action: typing issue on ppc64 worker - crosscheck performance impact of ffmpeg on ppc64le (Power8 kvm) (New)" href="https://progress.opensuse.org/issues/158116">#158116</a></li>
<li>Check existing ffmpeg processes on mania which take a lot of CPU time - <a class="issue tracker-4 status-1 priority-4 priority-default child" title="action: typing issue on ppc64 worker - crosscheck performance impact of ffmpeg on ppc64le (Power8 kvm) (New)" href="https://progress.opensuse.org/issues/158116">#158116</a></li>
</ul>
<a name="Out-of-scope"></a>
<h2 >Out of scope<a href="#Out-of-scope" class="wiki-anchor">¶</a></h2>
<ul>
<li>ffmpeg impact investigation -> <a class="issue tracker-4 status-4 priority-5 priority-high3 child behind-schedule" title="action: typing issue on ppc64 worker - make CPU load alert more strict (Feedback)" href="https://progress.opensuse.org/issues/158113">#158113</a></li>
<li>code improvements -> <a class="issue tracker-4 status-1 priority-4 priority-default child" title="action: typing issue on ppc64 worker - only pick up (or start) new jobs if CPU load is below configured t... (New)" href="https://progress.opensuse.org/issues/158125">#158125</a></li>
<li>improving the alert -> <a class="issue tracker-4 status-4 priority-5 priority-high3 child behind-schedule" title="action: typing issue on ppc64 worker - make CPU load alert more strict (Feedback)" href="https://progress.opensuse.org/issues/158113">#158113</a></li>
</ul>
<a name="Further-details"></a>
<h2 >Further details<a href="#Further-details" class="wiki-anchor">¶</a></h2>
<p>Always latest result in this scenario: <a href="https://openqa.suse.de/tests/latest?arch=ppc64le&distri=sle&flavor=Online&machine=ppc64le-2g&test=ha_beta_supportserver&version=15-SP6" class="external">latest</a></p>
QA - action #157858 (Feedback): Repeated reminder comments about SLO's for openqatests size:Shttps://progress.opensuse.org/issues/1578582024-03-25T08:37:52Zlivdywanliv.dywan@suse.com
<a name="Motivation"></a>
<h2 >Motivation<a href="#Motivation" class="wiki-anchor">¶</a></h2>
<p><a class="issue tracker-4 status-3 priority-5 priority-high3 closed child" title="action: No ticket reminder comments about SLO's for openqatests size:M (Resolved)" href="https://progress.opensuse.org/issues/157522">#157522</a> addressed a bug that prevented reminder comments from being sent. Unfortunately comments are added even if a comment was already present. This is especially visible in <em>immediate</em> tickets, for example #153115, which get daily reminders - as per <a class="issue tracker-4 status-3 priority-5 priority-high3 closed child" title="action: Automated alerts and reminders about SLO's for openqatests (only one reminder) size:M (Resolved)" href="https://progress.opensuse.org/issues/116545">#116545</a> only one comment is supposed to be added. Maybe this is a regression or the check is not comprehensive enough.</p>
<a name="Acceptance-criteria"></a>
<h2 >Acceptance criteria<a href="#Acceptance-criteria" class="wiki-anchor">¶</a></h2>
<ul>
<li><strong>AC1:</strong> Reminders are only added once</li>
</ul>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>We already have the code that should handle that: Review the implementation from <a class="issue tracker-4 status-3 priority-5 priority-high3 closed child" title="action: Automated alerts and reminders about SLO's for openqatests (only one reminder) size:M (Resolved)" href="https://progress.opensuse.org/issues/116545">#116545</a> for gaps in the current logic in <a href="https://github.com/openSUSE/backlogger/blob/main/backlogger.py" class="external">https://github.com/openSUSE/backlogger/blob/main/backlogger.py</a></li>
<li>Investigate if something changed with current comments, maybe the Redmine upgrade made a difference here (complete guess)?</li>
<li>Maybe the regex needs to be adapted and/or better covered with unit testing</li>
</ul>
openQA Infrastructure - action #157615 (Feedback): [alert] osd-deployment failed in post-deploy ,...https://progress.opensuse.org/issues/1576152024-03-20T18:18:05Zjbaier_czjbaier@suse.cz
<p>See <a href="https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/2411217" class="external">https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/2411217</a></p>
<pre><code>schort-server.qe.nue2.suse.org:
2024-03-20T16:23:32Z E! [agent] Error killing process: os: process already finished
2024-03-20T16:23:32Z E! [agent] Error killing process: os: process already finished
2024-03-20T16:23:32Z E! [inputs.exec] Error in plugin: exec: command timed out for command '/etc/telegraf/scripts/systemd_list_service_by_state_for_telegraf.sh --state masked --exclude ""':
2024-03-20T16:23:32Z E! [inputs.exec] Error in plugin: exec: command timed out for command '/etc/telegraf/scripts/systemd_list_service_by_state_for_telegraf.sh --state failed --exclude ""':
2024-03-20T16:23:32Z E! [telegraf] Error running agent: input plugins recorded 2 errors
telegraf errors
monitor.qe.nue2.suse.org:
2024-03-20T16:23:31Z E! [inputs.x509_cert] Error in plugin: cannot get SSL cert 'https://monitor.qa.suse.de:443': dial tcp: lookup monitor.qa.suse.de: i/o timeout
2024-03-20T16:23:35Z E! [telegraf] Error running agent: input plugins recorded 1 errors
telegraf errors
++ grep ' E! ' salt_post_deploy_checks.log
2024-03-20T16:23:32Z E! [agent] Error killing process: os: process already finished
2024-03-20T16:23:32Z E! [agent] Error killing process: os: process already finished
2024-03-20T16:23:32Z E! [inputs.exec] Error in plugin: exec: command timed out for command '/etc/telegraf/scripts/systemd_list_service_by_state_for_telegraf.sh --state masked --exclude ""':
2024-03-20T16:23:32Z E! [inputs.exec] Error in plugin: exec: command timed out for command '/etc/telegraf/scripts/systemd_list_service_by_state_for_telegraf.sh --state failed --exclude ""':
2024-03-20T16:23:32Z E! [telegraf] Error running agent: input plugins recorded 2 errors
2024-03-20T16:23:31Z E! [inputs.x509_cert] Error in plugin: cannot get SSL cert 'https://monitor.qa.suse.de:443': dial tcp: lookup monitor.qa.suse.de: i/o timeout
2024-03-20T16:23:35Z E! [telegraf] Error running agent: input plugins recorded 1 errors
</code></pre>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ol>
<li>Understand why and where <code>systemd_list_service_by_state_for_telegraf.sh</code> times out. It could be the general telegraf-timeout in the pipeline, in the execution of the script itself (from telegraf.conf) or another place. Adjust the timeout to match expected runtime or fix the script to complete faster -> schort-server only has 1 VM core, consider configuring the hypervisor to use at least 2 cores</li>
<li>"Error killing process: os: process already finished" might just be a consequence of the above</li>
<li>"Error in plugin: cannot get SSL cert '<a href="https://monitor.qa.suse.de:443':" class="external">https://monitor.qa.suse.de:443':</a> dial tcp: lookup monitor.qa.suse.de: i/o timeout" possibly to be covered with some retrying? Investigate what the real error message means, ask <a href="https://www.ecosia.org/chat" class="external">https://www.ecosia.org/chat</a> (or if that does not work invest in coal-powered <a href="https://www.cat-gpt.com/chat" class="external">https://www.cat-gpt.com/chat</a> ) or something</li>
<li>If we cannot solve these problems, consider excluding them from CI execution to avoid false-positives. Consider the impact of doing this first however!</li>
</ol>
openQA Project - action #157540 (Feedback): [sporadic] ci openQA: t/33-developer_mode.t fails size:Mhttps://progress.opensuse.org/issues/1575402024-03-19T14:15:50Ztinitatina.mueller+trick-redmine@suse.com
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<p><a href="https://app.circleci.com/pipelines/github/os-autoinst/openQA/13196/workflows/ddb935c7-31dd-4beb-877c-25ef1e703b4d/jobs/123231" class="external">https://app.circleci.com/pipelines/github/os-autoinst/openQA/13196/workflows/ddb935c7-31dd-4beb-877c-25ef1e703b4d/jobs/123231</a></p>
<pre><code>[14:03:42] t/33-developer_mode.t .. 17/? # Unexpected Javascript console errors, waiting for connection opened: [
# {
# level => "SEVERE",
# message => "http://localhost:9526/asset/3906633cf0/ws_console.js 8 WebSocket connection to 'ws://localhost:9528/liveviewhandler/tests/1/developer/ws-proxy' failed: Error during WebSocket handshake: Unexpected response code: 302",
# source => "network",
# timestamp => 1710857067816,
# },
# ]
# Failed test 'No unexpected js warnings'
# at /home/squamata/project/t/lib/OpenQA/Test/FullstackUtils.pm line 123.
# Looks like you failed 1 test of 9.
[14:03:42] t/33-developer_mode.t .. 20/?
</code></pre>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>While investigating the code in parallel try to reproduce locally with coverage enabled and multiple runs to get a statistically significant result, e.g. <code>make test KEEP_DB=1 RETRY=500 TESTS=t/33-developer.t</code> and go for lunch or continue coding :)</li>
<li>If it's not reproducible consider the same with coverage enabled and/or in circleCI, e.g. a temporary branch in your github repo fork</li>
<li>Identify where in <a href="https://github.com/os-autoinst/openQA/blob/master/t/33-developer_mode.t" class="external">https://github.com/os-autoinst/openQA/blob/master/t/33-developer_mode.t</a> the redirection "302" could happen</li>
<li>Even though the test is not technically a UI test in the t/ui/ folder it might still be necessary to apply UI test related synchronisation means to fix the sporadic failure as a selenium instance is used</li>
<li>Might be a similar issue: <a class="issue tracker-4 status-3 priority-5 priority-high3 closed child" title="action: [sporadic] t/full-stack.t Failed test 'Expected result for job 1 not found' size:M (Resolved)" href="https://progress.opensuse.org/issues/102578">#102578</a></li>
</ul>
openQA Infrastructure - action #157468 (Feedback): Handle internal test machines with compromised...https://progress.opensuse.org/issues/1574682024-03-18T12:00:21Zokurzokurz@suse.com
<a name="Motivation"></a>
<h2 >Motivation<a href="#Motivation" class="wiki-anchor">¶</a></h2>
<p>In <a href="https://sd.suse.com/servicedesk/customer/portal/1/SD-150437" class="external">https://sd.suse.com/servicedesk/customer/portal/1/SD-150437</a> we are asked to handle "compromised root passwords in QA segments" including s390zl11…16</p>
<a name="Acceptance-criteria"></a>
<h2 >Acceptance criteria<a href="#Acceptance-criteria" class="wiki-anchor">¶</a></h2>
<ul>
<li><strong>AC1:</strong> All steps asked in <a href="https://sd.suse.com/servicedesk/customer/portal/1/SD-150437" class="external">https://sd.suse.com/servicedesk/customer/portal/1/SD-150437</a> have been sufficiently handled</li>
</ul>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li><em>DONE</em> Change root password for s390zl11…16 and in sync update in <a href="https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls" class="external">https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls</a></li>
<li><em>DONE</em> Ensure according tests work</li>
<li><em>DONE</em> Ask wegao about their personal machine also referenced</li>
<li><em>DONE</em> Find a working solution covering s390kvm080…099 -> see related tickets</li>
<li>Depending on response in <a href="https://sd.suse.com/servicedesk/customer/portal/1/SD-150437" class="external">https://sd.suse.com/servicedesk/customer/portal/1/SD-150437</a> either resolve this ticket or block on sibling tasks, e.g. new password for s390x openQA tests, rotating password, ssh key based authentication and/or more secured network</li>
</ul>