openSUSE Project Management Tool: Issueshttps://progress.opensuse.org/https://progress.opensuse.org/themes/openSUSE/favicon/favicon.ico?15829177842024-03-27T08:03:58ZopenSUSE Project Management Tool
Redmine openQA Infrastructure - action #158113 (Feedback): typing issue on ppc64 worker - make CPU load a...https://progress.opensuse.org/issues/1581132024-03-27T08:03:58Zokurzokurz@suse.com
<a name="Motivation"></a>
<h2 >Motivation<a href="#Motivation" class="wiki-anchor">¶</a></h2>
<p><a class="issue tracker-4 status-4 priority-5 priority-high3 child behind-schedule" title="action: typing issue on ppc64 worker size:S (Feedback)" href="https://progress.opensuse.org/issues/158104">#158104</a> shows VNC typing issues. For this in <a class="issue tracker-4 status-3 priority-4 priority-default closed child" title="action: CPU Load and usage alert for openQA workers size:S (Resolved)" href="https://progress.opensuse.org/issues/150983">#150983</a> on purpose we added alerts to alert on too high CPU load. <a href="https://monitor.qa.suse.de/d/WDmania/worker-dashboard-mania?orgId=1&from=now-2d&to=now&viewPanel=54694" class="external">https://monitor.qa.suse.de/d/WDmania/worker-dashboard-mania?orgId=1&from=now-2d&to=now&viewPanel=54694</a> clearly shows a load consistently in the range of 50-70(!) for mania but no alert triggered. We should crosscheck <a href="https://monitor.qa.suse.de/alerting/cpu_load_alert_mania/modify-export?returnTo=%2Fd%2FWDmania%2Fworker-dashboard-mania%3ForgId%3D1%26from%3Dnow-7d%26to%3Dnow%26viewPanel%3D54694%26editPanel%3D54694%26tab%3Dalert" class="external">https://monitor.qa.suse.de/alerting/cpu_load_alert_mania/modify-export?returnTo=%2Fd%2FWDmania%2Fworker-dashboard-mania%3ForgId%3D1%26from%3Dnow-7d%26to%3Dnow%26viewPanel%3D54694%26editPanel%3D54694%26tab%3Dalert</a><br>
and make that alert more strict.</p>
<a name="Acceptance-criteria"></a>
<h2 >Acceptance criteria<a href="#Acceptance-criteria" class="wiki-anchor">¶</a></h2>
<ul>
<li><strong>AC1:</strong> CPU load alerts trigger for a CPU load15 consistently above 40 as originally planned</li>
</ul>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>Crosscheck <a href="https://monitor.qa.suse.de/alerting/cpu_load_alert_mania/modify-export?returnTo=%2Fd%2FWDmania%2Fworker-dashboard-mania%3ForgId%3D1%26from%3Dnow-7d%26to%3Dnow%26viewPanel%3D54694%26editPanel%3D54694%26tab%3Dalert" class="external">https://monitor.qa.suse.de/alerting/cpu_load_alert_mania/modify-export?returnTo=%2Fd%2FWDmania%2Fworker-dashboard-mania%3ForgId%3D1%26from%3Dnow-7d%26to%3Dnow%26viewPanel%3D54694%26editPanel%3D54694%26tab%3Dalert</a> or the implementation in code <a href="https://gitlab.suse.de/openqa/salt-states-openqa/-/blame/master/monitoring/grafana/alerting-dashboard-WD.yaml.template?ref_type=heads#L941" class="external">https://gitlab.suse.de/openqa/salt-states-openqa/-/blame/master/monitoring/grafana/alerting-dashboard-WD.yaml.template?ref_type=heads#L941</a></li>
</ul>
openQA Infrastructure - action #157615 (Feedback): [alert] osd-deployment failed in post-deploy ,...https://progress.opensuse.org/issues/1576152024-03-20T18:18:05Zjbaier_czjbaier@suse.cz
<p>See <a href="https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/2411217" class="external">https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/2411217</a></p>
<pre><code>schort-server.qe.nue2.suse.org:
2024-03-20T16:23:32Z E! [agent] Error killing process: os: process already finished
2024-03-20T16:23:32Z E! [agent] Error killing process: os: process already finished
2024-03-20T16:23:32Z E! [inputs.exec] Error in plugin: exec: command timed out for command '/etc/telegraf/scripts/systemd_list_service_by_state_for_telegraf.sh --state masked --exclude ""':
2024-03-20T16:23:32Z E! [inputs.exec] Error in plugin: exec: command timed out for command '/etc/telegraf/scripts/systemd_list_service_by_state_for_telegraf.sh --state failed --exclude ""':
2024-03-20T16:23:32Z E! [telegraf] Error running agent: input plugins recorded 2 errors
telegraf errors
monitor.qe.nue2.suse.org:
2024-03-20T16:23:31Z E! [inputs.x509_cert] Error in plugin: cannot get SSL cert 'https://monitor.qa.suse.de:443': dial tcp: lookup monitor.qa.suse.de: i/o timeout
2024-03-20T16:23:35Z E! [telegraf] Error running agent: input plugins recorded 1 errors
telegraf errors
++ grep ' E! ' salt_post_deploy_checks.log
2024-03-20T16:23:32Z E! [agent] Error killing process: os: process already finished
2024-03-20T16:23:32Z E! [agent] Error killing process: os: process already finished
2024-03-20T16:23:32Z E! [inputs.exec] Error in plugin: exec: command timed out for command '/etc/telegraf/scripts/systemd_list_service_by_state_for_telegraf.sh --state masked --exclude ""':
2024-03-20T16:23:32Z E! [inputs.exec] Error in plugin: exec: command timed out for command '/etc/telegraf/scripts/systemd_list_service_by_state_for_telegraf.sh --state failed --exclude ""':
2024-03-20T16:23:32Z E! [telegraf] Error running agent: input plugins recorded 2 errors
2024-03-20T16:23:31Z E! [inputs.x509_cert] Error in plugin: cannot get SSL cert 'https://monitor.qa.suse.de:443': dial tcp: lookup monitor.qa.suse.de: i/o timeout
2024-03-20T16:23:35Z E! [telegraf] Error running agent: input plugins recorded 1 errors
</code></pre>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ol>
<li>Understand why and where <code>systemd_list_service_by_state_for_telegraf.sh</code> times out. It could be the general telegraf-timeout in the pipeline, in the execution of the script itself (from telegraf.conf) or another place. Adjust the timeout to match expected runtime or fix the script to complete faster -> schort-server only has 1 VM core, consider configuring the hypervisor to use at least 2 cores</li>
<li>"Error killing process: os: process already finished" might just be a consequence of the above</li>
<li>"Error in plugin: cannot get SSL cert '<a href="https://monitor.qa.suse.de:443':" class="external">https://monitor.qa.suse.de:443':</a> dial tcp: lookup monitor.qa.suse.de: i/o timeout" possibly to be covered with some retrying? Investigate what the real error message means, ask <a href="https://www.ecosia.org/chat" class="external">https://www.ecosia.org/chat</a> (or if that does not work invest in coal-powered <a href="https://www.cat-gpt.com/chat" class="external">https://www.cat-gpt.com/chat</a> ) or something</li>
<li>If we cannot solve these problems, consider excluding them from CI execution to avoid false-positives. Consider the impact of doing this first however!</li>
</ol>
openQA Infrastructure - action #156322 (Blocked): zabbix-proxy.dmz-prg2.suse.org not reachable fr...https://progress.opensuse.org/issues/1563222024-02-29T11:21:32Zjbaier_czjbaier@suse.cz
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<p>Zabbix proxy is not reachable from ariel, hence the monitoring of that host is not working at all.</p>
<p>Error message from zabbix frontend: <code>Received empty response from Zabbix Agent at [10.150.1.11]. Assuming that agent dropped connection because of access permissions.</code></p>
<pre><code>new-ariel # ping -c3 zabbix-proxy.dmz-prg2.suse.org
PING zabbix-proxy.dmz-prg2.suse.org (10.150.1.22) 56(84) bytes of data.
From ariel.suse-dmz.opensuse.org (10.150.1.11) icmp_seq=1 Destination Host Unreachable
From ariel.suse-dmz.opensuse.org (10.150.1.11) icmp_seq=2 Destination Host Unreachable
From ariel.suse-dmz.opensuse.org (10.150.1.11) icmp_seq=3 Destination Host Unreachable
--- zabbix-proxy.dmz-prg2.suse.org ping statistics ---
3 packets transmitted, 0 received, +3 errors, 100% packet loss, time 2045ms
</code></pre> openQA Infrastructure - action #155743 (Blocked): OBSRSync fails to sync openSUSE:Factory:PowerPC...https://progress.opensuse.org/issues/1557432024-02-21T12:07:21Zlivdywanliv.dywan@suse.com
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<p>Several emails with the subject <strong>Munin - minion Minion Jobs</strong> and content like this:</p>
<pre><code>opensuse.org :: openqa.opensuse.org :: Minion Jobs - see https://openqa.opensuse.org/minion/jobs?state=failed
WARNINGs: failed is 452.00 (outside range [:400]).
</code></pre>
<p>Looking at <a href="https://openqa.opensuse.org/minion/jobs?state=failed" class="external">https://openqa.opensuse.org/minion/jobs?state=failed</a> a lot of <a href="https://openqa.opensuse.org/minion/jobs?id=3440404" class="external">obs_run_run jobs fail</a> reveals failed jobs as recent as 2024-02-11T10:08:17.307669Z:</p>
<pre><code>---
args:
- project: openSUSE:Factory:PowerPC:ToTest
url: https://api.opensuse.org/public/build/openSUSE:Factory:PowerPC:ToTest/_result?package=000product
attempts: 1
children: []
created: 2024-02-11T10:06:07.856414Z
delayed: 2024-02-11T10:06:07.856414Z
expires: ~
finished: 2024-02-11T10:08:17.307669Z
id: 3412364
lax: 0
notes:
gru_id: 19905665
project_lock: 1
parents: []
priority: 100
queue: default
result:
code: 512
message: |-
openSUSE:Factory:PowerPC:ToTest/base/ exit code: 1 (1 failures total so far)
openSUSE:Factory:PowerPC:ToTest/microos/ exit code: 1 (2 failures total so far)
retried: ~
retries: 0
started: 2024-02-11T10:06:07.858866Z
state: failed
task: obs_rsync_run
time: 2024-02-21T12:07:01.731854Z
worker: 1952
</code></pre>
<p>and</p>
<pre><code>---
args:
- project: openSUSE:Factory:LegacyX86:ToTest
url: https://api.opensuse.org/public/build/openSUSE:Factory:LegacyX86:ToTest/_result?package=000product
attempts: 1
children: []
created: 2024-02-09T13:33:44.131117Z
delayed: 2024-02-09T13:33:44.131117Z
expires: ~
finished: 2024-02-09T13:35:39.515968Z
id: 3407299
lax: 0
notes:
gru_id: 19902081
project_lock: 1
parents: []
priority: 100
queue: default
result: 'Job terminated unexpectedly (exit code: 0, signal: 15)'
retried: ~
retries: 0
started: 2024-02-09T13:33:44.133221Z
state: failed
task: obs_rsync_run
time: 2024-02-21T12:07:01.731854Z
worker: 1950
</code></pre>
<p>as well as</p>
<pre><code>---
args:
- project: openSUSE:Leap:15.6:ToTest
url: https://api.opensuse.org/public/build/openSUSE:Leap:15.6:ToTest/_result?package=000product
attempts: 1
children: []
created: 2024-02-09T01:21:47.455035Z
delayed: 2024-02-09T01:21:47.455035Z
expires: ~
finished: 2024-02-09T01:25:08.329909Z
id: 3404260
lax: 0
notes:
gru_id: 19899816
project_lock: 1
parents: []
priority: 100
queue: default
result:
code: 256
message: No message
retried: ~
retries: 0
started: 2024-02-09T01:21:47.456660Z
state: failed
task: obs_rsync_run
time: 2024-02-21T12:07:01.731854Z
worker: 1950
</code></pre>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li></li>
</ul>
openQA Infrastructure - action #134846 (New): Old NFS share mount is keeping processes stuck and ...https://progress.opensuse.org/issues/1348462023-08-30T13:17:42Zokurzokurz@suse.com
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<p>On 2023-08-30 many openQA jobs were not picked up for long on OSD machines due to the machines still being connected to the NFS share from old OSD and eventually got stuck with some processes in "D" state (uninteruptible sleep).</p>
<a name="Acceptance-criteria"></a>
<h2 >Acceptance criteria<a href="#Acceptance-criteria" class="wiki-anchor">¶</a></h2>
<ul>
<li><strong>AC1:</strong> Hosts with stuck processes for long trigger alerts</li>
</ul>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>Try to reproduce the problem e.g. by manually making one process stuck in "D"</li>
<li>Add an alert triggering on the above condition</li>
</ul>
openQA Infrastructure - action #133907 (Workable): Improve monitoring for http(s?) reachable on j...https://progress.opensuse.org/issues/1339072023-08-07T10:21:52Ztinitatina.mueller+trick-redmine@suse.com
<a name="Motivation"></a>
<h2 >Motivation<a href="#Motivation" class="wiki-anchor">¶</a></h2>
<p>There's a few issues with Jenkins:</p>
<ul>
<li>We seem to have been missing builds for at least a day at the time of this writing. See <a href="https://openqa.opensuse.org/group_overview/24" class="external">https://openqa.opensuse.org/group_overview/24</a> (but it may be outdated once you see it, it's not a permalink).</li>
<li><em>DONE</em> <del><a href="http://jenkins.qa.suse.de/view/openQA-in-openQA/" class="external">http://jenkins.qa.suse.de/view/openQA-in-openQA/</a> is refusing the connection.</del> okurz: Fixed the wiki reference and job group description in <a href="https://openqa.opensuse.org/admin/job_templates/24" class="external">https://openqa.opensuse.org/admin/job_templates/24</a></li>
<li>It's unclear if jenkins.qa.suse.de is responsive to pings</li>
</ul>
<p>It's unclear what's going on. We didn't get any alerts and we don't know if we have proper monitoring for the service</p>
<p>From the journal for service <code>jenkins.service</code> on the system:</p>
<pre><code>Aug 06 03:25:41 jenkins jenkins[26704]: 2023-08-06 01:25:41.061+0000 [id=110] INFO org.pircbotx.output.OutputRaw#rawLine: PONG irc.suse.de
Aug 06 03:27:41 jenkins jenkins[26704]: 2023-08-06 01:27:41.505+0000 [id=71] INFO org.pircbotx.InputParser#handleLine: PING :irc.suse.de
Aug 06 03:27:41 jenkins jenkins[26704]: 2023-08-06 01:27:41.508+0000 [id=122] INFO org.pircbotx.output.OutputRaw#rawLine: PONG irc.suse.de
-- Boot d29ffd414ee14afd9e930a7cddfc124b --
Aug 07 13:04:50 jenkins systemd[1]: Starting Jenkins Continuous Integration Server...
Aug 07 13:05:09 jenkins jenkins[1218]: Running from: /usr/share/java/jenkins.war
</code></pre>
<a name="Acceptance-criteria"></a>
<h2 >Acceptance criteria<a href="#Acceptance-criteria" class="wiki-anchor">¶</a></h2>
<ul>
<li><strong>AC1:</strong> There's an alert for the Jenkins web interface (HTTP response, not just ping)</li>
</ul>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>Find out why we didn't get an alert about a failed systemd service</li>
<li>Maybe add a check for <code>systemd is-running</code>? (Likely not very useful.)</li>
<li>Add a connectivity check via telegraf and configure an alert via Grafana if there's no simpler solution
<ul>
<li>At least add a local, not-versioned telegraf extension to look at port 80, e.g. in /etc/telegraf/</li>
</ul></li>
<li>Possibly add a new role in our Salt states (we don't want this kind of check for all generic hosts)</li>
</ul>
openQA Project - action #133901 (New): [ o3 logreport] DBD::Pg::st execute failed: ERROR: invali...https://progress.opensuse.org/issues/1339012023-08-07T09:51:27Ztinitatina.mueller+trick-redmine@suse.com
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<p>From o3 /var/log/openqa:</p>
<pre><code>[2023-08-05T20:39:10.313025Z] [error] [wjDADFtweJVf] DBIx::Class::Storage::DBI::_dbh_execute(): DBI Exception: DBD::Pg::st execute failed: ERROR: invalid input
syntax for type bigint: "1'"
CONTEXT: unnamed portal parameter $1 = '...' [for Statement "SELECT COUNT( * ) FROM scheduled_products me WHERE ( me.id = ? )" with ParamValues: 1='1''] at
/usr/share/openqa/script/../lib/OpenQA/WebAPI/ServerSideDataTable.pm line 33
[2023-08-05T20:40:04.268615Z] [error] [SXp2NHWv1rW-] DBIx::Class::Storage::DBI::_dbh_execute(): DBI Exception: DBD::Pg::st execute failed: ERROR: invalid input
syntax for type bigint: "1<script>alert(1)</script>"
CONTEXT: unnamed portal parameter $1 = '...' [for Statement "SELECT COUNT( * ) FROM scheduled_products me WHERE ( me.id = ? )" with ParamValues:
1='1<script>alert(1)</script>'] at /usr/share/openqa/script/../lib/OpenQA/WebAPI/ServerSideDataTable.pm line 33
</code></pre>
<p>Happens with this for example: <a href="https://openqa.opensuse.org/admin/productlog?id=327913lala" class="external">https://openqa.opensuse.org/admin/productlog?id=327913lala</a></p>
<p>There are 4 places where OpenQA::WebAPI::ServerSideDataTable::render_response is used.</p>
<a name="Acceptance-Criteria"></a>
<h2 >Acceptance Criteria<a href="#Acceptance-Criteria" class="wiki-anchor">¶</a></h2>
<p><strong>AC1</strong>: Parameters for the mentioned calls are validated</p>
openQA Infrastructure - action #133388 (New): Unavailable developer mode on ow18https://progress.opensuse.org/issues/1333882023-07-26T12:59:17Zokurzokurz@suse.com
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<p>From <a href="https://suse.slack.com/archives/C02CANHLANP/p1690375626721579" class="external">https://suse.slack.com/archives/C02CANHLANP/p1690375626721579</a></p>
<blockquote>
<p>(Felix Niederwanger) Also, is the developer mode on OSD currently unavailable?<br>
(Jozef Pupava) it's fw, I guess it's ow18 ?<br>
(Felix Niederwanger) Yep</p>
</blockquote>
<p>so I assume developer mode on that machine is not effective. Also we have not seen any alert about that.</p>
<a name="Acceptance-criteria"></a>
<h2 >Acceptance criteria<a href="#Acceptance-criteria" class="wiki-anchor">¶</a></h2>
<ul>
<li><strong>AC1:</strong> Developer mode works on ow18</li>
<li><strong>AC2:</strong> Developer mode works on all production OSD workers</li>
<li><strong>AC3:</strong> There are alerts about unavailable developer mode prerequisities</li>
</ul>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>Follow <a href="http://open.qa/docs/#debugdevelmode" class="external">http://open.qa/docs/#debugdevelmode</a> for w18</li>
<li>Crosscheck for other machines and make an alert about that</li>
</ul>
openQA Infrastructure - action #132998 (Workable): [alert] [FIRING:1] openqaworker-arm-3: Memory ...https://progress.opensuse.org/issues/1329982023-07-19T06:03:04Zokurzokurz@suse.com
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<p><a href="https://stats.openqa-monitor.qa.suse.de/d/WDopenqaworker-arm-3/worker-dashboard-openqaworker-arm-3?orgId=1&viewPanel=12054&from=1689743130960&to=1689746327640" class="external">https://stats.openqa-monitor.qa.suse.de/d/WDopenqaworker-arm-3/worker-dashboard-openqaworker-arm-3?orgId=1&viewPanel=12054&from=1689743130960&to=1689746327640</a> and according email.<br>
The graph shows that the system exhausted all available memory.</p>
<a name="Acceptance-criteria"></a>
<h2 >Acceptance criteria<a href="#Acceptance-criteria" class="wiki-anchor">¶</a></h2>
<ul>
<li><strong>AC1:</strong> Measures have been applied to prevent memory exhaustion</li>
<li><strong>AC2</strong>: It's safe to schedule jobs with too high memory requirements</li>
</ul>
<a name="Acceptance-Tests"></a>
<h2 >Acceptance Tests<a href="#Acceptance-Tests" class="wiki-anchor">¶</a></h2>
<ul>
<li><strong>AT1-1</strong>: A job with QEMURAM=999999999 aborts cleanly without alerts being raised</li>
<li><strong>AT1-2</strong>: A worker without the mitigation kills processes due to memory exhaustion</li>
</ul>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>Look into logs and according openQA jobs running on that host what exhausted the memory, likely too many too big openQA jobs</li>
<li>Ask people to not do that!</li>
<li>As necessary adapt number of worker instances or different worker classes like "big mem"</li>
<li>As necessary adapt job scenarios to not overcommit</li>
<li>If it is not openQA jobs then look into what else it is</li>
</ul>
<a name="Out-of-scope"></a>
<h2 >Out of scope<a href="#Out-of-scope" class="wiki-anchor">¶</a></h2>
<ul>
<li>Preventing the over-commit in openQA worker, see <a class="issue tracker-4 status-3 priority-4 priority-default closed" title="action: [spike solution][timeboxed:10h] Prevent memory over-commits in openQA worker service definitions ... (Resolved)" href="https://progress.opensuse.org/issues/133511">#133511</a> for this</li>
</ul>
openQA Infrastructure - action #132926 (Workable): OSD cron -> (fetch_openqa_bugs)> /tmp/fetch_op...https://progress.opensuse.org/issues/1329262023-07-18T07:56:34Zosukup
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<p>OSD cron -> (fetch_openqa_bugs)> /tmp/fetch_openqa_bugs_osd.log failed:</p>
<p>from traceback:</p>
<pre><code>requests.exceptions.ConnectTimeout: HTTPSConnectionPool(host='api.github.com', port=443): Max retries exceeded with url: /repos/SUSE/ha-sap-terraform-deployments/issues/857 (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f7439e43b38>, 'Connection to api.github.com timed out. (connect timeout=10)'))
</code></pre>
<p>fetch_openqa_bug failed when fetch issues from GitHub</p>
<a name="Acceptance-criteria"></a>
<h2 >Acceptance criteria<a href="#Acceptance-criteria" class="wiki-anchor">¶</a></h2>
<ul>
<li><strong>AC1:</strong> It is understood why the error occurred</li>
<li><strong>AC2:</strong> The error does not persist</li>
</ul>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>Make sure you can login, see <a href="https://gitlab.suse.de/OPS-Service/salt/-/blob/production/pillar/id/openqa-service_qe_suse_de.sls#L11" class="external">https://gitlab.suse.de/OPS-Service/salt/-/blob/production/pillar/id/openqa-service_qe_suse_de.sls#L11</a> or ask dheidler/mkittler to do that for you</li>
<li>Assuming "host unavailable', check how long the scripts retried
<ul>
<li>Re-try more often?</li>
<li>Wait longer between attemps? </li>
</ul></li>
<li><a href="https://github.com/os-autoinst/openqa_bugfetcher" class="external">https://github.com/os-autoinst/openqa_bugfetcher</a></li>
</ul>
openQA Infrastructure - action #132380 (New): Multiple empty folders in grafana linked to alertshttps://progress.opensuse.org/issues/1323802023-07-06T06:35:42Zokurzokurz@suse.com
<a name="Motivation"></a>
<h2 >Motivation<a href="#Motivation" class="wiki-anchor">¶</a></h2>
<p>In monitor.qa.suse.de we seem to have all provisioned dashboards and panels in the "Salt" folder but alerts are linked to otherwise empty folders "Generic" and "openQA". See<br>
<img src="https://progress.opensuse.org/attachments/download/15656/Screenshot_20230706_083204_mixed_grafana_groups_generic_openqa_salt.png" alt="Screenshot_20230706_083204_mixed_grafana_groups_generic_openqa_salt.png" loading="lazy" /><br>
for an example.</p>
<p>We should decide if we put everything provisioned including alerts into "salt" or sort everything from "salt" into the other categories.</p>
openQA Infrastructure - action #125141 (Workable): Salt pillars deployment pipeline failed on "tu...https://progress.opensuse.org/issues/1251412023-02-28T11:17:44Zmkittlermarius.kittler@suse.com
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<pre><code> ID: security-sensor.repo
Function: pkgrepo.managed
Result: False
Comment: Failed to configure repo 'security-sensor.repo': Zypper command failure: Repository 'security-sensor.repo' is invalid.
[security-sensor.repo|https://download.opensuse.org/repositories/security:/sensor/15.4] Valid metadata not found at specified URL
History:
- Signature verification failed for repomd.xml
- Can't provide /repodata/repomd.xml
Please check if the URIs defined for this repository are pointing to a valid repository.
Skipping repository 'security-sensor.repo' because of the above error.
Could not refresh the repositories because of errors.Forcing raw metadata refresh
Retrieving repository 'security-sensor.repo' metadata [..........
Warning: File 'repomd.xml' from repository 'security-sensor.repo' is unsigned.
Note: Signing data enables the recipient to verify that no modifications occurred after the data
were signed. Accepting data with no, wrong or unknown signature can lead to a corrupted system
and in extreme cases even to a system compromise.
Note: File 'repomd.xml' is the repositories master index file. It ensures the integrity of the
whole repo.
Warning: We can't verify that no one meddled with this file, so it might not be trustworthy
anymore! You should not continue unless you know it's safe.
File 'repomd.xml' from repository 'security-sensor.repo' is unsigned, continue? [yes/no] (no): no
error]
Started: 09:39:50.917365
Duration: 9775.41 ms
Changes:
----------
ID: security-sensor.repo
Function: pkg.latest
Name: velociraptor-client
Result: False
Comment: One or more requisite failed: security_sensor.security-sensor.repo
Started: 09:40:00.699471
Duration: 0.011 ms
Changes:
…
Summary for tumblesle
--------------
Succeeded: 231 (changed=1)
Failed: 2
--------------
Total states run: 233
</code></pre>
<p>(<a href="https://gitlab.suse.de/openqa/salt-pillars-openqa/-/jobs/1427053/raw">https://gitlab.suse.de/openqa/salt-pillars-openqa/-/jobs/1427053/raw</a>)</p>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>Find out what the host "tumblesle" is -> a VM on qamaster.qa.suse.de (according to <a href="https://racktables.suse.de/index.php?page=object&tab=default&object_id=1300">https://racktables.suse.de/index.php?page=object&tab=default&object_id=1300</a>), the full domain is tumblesle.qa.suse.de</li>
<li>Check whether the problem persists -> no the repo can be refreshed (on tumblesle)</li>
<li>Check whether the error handling (retries) is in accordance with how other repos are configured -> we use <code>pkgrepo.managed: - retry: attempts: 5</code> for our own devel repos, maybe the same would make sense for <code>security:sensor</code> as well; we don't have a retry for all repos configured via <code>pkgrepo.managed</code> so far, though</li>
</ul>
<a name="Remarks"></a>
<h2 >Remarks<a href="#Remarks" class="wiki-anchor">¶</a></h2>
<ul>
<li>Likely not specific to "tumblesle".</li>
<li>Looks like a temporary signing problem of security-sensor.repo (and not like a network issue). <em>DONE</em> So maybe a one-time issue and we don't need to introduce a retry. -> It is reproducible on tumblesle.qa.suse.de with</li>
</ul>
<pre><code>for i in {001..100}; do echo "## $i" && zypper ref --force -r security-sensor.repo; done
</code></pre>
<p>after 23 runs. Directly afterwards it was working to retrieve the file.</p>
<ul>
<li><em>Optional</em> Try to reproduce the above problem in a clean container environment, at best for crosschecking both Leap and Tumbleweed</li>
<li>Based on the above report an issue to zypper on <a href="https://github.com/openSUSE/zypper/">https://github.com/openSUSE/zypper/</a> as zypper claims "File is unsigned" which is apparently not true. It's likely a temporary connection issue. Better retry</li>
<li><em>Optional:</em> Additionally report an issue with the openSUSE infrastructure with a cross-reference</li>
</ul>
openQA Infrastructure - action #68633 (New): alert if there is no worker active for any existant ...https://progress.opensuse.org/issues/686332020-07-04T07:55:24Zokurzokurz@suse.com
<a name="Motivation"></a>
<h2 >Motivation<a href="#Motivation" class="wiki-anchor">¶</a></h2>
<p>See <a class="issue tracker-4 status-3 priority-4 priority-default closed" title="action: [sle][s390x][infrastructure][hard] set up dedicated z/VM for (open)QA on our new storage system (Resolved)" href="https://progress.opensuse.org/issues/33127#note-28">#33127#note-28</a> . Every "machine" in openQA should have at least one worker instance with matching <code>WORKER_CLASS</code> to be able to execute tests otherwise tests are stuck in scheduled state forever. We could have monitoring that alerts about this. Alternative: Fail tests or incomplete automatically after configured time.</p>
openQA Infrastructure - action #59621 (New): osd: Sporadically high CPU and IO load (vdd), grafan...https://progress.opensuse.org/issues/596212019-11-14T12:29:12Zokurzokurz@suse.com
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<p><a href="https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?fullscreen&edit&tab=alert&panelId=23&orgId=1&from=1573686000000&to=1573732800000" class="external">https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?fullscreen&edit&tab=alert&panelId=23&orgId=1&from=1573686000000&to=1573732800000</a><br>
shows alerting CPU usage and <a href="https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?fullscreen&panelId=48&from=1573722000000&to=1573732800000" class="external">https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?fullscreen&panelId=48&from=1573722000000&to=1573732800000</a> shows "Disk I/O time for /dev/vdd" alerting.</p>
<p>from chat:</p>
<p>all storage comes from netapp. Not sure what I/O time actually tells us. IO going up and CPU going up may just mean: we're screwed. "CPU going up" is basically a consequence of the slow IO. that's why we got the alerts. apache was roughly writing at ~100MB/s which is not that fast… . cthe highest I saw in htop was 10MB/s per httpd_prefork process. I wonder if infra monitors their virtualization host. I guess all VMs share the same path to the netapp. If this is really our bottleneck we might need to invest into separate hardware (not strictly speaking about a separate server for OSD).</p>
openQA Infrastructure - action #55316 (New): monitoring alerts for too long running database querieshttps://progress.opensuse.org/issues/553162019-08-09T13:12:53Zokurzokurz@suse.com
<a name="Motivation"></a>
<h2 >Motivation<a href="#Motivation" class="wiki-anchor">¶</a></h2>
<p><a class="issue tracker-4 status-6 priority-4 priority-default closed" title="action: long running (minutes) postgres SELECT calls on osd (Rejected)" href="https://progress.opensuse.org/issues/55313">#55313</a> showed as some time in the past that eventually we might be hit by database queries that run too long. As postgres <a href="https://www.cybertec-postgresql.com/en/3-ways-to-detect-slow-queries-in-postgresql/" class="external">can detect long running queries itself</a> why not use that?</p>