openSUSE Project Management Tool: Issueshttps://progress.opensuse.org/https://progress.opensuse.org/themes/openSUSE/favicon/favicon.ico?15829177842024-02-29T11:21:32ZopenSUSE Project Management Tool
Redmine openQA Infrastructure - action #156322 (Blocked): zabbix-proxy.dmz-prg2.suse.org not reachable fr...https://progress.opensuse.org/issues/1563222024-02-29T11:21:32Zjbaier_czjbaier@suse.cz
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<p>Zabbix proxy is not reachable from ariel, hence the monitoring of that host is not working at all.</p>
<p>Error message from zabbix frontend: <code>Received empty response from Zabbix Agent at [10.150.1.11]. Assuming that agent dropped connection because of access permissions.</code></p>
<pre><code>new-ariel # ping -c3 zabbix-proxy.dmz-prg2.suse.org
PING zabbix-proxy.dmz-prg2.suse.org (10.150.1.22) 56(84) bytes of data.
From ariel.suse-dmz.opensuse.org (10.150.1.11) icmp_seq=1 Destination Host Unreachable
From ariel.suse-dmz.opensuse.org (10.150.1.11) icmp_seq=2 Destination Host Unreachable
From ariel.suse-dmz.opensuse.org (10.150.1.11) icmp_seq=3 Destination Host Unreachable
--- zabbix-proxy.dmz-prg2.suse.org ping statistics ---
3 packets transmitted, 0 received, +3 errors, 100% packet loss, time 2045ms
</code></pre> openQA Infrastructure - action #155743 (Blocked): OBSRSync fails to sync openSUSE:Factory:PowerPC...https://progress.opensuse.org/issues/1557432024-02-21T12:07:21Zlivdywanliv.dywan@suse.com
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<p>Several emails with the subject <strong>Munin - minion Minion Jobs</strong> and content like this:</p>
<pre><code>opensuse.org :: openqa.opensuse.org :: Minion Jobs - see https://openqa.opensuse.org/minion/jobs?state=failed
WARNINGs: failed is 452.00 (outside range [:400]).
</code></pre>
<p>Looking at <a href="https://openqa.opensuse.org/minion/jobs?state=failed" class="external">https://openqa.opensuse.org/minion/jobs?state=failed</a> a lot of <a href="https://openqa.opensuse.org/minion/jobs?id=3440404" class="external">obs_run_run jobs fail</a> reveals failed jobs as recent as 2024-02-11T10:08:17.307669Z:</p>
<pre><code>---
args:
- project: openSUSE:Factory:PowerPC:ToTest
url: https://api.opensuse.org/public/build/openSUSE:Factory:PowerPC:ToTest/_result?package=000product
attempts: 1
children: []
created: 2024-02-11T10:06:07.856414Z
delayed: 2024-02-11T10:06:07.856414Z
expires: ~
finished: 2024-02-11T10:08:17.307669Z
id: 3412364
lax: 0
notes:
gru_id: 19905665
project_lock: 1
parents: []
priority: 100
queue: default
result:
code: 512
message: |-
openSUSE:Factory:PowerPC:ToTest/base/ exit code: 1 (1 failures total so far)
openSUSE:Factory:PowerPC:ToTest/microos/ exit code: 1 (2 failures total so far)
retried: ~
retries: 0
started: 2024-02-11T10:06:07.858866Z
state: failed
task: obs_rsync_run
time: 2024-02-21T12:07:01.731854Z
worker: 1952
</code></pre>
<p>and</p>
<pre><code>---
args:
- project: openSUSE:Factory:LegacyX86:ToTest
url: https://api.opensuse.org/public/build/openSUSE:Factory:LegacyX86:ToTest/_result?package=000product
attempts: 1
children: []
created: 2024-02-09T13:33:44.131117Z
delayed: 2024-02-09T13:33:44.131117Z
expires: ~
finished: 2024-02-09T13:35:39.515968Z
id: 3407299
lax: 0
notes:
gru_id: 19902081
project_lock: 1
parents: []
priority: 100
queue: default
result: 'Job terminated unexpectedly (exit code: 0, signal: 15)'
retried: ~
retries: 0
started: 2024-02-09T13:33:44.133221Z
state: failed
task: obs_rsync_run
time: 2024-02-21T12:07:01.731854Z
worker: 1950
</code></pre>
<p>as well as</p>
<pre><code>---
args:
- project: openSUSE:Leap:15.6:ToTest
url: https://api.opensuse.org/public/build/openSUSE:Leap:15.6:ToTest/_result?package=000product
attempts: 1
children: []
created: 2024-02-09T01:21:47.455035Z
delayed: 2024-02-09T01:21:47.455035Z
expires: ~
finished: 2024-02-09T01:25:08.329909Z
id: 3404260
lax: 0
notes:
gru_id: 19899816
project_lock: 1
parents: []
priority: 100
queue: default
result:
code: 256
message: No message
retried: ~
retries: 0
started: 2024-02-09T01:21:47.456660Z
state: failed
task: obs_rsync_run
time: 2024-02-21T12:07:01.731854Z
worker: 1950
</code></pre>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li></li>
</ul>
openQA Infrastructure - action #134846 (New): Old NFS share mount is keeping processes stuck and ...https://progress.opensuse.org/issues/1348462023-08-30T13:17:42Zokurzokurz@suse.com
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<p>On 2023-08-30 many openQA jobs were not picked up for long on OSD machines due to the machines still being connected to the NFS share from old OSD and eventually got stuck with some processes in "D" state (uninteruptible sleep).</p>
<a name="Acceptance-criteria"></a>
<h2 >Acceptance criteria<a href="#Acceptance-criteria" class="wiki-anchor">¶</a></h2>
<ul>
<li><strong>AC1:</strong> Hosts with stuck processes for long trigger alerts</li>
</ul>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>Try to reproduce the problem e.g. by manually making one process stuck in "D"</li>
<li>Add an alert triggering on the above condition</li>
</ul>
openQA Infrastructure - action #133907 (Workable): Improve monitoring for http(s?) reachable on j...https://progress.opensuse.org/issues/1339072023-08-07T10:21:52Ztinitatina.mueller+trick-redmine@suse.com
<a name="Motivation"></a>
<h2 >Motivation<a href="#Motivation" class="wiki-anchor">¶</a></h2>
<p>There's a few issues with Jenkins:</p>
<ul>
<li>We seem to have been missing builds for at least a day at the time of this writing. See <a href="https://openqa.opensuse.org/group_overview/24" class="external">https://openqa.opensuse.org/group_overview/24</a> (but it may be outdated once you see it, it's not a permalink).</li>
<li><em>DONE</em> <del><a href="http://jenkins.qa.suse.de/view/openQA-in-openQA/" class="external">http://jenkins.qa.suse.de/view/openQA-in-openQA/</a> is refusing the connection.</del> okurz: Fixed the wiki reference and job group description in <a href="https://openqa.opensuse.org/admin/job_templates/24" class="external">https://openqa.opensuse.org/admin/job_templates/24</a></li>
<li>It's unclear if jenkins.qa.suse.de is responsive to pings</li>
</ul>
<p>It's unclear what's going on. We didn't get any alerts and we don't know if we have proper monitoring for the service</p>
<p>From the journal for service <code>jenkins.service</code> on the system:</p>
<pre><code>Aug 06 03:25:41 jenkins jenkins[26704]: 2023-08-06 01:25:41.061+0000 [id=110] INFO org.pircbotx.output.OutputRaw#rawLine: PONG irc.suse.de
Aug 06 03:27:41 jenkins jenkins[26704]: 2023-08-06 01:27:41.505+0000 [id=71] INFO org.pircbotx.InputParser#handleLine: PING :irc.suse.de
Aug 06 03:27:41 jenkins jenkins[26704]: 2023-08-06 01:27:41.508+0000 [id=122] INFO org.pircbotx.output.OutputRaw#rawLine: PONG irc.suse.de
-- Boot d29ffd414ee14afd9e930a7cddfc124b --
Aug 07 13:04:50 jenkins systemd[1]: Starting Jenkins Continuous Integration Server...
Aug 07 13:05:09 jenkins jenkins[1218]: Running from: /usr/share/java/jenkins.war
</code></pre>
<a name="Acceptance-criteria"></a>
<h2 >Acceptance criteria<a href="#Acceptance-criteria" class="wiki-anchor">¶</a></h2>
<ul>
<li><strong>AC1:</strong> There's an alert for the Jenkins web interface (HTTP response, not just ping)</li>
</ul>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>Find out why we didn't get an alert about a failed systemd service</li>
<li>Maybe add a check for <code>systemd is-running</code>? (Likely not very useful.)</li>
<li>Add a connectivity check via telegraf and configure an alert via Grafana if there's no simpler solution
<ul>
<li>At least add a local, not-versioned telegraf extension to look at port 80, e.g. in /etc/telegraf/</li>
</ul></li>
<li>Possibly add a new role in our Salt states (we don't want this kind of check for all generic hosts)</li>
</ul>
openQA Project - action #133901 (New): [ o3 logreport] DBD::Pg::st execute failed: ERROR: invali...https://progress.opensuse.org/issues/1339012023-08-07T09:51:27Ztinitatina.mueller+trick-redmine@suse.com
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<p>From o3 /var/log/openqa:</p>
<pre><code>[2023-08-05T20:39:10.313025Z] [error] [wjDADFtweJVf] DBIx::Class::Storage::DBI::_dbh_execute(): DBI Exception: DBD::Pg::st execute failed: ERROR: invalid input
syntax for type bigint: "1'"
CONTEXT: unnamed portal parameter $1 = '...' [for Statement "SELECT COUNT( * ) FROM scheduled_products me WHERE ( me.id = ? )" with ParamValues: 1='1''] at
/usr/share/openqa/script/../lib/OpenQA/WebAPI/ServerSideDataTable.pm line 33
[2023-08-05T20:40:04.268615Z] [error] [SXp2NHWv1rW-] DBIx::Class::Storage::DBI::_dbh_execute(): DBI Exception: DBD::Pg::st execute failed: ERROR: invalid input
syntax for type bigint: "1<script>alert(1)</script>"
CONTEXT: unnamed portal parameter $1 = '...' [for Statement "SELECT COUNT( * ) FROM scheduled_products me WHERE ( me.id = ? )" with ParamValues:
1='1<script>alert(1)</script>'] at /usr/share/openqa/script/../lib/OpenQA/WebAPI/ServerSideDataTable.pm line 33
</code></pre>
<p>Happens with this for example: <a href="https://openqa.opensuse.org/admin/productlog?id=327913lala" class="external">https://openqa.opensuse.org/admin/productlog?id=327913lala</a></p>
<p>There are 4 places where OpenQA::WebAPI::ServerSideDataTable::render_response is used.</p>
<a name="Acceptance-Criteria"></a>
<h2 >Acceptance Criteria<a href="#Acceptance-Criteria" class="wiki-anchor">¶</a></h2>
<p><strong>AC1</strong>: Parameters for the mentioned calls are validated</p>
openQA Infrastructure - action #133388 (New): Unavailable developer mode on ow18https://progress.opensuse.org/issues/1333882023-07-26T12:59:17Zokurzokurz@suse.com
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<p>From <a href="https://suse.slack.com/archives/C02CANHLANP/p1690375626721579" class="external">https://suse.slack.com/archives/C02CANHLANP/p1690375626721579</a></p>
<blockquote>
<p>(Felix Niederwanger) Also, is the developer mode on OSD currently unavailable?<br>
(Jozef Pupava) it's fw, I guess it's ow18 ?<br>
(Felix Niederwanger) Yep</p>
</blockquote>
<p>so I assume developer mode on that machine is not effective. Also we have not seen any alert about that.</p>
<a name="Acceptance-criteria"></a>
<h2 >Acceptance criteria<a href="#Acceptance-criteria" class="wiki-anchor">¶</a></h2>
<ul>
<li><strong>AC1:</strong> Developer mode works on ow18</li>
<li><strong>AC2:</strong> Developer mode works on all production OSD workers</li>
<li><strong>AC3:</strong> There are alerts about unavailable developer mode prerequisities</li>
</ul>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>Follow <a href="http://open.qa/docs/#debugdevelmode" class="external">http://open.qa/docs/#debugdevelmode</a> for w18</li>
<li>Crosscheck for other machines and make an alert about that</li>
</ul>
openQA Infrastructure - action #132998 (Workable): [alert] [FIRING:1] openqaworker-arm-3: Memory ...https://progress.opensuse.org/issues/1329982023-07-19T06:03:04Zokurzokurz@suse.com
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<p><a href="https://stats.openqa-monitor.qa.suse.de/d/WDopenqaworker-arm-3/worker-dashboard-openqaworker-arm-3?orgId=1&viewPanel=12054&from=1689743130960&to=1689746327640" class="external">https://stats.openqa-monitor.qa.suse.de/d/WDopenqaworker-arm-3/worker-dashboard-openqaworker-arm-3?orgId=1&viewPanel=12054&from=1689743130960&to=1689746327640</a> and according email.<br>
The graph shows that the system exhausted all available memory.</p>
<a name="Acceptance-criteria"></a>
<h2 >Acceptance criteria<a href="#Acceptance-criteria" class="wiki-anchor">¶</a></h2>
<ul>
<li><strong>AC1:</strong> Measures have been applied to prevent memory exhaustion</li>
<li><strong>AC2</strong>: It's safe to schedule jobs with too high memory requirements</li>
</ul>
<a name="Acceptance-Tests"></a>
<h2 >Acceptance Tests<a href="#Acceptance-Tests" class="wiki-anchor">¶</a></h2>
<ul>
<li><strong>AT1-1</strong>: A job with QEMURAM=999999999 aborts cleanly without alerts being raised</li>
<li><strong>AT1-2</strong>: A worker without the mitigation kills processes due to memory exhaustion</li>
</ul>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>Look into logs and according openQA jobs running on that host what exhausted the memory, likely too many too big openQA jobs</li>
<li>Ask people to not do that!</li>
<li>As necessary adapt number of worker instances or different worker classes like "big mem"</li>
<li>As necessary adapt job scenarios to not overcommit</li>
<li>If it is not openQA jobs then look into what else it is</li>
</ul>
<a name="Out-of-scope"></a>
<h2 >Out of scope<a href="#Out-of-scope" class="wiki-anchor">¶</a></h2>
<ul>
<li>Preventing the over-commit in openQA worker, see <a class="issue tracker-4 status-3 priority-4 priority-default closed" title="action: [spike solution][timeboxed:10h] Prevent memory over-commits in openQA worker service definitions ... (Resolved)" href="https://progress.opensuse.org/issues/133511">#133511</a> for this</li>
</ul>
openQA Infrastructure - action #132926 (Workable): OSD cron -> (fetch_openqa_bugs)> /tmp/fetch_op...https://progress.opensuse.org/issues/1329262023-07-18T07:56:34Zosukup
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<p>OSD cron -> (fetch_openqa_bugs)> /tmp/fetch_openqa_bugs_osd.log failed:</p>
<p>from traceback:</p>
<pre><code>requests.exceptions.ConnectTimeout: HTTPSConnectionPool(host='api.github.com', port=443): Max retries exceeded with url: /repos/SUSE/ha-sap-terraform-deployments/issues/857 (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f7439e43b38>, 'Connection to api.github.com timed out. (connect timeout=10)'))
</code></pre>
<p>fetch_openqa_bug failed when fetch issues from GitHub</p>
<a name="Acceptance-criteria"></a>
<h2 >Acceptance criteria<a href="#Acceptance-criteria" class="wiki-anchor">¶</a></h2>
<ul>
<li><strong>AC1:</strong> It is understood why the error occurred</li>
<li><strong>AC2:</strong> The error does not persist</li>
</ul>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>Make sure you can login, see <a href="https://gitlab.suse.de/OPS-Service/salt/-/blob/production/pillar/id/openqa-service_qe_suse_de.sls#L11" class="external">https://gitlab.suse.de/OPS-Service/salt/-/blob/production/pillar/id/openqa-service_qe_suse_de.sls#L11</a> or ask dheidler/mkittler to do that for you</li>
<li>Assuming "host unavailable', check how long the scripts retried
<ul>
<li>Re-try more often?</li>
<li>Wait longer between attemps? </li>
</ul></li>
<li><a href="https://github.com/os-autoinst/openqa_bugfetcher" class="external">https://github.com/os-autoinst/openqa_bugfetcher</a></li>
</ul>
openQA Infrastructure - action #132380 (New): Multiple empty folders in grafana linked to alertshttps://progress.opensuse.org/issues/1323802023-07-06T06:35:42Zokurzokurz@suse.com
<a name="Motivation"></a>
<h2 >Motivation<a href="#Motivation" class="wiki-anchor">¶</a></h2>
<p>In monitor.qa.suse.de we seem to have all provisioned dashboards and panels in the "Salt" folder but alerts are linked to otherwise empty folders "Generic" and "openQA". See<br>
<img src="https://progress.opensuse.org/attachments/download/15656/Screenshot_20230706_083204_mixed_grafana_groups_generic_openqa_salt.png" alt="Screenshot_20230706_083204_mixed_grafana_groups_generic_openqa_salt.png" loading="lazy" /><br>
for an example.</p>
<p>We should decide if we put everything provisioned including alerts into "salt" or sort everything from "salt" into the other categories.</p>
openQA Infrastructure - action #125141 (Workable): Salt pillars deployment pipeline failed on "tu...https://progress.opensuse.org/issues/1251412023-02-28T11:17:44Zmkittlermarius.kittler@suse.com
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<pre><code> ID: security-sensor.repo
Function: pkgrepo.managed
Result: False
Comment: Failed to configure repo 'security-sensor.repo': Zypper command failure: Repository 'security-sensor.repo' is invalid.
[security-sensor.repo|https://download.opensuse.org/repositories/security:/sensor/15.4] Valid metadata not found at specified URL
History:
- Signature verification failed for repomd.xml
- Can't provide /repodata/repomd.xml
Please check if the URIs defined for this repository are pointing to a valid repository.
Skipping repository 'security-sensor.repo' because of the above error.
Could not refresh the repositories because of errors.Forcing raw metadata refresh
Retrieving repository 'security-sensor.repo' metadata [..........
Warning: File 'repomd.xml' from repository 'security-sensor.repo' is unsigned.
Note: Signing data enables the recipient to verify that no modifications occurred after the data
were signed. Accepting data with no, wrong or unknown signature can lead to a corrupted system
and in extreme cases even to a system compromise.
Note: File 'repomd.xml' is the repositories master index file. It ensures the integrity of the
whole repo.
Warning: We can't verify that no one meddled with this file, so it might not be trustworthy
anymore! You should not continue unless you know it's safe.
File 'repomd.xml' from repository 'security-sensor.repo' is unsigned, continue? [yes/no] (no): no
error]
Started: 09:39:50.917365
Duration: 9775.41 ms
Changes:
----------
ID: security-sensor.repo
Function: pkg.latest
Name: velociraptor-client
Result: False
Comment: One or more requisite failed: security_sensor.security-sensor.repo
Started: 09:40:00.699471
Duration: 0.011 ms
Changes:
…
Summary for tumblesle
--------------
Succeeded: 231 (changed=1)
Failed: 2
--------------
Total states run: 233
</code></pre>
<p>(<a href="https://gitlab.suse.de/openqa/salt-pillars-openqa/-/jobs/1427053/raw">https://gitlab.suse.de/openqa/salt-pillars-openqa/-/jobs/1427053/raw</a>)</p>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>Find out what the host "tumblesle" is -> a VM on qamaster.qa.suse.de (according to <a href="https://racktables.suse.de/index.php?page=object&tab=default&object_id=1300">https://racktables.suse.de/index.php?page=object&tab=default&object_id=1300</a>), the full domain is tumblesle.qa.suse.de</li>
<li>Check whether the problem persists -> no the repo can be refreshed (on tumblesle)</li>
<li>Check whether the error handling (retries) is in accordance with how other repos are configured -> we use <code>pkgrepo.managed: - retry: attempts: 5</code> for our own devel repos, maybe the same would make sense for <code>security:sensor</code> as well; we don't have a retry for all repos configured via <code>pkgrepo.managed</code> so far, though</li>
</ul>
<a name="Remarks"></a>
<h2 >Remarks<a href="#Remarks" class="wiki-anchor">¶</a></h2>
<ul>
<li>Likely not specific to "tumblesle".</li>
<li>Looks like a temporary signing problem of security-sensor.repo (and not like a network issue). <em>DONE</em> So maybe a one-time issue and we don't need to introduce a retry. -> It is reproducible on tumblesle.qa.suse.de with</li>
</ul>
<pre><code>for i in {001..100}; do echo "## $i" && zypper ref --force -r security-sensor.repo; done
</code></pre>
<p>after 23 runs. Directly afterwards it was working to retrieve the file.</p>
<ul>
<li><em>Optional</em> Try to reproduce the above problem in a clean container environment, at best for crosschecking both Leap and Tumbleweed</li>
<li>Based on the above report an issue to zypper on <a href="https://github.com/openSUSE/zypper/">https://github.com/openSUSE/zypper/</a> as zypper claims "File is unsigned" which is apparently not true. It's likely a temporary connection issue. Better retry</li>
<li><em>Optional:</em> Additionally report an issue with the openSUSE infrastructure with a cross-reference</li>
</ul>
openQA Project - action #89560 (Workable): Add alert for blocked gitlab account when users are un...https://progress.opensuse.org/issues/895602021-03-05T12:25:25Zokurzokurz@suse.com
<a name="Motivation"></a>
<h2 >Motivation<a href="#Motivation" class="wiki-anchor">¶</a></h2>
<p>See <a class="issue tracker-4 status-3 priority-5 priority-high3 closed" title="action: Failed to commit needles, gitlab account blocked 2021-02-24 (Resolved)" href="https://progress.opensuse.org/issues/89047">#89047</a> . As discussed in retrospective 2021-03-05 we had <a class="issue tracker-4 status-3 priority-5 priority-high3 closed" title="action: Failed to commit needles, gitlab account blocked 2021-02-24 (Resolved)" href="https://progress.opensuse.org/issues/89047">#89047</a> based on a user report, not an automatic alert which we should always have first</p>
<a name="Acceptance-criteria"></a>
<h2 >Acceptance criteria<a href="#Acceptance-criteria" class="wiki-anchor">¶</a></h2>
<ul>
<li><strong>AC1:</strong> Alert exists for "blocked openqa-pusher gitlab account"</li>
</ul>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>Look into existing minion related alerts</li>
<li>Crosscheck existing alert levels</li>
<li>Potentially extend the openQA influxdb API endpoint to more explicitly tell about the types of minion jobs to fail</li>
<li>Ensure there is a grafana alert working based on above data to alert us if the gitlab needle openqa-pusher account would be blocked again</li>
</ul>
openQA Project - action #70774 (New): save_needle Minion tasks fail frequentlyhttps://progress.opensuse.org/issues/707742020-09-01T12:17:04Zmkittlermarius.kittler@suse.com
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<p>The <code>save_needle</code> Minion task fails frequently on OSD and also sometimes on o3.</p>
<p>This can be observed using the following query parameters: <a href="https://openqa.suse.de/minion/jobs?state=failed&offset=0&task=save_needle" class="external">https://openqa.suse.de/minion/jobs?state=failed&offset=0&task=save_needle</a><br>
I'm going to remove most of these jobs to calm down the alert but right now 24 jobs have piled up over 2 month. However, the problem actually exists longer than 2 month but the failures have been manually cleaned up so far.</p>
<p>The problem here is always that the Git working tree is in a state which can not be handled by the task:</p>
<p>1.</p>
<pre><code> "result" => {
"error" => "<strong>Failed to save addon_products-module-dev-tools-pvm-20200805.</strong><br><pre>Unable to commit via Git: On branch master\nYour branch is up to date with 'origin/master'.\n\nnothing to commit, working tree clean</pre>"
},
</code></pre>
<p>2.</p>
<pre><code> "result" => {
"error" => "<strong>Failed to save manually_add_profile-AppArmor-Chose-a-program-to-generate-a-profile-20200827.</strong><br><pre>Unable to reset repository to origin/master: error: cannot rebase: Your index contains uncommitted changes.\nerror: Please commit or stash them.</pre>"
},
</code></pre>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<p>It would be useful if the task would be able to handle the problematic situations itself instead of requiring manual intervention. Note that the <code>delete_needle</code> task (which shares the same Git code) is also affected. We have likely less problems there because that task is not executed that often.</p>
<a name="Problematic-situations"></a>
<h2 >Problematic situations<a href="#Problematic-situations" class="wiki-anchor">¶</a></h2>
<ol>
<li>No diff has been produced which could be committed: Maybe that's simply when there's no actual change and we can simply return early in that case.</li>
<li>The Git directory contains uncommitted changes: We could save these changes on a new branch before rebasing.</li>
<li>We can not push the new commit because in the meantime new commits have been pushed to the remote from elsewhere: Just repeat the procedure.</li>
<li>The fetch needles script is interfering.</li>
</ol>
openQA Infrastructure - action #68633 (New): alert if there is no worker active for any existant ...https://progress.opensuse.org/issues/686332020-07-04T07:55:24Zokurzokurz@suse.com
<a name="Motivation"></a>
<h2 >Motivation<a href="#Motivation" class="wiki-anchor">¶</a></h2>
<p>See <a class="issue tracker-4 status-3 priority-4 priority-default closed" title="action: [sle][s390x][infrastructure][hard] set up dedicated z/VM for (open)QA on our new storage system (Resolved)" href="https://progress.opensuse.org/issues/33127#note-28">#33127#note-28</a> . Every "machine" in openQA should have at least one worker instance with matching <code>WORKER_CLASS</code> to be able to execute tests otherwise tests are stuck in scheduled state forever. We could have monitoring that alerts about this. Alternative: Fail tests or incomplete automatically after configured time.</p>
openQA Infrastructure - action #59621 (New): osd: Sporadically high CPU and IO load (vdd), grafan...https://progress.opensuse.org/issues/596212019-11-14T12:29:12Zokurzokurz@suse.com
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<p><a href="https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?fullscreen&edit&tab=alert&panelId=23&orgId=1&from=1573686000000&to=1573732800000" class="external">https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?fullscreen&edit&tab=alert&panelId=23&orgId=1&from=1573686000000&to=1573732800000</a><br>
shows alerting CPU usage and <a href="https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?fullscreen&panelId=48&from=1573722000000&to=1573732800000" class="external">https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?fullscreen&panelId=48&from=1573722000000&to=1573732800000</a> shows "Disk I/O time for /dev/vdd" alerting.</p>
<p>from chat:</p>
<p>all storage comes from netapp. Not sure what I/O time actually tells us. IO going up and CPU going up may just mean: we're screwed. "CPU going up" is basically a consequence of the slow IO. that's why we got the alerts. apache was roughly writing at ~100MB/s which is not that fast… . cthe highest I saw in htop was 10MB/s per httpd_prefork process. I wonder if infra monitors their virtualization host. I guess all VMs share the same path to the netapp. If this is really our bottleneck we might need to invest into separate hardware (not strictly speaking about a separate server for OSD).</p>
openQA Infrastructure - action #55316 (New): monitoring alerts for too long running database querieshttps://progress.opensuse.org/issues/553162019-08-09T13:12:53Zokurzokurz@suse.com
<a name="Motivation"></a>
<h2 >Motivation<a href="#Motivation" class="wiki-anchor">¶</a></h2>
<p><a class="issue tracker-4 status-6 priority-4 priority-default closed" title="action: long running (minutes) postgres SELECT calls on osd (Rejected)" href="https://progress.opensuse.org/issues/55313">#55313</a> showed as some time in the past that eventually we might be hit by database queries that run too long. As postgres <a href="https://www.cybertec-postgresql.com/en/3-ways-to-detect-slow-queries-in-postgresql/" class="external">can detect long running queries itself</a> why not use that?</p>