openSUSE Project Management Tool: Issueshttps://progress.opensuse.org/https://progress.opensuse.org/themes/openSUSE/favicon/favicon.ico?15829177842024-03-19T12:18:07ZopenSUSE Project Management Tool
Redmine openQA Infrastructure - action #157528 (Workable): Remove redundant ASM connections for powerPC m...https://progress.opensuse.org/issues/1575282024-03-19T12:18:07Znicksingernsinger@suse.com
<a name="Motivation"></a>
<h2 >Motivation<a href="#Motivation" class="wiki-anchor">¶</a></h2>
<p>Our current hypothesis is that the PPC HMC struggles with two simultaneous connections to the ASM at the same time. It causes the managed system to "flicker" in the webui and constantly abort any operation you execute. We should explore if these connection issues can be resolved by only having one, single connection between ASM<->HMC.</p>
<p>Machines where this happens:</p>
<ul>
<li>soapberry</li>
<li>blackcurrant</li>
</ul>
<a name="Acceptance-criteria"></a>
<h2 >Acceptance criteria<a href="#Acceptance-criteria" class="wiki-anchor">¶</a></h2>
<ul>
<li><strong>AC1:</strong> <a href="https://powerhmc1.oqa.prg2.suse.org/" class="external">https://powerhmc1.oqa.prg2.suse.org/</a> shows no flickering anymore for machines going between "No connection" and "operating"</li>
<li><strong>AC2:</strong> racktables is up-to-date</li>
</ul>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>Research upstream what IBM suggests. We assume it's not foreseen that one connects more than one physical network connection to the same HMC</li>
<li>Create an infra ticket according to <a href="https://progress.opensuse.org/projects/qa/wiki/Tools#SUSE-IT-ticket-handling" class="external">https://progress.opensuse.org/projects/qa/wiki/Tools#SUSE-IT-ticket-handling</a> asking to remove the secondary, redundant network connection. At best physically remove and update racktables, not in switch config so that not somebody else some months later tries to "fix" a disabled switch port</li>
<li>Ensure that machines are still controllable over HMC after cable removal</li>
<li>Ensure that racktables is up-to-date with the remaining connection</li>
</ul>
openQA Infrastructure - action #155689 (Resolved): bot-ng pipelines fails to schedule incidentshttps://progress.opensuse.org/issues/1556892024-02-20T11:05:17Znicksingernsinger@suse.com
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<p>Since Feb 19, 2024 6:03pm GMT+0100 our pipelines in bot-ng fail at the step "schedule incidents": <a href="https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs" class="external">https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs</a> e.g. <a href="https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/2292982" class="external">https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/2292982</a> which was the first I could find</p>
<a name="Acceptance-Criteria"></a>
<h2 >Acceptance Criteria<a href="#Acceptance-Criteria" class="wiki-anchor">¶</a></h2>
<ul>
<li><strong>AC1</strong>: Pipelines do work again (complete the job "schedule incidents")</li>
</ul>
<a name="Rollback-steps"></a>
<h2 >Rollback steps<a href="#Rollback-steps" class="wiki-anchor">¶</a></h2>
<ul>
<li>activate pipelines <a href="https://gitlab.suse.de/qa-maintenance/bot-ng/-/pipeline_schedules" class="external">https://gitlab.suse.de/qa-maintenance/bot-ng/-/pipeline_schedules</a></li>
</ul>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>Check if something changed</li>
<li>Read the logs and try to understand if the issues are the same and if/how we can fix them</li>
</ul>
openQA Infrastructure - action #153328 (Resolved): jenkins fails in submit-openQA-TW-to-oS_Fctry,...https://progress.opensuse.org/issues/1533282024-01-10T09:32:23Znicksingernsinger@suse.com
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<p><a href="http://jenkins.qa.suse.de/job/submit-openQA-TW-to-oS_Fctry/975" class="external">http://jenkins.qa.suse.de/job/submit-openQA-TW-to-oS_Fctry/975</a> fails with:</p>
<pre><code>+ curl -sS 'https://api.opensuse.org/public/build/devel:openQA:tested/_result?repository=openSUSE_Factory&package=os-autoinst'
+ grep -e '\(unknown\|blocked\|scheduled\|dispatching\|building\|signing\|finished\)'
curl: (28) Failed to connect to api.opensuse.org port 443 after 129973 ms: Couldn't connect to server
…
+ osc service wait devel:openQA os-autoinst-distri-opensuse-deps
Server returned an error: HTTP Error 400: Bad Request
The service for project 'devel:openQA' package 'os-autoinst-distri-opensuse-deps' failed
service error: '
+ rm -rf /tmp/os-autoinst-obs-auto-submit-vn8K
</code></pre>
<p>The error message unfortunately doesn't tell us much. Maybe this can be improved for the future? Error 400 indicates we're using the API wrong, did something change we didn't notice?</p>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>Check the jenkins job if error output can be fixed</li>
<li>Check if something with OBS changed and we need to update the client or might have to adapt our script(s)</li>
<li>Run the failing command manually, find out if it's reproducible/sporadic?
<ul>
<li>It works locally on TW and Leap 15.5</li>
</ul></li>
<li>jenkins.qe.nue2.suse.org has the same osc package version as other hosts so it's not an outdated package</li>
<li>Ask for help in OBS related channels</li>
<li>Improve the error handling in the pipleine script</li>
</ul>
openQA Infrastructure - action #153325 (Resolved): osd-deployment | Failed pipeline, Digest verif...https://progress.opensuse.org/issues/1533252024-01-10T09:26:30Znicksingernsinger@suse.com
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<p><a href="https://gitlab.suse.de/openqa/osd-deployment/-/jobs/2147143" class="external">https://gitlab.suse.de/openqa/osd-deployment/-/jobs/2147143</a> fails with:</p>
<pre><code> Warning: Digest verification failed for file 'openQA-common-4.6.1704466891.4d4e5b7-lp155.6263.1.aarch64.rpm'
[/var/tmp/AP_0xu5QjCF/aarch64/openQA-common-4.6.1704466891.4d4e5b7-lp155.6263.1.aarch64.rpm]
expected 54c424e19b97104953e5c1e28b81a291690a3e73f4ccc7705312ff0eb5f53cb8
but got e3117a0c9ad9cbf0dbc00d70ba6bfa304e9592a69d158f1e5b8561a71ece6094
</code></pre>
<p>This happened already 4 times. The only related thing I found on progress is <a href="https://progress.opensuse.org/issues/69334" class="external">https://progress.opensuse.org/issues/69334</a> so maybe an broken mirror?</p>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>Check the openQA OBS project if everything is fine or needs fixing</li>
<li>Make sure our automatic deployments are working again</li>
<li>Try to reproduce by running zypper manually on the worker host (maybe within a container)</li>
<li>Restart the deployment pipeline</li>
</ul>
openQA Infrastructure - action #151231 (Resolved): package loss between o3 machines and download....https://progress.opensuse.org/issues/1512312023-11-21T11:33:46Znicksingernsinger@suse.com
<a name="Motivation"></a>
<h2 >Motivation<a href="#Motivation" class="wiki-anchor">¶</a></h2>
<p>After the move of download.opensuse.org we're facing a lot of package loss between o3 machines and download.opensuse.org.<br>
mtr from new-ariel shows around 1/3 of packages getting lost on the first hop:</p>
<pre><code> My traceroute [v0.92]
new-ariel (10.150.2.10) 2023-11-21T11:00:24+0000
Packets Pings
Host Loss% Snt Last Avg Best Wrst StDev
1. 195.135.223.5 36.0% 390 2.9 2.6 0.2 39.8 6.7
2. 195.135.223.226 0.0% 389 0.3 2.4 0.2 33.6 5.3
</code></pre>
<a name="Acceptance-criteria"></a>
<h2 >Acceptance criteria<a href="#Acceptance-criteria" class="wiki-anchor">¶</a></h2>
<ul>
<li><strong>AC1:</strong> Package loss is significant lower then currently (below 1% is usually a good indicator for a "stable" connection)</li>
</ul>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li><del>Raise the issue in #dct-migration</del> <strong>DONE</strong>: <a href="https://suse.slack.com/archives/C04MDKHQE20/p1700564457230459" class="external">https://suse.slack.com/archives/C04MDKHQE20/p1700564457230459</a></li>
<li>Follow the slack discussion and help investigate the issue</li>
</ul>
openQA Project - action #138287 (Resolved): petrol sometimes take a long time to respond/render h...https://progress.opensuse.org/issues/1382872023-10-19T13:33:37Znicksingernsinger@suse.com
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<p>Sometimes pipelines (e.g. <a href="https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/1915033">https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/1915033</a>) fail with: </p>
<pre><code>2023-10-19T13:14:13Z E! [inputs.http] Error in plugin: [url=http://localhost:9530/influxdb/minion]: Get "http://localhost:9530/influxdb/minion": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
</code></pre>
<p>It seems like the endpoint on that host sometimes takes a long time to respond:</p>
<pre><code>petrol:~ # time curl http://localhost:9530/influxdb/minion
openqa_minion_jobs,url=http://localhost:9530 active=0i,delayed=0i,failed=19i,inactive=0i
openqa_minion_workers,url=http://localhost:9530 active=0i,inactive=1i,registered=1i
openqa_download_count,url=http://localhost:9530 count=0i
openqa_download_rate,url=http://localhost:9530 bytes=28359186i
real 0m0.008s
user 0m0.006s
sys 0m0.000s
petrol:~ # time curl http://localhost:9530/influxdb/minion
openqa_minion_jobs,url=http://localhost:9530 active=0i,delayed=0i,failed=19i,inactive=0i
openqa_minion_workers,url=http://localhost:9530 active=0i,inactive=1i,registered=1i
openqa_download_count,url=http://localhost:9530 count=0i
openqa_download_rate,url=http://localhost:9530 bytes=28359186i
real 0m0.008s
user 0m0.006s
sys 0m0.000s
petrol:~ # time curl http://localhost:9530/influxdb/minion
openqa_minion_jobs,url=http://localhost:9530 active=0i,delayed=0i,failed=19i,inactive=1i
openqa_minion_workers,url=http://localhost:9530 active=0i,inactive=1i,registered=1i
openqa_download_count,url=http://localhost:9530 count=0i
openqa_download_rate,url=http://localhost:9530 bytes=28359186i
real 0m6.242s
user 0m0.003s
sys 0m0.003s
petrol:~ # time curl http://localhost:9530/influxdb/minion
openqa_minion_jobs,url=http://localhost:9530 active=1i,delayed=0i,failed=19i,inactive=0i
openqa_minion_workers,url=http://localhost:9530 active=1i,inactive=0i,registered=1i
openqa_download_count,url=http://localhost:9530 count=1i
openqa_download_rate,url=http://localhost:9530 bytes=28359186i
real 0m11.547s
user 0m0.006s
sys 0m0.000s
</code></pre>
<a name="Reproducible"></a>
<h2 >Reproducible<a href="#Reproducible" class="wiki-anchor">¶</a></h2>
<p>Not sure what causes the long response times but I could easily reproduce it by running <code>time curl http://localhost:9530/influxdb/minion</code> a couple of times.</p>
<a name="Expected-result"></a>
<h2 >Expected result<a href="#Expected-result" class="wiki-anchor">¶</a></h2>
<p>The route should be quite snappy and not that slow. At the very least, if we cannot understand or fix the underlying problem our pipelines should not fail because of this.</p>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>Understand why that api endpoint needs so long to respond on only that host</li>
<li>Bump curl timeouts in our telegraf config</li>
</ul>
openQA Infrastructure - action #135944 (New): Implement a constantly running monitoring/debugging...https://progress.opensuse.org/issues/1359442023-09-18T22:23:33Znicksingernsinger@suse.com
<a name="Motivation"></a>
<h2 >Motivation<a href="#Motivation" class="wiki-anchor">¶</a></h2>
<p>We struggle to understand the multi-machine setup and how to set it up properly for newly added openQA machines. In addition to that we also have problems debugging these setups in cases like <a class="issue tracker-4 status-3 priority-5 priority-high3 closed child" title="action: [tools] network protocols failures on multimachine tests on HA/SAP size:S auto_review:"no candida... (Resolved)" href="https://progress.opensuse.org/issues/134282">#134282</a> because they consist of a quite complex networking stack (advanced linux networking, openvswitch, gre tunnels between workers, openvswitch-osautoinst and KVM/QEMU on top of all of that, etc.)</p>
<p>Together with <a class="user active user-mention" href="https://progress.opensuse.org/users/25092">@pcervinka</a> <a class="user active user-mention" href="https://progress.opensuse.org/users/17668">@okurz</a> and <a class="user active user-mention" href="https://progress.opensuse.org/users/22072">@mkittler</a> we discussed on 2023-09-18 in jitsi that it might be a good idea to have a qemu instance constantly running which is setup like a multimachine job but with a very basic installation. These VMs could be used to run e.g. telegraf with basic checks on top of the whole stack (ping, curl to different required sources like scc.suse.de, etc) and can be accessed by SSH to do debugging in case something is not working.</p>
<a name="Acceptance-criteria"></a>
<h2 >Acceptance criteria<a href="#Acceptance-criteria" class="wiki-anchor">¶</a></h2>
<ul>
<li><strong>AC1:</strong> each multi-machine capable worker has a constantly running qemu instance connected to the multi-machine network-stack
<ul>
<li><strong>AC1.1:</strong> this setup is defined and configured via salt</li>
<li><strong>AC1.2:</strong> the VM starts on worker startup via a systemd unit</li>
</ul></li>
</ul>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>Check what openQA executes to spawn VMs (e.g. with <code>ps</code> while a multi-machine job is running)</li>
<li>Use the <a href="https://progress.opensuse.org/issues/135818" class="external">minimal reproducer</a> as base</li>
<li>Keep the <a href="https://progress.opensuse.org/issues/135914" class="external">best practices</a> for multi-machine test debugging in mind</li>
<li>Understand what <a href="https://github.com/os-autoinst/os-autoinst/blob/master/os-autoinst-openvswitch" class="external">os-autoinst-openvswitch</a> does</li>
</ul>
openQA Infrastructure - action #134051 (Resolved): Eng-Infra maintained DNS server for .qa.suse.d...https://progress.opensuse.org/issues/1340512023-08-09T17:43:03Znicksingernsinger@suse.com
<a name="Motivation"></a>
<h2 >Motivation<a href="#Motivation" class="wiki-anchor">¶</a></h2>
<p>We need to clear out Maxtorhof where qanet currently sits and is running bind to serve the qa.suse.de domain. We agreed that we want to keep the domain as "common name" for all qe provided services pointing to domains in different locations (e.g. s.qa.suse.de should be a CNAME for s.qe.nue2.suse.org).</p>
<p>In the best case eng-infra can provide us with a zone on their already existing infrastructure. Next best thing would be a VM provided by them. If all this is not possible we might need to spin up qanet2 and serve the domain from there.</p>
<a name="Acceptance-criteria"></a>
<h2 >Acceptance criteria<a href="#Acceptance-criteria" class="wiki-anchor">¶</a></h2>
<ul>
<li><strong>AC1:</strong> We can maintain CNAME entries within the .qa.suse.de domain in <a href="https://gitlab.suse.de/OPS-Service/salt/" class="external">https://gitlab.suse.de/OPS-Service/salt/</a></li>
</ul>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>Follow-up with the discussion <a href="https://suse.slack.com/archives/C04MDKHQE20/p1691150107815519" class="external">https://suse.slack.com/archives/C04MDKHQE20/p1691150107815519</a></li>
<li>Ask Eng-Infra in Slack #dct-migration . If that yields no results then create SD ticket or push them in a corner and threaten them or whatever to provide this</li>
<li>Extend our wiki with according instructions</li>
<li>Take over all current entries from <a href="https://gitlab.suse.de/qa-sle/qanet-configs/" class="external">https://gitlab.suse.de/qa-sle/qanet-configs/</a> so that we are sure we can decommission qanet</li>
<li>Push the Eng-Infra maintained DNS servers from DHCP+PXE on qanet</li>
</ul>
openQA Infrastructure - action #133991 (New): Cover same metric for different hosts with a single...https://progress.opensuse.org/issues/1339912023-08-08T17:10:18Znicksingernsinger@suse.com
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<p>In <a href="https://progress.opensuse.org/issues/133130" class="external">https://progress.opensuse.org/issues/133130</a> I explored different possibilities of grouping alerts (by hostname, by alert, etc.) and realized that unified alerting would allow us to greatly decrease our number of alert rules. Our current alert rules are automatically generated for each host by salt but could be generalized to cover every host without needing to create a new rule for them specifically.</p>
<p>To phrase it differently: We have n alert rule instances of the "host up" alert. One for each host. This could be reduced to one single alert by writing a query which groups by host. An example for a single alert instance covering all hosts can be found here: <a href="https://stats.openqa-monitor.qa.suse.de/alerting/grafana/b8b0597c-0aeb-4b0a-9337-6f225cd8c9d4/view" class="external">https://stats.openqa-monitor.qa.suse.de/alerting/grafana/b8b0597c-0aeb-4b0a-9337-6f225cd8c9d4/view</a></p>
<a name="Acceptance-criteria"></a>
<h2 >Acceptance criteria<a href="#Acceptance-criteria" class="wiki-anchor">¶</a></h2>
<ul>
<li><strong>AC1:</strong> A single alert rule exists which replaces all current alert rules (per host, covering the same metric)</li>
<li><strong>AC2:</strong> The single alert conveys the same amount of information as the single alert rules do </li>
<li><strong>AC3:</strong> All newly created alert rules are deployed via salt. Old ones are removed from salt/the templates</li>
</ul>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>Check an example created by nsinger: <a href="https://stats.openqa-monitor.qa.suse.de/alerting/grafana/b8b0597c-0aeb-4b0a-9337-6f225cd8c9d4/view" class="external">https://stats.openqa-monitor.qa.suse.de/alerting/grafana/b8b0597c-0aeb-4b0a-9337-6f225cd8c9d4/view</a></li>
<li>Read Grafanas documentation regarding templating alert messages: <a href="https://grafana.com/docs/grafana/latest/alerting/fundamentals/alert-rules/message-templating/" class="external">https://grafana.com/docs/grafana/latest/alerting/fundamentals/alert-rules/message-templating/</a></li>
<li>Test with a manually created alert and a limited amount of recipients</li>
</ul>
openQA Infrastructure - action #132902 (Resolved): Check and document PDU connection of nibali.qe...https://progress.opensuse.org/issues/1329022023-07-17T19:32:23Znicksingernsinger@suse.com
<p>Due to <a href="https://progress.opensuse.org/issues/132860" class="external">https://progress.opensuse.org/issues/132860</a> I realized that "openqa raspberry pi hw worker PDU" has a comment stating that it is connected to A20 of "PDU-FC-B2". However according to racktables, nibali is connected to port 20. I just tried it and saw that our Pi setup is indeed connected to port 20. This means the documentation of nibali.qe.nue2.suse.org is <strong>wrong</strong> and needs to be checked when in FC basement the next time.</p>
openQA Infrastructure - action #132461 (Resolved): manage tls certificates on o3/ariel directly w...https://progress.opensuse.org/issues/1324612023-07-07T13:27:15Znicksingernsinger@suse.com
<a name="Motivation"></a>
<h2 >Motivation<a href="#Motivation" class="wiki-anchor">¶</a></h2>
<p>(We got informed)[[https://app.slack.com/client/T02863RC2AC/C04MDKHQE20/thread/C04MDKHQE20-1688735468.778099]] that ariel/o3 will have no hydra/ha-proxy setup in the new location. Therefore we need to handle our tls certificates for nginx on our own in the future.</p>
<a name="Acceptance-criteria"></a>
<h2 >Acceptance criteria<a href="#Acceptance-criteria" class="wiki-anchor">¶</a></h2>
<ul>
<li><strong>AC1:</strong> openqa.opensuse.org has a valid certificate requested by the webhost itself</li>
<li><strong>AC2:</strong> the process is fully automated and certificate renewal requires no human interaction</li>
<li><strong>AC3:</strong> Any generalizable config snippets are in github.com/os-autoinst/openQA/</li>
</ul>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>Install an Lets Encrypt compatible client on ariel (see <a href="https://wiki.archlinux.org/title/Transport_Layer_Security#ACME_clients" class="external">https://wiki.archlinux.org/title/Transport_Layer_Security#ACME_clients</a> for a list) - nsinger recommends (dehydrated)[[https://github.com/dehydrated-io/dehydrated]]</li>
<li>Adjust nginx to serve the ACME challenges and reconfigure existing entries to use that new certificate</li>
<li>Feel welcome to experiment on o3 as long as you monitor closely that everything still works as expected or is quickly reverted on problems</li>
<li>Submit any generalizable config snippets into github.com/os-autoinst/openQA/, e.g. as commented nginx config templates</li>
</ul>
openQA Infrastructure - action #129065 (Resolved): [alert] HTTP Response alert fired, OSD loads s...https://progress.opensuse.org/issues/1290652023-05-10T13:33:13Znicksingernsinger@suse.com
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<p><a href="https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=1683722604024&to=1683725326412&viewPanel=78">https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=1683722604024&to=1683725326412&viewPanel=78</a> alerted on 2023-05-10 15:07 CEST</p>
<a name="Acceptance-criteria"></a>
<h2 >Acceptance criteria<a href="#Acceptance-criteria" class="wiki-anchor">¶</a></h2>
<ul>
<li><strong>AC1</strong>: The alert is not firing anymore.</li>
<li><strong>AC2</strong>: Logs have been investigated.</li>
</ul>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>Look into the timeframe
<a href="https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=1683723624920&to=1683724305517">https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=1683723624920&to=1683724305517</a> and compare to other panels on OSD if it's visible what made the system busy DONE: nothing too unusual. Maybe a little high IO times but far from concerning</li>
<li><p><a class="user active user-mention" href="https://progress.opensuse.org/users/17668">@okurz</a> suggested in <a href="https://suse.slack.com/archives/C02AJ1E568M/p1683724668733689?thread_ts=1683724103.321589&cid=C02AJ1E568M">https://suse.slack.com/archives/C02AJ1E568M/p1683724668733689?thread_ts=1683724103.321589&cid=C02AJ1E568M</a> that it might be caused by something we don't collect metrics from - brainstorm what these could be, implement metrics for them</p>
<ul>
<li>Open network connections - nsinger observed peaks of >2k, ~75% of them related to httpd-prefork, ~20% to openqa-websocket</li>
<li>> (Nick Singer) I'm currently logged into OSD. CPU utilization is quite high with a longterm load of 12 and shortterm of ~14 with only 12 cores on OSD. velociraptor goes up to 200% and is in general quite high in the process list but also telegraf and obviously openqa itself.
> (Oliver Kurz) all of that sounds fine. When the HTTP response was high I just took a look and the CPU usage was near 0 same as we suspected in the past. Remember our debugging on why qanet is slow? Comparable to that but here it's likely apache, number of concurrent connections, something like that</li>
</ul></li>
<li><p>Take <a href="https://suse.slack.com/archives/C02CANHLANP/p1683723956965209">https://suse.slack.com/archives/C02CANHLANP/p1683723956965209</a> into account - is there something we can do to improve this situation?</p></li>
</ul>
<blockquote>
<p>(Joaquin Rivera) is OSD also slow for someone else? (edited) <br>
(Fabian Vogt) That might be partially because of the yast2_nfs_server jobs for investigation. You might want to delete them now that they did their job. (e.g. <a href="https://openqa.suse.de/tests/11085729">https://openqa.suse.de/tests/11085729</a>. Don't open, might crash your browser...). those jobs are special. serial_terminal has some race condition so they hammer enter_cmd + assert_script_run in a loop until it fails</p>
</blockquote>
<a name="Out-of-scope"></a>
<h2 >Out of scope<a href="#Out-of-scope" class="wiki-anchor">¶</a></h2>
<ul>
<li>limiting the number of test result steps uploads or handling the effect of test result steps uploading -> <a class="issue tracker-4 status-3 priority-4 priority-default closed child" title="action: Limit the number of uploadable test result steps size:M (Resolved)" href="https://progress.opensuse.org/issues/129068">#129068</a></li>
</ul>
openQA Infrastructure - action #128420 (Resolved): [alert][grafana] 100% packet loss from qa-powe...https://progress.opensuse.org/issues/1284202023-04-28T16:55:01Znicksingernsinger@suse.com
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<p>Starting 2023-04-27 15:15:00 the mentioned machines in the title failed to access/ping s390 LPARs. Something between these hosts has changed or broke and needs to be fixed.<br>
We had similar issues in the past, see the following SD tickets:</p>
<ul>
<li><a href="https://sd.suse.com/servicedesk/customer/portal/1/SD-92689" class="external">https://sd.suse.com/servicedesk/customer/portal/1/SD-92689</a></li>
<li><a href="https://sd.suse.com/servicedesk/customer/portal/1/SD-115963" class="external">https://sd.suse.com/servicedesk/customer/portal/1/SD-115963</a></li>
</ul>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>Check what these machines have in common. A quick look of mine showed that they are in the "old" qa network close by: <a href="https://racktables.suse.de/index.php?page=rack&rack_id=516" class="external">https://racktables.suse.de/index.php?page=rack&rack_id=516</a></li>
<li>Check if other machines in that location, network, room, switch have the same problems</li>
<li>Create a new SD ticket referencing the old ones. Robert mentioned in one of them that we might need to get rid of a second uplink </li>
</ul>
<a name="Rollback-steps"></a>
<h2 >Rollback steps<a href="#Rollback-steps" class="wiki-anchor">¶</a></h2>
<ol>
<li>Remove silence for rule_uid=2Z025iB4km </li>
</ol>
openQA Infrastructure - action #128417 (Resolved): [alert][grafana] openqaw5-xen: partitions usag...https://progress.opensuse.org/issues/1284172023-04-28T16:44:30Znicksingernsinger@suse.com
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<p>On 2023-04-28 16:30 the partition usage of w5-xen skyrocketed to >90% (<a href="https://stats.openqa-monitor.qa.suse.de/d/GDopenqaw5-xen/dashboard-for-openqaw5-xen?orgId=1&viewPanel=65090&from=1682657429086&to=1682699823248" class="external">https://stats.openqa-monitor.qa.suse.de/d/GDopenqaw5-xen/dashboard-for-openqaw5-xen?orgId=1&viewPanel=65090&from=1682657429086&to=1682699823248</a>) and quickly after a alert was fired. Someone or something cleaned up a short time after to a reasonable 40% usage.</p>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>DONE: Check with e.g. <a class="user active user-mention" href="https://progress.opensuse.org/users/17668">@okurz</a> if this was maybe a one-time thing because somebody moved around stuff manually</li>
<li>DONE: Manual cleanup of files in /var/lib/libvirt/images, ask in #eng-testing what the stuff is needed for</li>
<li>Plug in more SSDs. Likely we have some spare in FC Basement shelves</li>
<li>Check virsh XMLs to crosscheck openQA jobs before deleting anything for good</li>
<li><del>Adjust the alert to allow longer periods over the threshold</del> We decided that our thresholds are feasible</li>
</ul>
openQA Infrastructure - action #126212 (Resolved): openqa.suse.de response times very slow. No al...https://progress.opensuse.org/issues/1262122023-03-20T10:01:11Znicksingernsinger@suse.com
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<p><a href="https://suse.slack.com/archives/C02AJ1E568M/p1679304824699299" class="external">Tina observed</a> very slow responses from the OSD webui at 10:33 CET. Shortly after we got asked in <a href="https://suse.slack.com/archives/C02CANHLANP/p1679305078007049" class="external">#eng-testing</a>.<br>
The higher load can be well seen in grafana too: <a href="https://stats.openqa-monitor.qa.suse.de/d/Webuinew/webui-summary-new?orgId=1&from=1679293281205&to=1679306017314" class="external">https://stats.openqa-monitor.qa.suse.de/d/Webuinew/webui-summary-new?orgId=1&from=1679293281205&to=1679306017314</a><br>
We received no apache response time alerts as far as I can tell.</p>
<a name="Acceptance-Criteria"></a>
<h2 >Acceptance Criteria<a href="#Acceptance-Criteria" class="wiki-anchor">¶</a></h2>
<ul>
<li><strong>AC1</strong>: It is known that our alert thresholds are sensible</li>
</ul>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>Check what caused the high load e.g. by analyzing the apache log in /var/log/apache2</li>
<li>Remediate the offender (e.g. fixing a script, blocking an IP, etc)</li>
<li>Check why the apache response time alert was not firing and check if something needs to be fixed
<ul>
<li>Apache Response Time should have fired?</li>
<li>Maybe the alert was too relaxed and didn't trigger "yet"?</li>
<li>Should be 10s but even the index page w/o additional ajax took longer? We don't have numbers, though?</li>
</ul></li>
</ul>