openSUSE Project Management Tool: Issueshttps://progress.opensuse.org/https://progress.opensuse.org/themes/openSUSE/favicon/favicon.ico?15829177842024-02-29T12:50:10ZopenSUSE Project Management Tool
Redmine openQA Infrastructure - action #156331 (Resolved): [gitlab] New pipeline schedules cannot be crea...https://progress.opensuse.org/issues/1563312024-02-29T12:50:10Zjbaier_czjbaier@suse.cz
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<p>New pipeline schedules can’t be created.</p>
<a name="Steps-to-reproduce"></a>
<h2 >Steps to reproduce<a href="#Steps-to-reproduce" class="wiki-anchor">¶</a></h2>
<ol>
<li>Visit pipeline schedules of any project with CI/CD enabled.</li>
<li>Observe message: You have exceeded the maximum number of pipeline schedules for your plan. To create a new schedule, either increase your plan limit or delete an exisiting schedule.</li>
<li>See disabled button “New schedule”.</li>
</ol>
<a name="Expected-result"></a>
<h2 >Expected result<a href="#Expected-result" class="wiki-anchor">¶</a></h2>
<p>New pipeline schedules can be created.</p>
<a name="Impact"></a>
<h2 >Impact<a href="#Impact" class="wiki-anchor">¶</a></h2>
<p>Without the ability to create more schedules, the automation process might be hindered.</p>
<a name="Further-details"></a>
<h2 >Further details<a href="#Further-details" class="wiki-anchor">¶</a></h2>
<p>This issue can be easily solved by following the steps mentioned in <a href="https://gitlab.suse.de/help/administration/instance_limits#number-of-pipeline-schedules" class="external">https://gitlab.suse.de/help/administration/instance_limits#number-of-pipeline-schedules</a></p>
openQA Infrastructure - action #156301 (Resolved): [bot-ng] Pipeline failed / KeyError: 'priority...https://progress.opensuse.org/issues/1563012024-02-29T08:54:46Zlivdywanliv.dywan@suse.com
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<p><a href="https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/2327183" class="external">https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/2327183</a></p>
<pre><code>++ retry -r 30 -e -- ./qem-bot/bot-ng.py -c /etc/openqabot --token [MASKED] incidents-run
[...]
KeyError: 'priority'
Retrying up to 19 more times after sleeping 6144s …
2024-02-29 06:28:46 INFO Bot schedule starts now
Traceback (most recent call last):
File "./qem-bot/bot-ng.py", line 7, in <module>
main()
File "/builds/qa-maintenance/bot-ng/qem-bot/openqabot/main.py", line 32, in main
sys.exit(cfg.func(cfg))
File "/builds/qa-maintenance/bot-ng/qem-bot/openqabot/args.py", line 24, in do_incident_schedule
bot = OpenQABot(args)
File "/builds/qa-maintenance/bot-ng/qem-bot/openqabot/openqabot.py", line 24, in __init__
self.incidents = get_incidents(self.token)
File "/builds/qa-maintenance/bot-ng/qem-bot/openqabot/loader/qem.py", line 41, in get_incidents
xs.append(Incident(i))
File "/builds/qa-maintenance/bot-ng/qem-bot/openqabot/types/incident.py", line 23, in __init__
self.priority = incident["priority"]
KeyError: 'priority'
Retrying up to 18 more times after sleeping 12288s …
ERROR: Job failed: execution took longer than 4h0m0s seconds
</code></pre>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li><strong>DONE</strong> Restart pipelines</li>
<li>Investigate if there is new data the bot is not handling correctly</li>
<li>Don't provoke timeouts with retrying on reproducible errors</li>
<li>Look into unit test coverage</li>
</ul>
openQA Infrastructure - action #156226 (Resolved): [bot-ng] Pipeline failed / failed to pulled im...https://progress.opensuse.org/issues/1562262024-02-28T13:51:23Zlivdywanliv.dywan@suse.com
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<p><a href="https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/2325569" class="external">https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/2325569</a></p>
<pre><code>WARNING: Failed to pull image with policy "always": failed to register layer: open /var/cache/zypp/solv/@System/solv.idx: no space left on device (manager.go:237:16s)
ERROR: Job failed: failed to pull image "registry.suse.de/qa/maintenance/containers/qam-ci-leap:latest" with specified policies [always]: failed to register layer: open /var/cache/zypp/solv/@System/solv.idx: no space left on device (manager.go:237:16s)
WARNING: Failed to pull image with policy "always": failed to register layer: mkdir /var/cache/zypp/solv/obs_repository: no space left on device (manager.go:237:13s)
ERROR: Job failed: failed to pull image "registry.suse.de/qa/maintenance/containers/qam-ci-leap:latest" with specified policies [always]: failed to register layer: mkdir /var/cache/zypp/solv/obs_repository: no space left on device (manager.go:237:13s)
</code></pre>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li><strong>DONE</strong> Restart pipelines</li>
<li><strong>DONE</strong> Report an infra SD ticket</li>
<li><strong>DONE</strong> Add retries to the pipeline</li>
</ul>
openQA Infrastructure - action #155725 (Resolved): [openQA][infra][sut] Failed to establish connn...https://progress.opensuse.org/issues/1557252024-02-21T09:38:46Zwaynechen55wchen@suse.com
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<p>Can not establish ipmi sol connnection to fozzie-sp and quinn-sp</p>
<pre><code>localhost:~ # ipmitool -I lanplus -H fozzie-sp.qe.nue2.suse.org -U ADMIN -P xxxxx chassis power status
Address lookup for fozzie-sp.qe.nue2.suse.org failed
Could not open socket!
Error: Unable to establish IPMI v2 / RMCP+ session
localhost:~ # ipmitool -I lanplus -H quinn-sp.qe.nue2.suse.org -U ADMIN -P xxxxx chassis power status
Address lookup for quinn-sp.qe.nue2.suse.org failed
Could not open socket!
Error: Unable to establish IPMI v2 / RMCP+ session
localhost:~ # ping -c5 fozzie-sp.qe.nue2.suse.org
ping: fozzie-sp.qe.nue2.suse.org: Name or service not known
localhost:~ # ping -c5 quinn-sp.qe.nue2.suse.org
ping: quinn-sp.qe.nue2.suse.org: Name or service not known
</code></pre>
<a name="Steps-to-reproduce"></a>
<h2 >Steps to reproduce<a href="#Steps-to-reproduce" class="wiki-anchor">¶</a></h2>
<ul>
<li>Use ipmitool to do operation</li>
</ul>
<a name="Impact"></a>
<h2 >Impact<a href="#Impact" class="wiki-anchor">¶</a></h2>
<p>Test run keeps failing.</p>
<a name="Problem"></a>
<h2 >Problem<a href="#Problem" class="wiki-anchor">¶</a></h2>
<p>Looks like something wrong with management unit</p>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>Check management unit state</li>
<li>Check error/warning report from management unit</li>
<li>Check management unit configuration</li>
<li>Check ipmi sol is enabled</li>
</ul>
<a name="Workaround"></a>
<h2 >Workaround<a href="#Workaround" class="wiki-anchor">¶</a></h2>
<p>n/a</p>
openQA Infrastructure - action #155080 (Resolved): jenkins is no longer producing GNOME:Next tes...https://progress.opensuse.org/issues/1550802024-02-07T13:03:45Zokurzokurz@suse.com
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<p>From <a href="https://suse.slack.com/archives/C02CANHLANP/p1707310927769339" class="external">https://suse.slack.com/archives/C02CANHLANP/p1707310927769339</a></p>
<blockquote>
<p>(Dominique Leuenberger) seems jenkins is no longer producing GNOME:Next test runs: <a href="http://jenkins.qa.suse.de/job/gnome_next-openqa/8670/console" class="external">http://jenkins.qa.suse.de/job/gnome_next-openqa/8670/console</a></p>
</blockquote>
<pre><code>Caused: java.io.IOException: Cannot run program "/bin/sh" (in directory "/var/lib/jenkins/workspace/gnome_next-openqa"): error=0, Failed to exec spawn helper: pid: 2883, signal: 11
</code></pre> openQA Infrastructure - action #154927 (Resolved): [alert] Broken workers alert was firing severa...https://progress.opensuse.org/issues/1549272024-02-05T10:03:31Zmkittlermarius.kittler@suse.com
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<p>See <a href="https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&viewPanel=96&editPanel=96&from=1706991565957&to=1707139468853" class="external">https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&viewPanel=96&editPanel=96&from=1706991565957&to=1707139468853</a> for the panel/timeframe</p>
<p>Example of how the worker log looked like (looked similar on all machines/services I checked):</p>
<pre><code>Feb 04 03:33:36 worker40 worker[3881]: [error] [pid:3881] Worker cache not available via http://127.0.0.1:9530: Cache service info error: Connection refused
Feb 04 03:33:36 worker40 worker[3881]: [info] [pid:3881] Project dir for host openqa.suse.de is /var/lib/openqa/share
Feb 04 03:33:36 worker40 worker[3881]: [info] [pid:3881] Registering with openQA openqa.suse.de
Feb 04 03:33:36 worker40 worker[3881]: [warn] [pid:3881] Failed to register at openqa.suse.de - connection error: Transport endpoint is not connected - trying again in 10 seconds
Feb 04 03:33:46 worker40 worker[3881]: [info] [pid:3881] Registering with openQA openqa.suse.de
Feb 04 03:33:46 worker40 worker[3881]: [warn] [pid:3881] Failed to register at openqa.suse.de - connection error: Transport endpoint is not connected - trying again in 10 seconds
Feb 04 03:33:56 worker40 worker[3881]: [info] [pid:3881] Registering with openQA openqa.suse.de
Feb 04 03:33:56 worker40 worker[3881]: [warn] [pid:3881] Failed to register at openqa.suse.de - connection error: Transport endpoint is not connected - trying again in 10 seconds
Feb 04 03:34:06 worker40 worker[3881]: [info] [pid:3881] Registering with openQA openqa.suse.de
Feb 04 03:34:06 worker40 worker[3881]: [warn] [pid:3881] Failed to register at openqa.suse.de - connection error: Transport endpoint is not connected - trying again in 10 seconds
Feb 04 03:34:16 worker40 worker[3881]: [info] [pid:3881] Registering with openQA openqa.suse.de
Feb 04 03:34:16 worker40 worker[3881]: [warn] [pid:3881] Failed to register at openqa.suse.de - connection error: Transport endpoint is not connected - trying again in 10 seconds
Feb 04 03:34:26 worker40 worker[3881]: [info] [pid:3881] Registering with openQA openqa.suse.de
Feb 04 03:34:26 worker40 worker[3881]: [warn] [pid:3881] Failed to register at openqa.suse.de - connection error: Transport endpoint is not connected - trying again in 10 seconds
Feb 04 03:34:36 worker40 worker[3881]: [info] [pid:3881] Registering with openQA openqa.suse.de
Feb 04 03:34:36 worker40 worker[3881]: [warn] [pid:3881] Failed to register at openqa.suse.de - connection error: Transport endpoint is not connected - trying again in 10 seconds
Feb 04 03:34:46 worker40 worker[3881]: [info] [pid:3881] Registering with openQA openqa.suse.de
Feb 04 03:34:46 worker40 worker[3881]: [info] [pid:3881] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/3087
Feb 04 03:34:46 worker40 worker[3881]: [info] [pid:3881] Registered and connected via websockets with openQA host openqa.suse.de and worker ID 3087
Feb 04 19:43:20 worker40 worker[3881]: [debug] [pid:3881] Accepting job 13420128 from openqa.suse.de.
Feb 04 19:43:20 worker40 worker[3881]: [debug] [pid:3881] Setting job 13420128 from openqa.suse.de up
Feb 04 19:43:20 worker40 worker[3881]: [debug] [pid:3881] Preparing Mojo::IOLoop::ReadWriteProcess::Session
Feb 04 19:43:20 worker40 worker[3881]: [info] [pid:3881] +++ setup notes +++
Feb 04 19:43:20 worker40 worker[3881]: [info] [pid:3881] Running on worker40:1 (Linux 5.14.21-150500.55.44-default #1 SMP PREEMPT_DYNAMIC Mon Jan 15 10:03:40 UTC 2024 (cc7d8b6) x86_64)
</code></pre>
<p>Sometimes the connection to the cache service is not even possible despite the worker registration already working:</p>
<pre><code>Feb 04 03:35:12 worker34 worker[3937]: [error] [pid:3937] Worker cache not available via http://127.0.0.1:9530: Cache service info error: Connection refused
Feb 04 03:35:12 worker34 worker[3937]: [info] [pid:3937] Project dir for host openqa.suse.de is /var/lib/openqa/share
Feb 04 03:35:12 worker34 worker[3937]: [info] [pid:3937] Registering with openQA openqa.suse.de
Feb 04 03:35:12 worker34 worker[3937]: [warn] [pid:3937] Failed to register at openqa.suse.de - connection error: Transport endpoint is not connected - trying again in 10 seconds
Feb 04 03:35:22 worker34 worker[3937]: [info] [pid:3937] Registering with openQA openqa.suse.de
Feb 04 03:35:22 worker34 worker[3937]: [warn] [pid:3937] Failed to register at openqa.suse.de - connection error: Transport endpoint is not connected - trying again in 10 seconds
Feb 04 03:35:32 worker34 worker[3937]: [info] [pid:3937] Registering with openQA openqa.suse.de
Feb 04 03:35:32 worker34 worker[3937]: [warn] [pid:3937] Failed to register at openqa.suse.de - connection error: Transport endpoint is not connected - trying again in 10 seconds
Feb 04 03:35:42 worker34 worker[3937]: [info] [pid:3937] Registering with openQA openqa.suse.de
Feb 04 03:35:42 worker34 worker[3937]: [warn] [pid:3937] Failed to register at openqa.suse.de - connection error: Transport endpoint is not connected - trying again in 10 seconds
Feb 04 03:35:52 worker34 worker[3937]: [info] [pid:3937] Registering with openQA openqa.suse.de
Feb 04 03:35:52 worker34 worker[3937]: [warn] [pid:3937] Failed to register at openqa.suse.de - connection error: Transport endpoint is not connected - trying again in 10 seconds
Feb 04 03:36:02 worker34 worker[3937]: [info] [pid:3937] Registering with openQA openqa.suse.de
Feb 04 03:36:02 worker34 worker[3937]: [warn] [pid:3937] Failed to register at openqa.suse.de - connection error: Transport endpoint is not connected - trying again in 10 seconds
Feb 04 03:36:12 worker34 worker[3937]: [info] [pid:3937] Registering with openQA openqa.suse.de
Feb 04 03:36:12 worker34 worker[3937]: [warn] [pid:3937] Failed to register at openqa.suse.de - connection error: Transport endpoint is not connected - trying again in 10 seconds
Feb 04 03:36:22 worker34 worker[3937]: [info] [pid:3937] Registering with openQA openqa.suse.de
Feb 04 03:36:22 worker34 worker[3937]: [warn] [pid:3937] Failed to register at openqa.suse.de - connection error: Transport endpoint is not connected - trying again in 10 seconds
Feb 04 03:36:32 worker34 worker[3937]: [info] [pid:3937] Registering with openQA openqa.suse.de
Feb 04 03:36:33 worker34 worker[3937]: [info] [pid:3937] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/2687
Feb 04 03:36:33 worker34 worker[3937]: [info] [pid:3937] Registered and connected via websockets with openQA host openqa.suse.de and worker ID 2687
Feb 04 03:36:33 worker34 worker[3937]: [warn] [pid:3937] Worker cache not available via http://127.0.0.1:9530: Cache service info error: Connection refused - checking again for web UI 'openqa.suse.de' in 80.81 s
Feb 04 19:18:04 worker34 worker[3937]: [debug] [pid:3937] Accepting job 13423952 from openqa.suse.de.
Feb 04 19:18:04 worker34 worker[3937]: [debug] [pid:3937] Setting job 13423952 from openqa.suse.de up
Feb 04 19:18:04 worker34 worker[3937]: [debug] [pid:3937] Preparing Mojo::IOLoop::ReadWriteProcess::Session
</code></pre>
<p>Of course the worker is designed to try again and after a while all ~ 100 affected worker slots were good again with the exception of two worker slots where it apparently took several hours. Those two slots were triggering the alert. Unfortunately we don't know which slots those were (from the data in Grafana).</p>
<p>The cache service itself only took long to start because it couldn't get a socket to listen on at first:</p>
<pre><code>…
Feb 04 03:34:27 worker40 systemd[1]: Started OpenQA Worker Cache Service.
Feb 04 03:34:27 worker40 openqa-workercache-daemon[17249]: [17249] [i] Cache size of "/var/lib/openqa/cache" is 0 Byte, with limit 50 GiB
Feb 04 03:34:27 worker40 openqa-workercache-daemon[17249]: Can't create listen socket: Address family for hostname not supported at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop.pm line 124.
Feb 04 03:34:27 worker40 systemd[1]: openqa-worker-cacheservice.service: Main process exited, code=exited, status=22/n/a
Feb 04 03:34:27 worker40 systemd[1]: openqa-worker-cacheservice.service: Failed with result 'exit-code'.
Feb 04 03:34:27 worker40 systemd[1]: openqa-worker-cacheservice.service: Scheduled restart job, restart counter is at 11.
Feb 04 03:34:32 worker40 systemd[1]: Stopped OpenQA Worker Cache Service.
Feb 04 03:34:32 worker40 systemd[1]: Started OpenQA Worker Cache Service.
Feb 04 03:34:33 worker40 openqa-workercache-daemon[17340]: [17340] [i] Cache size of "/var/lib/openqa/cache" is 0 Byte, with limit 50 GiB
Feb 04 03:34:33 worker40 openqa-workercache-daemon[17340]: Can't create listen socket: Address family for hostname not supported at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop.pm line 124.
Feb 04 03:34:33 worker40 systemd[1]: openqa-worker-cacheservice.service: Main process exited, code=exited, status=22/n/a
Feb 04 03:34:33 worker40 systemd[1]: openqa-worker-cacheservice.service: Failed with result 'exit-code'.
Feb 04 03:34:33 worker40 systemd[1]: openqa-worker-cacheservice.service: Scheduled restart job, restart counter is at 12.
Feb 04 03:34:38 worker40 systemd[1]: Stopped OpenQA Worker Cache Service.
Feb 04 03:34:38 worker40 systemd[1]: Started OpenQA Worker Cache Service.
Feb 04 03:34:38 worker40 openqa-workercache-daemon[17453]: [17453] [i] Cache size of "/var/lib/openqa/cache" is 0 Byte, with limit 50 GiB
Feb 04 03:34:38 worker40 openqa-workercache-daemon[17453]: Can't create listen socket: Address family for hostname not supported at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop.pm line 124.
Feb 04 03:34:38 worker40 systemd[1]: openqa-worker-cacheservice.service: Main process exited, code=exited, status=22/n/a
Feb 04 03:34:38 worker40 systemd[1]: openqa-worker-cacheservice.service: Failed with result 'exit-code'.
Feb 04 03:34:39 worker40 systemd[1]: openqa-worker-cacheservice.service: Scheduled restart job, restart counter is at 13.
Feb 04 03:34:43 worker40 systemd[1]: Stopped OpenQA Worker Cache Service.
Feb 04 03:34:43 worker40 systemd[1]: Started OpenQA Worker Cache Service.
Feb 04 03:34:44 worker40 openqa-workercache-daemon[17834]: Web application available at http://127.0.0.1:9530
Feb 04 03:34:44 worker40 openqa-workercache-daemon[17834]: Web application available at http://[::1]:9530
Feb 04 03:34:44 worker40 openqa-workercache-daemon[17834]: [17834] [i] Cache size of "/var/lib/openqa/cache" is 0 Byte, with limit 50 GiB
Feb 04 03:34:44 worker40 openqa-workercache-daemon[17834]: [17834] [i] Listening at "http://127.0.0.1:9530"
Feb 04 03:34:44 worker40 openqa-workercache-daemon[17834]: [17834] [i] Listening at "http://[::1]:9530"
</code></pre>
<p>There were actually 13 failed startup attempts on that particular host (that all happened within the time frame of a minute).</p>
openQA Infrastructure - action #154627 (Resolved): [potential-regression] Ensure that our "host u...https://progress.opensuse.org/issues/1546272024-01-31T12:43:45Zokurzokurz@suse.com
<a name="Motivation"></a>
<h2 >Motivation<a href="#Motivation" class="wiki-anchor">¶</a></h2>
<p>See <a class="issue tracker-4 status-3 priority-4 priority-default closed child" title="action: CPU Load and usage alert for openQA workers size:S (Resolved)" href="https://progress.opensuse.org/issues/150983">#150983</a> and <a class="issue tracker-4 status-6 priority-3 priority-lowest closed" title="action: [potential-regression] Our salt node up check in osd-deployment never fails size:M (Rejected)" href="https://progress.opensuse.org/issues/151588">#151588</a>. Currently our "host up" alert is likely showing "no data" for currently salt-controlled hosts that are temporarily down but that needs to be crosschecked.</p>
<a name="Acceptance-criteria"></a>
<h2 >Acceptance criteria<a href="#Acceptance-criteria" class="wiki-anchor">¶</a></h2>
<ul>
<li><strong>AC1:</strong> We are alerted if a host that is currently in salt is down</li>
<li><strong>AC2:</strong> There is only one firing alert at a time when a host that is currently in salt is down</li>
<li><strong>AC3:</strong> There is no firing alert after reasonable time if we have removed a host from salt control, i.e. removed from salt keys on OSD and potentially re-deploy a high state</li>
</ul>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>On monitor.qa.suse.de select any host, show the "host up" panel, then shut down the machine and check how the ping behaves, e.g. select tumblesle on qamaster</li>
<li>Fix the alert or if everything works fine convince everybody that made big noiz about nothing</li>
<li>Extend our documentation in salt-states repo or team wiki or openQA wiki as applicable for how to handle taking hosts down/up or something, e.g. review <a href="https://progress.opensuse.org/projects/openqav3/wiki/#Take-machines-out-of-salt-controlled-production" class="external">https://progress.opensuse.org/projects/openqav3/wiki/#Take-machines-out-of-salt-controlled-production</a></li>
</ul>
openQA Infrastructure - action #153958 (Resolved): [alert] s390zl12: Memory usage alert Generic m...https://progress.opensuse.org/issues/1539582024-01-19T11:57:59Ztinitatina.mueller+trick-redmine@suse.com
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<pre><code>Date: Fri, 19 Jan 2024 11:55:37 +0100
1 firing alert instance
[IMAGE]
GROUPED BY
hostname=s390zl12
1 firing instances
Firing [stats.openqa-monitor.qa.suse.de]
s390zl12: Memory usage alert
View alert [stats.openqa-monitor.qa.suse.de]
Values
A0=0.06117900738663373
Labels
alertname
s390zl12: Memory usage alert
grafana_folder
Generic
hostname
s390zl12
rule_uid
memory_usage_alert_s390zl12
</code></pre>
<p><a href="http://stats.openqa-monitor.qa.suse.de/alerting/grafana/memory_usage_alert_s390zl12/view?orgId=1" class="external">http://stats.openqa-monitor.qa.suse.de/alerting/grafana/memory_usage_alert_s390zl12/view?orgId=1</a></p>
<a name="Rollback-steps"></a>
<h2 >Rollback steps<a href="#Rollback-steps" class="wiki-anchor">¶</a></h2>
<p>Remove silence "alertname=s390zl12: Memory usage alert" from <a href="https://stats.openqa-monitor.qa.suse.de/alerting/silences" class="external">https://stats.openqa-monitor.qa.suse.de/alerting/silences</a></p>
openQA Infrastructure - action #152811 (Resolved): ada.qe.suse.de is not responding to salt commandshttps://progress.opensuse.org/issues/1528112023-12-20T13:43:19Zlivdywanliv.dywan@suse.com
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<pre><code>ada.qe.suse.de: Minion did not return. [Not connected]
</code></pre>
<a name="Rollback-steps"></a>
<h2 >Rollback steps<a href="#Rollback-steps" class="wiki-anchor">¶</a></h2>
<ul>
<li><code>ssh osd 'sudo salt-key -y -a ada.qe.suse.de'</code></li>
</ul>
openQA Infrastructure - action #152095 (Resolved): [spike solution][timeboxed:8h] Ping over GRE t...https://progress.opensuse.org/issues/1520952023-12-05T13:22:56Zokurzokurz@suse.com
<a name="Motivation"></a>
<h2 >Motivation<a href="#Motivation" class="wiki-anchor">¶</a></h2>
<p>See lessons learned meeting <a class="issue tracker-4 status-3 priority-5 priority-high3 closed child" title="action: Conduct "lessons learned" with Five Why analysis for "test fails in iscsi_client due to salt 'hos... (Resolved)" href="https://progress.opensuse.org/issues/139136">#139136</a>. We would again benefit from an easier reproducer. Related to <a class="issue tracker-4 status-3 priority-4 priority-default closed" title="action: [kernel] minimal reproducer for many multi-machine test failures in "ovs-client+ovs-server" test ... (Resolved)" href="https://progress.opensuse.org/issues/135818">#135818</a> . Come up with a way to ping over GRE tunnels and TAP devices and openvswitch outside a VM with differing packet sizes.</p>
<a name="Acceptance-criteria"></a>
<h2 >Acceptance criteria<a href="#Acceptance-criteria" class="wiki-anchor">¶</a></h2>
<ul>
<li><strong>AC1:</strong> We know how to ping over GRE tunnels and TAP devices and openvswitch outside a VM with differing packet sizes</li>
</ul>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li><p>Research upstream about pinging over specific interfaces, GRE tunnels, TAP devices, openvswitch, etc.</p>
<ul>
<li>Like <code>ping -I<interface></code> or <code>ping X.X.X.X%tap0</code>?</li>
<li>Checkout network namespaces and if they could be used</li>
</ul></li>
<li><p>Research about MTU size debugging, tracepath, traceroute, etc.</p></li>
<li><p>Experiment in an openQA-environment or openQA-like with the bridges, tap devices, etc.</p></li>
<li><p>Demonstrate to the team in written form or interactively</p></li>
<li><p>Lookup how the existing check is done via a VM/VNC, and see how this could be simplified</p></li>
</ul>
openQA Infrastructure - action #150938 (Resolved): [openQA][sut][ipmi] No ipmi sol output with ix...https://progress.opensuse.org/issues/1509382023-11-16T09:39:37Zwaynechen55wchen@suse.com
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<p>Test run starts failing with <code>imagetester:7</code> at ipxe_install, for example, <a href="https://openqa.suse.de/tests/12822901#step/ipxe_install/1" class="external">https://openqa.suse.de/tests/12822901#step/ipxe_install/1</a>. It looks like needle matching failure, but actually there is nothing printed out on its ipmi sol console after reboot. </p>
<pre><code>ipmitool -I lanplus -C 3 -H ix64ph1075-sp.qe.nue2.suse.org -U admin -P xxxxxxxx sol activate
</code></pre>
<a name="Steps-to-reproduce"></a>
<h2 >Steps to reproduce<a href="#Steps-to-reproduce" class="wiki-anchor">¶</a></h2>
<ul>
<li>Connect to ix64ph1075 ipmi sol console</li>
<li>Reboot the machine</li>
<li>Wait for output on ipmi sol console</li>
</ul>
<a name="Impact"></a>
<h2 >Impact<a href="#Impact" class="wiki-anchor">¶</a></h2>
<p>No test run assigned to <code>imagetester:7</code> can proceed. Now <code>imagetester:6</code></p>
<a name="Problem"></a>
<h2 >Problem<a href="#Problem" class="wiki-anchor">¶</a></h2>
<ul>
<li>Looks like something wrong with ipmi sol console</li>
</ul>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>Check ipmi sol config</li>
<li>Check warning/error in BMC</li>
<li>Factory-reset the BMC</li>
<li>Reinstall the firmware</li>
<li>Click every possible button</li>
<li>Check that the physical ethernet cable is not broken</li>
</ul>
<a name="Workaround"></a>
<h2 >Workaround<a href="#Workaround" class="wiki-anchor">¶</a></h2>
<p>n/a</p>
<a name="Rollback-actions"></a>
<h2 >Rollback actions<a href="#Rollback-actions" class="wiki-anchor">¶</a></h2>
<ul>
<li><code>sudo systemctl unmask openqa-worker-auto-restart@6 && sudo systemctl enable --now openqa-worker-auto-restart@6</code></li>
</ul>
openQA Infrastructure - action #137600 (Resolved): [alert] Packet loss between worker hosts and o...https://progress.opensuse.org/issues/1376002023-10-09T07:46:02Zjbaier_czjbaier@suse.cz
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<p>We had multiple occurrences of packet loss alert over the weekend</p>
<pre><code>alertname Packet loss between worker hosts and other hosts alert
grafana_folder Salt
rule_uid 2Z025iB4km
</code></pre>
<p><a href="http://stats.openqa-monitor.qa.suse.de/d/EML0bpuGk?orgId=1&viewPanel=4" class="external">http://stats.openqa-monitor.qa.suse.de/d/EML0bpuGk?orgId=1&viewPanel=4</a></p>
<p>Currently, the problematic ones according to the panel are:</p>
<pre><code>imagetester - walter1.qe.nue2.suse.org 100%
petrol-1 - walter1.qe.nue2.suse.org 100%
sapworker1 - walter1.qe.nue2.suse.org 100%
</code></pre>
<p>That is a little bit weird as I manually checked the first one and it can reach each other well</p>
<pre><code>walter1:~ # ping imagetester.qe.nue2.suse.org
PING imagetester.qe.nue2.suse.org (10.168.192.249) 56(84) bytes of data.
64 bytes from imagetester.qe.nue2.suse.org (10.168.192.249): icmp_seq=7 ttl=64 time=0.326 ms
jbaier@imagetester:~> ping walter1.qe.nue2.suse.org
PING walter1.qe.nue2.suse.org (10.168.192.1) 56(84) bytes of data.
64 bytes from walter1.qe.nue2.suse.org (10.168.192.1): icmp_seq=1 ttl=64 time=0.331 ms
</code></pre>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>Confirm <strong>when</strong> this started happening or if it's no longer an issue</li>
<li>There's no paused alerts</li>
</ul>
openQA Infrastructure - action #135632 (Resolved): "Mojo::File::spurt is deprecated in favor of M...https://progress.opensuse.org/issues/1356322023-09-13T06:03:30Zlivdywanliv.dywan@suse.com
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<p>See <a href="https://gitlab.suse.de/openqa/osd-deployment/-/jobs/1825493:">https://gitlab.suse.de/openqa/osd-deployment/-/jobs/1825493:</a></p>
<pre><code>++ echo 'Build status for https://build.opensuse.org/project/show/devel:openQA (openSUSE_Leap_15.4) arch x86_64) is not successful'
++ echo '<resultlist state="9c65a2d1b41fa9bf4c35ffd463cddc69">
<result project="devel:openQA" repository="openSUSE_Leap_15.4" arch="x86_64" code="published" state="published">
[...]
<status package="os-autoinst" code="failed"/>
</code></pre>
<p>And accordingly <a href="https://build.opensuse.org/package/live_build_log/devel:openQA/os-autoinst/openSUSE_Leap_15.4/x86_64">https://build.opensuse.org/package/live_build_log/devel:openQA/os-autoinst/openSUSE_Leap_15.4/x86_64</a> - note that there's no persistent logs so I attached the log of the failure:</p>
<pre><code>[19:51:05] ./xt/01-style.t ......................................... fatal: not a git repository (or any of the parent directories): .git
[...]
# Failed test 'no (unexpected) warnings (via done_testing)'
at ./t/03-testapi.t line 1105.
# Got the following unexpected warnings:
# 1: Mojo::File::spurt is deprecated in favor of Mojo::File::spew at /home/abuild/rpmbuild/BUILD/os-autoinst-4.6.1694444383.e6a5294/basetest.pm line 433.
</code></pre>
<a name="Acceptance-criteria"></a>
<h2 >Acceptance criteria<a href="#Acceptance-criteria" class="wiki-anchor">¶</a></h2>
<ul>
<li><strong>AC1:</strong> os-autoinst builds fine in CI</li>
<li><strong>AC2:</strong> No "fatal" but ignored warning in the output</li>
</ul>
<a name="Added-by-okurz-after-estimation"></a>
<h3 >Added by okurz after estimation<a href="#Added-by-okurz-after-estimation" class="wiki-anchor">¶</a></h3>
<ul>
<li><strong>AC3:</strong> Open SRs for Tumbleweed and MRs for Leap are accepted</li>
<li><strong>AC4:</strong> <a href="https://build.opensuse.org/package/live_build_log/openSUSE:Factory/openQA/standard/x86_64">https://build.opensuse.org/package/live_build_log/openSUSE:Factory/openQA/standard/x86_64</a> passes</li>
<li><strong>AC5:</strong> <a href="https://build.opensuse.org/package/live_build_log/openSUSE:Factory/os-autoinst/standard/x86_64">https://build.opensuse.org/package/live_build_log/openSUSE:Factory/os-autoinst/standard/x86_64</a> passes</li>
</ul>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>Note that this is not specific to OBS/Leap15.4</li>
<li>Confirm the source of the <code>fatal: not a git repository (or any of the parent directories): .git</code> errors
<ul>
<li>This is probably not failing anything and not new</li>
</ul></li>
<li>Address or switch off the <code>Mojo::File::spurt is deprecated in favor of Mojo::File::spew</code> errors which seem to upset our checks for no warnings, e.g. just replace all uses of spurt with spew because that's how rolling distributions work. As fallback try to use a dynamic lookup of the method presence and use it if available, fallback to "spurt" otherwise</li>
<li>Also see <a href="https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/17748">https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/17748</a> and <a href="https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/17746">https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/17746</a></li>
</ul>
<a name="Rollback-actions"></a>
<h2 >Rollback actions<a href="#Rollback-actions" class="wiki-anchor">¶</a></h2>
<ul>
<li>Remove perl-Mojo-IOLoop-ReadWriteProcess and perl-Mojolicious-Plugin-AssetPack from devel:openQA as soon as we have the new version in Tumbleweed and current Leap</li>
<li>Same but in devel:openQA:Leap:15.5</li>
</ul>
openQA Infrastructure - action #81198 (Resolved): [tracker-ticket] openqaworker-arm-{1..3} have n...https://progress.opensuse.org/issues/811982020-12-18T13:36:54Znicksingernsinger@suse.com
<p>As we face repeated network problems with our arm workers (e.g. <a href="https://progress.opensuse.org/issues/81026" class="external">https://progress.opensuse.org/issues/81026</a>) we decided to disable ipv6 once again completely on all our arm workers.<br>
This ticket is to track this change to revisit it after the Christmas holidays</p>
openQA Infrastructure - action #37644 (Resolved): [tools] osd SSL certificate is only valid for o...https://progress.opensuse.org/issues/376442018-06-21T18:58:28Zokurzokurz@suse.com