openSUSE Project Management Tool: Issueshttps://progress.opensuse.org/https://progress.opensuse.org/themes/openSUSE/favicon/favicon.ico?15829177842024-03-06T11:43:54ZopenSUSE Project Management Tool
Redmine openQA Project - action #156754 (Resolved): "DBIx::Class::Row::update(): Can't update OpenQA::Sch...https://progress.opensuse.org/issues/1567542024-03-06T11:43:54Zokurzokurz@suse.com
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<p>As seen in <a class="issue tracker-4 status-3 priority-4 priority-default closed" title="action: [alert] "HTTP Response" alert fired shortly on 2024-02-12 and 2024-03-04 size:M (Resolved)" href="https://progress.opensuse.org/issues/155326">#155326</a></p>
<p>OSD journal logs show some DBIx error:</p>
<pre><code>Feb 12 00:38:12 openqa openqa[11635]: [error] [ztQJ1_pAsMiS] DBIx::Class::Row::update(): Can't update OpenQA::Schema::Result::JobLocks=HASH(0x55b77ea45e28): row not found at /usr/share/openqa/script/../lib/OpenQA/Resource/Locks.pm line 139
</code></pre>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>Specifically look into the only error in the log excerpt "[ztQJ1_pAsMiS] DBIx::Class::Row::update(): Can't update OpenQA::Schema::Result::JobLocks=HASH(0x55b77ea45e28): row not found at /usr/share/openqa/script/../lib/OpenQA/Resource/Locks.pm line 139"</li>
</ul>
openQA Infrastructure - action #156331 (Resolved): [gitlab] New pipeline schedules cannot be crea...https://progress.opensuse.org/issues/1563312024-02-29T12:50:10Zjbaier_czjbaier@suse.cz
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<p>New pipeline schedules can’t be created.</p>
<a name="Steps-to-reproduce"></a>
<h2 >Steps to reproduce<a href="#Steps-to-reproduce" class="wiki-anchor">¶</a></h2>
<ol>
<li>Visit pipeline schedules of any project with CI/CD enabled.</li>
<li>Observe message: You have exceeded the maximum number of pipeline schedules for your plan. To create a new schedule, either increase your plan limit or delete an exisiting schedule.</li>
<li>See disabled button “New schedule”.</li>
</ol>
<a name="Expected-result"></a>
<h2 >Expected result<a href="#Expected-result" class="wiki-anchor">¶</a></h2>
<p>New pipeline schedules can be created.</p>
<a name="Impact"></a>
<h2 >Impact<a href="#Impact" class="wiki-anchor">¶</a></h2>
<p>Without the ability to create more schedules, the automation process might be hindered.</p>
<a name="Further-details"></a>
<h2 >Further details<a href="#Further-details" class="wiki-anchor">¶</a></h2>
<p>This issue can be easily solved by following the steps mentioned in <a href="https://gitlab.suse.de/help/administration/instance_limits#number-of-pipeline-schedules" class="external">https://gitlab.suse.de/help/administration/instance_limits#number-of-pipeline-schedules</a></p>
openQA Infrastructure - action #156301 (Resolved): [bot-ng] Pipeline failed / KeyError: 'priority...https://progress.opensuse.org/issues/1563012024-02-29T08:54:46Zlivdywanliv.dywan@suse.com
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<p><a href="https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/2327183" class="external">https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/2327183</a></p>
<pre><code>++ retry -r 30 -e -- ./qem-bot/bot-ng.py -c /etc/openqabot --token [MASKED] incidents-run
[...]
KeyError: 'priority'
Retrying up to 19 more times after sleeping 6144s …
2024-02-29 06:28:46 INFO Bot schedule starts now
Traceback (most recent call last):
File "./qem-bot/bot-ng.py", line 7, in <module>
main()
File "/builds/qa-maintenance/bot-ng/qem-bot/openqabot/main.py", line 32, in main
sys.exit(cfg.func(cfg))
File "/builds/qa-maintenance/bot-ng/qem-bot/openqabot/args.py", line 24, in do_incident_schedule
bot = OpenQABot(args)
File "/builds/qa-maintenance/bot-ng/qem-bot/openqabot/openqabot.py", line 24, in __init__
self.incidents = get_incidents(self.token)
File "/builds/qa-maintenance/bot-ng/qem-bot/openqabot/loader/qem.py", line 41, in get_incidents
xs.append(Incident(i))
File "/builds/qa-maintenance/bot-ng/qem-bot/openqabot/types/incident.py", line 23, in __init__
self.priority = incident["priority"]
KeyError: 'priority'
Retrying up to 18 more times after sleeping 12288s …
ERROR: Job failed: execution took longer than 4h0m0s seconds
</code></pre>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li><strong>DONE</strong> Restart pipelines</li>
<li>Investigate if there is new data the bot is not handling correctly</li>
<li>Don't provoke timeouts with retrying on reproducible errors</li>
<li>Look into unit test coverage</li>
</ul>
openQA Infrastructure - action #156226 (Resolved): [bot-ng] Pipeline failed / failed to pulled im...https://progress.opensuse.org/issues/1562262024-02-28T13:51:23Zlivdywanliv.dywan@suse.com
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<p><a href="https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/2325569" class="external">https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/2325569</a></p>
<pre><code>WARNING: Failed to pull image with policy "always": failed to register layer: open /var/cache/zypp/solv/@System/solv.idx: no space left on device (manager.go:237:16s)
ERROR: Job failed: failed to pull image "registry.suse.de/qa/maintenance/containers/qam-ci-leap:latest" with specified policies [always]: failed to register layer: open /var/cache/zypp/solv/@System/solv.idx: no space left on device (manager.go:237:16s)
WARNING: Failed to pull image with policy "always": failed to register layer: mkdir /var/cache/zypp/solv/obs_repository: no space left on device (manager.go:237:13s)
ERROR: Job failed: failed to pull image "registry.suse.de/qa/maintenance/containers/qam-ci-leap:latest" with specified policies [always]: failed to register layer: mkdir /var/cache/zypp/solv/obs_repository: no space left on device (manager.go:237:13s)
</code></pre>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li><strong>DONE</strong> Restart pipelines</li>
<li><strong>DONE</strong> Report an infra SD ticket</li>
<li><strong>DONE</strong> Add retries to the pipeline</li>
</ul>
openQA Project - action #156052 (Resolved): [alert] Scripts CI pipeline failing after logging mu...https://progress.opensuse.org/issues/1560522024-02-26T10:26:59Zlivdywanliv.dywan@suse.com
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<p><a href="https://gitlab.suse.de/openqa/scripts-ci/-/jobs/2315561" class="external">https://gitlab.suse.de/openqa/scripts-ci/-/jobs/2315561</a></p>
<pre><code>2 jobs have been created:
- http://openqa.suse.de/tests/13603796
- http://openqa.suse.de/tests/13603797
{"blocked_by_id":null,"id":13603796,"result":"none","state":"scheduled"}
Job state of job ID 13603796: scheduled, waiting …
{"blocked_by_id":null,"id":13603796,"result":"none","state":"running"}
Job state of job ID 13603796: running, waiting …
</code></pre>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>Investigate what is causing the pipeline to fail
<ul>
<li>The pipeline fails.</li>
<li>The two created jobs failed.</li>
<li>There is a lot of log messages mentioning "waiting" which is not shown to be successful or unsuccessful.</li>
</ul></li>
</ul>
<a name="Rollback-steps"></a>
<h2 >Rollback steps<a href="#Rollback-steps" class="wiki-anchor">¶</a></h2>
<p>Active pipelines on <a href="https://gitlab.suse.de/openqa/scripts-ci/-/pipeline_schedules" class="external">https://gitlab.suse.de/openqa/scripts-ci/-/pipeline_schedules</a> again</p>
QA - action #155917 (Resolved): [backlogger] Count "Feedback" ticket state for cycle time as well...https://progress.opensuse.org/issues/1559172024-02-23T10:37:06Zokurzokurz@suse.com
<a name="Motivation"></a>
<h2 >Motivation<a href="#Motivation" class="wiki-anchor">¶</a></h2>
<p>As discussed in the coordination call where we now look at our metrics, it would be a good idea to take Feedback into account for the cycle time.</p>
<a name="Acceptance-Criteria"></a>
<h2 >Acceptance Criteria<a href="#Acceptance-Criteria" class="wiki-anchor">¶</a></h2>
<p><strong>AC1</strong>: The cycle time treats "Feedback" like "In Progress"<br>
<strong>AC2</strong>: We are aware that "Blocked" needs to be used (instead of "Feedback") when waiting on external progress</p>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>Count "Feedback" the same as "In Progress". We include tickets in "Feedback" when just waiting for others within the team would be counted towards cycle time </li>
<li>Document in the wiki that waiting for external feedback should use Blocked</li>
</ul>
openQA Infrastructure - action #155725 (Resolved): [openQA][infra][sut] Failed to establish connn...https://progress.opensuse.org/issues/1557252024-02-21T09:38:46Zwaynechen55wchen@suse.com
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<p>Can not establish ipmi sol connnection to fozzie-sp and quinn-sp</p>
<pre><code>localhost:~ # ipmitool -I lanplus -H fozzie-sp.qe.nue2.suse.org -U ADMIN -P xxxxx chassis power status
Address lookup for fozzie-sp.qe.nue2.suse.org failed
Could not open socket!
Error: Unable to establish IPMI v2 / RMCP+ session
localhost:~ # ipmitool -I lanplus -H quinn-sp.qe.nue2.suse.org -U ADMIN -P xxxxx chassis power status
Address lookup for quinn-sp.qe.nue2.suse.org failed
Could not open socket!
Error: Unable to establish IPMI v2 / RMCP+ session
localhost:~ # ping -c5 fozzie-sp.qe.nue2.suse.org
ping: fozzie-sp.qe.nue2.suse.org: Name or service not known
localhost:~ # ping -c5 quinn-sp.qe.nue2.suse.org
ping: quinn-sp.qe.nue2.suse.org: Name or service not known
</code></pre>
<a name="Steps-to-reproduce"></a>
<h2 >Steps to reproduce<a href="#Steps-to-reproduce" class="wiki-anchor">¶</a></h2>
<ul>
<li>Use ipmitool to do operation</li>
</ul>
<a name="Impact"></a>
<h2 >Impact<a href="#Impact" class="wiki-anchor">¶</a></h2>
<p>Test run keeps failing.</p>
<a name="Problem"></a>
<h2 >Problem<a href="#Problem" class="wiki-anchor">¶</a></h2>
<p>Looks like something wrong with management unit</p>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>Check management unit state</li>
<li>Check error/warning report from management unit</li>
<li>Check management unit configuration</li>
<li>Check ipmi sol is enabled</li>
</ul>
<a name="Workaround"></a>
<h2 >Workaround<a href="#Workaround" class="wiki-anchor">¶</a></h2>
<p>n/a</p>
openQA Infrastructure - action #155080 (Resolved): jenkins is no longer producing GNOME:Next tes...https://progress.opensuse.org/issues/1550802024-02-07T13:03:45Zokurzokurz@suse.com
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<p>From <a href="https://suse.slack.com/archives/C02CANHLANP/p1707310927769339" class="external">https://suse.slack.com/archives/C02CANHLANP/p1707310927769339</a></p>
<blockquote>
<p>(Dominique Leuenberger) seems jenkins is no longer producing GNOME:Next test runs: <a href="http://jenkins.qa.suse.de/job/gnome_next-openqa/8670/console" class="external">http://jenkins.qa.suse.de/job/gnome_next-openqa/8670/console</a></p>
</blockquote>
<pre><code>Caused: java.io.IOException: Cannot run program "/bin/sh" (in directory "/var/lib/jenkins/workspace/gnome_next-openqa"): error=0, Failed to exec spawn helper: pid: 2883, signal: 11
</code></pre> openQA Infrastructure - action #154927 (Resolved): [alert] Broken workers alert was firing severa...https://progress.opensuse.org/issues/1549272024-02-05T10:03:31Zmkittlermarius.kittler@suse.com
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<p>See <a href="https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&viewPanel=96&editPanel=96&from=1706991565957&to=1707139468853" class="external">https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&viewPanel=96&editPanel=96&from=1706991565957&to=1707139468853</a> for the panel/timeframe</p>
<p>Example of how the worker log looked like (looked similar on all machines/services I checked):</p>
<pre><code>Feb 04 03:33:36 worker40 worker[3881]: [error] [pid:3881] Worker cache not available via http://127.0.0.1:9530: Cache service info error: Connection refused
Feb 04 03:33:36 worker40 worker[3881]: [info] [pid:3881] Project dir for host openqa.suse.de is /var/lib/openqa/share
Feb 04 03:33:36 worker40 worker[3881]: [info] [pid:3881] Registering with openQA openqa.suse.de
Feb 04 03:33:36 worker40 worker[3881]: [warn] [pid:3881] Failed to register at openqa.suse.de - connection error: Transport endpoint is not connected - trying again in 10 seconds
Feb 04 03:33:46 worker40 worker[3881]: [info] [pid:3881] Registering with openQA openqa.suse.de
Feb 04 03:33:46 worker40 worker[3881]: [warn] [pid:3881] Failed to register at openqa.suse.de - connection error: Transport endpoint is not connected - trying again in 10 seconds
Feb 04 03:33:56 worker40 worker[3881]: [info] [pid:3881] Registering with openQA openqa.suse.de
Feb 04 03:33:56 worker40 worker[3881]: [warn] [pid:3881] Failed to register at openqa.suse.de - connection error: Transport endpoint is not connected - trying again in 10 seconds
Feb 04 03:34:06 worker40 worker[3881]: [info] [pid:3881] Registering with openQA openqa.suse.de
Feb 04 03:34:06 worker40 worker[3881]: [warn] [pid:3881] Failed to register at openqa.suse.de - connection error: Transport endpoint is not connected - trying again in 10 seconds
Feb 04 03:34:16 worker40 worker[3881]: [info] [pid:3881] Registering with openQA openqa.suse.de
Feb 04 03:34:16 worker40 worker[3881]: [warn] [pid:3881] Failed to register at openqa.suse.de - connection error: Transport endpoint is not connected - trying again in 10 seconds
Feb 04 03:34:26 worker40 worker[3881]: [info] [pid:3881] Registering with openQA openqa.suse.de
Feb 04 03:34:26 worker40 worker[3881]: [warn] [pid:3881] Failed to register at openqa.suse.de - connection error: Transport endpoint is not connected - trying again in 10 seconds
Feb 04 03:34:36 worker40 worker[3881]: [info] [pid:3881] Registering with openQA openqa.suse.de
Feb 04 03:34:36 worker40 worker[3881]: [warn] [pid:3881] Failed to register at openqa.suse.de - connection error: Transport endpoint is not connected - trying again in 10 seconds
Feb 04 03:34:46 worker40 worker[3881]: [info] [pid:3881] Registering with openQA openqa.suse.de
Feb 04 03:34:46 worker40 worker[3881]: [info] [pid:3881] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/3087
Feb 04 03:34:46 worker40 worker[3881]: [info] [pid:3881] Registered and connected via websockets with openQA host openqa.suse.de and worker ID 3087
Feb 04 19:43:20 worker40 worker[3881]: [debug] [pid:3881] Accepting job 13420128 from openqa.suse.de.
Feb 04 19:43:20 worker40 worker[3881]: [debug] [pid:3881] Setting job 13420128 from openqa.suse.de up
Feb 04 19:43:20 worker40 worker[3881]: [debug] [pid:3881] Preparing Mojo::IOLoop::ReadWriteProcess::Session
Feb 04 19:43:20 worker40 worker[3881]: [info] [pid:3881] +++ setup notes +++
Feb 04 19:43:20 worker40 worker[3881]: [info] [pid:3881] Running on worker40:1 (Linux 5.14.21-150500.55.44-default #1 SMP PREEMPT_DYNAMIC Mon Jan 15 10:03:40 UTC 2024 (cc7d8b6) x86_64)
</code></pre>
<p>Sometimes the connection to the cache service is not even possible despite the worker registration already working:</p>
<pre><code>Feb 04 03:35:12 worker34 worker[3937]: [error] [pid:3937] Worker cache not available via http://127.0.0.1:9530: Cache service info error: Connection refused
Feb 04 03:35:12 worker34 worker[3937]: [info] [pid:3937] Project dir for host openqa.suse.de is /var/lib/openqa/share
Feb 04 03:35:12 worker34 worker[3937]: [info] [pid:3937] Registering with openQA openqa.suse.de
Feb 04 03:35:12 worker34 worker[3937]: [warn] [pid:3937] Failed to register at openqa.suse.de - connection error: Transport endpoint is not connected - trying again in 10 seconds
Feb 04 03:35:22 worker34 worker[3937]: [info] [pid:3937] Registering with openQA openqa.suse.de
Feb 04 03:35:22 worker34 worker[3937]: [warn] [pid:3937] Failed to register at openqa.suse.de - connection error: Transport endpoint is not connected - trying again in 10 seconds
Feb 04 03:35:32 worker34 worker[3937]: [info] [pid:3937] Registering with openQA openqa.suse.de
Feb 04 03:35:32 worker34 worker[3937]: [warn] [pid:3937] Failed to register at openqa.suse.de - connection error: Transport endpoint is not connected - trying again in 10 seconds
Feb 04 03:35:42 worker34 worker[3937]: [info] [pid:3937] Registering with openQA openqa.suse.de
Feb 04 03:35:42 worker34 worker[3937]: [warn] [pid:3937] Failed to register at openqa.suse.de - connection error: Transport endpoint is not connected - trying again in 10 seconds
Feb 04 03:35:52 worker34 worker[3937]: [info] [pid:3937] Registering with openQA openqa.suse.de
Feb 04 03:35:52 worker34 worker[3937]: [warn] [pid:3937] Failed to register at openqa.suse.de - connection error: Transport endpoint is not connected - trying again in 10 seconds
Feb 04 03:36:02 worker34 worker[3937]: [info] [pid:3937] Registering with openQA openqa.suse.de
Feb 04 03:36:02 worker34 worker[3937]: [warn] [pid:3937] Failed to register at openqa.suse.de - connection error: Transport endpoint is not connected - trying again in 10 seconds
Feb 04 03:36:12 worker34 worker[3937]: [info] [pid:3937] Registering with openQA openqa.suse.de
Feb 04 03:36:12 worker34 worker[3937]: [warn] [pid:3937] Failed to register at openqa.suse.de - connection error: Transport endpoint is not connected - trying again in 10 seconds
Feb 04 03:36:22 worker34 worker[3937]: [info] [pid:3937] Registering with openQA openqa.suse.de
Feb 04 03:36:22 worker34 worker[3937]: [warn] [pid:3937] Failed to register at openqa.suse.de - connection error: Transport endpoint is not connected - trying again in 10 seconds
Feb 04 03:36:32 worker34 worker[3937]: [info] [pid:3937] Registering with openQA openqa.suse.de
Feb 04 03:36:33 worker34 worker[3937]: [info] [pid:3937] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/2687
Feb 04 03:36:33 worker34 worker[3937]: [info] [pid:3937] Registered and connected via websockets with openQA host openqa.suse.de and worker ID 2687
Feb 04 03:36:33 worker34 worker[3937]: [warn] [pid:3937] Worker cache not available via http://127.0.0.1:9530: Cache service info error: Connection refused - checking again for web UI 'openqa.suse.de' in 80.81 s
Feb 04 19:18:04 worker34 worker[3937]: [debug] [pid:3937] Accepting job 13423952 from openqa.suse.de.
Feb 04 19:18:04 worker34 worker[3937]: [debug] [pid:3937] Setting job 13423952 from openqa.suse.de up
Feb 04 19:18:04 worker34 worker[3937]: [debug] [pid:3937] Preparing Mojo::IOLoop::ReadWriteProcess::Session
</code></pre>
<p>Of course the worker is designed to try again and after a while all ~ 100 affected worker slots were good again with the exception of two worker slots where it apparently took several hours. Those two slots were triggering the alert. Unfortunately we don't know which slots those were (from the data in Grafana).</p>
<p>The cache service itself only took long to start because it couldn't get a socket to listen on at first:</p>
<pre><code>…
Feb 04 03:34:27 worker40 systemd[1]: Started OpenQA Worker Cache Service.
Feb 04 03:34:27 worker40 openqa-workercache-daemon[17249]: [17249] [i] Cache size of "/var/lib/openqa/cache" is 0 Byte, with limit 50 GiB
Feb 04 03:34:27 worker40 openqa-workercache-daemon[17249]: Can't create listen socket: Address family for hostname not supported at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop.pm line 124.
Feb 04 03:34:27 worker40 systemd[1]: openqa-worker-cacheservice.service: Main process exited, code=exited, status=22/n/a
Feb 04 03:34:27 worker40 systemd[1]: openqa-worker-cacheservice.service: Failed with result 'exit-code'.
Feb 04 03:34:27 worker40 systemd[1]: openqa-worker-cacheservice.service: Scheduled restart job, restart counter is at 11.
Feb 04 03:34:32 worker40 systemd[1]: Stopped OpenQA Worker Cache Service.
Feb 04 03:34:32 worker40 systemd[1]: Started OpenQA Worker Cache Service.
Feb 04 03:34:33 worker40 openqa-workercache-daemon[17340]: [17340] [i] Cache size of "/var/lib/openqa/cache" is 0 Byte, with limit 50 GiB
Feb 04 03:34:33 worker40 openqa-workercache-daemon[17340]: Can't create listen socket: Address family for hostname not supported at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop.pm line 124.
Feb 04 03:34:33 worker40 systemd[1]: openqa-worker-cacheservice.service: Main process exited, code=exited, status=22/n/a
Feb 04 03:34:33 worker40 systemd[1]: openqa-worker-cacheservice.service: Failed with result 'exit-code'.
Feb 04 03:34:33 worker40 systemd[1]: openqa-worker-cacheservice.service: Scheduled restart job, restart counter is at 12.
Feb 04 03:34:38 worker40 systemd[1]: Stopped OpenQA Worker Cache Service.
Feb 04 03:34:38 worker40 systemd[1]: Started OpenQA Worker Cache Service.
Feb 04 03:34:38 worker40 openqa-workercache-daemon[17453]: [17453] [i] Cache size of "/var/lib/openqa/cache" is 0 Byte, with limit 50 GiB
Feb 04 03:34:38 worker40 openqa-workercache-daemon[17453]: Can't create listen socket: Address family for hostname not supported at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop.pm line 124.
Feb 04 03:34:38 worker40 systemd[1]: openqa-worker-cacheservice.service: Main process exited, code=exited, status=22/n/a
Feb 04 03:34:38 worker40 systemd[1]: openqa-worker-cacheservice.service: Failed with result 'exit-code'.
Feb 04 03:34:39 worker40 systemd[1]: openqa-worker-cacheservice.service: Scheduled restart job, restart counter is at 13.
Feb 04 03:34:43 worker40 systemd[1]: Stopped OpenQA Worker Cache Service.
Feb 04 03:34:43 worker40 systemd[1]: Started OpenQA Worker Cache Service.
Feb 04 03:34:44 worker40 openqa-workercache-daemon[17834]: Web application available at http://127.0.0.1:9530
Feb 04 03:34:44 worker40 openqa-workercache-daemon[17834]: Web application available at http://[::1]:9530
Feb 04 03:34:44 worker40 openqa-workercache-daemon[17834]: [17834] [i] Cache size of "/var/lib/openqa/cache" is 0 Byte, with limit 50 GiB
Feb 04 03:34:44 worker40 openqa-workercache-daemon[17834]: [17834] [i] Listening at "http://127.0.0.1:9530"
Feb 04 03:34:44 worker40 openqa-workercache-daemon[17834]: [17834] [i] Listening at "http://[::1]:9530"
</code></pre>
<p>There were actually 13 failed startup attempts on that particular host (that all happened within the time frame of a minute).</p>
QA - coordination #154756 (Resolved): [epic] Decommission qa-maintenance/openQABothttps://progress.opensuse.org/issues/1547562024-02-01T14:20:27Zokurzokurz@suse.com
<a name="Motivation"></a>
<h2 >Motivation<a href="#Motivation" class="wiki-anchor">¶</a></h2>
<p><a href="https://gitlab.suse.de/qa-maintenance/openQABot/" class="external">https://gitlab.suse.de/qa-maintenance/openQABot/</a> is a copy&rewrite of <a href="https://github.com/openSUSE/openSUSE-release-tools/blob/master/openqa-maintenance.py" class="external">https://github.com/openSUSE/openSUSE-release-tools/blob/master/openqa-maintenance.py</a> which was already partially replaced by <a href="https://github.com/openSUSE/qem-bot/" class="external">https://github.com/openSUSE/qem-bot/</a> . openQABot was still used for L3+MR testing, see <a href="https://gitlab.suse.de/qa-maintenance/openQABot/-/pipeline_schedules" class="external">https://gitlab.suse.de/qa-maintenance/openQABot/-/pipeline_schedules</a> but according to latest runs, e.g. <a href="https://gitlab.suse.de/qa-maintenance/openQABot/-/jobs/2230100" class="external">https://gitlab.suse.de/qa-maintenance/openQABot/-/jobs/2230100</a> and <a href="https://gitlab.suse.de/qa-maintenance/openQABot/-/jobs/2230810" class="external">https://gitlab.suse.de/qa-maintenance/openQABot/-/jobs/2230810</a> and also because I don't see any jobs in <a href="https://openqa.suse.de/parent_group_overview/30#grouped_by_build" class="external">https://openqa.suse.de/parent_group_overview/30#grouped_by_build</a> I assume nobody needs openQABot anymore. To reduce our maintenance effort we should fully decommission openQABot.</p>
<a name="Acceptance-criteria"></a>
<h2 >Acceptance criteria<a href="#Acceptance-criteria" class="wiki-anchor">¶</a></h2>
<ul>
<li><strong>AC1:</strong> <a href="https://gitlab.suse.de/qa-maintenance/openQABot" class="external">https://gitlab.suse.de/qa-maintenance/openQABot</a> is archived</li>
<li><strong>AC2:</strong> No openQABot invocations happen over <a href="https://gitlab.suse.de/qa-maintenance/openQABot/-/pipeline_schedules" class="external">https://gitlab.suse.de/qa-maintenance/openQABot/-/pipeline_schedules</a></li>
<li><strong>AC3:</strong> openQABot is not referenced as active application anymore on common places like wiki pages</li>
<li><strong>AC4:</strong> Full decommissioning was announced over applicable communication channels</li>
</ul>
openQA Infrastructure - action #154627 (Resolved): [potential-regression] Ensure that our "host u...https://progress.opensuse.org/issues/1546272024-01-31T12:43:45Zokurzokurz@suse.com
<a name="Motivation"></a>
<h2 >Motivation<a href="#Motivation" class="wiki-anchor">¶</a></h2>
<p>See <a class="issue tracker-4 status-3 priority-4 priority-default closed child" title="action: CPU Load and usage alert for openQA workers size:S (Resolved)" href="https://progress.opensuse.org/issues/150983">#150983</a> and <a class="issue tracker-4 status-6 priority-3 priority-lowest closed" title="action: [potential-regression] Our salt node up check in osd-deployment never fails size:M (Rejected)" href="https://progress.opensuse.org/issues/151588">#151588</a>. Currently our "host up" alert is likely showing "no data" for currently salt-controlled hosts that are temporarily down but that needs to be crosschecked.</p>
<a name="Acceptance-criteria"></a>
<h2 >Acceptance criteria<a href="#Acceptance-criteria" class="wiki-anchor">¶</a></h2>
<ul>
<li><strong>AC1:</strong> We are alerted if a host that is currently in salt is down</li>
<li><strong>AC2:</strong> There is only one firing alert at a time when a host that is currently in salt is down</li>
<li><strong>AC3:</strong> There is no firing alert after reasonable time if we have removed a host from salt control, i.e. removed from salt keys on OSD and potentially re-deploy a high state</li>
</ul>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>On monitor.qa.suse.de select any host, show the "host up" panel, then shut down the machine and check how the ping behaves, e.g. select tumblesle on qamaster</li>
<li>Fix the alert or if everything works fine convince everybody that made big noiz about nothing</li>
<li>Extend our documentation in salt-states repo or team wiki or openQA wiki as applicable for how to handle taking hosts down/up or something, e.g. review <a href="https://progress.opensuse.org/projects/openqav3/wiki/#Take-machines-out-of-salt-controlled-production" class="external">https://progress.opensuse.org/projects/openqav3/wiki/#Take-machines-out-of-salt-controlled-production</a></li>
</ul>
openQA Infrastructure - action #153958 (Resolved): [alert] s390zl12: Memory usage alert Generic m...https://progress.opensuse.org/issues/1539582024-01-19T11:57:59Ztinitatina.mueller+trick-redmine@suse.com
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<pre><code>Date: Fri, 19 Jan 2024 11:55:37 +0100
1 firing alert instance
[IMAGE]
GROUPED BY
hostname=s390zl12
1 firing instances
Firing [stats.openqa-monitor.qa.suse.de]
s390zl12: Memory usage alert
View alert [stats.openqa-monitor.qa.suse.de]
Values
A0=0.06117900738663373
Labels
alertname
s390zl12: Memory usage alert
grafana_folder
Generic
hostname
s390zl12
rule_uid
memory_usage_alert_s390zl12
</code></pre>
<p><a href="http://stats.openqa-monitor.qa.suse.de/alerting/grafana/memory_usage_alert_s390zl12/view?orgId=1" class="external">http://stats.openqa-monitor.qa.suse.de/alerting/grafana/memory_usage_alert_s390zl12/view?orgId=1</a></p>
<a name="Rollback-steps"></a>
<h2 >Rollback steps<a href="#Rollback-steps" class="wiki-anchor">¶</a></h2>
<p>Remove silence "alertname=s390zl12: Memory usage alert" from <a href="https://stats.openqa-monitor.qa.suse.de/alerting/silences" class="external">https://stats.openqa-monitor.qa.suse.de/alerting/silences</a></p>
openQA Infrastructure - action #150938 (Resolved): [openQA][sut][ipmi] No ipmi sol output with ix...https://progress.opensuse.org/issues/1509382023-11-16T09:39:37Zwaynechen55wchen@suse.com
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<p>Test run starts failing with <code>imagetester:7</code> at ipxe_install, for example, <a href="https://openqa.suse.de/tests/12822901#step/ipxe_install/1" class="external">https://openqa.suse.de/tests/12822901#step/ipxe_install/1</a>. It looks like needle matching failure, but actually there is nothing printed out on its ipmi sol console after reboot. </p>
<pre><code>ipmitool -I lanplus -C 3 -H ix64ph1075-sp.qe.nue2.suse.org -U admin -P xxxxxxxx sol activate
</code></pre>
<a name="Steps-to-reproduce"></a>
<h2 >Steps to reproduce<a href="#Steps-to-reproduce" class="wiki-anchor">¶</a></h2>
<ul>
<li>Connect to ix64ph1075 ipmi sol console</li>
<li>Reboot the machine</li>
<li>Wait for output on ipmi sol console</li>
</ul>
<a name="Impact"></a>
<h2 >Impact<a href="#Impact" class="wiki-anchor">¶</a></h2>
<p>No test run assigned to <code>imagetester:7</code> can proceed. Now <code>imagetester:6</code></p>
<a name="Problem"></a>
<h2 >Problem<a href="#Problem" class="wiki-anchor">¶</a></h2>
<ul>
<li>Looks like something wrong with ipmi sol console</li>
</ul>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>Check ipmi sol config</li>
<li>Check warning/error in BMC</li>
<li>Factory-reset the BMC</li>
<li>Reinstall the firmware</li>
<li>Click every possible button</li>
<li>Check that the physical ethernet cable is not broken</li>
</ul>
<a name="Workaround"></a>
<h2 >Workaround<a href="#Workaround" class="wiki-anchor">¶</a></h2>
<p>n/a</p>
<a name="Rollback-actions"></a>
<h2 >Rollback actions<a href="#Rollback-actions" class="wiki-anchor">¶</a></h2>
<ul>
<li><code>sudo systemctl unmask openqa-worker-auto-restart@6 && sudo systemctl enable --now openqa-worker-auto-restart@6</code></li>
</ul>
openQA Infrastructure - action #137600 (Resolved): [alert] Packet loss between worker hosts and o...https://progress.opensuse.org/issues/1376002023-10-09T07:46:02Zjbaier_czjbaier@suse.cz
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<p>We had multiple occurrences of packet loss alert over the weekend</p>
<pre><code>alertname Packet loss between worker hosts and other hosts alert
grafana_folder Salt
rule_uid 2Z025iB4km
</code></pre>
<p><a href="http://stats.openqa-monitor.qa.suse.de/d/EML0bpuGk?orgId=1&viewPanel=4" class="external">http://stats.openqa-monitor.qa.suse.de/d/EML0bpuGk?orgId=1&viewPanel=4</a></p>
<p>Currently, the problematic ones according to the panel are:</p>
<pre><code>imagetester - walter1.qe.nue2.suse.org 100%
petrol-1 - walter1.qe.nue2.suse.org 100%
sapworker1 - walter1.qe.nue2.suse.org 100%
</code></pre>
<p>That is a little bit weird as I manually checked the first one and it can reach each other well</p>
<pre><code>walter1:~ # ping imagetester.qe.nue2.suse.org
PING imagetester.qe.nue2.suse.org (10.168.192.249) 56(84) bytes of data.
64 bytes from imagetester.qe.nue2.suse.org (10.168.192.249): icmp_seq=7 ttl=64 time=0.326 ms
jbaier@imagetester:~> ping walter1.qe.nue2.suse.org
PING walter1.qe.nue2.suse.org (10.168.192.1) 56(84) bytes of data.
64 bytes from walter1.qe.nue2.suse.org (10.168.192.1): icmp_seq=1 ttl=64 time=0.331 ms
</code></pre>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>Confirm <strong>when</strong> this started happening or if it's no longer an issue</li>
<li>There's no paused alerts</li>
</ul>
openQA Infrastructure - action #135632 (Resolved): "Mojo::File::spurt is deprecated in favor of M...https://progress.opensuse.org/issues/1356322023-09-13T06:03:30Zlivdywanliv.dywan@suse.com
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<p>See <a href="https://gitlab.suse.de/openqa/osd-deployment/-/jobs/1825493:">https://gitlab.suse.de/openqa/osd-deployment/-/jobs/1825493:</a></p>
<pre><code>++ echo 'Build status for https://build.opensuse.org/project/show/devel:openQA (openSUSE_Leap_15.4) arch x86_64) is not successful'
++ echo '<resultlist state="9c65a2d1b41fa9bf4c35ffd463cddc69">
<result project="devel:openQA" repository="openSUSE_Leap_15.4" arch="x86_64" code="published" state="published">
[...]
<status package="os-autoinst" code="failed"/>
</code></pre>
<p>And accordingly <a href="https://build.opensuse.org/package/live_build_log/devel:openQA/os-autoinst/openSUSE_Leap_15.4/x86_64">https://build.opensuse.org/package/live_build_log/devel:openQA/os-autoinst/openSUSE_Leap_15.4/x86_64</a> - note that there's no persistent logs so I attached the log of the failure:</p>
<pre><code>[19:51:05] ./xt/01-style.t ......................................... fatal: not a git repository (or any of the parent directories): .git
[...]
# Failed test 'no (unexpected) warnings (via done_testing)'
at ./t/03-testapi.t line 1105.
# Got the following unexpected warnings:
# 1: Mojo::File::spurt is deprecated in favor of Mojo::File::spew at /home/abuild/rpmbuild/BUILD/os-autoinst-4.6.1694444383.e6a5294/basetest.pm line 433.
</code></pre>
<a name="Acceptance-criteria"></a>
<h2 >Acceptance criteria<a href="#Acceptance-criteria" class="wiki-anchor">¶</a></h2>
<ul>
<li><strong>AC1:</strong> os-autoinst builds fine in CI</li>
<li><strong>AC2:</strong> No "fatal" but ignored warning in the output</li>
</ul>
<a name="Added-by-okurz-after-estimation"></a>
<h3 >Added by okurz after estimation<a href="#Added-by-okurz-after-estimation" class="wiki-anchor">¶</a></h3>
<ul>
<li><strong>AC3:</strong> Open SRs for Tumbleweed and MRs for Leap are accepted</li>
<li><strong>AC4:</strong> <a href="https://build.opensuse.org/package/live_build_log/openSUSE:Factory/openQA/standard/x86_64">https://build.opensuse.org/package/live_build_log/openSUSE:Factory/openQA/standard/x86_64</a> passes</li>
<li><strong>AC5:</strong> <a href="https://build.opensuse.org/package/live_build_log/openSUSE:Factory/os-autoinst/standard/x86_64">https://build.opensuse.org/package/live_build_log/openSUSE:Factory/os-autoinst/standard/x86_64</a> passes</li>
</ul>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>Note that this is not specific to OBS/Leap15.4</li>
<li>Confirm the source of the <code>fatal: not a git repository (or any of the parent directories): .git</code> errors
<ul>
<li>This is probably not failing anything and not new</li>
</ul></li>
<li>Address or switch off the <code>Mojo::File::spurt is deprecated in favor of Mojo::File::spew</code> errors which seem to upset our checks for no warnings, e.g. just replace all uses of spurt with spew because that's how rolling distributions work. As fallback try to use a dynamic lookup of the method presence and use it if available, fallback to "spurt" otherwise</li>
<li>Also see <a href="https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/17748">https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/17748</a> and <a href="https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/17746">https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/17746</a></li>
</ul>
<a name="Rollback-actions"></a>
<h2 >Rollback actions<a href="#Rollback-actions" class="wiki-anchor">¶</a></h2>
<ul>
<li>Remove perl-Mojo-IOLoop-ReadWriteProcess and perl-Mojolicious-Plugin-AssetPack from devel:openQA as soon as we have the new version in Tumbleweed and current Leap</li>
<li>Same but in devel:openQA:Leap:15.5</li>
</ul>