openQA Infrastructure - action #122983: [alert] openqa/monitor-o3 failing because openqaworker1 is down size:M</h1> <article> <h1>openQA Infrastructure - action #122983: [alert] openqa/monitor-o3 failing because openqaworker1 is down size:M</h1> <p>2023-01-12T07:04:28Z</p> <ul><li><strong>Subject</strong> changed from <i>[alert] openqaworker1</i> to <i>[alert] openqa/monitor-o3 failing because openqaworker1 is down</i></li></ul> </article> <article> <h1>openQA Infrastructure - action #122983: [alert] openqa/monitor-o3 failing because openqaworker1 is down size:M</h1> <p>2023-01-12T07:28:48Z</p> <ul><li><strong>Tags</strong> set to <i>infra</i></li><li><strong>Priority</strong> changed from <i>High</i> to <i>Urgent</i></li></ul><p>As long as the monitoring pipeline is active it will bug about this, so this needs urgent handling, at best today</p> </article> <article> <h1>openQA Infrastructure - action #122983: [alert] openqa/monitor-o3 failing because openqaworker1 is down size:M</h1> <p>2023-01-12T10:19:35Z</p> <ul><li><strong>Subject</strong> changed from <i>[alert] openqa/monitor-o3 failing because openqaworker1 is down</i> to <i>[alert] openqa/monitor-o3 failing because openqaworker1 is down size:M</i></li><li><strong>Description</strong> updated (<a title="View differences" href="/journals/592795/diff?detail_id=556630">diff</a>)</li><li><strong>Status</strong> changed from <i>New</i> to <i>Workable</i></li></ul> </article> <article> <h1>openQA Infrastructure - action #122983: [alert] openqa/monitor-o3 failing because openqaworker1 is down size:M</h1> <p>2023-01-12T11:49:14Z</p> <ul></ul><p>w1 runs s390x instances so the impact is more than just x86_64. This was brought up in <a href="https://suse.slack.com/archives/C02CANHLANP/p1673523496304249" class="external">https://suse.slack.com/archives/C02CANHLANP/p1673523496304249</a></p> <blockquote> <p>(Sofia Syrianidou) what's wrong with o3 s390x? I scheduled a couple of test in the morning and they are still not assigned to a worker.</p> </blockquote> </article> <article> <h1>openQA Infrastructure - action #122983: [alert] openqa/monitor-o3 failing because openqaworker1 is down size:M</h1> <p>2023-01-12T12:51:55Z</p> <ul><li><strong>Blocked by</strong> <i><a class="issue tracker-4 status-3 priority-5 priority-high3 closed" href="/issues/123028">action #123028</a>: A/C broken in TAM lab size:M</i> added</li></ul> </article> <article> <h1>openQA Infrastructure - action #122983: [alert] openqa/monitor-o3 failing because openqaworker1 is down size:M</h1> <p>2023-01-12T15:52:11Z</p> <ul><li><strong>Description</strong> updated (<a title="View differences" href="/journals/593011/diff?detail_id=556831">diff</a>)</li></ul><p>As part of <a class="issue tracker-4 status-3 priority-5 priority-high3 closed" title="action: o3 worker rebel is down; was: inconsistent package database or filesystem corruption size:M (Resolved)" href="https://progress.opensuse.org/issues/122998">#122998</a> I've been enabling the s390x worker slots on rebel instead.</p> </article> <article> <h1>openQA Infrastructure - action #122983: [alert] openqa/monitor-o3 failing because openqaworker1 is down size:M</h1> <p>2023-01-12T15:55:54Z</p> <ul><li><strong>Assignee</strong> set to <i>mkittler</i></li></ul> </article> <article> <h1>openQA Infrastructure - action #122983: [alert] openqa/monitor-o3 failing because openqaworker1 is down size:M</h1> <p>2023-01-12T15:58:32Z</p> <ul><li><strong>Status</strong> changed from <i>Workable</i> to <i>Blocked</i></li></ul><p>The worker is currently explicitly offline, see blocker. IPMI access works at least (via <a href="https://gitlab.suse.de/openqa/salt-pillars-openqa/-/commit/428870cab3957d4fc206164c236af20c340e7157" class="external">reverted command</a>).</p> </article> <article> <h1>openQA Infrastructure - action #122983: [alert] openqa/monitor-o3 failing because openqaworker1 is down size:M</h1> <p>2023-01-17T11:42:22Z</p> <ul></ul><p>I guess worker1 should be removed from salt? Since it's still <a href="https://progress.opensuse.org/issues/122983" class="external">failing our deployment monitoring</a>.</p> </article> <article> <h1>openQA Infrastructure - action #122983: [alert] openqa/monitor-o3 failing because openqaworker1 is down size:M</h1> <p>2023-01-17T12:53:24Z</p> <ul></ul><p>cdywan wrote:</p> <blockquote> <p>I guess worker1 should be removed from salt? Since it's still <a href="https://progress.opensuse.org/issues/122983" class="external">failing our deployment monitoring</a>.</p> </blockquote> <p><a href="https://gitlab.suse.de/openqa/monitor-o3/-/commit/121f84cd71de4b8c9e226cec34f0f5bc287d4f83" class="external">https://gitlab.suse.de/openqa/monitor-o3/-/commit/121f84cd71de4b8c9e226cec34f0f5bc287d4f83</a></p> </article> <article> <h1>openQA Infrastructure - action #122983: [alert] openqa/monitor-o3 failing because openqaworker1 is down size:M</h1> <p>2023-01-17T12:59:42Z</p> <ul></ul><p>cdywan wrote:</p> <blockquote> <p>I guess worker1 should be removed from salt?</p> </blockquote> <p>No, o3 workers are not in salt. The workers are listed in <a href="https://gitlab.suse.de/openqa/monitor-o3/-/blob/master/.gitlab-ci.yml" class="external">https://gitlab.suse.de/openqa/monitor-o3/-/blob/master/.gitlab-ci.yml</a></p> <blockquote> <p>Since it's still <a href="https://progress.opensuse.org/issues/122983" class="external">failing our deployment monitoring</a>.</p> </blockquote> <p>That'? not a deployment monitoring but an explicit monitoring for o3 workers. Removed the openqaworker1 config for now with </p> <p><a href="https://gitlab.suse.de/openqa/monitor-o3/-/commit/121f84cd71de4b8c9e226cec34f0f5bc287d4f83" class="external">https://gitlab.suse.de/openqa/monitor-o3/-/commit/121f84cd71de4b8c9e226cec34f0f5bc287d4f83</a></p> <p>And also added the missing openqaworker19+20 in a subsequent commit.</p> </article> <article> <h1>openQA Infrastructure - action #122983: [alert] openqa/monitor-o3 failing because openqaworker1 is down size:M</h1> <p>2023-01-20T15:38:56Z</p> <ul><li><strong>Due date</strong> deleted (<del><i>2023-01-20</i></del>)</li></ul><p>This is blocking on a blocked ticket. Thus resetting the due date.</p> </article> <article> <h1>openQA Infrastructure - action #122983: [alert] openqa/monitor-o3 failing because openqaworker1 is down size:M</h1> <p>2023-01-25T10:23:08Z</p> <ul><li><strong>Status</strong> changed from <i>Blocked</i> to <i>Feedback</i></li><li><strong>Priority</strong> changed from <i>Urgent</i> to <i>Normal</i></li></ul><p>openqaworker1 monitoring was disabled with<br> <a href="https://gitlab.suse.de/openqa/monitor-o3/-/commit/121f84cd71de4b8c9e226cec34f0f5bc287d4f83" class="external">https://gitlab.suse.de/openqa/monitor-o3/-/commit/121f84cd71de4b8c9e226cec34f0f5bc287d4f83</a><br> and we don't need that machine critically so we can reduce priority.<br> I created <a href="https://gitlab.suse.de/openqa/monitor-o3/-/merge_requests/9" class="external">https://gitlab.suse.de/openqa/monitor-o3/-/merge_requests/9</a> to follow Marius's suggestion to use <code>when: manual</code> instead of disabled code. And then eventually when openqaworker1 is usable in FC labs, see #119548, we can try to connect the machine again with o3 over routing over different locations.</p> <p>@Marius I suggest to set this ticket to "Blocked" by #119548 as soon as <a href="https://gitlab.suse.de/openqa/monitor-o3/-/merge_requests/9" class="external">https://gitlab.suse.de/openqa/monitor-o3/-/merge_requests/9</a> is merged</p> </article> <article> <h1>openQA Infrastructure - action #122983: [alert] openqa/monitor-o3 failing because openqaworker1 is down size:M</h1> <p>2023-01-26T17:03:39Z</p> <ul><li><strong>Status</strong> changed from <i>Feedback</i> to <i>Blocked</i></li></ul> </article> <article> <h1>openQA Infrastructure - action #122983: [alert] openqa/monitor-o3 failing because openqaworker1 is down size:M</h1> <p>2023-03-14T16:59:53Z</p> <ul></ul><p>Looks like the worker is now in FC: <a href="https://racktables.suse.de/index.php?page=object&object_id=1260" class="external">https://racktables.suse.de/index.php?page=object&object_id=1260</a></p> <p>I couldn't reach it via SSH (from ariel) or IPMI, though. So I guess this ticket is still blocked.</p> </article> <article> <h1>openQA Infrastructure - action #122983: [alert] openqa/monitor-o3 failing because openqaworker1 is down size:M</h1> <p>2023-03-14T17:01:38Z</p> <ul><li><strong>Status</strong> changed from <i>Blocked</i> to <i>Feedback</i></li></ul><p>I set this ticket to feedback because I'm not sure what other ticket I'm waiting for. Surely the AC problem in the TAM lab isn't relevant anymore and #119548 is resolved. So we need to talk about it in the unblock meeting.</p> </article> <article> <h1>openQA Infrastructure - action #122983: [alert] openqa/monitor-o3 failing because openqaworker1 is down size:M</h1> <p>2023-03-14T17:16:28Z</p> <ul></ul><p>mkittler wrote:</p> <blockquote> <p>I set this ticket to feedback because I'm not sure what other ticket I'm waiting for.</p> </blockquote> <p>Well, it was #119548 which is resolved so you can continue.</p> <p>What we can do as next step is to done of the following</p> <ol> <li>Find the dynamic DHCP lease, e.g. from <code>ip n</code> on a neighboring machine -> <em>DONE</em> from qa-jump, no match</li> <li>Or wait for <a href="https://sd.suse.com/servicedesk/customer/portal/1/SD-113959" class="external">https://sd.suse.com/servicedesk/customer/portal/1/SD-113959</a> so that we would be able to find the DHCP lease from the DHCP server directly</li> <li>Add the machine into the <a href="https://gitlab.suse.de/OPS-Service/salt/" class="external">ops salt repo</a> with both the ipmi+prod Ethernet and use it as experimental OSD worker from FC Basement</li> <li>or skip step 3. and make it work as o3 worker, <ul> <li>4a. either coordinate with Eng-Infra how to connect it into the o3 network</li> <li>4b. just connect it over the public https interface <a href="https://openqa.opensuse.org" class="external">https://openqa.opensuse.org</a></li> </ul></li> </ol> </article> <article> <h1>openQA Infrastructure - action #122983: [alert] openqa/monitor-o3 failing because openqaworker1 is down size:M</h1> <p>2023-03-15T15:40:03Z</p> <ul></ul><p>I've been creating <a href="https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/3275" class="external">https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/3275</a> for 3.</p> </article> <article> <h1>openQA Infrastructure - action #122983: [alert] openqa/monitor-o3 failing because openqaworker1 is down size:M</h1> <p>2023-03-21T10:48:20Z</p> <ul></ul><p>The MR is still pending.</p> <p>When asking about 4. on Slack I've only got feedback from Matthias stating that this use case wasn't considered. So perhaps it isn't easy to implement now.</p> </article> <article> <h1>openQA Infrastructure - action #122983: [alert] openqa/monitor-o3 failing because openqaworker1 is down size:M</h1> <p>2023-03-21T11:02:00Z</p> <ul></ul><p>mkittler wrote:</p> <blockquote> <p>The MR is still pending.</p> <p>When asking about 4. on Slack I've only got feedback from Matthias stating that this use case wasn't considered. So perhaps it isn't easy to implement now.</p> </blockquote> <p>Yes, of course it wasn't considered yet. That is why we do this exploration task here :) What about 4b? Just connect to <a href="https://openqa.opensuse.org" class="external">https://openqa.opensuse.org</a>?</p> </article> <article> <h1>openQA Infrastructure - action #122983: [alert] openqa/monitor-o3 failing because openqaworker1 is down size:M</h1> <p>2023-03-21T12:16:33Z</p> <ul><li><strong>Status</strong> changed from <i>Feedback</i> to <i>Blocked</i></li></ul> </article> <article> <h1>openQA Infrastructure - action #122983: [alert] openqa/monitor-o3 failing because openqaworker1 is down size:M</h1> <p>2023-03-23T12:26:46Z</p> <ul></ul><p>The MR has been merged but I cannot resolve openqaworker1-ipmi.qe.nue2.suse.org or openqaworker1.qe.nue2.suse.org. I'm using VPN and I can resolve e.g. thincsus.qe.nue2.suse.org so it is likely not a local problem.</p> <p>I'm also unable to establish an IPMI or SSH connection using the IPs. Maybe this needs on-site investigation?</p> </article> <article> <h1>openQA Infrastructure - action #122983: [alert] openqa/monitor-o3 failing because openqaworker1 is down size:M</h1> <p>2023-03-23T12:29:14Z</p> <ul><li><strong>Status</strong> changed from <i>Blocked</i> to <i>Feedback</i></li></ul> </article> <article> <h1>openQA Infrastructure - action #122983: [alert] openqa/monitor-o3 failing because openqaworker1 is down size:M</h1> <p>2023-03-29T14:56:58Z</p> <ul><li><strong>Status</strong> changed from <i>Feedback</i> to <i>In Progress</i></li></ul><p>I can now resolve both domains and establish an IPMI connection. So whatever the problem was, it is now solved. The machine was powered off so I've just powered it on. Let's see whether I can simply connect it to o3 as like I would connect any public worker.</p> <hr> <p>The system boots and has a link via:</p> <pre><code>2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 link/ether 2c:60:0c:73:03:d6 brd ff:ff:ff:ff:ff:ff altname enp1s0f0 altname ens255f0 inet 192.168.112.6/24 brd 192.168.112.255 scope global eth0 valid_lft forever preferred_lft forever inet6 fe80::2e60:cff:fe73:3d6/64 scope link valid_lft forever preferred_lft forever </code></pre> <p>However, the IP doesn't match the one configured by <a href="https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/3275" class="external">https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/3275</a> and there's no IP connectivity.</p> <hr> <p>Since IPMI is at least working I've created <a href="https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/515" class="external">https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/515</a>.</p> </article> <article> <h1>openQA Infrastructure - action #122983: [alert] openqa/monitor-o3 failing because openqaworker1 is down size:M</h1> <p>2023-03-30T04:10:27Z</p> <ul><li><strong>Due date</strong> set to <i>2023-04-13</i></li></ul><p>Setting due date based on mean cycle time of SUSE QE Tools</p> </article> <article> <h1>openQA Infrastructure - action #122983: [alert] openqa/monitor-o3 failing because openqaworker1 is down size:M</h1> <p>2023-03-30T11:03:58Z</p> <ul></ul><p><a href="https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/515" class="external">https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/515</a> merged.</p> <p>/etc/sysconfig/network/ifcfg-eth0 shows</p> <pre><code>BOOTPROTO='static' STARTMODE='auto' IPADDR='192.168.112.6/24' ZONE=trusted </code></pre> <p>so configure that to DHCP and try again</p> </article> <article> <h1>openQA Infrastructure - action #122983: [alert] openqa/monitor-o3 failing because openqaworker1 is down size:M</h1> <p>2023-03-30T14:19:58Z</p> <ul><li><strong>Status</strong> changed from <i>In Progress</i> to <i>Resolved</i></li></ul><p>I've also had to get rid of the static DNS server configured in <code>/etc/sysconfig/network/config</code>. With that networking looks good and it can connect to o3. There are also no failed systemd services. Everything survived a reboot so I guess AC1 is fullfilled.</p> <p>So I'm considering this ticket resolved for now. Let me know if I should still look into some of the other options.</p> </article> <article> <h1>openQA Infrastructure - action #122983: [alert] openqa/monitor-o3 failing because openqaworker1 is down size:M</h1> <p>2023-03-30T14:57:28Z</p> <ul></ul><p>mkittler wrote:</p> <blockquote> <p>So I'm considering this ticket resolved for now. Let me know if I should still look into some of the other options.</p> </blockquote> <p>No need for that but we should ensure our wiki describing the o3 infra covers openqaworker1 in the current state. And please check the racktables entry that it correctly describes the current use</p> </article> <article> <h1>openQA Infrastructure - action #122983: [alert] openqa/monitor-o3 failing because openqaworker1 is down size:M</h1> <p>2023-03-30T14:57:42Z</p> <ul><li><strong>Status</strong> changed from <i>Resolved</i> to <i>Feedback</i></li></ul> </article> <article> <h1>openQA Infrastructure - action #122983: [alert] openqa/monitor-o3 failing because openqaworker1 is down size:M</h1> <p>2023-03-31T08:21:48Z</p> <ul><li><strong>Status</strong> changed from <i>Feedback</i> to <i>Workable</i></li></ul><p>Apparently ow1 is alive again and attempted to run some jobs on o3.</p> <p>However, they fail:</p> <p><code>[2023-03-31T10:10:29.124064+02:00] [error] Unable to setup job 3202162: The source directory /var/lib/openqa/share/tests/opensuse does not exist</code></p> <p>It appears like the IP also changed from 10.168.192.6 to 10.168.192.120. While the latter is pingable from o3, SSH does not work.</p> <p>For the time being I just did <code>systemctl disable --now openqa-worker-auto-restart@{1..20}.service</code> as workaround.</p> </article> <article> <h1>openQA Infrastructure - action #122983: [alert] openqa/monitor-o3 failing because openqaworker1 is down size:M</h1> <p>2023-03-31T08:36:27Z</p> <ul><li><strong>Priority</strong> changed from <i>Normal</i> to <i>Urgent</i></li></ul> </article> <article> <h1>openQA Infrastructure - action #122983: [alert] openqa/monitor-o3 failing because openqaworker1 is down size:M</h1> <p>2023-03-31T08:48:50Z</p> <ul><li><strong>Status</strong> changed from <i>Workable</i> to <i>In Progress</i></li></ul><p>I've enabled the services again but I setup a special worker class for testing. I suppose the main problem was simply that the test pool server hasn't been adapted yet.</p> <p>Note that you can simply edit <code>/etc/openqa/workers.ini</code> changing the worker class. There's no need to deal with systemd services.</p> </article> <article> <h1>openQA Infrastructure - action #122983: [alert] openqa/monitor-o3 failing because openqaworker1 is down size:M</h1> <p>2023-03-31T10:49:10Z</p> <ul><li><strong>Status</strong> changed from <i>In Progress</i> to <i>Feedback</i></li></ul><p>I'm afraid this setup is not going to work because the rsync server on ariel is not exposed. So it is not possibly to sync tests. The same counts for NFS.</p> <p>We could sync <code>/var/lib/openqa/share/tests</code> from OSD instead but it is likely a bad idea as the directory might contain internal files (e.g. SLE needles).</p> <p>Since I keep the worker up but only with <code>WORKER_CLASS=openqaworker1</code> so it doesn't do any harm.</p> <p>Note that AC1 is nevertheless fulfilled so I'm inclined to resolve this ticket. Especially because I cannot do much about it anyways. I've also already attempted to connect with Infra as suggested in option 4a of <a class="issue tracker-4 status-3 priority-5 priority-high3 closed behind-schedule" title="action: [alert] openqa/monitor-o3 failing because openqaworker1 is down size:M (Resolved)" href="https://progress.opensuse.org/issues/122983#note-17">#122983#note-17</a> but haven't got a useful response. I could create an SD ticket if that's wanted, though. Otherwise we could use the worker as an OSD worker.</p> </article> <article> <h1>openQA Infrastructure - action #122983: [alert] openqa/monitor-o3 failing because openqaworker1 is down size:M</h1> <p>2023-03-31T10:50:29Z</p> <ul><li><strong>Priority</strong> changed from <i>Urgent</i> to <i>High</i></li></ul> </article> <article> <h1>openQA Infrastructure - action #122983: [alert] openqa/monitor-o3 failing because openqaworker1 is down size:M</h1> <p>2023-03-31T17:12:35Z</p> <ul></ul><p>mkittler wrote:</p> <blockquote> <p>Note that AC1 is nevertheless fulfilled so I'm inclined to resolve this ticket.</p> </blockquote> <p>That would bring the risk that ow1 might idle for years wasting power and nobody is making good use of the machine.</p> <p>I think using the machine as part of OSD is also fine for the time being. Then at least it's put to good use</p> </article> <article> <h1>openQA Infrastructure - action #122983: [alert] openqa/monitor-o3 failing because openqaworker1 is down size:M</h1> <p>2023-04-03T09:02:12Z</p> <ul></ul><p>And another alternative that came up in the chat: Setup fetchneedles on ow1 as it is normally done on the web UI.</p> <p>Note that in case we don't use the machine I would always power it off. So we'd at least not waste any power :-)</p> </article> <article> <h1>openQA Infrastructure - action #122983: [alert] openqa/monitor-o3 failing because openqaworker1 is down size:M</h1> <p>2023-04-03T15:53:20Z</p> <ul></ul><p>I've just setup fetchneedles in accordance with the o3 web UI host. It generally works. There are still problems:</p> <ul> <li>The developer mode doesn't work and I don't think we can fix that. I suppose this is something we could live with, though.</li> <li>The openQA-in-openQA test I've tried could not resolve <code>codecs.opensuse.org</code> from within the SUT: <a href="https://openqa.opensuse.org/tests/3207420#step/openqa_webui/9" class="external">https://openqa.opensuse.org/tests/3207420#step/openqa_webui/9</a> <ul> <li>I'm not yet sure why that is. The domain is resolvable on ow1 in general and curl returns data.</li> <li>The problem persists after restarting.</li> <li>Another test also runs into errors on <code>zypper in …</code>: <a href="https://openqa.opensuse.org/tests/3207418#step/prepare/11" class="external">https://openqa.opensuse.org/tests/3207418#step/prepare/11</a></li> </ul></li> </ul> <p>Maybe it is better to just use it as OSD worker for now.</p> </article> <article> <h1>openQA Infrastructure - action #122983: [alert] openqa/monitor-o3 failing because openqaworker1 is down size:M</h1> <p>2023-04-05T13:18:13Z</p> <ul></ul><p>Maybe the problems mentioned in my previous comment can be explained by <a class="issue tracker-4 status-3 priority-4 priority-default closed" title="action: missing nameservers in dhcp response for baremetal machines in NUE-FC-B 2 size:M (Resolved)" href="https://progress.opensuse.org/issues/127256">#127256</a>. I've nevertheless configured the worker now to connect to OSD to cross-check. (Of course still using just <code>openqaworker1</code> as <code>WORKER_CLASS</code>.)</p> </article> <article> <h1>openQA Infrastructure - action #122983: [alert] openqa/monitor-o3 failing because openqaworker1 is down size:M</h1> <p>2023-04-05T14:49:03Z</p> <ul></ul><p>I've cloned an OSD job and it ran into a random DNS error as well: <a href="https://openqa.suse.de/tests/10863595#step/nautilus_open_ftp/6" class="external">https://openqa.suse.de/tests/10863595#step/nautilus_open_ftp/6</a></p> <p>So is suspect this ticket is really related to <a class="issue tracker-4 status-3 priority-4 priority-default closed" title="action: missing nameservers in dhcp response for baremetal machines in NUE-FC-B 2 size:M (Resolved)" href="https://progress.opensuse.org/issues/127256">#127256</a>. I suppose that also means it is blocked by <a class="issue tracker-4 status-3 priority-4 priority-default closed" title="action: missing nameservers in dhcp response for baremetal machines in NUE-FC-B 2 size:M (Resolved)" href="https://progress.opensuse.org/issues/127256">#127256</a> because without reliable DNS we cannot use the machine as worker.</p> </article> <article> <h1>openQA Infrastructure - action #122983: [alert] openqa/monitor-o3 failing because openqaworker1 is down size:M</h1> <p>2023-04-05T14:49:32Z</p> <ul><li><strong>Blocked by</strong> <i><a class="issue tracker-4 status-3 priority-4 priority-default closed" href="/issues/127256">action #127256</a>: missing nameservers in dhcp response for baremetal machines in NUE-FC-B 2 size:M</i> added</li></ul> </article> <article> <h1>openQA Infrastructure - action #122983: [alert] openqa/monitor-o3 failing because openqaworker1 is down size:M</h1> <p>2023-04-12T11:33:45Z</p> <ul><li><strong>Related to</strong> <i><a class="issue tracker-4 status-3 priority-5 priority-high3 closed child" href="/issues/126188">action #126188</a>: [openQA][infra][worker][sut] openQA infra performance fluctuates to the level that that leads to tangible test run failure size:M</i> added</li></ul> </article> <article> <h1>openQA Infrastructure - action #122983: [alert] openqa/monitor-o3 failing because openqaworker1 is down size:M</h1> <p>2023-04-13T08:12:53Z</p> <ul><li><strong>Due date</strong> changed from <i>2023-04-13</i> to <i>2023-04-28</i></li></ul><p>mkittler wrote:</p> <blockquote> <p>I've cloned an OSD job and it ran into a random DNS error as well: <a href="https://openqa.suse.de/tests/10863595#step/nautilus_open_ftp/6" class="external">https://openqa.suse.de/tests/10863595#step/nautilus_open_ftp/6</a></p> <p>So is suspect this ticket is really related to <a class="issue tracker-4 status-3 priority-4 priority-default closed" title="action: missing nameservers in dhcp response for baremetal machines in NUE-FC-B 2 size:M (Resolved)" href="https://progress.opensuse.org/issues/127256">#127256</a>. I suppose that also means it is blocked by <a class="issue tracker-4 status-3 priority-4 priority-default closed" title="action: missing nameservers in dhcp response for baremetal machines in NUE-FC-B 2 size:M (Resolved)" href="https://progress.opensuse.org/issues/127256">#127256</a> because without reliable DNS we cannot use the machine as worker.</p> </blockquote> <p>Presumably still blocking on <a class="issue tracker-4 status-3 priority-4 priority-default closed" title="action: missing nameservers in dhcp response for baremetal machines in NUE-FC-B 2 size:M (Resolved)" href="https://progress.opensuse.org/issues/127256">#127256</a>, hence bumping the due date.</p> </article> <article> <h1>openQA Infrastructure - action #122983: [alert] openqa/monitor-o3 failing because openqaworker1 is down size:M</h1> <p>2023-04-18T10:28:54Z</p> <ul></ul><p>Due to progress within <a href="https://sd.suse.com/servicedesk/customer/portal/1/SD-113959">https://sd.suse.com/servicedesk/customer/portal/1/SD-113959</a> we can now debug the DHCP server on walter1.qe.nue2.suse.org. mkittler and me did over an IPMI SoL on worker1 <code>ifdown eth0 && ifup eth0</code> and got a complete entry in /etc/resolv.conf so that did not immediately reproduce the problem that /etc/resolv.conf would be incomplete.</p> <p>It seems that both walter1+walter2 can serve DHCP requests using a failover but with synchronized entries so we should be fine to just look at one journal at a time.</p> <p>There is an error showing up in the dhcpd journal "dns2.qe.nue2.suse.org: host unknown.". Apparantely that host does not exist in any references on walter1:/etc/ nor walter2:/etc/ except for the dhcpd configs trying to publish that nameserver.</p> <p>We removed that entry for now on both walter1 and walter2</p> <p>I ran</p> <pre><code>for i in {1..30}; do echo "### Run: $i -- $(date -Is)" && ifdown eth0 && ifup eth0 ; tail -n 5 /etc/resolv.conf ; ip a show dev eth0; ls -l /etc/resolv.conf; done </code></pre> <p>but couldn't reproduce any problems with nameserver config yet.</p> <p>Maybe with restarting the complete network stack:</p> <pre><code>for i in {1..30}; do echo "### Run: $i -- $(date -Is)" && systemctl restart network.service ; until ifstatus eth0 | grep -q not-running; do echo -n "." && sleep 1; done; ifstatus eth0; tail -n 5 /etc/resolv.conf ; ip a show dev eth0; ls -l /etc/resolv.conf; done </code></pre> <p>which did never return from the inner loop likely because ifstatus still shows "device-no-running" due to DHCPv6 never fulfilled. So changed to use <code>ifstatus eth0 | grep -q not-running</code> instead of just exit code evaluation.</p> <p>This seems to work. Now let's try to break the loop as soon as nameserver entries are completely missing.</p> <pre><code>for i in {1..100000}; do echo "### Run: $i -- $(date -Is)" && systemctl restart network.service ; until ifstatus eth0 | grep -q not-running; do echo -n "." && sleep 1; done; ifstatus eth0; tail -n 5 /etc/resolv.conf ; ip a show dev eth0; ls -l /etc/resolv.conf; grep -q nameserver /etc/resolv.conf || break; done </code></pre> <p>EDIT: Not reproduced after 333 runs. I guess we can't reproduce like this. I suggest to try with actual reboots.</p> </article> <article> <h1>openQA Infrastructure - action #122983: [alert] openqa/monitor-o3 failing because openqaworker1 is down size:M</h1> <p>2023-04-26T09:29:27Z</p> <ul></ul><p>Discussed in the Unblock. Please try and reproduce using openQA tests, and if that doesn't reproduce it consider it solved.</p> </article> <article> <h1>openQA Infrastructure - action #122983: [alert] openqa/monitor-o3 failing because openqaworker1 is down size:M</h1> <p>2023-04-26T13:00:49Z</p> <ul></ul><p>Maybe you can check <a href="https://progress.opensuse.org/issues/127256#note-11" class="external">https://progress.opensuse.org/issues/127256#note-11</a> if it helps.</p> </article> <article> <h1>openQA Infrastructure - action #122983: [alert] openqa/monitor-o3 failing because openqaworker1 is down size:M</h1> <p>2023-04-26T14:06:36Z</p> <ul></ul><p>The issue is not resolved. Then I tried to run a job on openqaworker1 it was stuck in the state setup because the worker itself lacked the nameserver. So this does not only happen after a reboot but can also happen in the middle (as openqaworker1 has been running for 5 days and could initially connect to the web UI).</p> <p>When running the loop from above (which effectively restart the network via <code>systemctl restart network.service</code>) this changes nothing. The nameserver is still missing in <code>/etc/resolv.conf</code>. In the DHCP logs it looks like this:</p> <pre><code>Apr 26 13:59:34 walter1 dhcpd[29309]: DHCPREQUEST for 10.168.192.120 from 2c:60:0c:73:03:d6 via eth0 Apr 26 13:59:34 walter1 dhcpd[29309]: dns2.qe.nue2.suse.org: host unknown. Apr 26 13:59:34 walter1 dhcpd[29309]: DHCPACK on 10.168.192.120 to 2c:60:0c:73:03:d6 via eth0 </code></pre><pre><code>Apr 26 13:59:34 walter2 dhcpd[30886]: DHCPREQUEST for 10.168.192.120 from 2c:60:0c:73:03:d6 via eth0 Apr 26 13:59:34 walter2 dhcpd[30886]: dns2.qe.nue2.suse.org: host unknown. Apr 26 13:59:34 walter2 dhcpd[30886]: DHCPACK on 10.168.192.120 to 2c:60:0c:73:03:d6 via eth0 </code></pre> <p>Not sure whether the message about <code>dns2.qe.nue2.suse.org</code> shown in the middle is relevant.</p> <p>After restarting wicket a 3rd time it worked again. Now the logs look different:</p> <pre><code>Apr 26 14:12:56 walter1 dhcpd[29309]: DHCPREQUEST for 10.168.192.120 from 2c:60:0c:73:03:d6 via eth0 Apr 26 14:12:56 walter1 dhcpd[29309]: DHCPACK on 10.168.192.120 to 2c:60:0c:73:03:d6 via eth0 </code></pre><pre><code>Apr 26 14:12:56 walter2 dhcpd[30886]: DHCPREQUEST for 10.168.192.120 from 2c:60:0c:73:03:d6 via eth0 Apr 26 14:12:56 walter2 dhcpd[30886]: DHCPACK on 10.168.192.120 to 2c:60:0c:73:03:d6 via eth0 </code></pre> <hr> <p>Between the 2nd and 3rd attempt the following was logged:</p> <pre><code>Apr 26 14:06:19 walter2 dhcpd[30886]: balancing pool 55ab0933cb60 10.168.192.0/22 total 201 free 88 backup 107 lts 9 max-own (+/-)20 Apr 26 14:06:19 walter2 dhcpd[30886]: balanced pool 55ab0933cb60 10.168.192.0/22 total 201 free 88 backup 107 lts 9 max-misbal 29 Apr 26 14:06:22 walter2 dhcpd[30886]: reuse_lease: lease age 5023 (secs) under 25% threshold, reply with unaltered, existing lease for 10.168.193.56 Apr 26 14:06:22 walter2 dhcpd[30886]: No hostname for 10.168.193.56 Apr 26 14:06:22 walter2 dhcpd[30886]: DHCPREQUEST for 10.168.193.56 from 98:be:94:4b:8e:98 via eth0 Apr 26 14:06:22 walter2 dhcpd[30886]: dns2.qe.nue2.suse.org: host unknown. Apr 26 14:06:22 walter2 dhcpd[30886]: DHCPACK on 10.168.193.56 to 98:be:94:4b:8e:98 via eth0 Apr 26 14:07:00 walter2 dhcpd[30886]: DHCPDISCOVER from 00:0a:f7:de:79:54 via eth0 Apr 26 14:07:00 walter2 dhcpd[30886]: DHCPOFFER on 10.168.192.93 to 00:0a:f7:de:79:54 via eth0 Apr 26 14:07:04 walter2 dhcpd[30886]: DHCPREQUEST for 10.168.192.93 (10.168.192.2) from 00:0a:f7:de:79:54 via eth0 Apr 26 14:07:04 walter2 dhcpd[30886]: DHCPACK on 10.168.192.93 to 00:0a:f7:de:79:54 via eth0 Apr 26 14:10:51 walter2 dhcpd[30886]: Wrote 0 deleted host decls to leases file. Apr 26 14:10:51 walter2 dhcpd[30886]: Wrote 0 new dynamic host decls to leases file. Apr 26 14:10:51 walter2 dhcpd[30886]: Wrote 201 leases to leases file. </code></pre> </article> <article> <h1>openQA Infrastructure - action #122983: [alert] openqa/monitor-o3 failing because openqaworker1 is down size:M</h1> <p>2023-04-26T14:32:23Z</p> <ul></ul><p>I've just tried with only one dhcp server (the one on walter1, stopped the one on walter2). The problem was still reproducible. However, after removing <code>dns2.qe.nue2.suse.org</code> from <code>dhcpd.conf</code> it seems ok. Maybe it makes sense to remove that entry.</p> </article> <article> <h1>openQA Infrastructure - action #122983: [alert] openqa/monitor-o3 failing because openqaworker1 is down size:M</h1> <p>2023-04-26T14:47:35Z</p> <ul></ul><p>If we're lucky everything boils down to fixing a typo: <a href="https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/3456" class="external">https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/3456</a></p> </article> <article> <h1>openQA Infrastructure - action #122983: [alert] openqa/monitor-o3 failing because openqaworker1 is down size:M</h1> <p>2023-04-27T09:31:48Z</p> <ul></ul><p><a href="https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/3456" class="external">https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/3456</a> was merged and is deployed to both our DHCP servers walter1.qe.nue2.suse.org and walter2.qe.nue2.suse.org . We assume this fixes the problem.</p> </article> <article> <h1>openQA Infrastructure - action #122983: [alert] openqa/monitor-o3 failing because openqaworker1 is down size:M</h1> <p>2023-04-27T11:19:58Z</p> <ul></ul><p>I'm running a few more tests (have just restarted <a href="https://openqa.suse.de/tests/10992724" class="external">https://openqa.suse.de/tests/10992724</a>).</p> <p>So, if everything looks good - how should I proceed:</p> <ul> <li>Add the worker as OSD worker. That would mean adding it to our salt infrastructure.</li> <li>Add the worker as o3 worker. That would mean setting up fetchneedles in accordance with o3. I have already done it in <a class="issue tracker-4 status-3 priority-5 priority-high3 closed behind-schedule" title="action: [alert] openqa/monitor-o3 failing because openqaworker1 is down size:M (Resolved)" href="https://progress.opensuse.org/issues/122983#note-37">#122983#note-37</a>. The caveat of that approach: <ul> <li>This setup might become out-of-sync with o3 and then needs to be deal with manually. While it is not a big deal this means the worker might be in a state where it produces incompletes until we take care of it.</li> <li>The mount <code>/var/lib/openqa/share</code> will not be available on that worker. We want to avoid relying on it anyways but nevertheless not having it makes this worker odd and prone to produce incompletes when tests rely on it after all.</li> </ul></li> </ul> <p>I would tend to use it as an OSD worker.</p> </article> <article> <h1>openQA Infrastructure - action #122983: [alert] openqa/monitor-o3 failing because openqaworker1 is down size:M</h1> <p>2023-04-27T11:55:16Z</p> <ul></ul><p>For running some tests I keep the worker as OSD worker. I've cloned a few tests via <code>sudo openqa-clone-job --skip-chained-deps --skip-download --within-instance https://openqa.suse.de/tests/… _GROUP=0 BUILD+=-ow1-test TEST+=-ow1-test WORKER_CLASS=openqaworker1</code>:</p> <ul> <li>passed: <a href="https://openqa.suse.de/tests/10992724" class="external">https://openqa.suse.de/tests/10992724</a></li> <li>softfailure: <a href="https://openqa.suse.de/tests/10992727" class="external">https://openqa.suse.de/tests/10992727</a></li> <li>strange failure¹ but not related to specific worker: <a href="https://openqa.suse.de/tests/10992730" class="external">https://openqa.suse.de/tests/10992730</a></li> <li>passed: <a href="https://openqa.suse.de/tests/10992732" class="external">https://openqa.suse.de/tests/10992732</a></li> </ul> <hr> <p>¹ <code>Reason: api failure: 400 response: OpenQA::Schema::Result::Jobs::insert_module(): DBI Exception: DBD::Pg::st execute failed: ERROR: null value in column "name" of relation "job_modules" violates not-null constraint DETAIL: Failing row contains (2864833368, 10992730, null, tests/btrfs-progs/generate_report…</code> - Maybe a bug/race-condition in the code for uploading external results.</p> </article> <article> <h1>openQA Infrastructure - action #122983: [alert] openqa/monitor-o3 failing because openqaworker1 is down size:M</h1> <p>2023-04-27T12:50:00Z</p> <ul></ul><p>mkittler wrote:</p> <blockquote> <p>I'm running a few more tests (have just restarted <a href="https://openqa.suse.de/tests/10992724" class="external">https://openqa.suse.de/tests/10992724</a>).</p> <p>So, if everything looks good - how should I proceed:</p> <ul> <li>Add the worker as OSD worker. That would mean adding it to our salt infrastructure.</li> <li>Add the worker as o3 worker. That would mean setting up fetchneedles in accordance with o3. I have already done it in <a class="issue tracker-4 status-3 priority-5 priority-high3 closed behind-schedule" title="action: [alert] openqa/monitor-o3 failing because openqaworker1 is down size:M (Resolved)" href="https://progress.opensuse.org/issues/122983#note-37">#122983#note-37</a>. The caveat of that approach: <ul> <li>This setup might become out-of-sync with o3 and then needs to be deal with manually. While it is not a big deal this means the worker might be in a state where it produces incompletes until we take care of it.</li> <li>The mount <code>/var/lib/openqa/share</code> will not be available on that worker. We want to avoid relying on it anyways but nevertheless not having it makes this worker odd and prone to produce incompletes when tests rely on it after all.</li> </ul></li> </ul> <p>I would tend to use it as an OSD worker.</p> </blockquote> <p>I would say yes. We could theoretically think about a feature to make full asset+tests syncing possible over https but then again we plan to load and likely "cache" tests from git so I guess for that we better wait for <a class="issue tracker-6 status-15 priority-5 priority-high3 parent" title="coordination: [saga][epic][use case] full version control awareness within openQA (Blocked)" href="https://progress.opensuse.org/issues/58184">#58184</a></p> </article> <article> <h1>openQA Infrastructure - action #122983: [alert] openqa/monitor-o3 failing because openqaworker1 is down size:M</h1> <p>2023-04-28T11:50:18Z</p> <ul></ul><p>I've been adding the worker to salt and created a MR for its configuration: <a href="https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/528" class="external">https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/528</a></p> </article> <article> <h1>openQA Infrastructure - action #122983: [alert] openqa/monitor-o3 failing because openqaworker1 is down size:M</h1> <p>2023-04-28T16:35:42Z</p> <ul></ul><p>Note that this worker triggered an "failed systemd service" alert on 2023-04-28 at 14:15 - not sure if this was caused by you working on the machine or if something failed unexpectedly. This is what was shown in the journal:</p> <pre><code>Apr 28 13:58:01 openqaworker1 systemd[1]: Reloading openQA Worker #10... Apr 28 13:58:01 openqaworker1 worker[26373]: [info] Received signal HUP Apr 28 13:58:01 openqaworker1 systemd[1]: openqa-worker-auto-restart@10.service: Deactivated successfully. Apr 28 13:58:01 openqaworker1 systemd[1]: Reloaded openQA Worker #10. Apr 28 13:58:01 openqaworker1 systemd[1]: openqa-worker-auto-restart@10.service: Consumed 4.581s CPU time. Apr 28 13:58:02 openqaworker1 systemd[1]: openqa-worker-auto-restart@10.service: Scheduled restart job, restart counter is at 11. Apr 28 13:58:02 openqaworker1 systemd[1]: Stopped openQA Worker #10. Apr 28 13:58:02 openqaworker1 systemd[1]: openqa-worker-auto-restart@10.service: Consumed 4.581s CPU time. Apr 28 13:58:02 openqaworker1 systemd[1]: Starting openQA Worker #10... Apr 28 13:58:02 openqaworker1 systemd[1]: Started openQA Worker #10. Apr 28 13:58:03 openqaworker1 worker[24048]: [info] [pid:24048] worker 10: Apr 28 13:58:03 openqaworker1 worker[24048]: - config file: /etc/openqa/workers.ini Apr 28 13:58:03 openqaworker1 worker[24048]: - name used to register: openqaworker1 Apr 28 13:58:03 openqaworker1 worker[24048]: - worker address (WORKER_HOSTNAME): localhost Apr 28 13:58:03 openqaworker1 worker[24048]: - isotovideo version: 38 Apr 28 13:58:03 openqaworker1 worker[24048]: - websocket API version: 1 Apr 28 13:58:03 openqaworker1 worker[24048]: - web UI hosts: localhost Apr 28 13:58:03 openqaworker1 worker[24048]: - class: ? Apr 28 13:58:03 openqaworker1 worker[24048]: - no cleanup: no Apr 28 13:58:03 openqaworker1 worker[24048]: - pool directory: /var/lib/openqa/pool/10 Apr 28 13:58:03 openqaworker1 worker[24048]: API key and secret are needed for the worker connecting localhost Apr 28 13:58:03 openqaworker1 worker[24048]: at /usr/share/openqa/script/../lib/OpenQA/Worker/WebUIConnection.pm line 50. Apr 28 13:58:03 openqaworker1 worker[24048]: OpenQA::Worker::WebUIConnection::new("OpenQA::Worker::WebUIConnection", "localhost", HASH(0x55fdacf3cc60)) called at /usr/share/openqa/script/../l> Apr 28 13:58:03 openqaworker1 worker[24048]: OpenQA::Worker::init(OpenQA::Worker=HASH(0x55fdb04186a8)) called at /usr/share/openqa/script/../lib/OpenQA/Worker.pm line 363 Apr 28 13:58:03 openqaworker1 worker[24048]: OpenQA::Worker::exec(OpenQA::Worker=HASH(0x55fdb04186a8)) called at /usr/share/openqa/script/worker line 125 Apr 28 13:58:03 openqaworker1 systemd[1]: openqa-worker-auto-restart@10.service: Main process exited, code=exited, status=255/EXCEPTION Apr 28 13:58:03 openqaworker1 systemd[1]: openqa-worker-auto-restart@10.service: Failed with result 'exit-code'. Apr 28 13:58:03 openqaworker1 systemd[1]: openqa-worker-auto-restart@10.service: Consumed 1.122s CPU time. Apr 28 13:58:03 openqaworker1 systemd[1]: openqa-worker-auto-restart@10.service: Scheduled restart job, restart counter is at 12. Apr 28 13:58:03 openqaworker1 systemd[1]: Stopped openQA Worker #10. Apr 28 13:58:03 openqaworker1 systemd[1]: openqa-worker-auto-restart@10.service: Consumed 1.122s CPU time. Apr 28 13:58:03 openqaworker1 systemd[1]: Starting openQA Worker #10... Apr 28 13:58:03 openqaworker1 systemd[1]: Started openQA Worker #10. Apr 28 13:58:04 openqaworker1 worker[24114]: [info] [pid:24114] worker 10: Apr 28 13:58:04 openqaworker1 worker[24114]: - config file: /etc/openqa/workers.ini Apr 28 13:58:04 openqaworker1 worker[24114]: - name used to register: openqaworker1 Apr 28 13:58:04 openqaworker1 worker[24114]: - worker address (WORKER_HOSTNAME): localhost Apr 28 13:58:04 openqaworker1 worker[24114]: - isotovideo version: 38 Apr 28 13:58:04 openqaworker1 worker[24114]: - websocket API version: 1 Apr 28 13:58:04 openqaworker1 worker[24114]: - web UI hosts: localhost Apr 28 13:58:04 openqaworker1 worker[24114]: - class: ? Apr 28 13:58:04 openqaworker1 worker[24114]: - no cleanup: no Apr 28 13:58:04 openqaworker1 worker[24114]: - pool directory: /var/lib/openqa/pool/10 Apr 28 13:58:04 openqaworker1 worker[24114]: API key and secret are needed for the worker connecting localhost Apr 28 13:58:04 openqaworker1 worker[24114]: at /usr/share/openqa/script/../lib/OpenQA/Worker/WebUIConnection.pm line 50. Apr 28 13:58:04 openqaworker1 worker[24114]: OpenQA::Worker::WebUIConnection::new("OpenQA::Worker::WebUIConnection", "localhost", HASH(0x563fe6781c60)) called at /usr/share/openqa/script/../l> Apr 28 13:58:04 openqaworker1 worker[24114]: OpenQA::Worker::init(OpenQA::Worker=HASH(0x563fe9c5d2f8)) called at /usr/share/openqa/script/../lib/OpenQA/Worker.pm line 363 Apr 28 13:58:04 openqaworker1 worker[24114]: OpenQA::Worker::exec(OpenQA::Worker=HASH(0x563fe9c5d2f8)) called at /usr/share/openqa/script/worker line 125 Apr 28 13:58:04 openqaworker1 systemd[1]: Stopping openQA Worker #10... Apr 28 13:58:04 openqaworker1 systemd[1]: openqa-worker-auto-restart@10.service: Main process exited, code=exited, status=255/EXCEPTION Apr 28 13:58:04 openqaworker1 systemd[1]: openqa-worker-auto-restart@10.service: Failed with result 'exit-code'. Apr 28 13:58:04 openqaworker1 systemd[1]: Stopped openQA Worker #10. Apr 28 13:58:04 openqaworker1 systemd[1]: openqa-worker-auto-restart@10.service: Consumed 1.024s CPU time. Apr 28 15:56:12 openqaworker1 systemd[1]: openqa-worker-auto-restart@10.service: Unit cannot be reloaded because it is inactive. </code></pre> </article> <article> <h1>openQA Infrastructure - action #122983: [alert] openqa/monitor-o3 failing because openqaworker1 is down size:M</h1> <p>2023-05-02T07:25:25Z</p> <ul></ul><p>Found another issue with our deployment pipeline today. It complains about a missing folder to upgrade the openQA package:</p> <pre><code> (1/5) Installing: openQA-4.6.1682696190.26b7581-lp154.5745.1.x86_64 [...... error: unpacking of archive failed on file /var/lib/openqa/share/factory: cpio: chown failed - No such file or directory error: openQA-4.6.1682696190.26b7581-lp154.5745.1.x86_64: install failed error: openQA-4.6.1682608278.68a0ff2-lp154.5738.1.x86_64: erase skipped error] Installation of openQA-4.6.1682696190.26b7581-lp154.5745.1.x86_64 failed: Error: Subprocess failed. Error: RPM failed: Command exited with status 1. </code></pre> <p>Is there a problem with our package too which this worker just shows now? If so, feel free to split this off into another ticket.</p> </article> <article> <h1>openQA Infrastructure - action #122983: [alert] openqa/monitor-o3 failing because openqaworker1 is down size:M</h1> <p>2023-05-02T07:34:05Z</p> <ul></ul><p>Is this really on worker1? We have observed the same problem on baremetal-supportserver. The problem only happens if there is 1. openQA webui installed, 2. NFS share from another webui server, 3. Mismatch in uids. On baremetal-supportserver we fixed this by syncing uids manually but here I suggest to remove the openQA package as only the worker package should be necessary</p> </article> <article> <h1>openQA Infrastructure - action #122983: [alert] openqa/monitor-o3 failing because openqaworker1 is down size:M</h1> <p>2023-05-02T11:04:04Z</p> <ul></ul><p><a class="issue tracker-4 status-3 priority-5 priority-high3 closed behind-schedule" title="action: [alert] openqa/monitor-o3 failing because openqaworker1 is down size:M (Resolved)" href="https://progress.opensuse.org/issues/122983#note-55">#122983#note-55</a> should be fixed by uninstalling the openQA web UI package. I had only installed it for fetchneeles to test it as o3 worker.</p> </article> <article> <h1>openQA Infrastructure - action #122983: [alert] openqa/monitor-o3 failing because openqaworker1 is down size:M</h1> <p>2023-05-02T11:12:15Z</p> <ul></ul><p><a class="issue tracker-4 status-3 priority-5 priority-high3 closed behind-schedule" title="action: [alert] openqa/monitor-o3 failing because openqaworker1 is down size:M (Resolved)" href="https://progress.opensuse.org/issues/122983#note-54">#122983#note-54</a> should be fixed by <a href="https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/528" class="external">https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/528</a> - it is just that salt at this point treats the machine as generic host and wiped the worker config to the bare minimum. So host is an empty string and we don't have API credentials for that "host".</p> </article> <article> <h1>openQA Infrastructure - action #122983: [alert] openqa/monitor-o3 failing because openqaworker1 is down size:M</h1> <p>2023-05-02T12:01:48Z</p> <ul><li><strong>Due date</strong> changed from <i>2023-04-28</i> to <i>2023-05-12</i></li></ul><p>discussed in daily, bumped due-date accordingly</p> </article> <article> <h1>openQA Infrastructure - action #122983: [alert] openqa/monitor-o3 failing because openqaworker1 is down size:M</h1> <p>2023-05-02T12:04:13Z</p> <ul></ul><p>I've cloned the last 100 successful tests on OSD (excluding ones with parallel dependencies):</p> <pre><code>openqa=# \copy (select distinct jobs.id from jobs join job_settings on jobs.id = job_settings.job_id left join job_dependencies on (jobs.id = child_job_id or jobs.id = parent_job_id) where dependency != 2 and result = 'passed' and job_settings.key = 'WORKER_CLASS' and job_settings.value = 'qemu_x86_64' order by id desc limit 100) to '/tmp/jobs_to_clone_x86_64' csv; COPY 100 </code></pre><pre><code>martchus@openqa:~> for job_id in $(cat /tmp/jobs_to_clone_x86_64 ) ; do openqa-clone-job --host openqa.suse.de --apikey … --apisecret … --skip-download --skip-chained-deps --clone-children --parental-inheritance "https://openqa.suse.de/tests/$job_id" _GROUP=0 TEST+=-ow1-test BUILD=test-ow1 WORKER_CLASS=openqaworker1 ; done </code></pre> <p>Apparently some of the jobs have chained children so we've got actually more than 100 jobs. Link to overview: <a href="https://openqa.suse.de/tests/overview?build=test-ow1" class="external">https://openqa.suse.de/tests/overview?build=test-ow1</a></p> </article> <article> <h1>openQA Infrastructure - action #122983: [alert] openqa/monitor-o3 failing because openqaworker1 is down size:M</h1> <p>2023-05-03T09:11:15Z</p> <ul></ul><p>I've just been reviewing the overview. The failures are due to:</p> <ul> <li>Jobs requiring a tap setup have accidentally been cloned.</li> <li>Jobs requiring a private asset have accidentally been cloned.</li> <li>Some jobs fail also sometimes on other workers in the same way.</li> </ul> <p>However, 120 jobs have passed/softfailed. So I guess that's good enough. I've been creating a MR to enable the worker for real: <a href="https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/532" class="external">https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/532</a></p> </article> <article> <h1>openQA Infrastructure - action #122983: [alert] openqa/monitor-o3 failing because openqaworker1 is down size:M</h1> <p>2023-05-03T09:32:59Z</p> <ul></ul><p><a href="https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/532" class="external">https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/532</a> merged, please monitor that real production jobs pass on this worker with the new worker class. Please make sure openqaworker1 shows up on <a href="https://monitor.qa.suse.de/d/4KkGdvvZk/osd-status-overview?orgId=1" class="external">https://monitor.qa.suse.de/d/4KkGdvvZk/osd-status-overview?orgId=1</a> .</p> </article> <article> <h1>openQA Infrastructure - action #122983: [alert] openqa/monitor-o3 failing because openqaworker1 is down size:M</h1> <p>2023-05-04T11:22:43Z</p> <ul></ul><p>The fail+incomplete ratio is so far similar to other worker hosts:</p> <pre><code>openqa=# with finished as (select result, t_finished, host from jobs left join workers on jobs.assigned_worker_id = workers.id where result != 'none') select host, round(count(*) filter (where result='failed' or result='incomplete') * 100. / count(*), 2)::numeric(5,2)::float as ratio_failed_by_host, count(*) total from finished where host like '%worker%' and t_finished >= '2023-05-01' group by host order by ratio_failed_by_host desc; host | ratio_failed_by_host | total ---------------------+----------------------+------- openqa-piworker | 100 | 12 worker12 | 22.22 | 18 worker11 | 20.43 | 186 worker2 | 18.64 | 1937 openqaworker-arm-3 | 16.72 | 1029 openqaworker1 | 14.35 | 418 worker10 | 13.96 | 523 openqaworker14 | 13.5 | 941 openqaworker17 | 13.42 | 1207 openqaworker-arm-2 | 13.03 | 1036 worker13 | 13.02 | 791 openqaworker18 | 11.91 | 1217 openqaworker16 | 11.76 | 1156 worker3 | 11.52 | 1137 worker5 | 11.51 | 2346 worker6 | 11.19 | 1779 powerqaworker-qam-1 | 10.96 | 374 worker9 | 10.96 | 785 worker8 | 10.61 | 886 openqaworker-arm-1 | 9.76 | 502 (20 rows) </code></pre> <p>With <a href="https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/852" class="external">https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/852</a> merged the worker shows up on <a href="https://monitor.qa.suse.de/d/4KkGdvvZk/osd-status-overview?orgId=1" class="external">https://monitor.qa.suse.de/d/4KkGdvvZk/osd-status-overview?orgId=1</a>. I can also create another MR for adding otherwise forgotten hosts. However, I wouldn't consider this part of the ticket.</p> </article> <article> <h1>openQA Infrastructure - action #122983: [alert] openqa/monitor-o3 failing because openqaworker1 is down size:M</h1> <p>2023-05-04T14:19:39Z</p> <ul><li><strong>Status</strong> changed from <i>Feedback</i> to <i>Resolved</i></li></ul><p>Opened a MR to update the dashboard: <a href="https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/853" class="external">https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/853</a></p> <p>With that I'm resolving this ticket.</p> </article> </main></body></html>