https://progress.opensuse.org/https://progress.opensuse.org/themes/openSUSE/favicon/favicon.ico?15829177842023-01-20T11:15:47ZopenSUSE Project Management ToolQA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=5955612023-01-20T11:15:47Zokurzokurz@suse.com
<ul><li><strong>Related to</strong> <i><a class="issue tracker-4 status-3 priority-5 priority-high3 closed" href="/issues/123028">action #123028</a>: A/C broken in TAM lab size:M</i> added</li></ul> QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=5955672023-01-20T11:15:57Zokurzokurz@suse.com
<ul><li><strong>Status</strong> changed from <i>Blocked</i> to <i>In Progress</i></li></ul><p>There is the plan to disassemble all the equipment from NUE-2.2.14 and move to FC lab or SRV2 on Tuesday. That will be executed by mgriessmeier with help from nsinger and mmoese even though #119548 is not finished yet but the plan is expedited due to <a class="issue tracker-4 status-3 priority-5 priority-high3 closed" title="action: A/C broken in TAM lab size:M (Resolved)" href="https://progress.opensuse.org/issues/123028">#123028</a></p>
QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=5955762023-01-20T11:21:47Zjstehlik
<ul></ul><p>Good to know. That means we need to finish all tests until Tuesday 24.1. since then the machines will be offline for how long .. one or two days? That might impact Rado's plan to aim for Thursday release if there are critical bugs found. And the week after is hackweek. Feels like planning a walk through a mine field :)</p>
QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=5956452023-01-20T12:32:27Zokurzokurz@suse.com
<ul></ul><p>jstehlik wrote:</p>
<blockquote>
<p>Good to know. That means we need to finish all tests until Tuesday 24.1. since then the machines will be offline for how long .. one or two days?</p>
</blockquote>
<p>We should keep in mind that the most critical machines are not affected as they are in server rooms and not in labs. Anyone critically relying on on systems within labs should consider using additionally or as replacement machines in other locations. However the "more important" machines should be moved to SRV2 already on Monday so in best case there is only an outage of some hours. The machines which are currently offline due to the A/C outage anyway will be moved to FC on Tuesday and available as soon as EngInfra could setup the network in FC labs. This might take days to weeks to be realistic.</p>
QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=5958192023-01-21T04:10:38Zopenqa_reviewopenqa-review@suse.de
<ul><li><strong>Due date</strong> set to <i>2023-02-04</i></li></ul><p>Setting due date based on mean cycle time of SUSE QE Tools</p>
QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=5963832023-01-24T11:15:44Zokurzokurz@suse.com
<ul></ul><p>The equipment and NUE-2.2.14-B was disassembled, also see <a class="issue tracker-4 status-3 priority-5 priority-high3 closed" title="action: A/C broken in TAM lab size:M (Resolved)" href="https://progress.opensuse.org/issues/123028#note-14">#123028#note-14</a> . Some machines were put into NUE-SRV2, others pending move to FC, see list in <a href="https://racktables.nue.suse.com/index.php?page=rack&rack_id=19904" class="external">https://racktables.nue.suse.com/index.php?page=rack&rack_id=19904</a>. It is planned to install the servers on Wednesday and continue with setting up the network. Also I consider openqaworker1 as not critical and as done in the past with the move to NUE-2.2.14 we can experiment with connecting an o3 worker from the FC labs without sharing the VLAN as VLANs will not be shared across locations so we can try to come up with a proper routing approach. Keep in mind that some workstations are still in <a href="https://racktables.nue.suse.com/index.php?page=row&row_id=16582" class="external">NUE-2.2.13</a></p>
QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=5970792023-01-26T12:25:33Zokurzokurz@suse.com
<ul><li><strong>Status</strong> changed from <i>In Progress</i> to <i>Blocked</i></li></ul><p><a href="https://racktables.nue.suse.com/index.php?page=location&location_id=11" class="external">NUE-2.2.14 (TAM)</a> was cleaned out and updated accordingly in racktables. All relevant equipment if not in NUE-SRV2 is now in <a href="https://racktables.nue.suse.com/index.php?page=location&location_id=18261" class="external">FC Basement</a>. Now back to #119548 waiting for DHCP+DNS.</p>
QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=5976652023-01-30T09:22:59Zlivdywanliv.dywan@suse.com
<ul><li><strong>Due date</strong> changed from <i>2023-02-04</i> to <i>2023-02-10</i></li></ul><p>Bumping due date due to hackweek.</p>
QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=5983502023-02-06T07:18:09Zokurzokurz@suse.com
<ul><li><strong>Related to</strong> <i><a class="issue tracker-4 status-3 priority-5 priority-high3 closed" href="/issues/123933">action #123933</a>: [worker][ipmi][bmc] Some worker can not be reached via BMC</i> added</li></ul> QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=5983712023-02-06T07:25:36Zxlaixlai@suse.com
<ul></ul><p>okurz wrote:</p>
<blockquote>
<p>The equipment and NUE-2.2.14-B was disassembled, also see <a class="issue tracker-4 status-3 priority-5 priority-high3 closed" title="action: A/C broken in TAM lab size:M (Resolved)" href="https://progress.opensuse.org/issues/123028#note-14">#123028#note-14</a> . Some machines were put into NUE-SRV2, others pending move to FC, see list in <a href="https://racktables.nue.suse.com/index.php?page=rack&rack_id=19904">https://racktables.nue.suse.com/index.php?page=rack&rack_id=19904</a>. </p>
<p>NUE-2.2.14 (TAM) was cleaned out and updated accordingly in racktables. All relevant equipment if not in NUE-SRV2 is now in FC Basement. Now back to #119548 waiting for DHCP+DNS.</p>
</blockquote>
<p><a class="user active user-mention" href="https://progress.opensuse.org/users/17668">@okurz</a>, Hi Oliver, does "FC" here means "Nbg Frankencampus" -- the new office building? What's the latest status for the machines in <a href="https://racktables.nue.suse.com/index.php?page=rack&rack_id=19904">https://racktables.nue.suse.com/index.php?page=rack&rack_id=19904</a>? Have they all been moved to Frankencampus lab? What's the ETA for the infra setup there being fully ready? Besides, what's the plan for those machines in NUE-SRV2(the lab in Maxtorhof)? Will they be moved to Frankencampus too? Any date/plan?</p>
<p>Let me also add more information to let you better know our situation for VT test as impact by this. We have totally ten ipmi x86 machines in NUE lab at Maxtorhof before this change. Based on the latest racktable records this morning, now the machines distribution is like below:</p>
<p>a) FC BASEMENT ->FC Inventory Storage : storage_qe2<br><br>
amd-zen3-gpu-sut1.qa.suse.de<br>
gonzo.qa.suse.de<br>
scooter.qa.suse.de<br>
kermit.qa.suse.de</p>
<p>b) NUE-SRV2-B:<br>
openqaw5-xen.qa.suse.de<br>
fozzie<br>
quinn<br>
amd-zen2-gpu-sut1.qa.suse.de<br>
openqaipmi5.qa.suse.de<br>
ix64ph1075.qa.suse.de</p>
<p>Here are the challenges we are facing atm by this new hardware location distribution and wip changes , in together with some needs from VT test:</p>
<ul>
<li>the 4 SUT machines in FC BASEMENT (nearly half of all total 9 x86 SUTs) are not usable now, given that infra setup at FC is not fully ready. And it will always be a major problem for 15sp5 test before infra setup there is done</li>
<li>we have 2 pair of machines for key test of virutalization migration and are better to locate in one lab. Now fozzie is in NUE-SRV2-B, while 3 other machines(kermit, gonzo, scooter) in FC basement. If the network communication between the two labs(after FC setup is done in days or weeks as you expected) is not good enough, the key migration test will loose one pair of machines and impact 15sp5 acceptance test in a way that we can't finish test within 1 day. Is there any chance that the 4 machines can stay together in one stable lab?</li>
<li>openqaw5-xen.qa.suse.de is one jump host used in vmware&hyperv VT test, it is better to stay in the same lab/network with the vmware&hyperv machines (eg hyperv2016(worker7-hyperv.oqa.suse.de) and vmware6.5(worker8-vmware.oqa.suse.de)). See lessons learned from <a href="https://progress.opensuse.org/issues/122662#note-18">https://progress.opensuse.org/issues/122662#note-18</a>. Is it possible to put it into consideration in infra setup?</li>
</ul>
<p><a class="user active user-mention" href="https://progress.opensuse.org/users/39517">@jstehlik</a> FYI. This lab move impact to virtualization test for sle15sp5 and tumbleweed is huge. The VT test speed and possibility for some tests will be impacted a lot before all infra setup is fully done/fixed in both FC new lab and Maxtorhof lab. Now we are debugging why all VT jobs on OSD fail at pxe boot. After this , we will then run the planned 15sp5 beta3 milestone test. Very likely that it will need much longer time because we loose many test machines by lab move in this ticket. </p>
QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=5984372023-02-06T09:47:09Zokurzokurz@suse.com
<ul></ul><p>xlai wrote:</p>
<blockquote>
<p><a class="user active user-mention" href="https://progress.opensuse.org/users/17668">@okurz</a>, Hi Oliver, does "FC" here means "Nbg Frankencampus" -- the new office building?</p>
</blockquote>
<p>Yes</p>
<blockquote>
<p>What's the latest status for the machines in <a href="https://racktables.nue.suse.com/index.php?page=rack&rack_id=19904">https://racktables.nue.suse.com/index.php?page=rack&rack_id=19904</a>? Have they all been moved to Frankencampus lab? </p>
</blockquote>
<p>The status in racktables should be up-to-date. For all machines that have not been moved to NUE1-SRV2 they have been moved to "FC Basement" that is the new lab at Frankencampus location.</p>
<blockquote>
<p>What's the ETA for the infra setup there being fully ready?</p>
</blockquote>
<p>We are waiting for Eng-Infra to do the setup and they provide us neither ETA nor status updates. My expectation is some days up to in the worst case multiple weeks</p>
<blockquote>
<p>Besides, what's the plan for those machines in NUE-SRV2(the lab in Maxtorhof)? Will they be moved to Frankencampus too? Any date/plan?</p>
</blockquote>
<p>Maybe we will move some machines to the FC Lab if we are happy with the quality and stability there but most machines from NUE1 that is Maxtorhof, both SRV1&SRV2 will eventually go to a new datacenter location somewhere in the vicinity of Nuremberg, planned for this year</p>
<blockquote>
<p>Let me also add more information to let you better know our situation for VT test as impact by this. We have totally ten ipmi x86 machines in NUE lab at Maxtorhof before this change. Based on the latest racktable records this morning, now the machines distribution is like below:</p>
<p>a) FC BASEMENT ->FC Inventory Storage : storage_qe2<br><br>
amd-zen3-gpu-sut1.qa.suse.de<br>
gonzo.qa.suse.de<br>
scooter.qa.suse.de<br>
kermit.qa.suse.de</p>
<p>b) NUE-SRV2-B:<br>
openqaw5-xen.qa.suse.de<br>
fozzie<br>
quinn<br>
amd-zen2-gpu-sut1.qa.suse.de<br>
openqaipmi5.qa.suse.de<br>
ix64ph1075.qa.suse.de</p>
<p>Here are the challenges we are facing atm by this new hardware location distribution and wip changes , in together with some needs from VT test:</p>
<ul>
<li>the 4 SUT machines in FC BASEMENT (nearly half of all total 9 x86 SUTs) are not usable now, given that infra setup at FC is not fully ready. And it will always be a major problem for 15sp5 test before infra setup there is done</li>
<li>we have 2 pair of machines for key test of virutalization migration and are better to locate in one lab. Now fozzie is in NUE-SRV2-B, while 3 other machines(kermit, gonzo, scooter) in FC basement. If the network communication between the two labs(after FC setup is done in days or weeks as you expected) is not good enough, the key migration test will loose one pair of machines and impact 15sp5 acceptance test in a way that we can't finish test within 1 day. Is there any chance that the 4 machines can stay together in one stable lab?</li>
<li>openqaw5-xen.qa.suse.de is one jump host used in vmware&hyperv VT test, it is better to stay in the same lab/network with the vmware&hyperv machines (eg hyperv2016(worker7-hyperv.oqa.suse.de) and vmware6.5(worker8-vmware.oqa.suse.de)). See lessons learned from <a href="https://progress.opensuse.org/issues/122662#note-18">https://progress.opensuse.org/issues/122662#note-18</a>. Is it possible to put it into consideration in infra setup?</li>
<li>amd-zen3-gpu-sut1.qa.suse.de needs to be used by O3, please help to consider this too</li>
</ul>
</blockquote>
<p>Right. Good that you bring this up. This is important to keep in mind. My intention is to provide a geo-redundany by spreading out services over locations where possible but also put critical machines together due to the strong requirements in network performance as you stated. Regarding jump hosts the best approach is likely to have likely even virtual machines but within the same server room as target hosts. Can you elaborate how <br>
openqaw5-xen.qa.suse.de which is a xen hypervisor host is used as jump host?</p>
QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=5984552023-02-06T10:36:28Zxlaixlai@suse.com
<ul></ul><p>okurz wrote:</p>
<blockquote>
<p>xlai wrote:</p>
<p>Right. Good that you bring this up. This is important to keep in mind. My intention is to provide a geo-redundany by spreading out services over locations where possible but also put critical machines together due to the strong requirements in network performance as you stated. Regarding jump hosts the best approach is likely to have likely even virtual machines but within the same server room as target hosts. Can you elaborate how <br>
openqaw5-xen.qa.suse.de which is a xen hypervisor host is used as jump host?</p>
</blockquote>
<p><a class="user active user-mention" href="https://progress.opensuse.org/users/17668">@okurz</a>, Hi Oliver, thanks for the quick reply. That's very helpful.</p>
<ul>
<li>yes, we also highly recommend to put the 4 pair machines together. Now fozzie is in NUE-SRV2-B, while 3 other machines(kermit, gonzo, scooter) in FC basement. </li>
<li>about openqaw5-xen.qa.suse.de, it serves as the xen hypervisor, then on top of it,multiple vms are created (one per worker), which are used in automation to either translate rdp to vnc(svirt-vmware/hyperv workers), or serving as test vm(svirt-xen workers)</li>
</ul>
QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=5991692023-02-08T02:18:03Zxlaixlai@suse.com
<ul></ul><p>Corrected one info in <a href="https://progress.opensuse.org/issues/119551#note-12" class="external">https://progress.opensuse.org/issues/119551#note-12</a> -- amd-zen3-gpu-sut1.qa.suse.de is used in OSD, rather than O3, and amd-zen2-gpu-sut1 is used in O3. Sorry for any confusion brought by it.</p>
QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=5993912023-02-08T09:48:14Zokurzokurz@suse.com
<ul></ul><p>got it. thx.</p>
QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=5994422023-02-08T10:24:13Zokurzokurz@suse.com
<ul><li><strong>Due date</strong> deleted (<del><i>2023-02-10</i></del>)</li></ul> QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=6003832023-02-09T15:37:06Zokurzokurz@suse.com
<ul></ul><p>We progressed in the FC Basement lab. All machines and equipment has been sorted, racks and shelfs have been labeled and everything relevant is updated accordingly in racktables. The biggest hurdle is not enough suitable rack mounting rails. One machine was mounted using L-shapes and connected to power and switch in B1. Also the PDU in B1 is connected to switch and marked accordingly in racktables. The blocking ticket is still the current blocker.</p>
QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=6021202023-02-15T07:39:30Zokurzokurz@suse.com
<ul><li><strong>Tags</strong> changed from <i>infra</i> to <i>infra, next-office-day, frankencampus</i></li><li><strong>Category</strong> set to <i>Infrastructure</i></li><li><strong>Status</strong> changed from <i>Blocked</i> to <i>In Progress</i></li><li><strong>Assignee</strong> changed from <i>okurz</i> to <i>nicksinger</i></li><li><strong>Priority</strong> changed from <i>Normal</i> to <i>Urgent</i></li></ul><p>With #119548 resolved, see notes in #119548#note-21, we can progress here. Today nicksinger plans to go to FC Basement and mount and setup more machines. I will see if I can join to help.</p>
QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=6026482023-02-15T15:18:44Zokurzokurz@suse.com
<ul></ul><p>nicksinger and me installed machines into NUE-FC-B1 QE LSG. Specifically those are the machines migration-qe1, power8.openqanet.opensuse.org, openqaworker1.openqanet.opensuse.org, holmes.qa.suse.de, gonzo.qa.suse.de, kermit.qa.suse.de, scooter.qa.suse.de, amd-zen3-gpu-sut1.qa.suse.de, openqaworker-arm-5.qa.suse.de, openqaworker-arm-4.qa.suse.de, openqa-migration-qe1.qa.suse.de . We had to adjust the spacing of the vertical holders in the rack as they had been assembled in a tilted way with two L-shaped brackets that are about 5mm longer than all other L-shaped brackets. We have disassembled those two L-shaped brackets and labeled them clearly as "too long" for our purposes. Then we put the above mentioned machines onto those L-shaped brackets as there are no rails fitting our machines. We connected all machines to power and network and documented everything accordingly in racktables. On the DHCP VM "qa-jump" as provided by Eng-Infra we could see that all mgmt interfaces show up and get an IPv4 address assigned by dhcpd. The next step is to assign static leases and adjust DNS entries on qanet accordingly.</p>
<p>I added all hosts to the dhcpd config:</p>
<pre><code># NUE-FC-B: Rack https://racktables.nue.suse.com/index.php?page=rack&rack_id=19174
host amd-zen3-gpu-sut1-sp { hardware ethernet ec:2a:72:0c:25:4c; fixed-address 10.168.192.83; option inventory-url "https://racktables.suse.de/index.php?page=object&object_id=16390"; option host-name "amd-zen3-gpu-sut1-sp"; }
host amd-zen3-gpu-sut1-1 { hardware ethernet ec:2a:72:02:84:20; fixed-address 10.168.192.84; option inventory-url "https://racktables.suse.de/index.php?page=object&object_id=16390"; option host-name "amd-zen3-gpu-sut1-1"; filename "pxelinux.0"; }
host amd-zen3-gpu-sut1-2 { hardware ethernet b4:96:91:9c:5a:d4; fixed-address 10.168.192.85; option inventory-url "https://racktables.suse.de/index.php?page=object&object_id=16390"; option host-name "amd-zen3-gpu-sut1-2"; filename "pxelinux.0"; }
host scooter-sp { hardware ethernet ac:1f:6b:4b:a7:d7; fixed-address 10.168.192.86; option inventory-url "https://racktables.suse.de/index.php?page=object&object_id=10124"; option host-name "scooter-sp"; }
host scooter-1 { hardware ethernet ac:1f:6b:47:73:38; fixed-address 10.168.192.87; option inventory-url "https://racktables.suse.de/index.php?page=object&object_id=10124"; option host-name "scooter-1"; filename "pxelinux.0"; }
host kermit-sp { hardware ethernet ac:1f:6b:4b:6c:af; fixed-address 10.168.192.88; option inventory-url "https://racktables.suse.de/index.php?page=object&object_id=10102"; option host-name "kermit-sp"; }
host kermit-1 { hardware ethernet ac:1f:6b:47:03:26; fixed-address 10.168.192.89; option inventory-url "https://racktables.suse.de/index.php?page=object&object_id=10102"; option host-name "kermit-1"; filename "pxelinux.1"; }
host gonzo-sp { hardware ethernet ac:1f:6b:4b:6b:03; fixed-address 10.168.192.90; option inventory-url "https://racktables.suse.de/index.php?page=object&object_id=10104"; option host-name "gonzo-sp"; }
host gonzo-1 { hardware ethernet ac:1f:6b:47:06:86; fixed-address 10.168.192.91; option inventory-url "https://racktables.suse.de/index.php?page=object&object_id=10104"; option host-name "gonzo-1"; filename "pxelinux.0"; }
host holmes-sp { hardware ethernet 58:8a:5a:f5:60:4a; fixed-address 10.168.192.92; option inventory-url "https://racktables.suse.de/index.php?page=object&object_id=10699"; option host-name "holmes-sp"; }
host holmes-1 { hardware ethernet 00:0a:f7:de:79:54; fixed-address 10.168.192.93; option inventory-url "https://racktables.suse.de/index.php?page=object&object_id=10699"; option host-name "holmes-1"; filename "pxelinux.0"; } # NVDIMM test host
host holmes-4 { hardware ethernet 00:0a:f7:de:79:53; fixed-address 10.168.192.94; option inventory-url "https://racktables.suse.de/index.php?page=object&object_id=10699"; option host-name "holmes-4"; filename "pxelinux.0"; } # NVDIMM test host
# openqaworker1 not included for now
# power8 not included for now
</code></pre>
<p>and updated openqa-migration-qe1. I restarted the DHCP server and the service started fine.</p>
QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=6027472023-02-16T04:11:15Zopenqa_reviewopenqa-review@suse.de
<ul><li><strong>Due date</strong> set to <i>2023-03-02</i></li></ul><p>Setting due date based on mean cycle time of SUSE QE Tools</p>
QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=6028642023-02-16T09:19:32Zokurzokurz@suse.com
<ul><li><strong>File</strong> <a href="/attachments/14639">SUSE_FC_Basement_different_length_L_shaped_brackets.jpg</a> <a class="icon-only icon-download" title="Download" href="/attachments/download/14639/SUSE_FC_Basement_different_length_L_shaped_brackets.jpg">SUSE_FC_Basement_different_length_L_shaped_brackets.jpg</a> added</li></ul><p>This was the biggest surprise of today:</p>
<p><img src="https://progress.opensuse.org/attachments/download/14639/SUSE_FC_Basement_different_length_L_shaped_brackets.jpg" alt="SUSE_FC_Basement_different_length_L_shaped_brackets.jpg" loading="lazy" /></p>
<p>The first rack was already mounted with L-shaped brackets on both sides. So we tried to mount more servers and found we couldn't fix the next brackets with screws due to the mismatch visible in the picture which is about 5mm difference for a 70cm long bracket. Turned out somebody managed to mount a 70cm bracket on the left side for which we have about 50 brackets and a 70,5mm version for which we have exactly two pieces. After realizing we dismounted those two and used only 70cm pieces consistently</p>
QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=6028672023-02-16T09:19:51Zokurzokurz@suse.com
<ul></ul><p><a href="https://gitlab.suse.de/qa-sle/qanet-configs/-/merge_requests/49">https://gitlab.suse.de/qa-sle/qanet-configs/-/merge_requests/49</a> to update DHCP/DNS entries</p>
<p>EDIT: merged and deployed</p>
<pre><code>qanet:~ # for i in scooter holmes gonzo kermit amd-zen3-gpu-sut1; do ping -c 1 $i-sp.qa.suse.de; done
PING scooter-sp.qa.suse.de (10.168.192.86) 56(84) bytes of data.
64 bytes from 10.168.192.86: icmp_seq=1 ttl=59 time=2.28 ms
--- scooter-sp.qa.suse.de ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 2.281/2.281/2.281/0.000 ms
PING holmes-sp.qa.suse.de (10.168.192.92) 56(84) bytes of data.
64 bytes from 10.168.192.92: icmp_seq=1 ttl=59 time=2.62 ms
--- holmes-sp.qa.suse.de ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 2.627/2.627/2.627/0.000 ms
PING gonzo-sp.qa.suse.de (10.168.192.90) 56(84) bytes of data.
64 bytes from 10.168.192.90: icmp_seq=1 ttl=59 time=2.63 ms
--- gonzo-sp.qa.suse.de ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 2.638/2.638/2.638/0.000 ms
PING kermit-sp.qa.suse.de (10.168.192.88) 56(84) bytes of data.
64 bytes from 10.168.192.88: icmp_seq=1 ttl=59 time=8.20 ms
--- kermit-sp.qa.suse.de ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 8.203/8.203/8.203/0.000 ms
PING amd-zen3-gpu-sut1-sp.qa.suse.de (10.168.192.83) 56(84) bytes of data.
64 bytes from 10.168.192.83: icmp_seq=1 ttl=59 time=2.59 ms
--- amd-zen3-gpu-sut1-sp.qa.suse.de ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 2.597/2.597/2.597/0.000 ms
</code></pre>
<p><a href="https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/493">https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/493</a> To be merged after verification in openQA</p>
<p>EDIT: Cloning a set of openQA jobs with</p>
<pre><code>openqa-clone-set https://openqa.suse.de/tests/10493728 okurz_investigation_ipmi_workers_poo119551 WORKER_CLASS=64bit-ipmi_disabled BUILD=okurz_poo119551 _GROUP=0
</code></pre>
<p>results on <a href="https://openqa.suse.de/tests/overview?build=okurz_poo119551&distri=sle&version=15-SP5">https://openqa.suse.de/tests/overview?build=okurz_poo119551&distri=sle&version=15-SP5</a></p>
<ul>
<li>gonzo: <a href="https://openqa.suse.de/tests/10514912">https://openqa.suse.de/tests/10514912</a></li>
<li>amd-zen3-gpu-sut1: <a href="https://openqa.suse.de/tests/10514913">https://openqa.suse.de/tests/10514913</a></li>
<li>kermit: <a href="https://openqa.suse.de/tests/10514914">https://openqa.suse.de/tests/10514914</a></li>
<li>scooter: <a href="https://openqa.suse.de/tests/10514915">https://openqa.suse.de/tests/10514915</a></li>
</ul>
<p>and for holmes:</p>
<pre><code>end=003 openqa-clone-set https://openqa.suse.de/tests/10493728 okurz_investigation_ipmi_workers_poo119551 WORKER_CLASS=64bit-ipmi-nvdimm_disabled BUILD=okurz_poo119551_holmes _GROUP=0 INCLUDE_MODULES=bootloader_start
</code></pre>
<p>Created job #10515077: sle-15-SP5-Online-x86_64-Build72.1-guided_btrfs@64bit-ipmi -> <a href="https://openqa.suse.de/t10515077">https://openqa.suse.de/t10515077</a></p>
<p>I guess the next step is to ensure that files are delivered over PXE</p>
QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=6031702023-02-16T14:42:56Znicksingernsinger@suse.com
<ul><li><strong>Status</strong> changed from <i>In Progress</i> to <i>Blocked</i></li></ul><p>I tried several options to point to our existing TFTP-server on qanet but realized after resorting to tcpdump that the (tftp) packages never arrive at qanet. I created <a href="https://sd.suse.com/servicedesk/customer/portal/1/SD-112718" class="external">https://sd.suse.com/servicedesk/customer/portal/1/SD-112718</a> to address this problem.</p>
QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=6032152023-02-16T16:08:13Zokurzokurz@suse.com
<ul></ul><p>ok, which tcpdump command did you use?</p>
<p>nicksinger wrote:</p>
<blockquote>
<p>I tried several options to point to our existing TFTP-server on qanet but realized after resorting to tcpdump that the (tftp) packages never arrive at qanet. I created <a href="https://sd.suse.com/servicedesk/customer/portal/1/SD-112718" class="external">https://sd.suse.com/servicedesk/customer/portal/1/SD-112718</a> to address this problem.</p>
</blockquote>
<p>I guess the alternative could be to provide a TFTP server from qa-jump which we will want in the future anyway. At best find someone from Eng-Infra to get the "get into salt and provide DHCP+DNS+PXE"-part done in one go. By the way as we learned it's "Georg" currently working on (re-)connecting qa-jump to Eng-Infra salt.</p>
QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=6041692023-02-20T10:16:06Zxlaixlai@suse.com
<ul></ul><p><a class="user active user-mention" href="https://progress.opensuse.org/users/24624">@nicksinger</a> <a class="user active user-mention" href="https://progress.opensuse.org/users/17668">@okurz</a> Hello guys, Jan just shared me that this ticket was done. But based on current ticket status , it is blocked. Would you please help clarify the real status? I saw that a lot had been done for this ticket, can I assume that there is only few TODO? Besides, what do you expect the machine owners to do to have the machines ready to serve as openqa SUT? We will prepare for that if needed.</p>
<p>Our situation is like this -- public beta is to be announced soon, if we can have the 4 affected machines back BEFORE THIS WEEKEND, we will wait for them and launch the tests next week via openqa. Otherwise, we will start manual test immediately after public beta is announced, for which the effort is not minor. Hope to have some forecast for the ticket , so that we can plan our next step for VT. </p>
<p>Thanks for your efforts. It means a lot for us!</p>
QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=6047482023-02-21T09:40:14Znicksingernsinger@suse.com
<ul></ul><p>xlai wrote:</p>
<blockquote>
<p><a class="user active user-mention" href="https://progress.opensuse.org/users/24624">@nicksinger</a> <a class="user active user-mention" href="https://progress.opensuse.org/users/17668">@okurz</a> Hello guys, Jan just shared me that this ticket was done. But based on current ticket status , it is blocked. Would you please help clarify the real status? I saw that a lot had been done for this ticket, can I assume that there is only few TODO? Besides, what do you expect the machine owners to do to have the machines ready to serve as openqa SUT? We will prepare for that if needed.</p>
</blockquote>
<p>The main missing component is PXE here. We tried to setup a quick solution by just forwarding to our existing server but this unfortunately failed. We're in contact here with eng-infra to get this resolved but I simply cannot estimate when and if they will be able to resolve this problem.</p>
<blockquote>
<p>Our situation is like this -- public beta is to be announced soon, if we can have the 4 affected machines back BEFORE THIS WEEKEND, we will wait for them and launch the tests next week via openqa. Otherwise, we will start manual test immediately after public beta is announced, for which the effort is not minor. Hope to have some forecast for the ticket , so that we can plan our next step for VT. </p>
</blockquote>
<p>We do our best to get the setup up and running but cannot guarantee a working and stable environment at the moment as this is a fairly new setup. If these machines are so very important for public beta I'd say you should prepare the manual tests. If everything is working in openQA you could stop manual testing when openQA tests are showing results, no?</p>
<blockquote>
<p>Thanks for your efforts. It means a lot for us!</p>
</blockquote>
<p><a class="user active user-mention" href="https://progress.opensuse.org/users/24624">@nicksinger</a>, thanks for the reply. Appreciate your work to set up it. We will then plan our manual test in case it is needed. </p>
QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=6066202023-02-27T11:58:50Zokurzokurz@suse.com
<ul><li><strong>Status</strong> changed from <i>Blocked</i> to <i>Workable</i></li></ul><p>Robert Wawrig commented in <a href="https://sd.suse.com/servicedesk/customer/portal/1/SD-112718" class="external">https://sd.suse.com/servicedesk/customer/portal/1/SD-112718</a> with a change and a request to test again. If that is not successful please followup with <a class="issue tracker-4 status-3 priority-6 priority-high2 closed child" title="action: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:M (Resolved)" href="https://progress.opensuse.org/issues/119551#note-26">#119551#note-26</a></p>
QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=6067342023-02-27T15:31:01Zokurzokurz@suse.com
<ul></ul><p>With rrichardson changed <a href="https://racktables.nue.suse.com/index.php?page=rack&tab=default&rack_id=19190" class="external">NUE-FC-B:5</a> to match the shorter L-rails and put cloud4.qa, qanet2, seth+osiris there. Updated racktables to include the servers but couldn't yet finish the cabling.</p>
QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=6067642023-02-27T17:25:16Znicksingernsinger@suse.com
<ul></ul><p>I've setup a tftp server on qa-jump with some basic config required for <code>pxegen.sh</code>. I and the script populated some files in /srv/tftpboot required for PXE booting. What is left is to test the setup by adding the custom tftp-server-url in <a href="https://gitlab.suse.de/OPS-Service/salt/-/blob/production/pillar/domain/qe_nue2_suse_org/hosts.yaml" class="external">https://gitlab.suse.de/OPS-Service/salt/-/blob/production/pillar/domain/qe_nue2_suse_org/hosts.yaml</a> - Martin showed me that other domains do this already but I need to figure out what the correct syntax is for that</p>
QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=6069052023-02-28T08:15:20Znicksingernsinger@suse.com
<ul><li><strong>Status</strong> changed from <i>Workable</i> to <i>Feedback</i></li></ul><p>created <a href="https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/3233" class="external">https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/3233</a> which needs to be merged before I can further test if my setup works</p>
QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=6070162023-02-28T10:15:45Znicksingernsinger@suse.com
<ul><li><strong>Status</strong> changed from <i>Feedback</i> to <i>In Progress</i></li></ul> QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=6070942023-02-28T12:11:18Zokurzokurz@suse.com
<ul><li><strong>Related to</strong> <i><a class="issue tracker-4 status-3 priority-4 priority-default closed child" href="/issues/117043">action #117043</a>: Request DHCP+DNS services for new QE network zones, same as already provided for .qam.suse.de and .qa.suse.cz</i> added</li></ul> QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=6071782023-02-28T13:48:49Znicksingernsinger@suse.com
<ul></ul><p>I tried with gonzo but the request didn't make it to our own TFTP/PXE. Apparently "dhcp_next_server" just should be "next_server" but this unfortunately already fails in the tests: <a href="https://gitlab.suse.de/nicksinger/salt/-/jobs/1427672#L33" class="external">https://gitlab.suse.de/nicksinger/salt/-/jobs/1427672#L33</a> - I asked Martin in private message if he can give me a hint</p>
QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=6075382023-03-01T09:13:02Zokurzokurz@suse.com
<ul></ul><p>nicksinger wrote:</p>
<blockquote>
<p>I tried with gonzo but the request didn't make it to our own TFTP/PXE. Apparently "dhcp_next_server" just should be "next_server" but this unfortunately already fails in the tests: <a href="https://gitlab.suse.de/nicksinger/salt/-/jobs/1427672#L33" class="external">https://gitlab.suse.de/nicksinger/salt/-/jobs/1427672#L33</a></p>
</blockquote>
<p>I like that the error message is very specific. It's also pretty cool that you can test this in your own fork before even creating a merge request. I assume you didn't create a merge request yet, right?</p>
<blockquote>
<p>I asked Martin in <em>private message</em> if he can give me a hint</p>
</blockquote>
<p>why not in a public room? Did you make it personal? ;)</p>
QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=6075922023-03-01T10:25:52Zokurzokurz@suse.com
<ul><li><strong>Copied to</strong> <i><a class="issue tracker-4 status-3 priority-4 priority-default closed child" href="/issues/125204">action #125204</a>: Move QA labs NUE-2.2.14-B to Frankencampus labs - non-bare-metal machines size:M</i> added</li></ul> QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=6075982023-03-01T10:27:15Zokurzokurz@suse.com
<ul><li><strong>Subject</strong> changed from <i>Move QA labs NUE-2.2.14-B to Frankencampus labs</i> to <i>Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers</i></li><li><strong>Due date</strong> changed from <i>2023-03-02</i> to <i>2023-03-10</i></li></ul><p>I extracted a ticket <a class="issue tracker-4 status-3 priority-4 priority-default closed child" title="action: Move QA labs NUE-2.2.14-B to Frankencampus labs - non-bare-metal machines size:M (Resolved)" href="https://progress.opensuse.org/issues/125204">#125204</a> for everything that goes beyond "just make bare-metal openQA tests using PXE work". <a class="user active user-mention" href="https://progress.opensuse.org/users/24624">@nicksinger</a> will bring up the topic in #help-it-ama</p>
QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=6077032023-03-01T12:18:54Znicksingernsinger@suse.com
<ul></ul><p>Created a SD ticket for a DNS entry for "qa-jump": <a href="https://sd.suse.com/servicedesk/customer/portal/1/SD-113814" class="external">https://sd.suse.com/servicedesk/customer/portal/1/SD-113814</a></p>
QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=6077062023-03-01T12:20:48Zokurzokurz@suse.com
<ul></ul><p><a href="https://gitlab.suse.de/qa-sle/qanet-configs/-/merge_requests/53" class="external">https://gitlab.suse.de/qa-sle/qanet-configs/-/merge_requests/53</a> created for our .qa.suse.de DNS entry.</p>
<p>EDIT: Merged</p>
QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=6077902023-03-01T14:12:20Zokurzokurz@suse.com
<ul></ul><p>gpfuetzenreuter was nice and helpful in <a href="https://suse.slack.com/archives/C029APBKLGK/p1677671947741049">https://suse.slack.com/archives/C029APBKLGK/p1677671947741049</a> but eventually he asked to create (another) ticket so we did with <a href="https://sd.suse.com/servicedesk/customer/portal/1/SD-113832">https://sd.suse.com/servicedesk/customer/portal/1/SD-113832</a></p>
<blockquote>
<a name="Motivation"></a>
<h2 >Motivation<a href="#Motivation" class="wiki-anchor">¶</a></h2>
<p><a href="https://gitlab.suse.de/OPS-Service/salt/-/blob/production/pillar/domain/qe_nue2_suse_org/init.sls#L54">https://gitlab.suse.de/OPS-Service/salt/-/blob/production/pillar/domain/qe_nue2_suse_org/init.sls#L54</a> defines a PXE server and we can see machines like gonzo-1.qe.nue2.suse.org seeing the PXE boot menu from icecream.nue2.suse.org on bootup. But openQA tests need either a custom PXE boot menu or a mountpoint serving current openQA builds for booting. We tried to fix this ourselves on the machine “qa-jump”, formerly, 10.168.192.1, but this machine was replaced with walter1 denying us access so we can not investigate and fix this ourselves anymore. We tried to provide host-specific PXE config like in <a href="https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/3233/diffs">https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/3233/diffs</a> but this was also not effective. Please help to make sure that we end up with a working solution, either where Eng-Infra provides the service or we do it on our own but we should half-baked solution without access.</p>
<a name="Acceptance-criteria"></a>
<h2 >Acceptance criteria<a href="#Acceptance-criteria" class="wiki-anchor">¶</a></h2>
<ul>
<li>AC1: Machines in the new domain qe.nue2.suse.org can execute bare-metal openQA tests</li>
<li>AC2: QE employees can self-investigate issues with PXE booting</li>
</ul>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<p>I think the best option is if experts from Eng-Infra like Georg Pfützenreuter and Martin Caj sit together in an online session with the SUSE QE Tools expert Nick Singer (of course others can join as well) to find the best solution, either on an Eng-Infra maintained VM where we have access to try out and debug on our own or (less preferred) a VM that we maintain or other solutions based on what you come up with.</p>
</blockquote>
<p>Next to working with Eng-Infra to get a custom QE PXE working or our own PXE server different ideas to explore:</p>
<ol>
<li>Follow-up with <a href="https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/3234">https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/3234</a> , e.g. test how DHCP with an HTTP url behaves with container+VM without needing any custom rule matching</li>
<li>Find other tickets and add relations about "semi-automatic installation of openQA workers" because in the end we want the same for production hardware as well as bare-metal test hosts which is to have a common solution to deploy specific configurations of SLE/Leap/Tumbleweed, etc.</li>
<li>Reconsider how we install bare-metal from network for tests and get in contact with test squads about that, e.g. just find the correct tickets</li>
<li>An alternative that can be solved completely from os-autoinst-distri-opensuse perspective without needing any changes to infrastructure or backend would be to use the Eng-Infra supplied PXE boot menu and just boot an older version of the SLES installer (either older build or service pack) and conduct a remote installation of the current build from there. If that is not possible due to kernel mismatch between "linux" file and remote repo content then I suggest to boot an older version of SLES and update to the current build.</li>
</ol>
QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=6081982023-03-02T09:06:36Zokurzokurz@suse.com
<ul><li><strong>Assignee</strong> changed from <i>nicksinger</i> to <i>okurz</i></li></ul><p>Eng-Infra changed the PXE server advertise on the DHCP server with <a href="https://gitlab.suse.de/OPS-Service/salt/-/commit/050e95ece73f2fc79a7195a15a5cd1877d1b9241" class="external">https://gitlab.suse.de/OPS-Service/salt/-/commit/050e95ece73f2fc79a7195a15a5cd1877d1b9241</a> to point to "qa-jump (new)". We will setup PXE on qa-jump (new) for now. Once the setup is done we can think if/how we can integrate this into Eng-Infra maintained salt salt.</p>
<p>as root on qa-jump.qe.nue2.suse.org</p>
<pre><code>ssh-keygen -t ed25519
</code></pre>
<p>copied over the public key to qanet:/root/.ssh/authorized_keys</p>
<p>Then with nicksinger mount points in /etc/fstab:</p>
<pre><code>dist.suse.de:/dist /mnt/dist nfs4 defaults 0 1
openqa.suse.de:/var/lib/openqa/share/factory /mnt/openqa nfs ro,defaults 0 0
/mounts /srv/tftpboot/mounts none defaults,bind 0 0
/mnt/openqa /srv/tftpboot/mnt/openqa none defaults,bind 0 0
</code></pre>
<p>and copied from qanet:/srv/tftp/pxegen.sh and execute that script within that folder and add what is necessary to to make the script happy.</p>
<p>Trying to mount NFS seems to be blocked by firewall. We commented in <a href="https://sd.suse.com/servicedesk/customer/portal/1/SD-113832" class="external">https://sd.suse.com/servicedesk/customer/portal/1/SD-113832</a> and also in <a href="https://suse.slack.com/archives/C029APBKLGK/p1677749667949229" class="external">https://suse.slack.com/archives/C029APBKLGK/p1677749667949229</a></p>
QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=6082072023-03-02T09:09:13Zokurzokurz@suse.com
<ul><li><strong>Project</strong> changed from <i>46</i> to <i>QA</i></li><li><strong>Category</strong> deleted (<del><i>Infrastructure</i></del>)</li></ul> QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=6087892023-03-03T09:50:28Zokurzokurz@suse.com
<ul><li><strong>Assignee</strong> changed from <i>okurz</i> to <i>nicksinger</i></li></ul><p>We provided what we could in <a href="https://sd.suse.com/servicedesk/customer/portal/1/SD-113832" class="external">https://sd.suse.com/servicedesk/customer/portal/1/SD-113832</a> and were asked to refrain from further communication in chat and rather use the ticket. That's obviously making it harder for others to follow hence we must provide a status here. nicksinger is trying out some things regarding loading from tftp due to the urgency of the ticket but we are running out of options and basically need to wait for Eng-Infra personell to help us with one of the many requests, e.g. either provide us more access like root access to walter1.qe.nue2.suse.org and switch access or fix the actual problems</p>
QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=6093502023-03-06T12:49:56Zokurzokurz@suse.com
<ul><li><strong>Subject</strong> changed from <i>Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers</i> to <i>Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:M</i></li><li><strong>Description</strong> updated (<a title="View differences" href="/journals/609350/diff?detail_id=572096">diff</a>)</li></ul> QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=6097852023-03-07T10:18:31Zokurzokurz@suse.com
<ul></ul><p>I created <a href="https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/503" class="external">https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/503</a> to add the specific target machines for easier openQA job triggering.</p>
QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=6098302023-03-07T12:20:46Zokurzokurz@suse.com
<ul><li><strong>Copied to</strong> <i><a class="issue tracker-4 status-3 priority-4 priority-default closed child" href="/issues/125519">action #125519</a>: version control PXE stuff on qa-jump</i> added</li></ul> QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=6100162023-03-07T14:22:15Zokurzokurz@suse.com
<ul></ul><pre><code>for i in kermit scooter gonzo amd-zen3-gpu-sut1; do openqa-clone-job --skip-chained-deps --within-instance https://openqa.suse.de/tests/10565133 TEST=okurz_investigation_ipmi_workers_poo119551_$i BUILD=okurz_investigation_ipmi_workers_poo119551 _GROUP=0 WORKER_CLASS=$i;done
</code></pre>
<p>Created job #10636248: sle-15-SP5-Online-x86_64-Build73.2-guided_btrfs@64bit-ipmi -> <a href="https://openqa.suse.de/t10636248" class="external">https://openqa.suse.de/t10636248</a><br>
Created job #10636249: sle-15-SP5-Online-x86_64-Build73.2-guided_btrfs@64bit-ipmi -> <a href="https://openqa.suse.de/t10636249" class="external">https://openqa.suse.de/t10636249</a><br>
Created job #10636250: sle-15-SP5-Online-x86_64-Build73.2-guided_btrfs@64bit-ipmi -> <a href="https://openqa.suse.de/t10636250" class="external">https://openqa.suse.de/t10636250</a><br>
Created job #10636251: sle-15-SP5-Online-x86_64-Build73.2-guided_btrfs@64bit-ipmi -> <a href="https://openqa.suse.de/t10636251" class="external">https://openqa.suse.de/t10636251</a></p>
<p>-> <a href="https://openqa.suse.de/tests/overview?build=okurz_investigation_ipmi_workers_poo119551&distri=sle&version=15-SP5" class="external">https://openqa.suse.de/tests/overview?build=okurz_investigation_ipmi_workers_poo119551&distri=sle&version=15-SP5</a></p>
QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=6100942023-03-07T16:55:09Znicksingernsinger@suse.com
<ul></ul><p>We had to change the SUT_NETDEVICE variable for two hosts (<a href="https://gitlab.suse.de/openqa/salt-pillars-openqa/-/compare/10134f09...master?from_project_id=746&straight=true" class="external">https://gitlab.suse.de/openqa/salt-pillars-openqa/-/compare/10134f09...master?from_project_id=746&straight=true</a>) so the installer could find and access its files. Now we reach a common-ground on all machines where something (worker?) fails to connect to something else (yast in installer?) see <a href="https://openqa.suse.de/tests/overview?version=15-SP5&build=okurz_investigation_ipmi_workers_poo119551&distri=sle" class="external">https://openqa.suse.de/tests/overview?version=15-SP5&build=okurz_investigation_ipmi_workers_poo119551&distri=sle</a> . A first quick nmap from my personal workstation showed the port of the SUT (inside FC LAB) as "open" so not sure if this is some firewall blocking traffic. As next step we should pause the test right after "setup_libyui" and maybe investigate manually if the connection is blocked from the worker. It might make also sense to involve the yast squad for additional information.</p>
QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=6101032023-03-07T17:09:53Zokurzokurz@suse.com
<ul></ul><p>Actually I have seen the same error in the production qemu tests so I would even go as far as saying that we reached the same level as other tests and we are good to enable the workers for production again, see my draft MR, and resolve</p>
QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=6102292023-03-08T06:59:25Zmgriessmeiermgriessmeier@suse.com
<ul></ul><p>okurz wrote:</p>
<blockquote>
<p>Actually I have seen the same error in the production qemu tests so I would even go as far as saying that we reached the same level as other tests and we are good to enable the workers for production again, see my draft MR, and resolve</p>
</blockquote>
<p>do you have a reference (ticket/job) for this? I couldn't find one - if we can link it to an open issue, I am fine with it - otherwise I'd really like to see a job that is either passing or not failing on a potential network issue - wdyt?</p>
QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=6104512023-03-08T10:22:01Zokurzokurz@suse.com
<ul></ul><ul>
<li>One example is <a href="https://openqa.suse.de/t10562907" class="external">https://openqa.suse.de/t10562907</a> on ppc64le showing "Connection timed out" in the YaST installer trying to access the self-update repo from 13 days ago in SLE 15 SP5 build 73.2. Apparently nobody cares to review those tests</li>
<li>Please check on holmes, we have missed that yesterday in the for-loop</li>
</ul>
QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=6107782023-03-08T16:05:43Znicksingernsinger@suse.com
<ul></ul><p><a href="https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/506" class="external">https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/506</a> to enable kermit, scooter and zen3.<br>
I manually tested holmes and the machine was able to display a PXE menu. While doing so I realized (and vaguely remembered) that this machine needs two interfaces connected (<a href="https://gitlab.suse.de/OPS-Service/salt/-/blob/production/pillar/domain/qe_nue2_suse_org/hosts.yaml#L125-132" class="external">https://gitlab.suse.de/OPS-Service/salt/-/blob/production/pillar/domain/qe_nue2_suse_org/hosts.yaml#L125-132</a>) which we didn't do so for the sake of moving forward I already enabled 3/4 </p>
QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=6109102023-03-08T19:27:14Zokurzokurz@suse.com
<ul></ul><p>Merged <a href="https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/506" class="external">https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/506</a></p>
QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=6114022023-03-09T11:51:14Zmkittlermarius.kittler@suse.com
<ul></ul><p>Looks like the corresponding reload unit couldn't be stopped cleanly triggering the systemd services alert:</p>
<pre><code>martchus@worker2:~> sudo systemctl status openqa-reload-worker-auto-restart@54
× openqa-reload-worker-auto-restart@54.service - Restarts openqa-worker-auto-restart@54.service as soon as possible without interrupting jobs
Loaded: loaded (/usr/lib/systemd/system/openqa-reload-worker-auto-restart@.service; static)
Active: failed (Result: exit-code) since Wed 2023-03-08 20:29:17 CET; 16h ago
Main PID: 10271 (code=exited, status=1/FAILURE)
Mar 08 20:29:16 worker2 systemd[1]: Starting Restarts openqa-worker-auto-restart@54.service as soon as possible without interrupting jobs...
Mar 08 20:29:17 worker2 systemctl[10271]: Job for openqa-worker-auto-restart@54.service canceled.
Mar 08 20:29:17 worker2 systemd[1]: openqa-reload-worker-auto-restart@54.service: Main process exited, code=exited, status=1/FAILURE
Mar 08 20:29:17 worker2 systemd[1]: openqa-reload-worker-auto-restart@54.service: Failed with result 'exit-code'.
Mar 08 20:29:17 worker2 systemd[1]: Failed to start Restarts openqa-worker-auto-restart@54.service as soon as possible without interrupting jobs.
martchus@worker2:~> sudo systemctl status openqa-worker-auto-restart@54
○ openqa-worker-auto-restart@54.service - openQA Worker #54
Loaded: loaded (/usr/lib/systemd/system/openqa-worker-auto-restart@.service; disabled; vendor preset: disabled)
Drop-In: /etc/systemd/system/openqa-worker-auto-restart@.service.d
└─20-nvme-autoformat.conf, 30-openqa-max-inactive-caching-downloads.conf
Active: inactive (dead) since Wed 2023-03-08 20:29:18 CET; 16h ago
Main PID: 30231 (code=exited, status=0/SUCCESS)
Mar 08 12:23:29 worker2 worker[30231]: [info] [pid:30231] Registering with openQA openqa.suse.de
Mar 08 12:23:29 worker2 worker[30231]: [info] [pid:30231] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/2131
Mar 08 12:23:29 worker2 worker[30231]: [info] [pid:30231] Registered and connected via websockets with openQA host openqa.suse.de and worker ID 2131
Mar 08 17:15:22 worker2 worker[30231]: [warn] [pid:30231] Worker cache not available via http://127.0.0.1:9530: Cache service queue already full (10) - checking again for web UI 'openqa.suse.de' in 100.00 s
Mar 08 17:17:02 worker2 worker[30231]: [warn] [pid:30231] Worker cache not available via http://127.0.0.1:9530: Cache service queue already full (10) - checking again for web UI 'openqa.suse.de' in 100.00 s
Mar 08 20:29:18 worker2 worker[30231]: [info] [pid:30231] Received signal TERM
Mar 08 20:29:18 worker2 worker[30231]: [debug] [pid:30231] Informing openqa.suse.de that we are going offline
Mar 08 20:29:18 worker2 systemd[1]: Stopping openQA Worker #54...
Mar 08 20:29:18 worker2 systemd[1]: openqa-worker-auto-restart@54.service: Deactivated successfully.
Mar 08 20:29:18 worker2 systemd[1]: Stopped openQA Worker #54.
</code></pre>
<p>I've just reset the unit. Not sure whether this is a general problem we have when reducing the number of worker slots. (It seems more exceptional to me.)</p>
QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=6116332023-03-10T01:48:35ZJulie_CAOjcao@suse.com
<ul></ul><p>Some workers failed to get PXE menu due to tftp error, such as grenache-1:12 & grenache-1:15</p>
<p><a href="https://openqa.suse.de/tests/10652815#step/boot_from_pxe/10" class="external">https://openqa.suse.de/tests/10652815#step/boot_from_pxe/10</a></p>
QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=6116452023-03-10T01:59:57Zwaynechen55wchen@suse.com
<ul><li><strong>File</strong> <a href="/attachments/14780">osd-amd-zen3-pxe-boot.png</a> <a class="icon-only icon-download" title="Download" href="/attachments/download/14780/osd-amd-zen3-pxe-boot.png">osd-amd-zen3-pxe-boot.png</a> added</li></ul><p>Julie_CAO wrote:</p>
<blockquote>
<p>Some workers failed to get PXE menu due to tftp error, such as grenache-1:12 & grenache-1:15</p>
<p><a href="https://openqa.suse.de/tests/10652815#step/boot_from_pxe/10" class="external">https://openqa.suse.de/tests/10652815#step/boot_from_pxe/10</a></p>
</blockquote>
<p>Also grenache-1:19<br>
<a href="https://openqa.suse.de/tests/10652807#step/boot_from_pxe/22" class="external">https://openqa.suse.de/tests/10652807#step/boot_from_pxe/22</a><br>
<img src="https://progress.opensuse.org/attachments/download/14780/osd-amd-zen3-pxe-boot.png" alt="" loading="lazy" /></p>
QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=6116482023-03-10T02:09:19Zxlaixlai@suse.com
<ul></ul><p><a class="user active user-mention" href="https://progress.opensuse.org/users/17668">@okurz</a> <a class="user active user-mention" href="https://progress.opensuse.org/users/24624">@nicksinger</a> Hello guys, virt team checks all the 4 newly enabled workers in FC lab, namely <br>
amd-zen3-gpu-sut1.qa.suse.de<br>
gonzo.qa.suse.de<br>
scooter.qa.suse.de<br>
kermit.qa.suse.de</p>
<p>Based on all historical jobs triggered yesterday, no successful job and all fail at boot_from_pxe. Just as Julie and Wayne shared, likely root cause is on the tftp server used by pxe. This may need your further help.</p>
<p>In addition, for the other newly enabled machine, holmes, there is no job triggered there, so no reference at all.</p>
<p><a class="user active user-mention" href="https://progress.opensuse.org/users/39517">@jstehlik</a> FYI.</p>
QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=6116872023-03-10T04:11:48Zopenqa_reviewopenqa-review@suse.de
<ul></ul><p>Setting due date based on mean cycle time of SUSE QE Tools</p>
QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=6117652023-03-10T07:40:35Zokurzokurz@suse.com
<ul><li><strong>Related to</strong> <i><a class="issue tracker-4 status-6 priority-6 priority-high2 closed" href="/issues/125735">action #125735</a>: [openQA][infra][pxe] Some machines can not boot from pxe due to "TFTP open timeout"</i> added</li></ul> QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=6117712023-03-10T07:41:26Zokurzokurz@suse.com
<ul><li><strong>Priority</strong> changed from <i>Normal</i> to <i>Urgent</i></li><li><strong>% Done</strong> changed from <i>100</i> to <i>0</i></li></ul><p>back to urgent after changing <a class="issue tracker-4 status-6 priority-6 priority-high2 closed" title="action: [openQA][infra][pxe] Some machines can not boot from pxe due to "TFTP open timeout" (Rejected)" href="https://progress.opensuse.org/issues/125735">#125735</a> to not be a subtask</p>
QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=6118132023-03-10T08:44:25Zokurzokurz@suse.com
<ul></ul><p>I looked into this shortly with mgriessmeier and it looks like the systemd unit <code>tftp.socket</code> wasn't activated on qa-jump so I called <code>systemctl enable --now tftp.socket</code> and also for a persistent journal <code>mkdir -p /var/log/journal</code>. Then soon after <code>journalctl -f</code> showed tftpd processes showing up and serving requests. I opened some openQA jobs on the according worker instances and monitoring them.</p>
<p>Jobs like the following look promising:</p>
<ul>
<li><a href="https://openqa.suse.de/tests/10653273" class="external">https://openqa.suse.de/tests/10653273</a> grenache:12 kermit</li>
<li><a href="https://openqa.suse.de/tests/10653281" class="external">https://openqa.suse.de/tests/10653281</a> grenache:13 gonzo</li>
<li><a href="https://openqa.suse.de/tests/10653225" class="external">https://openqa.suse.de/tests/10653225</a> grenache:14 fozzie</li>
<li><a href="https://openqa.suse.de/tests/10653276" class="external">https://openqa.suse.de/tests/10653276</a> grenache:15 scooter</li>
<li><p><a href="https://openqa.suse.de/tests/10652010" class="external">https://openqa.suse.de/tests/10652010</a> grenache:19 amd-zen3-gpu-sut1</p></li>
<li><p>TODO <a class="user active user-mention" href="https://progress.opensuse.org/users/24624">@nicksinger</a> please find, label and retrigger all according affected tests</p></li>
</ul>
QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=6118222023-03-10T09:19:09ZJulie_CAOjcao@suse.com
<ul></ul><p>Thank you for the quick fix, Oliver. We will retrigger tests on our own.</p>
QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=6118642023-03-10T10:39:36Znicksingernsinger@suse.com
<ul><li><strong>Assignee</strong> changed from <i>nicksinger</i> to <i>okurz</i></li></ul><p><a class="user active user-mention" href="https://progress.opensuse.org/users/17668">@okurz</a> please check that both of<a href="https://gitlab.suse.de/OPS-Service/salt/-/blob/production/pillar/domain/qe_nue2_suse_org/hosts.yaml#L125-132" class="external">these interfaces</a> are connected to holmes when you visit the office next Monday.<br>
You can assign back so I can check if the rest of the setup works with this machine.</p>
QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=6121732023-03-11T04:10:47Zopenqa_reviewopenqa-review@suse.de
<ul><li><strong>Due date</strong> set to <i>2023-03-25</i></li></ul><p>Setting due date based on mean cycle time of SUSE QE Tools</p>
QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=6123112023-03-13T00:31:20Zwaynechen55wchen@suse.com
<ul><li><strong>Related to</strong> <i><a class="issue tracker-4 status-3 priority-6 priority-high2 closed behind-schedule" href="/issues/125810">action #125810</a>: [openqa][infra] Some SUT machines can not upload logs to worker machine size:S</i> added</li></ul> QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=6125812023-03-13T11:17:37Znicksingernsinger@suse.com
<ul></ul><p><a class="user active user-mention" href="https://progress.opensuse.org/users/17668">@okurz</a> connected the fourth interface of holmes. We where able to open a PXE menu and start into a leap15.4 installer. Triggered verification openQA job with:</p>
<pre><code>openqa-clone-job --skip-chained-deps --within-instance https://openqa.suse.de/tests/10493728 TEST=okurz_investigation_ipmi_workers_poo119551_holmes BUILD=okurz_investigation_ipmi_workers_poo119551 _GROUP=0 WORKER_CLASS=holmes --apikey XXX --apisecret XXX
</code></pre>
<p>Created job #10679705: sle-15-SP5-Online-x86_64-Build72.1-guided_btrfs@64bit-ipmi -> <a href="https://openqa.suse.de/t10679705" class="external">https://openqa.suse.de/t10679705</a></p>
QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=6128152023-03-13T15:53:57Zokurzokurz@suse.com
<ul><li><strong>Tags</strong> changed from <i>infra, next-office-day, frankencampus</i> to <i>infra, frankencampus</i></li><li><strong>Status</strong> changed from <i>In Progress</i> to <i>Blocked</i></li><li><strong>Assignee</strong> changed from <i>okurz</i> to <i>nicksinger</i></li></ul><p><a href="https://sd.suse.com/servicedesk/customer/portal/1/SD-114864" class="external">https://sd.suse.com/servicedesk/customer/portal/1/SD-114864</a></p>
QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=6130942023-03-14T08:47:24Zokurzokurz@suse.com
<ul><li><strong>Status</strong> changed from <i>Blocked</i> to <i>In Progress</i></li></ul><p>Firewall was unblocked, SD ticket closed.</p>
QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=6138592023-03-15T12:37:33Zokurzokurz@suse.com
<ul></ul><p>Please clone the latest ok jobs which were running on that specific worker instance <a href="https://openqa.suse.de/admin/workers/1264" class="external">https://openqa.suse.de/admin/workers/1264</a> and check if they work on holmes. It might be that the other generic scenarios can not run on holmes for whatever reason we do not need to care about.</p>
QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=6140632023-03-16T06:58:30Zokurzokurz@suse.com
<ul></ul><p>holmes seems to be the only worker reserved for "64bit-ipmi-nvdimm" and apparently no jobs were scheduled within the past two months that would match here. I looked around for longer but the best I could find is those 2 month old jobs so I cloned one of those overwriting the INCIDENT_REPO as otherwise we would get a warning because the incident repo is long gone. Anyway, it should at least show us how far the initial booting can go.</p>
<pre><code>openqa-clone-job --within-instance https://openqa.suse.de/tests/10297800 _GROUP=0 BUILD= TEST+=-okurz-poo119551 WORKER_CLASS=holmes INCIDENT_REPO=
</code></pre>
<p>Created job #10706998: sle-15-SP3-Server-DVD-SAP-Incidents-x86_64-Build:27344:php7-qam-sles4sap_online_dvd_gnome_hana_nvdimm@64bit-ipmi-nvdimm -> <a href="https://openqa.suse.de/t10706998" class="external">https://openqa.suse.de/t10706998</a></p>
<p>EDIT: this showed "Unable to locate configuration file" so same as what we have already sen</p>
<p><a class="user active user-mention" href="https://progress.opensuse.org/users/24624">@nicksinger</a> I suggest we follow the logs from tftpd and restart job boot attempts to follow what happens exactly.</p>
QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=6141592023-03-16T08:48:32Znicksingernsinger@suse.com
<ul></ul><p>Cross-referencing <a class="issue tracker-4 status-3 priority-6 priority-high2 closed behind-schedule" title="action: [openqa][infra] Some SUT machines can not upload logs to worker machine size:S (Resolved)" href="https://progress.opensuse.org/issues/125810">#125810</a> here as we saw issues with the PXE config generation script which got fixed with <a href="https://gitlab.suse.de/qa-sle/qa-jump-configs/-/merge_requests/3" class="external">https://gitlab.suse.de/qa-sle/qa-jump-configs/-/merge_requests/3</a></p>
QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=6142402023-03-16T12:31:15Zokurzokurz@suse.com
<ul><li><strong>Description</strong> updated (<a title="View differences" href="/journals/614240/diff?detail_id=576863">diff</a>)</li></ul><p>For now for investigation I masked a worker service so that we can check:</p>
<pre><code>systemctl mask --now openqa-worker-auto-restart@13
</code></pre>
<p><a href="https://openqa.suse.de/tests/10707348#step/reboot_and_wait_up_normal/14" class="external">https://openqa.suse.de/tests/10707348#step/reboot_and_wait_up_normal/14</a> shows that we login over ssh but name resolution in the curl command fails. We checked manually in a SoL session to gonzo-1, machine is still up from <a href="https://openqa.suse.de/tests/10707348" class="external">https://openqa.suse.de/tests/10707348</a> and <code>dig grenache-1.qa.suse.de</code> works fine. We assume that as soon as the openQA test logged in over ssh to gonzo-1 network was simply not fully up yet. The test is making wrong assumptions. This is something which should be changed within os-autoinst-distri-opensuse. <a class="user active user-mention" href="https://progress.opensuse.org/users/24624">@nicksinger</a> I suggest you create a specific ticket for that. What we can also do is check the system journal on gonzo-1 and compare network related log messages to what the autoinst-log.txt from <a href="https://openqa.suse.de/tests/10707348" class="external">https://openqa.suse.de/tests/10707348</a> says to check when openQA logged in and when the network was actually reported to be up in <code>journalctl</code>.</p>
QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=6142432023-03-16T12:40:26Zwaynechen55wchen@suse.com
<ul></ul><p>One more thing is ipmi sol connection can not be established to grenache-1:16/ix64ph1075:<br>
[2023-03-16T13:10:45.311924+01:00] [info] [pid:491202] ::: backend::baseclass::die_handler: Backend process died, backend errors are reported below in the following lines:<br>
ipmitool -I lanplus -H xxx -U xxx -P [masked] mc guid: Error: Unable to establish IPMI v2 / RMCP+ session at /usr/lib/os-autoinst/backend/ipmi.pm line 45.</p>
<p>All test run assigned to this worker failed due to the same reason as above, for example, <a href="https://openqa.suse.de/tests/10707853" class="external">https://openqa.suse.de/tests/10707853</a>.</p>
QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=6143662023-03-16T17:48:18Zxguoxguo@suse.com
<ul></ul><p>waynechen55 wrote:</p>
<blockquote>
<p>One more thing is ipmi sol connection can not be established to grenache-1:16/ix64ph1075:<br>
[2023-03-16T13:10:45.311924+01:00] [info] [pid:491202] ::: backend::baseclass::die_handler: Backend process died, backend errors are reported below in the following lines:<br>
ipmitool -I lanplus -H xxx -U xxx -P [masked] mc guid: Error: Unable to establish IPMI v2 / RMCP+ session at /usr/lib/os-autoinst/backend/ipmi.pm line 45.</p>
<p>All test run assigned to this worker failed due to the same reason as above, for example, <a href="https://openqa.suse.de/tests/10707853" class="external">https://openqa.suse.de/tests/10707853</a>.</p>
</blockquote>
<p>Quick update, Assigned worker: grenache-1:16 still have boot_from_pxe test failure on our OSD with the latest 15-SP5 build80.5.<br>
Please refer to the following osd test url for getting more details:<br>
<a href="https://openqa.suse.de/tests/10709006#step/boot_from_pxe/22" class="external">https://openqa.suse.de/tests/10709006#step/boot_from_pxe/22</a><br>
<a href="https://openqa.suse.de/tests/10709149#step/boot_from_pxe/9" class="external">https://openqa.suse.de/tests/10709149#step/boot_from_pxe/9</a></p>
<p>Meanwhile, or refer to <a href="https://openqa.suse.de/admin/workers/1247" class="external">https://openqa.suse.de/admin/workers/1247</a></p>
QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=6143842023-03-17T01:01:09Zxlaixlai@suse.com
<ul></ul><p>Thanks for the effort on this, guys.</p>
<p>I observe that after yesterday's final change, for the 3 machines, namely amd-zen3-gpu-sut1, gonzo, scooter, the boot_from_pxe succeed at acceptable ratio. </p>
<p>But on kermit, the success ratio is not high enough, see <a href="https://openqa.suse.de/admin/workers/1243" class="external">https://openqa.suse.de/admin/workers/1243</a>. Recent 3 jobs all failed at <a href="https://openqa.suse.de/tests/10707383#step/boot_from_pxe/7" class="external">https://openqa.suse.de/tests/10707383#step/boot_from_pxe/7</a>, while the earlier 3 passed. Would you please have a look?</p>
QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=6144022023-03-17T02:27:20Zcachencachen@suse.com
<ul></ul><p>xlai wrote:</p>
<blockquote>
<p>Thanks for the effort on this, guys.</p>
<p>I observe that after yesterday's final change, for the 3 machines, namely amd-zen3-gpu-sut1, gonzo, scooter, the boot_from_pxe succeed at acceptable ratio. </p>
<p>But on kermit, the success ratio is not high enough, see <a href="https://openqa.suse.de/admin/workers/1243" class="external">https://openqa.suse.de/admin/workers/1243</a>. Recent 3 jobs all failed at <a href="https://openqa.suse.de/tests/10707383#step/boot_from_pxe/7" class="external">https://openqa.suse.de/tests/10707383#step/boot_from_pxe/7</a>, while the earlier 3 passed. Would you please have a look?</p>
</blockquote>
<p>Many test failed in 'could not find kernel image', I checked all the type string is correct, assuming it's still caused by the unstable or problem network connect from kermit to pxe/tftp server?</p>
QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=6144472023-03-17T06:35:19Zokurzokurz@suse.com
<ul></ul><p>Please try to separate concerns and provide more details in your messages. <br>
It's important to distinguish errors that happen in all cases, like 100% error rate and sporadic timeouts and such as you noted about.<br>
Also referencing openQA jobs is good but even better is to explain what jobs those are, where they ran, what problem they show and what you expect instead. Can be all in a simple sentence, does not have to be fancy.</p>
QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=6144592023-03-17T07:36:54Zwaynechen55wchen@suse.com
<ul><li><strong>File</strong> <a href="/attachments/14840">boot_from_pxe_do_not_download.png</a> <a class="icon-only icon-download" title="Download" href="/attachments/download/14840/boot_from_pxe_do_not_download.png">boot_from_pxe_do_not_download.png</a> added</li></ul><p>okurz wrote:</p>
<blockquote>
<p>Please try to separate concerns and provide more details in your messages. <br>
It's important to distinguish errors that happen in all cases, like 100% error rate and sporadic timeouts and such as you noted about.<br>
Also referencing openQA jobs is good but even better is to explain what jobs those are, where they ran, what problem they show and what you expect instead. Can be all in a simple sentence, does not have to be fancy.</p>
</blockquote>
<p>Now the most obvious problem is these four machines:<br>
grenache-1:12/kermit<br>
grenache-1:13/gonzo<br>
grenache-1:15/scooter<br>
grenache-1:19/amd-zen3<br>
can not do host installation from pxe/tftp.</p>
<p>Steps to reproduce:</p>
<ul>
<li>Establish ipmi sol session to one of the above machines</li>
<li>Press 'esc' at pxe menu</li>
<li>"boot:" prompts</li>
<li>Enter the following to install 15-SP5 Build80.5:
/mnt/openqa/repo/SLE-15-SP5-Online-x86_64-Build80.5-Media1/boot/x86_64/loader/linux initrd=/mnt/openqa/repo/SLE-15-SP5-Online-x86_64-Build80.5-Media1/boot/x86_64/loader/initrd install=<a href="http://openqa.suse.de/assets/repo/SLE-15-SP5-Online-x86_64-Build80.5-Media1?device=eth0">http://openqa.suse.de/assets/repo/SLE-15-SP5-Online-x86_64-Build80.5-Media1?device=eth0</a> ifcfg=eth0=dhcp4 plymouth.enable=0 /mnt/openqa/repo/SLE-15-SP5-Online-x86_64-Build80.5-Media1/boot/x86_64/loader/linux initrd=/mnt/openqa/repo/SLE-15-SP5-Online-x86_64-Build80.5-Media1/boot/x86_64/loader/initrd install=<a href="http://openqa.suse.de/assets/repo/SLE-15-SP5-Online-x86_64-Build80.5-Media1?device=eth0">http://openqa.suse.de/assets/repo/SLE-15-SP5-Online-x86_64-Build80.5-Media1?device=eth0</a> ifcfg=eth0=dhcp4 plymouth.enable=0 ssh=1 sshpassword=xxxxxx regurl=<a href="http://all-80.5.proxy.scc.suse.de">http://all-80.5.proxy.scc.suse.de</a> kernel.softlockup_panic=1 vt.color=0x07</li>
<li>Press "enter" to start loading linux/initrd</li>
</ul>
<p>But unfortunately, linux/initrd downloading never started. The machine hangs there. Please refer to the following screenshot:<br>
<img src="https://progress.opensuse.org/attachments/download/14840/boot_from_pxe_do_not_download.png" alt="" loading="lazy" /></p>
<p>Please also refer to openQA jobs:<br>
kermit <a href="https://openqa.suse.de/tests/10707383#step/boot_from_pxe/7">https://openqa.suse.de/tests/10707383#step/boot_from_pxe/7</a><br>
scooter <a href="https://openqa.suse.de/tests/10713350#step/boot_from_pxe/7">https://openqa.suse.de/tests/10713350#step/boot_from_pxe/7</a><br>
amd-zen3 <a href="https://openqa.suse.de/tests/10713345#step/boot_from_pxe/7">https://openqa.suse.de/tests/10713345#step/boot_from_pxe/7</a><br>
I also reproduced this issue manually with gonzo. Looks like it has 100% reproducibility.</p>
QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=6144652023-03-17T08:04:13Zwaynechen55wchen@suse.com
<ul><li><strong>File</strong> <a href="/attachments/14843">boot_from_pxe_do_not_download_2.png</a> <a class="icon-only icon-download" title="Download" href="/attachments/download/14843/boot_from_pxe_do_not_download_2.png">boot_from_pxe_do_not_download_2.png</a> added</li></ul><p>One more screenshot from video record of job 10707383 in <a class="issue tracker-4 status-3 priority-6 priority-high2 closed child" title="action: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:M (Resolved)" href="https://progress.opensuse.org/issues/119551#note-81">#119551#note-81</a>. It reported explicitly that it can not find image:<br>
<img src="https://progress.opensuse.org/attachments/download/14843/boot_from_pxe_do_not_download_2.png" alt="" loading="lazy" /><br>
So I think my manual reproduce with gonzo should has the same issue.</p>
QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=6144742023-03-17T08:41:34Znicksingernsinger@suse.com
<ul></ul><p>I can confirm that the NFS share on the tftp-server pointing to openqa.suse.de hangs. Most likely a unstable connection. Will check how we can recovery and how we can rectify the problem long term</p>
QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=6144802023-03-17T08:46:51Znicksingernsinger@suse.com
<ul></ul><p>dmesg shows that the machine failed to reach OSD starting yesterday, 20:34 CET:</p>
<pre><code>[Mar16 20:34] nfs: server openqa.suse.de not responding, still trying
[ +0.001777] nfs: server openqa.suse.de not responding, still trying
[Mar16 21:16] nfs: server openqa.suse.de not responding, still trying
[ +0.155986] nfs: server openqa.suse.de not responding, still trying
[Mar16 21:17] nfs: server openqa.suse.de not responding, still trying
[ +0.001621] nfs: server openqa.suse.de not responding, still trying
[Mar16 21:19] nfs: server openqa.suse.de not responding, still trying
[ +0.001601] nfs: server openqa.suse.de not responding, still trying
[Mar16 21:22] nfs: server openqa.suse.de not responding, still trying
[ +0.001568] nfs: server openqa.suse.de not responding, still trying
[Mar16 21:24] nfs: server openqa.suse.de not responding, still trying
[ +0.001573] nfs: server openqa.suse.de not responding, still trying
</code></pre>
<p>the retries lasted till now despite osd being reachable over ping as well as showing the NFS port(s) in nmap. <code>umount -l</code> followed by <code>mount -a</code> hangs again so I might have a reproducer right now</p>
QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=6144982023-03-17T08:57:12Znicksingernsinger@suse.com
<ul></ul><p>nfs-server logs on OSD show no noteworthy entries. Listing mounts from qa-jump works:</p>
<pre><code>qa-jump:~ # showmount --exports openqa.suse.de
Export list for openqa.suse.de:
/var/lib/openqa/share *
</code></pre> QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=6145192023-03-17T09:17:02Znicksingernsinger@suse.com
<ul></ul><p><a href="https://www.suse.com/support/kb/doc/?id=000019722" class="external">https://www.suse.com/support/kb/doc/?id=000019722</a> mentions: "The unique scenario described above happens because many firewalls and smart routers will detect and block TCP connection reuse, even though connection reuse is a valid practice and NFS has traditionally relied upon it." - this sounds like a realistic assumption. We also see the described phenomenon on qa-jump:</p>
<pre><code>qa-jump:~ # ss -nt | grep :2049
ESTAB 0 0 10.168.192.10:783 10.137.50.100:2049
SYN-SENT 0 1 10.168.192.10:765 10.160.0.207:2049
</code></pre>
<p>Checking mount -v I can see that apparently we fallback to NFSv3:</p>
<pre><code>qa-jump:~ # /sbin/mount.nfs4 -v openqa.suse.de:/var/lib/openqa/share/factory /mnt/openqa -o ro
mount.nfs4: timeout set for Fri Mar 17 09:04:19 2023
mount.nfs4: trying text-based options 'vers=4.2,addr=10.160.0.207,clientaddr=10.168.192.10'
mount.nfs4: mount(2): No such file or directory
mount.nfs4: trying text-based options 'addr=10.160.0.207'
mount.nfs4: prog 100003, trying vers=3, prot=6
mount.nfs4: trying 10.160.0.207 prog 100003 vers 3 prot TCP port 2049
mount.nfs4: prog 100005, trying vers=3, prot=17
mount.nfs4: trying 10.160.0.207 prog 100005 vers 3 prot UDP port 20048
Terminated
</code></pre>
<p>Dist was mounted with v4 and survived for a longer time. I think a valid workaround could be to mount OSD with the following command:</p>
<pre><code>/sbin/mount.nfs4 -v openqa.suse.de:/ /mnt/openqa -o ro,nfsvers=4,minorversion=2
</code></pre>
<p>I just did this manually but need to find out how to persist it. Afterwards a ticket needs to be opened to inform eng-infra about this shortcoming and ask, if they are aware. Next we have to check if it actually helps with our problem in the long-run.</p>
QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=6146092023-03-17T10:46:03Zokurzokurz@suse.com
<ul></ul><p>Discussed in SUSE QE Tools weekly meeting 2023-03-17: The NFS mount on qa-jump was fixed. nicksinger retriggered openQA jobs and will monitor those. If no further problems are found then </p>
<p>Please handle in separate tickets any sporadic issues or any potential firewall related issues for anything <em>later</em> than the openQA test modules "installation/welcome".</p>
QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=6146242023-03-17T11:14:13Zokurzokurz@suse.com
<ul></ul><p>xguo wrote:</p>
<blockquote>
<p>[…]<br>
Quick update, Assigned worker: grenache-1:16 still have boot_from_pxe test failure on our OSD with the latest 15-SP5 build80.5.</p>
</blockquote>
<p>Please be aware that grenache-1:16 is ix64ph1075 which is NUE1-SRV2 so not affected by move to FC Basement. Also the problem happened during the Eng-Infra maintenance window. We can't rule out that as an effect which is unfortunate but means if you can reproduce that problem then please bring it up in a separate progress ticket <em>and</em> an according linked Eng-Infra ticket as well.</p>
QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=6146272023-03-17T11:14:23Znicksingernsinger@suse.com
<ul></ul><p>I changed the mountpoint to the following in /etc/fstab:</p>
<pre><code>openqa.suse.de:/factory /mnt/openqa nfs4 ro,defaults 0 0 # this mounts /var/lib/openqa/share/factory from OSD
</code></pre> QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=6146302023-03-17T11:16:03Zokurzokurz@suse.com
<ul><li><strong>Description</strong> updated (<a title="View differences" href="/journals/614630/diff?detail_id=577181">diff</a>)</li></ul> QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=6146332023-03-17T11:17:08Zokurzokurz@suse.com
<ul><li><strong>Description</strong> updated (<a title="View differences" href="/journals/614633/diff?detail_id=577184">diff</a>)</li></ul><p>Unmasked and started grenache-1 openqa-worker-auto-restart@13 aka. gonzo again.</p>
<p>Created <a href="https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/513" class="external">https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/513</a> to include qa-jump and others in our availability check monitoring.</p>
<p>After the latest change in NFS mount jobs look good again:</p>
<ul>
<li><a href="https://openqa.suse.de/admin/workers/1207" class="external">grenache-1:10 openqaipmi5</a> <a href="https://openqa.suse.de/tests/10716021" class="external">https://openqa.suse.de/tests/10716021</a></li>
<li><a href="https://openqa.suse.de/admin/workers/1243" class="external">grenache-1:12 kermit</a> <a href="https://openqa.suse.de/tests/10716022" class="external">https://openqa.suse.de/tests/10716022</a></li>
<li><a href="https://openqa.suse.de/admin/workers/1349" class="external">grenache-1:13 gonzo</a> (no jobs currently)</li>
<li><a href="https://openqa.suse.de/admin/workers/1245" class="external">grenache-1:14 fozzie</a> <a href="https://openqa.suse.de/tests/10715932" class="external">https://openqa.suse.de/tests/10715932</a></li>
<li><a href="https://openqa.suse.de/admin/workers/1274" class="external">grenache-1:15 scooter</a> </li>
<li><a href="https://openqa.suse.de/admin/workers/1262" class="external">grenache-1:19 amd-zen3-gpu-sut1</a> <a href="https://openqa.suse.de/tests/10713369" class="external">https://openqa.suse.de/tests/10713369</a></li>
</ul>
QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=6146392023-03-17T11:28:16Zokurzokurz@suse.com
<ul><li><strong>Description</strong> updated (<a title="View differences" href="/journals/614639/diff?detail_id=577193">diff</a>)</li></ul> QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=6150922023-03-20T12:07:18Znicksingernsinger@suse.com
<ul><li><strong>Status</strong> changed from <i>In Progress</i> to <i>Resolved</i></li></ul><p>I checked all mentioned machines and all runs over the weekend passed the PXE menu and look working. I think with that we can consider this task done</p>
QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:Mhttps://progress.opensuse.org/issues/119551?journal_id=6184132023-03-29T15:17:58Zokurzokurz@suse.com
<ul><li><strong>Due date</strong> deleted (<del><i>2023-03-25</i></del>)</li></ul>