action #134123
closedQA - coordination #121720: [saga][epic] Migration to QE setup in PRG2+NUE3 while ensuring availability
QA - coordination #123800: [epic] Provide SUSE QE Tools services running in PRG2 aka. Prg CoLo
Setup new PRG2 openQA worker for o3 - two new arm workers size:M
0%
Description
Motivation¶
New hardware was ordered to serve as openQA workers for o3, both x86_64 and aarch64. We can connect those machines to the o3 webUI VM instance regardless if o3 running still from NUE1 or from PRG2.
Acceptance criteria¶
- AC1: o3 multi-machine aarch64 jobs run successfully on new PRG2 aarch64 openQA workers
- AC2: Both new PRG2 aarch64 openQA workers are monitored by https://gitlab.suse.de/openqa/monitor-o3/
- AC3: racktables.nue.suse.com is up-to-date with information about arm21+arm22
Suggestions¶
- See #132134 about x86_64 workers and also #133490 for iPXE which we likely need to adjust for aarch64
- Ensure both machines are up-to-date in racktables.nue.suse.com
- Optionally update netbox but we can trust the sync from hreinecke
- Use the iPXE or uefi httpboot setup to boot an installer, follow common installation instructions using autoyast and then also https://github.com/os-autoinst/os-autoinst/blob/master/script/os-autoinst-setup-multi-machine for both machines
- Verify stability with o3 openQA multi-machine jobs covering both machines
- Add according IPMI information in https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls
Updated by okurz about 1 year ago
- Copied from action #132134: Setup new PRG2 multi-machine openQA worker for o3 size:M added
Updated by mkittler about 1 year ago
- Status changed from New to In Progress
I was able to compile iPXE via https://github.com/os-autoinst/scripts/pull/253.
I was not able to boot via UEFI PXE or HTTP boot. No message appears in journalctl -fu dnsmasq.service
on new-ariel about an unknown host.
It looks like https://racktables.suse.de/index.php?page=object&tab=default&object_id=22953 and https://racktables.suse.de/index.php?page=object&tab=default&object_id=22948 are only connected via eth2 and eth3 but in the setup menu I only see two ethernet devices and in the UEFI shell I also only see eth0 and eth1 as disconnected:
Shell> ifconfig -l
-----------------------------------------------------------------
name : eth0
Media State : Media disconnected
policy : dhcp
mac addr : D8:5E:D3:E7:AA:2C
ipv4 address : 0.0.0.0
subnet mask : 0.0.0.0
default gateway: 0.0.0.0
Routes (0 entries):
DNS server :
-----------------------------------------------------------------
name : eth1
Media State : Media disconnected
policy : dhcp
mac addr : D8:5E:D3:E7:AA:2D
ipv4 address : 0.0.0.0
subnet mask : 0.0.0.0
default gateway: 0.0.0.0
Routes (0 entries):
DNS server :
-----------------------------------------------------------------
This looks the same on both hosts. Could it be that the ethernet connection is done via some ethernet ports that are not available in early boot (and thus don't allow me to boot the machine)?
Updated by mkittler about 1 year ago
I noticed that there's indeed a slot occupied with an additional network card. I enabled some BIOS initialization feature for that device in the hope I could then boot from it. Unfortunately I now cannot even enter the setup menu anymore as the boot is stuck at:
Press <ESC> or <DEL> to enter setup, <F10> Display Boot Menu, <F12> to Network P
XE Boot
Entering Setup...
BIOS Date: 09/29/2022 10:23:43 Ver: F31L
BMC Firmware Version: 12.60.38
BMC IP : 192.168.204.18
Processor CPU0: Ampere(R) Altra(R) Processor Q80-33
Processor Speed: 3200 MHz
Processor Max Speed: 3300 MHz
Total Memory: 512 GB
USB Devices total: 0 KBDs, 0 MICE, 0 MASS, 2 HUBs
NVMe Device0: SAMSUNG MZPLJ6T4HALA-00007 , 6.4TB
Detected ATA/ATAPI Devices...
SATA Hard Device Number : 0
92
Updated by openqa_review about 1 year ago
- Due date set to 2023-08-26
Setting due date based on mean cycle time of SUSE QE Tools
Updated by mkittler about 1 year ago
After a BIOS update (https://www.gigabyte.com/de/Enterprise/Rack-Server/R272-P30-rev-100#Support-Bios) it arm21 works again. After enabling x86 emulation the other ethernet device shows up during early boot (as some broadcom device) and can now in fact be used.
I've put the mac address of that interfaced into the o3 dnsmasq config and with the following config the iPXE boot works with my locally cross-compiled binary (that I have put on o3):
# UEFI tftp boot aarch64
dhcp-option=tag:efi-aarch64,option:bootfile-name,ipxe/ipxe-arm64.efi
# UEFI http boot aarch64
dhcp-option=tag:efi-aarch64-http,option:bootfile-name,http://10.150.1.11/tftpboot/ipxe/ipxe-arm64.efi
dhcp-option=tag:efi-aarch64-http,option:vendor-class,HTTPClient
I'll create a PR for further documentation of this. EDIT: https://github.com/os-autoinst/scripts/pull/255
For now I manually changed the command line to the aarch64 one documented under https://wiki.suse.net/index.php/OpenQA#Remote_management_with_IPMI_and_BMC_tools. Unfortunately I couldn't get "Hardware detection". Likely the serial device was wrong.
Updated by mkittler about 1 year ago
I cannot boot via autoyast. The systems always hang on "Hardware detection". When enabling ssh one gets additional error messages and it doesn't work:
Loading basic drivers
Hardware detection
[ 90.877330][ C7] watchdog: BUG: soft lockup - CPU#7 stuck for 26s! [wickedd-dhcp4:1243]
[ 114.877330][ C7] watchdog: BUG: soft lockup - CPU#7 stuck for 48s! [wickedd-dhcp4:1243]
[ 138.877330][ C7] watchdog: BUG: soft lockup - CPU#7 stuck for 71s! [wickedd-dhcp4:1243]
[ 162.877330][ C7] watchdog: BUG: soft lockup - CPU#7 stuck for 93s! [wickedd-dhcp4:1243]
[ 186.877330][ C7] watchdog: BUG: soft lockup - CPU#7 stuck for 116s! [wickedd-dhcp4:1243]
[ 210.877330][ C7] watchdog: BUG: soft lockup - CPU#7 stuck for 138s! [wickedd-dhcp4:1243]
[ 234.877330][ C7] watchdog: BUG: soft lockup - CPU#7 stuck for 160s! [wickedd-dhcp4:1243]
[ 258.877330][ C7] watchdog: BUG: soft lockup - CPU#7 stuck for 183s! [wickedd-dhcp4:1243]
[ 282.877330][ C7] watchdog: BUG: soft lockup - CPU#7 stuck for 205s! [wickedd-dhcp4:1243]
[ 306.877330][ C7] watchdog: BUG: soft lockup - CPU#7 stuck for 227s! [wickedd-dhcp4:1243]
On arm22 I could at least boot without autoyast. On arm21 the system unfortunately still hangs after hardware initialization. Both systems are on the same firmware versions and I'm using the exact same way to boot so I'm not sure what makes the difference.
I suppose the firmware update for the nic "BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet Controller" would be available to download on https://www.broadcom.com/products/ethernet-connectivity/network-adapters/bcm57416-1gbase-t-ic (according to https://techdocs.broadcom.com/us/en/storage-and-ethernet-connectivity/ethernet-nic-controllers/bcm957xxx/adapters/software-installation/updating-the-firmware/updating-the-firmware-online.html) but I couldn't find any firmware downloads on that page. Maybe the firmware from this page is relevant: https://www.broadcom.com/products/ethernet-connectivity/network-adapters/p210tp
Updated by mkittler about 1 year ago
To summarize: I was able to setup arm22 (just installed the OS, not configured yet) but arm1 hangs as mentioned in the previous comment. The problem on arm21 must be more severe and is likely not caused by booting from network because Antony said in the chate "I'm trying to install tumbleweed on the one with the issue, so that I can apply the fw update, but I can't even install tumbleweed from a usb...".
Updated by livdywan about 1 year ago
- Subject changed from Setup new PRG2 openQA worker for o3 - two new arm workers to Setup new PRG2 openQA worker for o3 - two new arm workers size:M
- Description updated (diff)
Updated by mkittler about 1 year ago
- Description updated (diff)
I was not able to boot any of the various Linux distributions I tried to boot on arm21 via netboot.xyz. I tried Ubuntu twice, Fedora and Alpine. I also tried the Tumbleweed installation once more from USB. The system just hung up in various stages (download, hardware initialization, service startup, download still within iPXE). I haven't conducted a memory test yet because we have MemTest only for x86 and netboot.xyz only provides MemTest for x86 in legacy boot mode.
I have configured arm22 similar to the host aarch64 and will run some test jobs on it when I'm back from vacation.
Updated by mkittler about 1 year ago
I've been creating an infra ticket about the problematic arm worker: https://sd.suse.com/servicedesk/customer/portal/1/SD-130480
Updated by tinita about 1 year ago
- Due date changed from 2023-08-26 to 2023-09-01
- Status changed from In Progress to Blocked
Updated by okurz about 1 year ago
- Due date deleted (
2023-09-01) - Assignee changed from mkittler to nicksinger
As discussed in daily infra call handed over to nicksinger for tracking.
Updated by favogt about 1 year ago
I saw that openqaworker-arm22's 32 workers were sitting idle while the other aarch64 workers were rather busy with jobs.
A test job revealed that some packages were missing on the worker, hugepages had to be set up and openqa-auto-update.timer as well as openqa-continuous-update.timer had to be enabled. I did all of that and ran a test job, which passed. After a reboot I added the production class (64bit only, Neoverse-N1 cannot run 32bit kernels) to see what's missing in practice.
Updated by favogt about 1 year ago
mkittler wrote in #note-8:
I cannot boot via autoyast. The systems always hang on "Hardware detection". When enabling ssh one gets additional error messages and it doesn't work:
Loading basic drivers Hardware detection [ 90.877330][ C7] watchdog: BUG: soft lockup - CPU#7 stuck for 26s! [wickedd-dhcp4:1243] [ 114.877330][ C7] watchdog: BUG: soft lockup - CPU#7 stuck for 48s! [wickedd-dhcp4:1243] [ 138.877330][ C7] watchdog: BUG: soft lockup - CPU#7 stuck for 71s! [wickedd-dhcp4:1243] [ 162.877330][ C7] watchdog: BUG: soft lockup - CPU#7 stuck for 93s! [wickedd-dhcp4:1243] [ 186.877330][ C7] watchdog: BUG: soft lockup - CPU#7 stuck for 116s! [wickedd-dhcp4:1243] [ 210.877330][ C7] watchdog: BUG: soft lockup - CPU#7 stuck for 138s! [wickedd-dhcp4:1243] [ 234.877330][ C7] watchdog: BUG: soft lockup - CPU#7 stuck for 160s! [wickedd-dhcp4:1243] [ 258.877330][ C7] watchdog: BUG: soft lockup - CPU#7 stuck for 183s! [wickedd-dhcp4:1243] [ 282.877330][ C7] watchdog: BUG: soft lockup - CPU#7 stuck for 205s! [wickedd-dhcp4:1243] [ 306.877330][ C7] watchdog: BUG: soft lockup - CPU#7 stuck for 227s! [wickedd-dhcp4:1243]
On arm22 I could at least boot without autoyast. On arm21 the system unfortunately still hangs after hardware initialization. Both systems are on the same firmware versions and I'm using the exact same way to boot so I'm not sure what makes the difference.
I guess CPU core #7 is broken in some way. You could confirm that by booting with maxcpus=6
and then getting cores 8-80 online with for i in {8..80}; do echo 1 > /sys/devices/system/cpu/cpu${i}/online; done
.
Updated by okurz about 1 year ago
Now tracking https://jira.suse.com/browse/ENGINFRA-2676
Updated by okurz about 1 year ago
Updated by okurz about 1 year ago
- Subject changed from Setup new PRG2 openQA worker for o3 - two new arm workers size:M to Setup new PRG2 openQA worker for o3 - two new arm workers
- Description updated (diff)
I suggest to re-estimate. The suggestions still have "TBD". Also I added a point regarding monitoring.
Updated by livdywan about 1 year ago
okurz wrote in #note-19:
I suggest to re-estimate. The suggestions still have "TBD". Also I added a point regarding monitoring.
Last comment on Jira from 2023-10-11. Note that we're excluding blocked tickets in our go-to query so this won't be estimated until that is resolved or we re-consider blocking on that.
Updated by okurz about 1 year ago
livdywan wrote in #note-20:
okurz wrote in #note-19:
I suggest to re-estimate. The suggestions still have "TBD". Also I added a point regarding monitoring.
Note that we're excluding blocked tickets in our go-to query so this won't be estimated until that is resolved or we re-consider blocking on that.
yes, that's fine. Only the latest after the SD ticket was resolved we should re-estimate.
Updated by livdywan about 1 year ago
- Due date set to 2023-11-11
Update from ENGINFRA-2676
Fedex lost the incoming server package from Happyware
So we'll need to wait for the replacement.
Updated by okurz about 1 year ago
- Due date changed from 2023-11-11 to 2023-11-17
special hackweek due-date bump
Updated by nicksinger 12 months ago
- Status changed from Workable to In Progress
I was able to find the IPMI login credentials after seeing in Jira that the vendor default was documented in Racktables. The documented oqadmin user in https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls?ref_type=heads#L2079 is able to access the machine. Continuing to try to setup the machine via PXE/network boot now.
Updated by nicksinger 12 months ago
I adjusted both entries in /etc/dnsmasq.d/openqa.conf
according to the new IP addresses:
dhcp-host=14:23:f2:20:fe:c0,openqaworker-arm21
dhcp-host=14:23:f2:20:fe:c1,openqaworker-arm21-iface2
After restarting dnsmasq (reload wasn't sufficient) I get the iPXE binary served now an could select the AY entry but something seems wrong:
http://download.opensuse.org/distribution/leap/15.5/repo/oss/boot/x86_64/loader/linux.................. Connection timed out (https://ipxe.org/4c0a6092)
Updated by nicksinger 12 months ago
something is odd. Now it can reach d.o.o but hangs at 70%
Updated by nicksinger 12 months ago
- Blocked by action #151231: package loss between o3 machines and download.opensuse.org size:M added
Updated by nicksinger 12 months ago · Edited
- Status changed from In Progress to Blocked
I don't think it makes sense wasting a lot of time just to get something booting and I rather try to fix the underlying issue in #151231
Updated by nicksinger 12 months ago
After I fixed ariel to properly resolve download.opensuse.org (by moving away a static config from /etc/dnsmasq.d/doo.conf
) I ran a 1000 wget downloads for the kernel and initrd each - a 100% success rate and really fast downloads. But for the love of god, I cannot get a system loaded because iPXE is always stuck at a different percentage of either initrd or the kernel. Interestingly enough iPXE itself seems to be working because ctrl+c aborts the download and just cycles the boot process again. If somebody wants to try, feel free - I can take over once the installation is started.
Updated by okurz 11 months ago
- Related to action #150920: openqaworker-arm22 is unable to join download.opensuse.org in parallel tests = tap mode size:M added
Updated by nicksinger 11 months ago
- Status changed from Workable to In Progress
I've checked with worker22 now and face the same issue. Sometimes the kernel loads but hangs at loading the initrd, another time the kernel download already is the first thing which hangs.
I tried to faster reproduce the issue by dropping into an iPXE-shell on worker22 and executing "kernel --timeout 10000 http://download.opensuse.org/distribution/leap/15.5/repo/oss/boot/aarch64/linux" multiple times. I realized that the first download worked while all following fail with a "Connection timed out".
While poking around and reading documentation I found https://ipxe.org/cmd/ifstat and this shows me the following after multiple download attempts on worker22:
net0: 00:62:0b:32:09:10 using 14e4-16D8 on 0003:01:00.0 (Ethernet) [open]
[Link:up, TX:28492 TXE:80 RX:47079 RXE:39]
[TXE: 1 x "Network unreachable (https://ipxe.org/28086090)"]
[TXE: 79 x "Error 0x2a2c4089 (https://ipxe.org/2a2c4089)"]
[RXE: 32 x "The socket is not connected (https://ipxe.org/380f6093)"]
[RXE: 6 x "The socket is not connected (https://ipxe.org/380a6093)"]
[RXE: 1 x "Operation not supported (https://ipxe.org/3c046083)"]
net1: 00:62:0b:32:09:11 using 14e4-16D8 on 0003:01:00.1 (Ethernet) [closed]
[Link:down, TX:0 TXE:0 RX:0 RXE:0]
[Link status: Unknown (https://ipxe.org/1a086194)]
net2: d8:5e:d3:e7:aa:2c using i350 on 0003:02:00.0 (Ethernet) [closed]
[Link:down, TX:0 TXE:0 RX:0 RXE:0]
[Link status: Down (https://ipxe.org/38086193)]
net3: d8:5e:d3:e7:aa:2d using i350 on 0003:02:00.1 (Ethernet) [closed]
[Link:down, TX:0 TXE:0 RX:0 RXE:0]
[Link status: Down (https://ipxe.org/38086193)]
after running another download the metrics change:
net0: 00:62:0b:32:09:10 using 14e4-16D8 on 0003:01:00.0 (Ethernet) [open]
[Link:up, TX:28492 TXE:86 RX:47079 RXE:39]
[TXE: 1 x "Network unreachable (https://ipxe.org/28086090)"]
[TXE: 85 x "Error 0x2a2c4089 (https://ipxe.org/2a2c4089)"]
[RXE: 32 x "The socket is not connected (https://ipxe.org/380f6093)"]
[RXE: 6 x "The socket is not connected (https://ipxe.org/380a6093)"]
[RXE: 1 x "Operation not supported (https://ipxe.org/3c046083)"]
net1: 00:62:0b:32:09:11 using 14e4-16D8 on 0003:01:00.1 (Ethernet) [closed]
[Link:down, TX:0 TXE:0 RX:0 RXE:0]
[Link status: Unknown (https://ipxe.org/1a086194)]
net2: d8:5e:d3:e7:aa:2c using i350 on 0003:02:00.0 (Ethernet) [closed]
[Link:down, TX:0 TXE:0 RX:0 RXE:0]
[Link status: Down (https://ipxe.org/38086193)]
net3: d8:5e:d3:e7:aa:2d using i350 on 0003:02:00.1 (Ethernet) [closed]
[Link:down, TX:0 TXE:0 RX:0 RXE:0]
[Link status: Down (https://ipxe.org/38086193)]
so at least in the iPXE-shell "Error 0x2a2c4089 (https://ipxe.org/2a2c4089)" seems to be causing the issues. Unfortunately the linked page does not do a lot in explaining how to fix it. It might also be just a red herring because I would expect a full buffer always roughly at the same time while downloading while we see it all over the place.
Updated by nicksinger 11 months ago
a similar issue as we face is described here: https://github.com/ipxe/ipxe/issues/1023#issuecomment-1708614076 - the broadcom NIC seems to cause problems and I will try now how to make use of the NII
-fallback driver.
Updated by nicksinger 11 months ago
Compiling and using the "snponly" version of iPXE allowed me to boot reproducible on both machines. As such, I added the according binary to ariel and adjusted the pxeboot.conf of dnsmasq on it. I also created https://github.com/os-autoinst/scripts/pull/273 to at least somehow document my findings.
Updated by nicksinger 11 months ago
- Due date changed from 2023-12-06 to 2023-12-08
As written previously I was able to continue the installation but it hang at "detecting hardware". Will have to try different serial settings as I observed the same on worker22
Updated by okurz 11 months ago
nicksinger wrote in #note-44:
As written previously I was able to continue the installation but it hang at "detecting hardware". Will have to try different serial settings as I observed the same on worker22
When you say "worker22" do you mean openqaworker22.openqanet.opensuse.org or openqaworker-arm22.openqanet.opensuse.org? w22 aka. openqaworker22.openqanet.opensuse.org has console=tty console=ttyS1,115200
, warm22 aka. openqaworker-arm22.openqanet.opensuse.org has no console setting in /proc/cmdline, probably should have /dev/ttyAMA0 or /dev/ttyAMA1
Updated by nicksinger 11 months ago
When you say "worker22" do you mean openqaworker22.openqanet.opensuse.org or openqaworker-arm22.openqanet.opensuse.org? w22 aka. openqaworker22.openqanet.opensuse.org has
console=tty console=ttyS1,115200
, warm22 aka. openqaworker-arm22.openqanet.opensuse.org has no console setting in /proc/cmdline, probably should have /dev/ttyAMA0 or /dev/ttyAMA1
I mean arm22. I tried it with console=ttyAMA0,115200
and can now see the yast installer via SOL. It seems like the linked AY profile cannot handle the disk situation on that machine:
-││Important issues were detected: ││
-││ ││
-││ * Unable to propose a partitioning layout ││
-││ * partitioning,0: ││
-││ + Disk '/dev/nvme1n1' was not found ││
-││ * partitioning,1: ││
-││ + Disk '/dev/nvme2n1' was not found ││
-││ ││
-││ ││
-││Please, correct these problems and try again.
Updated by nicksinger 11 months ago
- Priority changed from Urgent to High
According to https://suse.slack.com/archives/C02AJ1E568M/p1702033595493679 the prio can be reduced if it is not linked to network issues. I've concluded tests in https://progress.opensuse.org/issues/134123#note-35 and found a fix for iPXE with https://progress.opensuse.org/issues/134123#note-43 so reducing now.
I followed https://progress.opensuse.org/projects/openqav3/wiki#Manual-worker-setup but the machines does not connect to o3 yet, have to figure out what is wrong and then schedule some test-runs for that machine
Updated by nicksinger 11 months ago
- Due date changed from 2023-12-08 to 2023-12-12
- Status changed from In Progress to Workable
several jobs run in the meantime. After fixing the rsync-server-url and installing perl-SemVer
manually I still face incompletes on arm21. They look like this: https://openqa.opensuse.org/tests/3792840 - not sure what is causing this. So I will shutdown the machine for now and have another look at it on Monday.
Updated by nicksinger 11 months ago
- Status changed from Workable to In Progress
I missed the first sentence in the setup page and didn't add the openQA development repository, fixed now. I also adjusted /etc/fstab to not rely on "openqa1-opensuse" but rather a proper domain. Rerunning some verification now.
Updated by nicksinger 11 months ago
now facing problems with starting qemu because of the huge pages setup. Found some hints on https://progress.opensuse.org/issues/53234#note-12 which I followed and also issued echo 128 > /proc/sys/vm/nr_hugepages
which was present on arm22 (not sure what sets it and how) but qemu still refuses to start up properly: https://openqa.opensuse.org/tests/3800270#details
Updated by okurz 11 months ago
- Related to action #152446: openqaworker-arm21 is broken and produces lots of incomplete jobs added
Updated by nicksinger 11 months ago
I had to adjust /etc/default/grub for the correct huge pages table config. Now I managed to validate the worker with https://openqa.opensuse.org/tests/3805306
I will now enable it for production and monitor it a little bit more before considering this ticket here as resolved.
Updated by okurz 11 months ago
- Status changed from Blocked to Resolved
- Assignee changed from nicksinger to okurz
With #150920-25 both warm21+warm22 and also a combination of both have been verified. So AC1 covered. With https://gitlab.suse.de/openqa/monitor-o3/-/merge_requests/14 both machines are monitored as part of monitor-o3. So AC2 covered.
I reviewed and updated racktables entries for not only warm21+warm22 but all openqaworker21..28. I updated contact person, description, common names, FQDNs, interface names, L2 addresses and IPs. So AC3 covered. I crosschecked all suggestions and I am convinced that now we are good \o/