Project

General

Profile

Actions

action #134123

closed

QA - coordination #121720: [saga][epic] Migration to QE setup in PRG2+NUE3 while ensuring availability

QA - coordination #123800: [epic] Provide SUSE QE Tools services running in PRG2 aka. Prg CoLo

Setup new PRG2 openQA worker for o3 - two new arm workers size:M

Added by okurz 10 months ago. Updated 6 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:

Description

Motivation

New hardware was ordered to serve as openQA workers for o3, both x86_64 and aarch64. We can connect those machines to the o3 webUI VM instance regardless if o3 running still from NUE1 or from PRG2.

Acceptance criteria

  • AC1: o3 multi-machine aarch64 jobs run successfully on new PRG2 aarch64 openQA workers
  • AC2: Both new PRG2 aarch64 openQA workers are monitored by https://gitlab.suse.de/openqa/monitor-o3/
  • AC3: racktables.nue.suse.com is up-to-date with information about arm21+arm22

Suggestions


Related issues 4 (0 open4 closed)

Related to openQA Infrastructure - action #150920: openqaworker-arm22 is unable to join download.opensuse.org in parallel tests = tap mode size:MResolvednicksinger2023-11-15

Actions
Related to openQA Infrastructure - action #152446: openqaworker-arm21 is broken and produces lots of incomplete jobsResolvednicksinger2023-12-12

Actions
Blocked by openQA Infrastructure - action #151231: package loss between o3 machines and download.opensuse.org size:MResolvednicksinger2023-11-21

Actions
Copied from openQA Infrastructure - action #132134: Setup new PRG2 multi-machine openQA worker for o3 size:MResolveddheidler2023-06-29

Actions
Actions #1

Updated by okurz 10 months ago

  • Copied from action #132134: Setup new PRG2 multi-machine openQA worker for o3 size:M added
Actions #2

Updated by mkittler 10 months ago

  • Assignee set to mkittler
Actions #3

Updated by mkittler 10 months ago

  • Status changed from New to In Progress

I was able to compile iPXE via https://github.com/os-autoinst/scripts/pull/253.

I was not able to boot via UEFI PXE or HTTP boot. No message appears in journalctl -fu dnsmasq.service on new-ariel about an unknown host.

It looks like https://racktables.suse.de/index.php?page=object&tab=default&object_id=22953 and https://racktables.suse.de/index.php?page=object&tab=default&object_id=22948 are only connected via eth2 and eth3 but in the setup menu I only see two ethernet devices and in the UEFI shell I also only see eth0 and eth1 as disconnected:

Shell> ifconfig -l

-----------------------------------------------------------------

name         : eth0
Media State  : Media disconnected
policy       : dhcp
mac addr     : D8:5E:D3:E7:AA:2C

ipv4 address : 0.0.0.0

subnet mask  : 0.0.0.0

default gateway: 0.0.0.0

  Routes (0 entries):

DNS server   : 

-----------------------------------------------------------------

name         : eth1
Media State  : Media disconnected
policy       : dhcp
mac addr     : D8:5E:D3:E7:AA:2D

ipv4 address : 0.0.0.0

subnet mask  : 0.0.0.0

default gateway: 0.0.0.0

  Routes (0 entries):

DNS server   : 

-----------------------------------------------------------------

This looks the same on both hosts. Could it be that the ethernet connection is done via some ethernet ports that are not available in early boot (and thus don't allow me to boot the machine)?

Actions #4

Updated by mkittler 10 months ago

I noticed that there's indeed a slot occupied with an additional network card. I enabled some BIOS initialization feature for that device in the hope I could then boot from it. Unfortunately I now cannot even enter the setup menu anymore as the boot is stuck at:

Press <ESC> or <DEL> to enter setup, <F10> Display Boot Menu, <F12> to Network P
XE Boot                                                                         
Entering Setup...                                                               


BIOS Date: 09/29/2022 10:23:43 Ver: F31L                                        
BMC Firmware Version: 12.60.38                                                  

BMC IP : 192.168.204.18                                                         

Processor CPU0: Ampere(R) Altra(R) Processor Q80-33                             
Processor Speed: 3200 MHz                                                       
Processor Max Speed: 3300 MHz                                                   
Total Memory: 512 GB                                                            
USB Devices total: 0 KBDs, 0 MICE, 0 MASS, 2 HUBs                               

NVMe Device0:  SAMSUNG MZPLJ6T4HALA-00007 , 6.4TB                               

Detected ATA/ATAPI Devices...                                                   
SATA Hard Device Number : 0                                                     




                                                                             92
Actions #6

Updated by openqa_review 10 months ago

  • Due date set to 2023-08-26

Setting due date based on mean cycle time of SUSE QE Tools

Actions #7

Updated by mkittler 10 months ago

After a BIOS update (https://www.gigabyte.com/de/Enterprise/Rack-Server/R272-P30-rev-100#Support-Bios) it arm21 works again. After enabling x86 emulation the other ethernet device shows up during early boot (as some broadcom device) and can now in fact be used.

I've put the mac address of that interfaced into the o3 dnsmasq config and with the following config the iPXE boot works with my locally cross-compiled binary (that I have put on o3):

# UEFI tftp boot aarch64
dhcp-option=tag:efi-aarch64,option:bootfile-name,ipxe/ipxe-arm64.efi

# UEFI http boot aarch64
dhcp-option=tag:efi-aarch64-http,option:bootfile-name,http://10.150.1.11/tftpboot/ipxe/ipxe-arm64.efi
dhcp-option=tag:efi-aarch64-http,option:vendor-class,HTTPClient

I'll create a PR for further documentation of this. EDIT: https://github.com/os-autoinst/scripts/pull/255

For now I manually changed the command line to the aarch64 one documented under https://wiki.suse.net/index.php/OpenQA#Remote_management_with_IPMI_and_BMC_tools. Unfortunately I couldn't get "Hardware detection". Likely the serial device was wrong.

Actions #8

Updated by mkittler 10 months ago

I cannot boot via autoyast. The systems always hang on "Hardware detection". When enabling ssh one gets additional error messages and it doesn't work:

Loading basic drivers
Hardware detection
[   90.877330][    C7] watchdog: BUG: soft lockup - CPU#7 stuck for 26s! [wickedd-dhcp4:1243]
[  114.877330][    C7] watchdog: BUG: soft lockup - CPU#7 stuck for 48s! [wickedd-dhcp4:1243]
[  138.877330][    C7] watchdog: BUG: soft lockup - CPU#7 stuck for 71s! [wickedd-dhcp4:1243]
[  162.877330][    C7] watchdog: BUG: soft lockup - CPU#7 stuck for 93s! [wickedd-dhcp4:1243]
[  186.877330][    C7] watchdog: BUG: soft lockup - CPU#7 stuck for 116s! [wickedd-dhcp4:1243]
[  210.877330][    C7] watchdog: BUG: soft lockup - CPU#7 stuck for 138s! [wickedd-dhcp4:1243]
[  234.877330][    C7] watchdog: BUG: soft lockup - CPU#7 stuck for 160s! [wickedd-dhcp4:1243]
[  258.877330][    C7] watchdog: BUG: soft lockup - CPU#7 stuck for 183s! [wickedd-dhcp4:1243]
[  282.877330][    C7] watchdog: BUG: soft lockup - CPU#7 stuck for 205s! [wickedd-dhcp4:1243]
[  306.877330][    C7] watchdog: BUG: soft lockup - CPU#7 stuck for 227s! [wickedd-dhcp4:1243]

On arm22 I could at least boot without autoyast. On arm21 the system unfortunately still hangs after hardware initialization. Both systems are on the same firmware versions and I'm using the exact same way to boot so I'm not sure what makes the difference.


I suppose the firmware update for the nic "BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet Controller" would be available to download on https://www.broadcom.com/products/ethernet-connectivity/network-adapters/bcm57416-1gbase-t-ic (according to https://techdocs.broadcom.com/us/en/storage-and-ethernet-connectivity/ethernet-nic-controllers/bcm957xxx/adapters/software-installation/updating-the-firmware/updating-the-firmware-online.html) but I couldn't find any firmware downloads on that page. Maybe the firmware from this page is relevant: https://www.broadcom.com/products/ethernet-connectivity/network-adapters/p210tp

Actions #9

Updated by mkittler 10 months ago

To summarize: I was able to setup arm22 (just installed the OS, not configured yet) but arm1 hangs as mentioned in the previous comment. The problem on arm21 must be more severe and is likely not caused by booting from network because Antony said in the chate "I'm trying to install tumbleweed on the one with the issue, so that I can apply the fw update, but I can't even install tumbleweed from a usb...".

Actions #10

Updated by livdywan 10 months ago

  • Subject changed from Setup new PRG2 openQA worker for o3 - two new arm workers to Setup new PRG2 openQA worker for o3 - two new arm workers size:M
  • Description updated (diff)
Actions #11

Updated by mkittler 10 months ago

  • Description updated (diff)

I was not able to boot any of the various Linux distributions I tried to boot on arm21 via netboot.xyz. I tried Ubuntu twice, Fedora and Alpine. I also tried the Tumbleweed installation once more from USB. The system just hung up in various stages (download, hardware initialization, service startup, download still within iPXE). I haven't conducted a memory test yet because we have MemTest only for x86 and netboot.xyz only provides MemTest for x86 in legacy boot mode.

I have configured arm22 similar to the host aarch64 and will run some test jobs on it when I'm back from vacation.

Actions #12

Updated by mkittler 10 months ago

I've been creating an infra ticket about the problematic arm worker: https://sd.suse.com/servicedesk/customer/portal/1/SD-130480

Actions #13

Updated by tinita 10 months ago

  • Due date changed from 2023-08-26 to 2023-09-01
  • Status changed from In Progress to Blocked
Actions #14

Updated by okurz 10 months ago

  • Due date deleted (2023-09-01)
  • Assignee changed from mkittler to nicksinger

As discussed in daily infra call handed over to nicksinger for tracking.

Actions #15

Updated by favogt 10 months ago

I saw that openqaworker-arm22's 32 workers were sitting idle while the other aarch64 workers were rather busy with jobs.
A test job revealed that some packages were missing on the worker, hugepages had to be set up and openqa-auto-update.timer as well as openqa-continuous-update.timer had to be enabled. I did all of that and ran a test job, which passed. After a reboot I added the production class (64bit only, Neoverse-N1 cannot run 32bit kernels) to see what's missing in practice.

Actions #16

Updated by favogt 10 months ago

mkittler wrote in #note-8:

I cannot boot via autoyast. The systems always hang on "Hardware detection". When enabling ssh one gets additional error messages and it doesn't work:

Loading basic drivers
Hardware detection
[   90.877330][    C7] watchdog: BUG: soft lockup - CPU#7 stuck for 26s! [wickedd-dhcp4:1243]
[  114.877330][    C7] watchdog: BUG: soft lockup - CPU#7 stuck for 48s! [wickedd-dhcp4:1243]
[  138.877330][    C7] watchdog: BUG: soft lockup - CPU#7 stuck for 71s! [wickedd-dhcp4:1243]
[  162.877330][    C7] watchdog: BUG: soft lockup - CPU#7 stuck for 93s! [wickedd-dhcp4:1243]
[  186.877330][    C7] watchdog: BUG: soft lockup - CPU#7 stuck for 116s! [wickedd-dhcp4:1243]
[  210.877330][    C7] watchdog: BUG: soft lockup - CPU#7 stuck for 138s! [wickedd-dhcp4:1243]
[  234.877330][    C7] watchdog: BUG: soft lockup - CPU#7 stuck for 160s! [wickedd-dhcp4:1243]
[  258.877330][    C7] watchdog: BUG: soft lockup - CPU#7 stuck for 183s! [wickedd-dhcp4:1243]
[  282.877330][    C7] watchdog: BUG: soft lockup - CPU#7 stuck for 205s! [wickedd-dhcp4:1243]
[  306.877330][    C7] watchdog: BUG: soft lockup - CPU#7 stuck for 227s! [wickedd-dhcp4:1243]

On arm22 I could at least boot without autoyast. On arm21 the system unfortunately still hangs after hardware initialization. Both systems are on the same firmware versions and I'm using the exact same way to boot so I'm not sure what makes the difference.

I guess CPU core #7 is broken in some way. You could confirm that by booting with maxcpus=6 and then getting cores 8-80 online with for i in {8..80}; do echo 1 > /sys/devices/system/cpu/cpu${i}/online; done.

Actions #18

Updated by okurz 9 months ago

okurz wrote in #note-17:

Now tracking https://jira.suse.com/browse/ENGINFRA-2676

still valid block

Actions #19

Updated by okurz 8 months ago

  • Subject changed from Setup new PRG2 openQA worker for o3 - two new arm workers size:M to Setup new PRG2 openQA worker for o3 - two new arm workers
  • Description updated (diff)

I suggest to re-estimate. The suggestions still have "TBD". Also I added a point regarding monitoring.

Actions #20

Updated by livdywan 8 months ago

okurz wrote in #note-19:

I suggest to re-estimate. The suggestions still have "TBD". Also I added a point regarding monitoring.

Last comment on Jira from 2023-10-11. Note that we're excluding blocked tickets in our go-to query so this won't be estimated until that is resolved or we re-consider blocking on that.

Actions #21

Updated by okurz 8 months ago

livdywan wrote in #note-20:

okurz wrote in #note-19:

I suggest to re-estimate. The suggestions still have "TBD". Also I added a point regarding monitoring.

Note that we're excluding blocked tickets in our go-to query so this won't be estimated until that is resolved or we re-consider blocking on that.

yes, that's fine. Only the latest after the SD ticket was resolved we should re-estimate.

Actions #22

Updated by livdywan 8 months ago

  • Due date set to 2023-11-11

Update from ENGINFRA-2676

Fedex lost the incoming server package from Happyware

So we'll need to wait for the replacement.

Actions #23

Updated by okurz 8 months ago

  • Due date changed from 2023-11-11 to 2023-11-17

special hackweek due-date bump

Actions #24

Updated by livdywan 7 months ago

livdywan wrote in #note-22:

Update from ENGINFRA-2676

Fedex lost the incoming server package from Happyware

No news yet. Checking on the ticket.

Actions #25

Updated by livdywan 7 months ago

  • Status changed from Blocked to Workable

Big news! According to Anthony the new server arrived, and it's already installed in the rack! It should have the same IPMI static address and credentials. BMC should work, too.

Actions #26

Updated by okurz 7 months ago

  • Subject changed from Setup new PRG2 openQA worker for o3 - two new arm workers to Setup new PRG2 openQA worker for o3 - two new arm workers size:M
  • Description updated (diff)
Actions #27

Updated by livdywan 7 months ago

  • Due date changed from 2023-11-17 to 2023-11-29

Let's aim to at least confirm the ETA in the next two weeks, and Anthony says to be available on Slack in case that helps ;-)

Actions #28

Updated by nicksinger 7 months ago

  • Status changed from Workable to In Progress

I was able to find the IPMI login credentials after seeing in Jira that the vendor default was documented in Racktables. The documented oqadmin user in https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls?ref_type=heads#L2079 is able to access the machine. Continuing to try to setup the machine via PXE/network boot now.

Actions #29

Updated by nicksinger 7 months ago

I adjusted both entries in /etc/dnsmasq.d/openqa.conf according to the new IP addresses:

dhcp-host=14:23:f2:20:fe:c0,openqaworker-arm21
dhcp-host=14:23:f2:20:fe:c1,openqaworker-arm21-iface2

After restarting dnsmasq (reload wasn't sufficient) I get the iPXE binary served now an could select the AY entry but something seems wrong:

http://download.opensuse.org/distribution/leap/15.5/repo/oss/boot/x86_64/loader/linux.................. Connection timed out (https://ipxe.org/4c0a6092)
Actions #30

Updated by nicksinger 7 months ago

something is odd. Now it can reach d.o.o but hangs at 70%

Actions #31

Updated by nicksinger 7 months ago

  • Blocked by action #151231: package loss between o3 machines and download.opensuse.org size:M added
Actions #32

Updated by nicksinger 7 months ago · Edited

  • Status changed from In Progress to Blocked

I don't think it makes sense wasting a lot of time just to get something booting and I rather try to fix the underlying issue in #151231

Actions #33

Updated by okurz 7 months ago

  • Status changed from Blocked to Workable

#151231 resolved, back to here

Actions #34

Updated by okurz 7 months ago

nicksinger tried the installation again, it progressed a little bit but still initrd download then failed or timed out. Please try out just wget http:/… or tftp on another machine, e.g. openqaworker-arm22

Actions #35

Updated by nicksinger 7 months ago

After I fixed ariel to properly resolve download.opensuse.org (by moving away a static config from /etc/dnsmasq.d/doo.conf) I ran a 1000 wget downloads for the kernel and initrd each - a 100% success rate and really fast downloads. But for the love of god, I cannot get a system loaded because iPXE is always stuck at a different percentage of either initrd or the kernel. Interestingly enough iPXE itself seems to be working because ctrl+c aborts the download and just cycles the boot process again. If somebody wants to try, feel free - I can take over once the installation is started.

Actions #36

Updated by mkittler 7 months ago

Maybe it makes sense to cross check whether iPXE works on another host like the other arm worker. We don't need to re-install it but just check whether we could boot an installation. If that works then maybe the new arm worker itself is still to blame.

Actions #37

Updated by tinita 7 months ago

  • Due date changed from 2023-11-29 to 2023-12-06
Actions #38

Updated by okurz 7 months ago

  • Related to action #150920: openqaworker-arm22 is unable to join download.opensuse.org in parallel tests = tap mode size:M added
Actions #39

Updated by okurz 7 months ago

  • Priority changed from High to Urgent

Becoming urgent as more tickets blocked

Actions #40

Updated by nicksinger 6 months ago

  • Status changed from Workable to In Progress

I've checked with worker22 now and face the same issue. Sometimes the kernel loads but hangs at loading the initrd, another time the kernel download already is the first thing which hangs.
I tried to faster reproduce the issue by dropping into an iPXE-shell on worker22 and executing "kernel --timeout 10000 http://download.opensuse.org/distribution/leap/15.5/repo/oss/boot/aarch64/linux" multiple times. I realized that the first download worked while all following fail with a "Connection timed out".

While poking around and reading documentation I found https://ipxe.org/cmd/ifstat and this shows me the following after multiple download attempts on worker22:

net0: 00:62:0b:32:09:10 using 14e4-16D8 on 0003:01:00.0 (Ethernet) [open]
  [Link:up, TX:28492 TXE:80 RX:47079 RXE:39]
  [TXE: 1 x "Network unreachable (https://ipxe.org/28086090)"]
  [TXE: 79 x "Error 0x2a2c4089 (https://ipxe.org/2a2c4089)"]
  [RXE: 32 x "The socket is not connected (https://ipxe.org/380f6093)"]
  [RXE: 6 x "The socket is not connected (https://ipxe.org/380a6093)"]
  [RXE: 1 x "Operation not supported (https://ipxe.org/3c046083)"]
net1: 00:62:0b:32:09:11 using 14e4-16D8 on 0003:01:00.1 (Ethernet) [closed]
  [Link:down, TX:0 TXE:0 RX:0 RXE:0]
  [Link status: Unknown (https://ipxe.org/1a086194)]
net2: d8:5e:d3:e7:aa:2c using i350 on 0003:02:00.0 (Ethernet) [closed]
  [Link:down, TX:0 TXE:0 RX:0 RXE:0]
  [Link status: Down (https://ipxe.org/38086193)]
net3: d8:5e:d3:e7:aa:2d using i350 on 0003:02:00.1 (Ethernet) [closed]
  [Link:down, TX:0 TXE:0 RX:0 RXE:0]
  [Link status: Down (https://ipxe.org/38086193)]

after running another download the metrics change:

net0: 00:62:0b:32:09:10 using 14e4-16D8 on 0003:01:00.0 (Ethernet) [open]
  [Link:up, TX:28492 TXE:86 RX:47079 RXE:39]
  [TXE: 1 x "Network unreachable (https://ipxe.org/28086090)"]
  [TXE: 85 x "Error 0x2a2c4089 (https://ipxe.org/2a2c4089)"]
  [RXE: 32 x "The socket is not connected (https://ipxe.org/380f6093)"]
  [RXE: 6 x "The socket is not connected (https://ipxe.org/380a6093)"]
  [RXE: 1 x "Operation not supported (https://ipxe.org/3c046083)"]
net1: 00:62:0b:32:09:11 using 14e4-16D8 on 0003:01:00.1 (Ethernet) [closed]
  [Link:down, TX:0 TXE:0 RX:0 RXE:0]
  [Link status: Unknown (https://ipxe.org/1a086194)]
net2: d8:5e:d3:e7:aa:2c using i350 on 0003:02:00.0 (Ethernet) [closed]
  [Link:down, TX:0 TXE:0 RX:0 RXE:0]
  [Link status: Down (https://ipxe.org/38086193)]
net3: d8:5e:d3:e7:aa:2d using i350 on 0003:02:00.1 (Ethernet) [closed]
  [Link:down, TX:0 TXE:0 RX:0 RXE:0]
  [Link status: Down (https://ipxe.org/38086193)]

so at least in the iPXE-shell "Error 0x2a2c4089 (https://ipxe.org/2a2c4089)" seems to be causing the issues. Unfortunately the linked page does not do a lot in explaining how to fix it. It might also be just a red herring because I would expect a full buffer always roughly at the same time while downloading while we see it all over the place.

Actions #41

Updated by nicksinger 6 months ago

a similar issue as we face is described here: https://github.com/ipxe/ipxe/issues/1023#issuecomment-1708614076 - the broadcom NIC seems to cause problems and I will try now how to make use of the NII-fallback driver.

Actions #42

Updated by favogt 6 months ago

You could also try the EFI shell with HTTP boot or HTTP downloads onto a FAT partition.

Actions #43

Updated by nicksinger 6 months ago

Compiling and using the "snponly" version of iPXE allowed me to boot reproducible on both machines. As such, I added the according binary to ariel and adjusted the pxeboot.conf of dnsmasq on it. I also created https://github.com/os-autoinst/scripts/pull/273 to at least somehow document my findings.

Actions #44

Updated by nicksinger 6 months ago

  • Due date changed from 2023-12-06 to 2023-12-08

As written previously I was able to continue the installation but it hang at "detecting hardware". Will have to try different serial settings as I observed the same on worker22

Actions #45

Updated by okurz 6 months ago

nicksinger wrote in #note-44:

As written previously I was able to continue the installation but it hang at "detecting hardware". Will have to try different serial settings as I observed the same on worker22

When you say "worker22" do you mean openqaworker22.openqanet.opensuse.org or openqaworker-arm22.openqanet.opensuse.org? w22 aka. openqaworker22.openqanet.opensuse.org has console=tty console=ttyS1,115200, warm22 aka. openqaworker-arm22.openqanet.opensuse.org has no console setting in /proc/cmdline, probably should have /dev/ttyAMA0 or /dev/ttyAMA1

Actions #46

Updated by nicksinger 6 months ago

When you say "worker22" do you mean openqaworker22.openqanet.opensuse.org or openqaworker-arm22.openqanet.opensuse.org? w22 aka. openqaworker22.openqanet.opensuse.org has console=tty console=ttyS1,115200, warm22 aka. openqaworker-arm22.openqanet.opensuse.org has no console setting in /proc/cmdline, probably should have /dev/ttyAMA0 or /dev/ttyAMA1

I mean arm22. I tried it with console=ttyAMA0,115200 and can now see the yast installer via SOL. It seems like the linked AY profile cannot handle the disk situation on that machine:

   -││Important issues were detected:                                     ││
   -││                                                                    ││
   -││ *  Unable to propose a partitioning layout                         ││
   -││ *  partitioning,0:                                                 ││
   -││     +  Disk '/dev/nvme1n1' was not found                           ││
   -││ *  partitioning,1:                                                 ││
   -││     +  Disk '/dev/nvme2n1' was not found                           ││
   -││                                                                    ││
   -││                                                                    ││
   -││Please, correct these problems and try again.
Actions #47

Updated by okurz 6 months ago

in all ticket comments please consider our best practices from progress.opensuse.org/projects/qa/wiki/tools to make it clear what you plan to do as next step or also when you don't plan anything

Actions #48

Updated by nicksinger 6 months ago

  • Priority changed from Urgent to High

According to https://suse.slack.com/archives/C02AJ1E568M/p1702033595493679 the prio can be reduced if it is not linked to network issues. I've concluded tests in https://progress.opensuse.org/issues/134123#note-35 and found a fix for iPXE with https://progress.opensuse.org/issues/134123#note-43 so reducing now.

I followed https://progress.opensuse.org/projects/openqav3/wiki#Manual-worker-setup but the machines does not connect to o3 yet, have to figure out what is wrong and then schedule some test-runs for that machine

Actions #49

Updated by nicksinger 6 months ago

  • Due date changed from 2023-12-08 to 2023-12-12
  • Status changed from In Progress to Workable

several jobs run in the meantime. After fixing the rsync-server-url and installing perl-SemVer manually I still face incompletes on arm21. They look like this: https://openqa.opensuse.org/tests/3792840 - not sure what is causing this. So I will shutdown the machine for now and have another look at it on Monday.

Actions #50

Updated by nicksinger 6 months ago

  • Status changed from Workable to In Progress

I missed the first sentence in the setup page and didn't add the openQA development repository, fixed now. I also adjusted /etc/fstab to not rely on "openqa1-opensuse" but rather a proper domain. Rerunning some verification now.

Actions #51

Updated by nicksinger 6 months ago

now facing problems with starting qemu because of the huge pages setup. Found some hints on https://progress.opensuse.org/issues/53234#note-12 which I followed and also issued echo 128 > /proc/sys/vm/nr_hugepages which was present on arm22 (not sure what sets it and how) but qemu still refuses to start up properly: https://openqa.opensuse.org/tests/3800270#details

Actions #52

Updated by okurz 6 months ago

  • Related to action #152446: openqaworker-arm21 is broken and produces lots of incomplete jobs added
Actions #53

Updated by nicksinger 6 months ago

I had to adjust /etc/default/grub for the correct huge pages table config. Now I managed to validate the worker with https://openqa.opensuse.org/tests/3805306
I will now enable it for production and monitor it a little bit more before considering this ticket here as resolved.

Actions #54

Updated by nicksinger 6 months ago

  • Description updated (diff)
Actions #55

Updated by livdywan 6 months ago

  • Description updated (diff)
  • Status changed from In Progress to Blocked

AC1: o3 multi-machine aarch64 jobs run successfully on new PRG2 aarch64 openQA workers

Let's block this on #150920

Actions #56

Updated by livdywan 6 months ago

  • Due date deleted (2023-12-12)
Actions #57

Updated by okurz 6 months ago

  • Status changed from Blocked to Resolved
  • Assignee changed from nicksinger to okurz

With #150920-25 both warm21+warm22 and also a combination of both have been verified. So AC1 covered. With https://gitlab.suse.de/openqa/monitor-o3/-/merge_requests/14 both machines are monitored as part of monitor-o3. So AC2 covered.
I reviewed and updated racktables entries for not only warm21+warm22 but all openqaworker21..28. I updated contact person, description, common names, FQDNs, interface names, L2 addresses and IPs. So AC3 covered. I crosschecked all suggestions and I am convinced that now we are good \o/

Actions #58

Updated by okurz 6 months ago

  • Assignee changed from okurz to nicksinger
Actions

Also available in: Atom PDF