



action #111473


action #97862: More openQA worker hardware for OSD size:M

Get replacements for imagetester and openqaworker1 size:M

Added by okurz over 2 years ago. Updated over 2 years ago.

Start date:
Due date:
% Done:


Estimated time:



Move to co-location is delayed, we need to keep O3+OSD infrastructure within NUE SRV1 up-to-date.

Acceptance criteria

  • AC1: Updated the two oldest workers in NUE SRV1 (imagetester+openqaworker1)


  • We need to make physical space for new machines (might be possible to remove uno+rebel or openqaworker2+openqaworker3 to make room)
  • Get the quote from Nick or Oli

  • DONE: Get a quote, e.g. from for supermicro machines to replace at least imagetester+openqaworker1, potentially also uno+rebel or openqaworker2+openqaworker3 ->

  • Bring the order forward to Lee Martin Nick Singer and Matthias Grießmeier to crosslink with the QE budget

  • Coordinate with SUSE-IT EngInfra to prepare for new machines replacing existing ones

  • Make sure machines are ordered to SUSE Nbg Maxtorhof

Further details

Suggested by nsinger:

  • Chassis 1x SuperMicro 825BTQC-R1K23LPB
  • RAM 8x Micron MTA36ASF8G72PZ-3G2: 512 GB
  • Disk 2x Samsung PM1643a 3,8TB SSD
  • Controller 1x Broadcom 9500-8i
  • Network 1x Intel X710-DA2, 2 Ports, 10GbE, SFP+
  • M.2 NVMe, 2x Micron 7400 MAX, 400 GB, SSD
  • CPU 1x AMD EPYC 7763, 64 Cores pro CPU, 2,45 GHz


delta_d10z-m2-zm_9430.pdf (549 KB) delta_d10z-m2-zm_9430.pdf okurz, 2022-06-03 12:32
Image from iOS.jpg (1.25 MB) Image from iOS.jpg mgriessmeier, 2022-07-20 11:28

Related issues 5 (0 open5 closed)

Related to openQA Infrastructure (public) - action #115418: Setup ow19+20 to be able to run MM tests size:MResolvedfavogt2022-08-17

Related to openQA Infrastructure (public) - action #115547: openqaworker20 fails to boot, broken hardware size:MResolvedfavogt2022-08-19

Related to openQA Infrastructure (public) - action #135137: Bring back imagetester size:MResolvedokurz2023-09-04

Copied to openQA Infrastructure (public) - action #111986: Ensure is properly usedResolvedokurz

Copied to openQA Infrastructure (public) - action #113477: Get replacements for o3+osd top of rack switchRejectedokurz

Actions #1

Updated by okurz over 2 years ago

  • Description updated (diff)
Actions #2

Updated by okurz over 2 years ago

  • Tracker changed from action to coordination
Actions #3

Updated by okurz over 2 years ago

  • Tracker changed from coordination to action
  • Subject changed from [epic] Get replacement for existing machines, e.g. imagetester, openqaworker1 (potentially more) to Get replacement for existing machines, e.g. imagetester, openqaworker1 (potentially more)
  • Priority changed from Normal to High
Actions #4

Updated by livdywan over 2 years ago

  • Subject changed from Get replacement for existing machines, e.g. imagetester, openqaworker1 (potentially more) to Get replacements for imagetester and openqaworker1 size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #5

Updated by okurz over 2 years ago

  • Description updated (diff)
Actions #6

Updated by okurz over 2 years ago

  • Copied to action #111986: Ensure is properly used added
Actions #7

Updated by okurz over 2 years ago

  • Priority changed from High to Urgent

To ensure we can make use of assigned budget we should expedite the ordering process, raising prio.

Actions #8

Updated by okurz over 2 years ago

Seems like I missed something in the previous PDF, e.g. missing NVMe etc.
Configured a new one together with nsinger to be sure we come up with the right stuff. is the link to the order, not sure if it's really persistent. So also attaching.

We decided that we don't need to ask for any "extended support" so as included in the PDF. Cost calculates to roughly 9600 * 1.19 including tax, or using SUSE internal conversion from EUR to USD 1.21 ~ 13823 USD

Actions #9

Updated by okurz over 2 years ago

  • Description updated (diff)

Moved "uno" related task to #111986

Actions #10

Updated by livdywan over 2 years ago

  • Status changed from Workable to Feedback

Email with "Anfrage" in the subject sent today

Actions #11

Updated by okurz over 2 years ago

  • Description updated (diff)
Actions #13

Updated by okurz over 2 years ago

  • Status changed from Feedback to In Progress
  • Assignee changed from nicksinger to okurz
Actions #14

Updated by okurz over 2 years ago

email sent to vendor, awaiting response.

Actions #15

Updated by okurz over 2 years ago

  • Due date set to 2022-06-24
  • Status changed from In Progress to Feedback

two quotes received, created ticket to receive approved POs. Approved by mgriessmeier as sponsor representative, waiting for PO approval.

Actions #16

Updated by okurz over 2 years ago

  • Due date changed from 2022-06-24 to 2022-07-01

Talked with mbach. She is aware of the urgency. Waiting for approval.

Actions #17

Updated by okurz over 2 years ago

  • Due date changed from 2022-07-01 to 2022-09-02

Approval was received. I forwarded the order to the vendor. Received a confirmation for both

Actions #18

Updated by okurz over 2 years ago

  • Priority changed from Urgent to Low

Waiting for delivery

Actions #19

Updated by okurz over 2 years ago

  • Due date deleted (2022-09-02)
  • Status changed from Feedback to Workable
  • Assignee deleted (okurz)
  • Priority changed from Low to High

According to the Nbg machines have arrived at Frankencampus.

Next steps can be coordinated, e.g. create SUSE-IT EngInfra ticket, get the server hardware moved to Nbg Maxtorhof SRV1, have it connected replacing imagetester and openqaworker1, have imagetester and openqaworker1 moved to qa cold storage or put somewhere in SRV2 where there is place.

Actions #20

Updated by okurz over 2 years ago

  • Copied to action #113477: Get replacements for o3+osd top of rack switch added
Actions #21

Updated by okurz over 2 years ago

  • Description updated (diff)
Actions #23

Updated by mkittler over 2 years ago

  • Status changed from Workable to Feedback
  • Assignee set to mkittler
Actions #24

Updated by mgriessmeier over 2 years ago

Discussed today in person with Moroni. I will follow up on it next week when Oliver from facilities is back from holiday

Actions #25

Updated by mkittler over 2 years ago

@mgriessmeier Thanks for helping out.

And just for the record: The EngInfra ticket was not the way to go and we still need to figure out the correct process.

Actions #26

Updated by okurz over 2 years ago

mkittler wrote:

@mgriessmeier Thanks for helping out.

And just for the record: The EngInfra ticket was not the way to go and we still need to figure out the correct process.

I think the EngInfra ticket is certainly still the way to go. That it was closed as "Can't Do" is IMHO an unprofessional reaction and needs escalation, as already handled by mgriessmeier.

Actions #27

Updated by mgriessmeier over 2 years ago

While I'm still in discussions about the future progress, I have an update for this particular case. So facilities has taken the servers to Maxtorhof, they are residing in the old allhands area (see picture attached).
Moroni wants to have a new ticket for racking them, could you please take care about that?

Actions #29

Updated by okurz over 2 years ago

  • Status changed from Feedback to Blocked

blocked by #114379

Actions #30

Updated by mkittler over 2 years ago

  • Status changed from Blocked to In Progress

The machines have been installed. I suppose this is no longer blocked by #114379. I'll check whether I can reach the BMC mentioned in and update our salt pillars (draft:

Actions #31

Updated by mkittler over 2 years ago

  • Status changed from In Progress to Feedback

So I suppose we should wait until both BMCs are responding. I suppose we also need a follow-up ticket for setting up both machines (I suppose using either Leap 15.4 or Tumbleweed as OS and replicating the config of the openqaworker1 and imagetester).

Actions #33

Updated by okurz over 2 years ago

mkittler wrote:

So I suppose we should wait until both BMCs are responding. I suppose we also need a follow-up ticket for setting up both machines (I suppose using either Leap 15.4 or Tumbleweed as OS and replicating the config of the openqaworker1 and imagetester).

Installation is part of this ticket. Does not make sense to plan a ticket at a separate time for two uninstalled machines. Please use openSUSE Leap 15.4 and consider using or a customized "non-salt" version. Alternative to autoyast installation is JeOS+cloudinit (or ignition/combustion) to do the initial configuration.

Actions #34

Updated by mkittler over 2 years ago

  • Status changed from Feedback to In Progress

Ok, I'll try installing openqaworker19 (BMC of openqaworker20 is still not responding).

There are already a few issues:

  • It looks like the BMC's web UI is only supporting Chromium (at least here Firefox doesn't work).
  • When trying to add a virtual media within the remote control window, the web UI says "This function requires SFT-DCMS-SINGLE license!".
  • When trying to add a virtual media under "Configuration -> Virtual Media" from a file from the web UI says it doesn't work (tried both, / and \ in the path).
  • PXEboot doesn't find anything (via IPv4 and IPv6), also not when using legacy boot

So I currently don't know how to do the installation. I've mentioned it in the infra ticket, maybe someone can help.

Actions #35

Updated by openqa_review over 2 years ago

  • Due date set to 2022-08-20

Setting due date based on mean cycle time of SUSE QE Tools

Actions #36

Updated by mkittler over 2 years ago

The BMC of openqaworker20 is now working as well but it doesn't find anything over PXEboot as well.

Actions #37

Updated by okurz over 2 years ago

Updated racktable entries with PO numbers and details, updated

Regarding PXEBoot: Both machines should be o3 workers so they need to be added to the according VLAN. I don't see according entries in racktables yet so maybe that is missing. As soon as that is done I would assume that the PXE boot menu managed on o3 would show up.

Actions #38

Updated by mkittler over 2 years ago

Looks like dnsmasq on ariel isn't handing out an IP address automatically (but the host is already in o3 VLAN 662). So I'll have to configure it manually. (The mac address of ow19 (3c:ec:ef:93:aa:fe) already shows up in the dnsmasq log with "no address available".)

EDIT: Added


to the dnsmasq config and   openqaworker19   openqaworker20

to /etc/hosts.

It still doesn't work, see

Actions #39

Updated by mkittler over 2 years ago

With the image from the ipxe-bootimgs package and the config dhcp-boot=tag:efi-x86_64,ipxe/ipxe-x86_64.efi we could come a little bit further. The machine boots but the PXE-boot setup still lacks some configuration to be useful. Not sure why booting from pxelinux.0 as suggested in some places doesn't work. It also seems that the machine needs to be reset when changing the image path (just selecting the entry from the boot menu again will not make it load the other image).

I also found out why my attempts to mount a virtual media didn't work: Apparently this feature only supports Windows shares (and not sshfs). I suppose an SMB server on Linux would work as well. So alternatively I could try setting up an SMB server on some machine and try attaching a media from there.

Actions #41

Updated by mkittler over 2 years ago

To checkout the SMB approach I've started an SMB server on my machine that is connected via VPN. It generally seems to works just fine:

martchus@QA-Power8-5-kvm:~> smbget --user martchus 'smb://'
Password for [martchus] connecting to // 
Using workgroup WORKGROUP, user martchus
Downloaded 173,00MB in 76 seconds

However, when trying to use it on it doesn't work. (There's not good error message.)

I've also read the help available in the UI and you can also put an HTTP-URL in. However, when doing so it again complains about a missing license.

I saw but it doesn't use dnsmasq as DHCP server. So it would mean re-configuring dnsmasq to do only DNS and use dhcpd for DHCP. I'm also not sure whether the HTTP server setup will work out of the box with our existing apache2 configuration. The step "Enroll the server certificate into the client firmware" is also not very clear to me.

Actions #42

Updated by mkittler over 2 years ago

Just our of curiosity I've tried the "netboot" image provided by other distributions (openSUSE does not provide such an image). The one from Ubuntu seems to be legacy-boot-only. The UEFI one from Arch (, works. It seems to be based on the ipxe image we've tried before in the mob session (the one from the ipxe-bootimgs). Maybe I can create such an efi binary for openSUSE as well.

From the Arch image one can enter a shell and boot openSUSE via a few commands. However, it ended up stuck on the rescue shell prompt. Maybe I can follow instructions from

Actions #43

Updated by mkittler over 2 years ago

Note that if the BMC website doesn't load anymore one can either use private mode or remove data via e.g. chrome://settings/cookies/detail?

Actions #44

Updated by mkittler over 2 years ago

EDIT: I have updated this comment. The following commands show the full procedure that allowed me to install a bootable system.

I managed to boot without having to enter commands manually in the ipxe shell. It'll just load the commands from a file on some http server (to avoid having to re-create the image everytime I want to change commands). Those are my notes so far:

# make file that contains the boot ipxe commands to boot available via some http server, file contents for installing Leap 15.4 with autoyast:
kernel initrd=initrd console=tty0 console=ttyS1,115200 install= autoyast= rootpassword=opensuse

# setup build of ipxe UEFI image like explained on
git clone
cd ipxe
echo "#!ipxe
chain" > myscript.ipxe
cd src

# conduct build similar to
make EMBED=../myscript.ipxe NO_WERROR=1 bin/ipxe.lkrn bin/ipxe.pxe bin-i386-efi/ipxe.efi bin-x86_64-efi/ipxe.efi

# copy image to ariel
rsync bin-x86_64-efi/ipxe.efi

# use image on ariel (via `dhcp-boot=tag:efi-x86_64,ipxe-own-build/ipxe.efi` in `/etc/dnsmasq.d/pxeboot.conf`)
sudo cp /home/martchus/ipxe.efi /srv/tftpboot/ipxe-own-build/ipxe.efi

# boot from the image, enter boot menu via F11 for quicker retries (but when amending the ipxe image one needs a full reboot!)

# amend file on server to test different parameters

# boot again


The autoyast profile is simply what we have in our repository with some adjustments. I'll create a PR for that later.

Actions #45

Updated by mkittler over 2 years ago

Both workers are now reachable via SSH from ariel. Login is possible using usual credentials. So far nothing has been setup yet.

Actions #46

Updated by mkittler over 2 years ago

I restored the PXE configuration on o3, cleaned up the tftp root again and added a few further comments in the dnsmasq config. It looks like the previous configuration has been restored correctly:

I created a PR for the AutoYaST-profile modifications:

Tomorrow I will setup openQA worker slots on the two machines.

Actions #47

Updated by livdywan over 2 years ago

mkittler wrote:

Tomorrow I will setup openQA worker slots on the two machines.

I guess they're live? But there's a dependency issue

Actions #48

Updated by mkittler over 2 years ago

Both workers are now configured:

  • The config is following what we do on other o3 workers (and mainly keeping openSUSE's defaults).
  • SSH keys from openqaworker1 are copied over. So everyone's login should work as before.
  • Currently a special worker class is still in place as I'm still running some test jobs (see and
    • So far the setup looks good (after a few initial problems).
  • I started 30 slots per host and enabled the VP9 video encoder as the systems look fast enough. (We can likely even increase the number of slots further.)
  • I configured the firewall so the developer mode works (see For now the ports for up to 50 worker slots are covered. (When setting up MM tests we likely go to the usual trusted zone config anyways so this point becomes irrelevant.)
  • The ovs-setup is still missing. However, this would be a good point to split further setup work into a separate ticket so I'd do the MM setup separately (preferably coming up with a script to do things automatically).
  • I rebooted both systems to see whether everything comes up automatically.
Actions #49

Updated by mkittler over 2 years ago

Failures I've observed so far:

  1. and
  2. The test jeos-extra@64bit_virtio-2G only failed because the build variable was changes. So nothing to worry about I suppose.
  3. The test module zypper_log_packages failed on multiple jobs on both workers. Not sure yet why.
  4. The test module virt_install failed on both workers. Not sure yet why.

Otherwise both build result overviews look quite green.

Actions #50

Updated by favogt over 2 years ago

mkittler wrote:

Failures I've observed so far:

  1. and

Looks like it's doing run_in_powershell while the Tumbleweed window opens. It should probably close the serial port before starting that in the background. I'll try that out.

  1. The test jeos-extra@64bit_virtio-2G only failed because the build variable was changes. So nothing to worry about I suppose.
  2. The test module zypper_log_packages failed on multiple jobs on both workers. Not sure yet why.

It uses autoinst_url in test code, but isn't reachable on the worker itself. Setting AUTOINST_URL_HOSTNAME in workers.ini (or configuring MM, so br1 gets might help:

  1. The test module virt_install failed on both workers. Not sure yet why.

Looks like that's expected, the test fails everywhere constantly.

Otherwise both build result overviews look quite green.

Actions #51

Updated by favogt over 2 years ago

favogt wrote:

mkittler wrote:

Failures I've observed so far:

  1. and

Looks like it's doing run_in_powershell while the Tumbleweed window opens. It should probably close the serial port before starting that in the background. I'll try that out.


  1. The test jeos-extra@64bit_virtio-2G only failed because the build variable was changes. So nothing to worry about I suppose.
  2. The test module zypper_log_packages failed on multiple jobs on both workers. Not sure yet why.

It uses autoinst_url in test code, but isn't reachable on the worker itself. Setting AUTOINST_URL_HOSTNAME in workers.ini (or configuring MM, so br1 gets might help:

Works. I added it to workers.ini on both.

  1. The test module virt_install failed on both workers. Not sure yet why.

Looks like that's expected, the test fails everywhere constantly.

Otherwise both build result overviews look quite green.

Actions #52

Updated by favogt over 2 years ago

  • % Done changed from 0 to 80

favogt wrote:

favogt wrote:

mkittler wrote:

Failures I've observed so far:

  1. and

Looks like it's doing run_in_powershell while the Tumbleweed window opens. It should probably close the serial port before starting that in the background. I'll try that out.


Got merged.

  1. The test jeos-extra@64bit_virtio-2G only failed because the build variable was changes. So nothing to worry about I suppose.
  2. The test module zypper_log_packages failed on multiple jobs on both workers. Not sure yet why.

It uses autoinst_url in test code, but isn't reachable on the worker itself. Setting AUTOINST_URL_HOSTNAME in workers.ini (or configuring MM, so br1 gets might help:

Works. I added it to workers.ini on both.

  1. The test module virt_install failed on both workers. Not sure yet why.

Looks like that's expected, the test fails everywhere constantly.

Otherwise both build result overviews look quite green.

No (known) blockers left, so I enabled ow19 and ow20 for production, maybe just to uncover more issues. Fingers crossed.

Actions #53

Updated by mkittler over 2 years ago

  • Status changed from In Progress to Feedback
  • % Done changed from 80 to 0

@fvogt Thanks for looking into these. I'd had a closer look today but it seems like you've already done the job :-)

Let's see how the new workers will perform.

Actions #54

Updated by mkittler over 2 years ago

I've been creating #115418 for the MM setup.

Actions #55

Updated by mkittler over 2 years ago

  • Related to action #115418: Setup ow19+20 to be able to run MM tests size:M added
Actions #56

Updated by favogt over 2 years ago

No (known) blockers left, so I enabled ow19 and ow20 for production, maybe just to uncover more issues. Fingers crossed.

One blocker uncovered and fixed: OVMF was too new, so I downgraded it (poo#111992).

Actions #58

Updated by okurz over 2 years ago

I reviewed instructions, looks good.

Actions #59

Updated by mkittler over 2 years ago

I've changed the IPMI password and updated accordingly.

Actions #60

Updated by favogt over 2 years ago

One issue uncovered: The firewall setup made it harder to access VNC for "intensive" debugging sessions. On ow20 I had to move eth2 from the public to the trusted zone to be able to access it via an SSH tunnel through ariel. So maybe that should be the default config even without MM networking set up.

Actions #61

Updated by favogt over 2 years ago

Some more issues: The staging OVMF images were missing, I copied them over. swtpm was also not installed, did that as well.
Some libvirt tests failed because they couldn't download a HDD, which worked on other workers. The reason is that the new workers didn't have the NFS mount set up.
Ideally the tests get changed such that the asset is properly tracked and NFS not needed, but that's something for later.

I added those three items to the worker setup documentation.

The WSL tests for some reason don't have proper DNS set up, so couldn't resolve openqaworker19 which was set as AUTOINST_URL_HOSTNAME.
I changed it to use the worker IPs instead. However, it looks like the reason that doesn't work is a bug in test code:
After that is fixed, we can drop the AUTOINST_URL_HOSTNAME assignment again.

Actions #62

Updated by favogt over 2 years ago

  • Related to action #115547: openqaworker20 fails to boot, broken hardware size:M added
Actions #63

Updated by livdywan over 2 years ago

  • Due date changed from 2022-08-20 to 2022-09-02

I refrained from suggesting to mark this ticket as blocked for now, but let's realistically give ourselves some more time to sort out the setup

Actions #64

Updated by mkittler over 2 years ago

Since @favogt took already care of the problems he mentioned himself there's still nothing left to do for this ticket (except maybe removing AUTOINST_URL_HOSTNAME later).

However, the fact that ow20 is broken is indeed a blocker - although I'm not sure whether we should consider fixing ow20 (after it was running as expected for a little while) should be considered part of this ticket. If not, I'd close this ticket once AUTOINST_URL_HOSTNAME has been removed on ow19. (The PR has only been merged 2 hours ago. I suppose removing it tomorrow should be ok.)

Actions #65

Updated by mkittler over 2 years ago

  • Status changed from Feedback to Resolved

The due date for this issue is exceeding. Considering what I've wrote in the previous comment I'm resolving the ticket now. Note that the change to be able to drop AUTOINST_URL_HOSTNAME is not ready yet but I suppose we can close this issue nevertheless as it should better be done as part of #115478 (where I've also mentioned it).

Actions #66

Updated by okurz over 2 years ago

  • Due date deleted (2022-09-02)
Actions #67

Updated by okurz over 1 year ago


Also available in: Atom PDF