action #111473
action #97862: More openQA worker hardware for OSD size:M
Get replacements for imagetester and openqaworker1 size:M
0%
Description
Motivation¶
Move to co-location is delayed, we need to keep O3+OSD infrastructure within NUE SRV1 up-to-date.
Acceptance criteria¶
- AC1: Updated the two oldest workers in NUE SRV1 (imagetester+openqaworker1)
Suggestions¶
- We need to make physical space for new machines (might be possible to remove uno+rebel or openqaworker2+openqaworker3 to make room)
Get the quote from Nick or Oli
DONE: Get a quote, e.g. from https://www.deltacomputer.com/ for supermicro machines to replace at least imagetester+openqaworker1, potentially also uno+rebel or openqaworker2+openqaworker3 -> https://www.deltacomputer.com/media/pdf/delta_d10z-m2-zm_0465.pdf
Bring the order forward to
Lee MartinNick Singer and Matthias Grießmeier to crosslink with the QE budgetCoordinate with SUSE-IT EngInfra to prepare for new machines replacing existing ones
Make sure machines are ordered to SUSE Nbg Maxtorhof
Further details¶
Suggested by nsinger:
- Chassis 1x SuperMicro 825BTQC-R1K23LPB
- RAM 8x Micron MTA36ASF8G72PZ-3G2: 512 GB
- Disk 2x Samsung PM1643a 3,8TB SSD
- Controller 1x Broadcom 9500-8i
- Network 1x Intel X710-DA2, 2 Ports, 10GbE, SFP+
- M.2 NVMe, 2x Micron 7400 MAX, 400 GB, SSD
- CPU 1x AMD EPYC 7763, 64 Cores pro CPU, 2,45 GHz
https://confluence.suse.com/display/qasle/2022-05-23+Meeting+notes
Related issues
History
#1
Updated by okurz about 1 year ago
- Description updated (diff)
#2
Updated by okurz about 1 year ago
- Tracker changed from action to coordination
#3
Updated by okurz about 1 year ago
- Tracker changed from coordination to action
- Subject changed from [epic] Get replacement for existing machines, e.g. imagetester, openqaworker1 (potentially more) to Get replacement for existing machines, e.g. imagetester, openqaworker1 (potentially more)
- Priority changed from Normal to High
#4
Updated by cdywan about 1 year ago
- Subject changed from Get replacement for existing machines, e.g. imagetester, openqaworker1 (potentially more) to Get replacements for imagetester and openqaworker1 size:M
- Description updated (diff)
- Status changed from New to Workable
#5
Updated by okurz about 1 year ago
- Description updated (diff)
#6
Updated by okurz about 1 year ago
- Copied to action #111986: Ensure uno.openqanet.opensuse.org is properly used added
#7
Updated by okurz about 1 year ago
- Priority changed from High to Urgent
To ensure we can make use of assigned budget we should expedite the ordering process, raising prio.
#8
Updated by okurz about 1 year ago
- File delta_d10z-m2-zm_9430.pdf delta_d10z-m2-zm_9430.pdf added
- Description updated (diff)
- Assignee set to nicksinger
Seems like I missed something in the previous PDF, e.g. missing NVMe etc.
Configured a new one together with nsinger to be sure we come up with the right stuff. https://www.deltacomputer.com/media/pdf/delta_d10z-m2-zm_9430.pdf is the link to the order, not sure if it's really persistent. So also attaching.
We decided that we don't need to ask for any "extended support" so as included in the PDF. Cost calculates to roughly 9600 * 1.19 including tax, or using SUSE internal conversion from EUR to USD 1.21 ~ 13823 USD
#10
Updated by cdywan about 1 year ago
- Status changed from Workable to Feedback
Email with "Anfrage" in the subject sent today
#15
Updated by okurz 12 months ago
- Due date set to 2022-06-24
- Status changed from In Progress to Feedback
two quotes received, created ticket to receive approved POs. Approved by mgriessmeier as sponsor representative, waiting for PO approval.
#19
Updated by okurz 11 months ago
- Due date deleted (
2022-09-02) - Status changed from Feedback to Workable
- Assignee deleted (
okurz) - Priority changed from Low to High
According to https://www.ids-logistik.de/de/sendungsverfolgung?tracking=90461_2009043476268005 the Nbg machines have arrived at Frankencampus.
Next steps can be coordinated, e.g. create SUSE-IT EngInfra ticket, get the server hardware moved to Nbg Maxtorhof SRV1, have it connected replacing imagetester and openqaworker1, have imagetester and openqaworker1 moved to qa cold storage or put somewhere in SRV2 where there is place.
#20
Updated by okurz 11 months ago
- Copied to action #113477: Get replacements for o3+osd top of rack switch added
#22
Updated by mkittler 11 months ago
Created an Infra ticket: https://sd.suse.com/servicedesk/customer/portal/1/SD-92255
#24
Updated by mgriessmeier 11 months ago
Discussed today in person with Moroni. I will follow up on it next week when Oliver from facilities is back from holiday
#25
Updated by mkittler 11 months ago
mgriessmeier Thanks for helping out.
And just for the record: The EngInfra ticket was not the way to go and we still need to figure out the correct process.
#26
Updated by okurz 11 months ago
mkittler wrote:
mgriessmeier Thanks for helping out.
And just for the record: The EngInfra ticket was not the way to go and we still need to figure out the correct process.
I think the EngInfra ticket is certainly still the way to go. That it was closed as "Can't Do" is IMHO an unprofessional reaction and needs escalation, as already handled by mgriessmeier.
#27
Updated by mgriessmeier 11 months ago
- File Image from iOS.jpg Image from iOS.jpg added
While I'm still in discussions about the future progress, I have an update for this particular case. So facilities has taken the servers to Maxtorhof, they are residing in the old allhands area (see picture attached).
Moroni wants to have a new ticket for racking them, could you please take care about that?
#30
Updated by mkittler 10 months ago
- Status changed from Blocked to In Progress
The machines have been installed. I suppose this is no longer blocked by #114379. I'll check whether I can reach the BMC mentioned in https://sd.suse.com/servicedesk/customer/portal/1/SD-93312 and update our salt pillars (draft: https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/430).
#31
Updated by mkittler 10 months ago
- Status changed from In Progress to Feedback
So I suppose we should wait until both BMCs are responding. I suppose we also need a follow-up ticket for setting up both machines (I suppose using either Leap 15.4 or Tumbleweed as OS and replicating the config of the openqaworker1 and imagetester).
#33
Updated by okurz 10 months ago
mkittler wrote:
So I suppose we should wait until both BMCs are responding. I suppose we also need a follow-up ticket for setting up both machines (I suppose using either Leap 15.4 or Tumbleweed as OS and replicating the config of the openqaworker1 and imagetester).
Installation is part of this ticket. Does not make sense to plan a ticket at a separate time for two uninstalled machines. Please use openSUSE Leap 15.4 and consider using
https://github.com/os-autoinst/openQA/blob/master/contrib/ay-openqa-worker.xml or a customized "non-salt" version. Alternative to autoyast installation is JeOS+cloudinit (or ignition/combustion) to do the initial configuration.
#34
Updated by mkittler 10 months ago
- Status changed from Feedback to In Progress
Ok, I'll try installing openqaworker19 (BMC of openqaworker20 is still not responding).
There are already a few issues:
- It looks like the BMC's web UI is only supporting Chromium (at least here Firefox doesn't work).
- When trying to add a virtual media within the remote control window, the web UI says "This function requires SFT-DCMS-SINGLE license!".
- When trying to add a virtual media under "Configuration -> Virtual Media" from a file from
wotan.suse.de
the web UI says it doesn't work (tried both,/
and\
in the path). - PXEboot doesn't find anything (via IPv4 and IPv6), also not when using legacy boot
So I currently don't know how to do the installation. I've mentioned it in the infra ticket, maybe someone can help.
#35
Updated by openqa_review 10 months ago
- Due date set to 2022-08-20
Setting due date based on mean cycle time of SUSE QE Tools
#37
Updated by okurz 10 months ago
Updated racktable entries with PO numbers and details, updated https://sd.suse.com/servicedesk/customer/portal/1/SD-89423
Regarding PXEBoot: Both machines should be o3 workers so they need to be added to the according VLAN. I don't see according entries in racktables yet so maybe that is missing. As soon as that is done I would assume that the PXE boot menu managed on o3 would show up.
#38
Updated by mkittler 10 months ago
Looks like dnsmasq on ariel isn't handing out an IP address automatically (but the host is already in o3 VLAN 662). So I'll have to configure it manually. (The mac address of ow19 (3c:ec:ef:93:aa:fe
) already shows up in the dnsmasq log with "no address available".)
EDIT: Added
dhcp-host=3C:EC:EF:93:AA:FE,openqaworker19 dhcp-host=3C:EC:EF:93:AA:22,openqaworker20
to the dnsmasq config and
192.168.112.17 openqaworker19 192.168.112.18 openqaworker20
to /etc/hosts
.
It still doesn't work, see https://suse.slack.com/archives/C02AJ1E568M/p1660129409009959?thread_ts=1660126725.943279&cid=C02AJ1E568M.
#39
Updated by mkittler 10 months ago
With the image from the ipxe-bootimgs
package and the config dhcp-boot=tag:efi-x86_64,ipxe/ipxe-x86_64.efi
we could come a little bit further. The machine boots but the PXE-boot setup still lacks some configuration to be useful. Not sure why booting from pxelinux.0
as suggested in some places doesn't work. It also seems that the machine needs to be reset when changing the image path (just selecting the entry from the boot menu again will not make it load the other image).
I also found out why my attempts to mount a virtual media didn't work: Apparently this feature only supports Windows shares (and not sshfs). I suppose an SMB server on Linux would work as well. So alternatively I could try setting up an SMB server on some machine and try attaching a media from there.
#41
Updated by mkittler 10 months ago
To checkout the SMB approach I've started an SMB server on my machine that is connected via VPN. It generally seems to works just fine:
martchus@QA-Power8-5-kvm:~> smbget --user martchus 'smb://10.163.28.162/iso/openSUSE-Leap-15.4-NET-x86_64-Build243.2-Media.iso' Password for [martchus] connecting to //10.163.28.162/iso: Using workgroup WORKGROUP, user martchus smb://10.163.28.162/iso/openSUSE-Leap-15.4-NET-x86_64-Build243.2-Media.iso Downloaded 173,00MB in 76 seconds
However, when trying to use it on https://openqaworker19-ipmi.suse.de it doesn't work. (There's not good error message.)
I've also read the help available in the UI and you can also put an HTTP-URL in. However, when doing so it again complains about a missing license.
I saw https://documentation.suse.com/sles/15-SP4/html/SLES-all/cha-deployment-prep-uefi-httpboot.html but it doesn't use dnsmasq
as DHCP server. So it would mean re-configuring dnsmasq
to do only DNS and use dhcpd
for DHCP. I'm also not sure whether the HTTP server setup will work out of the box with our existing apache2 configuration. The step "Enroll the server certificate into the client firmware" is also not very clear to me.
#42
Updated by mkittler 10 months ago
Just our of curiosity I've tried the "netboot" image provided by other distributions (openSUSE does not provide such an image). The one from Ubuntu seems to be legacy-boot-only. The UEFI one from Arch (https://archlinux.org/releng/netboot, https://github.com/archlinux/svntogit-community/blob/packages/ipxe/trunk/PKGBUILD#L77) works. It seems to be based on the ipxe image we've tried before in the mob session (the one from the ipxe-bootimgs
). Maybe I can create such an efi binary for openSUSE as well.
From the Arch image one can enter a shell and boot openSUSE via a few commands. However, it ended up stuck on the rescue shell prompt. Maybe I can follow instructions from https://en.opensuse.org/SDB:IPXE_booting#Setup.
#44
Updated by mkittler 10 months ago
EDIT: I have updated this comment. The following commands show the full procedure that allowed me to install a bootable system.
I managed to boot without having to enter commands manually in the ipxe shell. It'll just load the commands from a file on some http server (to avoid having to re-create the image everytime I want to change commands). Those are my notes so far:
# make file that contains the boot ipxe commands to boot available via some http server, file contents for installing Leap 15.4 with autoyast: #!ipxe kernel http://download.opensuse.org/distribution/leap/15.4/repo/oss/boot/x86_64/loader/linux initrd=initrd console=tty0 console=ttyS1,115200 install=http://download.opensuse.org/distribution/leap/15.4/repo/oss/ autoyast=http://martchus.no-ip.biz/ipxe/ay-openqa-worker.xml rootpassword=opensuse initrd http://download.opensuse.org/factory/repo/oss/boot/x86_64/loader/initrd boot # setup build of ipxe UEFI image like explained on https://en.opensuse.org/SDB:IPXE_booting#Setup git clone https://github.com/ipxe/ipxe.git cd ipxe echo "#!ipxe dhcp chain http://martchus.no-ip.biz/ipxe/leap-15.4" > myscript.ipxe cd src # conduct build similar to https://github.com/archlinux/svntogit-community/blob/packages/ipxe/trunk/PKGBUILD#L58 make EMBED=../myscript.ipxe NO_WERROR=1 bin/ipxe.lkrn bin/ipxe.pxe bin-i386-efi/ipxe.efi bin-x86_64-efi/ipxe.efi # copy image to ariel rsync bin-x86_64-efi/ipxe.efi openqa.opensuse.org:/home/martchus/ipxe.efi # use image on ariel (via `dhcp-boot=tag:efi-x86_64,ipxe-own-build/ipxe.efi` in `/etc/dnsmasq.d/pxeboot.conf`) sudo cp /home/martchus/ipxe.efi /srv/tftpboot/ipxe-own-build/ipxe.efi # boot from the image, enter boot menu via F11 for quicker retries (but when amending the ipxe image one needs a full reboot!) # amend file on server to test different parameters # boot again …
The autoyast profile is simply what we have in our repository with some adjustments. I'll create a PR for that later.
#46
Updated by mkittler 10 months ago
I restored the PXE configuration on o3, cleaned up the tftp root again and added a few further comments in the dnsmasq config. It looks like the previous configuration has been restored correctly: https://openqa.opensuse.org/tests/2516355#live
I created a PR for the AutoYaST-profile modifications: https://github.com/os-autoinst/openQA/pull/4778
Tomorrow I will setup openQA worker slots on the two machines.
#47
Updated by cdywan 10 months ago
mkittler wrote:
Tomorrow I will setup openQA worker slots on the two machines.
I guess they're live? But there's a dependency issue https://openqa.opensuse.org/tests/2517895
#48
Updated by mkittler 10 months ago
Both workers are now configured:
- The config is following what we do on other o3 workers (and mainly keeping openSUSE's defaults).
- SSH keys from
openqaworker1
are copied over. So everyone's login should work as before. - Currently a special worker class is still in place as I'm still running some test jobs (see https://openqa.opensuse.org/tests/overview?build=20220815-test-ow19 and https://openqa.opensuse.org/tests/overview?build=20220815-test-ow20).
- So far the setup looks good (after a few initial problems).
- I started 30 slots per host and enabled the VP9 video encoder as the systems look fast enough. (We can likely even increase the number of slots further.)
- I configured the firewall so the developer mode works (see https://github.com/os-autoinst/openQA/pull/4780). For now the ports for up to 50 worker slots are covered. (When setting up MM tests we likely go to the usual trusted zone config anyways so this point becomes irrelevant.)
- The ovs-setup is still missing. However, this would be a good point to split further setup work into a separate ticket so I'd do the MM setup separately (preferably coming up with a script to do things automatically).
- I rebooted both systems to see whether everything comes up automatically.
#49
Updated by mkittler 10 months ago
Failures I've observed so far:
- https://openqa.opensuse.org/tests/2518592#step/firstrun/3 and https://openqa.opensuse.org/tests/2518593#step/firstrun/3
- For some reason the window lacks focus. (In the passed runs on other workers it has focus: https://openqa.opensuse.org/tests/2517671#step/firstrun/2)
- Not sure yet why that behaves different on the new workers.
- The test
jeos-extra@64bit_virtio-2G
only failed because the build variable was changes. So nothing to worry about I suppose. - The test module
zypper_log_packages
failed on multiple jobs on both workers. Not sure yet why. - The test module
virt_install
failed on both workers. Not sure yet why.
Otherwise both build result overviews look quite green.
#50
Updated by favogt 10 months ago
mkittler wrote:
Failures I've observed so far:
- https://openqa.opensuse.org/tests/2518592#step/firstrun/3 and https://openqa.opensuse.org/tests/2518593#step/firstrun/3
- For some reason the window lacks focus. (In the passed runs on other workers it has focus: https://openqa.opensuse.org/tests/2517671#step/firstrun/2)
- Not sure yet why that behaves different on the new workers.
Looks like it's doing run_in_powershell while the Tumbleweed window opens. It should probably close the serial port before starting that in the background. I'll try that out.
- The test
jeos-extra@64bit_virtio-2G
only failed because the build variable was changes. So nothing to worry about I suppose.- The test module
zypper_log_packages
failed on multiple jobs on both workers. Not sure yet why.
It uses autoinst_url in test code, but 10.0.2.2 isn't reachable on the worker itself. Setting AUTOINST_URL_HOSTNAME in workers.ini (or configuring MM, so br1 gets 10.0.2.2) might help: https://openqa.opensuse.org/tests/2520422
- The test module
virt_install
failed on both workers. Not sure yet why.
Looks like that's expected, the test fails everywhere constantly.
Otherwise both build result overviews look quite green.
#51
Updated by favogt 10 months ago
favogt wrote:
mkittler wrote:
Failures I've observed so far:
- https://openqa.opensuse.org/tests/2518592#step/firstrun/3 and https://openqa.opensuse.org/tests/2518593#step/firstrun/3
- For some reason the window lacks focus. (In the passed runs on other workers it has focus: https://openqa.opensuse.org/tests/2517671#step/firstrun/2)
- Not sure yet why that behaves different on the new workers.
Looks like it's doing run_in_powershell while the Tumbleweed window opens. It should probably close the serial port before starting that in the background. I'll try that out.
Works: https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/15376
- The test
jeos-extra@64bit_virtio-2G
only failed because the build variable was changes. So nothing to worry about I suppose.- The test module
zypper_log_packages
failed on multiple jobs on both workers. Not sure yet why.It uses autoinst_url in test code, but 10.0.2.2 isn't reachable on the worker itself. Setting AUTOINST_URL_HOSTNAME in workers.ini (or configuring MM, so br1 gets 10.0.2.2) might help: https://openqa.opensuse.org/tests/2520422
Works. I added it to workers.ini on both.
- The test module
virt_install
failed on both workers. Not sure yet why.Looks like that's expected, the test fails everywhere constantly.
Otherwise both build result overviews look quite green.
#52
Updated by favogt 10 months ago
- % Done changed from 0 to 80
favogt wrote:
favogt wrote:
mkittler wrote:
Failures I've observed so far:
- https://openqa.opensuse.org/tests/2518592#step/firstrun/3 and https://openqa.opensuse.org/tests/2518593#step/firstrun/3
- For some reason the window lacks focus. (In the passed runs on other workers it has focus: https://openqa.opensuse.org/tests/2517671#step/firstrun/2)
- Not sure yet why that behaves different on the new workers.
Looks like it's doing run_in_powershell while the Tumbleweed window opens. It should probably close the serial port before starting that in the background. I'll try that out.
Works: https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/15376
Got merged.
- The test
jeos-extra@64bit_virtio-2G
only failed because the build variable was changes. So nothing to worry about I suppose.- The test module
zypper_log_packages
failed on multiple jobs on both workers. Not sure yet why.It uses autoinst_url in test code, but 10.0.2.2 isn't reachable on the worker itself. Setting AUTOINST_URL_HOSTNAME in workers.ini (or configuring MM, so br1 gets 10.0.2.2) might help: https://openqa.opensuse.org/tests/2520422
Works. I added it to workers.ini on both.
- The test module
virt_install
failed on both workers. Not sure yet why.Looks like that's expected, the test fails everywhere constantly.
Otherwise both build result overviews look quite green.
No (known) blockers left, so I enabled ow19 and ow20 for production, maybe just to uncover more issues. Fingers crossed.
#55
Updated by mkittler 10 months ago
- Related to action #115418: Setup ow19+20 to be able to run MM tests size:M added
#57
Updated by mkittler 10 months ago
Ok.
I have updated our documentation in the Wiki: https://progress.opensuse.org/projects/openqav3/wiki/Wiki#Setup-guide-for-new-machines
#59
Updated by mkittler 10 months ago
I've changed the IPMI password and updated https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/430 accordingly.
#60
Updated by favogt 10 months ago
One issue uncovered: The firewall setup made it harder to access VNC for "intensive" debugging sessions. On ow20 I had to move eth2 from the public to the trusted zone to be able to access it via an SSH tunnel through ariel. So maybe that should be the default config even without MM networking set up.
#61
Updated by favogt 10 months ago
Some more issues: The staging OVMF images were missing, I copied them over. swtpm
was also not installed, did that as well.
Some libvirt tests failed because they couldn't download a HDD, which worked on other workers. The reason is that the new workers didn't have the NFS mount set up.
Ideally the tests get changed such that the asset is properly tracked and NFS not needed, but that's something for later.
I added those three items to the worker setup documentation.
The WSL tests for some reason don't have proper DNS set up, so couldn't resolve openqaworker19
which was set as AUTOINST_URL_HOSTNAME
.
I changed it to use the worker IPs instead. However, it looks like the reason that 10.0.2.2 doesn't work is a bug in test code: https://progress.opensuse.org/issues/115478
After that is fixed, we can drop the AUTOINST_URL_HOSTNAME
assignment again.
#62
Updated by favogt 10 months ago
- Related to action #115547: openqaworker20 fails to boot, broken hardware size:M added
#64
Updated by mkittler 9 months ago
Since favogt took already care of the problems he mentioned himself there's still nothing left to do for this ticket (except maybe removing AUTOINST_URL_HOSTNAME
later).
However, the fact that ow20 is broken is indeed a blocker - although I'm not sure whether we should consider fixing ow20 (after it was running as expected for a little while) should be considered part of this ticket. If not, I'd close this ticket once AUTOINST_URL_HOSTNAME
has been removed on ow19. (The PR has only been merged 2 hours ago. I suppose removing it tomorrow should be ok.)
#65
Updated by mkittler 9 months ago
- Status changed from Feedback to Resolved
The due date for this issue is exceeding. Considering what I've wrote in the previous comment I'm resolving the ticket now. Note that the change to be able to drop AUTOINST_URL_HOSTNAME
is not ready yet but I suppose we can close this issue nevertheless as it should better be done as part of #115478 (where I've also mentioned it).