Project

General

Profile

Actions

coordination #131519

closed

[epic] Additional redundancy for OSD virtualization testing

Added by okurz over 1 year ago. Updated about 1 year ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Start date:
2023-02-09
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Tags:

Description

Motivation

As discussed in #131144 and various chats the additional challenge for machines like "xen/hyperv/vmware" is that there are no clear instructions available how to deploy a new machine to fill that purpose. This is why we have no or little redundancy. That means: For any qemu worker we can just keep it offline for weeks and nobody notices as we have redundancy and build up new machines quickly. If we apply the "same level of care" for openqaw5-xen that means no Xen testing for weeks. We have free hardware ressources available e.g. in FC Basement. How about we try to deploy machines there to be usable as Xen workers? Same for hyperv/vmware?

Suggestions


Subtasks 5 (0 open5 closed)

action #131549: [spike][timeboxed:20h] Additional redundancy for OSD virtualization testing - Hyperv 2016 worker host size:MResolvednanzhang2023-06-28

Actions
action #133247: Additional redundancy for OSD virtualization testing - Hyperv 2019 and 2022 (or 2012r2) worker host size:MResolvedrcai2023-07-25

Actions
action #133367: Evaluate if we have hardware alternatives for Windows Server 2016+ testingResolvedokurz2023-07-26

Actions
action #137306: Check unreal6 cabling, SP and system not reachable over network size:MResolvedokurz2023-10-02

Actions
action #138350: worker31 and likely more OSD machines get stuck on boot in grub command lineResolveddheidler2023-06-28

Actions

Related issues 7 (5 open2 closed)

Related to openQA Infrastructure (public) - action #108872: Outdated information on openqaw5-xen https://racktables.suse.de/index.php?page=object&tab=default&object_id=3468Resolvedokurz

Actions
Related to openQA Tests (public) - action #109085: [qe-core] Ensure openqaw5-xen.qa.suse.de and potentially other hypervisor hosts OSs are updated to prevent NFS or other problemsNew2022-03-28

Actions
Related to Containers and images - coordination #95422: [MinimalVM][epic] separate hyperv from svirt backendNew2021-07-13

Actions
Related to openQA Project (public) - action #76813: [tools] Test using svirt backend fails with auto_review:"Error connecting to VNC server.*: IO::Socket::INET: connect: Connection refused"New2020-10-30

Actions
Related to openQA Tests (public) - action #46394: [sle][s390x][spvm][kvm][sporadic] test fails in various modules to login to svirt console (or system is not up yet)Workable

Actions
Related to openQA Infrastructure (public) - action #128222: [virtualization] The Xen specific host configuration on openqaw5-xen can be re-created from salt size:MNew2023-04-24

Actions
Related to openQA Infrastructure (public) - action #134912: Gradually phase out NUE1 based openQA workers size:MResolvedokurz

Actions
Actions #2

Updated by okurz over 1 year ago

  • Tracker changed from action to coordination
  • Subject changed from Additional redundancy for OSD virtualization testing to [epic] Additional redundancy for OSD virtualization testing
Actions #3

Updated by okurz over 1 year ago

  • Related to action #108872: Outdated information on openqaw5-xen https://racktables.suse.de/index.php?page=object&tab=default&object_id=3468 added
Actions #4

Updated by okurz over 1 year ago

  • Related to action #109085: [qe-core] Ensure openqaw5-xen.qa.suse.de and potentially other hypervisor hosts OSs are updated to prevent NFS or other problems added
Actions #7

Updated by okurz over 1 year ago

Actions #8

Updated by okurz over 1 year ago

  • Related to action #76813: [tools] Test using svirt backend fails with auto_review:"Error connecting to VNC server.*: IO::Socket::INET: connect: Connection refused" added
Actions #9

Updated by okurz over 1 year ago

  • Related to action #46394: [sle][s390x][spvm][kvm][sporadic] test fails in various modules to login to svirt console (or system is not up yet) added
Actions #13

Updated by xlai over 1 year ago

Hi Oliver,
The main reason for no redundancy for the vmware/hyperv/xen is no additional hardware and license. If there are free hardwares, why not? I support building redundancy. That's definitely improvement. Just please be aware, the redundancy will mainly help for such infra move, disaster case, etc, which won't be frequent.

Actions #14

Updated by okurz over 1 year ago

xlai wrote:

Hi Oliver,
The main reason for no redundancy for the vmware/hyperv/xen is no additional hardware and license.

What licences are needed for hyperv or xen? An evaluation copy of Windows should be good enough to proof the concept and for Xen there should not be any problem. For VMWare there are more likely license restrictions but maybe also there an evaluation copy is enough to proof the concept?

Actions #16

Updated by okurz over 1 year ago

  • Status changed from New to Blocked
  • Assignee set to okurz
Actions #17

Updated by cachen over 1 year ago

I think here is talking about the idea of new building for below 4 systems on free machines in FC Basement. @okurz, do we have enought 4 free machines there? is it possible to provide VT team free machines remote access for them to evaluate machines capability first? If this can be achieved as redundancy/backup system with old machines, that will be a benefit for test run times.
Setup new machines and the future maintenance in FC will still need you and your team's help on fundamental infrastructure(qa-net/pxe/salt/dhcp...), is that fine to you?

openqaw9-hyperv.qa.suse.de( old name: flexo.qa.suse.cz/flexo.qa.suse.de) - Hyper-V 2012 R2 host
worker7-hyperv.oqa.suse.de - Hyper-V 2016 host
worker8-vmware.oqa.suse.de - VMware ESXi 6.5 host, now used by qac, purchased by VT
openqaw5-xen.qa.suse.de

Actions #18

Updated by okurz over 1 year ago

cachen wrote:

I think here is talking about the idea of new building for below 4 systems on free machines in FC Basement. @okurz, do we have enought 4 free machines there?

Yes, see #131552

is it possible to provide VT team free machines remote access for them to evaluate machines capability first?

Yes, see the wiki linked from
#131552

Setup new machines and the future maintenance in FC will still need you and your team's help on fundamental infrastructure(qa-net/pxe/salt/dhcp...), is that fine to you?

Yes. We would be happy to support anyone picking up that task.

Actions #19

Updated by cachen over 1 year ago

Cool, @xlai I leave to you and your team to pick up machines there and evaluate capability to setup new system for extending the redundancy.

Actions #20

Updated by okurz over 1 year ago

  • Description updated (diff)
Actions #21

Updated by xlai over 1 year ago

What licences are needed for hyperv or xen? An evaluation copy of Windows should be good enough to proof the concept and for Xen there should not be any problem. For VMWare there are more likely license restrictions but maybe also there an evaluation copy is enough to proof the concept?

@okurz, Hi Oliver, about license, it is for vmware and hyperv. From the regard of concept proof, free evaluation/experimental license should be enough. For official test, we have contacts in SuSE to require.

Actions #22

Updated by xlai over 1 year ago

  • Description updated (diff)
Actions #23

Updated by xlai over 1 year ago

unreal2, unreal3, unreal4, unreal5 as bare-metal test hosts #131552
unreal6 for pure Xen #131546
unreal7 for VMWare 7 #132590
unreal8 for hyperv 2016 #131549

@okurz, Hi Oliver, if these unreal machines are ok to serve as redundancy vmware&hyperv hosts, can we ask for 2 more machines to install to hyperv2012r2 and hyperv 2019? Would you please share the hardware info if okay?

Actions #24

Updated by okurz over 1 year ago

xlai wrote:

unreal2, unreal3, unreal4, unreal5 as bare-metal test hosts #131552
unreal6 for pure Xen #131546
unreal7 for VMWare 7 #132590
unreal8 for hyperv 2016 #131549

@okurz, Hi Oliver, if these unreal machines are ok to serve as redundancy vmware&hyperv hosts, can we ask for 2 more machines to install to hyperv2012r2 and hyperv 2019?

ok, created #131549 for this. I assume you want to test on Windows Server 2022 as Windows Server 2012r2 goes EOL 2023-10-10 https://learn.microsoft.com/en-us/lifecycle/announcements/windows-server-2012-r2-end-of-support

Would you please share the hardware info if okay?

Each server is a Supermicro X10SLD-F https://www.supermicro.com/en/products/motherboard/X10SLD-F with 2xSSD. Those SSDs are likely rather small but ordering bigger storage devices is likely easy to do. The exact details could be found over the BMC, e.g. https://unreal4-sp.qe.nue2.suse.org/

Actions #25

Updated by xlai over 1 year ago

okurz wrote:

xlai wrote:

unreal2, unreal3, unreal4, unreal5 as bare-metal test hosts #131552
unreal6 for pure Xen #131546
unreal7 for VMWare 7 #132590
unreal8 for hyperv 2016 #131549

@okurz, Hi Oliver, if these unreal machines are ok to serve as redundancy vmware&hyperv hosts, can we ask for 2 more machines to install to hyperv2012r2 and hyperv 2019?

ok, created #131549 for this. I assume you want to test on Windows Server 2022 as Windows Server 2012r2 goes EOL 2023-10-10 https://learn.microsoft.com/en-us/lifecycle/announcements/windows-server-2012-r2-end-of-support

@okurz Hi Oliver, #131549 was for hyperv 2016. Shall we create a new one for hyperv 2022? From https://progress.opensuse.org/issues/131549#note-12, it seems you directly modify #131549 to fit for 2022?

Besides, I hope you realize from https://progress.opensuse.org/issues/131549#note-5 and afterwards comments that, qe-virt squad is willing to take over the further setup work since tools team is out of capacity recently. So we assigned nanzhang and bump priority and it is wip now, but you recovered all of the settings in https://progress.opensuse.org/issues/131549#note-12. Would you please explain why? We know that the ticket is in openqa infrastructure backlog. If we are not suggested to directly own and edit the tickets, would you please share how you'd suggest us to continue?

Actions #26

Updated by okurz over 1 year ago

  • Description updated (diff)

xlai wrote:

okurz wrote:

xlai wrote:

unreal2, unreal3, unreal4, unreal5 as bare-metal test hosts #131552
unreal6 for pure Xen #131546
unreal7 for VMWare 7 #132590
unreal8 for hyperv 2016 #131549

@okurz, Hi Oliver, if these unreal machines are ok to serve as redundancy vmware&hyperv hosts, can we ask for 2 more machines to install to hyperv2012r2 and hyperv 2019?

ok, created #131549 for this. I assume you want to test on Windows Server 2022 as Windows Server 2012r2 goes EOL 2023-10-10 https://learn.microsoft.com/en-us/lifecycle/announcements/windows-server-2012-r2-end-of-support

@okurz Hi Oliver, #131549 was for hyperv 2016. Shall we create a new one for hyperv 2022? From https://progress.opensuse.org/issues/131549#note-12, it seems you directly modify #131549 to fit for 2022?

Sorry, I was updating the wrong ticket by mistake. I thought I was working on a copy for hyper2019 instead. That new ticket will be #133247 and I reverted #131549

Besides, I hope you realize from https://progress.opensuse.org/issues/131549#note-5 and afterwards comments that, qe-virt squad is willing to take over the further setup work since tools team is out of capacity recently. So we assigned nanzhang and bump priority and it is wip now, but you recovered all of the settings in https://progress.opensuse.org/issues/131549#note-12.
Would you please explain why? We know that the ticket is in openqa infrastructure backlog. If we are not suggested to directly own and edit the tickets, would you please share how you'd suggest us to continue?

All is good, I am sorry. I reverted the changes I did in #131549. Yes, I am aware and appreciate your work. Of course you can continue in #131549

Actions #27

Updated by xlai over 1 year ago

No worries, we will continue :).

Actions #28

Updated by xlai over 1 year ago

  • Description updated (diff)
Actions #29

Updated by okurz over 1 year ago

  • Related to action #128222: [virtualization] The Xen specific host configuration on openqaw5-xen can be re-created from salt size:M added
Actions #30

Updated by okurz about 1 year ago

  • Parent task changed from #130955 to #129280
Actions #31

Updated by okurz about 1 year ago

  • Target version changed from Ready to Tools - Next
Actions #32

Updated by xlai about 1 year ago

  • Description updated (diff)
Actions #33

Updated by xlai about 1 year ago

Let me summarize the redundancy building status from virtualization squad, to let all be on the same page.

@nanzhang @rcai helped build 6 redundancy machines, all are done -- added in OSD and verified with openqa jobs. Nan and Roy will continue following stability and performance behaviors on these machines during 15sp6 test, but not in these tickets' scope and be treated as separate task in our own backlog. Thanks a lot to Nan and Roy. For details, the built redundancy machines are:

  • unreal2, unreal3 for kvm and xen baremetal test machine #131552
  • unreal4, unreal5 for hyperv 2022 and 2019 #133247
  • unreal7 for VMWare 7 #132590
  • unreal8 for hyperv 2016 #131549

We do not do for -- unreal6 for pure Xen #131546. And we do not have plan to work on it. Virtualization squad's automation infrastructure has decoupled from the original xen server, and there is no need for any xen server in future either.

I think we are basically done here.

@okurz FYI. Thanks for the support during the process. Now we will let tools team fully decide the left tickets. Good luck!

Actions #34

Updated by okurz about 1 year ago

Thank you for the update

Actions #36

Updated by okurz about 1 year ago

  • Target version changed from Tools - Next to Ready

Still blocked on #131552 but we need to switch off NUE1 machines unconditionally in the next days. Expect disruptions if the newly built up machines are not finished yet to fully take over.

Actions #37

Updated by okurz about 1 year ago

  • Related to action #134912: Gradually phase out NUE1 based openQA workers size:M added
Actions #38

Updated by okurz about 1 year ago

  • Subtask #137306 added
Actions #39

Updated by okurz about 1 year ago

As commented in #132617#note-17 I prepared the move of worker7-hyperv and worker8-vmware and powered off both machines.

Actions #40

Updated by okurz about 1 year ago

  • Subtask #138350 added
Actions #41

Updated by okurz about 1 year ago

  • Subtask #138374 added
Actions #42

Updated by okurz about 1 year ago

  • Subtask deleted (#138374)
Actions #43

Updated by okurz about 1 year ago

  • Status changed from Blocked to Resolved

All subtasks are resolved. I see that now we have all relevant testing resources covered in at least new locations not critically relying on NUE1 anymore. Thanks to everyone contributing.

Actions

Also available in: Atom PDF