Project

General

Profile

Actions

action #113366

closed

openQA Project - coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

openQA Project - coordination #109659: [epic] More remote workers

Add three more Prague located OSD workers size:M

Added by okurz over 1 year ago. Updated 12 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Motivation

Three more OSD machines could be ordered and arrived in Prague, to be used within the OSD infrastructure. This was discussed as part of https://confluence.suse.com/display/qasle/2022-05-16+SUSE+IT+and+Networking

Acceptance criteria

  • AC1: The three new machines are used by OSD production jobs
  • AC2: All machines are maintained same as we do for other x86_64 OSD workers

Suggestions

  • Follow https://sd.suse.com/servicedesk/customer/portal/1/SD-89423 for the ordering process
  • Wait for details from EngInfra regarding the machines
  • Ensure that a valid OS is installed
  • Ensure that machines can be remote controlled
  • Deploy machines as part of generic salt-controlled OSD infrastructure
  • Configure openQA worker instances on these machines for testing purposes
  • Test out their operation and take special care due to the remoteness between NBG (osd webUI) and PRG (location of these two machines)
  • If all good include as generic workers as well as with special sap worker class settings

Further details

Previous two machines were setup in #104970.
Martin Caj and Lee Martin know more about the details. The plan was to put them next to https://racktables.nue.suse.com/index.php?page=object&tab=default&object_id=16124 or in neighboring racks and connect them the same way. For that first some other machines need to be reshuffled

Out of scope

  • Multi-machine tests, just disable "tap" worker class for now

Related issues 4 (2 open2 closed)

Related to openQA Project - action #80382: Provide installation recipes for automatic installations of openQA worker machinesWorkable2020-11-25

Actions
Related to openQA Infrastructure - action #125798: Visual differences in GRUB menu on different x86_64 UEFI workersResolvedosukup2023-03-10

Actions
Copied from openQA Infrastructure - action #104970: Add two OSD workers (openqaworker14+openqaworker15) specifically for sap-application testing size:MResolvedmkittler2022-01-17

Actions
Copied to openQA Infrastructure - action #124562: Re-install at least one of the new OSD workers located in PragueNew

Actions
Actions #1

Updated by okurz over 1 year ago

  • Copied from action #104970: Add two OSD workers (openqaworker14+openqaworker15) specifically for sap-application testing size:M added
Actions #2

Updated by okurz over 1 year ago

  • Description updated (diff)
Actions #3

Updated by okurz over 1 year ago

  • Status changed from New to Feedback

Waiting for response from Viktor Karpovych who contacted me by DM

Actions #4

Updated by okurz over 1 year ago

  • Due date set to 2022-07-22
Actions #5

Updated by mkittler over 1 year ago

  • Subject changed from Add three more Prague located OSD workers to Add three more Prague located OSD workers size:M
Actions #6

Updated by mkittler over 1 year ago

  • Due date deleted (2022-07-22)
  • Status changed from Feedback to Workable
  • Assignee deleted (okurz)

There was no response so far.

Actions #7

Updated by okurz over 1 year ago

  • Due date set to 2022-09-19
  • Status changed from Workable to Feedback
  • Assignee set to okurz

vkarpovych answered, I responded:

Oliver Kurz: https://confluence.suse.com/display/qasle/2022-05-16+SUSE+IT+and+Networking has the corresponding entry: "Prague status: Discuss where to locate new QE-SAP servers: Could ask about Rack B4 or B5 - the servers are old and could maybe go to make space for SAP ? (they were cloud/SES, all out of warranty)". I don't have more details the current status, that's on mcaj and lmartin
Viktor Karpovych: those 3 servers occupy some space in Storage room, I want to mount it to Rack PRG-SRV1-B :5 just for space release in storage room.
Could I just put it to bottom of that rack? How to name those servers for put it to Racktables and after PDUs and Network Switches arrived and we will move everything from Rack 5 we can decide in which units those servers will be moved.
Oliver Kurz: Sounds ok. Please name them openqaworker16, openqaworker17, openqaworker18 with configuration and network setup same as openqaworker14+15. Those machines were setup by mcaj last year IIRC

racktable entries:

Actions #8

Updated by okurz over 1 year ago

Checked with vkarpovych. Still waiting for network switches. Updated racktable entries with PO numbers and details, updated https://sd.suse.com/servicedesk/customer/portal/1/SD-89423

Actions #9

Updated by livdywan over 1 year ago

Any update here? The SLO query is surfacing this ticket, although the manually set Due Date is not being exceeded yet.

Actions #10

Updated by okurz over 1 year ago

  • Due date deleted (2022-09-19)
  • Status changed from Feedback to Workable
  • Assignee deleted (okurz)

I have received no response since a month so I assume we are still waiting for network switches or somebody forgot to continue. I suggest to report a ticket with SUSE-IT to request the further setup steps.

Actions #11

Updated by mkittler over 1 year ago

  • Assignee set to nicksinger
Actions #12

Updated by nicksinger over 1 year ago

  • Status changed from Workable to Feedback

Request for network connection done in https://sd.suse.com/servicedesk/customer/portal/1/SD-99844

Actions #13

Updated by okurz over 1 year ago

  • Due date set to 2022-11-18
  • Status changed from Feedback to Blocked

cool. We have access to the SD ticket so we can treat this as "Blocked"

Actions #14

Updated by nicksinger over 1 year ago

Viktor gave the answer that there is currently no power available in the racks of these machines. He hopes to get some new PDU this week and recable all machines. Waiting for further updates (I think the due date is still reasonable)

Actions #15

Updated by okurz over 1 year ago

ok, good to know. But where did you get this information from? https://sd.suse.com/servicedesk/customer/portal/1/SD-99844 is still empty.

Actions #16

Updated by nicksinger over 1 year ago

okurz wrote:

ok, good to know. But where did you get this information from? https://sd.suse.com/servicedesk/customer/portal/1/SD-99844 is still empty.

Apparently the ticket is empty again and the message from Viktor was deleted. I only found a local copy in my inbox:

—-—-—-—
Reply above this line.

Viktor Karpovych commented:

Dear Nick Singer,

Those servers are in Rack PRG-SRV1-B5 only mounted but not connected to power and network.
I hope a new power source will be available next week.
After that, We plan to reconnect all equipment s in that rack to a new power source.
And replace the ToR switch, move all servers from PRG-SRV1-B5 Rack and 
start to install openqaworker16-18 servers.
__
Best regards,
Viktor Karpovych

View request · Turn off this request's notifications

This is shared with OSD Admins, Nick Singer, and Eng-Infra.

I asked again what the status on this is and if there is any ETA

Actions #17

Updated by nicksinger over 1 year ago

Asked again in https://sd.suse.com/servicedesk/customer/portal/1/SD-99844 if there is any further estimate

Actions #18

Updated by nicksinger over 1 year ago

  • Due date changed from 2022-11-18 to 2022-11-29

no further update till today. Asked again and move due date

Actions #19

Updated by okurz over 1 year ago

  • Due date deleted (2022-11-29)
  • Status changed from Blocked to Workable

I have received IPMI credentials and can confirm that I can access.

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/466

for IPMI credentials.
https://sd.suse.com/servicedesk/customer/portal/1/SD-99844 is still open. I commented there:

Could you please add L2 address entries on racktables as well? E.g. on https://racktables.nue.suse.com/index.php?page=object&tab=default&object_id=17946 . I think that’s all that is left to do for the work in this ticket.

However now we can already continue the work to setup the machines as OSD workers.

Actions #20

Updated by okurz over 1 year ago

please decide yourself if you want to experiment with automatic installations as described in #80382 or by hand, your choice :)

Actions #21

Updated by okurz over 1 year ago

  • Related to action #80382: Provide installation recipes for automatic installations of openQA worker machines added
Actions #22

Updated by okurz about 1 year ago

nicksinger looked into cobbler and yomi. okurz looked a bit into various other tools, e.g. metal3, pixiesomething and more. All those tools seem to be more suitable for customer-facing MaaS which we don't need. So maybe we are just ok with relying on a standard PXE server supplied by Eng-Infra Prg then we adapt the command line, e.g. from an outdated Leap 15.0, just replace with an up-to-date 15.4 and pass the autoyast https://github.com/os-autoinst/openQA/blob/master/contrib/ay-openqa-worker.xml command line and all what is necessary

Actions #23

Updated by okurz about 1 year ago

As discussed in daily 2022-12-21 please

Actions #25

Updated by okurz about 1 year ago

https://sd.suse.com/servicedesk/customer/portal/1/SD-108265 resolved, PXE was added to all three machines

Actions #26

Updated by nicksinger about 1 year ago

  • Status changed from Workable to In Progress

I had to disable legacy boot again since we need to boot from NVMes which is only possible with UEFI. I chose "hybrid mode" which might still allow PXE boot but the booting OS will not be able to install the OS in a way the system can access it later on after installing. Therefore I had to use IPMIView from Supermicro to mount a installer disk and boot from it in UEFI mode. Worker 16 and 17 are accessible now, salt-minion is installed but not yet connected to our infrastructure. Worker 18 still needs an OS.

Actions #27

Updated by openqa_review about 1 year ago

  • Due date set to 2023-01-24

Setting due date based on mean cycle time of SUSE QE Tools

Actions #28

Updated by livdywan about 1 year ago

  • Due date changed from 2023-01-24 to 2023-01-27

I think an update got lost here. We discussed it in the infra daily yesterday and confirmed that Nick is still looking into setting up worker 18 with salt, delayed mainly due to #123028.

Actions #29

Updated by okurz about 1 year ago

  • Due date deleted (2023-01-27)
  • Status changed from In Progress to Workable
  • Assignee deleted (nicksinger)

We decided that due to the update regarding new QE labs in Nbg FC we did not actually conclude with that. This is free to be picked up by anybody again. Next steps:

  • Come up with fitting worker config in salt pillars (likely prerequisite for successful high state application) with "test" worker class
  • Accept salt key of one machine on osd, apply high state
  • After successful high state trigger openQA jobs on that worker for testing
  • Use production worker class
  • Repeat above steps for 2nd and 3rd machine
Actions #30

Updated by okurz about 1 year ago

  • Priority changed from High to Urgent
Actions #31

Updated by osukup about 1 year ago

  • Assignee set to osukup

1) serial is directed to wrong port -> ipmi stops working during initrd initialization.
2) WebUI works correctly only in Firefox.

Actions #32

Updated by livdywan about 1 year ago

Worker16 and 17 should be setup up and ready to go tomorrow; 18 will need to be looked into after that.

Actions #33

Updated by livdywan about 1 year ago

  • Status changed from Workable to In Progress
Actions #34

Updated by osukup about 1 year ago

all three workers setup with salt --local ... and visible in OSD workers with altered workerclasses, added prefix '-test' which will be removed when we add workers to OSD salt.

https://progress.opensuse.org/issues/110467 ... so tap is now out of question:D

Actions #35

Updated by openqa_review about 1 year ago

  • Due date set to 2023-02-23

Setting due date based on mean cycle time of SUSE QE Tools

Actions #36

Updated by osukup about 1 year ago

looks like all three workers ended after reboot in emergency mode thanks to service -> openqa-establish-nvme-setup ...

all three workers have 2 nvme , on nvme0n1 is system , and on nvme1n1 is /dev/md/openqa. Which isn't configuration excepted by this script.

fix -> https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/793

Actions #37

Updated by osukup about 1 year ago

  • Status changed from In Progress to Feedback

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/491 - merged ( deploy CI failed, but we don't know why because Gitlab truncated log ), salt openqa.... atate.apply passed and workers seems working as intendet --> setting to feedback

PS: WebUI of supermicro on this workers usualy hangs in any browser except anonymous mode ( clean new run of browser )

Actions #38

Updated by okurz about 1 year ago

very nice. Now one of the next steps (in a follow-up ticket) will be to actually destroy one or multiple of those machines again and re-install to be faster with that. As soon as we have some jobs successfully passed on w16..w18 you can resolve.

Actions #39

Updated by livdywan about 1 year ago

  • Status changed from Feedback to Resolved

Examples of successfully executed jobs:

Looks like things are working well.

Actions #40

Updated by livdywan about 1 year ago

  • Copied to action #124562: Re-install at least one of the new OSD workers located in Prague added
Actions #41

Updated by okurz 12 months ago

  • Due date deleted (2023-02-23)
Actions #42

Updated by okurz 12 months ago

  • Related to action #125798: Visual differences in GRUB menu on different x86_64 UEFI workers added
Actions

Also available in: Atom PDF