action #113366
closedopenQA Project - coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes
openQA Project - coordination #109659: [epic] More remote workers
Add three more Prague located OSD workers size:M
0%
Description
Motivation¶
Three more OSD machines could be ordered and arrived in Prague, to be used within the OSD infrastructure. This was discussed as part of https://confluence.suse.com/display/qasle/2022-05-16+SUSE+IT+and+Networking
Acceptance criteria¶
- AC1: The three new machines are used by OSD production jobs
- AC2: All machines are maintained same as we do for other x86_64 OSD workers
Suggestions¶
- Follow https://sd.suse.com/servicedesk/customer/portal/1/SD-89423 for the ordering process
- Wait for details from EngInfra regarding the machines
- Ensure that a valid OS is installed
- Ensure that machines can be remote controlled
- Deploy machines as part of generic salt-controlled OSD infrastructure
- Configure openQA worker instances on these machines for testing purposes
- Test out their operation and take special care due to the remoteness between NBG (osd webUI) and PRG (location of these two machines)
- If all good include as generic workers as well as with special sap worker class settings
Further details¶
Previous two machines were setup in #104970.
Martin Caj and Lee Martin know more about the details. The plan was to put them next to https://racktables.nue.suse.com/index.php?page=object&tab=default&object_id=16124 or in neighboring racks and connect them the same way. For that first some other machines need to be reshuffled
Out of scope¶
- Multi-machine tests, just disable "tap" worker class for now
Updated by okurz over 2 years ago
- Copied from action #104970: Add two OSD workers (openqaworker14+openqaworker15) specifically for sap-application testing size:M added
Updated by okurz over 2 years ago
- Status changed from New to Feedback
Waiting for response from Viktor Karpovych who contacted me by DM
Updated by mkittler over 2 years ago
- Subject changed from Add three more Prague located OSD workers to Add three more Prague located OSD workers size:M
Updated by mkittler over 2 years ago
- Due date deleted (
2022-07-22) - Status changed from Feedback to Workable
- Assignee deleted (
okurz)
There was no response so far.
Updated by okurz over 2 years ago
- Due date set to 2022-09-19
- Status changed from Workable to Feedback
- Assignee set to okurz
vkarpovych answered, I responded:
Oliver Kurz: https://confluence.suse.com/display/qasle/2022-05-16+SUSE+IT+and+Networking has the corresponding entry: "Prague status: Discuss where to locate new QE-SAP servers: Could ask about Rack B4 or B5 - the servers are old and could maybe go to make space for SAP ? (they were cloud/SES, all out of warranty)". I don't have more details the current status, that's on mcaj and lmartin
Viktor Karpovych: those 3 servers occupy some space in Storage room, I want to mount it to Rack PRG-SRV1-B :5 just for space release in storage room.
Could I just put it to bottom of that rack? How to name those servers for put it to Racktables and after PDUs and Network Switches arrived and we will move everything from Rack 5 we can decide in which units those servers will be moved.
Oliver Kurz: Sounds ok. Please name them openqaworker16, openqaworker17, openqaworker18 with configuration and network setup same as openqaworker14+15. Those machines were setup by mcaj last year IIRC
racktable entries:
Updated by okurz over 2 years ago
Checked with vkarpovych. Still waiting for network switches. Updated racktable entries with PO numbers and details, updated https://sd.suse.com/servicedesk/customer/portal/1/SD-89423
Updated by livdywan about 2 years ago
Any update here? The SLO query is surfacing this ticket, although the manually set Due Date is not being exceeded yet.
Updated by okurz about 2 years ago
- Due date deleted (
2022-09-19) - Status changed from Feedback to Workable
- Assignee deleted (
okurz)
I have received no response since a month so I assume we are still waiting for network switches or somebody forgot to continue. I suggest to report a ticket with SUSE-IT to request the further setup steps.
Updated by nicksinger about 2 years ago
- Status changed from Workable to Feedback
Request for network connection done in https://sd.suse.com/servicedesk/customer/portal/1/SD-99844
Updated by okurz about 2 years ago
- Due date set to 2022-11-18
- Status changed from Feedback to Blocked
cool. We have access to the SD ticket so we can treat this as "Blocked"
Updated by nicksinger about 2 years ago
Viktor gave the answer that there is currently no power available in the racks of these machines. He hopes to get some new PDU this week and recable all machines. Waiting for further updates (I think the due date is still reasonable)
Updated by okurz about 2 years ago
ok, good to know. But where did you get this information from? https://sd.suse.com/servicedesk/customer/portal/1/SD-99844 is still empty.
Updated by nicksinger about 2 years ago
okurz wrote:
ok, good to know. But where did you get this information from? https://sd.suse.com/servicedesk/customer/portal/1/SD-99844 is still empty.
Apparently the ticket is empty again and the message from Viktor was deleted. I only found a local copy in my inbox:
—-—-—-—
Reply above this line.
Viktor Karpovych commented:
Dear Nick Singer,
Those servers are in Rack PRG-SRV1-B5 only mounted but not connected to power and network.
I hope a new power source will be available next week.
After that, We plan to reconnect all equipment s in that rack to a new power source.
And replace the ToR switch, move all servers from PRG-SRV1-B5 Rack and
start to install openqaworker16-18 servers.
__
Best regards,
Viktor Karpovych
View request · Turn off this request's notifications
This is shared with OSD Admins, Nick Singer, and Eng-Infra.
I asked again what the status on this is and if there is any ETA
Updated by nicksinger about 2 years ago
Asked again in https://sd.suse.com/servicedesk/customer/portal/1/SD-99844 if there is any further estimate
Updated by nicksinger about 2 years ago
- Due date changed from 2022-11-18 to 2022-11-29
no further update till today. Asked again and move due date
Updated by okurz almost 2 years ago
- Due date deleted (
2022-11-29) - Status changed from Blocked to Workable
I have received IPMI credentials and can confirm that I can access.
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/466
for IPMI credentials.
https://sd.suse.com/servicedesk/customer/portal/1/SD-99844 is still open. I commented there:
Could you please add L2 address entries on racktables as well? E.g. on https://racktables.nue.suse.com/index.php?page=object&tab=default&object_id=17946 . I think that’s all that is left to do for the work in this ticket.
However now we can already continue the work to setup the machines as OSD workers.
Updated by okurz almost 2 years ago
please decide yourself if you want to experiment with automatic installations as described in #80382 or by hand, your choice :)
Updated by okurz almost 2 years ago
- Related to action #80382: Provide installation recipes for automatic installations of openQA worker machines added
Updated by okurz almost 2 years ago
nicksinger looked into cobbler and yomi. okurz looked a bit into various other tools, e.g. metal3, pixiesomething and more. All those tools seem to be more suitable for customer-facing MaaS which we don't need. So maybe we are just ok with relying on a standard PXE server supplied by Eng-Infra Prg then we adapt the command line, e.g. from an outdated Leap 15.0, just replace with an up-to-date 15.4 and pass the autoyast https://github.com/os-autoinst/openQA/blob/master/contrib/ay-openqa-worker.xml command line and all what is necessary
Updated by okurz almost 2 years ago
As discussed in daily 2022-12-21 please
- document instructions regarding SMB in https://progress.opensuse.org/projects/openqav3/wiki/#Infrastructure-setup-for-o3-openqaopensuseorg-and-osd-openqasusede
- create ticket over sd.suse.com why PXE does not work
Updated by nicksinger almost 2 years ago
okurz wrote:
As discussed in daily 2022-12-21 please
- document instructions regarding SMB in https://progress.opensuse.org/projects/openqav3/wiki/#Infrastructure-setup-for-o3-openqaopensuseorg-and-osd-openqasusede
- create ticket over sd.suse.com why PXE does not work
Updated by okurz almost 2 years ago
https://sd.suse.com/servicedesk/customer/portal/1/SD-108265 resolved, PXE was added to all three machines
Updated by nicksinger almost 2 years ago
- Status changed from Workable to In Progress
I had to disable legacy boot again since we need to boot from NVMes which is only possible with UEFI. I chose "hybrid mode" which might still allow PXE boot but the booting OS will not be able to install the OS in a way the system can access it later on after installing. Therefore I had to use IPMIView from Supermicro to mount a installer disk and boot from it in UEFI mode. Worker 16 and 17 are accessible now, salt-minion is installed but not yet connected to our infrastructure. Worker 18 still needs an OS.
Updated by openqa_review almost 2 years ago
- Due date set to 2023-01-24
Setting due date based on mean cycle time of SUSE QE Tools
Updated by livdywan almost 2 years ago
- Due date changed from 2023-01-24 to 2023-01-27
I think an update got lost here. We discussed it in the infra daily yesterday and confirmed that Nick is still looking into setting up worker 18 with salt, delayed mainly due to #123028.
Updated by okurz almost 2 years ago
- Due date deleted (
2023-01-27) - Status changed from In Progress to Workable
- Assignee deleted (
nicksinger)
We decided that due to the update regarding new QE labs in Nbg FC we did not actually conclude with that. This is free to be picked up by anybody again. Next steps:
- Come up with fitting worker config in salt pillars (likely prerequisite for successful high state application) with "test" worker class
- Accept salt key of one machine on osd, apply high state
- After successful high state trigger openQA jobs on that worker for testing
- Use production worker class
- Repeat above steps for 2nd and 3rd machine
Updated by osukup almost 2 years ago
- Assignee set to osukup
1) serial is directed to wrong port -> ipmi stops working during initrd initialization.
2) WebUI works correctly only in Firefox.
Updated by livdywan almost 2 years ago
Worker16 and 17 should be setup up and ready to go tomorrow; 18 will need to be looked into after that.
Updated by livdywan almost 2 years ago
- Status changed from Workable to In Progress
Updated by osukup almost 2 years ago
all three workers setup with salt --local ...
and visible in OSD workers with altered workerclasses, added prefix '-test' which will be removed when we add workers to OSD salt.
https://progress.opensuse.org/issues/110467 ... so tap is now out of question:D
Updated by openqa_review almost 2 years ago
- Due date set to 2023-02-23
Setting due date based on mean cycle time of SUSE QE Tools
Updated by osukup almost 2 years ago
looks like all three workers ended after reboot in emergency mode thanks to service -> openqa-establish-nvme-setup ...
all three workers have 2 nvme , on nvme0n1 is system , and on nvme1n1 is /dev/md/openqa. Which isn't configuration excepted by this script.
fix -> https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/793
Updated by osukup almost 2 years ago
- Status changed from In Progress to Feedback
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/491 - merged ( deploy CI failed, but we don't know why because Gitlab truncated log ), salt openqa.... atate.apply
passed and workers seems working as intendet --> setting to feedback
PS: WebUI of supermicro on this workers usualy hangs in any browser except anonymous mode ( clean new run of browser )
Updated by okurz almost 2 years ago
very nice. Now one of the next steps (in a follow-up ticket) will be to actually destroy one or multiple of those machines again and re-install to be faster with that. As soon as we have some jobs successfully passed on w16..w18 you can resolve.
Updated by livdywan almost 2 years ago
- Status changed from Feedback to Resolved
Examples of successfully executed jobs:
- worker16 / https://openqa.suse.de/tests/10503349
- worker17 / https://openqa.suse.de/tests/10505095
- worker18 / https://openqa.suse.de/tests/10503829
Looks like things are working well.
Updated by livdywan almost 2 years ago
- Copied to action #124562: Re-install at least one of the new OSD workers located in Prague added
Updated by okurz over 1 year ago
- Related to action #125798: Visual differences in GRUB menu on different x86_64 UEFI workers added