action #132647
openQA - coordination #121720: [saga][epic] Migration to QE setup in PRG2+NUE3 while ensuring availability
QA - coordination #123800: [epic] Provide SUSE QE Tools services running in PRG2 aka. Prg CoLo
Migration of o3 VM to PRG2 - bare-metal tests size:M
0%
Description
Motivation¶
The openQA webUI VM for o3 will move to PRG2. After the move we must ensure bare metal tests can work
Acceptance criteria¶
- AC1: bare metal production tests work on o3 after move
Suggestions¶
- Wait for the move
- Fix o3 bare metal hosts iPXE booting, see https://openqa.opensuse.org/tests/3446336#step/ipxe_install/2
- After move connect NUE1 based workers over https to PRG2 based o3 i.e. https://openqa.opensuse.org/admin/workers looking for nue1, see #132134 for workers setup in PRG2
- Consider using at least one of the new PRG2 o3 machines in https://racktables.nue.suse.com/index.php?page=rack&rack_id=21282 as bare metal machine. For this create Eng-Infra ticket to switch VLAN of the IPMI interface to the o3 one so that openQA tests can use the IPMI interface and consider renaming machines to make the distinction clear
- Ensure that production openQA jobs, e.g. in https://openqa.opensuse.org/tests/overview?distri=opensuse&groupid=1&version=Tumbleweed, can run successfully
Rollback steps¶
Updated by okurz 10 months ago
- Copied from action #132143: Migration of o3 VM to PRG2 - 2023-07-19 size:M added
Updated by okurz 9 months ago
- Copied to action #133490: Migration of o3 VM to PRG2 - Fix o3 bare metal hosts iPXE booting size:M added
Updated by livdywan 7 months ago
- Subject changed from Migration of o3 VM to PRG2 - bare-metal tests size:M to Migration of o3 VM to PRG2 - bare-metal tests
- After move connect NUE1 based workers over https to PRG2 based o3 i.e. https://openqa.opensuse.org/admin/workers looking for nue1
Currrently this shows worker7 as online/working, and worker6,19 and 20 as offline. How do I answer the question in the AC which is "bare metal tests can work on o3 after move"?
Updated by okurz 7 months ago
- Subject changed from Migration of o3 VM to PRG2 - bare-metal tests to Migration of o3 VM to PRG2 - bare-metal tests size:M
livdywan wrote in #note-9:
- After move connect NUE1 based workers over https to PRG2 based o3 i.e. https://openqa.opensuse.org/admin/workers looking for nue1
Currrently this shows worker7 as online/working, and worker6,19 and 20 as offline.
w6+7+19+20 are all NUE1 based so not PRG2.
How do I answer the question in the AC which is "bare metal tests can work on o3 after move"?
As an acceptance test one can go to https://openqa.opensuse.org/admin/workers, select "All", search for "ipmi" and find workers that are prg2 based (currently none), find tests and look for passed tests.
Updated by okurz 7 months ago
- Related to action #134126: Setup new PRG2 openQA worker for o3 - bare-metal testing size:M added
Updated by xlai 7 months ago
dheidler wrote in #note-14:
@dheidler Hi, I do not have access to it. Would you please explain a little bit what's the root cause for this being blocked?
Besides, virtualization team used to have two ipmi workers configured on "rebel.openqanet.opensuse.org" and @okurz shared that above SD ticket won't likely to be resolved until end of this Nov(if my understanding is correct). Is it possible to recover the original "rebel.openqanet.opensuse.org" and make it work, until some date closer to end of Nov, so that O3 virt tests can continue?
Updated by okurz 7 months ago
xlai wrote in #note-16:
dheidler wrote in #note-14:
@dheidler Hi, I do not have access to it. Would you please explain a little bit what's the root cause for this being blocked?
The idea is to use two of the new x86_64 machines in PRG2 as bare metal test machines. For that we need the Ipmi interfaces in a network controllable by other openQA workers. The above SD ticket is a request to change the VLAN interfaces of the according switch ports.
All that is a workaround planned by us as the room PRG2e where existing other baremetal test machines should move is not ready yet for move.
Besides, virtualization team used to have two ipmi workers configured on "rebel.openqanet.opensuse.org" and @okurz shared that above SD ticket won't likely to be resolved until end of this Nov(if my understanding is correct). Is it possible to recover the original "rebel.openqanet.opensuse.org" and make it work, until some date closer to end of Nov, so that O3 virt tests can continue?
That is not feasible. rebel is already at the planned target location of NUE3 along with all the other older openqaworker machines, transported in coordination with an external logistics company and BuildOps, SUSE-IT, Facilities. I do no see a reasonable RoI nor available people capacity to move back the machine.
In theory we could build solutions of providing access to the IPMI interfaces in the current network configuration but with tunnels but that would effectively give every openQA test on the public o3 instance access to a major part of the SUSE internal network which we must not allow to do.
I think the best course of action would be to provide SUSE QE Tools Team members administrative access to the involved network switches so that we don't need to reply on Eng-Infra limited resources to switch the VLAN config on the switches which by itself would take about 10 minutes to do. So if you won't to help consider to escalate to management that tools team could do more work if we are trusted with more administrative access :)
Updated by xlai 7 months ago
@okurz I see, thanks for the explanations. With that said, agree with the current handling and plan with rebel.
So if you won't to help consider to escalate to management that tools team could do more work if we are trusted with more administrative access :)
I believe so, will try to, but I guess the reasons may not be from trust/capability, but maybe others which I do not know and have complex background...
Thank you, Oliver, for the always support.
Updated by xlai 7 months ago
@okurz FYI, we escalate it to @cachen . Calen will help discuss with higher management team to see if it can be unblocked. Thank you, Calen!
BTW, Oliver, would you please grant view permission to Calen and me, for https://sd.suse.com/servicedesk/customer/portal/1/SD-134097.
Updated by okurz 7 months ago
xlai wrote in #note-19:
BTW, Oliver, would you please grant view permission to Calen and me, for https://sd.suse.com/servicedesk/customer/portal/1/SD-134097.
I already escalated over multiple channels the multiple shortcomings of having private-by-default SD tickets and no possibility to share with bigger groups or complete SUSE. I don't think it's worth our time to share SD tickets while SUSE IT refuses our suggestions
Updated by xlai 7 months ago
okurz wrote in #note-20:
xlai wrote in #note-19:
BTW, Oliver, would you please grant view permission to Calen and me, for https://sd.suse.com/servicedesk/customer/portal/1/SD-134097.
I already escalated over multiple channels the multiple shortcomings of having private-by-default SD tickets and no possibility to share with bigger groups or complete SUSE. I don't think it's worth our time to share SD tickets while SUSE IT refuses our suggestions
@okurz I understand the pain. And now, we are finding team leads for help. How can they help if they can not view it? Maybe they can make some difference on the decision with even more and higher escalation ;-)
Updated by okurz 7 months ago
xlai wrote in #note-21:
okurz wrote in #note-20:
xlai wrote in #note-19:
BTW, Oliver, would you please grant view permission to Calen and me, for https://sd.suse.com/servicedesk/customer/portal/1/SD-134097.
I already escalated over multiple channels the multiple shortcomings of having private-by-default SD tickets and no possibility to share with bigger groups or complete SUSE. I don't think it's worth our time to share SD tickets while SUSE IT refuses our suggestions
@okurz I understand the pain. And now, we are finding team leads for help. How can they help if they can not view it? Maybe they can make some difference on the decision with even more and higher escalation ;-)
The ticket just has the following content not even following our proper ticket template
These two machines are planned to be used as bare metal SUT hosts. This requires other workers within the dmz network to be able to reach them via IPMI.
So please connect the IPMI interfaces of those two machines to the same network as the machine itself is connected to.
See: https://progress.opensuse.org/issues/132647
I already stated what I think should be done:
I think the best course of action would be to provide SUSE QE Tools Team members administrative access to the involved network switches so that we don't need to reply on Eng-Infra limited resources to switch the VLAN config on the switches which by itself would take about 10 minutes to do. So if you won't to help consider to escalate to management that tools team could do more work if we are trusted with more administrative access :)
Please keep in mind that I am in full alignment with SUSE-IT that evacuation of NUE1 and the migration of SUSE wide services like imap.suse.de must have higher priority than QE specific tasks which unfortunately also includes this ticket here. So this is why we have the estimate from mflores, team lead of Eng-Infra, to not expect any bigger help from Eng-Infra before 2023-12. So I suggest to not ask to prioritize the work on https://sd.suse.com/servicedesk/customer/portal/1/SD-134097 itself but rather give more people access to the switch administration so that we can solve problems ourselves.
Updated by cachen 7 months ago
Hello guys, I just write email to Moroni with 2 escalations(higher administration access delegation; jira-sd ticket authority), hard to say if that help the speed up, but at least I hope IT team is listening our needs, let's see what I will get back from Moroni.
@mgriessmeier
Updated by okurz 7 months ago
@dheidler and me talked about the topic yesterday and dheidler has added a comment in https://sd.suse.com/servicedesk/customer/portal/1/SD-134097 to ask to just give us administrative access to the switches. I don't expect that to happen within a reasonable timespan nor without further motivation. We can either escalate this further or just wait with lower priority. If the opportunity arises I will bring the topic up in related meetings or other opportunities.
Updated by okurz 6 months ago
I created https://jira.suse.com/browse/ENGINFRA-3208 "Provide administrative access to LSG QE Tools for ToR switches in J12" to help us.
Updated by livdywan 6 months ago
okurz wrote in #note-28:
I created https://jira.suse.com/browse/ENGINFRA-3208 "Provide administrative access to LSG QE Tools for ToR switches in J12" to help us.
Ticket hasn't been picked up yet. I added a comment there to confirm we're on the right track.
Updated by livdywan 6 months ago
livdywan wrote in #note-29:
okurz wrote in #note-28:
I created https://jira.suse.com/browse/ENGINFRA-3208 "Provide administrative access to LSG QE Tools for ToR switches in J12" to help us.
Ticket hasn't been picked up yet. I added a comment there to confirm we're on the right track.
Asking in Slack now.
Updated by dheidler 6 months ago
Last status from https://suse.slack.com/archives/C04MDKHQE20/p169989293880018
The jira ticket is still in backlog and might be worked on in their next spring from 21. on.
Updated by livdywan 6 months ago
dheidler wrote in #note-31:
Last status from https://suse.slack.com/archives/C04MDKHQE20/p169989293880018
The jira ticket is still in backlog and might be worked on in their next spring from 21. on.
November 21, this week? I can't open the Slack chat.
Updated by okurz 5 months ago
- Priority changed from High to Normal
https://jira.suse.com/browse/ENGINFRA-3208 still unchanged and our questions are not answered. Realistically that means that we can't expect an improvement there soon hence reducing prio
Updated by livdywan 3 months ago
okurz wrote in #note-33:
https://jira.suse.com/browse/ENGINFRA-3208 still unchanged and our questions are not answered. Realistically that means that we can't expect an improvement there soon hence reducing prio
Response today is that read-only access is a possibility. Long-term there will be a salt deployment for it.
Updated by okurz 3 months ago
- Related to action #153706: Move of selected LSG QE machines NUE1 to PRG2 - amd-zen2-gpu-sut1 size:M added
Updated by okurz 3 months ago
- Assignee changed from dheidler to okurz
- Target version changed from Tools - Next to future
https://jira.suse.com/browse/ENGINFRA-3207 was rejected with https://jira.suse.com/browse/ENGINFRA-3207?focusedId=1327731&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-1327731
(Moroni Flores) Hello - administrative access to network devices in PRG2 will not be granted to any team.
As previously discussed, we will deploy and have a way for each team to be able to make changes via salt, but not directly on the device.
If you are interested in read-only access for certain troubleshooting, that can be granted.
Thanks.
As now IT has successfully stalled our original plans as well as multiple levels of mitigation in the meantime amd-zen2-gpu-sut1 is at least pingable so I wil wait for #153706 first and see if that is enough for our bare-metal testing. It's unfortunate to see that openqaworker27+openqaworker28 were actually also powered on the past time. I now set both to powered off to save energy.