action #132647
closedMigration of o3 VM to PRG2 - bare-metal tests size:M
0%
Description
Motivation¶
The openQA webUI VM for o3 will move to PRG2. After the move we must ensure bare metal tests can work
Acceptance criteria¶
- AC1: bare metal production tests work on o3 after move
Suggestions¶
- Wait for the move
- Fix o3 bare metal hosts iPXE booting, see https://openqa.opensuse.org/tests/3446336#step/ipxe_install/2
- After move connect NUE1 based workers over https to PRG2 based o3 i.e. https://openqa.opensuse.org/admin/workers looking for nue1, see #132134 for workers setup in PRG2
- Consider using at least one of the new PRG2 o3 machines in https://racktables.nue.suse.com/index.php?page=rack&rack_id=21282 as bare metal machine. For this create Eng-Infra ticket to switch VLAN of the IPMI interface to the o3 one so that openQA tests can use the IPMI interface and consider renaming machines to make the distinction clear
- Ensure that production openQA jobs, e.g. in https://openqa.opensuse.org/tests/overview?distri=opensuse&groupid=1&version=Tumbleweed, can run successfully
Rollback steps¶
Updated by okurz over 1 year ago
- Copied from action #132143: Migration of o3 VM to PRG2 - 2023-07-19 size:M added
Updated by livdywan over 1 year ago
We decided to skip this ticket and not estimate it yet because this is a moving target (no pun intended) and we don't know exactly what tests this is about. @okurz Feel free to provide more details if you can in the meantime.
Updated by livdywan over 1 year ago
- Subject changed from Migration of o3 VM to PRG2 - bare-metal tests to Migration of o3 VM to PRG2 - bare-metal tests size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by okurz over 1 year ago
- Copied to action #133490: Migration of o3 VM to PRG2 - Fix o3 bare metal hosts iPXE booting size:M added
Updated by okurz over 1 year ago
- Project changed from 115 to openQA Infrastructure (public)
Updated by livdywan about 1 year ago
- Subject changed from Migration of o3 VM to PRG2 - bare-metal tests size:M to Migration of o3 VM to PRG2 - bare-metal tests
- After move connect NUE1 based workers over https to PRG2 based o3 i.e. https://openqa.opensuse.org/admin/workers looking for nue1
Currrently this shows worker7 as online/working, and worker6,19 and 20 as offline. How do I answer the question in the AC which is "bare metal tests can work on o3 after move"?
Updated by okurz about 1 year ago
- Subject changed from Migration of o3 VM to PRG2 - bare-metal tests to Migration of o3 VM to PRG2 - bare-metal tests size:M
livdywan wrote in #note-9:
- After move connect NUE1 based workers over https to PRG2 based o3 i.e. https://openqa.opensuse.org/admin/workers looking for nue1
Currrently this shows worker7 as online/working, and worker6,19 and 20 as offline.
w6+7+19+20 are all NUE1 based so not PRG2.
How do I answer the question in the AC which is "bare metal tests can work on o3 after move"?
As an acceptance test one can go to https://openqa.opensuse.org/admin/workers, select "All", search for "ipmi" and find workers that are prg2 based (currently none), find tests and look for passed tests.
Updated by okurz about 1 year ago
- Related to action #134126: Setup new PRG2 openQA worker for o3 - bare-metal testing size:M added
Updated by dheidler about 1 year ago
- Status changed from Workable to Blocked
Updated by xlai about 1 year ago
dheidler wrote in #note-14:
@dheidler Hi, I do not have access to it. Would you please explain a little bit what's the root cause for this being blocked?
Besides, virtualization team used to have two ipmi workers configured on "rebel.openqanet.opensuse.org" and @okurz shared that above SD ticket won't likely to be resolved until end of this Nov(if my understanding is correct). Is it possible to recover the original "rebel.openqanet.opensuse.org" and make it work, until some date closer to end of Nov, so that O3 virt tests can continue?
Updated by okurz about 1 year ago
xlai wrote in #note-16:
dheidler wrote in #note-14:
@dheidler Hi, I do not have access to it. Would you please explain a little bit what's the root cause for this being blocked?
The idea is to use two of the new x86_64 machines in PRG2 as bare metal test machines. For that we need the Ipmi interfaces in a network controllable by other openQA workers. The above SD ticket is a request to change the VLAN interfaces of the according switch ports.
All that is a workaround planned by us as the room PRG2e where existing other baremetal test machines should move is not ready yet for move.
Besides, virtualization team used to have two ipmi workers configured on "rebel.openqanet.opensuse.org" and @okurz shared that above SD ticket won't likely to be resolved until end of this Nov(if my understanding is correct). Is it possible to recover the original "rebel.openqanet.opensuse.org" and make it work, until some date closer to end of Nov, so that O3 virt tests can continue?
That is not feasible. rebel is already at the planned target location of NUE3 along with all the other older openqaworker machines, transported in coordination with an external logistics company and BuildOps, SUSE-IT, Facilities. I do no see a reasonable RoI nor available people capacity to move back the machine.
In theory we could build solutions of providing access to the IPMI interfaces in the current network configuration but with tunnels but that would effectively give every openQA test on the public o3 instance access to a major part of the SUSE internal network which we must not allow to do.
I think the best course of action would be to provide SUSE QE Tools Team members administrative access to the involved network switches so that we don't need to reply on Eng-Infra limited resources to switch the VLAN config on the switches which by itself would take about 10 minutes to do. So if you won't to help consider to escalate to management that tools team could do more work if we are trusted with more administrative access :)
Updated by xlai about 1 year ago
@okurz I see, thanks for the explanations. With that said, agree with the current handling and plan with rebel.
So if you won't to help consider to escalate to management that tools team could do more work if we are trusted with more administrative access :)
I believe so, will try to, but I guess the reasons may not be from trust/capability, but maybe others which I do not know and have complex background...
Thank you, Oliver, for the always support.
Updated by xlai about 1 year ago
@okurz FYI, we escalate it to @cachen . Calen will help discuss with higher management team to see if it can be unblocked. Thank you, Calen!
BTW, Oliver, would you please grant view permission to Calen and me, for https://sd.suse.com/servicedesk/customer/portal/1/SD-134097.
Updated by okurz about 1 year ago
xlai wrote in #note-19:
BTW, Oliver, would you please grant view permission to Calen and me, for https://sd.suse.com/servicedesk/customer/portal/1/SD-134097.
I already escalated over multiple channels the multiple shortcomings of having private-by-default SD tickets and no possibility to share with bigger groups or complete SUSE. I don't think it's worth our time to share SD tickets while SUSE IT refuses our suggestions
Updated by xlai about 1 year ago
okurz wrote in #note-20:
xlai wrote in #note-19:
BTW, Oliver, would you please grant view permission to Calen and me, for https://sd.suse.com/servicedesk/customer/portal/1/SD-134097.
I already escalated over multiple channels the multiple shortcomings of having private-by-default SD tickets and no possibility to share with bigger groups or complete SUSE. I don't think it's worth our time to share SD tickets while SUSE IT refuses our suggestions
@okurz I understand the pain. And now, we are finding team leads for help. How can they help if they can not view it? Maybe they can make some difference on the decision with even more and higher escalation ;-)
Updated by okurz about 1 year ago
xlai wrote in #note-21:
okurz wrote in #note-20:
xlai wrote in #note-19:
BTW, Oliver, would you please grant view permission to Calen and me, for https://sd.suse.com/servicedesk/customer/portal/1/SD-134097.
I already escalated over multiple channels the multiple shortcomings of having private-by-default SD tickets and no possibility to share with bigger groups or complete SUSE. I don't think it's worth our time to share SD tickets while SUSE IT refuses our suggestions
@okurz I understand the pain. And now, we are finding team leads for help. How can they help if they can not view it? Maybe they can make some difference on the decision with even more and higher escalation ;-)
The ticket just has the following content not even following our proper ticket template
These two machines are planned to be used as bare metal SUT hosts. This requires other workers within the dmz network to be able to reach them via IPMI.
So please connect the IPMI interfaces of those two machines to the same network as the machine itself is connected to.
See: https://progress.opensuse.org/issues/132647
I already stated what I think should be done:
I think the best course of action would be to provide SUSE QE Tools Team members administrative access to the involved network switches so that we don't need to reply on Eng-Infra limited resources to switch the VLAN config on the switches which by itself would take about 10 minutes to do. So if you won't to help consider to escalate to management that tools team could do more work if we are trusted with more administrative access :)
Please keep in mind that I am in full alignment with SUSE-IT that evacuation of NUE1 and the migration of SUSE wide services like imap.suse.de must have higher priority than QE specific tasks which unfortunately also includes this ticket here. So this is why we have the estimate from mflores, team lead of Eng-Infra, to not expect any bigger help from Eng-Infra before 2023-12. So I suggest to not ask to prioritize the work on https://sd.suse.com/servicedesk/customer/portal/1/SD-134097 itself but rather give more people access to the switch administration so that we can solve problems ourselves.
Updated by cachen about 1 year ago
Hello guys, I just write email to Moroni with 2 escalations(higher administration access delegation; jira-sd ticket authority), hard to say if that help the speed up, but at least I hope IT team is listening our needs, let's see what I will get back from Moroni.
@mgriessmeier
Updated by okurz about 1 year ago
@dheidler and me talked about the topic yesterday and dheidler has added a comment in https://sd.suse.com/servicedesk/customer/portal/1/SD-134097 to ask to just give us administrative access to the switches. I don't expect that to happen within a reasonable timespan nor without further motivation. We can either escalate this further or just wait with lower priority. If the opportunity arises I will bring the topic up in related meetings or other opportunities.
Updated by okurz about 1 year ago
I created https://jira.suse.com/browse/ENGINFRA-3208 "Provide administrative access to LSG QE Tools for ToR switches in J12" to help us.
Updated by livdywan about 1 year ago
okurz wrote in #note-28:
I created https://jira.suse.com/browse/ENGINFRA-3208 "Provide administrative access to LSG QE Tools for ToR switches in J12" to help us.
Ticket hasn't been picked up yet. I added a comment there to confirm we're on the right track.
Updated by livdywan about 1 year ago
livdywan wrote in #note-29:
okurz wrote in #note-28:
I created https://jira.suse.com/browse/ENGINFRA-3208 "Provide administrative access to LSG QE Tools for ToR switches in J12" to help us.
Ticket hasn't been picked up yet. I added a comment there to confirm we're on the right track.
Asking in Slack now.
Updated by dheidler about 1 year ago
Last status from https://suse.slack.com/archives/C04MDKHQE20/p169989293880018
The jira ticket is still in backlog and might be worked on in their next spring from 21. on.
Updated by livdywan about 1 year ago
dheidler wrote in #note-31:
Last status from https://suse.slack.com/archives/C04MDKHQE20/p169989293880018
The jira ticket is still in backlog and might be worked on in their next spring from 21. on.
November 21, this week? I can't open the Slack chat.
Updated by okurz about 1 year ago
- Priority changed from High to Normal
https://jira.suse.com/browse/ENGINFRA-3208 still unchanged and our questions are not answered. Realistically that means that we can't expect an improvement there soon hence reducing prio
Updated by livdywan 10 months ago
okurz wrote in #note-33:
https://jira.suse.com/browse/ENGINFRA-3208 still unchanged and our questions are not answered. Realistically that means that we can't expect an improvement there soon hence reducing prio
Response today is that read-only access is a possibility. Long-term there will be a salt deployment for it.
Updated by okurz 10 months ago
- Assignee changed from dheidler to okurz
- Target version changed from Tools - Next to future
https://jira.suse.com/browse/ENGINFRA-3207 was rejected with https://jira.suse.com/browse/ENGINFRA-3207?focusedId=1327731&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-1327731
(Moroni Flores) Hello - administrative access to network devices in PRG2 will not be granted to any team.
As previously discussed, we will deploy and have a way for each team to be able to make changes via salt, but not directly on the device.
If you are interested in read-only access for certain troubleshooting, that can be granted.
Thanks.
As now IT has successfully stalled our original plans as well as multiple levels of mitigation in the meantime amd-zen2-gpu-sut1 is at least pingable so I wil wait for #153706 first and see if that is enough for our bare-metal testing. It's unfortunate to see that openqaworker27+openqaworker28 were actually also powered on the past time. I now set both to powered off to save energy.
Updated by okurz 7 months ago
- Category set to Feature requests
- Status changed from Workable to Feedback
Asked in https://suse.slack.com/archives/C02CANHLANP/p1716974536520799
@Xiaoli Ai what's the current status regarding bare-metal tests on o3? Context asking as part of https://progress.opensuse.org/issues/132647 as there is amd-zen2-gpu-sut1 already since some months but no tests had been running there, isn't it?