action #132647: Migration of o3 VM to PRG2 - bare-metal tests size:M - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #132647

closed

Migration of o3 VM to PRG2 - bare-metal tests size:M

Added by okurz over 1 year ago. Updated 6 months ago.

Status:

Resolved

Priority:

Low

Assignee:

okurz

Category:

Feature requests

Target version:

openQA Project (public) - Ready

Start date:

Due date:

% Done:

Estimated time:

Tags:

infra, bare metal

Description

Motivation¶

The openQA webUI VM for o3 will move to PRG2. After the move we must ensure bare metal tests can work

Acceptance criteria¶

AC1: bare metal production tests work on o3 after move

Suggestions¶

Wait for the move
Fix o3 bare metal hosts iPXE booting, see https://openqa.opensuse.org/tests/3446336#step/ipxe_install/2
After move connect NUE1 based workers over https to PRG2 based o3 i.e. https://openqa.opensuse.org/admin/workers looking for nue1, see #132134 for workers setup in PRG2
Consider using at least one of the new PRG2 o3 machines in https://racktables.nue.suse.com/index.php?page=rack&rack_id=21282 as bare metal machine. For this create Eng-Infra ticket to switch VLAN of the IPMI interface to the o3 one so that openQA tests can use the IPMI interface and consider renaming machines to make the distinction clear
Ensure that production openQA jobs, e.g. in https://openqa.opensuse.org/tests/overview?distri=opensuse&groupid=1&version=Tumbleweed, can run successfully

Rollback steps¶

Revert https://github.com/os-autoinst/opensuse-jobgroups/pull/376

Related issues 3 (0 open — 3 closed)

Actions

Copy link

Updated by okurz over 1 year ago

Copied from action #132143: Migration of o3 VM to PRG2 - 2023-07-19 size:M added

Actions

Copy link

Updated by livdywan over 1 year ago

We decided to skip this ticket and not estimate it yet because this is a moving target (no pun intended) and we don't know exactly what tests this is about. @okurz Feel free to provide more details if you can in the meantime.

Actions

Copy link

Updated by livdywan over 1 year ago

Subject changed from Migration of o3 VM to PRG2 - bare-metal tests to Migration of o3 VM to PRG2 - bare-metal tests size:M
Description updated (diff)
Status changed from New to Workable

Actions

Copy link

Updated by okurz over 1 year ago

Description updated (diff)

Actions

Copy link

Updated by okurz over 1 year ago

Copied to action #133490: Migration of o3 VM to PRG2 - Fix o3 bare metal hosts iPXE booting size:M added

Actions

Copy link

Updated by okurz over 1 year ago

Project changed from 115 to openQA Infrastructure (public)

Actions

Copy link

Updated by livdywan about 1 year ago

Description updated (diff)

Actions

Copy link

Updated by livdywan about 1 year ago

Subject changed from Migration of o3 VM to PRG2 - bare-metal tests size:M to Migration of o3 VM to PRG2 - bare-metal tests

After move connect NUE1 based workers over https to PRG2 based o3 i.e. https://openqa.opensuse.org/admin/workers looking for nue1

Currrently this shows worker7 as online/working, and worker6,19 and 20 as offline. How do I answer the question in the AC which is "bare metal tests can work on o3 after move"?

Actions

Copy link

#10

Updated by okurz about 1 year ago

Subject changed from Migration of o3 VM to PRG2 - bare-metal tests to Migration of o3 VM to PRG2 - bare-metal tests size:M

livdywan wrote in #note-9:

After move connect NUE1 based workers over https to PRG2 based o3 i.e. https://openqa.opensuse.org/admin/workers looking for nue1

Currrently this shows worker7 as online/working, and worker6,19 and 20 as offline.

w6+7+19+20 are all NUE1 based so not PRG2.

How do I answer the question in the AC which is "bare metal tests can work on o3 after move"?

As an acceptance test one can go to https://openqa.opensuse.org/admin/workers, select "All", search for "ipmi" and find workers that are prg2 based (currently none), find tests and look for passed tests.

Actions

Copy link

#11

Updated by dheidler about 1 year ago

Assignee set to dheidler

Actions

Copy link

#12

Updated by okurz about 1 year ago

Description updated (diff)

Clarified and extended last suggestion

Actions

Copy link

#13

Updated by okurz about 1 year ago

Related to action #134126: Setup new PRG2 openQA worker for o3 - bare-metal testing size:M added

Actions

Copy link

#14

Updated by dheidler about 1 year ago

Status changed from Workable to Blocked

https://sd.suse.com/servicedesk/customer/portal/1/SD-134097

Actions

Copy link

#15

Updated by okurz about 1 year ago

Description updated (diff)

Actions

Copy link

#16

Updated by xlai about 1 year ago

dheidler wrote in #note-14:

https://sd.suse.com/servicedesk/customer/portal/1/SD-134097

@dheidler Hi, I do not have access to it. Would you please explain a little bit what's the root cause for this being blocked?

Besides, virtualization team used to have two ipmi workers configured on "rebel.openqanet.opensuse.org" and @okurz shared that above SD ticket won't likely to be resolved until end of this Nov(if my understanding is correct). Is it possible to recover the original "rebel.openqanet.opensuse.org" and make it work, until some date closer to end of Nov, so that O3 virt tests can continue?

Actions

Copy link

#17

Updated by okurz about 1 year ago

xlai wrote in #note-16:

dheidler wrote in #note-14:

https://sd.suse.com/servicedesk/customer/portal/1/SD-134097

@dheidler Hi, I do not have access to it. Would you please explain a little bit what's the root cause for this being blocked?

The idea is to use two of the new x86_64 machines in PRG2 as bare metal test machines. For that we need the Ipmi interfaces in a network controllable by other openQA workers. The above SD ticket is a request to change the VLAN interfaces of the according switch ports.

All that is a workaround planned by us as the room PRG2e where existing other baremetal test machines should move is not ready yet for move.

Besides, virtualization team used to have two ipmi workers configured on "rebel.openqanet.opensuse.org" and @okurz shared that above SD ticket won't likely to be resolved until end of this Nov(if my understanding is correct). Is it possible to recover the original "rebel.openqanet.opensuse.org" and make it work, until some date closer to end of Nov, so that O3 virt tests can continue?

That is not feasible. rebel is already at the planned target location of NUE3 along with all the other older openqaworker machines, transported in coordination with an external logistics company and BuildOps, SUSE-IT, Facilities. I do no see a reasonable RoI nor available people capacity to move back the machine.

In theory we could build solutions of providing access to the IPMI interfaces in the current network configuration but with tunnels but that would effectively give every openQA test on the public o3 instance access to a major part of the SUSE internal network which we must not allow to do.

I think the best course of action would be to provide SUSE QE Tools Team members administrative access to the involved network switches so that we don't need to reply on Eng-Infra limited resources to switch the VLAN config on the switches which by itself would take about 10 minutes to do. So if you won't to help consider to escalate to management that tools team could do more work if we are trusted with more administrative access :)

Actions

Copy link

#18

Updated by xlai about 1 year ago

@okurz I see, thanks for the explanations. With that said, agree with the current handling and plan with rebel.

So if you won't to help consider to escalate to management that tools team could do more work if we are trusted with more administrative access :)

I believe so, will try to, but I guess the reasons may not be from trust/capability, but maybe others which I do not know and have complex background...

Thank you, Oliver, for the always support.

Actions

Copy link

#19

Updated by xlai about 1 year ago

@okurz FYI, we escalate it to @cachen . Calen will help discuss with higher management team to see if it can be unblocked. Thank you, Calen!

BTW, Oliver, would you please grant view permission to Calen and me, for https://sd.suse.com/servicedesk/customer/portal/1/SD-134097.

Actions

Copy link

#20

Updated by okurz about 1 year ago

xlai wrote in #note-19:

BTW, Oliver, would you please grant view permission to Calen and me, for https://sd.suse.com/servicedesk/customer/portal/1/SD-134097.

I already escalated over multiple channels the multiple shortcomings of having private-by-default SD tickets and no possibility to share with bigger groups or complete SUSE. I don't think it's worth our time to share SD tickets while SUSE IT refuses our suggestions

Actions

Copy link

#21

Updated by xlai about 1 year ago

okurz wrote in #note-20:

xlai wrote in #note-19:

BTW, Oliver, would you please grant view permission to Calen and me, for https://sd.suse.com/servicedesk/customer/portal/1/SD-134097.

I already escalated over multiple channels the multiple shortcomings of having private-by-default SD tickets and no possibility to share with bigger groups or complete SUSE. I don't think it's worth our time to share SD tickets while SUSE IT refuses our suggestions

@okurz I understand the pain. And now, we are finding team leads for help. How can they help if they can not view it? Maybe they can make some difference on the decision with even more and higher escalation ;-)

Actions

Copy link

#22

Updated by okurz about 1 year ago

xlai wrote in #note-21:

okurz wrote in #note-20:

xlai wrote in #note-19:

BTW, Oliver, would you please grant view permission to Calen and me, for https://sd.suse.com/servicedesk/customer/portal/1/SD-134097.

I already escalated over multiple channels the multiple shortcomings of having private-by-default SD tickets and no possibility to share with bigger groups or complete SUSE. I don't think it's worth our time to share SD tickets while SUSE IT refuses our suggestions

@okurz I understand the pain. And now, we are finding team leads for help. How can they help if they can not view it? Maybe they can make some difference on the decision with even more and higher escalation ;-)

The ticket just has the following content not even following our proper ticket template

These two machines are planned to be used as bare metal SUT hosts. This requires other workers within the dmz network to be able to reach them via IPMI.

So please connect the IPMI interfaces of those two machines to the same network as the machine itself is connected to.

See: https://progress.opensuse.org/issues/132647

I already stated what I think should be done:

I think the best course of action would be to provide SUSE QE Tools Team members administrative access to the involved network switches so that we don't need to reply on Eng-Infra limited resources to switch the VLAN config on the switches which by itself would take about 10 minutes to do. So if you won't to help consider to escalate to management that tools team could do more work if we are trusted with more administrative access :)

Please keep in mind that I am in full alignment with SUSE-IT that evacuation of NUE1 and the migration of SUSE wide services like imap.suse.de must have higher priority than QE specific tasks which unfortunately also includes this ticket here. So this is why we have the estimate from mflores, team lead of Eng-Infra, to not expect any bigger help from Eng-Infra before 2023-12. So I suggest to not ask to prioritize the work on https://sd.suse.com/servicedesk/customer/portal/1/SD-134097 itself but rather give more people access to the switch administration so that we can solve problems ourselves.

Actions

Copy link

#23

Updated by cachen about 1 year ago

Hello guys, I just write email to Moroni with 2 escalations(higher administration access delegation; jira-sd ticket authority), hard to say if that help the speed up, but at least I hope IT team is listening our needs, let's see what I will get back from Moroni.
@mgriessmeier

Actions

Copy link

#25

Updated by okurz about 1 year ago

Description updated (diff)

Actions

Copy link

#26

Updated by okurz about 1 year ago

@dheidler and me talked about the topic yesterday and dheidler has added a comment in https://sd.suse.com/servicedesk/customer/portal/1/SD-134097 to ask to just give us administrative access to the switches. I don't expect that to happen within a reasonable timespan nor without further motivation. We can either escalate this further or just wait with lower priority. If the opportunity arises I will bring the topic up in related meetings or other opportunities.

Actions

Copy link

#28

Updated by okurz about 1 year ago

I created https://jira.suse.com/browse/ENGINFRA-3208 "Provide administrative access to LSG QE Tools for ToR switches in J12" to help us.

Actions

Copy link

#29

Updated by livdywan about 1 year ago

okurz wrote in #note-28:

I created https://jira.suse.com/browse/ENGINFRA-3208 "Provide administrative access to LSG QE Tools for ToR switches in J12" to help us.

Ticket hasn't been picked up yet. I added a comment there to confirm we're on the right track.

Actions

Copy link

#30

Updated by livdywan about 1 year ago

livdywan wrote in #note-29:

okurz wrote in #note-28:

I created https://jira.suse.com/browse/ENGINFRA-3208 "Provide administrative access to LSG QE Tools for ToR switches in J12" to help us.

Ticket hasn't been picked up yet. I added a comment there to confirm we're on the right track.

Asking in Slack now.

Actions

Copy link

#31

Updated by dheidler about 1 year ago

Last status from https://suse.slack.com/archives/C04MDKHQE20/p169989293880018
The jira ticket is still in backlog and might be worked on in their next spring from 21. on.

Actions

Copy link

#32

Updated by livdywan about 1 year ago

dheidler wrote in #note-31:

Last status from https://suse.slack.com/archives/C04MDKHQE20/p169989293880018
The jira ticket is still in backlog and might be worked on in their next spring from 21. on.

November 21, this week? I can't open the Slack chat.

Actions

Copy link

#33

Updated by okurz about 1 year ago

Priority changed from High to Normal

https://jira.suse.com/browse/ENGINFRA-3208 still unchanged and our questions are not answered. Realistically that means that we can't expect an improvement there soon hence reducing prio

Actions

Copy link

#34

Updated by dheidler 11 months ago

Asked again in the ENGINFRA ticket.

Actions

Copy link

#35

Updated by okurz 10 months ago

Priority changed from Normal to Low

Actions

Copy link

#36

Updated by livdywan 10 months ago

okurz wrote in #note-33:

https://jira.suse.com/browse/ENGINFRA-3208 still unchanged and our questions are not answered. Realistically that means that we can't expect an improvement there soon hence reducing prio

Response today is that read-only access is a possibility. Long-term there will be a salt deployment for it.

Actions

Copy link

#37

Updated by okurz 10 months ago

Target version changed from Ready to Tools - Next

Actions

Copy link

#39

Updated by okurz 10 months ago

Assignee changed from dheidler to okurz
Target version changed from Tools - Next to future

https://jira.suse.com/browse/ENGINFRA-3207 was rejected with https://jira.suse.com/browse/ENGINFRA-3207?focusedId=1327731&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-1327731

(Moroni Flores) Hello - administrative access to network devices in PRG2 will not be granted to any team.
As previously discussed, we will deploy and have a way for each team to be able to make changes via salt, but not directly on the device.
If you are interested in read-only access for certain troubleshooting, that can be granted.
Thanks.

As now IT has successfully stalled our original plans as well as multiple levels of mitigation in the meantime amd-zen2-gpu-sut1 is at least pingable so I wil wait for #153706 first and see if that is enough for our bare-metal testing. It's unfortunate to see that openqaworker27+openqaworker28 were actually also powered on the past time. I now set both to powered off to save energy.

Actions

Copy link

#40

Updated by okurz 7 months ago

Status changed from Blocked to Workable
Target version changed from future to Ready

Next step: Crosscheck the current status of o3 bare-metal tests with #153706 resolved

Actions

Copy link

#41

Updated by okurz 6 months ago

Category set to Feature requests
Status changed from Workable to Feedback

Asked in https://suse.slack.com/archives/C02CANHLANP/p1716974536520799

@Xiaoli Ai what's the current status regarding bare-metal tests on o3? Context asking as part of https://progress.opensuse.org/issues/132647 as there is amd-zen2-gpu-sut1 already since some months but no tests had been running there, isn't it?

Actions

Copy link

#42

Updated by okurz 6 months ago

Status changed from Feedback to Resolved

ok, QE Virt does not currently invest any effort in o3 bare-metal tests but workers are ready to be used. The bare-metal machine is powered off while not in use so also no significant electrical power lost.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #132647

Migration of o3 VM to PRG2 - bare-metal tests size:M

Motivation¶

Acceptance criteria¶

Suggestions¶

Rollback steps¶

Updated by okurz over 1 year ago

Updated by livdywan over 1 year ago

Updated by livdywan over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by livdywan about 1 year ago

Updated by livdywan about 1 year ago

Updated by okurz about 1 year ago

Updated by dheidler about 1 year ago

Updated by okurz about 1 year ago

Updated by okurz about 1 year ago

Updated by dheidler about 1 year ago

Updated by okurz about 1 year ago

Updated by xlai about 1 year ago

Updated by okurz about 1 year ago

Updated by xlai about 1 year ago

Updated by xlai about 1 year ago

Updated by okurz about 1 year ago

Updated by xlai about 1 year ago

Updated by okurz about 1 year ago

Updated by cachen about 1 year ago

Updated by okurz about 1 year ago

Updated by okurz about 1 year ago

Updated by okurz about 1 year ago

Updated by livdywan about 1 year ago

Updated by livdywan about 1 year ago

Updated by dheidler about 1 year ago

Updated by livdywan about 1 year ago

Updated by okurz about 1 year ago

Updated by dheidler 11 months ago

Updated by okurz 10 months ago

Updated by livdywan 10 months ago

Updated by okurz 10 months ago

Updated by okurz 10 months ago

Updated by okurz 7 months ago

Updated by okurz 6 months ago

Updated by okurz 6 months ago