Project

General

Profile

Actions

action #132647

closed

Migration of o3 VM to PRG2 - bare-metal tests size:M

Added by okurz over 1 year ago. Updated 6 months ago.

Status:
Resolved
Priority:
Low
Assignee:
Category:
Feature requests
Start date:
Due date:
% Done:

0%

Estimated time:

Description

Motivation

The openQA webUI VM for o3 will move to PRG2. After the move we must ensure bare metal tests can work

Acceptance criteria

  • AC1: bare metal production tests work on o3 after move

Suggestions

Rollback steps


Related issues 3 (0 open3 closed)

Related to openQA Infrastructure (public) - action #134126: Setup new PRG2 openQA worker for o3 - bare-metal testing size:MResolvedokurz

Actions
Copied from openQA Infrastructure (public) - action #132143: Migration of o3 VM to PRG2 - 2023-07-19 size:MResolvednicksinger2023-06-29

Actions
Copied to openQA Infrastructure (public) - action #133490: Migration of o3 VM to PRG2 - Fix o3 bare metal hosts iPXE booting size:MResolveddheidler

Actions
Actions #1

Updated by okurz over 1 year ago

  • Copied from action #132143: Migration of o3 VM to PRG2 - 2023-07-19 size:M added
Actions #2

Updated by livdywan over 1 year ago

We decided to skip this ticket and not estimate it yet because this is a moving target (no pun intended) and we don't know exactly what tests this is about. @okurz Feel free to provide more details if you can in the meantime.

Actions #3

Updated by livdywan over 1 year ago

  • Subject changed from Migration of o3 VM to PRG2 - bare-metal tests to Migration of o3 VM to PRG2 - bare-metal tests size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #4

Updated by okurz over 1 year ago

  • Description updated (diff)
Actions #6

Updated by okurz over 1 year ago

  • Copied to action #133490: Migration of o3 VM to PRG2 - Fix o3 bare metal hosts iPXE booting size:M added
Actions #7

Updated by okurz over 1 year ago

  • Project changed from 115 to openQA Infrastructure (public)
Actions #8

Updated by livdywan over 1 year ago

  • Description updated (diff)
Actions #9

Updated by livdywan about 1 year ago

  • Subject changed from Migration of o3 VM to PRG2 - bare-metal tests size:M to Migration of o3 VM to PRG2 - bare-metal tests

Currrently this shows worker7 as online/working, and worker6,19 and 20 as offline. How do I answer the question in the AC which is "bare metal tests can work on o3 after move"?

Actions #10

Updated by okurz about 1 year ago

  • Subject changed from Migration of o3 VM to PRG2 - bare-metal tests to Migration of o3 VM to PRG2 - bare-metal tests size:M

livdywan wrote in #note-9:

Currrently this shows worker7 as online/working, and worker6,19 and 20 as offline.

w6+7+19+20 are all NUE1 based so not PRG2.

How do I answer the question in the AC which is "bare metal tests can work on o3 after move"?

As an acceptance test one can go to https://openqa.opensuse.org/admin/workers, select "All", search for "ipmi" and find workers that are prg2 based (currently none), find tests and look for passed tests.

Actions #11

Updated by dheidler about 1 year ago

  • Assignee set to dheidler
Actions #12

Updated by okurz about 1 year ago

  • Description updated (diff)

Clarified and extended last suggestion

Actions #13

Updated by okurz about 1 year ago

  • Related to action #134126: Setup new PRG2 openQA worker for o3 - bare-metal testing size:M added
Actions #14

Updated by dheidler about 1 year ago

  • Status changed from Workable to Blocked
Actions #15

Updated by okurz about 1 year ago

  • Description updated (diff)
Actions #16

Updated by xlai about 1 year ago

dheidler wrote in #note-14:

https://sd.suse.com/servicedesk/customer/portal/1/SD-134097

@dheidler Hi, I do not have access to it. Would you please explain a little bit what's the root cause for this being blocked?

Besides, virtualization team used to have two ipmi workers configured on "rebel.openqanet.opensuse.org" and @okurz shared that above SD ticket won't likely to be resolved until end of this Nov(if my understanding is correct). Is it possible to recover the original "rebel.openqanet.opensuse.org" and make it work, until some date closer to end of Nov, so that O3 virt tests can continue?

Actions #17

Updated by okurz about 1 year ago

xlai wrote in #note-16:

dheidler wrote in #note-14:

https://sd.suse.com/servicedesk/customer/portal/1/SD-134097

@dheidler Hi, I do not have access to it. Would you please explain a little bit what's the root cause for this being blocked?

The idea is to use two of the new x86_64 machines in PRG2 as bare metal test machines. For that we need the Ipmi interfaces in a network controllable by other openQA workers. The above SD ticket is a request to change the VLAN interfaces of the according switch ports.

All that is a workaround planned by us as the room PRG2e where existing other baremetal test machines should move is not ready yet for move.

Besides, virtualization team used to have two ipmi workers configured on "rebel.openqanet.opensuse.org" and @okurz shared that above SD ticket won't likely to be resolved until end of this Nov(if my understanding is correct). Is it possible to recover the original "rebel.openqanet.opensuse.org" and make it work, until some date closer to end of Nov, so that O3 virt tests can continue?

That is not feasible. rebel is already at the planned target location of NUE3 along with all the other older openqaworker machines, transported in coordination with an external logistics company and BuildOps, SUSE-IT, Facilities. I do no see a reasonable RoI nor available people capacity to move back the machine.

In theory we could build solutions of providing access to the IPMI interfaces in the current network configuration but with tunnels but that would effectively give every openQA test on the public o3 instance access to a major part of the SUSE internal network which we must not allow to do.

I think the best course of action would be to provide SUSE QE Tools Team members administrative access to the involved network switches so that we don't need to reply on Eng-Infra limited resources to switch the VLAN config on the switches which by itself would take about 10 minutes to do. So if you won't to help consider to escalate to management that tools team could do more work if we are trusted with more administrative access :)

Actions #18

Updated by xlai about 1 year ago

@okurz I see, thanks for the explanations. With that said, agree with the current handling and plan with rebel.

So if you won't to help consider to escalate to management that tools team could do more work if we are trusted with more administrative access :)

I believe so, will try to, but I guess the reasons may not be from trust/capability, but maybe others which I do not know and have complex background...

Thank you, Oliver, for the always support.

Actions #19

Updated by xlai about 1 year ago

@okurz FYI, we escalate it to @cachen . Calen will help discuss with higher management team to see if it can be unblocked. Thank you, Calen!

BTW, Oliver, would you please grant view permission to Calen and me, for https://sd.suse.com/servicedesk/customer/portal/1/SD-134097.

Actions #20

Updated by okurz about 1 year ago

xlai wrote in #note-19:

BTW, Oliver, would you please grant view permission to Calen and me, for https://sd.suse.com/servicedesk/customer/portal/1/SD-134097.

I already escalated over multiple channels the multiple shortcomings of having private-by-default SD tickets and no possibility to share with bigger groups or complete SUSE. I don't think it's worth our time to share SD tickets while SUSE IT refuses our suggestions

Actions #21

Updated by xlai about 1 year ago

okurz wrote in #note-20:

xlai wrote in #note-19:

BTW, Oliver, would you please grant view permission to Calen and me, for https://sd.suse.com/servicedesk/customer/portal/1/SD-134097.

I already escalated over multiple channels the multiple shortcomings of having private-by-default SD tickets and no possibility to share with bigger groups or complete SUSE. I don't think it's worth our time to share SD tickets while SUSE IT refuses our suggestions

@okurz I understand the pain. And now, we are finding team leads for help. How can they help if they can not view it? Maybe they can make some difference on the decision with even more and higher escalation ;-)

Actions #22

Updated by okurz about 1 year ago

xlai wrote in #note-21:

okurz wrote in #note-20:

xlai wrote in #note-19:

BTW, Oliver, would you please grant view permission to Calen and me, for https://sd.suse.com/servicedesk/customer/portal/1/SD-134097.

I already escalated over multiple channels the multiple shortcomings of having private-by-default SD tickets and no possibility to share with bigger groups or complete SUSE. I don't think it's worth our time to share SD tickets while SUSE IT refuses our suggestions

@okurz I understand the pain. And now, we are finding team leads for help. How can they help if they can not view it? Maybe they can make some difference on the decision with even more and higher escalation ;-)

The ticket just has the following content not even following our proper ticket template

These two machines are planned to be used as bare metal SUT hosts. This requires other workers within the dmz network to be able to reach them via IPMI.

So please connect the IPMI interfaces of those two machines to the same network as the machine itself is connected to.

See: https://progress.opensuse.org/issues/132647

I already stated what I think should be done:

I think the best course of action would be to provide SUSE QE Tools Team members administrative access to the involved network switches so that we don't need to reply on Eng-Infra limited resources to switch the VLAN config on the switches which by itself would take about 10 minutes to do. So if you won't to help consider to escalate to management that tools team could do more work if we are trusted with more administrative access :)

Please keep in mind that I am in full alignment with SUSE-IT that evacuation of NUE1 and the migration of SUSE wide services like imap.suse.de must have higher priority than QE specific tasks which unfortunately also includes this ticket here. So this is why we have the estimate from mflores, team lead of Eng-Infra, to not expect any bigger help from Eng-Infra before 2023-12. So I suggest to not ask to prioritize the work on https://sd.suse.com/servicedesk/customer/portal/1/SD-134097 itself but rather give more people access to the switch administration so that we can solve problems ourselves.

Actions #23

Updated by cachen about 1 year ago

Hello guys, I just write email to Moroni with 2 escalations(higher administration access delegation; jira-sd ticket authority), hard to say if that help the speed up, but at least I hope IT team is listening our needs, let's see what I will get back from Moroni.
@mgriessmeier

Actions #25

Updated by okurz about 1 year ago

  • Description updated (diff)
Actions #26

Updated by okurz about 1 year ago

@dheidler and me talked about the topic yesterday and dheidler has added a comment in https://sd.suse.com/servicedesk/customer/portal/1/SD-134097 to ask to just give us administrative access to the switches. I don't expect that to happen within a reasonable timespan nor without further motivation. We can either escalate this further or just wait with lower priority. If the opportunity arises I will bring the topic up in related meetings or other opportunities.

Actions #28

Updated by okurz about 1 year ago

I created https://jira.suse.com/browse/ENGINFRA-3208 "Provide administrative access to LSG QE Tools for ToR switches in J12" to help us.

Actions #29

Updated by livdywan about 1 year ago

okurz wrote in #note-28:

I created https://jira.suse.com/browse/ENGINFRA-3208 "Provide administrative access to LSG QE Tools for ToR switches in J12" to help us.

Ticket hasn't been picked up yet. I added a comment there to confirm we're on the right track.

Actions #30

Updated by livdywan about 1 year ago

livdywan wrote in #note-29:

okurz wrote in #note-28:

I created https://jira.suse.com/browse/ENGINFRA-3208 "Provide administrative access to LSG QE Tools for ToR switches in J12" to help us.

Ticket hasn't been picked up yet. I added a comment there to confirm we're on the right track.

Asking in Slack now.

Actions #31

Updated by dheidler about 1 year ago

Last status from https://suse.slack.com/archives/C04MDKHQE20/p169989293880018
The jira ticket is still in backlog and might be worked on in their next spring from 21. on.

Actions #32

Updated by livdywan about 1 year ago

dheidler wrote in #note-31:

Last status from https://suse.slack.com/archives/C04MDKHQE20/p169989293880018
The jira ticket is still in backlog and might be worked on in their next spring from 21. on.

November 21, this week? I can't open the Slack chat.

Actions #33

Updated by okurz about 1 year ago

  • Priority changed from High to Normal

https://jira.suse.com/browse/ENGINFRA-3208 still unchanged and our questions are not answered. Realistically that means that we can't expect an improvement there soon hence reducing prio

Actions #34

Updated by dheidler 11 months ago

Asked again in the ENGINFRA ticket.

Actions #35

Updated by okurz 10 months ago

  • Priority changed from Normal to Low
Actions #36

Updated by livdywan 10 months ago

okurz wrote in #note-33:

https://jira.suse.com/browse/ENGINFRA-3208 still unchanged and our questions are not answered. Realistically that means that we can't expect an improvement there soon hence reducing prio

Response today is that read-only access is a possibility. Long-term there will be a salt deployment for it.

Actions #37

Updated by okurz 10 months ago

  • Target version changed from Ready to Tools - Next
Actions #39

Updated by okurz 10 months ago

  • Assignee changed from dheidler to okurz
  • Target version changed from Tools - Next to future

https://jira.suse.com/browse/ENGINFRA-3207 was rejected with https://jira.suse.com/browse/ENGINFRA-3207?focusedId=1327731&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-1327731

(Moroni Flores) Hello - administrative access to network devices in PRG2 will not be granted to any team.
As previously discussed, we will deploy and have a way for each team to be able to make changes via salt, but not directly on the device.
If you are interested in read-only access for certain troubleshooting, that can be granted.
Thanks.

As now IT has successfully stalled our original plans as well as multiple levels of mitigation in the meantime amd-zen2-gpu-sut1 is at least pingable so I wil wait for #153706 first and see if that is enough for our bare-metal testing. It's unfortunate to see that openqaworker27+openqaworker28 were actually also powered on the past time. I now set both to powered off to save energy.

Actions #40

Updated by okurz 7 months ago

  • Status changed from Blocked to Workable
  • Target version changed from future to Ready

Next step: Crosscheck the current status of o3 bare-metal tests with #153706 resolved

Actions #41

Updated by okurz 7 months ago

  • Category set to Feature requests
  • Status changed from Workable to Feedback

Asked in https://suse.slack.com/archives/C02CANHLANP/p1716974536520799

@Xiaoli Ai what's the current status regarding bare-metal tests on o3? Context asking as part of https://progress.opensuse.org/issues/132647 as there is amd-zen2-gpu-sut1 already since some months but no tests had been running there, isn't it?

Actions #42

Updated by okurz 6 months ago

  • Status changed from Feedback to Resolved

ok, QE Virt does not currently invest any effort in o3 bare-metal tests but workers are ready to be used. The bare-metal machine is powered off while not in use so also no significant electrical power lost.

Actions

Also available in: Atom PDF