Project

General

Profile

Actions

action #168811

closed

baremetal-support in PRG2 size:M

Added by okurz 3 months ago. Updated about 1 month ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Feature requests
Start date:
2024-02-15
Due date:
% Done:

0%

Estimated time:

Description

Motivation

When we offer a PXE server in PRG2 with #155524 we should also consider another instance of baremetal-support which is needed for openQA bare-metal tests or moving the existing one https://racktables.nue.suse.com/index.php?page=object&tab=default&object_id=28331 from NUE2 to PRG2. Another reason why we should consider this is the upcoming CC-related network changes until 2024-W46, see #165282

Acceptance criteria

  • AC1: The team has a general understanding of what "baremetal-support" is doing
  • AC2: bare-metal openQA tests in PRG2 can run without relying on baremetal-support.qe.nue2.suse.org

Suggestions


Related issues 2 (0 open2 closed)

Related to openQA Infrastructure (public) - action #174610: [alert] salt-states-openqa deploy pipeline failed: data failed to compileResolveddheidler2024-12-19

Actions
Related to openQA Infrastructure (public) - action #174652: Ensure uniqueness of nodenames for generating configs on monitor size:MResolvedybonatakis2024-12-202025-02-04

Actions
Actions #2

Updated by okurz 3 months ago

  • Tags changed from infra, network, qe, suse to infra, network, qe, suse, cc
Actions #3

Updated by livdywan 3 months ago

  • Subject changed from baremetal-support in PRG2 to baremetal-support in PRG2 size:M
  • Status changed from New to Workable
Actions #4

Updated by okurz 3 months ago

  • Category set to Feature requests
Actions #5

Updated by dheidler 3 months ago

  • Assignee set to dheidler
Actions #6

Updated by dheidler 3 months ago

  • Status changed from Workable to In Progress

Added https://gitlab.suse.de/qa-sle/baremetal-support-configs/

Currently waiting for https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/5803

Reverse engineered the existing ipxe boot script from http://netboot.qe.prg2.suse.org/kernelqa/ipxe.efi by booting it in qemu

qemu-system-x86_64 -bios /usr/share/qemu/ovmf-x86_64.bin -kernel ipxe.efi

dumping the VM memory and using strings to extract the script.

Actions #7

Updated by dheidler 3 months ago · Edited

  • Status changed from In Progress to Blocked

Blocked until DNS is set up: https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/5803
Also waiting for salt firewall rules via https://sd.suse.com/servicedesk/customer/portal/1/SD-172306

2do afterwards:

Actions #8

Updated by dheidler 2 months ago

  • Status changed from Blocked to In Progress

It took infra only 14 days to merge my MR.
Of course that time didn't prevent them to merge other things so I had to rebase 4 times.

Now we have DNS.

Actions #9

Updated by openqa_review 2 months ago

  • Due date set to 2024-12-11

Setting due date based on mean cycle time of SUSE QE Tools

Actions #10

Updated by dheidler 2 months ago

  • Status changed from In Progress to Blocked

Switched over prg2 workers and openQA instance pointing to baremetal-support.qe.prg2.

Asked Matze for some escalation for the firewall rules.

Actions #12

Updated by dheidler about 2 months ago · Edited

Got gitlab config deployment working by adding a gitlab runner on netboot.qe.prg2.suse.org.
(analog to https://progress.opensuse.org/issues/125519#note-3)

Actions #14

Updated by okurz about 2 months ago

  • Due date changed from 2024-12-11 to 2024-12-27

Multiple delays accumulated also due to #173662 and related impacting us negatively.
Now blocked on https://sd.suse.com/servicedesk/customer/portal/1/SD-175317

Actions #15

Updated by okurz about 2 months ago

  • Status changed from Blocked to Workable

https://sd.suse.com/servicedesk/customer/portal/1/SD-175317 was resolved, you verified. What's next?

Actions #16

Updated by dheidler about 2 months ago

I could try adding IPv6, but it wouldn't do much - I'm not sure if iPXE even supports it.

Actions #18

Updated by dheidler about 2 months ago

  • Status changed from Workable to Blocked

just ipv6 dns missing.

Actions #19

Updated by okurz about 2 months ago

  • Status changed from Blocked to Workable
  • Priority changed from Normal to Urgent

@dheidler salt minion IDs need to be unique. Now with

baremetal-support.qe.nue2.suse.org
baremetal-support.qe.prg2.suse.org

we have problems like in https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1328#note_702464 which need urgent handling. How about renaming the second machine to "baremetal-support2" or "baremetal-support-prg2"?

Actions #20

Updated by dheidler about 2 months ago

  • Status changed from Workable to In Progress

I guess our minion handling could be improved here.
This should not break things as the minion ids are already unique.

But as a workaround I will rename both prg2 and nue2 baremetal-support minion id to "baremetal-support-nue2" and respective for prg2.
This is not in line with our documentation at https://gitlab.suse.de/openqa/salt-states-openqa though.

Actions #21

Updated by mkittler about 2 months ago

The generic and dashboards use the "nodename" instead of the minion ID / FQDN. We could easily change that to the FQDN, see https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1329. As mentioned in the MR it would cause problems with the InfluxDB queries.

Actions #22

Updated by mkittler about 2 months ago

But as a workaround I will rename both prg2 and nue2 baremetal-support minion id to "baremetal-support-nue2" and respective for prg2.

It would probably better to keep the name of the existing/old NUE2 host. Otherwise the Grafana dashboards will look empty.

Actions #23

Updated by dheidler about 2 months ago

  • Status changed from In Progress to Blocked

done.

Actions #24

Updated by dheidler about 2 months ago

  • Priority changed from Urgent to Normal
Actions #25

Updated by dheidler about 2 months ago

  • Status changed from Blocked to Resolved

ipv6 dns done

Actions #26

Updated by okurz about 1 month ago

  • Status changed from Resolved to Workable

okurz wrote:

  • AC2: bare-metal openQA tests in PRG2 can run without relying on baremetal-support.qe.nue2.suse.org

please reference at least one verification job showing that working.

Actions #27

Updated by dheidler about 1 month ago

  • Status changed from Workable to Resolved
Actions #28

Updated by okurz about 1 month ago

  • Due date deleted (2024-12-27)
Actions #29

Updated by jbaier_cz about 1 month ago

  • Status changed from Resolved to Feedback

Did the rename happen? I still see failures in the salt pipeline yesterday: https://gitlab.suse.de/openqa/salt-pillars-openqa/-/jobs/3559987

monitor.qe.nue2.suse.org:
    Data failed to compile:
----------
    Rendering SLS 'base:monitoring.grafana' failed: while constructing a mapping
  in "<unicode string>", line 15, column 1:
    grafana:
    ^
found conflicting ID '/var/lib/grafana/dashboards/generic-baremetal-support.json'
  in "<unicode string>", line 921, column 1:
    /var/lib/grafana/dashboards/gene ... 
    ^
Actions #30

Updated by dheidler about 1 month ago

  • Status changed from Feedback to Resolved
baremetal-support.qe.prg2.suse.org:~ # cat /etc/salt/minion_id ; echo
baremetal-support.qe.prg2.suse.org
Actions #31

Updated by jbaier_cz about 1 month ago

  • Related to action #174610: [alert] salt-states-openqa deploy pipeline failed: data failed to compile added
Actions #32

Updated by okurz about 1 month ago

  • Status changed from Resolved to Workable
  • Priority changed from Normal to Urgent

According to #174610 that's still a problem. As suggested in #168811-22 please rename the hostname of the PRG2 instance and ensure that the deployment works again

Actions #33

Updated by dheidler about 1 month ago

That comment was about renaming the nue2 instance back which I already did.

Actions #34

Updated by dheidler about 1 month ago

Maybe salt does some string operations like removing everything after the first dot in the node name, so let's use a minus here.
Even though I really dislike using a node name that doesn't equal to the fqdn.

Actions #35

Updated by okurz about 1 month ago

dheidler wrote in #note-34:

Maybe salt does some string operations like removing everything after the first dot in the node name, so let's use a minus here.
Even though I really dislike using a node name that doesn't equal to the fqdn.

Exactly. That is why you should update the FQDN. Please call it baremetal-support-prg2.qe.prg2.suse.org.

Actions #36

Updated by dheidler about 1 month ago

  • Status changed from Workable to Resolved

baremetal-support-prg2.qe.prg2.suse.org sounds 🤮 and also dns changes would involve infra so it wouldn't be fixed before mid January (if they are fast) as well as rebuilding the ipxe images.

So I added a grain on the machine to set the nodename.
See https://progress.opensuse.org/issues/174610

Actions #37

Updated by okurz about 1 month ago

  • Status changed from Resolved to Workable

please stop messing around with workarounds and call the ticket "Resolved". How about you create a ticket to have our salt states work properly with the fqdn instead and record the necessary step to revert your undocumented workaround?

Actions #38

Updated by dheidler about 1 month ago

That is not a workaround the ticket is resolved.
Especially THIS tickets (see ACs).

In https://progress.opensuse.org/issues/174610?issue_count=6&issue_position=2&next_issue_id=162296&prev_issue_id=168811#note-4 it is documented what was done.

I created https://progress.opensuse.org/issues/174652 as a followup.

Actions #39

Updated by dheidler about 1 month ago

  • Status changed from Workable to Resolved
Actions #40

Updated by okurz about 1 month ago

  • Parent task changed from #159852 to #166598
Actions #41

Updated by okurz about 1 month ago

  • Related to action #174652: Ensure uniqueness of nodenames for generating configs on monitor size:M added
Actions

Also available in: Atom PDF