action #168811
closedbaremetal-support in PRG2 size:M
0%
Description
Motivation¶
When we offer a PXE server in PRG2 with #155524 we should also consider another instance of baremetal-support which is needed for openQA bare-metal tests or moving the existing one https://racktables.nue.suse.com/index.php?page=object&tab=default&object_id=28331 from NUE2 to PRG2. Another reason why we should consider this is the upcoming CC-related network changes until 2024-W46, see #165282
Acceptance criteria¶
- AC1: The team has a general understanding of what "baremetal-support" is doing
- AC2: bare-metal openQA tests in PRG2 can run without relying on baremetal-support.qe.nue2.suse.org
Suggestions¶
- Wait for #155524 to sort out where to host the VM then pick the same approach for baremetal-support
- Have an overview about the machine from https://racktables.nue.suse.com/index.php?page=object&tab=default&object_id=28331
- Read as needed https://github.com/frankenmichl/baremetal_support/ or consult with domain expert and creator mmoese
- Setup another instance or migrate and verify with bare-metal tests on OSD
- Present your understandings to the team
Updated by dheidler 3 months ago
- Status changed from Workable to In Progress
Added https://gitlab.suse.de/qa-sle/baremetal-support-configs/
Currently waiting for https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/5803
Reverse engineered the existing ipxe boot script from http://netboot.qe.prg2.suse.org/kernelqa/ipxe.efi by booting it in qemu
qemu-system-x86_64 -bios /usr/share/qemu/ovmf-x86_64.bin -kernel ipxe.efi
dumping the VM memory and using strings
to extract the script.
Updated by dheidler 3 months ago · Edited
- Status changed from In Progress to Blocked
Blocked until DNS is set up: https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/5803
Also waiting for salt firewall rules via https://sd.suse.com/servicedesk/customer/portal/1/SD-172306
2do afterwards:
Regenerate ipxe images on kernelqa tftp tree for new baremetal-support url- Add to salt (Ensure velociraptor still working afterwards)
Update worker.ini on all prg2 bare metal workers for new url (via https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/937)Also still 2do: add to racktables- IPv6: https://docs.google.com/spreadsheets/d/1M1-uAsNzawPsup-_VFa1qi2NBO1_ItCuxU1kfAe2WGw/edit?gid=775374594#gid=775374594
Updated by openqa_review 2 months ago
- Due date set to 2024-12-11
Setting due date based on mean cycle time of SUSE QE Tools
Updated by dheidler about 2 months ago
There is now https://sd.suse.com/servicedesk/customer/portal/1/SD-174583 as well.
Updated by dheidler about 2 months ago · Edited
Got gitlab config deployment working by adding a gitlab runner on netboot.qe.prg2.suse.org.
(analog to https://progress.opensuse.org/issues/125519#note-3)
Updated by okurz about 2 months ago
- Due date changed from 2024-12-11 to 2024-12-27
Multiple delays accumulated also due to #173662 and related impacting us negatively.
Now blocked on https://sd.suse.com/servicedesk/customer/portal/1/SD-175317
Updated by okurz about 2 months ago
- Status changed from Blocked to Workable
https://sd.suse.com/servicedesk/customer/portal/1/SD-175317 was resolved, you verified. What's next?
Updated by dheidler about 2 months ago
I could try adding IPv6, but it wouldn't do much - I'm not sure if iPXE even supports it.
Updated by dheidler about 2 months ago
Updated by dheidler about 2 months ago
- Status changed from Workable to Blocked
just ipv6 dns missing.
Updated by okurz about 2 months ago
- Status changed from Blocked to Workable
- Priority changed from Normal to Urgent
@dheidler salt minion IDs need to be unique. Now with
baremetal-support.qe.nue2.suse.org
baremetal-support.qe.prg2.suse.org
we have problems like in https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1328#note_702464 which need urgent handling. How about renaming the second machine to "baremetal-support2" or "baremetal-support-prg2"?
Updated by dheidler about 2 months ago
- Status changed from Workable to In Progress
I guess our minion handling could be improved here.
This should not break things as the minion ids are already unique.
But as a workaround I will rename both prg2 and nue2 baremetal-support minion id to "baremetal-support-nue2" and respective for prg2.
This is not in line with our documentation at https://gitlab.suse.de/openqa/salt-states-openqa though.
Updated by mkittler about 2 months ago
The generic and dashboards use the "nodename" instead of the minion ID / FQDN. We could easily change that to the FQDN, see https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1329. As mentioned in the MR it would cause problems with the InfluxDB queries.
Updated by mkittler about 2 months ago
But as a workaround I will rename both prg2 and nue2 baremetal-support minion id to "baremetal-support-nue2" and respective for prg2.
It would probably better to keep the name of the existing/old NUE2 host. Otherwise the Grafana dashboards will look empty.
Updated by okurz about 1 month ago
- Status changed from Resolved to Workable
okurz wrote:
- AC2: bare-metal openQA tests in PRG2 can run without relying on baremetal-support.qe.nue2.suse.org
please reference at least one verification job showing that working.
Updated by dheidler about 1 month ago
- Status changed from Workable to Resolved
Updated by jbaier_cz about 1 month ago
- Status changed from Resolved to Feedback
Did the rename happen? I still see failures in the salt pipeline yesterday: https://gitlab.suse.de/openqa/salt-pillars-openqa/-/jobs/3559987
monitor.qe.nue2.suse.org:
Data failed to compile:
----------
Rendering SLS 'base:monitoring.grafana' failed: while constructing a mapping
in "<unicode string>", line 15, column 1:
grafana:
^
found conflicting ID '/var/lib/grafana/dashboards/generic-baremetal-support.json'
in "<unicode string>", line 921, column 1:
/var/lib/grafana/dashboards/gene ...
^
Updated by dheidler about 1 month ago
- Status changed from Feedback to Resolved
baremetal-support.qe.prg2.suse.org:~ # cat /etc/salt/minion_id ; echo
baremetal-support.qe.prg2.suse.org
Updated by jbaier_cz about 1 month ago
- Related to action #174610: [alert] salt-states-openqa deploy pipeline failed: data failed to compile added
Updated by okurz about 1 month ago
- Status changed from Resolved to Workable
- Priority changed from Normal to Urgent
According to #174610 that's still a problem. As suggested in #168811-22 please rename the hostname of the PRG2 instance and ensure that the deployment works again
Updated by dheidler about 1 month ago
That comment was about renaming the nue2 instance back which I already did.
Updated by dheidler about 1 month ago
Maybe salt does some string operations like removing everything after the first dot in the node name, so let's use a minus here.
Even though I really dislike using a node name that doesn't equal to the fqdn.
Updated by okurz about 1 month ago
dheidler wrote in #note-34:
Maybe salt does some string operations like removing everything after the first dot in the node name, so let's use a minus here.
Even though I really dislike using a node name that doesn't equal to the fqdn.
Exactly. That is why you should update the FQDN. Please call it baremetal-support-prg2.qe.prg2.suse.org.
Updated by dheidler about 1 month ago
- Status changed from Workable to Resolved
baremetal-support-prg2.qe.prg2.suse.org sounds 🤮 and also dns changes would involve infra so it wouldn't be fixed before mid January (if they are fast) as well as rebuilding the ipxe images.
So I added a grain on the machine to set the nodename.
See https://progress.opensuse.org/issues/174610
Updated by okurz about 1 month ago
- Status changed from Resolved to Workable
please stop messing around with workarounds and call the ticket "Resolved". How about you create a ticket to have our salt states work properly with the fqdn instead and record the necessary step to revert your undocumented workaround?
Updated by dheidler about 1 month ago
That is not a workaround the ticket is resolved.
Especially THIS tickets (see ACs).
In https://progress.opensuse.org/issues/174610?issue_count=6&issue_position=2&next_issue_id=162296&prev_issue_id=168811#note-4 it is documented what was done.
I created https://progress.opensuse.org/issues/174652 as a followup.
Updated by okurz about 1 month ago
- Related to action #174652: Ensure uniqueness of nodenames for generating configs on monitor size:M added