action #133748
closedcoordination #121720: [saga][epic] Migration to QE setup in PRG2+NUE3 while ensuring availability
coordination #129280: [epic] Move from SUSE NUE1 (Maxtorhof) to new NBG Datacenters
Move of openqaworker-arm-1 to FC Basement size:M
0%
Description
Motivation¶
In #132614 openqaworker-arm-1 was moved to FC Basement so that we have one hot-redundant aarch64 OSD machine outside of PRG2. For that to be setup we need to also accomodate the automatic recovery feature.
Acceptance criteria¶
- AC1: openqaworker-arm-1 runs OSD production jobs again
- AC2: The automatic recovery of openqaworker-arm-1 on crashes works
Suggestions¶
- Disable the automatic recovery for openqaworker-arm-1 from the old location
- Mount the machine and connect it back into the network including DHCP/DNS in https://gitlab.suse.de/OPS-Service/salt/
- Remove old DHCP/DNS entries in https://gitlab.suse.de/OPS-Service/salt/
- Update https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls
- Find on https://wiki.suse.net/index.php/SUSE-Quality_Assurance/Labs how the new PDU can be used
- Integrate the new PDU in https://gitlab.suse.de/openqa/grafana-webhook-actions
Rollback steps¶
- Add back openqaworker-arm-1 to salt on OSD
- after openqaworker-arm-1 is back remove silences in https://monitor.qa.suse.de/alerting/silences
- Remove the "Mute All times" in https://monitor.qa.suse.de/alerting/routes for
__contacts__ =~ .*"Trigger reboot of openqaworker-arm-1".*
Updated by okurz 9 months ago
I crosschecked the machine connections and updated racktables. The fibre channel connection is correct as documented in racktables, eth0 in OS is for the physical port SFP+-1. On the DHCP server no request shows up for this network interface. I assume the switch is not configured correctly yet for the SFP port.
Updated by okurz 4 months ago
livdywan wrote in #note-14:
This is the only workable subtask of #121720 which is a High ticket on the backlog. We should consider having this on the backlog.
you are incorrectly reading the ticket dependencies. #121720 is High because of other subtasks so this ticket does not have a direct impact. In fact adding this to the backlog would actually increase the risk that the High blocking sibling subtasks are less likely to be worked on.
Updated by ybonatakis about 1 month ago
- Status changed from Workable to In Progress
- Assignee set to ybonatakis
Updated by okurz about 1 month ago
I enabled IPv6 on openqaworker-arm-1 and created https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/4936 to add the AAAA record
Updated by ybonatakis about 1 month ago
I followed the guide https://progress.opensuse.org/projects/openqav3/wiki/#Distribution-upgrades, after checked with Marius that https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls and https://gitlab.suse.de/search?search=openqaworker-arm-1%20&nav_source=navbar&project_id=419&group_id=8&search_code=true&repository_ref=production are setup in place.
I didnt the minimal modifications which vimdiff shows, accepting mostly the old configs.
systemctl --failed shows no errors
I had to update the /etc/hosts and after systemctl unmask openqa-worker-auto-restart@11.service openqa-reload-worker-auto-restart@11.{service,path}
systemctl start openqa-worker-auto-restart@11.service openqa-reload-worker-auto-restart@11.path, I cloned a job[0] which run successfully
the file /etc/sysctl.d/99-poo81198.conf shows that IPv6 is disabled. I recommend to read the referenced ticket 81198 and consider removing this file and getting IPv6 fixed first.
i didnt find that file. However i updated the /etc/hosts with ipv6 i found from another worker.
However Oli has already did enable ipv6 and submitted https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/4936/diffs
I used 2a07:de40:b203:12:0:ff:fe4f:7c2b but the PR uses 2a07:de40:a102:5:1e1b:dff:fe68:7ecf
Updated by okurz about 1 month ago
ybonatakis wrote in #note-22:
I followed the guide https://progress.opensuse.org/projects/openqav3/wiki/#Distribution-upgrades, after checked with Marius that https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls and https://gitlab.suse.de/search?search=openqaworker-arm-1%20&nav_source=navbar&project_id=419&group_id=8&search_code=true&repository_ref=production are setup in place.
I didnt the minimal modifications which vimdiff shows, accepting mostly the old configs.
systemctl --failed shows no errors
I had to update the /etc/hosts and after systemctl unmask openqa-worker-auto-restart@11.service openqa-reload-worker-auto-restart@11.{service,path}
systemctl start openqa-worker-auto-restart@11.service openqa-reload-worker-auto-restart@11.path, I cloned a job[0] which run successfullythe file /etc/sysctl.d/99-poo81198.conf shows that IPv6 is disabled. I recommend to read the referenced ticket 81198 and consider removing this file and getting IPv6 fixed first.
i didnt find that file. However i updated the /etc/hosts with ipv6 i found from another worker.
However Oli has already did enable ipv6 and submitted https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/4936/diffsI used 2a07:de40:b203:12:0:ff:fe4f:7c2b but the PR uses 2a07:de40:a102:5:1e1b:dff:fe68:7ecf
7c2b is osd, 7cef is openqaworker-arm-1 itself
Updated by okurz about 1 month ago
https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/4936 merged and effective. ping openqaworker-arm-1.qe.nue2.suse.org
yields 2a07:de40:a102:5:1e1b:dff:fe68:7ec7
Updated by okurz about 1 month ago
- Copied to action #157753: Bring back automatic recovery for openqaworker-arm-1 size:M added
Updated by okurz about 1 month ago
- Due date set to 2024-03-29
As suggested I moved out the auto-recovery part into #157753
So left to be done is removing the suffix in the worker class setting in https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls and
https://progress.opensuse.org/projects/openqav3/wiki/#Bring-back-machines-into-salt-controlled-production
Updated by ybonatakis about 1 month ago
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/750
update openqa keys to the worker. CLASS should be fine
Updated by ybonatakis about 1 month ago
- Status changed from In Progress to Resolved
After the instructions and help from Oli and Nick the worker is up and running.
https://openqa.suse.de/tests/13846104
Updated by okurz about 1 month ago
- Related to action #158020: salt-states-openqa pipeline times out added