action #133748
closedcoordination #121720: [saga][epic] Migration to QE setup in PRG2+NUE3 while ensuring availability
coordination #129280: [epic] Move from SUSE NUE1 (Maxtorhof) to new NBG Datacenters
Move of openqaworker-arm-1 to FC Basement size:M
0%
Description
Motivation¶
In #132614 openqaworker-arm-1 was moved to FC Basement so that we have one hot-redundant aarch64 OSD machine outside of PRG2. For that to be setup we need to also accomodate the automatic recovery feature.
Acceptance criteria¶
- AC1: openqaworker-arm-1 runs OSD production jobs again
- AC2: The automatic recovery of openqaworker-arm-1 on crashes works
Suggestions¶
- Disable the automatic recovery for openqaworker-arm-1 from the old location
- Mount the machine and connect it back into the network including DHCP/DNS in https://gitlab.suse.de/OPS-Service/salt/
- Remove old DHCP/DNS entries in https://gitlab.suse.de/OPS-Service/salt/
- Update https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls
- Find on https://wiki.suse.net/index.php/SUSE-Quality_Assurance/Labs how the new PDU can be used
- Integrate the new PDU in https://gitlab.suse.de/openqa/grafana-webhook-actions
Rollback steps¶
- Add back openqaworker-arm-1 to salt on OSD
- after openqaworker-arm-1 is back remove silences in https://monitor.qa.suse.de/alerting/silences
- Remove the "Mute All times" in https://monitor.qa.suse.de/alerting/routes for
__contacts__ =~ .*"Trigger reboot of openqaworker-arm-1".*
Updated by okurz over 1 year ago
- Description updated (diff)
I now tried to mute the "notification" which triggers the webhook of gitlab for openqaworker-arm-1, added to rollback steps
Updated by mkittler over 1 year ago
- Subject changed from Move of openqaworker-arm-1 to FC Basement to Move of openqaworker-arm-1 to FC Basement size:M
Updated by okurz over 1 year ago
I crosschecked the machine connections and updated racktables. The fibre channel connection is correct as documented in racktables, eth0 in OS is for the physical port SFP+-1. On the DHCP server no request shows up for this network interface. I assume the switch is not configured correctly yet for the SFP port.
Updated by okurz about 1 year ago
- Priority changed from Urgent to High
We don't have capacity to work on that many infra tasks with urgent prio, reducing to "High"
Updated by okurz about 1 year ago
- Priority changed from High to Normal
- Target version changed from Ready to future
We will just have to trust the prg2 workers for now
Updated by okurz 10 months ago
livdywan wrote in #note-14:
This is the only workable subtask of #121720 which is a High ticket on the backlog. We should consider having this on the backlog.
you are incorrectly reading the ticket dependencies. #121720 is High because of other subtasks so this ticket does not have a direct impact. In fact adding this to the backlog would actually increase the risk that the High blocking sibling subtasks are less likely to be worked on.
Updated by ybonatakis 8 months ago
- Status changed from Workable to In Progress
- Assignee set to ybonatakis
Updated by okurz 8 months ago
I enabled IPv6 on openqaworker-arm-1 and created https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/4936 to add the AAAA record
Updated by ybonatakis 8 months ago
I followed the guide https://progress.opensuse.org/projects/openqav3/wiki/#Distribution-upgrades, after checked with Marius that https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls and https://gitlab.suse.de/search?search=openqaworker-arm-1%20&nav_source=navbar&project_id=419&group_id=8&search_code=true&repository_ref=production are setup in place.
I didnt the minimal modifications which vimdiff shows, accepting mostly the old configs.
systemctl --failed shows no errors
I had to update the /etc/hosts and after systemctl unmask openqa-worker-auto-restart@11.service openqa-reload-worker-auto-restart@11.{service,path}
systemctl start openqa-worker-auto-restart@11.service openqa-reload-worker-auto-restart@11.path, I cloned a job[0] which run successfully
the file /etc/sysctl.d/99-poo81198.conf shows that IPv6 is disabled. I recommend to read the referenced ticket 81198 and consider removing this file and getting IPv6 fixed first.
i didnt find that file. However i updated the /etc/hosts with ipv6 i found from another worker.
However Oli has already did enable ipv6 and submitted https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/4936/diffs
I used 2a07:de40:b203:12:0:ff:fe4f:7c2b but the PR uses 2a07:de40:a102:5:1e1b:dff:fe68:7ecf
Updated by okurz 8 months ago
ybonatakis wrote in #note-22:
I followed the guide https://progress.opensuse.org/projects/openqav3/wiki/#Distribution-upgrades, after checked with Marius that https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls and https://gitlab.suse.de/search?search=openqaworker-arm-1%20&nav_source=navbar&project_id=419&group_id=8&search_code=true&repository_ref=production are setup in place.
I didnt the minimal modifications which vimdiff shows, accepting mostly the old configs.
systemctl --failed shows no errors
I had to update the /etc/hosts and after systemctl unmask openqa-worker-auto-restart@11.service openqa-reload-worker-auto-restart@11.{service,path}
systemctl start openqa-worker-auto-restart@11.service openqa-reload-worker-auto-restart@11.path, I cloned a job[0] which run successfullythe file /etc/sysctl.d/99-poo81198.conf shows that IPv6 is disabled. I recommend to read the referenced ticket 81198 and consider removing this file and getting IPv6 fixed first.
i didnt find that file. However i updated the /etc/hosts with ipv6 i found from another worker.
However Oli has already did enable ipv6 and submitted https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/4936/diffsI used 2a07:de40:b203:12:0:ff:fe4f:7c2b but the PR uses 2a07:de40:a102:5:1e1b:dff:fe68:7ecf
7c2b is osd, 7cef is openqaworker-arm-1 itself
Updated by okurz 8 months ago
https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/4936 merged and effective. ping openqaworker-arm-1.qe.nue2.suse.org
yields 2a07:de40:a102:5:1e1b:dff:fe68:7ec7
Updated by okurz 8 months ago
- Copied to action #157753: Bring back automatic recovery for openqaworker-arm-1 size:M added
Updated by okurz 8 months ago
- Due date set to 2024-03-29
As suggested I moved out the auto-recovery part into #157753
So left to be done is removing the suffix in the worker class setting in https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls and
https://progress.opensuse.org/projects/openqav3/wiki/#Bring-back-machines-into-salt-controlled-production
Updated by ybonatakis 8 months ago
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/750
update openqa keys to the worker. CLASS should be fine
Updated by ybonatakis 8 months ago
- Status changed from In Progress to Resolved
After the instructions and help from Oli and Nick the worker is up and running.
https://openqa.suse.de/tests/13846104
Updated by okurz 8 months ago
- Related to action #158020: salt-states-openqa pipeline times out added