Project

General

Profile

Actions

action #133748

closed

coordination #121720: [saga][epic] Migration to QE setup in PRG2+NUE3 while ensuring availability

coordination #129280: [epic] Move from SUSE NUE1 (Maxtorhof) to new NBG Datacenters

Move of openqaworker-arm-1 to FC Basement size:M

Added by okurz 9 months ago. Updated 14 days ago.

Status:
Resolved
Priority:
Low
Assignee:
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:

Description

Motivation

In #132614 openqaworker-arm-1 was moved to FC Basement so that we have one hot-redundant aarch64 OSD machine outside of PRG2. For that to be setup we need to also accomodate the automatic recovery feature.

Acceptance criteria

  • AC1: openqaworker-arm-1 runs OSD production jobs again
  • AC2: The automatic recovery of openqaworker-arm-1 on crashes works

Suggestions

Rollback steps


Related issues 2 (0 open2 closed)

Related to openQA Infrastructure - action #158020: salt-states-openqa pipeline times outResolvedokurz2024-03-26

Actions
Copied to QA - action #157753: Bring back automatic recovery for openqaworker-arm-1 size:MResolvedybonatakis

Actions
Actions #2

Updated by okurz 9 months ago

  • Description updated (diff)

I now tried to mute the "notification" which triggers the webhook of gitlab for openqaworker-arm-1, added to rollback steps

Actions #3

Updated by mkittler 9 months ago

  • Subject changed from Move of openqaworker-arm-1 to FC Basement to Move of openqaworker-arm-1 to FC Basement size:M
Actions #4

Updated by mkittler 9 months ago

  • Status changed from New to Workable
Actions #5

Updated by okurz 9 months ago

  • Description updated (diff)
Actions #6

Updated by okurz 9 months ago

I crosschecked the machine connections and updated racktables. The fibre channel connection is correct as documented in racktables, eth0 in OS is for the physical port SFP+-1. On the DHCP server no request shows up for this network interface. I assume the switch is not configured correctly yet for the SFP port.

Actions #7

Updated by okurz 9 months ago

  • Priority changed from High to Normal
Actions #8

Updated by okurz 8 months ago

  • Priority changed from Normal to Urgent
Actions #9

Updated by okurz 8 months ago

  • Priority changed from Urgent to High

We don't have capacity to work on that many infra tasks with urgent prio, reducing to "High"

Actions #10

Updated by okurz 8 months ago

  • Parent task changed from #130955 to #129280
Actions #11

Updated by okurz 7 months ago

  • Priority changed from High to Normal
  • Target version changed from Ready to future

We will just have to trust the prg2 workers for now

Actions #12

Updated by okurz 5 months ago

  • Target version changed from future to Ready
Actions #13

Updated by okurz 5 months ago

  • Target version changed from Ready to Tools - Next
Actions #14

Updated by livdywan 4 months ago

This is the only workable subtask of #121720 which is a High ticket on the backlog. We should consider having this on the backlog.

Actions #15

Updated by okurz 4 months ago

livdywan wrote in #note-14:

This is the only workable subtask of #121720 which is a High ticket on the backlog. We should consider having this on the backlog.

you are incorrectly reading the ticket dependencies. #121720 is High because of other subtasks so this ticket does not have a direct impact. In fact adding this to the backlog would actually increase the risk that the High blocking sibling subtasks are less likely to be worked on.

Actions #16

Updated by okurz 3 months ago

  • Target version changed from Tools - Next to Ready
Actions #17

Updated by okurz 3 months ago

  • Priority changed from Normal to Low
Actions #18

Updated by okurz 3 months ago

  • Target version changed from Ready to Tools - Next
Actions #19

Updated by okurz 3 months ago

  • Target version changed from Tools - Next to Ready
Actions #20

Updated by ybonatakis about 1 month ago

  • Status changed from Workable to In Progress
  • Assignee set to ybonatakis
Actions #21

Updated by okurz about 1 month ago

I enabled IPv6 on openqaworker-arm-1 and created https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/4936 to add the AAAA record

Actions #22

Updated by ybonatakis about 1 month ago

I followed the guide https://progress.opensuse.org/projects/openqav3/wiki/#Distribution-upgrades, after checked with Marius that https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls and https://gitlab.suse.de/search?search=openqaworker-arm-1%20&nav_source=navbar&project_id=419&group_id=8&search_code=true&repository_ref=production are setup in place.
I didnt the minimal modifications which vimdiff shows, accepting mostly the old configs.
systemctl --failed shows no errors
I had to update the /etc/hosts and after systemctl unmask openqa-worker-auto-restart@11.service openqa-reload-worker-auto-restart@11.{service,path}
systemctl start openqa-worker-auto-restart@11.service openqa-reload-worker-auto-restart@11.path, I cloned a job[0] which run successfully

the file /etc/sysctl.d/99-poo81198.conf shows that IPv6 is disabled. I recommend to read the referenced ticket 81198 and consider removing this file and getting IPv6 fixed first.

i didnt find that file. However i updated the /etc/hosts with ipv6 i found from another worker.
However Oli has already did enable ipv6 and submitted https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/4936/diffs

I used 2a07:de40:b203:12:0:ff:fe4f:7c2b but the PR uses 2a07:de40:a102:5:1e1b:dff:fe68:7ecf

[0] https://openqa.suse.de/tests/13839219

Actions #23

Updated by okurz about 1 month ago

ybonatakis wrote in #note-22:

I followed the guide https://progress.opensuse.org/projects/openqav3/wiki/#Distribution-upgrades, after checked with Marius that https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls and https://gitlab.suse.de/search?search=openqaworker-arm-1%20&nav_source=navbar&project_id=419&group_id=8&search_code=true&repository_ref=production are setup in place.
I didnt the minimal modifications which vimdiff shows, accepting mostly the old configs.
systemctl --failed shows no errors
I had to update the /etc/hosts and after systemctl unmask openqa-worker-auto-restart@11.service openqa-reload-worker-auto-restart@11.{service,path}
systemctl start openqa-worker-auto-restart@11.service openqa-reload-worker-auto-restart@11.path, I cloned a job[0] which run successfully

the file /etc/sysctl.d/99-poo81198.conf shows that IPv6 is disabled. I recommend to read the referenced ticket 81198 and consider removing this file and getting IPv6 fixed first.

i didnt find that file. However i updated the /etc/hosts with ipv6 i found from another worker.
However Oli has already did enable ipv6 and submitted https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/4936/diffs

I used 2a07:de40:b203:12:0:ff:fe4f:7c2b but the PR uses 2a07:de40:a102:5:1e1b:dff:fe68:7ecf

7c2b is osd, 7cef is openqaworker-arm-1 itself

Actions #24

Updated by okurz about 1 month ago

https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/4936 merged and effective. ping openqaworker-arm-1.qe.nue2.suse.org yields 2a07:de40:a102:5:1e1b:dff:fe68:7ec7

Actions #25

Updated by okurz about 1 month ago

  • Copied to action #157753: Bring back automatic recovery for openqaworker-arm-1 size:M added
Actions #26

Updated by okurz about 1 month ago

  • Due date set to 2024-03-29

As suggested I moved out the auto-recovery part into #157753

So left to be done is removing the suffix in the worker class setting in https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls and
https://progress.opensuse.org/projects/openqav3/wiki/#Bring-back-machines-into-salt-controlled-production

Actions #27

Updated by ybonatakis about 1 month ago

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/750

update openqa keys to the worker. CLASS should be fine

Actions #28

Updated by ybonatakis about 1 month ago

  • Status changed from In Progress to Resolved

After the instructions and help from Oli and Nick the worker is up and running.
https://openqa.suse.de/tests/13846104

Actions #30

Updated by okurz about 1 month ago

  • Related to action #158020: salt-states-openqa pipeline times out added
Actions #31

Updated by okurz 14 days ago

  • Due date deleted (2024-03-29)
Actions

Also available in: Atom PDF