action #133748: Move of openqaworker-arm-1 to FC Basement size:M - QA - openSUSE Project Management Tool

Actions

Copy link

action #133748

closed

coordination #121720: [saga][epic] Migration to QE setup in PRG2+NUE3 while ensuring availability

coordination #129280: [epic] Move from SUSE NUE1 (Maxtorhof) to new NBG Datacenters

Move of openqaworker-arm-1 to FC Basement size:M

Added by okurz over 1 year ago. Updated 7 months ago.

Status:

Resolved

Priority:

Low

Assignee:

ybonatakis

Target version:

openQA Project - Ready

Start date:

Due date:

% Done:

Estimated time:

Tags:

arm, infra, FC Basement

Description

Motivation¶

In #132614 openqaworker-arm-1 was moved to FC Basement so that we have one hot-redundant aarch64 OSD machine outside of PRG2. For that to be setup we need to also accomodate the automatic recovery feature.

Acceptance criteria¶

AC1: openqaworker-arm-1 runs OSD production jobs again
AC2: The automatic recovery of openqaworker-arm-1 on crashes works

Suggestions¶

Disable the automatic recovery for openqaworker-arm-1 from the old location
Mount the machine and connect it back into the network including DHCP/DNS in https://gitlab.suse.de/OPS-Service/salt/
Remove old DHCP/DNS entries in https://gitlab.suse.de/OPS-Service/salt/
Update https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls
Find on https://wiki.suse.net/index.php/SUSE-Quality_Assurance/Labs how the new PDU can be used
Integrate the new PDU in https://gitlab.suse.de/openqa/grafana-webhook-actions

Rollback steps¶

Add back openqaworker-arm-1 to salt on OSD
after openqaworker-arm-1 is back remove silences in https://monitor.qa.suse.de/alerting/silences
Remove the "Mute All times" in https://monitor.qa.suse.de/alerting/routes for __contacts__ =~ .*"Trigger reboot of openqaworker-arm-1".*

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by okurz over 1 year ago

Description updated (diff)

I now tried to mute the "notification" which triggers the webhook of gitlab for openqaworker-arm-1, added to rollback steps

Actions

Copy link

Updated by mkittler over 1 year ago

Subject changed from Move of openqaworker-arm-1 to FC Basement to Move of openqaworker-arm-1 to FC Basement size:M

Actions

Copy link

Updated by mkittler over 1 year ago

Status changed from New to Workable

Actions

Copy link

Updated by okurz over 1 year ago

Description updated (diff)

Actions

Copy link

Updated by okurz over 1 year ago

I crosschecked the machine connections and updated racktables. The fibre channel connection is correct as documented in racktables, eth0 in OS is for the physical port SFP+-1. On the DHCP server no request shows up for this network interface. I assume the switch is not configured correctly yet for the SFP port.

Actions

Copy link

Updated by okurz over 1 year ago

Priority changed from High to Normal

Actions

Copy link

Updated by okurz about 1 year ago

Priority changed from Normal to Urgent

Actions

Copy link

Updated by okurz about 1 year ago

Priority changed from Urgent to High

We don't have capacity to work on that many infra tasks with urgent prio, reducing to "High"

Actions

Copy link

#10

Updated by okurz about 1 year ago

Parent task changed from #130955 to #129280

Actions

Copy link

#11

Updated by okurz about 1 year ago

Priority changed from High to Normal
Target version changed from Ready to future

We will just have to trust the prg2 workers for now

Actions

Copy link

#12

Updated by okurz 12 months ago

Target version changed from future to Ready

Actions

Copy link

#13

Updated by okurz 11 months ago

Target version changed from Ready to Tools - Next

Actions

Copy link

#14

Updated by livdywan 11 months ago

This is the only workable subtask of #121720 which is a High ticket on the backlog. We should consider having this on the backlog.

Actions

Copy link

#15

Updated by okurz 10 months ago

livdywan wrote in #note-14:

This is the only workable subtask of #121720 which is a High ticket on the backlog. We should consider having this on the backlog.

you are incorrectly reading the ticket dependencies. #121720 is High because of other subtasks so this ticket does not have a direct impact. In fact adding this to the backlog would actually increase the risk that the High blocking sibling subtasks are less likely to be worked on.

Actions

Copy link

#16

Updated by okurz 10 months ago

Target version changed from Tools - Next to Ready

Actions

Copy link

#17

Updated by okurz 10 months ago

Priority changed from Normal to Low

Actions

Copy link

#18

Updated by okurz 10 months ago

Target version changed from Ready to Tools - Next

Actions

Copy link

#19

Updated by okurz 10 months ago

Target version changed from Tools - Next to Ready

Actions

Copy link

#20

Updated by ybonatakis 8 months ago

Status changed from Workable to In Progress
Assignee set to ybonatakis

Actions

Copy link

#21

Updated by okurz 8 months ago

I enabled IPv6 on openqaworker-arm-1 and created https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/4936 to add the AAAA record

Actions

Copy link

#22

Updated by ybonatakis 8 months ago

I followed the guide https://progress.opensuse.org/projects/openqav3/wiki/#Distribution-upgrades, after checked with Marius that https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls and https://gitlab.suse.de/search?search=openqaworker-arm-1%20&nav_source=navbar&project_id=419&group_id=8&search_code=true&repository_ref=production are setup in place.
I didnt the minimal modifications which vimdiff shows, accepting mostly the old configs.
systemctl --failed shows no errors
I had to update the /etc/hosts and after systemctl unmask openqa-worker-auto-restart@11.service openqa-reload-worker-auto-restart@11.{service,path}
systemctl start openqa-worker-auto-restart@11.service openqa-reload-worker-auto-restart@11.path, I cloned a job[0] which run successfully

the file /etc/sysctl.d/99-poo81198.conf shows that IPv6 is disabled. I recommend to read the referenced ticket 81198 and consider removing this file and getting IPv6 fixed first.

i didnt find that file. However i updated the /etc/hosts with ipv6 i found from another worker.
However Oli has already did enable ipv6 and submitted https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/4936/diffs

I used 2a07:de40:b203:12:0:ff:fe4f:7c2b but the PR uses 2a07:de40:a102:5:1e1b:dff:fe68:7ecf

[0] https://openqa.suse.de/tests/13839219

Actions

Copy link

#23

Updated by okurz 8 months ago

ybonatakis wrote in #note-22:

I followed the guide https://progress.opensuse.org/projects/openqav3/wiki/#Distribution-upgrades, after checked with Marius that https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls and https://gitlab.suse.de/search?search=openqaworker-arm-1%20&nav_source=navbar&project_id=419&group_id=8&search_code=true&repository_ref=production are setup in place.
I didnt the minimal modifications which vimdiff shows, accepting mostly the old configs.
systemctl --failed shows no errors
I had to update the /etc/hosts and after systemctl unmask openqa-worker-auto-restart@11.service openqa-reload-worker-auto-restart@11.{service,path}
systemctl start openqa-worker-auto-restart@11.service openqa-reload-worker-auto-restart@11.path, I cloned a job[0] which run successfully

the file /etc/sysctl.d/99-poo81198.conf shows that IPv6 is disabled. I recommend to read the referenced ticket 81198 and consider removing this file and getting IPv6 fixed first.

i didnt find that file. However i updated the /etc/hosts with ipv6 i found from another worker.
However Oli has already did enable ipv6 and submitted https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/4936/diffs

I used 2a07:de40:b203:12:0:ff:fe4f:7c2b but the PR uses 2a07:de40:a102:5:1e1b:dff:fe68:7ecf

7c2b is osd, 7cef is openqaworker-arm-1 itself

Actions

Copy link

#24

Updated by okurz 8 months ago

https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/4936 merged and effective. ping openqaworker-arm-1.qe.nue2.suse.org yields 2a07:de40:a102:5:1e1b:dff:fe68:7ec7

Actions

Copy link

#25

Updated by okurz 8 months ago

Copied to action #157753: Bring back automatic recovery for openqaworker-arm-1 size:M added

Actions

Copy link

#26

Updated by okurz 8 months ago

Due date set to 2024-03-29

As suggested I moved out the auto-recovery part into #157753

So left to be done is removing the suffix in the worker class setting in https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls and
https://progress.opensuse.org/projects/openqav3/wiki/#Bring-back-machines-into-salt-controlled-production

Actions

Copy link

#27

Updated by ybonatakis 8 months ago

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/750

update openqa keys to the worker. CLASS should be fine

Actions

Copy link

#28

Updated by ybonatakis 8 months ago

Status changed from In Progress to Resolved

After the instructions and help from Oli and Nick the worker is up and running.
https://openqa.suse.de/tests/13846104

Actions

Copy link

#30

Updated by okurz 8 months ago

Related to action #158020: salt-states-openqa pipeline times out added

Actions

Copy link

#31

Updated by okurz 7 months ago

Due date deleted (~~2024-03-29~~)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA

Tags

Custom queries

action #133748

Move of openqaworker-arm-1 to FC Basement size:M

Motivation¶

Acceptance criteria¶

Suggestions¶

Rollback steps¶

Updated by okurz over 1 year ago

Updated by mkittler over 1 year ago

Updated by mkittler over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz about 1 year ago

Updated by okurz about 1 year ago

Updated by okurz about 1 year ago

Updated by okurz about 1 year ago

Updated by okurz 12 months ago

Updated by okurz 11 months ago

Updated by livdywan 11 months ago

Updated by okurz 10 months ago

Updated by okurz 10 months ago

Updated by okurz 10 months ago

Updated by okurz 10 months ago

Updated by okurz 10 months ago

Updated by ybonatakis 8 months ago

Updated by okurz 8 months ago

Updated by ybonatakis 8 months ago

Updated by okurz 8 months ago

Updated by okurz 8 months ago

Updated by okurz 8 months ago

Updated by okurz 8 months ago

Updated by ybonatakis 8 months ago

Updated by ybonatakis 8 months ago

Updated by okurz 8 months ago

Updated by okurz 7 months ago