action #132137: Setup new PRG2 openQA worker for osd size:M - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #132137

closed

QA (public) - coordination #121720: [saga][epic] Migration to QE setup in PRG2+NUE3 while ensuring availability

QA (public) - coordination #123800: [epic] Provide SUSE QE Tools services running in PRG2 aka. Prg CoLo

Setup new PRG2 openQA worker for osd size:M

Added by okurz almost 2 years ago. Updated over 1 year ago.

Status:

Resolved

Priority:

Urgent

Assignee:

mkittler

Category:

Target version:

openQA Project (public) - Ready

Start date:

2023-06-29

Due date:

% Done:

Estimated time:

Tags:

openQA, worker, osd, infra, dct migration, prg2

Description

Motivation¶

New hardware was ordered to serve as openQA workers for osd. We can connect those machines to the osd webUI VM instance regardless if it's running still from NUE1 or from PRG2.

Acceptance criteria¶

AC1: osd multi-machine jobs run successfully on new PRG2 openQA workers

Suggestions¶

DONE (see https://gitlab.suse.de/openqa/salt-pillars-openqa/-/commit/a27f3d50872a8f7aff127b20a409dcdc1c28d91b): Track https://jira.suse.com/browse/ENGINFRA-2379 "PRG2 IPMI for QA" to be able to remote control
Track https://jira.suse.com/browse/ENGINFRA-1742 "Build OpenQA Environment" which is the neighboring story of the osd VM being migrated
Wait for Eng-Infra to inform us about the availability of the network and machines
Ensure we can connect over IPMI
Include IPMI contact details and workerconf in https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls
Follow https://gitlab.suse.de/openqa/salt-states-openqa

Rollback steps¶

Moved to #134948* Remove silence for "host up" alerts for worker33…39 alertname=~(d[0-9]*|worker[0-9]*): host up alert in https://stats.openqa-monitor.qa.suse.de/alerting/silences
DONE Add back arm-worker2.oqa.prg2.suse.org (or worker-arm2 correspondingly)

Out of scope¶

Ensure that osd can work without relying on any physical machine in NUE1

Related issues 7 (0 open — 7 closed)

Related to openQA Infrastructure (public) - action #131249: [alert][ci][deployment] OSD deployment failed, grenache-1, worker5, worker2 salt-minion does not return, error message "No response" size:M

Resolved

okurz

2023-06-22

Actions

Related to openQA Infrastructure (public) - action #132827: [tools][qe-core]test fails in rsync_client/salt-master, DNS resolve issue with workers "sapworker*" on multi-machine tests size:M

Resolved

ybonatakis

2023-07-17

Actions

Related to openQA Infrastructure (public) - action #133892: [alert] arm-worker2 (arm-worker2: host up alert openQA host_up_alert_arm-worker2 worker size:M

Resolved

mkittler

2023-08-07

2023-08-25

Actions

Related to openQA Project (public) - action #133025: Configure Virtual Interfaces instructions do not work on Leap 15.5 size:M

Resolved

dheidler

2023-07-19

2023-10-31

Actions

Copied from openQA Infrastructure (public) - action #132134: Setup new PRG2 multi-machine openQA worker for o3 size:M

Resolved

dheidler

2023-06-29

Actions

Copied to QA (public) - action #132158: Ensure that osd can work without relying on any physical machine in NUE1 size:M

Resolved

okurz

2023-06-29

Actions

Copied to openQA Infrastructure (public) - action #134912: Gradually phase out NUE1 based openQA workers size:M

Resolved

okurz

Actions

Copy link

Updated by okurz almost 2 years ago

Copied from action #132134: Setup new PRG2 multi-machine openQA worker for o3 size:M added

Actions

Copy link

Updated by okurz almost 2 years ago

Subject changed from Setup new PRG2 openQA worker for osd to Setup new PRG2 openQA worker for osd size:M
Description updated (diff)
Status changed from New to Workable

Actions

Copy link

Updated by okurz almost 2 years ago

Copied to action #132158: Ensure that osd can work without relying on any physical machine in NUE1 size:M added

Actions

Copy link

Updated by okurz almost 2 years ago

Status changed from Workable to Blocked
Assignee set to okurz

Waiting for IPMI access, see #132134

Actions

Copy link

Updated by okurz over 1 year ago

Status changed from Blocked to In Progress

From https://suse.slack.com/archives/C04MDKHQE20/p1689317691714229

(Martin Caj) @Oliver Kurz I send you and email with encrypted file. There you find how to get into new workers in PRG2 please test it and let me know.
(Oliver Kurz) @Martin Caj according to https://racktables.nue.suse.com/index.php?page=rack&rack_id=21278 those machines are openqaworker1…12 which are intended for openqa.suse.de but our priority should be https://racktables.nue.suse.com/index.php?page=rack&rack_id=21282 for openqa.opensuse.org aka openQA.DMZ
(Martin Caj) I know we are working on DMZ now
(Oliver Kurz)

We can login as "jumpy" and ping all hosts from /etc/hosts and reach them over IPMI

according to https://racktables.nue.suse.com/index.php?page=rack&rack_id=21278 those machines are openqaworker1…12 which are intended for openqa.suse.de but our priority should be https://racktables.nue.suse.com/index.php?page=rack&rack_id=21282 for openqa.opensuse.org aka openQA.DMZ

ssh over IPv6 takes very long. Likely the machine is only connected over IPv4 but has a AAAA record

The decision was to start numbering machines starting with 21 to prevent confusion with the already existing machines at other locations so this needs to be changed for both o3+osd machines

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/560 for adding credentials in the current state.

Actions

Copy link

Updated by okurz over 1 year ago

Trying

linuxefi (http,195.135.221.134)/distribution/openSUSE-current/repo/oss/boot/x86_64/loader/linux install=http://download.opensuse.org/distribution/openSUSE-current/repo/oss console=ttyS0,115200 autoyast=http://s.qa.suse.de/oqa-ay-lp rootpassword=susetesting
initrdefi (http,195.135.221.134)/distribution/openSUSE-current/repo/oss/boot/x86_64/loader/initrd
boot

takes about 1-2 seconds after linuxefi and about 60-90s after initrd. After boot nothing shows up maybe because ttyS0 is not correct. I would need to try that with ttyS1 and ttyS2 as well. So ttyS1 works \o/

Conducted the autoyast installation and then logged in over SOL and patched missing repos:

for i in update/leap/\$releasever/non-oss distribution/leap/\$releasever/repo/oss update/leap/\$releasever/oss distribution/leap/\$releasever/repo/non-oss; do zypper ar https://download.opensuse.org/$i $i; done
zypper mr -f 8
zypper mr -f 9

hostname -f tells that the machine knows itself as d105.oqa.prg2.suse.org. Discussed with mcaj in https://suse.slack.com/archives/C04MDKHQE20/p1689317691714229:

(Oliver Kurz) @Martin Caj I see https://racktables.nue.suse.com/index.php?page=rack&rack_id=21282 with new names, that's nice. Will you then name the OSD (openqa.suse.de) ones in J11 worker29 and above?
(Martin Caj) well I do not know how to add them into OSD ... but FQDN:
for internal workers should be oqa.prg2.suse.org
for DMZ workres than oqa.opensuse.org
The intertnal one need to be register at suttner1.oqa.prg2.suse.org DNS/DHCP this I can do it.
for DMZ I do not know but I assume that you had dhcp running on ariel server, but If you need we can build separate par for DHCP/DNS servers
(Oliver Kurz) We can handle the registration against the DHCP server if it's the same salt repo. For DMZ the decision (I think your suggestion?) was to have a separate DHCP/DNS server same as you maintain for the other networks. However we can also still just try dnsmasq from ariel
(Martin Caj) internal one: we have it https://gitlab.suse.de/OPS-Service/salt/-/blob/production/pillar/domain/oqa_prg2_suse_org/hosts.yaml
but hosts there are just dummy one we need to replace them
DNS its under https://gitlab.suse.de/OPS-Service/salt/-/tree/production/salt/profile/dns/files/oqa_prg2_suse_org
(Oliver Kurz) got it. We will prepare an MR so that you can focus on getting the DMZ up, ok?

Then I followed https://progress.opensuse.org/projects/openqav3/wiki/#Setup-guide-for-new-machines and https://gitlab.suse.de/openqa/salt-states-openqa#setup-production-machine . So I put worker29.oqa.prg2.suse.org into /etc/hostname and such and will create a merge request for the DHCP/DNS entry.

So from our salt pillars:

sed -n '/prg2.suse.org/s/^.*serial: `\([^`]*\).*$/\1/p' openqa/workerconf.sls | sed 's/sol activate/raw 0x30 0x21 | tail -c 18 | sed \\\"s@ @:@g\\\"/' | (while read -r cmd; do echo $cmd && eval $cmd; done)

which unfortunately stops after the first eval so what we did is create a list of commands we can execute on qe-jumpy directly then:

sed -n '/prg2.suse.org/s/^.*serial: `\([^`]*\).*$/\1/p' openqa/workerconf.sls | sed -e 's/^[^"]*"//' -e 's/"$//' -e 's/sol activate/raw 0x30 0x21 | tail -c 18 | sed "s@ @:@g"/' | (while read -r cmd; do echo $cmd; done)

and copy-paste that into the ssh which yields:

jumpy@qe-jumpy:~> ipmitool -I lanplus -H openqaworker1.qe-ipmi-ur -U qadmin -P 'X' raw 0x30 0x21 | tail -c 18 | sed "s@ @:@g"
7c:c2:55:24:de:9c
jumpy@qe-jumpy:~> ipmitool -I lanplus -H openqaworker2.qe-ipmi-ur -U qadmin -P 'X' raw 0x30 0x21 | tail -c 18 | sed "s@ @:@g"
7c:c2:55:24:dc:34
jumpy@qe-jumpy:~> ipmitool -I lanplus -H openqaworker3.qe-ipmi-ur -U qadmin -P 'X' raw 0x30 0x21 | tail -c 18 | sed "s@ @:@g"
7c:c2:55:24:de:70
jumpy@qe-jumpy:~> ipmitool -I lanplus -H openqaworker4.qe-ipmi-ur -U qadmin -P 'X' raw 0x30 0x21 | tail -c 18 | sed "s@ @:@g"
3c:ec:ef:fe:0a:c4
jumpy@qe-jumpy:~> ipmitool -I lanplus -H openqaworker5.qe-ipmi-ur -U qadmin -P 'X' raw 0x30 0x21 | tail -c 18 | sed "s@ @:@g"
7c:c2:55:24:de:2a
jumpy@qe-jumpy:~> ipmitool -I lanplus -H openqaworker6.qe-ipmi-ur -U qadmin -P 'X' raw 0x30 0x21 | tail -c 18 | sed "s@ @:@g"
7c:c2:55:24:dd:e8
jumpy@qe-jumpy:~> ipmitool -I lanplus -H openqaworker7.qe-ipmi-ur -U qadmin -P 'X' raw 0x30 0x21 | tail -c 18 | sed "s@ @:@g"
7c:c2:55:24:de:cc
jumpy@qe-jumpy:~> ipmitool -I lanplus -H openqaworker8.qe-ipmi-ur -U qadmin -P 'X' raw 0x30 0x21 | tail -c 18 | sed "s@ @:@g"
7c:c2:55:24:de:de
jumpy@qe-jumpy:~> ipmitool -I lanplus -H openqaworker9.qe-ipmi-ur -U qadmin -P 'X' raw 0x30 0x21 | tail -c 18 | sed "s@ @:@g"
7c:c2:55:24:dc:4c
jumpy@qe-jumpy:~> ipmitool -I lanplus -H openqaworker10.qe-ipmi-ur -U qadmin -P 'X' raw 0x30 0x21 | tail -c 18 | sed "s@ @:@g"
7c:c2:55:24:de:d2
jumpy@qe-jumpy:~> ipmitool -I lanplus -H openqaworker11.qe-ipmi-ur -U qadmin -P 'X' raw 0x30 0x21 | tail -c 18 | sed "s@ @:@g"
7c:c2:55:24:de:ae
jumpy@qe-jumpy:~> ipmitool -I lanplus -H openqaworker12.qe-ipmi-ur -U qadmin -P 'X' raw 0x30 0x21 | tail -c 18 | sed "s@ @:@g"
7c:c2:55:24:de:ce
jumpy@qe-jumpy:~> ipmitool -I lanplus -H openqaworker-arm1.qe-ipmi-ur -U qadmin -P 'X' raw 0x30 0x21 | tail -c 18 | sed "s@ @:@g"
Unable to send RAW command (channel=0x0 netfn=0x30 lun=0x0 cmd=0x21 rsp=0xc1): Invalid command
jumpy@qe-jumpy:~> ipmitool -I lanplus -H openqaworker-arm2.qe-ipmi-ur -U qadmin -P 'X' raw 0x30 0x21 | tail -c 18 | sed "s@ @:@g"
Unable to send RAW command (channel=0x0 netfn=0x30 lun=0x0 cmd=0x21 rsp=0xc1): Invalid command

so works fine for x86_64, it's a start. Created https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/3761

Actions

Copy link

Updated by openqa_review over 1 year ago

Due date set to 2023-07-29

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

Updated by okurz over 1 year ago

Project changed from 46 to openQA Infrastructure (public)
Category deleted (~~Infrastructure~~)

Actions

Copy link

Updated by livdywan over 1 year ago

Some of those machines already established connection to the salt master and there is a valid workerconf to start with.
Anyone of you can continue by accepting the salt key, applying the salt high state and scheduling specific openQA test jobs on those including multi-machine

I'm quoting Oli here to highlight that everyone is welcome to pick this up

Actions

Copy link

#10

Updated by okurz over 1 year ago

Status changed from In Progress to Workable

Actions

Copy link

#11

Updated by okurz over 1 year ago

Due date deleted (~~2023-07-29~~)
Priority changed from Normal to Urgent

Blocked by #133250, tracker for https://sd.suse.com/servicedesk/customer/portal/1/SD-128313 (resolved)

Actions

Copy link

#12

Updated by okurz over 1 year ago

Status changed from Workable to Blocked

gitlab is back but https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/3761 still not merged

Actions

Copy link

#13

Updated by okurz over 1 year ago

Status changed from Blocked to Workable
Assignee deleted (~~okurz~~)

https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/3761 now merged.

host worker37.oqa.prg2.suse.org
worker37.oqa.prg2.suse.org has address 10.145.10.10
worker37.oqa.prg2.suse.org has IPv6 address 2a07:de40:b203:12:10:145:10:10

so good to continue.

Actions

Copy link

#14

Updated by mkittler over 1 year ago

Description updated (diff)

Actions

Copy link

#15

Updated by mkittler over 1 year ago

Status changed from Workable to In Progress
Assignee set to mkittler

Actions

Copy link

#16

Updated by mkittler over 1 year ago

I suppose all relevant machines are mentioned in commit https://gitlab.suse.de/openqa/salt-pillars-openqa/-/commit/a27f3d50872a8f7aff127b20a409dcdc1c28d91b. They are part of rack https://racktables.suse.de/index.php?page=object&object_id=23028 or https://racktables.suse.de/index.php?page=rack&rack_id=21282.

Looks like Leap 15.5 has already been installed on hosts worker29.oqa.prg2.suse.org to worker32.oqa.prg2.suse.org so worker33.oqa.prg2.suse.org to worker40.oqa.prg2.suse.org and worker-arm1.oqa.prg2.suse.org and worker-arm2.oqa.prg2.suse.org are missing.

However, I cannot establish an IPMI connection to the remaining hosts except 36, 37, 39 and arm1. It seems that on 36, 37 and 39 there is something running (all one gets via sol is a timer) and on arm1 one is prompted "pingubox login:". So supposedly it makes sense to wait with the OS installation on those machines. Maybe I should ask about those hosts in Slack as mentioned in a Jira comment.

Judging by the output of salt-key -L on OSD the workers 29 to 32 have even already Salt running. Not sure whether it is wanted to accept those connections at this point.

By the way, the IPMI password of some hosts contains exclamation marks. So before running the commands from workerconf.sls it makes sense to use set +H to avoid having to escape those. Some passwords also contain $ which requires extra escaping that is not present in workerconf.sls. However, even then I couldn't connect via IPMP to e.g. worker33.

Actions

Copy link

#17

Updated by okurz over 1 year ago

mkittler wrote:

Judging by the output of salt-key -L on OSD the workers 29 to 32 have even already Salt running. Not sure whether it is wanted to accept those connections at this point.

Yes, but only while being closely monitored :) so whenever one of us is available for some time, accept the salt key, apply the high state and monitor openQA jobs that the machines accept - assuming the workers have the connection to osd. And, you know, gre tunnels and multi machine and such :)

Actions

Copy link

#18

Updated by openqa_review over 1 year ago

Due date set to 2023-08-12

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

#19

Updated by mkittler over 1 year ago

Looks like it wasn't the escaping but just a bad connection. One has to retry multiple times to get past errors like "no authcode provided". We're nevertheless currently changing the passwords to avoid this confusion. The "timers" are actually the system time shown by the BIOS screen. One can see more by pressing the arrow keys.

I still cannot connect to arm2 via IPMI. I created an SD ticket for that: https://sd.suse.com/servicedesk/customer/portal/1/SD-128708

Actions

Copy link

#20

Updated by okurz over 1 year ago

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/573 for password changes as discussed. And https://sd.suse.com/servicedesk/customer/portal/1/SD-128710 for alignment of IPMI hostnames on jumpy. Added installation instructions over https://wiki.suse.net/index.php/OpenQA#Remote_management_with_IPMI_and_BMC_tools

Actions

Copy link

#21

Updated by mkittler over 1 year ago

I was able to setup all machines up to 39. (~~Bare installation. Salt is still missing.~~)

I was not able to setup 40 because it cannot boot via UEFI HTTP boot. Maybe the plug is not connected? I'm aware that this machine has two ethernet port but I've tried both and it times out on both. EDIT: I created an SD ticket for that: https://sd.suse.com/servicedesk/customer/portal/1/SD-128721

I was not able to setup arm1 because I still have to figure out how to proceed from that "pingubox login" prompt.

Actions

Copy link

#22

Updated by mkittler over 1 year ago

I was now able to setup everything except worker 40 and arm worker 2. I will deal with possible failing systemd units and failing salt states tomorrow.

I updated the Wiki with the commands used on the arm worker (which differed a little bit). We also had to manually amend the filesystem layout on the arm worker.

Actions

Copy link

#23

Updated by okurz over 1 year ago

Description updated (diff)
Assignee deleted (~~mkittler~~)

Soon after as expected the hosts ran into an unresponsive salt-minion due to https://bugzilla.opensuse.org/show_bug.cgi?id=1212816 . It looks like you haven't applied the workaround from #131249 so I am doing that now over ssh as salt is not responsive:

for i in {29..40}; do echo "## $i" && ssh -4 -o StrictHostKeyChecking=no worker$i.oqa.prg2.suse.org 'sudo zypper -n in --oldpackage --allow-downgrade http://download.opensuse.org/update/leap/15.4/sle/x86_64/salt-3004-150400.8.25.1.x86_64.rpm http://download.opensuse.org/update/leap/15.4/sle/x86_64/salt-minion-3004-150400.8.25.1.x86_64.rpm http://download.opensuse.org/update/leap/15.4/sle/x86_64/python3-salt-3004-150400.8.25.1.x86_64.rpm && sudo zypper al --comment "poo#131249 - potential salt regression, unresponsive salt-minion" salt salt-minion salt-bash-completion python3-salt'; done

Multiple alerts received, among them:

[FIRING:1] d105: host up alert openQA (d105 host_up_alert_d105 worker)
[FIRING:1] worker33: host up alert openQA (worker33 host_up_alert_worker33 worker)
…
[FIRING:1] worker39: host up alert openQA (worker39 host_up_alert_worker39 worker)

at least consistently for all :)

Added silences and according rollback steps in ticket.

Actions

Copy link

#24

Updated by okurz over 1 year ago

Related to action #131249: [alert][ci][deployment] OSD deployment failed, grenache-1, worker5, worker2 salt-minion does not return, error message "No response" size:M added

Actions

Copy link

#25

Updated by okurz over 1 year ago

Assignee set to mkittler

deleted assignee by accident

Actions

Copy link

#26

Updated by mkittler over 1 year ago

I have also applied the workaround on the arm worker now:

sudo zypper -n in --oldpackage --allow-downgrade http://download.opensuse.org/update/leap/15.4/sle/aarch64/salt-3004-150400.8.25.1.aarch64.rpm http://download.opensuse.org/update/leap/15.4/sle/aarch64/salt-minion-3004-150400.8.25.1.aarch64.rpm http://download.opensuse.org/update/leap/15.4/sle/aarch64/python3-salt-3004-150400.8.25.1.aarch64.rpm && sudo zypper al --comment "poo#131249 - potential salt regression, unresponsive salt-minion" salt salt-minion python3-salt

Actions

Copy link

#27

Updated by mkittler over 1 year ago

Description updated (diff)

https://sd.suse.com/servicedesk/customer/portal/1/SD-128708 has been resolved now so I'm currently installing arm2.

Actions

Copy link

#28

Updated by okurz over 1 year ago

Description updated (diff)

Actions

Copy link

#29

Updated by mkittler over 1 year ago

The problem that is causing the host up alert is that the new workers are in fact not pingable from OSD (not via IPv4 and not via IPv6). I'm not sure how big of a problem we consider this because the OSD VM is going to be moved soon anyways. Supposedly we can for now just keep the silence and sort this out once the OSD VM has been moved.

There's another problem: Our NVMe setup script cannot cope with the available disks/partitions on the new workers. For arm1 we mitigated this by changing the partitioning on the single available NVMe. The other workers actually have multiple SSDs so I'm currently trying to improve our script to be able to make use of them. If that doesn't work we still can configure it manually via the grain-approach I have introduced for the sap workers.

EDIT: MR for the NVMe problem: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/933

Actions

Copy link

#30

Updated by mkittler over 1 year ago

Status changed from In Progress to Feedback

Now arm-2 is setup as well. This leaves only worker40 which is still blocked by https://sd.suse.com/servicedesk/customer/portal/1/SD-128721.

Actions

Copy link

#31

Updated by okurz over 1 year ago

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/574 for an update of IPMI hostnames

Actions

Copy link

#32

Updated by mkittler over 1 year ago

Status changed from Feedback to In Progress

Thanks, it looks good.

Meanwhile https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/933 was merged and applied. That fixed the NVMe setup on the workers as expected.

I created some test jobs but they fail with download errors (e.g. https://openqa.suse.de/tests/11724074). I still have to figure out why that's the case. Downloading manually via wget works just fine.

Actions

Copy link

#33

Updated by mkittler over 1 year ago

Looks like the downloading works after rebooting. After rebooting all machines I'm getting mixed results but some tests are passing at least.

For some reason arm2 booted into GRUB provided via PXE. I'll have to find out why that happened. The boot prios look fine in the setup menu and the system generally boots the installed system.

All jobs have passed without further problems except on worker32 which had problems determining its own hostname. After a reboot it seemed better but the corresponding tests failed: https://openqa.suse.de/tests/11733211

I tried to run tests on arm workers as well but it seems https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/3801 is not effective yet.

Actions

Copy link

#34

Updated by okurz over 1 year ago

I just checked

okurz@worker38:~> for i in $(sudo ovs-vsctl show | sed -n 's/^.*remote_ip="\([^"]*\)"}/\1/p'); do ping -c1 $i; done

which looks good. So at least this worker, assumingly all PR2 workers then, can reach all other OSD workers. Time to test some more multi-machine then :)

Actions

Copy link

#35

Updated by mkittler over 1 year ago

I have run a job across 29 and 30. That worked: https://openqa.suse.de/tests/11732858 - I'll do more tests between more machines.

The developer mode doesn't work. First I thought it is due to a firewall issue but it is the same problem that causes the ping alert - which is that we just cannot reach those hosts from OSD at all. I guess that problem will resolve itself once OSD is migrated to Prague as well.

Actions

Copy link

#36

Updated by mkittler over 1 year ago

Some more MM tests covering all workers and at least one connection to another worker:

(Original job is https://openqa.suse.de/tests/11162867.)

Actions

Copy link

#37

Updated by mkittler over 1 year ago

2nd attempt for DNS/host setup of arm workers: https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/3812

We'll also need this change for MM tests: https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/575

Actions

Copy link

#38

Updated by okurz over 1 year ago

Description updated (diff)

arm-worker2.oqa.prg2.suse.org was not responsive in salt so I removed that salt key to unblock salt state application. Added according rollback step in ticket description.

Actions

Copy link

#39

Updated by mkittler over 1 year ago

Looks like that 2nd attempt worked. So only a few things left to look into before I can enable the workers in production:

I've now run MM jobs on the arm workers: https://openqa.suse.de/tests/11739952 - The tests haven't finished yet but it already looks like they're failing. Maybe the mentioned unresponsive salt is the problem here. (So tap was not fully/correctly setup due to it.)
Then there's also this MM failure on worker32 I still have to look into: https://openqa.suse.de/tests/11733211#step/before_test/22
One cannot reach the workers via ssh or http from OSD breaking e.g. the developer mode. Alerts like https://stats.openqa-monitor.qa.suse.de/alerting/grafana/host_up_alert_worker33/view are also still failing. I mentioned it on the dct channel.

Actions

Copy link

#40

Updated by okurz over 1 year ago

Related to action #132827: [tools][qe-core]test fails in rsync_client/salt-master, DNS resolve issue with workers "sapworker*" on multi-machine tests size:M added

Actions

Copy link

#41

Updated by okurz over 1 year ago

Related to action #133892: [alert] arm-worker2 (arm-worker2: host up alert openQA host_up_alert_arm-worker2 worker size:M added

Actions

Copy link

#42

Updated by mkittler over 1 year ago

I've been restarting MM jobs on arm workers: https://openqa.suse.de/tests/11768364
- Maybe it works now after ensuring that https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/575 is properly applied on arm2. Maybe also another restart is required.
Test MM job between worker33 and 34: https://openqa.suse.de/tests/11768461 - I suppose it'll pass and thus worker32 was the culprit causing the MM job between 32 and 33 to fail.
- EDIT: It has just passed. So worker32 is the culprit. Restarted the test on worker32 again to check whether it was just a temporary issue: https://openqa.suse.de/tests/11768465
- It is strange that worker32 has an additional IP via eth1 (10.145.10.109/24). None of the other workers have that. Maybe that's interfering? The bridge device and firewall config seems to be correct, though (using eth0).
~~The problem of OSD reaching workers persists. I think we'll eventually handle that as part of #132146, see #132146#note-12.~~ It is not a problem anymore on the new OSD VM, see #132146#note-12. That should be good enough.

Actions

Copy link

#43

Updated by mkittler over 1 year ago

Looks like the MM problem on arm is reproducible. Likely the problem is on the side of arm2 or wicked_basic_sut. The networking within the SUT on arm1 seems good but the SUT on arm2 (which is executing wicked_basic_sut) doesn't get an IP, also not after restarting wicked and restarting the SUT (the eth0 interface is at least up). After rebooting the SUT one gets the message "Unable to locate ITS domain handle". I'm now rebooting arm2 and see whether it helps. If not I'll try scheduling the jobs so the job/worker assignment is swapped to see whether it makes a difference.

EDIT: It still fails. Test with swapped assignment: https://openqa.suse.de/tests/11768869
EDIT: Looks like that swapped the failing side. So I guess it isn't the worker but one "side" of the scenario.
EDIT: Looks like the problem on worker32 persists: https://openqa.suse.de/tests/11768465

Actions

Copy link

#44

Updated by okurz over 1 year ago

Related to action #133025: Configure Virtual Interfaces instructions do not work on Leap 15.5 size:M added

Actions

Copy link

#45

Updated by mkittler over 1 year ago

Maybe it makes a difference to disable eth1 on worker32 so I did that via sudo wicked ifstatus eth1 and restarted the test: https://openqa.suse.de/tests/11784175

EDIT: The test https://openqa.suse.de/tests/11784175 is now already passed the point where it previously failed. So apparently having another ethX device up with an IP address causes the MM setup to break - even if the trusted zone only contains the correct interface and a reboot has been done since the zone configuration was corrected. I don't know exactly why that additional IP interfered but maybe it is worth documenting that it may be problematic. Interestingly sudo wicked ifstatus eth1 worked without a reboot.

Considering

martchus@worker32:~> sudo wicked show-config eth1
<interface origin="compat:suse:/etc/sysconfig/network/ifcfg-eth1">
  <name>eth1</name>
  <control>
    <mode>boot</mode>
  </control>
  <firewall>
    <zone>trusted</zone>
  </firewall>
…

the eth1 dev is actually still part of the trusted zone. But maybe that's not really the case because firewall-cmd tells otherwise.

To make the change persistent, I changed the config via Yast. It now looks like this:

martchus@worker32:~> sudo cat /etc/sysconfig/network/ifcfg-eth1

EDIT: The re-triggered test passed as well: https://openqa.suse.de/tests/11784797
BOOTPROTO='none'
STARTMODE='off'
ZONE=public

EDIT: The re-triggered test passed as well: https://openqa.suse.de/tests/11784797

Actions

Copy link

#46

Updated by mkittler over 1 year ago

I asked about arm1 and arm2 in the chat yesterday but we couldn't find out what the problem is. I also rebooted both machines one more time and gave it a try but ran into the same error (see https://openqa.suse.de/tests/11768868).

I suppose I'll create a MR to enable only the x86_64 workers in production for now. EDIT: https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/581

Actions

Copy link

#48

Updated by okurz over 1 year ago

Due date changed from 2023-08-12 to 2023-08-18

We discussed this ticket in the weekly QE Tools meeting. Enabling multi-machine tests across sites might introduce a higher risk so we suggest that next week we disable "tap" on NUE1+2 based workers and only enable on PRG1+2 based workers and verify stability again. About the ARM workers please ensure that multi-machine capabilities are handled similarly within this ticket or another new one to be created. That should all be feasible to achieve until end of next week.

Actions

Copy link

#49

Updated by mkittler over 1 year ago

Ok, I thought we kept the approach open (so I scheduled a cross-site test run: https://openqa.suse.de/tests/11802207).

But yes, it is likely the safest to disable "tap" in NUE workers at the same time we enable PRG workers.

Actions

Copy link

#50

Updated by mkittler over 1 year ago

I have totally forgotten about worker40 which I couldn't setup due to https://sd.suse.com/servicedesk/customer/portal/1/SD-128721. Now IPMI works so I'll continue with that.

Actions

Copy link

#51

Updated by mkittler over 1 year ago

worker40 should be fine now, let's see whether it works: https://openqa.suse.de/tests/11850962

EDIT: It works, see https://openqa.suse.de/tests/11850961.

Actions

Copy link

#52

Updated by mkittler over 1 year ago

I've created https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/3853 and https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/583 to change our configuration according to an interface change on worker-arm2 that has unexpectedly happened.

Actions

Copy link

#53

Updated by tinita over 1 year ago

mkittler wrote:

I've created https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/3853 and https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/583 to change our configuration according to an interface change on worker-arm2 that has unexpectedly happened.

I merged the salt-pillars-openqa MR because the OPS-Service MR was merged.

Actions

Copy link

#54

Updated by tinita over 1 year ago

Due date changed from 2023-08-18 to 2023-08-25

Actions

Copy link

#55

Updated by mkittler over 1 year ago

Good. I've created some test jobs to see whether the switch of the ethernet device has maybe changed something for the better. However, the tests still fail as before: https://openqa.suse.de/tests/11897393

Actions

Copy link

#56

Updated by tinita over 1 year ago

Due date changed from 2023-08-25 to 2023-09-01

Actions

Copy link

#57

Updated by mkittler over 1 year ago

Status changed from In Progress to Feedback

The MM setup also works on the arm workers now. It has been resolved by #133736#change-663881. See https://openqa.suse.de/tests/11897961 for a test cluster that ran across arm1 and arm2.

I guess that leaves only enabling the workers in production (see https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/581 for a draft).

Actions

Copy link

#58

Updated by mkittler over 1 year ago

I've updated https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/581 to include the new arm workers as well. I've also removed the tap worker class from Nürnberg located workers so all tap workers are on the same site.

Actions

Copy link

#59

Updated by okurz over 1 year ago

mkittler wrote in #note-51:

worker40 should be fine now, let's see whether it works: https://openqa.suse.de/tests/11850962

EDIT: It works, see https://openqa.suse.de/tests/11850961.

worker40 seems to be missing from salt though and https://openqa.suse.de/admin/workers/3095 shows no jobs.

Actions

Copy link

#60

Updated by okurz over 1 year ago

Copied to action #134912: Gradually phase out NUE1 based openQA workers size:M added

Actions

Copy link

#61

Updated by okurz over 1 year ago

worker40 is in salt now so that's good. The older comments reference successful multi-machine scenarios on various worker combinations.

That leaves the two rollback steps

Remove silence for "host up" alerts for worker33…39 alertname=~(d[0-9]|worker[0-9]): host up alert in https://stats.openqa-monitor.qa.suse.de/alerting/silences
Add back arm-worker2.oqa.prg2.suse.org (or worker-arm2 correspondingly)

Consider creating a separate ticket about IPv6 in future and reference that ticket in the alert silences. Then we can close here.

Actions

Copy link

#62

Updated by mkittler over 1 year ago

Description updated (diff)
Status changed from Feedback to Resolved

I've created #134948 and updated the silences.

Note that worker-arm2.oqa.prg2.suse.org has been brought back a while ago so that rollback step is actually quite outdated.

Actions

Copy link

#63

Updated by okurz over 1 year ago

Due date deleted (~~2023-09-01~~)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #132137

Setup new PRG2 openQA worker for osd size:M

Motivation¶

Acceptance criteria¶

Suggestions¶

Rollback steps¶

Out of scope¶

Updated by okurz almost 2 years ago

Updated by okurz almost 2 years ago

Updated by okurz almost 2 years ago

Updated by okurz almost 2 years ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by openqa_review over 1 year ago

Updated by okurz over 1 year ago

Updated by livdywan over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by mkittler over 1 year ago

Updated by mkittler over 1 year ago

Updated by mkittler over 1 year ago

Updated by okurz over 1 year ago

Updated by openqa_review over 1 year ago

Updated by mkittler over 1 year ago

Updated by okurz over 1 year ago

Updated by mkittler over 1 year ago

Updated by mkittler over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by mkittler over 1 year ago

Updated by mkittler over 1 year ago

Updated by okurz over 1 year ago

Updated by mkittler over 1 year ago

Updated by mkittler over 1 year ago

Updated by okurz over 1 year ago

Updated by mkittler over 1 year ago

Updated by mkittler over 1 year ago

Updated by okurz over 1 year ago

Updated by mkittler over 1 year ago

Updated by mkittler over 1 year ago

Updated by mkittler over 1 year ago

Updated by okurz over 1 year ago

Updated by mkittler over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by mkittler over 1 year ago

Updated by mkittler over 1 year ago

Updated by okurz over 1 year ago

Updated by mkittler over 1 year ago

Updated by mkittler over 1 year ago

Updated by okurz over 1 year ago

Updated by mkittler over 1 year ago

Updated by mkittler over 1 year ago

Updated by mkittler over 1 year ago

Updated by mkittler over 1 year ago

Updated by tinita over 1 year ago

Updated by tinita over 1 year ago

Updated by mkittler over 1 year ago

Updated by tinita over 1 year ago

Updated by mkittler over 1 year ago

Updated by mkittler over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by mkittler over 1 year ago

Updated by okurz over 1 year ago