action #132137
closedQA (public) - coordination #121720: [saga][epic] Migration to QE setup in PRG2+NUE3 while ensuring availability
QA (public) - coordination #123800: [epic] Provide SUSE QE Tools services running in PRG2 aka. Prg CoLo
Setup new PRG2 openQA worker for osd size:M
Added by okurz over 1 year ago. Updated over 1 year ago.
0%
Description
Motivation¶
New hardware was ordered to serve as openQA workers for osd. We can connect those machines to the osd webUI VM instance regardless if it's running still from NUE1 or from PRG2.
Acceptance criteria¶
- AC1: osd multi-machine jobs run successfully on new PRG2 openQA workers
Suggestions¶
- DONE (see https://gitlab.suse.de/openqa/salt-pillars-openqa/-/commit/a27f3d50872a8f7aff127b20a409dcdc1c28d91b): Track https://jira.suse.com/browse/ENGINFRA-2379 "PRG2 IPMI for QA" to be able to remote control
- Track https://jira.suse.com/browse/ENGINFRA-1742 "Build OpenQA Environment" which is the neighboring story of the osd VM being migrated
- Wait for Eng-Infra to inform us about the availability of the network and machines
- Ensure we can connect over IPMI
- Include IPMI contact details and workerconf in https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls
- Follow https://gitlab.suse.de/openqa/salt-states-openqa
Rollback steps¶
- Moved to #134948* Remove silence for "host up" alerts for worker33…39
alertname=~(d[0-9]*|worker[0-9]*): host up alert
in https://stats.openqa-monitor.qa.suse.de/alerting/silences - DONE Add back arm-worker2.oqa.prg2.suse.org (or worker-arm2 correspondingly)
Out of scope¶
- Ensure that osd can work without relying on any physical machine in NUE1
Updated by okurz over 1 year ago
- Copied from action #132134: Setup new PRG2 multi-machine openQA worker for o3 size:M added
Updated by okurz over 1 year ago
- Subject changed from Setup new PRG2 openQA worker for osd to Setup new PRG2 openQA worker for osd size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by okurz over 1 year ago
- Copied to action #132158: Ensure that osd can work without relying on any physical machine in NUE1 size:M added
Updated by okurz over 1 year ago
- Status changed from Workable to Blocked
- Assignee set to okurz
Waiting for IPMI access, see #132134
Updated by okurz over 1 year ago
- Status changed from Blocked to In Progress
From https://suse.slack.com/archives/C04MDKHQE20/p1689317691714229
(Martin Caj) @Oliver Kurz I send you and email with encrypted file. There you find how to get into new workers in PRG2 please test it and let me know.
(Oliver Kurz) @Martin Caj according to https://racktables.nue.suse.com/index.php?page=rack&rack_id=21278 those machines are openqaworker1…12 which are intended for openqa.suse.de but our priority should be https://racktables.nue.suse.com/index.php?page=rack&rack_id=21282 for openqa.opensuse.org aka openQA.DMZ
(Martin Caj) I know we are working on DMZ now
(Oliver Kurz)
- We can login as "jumpy" and ping all hosts from /etc/hosts and reach them over IPMI
- according to https://racktables.nue.suse.com/index.php?page=rack&rack_id=21278 those machines are openqaworker1…12 which are intended for openqa.suse.de but our priority should be https://racktables.nue.suse.com/index.php?page=rack&rack_id=21282 for openqa.opensuse.org aka openQA.DMZ
- ssh over IPv6 takes very long. Likely the machine is only connected over IPv4 but has a AAAA record
- The decision was to start numbering machines starting with 21 to prevent confusion with the already existing machines at other locations so this needs to be changed for both o3+osd machines
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/560 for adding credentials in the current state.
Updated by okurz over 1 year ago
Trying
linuxefi (http,195.135.221.134)/distribution/openSUSE-current/repo/oss/boot/x86_64/loader/linux install=http://download.opensuse.org/distribution/openSUSE-current/repo/oss console=ttyS0,115200 autoyast=http://s.qa.suse.de/oqa-ay-lp rootpassword=susetesting
initrdefi (http,195.135.221.134)/distribution/openSUSE-current/repo/oss/boot/x86_64/loader/initrd
boot
takes about 1-2 seconds after linuxefi and about 60-90s after initrd. After boot
nothing shows up maybe because ttyS0 is not correct. I would need to try that with ttyS1 and ttyS2 as well. So ttyS1 works \o/
Conducted the autoyast installation and then logged in over SOL and patched missing repos:
for i in update/leap/\$releasever/non-oss distribution/leap/\$releasever/repo/oss update/leap/\$releasever/oss distribution/leap/\$releasever/repo/non-oss; do zypper ar https://download.opensuse.org/$i $i; done
zypper mr -f 8
zypper mr -f 9
hostname -f
tells that the machine knows itself as d105.oqa.prg2.suse.org. Discussed with mcaj in https://suse.slack.com/archives/C04MDKHQE20/p1689317691714229:
(Oliver Kurz) @Martin Caj I see https://racktables.nue.suse.com/index.php?page=rack&rack_id=21282 with new names, that's nice. Will you then name the OSD (openqa.suse.de) ones in J11 worker29 and above?
(Martin Caj) well I do not know how to add them into OSD ... but FQDN:
for internal workers should be oqa.prg2.suse.org
for DMZ workres than oqa.opensuse.org
The intertnal one need to be register at suttner1.oqa.prg2.suse.org DNS/DHCP this I can do it.
for DMZ I do not know but I assume that you had dhcp running on ariel server, but If you need we can build separate par for DHCP/DNS servers
(Oliver Kurz) We can handle the registration against the DHCP server if it's the same salt repo. For DMZ the decision (I think your suggestion?) was to have a separate DHCP/DNS server same as you maintain for the other networks. However we can also still just try dnsmasq from ariel
(Martin Caj) internal one: we have it https://gitlab.suse.de/OPS-Service/salt/-/blob/production/pillar/domain/oqa_prg2_suse_org/hosts.yaml
but hosts there are just dummy one we need to replace them
DNS its under https://gitlab.suse.de/OPS-Service/salt/-/tree/production/salt/profile/dns/files/oqa_prg2_suse_org
(Oliver Kurz) got it. We will prepare an MR so that you can focus on getting the DMZ up, ok?
Then I followed https://progress.opensuse.org/projects/openqav3/wiki/#Setup-guide-for-new-machines and https://gitlab.suse.de/openqa/salt-states-openqa#setup-production-machine . So I put worker29.oqa.prg2.suse.org into /etc/hostname and such and will create a merge request for the DHCP/DNS entry.
So from our salt pillars:
sed -n '/prg2.suse.org/s/^.*serial: `\([^`]*\).*$/\1/p' openqa/workerconf.sls | sed 's/sol activate/raw 0x30 0x21 | tail -c 18 | sed \\\"s@ @:@g\\\"/' | (while read -r cmd; do echo $cmd && eval $cmd; done)
which unfortunately stops after the first eval so what we did is create a list of commands we can execute on qe-jumpy directly then:
sed -n '/prg2.suse.org/s/^.*serial: `\([^`]*\).*$/\1/p' openqa/workerconf.sls | sed -e 's/^[^"]*"//' -e 's/"$//' -e 's/sol activate/raw 0x30 0x21 | tail -c 18 | sed "s@ @:@g"/' | (while read -r cmd; do echo $cmd; done)
and copy-paste that into the ssh which yields:
jumpy@qe-jumpy:~> ipmitool -I lanplus -H openqaworker1.qe-ipmi-ur -U qadmin -P 'X' raw 0x30 0x21 | tail -c 18 | sed "s@ @:@g"
7c:c2:55:24:de:9c
jumpy@qe-jumpy:~> ipmitool -I lanplus -H openqaworker2.qe-ipmi-ur -U qadmin -P 'X' raw 0x30 0x21 | tail -c 18 | sed "s@ @:@g"
7c:c2:55:24:dc:34
jumpy@qe-jumpy:~> ipmitool -I lanplus -H openqaworker3.qe-ipmi-ur -U qadmin -P 'X' raw 0x30 0x21 | tail -c 18 | sed "s@ @:@g"
7c:c2:55:24:de:70
jumpy@qe-jumpy:~> ipmitool -I lanplus -H openqaworker4.qe-ipmi-ur -U qadmin -P 'X' raw 0x30 0x21 | tail -c 18 | sed "s@ @:@g"
3c:ec:ef:fe:0a:c4
jumpy@qe-jumpy:~> ipmitool -I lanplus -H openqaworker5.qe-ipmi-ur -U qadmin -P 'X' raw 0x30 0x21 | tail -c 18 | sed "s@ @:@g"
7c:c2:55:24:de:2a
jumpy@qe-jumpy:~> ipmitool -I lanplus -H openqaworker6.qe-ipmi-ur -U qadmin -P 'X' raw 0x30 0x21 | tail -c 18 | sed "s@ @:@g"
7c:c2:55:24:dd:e8
jumpy@qe-jumpy:~> ipmitool -I lanplus -H openqaworker7.qe-ipmi-ur -U qadmin -P 'X' raw 0x30 0x21 | tail -c 18 | sed "s@ @:@g"
7c:c2:55:24:de:cc
jumpy@qe-jumpy:~> ipmitool -I lanplus -H openqaworker8.qe-ipmi-ur -U qadmin -P 'X' raw 0x30 0x21 | tail -c 18 | sed "s@ @:@g"
7c:c2:55:24:de:de
jumpy@qe-jumpy:~> ipmitool -I lanplus -H openqaworker9.qe-ipmi-ur -U qadmin -P 'X' raw 0x30 0x21 | tail -c 18 | sed "s@ @:@g"
7c:c2:55:24:dc:4c
jumpy@qe-jumpy:~> ipmitool -I lanplus -H openqaworker10.qe-ipmi-ur -U qadmin -P 'X' raw 0x30 0x21 | tail -c 18 | sed "s@ @:@g"
7c:c2:55:24:de:d2
jumpy@qe-jumpy:~> ipmitool -I lanplus -H openqaworker11.qe-ipmi-ur -U qadmin -P 'X' raw 0x30 0x21 | tail -c 18 | sed "s@ @:@g"
7c:c2:55:24:de:ae
jumpy@qe-jumpy:~> ipmitool -I lanplus -H openqaworker12.qe-ipmi-ur -U qadmin -P 'X' raw 0x30 0x21 | tail -c 18 | sed "s@ @:@g"
7c:c2:55:24:de:ce
jumpy@qe-jumpy:~> ipmitool -I lanplus -H openqaworker-arm1.qe-ipmi-ur -U qadmin -P 'X' raw 0x30 0x21 | tail -c 18 | sed "s@ @:@g"
Unable to send RAW command (channel=0x0 netfn=0x30 lun=0x0 cmd=0x21 rsp=0xc1): Invalid command
jumpy@qe-jumpy:~> ipmitool -I lanplus -H openqaworker-arm2.qe-ipmi-ur -U qadmin -P 'X' raw 0x30 0x21 | tail -c 18 | sed "s@ @:@g"
Unable to send RAW command (channel=0x0 netfn=0x30 lun=0x0 cmd=0x21 rsp=0xc1): Invalid command
so works fine for x86_64, it's a start. Created https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/3761
Updated by openqa_review over 1 year ago
- Due date set to 2023-07-29
Setting due date based on mean cycle time of SUSE QE Tools
Updated by okurz over 1 year ago
- Project changed from 46 to openQA Infrastructure (public)
- Category deleted (
Infrastructure)
Updated by livdywan over 1 year ago
Some of those machines already established connection to the salt master and there is a valid workerconf to start with.
Anyone of you can continue by accepting the salt key, applying the salt high state and scheduling specific openQA test jobs on those including multi-machine
I'm quoting Oli here to highlight that everyone is welcome to pick this up
Updated by okurz over 1 year ago
- Due date deleted (
2023-07-29) - Priority changed from Normal to Urgent
Blocked by #133250, tracker for https://sd.suse.com/servicedesk/customer/portal/1/SD-128313 (resolved)
Updated by okurz over 1 year ago
- Status changed from Workable to Blocked
gitlab is back but https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/3761 still not merged
Updated by okurz over 1 year ago
- Status changed from Blocked to Workable
- Assignee deleted (
okurz)
https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/3761 now merged.
host worker37.oqa.prg2.suse.org
worker37.oqa.prg2.suse.org has address 10.145.10.10
worker37.oqa.prg2.suse.org has IPv6 address 2a07:de40:b203:12:10:145:10:10
so good to continue.
Updated by mkittler over 1 year ago
- Status changed from Workable to In Progress
- Assignee set to mkittler
Updated by mkittler over 1 year ago
I suppose all relevant machines are mentioned in commit https://gitlab.suse.de/openqa/salt-pillars-openqa/-/commit/a27f3d50872a8f7aff127b20a409dcdc1c28d91b. They are part of rack https://racktables.suse.de/index.php?page=object&object_id=23028 or https://racktables.suse.de/index.php?page=rack&rack_id=21282.
Looks like Leap 15.5 has already been installed on hosts worker29.oqa.prg2.suse.org to worker32.oqa.prg2.suse.org so worker33.oqa.prg2.suse.org to worker40.oqa.prg2.suse.org and worker-arm1.oqa.prg2.suse.org and worker-arm2.oqa.prg2.suse.org are missing.
However, I cannot establish an IPMI connection to the remaining hosts except 36, 37, 39 and arm1. It seems that on 36, 37 and 39 there is something running (all one gets via sol is a timer) and on arm1 one is prompted "pingubox login:". So supposedly it makes sense to wait with the OS installation on those machines. Maybe I should ask about those hosts in Slack as mentioned in a Jira comment.
Judging by the output of salt-key -L
on OSD the workers 29 to 32 have even already Salt running. Not sure whether it is wanted to accept those connections at this point.
By the way, the IPMI password of some hosts contains exclamation marks. So before running the commands from workerconf.sls
it makes sense to use set +H
to avoid having to escape those. Some passwords also contain $
which requires extra escaping that is not present in workerconf.sls
. However, even then I couldn't connect via IPMP to e.g. worker33.
Updated by okurz over 1 year ago
mkittler wrote:
Judging by the output of
salt-key -L
on OSD the workers 29 to 32 have even already Salt running. Not sure whether it is wanted to accept those connections at this point.
Yes, but only while being closely monitored :) so whenever one of us is available for some time, accept the salt key, apply the high state and monitor openQA jobs that the machines accept - assuming the workers have the connection to osd. And, you know, gre tunnels and multi machine and such :)
Updated by openqa_review over 1 year ago
- Due date set to 2023-08-12
Setting due date based on mean cycle time of SUSE QE Tools
Updated by mkittler over 1 year ago
Looks like it wasn't the escaping but just a bad connection. One has to retry multiple times to get past errors like "no authcode provided". We're nevertheless currently changing the passwords to avoid this confusion. The "timers" are actually the system time shown by the BIOS screen. One can see more by pressing the arrow keys.
I still cannot connect to arm2 via IPMI. I created an SD ticket for that: https://sd.suse.com/servicedesk/customer/portal/1/SD-128708
Updated by okurz over 1 year ago
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/573 for password changes as discussed. And https://sd.suse.com/servicedesk/customer/portal/1/SD-128710 for alignment of IPMI hostnames on jumpy. Added installation instructions over https://wiki.suse.net/index.php/OpenQA#Remote_management_with_IPMI_and_BMC_tools
Updated by mkittler over 1 year ago
I was able to setup all machines up to 39. (Bare installation. Salt is still missing.)
I was not able to setup 40 because it cannot boot via UEFI HTTP boot. Maybe the plug is not connected? I'm aware that this machine has two ethernet port but I've tried both and it times out on both. EDIT: I created an SD ticket for that: https://sd.suse.com/servicedesk/customer/portal/1/SD-128721
I was not able to setup arm1 because I still have to figure out how to proceed from that "pingubox login" prompt.
Updated by mkittler over 1 year ago
I was now able to setup everything except worker 40 and arm worker 2. I will deal with possible failing systemd units and failing salt states tomorrow.
I updated the Wiki with the commands used on the arm worker (which differed a little bit). We also had to manually amend the filesystem layout on the arm worker.
Updated by okurz over 1 year ago
- Description updated (diff)
- Assignee deleted (
mkittler)
Soon after as expected the hosts ran into an unresponsive salt-minion due to https://bugzilla.opensuse.org/show_bug.cgi?id=1212816 . It looks like you haven't applied the workaround from #131249 so I am doing that now over ssh as salt is not responsive:
for i in {29..40}; do echo "## $i" && ssh -4 -o StrictHostKeyChecking=no worker$i.oqa.prg2.suse.org 'sudo zypper -n in --oldpackage --allow-downgrade http://download.opensuse.org/update/leap/15.4/sle/x86_64/salt-3004-150400.8.25.1.x86_64.rpm http://download.opensuse.org/update/leap/15.4/sle/x86_64/salt-minion-3004-150400.8.25.1.x86_64.rpm http://download.opensuse.org/update/leap/15.4/sle/x86_64/python3-salt-3004-150400.8.25.1.x86_64.rpm && sudo zypper al --comment "poo#131249 - potential salt regression, unresponsive salt-minion" salt salt-minion salt-bash-completion python3-salt'; done
Multiple alerts received, among them:
- [FIRING:1] d105: host up alert openQA (d105 host_up_alert_d105 worker)
- [FIRING:1] worker33: host up alert openQA (worker33 host_up_alert_worker33 worker)
- …
- [FIRING:1] worker39: host up alert openQA (worker39 host_up_alert_worker39 worker)
at least consistently for all :)
Added silences and according rollback steps in ticket.
Updated by okurz over 1 year ago
- Related to action #131249: [alert][ci][deployment] OSD deployment failed, grenache-1, worker5, worker2 salt-minion does not return, error message "No response" size:M added
Updated by mkittler over 1 year ago
I have also applied the workaround on the arm worker now:
sudo zypper -n in --oldpackage --allow-downgrade http://download.opensuse.org/update/leap/15.4/sle/aarch64/salt-3004-150400.8.25.1.aarch64.rpm http://download.opensuse.org/update/leap/15.4/sle/aarch64/salt-minion-3004-150400.8.25.1.aarch64.rpm http://download.opensuse.org/update/leap/15.4/sle/aarch64/python3-salt-3004-150400.8.25.1.aarch64.rpm && sudo zypper al --comment "poo#131249 - potential salt regression, unresponsive salt-minion" salt salt-minion python3-salt
Updated by mkittler over 1 year ago
- Description updated (diff)
https://sd.suse.com/servicedesk/customer/portal/1/SD-128708 has been resolved now so I'm currently installing arm2.
Updated by mkittler over 1 year ago
The problem that is causing the host up alert is that the new workers are in fact not pingable from OSD (not via IPv4 and not via IPv6). I'm not sure how big of a problem we consider this because the OSD VM is going to be moved soon anyways. Supposedly we can for now just keep the silence and sort this out once the OSD VM has been moved.
There's another problem: Our NVMe setup script cannot cope with the available disks/partitions on the new workers. For arm1 we mitigated this by changing the partitioning on the single available NVMe. The other workers actually have multiple SSDs so I'm currently trying to improve our script to be able to make use of them. If that doesn't work we still can configure it manually via the grain-approach I have introduced for the sap workers.
EDIT: MR for the NVMe problem: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/933
Updated by mkittler over 1 year ago
- Status changed from In Progress to Feedback
Now arm-2 is setup as well. This leaves only worker40 which is still blocked by https://sd.suse.com/servicedesk/customer/portal/1/SD-128721.
Updated by okurz over 1 year ago
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/574 for an update of IPMI hostnames
Updated by mkittler over 1 year ago
- Status changed from Feedback to In Progress
Thanks, it looks good.
Meanwhile https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/933 was merged and applied. That fixed the NVMe setup on the workers as expected.
I created some test jobs but they fail with download errors (e.g. https://openqa.suse.de/tests/11724074). I still have to figure out why that's the case. Downloading manually via wget works just fine.
Updated by mkittler over 1 year ago
Looks like the downloading works after rebooting. After rebooting all machines I'm getting mixed results but some tests are passing at least.
For some reason arm2 booted into GRUB provided via PXE. I'll have to find out why that happened. The boot prios look fine in the setup menu and the system generally boots the installed system.
All jobs have passed without further problems except on worker32 which had problems determining its own hostname. After a reboot it seemed better but the corresponding tests failed: https://openqa.suse.de/tests/11733211
I tried to run tests on arm workers as well but it seems https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/3801 is not effective yet.
Updated by okurz over 1 year ago
I just checked
okurz@worker38:~> for i in $(sudo ovs-vsctl show | sed -n 's/^.*remote_ip="\([^"]*\)"}/\1/p'); do ping -c1 $i; done
which looks good. So at least this worker, assumingly all PR2 workers then, can reach all other OSD workers. Time to test some more multi-machine then :)
Updated by mkittler over 1 year ago
I have run a job across 29 and 30. That worked: https://openqa.suse.de/tests/11732858 - I'll do more tests between more machines.
The developer mode doesn't work. First I thought it is due to a firewall issue but it is the same problem that causes the ping alert - which is that we just cannot reach those hosts from OSD at all. I guess that problem will resolve itself once OSD is migrated to Prague as well.
Updated by mkittler over 1 year ago
Some more MM tests covering all workers and at least one connection to another worker:
- https://openqa.suse.de/tests/11733019
- https://openqa.suse.de/tests/11733124
- https://openqa.suse.de/tests/11733124
- https://openqa.suse.de/tests/11733126
- https://openqa.suse.de/tests/11733127
- https://openqa.suse.de/tests/11733130
- https://openqa.suse.de/tests/11733131
(Original job is https://openqa.suse.de/tests/11162867.)
Updated by mkittler over 1 year ago
2nd attempt for DNS/host setup of arm workers: https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/3812
We'll also need this change for MM tests: https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/575
Updated by okurz over 1 year ago
- Description updated (diff)
arm-worker2.oqa.prg2.suse.org was not responsive in salt so I removed that salt key to unblock salt state application. Added according rollback step in ticket description.
Updated by mkittler over 1 year ago
Looks like that 2nd attempt worked. So only a few things left to look into before I can enable the workers in production:
- I've now run MM jobs on the arm workers: https://openqa.suse.de/tests/11739952 - The tests haven't finished yet but it already looks like they're failing. Maybe the mentioned unresponsive salt is the problem here. (So tap was not fully/correctly setup due to it.)
- Then there's also this MM failure on worker32 I still have to look into: https://openqa.suse.de/tests/11733211#step/before_test/22
- One cannot reach the workers via ssh or http from OSD breaking e.g. the developer mode. Alerts like https://stats.openqa-monitor.qa.suse.de/alerting/grafana/host_up_alert_worker33/view are also still failing. I mentioned it on the dct channel.
Updated by okurz over 1 year ago
- Related to action #132827: [tools][qe-core]test fails in rsync_client/salt-master, DNS resolve issue with workers "sapworker*" on multi-machine tests size:M added
Updated by okurz over 1 year ago
- Related to action #133892: [alert] arm-worker2 (arm-worker2: host up alert openQA host_up_alert_arm-worker2 worker size:M added
Updated by mkittler over 1 year ago
- I've been restarting MM jobs on arm workers: https://openqa.suse.de/tests/11768364
- Maybe it works now after ensuring that https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/575 is properly applied on arm2. Maybe also another restart is required.
- Test MM job between worker33 and 34: https://openqa.suse.de/tests/11768461 - I suppose it'll pass and thus worker32 was the culprit causing the MM job between 32 and 33 to fail.
- EDIT: It has just passed. So worker32 is the culprit. Restarted the test on worker32 again to check whether it was just a temporary issue: https://openqa.suse.de/tests/11768465
- It is strange that worker32 has an additional IP via eth1 (10.145.10.109/24). None of the other workers have that. Maybe that's interfering? The bridge device and firewall config seems to be correct, though (using eth0).
The problem of OSD reaching workers persists. I think we'll eventually handle that as part of #132146, see #132146#note-12.It is not a problem anymore on the new OSD VM, see #132146#note-12. That should be good enough.
Updated by mkittler over 1 year ago
Looks like the MM problem on arm is reproducible. Likely the problem is on the side of arm2 or wicked_basic_sut. The networking within the SUT on arm1 seems good but the SUT on arm2 (which is executing wicked_basic_sut) doesn't get an IP, also not after restarting wicked and restarting the SUT (the eth0 interface is at least up). After rebooting the SUT one gets the message "Unable to locate ITS domain handle". I'm now rebooting arm2 and see whether it helps. If not I'll try scheduling the jobs so the job/worker assignment is swapped to see whether it makes a difference.
EDIT: It still fails. Test with swapped assignment: https://openqa.suse.de/tests/11768869
EDIT: Looks like that swapped the failing side. So I guess it isn't the worker but one "side" of the scenario.
EDIT: Looks like the problem on worker32 persists: https://openqa.suse.de/tests/11768465
Updated by okurz over 1 year ago
- Related to action #133025: Configure Virtual Interfaces instructions do not work on Leap 15.5 size:M added
Updated by mkittler over 1 year ago
Maybe it makes a difference to disable eth1 on worker32 so I did that via sudo wicked ifstatus eth1
and restarted the test: https://openqa.suse.de/tests/11784175
EDIT: The test https://openqa.suse.de/tests/11784175 is now already passed the point where it previously failed. So apparently having another ethX device up with an IP address causes the MM setup to break - even if the trusted zone only contains the correct interface and a reboot has been done since the zone configuration was corrected. I don't know exactly why that additional IP interfered but maybe it is worth documenting that it may be problematic. Interestingly sudo wicked ifstatus eth1
worked without a reboot.
Considering
martchus@worker32:~> sudo wicked show-config eth1
<interface origin="compat:suse:/etc/sysconfig/network/ifcfg-eth1">
<name>eth1</name>
<control>
<mode>boot</mode>
</control>
<firewall>
<zone>trusted</zone>
</firewall>
…
the eth1
dev is actually still part of the trusted
zone. But maybe that's not really the case because firewall-cmd
tells otherwise.
To make the change persistent, I changed the config via Yast. It now looks like this:
martchus@worker32:~> sudo cat /etc/sysconfig/network/ifcfg-eth1
EDIT: The re-triggered test passed as well: https://openqa.suse.de/tests/11784797
BOOTPROTO='none'
STARTMODE='off'
ZONE=public
EDIT: The re-triggered test passed as well: https://openqa.suse.de/tests/11784797
Updated by mkittler over 1 year ago
I asked about arm1 and arm2 in the chat yesterday but we couldn't find out what the problem is. I also rebooted both machines one more time and gave it a try but ran into the same error (see https://openqa.suse.de/tests/11768868).
I suppose I'll create a MR to enable only the x86_64 workers in production for now. EDIT: https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/581
Updated by okurz over 1 year ago
- Due date changed from 2023-08-12 to 2023-08-18
We discussed this ticket in the weekly QE Tools meeting. Enabling multi-machine tests across sites might introduce a higher risk so we suggest that next week we disable "tap" on NUE1+2 based workers and only enable on PRG1+2 based workers and verify stability again. About the ARM workers please ensure that multi-machine capabilities are handled similarly within this ticket or another new one to be created. That should all be feasible to achieve until end of next week.
Updated by mkittler over 1 year ago
Ok, I thought we kept the approach open (so I scheduled a cross-site test run: https://openqa.suse.de/tests/11802207).
But yes, it is likely the safest to disable "tap" in NUE workers at the same time we enable PRG workers.
Updated by mkittler over 1 year ago
I have totally forgotten about worker40 which I couldn't setup due to https://sd.suse.com/servicedesk/customer/portal/1/SD-128721. Now IPMI works so I'll continue with that.
Updated by mkittler over 1 year ago
worker40 should be fine now, let's see whether it works: https://openqa.suse.de/tests/11850962
EDIT: It works, see https://openqa.suse.de/tests/11850961.
Updated by mkittler over 1 year ago
I've created https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/3853 and https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/583 to change our configuration according to an interface change on worker-arm2 that has unexpectedly happened.
Updated by tinita over 1 year ago
mkittler wrote:
I've created https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/3853 and https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/583 to change our configuration according to an interface change on worker-arm2 that has unexpectedly happened.
I merged the salt-pillars-openqa MR because the OPS-Service MR was merged.
Updated by tinita over 1 year ago
- Due date changed from 2023-08-18 to 2023-08-25
Updated by mkittler over 1 year ago
Good. I've created some test jobs to see whether the switch of the ethernet device has maybe changed something for the better. However, the tests still fail as before: https://openqa.suse.de/tests/11897393
Updated by tinita over 1 year ago
- Due date changed from 2023-08-25 to 2023-09-01
Updated by mkittler over 1 year ago
- Status changed from In Progress to Feedback
The MM setup also works on the arm workers now. It has been resolved by #133736#change-663881. See https://openqa.suse.de/tests/11897961 for a test cluster that ran across arm1 and arm2.
I guess that leaves only enabling the workers in production (see https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/581 for a draft).
Updated by mkittler over 1 year ago
I've updated https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/581 to include the new arm workers as well. I've also removed the tap worker class from Nürnberg located workers so all tap workers are on the same site.
Updated by okurz over 1 year ago
mkittler wrote in #note-51:
worker40 should be fine now, let's see whether it works: https://openqa.suse.de/tests/11850962
EDIT: It works, see https://openqa.suse.de/tests/11850961.
worker40 seems to be missing from salt though and https://openqa.suse.de/admin/workers/3095 shows no jobs.
Updated by okurz over 1 year ago
- Copied to action #134912: Gradually phase out NUE1 based openQA workers size:M added
Updated by okurz over 1 year ago
worker40 is in salt now so that's good. The older comments reference successful multi-machine scenarios on various worker combinations.
That leaves the two rollback steps
- Remove silence for "host up" alerts for worker33…39 alertname=~(d[0-9]|worker[0-9]): host up alert in https://stats.openqa-monitor.qa.suse.de/alerting/silences
- Add back arm-worker2.oqa.prg2.suse.org (or worker-arm2 correspondingly)
Consider creating a separate ticket about IPv6 in future and reference that ticket in the alert silences. Then we can close here.