action #132134
closedQA (public) - coordination #121720: [saga][epic] Migration to QE setup in PRG2+NUE3 while ensuring availability
QA (public) - coordination #123800: [epic] Provide SUSE QE Tools services running in PRG2 aka. Prg CoLo
Setup new PRG2 multi-machine openQA worker for o3 size:M
0%
Description
Motivation¶
New hardware was ordered to serve as openQA workers for o3, both x86_64 and aarch64. We can connect those machines to the o3 webUI VM instance regardless if o3 running still from NUE1 or from PRG2.
Acceptance criteria¶
- AC1: o3 multi-machine jobs run successfully on new PRG2 openQA workers
- AC2: All o3 workers have the same relevant downgrades and package locks as others, e.g. see w19+w21 as reference for kernel-default, linked to https://bugzilla.suse.com/show_bug.cgi?id=1214537
Suggestions¶
- Track https://jira.suse.com/browse/ENGINFRA-2347 "DMZ-OpenQA implementation" so that the o3 network is available
- Track https://jira.suse.com/browse/ENGINFRA-2379 "PRG2 IPMI for QA" to be able to remote control
- Track https://jira.suse.com/browse/ENGINFRA-2155 "Install Additional links to DMZ-CORE from J12 - openQA-DMZ", something about cabling
- Track https://jira.suse.com/browse/ENGINFRA-1742 "Build OpenQA Environment" which is the neighboring story of the o3 VM being migrated
- Wait for Eng-Infra to inform us about the availability of the network and machines
- Ensure we can connect over IPMI
- Include IPMI contact details in https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls
- Follow https://progress.opensuse.org/projects/openqav3/wiki/#Setup-guide-for-new-machines for o3 to install OS and openQA worker, connect to o3
- Ensure configuration is equivalent to what existing o3 machines can do
- Ensure that o3 can work without relying on any physical machine in NUE1 (if problems found then feel free to delegate into a new, specific ticket)
Out of scope¶
Files
Updated by okurz over 1 year ago
- Copied to action #132137: Setup new PRG2 openQA worker for osd size:M added
Updated by okurz over 1 year ago
- Copied to action #132143: Migration of o3 VM to PRG2 - 2023-07-19 size:M added
Updated by okurz over 1 year ago
- Subject changed from Setup new PRG2 openQA worker for o3 to Setup new PRG2 openQA worker for o3 size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by okurz over 1 year ago
- Assignee set to okurz
Nikolay Dachev yesterday contacted me and he is apparently preparing qe-jumpy.qe.prg2.suse.org or something. We are expecting that to be fully available to us the next days.
As long as ariel is still in NUE1 we should setup the new machines and try to connect to ariel, e.g. over https public internet + workaround of running fetchneedles locally on each worker in a cron job every minute and not use the worker cache for tests. While this is running we can look into providing rsync, either directly by allowing rsync access over the outside public internet or a dedicated SUSE-internal connection between the old NUE1-o3 network and new PRG2-o3 network or how about some fancy ssh config on rsync, like rsync -e "ssh -p 4242 special.host" …
Updated by okurz over 1 year ago
I received credentials by mcaj, tried them out and could not access the jump host over ssh:
Hi, I got your email about DMZ openqa worker test. Same IPv6 problem to access. Over IPv4 I got immediate ssh access but my ssh key is not accepted and I don't know any password for ssh authentication
Updated by okurz over 1 year ago
- Status changed from Workable to In Progress
I put the IPMI credentials into https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls#L1754 but still IPMI access over the SSH jump host does not yet work.
Updated by openqa_review over 1 year ago
- Due date set to 2023-07-31
Setting due date based on mean cycle time of SUSE QE Tools
Updated by okurz over 1 year ago
- Project changed from 46 to openQA Infrastructure (public)
- Category deleted (
Infrastructure)
Updated by okurz over 1 year ago
Apparently the wrong jump host config was copy/pasted, fixing in salt pillars
Updated by okurz over 1 year ago
Fixed in https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/565 (merged). Now all x86_64 hosts are accessible.
With
sed -n -e '/prg2/s/^[^`]*\([^`]*\)`/\1/p' ~/local/openqa/salt-pillars-openqa/openqa/workerconf.sls | sed 's/sol activate/power status/' | tr -d '`' | parallel -k
I could check all relevant osd+o3 prg2 hosts for their connection which works nicely.
Updated by livdywan over 1 year ago
okurz wrote:
sed -n -e '/prg2/s/^[^`]*\([^`]*\)`/\1/p' ~/local/openqa/salt-pillars-openqa/openqa/workerconf.sls | sed 's/sol activate/power status/' | tr -d '`' | parallel -k
I could check all relevant osd+o3 prg2 hosts for their connection which works nicely.
Would you mind clarifying how this works? I don't see any output and the command is not super easy to read for me.
Updated by okurz over 1 year ago
cdywan wrote:
Would you mind clarifying how this works? I don't see any output and the command is not super easy to read for me.
well, the workerconf includes all the commands we need within backquotes so with the above command I extracted them and executed them in parallel. The output looks like this
$ sed -n -e '/prg2/s/^[^`]*\([^`]*\)`/\1/p' ~/local/openqa/salt-pillars-openqa/openqa/workerconf.sls | sed 's/sol activate/power status/' | tr -d '`' | parallel -k
Chassis Power is on
Chassis Power is on
Chassis Power is on
Chassis Power is on
Chassis Power is on
Chassis Power is on
Chassis Power is on
Chassis Power is on
Chassis Power is on
Chassis Power is on
Chassis Power is on
Chassis Power is on
Chassis Power is on
Error: Unable to establish IPMI v2 / RMCP+ session
with the latest entry corresponding to the ARM machines which are so far not accessible.
Updated by okurz over 1 year ago
From https://suse.slack.com/archives/C04MDKHQE20/p1689688712037979
(Martin Caj) OpenQA workers in DMZ: Update !
As you can see in the picture the PXE boot finally works for openqa workers in dmz.
The screenshot is from openqaworker21. I did not continue with installation - that up to you.
The OS is Leap 15.5 NET ISO.
I had some issues with uefi boot there, so in the end I setup there boot over http.
When you are testing the rest of workers please make sure that http boot is enable in Bios (I think its in advance -> network stack settings).
TODO ATM its the ARM boot, for them I need to setup different boot.efi file, I will try it as well.
meanwhile please test it and let me know
But first I realized what I had already observed the past days and now I finally need to report: IPv6 ssh to the jump host is not properly routed over openVPN, reported
https://sd.suse.com/servicedesk/customer/portal/1/SD-127745
I connected with forwarding for worker21 with ssh -4 -L 1443:openqaworker21-ipmi:443 jumpy@oqa-jumpy.dmz-prg2.suse.org
and also in there with the ssh command line enable a forwarding for worker22 BMC ("~C" and then "-L 2443:openqaworker22-ipmi:443"). In worker21 I continued with the manual interactive installation that mcaj started to have at least something usable.
I am unsure about how to best use the storage devices. There is nvme0n1 with 6TB and nvme{1,2}n1 with 500GB. In the install shell (alt-f2) I executed
for i in 0 1 2; do fdisk -l /dev/nvme${i}n1 && hdparm -tT /dev/nvme${i}n1; done
and got
- nvme0n1: 20+2GB/s
- nvme1n1: 20+4GB/s
- nvme2n1: 20+4GB/s
so nvme1n1+nvme2n1 seem to be faster but with limited size. I am tending towards using nvme1n1+nvme2n1 as RAID1 for / and nvme0n1 for /var/lib/openqa. So in the expert partitioner I configured nvme1n1+nvme2n1 as part of a RAID1 with 512 MiB /boot/efi FAT and rest / btrfs and /nvme0n1 /var/lib/openqa ext4 based on what I saw on openqaworker19, timezone UTC, default root password for o3 machines, static hostname "openqaworker21", enabling ipv4 and ipv6 forwarding. While this is installing on openqaworker22 in BIOS http and network boot seems already enabled but not "LEGACY to EFI Support" so this should boot in EFI, let's see. The system showed "Boot HTTP over IPv4" or something and then quickly a grub menu that says "openSUSE Leap 15.5" with two entries "Installation" and "Upgrade", hm, is this a local device? Trying with worker23. In the meantime also whenever I open the BMC for each system I add the according MAC address entries in racktables for ipmi+eth0+eth1
On worker21 the installation went just fine and the host is (directly) reachable for me over ssh as "10.150.1.110". So following https://progress.opensuse.org/projects/openqav3/wiki/#Setup-guide-for-new-machines I added worker21+22+23 on o3 in /etc/hosts and /etc/dnsmasq.d/openqa.conf already.
Then http://open.qa/docs/#_configure_virtual_interfaces
And then
firewall-cmd --permanent --new-service isotovideo
firewall-cmd --permanent --zone=trusted --add-service=isotovideo
for i in {1..50}; do firewall-cmd --permanent --service=isotovideo --add-port=$((i * 10 + 20003))/tcp ; done
firewall-cmd --permanent --zone=trusted --add-interface=eth0
zypper -n in openQA-worker openQA-continuous-update openQA-auto-update os-autoinst-distri-opensuse-deps swtpm ffmpeg nfs-client bash-completion vim htop strace systemd-coredump iputils tcpdump
systemctl enable --now openqa-worker-auto-restart@{1..30}.service openqa-reload-worker-auto-restart@{1..30}.path openqa-auto-update.timer openqa-continuous-update.timer openqa-worker-cacheservice.service openqa-worker-cacheservice-minion.service rebootmgr
and taking over config from worker19 for /etc/openqa/{client.conf,workers.ini} without the special external video encoder as our new default should be good for VP9 now. And using https://openqa.opensuse.org
as the full URL for registering. So putting [openqa.opensuse.org]
in client.conf and workers.ini. After starting a worker instance that one connected just fine as https://openqa.opensuse.org/admin/workers/754
Hm, what should create br0?
Now for tests:
wget https://raw.githubusercontent.com/os-autoinst/openQA/master/script/fetchneedles -O /usr/local/bin/fetchneedles
chmod +x /usr/local/bin/fetchneedles
groupadd www
useradd --no-create-home --home-dir /var/lib/openqa -s /bin/false geekotest
(cd /var/lib/openqa/share/tests; for i in alp d-installer kubic leap-micro microos windows; do ln -s opensuse $i; done && sudo -u geekotest git clone https://github.com/os-autoinst/os-autoinst-distri-obs.git obs && sudo -u geekotest git clone https://github.com/os-autoinst/os-autoinst-distri-openQA openQA && ln -s openQA openqa && cd openqa && sudo -u geekotest git clone https://github.com/os-autoinst/os-autoinst-needles-openQA.git needles)
/usr/local/bin/fetchneedles
echo '-*/1 * * * * geekotest env updateall=1 force=1 /usr/share/openqa/script/fetchneedles' > /etc/cron.d/openqa-update-git
and then locally
rsync -aHP o3:"~/ovmf-x86*staging*" .
rsync -aHP ovmf-x86*staging* root@10.150.1.110:/usr/share/qemu/
openqa-clone-job --skip-chained-deps --within-instance https://openqa.opensuse.org/tests/3438766 TEST+=worker21-poo132134 _GROUP=0 BUILD= WORKER_CLASS=openqaworker21
1 job has been created:
- openqa-Tumbleweed-dev-x86_64-Build:TW.21646-openqa_install+publish@64bit-2G -> https://openqa.opensuse.org/tests/3438904
which looks pretty good.
I updated our team and DCT migration team about the status and that we are good to go ahead.
Updated by okurz over 1 year ago
Created a customized autoyast to make use of those storage devices correctly and trying to boot that. From the network booted grub after "linux /boot/x86_64/loader/linux" we put
console=tty console=ttyS1,115200 autoyast=https://is.gd/oqaay rootpassword=opensuse
with https://is.gd/oqaay pointing to https://raw.githubusercontent.com/okurz/openQA/feature/ay/contrib/ay-openqa-worker.xml.erb . We tried "ttyS0" and "ttyS2" in before in vain but ttyS1 is good. console=tty console=ttyS1,115200
jing /usr/share/YaST2/schema/autoyast/rng/profile.rng autoinst.xml
xmllint --noout --relaxng /usr/share/YaST2/schema/autoyast/rng/profile.rng autoinst.xml
maybe the erb template does not work. Trying with non-templated version in https://raw.githubusercontent.com/okurz/openQA/feature/ay/contrib/ay-openqa-worker.xml shortened to https://is.gd/oqaay3 . That seems to work fine.
ssh root@10.150.1.114 and installed packages as above. The good point is with the pattern kvm_server we now have by default already a br0. So we should reconsider installing worker21 with that again afterwards but for now we have enough new free machines available.
Regarding the test pool server maybe we can use some sophisticated ssh bridging hacks to get rsync to work instead of manual fetchneedles.
mkdir -p /var/lib/openqa/ssh
chown _openqa-worker.nogroup /var/lib/openqa/ssh
sudo -u _openqa-worker ssh-keygen -t ed25519 -f /var/lib/openqa/ssh/id_ed25519 -N ''
and copied that generated public key into o3:/var/lib/openqa/.ssh/authorized_keys so we should be able to login as geekotest user remotely now. That works nicely. I created a draft pull request with some notes extensions for that approach, see https://github.com/os-autoinst/openQA/pull/5251
For multi-machine I don't want to do everything by hand, following my own semi-automated instructions from #119008-17
updated instructions:
zypper -n in openvswitch os-autoinst-openvswitch firewalld
ovs-vsctl add-br br1
echo 'OS_AUTOINST_USE_BRIDGE=br1' > /etc/sysconfig/os-autoinst-openvswitch
cat > /etc/sysconfig/network/ifcfg-br1 <<EOF
BOOTPROTO='static'
IPADDR='10.0.2.2/15'
STARTMODE='auto'
ZONE=trusted
OVS_BRIDGE='yes'
PRE_UP_SCRIPT="wicked:gre_tunnel_preup.sh"
OVS_BRIDGE_PORT_DEVICE_0='tap0'
EOF
cat > /etc/sysconfig/network/ifcfg-tap0 <<EOF
BOOTPROTO='none'
IPADDR=''
NETMASK=''
PREFIXLEN=''
STARTMODE='auto'
TUNNEL='tap'
TUNNEL_SET_GROUP='nogroup'
TUNNEL_SET_OWNER='_openqa-worker'
EOF
for i in $(seq 1 50; seq 64 $((64+50)); seq 128 $((128+50))); do ln -sf ifcfg-tap0 /etc/sysconfig/network/ifcfg-tap$i && echo "OVS_BRIDGE_PORT_DEVICE_$i='tap$i'" >> /etc/sysconfig/network/ifcfg-br1; done
cat > /etc/firewalld/zones/trusted.xml <<EOF
<?xml version="1.0" encoding="utf-8"?>
<zone target="ACCEPT">
<short>Trusted</short>
<description>All network connections are accepted.</description>
<service name="isotovideo"/>
<interface name="br1"/>
<interface name="ovs-system"/>
<interface name="eth0"/>
<masquerade/>
</zone>
EOF
systemctl reload firewalld
zypper -n in libcap-progs
setcap CAP_NET_ADMIN=ep /usr/bin/qemu-system-x86_64
cat > /etc/wicked/scripts/gre_tunnel_preup.sh <<EOF
#!/bin/sh
action="\$1"
bridge="\$2"
ovs-vsctl set bridge \$bridge stp_enable=true
#ovs-vsctl --may-exist add-port \$bridge gre1 -- set interface gre1 type=gre options:remote_ip=<IP address of other host>
EOF
systemctl enable --now openvswitch os-autoinst-openvswitch firewalld
and then adjusting <IP address of other host>
in /etc/wicked/scripts/gre_tunnel_preup.sh . For now we don't know the final IP addresses that we want to give out from dnsmasq on ariel so using dummy values:
# openqaworker21
ovs-vsctl --may-exist add-port $bridge gre1 -- set interface gre1 type=gre options:remote_ip=10.150.1.110
# openqaworker22
#ovs-vsctl --may-exist add-port $bridge gre2 -- set interface gre2 type=gre options:remote_ip=10.150.1.114
# openqaworker23
#ovs-vsctl --may-exist add-port $bridge gre3 -- set interface gre3 type=gre options:remote_ip=10.150.1.115
# openqaworker24
#ovs-vsctl --may-exist add-port $bridge gre4 -- set interface gre4 type=gre options:remote_ip=10.150.1.116
# openqaworker25
#ovs-vsctl --may-exist add-port $bridge gre5 -- set interface gre5 type=gre options:remote_ip=10.150.1.117
# openqaworker26
#ovs-vsctl --may-exist add-port $bridge gre6 -- set interface gre6 type=gre options:remote_ip=10.150.1.118
# openqaworker27
#ovs-vsctl --may-exist add-port $bridge gre7 -- set interface gre7 type=gre options:remote_ip=10.150.1.119
# openqaworker28
#ovs-vsctl --may-exist add-port $bridge gre8 -- set interface gre8 type=gre options:remote_ip=10.150.1.120
This likely needs a full network stack reinit. We can try a reboot later. But for now trying a multi-machine test within the one worker machine:
openqa-clone-job --skip-chained-deps --within-instance https://openqa.opensuse.org/tests/3439908 TEST+=worker22-poo132134 _GROUP=0 BUILD= WORKER_CLASS=openqaworker22
That unfortunately did not set a customized name for the controller test and I could not find a way how to do it even with "TEST:0" and such and also not with something like openqa-cli api --o3 -X put -d '{"test": "controller_po132134"}' jobs/3440383
So instead I will manually delete those test jobs like https://openqa.opensuse.org/tests/3440384 which failed with
[2023-07-19T10:01:43.078582Z] [warn] [pid:113907] !!! : qemu-system-x86_64: -netdev tap,id=qanet0,ifname=tap16,script=no,downscript=no: could not configure /dev/net/tun (tap16): Operation not permitted
likely a reboot is needed so that all network devices are properly initialized. Triggered that. https://openqa.opensuse.org/t3440396 is currently running and looks very good.
Updated by okurz over 1 year ago
https://github.com/os-autoinst/openQA/pull/5254 for multi-machine docs improvements.
https://openqa.opensuse.org/t3440396 failed eventually due to https://openqa.opensuse.org/tests/3440396#step/await_install/6 showing that http://download.opensuse.org/update/tumbleweed could not be reached. Well, this might be due to bigger problems in the network as also gitlab CI runners were offline. Also see https://suse.slack.com/archives/C029APBKLGK/p1689771644435939
But a retriggered openQA job https://openqa.opensuse.org/tests/3440637 shows the same problem now
ok, fixed it partially with firewalld changes. seems like interfaces now showed up in "public", not "trusted" and masquerading wasn't enabled there. So I did
firewall-cmd --zone=public --add-masquerade
but IMHO the devices should just be in trusted.
I did now
for i in br0 eth0 ; do firewall-cmd --permanent --remove-interface=$i --zone=public ; firewall-cmd --permanent --add-interface=$i --zone=trusted; done
firewall-cmd --permanent --set-default-zone=trusted
but I had already done such change for eth0 in before.
And https://openqa.opensuse.org/tests/3440637 succeeded now. Will try again after a reboot.
Updated by okurz over 1 year ago
After reboot also good. nsinger continued with worker23, I continued with worker24. I will need to pick up creating an ssh key, adding that to o3 and firewall and multi-machine configuration. I continued with worker24 but can't fully verify. Same as done on worker22 and likely what we need on all is a higher timeout for os-autoinst-openvswitch because of the tap device initialization the complete network stack bootup takes considerably more time:
mkdir -p /etc/systemd/system/os-autoinst-openvswitch.service.d
cat > /etc/systemd/system/os-autoinst-openvswitch.service.d/override.conf <<EOF
[Service]
# okurz: 2023-07-19: with more tap devices init takes longer
Environment=OS_AUTOINST_OPENVSWITCH_INIT_TIMEOUT=300
EOF
systemctl daemon-reload
systemctl restart os-autoinst-openvswitch openqa-worker-auto-restart@*
Updated by livdywan over 1 year ago
As discussed in the unblocked, new tests will be run later today with the virt issues having been sorted out
Updated by okurz over 1 year ago
In #132143#note-46 I commented that we setup dnsmasq on new-ariel and have then scheduled new multi-machine tests. fvogt suggested firewalld-policy so let's do that:
openqa-clone-job --skip-chained-deps --within-instance https://openqa.opensuse.org/tests/3450618 TEST+=worker24-poo132134 _GROUP=0 BUILD= WORKER_CLASS=openqaworker24
all passed, see https://openqa.opensuse.org/tests/3456751
and the same for worker21 and worker22:
openqa-clone-job --skip-chained-deps --within-instance https://openqa.opensuse.org/tests/3450618 TEST+=worker21-poo132134 _GROUP=0 BUILD= WORKER_CLASS=openqaworker21
3 jobs have been created:
- opensuse-Tumbleweed-DVD-x86_64-Build20230724-firewalld-policy-firewall@64bit -> https://openqa.opensuse.org/tests/3456766
- opensuse-Tumbleweed-DVD-x86_64-Build20230724-firewalld-policy-server@64bit -> https://openqa.opensuse.org/tests/3456768
- opensuse-Tumbleweed-DVD-x86_64-Build20230724-firewalld-policy-client@64bit -> https://openqa.opensuse.org/tests/3456767
that needed more manual cleanup but we already know that w21 is our "dirty" one that should be reinstalled from autoyast with the auto-configured br0.
openqa-clone-job --skip-chained-deps --within-instance https://openqa.opensuse.org/tests/3450618 TEST+=worker22-poo132134 _GROUP=0 BUILD= WORKER_CLASS=openqaworker22
3 jobs have been created:
- opensuse-Tumbleweed-DVD-x86_64-Build20230724-firewalld-policy-firewall@64bit -> https://openqa.opensuse.org/tests/3456770
- opensuse-Tumbleweed-DVD-x86_64-Build20230724-firewalld-policy-server@64bit -> https://openqa.opensuse.org/tests/3456769
- opensuse-Tumbleweed-DVD-x86_64-Build20230724-firewalld-policy-client@64bit -> https://openqa.opensuse.org/tests/3456771
all good as well.
Triggering more tests for validation:
for i in 21 22 24; do end=400 openqa-clone-set https://openqa.opensuse.org/tests/3457252 okurz_load_test_w$i WORKER_CLASS=openqaworker$i BUILD=okurz_poo132134 _GROUP_ID=0; done
-> https://openqa.opensuse.org/tests/overview?build=okurz_poo132134&version=Tumbleweed&distri=opensuse
ok, I think that was not a good for a load test as that zdup scenario has a custom max job time hence no video. So instead I did
for i in 21 22 24; do end=400 openqa-clone-set https://openqa.opensuse.org/tests/3457738 kde-okurz_load_test_w$i WORKER_CLASS=openqaworker$i BUILD=okurz_poo132134 _GROUP_ID=0; done
-> https://openqa.opensuse.org/tests/overview?build=okurz_poo132134&version=Tumbleweed&distri=opensuse
EDIT: The above and links like https://openqa.opensuse.org/admin/workers/754 for individual workers show a sufficiently high pass-rate for all w21+w22+w24 so I am going to enable all for production but not for multi-machine tests.
Next steps:
- Enable production worker classes for w22+w24
- Install more workers, e.g. w25, using
the above code snippetshttps://github.com/os-autoinst/os-autoinst/pull/2347 to see what is missing - Verify multi-machine and enable production
- Re-install w21
We verified the mm-setup on w21+22+24 but we should only put them in production with tap after establishing gre tunnels to the nue1 workers or we disable all tap in nue1 but that's not that urgent.
Updated by okurz over 1 year ago
- Related to action #133496: [openqa-in-openQA] test fails due to not loaded install test modules added
Updated by okurz over 1 year ago
- Due date changed from 2023-07-31 to 2023-08-14
The test history from workers w21:1 https://openqa.opensuse.org/admin/workers/754 , w22:1 https://openqa.opensuse.org/admin/workers/828 , w24:1 https://openqa.opensuse.org/admin/workers/892 looks good. I will await at least the next Tumbleweed x86_64 snapshot to be published as visible on https://openqa.opensuse.org/group_overview/1 before disabling NUE1 based openQA multi-machine tests and enable them on the new PRG2 workers only.
Adding comment in /etc/openqa/workers.ini of NUE1 machines:
# okurz: 2023-07-31: disabled "tap" as prg2 should take over, see
# https://progress.opensuse.org/issues/132134#note-20
Adding ,nue,maxtorhof,nue1,nue1-srv1
to machines. Working on this together with #133580 as I am changing worker classes anyway.
For now I have only enabled the "tap" class on w22 as ovs-vsctl show
does not show a GRE tunnel connection to the others
Updated by okurz over 1 year ago
- Due date deleted (
2023-08-14) - Status changed from In Progress to Blocked
https://openqa.opensuse.org/tests/?resultfilter=Failed&resultfilter=Incomplete shows no problematic entries. With this blocking on #133160 to be able to install more workers.
Updated by okurz over 1 year ago
I realized that /etc/wicked/scripts/gre_tunne_preup.sh was not marked as executable so wicked ifup br1
did not start the script. Now tested on w21 with wicked ifup br1 && journalctl -e --since="1h ago" | grep -i gre
and found mentions. Now also ovs-vsctl show
shows
Port gre2
Interface gre2
type: gre
options: {remote_ip="10.150.1.114"}
and alike. Can progress with the same on others.
Updated by okurz over 1 year ago
Enabled now worker instances 3..50 on w21. Setting up w23 based on https://raw.githubusercontent.com/os-autoinst/os-autoinst/228c7a0921ffb9f0a33a19f70bd4004f872ff4bf/script/os-autoinst-setup-multi-machine. Observed that (obviously) again we need the os-autoinst-openvswitch timeout increase so I am proposing https://github.com/os-autoinst/os-autoinst/pull/2349 now
Updated by okurz over 1 year ago
Because I don't know how to clone openQA jobs so that one ends up on one worker and another on a different worker I just ran the following command until I ended up with jobs on separate machines:
openqa-clone-job --skip-chained-deps --within-instance https://openqa.opensuse.org/tests/3450618 TEST+=-worker21-23-poo132134 _GROUP=0 BUILD= WORKER_CLASS=prg2
I found that two machines still had the old GRE tunnel targets with old 100-range IPs so I did wicked ifup all
on those and confirmed that the GRE tunnels point to the correct other machines with ovs-vsctl show
and then retriggered jobs.
So tests that are executed within a single machine are fine, tests spanning multiple workers fail, e.g. https://openqa.opensuse.org/tests/3473385#step/firewalld_policy_objects/138 . I don't know right now what else to explore
EDIT: I used the above clone command but with openqa-clone-job --export-command …
without the WORKER_CLASS and then added WORKER_CLASS:3450618=openqaworker22 WORKER_CLASS:3450619=openqaworker22 WORKER_CLASS:3450620=openqaworker24
to schedule on specific workers
- w22+w24: https://openqa.opensuse.org/t3478671 PASSED
- w21+w22: https://openqa.opensuse.org/t3478730 PASSED
- w21+w23: https://openqa.opensuse.org/t3478734 PASSED
- w21+w24: https://openqa.opensuse.org/t3478737 PASSED
- w22+w24: https://openqa.opensuse.org/t3478740 PASSED
- w23+w24: https://openqa.opensuse.org/t3478743 https://openqa.opensuse.org/tests/3478742#step/firewalld_policy_objects/138 FAILED. Hm, maybe it matters if client or server reside on one or the other.
- w23+w24: Triggered once more: https://openqa.opensuse.org/t3478823, failed again consistently in https://openqa.opensuse.org/t3478824
- w24+w23: And one more where the first job is on w24: https://openqa.opensuse.org/t3478827 PASSED where the server is on w23, maybe w24 is the problematic one?
- w24+w22+w22: https://openqa.opensuse.org/t3478962 FAILED so is w24 bad? I should recheck after we have more workers installed for cross-checking
Updated by livdywan over 1 year ago
- Related to action #132167: asset uploading failed with http status 502 size:M added
Updated by dheidler over 1 year ago
- Status changed from Blocked to In Progress
Working on openqaworker28
Updated by dheidler over 1 year ago
Installed leap15.5 on openqaworker25,26,27 via autoyast.
Updated by openqa_review over 1 year ago
- Due date set to 2023-08-19
Setting due date based on mean cycle time of SUSE QE Tools
Updated by okurz over 1 year ago
- Assignee changed from okurz to dheidler
dheidler will look into setting up openQA. Instructions:
- Follow https://progress.opensuse.org/projects/openqav3/wiki/#Setup-guide-for-new-machines
- Call
head $(which os-autoinst-setup-multi-machine)
to find out which settings you need to call for each machine, then call likeinstances=50 os-autoinst-setup-multi-machine
- Copy over from an existing worker /etc/openqa/client.conf, /etc/openqa/workers.ini and /etc/wicked/scripts/gre_tunnel_preup.sh and adjust as necessary
- Enable openQA workers with non-production worker classes first
Test out verification jobs, e.g. call
openqa-clone-job --skip-chained-deps --export-command --within-instance https://openqa.opensuse.org/tests/3450618 TEST+=-worker21-23-poo132134 _GROUP=0 BUILD= WORKER_CLASS=prg2
and then adjust the worker classes so that jobs end up on one of the existing and one of the new machinesthen try out boot+install for aarch64
Updated by okurz over 1 year ago
- Related to action #133025: Configure Virtual Interfaces instructions do not work on Leap 15.5 size:M added
Updated by okurz over 1 year ago
- Copied to action #134123: Setup new PRG2 openQA worker for o3 - two new arm workers size:M added
Updated by okurz over 1 year ago
- Copied to action #134126: Setup new PRG2 openQA worker for o3 - bare-metal testing size:M added
Updated by okurz over 1 year ago
- Description updated (diff)
- Due date deleted (
2023-08-19) - Status changed from In Progress to Workable
- Priority changed from Urgent to High
Discussed in the weekly SUSE QE Tools meeting. We by now have stable productive o3 workers running hence reducing priority. The other additional machines can be setup now but with "High" priority. We carved out two separate tickets for aarch64 #134123 and w26,w27,w28 #134126 so that leaves only w25 to be setup completely as openQA worker
Updated by dheidler over 1 year ago
- Status changed from Workable to In Progress
Updated by dheidler over 1 year ago
The script that can be used to setup a worker after autoyast installation (what I have so far):
zypper ar -p 95 -f 'http://download.opensuse.org/repositories/devel:openQA/$releasever' devel_openQA
zypper ar -p 90 -f 'http://download.opensuse.org/repositories/devel:openQA:Leap:$releasever/$releasever' devel_openQA_Leap
zypper --gpg-auto-import-keys ref
echo "requires:openQA-worker" > /etc/zypp/systemCheck.d/openqa.check
zypper -n in openQA-worker openQA-auto-update openQA-continuous-update os-autoinst-distri-opensuse-deps swtpm # openQA worker services plus dependencies for openSUSE distri or development repo if added previously
zypper -n in ffmpeg-4 # for using external video encoder as it is already configured on some machines like ow19, ow20 and power8
zypper -n in nfs-client # For /var/lib/openqa/share
zypper -n in bash-completion vim htop strace systemd-coredump iputils tcpdump bind-utils # for general tinkering
# for use of nfs
# echo "openqa1-opensuse:/ /var/lib/openqa/share nfs4 noauto,nofail,retry=30,ro,x-systemd.automount,x-systemd.device-timeout=10m,x-systemd.mount-timeout=30m 0 0" >> /etc/fstab
# for use of cacheservice
# if connecting via ssh-rsync to o3
mkdir -p /var/lib/openqa/ssh/
chown _openqa-worker: /var/lib/openqa/ssh/
sudo -u _openqa-worker ssh-keygen -t ed25519 -N '' -f /var/lib/openqa/ssh/id_ed25519
# append /var/lib/openqa/ssh/id_ed25519.pub to ariel:/var/lib/openqa/.ssh/authorized_keys
sudo -u _openqa-worker ssh -i /var/lib/openqa/ssh/id_ed25519 -p 2214 -o "UserKnownHostsFile /var/lib/openqa/ssh/known_hosts" geekotest@gate.opensuse.org whoami
# note that ariel can be reached directly via ariel.dmz-prg2.suse.org from suse network
sed -i 's/\(solver.dupAllowVendorChange = \)false/\1true/' /etc/zypp/zypp.conf
instances=30 bash -x $(which os-autoinst-setup-multi-machine)
#[global]
#HOST = https://openqa.opensuse.org
#CACHEDIRECTORY = /var/lib/openqa/cache
##WORKER_CLASS = foo
#EXTERNAL_VIDEO_ENCODER_CMD=ffmpeg -y -hide_banner -nostats -r 24 -f image2pipe -vcodec ppm -i - -pix_fmt yuv420p -c:v libvpx-vp9 -crf 35 -b:v 1500k -cpu-used 0
#[https://openqa.opensuse.org]
##TESTPOOLSERVER = -e 'ssh -p 2214 -i /var/lib/openqa/ssh/id_ed25519 -o "UserKnownHostsFile /var/lib/openqa/ssh/known_hosts"' geekotest@gate.opensuse.org:/var/lib/openqa/tests
##TESTPOOLSERVER = -e 'ssh -i /var/lib/openqa/ssh/id_ed25519 -o "UserKnownHostsFile /var/lib/openqa/ssh/known_hosts"' geekotest@ariel.dmz-prg2.suse.org:/var/lib/openqa/tests
#TESTPOOLSERVER = rsync://openqa.opensuse.org/tests
zypper -n in rsync sshpass
rsync -rP --rsh="sshpass -p opensuse ssh -o StrictHostKeyChecking=no" root@openqaworker24:/etc/openqa/ /etc/openqa/
rsync -rP --rsh="sshpass -p opensuse ssh -o StrictHostKeyChecking=no" root@openqaworker24:/etc/wicked/scripts/gre_tunnel_preup.sh /etc/wicked/scripts/
sed -i "s/openqaworker24/$HOSTNAME/g" /etc/openqa/workers.ini
# comment out the worker that we are on and enable all other entries
vi /etc/wicked/scripts/gre_tunnel_preup.sh
# use only hostname as worker class for testing
vi /etc/openqa/workers.ini
chown _openqa-worker /etc/openqa/client.conf
chmod 640 /etc/openqa/client.conf
# configure /etc/openqa/client.conf and /etc/openqa/workers.ini, then enable the desired number of worker slots, e.g.:
systemctl enable --now openqa-worker-auto-restart@{1..30}.service openqa-reload-worker-auto-restart@{1..30}.path openqa-auto-update.timer openqa-continuous-update.timer openqa-worker-cacheservice.service openqa-worker-cacheservice-minion.service
# on aarch64 do something about hugepages permissions (see https://progress.opensuse.org/issues/53234)
echo "hugetlbfs /dev/hugepages hugetlbfs defaults,gid=kvm,mode=1775,pagesize=1G" >> /etc/fstab
# add this to kernel cmdline via /etc/default/grub (& update-bootloader) on aarch64:
# showopts nospec spectre_v2=off pti=off kpti=off mitigations=off default_hugepagesz=1G hugepagesz=1G hugepages=64 crashkernel=167M enforcing=0
Updated by dheidler over 1 year ago
- Status changed from In Progress to Workable
Updated by okurz over 1 year ago
- Assignee deleted (
dheidler) - Target version set to Ready
oops, removed target version by mistake, should have been assignee
Updated by nicksinger over 1 year ago
- Assignee deleted (
nicksinger)
assigned the wrong ticket to me, oops
Updated by okurz over 1 year ago
- Priority changed from Urgent to High
We don't have capacity to work on that many infra tasks with urgent prio, reducing to "High"
Updated by livdywan over 1 year ago
We took a look during the mob session. It's not very clear what was done and what's still to be done. It would be great if one of the previous assignees could add this.
- It seems that https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/565/diffs may be the list of new workers?
- Apparently they're missing users/ssh keys?
- Was anything done to enable multi-machine tests?
- o3 doesn't seem to list all workers - were any removed for some reason?
Updated by okurz over 1 year ago
Seems you overlooked some comments then. See #132134#note-35 . Only parts of the workers should be direct o3 openQA workers, some reserved for bare-metal testing. No users are setup but only root ssh key authentication same as we always had for o3 workers. Yes, something was done for multi-machine testing, see comments
Updated by livdywan over 1 year ago
- Subject changed from Setup new PRG2 openQA worker for o3 size:M to Setup new PRG2 multi-machine openQA worker for o3 size:M
Since the AC specifically ask for multi-machine it should be reflected in the title
Updated by okurz over 1 year ago
livdywan wrote in #note-48:
Since the AC specifically ask for multi-machine it should be reflected in the title
Actually there should be no need in mentioning multi-machine explicitly at all but be implied after we finished our saga about treating multi-machine tests as first class citizens. Maybe we need to clarify which parts of that were effectively not true
Updated by livdywan over 1 year ago
okurz wrote in #note-47:
Seems you overlooked some comments then. See #132134#note-35 . Only parts of the workers should be direct o3 openQA workers, some reserved for bare-metal testing. No users are setup but only root ssh key authentication same as we always had for o3 workers. Yes, something was done for multi-machine testing, see comments
Okay, so so that leaves only w25 to be setup completely as openQA worker
is what's left to do here then.
Updated by okurz about 1 year ago
- Related to action #130585: Upgrade o3 workers to openSUSE Leap 15.5 added
Updated by okurz about 1 year ago
- Related to action #134912: Gradually phase out NUE1 based openQA workers size:M added
Updated by livdywan about 1 year ago
https://openqa.opensuse.org/tests/latest?arch=x86_64&distri=microos&flavor=DVD&machine=64bit&test=remote_ssh_controller&version=Tumbleweed#next_previous This might be a good scenario to test with (the other ones referenced here were old and/or failing)
Updated by dheidler about 1 year ago
- File jobs.html jobs.html added
- File jobs.py jobs.py added
- Status changed from Workable to In Progress
Discussed with okurz that openqaworker21-26 should be used as worker hosts for now.
Fixed the kernel issue on hosts: A kernel downgrade and lock had been already applied on some workers due to a kernel bug. But the lock was only for kernel-default-*, so an update pulled kernel-base instead of kernel-default, which resultet in failed boots due to missing drivers.
I fixed those machines and added kernel lock for kernel*.
This will prevent updates via zypper patch for now.
Kernel locks can be removed after bsc#1214537 fix is in the update repo.
Working locked kernel versions for reference:
kernel-default-5.14.21-150400.24.74.1.x86_64
kernel-default-5.14.21-150500.55.12.1.x86_64
What was done:
openqaworker21: io issues? -> reinstall -> ready
openqaworker22: needs fix -> ready -> active!
openqaworker23: lock added -> ready
openqaworker24: fixed -> ready
openqaworker25: lock added -> ready
openqaworker26: lock added -> ready
Verification:
for i in {001..300} ; do openqa-clone-job --parental-inheritance --skip-chained-deps --within-instance https://openqa.opensuse.org/tests/3600589 TEST+=-dheidler-poo132134-$i _GROUP=0 BUILD=dheidler-poo132134-1 WORKER_CLASS=qemu_x86_64,tap_poo132134 ; done
See attachment for results. Multi machine worked fine on that workers and across workers.
Updated by okurz about 1 year ago
dheidler wrote in #note-56:
Discussed with okurz that openqaworker21-26 should be used as worker hosts for now.
Fixed the kernel issue on hosts: A kernel downgrade and lock had been already applied on some workers due to a kernel bug. But the lock was only for kernel-default-*, so an update pulled kernel-base instead of kernel-default, which resultet in failed boots due to missing drivers.I fixed those machines and added kernel lock for kernel*.
This will prevent updates via zypper patch for now.
Kernel locks can be removed after bsc#1214537 fix is in the update repo.Working locked kernel versions for reference:
kernel-default-5.14.21-150400.24.74.1.x86_64
kernel-default-5.14.21-150500.55.12.1.x86_64What was done:
openqaworker21: io issues? -> reinstall -> ready openqaworker22: needs fix -> ready -> active! openqaworker23: lock added -> ready openqaworker24: fixed -> ready openqaworker25: lock added -> ready openqaworker26: lock added -> ready
Verification:
for i in {001..300} ; do openqa-clone-job --parental-inheritance --skip-chained-deps --within-instance https://openqa.opensuse.org/tests/3600589 TEST+=-dheidler-poo132134-$i _GROUP=0 BUILD=dheidler-poo132134-1 WORKER_CLASS=qemu_x86_64,tap_poo132134 ; done
See attachment for results. Multi machine worked fine on that workers and across workers.
I would say from the infobox on https://openqa.opensuse.org/tests/overview?version=Tumbleweed&build=dheidler-poo132134-1&distri=microos it's already obvious: 602/602 passed, 100.00% pass rate. According to https://en.wikipedia.org/wiki/Rule_of_three_(statistics) that means that we have a computed failure probability of < .49% which I would say is pretty good :)
So, what are the next steps you plan?
Updated by dheidler about 1 year ago
- Status changed from In Progress to Resolved
Enabled the workers for production use of multi machine jobs by removing _poo132134
suffix from worker class entries.
Ran 300 tests again: https://openqa.opensuse.org/tests/overview?version=Tumbleweed&distri=microos&build=dheidler-poo132134-2
Checked that multi machine jobs work fine across all workers (21-26).
Subscribed to https://bugzilla.suse.com/show_bug.cgi?id=1214537 for further updates on when we can remove the kernel package locks.
Updated by dheidler about 1 year ago
Commented out gre entries for ow27 and ow28 on all workers in /etc/wicked/scripts/gre_tunnel_preup.sh
as they are planned to be used as generalhw sut.