Project

General

Profile

Actions

action #132134

closed

QA - coordination #121720: [saga][epic] Migration to QE setup in PRG2+NUE3 while ensuring availability

QA - coordination #123800: [epic] Provide SUSE QE Tools services running in PRG2 aka. Prg CoLo

Setup new PRG2 multi-machine openQA worker for o3 size:M

Added by okurz 11 months ago. Updated 8 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
2023-06-29
Due date:
% Done:

0%

Estimated time:

Description

Motivation

New hardware was ordered to serve as openQA workers for o3, both x86_64 and aarch64. We can connect those machines to the o3 webUI VM instance regardless if o3 running still from NUE1 or from PRG2.

Acceptance criteria

  • AC1: o3 multi-machine jobs run successfully on new PRG2 openQA workers
  • AC2: All o3 workers have the same relevant downgrades and package locks as others, e.g. see w19+w21 as reference for kernel-default, linked to https://bugzilla.suse.com/show_bug.cgi?id=1214537

Suggestions

Out of scope


Files

jobs.html (41.5 KB) jobs.html dheidler, 2023-10-02 09:50
jobs.py (1.25 KB) jobs.py dheidler, 2023-10-02 09:50

Related issues 9 (1 open8 closed)

Related to openQA Project - action #133496: [openqa-in-openQA] test fails due to not loaded install test modulesResolvedokurz2023-07-28

Actions
Related to openQA Project - action #132167: asset uploading failed with http status 502 size:MResolvedkraih2023-06-292023-08-30

Actions
Related to openQA Project - action #133025: Configure Virtual Interfaces instructions do not work on Leap 15.5 size:MResolveddheidler2023-07-192023-10-31

Actions
Related to openQA Project - action #130585: Upgrade o3 workers to openSUSE Leap 15.5Resolvedokurz

Actions
Related to openQA Infrastructure - action #134912: Gradually phase out NUE1 based openQA workers size:MResolvedokurz

Actions
Copied to openQA Infrastructure - action #132137: Setup new PRG2 openQA worker for osd size:MResolvedmkittler2023-06-29

Actions
Copied to openQA Infrastructure - action #132143: Migration of o3 VM to PRG2 - 2023-07-19 size:MResolvednicksinger2023-06-29

Actions
Copied to openQA Infrastructure - action #134123: Setup new PRG2 openQA worker for o3 - two new arm workers size:MResolvednicksinger

Actions
Copied to openQA Infrastructure - action #134126: Setup new PRG2 openQA worker for o3 - bare-metal testing size:MBlockedokurz

Actions
Actions #1

Updated by okurz 11 months ago

  • Copied to action #132137: Setup new PRG2 openQA worker for osd size:M added
Actions #2

Updated by okurz 11 months ago

  • Copied to action #132143: Migration of o3 VM to PRG2 - 2023-07-19 size:M added
Actions #3

Updated by okurz 11 months ago

  • Subject changed from Setup new PRG2 openQA worker for o3 to Setup new PRG2 openQA worker for o3 size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #4

Updated by okurz 11 months ago

  • Assignee set to okurz

Nikolay Dachev yesterday contacted me and he is apparently preparing qe-jumpy.qe.prg2.suse.org or something. We are expecting that to be fully available to us the next days.

As long as ariel is still in NUE1 we should setup the new machines and try to connect to ariel, e.g. over https public internet + workaround of running fetchneedles locally on each worker in a cron job every minute and not use the worker cache for tests. While this is running we can look into providing rsync, either directly by allowing rsync access over the outside public internet or a dedicated SUSE-internal connection between the old NUE1-o3 network and new PRG2-o3 network or how about some fancy ssh config on rsync, like rsync -e "ssh -p 4242 special.host" …

Actions #5

Updated by okurz 11 months ago

  • Priority changed from Normal to High
Actions #6

Updated by okurz 11 months ago

I received credentials by mcaj, tried them out and could not access the jump host over ssh:

Hi, I got your email about DMZ openqa worker test. Same IPv6 problem to access. Over IPv4 I got immediate ssh access but my ssh key is not accepted and I don't know any password for ssh authentication

Actions #7

Updated by okurz 10 months ago

  • Status changed from Workable to In Progress

I put the IPMI credentials into https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls#L1754 but still IPMI access over the SSH jump host does not yet work.

Actions #8

Updated by okurz 10 months ago

  • Priority changed from High to Urgent
Actions #9

Updated by openqa_review 10 months ago

  • Due date set to 2023-07-31

Setting due date based on mean cycle time of SUSE QE Tools

Actions #10

Updated by okurz 10 months ago

  • Project changed from 46 to openQA Infrastructure
  • Category deleted (Infrastructure)
Actions #11

Updated by okurz 10 months ago

Apparently the wrong jump host config was copy/pasted, fixing in salt pillars

Actions #12

Updated by okurz 10 months ago

Fixed in https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/565 (merged). Now all x86_64 hosts are accessible.

With

sed -n -e '/prg2/s/^[^`]*\([^`]*\)`/\1/p' ~/local/openqa/salt-pillars-openqa/openqa/workerconf.sls | sed 's/sol activate/power status/' | tr -d '`' | parallel -k

I could check all relevant osd+o3 prg2 hosts for their connection which works nicely.

Actions #13

Updated by livdywan 10 months ago

okurz wrote:

sed -n -e '/prg2/s/^[^`]*\([^`]*\)`/\1/p' ~/local/openqa/salt-pillars-openqa/openqa/workerconf.sls | sed 's/sol activate/power status/' | tr -d '`' | parallel -k

I could check all relevant osd+o3 prg2 hosts for their connection which works nicely.

Would you mind clarifying how this works? I don't see any output and the command is not super easy to read for me.

Actions #14

Updated by okurz 10 months ago

cdywan wrote:

Would you mind clarifying how this works? I don't see any output and the command is not super easy to read for me.

well, the workerconf includes all the commands we need within backquotes so with the above command I extracted them and executed them in parallel. The output looks like this

$ sed -n -e '/prg2/s/^[^`]*\([^`]*\)`/\1/p' ~/local/openqa/salt-pillars-openqa/openqa/workerconf.sls | sed 's/sol activate/power status/' | tr -d '`' | parallel -k
Chassis Power is on
Chassis Power is on
Chassis Power is on
Chassis Power is on
Chassis Power is on
Chassis Power is on
Chassis Power is on
Chassis Power is on
Chassis Power is on
Chassis Power is on
Chassis Power is on
Chassis Power is on
Chassis Power is on
Error: Unable to establish IPMI v2 / RMCP+ session

with the latest entry corresponding to the ARM machines which are so far not accessible.

Actions #15

Updated by okurz 10 months ago

From https://suse.slack.com/archives/C04MDKHQE20/p1689688712037979

(Martin Caj) OpenQA workers in DMZ: Update !
As you can see in the picture the PXE boot finally works for openqa workers in dmz.
The screenshot is from openqaworker21. I did not continue with installation - that up to you.
The OS is Leap 15.5 NET ISO.
I had some issues with uefi boot there, so in the end I setup there boot over http.
When you are testing the rest of workers please make sure that http boot is enable in Bios (I think its in advance -> network stack settings).
TODO ATM its the ARM boot, for them I need to setup different boot.efi file, I will try it as well.
meanwhile please test it and let me know

But first I realized what I had already observed the past days and now I finally need to report: IPv6 ssh to the jump host is not properly routed over openVPN, reported
https://sd.suse.com/servicedesk/customer/portal/1/SD-127745

I connected with forwarding for worker21 with ssh -4 -L 1443:openqaworker21-ipmi:443 jumpy@oqa-jumpy.dmz-prg2.suse.org and also in there with the ssh command line enable a forwarding for worker22 BMC ("~C" and then "-L 2443:openqaworker22-ipmi:443"). In worker21 I continued with the manual interactive installation that mcaj started to have at least something usable.

I am unsure about how to best use the storage devices. There is nvme0n1 with 6TB and nvme{1,2}n1 with 500GB. In the install shell (alt-f2) I executed

for i in 0 1 2; do fdisk -l /dev/nvme${i}n1 && hdparm -tT /dev/nvme${i}n1; done

and got

  • nvme0n1: 20+2GB/s
  • nvme1n1: 20+4GB/s
  • nvme2n1: 20+4GB/s

so nvme1n1+nvme2n1 seem to be faster but with limited size. I am tending towards using nvme1n1+nvme2n1 as RAID1 for / and nvme0n1 for /var/lib/openqa. So in the expert partitioner I configured nvme1n1+nvme2n1 as part of a RAID1 with 512 MiB /boot/efi FAT and rest / btrfs and /nvme0n1 /var/lib/openqa ext4 based on what I saw on openqaworker19, timezone UTC, default root password for o3 machines, static hostname "openqaworker21", enabling ipv4 and ipv6 forwarding. While this is installing on openqaworker22 in BIOS http and network boot seems already enabled but not "LEGACY to EFI Support" so this should boot in EFI, let's see. The system showed "Boot HTTP over IPv4" or something and then quickly a grub menu that says "openSUSE Leap 15.5" with two entries "Installation" and "Upgrade", hm, is this a local device? Trying with worker23. In the meantime also whenever I open the BMC for each system I add the according MAC address entries in racktables for ipmi+eth0+eth1

On worker21 the installation went just fine and the host is (directly) reachable for me over ssh as "10.150.1.110". So following https://progress.opensuse.org/projects/openqav3/wiki/#Setup-guide-for-new-machines I added worker21+22+23 on o3 in /etc/hosts and /etc/dnsmasq.d/openqa.conf already.

Then http://open.qa/docs/#_configure_virtual_interfaces

And then

firewall-cmd --permanent --new-service isotovideo
firewall-cmd --permanent --zone=trusted --add-service=isotovideo
for i in {1..50}; do firewall-cmd --permanent --service=isotovideo --add-port=$((i * 10 + 20003))/tcp ; done
firewall-cmd --permanent --zone=trusted --add-interface=eth0
zypper -n in openQA-worker openQA-continuous-update openQA-auto-update os-autoinst-distri-opensuse-deps swtpm ffmpeg nfs-client bash-completion vim htop strace systemd-coredump iputils tcpdump
systemctl enable --now openqa-worker-auto-restart@{1..30}.service openqa-reload-worker-auto-restart@{1..30}.path openqa-auto-update.timer openqa-continuous-update.timer openqa-worker-cacheservice.service openqa-worker-cacheservice-minion.service rebootmgr

and taking over config from worker19 for /etc/openqa/{client.conf,workers.ini} without the special external video encoder as our new default should be good for VP9 now. And using https://openqa.opensuse.org as the full URL for registering. So putting [openqa.opensuse.org] in client.conf and workers.ini. After starting a worker instance that one connected just fine as https://openqa.opensuse.org/admin/workers/754

Hm, what should create br0?

Now for tests:

wget https://raw.githubusercontent.com/os-autoinst/openQA/master/script/fetchneedles -O /usr/local/bin/fetchneedles
chmod +x /usr/local/bin/fetchneedles
groupadd www
useradd --no-create-home --home-dir /var/lib/openqa -s /bin/false geekotest
(cd /var/lib/openqa/share/tests; for i in alp d-installer kubic leap-micro microos windows; do ln -s opensuse $i; done && sudo -u geekotest git clone https://github.com/os-autoinst/os-autoinst-distri-obs.git obs && sudo -u geekotest git clone https://github.com/os-autoinst/os-autoinst-distri-openQA openQA && ln -s openQA openqa && cd openqa && sudo -u geekotest git clone https://github.com/os-autoinst/os-autoinst-needles-openQA.git needles)
/usr/local/bin/fetchneedles
echo '-*/1    * * * *  geekotest     env updateall=1 force=1 /usr/share/openqa/script/fetchneedles' > /etc/cron.d/openqa-update-git

and then locally

rsync -aHP o3:"~/ovmf-x86*staging*" .
rsync -aHP ovmf-x86*staging* root@10.150.1.110:/usr/share/qemu/
openqa-clone-job --skip-chained-deps --within-instance https://openqa.opensuse.org/tests/3438766 TEST+=worker21-poo132134 _GROUP=0 BUILD= WORKER_CLASS=openqaworker21

1 job has been created:

which looks pretty good.

I updated our team and DCT migration team about the status and that we are good to go ahead.

Actions #16

Updated by okurz 10 months ago

Created a customized autoyast to make use of those storage devices correctly and trying to boot that. From the network booted grub after "linux /boot/x86_64/loader/linux" we put

console=tty console=ttyS1,115200 autoyast=https://is.gd/oqaay rootpassword=opensuse

with https://is.gd/oqaay pointing to https://raw.githubusercontent.com/okurz/openQA/feature/ay/contrib/ay-openqa-worker.xml.erb . We tried "ttyS0" and "ttyS2" in before in vain but ttyS1 is good. console=tty console=ttyS1,115200

jing /usr/share/YaST2/schema/autoyast/rng/profile.rng autoinst.xml
xmllint --noout --relaxng /usr/share/YaST2/schema/autoyast/rng/profile.rng autoinst.xml

maybe the erb template does not work. Trying with non-templated version in https://raw.githubusercontent.com/okurz/openQA/feature/ay/contrib/ay-openqa-worker.xml shortened to https://is.gd/oqaay3 . That seems to work fine.

ssh root@10.150.1.114 and installed packages as above. The good point is with the pattern kvm_server we now have by default already a br0. So we should reconsider installing worker21 with that again afterwards but for now we have enough new free machines available.

Regarding the test pool server maybe we can use some sophisticated ssh bridging hacks to get rsync to work instead of manual fetchneedles.

mkdir -p /var/lib/openqa/ssh
chown _openqa-worker.nogroup /var/lib/openqa/ssh
sudo -u _openqa-worker ssh-keygen -t ed25519 -f /var/lib/openqa/ssh/id_ed25519 -N ''

and copied that generated public key into o3:/var/lib/openqa/.ssh/authorized_keys so we should be able to login as geekotest user remotely now. That works nicely. I created a draft pull request with some notes extensions for that approach, see https://github.com/os-autoinst/openQA/pull/5251

For multi-machine I don't want to do everything by hand, following my own semi-automated instructions from #119008-17

updated instructions:

zypper -n in openvswitch os-autoinst-openvswitch firewalld
ovs-vsctl add-br br1
echo 'OS_AUTOINST_USE_BRIDGE=br1' > /etc/sysconfig/os-autoinst-openvswitch
cat > /etc/sysconfig/network/ifcfg-br1 <<EOF
BOOTPROTO='static'
IPADDR='10.0.2.2/15'
STARTMODE='auto'
ZONE=trusted
OVS_BRIDGE='yes'
PRE_UP_SCRIPT="wicked:gre_tunnel_preup.sh"
OVS_BRIDGE_PORT_DEVICE_0='tap0'
EOF
cat > /etc/sysconfig/network/ifcfg-tap0 <<EOF
BOOTPROTO='none'
IPADDR=''
NETMASK=''
PREFIXLEN=''
STARTMODE='auto'
TUNNEL='tap'
TUNNEL_SET_GROUP='nogroup'
TUNNEL_SET_OWNER='_openqa-worker'
EOF
for i in $(seq 1 50; seq 64 $((64+50)); seq 128 $((128+50))); do ln -sf ifcfg-tap0 /etc/sysconfig/network/ifcfg-tap$i && echo "OVS_BRIDGE_PORT_DEVICE_$i='tap$i'" >> /etc/sysconfig/network/ifcfg-br1; done
cat > /etc/firewalld/zones/trusted.xml <<EOF
<?xml version="1.0" encoding="utf-8"?>
<zone target="ACCEPT">
  <short>Trusted</short>
  <description>All network connections are accepted.</description>
  <service name="isotovideo"/>
  <interface name="br1"/>
  <interface name="ovs-system"/>
  <interface name="eth0"/>
  <masquerade/>
</zone>
EOF
systemctl reload firewalld
zypper -n in libcap-progs
setcap CAP_NET_ADMIN=ep /usr/bin/qemu-system-x86_64
cat > /etc/wicked/scripts/gre_tunnel_preup.sh <<EOF
#!/bin/sh
action="\$1"
bridge="\$2"
ovs-vsctl set bridge \$bridge stp_enable=true
#ovs-vsctl --may-exist add-port \$bridge gre1 -- set interface gre1 type=gre options:remote_ip=<IP address of other host>
EOF
systemctl enable --now openvswitch os-autoinst-openvswitch firewalld

and then adjusting <IP address of other host> in /etc/wicked/scripts/gre_tunnel_preup.sh . For now we don't know the final IP addresses that we want to give out from dnsmasq on ariel so using dummy values:

# openqaworker21                                                                                                                                                 
ovs-vsctl --may-exist add-port $bridge gre1 -- set interface gre1 type=gre options:remote_ip=10.150.1.110                                                        
# openqaworker22                                                                                                                                                 
#ovs-vsctl --may-exist add-port $bridge gre2 -- set interface gre2 type=gre options:remote_ip=10.150.1.114                                                       
# openqaworker23                                                                                                                                                 
#ovs-vsctl --may-exist add-port $bridge gre3 -- set interface gre3 type=gre options:remote_ip=10.150.1.115                                                       
# openqaworker24                                                                                                                                                 
#ovs-vsctl --may-exist add-port $bridge gre4 -- set interface gre4 type=gre options:remote_ip=10.150.1.116                                                       
# openqaworker25                                                                                                                                                 
#ovs-vsctl --may-exist add-port $bridge gre5 -- set interface gre5 type=gre options:remote_ip=10.150.1.117                                                       
# openqaworker26                                                                                                                                                 
#ovs-vsctl --may-exist add-port $bridge gre6 -- set interface gre6 type=gre options:remote_ip=10.150.1.118                                                       
# openqaworker27                                                                                                                                                 
#ovs-vsctl --may-exist add-port $bridge gre7 -- set interface gre7 type=gre options:remote_ip=10.150.1.119                                                       
# openqaworker28                                                                                                                                                 
#ovs-vsctl --may-exist add-port $bridge gre8 -- set interface gre8 type=gre options:remote_ip=10.150.1.120         

This likely needs a full network stack reinit. We can try a reboot later. But for now trying a multi-machine test within the one worker machine:

openqa-clone-job --skip-chained-deps --within-instance https://openqa.opensuse.org/tests/3439908 TEST+=worker22-poo132134 _GROUP=0 BUILD= WORKER_CLASS=openqaworker22

That unfortunately did not set a customized name for the controller test and I could not find a way how to do it even with "TEST:0" and such and also not with something like openqa-cli api --o3 -X put -d '{"test": "controller_po132134"}' jobs/3440383 So instead I will manually delete those test jobs like https://openqa.opensuse.org/tests/3440384 which failed with

[2023-07-19T10:01:43.078582Z] [warn] [pid:113907] !!! : qemu-system-x86_64: -netdev tap,id=qanet0,ifname=tap16,script=no,downscript=no: could not configure /dev/net/tun (tap16): Operation not permitted

likely a reboot is needed so that all network devices are properly initialized. Triggered that. https://openqa.opensuse.org/t3440396 is currently running and looks very good.

Actions #17

Updated by okurz 10 months ago

https://github.com/os-autoinst/openQA/pull/5254 for multi-machine docs improvements.

https://openqa.opensuse.org/t3440396 failed eventually due to https://openqa.opensuse.org/tests/3440396#step/await_install/6 showing that http://download.opensuse.org/update/tumbleweed could not be reached. Well, this might be due to bigger problems in the network as also gitlab CI runners were offline. Also see https://suse.slack.com/archives/C029APBKLGK/p1689771644435939

But a retriggered openQA job https://openqa.opensuse.org/tests/3440637 shows the same problem now

ok, fixed it partially with firewalld changes. seems like interfaces now showed up in "public", not "trusted" and masquerading wasn't enabled there. So I did

firewall-cmd --zone=public --add-masquerade

but IMHO the devices should just be in trusted.

I did now

for i in br0 eth0 ; do firewall-cmd --permanent --remove-interface=$i --zone=public ; firewall-cmd --permanent --add-interface=$i --zone=trusted; done
firewall-cmd --permanent --set-default-zone=trusted

but I had already done such change for eth0 in before.

And https://openqa.opensuse.org/tests/3440637 succeeded now. Will try again after a reboot.

Actions #18

Updated by okurz 10 months ago

After reboot also good. nsinger continued with worker23, I continued with worker24. I will need to pick up creating an ssh key, adding that to o3 and firewall and multi-machine configuration. I continued with worker24 but can't fully verify. Same as done on worker22 and likely what we need on all is a higher timeout for os-autoinst-openvswitch because of the tap device initialization the complete network stack bootup takes considerably more time:

mkdir -p /etc/systemd/system/os-autoinst-openvswitch.service.d
cat > /etc/systemd/system/os-autoinst-openvswitch.service.d/override.conf <<EOF
[Service]
# okurz: 2023-07-19: with more tap devices init takes longer
Environment=OS_AUTOINST_OPENVSWITCH_INIT_TIMEOUT=300
EOF
systemctl daemon-reload
systemctl restart os-autoinst-openvswitch openqa-worker-auto-restart@*
Actions #19

Updated by livdywan 10 months ago

As discussed in the unblocked, new tests will be run later today with the virt issues having been sorted out

Actions #20

Updated by okurz 10 months ago

In #132143#note-46 I commented that we setup dnsmasq on new-ariel and have then scheduled new multi-machine tests. fvogt suggested firewalld-policy so let's do that:

openqa-clone-job --skip-chained-deps --within-instance https://openqa.opensuse.org/tests/3450618 TEST+=worker24-poo132134 _GROUP=0 BUILD= WORKER_CLASS=openqaworker24

all passed, see https://openqa.opensuse.org/tests/3456751

and the same for worker21 and worker22:

openqa-clone-job --skip-chained-deps --within-instance https://openqa.opensuse.org/tests/3450618 TEST+=worker21-poo132134 _GROUP=0 BUILD= WORKER_CLASS=openqaworker21

3 jobs have been created:

that needed more manual cleanup but we already know that w21 is our "dirty" one that should be reinstalled from autoyast with the auto-configured br0.

openqa-clone-job --skip-chained-deps --within-instance https://openqa.opensuse.org/tests/3450618 TEST+=worker22-poo132134 _GROUP=0 BUILD= WORKER_CLASS=openqaworker22                                                                                                                                                                                   

3 jobs have been created:

all good as well.

Triggering more tests for validation:

for i in 21 22 24; do end=400 openqa-clone-set https://openqa.opensuse.org/tests/3457252 okurz_load_test_w$i WORKER_CLASS=openqaworker$i BUILD=okurz_poo132134 _GROUP_ID=0; done

-> https://openqa.opensuse.org/tests/overview?build=okurz_poo132134&version=Tumbleweed&distri=opensuse

ok, I think that was not a good for a load test as that zdup scenario has a custom max job time hence no video. So instead I did

for i in 21 22 24; do end=400 openqa-clone-set https://openqa.opensuse.org/tests/3457738 kde-okurz_load_test_w$i WORKER_CLASS=openqaworker$i BUILD=okurz_poo132134 _GROUP_ID=0; done

-> https://openqa.opensuse.org/tests/overview?build=okurz_poo132134&version=Tumbleweed&distri=opensuse

EDIT: The above and links like https://openqa.opensuse.org/admin/workers/754 for individual workers show a sufficiently high pass-rate for all w21+w22+w24 so I am going to enable all for production but not for multi-machine tests.

Next steps:

  1. Enable production worker classes for w22+w24
  2. Install more workers, e.g. w25, using the above code snippets https://github.com/os-autoinst/os-autoinst/pull/2347 to see what is missing
  3. Verify multi-machine and enable production
  4. Re-install w21

We verified the mm-setup on w21+22+24 but we should only put them in production with tap after establishing gre tunnels to the nue1 workers or we disable all tap in nue1 but that's not that urgent.

Actions #21

Updated by okurz 10 months ago

  • Related to action #133496: [openqa-in-openQA] test fails due to not loaded install test modules added
Actions #22

Updated by okurz 10 months ago

  • Due date changed from 2023-07-31 to 2023-08-14

The test history from workers w21:1 https://openqa.opensuse.org/admin/workers/754 , w22:1 https://openqa.opensuse.org/admin/workers/828 , w24:1 https://openqa.opensuse.org/admin/workers/892 looks good. I will await at least the next Tumbleweed x86_64 snapshot to be published as visible on https://openqa.opensuse.org/group_overview/1 before disabling NUE1 based openQA multi-machine tests and enable them on the new PRG2 workers only.

Adding comment in /etc/openqa/workers.ini of NUE1 machines:

# okurz: 2023-07-31: disabled "tap" as prg2 should take over, see
# https://progress.opensuse.org/issues/132134#note-20

Adding ,nue,maxtorhof,nue1,nue1-srv1 to machines. Working on this together with #133580 as I am changing worker classes anyway.

For now I have only enabled the "tap" class on w22 as ovs-vsctl show does not show a GRE tunnel connection to the others

Actions #23

Updated by okurz 10 months ago

  • Due date deleted (2023-08-14)
  • Status changed from In Progress to Blocked

https://openqa.opensuse.org/tests/?resultfilter=Failed&resultfilter=Incomplete shows no problematic entries. With this blocking on #133160 to be able to install more workers.

Actions #24

Updated by okurz 10 months ago

I realized that /etc/wicked/scripts/gre_tunne_preup.sh was not marked as executable so wicked ifup br1 did not start the script. Now tested on w21 with wicked ifup br1 && journalctl -e --since="1h ago" | grep -i gre and found mentions. Now also ovs-vsctl show shows

        Port gre2
            Interface gre2
                type: gre
                options: {remote_ip="10.150.1.114"}

and alike. Can progress with the same on others.

Actions #25

Updated by okurz 10 months ago

Enabled now worker instances 3..50 on w21. Setting up w23 based on https://raw.githubusercontent.com/os-autoinst/os-autoinst/228c7a0921ffb9f0a33a19f70bd4004f872ff4bf/script/os-autoinst-setup-multi-machine. Observed that (obviously) again we need the os-autoinst-openvswitch timeout increase so I am proposing https://github.com/os-autoinst/os-autoinst/pull/2349 now

Actions #26

Updated by okurz 10 months ago

Because I don't know how to clone openQA jobs so that one ends up on one worker and another on a different worker I just ran the following command until I ended up with jobs on separate machines:

openqa-clone-job --skip-chained-deps --within-instance https://openqa.opensuse.org/tests/3450618 TEST+=-worker21-23-poo132134 _GROUP=0 BUILD= WORKER_CLASS=prg2

I found that two machines still had the old GRE tunnel targets with old 100-range IPs so I did wicked ifup all on those and confirmed that the GRE tunnels point to the correct other machines with ovs-vsctl show and then retriggered jobs.

So tests that are executed within a single machine are fine, tests spanning multiple workers fail, e.g. https://openqa.opensuse.org/tests/3473385#step/firewalld_policy_objects/138 . I don't know right now what else to explore

EDIT: I used the above clone command but with openqa-clone-job --export-command … without the WORKER_CLASS and then added WORKER_CLASS:3450618=openqaworker22 WORKER_CLASS:3450619=openqaworker22 WORKER_CLASS:3450620=openqaworker24 to schedule on specific workers

Actions #27

Updated by livdywan 10 months ago

  • Related to action #132167: asset uploading failed with http status 502 size:M added
Actions #28

Updated by dheidler 10 months ago

  • Status changed from Blocked to In Progress

Working on openqaworker28

Actions #29

Updated by dheidler 10 months ago

Installed leap15.5 on openqaworker25,26,27 via autoyast.

Actions #30

Updated by openqa_review 10 months ago

  • Due date set to 2023-08-19

Setting due date based on mean cycle time of SUSE QE Tools

Actions #31

Updated by okurz 10 months ago

  • Assignee changed from okurz to dheidler

dheidler will look into setting up openQA. Instructions:

  • Follow https://progress.opensuse.org/projects/openqav3/wiki/#Setup-guide-for-new-machines
  • Call head $(which os-autoinst-setup-multi-machine) to find out which settings you need to call for each machine, then call like instances=50 os-autoinst-setup-multi-machine
  • Copy over from an existing worker /etc/openqa/client.conf, /etc/openqa/workers.ini and /etc/wicked/scripts/gre_tunnel_preup.sh and adjust as necessary
  • Enable openQA workers with non-production worker classes first
  • Test out verification jobs, e.g. call openqa-clone-job --skip-chained-deps --export-command --within-instance https://openqa.opensuse.org/tests/3450618 TEST+=-worker21-23-poo132134 _GROUP=0 BUILD= WORKER_CLASS=prg2 and then adjust the worker classes so that jobs end up on one of the existing and one of the new machines

  • then try out boot+install for aarch64

Actions #32

Updated by okurz 10 months ago

  • Related to action #133025: Configure Virtual Interfaces instructions do not work on Leap 15.5 size:M added
Actions #33

Updated by okurz 10 months ago

  • Copied to action #134123: Setup new PRG2 openQA worker for o3 - two new arm workers size:M added
Actions #34

Updated by okurz 10 months ago

  • Copied to action #134126: Setup new PRG2 openQA worker for o3 - bare-metal testing size:M added
Actions #35

Updated by okurz 10 months ago

  • Description updated (diff)
  • Due date deleted (2023-08-19)
  • Status changed from In Progress to Workable
  • Priority changed from Urgent to High

Discussed in the weekly SUSE QE Tools meeting. We by now have stable productive o3 workers running hence reducing priority. The other additional machines can be setup now but with "High" priority. We carved out two separate tickets for aarch64 #134123 and w26,w27,w28 #134126 so that leaves only w25 to be setup completely as openQA worker

Actions #36

Updated by okurz 10 months ago

  • Priority changed from High to Normal
Actions #37

Updated by dheidler 9 months ago

  • Status changed from Workable to In Progress
Actions #38

Updated by dheidler 9 months ago

The script that can be used to setup a worker after autoyast installation (what I have so far):

zypper ar -p 95 -f 'http://download.opensuse.org/repositories/devel:openQA/$releasever' devel_openQA
zypper ar -p 90 -f 'http://download.opensuse.org/repositories/devel:openQA:Leap:$releasever/$releasever' devel_openQA_Leap
zypper --gpg-auto-import-keys ref
echo "requires:openQA-worker" > /etc/zypp/systemCheck.d/openqa.check
zypper -n in openQA-worker openQA-auto-update openQA-continuous-update os-autoinst-distri-opensuse-deps swtpm # openQA worker services plus dependencies for openSUSE distri or development repo if added previously
zypper -n in ffmpeg-4  # for using external video encoder as it is already configured on some machines like ow19, ow20 and power8
zypper -n in nfs-client  # For /var/lib/openqa/share
zypper -n in bash-completion vim htop strace systemd-coredump iputils tcpdump bind-utils  # for general tinkering

# for use of nfs
# echo "openqa1-opensuse:/ /var/lib/openqa/share nfs4 noauto,nofail,retry=30,ro,x-systemd.automount,x-systemd.device-timeout=10m,x-systemd.mount-timeout=30m 0 0" >> /etc/fstab

# for use of cacheservice
# if connecting via ssh-rsync to o3
mkdir -p /var/lib/openqa/ssh/
chown _openqa-worker: /var/lib/openqa/ssh/
sudo -u _openqa-worker ssh-keygen -t ed25519 -N '' -f /var/lib/openqa/ssh/id_ed25519
# append /var/lib/openqa/ssh/id_ed25519.pub to ariel:/var/lib/openqa/.ssh/authorized_keys
sudo -u _openqa-worker ssh -i /var/lib/openqa/ssh/id_ed25519 -p 2214 -o "UserKnownHostsFile /var/lib/openqa/ssh/known_hosts" geekotest@gate.opensuse.org whoami
# note that ariel can be reached directly via ariel.dmz-prg2.suse.org from suse network

sed -i 's/\(solver.dupAllowVendorChange = \)false/\1true/' /etc/zypp/zypp.conf

instances=30 bash -x $(which os-autoinst-setup-multi-machine)

#[global]
#HOST = https://openqa.opensuse.org
#CACHEDIRECTORY = /var/lib/openqa/cache
##WORKER_CLASS = foo
#EXTERNAL_VIDEO_ENCODER_CMD=ffmpeg -y -hide_banner -nostats -r 24 -f image2pipe -vcodec ppm -i - -pix_fmt yuv420p -c:v libvpx-vp9 -crf 35 -b:v 1500k -cpu-used 0
#[https://openqa.opensuse.org]
##TESTPOOLSERVER = -e 'ssh -p 2214 -i /var/lib/openqa/ssh/id_ed25519 -o "UserKnownHostsFile /var/lib/openqa/ssh/known_hosts"' geekotest@gate.opensuse.org:/var/lib/openqa/tests
##TESTPOOLSERVER = -e 'ssh -i /var/lib/openqa/ssh/id_ed25519 -o "UserKnownHostsFile /var/lib/openqa/ssh/known_hosts"' geekotest@ariel.dmz-prg2.suse.org:/var/lib/openqa/tests
#TESTPOOLSERVER = rsync://openqa.opensuse.org/tests

zypper -n in rsync sshpass
rsync -rP --rsh="sshpass -p opensuse ssh -o StrictHostKeyChecking=no" root@openqaworker24:/etc/openqa/ /etc/openqa/
rsync -rP --rsh="sshpass -p opensuse ssh -o StrictHostKeyChecking=no" root@openqaworker24:/etc/wicked/scripts/gre_tunnel_preup.sh /etc/wicked/scripts/
sed -i "s/openqaworker24/$HOSTNAME/g" /etc/openqa/workers.ini
# comment out the worker that we are on and enable all other entries
vi /etc/wicked/scripts/gre_tunnel_preup.sh
# use only hostname as worker class for testing
vi /etc/openqa/workers.ini

chown _openqa-worker /etc/openqa/client.conf
chmod 640 /etc/openqa/client.conf

# configure /etc/openqa/client.conf and /etc/openqa/workers.ini, then enable the desired number of worker slots, e.g.:
systemctl enable --now openqa-worker-auto-restart@{1..30}.service openqa-reload-worker-auto-restart@{1..30}.path openqa-auto-update.timer openqa-continuous-update.timer openqa-worker-cacheservice.service openqa-worker-cacheservice-minion.service

# on aarch64 do something about hugepages permissions (see https://progress.opensuse.org/issues/53234)
echo "hugetlbfs /dev/hugepages hugetlbfs defaults,gid=kvm,mode=1775,pagesize=1G" >> /etc/fstab

# add this to kernel cmdline via /etc/default/grub (& update-bootloader) on aarch64:
# showopts nospec spectre_v2=off pti=off kpti=off mitigations=off default_hugepagesz=1G hugepagesz=1G hugepages=64 crashkernel=167M enforcing=0
Actions #39

Updated by dheidler 9 months ago

  • Status changed from In Progress to Workable
Actions #40

Updated by okurz 9 months ago

  • Priority changed from Normal to Urgent
Actions #41

Updated by okurz 9 months ago

  • Target version deleted (Ready)
Actions #42

Updated by okurz 9 months ago

  • Assignee deleted (dheidler)
  • Target version set to Ready

oops, removed target version by mistake, should have been assignee

Actions #43

Updated by nicksinger 9 months ago

  • Assignee set to nicksinger
Actions #44

Updated by nicksinger 9 months ago

  • Assignee deleted (nicksinger)

assigned the wrong ticket to me, oops

Actions #45

Updated by okurz 9 months ago

  • Priority changed from Urgent to High

We don't have capacity to work on that many infra tasks with urgent prio, reducing to "High"

Actions #46

Updated by livdywan 9 months ago

We took a look during the mob session. It's not very clear what was done and what's still to be done. It would be great if one of the previous assignees could add this.

Actions #47

Updated by okurz 9 months ago

Seems you overlooked some comments then. See #132134#note-35 . Only parts of the workers should be direct o3 openQA workers, some reserved for bare-metal testing. No users are setup but only root ssh key authentication same as we always had for o3 workers. Yes, something was done for multi-machine testing, see comments

Actions #48

Updated by livdywan 9 months ago

  • Subject changed from Setup new PRG2 openQA worker for o3 size:M to Setup new PRG2 multi-machine openQA worker for o3 size:M

Since the AC specifically ask for multi-machine it should be reflected in the title

Actions #49

Updated by okurz 9 months ago

livdywan wrote in #note-48:

Since the AC specifically ask for multi-machine it should be reflected in the title

Actually there should be no need in mentioning multi-machine explicitly at all but be implied after we finished our saga about treating multi-machine tests as first class citizens. Maybe we need to clarify which parts of that were effectively not true

Actions #50

Updated by livdywan 9 months ago

okurz wrote in #note-47:

Seems you overlooked some comments then. See #132134#note-35 . Only parts of the workers should be direct o3 openQA workers, some reserved for bare-metal testing. No users are setup but only root ssh key authentication same as we always had for o3 workers. Yes, something was done for multi-machine testing, see comments

Okay, so so that leaves only w25 to be setup completely as openQA worker is what's left to do here then.

Actions #51

Updated by okurz 8 months ago

  • Related to action #130585: Upgrade o3 workers to openSUSE Leap 15.5 added
Actions #52

Updated by okurz 8 months ago

  • Description updated (diff)
Actions #53

Updated by okurz 8 months ago

  • Related to action #134912: Gradually phase out NUE1 based openQA workers size:M added
Actions #54

Updated by livdywan 8 months ago

Actions #55

Updated by dheidler 8 months ago

  • Assignee set to dheidler

Updated by dheidler 8 months ago

Discussed with okurz that openqaworker21-26 should be used as worker hosts for now.
Fixed the kernel issue on hosts: A kernel downgrade and lock had been already applied on some workers due to a kernel bug. But the lock was only for kernel-default-*, so an update pulled kernel-base instead of kernel-default, which resultet in failed boots due to missing drivers.

I fixed those machines and added kernel lock for kernel*.
This will prevent updates via zypper patch for now.
Kernel locks can be removed after bsc#1214537 fix is in the update repo.

Working locked kernel versions for reference:

kernel-default-5.14.21-150400.24.74.1.x86_64
kernel-default-5.14.21-150500.55.12.1.x86_64

What was done:

openqaworker21: io issues? -> reinstall -> ready
openqaworker22: needs fix -> ready -> active!
openqaworker23: lock added -> ready
openqaworker24: fixed -> ready
openqaworker25: lock added -> ready
openqaworker26: lock added -> ready

Verification:

for i in {001..300} ; do openqa-clone-job --parental-inheritance --skip-chained-deps --within-instance https://openqa.opensuse.org/tests/3600589 TEST+=-dheidler-poo132134-$i _GROUP=0 BUILD=dheidler-poo132134-1 WORKER_CLASS=qemu_x86_64,tap_poo132134 ; done

https://openqa.opensuse.org/tests/overview?version=Tumbleweed&build=dheidler-poo132134-1&distri=microos

See attachment for results. Multi machine worked fine on that workers and across workers.

Actions #57

Updated by okurz 8 months ago

dheidler wrote in #note-56:

Discussed with okurz that openqaworker21-26 should be used as worker hosts for now.
Fixed the kernel issue on hosts: A kernel downgrade and lock had been already applied on some workers due to a kernel bug. But the lock was only for kernel-default-*, so an update pulled kernel-base instead of kernel-default, which resultet in failed boots due to missing drivers.

I fixed those machines and added kernel lock for kernel*.
This will prevent updates via zypper patch for now.
Kernel locks can be removed after bsc#1214537 fix is in the update repo.

Working locked kernel versions for reference:

kernel-default-5.14.21-150400.24.74.1.x86_64
kernel-default-5.14.21-150500.55.12.1.x86_64

What was done:

openqaworker21: io issues? -> reinstall -> ready
openqaworker22: needs fix -> ready -> active!
openqaworker23: lock added -> ready
openqaworker24: fixed -> ready
openqaworker25: lock added -> ready
openqaworker26: lock added -> ready

Verification:

for i in {001..300} ; do openqa-clone-job --parental-inheritance --skip-chained-deps --within-instance https://openqa.opensuse.org/tests/3600589 TEST+=-dheidler-poo132134-$i _GROUP=0 BUILD=dheidler-poo132134-1 WORKER_CLASS=qemu_x86_64,tap_poo132134 ; done

https://openqa.opensuse.org/tests/overview?version=Tumbleweed&build=dheidler-poo132134-1&distri=microos

See attachment for results. Multi machine worked fine on that workers and across workers.

I would say from the infobox on https://openqa.opensuse.org/tests/overview?version=Tumbleweed&build=dheidler-poo132134-1&distri=microos it's already obvious: 602/602 passed, 100.00% pass rate. According to https://en.wikipedia.org/wiki/Rule_of_three_(statistics) that means that we have a computed failure probability of < .49% which I would say is pretty good :)

So, what are the next steps you plan?

Actions #58

Updated by dheidler 8 months ago

  • Status changed from In Progress to Resolved

Enabled the workers for production use of multi machine jobs by removing _poo132134 suffix from worker class entries.
Ran 300 tests again: https://openqa.opensuse.org/tests/overview?version=Tumbleweed&distri=microos&build=dheidler-poo132134-2
Checked that multi machine jobs work fine across all workers (21-26).
Subscribed to https://bugzilla.suse.com/show_bug.cgi?id=1214537 for further updates on when we can remove the kernel package locks.

Actions #59

Updated by dheidler 8 months ago

Commented out gre entries for ow27 and ow28 on all workers in /etc/wicked/scripts/gre_tunnel_preup.sh as they are planned to be used as generalhw sut.

Actions #60

Updated by okurz 8 months ago

Very nice and good work

Actions

Also available in: Atom PDF