Project

General

Profile

action #64700

setup o3 workers openqaworker4 and openqaworker7 for multi-machine tests size:S

Added by okurz over 2 years ago. Updated 3 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
Start date:
2020-03-20
Due date:
2022-08-23
% Done:

50%

Estimated time:

Description

Observation

#52499

Acceptance criteria

  • AC1: MM tests are known to work on openqaworker4

Suggestions


Related issues

Related to openQA Tests - action #64970: [desktop][opensuse][multi-machine] test fails in xrdp_client to connect to serverResolved2020-03-29

Related to openQA Infrastructure - action #114923: We lost multi-machine capabilities within o3 due to openqaworker1 being replacedResolved2022-08-03

Copied to openQA Infrastructure - action #114992: Broken MM machines don't appear as available workersNew2020-03-20

History

#1 Updated by okurz over 2 years ago

for testing on w7 which was previously already configured as MM worker for osd I added ",tap" to the worker class for :3 and :4 with

vim /etc/openqa/workers.ini
firewall-cmd --zone=trusted --add-masquerade
systemctl restart openqa-worker@{3..4}

successful tests:

so I restarted the other workers as well:

systemctl restart openqa-worker@{1..2} openqa-worker@{5..14}

I assume depending on http://open.qa/docs/#_multi_machine_tests_setup and the history on aarch64 what we would need to do for w4 which seems to have been never configured for MM:

zypper -n --no-refresh in firewalld openvswitch os-autoinst-openvswitch libcap-progs
systemctl enable --now firewalld openvswitch os-autoinst-openvswitch
echo 'OS_AUTOINST_USE_BRIDGE=br1' > /etc/sysconfig/os-autoinst-openvswitch
ovs-vsctl add-br br1
cat > /etc/sysconfig/network/ifcfg-tap0 <<EOF
BOOTPROTO='none'
IPADDR=''
NETMASK=''
PREFIXLEN=''
STARTMODE='auto'
TUNNEL='tap'
TUNNEL_SET_GROUP='nogroup'
TUNNEL_SET_OWNER='_openqa-worker'
EOF
for i in {1..14} {64..77} {128..141}; do echo OVS_BRIDGE_PORT_DEVICE_$i=\'tap$i\' ; done >> /etc/sysconfig/network/ifcfg-br1
for i in {1..14} {64..77} {128..141}; do ln -s /etc/sysconfig/network/ifcfg-tap{0,$i} ; done
firewall-cmd --set-default-zone=trusted
firewall-cmd --zone=trusted --add-masquerade
for i in br1 eth0 ovs-system; do firewall-cmd --zone=trusted --add-interface=$i; done
firewall-cmd --runtime-to-permanent
setcap CAP_NET_ADMIN=ep /usr/bin/qemu-system-x86_64

#2 Updated by okurz over 2 years ago

  • Description updated (diff)

#3 Updated by okurz over 2 years ago

  • Related to action #64970: [desktop][opensuse][multi-machine] test fails in xrdp_client to connect to server added

#4 Updated by okurz over 2 years ago

apparently openqaworker7 is producing some problematic job results. E.g.

[28/03/2020 17:11:19] <DimStar> okurz: https://openqa.opensuse.org/tests/1216485#next_previous is more painful :)
[28/03/2020 17:11:36] <DimStar> success/failure ratio is far off
[28/03/2020 17:11:59] <DimStar> I thin 10 days ago is when we removed OW1, right?

Seems like desktopapps-remote-desktop-xrdp-client1 consistently does not work on openqaworker7 so test reviewers retrigger failed tests until it happens to be run on openqaworker1 which seems to be stable. DimStar also mentioned other problems, like https://openqa.opensuse.org/tests/1217710#step/kubeadm/1 , also on openqaworker7. Could be something special about the firewall maybe. " https://openqa.opensuse.org/tests/1217727#step/yast2_nfs4_server/37 - firewall might be sonething..or dns config", also w7. I have disabled "tap" from worker class on openqaworker7 and restarted worker instances. Let's see if this helps. https://openqa.opensuse.org/tests/1217710# as an interesting example because it is not a multi-machine test. Maybe we can look into this one first, should be easier to crosscheck.

Also, what I saw as differences in configuration: On w1 only "br1" is in "trusted" zone, on w7 it's "br1 eth0 tap…", same on aarch64. Also the config differs in "STARTMODE" and the explicit "ZONE" in /etc/sysconfig/network/ifcfg-tap*

So now on w7 I did:

cat > /etc/sysconfig/network/ifcfg-tap0 <<EOF
> BOOTPROTO='none'
> IPADDR=''
> NETMASK=''
> PREFIXLEN=''
> STARTMODE='auto'
> TUNNEL='tap'
> TUNNEL_SET_GROUP='nogroup'
> TUNNEL_SET_OWNER='_openqa-worker'
> ZONE=public'
> EOF
for i in {1..20} {64..83} {128..147}; do ln -sf /etc/sysconfig/network/ifcfg-tap{0,$i} ; done
for i in {0..20} {64..83} {128..147}; do firewall-cmd --zone-trusted --remove-interface=eth0; done
firewall-cmd --runtime-to-permanent

and looking into the "kubeadm" failure:

$ build=okurz_investigation_poo64700; for i in 1 7 ; do build=$build openqa-clone-set https://openqa.opensuse.org/tests/1217710 ${build}_kubeadm_w$i WORKER_CLASS=openqaworker$i; done

https://openqa.opensuse.org/tests/overview?build=okurz_investigation_poo64700

shows that 10/10 jobs on openqaworker1 and 10/10 jobs on openqaworker7 fail the same so I reject the hypothesis that it's something specific to the MM setup on openqaworker7.

After the above changes I triggered some jobs again:

$ openqa-clone-job --parental-inheritance --skip-chained-deps --within-instance https://openqa.opensuse.org/tests/1218529 WORKER_CLASS=openqaworker7 BUILD=X _
GROUP=0 TEST=okurz_poo64700_yast2_nfs_v4_server                                                                                                                                             

Created job #1219043: opensuse-Tumbleweed-DVD-x86_64-Build20200329-yast2_nfs_v4_server@64bit -> https://openqa.opensuse.org/t1219043

as a single test out of a mm-pair which works fine on its own.

$ openqa-clone-job --parental-inheritance --skip-chained-deps --within-instance https://openqa.opensuse.org/tests/1217787 WORKER_CLASS=openqaworker7 BUILD=X _
GROUP=0 TEST=okurz_poo64700_yast2_nfs_v4_client

Created job #1219049: opensuse-Tumbleweed-DVD-x86_64-Build20200327-yast2_nfs_v4_server@64bit -> https://openqa.opensuse.org/t1219049
Created job #1219050: opensuse-Tumbleweed-DVD-x86_64-Build20200327-yast2_nfs_v4_client@64bit -> https://openqa.opensuse.org/t1219050

which fail in https://openqa.opensuse.org/tests/1219050#step/yast2_nfs4_client/28

But we check again the basics with wicked_basic:

$ openqa-clone-job --parental-inheritance --skip-chained-deps --within-instance https://openqa.opensuse.org/tests/1218584 WORKER_CLASS=openqaworker7 BUILD=X _GROUP=0 TEST=okurz_poo64700_wicked_basic_sut

Created job #1219103: opensuse-Tumbleweed-DVD-x86_64-Build20200329-wicked_basic_ref@64bit -> https://openqa.opensuse.org/t1219103
Created job #1219104: opensuse-Tumbleweed-DVD-x86_64-Build20200329-wicked_basic_sut@64bit -> https://openqa.opensuse.org/t1219104

failed. https://openqa.opensuse.org/tests/1219104/file/serial_terminal.txt shows

# ping -c 1 10.0.2.2|| journalctl -b --no-pager > /dev/ttyS0; echo MWhDi-$?-
PING 10.0.2.2 (10.0.2.2) 56(84) bytes of data.
From 10.0.2.11 icmp_seq=1 Destination Host Unreachable

TODO read older tickets to remind myself, e.g. #30892 , #52499 , #55043 , #31978

#5 Updated by okurz over 2 years ago

  • Status changed from In Progress to Workable
  • Assignee deleted (okurz)

I did not progress over #64700#note-note-4 unfortunately. Didn't find time to refresh my memory with old setup.

#6 Updated by okurz over 2 years ago

  • Priority changed from Normal to Low

#7 Updated by okurz about 2 years ago

  • Target version set to future

#8 Updated by favogt 4 months ago

  • Priority changed from Low to Normal
  • Target version changed from future to Ready

It's not entirely clear what the issue is/was and currently we're in need of a MM worker (https://progress.opensuse.org/issues/114923), so I went ahead and enabled tap on ow7 again.

#9 Updated by cdywan 4 months ago

  • Related to action #114923: We lost multi-machine capabilities within o3 due to openqaworker1 being replaced added

#10 Updated by favogt 4 months ago

For some reason, ow1 had tap0-tap19 configured, but only tap0-tap9 assigned to br1. Worker instances > 10 failed due to that, I disabled the tap class for those.

#11 Updated by favogt 4 months ago

  • % Done changed from 0 to 50

The problem on ow7 actually came back and got the firewalld tests. I did some debugging there and the cause is that ARP requests from the worker were not forwarded from br1 to tap*.
This was easy to reproduce as e.g. ping 10.0.2.2 from the SUT to the worker stopped responding after an ip neigh flush all on the worker. (I wonder why though, because the ICMP echo request should already fill the ARP table FWICT...).

Also, what I saw as differences in configuration: On w1 only "br1" is in "trusted" zone, on w7 it's "br1 eth0 tap…", same on aarch64. Also the config differs in "STARTMODE" and the explicit "ZONE" in /etc/sysconfig/network/ifcfg-tap*

I changed that in the config files and also made the change during runtime with for i in tap0 tap1 tap10 tap11 tap12 tap128 tap129 tap13 tap130 tap131 tap132 tap133 tap134 tap135 tap136 tap137 tap138 tap139 tap14 tap140 tap141 tap142 tap143 tap144 tap145 tap146 tap147 tap15 tap16 tap17 tap18 tap19 tap2 tap20 tap3 tap4 tap5 tap6 tap64 tap65 tap66 tap67 tap68 tap69 tap7 tap70 tap71 tap72 tap73 tap74 tap75 tap76 tap77 tap78 tap79 tap8 tap80 tap81 tap82 tap83 tap9; do firewall-cmd --zone=trusted --change-interface=$i; done and did a test job: Passed! https://openqa.opensuse.org/tests/2494156

So setting to 50% complete as ow7 should be fully MM capable now. Question is whether we want to extend the tap ability to other workers on ow7 and whether we want to enable tap on ow4 as well.

#12 Updated by okurz 4 months ago

Well, at best all worker instances on all hosts should be multi machine capable. They just aren't because we don't understand enough to know exactly what is needed to make sure multi machine tests work without trying out openQA tests and have them fail until we found a working config

#13 Updated by cdywan 4 months ago

  • Subject changed from setup o3 workers openqaworker4 and openqaworker7 for multi-machine tests to setup o3 workers openqaworker4 and openqaworker7 for multi-machine tests size:S
  • Description updated (diff)

#14 Updated by favogt 4 months ago

  • Subject changed from setup o3 workers openqaworker4 and openqaworker7 for multi-machine tests size:S to setup o3 workers openqaworker4 and openqaworker7 for multi-machine tests
  • Description updated (diff)

There were some issues after ow1 was brought back online, as openQA started to schedule MM tests across both hosts. I disabled MM on ow7 temporarily yesterday to get the MM tests working again.

Today I debugged that a bit and found that the GRE tunnel on ow7 was stilll configured for the osd network and fixed that. On ow1, GRE was not set up at all.
I fixed that, but then encountered that some VM traffic didn't make it, which was caused by missing MTU setup (https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/15321/files). With that applied, MM tests across ow1 and ow7 work.

After that, I expanded the tap configuration to all (max 20) worker instances on both ow1 and ow7 and did a successful test run with those: https://openqa.opensuse.org/tests/2496091

Let's wait a bit to see how ow1 and ow7 play together, then we could also implement tap on ow4.

#15 Updated by favogt 4 months ago

  • Subject changed from setup o3 workers openqaworker4 and openqaworker7 for multi-machine tests to setup o3 workers openqaworker4 and openqaworker7 for multi-machine tests size:S
  • Description updated (diff)

#16 Updated by cdywan 4 months ago

  • Copied to action #114992: Broken MM machines don't appear as available workers added

#17 Updated by mkittler 4 months ago

  • Assignee set to mkittler

At least judging by the history of https://openqa.opensuse.org/tests/2496091 and its parallel job it looks good. I suppose I could now go ahead and do the MM setup on ow4 as well.

#18 Updated by mkittler 4 months ago

  • Status changed from Workable to In Progress

I have configured MM tests on openqaworker4 and it survived the reboot. Not sure whether it actually worked. The test seemed to run into some problems (https://openqa.opensuse.org/tests/2503652#step/yast2_nfs_server/111).

EDIT: It also doesn't work after retrying. So not sure what I did wrong. I was following the documentation on https://open.qa/docs/#_gre_tunnels and regarding the firewall I was following what we have in salt (and compared everything to the other o3/OSD workers).

#19 Updated by favogt 4 months ago

mkittler wrote:

I have configured MM tests on openqaworker4 and it survived the reboot. Not sure whether it actually worked. The test seemed to run into some problems (https://openqa.opensuse.org/tests/2503652#step/yast2_nfs_server/111).

EDIT: It also doesn't work after retrying. So not sure what I did wrong. I was following the documentation on https://open.qa/docs/#_gre_tunnels and regarding the firewall I was following what we have in salt (and compared everything to the other o3/OSD workers).

I had a quick look and did a clone with WORKER_CLASS=openqaworker4,tap to have them on ow4 only. That failed the same way. The kind of error and the (massive!) repeating output of ovs-dpctl dump-flows indicated a switching loop:

recirc_id(0),tunnel(src=192.168.112.12,dst=192.168.112.7,flags(-df-csum)),in_port(62),eth(src=52:54:00:12:00:60,dst=33:33:00:00:00:02),eth_type(0x88a8),vlan(vid=9,pcp=0),encap(eth_type(0x86dd),ipv6(tclass=0/0x3,frag=no)), packets:6322, bytes:417252, used:0.000s, actions:1,2,pop_vlan,3,push_vlan(tpid=0x88a8,vid=9,pcp=0),4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,34,35,36,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,pop_vlan,61,set(tunnel(dst=192.168.112.6,ttl=64,flags(df))),push_vlan(tpid=0x88a8,vid=9,pcp=0),62
recirc_id(0),tunnel(src=192.168.112.6,dst=192.168.112.7,flags(-df-csum)),in_port(62),eth(src=52:54:00:12:00:69,dst=ff:ff:ff:ff:ff:ff),eth_type(0x88a8),vlan(vid=9,pcp=0),encap(eth_type(0x0800),ipv4(tos=0/0x3,frag=no)), packets:17419, bytes:5887622, used:0.000s, actions:1,2,pop_vlan,3,push_vlan(tpid=0x88a8,vid=9,pcp=0),4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,34,35,36,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,pop_vlan,61,set(tunnel(dst=192.168.112.12,ttl=64,flags(df))),push_vlan(tpid=0x88a8,vid=9,pcp=0),62
recirc_id(0),tunnel(src=192.168.112.12,dst=192.168.112.7,flags(-df-csum)),in_port(62),eth(src=52:54:00:12:00:69,dst=33:33:00:00:00:01),eth_type(0x88a8),vlan(vid=9,pcp=0),encap(eth_type(0x86dd),ipv6(tclass=0/0x3,frag=no)), packets:1467263, bytes:132053670, used:0.000s, actions:1,2,pop_vlan,3,push_vlan(tpid=0x88a8,vid=9,pcp=0),4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,34,35,36,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,pop_vlan,61,set(tunnel(dst=192.168.112.6,ttl=64,flags(df))),push_vlan(tpid=0x88a8,vid=9,pcp=0),62
recirc_id(0),in_port(3),eth(src=52:54:00:12:00:60,dst=33:33:00:00:00:02),eth_type(0x86dd),ipv6(tclass=0/0x3,frag=no), packets:0, bytes:0, used:never, actions:push_vlan(tpid=0x88a8,vid=9,pcp=0),1,2,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,34,35,36,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,pop_vlan,61,set(tunnel(dst=192.168.112.6,ttl=64,flags(df))),push_vlan(tpid=0x88a8,vid=9,pcp=0),62,set(tunnel(dst=192.168.112.12,ttl=64,flags(df))),62
recirc_id(0),tunnel(src=192.168.112.12,dst=192.168.112.7,flags(-df-csum)),in_port(62),eth(src=52:54:00:12:00:60,dst=33:33:00:00:00:01),eth_type(0x88a8),vlan(vid=9,pcp=0),encap(eth_type(0x86dd),ipv6(tclass=0/0x3,frag=no)), packets:167482, bytes:15073380, used:0.000s, actions:1,2,pop_vlan,3,push_vlan(tpid=0x88a8,vid=9,pcp=0),4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,34,35,36,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,pop_vlan,61,set(tunnel(dst=192.168.112.6,ttl=64,flags(df))),push_vlan(tpid=0x88a8,vid=9,pcp=0),62
recirc_id(0),tunnel(src=192.168.112.12,dst=192.168.112.7,flags(-df-csum)),in_port(62),eth(src=52:54:00:12:00:69,dst=33:33:00:00:00:16),eth_type(0x88a8),vlan(vid=9,pcp=0),encap(eth_type(0x86dd),ipv6(tclass=0/0x3,frag=no)), packets:17424, bytes:1637856, used:0.009s, actions:1,2,pop_vlan,3,push_vlan(tpid=0x88a8,vid=9,pcp=0),4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,34,35,36,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,pop_vlan,61,set(tunnel(dst=192.168.112.6,ttl=64,flags(df))),push_vlan(tpid=0x88a8,vid=9,pcp=0),62
recirc_id(0),tunnel(src=192.168.112.12,dst=192.168.112.7,flags(-df-csum)),in_port(62),eth(src=52:54:00:12:00:69,dst=ff:ff:ff:ff:ff:ff),eth_type(0x88a8),vlan(vid=9,pcp=0),encap(eth_type(0x0800),ipv4(tos=0/0x3,frag=no)), packets:22386, bytes:7566468, used:0.001s, actions:1,2,pop_vlan,3,push_vlan(tpid=0x88a8,vid=9,pcp=0),4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,34,35,36,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,pop_vlan,61,set(tunnel(dst=192.168.112.6,ttl=64,flags(df))),push_vlan(tpid=0x88a8,vid=9,pcp=0),62
recirc_id(0),in_port(61),eth(src=52:54:00:12:00:69,dst=33:33:00:00:00:02),eth_type(0x86dd),ipv6(tclass=0/0x3,frag=no), packets:0, bytes:0, used:never, actions:push_vlan(tpid=0x88a8,vid=9,pcp=0),1,2,pop_vlan,3,push_vlan(tpid=0x88a8,vid=9,pcp=0),4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,34,35,36,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,set(tunnel(dst=192.168.112.6,ttl=64,flags(df))),62,set(tunnel(dst=192.168.112.12,ttl=64,flags(df))),62
...

So I checked ovs-vsctl list bridge whether STP was enabled everywhere and indeed, on ow4 it was disabled. So I ran ovs-vsctl set bridge br1 stp_enable=true to enable it and now the test passes: https://openqa.opensuse.org/tests/2503732

#20 Updated by openqa_review 4 months ago

  • Due date set to 2022-08-23

Setting due date based on mean cycle time of SUSE QE Tools

#21 Updated by mkittler 4 months ago

favogt Thanks for having a look.

I thought putting

ovs-vsctl set bridge $bridge stp_enable=true

in

/etc/wicked/scripts/gre_tunnel_preup.sh

and then rebooting would be enough.

I'll re-run my tests again (where different hosts are used) to check whether that now works.

#22 Updated by mkittler 4 months ago

  • Status changed from In Progress to Feedback

It works across different hosts as well (see https://openqa.opensuse.org/tests/2505394). So I suppose the ticket could be resolved. However, I want to reboot openqaworker4 one more time to see whether the setting is persistent. (I'll wait with that until the worker is not completely busy anymore.)

@fvogt Thanks for your help again and also explaining what command you've used!

#23 Updated by mkittler 4 months ago

I've rebooted the machine. Unfortunately ovs-vsctl list bridge now shows stp_enable : false again. Using ovs-vsctl set bridge br1 stp_enable=true fixes it again but I'm not sure why it isn't persistent (as it is actually configured like I mentioned in #64700#note-21).

#24 Updated by mkittler 4 months ago

  • Status changed from Feedback to Resolved

The problem was that /etc/wicked/scripts/gre_tunnel_preup.sh was lacking the executable permission. It now works, verified via reboot.

#25 Updated by okurz 4 months ago

  • Status changed from Resolved to Feedback

https://github.com/os-autoinst/openQA/pull/4773

and mkittler had another idea what to put into the documentation. Please also include how to test that, e.g. ovs-vsctl debug commands as explained in some comments here as well as "jobs post" with the export command of an existing cluster scenario and how to adapt for testing.

#26 Updated by mkittler 4 months ago

PR for the remaining documentation update: https://github.com/os-autoinst/openQA/pull/4774

#27 Updated by mkittler 3 months ago

  • Status changed from Feedback to Resolved

The PR has been merged.

Also available in: Atom PDF