action #64700
closedsetup o3 workers openqaworker4 and openqaworker7 for multi-machine tests size:S
50%
Description
Observation¶
Acceptance criteria¶
- AC1: MM tests are known to work on openqaworker4
Suggestions¶
- See #64700#note-11
Updated by okurz over 4 years ago
for testing on w7 which was previously already configured as MM worker for osd I added ",tap" to the worker class for :3 and :4 with
vim /etc/openqa/workers.ini
firewall-cmd --zone=trusted --add-masquerade
systemctl restart openqa-worker@{3..4}
successful tests:
so I restarted the other workers as well:
systemctl restart openqa-worker@{1..2} openqa-worker@{5..14}
I assume depending on http://open.qa/docs/#_multi_machine_tests_setup and the history on aarch64 what we would need to do for w4 which seems to have been never configured for MM:
zypper -n --no-refresh in firewalld openvswitch os-autoinst-openvswitch libcap-progs
systemctl enable --now firewalld openvswitch os-autoinst-openvswitch
echo 'OS_AUTOINST_USE_BRIDGE=br1' > /etc/sysconfig/os-autoinst-openvswitch
ovs-vsctl add-br br1
cat > /etc/sysconfig/network/ifcfg-tap0 <<EOF
BOOTPROTO='none'
IPADDR=''
NETMASK=''
PREFIXLEN=''
STARTMODE='auto'
TUNNEL='tap'
TUNNEL_SET_GROUP='nogroup'
TUNNEL_SET_OWNER='_openqa-worker'
EOF
for i in {1..14} {64..77} {128..141}; do echo OVS_BRIDGE_PORT_DEVICE_$i=\'tap$i\' ; done >> /etc/sysconfig/network/ifcfg-br1
for i in {1..14} {64..77} {128..141}; do ln -s /etc/sysconfig/network/ifcfg-tap{0,$i} ; done
firewall-cmd --set-default-zone=trusted
firewall-cmd --zone=trusted --add-masquerade
for i in br1 eth0 ovs-system; do firewall-cmd --zone=trusted --add-interface=$i; done
firewall-cmd --runtime-to-permanent
setcap CAP_NET_ADMIN=ep /usr/bin/qemu-system-x86_64
Updated by okurz over 4 years ago
- Related to action #64970: [desktop][opensuse][multi-machine] test fails in xrdp_client to connect to server added
Updated by okurz over 4 years ago
apparently openqaworker7 is producing some problematic job results. E.g.
[28/03/2020 17:11:19] <DimStar> okurz: https://openqa.opensuse.org/tests/1216485#next_previous is more painful :)
[28/03/2020 17:11:36] <DimStar> success/failure ratio is far off
[28/03/2020 17:11:59] <DimStar> I thin 10 days ago is when we removed OW1, right?
Seems like desktopapps-remote-desktop-xrdp-client1 consistently does not work on openqaworker7 so test reviewers retrigger failed tests until it happens to be run on openqaworker1 which seems to be stable. DimStar also mentioned other problems, like https://openqa.opensuse.org/tests/1217710#step/kubeadm/1 , also on openqaworker7. Could be something special about the firewall maybe. " https://openqa.opensuse.org/tests/1217727#step/yast2_nfs4_server/37 - firewall might be sonething..or dns config", also w7. I have disabled "tap" from worker class on openqaworker7 and restarted worker instances. Let's see if this helps. https://openqa.opensuse.org/tests/1217710# as an interesting example because it is not a multi-machine test. Maybe we can look into this one first, should be easier to crosscheck.
Also, what I saw as differences in configuration: On w1 only "br1" is in "trusted" zone, on w7 it's "br1 eth0 tap…", same on aarch64. Also the config differs in "STARTMODE" and the explicit "ZONE" in /etc/sysconfig/network/ifcfg-tap*
So now on w7 I did:
cat > /etc/sysconfig/network/ifcfg-tap0 <<EOF
> BOOTPROTO='none'
> IPADDR=''
> NETMASK=''
> PREFIXLEN=''
> STARTMODE='auto'
> TUNNEL='tap'
> TUNNEL_SET_GROUP='nogroup'
> TUNNEL_SET_OWNER='_openqa-worker'
> ZONE=public'
> EOF
for i in {1..20} {64..83} {128..147}; do ln -sf /etc/sysconfig/network/ifcfg-tap{0,$i} ; done
for i in {0..20} {64..83} {128..147}; do firewall-cmd --zone-trusted --remove-interface=eth0; done
firewall-cmd --runtime-to-permanent
and looking into the "kubeadm" failure:
$ build=okurz_investigation_poo64700; for i in 1 7 ; do build=$build openqa-clone-set https://openqa.opensuse.org/tests/1217710 ${build}_kubeadm_w$i WORKER_CLASS=openqaworker$i; done
https://openqa.opensuse.org/tests/overview?build=okurz_investigation_poo64700
shows that 10/10 jobs on openqaworker1 and 10/10 jobs on openqaworker7 fail the same so I reject the hypothesis that it's something specific to the MM setup on openqaworker7.
After the above changes I triggered some jobs again:
$ openqa-clone-job --parental-inheritance --skip-chained-deps --within-instance https://openqa.opensuse.org/tests/1218529 WORKER_CLASS=openqaworker7 BUILD=X _
GROUP=0 TEST=okurz_poo64700_yast2_nfs_v4_server
Created job #1219043: opensuse-Tumbleweed-DVD-x86_64-Build20200329-yast2_nfs_v4_server@64bit -> https://openqa.opensuse.org/t1219043
as a single test out of a mm-pair which works fine on its own.
$ openqa-clone-job --parental-inheritance --skip-chained-deps --within-instance https://openqa.opensuse.org/tests/1217787 WORKER_CLASS=openqaworker7 BUILD=X _
GROUP=0 TEST=okurz_poo64700_yast2_nfs_v4_client
Created job #1219049: opensuse-Tumbleweed-DVD-x86_64-Build20200327-yast2_nfs_v4_server@64bit -> https://openqa.opensuse.org/t1219049
Created job #1219050: opensuse-Tumbleweed-DVD-x86_64-Build20200327-yast2_nfs_v4_client@64bit -> https://openqa.opensuse.org/t1219050
which fail in https://openqa.opensuse.org/tests/1219050#step/yast2_nfs4_client/28
But we check again the basics with wicked_basic
:
$ openqa-clone-job --parental-inheritance --skip-chained-deps --within-instance https://openqa.opensuse.org/tests/1218584 WORKER_CLASS=openqaworker7 BUILD=X _GROUP=0 TEST=okurz_poo64700_wicked_basic_sut
Created job #1219103: opensuse-Tumbleweed-DVD-x86_64-Build20200329-wicked_basic_ref@64bit -> https://openqa.opensuse.org/t1219103
Created job #1219104: opensuse-Tumbleweed-DVD-x86_64-Build20200329-wicked_basic_sut@64bit -> https://openqa.opensuse.org/t1219104
failed. https://openqa.opensuse.org/tests/1219104/file/serial_terminal.txt shows
# ping -c 1 10.0.2.2|| journalctl -b --no-pager > /dev/ttyS0; echo MWhDi-$?-
PING 10.0.2.2 (10.0.2.2) 56(84) bytes of data.
From 10.0.2.11 icmp_seq=1 Destination Host Unreachable
TODO read older tickets to remind myself, e.g. #30892 , #52499 , #55043 , #31978
Updated by okurz over 4 years ago
- Status changed from In Progress to Workable
- Assignee deleted (
okurz)
I did not progress over #64700#note-note-4 unfortunately. Didn't find time to refresh my memory with old setup.
Updated by favogt over 2 years ago
- Priority changed from Low to Normal
- Target version changed from future to Ready
It's not entirely clear what the issue is/was and currently we're in need of a MM worker (https://progress.opensuse.org/issues/114923), so I went ahead and enabled tap on ow7 again.
Updated by livdywan over 2 years ago
- Related to action #114923: We lost multi-machine capabilities within o3 due to openqaworker1 being replaced added
Updated by favogt over 2 years ago
For some reason, ow1 had tap0-tap19 configured, but only tap0-tap9 assigned to br1. Worker instances > 10 failed due to that, I disabled the tap class for those.
Updated by favogt over 2 years ago
- % Done changed from 0 to 50
The problem on ow7 actually came back and got the firewalld tests. I did some debugging there and the cause is that ARP requests from the worker were not forwarded from br1 to tap*.
This was easy to reproduce as e.g. ping 10.0.2.2
from the SUT to the worker stopped responding after an ip neigh flush all
on the worker. (I wonder why though, because the ICMP echo request should already fill the ARP table FWICT...).
Also, what I saw as differences in configuration: On w1 only "br1" is in "trusted" zone, on w7 it's "br1 eth0 tap…", same on aarch64. Also the config differs in "STARTMODE" and the explicit "ZONE" in /etc/sysconfig/network/ifcfg-tap*
I changed that in the config files and also made the change during runtime with for i in tap0 tap1 tap10 tap11 tap12 tap128 tap129 tap13 tap130 tap131 tap132 tap133 tap134 tap135 tap136 tap137 tap138 tap139 tap14 tap140 tap141 tap142 tap143 tap144 tap145 tap146 tap147 tap15 tap16 tap17 tap18 tap19 tap2 tap20 tap3 tap4 tap5 tap6 tap64 tap65 tap66 tap67 tap68 tap69 tap7 tap70 tap71 tap72 tap73 tap74 tap75 tap76 tap77 tap78 tap79 tap8 tap80 tap81 tap82 tap83 tap9; do firewall-cmd --zone=trusted --change-interface=$i; done
and did a test job: Passed! https://openqa.opensuse.org/tests/2494156
So setting to 50% complete as ow7 should be fully MM capable now. Question is whether we want to extend the tap
ability to other workers on ow7 and whether we want to enable tap on ow4 as well.
Updated by okurz over 2 years ago
Well, at best all worker instances on all hosts should be multi machine capable. They just aren't because we don't understand enough to know exactly what is needed to make sure multi machine tests work without trying out openQA tests and have them fail until we found a working config
Updated by livdywan over 2 years ago
- Subject changed from setup o3 workers openqaworker4 and openqaworker7 for multi-machine tests to setup o3 workers openqaworker4 and openqaworker7 for multi-machine tests size:S
- Description updated (diff)
Updated by favogt over 2 years ago
- Subject changed from setup o3 workers openqaworker4 and openqaworker7 for multi-machine tests size:S to setup o3 workers openqaworker4 and openqaworker7 for multi-machine tests
- Description updated (diff)
There were some issues after ow1 was brought back online, as openQA started to schedule MM tests across both hosts. I disabled MM on ow7 temporarily yesterday to get the MM tests working again.
Today I debugged that a bit and found that the GRE tunnel on ow7 was stilll configured for the osd network and fixed that. On ow1, GRE was not set up at all.
I fixed that, but then encountered that some VM traffic didn't make it, which was caused by missing MTU setup (https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/15321/files). With that applied, MM tests across ow1 and ow7 work.
After that, I expanded the tap configuration to all (max 20) worker instances on both ow1 and ow7 and did a successful test run with those: https://openqa.opensuse.org/tests/2496091
Let's wait a bit to see how ow1 and ow7 play together, then we could also implement tap on ow4.
Updated by favogt over 2 years ago
- Subject changed from setup o3 workers openqaworker4 and openqaworker7 for multi-machine tests to setup o3 workers openqaworker4 and openqaworker7 for multi-machine tests size:S
- Description updated (diff)
Updated by livdywan over 2 years ago
- Copied to action #114992: Broken MM machines don't appear as available workers added
Updated by mkittler over 2 years ago
- Assignee set to mkittler
At least judging by the history of https://openqa.opensuse.org/tests/2496091 and its parallel job it looks good. I suppose I could now go ahead and do the MM setup on ow4 as well.
Updated by mkittler over 2 years ago
- Status changed from Workable to In Progress
I have configured MM tests on openqaworker4 and it survived the reboot. Not sure whether it actually worked. The test seemed to run into some problems (https://openqa.opensuse.org/tests/2503652#step/yast2_nfs_server/111).
EDIT: It also doesn't work after retrying. So not sure what I did wrong. I was following the documentation on https://open.qa/docs/#_gre_tunnels and regarding the firewall I was following what we have in salt (and compared everything to the other o3/OSD workers).
Updated by favogt over 2 years ago
mkittler wrote:
I have configured MM tests on openqaworker4 and it survived the reboot. Not sure whether it actually worked. The test seemed to run into some problems (https://openqa.opensuse.org/tests/2503652#step/yast2_nfs_server/111).
EDIT: It also doesn't work after retrying. So not sure what I did wrong. I was following the documentation on https://open.qa/docs/#_gre_tunnels and regarding the firewall I was following what we have in salt (and compared everything to the other o3/OSD workers).
I had a quick look and did a clone with WORKER_CLASS=openqaworker4,tap
to have them on ow4 only. That failed the same way. The kind of error and the (massive!) repeating output of ovs-dpctl dump-flows
indicated a switching loop:
recirc_id(0),tunnel(src=192.168.112.12,dst=192.168.112.7,flags(-df-csum)),in_port(62),eth(src=52:54:00:12:00:60,dst=33:33:00:00:00:02),eth_type(0x88a8),vlan(vid=9,pcp=0),encap(eth_type(0x86dd),ipv6(tclass=0/0x3,frag=no)), packets:6322, bytes:417252, used:0.000s, actions:1,2,pop_vlan,3,push_vlan(tpid=0x88a8,vid=9,pcp=0),4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,34,35,36,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,pop_vlan,61,set(tunnel(dst=192.168.112.6,ttl=64,flags(df))),push_vlan(tpid=0x88a8,vid=9,pcp=0),62
recirc_id(0),tunnel(src=192.168.112.6,dst=192.168.112.7,flags(-df-csum)),in_port(62),eth(src=52:54:00:12:00:69,dst=ff:ff:ff:ff:ff:ff),eth_type(0x88a8),vlan(vid=9,pcp=0),encap(eth_type(0x0800),ipv4(tos=0/0x3,frag=no)), packets:17419, bytes:5887622, used:0.000s, actions:1,2,pop_vlan,3,push_vlan(tpid=0x88a8,vid=9,pcp=0),4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,34,35,36,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,pop_vlan,61,set(tunnel(dst=192.168.112.12,ttl=64,flags(df))),push_vlan(tpid=0x88a8,vid=9,pcp=0),62
recirc_id(0),tunnel(src=192.168.112.12,dst=192.168.112.7,flags(-df-csum)),in_port(62),eth(src=52:54:00:12:00:69,dst=33:33:00:00:00:01),eth_type(0x88a8),vlan(vid=9,pcp=0),encap(eth_type(0x86dd),ipv6(tclass=0/0x3,frag=no)), packets:1467263, bytes:132053670, used:0.000s, actions:1,2,pop_vlan,3,push_vlan(tpid=0x88a8,vid=9,pcp=0),4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,34,35,36,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,pop_vlan,61,set(tunnel(dst=192.168.112.6,ttl=64,flags(df))),push_vlan(tpid=0x88a8,vid=9,pcp=0),62
recirc_id(0),in_port(3),eth(src=52:54:00:12:00:60,dst=33:33:00:00:00:02),eth_type(0x86dd),ipv6(tclass=0/0x3,frag=no), packets:0, bytes:0, used:never, actions:push_vlan(tpid=0x88a8,vid=9,pcp=0),1,2,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,34,35,36,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,pop_vlan,61,set(tunnel(dst=192.168.112.6,ttl=64,flags(df))),push_vlan(tpid=0x88a8,vid=9,pcp=0),62,set(tunnel(dst=192.168.112.12,ttl=64,flags(df))),62
recirc_id(0),tunnel(src=192.168.112.12,dst=192.168.112.7,flags(-df-csum)),in_port(62),eth(src=52:54:00:12:00:60,dst=33:33:00:00:00:01),eth_type(0x88a8),vlan(vid=9,pcp=0),encap(eth_type(0x86dd),ipv6(tclass=0/0x3,frag=no)), packets:167482, bytes:15073380, used:0.000s, actions:1,2,pop_vlan,3,push_vlan(tpid=0x88a8,vid=9,pcp=0),4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,34,35,36,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,pop_vlan,61,set(tunnel(dst=192.168.112.6,ttl=64,flags(df))),push_vlan(tpid=0x88a8,vid=9,pcp=0),62
recirc_id(0),tunnel(src=192.168.112.12,dst=192.168.112.7,flags(-df-csum)),in_port(62),eth(src=52:54:00:12:00:69,dst=33:33:00:00:00:16),eth_type(0x88a8),vlan(vid=9,pcp=0),encap(eth_type(0x86dd),ipv6(tclass=0/0x3,frag=no)), packets:17424, bytes:1637856, used:0.009s, actions:1,2,pop_vlan,3,push_vlan(tpid=0x88a8,vid=9,pcp=0),4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,34,35,36,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,pop_vlan,61,set(tunnel(dst=192.168.112.6,ttl=64,flags(df))),push_vlan(tpid=0x88a8,vid=9,pcp=0),62
recirc_id(0),tunnel(src=192.168.112.12,dst=192.168.112.7,flags(-df-csum)),in_port(62),eth(src=52:54:00:12:00:69,dst=ff:ff:ff:ff:ff:ff),eth_type(0x88a8),vlan(vid=9,pcp=0),encap(eth_type(0x0800),ipv4(tos=0/0x3,frag=no)), packets:22386, bytes:7566468, used:0.001s, actions:1,2,pop_vlan,3,push_vlan(tpid=0x88a8,vid=9,pcp=0),4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,34,35,36,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,pop_vlan,61,set(tunnel(dst=192.168.112.6,ttl=64,flags(df))),push_vlan(tpid=0x88a8,vid=9,pcp=0),62
recirc_id(0),in_port(61),eth(src=52:54:00:12:00:69,dst=33:33:00:00:00:02),eth_type(0x86dd),ipv6(tclass=0/0x3,frag=no), packets:0, bytes:0, used:never, actions:push_vlan(tpid=0x88a8,vid=9,pcp=0),1,2,pop_vlan,3,push_vlan(tpid=0x88a8,vid=9,pcp=0),4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,34,35,36,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,set(tunnel(dst=192.168.112.6,ttl=64,flags(df))),62,set(tunnel(dst=192.168.112.12,ttl=64,flags(df))),62
...
So I checked ovs-vsctl list bridge
whether STP was enabled everywhere and indeed, on ow4 it was disabled. So I ran ovs-vsctl set bridge br1 stp_enable=true
to enable it and now the test passes: https://openqa.opensuse.org/tests/2503732
Updated by openqa_review over 2 years ago
- Due date set to 2022-08-23
Setting due date based on mean cycle time of SUSE QE Tools
Updated by mkittler over 2 years ago
@favogt Thanks for having a look.
I thought putting
ovs-vsctl set bridge $bridge stp_enable=true
in
/etc/wicked/scripts/gre_tunnel_preup.sh
and then rebooting would be enough.
I'll re-run my tests again (where different hosts are used) to check whether that now works.
Updated by mkittler over 2 years ago
- Status changed from In Progress to Feedback
It works across different hosts as well (see https://openqa.opensuse.org/tests/2505394). So I suppose the ticket could be resolved. However, I want to reboot openqaworker4 one more time to see whether the setting is persistent. (I'll wait with that until the worker is not completely busy anymore.)
@fvogt Thanks for your help again and also explaining what command you've used!
Updated by mkittler over 2 years ago
I've rebooted the machine. Unfortunately ovs-vsctl list bridge
now shows stp_enable : false
again. Using ovs-vsctl set bridge br1 stp_enable=true
fixes it again but I'm not sure why it isn't persistent (as it is actually configured like I mentioned in #64700#note-21).
Updated by mkittler over 2 years ago
- Status changed from Feedback to Resolved
The problem was that /etc/wicked/scripts/gre_tunnel_preup.sh
was lacking the executable permission. It now works, verified via reboot.
Updated by okurz over 2 years ago
- Status changed from Resolved to Feedback
https://github.com/os-autoinst/openQA/pull/4773
and mkittler had another idea what to put into the documentation. Please also include how to test that, e.g. ovs-vsctl debug commands as explained in some comments here as well as "jobs post" with the export command of an existing cluster scenario and how to adapt for testing.
Updated by mkittler over 2 years ago
PR for the remaining documentation update: https://github.com/os-autoinst/openQA/pull/4774
Updated by mkittler over 2 years ago
- Status changed from Feedback to Resolved
The PR has been merged.