action #134282: [tools] network protocols failures on multimachine tests on HA/SAP size:S auto_review:"no candidate.*iscsi-target-overview-service-tab|yast2.+firewall.+services.+add.+zone":retry - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

#1

Updated by pcervinka over 1 year ago

Priority changed from Normal to Urgent

There is something wrong with multimachine network when tests are run across different workers. If is multimachine job forced to run on same worker, it is fine.

There are fails in core group: https://openqa.suse.de/tests/11843205#next_previous
Kernel group: https://openqa.suse.de/tests/11846943#next_previous
HPC: https://openqa.suse.de/tests/11845897#next_previous

Actions

Copy link

#2

Updated by pcervinka over 1 year ago

I tried to debug issue in paused test. Ping worked, but other communication not between sut and support server, for example ssh. Could be GRE tunnel in some bad state and bigger packets just didn't pass? Unfortunately, i can't get more info (jobs are scheduled for couple of hours already).

Actions

Copy link

#3

Updated by pcervinka over 1 year ago

I was able to confirm above statement, there is definitely packet size issue, big packets just don't pass to support server.

susetest:~ # ping -s 1352 -c 2  10.0.2.1
PING 10.0.2.1 (10.0.2.1) 1352(1380) bytes of data.
1360 bytes from 10.0.2.1: icmp_seq=1 ttl=64 time=24.5 ms
1360 bytes from 10.0.2.1: icmp_seq=2 ttl=64 time=23.8 ms

--- 10.0.2.1 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1002ms
rtt min/avg/max/mdev = 23.893/24.221/24.550/0.363 ms
susetest:~ # ping -s 1353 -c 2  10.0.2.1
PING 10.0.2.1 (10.0.2.1) 1353(1381) bytes of data.

--- 10.0.2.1 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1007ms

susetest:~ # ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1458 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 52:54:00:12:08:1a brd ff:ff:ff:ff:ff:ff
    inet 10.0.2.15/24 brd 10.0.2.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::5054:ff:fe12:81a/64 scope link 
       valid_lft forever preferred_lft forever
susetest:~ # ssh root@10.0.2.1
^C
susetest:~ # ifconfig eth0 mtu 1350
susetest:~ # ssh root@10.0.2.1
The authenticity of host '10.0.2.1 (10.0.2.1)' can't be established.
ECDSA key fingerprint is SHA256:tQO13Ix/i0kNGPNMTEn9o7WXaEC7YNPkAufs7rJk5Iw.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '10.0.2.1' (ECDSA) to the list of known hosts.
Password: 
Last login: Thu Aug 17 02:31:13 2023 from ::1

It is visible that default MTU size is 1458, ssh doesn't work. If is MTU set to something smaller it will work.

Actions

Copy link

#4

Updated by pcervinka over 1 year ago

It impacts all multimachine jobs between different workers across all sle versions and different tests. It is not test issue or product issue.

Actions

Copy link

#5

Updated by osukup over 1 year ago

Subject changed from iscsi failures on multimachine tests on HA/SAP. to [tools] network protocols failures on multimachine tests on HA/SAP.

Actions

Copy link

#6

Updated by livdywan over 1 year ago

Target version set to Ready

Thank you for your thorough investigation! Discussing it in Slack now

Actions

Copy link

#7

Updated by dzedro over 1 year ago

Interesting, well done @pcervinka! 👍
Should we just decrease the MTU in support server setup ?
https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/master/tests/support_server/setup.pm#L116

Actions

Copy link

#8

Updated by pcervinka over 1 year ago

I also did tcpdump on support server to see what is coming

Ping size 1352:

susetest:~ # ping -s 1352 -c 2  10.0.2.1
PING 10.0.2.1 (10.0.2.1) 1352(1380) bytes of data.
1360 bytes from 10.0.2.1: icmp_seq=1 ttl=64 time=24.5 ms
1360 bytes from 10.0.2.1: icmp_seq=2 ttl=64 time=23.8 ms

Dump:

02:32:17.133307 52:54:00:12:08:1a > 52:54:00:12:07:f7, ethertype IPv4 (0x0800), length 1394: (tos 0x0, ttl 64, id 42398, offset 0, flags [DF], proto ICMP (1), length 1380)
    10.0.2.15 > 10.0.2.1: ICMP echo request, id 2100, seq 1, length 1360
02:32:17.133426 52:54:00:12:07:f7 > 52:54:00:12:08:1a, ethertype IPv4 (0x0800), length 1394: (tos 0x0, ttl 64, id 1951, offset 0, flags [none], proto ICMP (1), length 1380)
    10.0.2.1 > 10.0.2.15: ICMP echo reply, id 2100, seq 1, length 1360
02:32:18.135134 52:54:00:12:08:1a > 52:54:00:12:07:f7, ethertype IPv4 (0x0800), length 1394: (tos 0x0, ttl 64, id 42561, offset 0, flags [DF], proto ICMP (1), length 1380)
    10.0.2.15 > 10.0.2.1: ICMP echo request, id 2100, seq 2, length 1360
02:32:18.135220 52:54:00:12:07:f7 > 52:54:00:12:08:1a, ethertype IPv4 (0x0800), length 1394: (tos 0x0, ttl 64, id 2141, offset 0, flags [none], proto ICMP (1), length 1380)

We can see that max ethernet size with headers on top of ping size is 1394.

Ping size 1353:

--- 10.0.2.1 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1002ms
rtt min/avg/max/mdev = 23.893/24.221/24.550/0.363 ms
susetest:~ # ping -s 1353 -c 2  10.0.2.1
PING 10.0.2.1 (10.0.2.1) 1353(1381) bytes of data.

--- 10.0.2.1 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1007ms

There were no packets on supportserver.

I would recommend to take tcpdump on each worker to see what is leaving and incoming and check logs. (I don't have access to workers right now, my workstation with keys died).

Actions

Copy link

#9

Updated by pcervinka over 1 year ago

dzedro wrote:

Should we just decrease the MTU in support server setup ?
https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/master/tests/support_server/setup.pm#L116

No, we would have to set MTU on each client as well. This would just hide underlying problem.

Actions

Copy link

#10

Updated by livdywan over 1 year ago

Related to action #111908: Multimachine failures between multiple physical workers added

Actions

Copy link

#11

Updated by livdywan over 1 year ago

Tags set to infra
Project changed from openQA Tests (public) to openQA Infrastructure (public)
Subject changed from [tools] network protocols failures on multimachine tests on HA/SAP. to [tools] network protocols failures on multimachine tests on HA/SAP size:S
Description updated (diff)
Status changed from New to Workable

Actions

Copy link

#12

Updated by rfan1 over 1 year ago

Related to action #133787: [qe-core] not hardcode a single worker to run autofs_server/client' and 'ovs-server/client' tests added

Actions

Copy link

#13

Updated by livdywan over 1 year ago

Status changed from Workable to In Progress
Assignee set to livdywan

I'm filing an SD ticket now as discussed, and we'll see from there what we can do.

Actions

Copy link

#14

Updated by livdywan over 1 year ago

Status changed from In Progress to Blocked

https://sd.suse.com/servicedesk/customer/portal/1/SD-130143

Actions

Copy link

#15

Updated by livdywan over 1 year ago

based on feedback we got, neither Networking nor Eng-Infra is looking after what you mentioned.

Not sure yet what to make of that.

Actions

Copy link

#16

Updated by mgrifalconi over 1 year ago

Hello, does it mean nothing is happening from either side? Any way to escalate this or shall we continue to force green all multimachine tests for the foreseeable future?

Actions

Copy link

#17

Updated by livdywan over 1 year ago

Status changed from Blocked to Feedback

mgrifalconi wrote in #note-16:

Hello, does it mean nothing is happening from either side? Any way to escalate this or shall we continue to force green all multimachine tests for the foreseeable future?

I realzie I didn't save my comment. We're discussing it, will get back to you as soon as we know more

Actions

Copy link

#18

Updated by livdywan over 1 year ago

Notes from our debugging session:

salt-states etc/firewalld/zones/trusted.xml with wrong bridge_iface results in missing eth0 on worker3; it seems we don't currently have a test for the case where tap is set we also need bridge_iface
should be checked by e.g. a pipeline
"if workerclass contains "tap" then bridge_iface needs to be set"
using ethtool to check the interface - maybe there's a disconnected cable here as well? we should be okay since we have another uplink, though
testing with the suspected missing iface firewalld cmd --zone=trusted --add-interface eth0
https://openqa.suse.de/tests/11897800 let's see if this works with the fix
we don't have a good way to confirm if gre devices actually work?
- ~~ip a s dev gre29 device doesn't exist?~~ gre interfaces are handled by ovs and therefore usual linux tools don't work as expected
- the remote_ip i.e. of worker9 is correct

Actions

Copy link

#19

Updated by mkittler over 1 year ago

Looks like none of the jobs in the restarted cluster (https://openqa.suse.de/tests/11897800#dependencies) have been scheduled to run on worker3 which was the most likely culprit. That's not the worst because this way we can see whether the hypothesis of worker3 actually being the culprit is true. If the jobs no pass then worker3 is likely the culprit; otherwise there's more to it.

EDIT: Now actually two jobs within the cluster have failed, both on worker8 (https://openqa.suse.de/tests/11897800,https://openqa.suse.de/tests/11897799). So it is definitely not just a problem of worker3.

Actions

Copy link

#20

Updated by pcervinka over 1 year ago

It is really not related to specific worker, here is example of HPC job on aarch64, it just passed fine when when I cloned with defined worker. https://openqa.suse.de/tests/11902812

Actions

Copy link

#21

Updated by acarvajal over 1 year ago

mkittler wrote in #note-19:

EDIT: Now actually two jobs within the cluster have failed, both on worker8 (https://openqa.suse.de/tests/11897800,https://openqa.suse.de/tests/11897799). So it is definitely not just a problem of worker3.

Yesterday I saw MM jobs failing in network related issues in different workers. The ones I remember seeing were worker3, worker5, worker8 and worker9. Restarting the jobs forcing all jobs to run in the same physical worker seems to allow the jobs to complete, but this is of course a workaround and not something I would like to set for the jobs always.

We will continue to monitor the situation and paste here any upcoming failures we see during our review.

Actions

Copy link

#22

Updated by livdywan over 1 year ago

Assignee changed from livdywan to mkittler
Priority changed from Urgent to High

I guess we can consider it High while we're verifying the actual impact of it.

Actions

Copy link

#23

Updated by mkittler over 1 year ago

When https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/581 has been merged those workers won't be tap workers anymore anyways and Prague-located workers will be used instead. On those new workers I have verified that the MM setup works with jobs that were explicitly scheduled to run across multiple workers.

Note that it is totally possible that this problem is specific to the HA/SAP test scenario. I've checked some other MM tests on OSD like https://openqa.suse.de/tests/11932373 and this job and its parallel parent ran across different workers and passed. The whole Wicked Maintenance Updates looks in fact very good with no failures in the last two weeks (I haven't looked further into the past).

Actions

Copy link

#24

Updated by acarvajal over 1 year ago

mkittler wrote in #note-23:

When https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/581 has been merged those workers won't be tap workers anymore anyways and Prague-located workers will be used instead. On those new workers I have verified that the MM setup works with jobs that were explicitly scheduled to run across multiple workers.

Thanks a lot for the investigation. I'll keep an eye to this MR.

Note that it is totally possible that this problem is specific to the HA/SAP test scenario. I've checked some other MM tests on OSD like https://openqa.suse.de/tests/11932373 and this job and its parallel parent ran across different workers and passed. The whole Wicked Maintenance Updates looks in fact very good with no failures in the last two weeks (I haven't looked further into the past).

By chance I saw tests running today in the 15-SP5 QR Job Group (https://openqa.suse.de/tests/overview?distri=sle&version=15-SP5&build=115.1&groupid=518) and something is definitely going on.

Keep in mind that support server code and the iscsi_client module is the same in all scenarios.

Today in that job group we can see:

2-Node Cluster (alpha) failing on x86_64 (https://openqa.suse.de/tests/11935907) and ppc64le (https://openqa.suse.de/tests/11935771). It ran on malbec and QA-Power8-5-kvm in ppc64le, and in worker10, worker3 & worker8.
2-Node Cluster (beta) failing on aarch64 (https://openqa.suse.de/tests/11937572) and ppc64le (https://openqa.suse.de/tests/11937524), but passing on x86_64: https://openqa.suse.de/tests/11936615, https://openqa.suse.de/tests/11936619 and https://openqa.suse.de/tests/11936618#. All 3 jobs ran on worker5.
CTDB Cluster (to test samba resources) failing on aarch64 (https://openqa.suse.de/tests/11937584) and x86_64 (https://openqa.suse.de/tests/11936677). The x86_64 jobs ran in worker10, worker8, worker3 & worker5.
3-Node Cluster with Diskless SBD (delta) passing in aarch64 (https://openqa.suse.de/tests/11935686) and x86_64 (https://openqa.suse.de/tests/11937503). The aarch64 jobs ran in openqaworker-arm3 and the x86_64 jobs ran in worker5.

Other passing jobs were Priority Fencing Cluster (all jobs ran in worker3), and QNetd Cluster (all jobs ran in worker9).

I think there is a clear pattern where in these MM jobs the iscsi client fails to connect to the support server when the jobs are picked by different workers vs. when all MM jobs are picked by the same worker.

More importantly, this was working before, as we can see in the results of the 15-SP5 GMC here: https://openqa.suse.de/tests/overview?distri=sle&version=15-SP5&build=102.1&groupid=143

Pick any x86_64 cluster (alpha, beta, gamma, delta, CTDB, etc.) and not only they passed, but they ran in more than one physical worker.

Actions

Copy link

#25

Updated by pstivanin over 1 year ago

Blocks action #134495: [security][maintenance] all multi machines tests are failing added

Actions

Copy link

#26

Updated by okurz over 1 year ago

Priority changed from High to Urgent

What I understood is that on a daily base SLE maintenance update tests are affected by the issue so setting back to "Urgent". Related discussion https://suse.slack.com/archives/C02CANHLANP/p1693379586958799

(Oliver Kurz) @Antonios Pappas @Alvaro Carvajal (CC @qa-tools ) as this topic was brought up in the weekly QE sync meeting regarding https://progress.opensuse.org/issues/134282 . 1. How can we make a workaround more permanent to only schedule multi-machine tests on individual machines? 2. With https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/581 we are proposing to only run multi-machine x86_64 tests on PRG2 based workers which might also help. Should we go ahead with that first? 3. The ticket was reduced to "High" priority. Do I understand correctly that every day SLE maintenance tests keep failing related to that ticket? Because then we should increase prio to "Urgent" again
(Fabian Vogt) If the GRE tunnel is used, it's required that every SUT has its MTU adjusted. This is documented. http://open.qa/docs/#_gre_tunnels, the "NOTE"
(Oliver Kurz) oh, I was not aware. Where does this have to be set?
(Fabian Vogt) It's usually done by whatever sets up the mm networking inside the clients, let me find the place...
(Antonios Pappas) How has the scheduling changed that this started failing in August?
(Antonios Pappas) Was there a constraint that was relaxed?
(Antonios Pappas) If this was a true limitation why did everything start to fail mid august?
(Fabian Vogt) The code is in lib/mm_network.pm and tests/support_server/setup.pm
(Fabian Vogt) It's possible the infra was configured for jumbo packets before the migration but this no longer works?
(Alvaro Carvajal) Scheduling MM tests on individual machines would require setting the hostname in the WORKER_CLASS setting, right? Isn't this bad/introduces a single point of failure?
Yes, I think we should go ahead with https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/581 and re-assess
Yes, in HA & SAP job groups, tests fail every day
(Antonios Pappas) Kernel incidents were also affected. pcervinka is on vacation so he is not on the call but his backup should know
(Alvaro Carvajal) @Oliver Kurz ran into some MM failures from a saptune MU. These jobs I cannot force label to softfail. I need them to run (and pass). Should I restart them on a fixed worker or should I wait for the merge of https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/581?
(Oliver Kurz) Well, https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/581 could break more stuff. If you say you can closely monitor the effect today then we can just merge https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/581 now and trigger tests accordingly and see what it brings. That you cannot force the result is a different issue that we can consider as well. I recommend you open another thread or ticket as well
(Jozef Pupava) All nodes have MTU 1458, AFAIK they get that from DHCP/support_server https://openqa.suse.de/tests/11923577#step/hostname/27 node1 https://openqa.suse.de/tests/11923578#step/hostname/25 node2 https://openqa.suse.de/tests/11923576#step/setup/29 support_server
(Alvaro Carvajal) That you cannot force the result is a different issue that we can consider as well. I recommend you open another thread or ticket as well. Oh, I can technically do it. I just shouldn't. This is an important package to test
(Alvaro Carvajal) Well, https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/581 could break more stuff. If you say you can closely monitor the effect today then we can just merge https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/581 now and trigger tests accordingly and see what it brings.
Better wait. We have Sprint Review today
(Oliver Kurz) As soon as someone from tools team can closely monitor the impact we would merge
(Marius Kittler) I could merge and monitor after our daily meeting.

https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&viewPanel=26&from=now-90d&to=now looks like a regression of more multi-machine tests failing since about 2023-08-10

Actions

Copy link

#27

Updated by acarvajal over 1 year ago

Seems the move to the Multi-Machine workers in PRG did not solve the issue.

The following is a collection of HA & SAP jobs which have finished either on August 30th or 31st. Most of the failures are still in the iscsi_client test module (cluster nodes connecting to iSCSI server in Support Server), but now there are also failures in the Support Server setup module.

iscsi_client failures:

https://openqa.suse.de/tests/11962800 / https://openqa.suse.de/tests/11962798 / https://openqa.suse.de/tests/11962799. Ran in worker35, worker36 & worker37
https://openqa.suse.de/tests/11962802 / https://openqa.suse.de/tests/11962804 / https://openqa.suse.de/tests/11962803. Ran in worker36 & worker40
https://openqa.suse.de/tests/11949186 / https://openqa.suse.de/tests/11949187 / https://openqa.suse.de/tests/11949173. Ran in worker31, worker34 & worker32
https://openqa.suse.de/tests/11949174 / https://openqa.suse.de/tests/11949191 / https://openqa.suse.de/tests/11949190 / https://openqa.suse.de/tests/11949189. Ran in worker33, worker34, worker31 & worker36
https://openqa.suse.de/tests/11949253 / https://openqa.suse.de/tests/11949255 / https://openqa.suse.de/tests/11949254. Ran in worker35, worker36 & worker37
https://openqa.suse.de/tests/11949263 / https://openqa.suse.de/tests/11949265 / https://openqa.suse.de/tests/11949264. Ran in worker39, worker31 & worker34
https://openqa.suse.de/tests/11954945 / https://openqa.suse.de/tests/11954949 / https://openqa.suse.de/tests/11954948 / https://openqa.suse.de/tests/11954947. Ran in worker33, worker37 & worker35
https://openqa.suse.de/tests/11949206 / https://openqa.suse.de/tests/11949220 / https://openqa.suse.de/tests/11949219 / https://openqa.suse.de/tests/11949218. Ran in worker33, worker34 & worker38
https://openqa.suse.de/tests/11948362 / https://openqa.suse.de/tests/11948364 / https://openqa.suse.de/tests/11948363. Ran in worker29, worker36 & worker38

support_server/setup failures:

https://openqa.suse.de/tests/11962805 / https://openqa.suse.de/tests/11962807 / https://openqa.suse.de/tests/11962806. Ran in worker38, worker37 & worker35
https://openqa.suse.de/tests/11946732 / https://openqa.suse.de/tests/11946734 / https://openqa.suse.de/tests/11946733. Ran in worker38, worker34 & worker35
https://openqa.suse.de/tests/11946748 / https://openqa.suse.de/tests/11946750 / https://openqa.suse.de/tests/11946749. Ran in worker37, worker30 & worker31
https://openqa.suse.de/tests/11949179 / https://openqa.suse.de/tests/11949194 / https://openqa.suse.de/tests/11949193 / https://openqa.suse.de/tests/11949192. Ran in worker34, worker32 & worker31
https://openqa.suse.de/tests/11949180 / https://openqa.suse.de/tests/11949198 / https://openqa.suse.de/tests/11949197. Ran in worker34 & worker35
https://openqa.suse.de/tests/11949177 / https://openqa.suse.de/tests/11949183 / https://openqa.suse.de/tests/11949182 / https://openqa.suse.de/tests/11949181. Ran in worker34, worker29, worker36 & worker37
https://openqa.suse.de/tests/11949204 / https://openqa.suse.de/tests/11949213 / https://openqa.suse.de/tests/11949212 / https://openqa.suse.de/tests/11949211. Ran in worker38, worker37 & worker36
https://openqa.suse.de/tests/11949210 / https://openqa.suse.de/tests/11949225 / https://openqa.suse.de/tests/11949224 / https://openqa.suse.de/tests/11949223. Ran in worker34, worker29 & worker32
https://openqa.suse.de/tests/11946722 / https://openqa.suse.de/tests/11946724 / https://openqa.suse.de/tests/11946723. Ran in worker34, worker35 & worker31

This is only a sample. There are more.

Besides these, we are also seeing some jobs timing out when attempting connection outside of osd, for example: https://openqa.suse.de/tests/11950389#step/suseconnect_scc/20

Actions

Copy link

#28

Updated by acarvajal over 1 year ago

I found some jobs that ran in worker38 (so they should've worked), but which failed in support_server/setup module:

https://openqa.suse.de/tests/11963289
https://openqa.suse.de/tests/11963291
https://openqa.suse.de/tests/11963290

This is new, and I guess it's related to the move to PRG.

Actions

Copy link

#29

Updated by acarvajal over 1 year ago

I think the support_server/setup issue is present only on worker34, worker37, worker38. I've seen support servers passing this step in worker39, worker31 and worker30. Not sure if this helps.

Actions

Copy link

#30

Updated by mkittler over 1 year ago

I would assume that the exact lists of workers where it is passing or failing might be misleading. Maybe it is happening rather randomly. All mentioned workers have been setup in the exact same way so it would be strange if there are any differences between them in general.

I definitely also tested the basic wicked test scenario across the workers where the issue is present.

It would be helpful to understand what this test does that the basic wicked test doesn't.

In the failure mentioned in #134282#note-28 the support server job didn't unlock the mutex and thus the other jobs couldn't continue. So that's a problem that might not even have something to do with the gre tunnels and simply with the support server job not being able to reach the point where it would unlock the mutex. At which point is the support server supposed to unlock the mutex? I've been searching in the openSUSE test distribution for support_server_ready but couldn't find and occurrence that is applicable to this test scenario. I'm really wondering whether this failure is not just an issue with the test itself.
(Just for my own reference: good run: https://openqa.suse.de/tests/11934844#step/setup/41, bad run: https://openqa.suse.de/tests/11963289)

Actions

Copy link

#31

Updated by okurz over 1 year ago

Further from the referenced Slack conversation https://suse.slack.com/archives/C02CANHLANP/p1693379586958799

(Oliver Kurz) @Fabian Vogt @Jozef Pupava can we make failures more explicit if MTU is not set properly? Like, let tests crash if WORKER_CLASS=tap and not set MTU?
(Fabian Vogt) Not easily. This would have to be run on the SUT as test module, but after networking is set up. This means it's equally likely that if the MTU is wrong, the check isn't scheduled either
(Oliver Kurz) how about something that is called unconditionally in all tests? Maybe in main.pm? Would have to run at the right point in jobs, but is more likely to work
(Oliver Kurz) who can volunteer do implement this?

No response to this. I would say without the check for proper MTU there is no point to continue. So IMHO we should focus on that but I don't know where to add such check.

I also added a response in https://sd.suse.com/servicedesk/customer/portal/1/SD-130143 with some questions to Eng-Infra experts.

Actions

Copy link

#32

Updated by ybonatakis over 1 year ago

I believe the failures on HPC jobs are also because of it.
https://openqa.suse.de/tests/overview?distri=sle&version=15-SP6&build=16.1&groupid=130
However none of the ones i checked run in any of the workers acarvajal points to.

Actions

Copy link

#33

Updated by acarvajal over 1 year ago

mkittler wrote in #note-30:

In the failure mentioned in #134282#note-28 the support server job didn't unlock the mutex and thus the other jobs couldn't continue.

It didn't unlock the mutex, because the support server job failed before the unlock action.

A normal run looks like this: https://openqa.suse.de/tests/11163631

The mutex causing the other jobs to wait is created in the last line of the support_server/setup module: https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/master/tests/support_server/setup.pm#L655

As the module failed before that, then the other jobs were not unlocked.

So that's a problem that might not even have something to do with the gre tunnels and simply with the support server job not being able to reach the point where it would unlock the mutex.

Agree, but IMHO something with the MM setup in these workers (worker34, worker37, worker38) is causing a failure in the support_server/setup module before it can create the mutex.

At which point is the support server supposed to unlock the mutex? I've been searching in the openSUSE test distribution for support_server_ready but couldn't find and occurrence that is applicable to this test scenario. I'm really wondering whether this failure is not just an issue with the test itself.

https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/master/tests/support_server/setup.pm#L655

(Just for my own reference: good run: https://openqa.suse.de/tests/11934844#step/setup/41, bad run: https://openqa.suse.de/tests/11963289)

A week ago we saw a similar error in our development openQA instance.

See: https://openqaworker15.qa.suse.cz/tests/217494#step/setup/33

We had to check https://open.qa/docs/#_multi_machine_test_setup and after messing around with the firewalld configuration, installing libcap, and restarting the server, we managed to get it working: https://openqaworker15.qa.suse.cz/tests/217498

You can read our thread on the debug https://suse.slack.com/archives/C0369JZFBKK/p1692886259753739

Actions

Copy link

#34

Updated by acarvajal over 1 year ago

ybonatakis wrote in #note-32:

I believe the failures on HPC jobs are also because of it.
https://openqa.suse.de/tests/overview?distri=sle&version=15-SP6&build=16.1&groupid=130
However none of the ones i checked run in any of the workers acarvajal points to.

Seems very similar to what we're seeing, with the exception that you don't have issues in support_server/setup.

Most of your tests that passed, did so when the jobs ran in the same worker. I see some of these working in worker36 and worker35. I don't see any of your support servers running in worker34, worker37 or worker38 where the setup module is failing for us.

I did see one of your MM jobs passing when running on multiple workers:

https://openqa.suse.de/tests/11959550 (worker30)
https://openqa.suse.de/tests/11959754 (worker32)
https://openqa.suse.de/tests/11959755 (worker39)
https://openqa.suse.de/tests/11959756 (worker29)

I do wonder what is the difference between these and your other jobs which failed in cpuid and ours that fail in iscsi_client

Actions

Copy link

#35

Updated by acarvajal over 1 year ago

Here's another example of a support_server/setup failure in worker38: https://openqa.suse.de/tests/11972258#step/setup/33

This time when attempting to configure the firewall. This is the exact same step our tests were failing in our Staging openQA instance last Friday before we had to fix the MM configuration there.

Actions

Copy link

#36

Updated by okurz over 1 year ago

To mitigate the urgency I recommend to apply https://github.com/os-autoinst/scripts/blob/master/README.md#auto-review---automatically-detect-known-issues-in-openqa-jobs-label-openqa-jobs-with-ticket-references-and-optionally-retrigger

And in the failed job examples I see multiple points for improvement. E.g. in https://openqa.suse.de/tests/11962805#step/setup/45 I see that no post_fail_hook is executed at all. At least the system journal might be able to help here, possibly also YaST module logs. I suggest to create separate tickets about those.

And please for the ticket assignee make sure to update the ticket description based on https://progress.opensuse.org/projects/openqav3/wiki/#Further-decision-steps-working-on-test-issues to keep track of the current state, open hypotheses, experiments to conduct, etc.

Actions

Copy link

#37

Updated by livdywan over 1 year ago

Assignee changed from mkittler to livdywan

Actions

Copy link

#38

Updated by apappas over 1 year ago

Related to action #135035: Optionally restrict multimachine jobs to a single worker added

Actions

Copy link

#39

Updated by apappas over 1 year ago

okurz wrote in #note-36:

To mitigate the urgency I recommend to apply https://github.com/os-autoinst/scripts/blob/master/README.md#auto-review---automatically-detect-known-issues-in-openqa-jobs-label-openqa-jobs-with-ticket-references-and-optionally-retrigger

This will only reroll the dice with the hope that the retries will land on the same tap worker while also increase the workload on the limited tap pool.

Actions

Copy link

#40

Updated by okurz over 1 year ago

Yes, exactly

Actions

Copy link

#41

Updated by mkittler over 1 year ago

I haven't had the earlier hypothesis that it is MTU related on my mind anymore. That would actually explain why not all scenarios are affected.

It didn't unlock the mutex, because the support server job failed before the unlock action.

Right, and I was wondering about exactly that. Normally one doesn't need a network connection to start a server so it is strange to blame network connectivity here. If the network connection was the problem then only reaching that server should fail, right?

The mutex causing the other jobs to wait is created in the last line of the support_server/setup module: https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/master/tests/support_server/setup.pm#L655

Ah, thanks. I only considered mutex_unlock calls (and not mutex_create).

Agree, but IMHO something with the MM setup in these workers (worker34, worker37, worker38) is causing a failure in the support_server/setup module before it can create the mutex.

If you think that really only those workers are problematic than create a MR to remove the tap worker class only from those as a temporary workaround and also a means of checking that hypothesis. I guess the others would accept such a MR.

By the way, I'm on squad rotation as of next week. Hence I handed the ticket over to @livdywan.

Actions

Copy link

#42

Updated by okurz over 1 year ago

Related to action #135056: MM Test fails in a connection to an address outside of the worker added

Actions

Copy link

#43

Updated by livdywan over 1 year ago

To mitigate the urgency I recommend to apply https://github.com/os-autoinst/scripts/blob/master/README.md#auto-review---automatically-detect-known-issues-in-openqa-jobs-label-openqa-jobs-with-ticket-references-and-optionally-retrigger

This will only reroll the dice with the hope that the retries will land on the same tap worker while also increase the workload on the limited tap pool.

This would have the benefit of still giving us logs at the cost of < 15 minutes for each bad job (I don't know the failure rate, though) which might make it more useful than pinning which also comes with delays (see #135035#note-6 which also applies to manual pinning). And more importantly if we can identify an expression that all failures have an common we're probably a step closer to a fix.

Agree, but IMHO something with the MM setup in these workers (worker34, worker37, worker38) is causing a failure in the support_server/setup module before it can create the mutex.

At which point is the support server supposed to unlock the mutex? I've been searching in the openSUSE test distribution for support_server_ready but couldn't find and occurrence that is applicable to this test scenario. I'm really wondering whether this failure is not just an issue with the test itself.

https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/master/tests/support_server/setup.pm#L655

(Just for my own reference: good run: https://openqa.suse.de/tests/11934844#step/setup/41, bad run: https://openqa.suse.de/tests/11963289)

On the above note I wonder if we can make step 41 reveal the issue... it looks totally fine whether it fails or passes. https://openqa.suse.de/tests/11963289#step/setup/41

A week ago we saw a similar error in our development openQA instance.

See: https://openqaworker15.qa.suse.cz/tests/217494#step/setup/33

This seems to fail in yast2 firewall services add zone=EXT service=service:target instead of a needle. So that's a little better. I wonder if there's a more verbose mode we can use here to examine why this fails?

Actions

Copy link

#44

Updated by livdywan over 1 year ago

Subject changed from [tools] network protocols failures on multimachine tests on HA/SAP size:S to [tools] network protocols failures on multimachine tests on HA/SAP size:S auto_review:"no candidate.*iscsi-target-overview-service-tab":retry

To mitigate the urgency I recommend to apply https://github.com/os-autoinst/scripts/blob/master/README.md#auto-review---automatically-detect-known-issues-in-openqa-jobs-label-openqa-jobs-with-ticket-references-and-optionally-retrigger

This will only reroll the dice with the hope that the retries will land on the same tap worker while also increase the workload on the limited tap pool.

This would have the benefit of still giving us logs at the cost of < 15 minutes for each bad job (I don't know the failure rate, though) which might make it more useful than pinning which also comes with delays (see #135035#note-6 which also applies to manual pinning). And more importantly if we can identify an expression that all failures have an common we're probably a step closer to a fix.

For now the best we have is probably the needle mismatch no candidate needle with tag(s) 'iscsi-target-overview-service-tab' matched, so I'm starting with that.

Actions

Copy link

#45

Updated by livdywan over 1 year ago

Priority changed from Urgent to High

If we want to go the hard-coding route we could for example use only worker40: https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/596/diffs

Opinions welcome. I'm mainly looking into mitigation here. This is not a fix by any stretch.

Actions

Copy link

#46

Updated by apappas over 1 year ago

I do think that having a single point of failure again is not good. Especially since there are too many jobs that will pass through them

What I am currently doing right now is editing the jobgroups manually to pin one set of mm jobs to one worker but distribute the sets among the known good workers.

Actions

Copy link

#47

Updated by acarvajal over 1 year ago

livdywan wrote in #note-43:

On the above note I wonder if we can make step 41 reveal the issue... it looks totally fine whether it fails or passes. https://openqa.suse.de/tests/11963289#step/setup/41

I don't think so. AFAIK that step is local to the support server and is not doing anything network-related.

On a more positive note, after some days I've finally seen some MM passing in multiple workers. No idea if this is a sign of workers stabilizing after the migration, or if Tools Team did something to these workers which fixed this issue. Jobs are:

https://openqa.suse.de/tests/11996380 (support server, worker39)
https://openqa.suse.de/tests/11996383 (node 2, worker31)
https://openqa.suse.de/tests/11996382 (node 1, worker33)

I will continue monitoring and updating this ticket with what I find. Would not consider above jobs as proof that everything's back to normal until I see more.

Actions

Copy link

#48

Updated by acarvajal over 1 year ago

Things are looking better. I'm seeing fewer jobs failing in iscsi_client and in support_server/setup.

I even found some jobs passing in the workers I had seen failing in support_server/setup (worker34, worker37 and worker38) last week:

Not sure what was done, but thank you very much.

Another instance of a MM job finishing successfully in the new workers. This one is a HAWK test, which does a docker pull which was failing on connection-related issues last Thursday; they ran successfully on Sunday on worker37, worker38, worker35 and worker29:

We are still seeing some sporadic issues where the sles4sap/hana_install test module takes over 4 hours to run, and ends up failing. If you see https://openqa.suse.de/tests/11994946, in that test the module ran in under 13 minutes.

Actions

Copy link

#49

Updated by livdywan over 1 year ago

On a more positive note, after some days I've finally seen some MM passing in multiple workers. No idea if this is a sign of workers stabilizing after the migration, or if Tools Team did something to these workers which fixed this issue. Jobs are:

No. I suggested we remove mm workers but that hasn't been merged. More likely this is What I am currently doing right now is editing the jobgroups manually to pin one set of mm jobs to one worker but distribute the sets among the known good workers. as @apappas mentioned above. I think we should try to coordinate better to avoid drawing the wrong conclusions. Maybe slack is going to work better for that? Let's see.

Actions

Copy link

#50

Updated by livdywan over 1 year ago

Copied to action #135200: [qe-core] Implement a ping check with custom MTU packet size added

Actions

Copy link

#51

Updated by livdywan over 1 year ago

Description updated (diff)

Actions

Copy link

#52

Updated by acarvajal over 1 year ago

livdywan wrote in #note-49:

No. I suggested we remove mm workers but that hasn't been merged. More likely this is What I am currently doing right now is editing the jobgroups manually to pin one set of mm jobs to one worker but distribute the sets among the known good workers. as @apappas mentioned above.

Nope. That's not it. As I said:

Job https://openqa.suse.de/tests/11996380 ran in worker39
Job https://openqa.suse.de/tests/11996383 ran in worker31
Job https://openqa.suse.de/tests/11996382 ran in worker33

These jobs were not cloned. They were not manipulated. Their WORKER_CLASS setting is qemu_x86_64-large-mem,tap

I think we should try to coordinate better to avoid drawing the wrong conclusions. Maybe slack is going to work better for that? Let's see.

Ack.

Actions

Copy link

#53

Updated by livdywan over 1 year ago

https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/master/tests/support_server/setup.pm#L655

(Just for my own reference: good run: https://openqa.suse.de/tests/11934844#step/setup/41, bad run: https://openqa.suse.de/tests/11963289)

A week ago we saw a similar error in our development openQA instance.

See: https://openqaworker15.qa.suse.cz/tests/217494#step/setup/33

We had to check https://open.qa/docs/#_multi_machine_test_setup and after messing around with the firewalld configuration, installing libcap, and restarting the server, we managed to get it working: https://openqaworker15.qa.suse.cz/tests/217498

You can read our thread on the debug https://suse.slack.com/archives/C0369JZFBKK/p1692886259753739

Is there some way I can get access to this conversation. Apparently I can't open it. I was wondering if whatever you found would help come up with ideas to diagnose failures better.

Actions

Copy link

#54

Updated by acarvajal over 1 year ago

livdywan wrote in #note-53:

You can read our thread on the debug https://suse.slack.com/archives/C0369JZFBKK/p1692886259753739

Is there some way I can get access to this conversation. Apparently I can't open it. I was wondering if whatever you found would help come up with ideas to diagnose failures better.

Sorry. Didn't realize at the time that this was in a closed channel.

I've added you and @okurz

Actions

Copy link

#55

Updated by livdywan over 1 year ago

acarvajal wrote in #note-54:

livdywan wrote in #note-53:

You can read our thread on the debug https://suse.slack.com/archives/C0369JZFBKK/p1692886259753739

Is there some way I can get access to this conversation. Apparently I can't open it. I was wondering if whatever you found would help come up with ideas to diagnose failures better.

Sorry. Didn't realize at the time that this was in a closed channel.

I've added you and @okurz

Thanks! I have suspicion that #133469#note-14 is contributing to this. In particular missing packages. Because it looks to me like worker.sls is not really missing anything.

Actions

Copy link

#56

Updated by nicksinger over 1 year ago

livdywan wrote in #note-55:

Thanks! I have suspicion that #133469#note-14 is contributing to this. In particular missing packages. Because it looks to me like worker.sls is not really missing anything.

I don't think this is related here. We had several successful highstates in the past few days (at least no such issue as described in #133469 the whole week).

Actions

Copy link

#57

Updated by livdywan over 1 year ago

nicksinger wrote in #note-56:

livdywan wrote in #note-55:

Thanks! I have suspicion that #133469#note-14 is contributing to this. In particular missing packages. Because it looks to me like worker.sls is not really missing anything.

I don't think this is related here. We had several successful highstates in the past few days (at least no such issue as described in #133469 the whole week).

Maybe that particular issue with the openvswitch states is new... but in that case it must be #134042 without packages having been re-installed because Alvaro had to reinstall missing packages that are specified in salt.

Actions

Copy link

#58

Updated by livdywan over 1 year ago

Related to action #134042: auto-update on OSD does not install updates due to "Problem: nothing provides 'libwebkit2gtk3 ..." but service does not fail and we do not get an alert size:M added

Actions

Copy link

#59

Updated by srinidhir over 1 year ago

There are more failures in support_server/setup,

https://openqa.suse.de/tests/12029628
https://openqa.suse.de/tests/12029590
https://openqa.suse.de/tests/12029588
https://openqa.suse.de/tests/12029631
https://openqa.suse.de/tests/12029657
https://openqa.suse.de/tests/12029528
https://openqa.suse.de/tests/12029560
https://openqa.suse.de/tests/12029555
https://openqa.suse.de/tests/12029554
https://openqa.suse.de/tests/12029587
https://openqa.suse.de/tests/12029630
https://openqa.suse.de/tests/12029558

Actions

Copy link

#60

Updated by acarvajal over 1 year ago

srinidhir wrote in #note-59:

There are more failures in support_server/setup,

https://openqa.suse.de/tests/12029628
https://openqa.suse.de/tests/12029590
https://openqa.suse.de/tests/12029588
https://openqa.suse.de/tests/12029631
https://openqa.suse.de/tests/12029657
https://openqa.suse.de/tests/12029528
https://openqa.suse.de/tests/12029560
https://openqa.suse.de/tests/12029555
https://openqa.suse.de/tests/12029554
https://openqa.suse.de/tests/12029587
https://openqa.suse.de/tests/12029630
https://openqa.suse.de/tests/12029558

All located either in worker29 or worker30. Did anything change @livdywan @okurz ?

Actions

Copy link

#61

Updated by acarvajal over 1 year ago

Seems worker29 & worker30 are also impacting tests in another way:

https://openqa.suse.de/tests/12030835#step/iscsi_client/47
https://openqa.suse.de/tests/12030870#step/iscsi_client/22

This is a failure in iscsi_client, but earlier than in the cases reported in https://progress.opensuse.org/issues/134282#note-27. This time it fails on a zypper in yast-iscsi-client call, with timing out attempting a connection to updates.suse.com.

Actions

Copy link

#62

Updated by acarvajal over 1 year ago

Now support_server/setup fails in worker37, worker40, worker39, worker38, worker29, worker30:

So pretty much everywhere.

Actions

Copy link

#63

Updated by livdywan over 1 year ago

Probing for what might've changed i.e. things mentioned in https://open.qa/docs/#_multi_machine_test_setup I'm not spotting any obvious changes. I'm also not aware of relevant changes. Unfortunately again I don't get what happened.

Actions

Copy link

#64

Updated by srinidhir over 1 year ago

There are many more failures in support_server/setup and also in iscsi_client,

Thanks

Actions

Copy link

#65

Updated by acarvajal over 1 year ago

We're collecting results from the weekend, but things look broken from our end.

It seems like at least 9/10 hours ago, support server jobs running in worker37 (https://openqa.suse.de/tests/12071208 and https://openqa.suse.de/tests/12071225) were able to pass the support_server/setup test module, while those in worker30 (https://openqa.suse.de/tests/12071212) and worker29 (https://openqa.suse.de/tests/12071216) could not.

Regrettably, even in those cases where there were no issues with the support server, the parallel jobs ran into issues:

The first one running from worker38 could not connect to scc.suse.com https://openqa.suse.de/tests/12071210#step/iscsi_client/57
The second one running from worker29 could not connecto to scc.suse.com https://openqa.suse.de/tests/12071228#step/qnetd/53

Due to the different nature of those failures (though I suspect the root cause could be the same), I'm reporting those 2 issues in #135056

Actions

Copy link

#66

Updated by acarvajal over 1 year ago

srinidhir wrote in #note-64:

There are many more failures in support_server/setup and also in iscsi_client,

https://openqa.suse.de/tests/12076961#step/setup/35
https://openqa.suse.de/tests/12076964#step/setup/35
https://openqa.suse.de/tests/12076967#step/setup/35
https://openqa.suse.de/tests/12076986#step/setup/35
https://openqa.suse.de/tests/12076983#step/setup/35
https://openqa.suse.de/tests/12076978#step/setup/35
https://openqa.suse.de/tests/12076990#step/setup/35
https://openqa.suse.de/tests/12071154#step/iscsi_client/37
https://openqa.suse.de/tests/12071127#step/iscsi_client/37

The ones failing in iscsi_client are doing so while attempting connections to addresses outside of the worker (either scc.suse.com or updates.suse.com). This is different than previous iscsi_client failures which were during setup of the iscsi devices, i.e., while attempting to connect the cluster node to the iscsi server located in the support server.

Trying to identify some pattern, failures in support_server/setup were on worker30, worker39, worker38, worker29.

While both failures in iscsi_client were on worker29. Interestingly, the support server related to these iscsi_client failures ran in worker37 in both cases (https://openqa.suse.de/tests/12071152 and https://openqa.suse.de/tests/12071126) and were able to clear the support_server/setup test module, so it appears worker37 is behaving better now.

I am restarting https://openqa.suse.de/tests/12071126 and forcing all jobs to run in worker37 to see if tests pass.

See:

 - sle-15-SP4-Server-DVD-HA-Incidents-x86_64-Build:30601:shadow-qam_2nodes_supportserver@64bit -> http://openqa.suse.de/tests/12079184
 - sle-15-SP4-Server-DVD-HA-Incidents-x86_64-Build:30601:shadow-qam_2nodes_01@64bit -> http://openqa.suse.de/tests/12079183
 - sle-15-SP4-Server-DVD-HA-Incidents-x86_64-Build:30601:shadow-qam_2nodes_02@64bit -> http://openqa.suse.de/tests/12079185
 - sle-15-SP4-Server-DVD-HA-Incidents-x86_64-Build:30601:shadow-qam_2nodes_client@64bit -> http://openqa.suse.de/tests/12079186

Actions

Copy link

#67

Updated by okurz over 1 year ago

Given that there are still many problems I suggest to go ahead with https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/596 and only run x86_64 multi-machine tests from a single physical machine at least until we have better ideas and test improvements.

Actions

Copy link

#68

Updated by livdywan over 1 year ago

okurz wrote in #note-67:

Given that there are still many problems I suggest to go ahead with https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/596 and only run x86_64 multi-machine tests from a single physical machine at least until we have better ideas and test improvements.

Merged

Actions

Copy link

#69

Updated by srinidhir over 1 year ago

There are more failures in the support_server/setup,

Actions

Copy link

#70

Updated by livdywan over 1 year ago

Subject changed from [tools] network protocols failures on multimachine tests on HA/SAP size:S auto_review:"no candidate.*iscsi-target-overview-service-tab":retry to [tools] network protocols failures on multimachine tests on HA/SAP size:S auto_review:"no candidate.*iscsi-target-overview-service-tab|yast2.+firewall.+services.+add.+zone":retry

There are more failures in the support_server/setup,

https://openqa.suse.de/tests/12081543#step/setup/35

Thanks. I'm adding this to the regex as well. Interesting to see this on osd in production now.

I'm once gain wondering if yast2 could help us debug the issue? If it's consistently failing in the firewall config?

Actions

Copy link

#71

Updated by acarvajal over 1 year ago

Hello,

Besides the failures reported by @srinidhir, I also saw failures like this: https://openqa.suse.de/tests/12080260#step/setup/45

I saw a total of 231 failures (I can share the full list if necessary), all of them in support_server/setup test module, either on step 35 or on step 45.

Jobs ran in worker40 (expected due to https://progress.opensuse.org/issues/134282#note-68) but also in worker30. Seems like there were actually 2 workers handling Multi-Machine last night.

As commented, issue is present in the support server. The parallel jobs start but get blocked while the support server is starting; however the support server fails in support_server/setup and kills the whole MM job.

There are 2 types of errors:

The first one is like https://openqa.suse.de/tests/12080239#step/setup/35 and it's a script_run timeout. This happens in HA jobs
Second one is like https://openqa.suse.de/tests/12080260#step/setup/45 and it's a needle match failure. It happens in HanaSR jobs.

Even though they fail in different steps, I believe root cause is the same. Looking at the command which causes the failure in the HA jobs (yast2 firewall services add zone=EXT service=service:target) when it runs in the HanaSR job, we can see that:

HanaSR job:

[2023-09-12T02:15:57.536570+02:00] [debug] [pid:13581] <<< testapi::script_run(cmd="yast2 firewall services add zone=EXT service=service:target", die_on_timeout=1, timeout=200, output="", quiet=undef)
[2023-09-12T02:15:57.536672+02:00] [debug] [pid:13581] tests/support_server/setup.pm:628 called setup::setup_iscsi_server -> tests/support_server/setup.pm:354 called testapi::script_run
[2023-09-12T02:15:57.536779+02:00] [debug] [pid:13581] <<< testapi::type_string(string="yast2 firewall services add zone=EXT service=service:target", max_interval=250, wait_screen_change=0, wait_still_screen=0, timeout=30, similarity_level=47)
[2023-09-12T02:15:59.600493+02:00] [debug] [pid:13581] tests/support_server/setup.pm:628 called setup::setup_iscsi_server -> tests/support_server/setup.pm:354 called testapi::script_run
[2023-09-12T02:15:59.600776+02:00] [debug] [pid:13581] <<< testapi::type_string(string="; echo nOU2g-\$?- > /dev/ttyS0\n", max_interval=250, wait_screen_change=0, wait_still_screen=0, timeout=30, similarity_level=47)
[2023-09-12T02:16:00.716267+02:00] [debug] [pid:13581] tests/support_server/setup.pm:628 called setup::setup_iscsi_server -> tests/support_server/setup.pm:354 called testapi::script_run
[2023-09-12T02:16:00.716494+02:00] [debug] [pid:13581] <<< testapi::wait_serial(timeout=200, no_regex=0, record_output=undef, regexp=qr/nOU2g-\d+-/u, expect_not_found=0, buffer_size=undef, quiet=undef)
[2023-09-12T02:22:07.016683+02:00] [debug] [pid:13581] >>> testapi::wait_serial: (?^u:nOU2g-\d+-): ok

While in the HA job is:

[2023-09-12T01:53:34.668942+02:00] [debug] [pid:17155] <<< testapi::script_run(cmd="yast2 firewall services add zone=EXT service=service:target", output="", die_on_timeout=1, timeout=200, quiet=undef)
[2023-09-12T01:53:34.669046+02:00] [debug] [pid:17155] tests/support_server/setup.pm:628 called setup::setup_iscsi_server -> tests/support_server/setup.pm:354 called testapi::script_run
[2023-09-12T01:53:34.669152+02:00] [debug] [pid:17155] <<< testapi::type_string(string="yast2 firewall services add zone=EXT service=service:target", max_interval=250, wait_screen_change=0, wait_still_screen=0, timeout=30, similarity_level=47)
[2023-09-12T01:53:36.730800+02:00] [debug] [pid:17155] tests/support_server/setup.pm:628 called setup::setup_iscsi_server -> tests/support_server/setup.pm:354 called testapi::script_run
[2023-09-12T01:53:36.731133+02:00] [debug] [pid:17155] <<< testapi::type_string(string="; echo nOU2g-\$?- > /dev/ttyS0\n", max_interval=250, wait_screen_change=0, wait_still_screen=0, timeout=30, similarity_level=47)
[2023-09-12T01:53:37.846062+02:00] [debug] [pid:17155] tests/support_server/setup.pm:628 called setup::setup_iscsi_server -> tests/support_server/setup.pm:354 called testapi::script_run
[2023-09-12T01:53:37.846380+02:00] [debug] [pid:17155] <<< testapi::wait_serial(quiet=undef, buffer_size=undef, regexp=qr/nOU2g-\d+-/u, no_regex=0, expect_not_found=0, record_output=undef, timeout=200)
[2023-09-12T01:56:58.961522+02:00] [debug] [pid:17155] >>> testapi::wait_serial: (?^u:nOU2g-\d+-): fail

As you can see, in the HanaSR support server the command eventually worked after 6+ minutes, while in the HA job, it failed at the 3 minutes 21 seconds mark. HanaSR jobs run with TIMEOUT_SCALE=3 which explains why it waits longer.

Even though the command is successful in HanaSR tests after 6 minutes, I still think this shows that there is a problem, as that yast2 firewall services add zone=EXT service=service:target command should finish faster. Looking at one of the successful test from last week (see https://progress.opensuse.org/issues/134282#note-48 & https://openqa.suse.de/tests/11987887):

[2023-09-03T04:37:19.641796+02:00] [debug] [pid:38709] <<< testapi::script_run(cmd="yast2 firewall services add zone=EXT service=service:target", output="", timeout=200, quiet=undef, die_on_timeout=1)
[2023-09-03T04:37:19.641902+02:00] [debug] [pid:38709] tests/support_server/setup.pm:628 called setup::setup_iscsi_server -> tests/support_server/setup.pm:354 called testapi::script_run
[2023-09-03T04:37:19.642010+02:00] [debug] [pid:38709] <<< testapi::type_string(string="yast2 firewall services add zone=EXT service=service:target", max_interval=250, wait_screen_change=0, wait_still_screen=0, timeout=30, similarity_level=47)
[2023-09-03T04:37:21.700529+02:00] [debug] [pid:38709] tests/support_server/setup.pm:628 called setup::setup_iscsi_server -> tests/support_server/setup.pm:354 called testapi::script_run
[2023-09-03T04:37:21.700795+02:00] [debug] [pid:38709] <<< testapi::type_string(string="; echo nOU2g-\$?- > /dev/ttyS0\n", max_interval=250, wait_screen_change=0, wait_still_screen=0, timeout=30, similarity_level=47)
[2023-09-03T04:37:22.816323+02:00] [debug] [pid:38709] tests/support_server/setup.pm:628 called setup::setup_iscsi_server -> tests/support_server/setup.pm:354 called testapi::script_run
[2023-09-03T04:37:22.816573+02:00] [debug] [pid:38709] <<< testapi::wait_serial(timeout=200, buffer_size=undef, expect_not_found=0, record_output=undef, no_regex=0, regexp=qr/nOU2g-\d+-/u, quiet=undef)
[2023-09-03T04:37:25.876126+02:00] [debug] [pid:38709] >>> testapi::wait_serial: (?^u:nOU2g-\d+-): ok

As you can see, in this working test the command finished in under 6 seconds.

Hope this helps in the debug.

Actions

Copy link

#72

Updated by okurz over 1 year ago

acarvajal wrote in #note-71:

[…]
Jobs ran in worker40 (expected due to https://progress.opensuse.org/issues/134282#note-68) but also in worker30. Seems like there were actually 2 workers handling Multi-Machine last night.

@livdywan please make sure that only one machine is used here.

Actions

Copy link

#73

Updated by livdywan over 1 year ago

okurz wrote in #note-72:

acarvajal wrote in #note-71:

[…]
Jobs ran in worker40 (expected due to https://progress.opensuse.org/issues/134282#note-68) but also in worker30. Seems like there were actually 2 workers handling Multi-Machine last night.

@livdywan please make sure that only one machine is used here.

Apparently I couldn't see what was missing yesterday. Looks I missed it when resolving conflicts so here's a follow-up: https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/609

Actions

Copy link

#74

Updated by okurz over 1 year ago

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/609 merged, in deployment

Actions

Copy link

#75

Updated by livdywan over 1 year ago

As you can see, in this working test the command finished in under 6 seconds.

Hope this helps in the debug.

@acarvajal To avoid any misconceptions please note that I'm mainly looking after mitigations here. Somebody else will need to debug this further and narrow down the actual issue or come up with test improvements such as #135200.

livdywan wrote in #note-14:

https://sd.suse.com/servicedesk/customer/portal/1/SD-130143

Just FYI the latest feedback from infra on the SD ticket:

By default MTU runs at MTU 1500, however for openQA TORs we have MTU 9216 configured for each port and the future network automation service will apply this setting as well by default throughout PRG2, lowering the MTU will then be request via SD-Ticket.

Actions

Copy link

#76

Updated by livdywan over 1 year ago

Description updated (diff)

Actions

Copy link

#77

Updated by livdywan over 1 year ago

I went through the comments again just to be sure we've covered raised points in the current summary. We should be good.

Actions

Copy link

#78

Updated by okurz over 1 year ago

Judging from https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&viewPanel=26&from=1692855397995&to=1694689544873 I see no improvement in the results of multi-machine tests so no clear indication regarding E7-1 however this were no specially scheduled test results so other factors could also play in here.

One more idea for an experiment: Run multi-machine tests specifically triggered on an older NUE1 based worker to see if that one is affected the same.

Actions

Copy link

#79

Updated by okurz over 1 year ago

Related to action #135578: Long job age and jobs not executed for long size:M added

Actions

Copy link

#80

Updated by okurz over 1 year ago

Priority changed from High to Urgent

I realized that there is a strong relation to #135578 . Due to many multi-machine tests failing we have 1. longer runtimes due to timeouts and execution of post-fail-hooks, 2. multiple retries for recurring failures for jobs with setting RETRY=N, 3. investigation jobs for failing multi-machine tests that are unreviewed. All those three issues lead to a longer job schedule queue as observed in #135578 hence bumping prio again

EDIT: I conducted quick SQL queries to ensure that no x86_64 multi-machine tests have been executed on any other machine than worker40

openqa=> select jobs.id from jobs join workers on jobs.assigned_worker_id = workers.id join job_settings on jobs.id = job_settings.job_id where t_finished >= '2023-09-13' and host != 'worker40' and key = 'WORKER_CLASS' and value = 'tap' and jobs.arch = 'x86_64' limit 3;
    id    
----------
 12087785
 12087858
 12087788
(3 rows)

openqa=> select jobs.id from jobs join workers on jobs.assigned_worker_id = workers.id join job_settings on jobs.id = job_settings.job_id where t_finished >= '2023-09-14' and host != 'worker40' and key = 'WORKER_CLASS' and value = 'tap' and jobs.arch = 'x86_64' limit 3;                                            
 id 
----
(0 rows)

the first being a crosscheck showing that we still had tests on worker30 yesterday and the second shows that today, i.e. in the 15h of the day no multi-machine tests on any other than w40. To find all resulting multi-machine tests from worker40:

openqa=> select result,count(jobs.id) from jobs join workers on jobs.assigned_worker_id = workers.id join job_settings on jobs.id = job_settings.job_id where t_finished >= '2023-09-14' and host = 'worker40' and key = 'WORKER_CLASS' and value = 'tap' and jobs.arch = 'x86_64' group by result order by count DESC;
       result       | count 
--------------------+-------
 parallel_failed    |  1173
 failed             |   735
 incomplete         |   107
 parallel_restarted |    27
 passed             |    15
 timeout_exceeded   |    14
 skipped            |     2
 softfailed         |     1
(8 rows)

clearly disproving H7 so please REJECT H7 as in being the cause for the problem at hand because while there might be problem with cross-worker tests it also does not really help to make more tests pass from a single worker. That should be enough for you to re-enable the other machines.

Actions

Copy link

#81

Updated by livdywan over 1 year ago

I couldn't find an explicit mention of it, and apparently that was not clear in retrospect: My expectation is that somebody from @acarvajal's side makes time for experiments and we provide help from the Tools side.

Actions

Copy link

#82

Updated by livdywan over 1 year ago

Description updated (diff)

clearly disproving H7 so please REJECT H7 as in being the cause for the problem at hand because while there might be problem with cross-worker tests it also does not really help to make more tests pass from a single worker. That should be enough for you to re-enable the other machines.

Ack.

One example I took a look at is https://openqa.suse.de/tests/12096958/logfile?filename=autoinst-log.txt and it seems to run into the network delay we've seen before:

[2023-09-13T13:55:37.165578+02:00] [debug] [pid:12551] <<< testapi::type_string(string="yast2 firewall services add zone=EXT service=service:target", max_interval=250, wait_screen_change=0, wait_still_screen=0, timeout=30, similarity_level=47)
[2023-09-13T13:55:39.220794+02:00] [debug] [pid:12551] tests/support_server/setup.pm:628 called setup::setup_iscsi_server -> tests/support_server/setup.pm:354 called testapi::script_run
[2023-09-13T13:55:39.221004+02:00] [debug] [pid:12551] <<< testapi::type_string(string="; echo nOU2g-\$?- > /dev/ttyS0\n", max_interval=250, wait_screen_change=0, wait_still_screen=0, timeout=30, similarity_level=47)
[2023-09-13T13:55:40.333170+02:00] [debug] [pid:12551] tests/support_server/setup.pm:628 called setup::setup_iscsi_server -> tests/support_server/setup.pm:354 called testapi::script_run
[2023-09-13T13:55:40.333380+02:00] [debug] [pid:12551] <<< testapi::wait_serial(quiet=undef, regexp=qr/nOU2g-\d+-/u, record_output=undef, no_regex=0, timeout=200, expect_not_found=0, buffer_size=undef)
[2023-09-13T14:01:46.551431+02:00] [debug] [pid:12551] >>> testapi::wait_serial: (?^u:nOU2g-\d+-): ok

Actions

Copy link

#83

Updated by livdywan over 1 year ago

Presumably we can re-enable all x86-64 workers previously used for multi-machine cases. Do note that nothing changed here. I'm proposing this based on the findings in #note-80 in general and #note-82 by example:

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/613

Actions

Copy link

#84

Updated by apappas over 1 year ago

The statement:

Multimachine jobs do not work across multiple workers
is not disproved when multimachine jobs also fail across a single worker.

The failures could be relevant to cluster miscofiguration or not.

After a cursory glance in worker40, most of the failures happen in zypper. https://openqa.suse.de/tests/12108426#step/fips_setup/31 (fails zypper in -t pattern fips)

I fail to understand how that invalidates the hypothesis. On the contrary we have yet again a test that cannot communicate to outside networks.

Actions

Copy link

#85

Updated by livdywan over 1 year ago

After a cursory glance in worker40, most of the failures happen in zypper. https://openqa.suse.de/tests/12108426#step/fips_setup/31 (fails zypper in -t pattern fips)

I fail to understand how that invalidates the hypothesis. On the contrary we have yet again a test that cannot communicate to outside networks.

2023-09-13 16:27:45 <5> server(2215) [zypp-core] Exception.cc(log):186 Error message: Could not resolve host: updates.suse.com

That's not one of the cases we've been looking at before, though? This looks to fail because a host outside of openQA production infra is not reachable. 🤔

Actions

Copy link

#86

Updated by acarvajal over 1 year ago

okurz wrote in #note-80:

clearly disproving H7 so please REJECT H7 as in being the cause for the problem at hand because while there might be problem with cross-worker tests it also does not really help to make more tests pass from a single worker. That should be enough for you to re-enable the other machines.

How does the fact that Multi-Machine jobs have been running only in worker40 the past 2 days, disproves that Multi-Machine jobs don't work across workers?

IMHO, only seeing passing MM jobs across multiple workers would disprove H7.

okurz wrote in #note-78:

One more idea for an experiment: Run multi-machine tests specifically triggered on an older NUE1 based worker to see if that one is affected the same.

I scheduled some in worker9: https://openqa.suse.de/tests/overview?groupid=300&distri=sle&build=poo_134282&version=15-SP1

livdywan wrote in #note-83:

Presumably we can re-enable all x86-64 workers previously used for multi-machine cases. Do note that nothing changed here. I'm proposing this based on the findings in #note-80 in general and #note-82 by example:

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/613

I am not against this, but don't see how it will improve things when per #note-80 we are seeing a massive amount of failures when using only one worker. Yes, enabling the other workers may help with the job queue, and we may be lucky (see my comment regarding worker37 in #note-66) and some MM could be picked by workers where the support server setup works, but then again they could be also be picked by worker40 or other workers behaving like worker40, causing more failures.

IHMO, we should focus on making sure that MM workers fully work in a single worker first, and then dig into any issues that may be GRE-related.

Actions

Copy link

#87

Updated by acarvajal over 1 year ago

livdywan wrote in #note-85:

2023-09-13 16:27:45 <5> server(2215) [zypp-core] Exception.cc(log):186 Error message: Could not resolve host: updates.suse.com
That's not one of the cases we've been looking at before, though? This looks to fail because a host outside of openQA production infra is not reachable. 🤔

Yes, this is like the scenario described in #135056.

I do believe both issues are related ... i.e., the same workers where the support server is failing to finish its setup, are the workers unable to connect to addresses outside of osd.

Actions

Copy link

#88

Updated by livdywan over 1 year ago

clearly disproving H7 so please REJECT H7 as in being the cause for the problem at hand because while there might be problem with cross-worker tests it also does not really help to make more tests pass from a single worker. That should be enough for you to re-enable the other machines.

How does the fact that Multi-Machine jobs have been running only in worker40 the past 2 days, disproves that Multi-Machine jobs don't work across workers?

IMHO, only seeing passing MM jobs across multiple workers would disprove H7.

The same jobs fail regardless of whether they run on multiple or a single physical machine. To me that suggests the physical machine part is a red herring.

That's not one of the cases we've been looking at before, though? This looks to fail because a host outside of openQA production infra is not reachable. 🤔

Yes, this is like the scenario described in #135056.

I do believe both issues are related ... i.e., the same workers where the support server is failing to finish its setup, are the workers unable to connect to addresses outside of osd.

Right. What's mainly confusing me right now is that failures occur on a single physical host and also accessing download servers. As if the worker instance has no access to the network.

Actions

Copy link

#89

Updated by okurz over 1 year ago

Copied to action #135773: [tools] many multi-machine test failures in "ovs-client+ovs-server" test scenario when tests are run across different workers size:M added

Actions

Copy link

#90

Updated by okurz over 1 year ago

Description updated (diff)

acarvajal wrote in #note-86:

okurz wrote in #note-80:

clearly disproving H7 so please REJECT H7 as in being the cause for the problem at hand because while there might be problem with cross-worker tests it also does not really help to make more tests pass from a single worker. That should be enough for you to re-enable the other machines.

How does the fact that Multi-Machine jobs have been running only in worker40 the past 2 days, disproves that Multi-Machine jobs don't work across workers?

IMHO, only seeing passing MM jobs across multiple workers would disprove H7.

ok, sorry. I wasn't clear. Now I created #135773 as a clone from this for the specific problem as already observed stated by pcervinka regarding multi-machine jobs and which seems to be part of the problem domain. Also there is #111908 for long. With that I updated the description any hypotheses to keep H7 open but add H7.1 "Multi-machine jobs generally work fine when executed on a single physical machine" and only rejected that one. I also updated the description for the field TBD that wasn't filled by livdywan

livdywan wrote in #note-83:

Presumably we can re-enable all x86-64 workers previously used for multi-machine cases. Do note that nothing changed here. I'm proposing this based on the findings in #note-80 in general and #note-82 by example:

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/613

I am not against this, but don't see how it will improve things when per #note-80 we are seeing a massive amount of failures when using only one worker. Yes, enabling the other workers may help with the job queue, and we may be lucky (see my comment regarding worker37 in #note-66) and some MM could be picked by workers where the support server setup works, but then again they could be also be picked by worker40 or other workers behaving like worker40, causing more failures.

IHMO, we should focus on making sure that MM workers fully work in a single worker first, and then dig into any issues that may be GRE-related.

Yes, I agree. But apparently a single machine does not make a difference. And keeping only a single machine for all production multi-machine tests is conflicting with the long job schedule queue. And for investigation and trying to fix things one can still select worker classes freely to select on which machines something runs.

Actions

Copy link

#91

Updated by livdywan over 1 year ago

One more idea for an experiment: Run multi-machine tests specifically triggered on an older NUE1 based worker to see if that one is affected the same.

I scheduled some in worker9: https://openqa.suse.de/tests/overview?groupid=300&distri=sle&build=poo_134282&version=15-SP1

Note that as of #134912#note-4 that machine hasn't been running.

I tried to power it on again but it doesn't seem responsive. chassis status says System Power: off and power cycle says Set Chassis Power Control to Cycle failed: Command not supported in present state even after repeated attempts.

Edith: power on followed by another power cycle seems to have worked. I can get in via SSH. The webUI hasn't "seen" it yet.

Actions

Copy link

#92

Updated by mkittler over 1 year ago

About H3: We're currently using Open vSwitch 3.1.1 (or 3.1.0, the package version is a bit unclear to me). Maybe the update to 3.1.0 introduced a regression? It was released in February which would be too old but maybe it only landed in Leap a few month later. Maybe this could be easily cross-checked by downgrading to a previous version of Open vSwitch on all workers and then re-triggering problematic tests like https://openqa.suse.de/tests/latest?arch=x86_64&distri=sle&flavor=Server-DVD-HPC-Incidents&machine=64bit&test=hpc_BETA_mpich_mpi_cplusplus_master&version=15-SP5. (That scenario is currently passing but the history doesn't look very good. Supposedly one had to run a few successful tests before drawing the conclusion that downgrading helped.)

Actions

Copy link

#93

Updated by livdywan over 1 year ago

mkittler wrote in #note-92:

About H3: We're currently using Open vSwitch 3.1.1 (or 3.1.0, the package version is a bit unclear to me). Maybe the update to 3.1.0 introduced a regression? It was released in February which would be too old but maybe it only landed in Leap a few month later. Maybe this could be easily cross-checked by downgrading to a previous version of Open vSwitch on all workers and then re-triggering problematic tests like https://openqa.suse.de/tests/latest?arch=x86_64&distri=sle&flavor=Server-DVD-HPC-Incidents&machine=64bit&test=hpc_BETA_mpich_mpi_cplusplus_master&version=15-SP5. (That scenario is currently passing but the history doesn't look very good. Supposedly one had to run a few successful tests before drawing the conclusion that downgrading helped.)

Where are you getting that version? What I see on worker10 for example is this:

zypper pa -i | grep openvswitch
itory with updates from SUSE Linux Enterprise 15 | openvswitch                                          | 2.14.2-150400.24.3.1                        | x86_64
v  | openSUSE-Leap-15.4-Oss                                       | openvswitch                                          | 2.14.2-150400.22.23                         | x86_64
i+ | devel_openQA                                                 | os-autoinst-openvswitch                              | 4.6.1694444383.e6a5294-lp154.1635.1         | x86_64
v  | Update repository of openSUSE Backports                      | os-autoinst-openvswitch                              | 4.6.1639403953.ae94c4bd-bp154.2.3.1         | x86_64
v  | openSUSE-Leap-15.4-Oss                                       | os-autoinst-openvswitch                              | 4.6.1639403953.ae94c4bd-bp154.1.137         | x86_6

Actions

Copy link

#94

Updated by acarvajal over 1 year ago

livdywan wrote in #note-91:

One more idea for an experiment: Run multi-machine tests specifically triggered on an older NUE1 based worker to see if that one is affected the same.

I scheduled some in worker9: https://openqa.suse.de/tests/overview?groupid=300&distri=sle&build=poo_134282&version=15-SP1

Note that as of #134912#note-4 that machine hasn't been running.

I tried to power it on again but it doesn't seem responsive. chassis status says System Power: off and power cycle says Set Chassis Power Control to Cycle failed: Command not supported in present state even after repeated attempts.

Edith: power on followed by another power cycle seems to have worked. I can get in via SSH. The webUI hasn't "seen" it yet.

Yes. Just saw that. I'm cancelling those jobs and starting new ones in worker8.

Edit: https://openqa.suse.de/tests/overview?build=poo_134282&groupid=300&distri=sle&version=15-SP1

Actions

Copy link

#95

Updated by livdywan over 1 year ago

acarvajal wrote in #note-94:

livdywan wrote in #note-91:

One more idea for an experiment: Run multi-machine tests specifically triggered on an older NUE1 based worker to see if that one is affected the same.

I scheduled some in worker9: https://openqa.suse.de/tests/overview?groupid=300&distri=sle&build=poo_134282&version=15-SP1

Note that as of #134912#note-4 that machine hasn't been running.

I tried to power it on again but it doesn't seem responsive. chassis status says System Power: off and power cycle says Set Chassis Power Control to Cycle failed: Command not supported in present state even after repeated attempts.

Edith: power on followed by another power cycle seems to have worked. I can get in via SSH. The webUI hasn't "seen" it yet.

Yes. Just saw that. I'm cancelling those jobs and starting new ones in worker8.

Edit: https://openqa.suse.de/tests/overview?build=poo_134282&groupid=300&distri=sle&version=15-SP1

Okay! Meanwhile I realized I got confused by the naming of the running services... if you still wanna re-run those jobs on worker9:

grep numofworkers /etc/openqa/workers.ini
# numofworkers: 16
sudo systemctl enable --now openqa-worker-auto-restart@{1..16}.service
Created symlink /etc/systemd/system/multi-user.target.wants/openqa-worker-auto-restart@1.service → /usr/lib/systemd/system/openqa-worker-auto-restart@.service
[...]

Actions

Copy link

#96

Updated by okurz over 1 year ago

Copied to action #135818: [kernel] minimal reproducer for many multi-machine test failures in "ovs-client+ovs-server" test scenario when tests are run across different workers added

Actions

Copy link

#97

Updated by acarvajal over 1 year ago

acarvajal wrote in #note-94:

Yes. Just saw that. I'm cancelling those jobs and starting new ones in worker8.

Edit: https://openqa.suse.de/tests/overview?build=poo_134282&groupid=300&distri=sle&version=15-SP1

FYI:

Support server from these passed the support_server/setup step. See: https://openqa.suse.de/tests/12138679#step/barrier_init/1
iscsi_client module also passed. See: https://openqa.suse.de/tests/12138678#step/watchdog/1 & https://openqa.suse.de/tests/12138676#step/watchdog/1

Since all these jobs run in worker8, it's not a good test case to confirm or deny whether same situation would work when running in multiple workers.

Edit: tests passed.

Actions

Copy link

#98

Updated by mkittler over 1 year ago

Description updated (diff)

@livdywan

Where are you getting that version? What I see on worker10 for example is this:

From

martchus@worker40:~> zypper se -i -v vswitch
Loading repository data...
Reading installed packages...

S  | Name                    | Type    | Version                             | Arch   | Repository
---+-------------------------+---------+-------------------------------------+--------+-------------------------------------------------------------
i  | libopenvswitch-3_1-0    | package | 3.1.0-150500.3.3.1                  | x86_64 | Update repository with updates from SUSE Linux Enterprise 15
    name: libopenvswitch-3_1-0
i  | openvswitch3            | package | 3.1.0-150500.3.3.1                  | x86_64 | Update repository with updates from SUSE Linux Enterprise 15
    name: openvswitch3
i+ | os-autoinst-openvswitch | package | 4.6.1694444383.e6a5294-lp155.1635.1 | x86_64 | devel_openQA
    name: os-autoinst-openvswitch

but it looks like on some workers an older version (2.14) is used, e.g.

martchus@worker10:~> zypper se -i -v vswitch
Loading repository data...
Reading installed packages...

S  | Name                    | Type    | Version                             | Arch   | Repository
---+-------------------------+---------+-------------------------------------+--------+-------------------------------------------------------------
i  | libopenvswitch-2_14-0   | package | 2.14.2-150400.24.9.1                | x86_64 | Update repository with updates from SUSE Linux Enterprise 15
    name: libopenvswitch-2_14-0
i  | openvswitch             | package | 2.14.2-150400.24.9.1                | x86_64 | Update repository with updates from SUSE Linux Enterprise 15
    name: openvswitch
i+ | os-autoinst-openvswitch | package | 4.6.1694444383.e6a5294-lp154.1635.1 | x86_64 | devel_openQA

Note that worker10 is generally not the most relevant worker to check, though (as it doesn't have the tap worker class enabled anymore).

On the other side, this actually tells us something: We saw this problem before the dct move when we still used the Nürnberg-located workers. Those workers seem to still use the old version (2.14). So it is probably not due to updating Open vSwitch. (I say probably because I haven't checked whether the 2.x package has received any updates in the relevant time frame. Possibly 2.x and 3.x both received a minor update introducing the same bug. This is unlikely, though.)

Actions

Copy link

#99

Updated by livdywan over 1 year ago

Just had a call with Ralf, Anton, Alvaro and José to check where we're at:

Stop all openQA deployments for now DONE (https://gitlab.suse.de/openqa/osd-deployment/-/pipeline_schedules)
Let's have a daily standup
Can we have a rollback to the previous state? Probably not?
Is there anyone we can pull in who's more versed in debugging network setups?
Let's pull in Marius temporarily
There were similar symptoms in Walfdorf. Can we check this as a reference?

I also sent an email to qa-team to ensure there's a general visibility of what's being done.

Actions

Copy link

#100

Updated by livdywan over 1 year ago

Description updated (diff)

Actions

Copy link

#101

Updated by okurz over 1 year ago

livdywan wrote in #note-99:

Just had a call with Ralf, Anton, Alvaro and José to check where we're at:

Stop all openQA deployments for now DONE (https://gitlab.suse.de/openqa/osd-deployment/-/pipeline_schedules)

Please make sure you have an according "rollback action". By the way I consider that a bad idea. We must not forget that the majority of tests within OSD still work fine and also we need to apply changes for other tasks.

Let's have a daily standup

Can we have a rollback to the previous state? Probably not?

We don't know what the "previous state" was but we do know that there were changes that are effectively impossible to revert, e.g. moving physical machines back to NUE1 datacenter.

Is there anyone we can pull in who's more versed in debugging network setups?

Let's pull in Marius temporarily

There were similar symptoms in Walfdorf. Can we check this as a reference?

I also sent an email to qa-team to ensure there's a general visibility of what's being done.

Actions

Copy link

#102

Updated by acarvajal over 1 year ago

livdywan wrote in #note-99:

There were similar symptoms in Walfdorf. Can we check this as a reference?

Regarding this:

openqa.wdf.sap.corp setup originally consisted of 2 servers:
1.1. srv1 had the webUI, 9 x86_64 qemu workers and 8 pvm_hmc workers.
1.2. srv2 had 9 x86_64 qemu workers.
1.3. There was a GRE tunnel from srv1->srv2, and another from srv2->srv1
We got new HW to replace the old servers (newsrv1 & newsrv2). Both were installed and configured as openQA workers.
We enabled 15 qemu workers in each of newsrv1 and newsrv2
We disabled qemu workers in srv1 and srv2
After this, we noticed that MM jobs which ran across both newsrv1 and newsrv2 failed to connect to the support server.
At the same time, MM jobs which ran wholly in either of the new servers would work.
We suspected there was an issue with the GRE tunnel. After looking at the configuration we noticed that due to copy & paste error, the GRE tunnel in the new servers was established as newsrv1->srv1 and newsrv2->srv2 only.
After fixing the GRE tunnels and restarting network services and openQA workers, issue was still present; however rebooting both servers fixed the issue.

My impression is that some of the related services (wicked, firewalld, nftables, openvswitch, os-autoins-openvswitch, etc) had to be started in a certain order, and that would be why after a clean reboot issue was gone.

I don't expect osd workers to have misconfigured GRE tunnels.

Actions

Copy link

#104

Updated by acarvajal over 1 year ago

Following up on one of the action items raised during the QE-SAP/Tools Team sync from Wednesday, we ran 10 support server jobs in tap networks in each of the workers.

In order to have quick tests, these support servers jobs ran with a reduced schedule (dropping the modules which would block waiting for other jobs, see https://github.com/alvarocarvajald/os-autoinst-distri-opensuse/commit/dd4e04fd1b95f73c6582e4c1c2268f4509ca2669) and they were run without parallel jobs.

Settings were taken from the passing Multi-Machine support server job from earlier in the day: https://openqa.suse.de/tests/12138679/file/vars.json

The following settings were removed from the JSON file: JOBTOKEN, NAME, NEEDLES_GIT_HASH, NICMAC, NICMODEL, NICVLAN, OPENQA_HOSTNAME, OPENQA_URL, PRODUCTDIR, START_AFTER_TEST, TAPDEV, TAPDOWNSCRIPT, TAPSCRIPT, VNC, WORKER_HOSTNAME, WORKER_ID, WORKER_INSTANCE

The following settings were updated:

"WORKER_CLASS" : "qemu_x86_64,tap,worker29"
"CASEDIR" : "https://github.com/alvarocarvajald/os-autoinst-distri-opensuse.git#poo134282-reduced-ss-test"

WORKER_CLASS was of course changed after jobs for worker29 were scheduled, to schedule jobs in worker30, worker37, worker38, worker39 and worker40.

Jobs were posted with the command:

openqa-cli api --osd -X POST jobs $(cat vars.json | perl -MJSON -e 'my $j = ""; while (<>) { $j .= $_ } my $r = decode_json($j); foreach (keys %$r) { print "$_=$r->{$_} "}')

These were the results:

worker29: (100% passing rate)
https://openqa.suse.de/tests/12141451
https://openqa.suse.de/tests/12141510
https://openqa.suse.de/tests/12141511
https://openqa.suse.de/tests/12141512
https://openqa.suse.de/tests/12141513
https://openqa.suse.de/tests/12141514
https://openqa.suse.de/tests/12141515
https://openqa.suse.de/tests/12141516
https://openqa.suse.de/tests/12141517
https://openqa.suse.de/tests/12141518

worker30: (100% passing rate)
https://openqa.suse.de/tests/12141526
https://openqa.suse.de/tests/12141527
https://openqa.suse.de/tests/12141528
https://openqa.suse.de/tests/12141529
https://openqa.suse.de/tests/12141530
https://openqa.suse.de/tests/12141531
https://openqa.suse.de/tests/12141532
https://openqa.suse.de/tests/12141533
https://openqa.suse.de/tests/12141534
https://openqa.suse.de/tests/12141632

worker37: (100% failure)
https://openqa.suse.de/tests/12141589
https://openqa.suse.de/tests/12141590
https://openqa.suse.de/tests/12141591
https://openqa.suse.de/tests/12141592
https://openqa.suse.de/tests/12141593
https://openqa.suse.de/tests/12141594
https://openqa.suse.de/tests/12141595
https://openqa.suse.de/tests/12141596
https://openqa.suse.de/tests/12141597
https://openqa.suse.de/tests/12141633

worker38: (100% passing rate)
https://openqa.suse.de/tests/12141598
https://openqa.suse.de/tests/12141599
https://openqa.suse.de/tests/12141600
https://openqa.suse.de/tests/12141601
https://openqa.suse.de/tests/12141602
https://openqa.suse.de/tests/12141603
https://openqa.suse.de/tests/12141604
https://openqa.suse.de/tests/12141605
https://openqa.suse.de/tests/12141606
https://openqa.suse.de/tests/12141634

worker39: (100% passing rate)
https://openqa.suse.de/tests/12141607
https://openqa.suse.de/tests/12141608
https://openqa.suse.de/tests/12141609
https://openqa.suse.de/tests/12141610
https://openqa.suse.de/tests/12141611
https://openqa.suse.de/tests/12141612
https://openqa.suse.de/tests/12141613
https://openqa.suse.de/tests/12141614
https://openqa.suse.de/tests/12141615
https://openqa.suse.de/tests/12141636

worker40: (100% passing rate)
https://openqa.suse.de/tests/12141616
https://openqa.suse.de/tests/12141617
https://openqa.suse.de/tests/12141618
https://openqa.suse.de/tests/12141619
https://openqa.suse.de/tests/12141620
https://openqa.suse.de/tests/12141621
https://openqa.suse.de/tests/12141622
https://openqa.suse.de/tests/12141623
https://openqa.suse.de/tests/12141624
https://openqa.suse.de/tests/12141631

While things look much improved, I think we still have an issue with worker37.

Actions

Copy link

#105

Updated by acarvajal over 1 year ago

Some actual issues observed related to worker37 this past afternoon:

support server running in worker37, fails in setup: https://openqa.suse.de/tests/12140139#step/setup/35
Multi-Machine jobs running in worker37 & worker38, node 2 running in worker37 is unable to reach updates.suse.com: https://openqa.suse.de/tests/12140151#step/iscsi_client/57
Multi-Machine jobs running in worker37, worker38 & worker40, node 2 running in worker37 is unable to reach updates.suse.com: https://openqa.suse.de/tests/12140149#step/iscsi_client/57
Multi-Machine jobs running in worker37 & worker38, node 2 running in worker37 is unable to reach scc.suse.com: https://openqa.suse.de/tests/12140167#step/suseconnect_scc/20
Multi-Machine jobs running in worker29, worker40 & worker37, client job running in worker37 is unable to reach download.docker.com: https://openqa.suse.de/tests/12140197#step/hawk_gui/6

I went there, and net.ipv4.conf.br1.forwarding was set to 0, so I added the following to /etc/sysctl.conf:

net.ipv4.ip_forward = 1
net.ipv4.conf.br1.forwarding = 1
net.ipv4.conf.eth0.forwarding = 1

And then ran a sysctl -p /etc/sysctl.conf as documented in https://progress.opensuse.org/issues/135524#note-15.

After that checked with:

worker37:/proc/sys/net/ipv4/conf # cat {br1,eth0}/forwarding
1
1

Hopefully worker37 is fixed too.

Actions

Copy link

#106

Updated by okurz over 1 year ago

ok, thank you for trying to fix it. Just be aware that this is inconsistent with what mkittler did in https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/987 . But anyway to do it properly I guess we need to follow the original plan I had (for years) that we need to reinstall machines more often including those just freshly installed machines to ensure our configuration management has all the needed changes.

Actions

Copy link

#107

Updated by okurz over 1 year ago

Status changed from Feedback to Workable
Assignee deleted (~~livdywan~~)

Unassigning due to absence

Actions

Copy link

#108

Updated by nicksinger over 1 year ago

Assignee set to nicksinger

Actions

Copy link

#109

Updated by okurz over 1 year ago

Description updated (diff)
Status changed from Workable to In Progress

Also met with pcervinka, mkittler, nicksinger. pcervinka will work on #135818 . Only after that is done we should consider enabling more machines for multi-machine jobs again.

Actions

Copy link

#110

Updated by okurz over 1 year ago

Copied to action #135914: Extend/add initial validation steps and "best practices" for multi-machine test setup/debugging to openQA documentation size:M added

Actions

Copy link

#111

Updated by nicksinger over 1 year ago

okurz wrote in #note-109:

Also met with pcervinka, mkittler, nicksinger. pcervinka will work on #135818 . Only after that is done we should consider enabling more machines for multi-machine jobs again.

to add from the meeting: the situation got way better after forwarding was enabled in salt/firewalld on each bridge with https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/987. net.ipv4.ip_forward = 1 might still be required to be covered in salt but we need to understand what the <forwarding/>-directive in firewalld does first. Oli and me discussed in the infra daily that this is possible by e.g. reading firewalld documentation or just set it back to 0, run salt and see if this changes it back to 1.

Actions

Copy link

#112

Updated by acarvajal over 1 year ago

Ran into this one on worker29: https://openqa.suse.de/tests/12153998#step/setup/48

Actions

Copy link

#113

Updated by openqa_review over 1 year ago

Due date set to 2023-10-03

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

#114

Updated by okurz over 1 year ago

Description updated (diff)

Actions

Copy link

#115

Updated by okurz over 1 year ago

Copied to action #136007: Conduct "lessons learned" with Five Why analysis for network protocols failures on multimachine tests on HA/SAP size:S added

Actions

Copy link

#116

Updated by okurz over 1 year ago

Description updated (diff)

Actions

Copy link

#117

Updated by okurz over 1 year ago

Description updated (diff)

Actions

Copy link

#118

Updated by okurz over 1 year ago

Copied to action #136013: Ensure IP forwarding is persistent for multi-machine tests also in our salt recipes size:M added

Actions

Copy link

#119

Updated by okurz over 1 year ago

Description updated (diff)

Actions

Copy link

#120

Updated by livdywan over 1 year ago

Related to action #135944: Implement a constantly running monitoring/debugging VM for the multi-machine network added

Actions

Copy link

#121

Updated by okurz over 1 year ago

Discussed better alert definitions with nicksinger and livdywan. I created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/992 to prevent the very jaggy multi-machine ratio graphs. And nicksinger will tweak the existing alert by lowering the alert threshold on failed mm-tests from 60 to 30 and introduce a second, longer-time alert with threshold 20 for 6h

Actions

Copy link

#122

Updated by livdywan over 1 year ago

Note that follow-up tickets have been filed, see the Out of Scope section in the description.

Specifically for this ticket open action items are:

The title still has no candidate.*iscsi-target-overview-service-tab|yast2.+firewall.+services.+add.+zone. No failing tests should match this ticket.
We re-enabled deployments.
Adjust multi-machine result alerts to have a better measure of whether the situation has improved:
- https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/992
- Nick is looking into 60->30s 60h to 20h

Actions

Copy link

#123

Updated by nicksinger over 1 year ago

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/993 which adjusts the old alert to make it alert on short spikes, introduced the newer long-time alert and also adjusted the panel itself.

Actions

Copy link

#124

Updated by nicksinger over 1 year ago

Description updated (diff)
Status changed from In Progress to Feedback

Actions

Copy link

#125

Updated by okurz over 1 year ago

@livdywan we were about to miss that you switched on worker9, wasn't mentioned in rollback steps. I will power it off again for #134912

Actions

Copy link

#126

Updated by livdywan over 1 year ago

Description updated (diff)

okurz wrote in #note-125:

@livdywan we were about to miss that you switched on worker9, wasn't mentioned in rollback steps. I will power it off again for #134912

Ah! Sorry, I thought I mentioned it in Jitsi but apparently didn't add it here!

Actions

Copy link

#127

Updated by pstivanin over 1 year ago

Blocks deleted (action #134495: [security][maintenance] all multi machines tests are failing)

Actions

Copy link

#128

Updated by okurz over 1 year ago

Description updated (diff)
Priority changed from Urgent to High

We closely monitored the situation over the past days and will continue to do so at least over the next days into the next week in particular:

job queue on https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&from=now-2d&to=now&viewPanel=9
scheduled jobs on https://openqa.suse.de/tests/
Ratio of multi-machine tests by result https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&from=now-24h&to=now&viewPanel=24

All the other known remaining tasks are in separate tickets

Actions

Copy link

#129

Updated by livdywan over 1 year ago

Ratio of multi-machine tests by result https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&from=now-24h&to=now&viewPanel=24

We're slightly higher above 6% now. Still with in sensible terms? No queued jobs >5days and no impossible jobs queued forever.

Actions

Copy link

#130

Updated by nicksinger over 1 year ago

We're still in an acceptable range at around 5% failed jobs.

Actions

Copy link

#131

Updated by okurz over 1 year ago

Due date deleted (~~2023-10-03~~)
Status changed from Feedback to Resolved

So we are good. There are follow-up tasks like a "lessons learned" task so look out for that :)

Actions

Copy link

#132

Updated by livdywan over 1 year ago

Description updated (diff)

Actions

Copy link

#133

Updated by okurz over 1 year ago

Parent task set to #111929

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #134282

[tools] network protocols failures on multimachine tests on HA/SAP size:S auto_review:"no candidate.*iscsi-target-overview-service-tab|yast2.+firewall.+services.+add.+zone":retry

Observations¶

Theory¶

Problem¶

Suggestions¶

Rollback steps¶

Out of scope¶

Updated by pcervinka over 1 year ago

Updated by pcervinka over 1 year ago

Updated by pcervinka over 1 year ago

Updated by pcervinka over 1 year ago

Updated by osukup over 1 year ago

Updated by livdywan over 1 year ago

Updated by dzedro over 1 year ago

Updated by pcervinka over 1 year ago

Updated by pcervinka over 1 year ago

Updated by livdywan over 1 year ago

Updated by livdywan over 1 year ago

Updated by rfan1 over 1 year ago

Updated by livdywan over 1 year ago

Updated by livdywan over 1 year ago

Updated by livdywan over 1 year ago

Updated by mgrifalconi over 1 year ago

Updated by livdywan over 1 year ago

Updated by livdywan over 1 year ago

Updated by mkittler over 1 year ago

Updated by pcervinka over 1 year ago

Updated by acarvajal over 1 year ago

Updated by livdywan over 1 year ago

Updated by mkittler over 1 year ago

Updated by acarvajal over 1 year ago

Updated by pstivanin over 1 year ago

Updated by okurz over 1 year ago

Updated by acarvajal over 1 year ago

Updated by acarvajal over 1 year ago

Updated by acarvajal over 1 year ago

Updated by mkittler over 1 year ago

Updated by okurz over 1 year ago

Updated by ybonatakis over 1 year ago

Updated by acarvajal over 1 year ago

Updated by acarvajal over 1 year ago

Updated by acarvajal over 1 year ago

Updated by okurz over 1 year ago

Updated by livdywan over 1 year ago

Updated by apappas over 1 year ago

Updated by apappas over 1 year ago

Updated by okurz over 1 year ago

Updated by mkittler over 1 year ago

Updated by okurz over 1 year ago

Updated by livdywan over 1 year ago

Updated by livdywan over 1 year ago

Updated by livdywan over 1 year ago

Updated by apappas over 1 year ago

Updated by acarvajal over 1 year ago

Updated by acarvajal over 1 year ago

Updated by livdywan over 1 year ago

Updated by livdywan over 1 year ago

Updated by livdywan over 1 year ago

Updated by acarvajal over 1 year ago

Updated by livdywan over 1 year ago

Updated by acarvajal over 1 year ago

Updated by livdywan over 1 year ago

Updated by nicksinger over 1 year ago

Updated by livdywan over 1 year ago

Updated by livdywan over 1 year ago

Updated by srinidhir over 1 year ago

Updated by acarvajal over 1 year ago

Updated by acarvajal over 1 year ago

Updated by acarvajal over 1 year ago

Updated by livdywan over 1 year ago

Updated by srinidhir over 1 year ago

Updated by acarvajal over 1 year ago

Updated by acarvajal over 1 year ago