Project

General

Profile

Actions

action #134282

closed

openQA Project - coordination #112862: [saga][epic] Future ideas for easy multi-machine handling: MM-tests as first-class citizens

openQA Project - coordination #111929: [epic] Stable multi-machine tests covering multiple physical workers

[tools] network protocols failures on multimachine tests on HA/SAP size:S auto_review:"no candidate.*iscsi-target-overview-service-tab|yast2.+firewall.+services.+add.+zone":retry

Added by emiura 9 months ago. Updated 5 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
2023-08-15
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Observations

  • Multi-machine jobs can't download artifacts from OBS/pip

Theory

(Fill this section with our current understanding of how the world works based on observations as written in the next section)

Problem

  • H1 REJECT The product has changed
    • -> E1-1 Compare tests on multiple product versions -> O1-1-1 We observed the problem in multiple products with different state of maintenance updates and the support server is old SLE12SP3 with no change in maintenance updates since months. It is unlikely that the iscsi client changed recently but that has to be verified
  • H2 Fails because of changes in test setup
    • H2.1 Our test hardware equipment behaves different
    • H2.2 The network behaves different
  • H3 Fails because of changes in test infrastructure software, e.g. os-autoinst, openQA
    • -> E3-1 TODO compare package versions installed on machines from "last good" with "first bad", e.g. from /var/log/zypp/history
    • -> E3-2 It is probably not the Open vSwitch version, see comment #134282#note-98
  • H4 Fails because of changes in test management configuration, e.g. openQA database settings
    • -> wait for E5-1
  • H5 Fails because of changes in the test software itself (the test plan in source code as well as needles)
    • -> E5-1 TODO Compare vars.json from "last good" with "first bad" and in particular look into changes to needles and job templates
  • H6 REJECT Sporadic issue, i.e. the root problem is already hidden in the system for a long time but does not show symptoms every time
    • -> O6-1 #134282#note-71 but there is no 100% fail ratio
    • -> E6-2 Increase timeout in the initial step of firewall configuration to check if we have non-reliable test results due to timeouts
    • -> TODO Investigate the timeout in the initial step of firewall configuration
    • -> TODO Add TIMEOUT_SCALE=3 on non HanasR cluster tests' support servers
  • H7 Multi-machine jobs don't work across workers anymore since 2023-08 -> also see #111908 and #135773
    • H7.1 REJECT Multi-machine jobs generally work fine when executed on a single physical machine -> E7.1-1 Run multi-machines only on a single physical machine -> O7.1-1-1 See #134282-80
    • We could pin jobs to a worker but that will need to be implemented properly, see #135035
    • We otherwise need to understand the infra setup better

Suggestions

  • Test case improvements
    • support_server/setup
    • firewall services add zone=EXT service=service:target
    • MTU check for packet size - covered in #135200
  • MTU size configuration
    • By default MTU runs at MTU 1500, however for openQA TORs we have MTU 9216 configured for each port and the future network automation service will apply this setting as well by default throughout PRG2, lowering the MTU will then be request via SD-Ticket. https://sd.suse.com/servicedesk/customer/portal/1/SD-130143
  • Come up with better reproducer, e.g. run an openQA test scenario as single-machine test with support_server still on a tap-worker -> see #134282-104
  • Verify stability on one or multiple workers e.g. #135773-9

Rollback steps

Out of scope

  • Improving openQA upstream documentation -> #135914
  • ovs-server+client scenario and MTU related fixes -> #135773
  • lessons learned -> #136007
  • SAP NFS server related issues qesap-nfs.qa.suse.cz -> #135938
  • Problems to reach machines in external network in multi-machine tests -> #135056
  • Ensure IP forwarding is persistent for good -> #136013

Related issues 13 (2 open11 closed)

Related to openQA Project - action #111908: Multimachine failures between multiple physical workersNew2022-06-03

Actions
Related to openQA Tests - action #133787: [qe-core] not hardcode a single worker to run autofs_server/client' and 'ovs-server/client' testsClosedrfan12023-08-04

Actions
Related to openQA Project - action #135035: Optionally restrict multimachine jobs to a single workerResolvedmkittler2023-09-01

Actions
Related to openQA Infrastructure - action #135056: MM Test fails in a connection to an address outside of the workerResolvedmkittler2023-09-01

Actions
Related to openQA Infrastructure - action #134042: auto-update on OSD does not install updates due to "Problem: nothing provides 'libwebkit2gtk3 ..." but service does not fail and we do not get an alert size:MResolvedlivdywan2023-08-092023-09-12

Actions
Related to openQA Infrastructure - action #135578: Long job age and jobs not executed for long size:MResolvednicksinger

Actions
Related to openQA Infrastructure - action #135944: Implement a constantly running monitoring/debugging VM for the multi-machine networkNew2023-09-18

Actions
Copied to openQA Tests - action #135200: [qe-core] Implement a ping check with custom MTU packet sizeRejecteddvenkatachala2023-08-15

Actions
Copied to openQA Infrastructure - action #135773: [tools] many multi-machine test failures in "ovs-client+ovs-server" test scenario when tests are run across different workers size:MResolvedlivdywan2023-08-152023-10-07

Actions
Copied to openQA Tests - action #135818: [kernel] minimal reproducer for many multi-machine test failures in "ovs-client+ovs-server" test scenario when tests are run across different workersResolvedpcervinka2023-08-15

Actions
Copied to openQA Project - action #135914: Extend/add initial validation steps and "best practices" for multi-machine test setup/debugging to openQA documentation size:MResolvedmkittler

Actions
Copied to openQA Infrastructure - action #136007: Conduct "lessons learned" with Five Why analysis for network protocols failures on multimachine tests on HA/SAP size:SResolvedtinita

Actions
Copied to openQA Project - action #136013: Ensure IP forwarding is persistent for multi-machine tests also in our salt recipes size:MResolveddheidler

Actions
Actions #1

Updated by pcervinka 9 months ago

  • Priority changed from Normal to Urgent

There is something wrong with multimachine network when tests are run across different workers. If is multimachine job forced to run on same worker, it is fine.

There are fails in core group: https://openqa.suse.de/tests/11843205#next_previous
Kernel group: https://openqa.suse.de/tests/11846943#next_previous
HPC: https://openqa.suse.de/tests/11845897#next_previous

Actions #2

Updated by pcervinka 9 months ago

I tried to debug issue in paused test. Ping worked, but other communication not between sut and support server, for example ssh. Could be GRE tunnel in some bad state and bigger packets just didn't pass? Unfortunately, i can't get more info (jobs are scheduled for couple of hours already).

Actions #3

Updated by pcervinka 9 months ago

I was able to confirm above statement, there is definitely packet size issue, big packets just don't pass to support server.

susetest:~ # ping -s 1352 -c 2  10.0.2.1
PING 10.0.2.1 (10.0.2.1) 1352(1380) bytes of data.
1360 bytes from 10.0.2.1: icmp_seq=1 ttl=64 time=24.5 ms
1360 bytes from 10.0.2.1: icmp_seq=2 ttl=64 time=23.8 ms

--- 10.0.2.1 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1002ms
rtt min/avg/max/mdev = 23.893/24.221/24.550/0.363 ms
susetest:~ # ping -s 1353 -c 2  10.0.2.1
PING 10.0.2.1 (10.0.2.1) 1353(1381) bytes of data.

--- 10.0.2.1 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1007ms

susetest:~ # ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1458 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 52:54:00:12:08:1a brd ff:ff:ff:ff:ff:ff
    inet 10.0.2.15/24 brd 10.0.2.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::5054:ff:fe12:81a/64 scope link 
       valid_lft forever preferred_lft forever
susetest:~ # ssh root@10.0.2.1
^C
susetest:~ # ifconfig eth0 mtu 1350
susetest:~ # ssh root@10.0.2.1
The authenticity of host '10.0.2.1 (10.0.2.1)' can't be established.
ECDSA key fingerprint is SHA256:tQO13Ix/i0kNGPNMTEn9o7WXaEC7YNPkAufs7rJk5Iw.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '10.0.2.1' (ECDSA) to the list of known hosts.
Password: 
Last login: Thu Aug 17 02:31:13 2023 from ::1

It is visible that default MTU size is 1458, ssh doesn't work. If is MTU set to something smaller it will work.

Actions #4

Updated by pcervinka 9 months ago

It impacts all multimachine jobs between different workers across all sle versions and different tests. It is not test issue or product issue.

Actions #5

Updated by osukup 9 months ago

  • Subject changed from iscsi failures on multimachine tests on HA/SAP. to [tools] network protocols failures on multimachine tests on HA/SAP.
Actions #6

Updated by livdywan 9 months ago

  • Target version set to Ready

Thank you for your thorough investigation! Discussing it in Slack now

Actions #7

Updated by dzedro 9 months ago

Interesting, well done @pcervinka! 👍
Should we just decrease the MTU in support server setup ?
https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/master/tests/support_server/setup.pm#L116

Actions #8

Updated by pcervinka 9 months ago

I also did tcpdump on support server to see what is coming

Ping size 1352:

susetest:~ # ping -s 1352 -c 2  10.0.2.1
PING 10.0.2.1 (10.0.2.1) 1352(1380) bytes of data.
1360 bytes from 10.0.2.1: icmp_seq=1 ttl=64 time=24.5 ms
1360 bytes from 10.0.2.1: icmp_seq=2 ttl=64 time=23.8 ms

Dump:

02:32:17.133307 52:54:00:12:08:1a > 52:54:00:12:07:f7, ethertype IPv4 (0x0800), length 1394: (tos 0x0, ttl 64, id 42398, offset 0, flags [DF], proto ICMP (1), length 1380)
    10.0.2.15 > 10.0.2.1: ICMP echo request, id 2100, seq 1, length 1360
02:32:17.133426 52:54:00:12:07:f7 > 52:54:00:12:08:1a, ethertype IPv4 (0x0800), length 1394: (tos 0x0, ttl 64, id 1951, offset 0, flags [none], proto ICMP (1), length 1380)
    10.0.2.1 > 10.0.2.15: ICMP echo reply, id 2100, seq 1, length 1360
02:32:18.135134 52:54:00:12:08:1a > 52:54:00:12:07:f7, ethertype IPv4 (0x0800), length 1394: (tos 0x0, ttl 64, id 42561, offset 0, flags [DF], proto ICMP (1), length 1380)
    10.0.2.15 > 10.0.2.1: ICMP echo request, id 2100, seq 2, length 1360
02:32:18.135220 52:54:00:12:07:f7 > 52:54:00:12:08:1a, ethertype IPv4 (0x0800), length 1394: (tos 0x0, ttl 64, id 2141, offset 0, flags [none], proto ICMP (1), length 1380)

We can see that max ethernet size with headers on top of ping size is 1394.

Ping size 1353:

--- 10.0.2.1 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1002ms
rtt min/avg/max/mdev = 23.893/24.221/24.550/0.363 ms
susetest:~ # ping -s 1353 -c 2  10.0.2.1
PING 10.0.2.1 (10.0.2.1) 1353(1381) bytes of data.

--- 10.0.2.1 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1007ms

There were no packets on supportserver.

I would recommend to take tcpdump on each worker to see what is leaving and incoming and check logs. (I don't have access to workers right now, my workstation with keys died).

Actions #9

Updated by pcervinka 9 months ago

dzedro wrote:

Should we just decrease the MTU in support server setup ?
https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/master/tests/support_server/setup.pm#L116

No, we would have to set MTU on each client as well. This would just hide underlying problem.

Actions #10

Updated by livdywan 9 months ago

  • Related to action #111908: Multimachine failures between multiple physical workers added
Actions #11

Updated by livdywan 9 months ago

  • Tags set to infra
  • Project changed from openQA Tests to openQA Infrastructure
  • Subject changed from [tools] network protocols failures on multimachine tests on HA/SAP. to [tools] network protocols failures on multimachine tests on HA/SAP size:S
  • Description updated (diff)
  • Status changed from New to Workable
Actions #12

Updated by rfan1 9 months ago

  • Related to action #133787: [qe-core] not hardcode a single worker to run autofs_server/client' and 'ovs-server/client' tests added
Actions #13

Updated by livdywan 9 months ago

  • Status changed from Workable to In Progress
  • Assignee set to livdywan

I'm filing an SD ticket now as discussed, and we'll see from there what we can do.

Actions #14

Updated by livdywan 9 months ago

  • Status changed from In Progress to Blocked
Actions #15

Updated by livdywan 9 months ago

based on feedback we got, neither Networking nor Eng-Infra is looking after what you mentioned.

Not sure yet what to make of that.

Actions #16

Updated by mgrifalconi 8 months ago

Hello, does it mean nothing is happening from either side? Any way to escalate this or shall we continue to force green all multimachine tests for the foreseeable future?

Actions #17

Updated by livdywan 8 months ago

  • Status changed from Blocked to Feedback

mgrifalconi wrote in #note-16:

Hello, does it mean nothing is happening from either side? Any way to escalate this or shall we continue to force green all multimachine tests for the foreseeable future?

I realzie I didn't save my comment. We're discussing it, will get back to you as soon as we know more

Actions #18

Updated by livdywan 8 months ago

Notes from our debugging session:

  • salt-states etc/firewalld/zones/trusted.xml with wrong bridge_iface results in missing eth0 on worker3; it seems we don't currently have a test for the case where tap is set we also need bridge_iface
  • should be checked by e.g. a pipeline
  • "if workerclass contains "tap" then bridge_iface needs to be set"
  • using ethtool to check the interface - maybe there's a disconnected cable here as well? we should be okay since we have another uplink, though
  • testing with the suspected missing iface firewalld cmd --zone=trusted --add-interface eth0
  • https://openqa.suse.de/tests/11897800 let's see if this works with the fix
  • we don't have a good way to confirm if gre devices actually work?
    • ip a s dev gre29 device doesn't exist? gre interfaces are handled by ovs and therefore usual linux tools don't work as expected
    • the remote_ip i.e. of worker9 is correct
Actions #19

Updated by mkittler 8 months ago

Looks like none of the jobs in the restarted cluster (https://openqa.suse.de/tests/11897800#dependencies) have been scheduled to run on worker3 which was the most likely culprit. That's not the worst because this way we can see whether the hypothesis of worker3 actually being the culprit is true. If the jobs no pass then worker3 is likely the culprit; otherwise there's more to it.

EDIT: Now actually two jobs within the cluster have failed, both on worker8 (https://openqa.suse.de/tests/11897800,https://openqa.suse.de/tests/11897799). So it is definitely not just a problem of worker3.

Actions #20

Updated by pcervinka 8 months ago

It is really not related to specific worker, here is example of HPC job on aarch64, it just passed fine when when I cloned with defined worker. https://openqa.suse.de/tests/11902812

Actions #21

Updated by acarvajal 8 months ago

mkittler wrote in #note-19:

EDIT: Now actually two jobs within the cluster have failed, both on worker8 (https://openqa.suse.de/tests/11897800,https://openqa.suse.de/tests/11897799). So it is definitely not just a problem of worker3.

Yesterday I saw MM jobs failing in network related issues in different workers. The ones I remember seeing were worker3, worker5, worker8 and worker9. Restarting the jobs forcing all jobs to run in the same physical worker seems to allow the jobs to complete, but this is of course a workaround and not something I would like to set for the jobs always.

We will continue to monitor the situation and paste here any upcoming failures we see during our review.

Actions #22

Updated by livdywan 8 months ago

  • Assignee changed from livdywan to mkittler
  • Priority changed from Urgent to High

I guess we can consider it High while we're verifying the actual impact of it.

Actions #23

Updated by mkittler 8 months ago

When https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/581 has been merged those workers won't be tap workers anymore anyways and Prague-located workers will be used instead. On those new workers I have verified that the MM setup works with jobs that were explicitly scheduled to run across multiple workers.

Note that it is totally possible that this problem is specific to the HA/SAP test scenario. I've checked some other MM tests on OSD like https://openqa.suse.de/tests/11932373 and this job and its parallel parent ran across different workers and passed. The whole Wicked Maintenance Updates looks in fact very good with no failures in the last two weeks (I haven't looked further into the past).

Actions #24

Updated by acarvajal 8 months ago

mkittler wrote in #note-23:

When https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/581 has been merged those workers won't be tap workers anymore anyways and Prague-located workers will be used instead. On those new workers I have verified that the MM setup works with jobs that were explicitly scheduled to run across multiple workers.

Thanks a lot for the investigation. I'll keep an eye to this MR.

Note that it is totally possible that this problem is specific to the HA/SAP test scenario. I've checked some other MM tests on OSD like https://openqa.suse.de/tests/11932373 and this job and its parallel parent ran across different workers and passed. The whole Wicked Maintenance Updates looks in fact very good with no failures in the last two weeks (I haven't looked further into the past).

By chance I saw tests running today in the 15-SP5 QR Job Group (https://openqa.suse.de/tests/overview?distri=sle&version=15-SP5&build=115.1&groupid=518) and something is definitely going on.

Keep in mind that support server code and the iscsi_client module is the same in all scenarios.

Today in that job group we can see:

  1. 2-Node Cluster (alpha) failing on x86_64 (https://openqa.suse.de/tests/11935907) and ppc64le (https://openqa.suse.de/tests/11935771). It ran on malbec and QA-Power8-5-kvm in ppc64le, and in worker10, worker3 & worker8.
  2. 2-Node Cluster (beta) failing on aarch64 (https://openqa.suse.de/tests/11937572) and ppc64le (https://openqa.suse.de/tests/11937524), but passing on x86_64: https://openqa.suse.de/tests/11936615, https://openqa.suse.de/tests/11936619 and https://openqa.suse.de/tests/11936618#. All 3 jobs ran on worker5.
  3. CTDB Cluster (to test samba resources) failing on aarch64 (https://openqa.suse.de/tests/11937584) and x86_64 (https://openqa.suse.de/tests/11936677). The x86_64 jobs ran in worker10, worker8, worker3 & worker5.
  4. 3-Node Cluster with Diskless SBD (delta) passing in aarch64 (https://openqa.suse.de/tests/11935686) and x86_64 (https://openqa.suse.de/tests/11937503). The aarch64 jobs ran in openqaworker-arm3 and the x86_64 jobs ran in worker5.

Other passing jobs were Priority Fencing Cluster (all jobs ran in worker3), and QNetd Cluster (all jobs ran in worker9).

I think there is a clear pattern where in these MM jobs the iscsi client fails to connect to the support server when the jobs are picked by different workers vs. when all MM jobs are picked by the same worker.

More importantly, this was working before, as we can see in the results of the 15-SP5 GMC here: https://openqa.suse.de/tests/overview?distri=sle&version=15-SP5&build=102.1&groupid=143

Pick any x86_64 cluster (alpha, beta, gamma, delta, CTDB, etc.) and not only they passed, but they ran in more than one physical worker.

Actions #25

Updated by pstivanin 8 months ago

  • Blocks action #134495: [security][maintenance] all multi machines tests are failing added
Actions #26

Updated by okurz 8 months ago

  • Priority changed from High to Urgent

What I understood is that on a daily base SLE maintenance update tests are affected by the issue so setting back to "Urgent". Related discussion https://suse.slack.com/archives/C02CANHLANP/p1693379586958799

(Oliver Kurz) @Antonios Pappas @Alvaro Carvajal (CC @qa-tools ) as this topic was brought up in the weekly QE sync meeting regarding https://progress.opensuse.org/issues/134282 . 1. How can we make a workaround more permanent to only schedule multi-machine tests on individual machines? 2. With https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/581 we are proposing to only run multi-machine x86_64 tests on PRG2 based workers which might also help. Should we go ahead with that first? 3. The ticket was reduced to "High" priority. Do I understand correctly that every day SLE maintenance tests keep failing related to that ticket? Because then we should increase prio to "Urgent" again
(Fabian Vogt) If the GRE tunnel is used, it's required that every SUT has its MTU adjusted. This is documented. http://open.qa/docs/#_gre_tunnels, the "NOTE"
(Oliver Kurz) oh, I was not aware. Where does this have to be set?
(Fabian Vogt) It's usually done by whatever sets up the mm networking inside the clients, let me find the place...
(Antonios Pappas) How has the scheduling changed that this started failing in August?
(Antonios Pappas) Was there a constraint that was relaxed?
(Antonios Pappas) If this was a true limitation why did everything start to fail mid august?
(Fabian Vogt) The code is in lib/mm_network.pm and tests/support_server/setup.pm
(Fabian Vogt) It's possible the infra was configured for jumbo packets before the migration but this no longer works?
(Alvaro Carvajal) Scheduling MM tests on individual machines would require setting the hostname in the WORKER_CLASS setting, right? Isn't this bad/introduces a single point of failure?
Yes, I think we should go ahead with https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/581 and re-assess
Yes, in HA & SAP job groups, tests fail every day
(Antonios Pappas) Kernel incidents were also affected. pcervinka is on vacation so he is not on the call but his backup should know
(Alvaro Carvajal) @Oliver Kurz ran into some MM failures from a saptune MU. These jobs I cannot force label to softfail. I need them to run (and pass). Should I restart them on a fixed worker or should I wait for the merge of https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/581?
(Oliver Kurz) Well, https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/581 could break more stuff. If you say you can closely monitor the effect today then we can just merge https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/581 now and trigger tests accordingly and see what it brings. That you cannot force the result is a different issue that we can consider as well. I recommend you open another thread or ticket as well
(Jozef Pupava) All nodes have MTU 1458, AFAIK they get that from DHCP/support_server https://openqa.suse.de/tests/11923577#step/hostname/27 node1 https://openqa.suse.de/tests/11923578#step/hostname/25 node2 https://openqa.suse.de/tests/11923576#step/setup/29 support_server
(Alvaro Carvajal) That you cannot force the result is a different issue that we can consider as well. I recommend you open another thread or ticket as well. Oh, I can technically do it. I just shouldn't. This is an important package to test
(Alvaro Carvajal) Well, https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/581 could break more stuff. If you say you can closely monitor the effect today then we can just merge https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/581 now and trigger tests accordingly and see what it brings.
Better wait. We have Sprint Review today
(Oliver Kurz) As soon as someone from tools team can closely monitor the impact we would merge
(Marius Kittler) I could merge and monitor after our daily meeting.

https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&viewPanel=26&from=now-90d&to=now looks like a regression of more multi-machine tests failing since about 2023-08-10

Actions #27

Updated by acarvajal 8 months ago

Seems the move to the Multi-Machine workers in PRG did not solve the issue.

The following is a collection of HA & SAP jobs which have finished either on August 30th or 31st. Most of the failures are still in the iscsi_client test module (cluster nodes connecting to iSCSI server in Support Server), but now there are also failures in the Support Server setup module.

iscsi_client failures:

support_server/setup failures:

This is only a sample. There are more.

Besides these, we are also seeing some jobs timing out when attempting connection outside of osd, for example: https://openqa.suse.de/tests/11950389#step/suseconnect_scc/20

Actions #28

Updated by acarvajal 8 months ago

I found some jobs that ran in worker38 (so they should've worked), but which failed in support_server/setup module:

https://openqa.suse.de/tests/11963289
https://openqa.suse.de/tests/11963291
https://openqa.suse.de/tests/11963290

This is new, and I guess it's related to the move to PRG.

Actions #29

Updated by acarvajal 8 months ago

I think the support_server/setup issue is present only on worker34, worker37, worker38. I've seen support servers passing this step in worker39, worker31 and worker30. Not sure if this helps.

Actions #30

Updated by mkittler 8 months ago

I would assume that the exact lists of workers where it is passing or failing might be misleading. Maybe it is happening rather randomly. All mentioned workers have been setup in the exact same way so it would be strange if there are any differences between them in general.

I definitely also tested the basic wicked test scenario across the workers where the issue is present.


It would be helpful to understand what this test does that the basic wicked test doesn't.


In the failure mentioned in #134282#note-28 the support server job didn't unlock the mutex and thus the other jobs couldn't continue. So that's a problem that might not even have something to do with the gre tunnels and simply with the support server job not being able to reach the point where it would unlock the mutex. At which point is the support server supposed to unlock the mutex? I've been searching in the openSUSE test distribution for support_server_ready but couldn't find and occurrence that is applicable to this test scenario. I'm really wondering whether this failure is not just an issue with the test itself.
(Just for my own reference: good run: https://openqa.suse.de/tests/11934844#step/setup/41, bad run: https://openqa.suse.de/tests/11963289)

Actions #31

Updated by okurz 8 months ago

Further from the referenced Slack conversation https://suse.slack.com/archives/C02CANHLANP/p1693379586958799

(Oliver Kurz) @Fabian Vogt @Jozef Pupava can we make failures more explicit if MTU is not set properly? Like, let tests crash if WORKER_CLASS=tap and not set MTU?
(Fabian Vogt) Not easily. This would have to be run on the SUT as test module, but after networking is set up. This means it's equally likely that if the MTU is wrong, the check isn't scheduled either
(Oliver Kurz) how about something that is called unconditionally in all tests? Maybe in main.pm? Would have to run at the right point in jobs, but is more likely to work
(Oliver Kurz) who can volunteer do implement this?

No response to this. I would say without the check for proper MTU there is no point to continue. So IMHO we should focus on that but I don't know where to add such check.

I also added a response in https://sd.suse.com/servicedesk/customer/portal/1/SD-130143 with some questions to Eng-Infra experts.

Actions #32

Updated by ybonatakis 8 months ago

I believe the failures on HPC jobs are also because of it.
https://openqa.suse.de/tests/overview?distri=sle&version=15-SP6&build=16.1&groupid=130
However none of the ones i checked run in any of the workers acarvajal points to.

Actions #33

Updated by acarvajal 8 months ago

mkittler wrote in #note-30:

In the failure mentioned in #134282#note-28 the support server job didn't unlock the mutex and thus the other jobs couldn't continue.

It didn't unlock the mutex, because the support server job failed before the unlock action.

A normal run looks like this: https://openqa.suse.de/tests/11163631

The mutex causing the other jobs to wait is created in the last line of the support_server/setup module: https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/master/tests/support_server/setup.pm#L655

As the module failed before that, then the other jobs were not unlocked.

So that's a problem that might not even have something to do with the gre tunnels and simply with the support server job not being able to reach the point where it would unlock the mutex.

Agree, but IMHO something with the MM setup in these workers (worker34, worker37, worker38) is causing a failure in the support_server/setup module before it can create the mutex.

At which point is the support server supposed to unlock the mutex? I've been searching in the openSUSE test distribution for support_server_ready but couldn't find and occurrence that is applicable to this test scenario. I'm really wondering whether this failure is not just an issue with the test itself.

https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/master/tests/support_server/setup.pm#L655

(Just for my own reference: good run: https://openqa.suse.de/tests/11934844#step/setup/41, bad run: https://openqa.suse.de/tests/11963289)

A week ago we saw a similar error in our development openQA instance.

See: https://openqaworker15.qa.suse.cz/tests/217494#step/setup/33

We had to check https://open.qa/docs/#_multi_machine_test_setup and after messing around with the firewalld configuration, installing libcap, and restarting the server, we managed to get it working: https://openqaworker15.qa.suse.cz/tests/217498

You can read our thread on the debug https://suse.slack.com/archives/C0369JZFBKK/p1692886259753739

Actions #34

Updated by acarvajal 8 months ago

ybonatakis wrote in #note-32:

I believe the failures on HPC jobs are also because of it.
https://openqa.suse.de/tests/overview?distri=sle&version=15-SP6&build=16.1&groupid=130
However none of the ones i checked run in any of the workers acarvajal points to.

Seems very similar to what we're seeing, with the exception that you don't have issues in support_server/setup.

Most of your tests that passed, did so when the jobs ran in the same worker. I see some of these working in worker36 and worker35. I don't see any of your support servers running in worker34, worker37 or worker38 where the setup module is failing for us.

I did see one of your MM jobs passing when running on multiple workers:

https://openqa.suse.de/tests/11959550 (worker30)
https://openqa.suse.de/tests/11959754 (worker32)
https://openqa.suse.de/tests/11959755 (worker39)
https://openqa.suse.de/tests/11959756 (worker29)

I do wonder what is the difference between these and your other jobs which failed in cpuid and ours that fail in iscsi_client

Actions #35

Updated by acarvajal 8 months ago

Here's another example of a support_server/setup failure in worker38: https://openqa.suse.de/tests/11972258#step/setup/33

This time when attempting to configure the firewall. This is the exact same step our tests were failing in our Staging openQA instance last Friday before we had to fix the MM configuration there.

Actions #36

Updated by okurz 8 months ago

To mitigate the urgency I recommend to apply https://github.com/os-autoinst/scripts/blob/master/README.md#auto-review---automatically-detect-known-issues-in-openqa-jobs-label-openqa-jobs-with-ticket-references-and-optionally-retrigger

And in the failed job examples I see multiple points for improvement. E.g. in https://openqa.suse.de/tests/11962805#step/setup/45 I see that no post_fail_hook is executed at all. At least the system journal might be able to help here, possibly also YaST module logs. I suggest to create separate tickets about those.

And please for the ticket assignee make sure to update the ticket description based on https://progress.opensuse.org/projects/openqav3/wiki/#Further-decision-steps-working-on-test-issues to keep track of the current state, open hypotheses, experiments to conduct, etc.

Actions #37

Updated by livdywan 8 months ago

  • Assignee changed from mkittler to livdywan
Actions #38

Updated by apappas 8 months ago

  • Related to action #135035: Optionally restrict multimachine jobs to a single worker added
Actions #39

Updated by apappas 8 months ago

okurz wrote in #note-36:

To mitigate the urgency I recommend to apply https://github.com/os-autoinst/scripts/blob/master/README.md#auto-review---automatically-detect-known-issues-in-openqa-jobs-label-openqa-jobs-with-ticket-references-and-optionally-retrigger

This will only reroll the dice with the hope that the retries will land on the same tap worker while also increase the workload on the limited tap pool.

Actions #40

Updated by okurz 8 months ago

Yes, exactly

Actions #41

Updated by mkittler 8 months ago

I haven't had the earlier hypothesis that it is MTU related on my mind anymore. That would actually explain why not all scenarios are affected.

It didn't unlock the mutex, because the support server job failed before the unlock action.

Right, and I was wondering about exactly that. Normally one doesn't need a network connection to start a server so it is strange to blame network connectivity here. If the network connection was the problem then only reaching that server should fail, right?

The mutex causing the other jobs to wait is created in the last line of the support_server/setup module: https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/master/tests/support_server/setup.pm#L655

Ah, thanks. I only considered mutex_unlock calls (and not mutex_create).

Agree, but IMHO something with the MM setup in these workers (worker34, worker37, worker38) is causing a failure in the support_server/setup module before it can create the mutex.

If you think that really only those workers are problematic than create a MR to remove the tap worker class only from those as a temporary workaround and also a means of checking that hypothesis. I guess the others would accept such a MR.


By the way, I'm on squad rotation as of next week. Hence I handed the ticket over to @livdywan.

Actions #42

Updated by okurz 8 months ago

  • Related to action #135056: MM Test fails in a connection to an address outside of the worker added
Actions #43

Updated by livdywan 8 months ago

To mitigate the urgency I recommend to apply https://github.com/os-autoinst/scripts/blob/master/README.md#auto-review---automatically-detect-known-issues-in-openqa-jobs-label-openqa-jobs-with-ticket-references-and-optionally-retrigger

This will only reroll the dice with the hope that the retries will land on the same tap worker while also increase the workload on the limited tap pool.

This would have the benefit of still giving us logs at the cost of < 15 minutes for each bad job (I don't know the failure rate, though) which might make it more useful than pinning which also comes with delays (see #135035#note-6 which also applies to manual pinning). And more importantly if we can identify an expression that all failures have an common we're probably a step closer to a fix.

Agree, but IMHO something with the MM setup in these workers (worker34, worker37, worker38) is causing a failure in the support_server/setup module before it can create the mutex.

At which point is the support server supposed to unlock the mutex? I've been searching in the openSUSE test distribution for support_server_ready but couldn't find and occurrence that is applicable to this test scenario. I'm really wondering whether this failure is not just an issue with the test itself.

https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/master/tests/support_server/setup.pm#L655

(Just for my own reference: good run: https://openqa.suse.de/tests/11934844#step/setup/41, bad run: https://openqa.suse.de/tests/11963289)

On the above note I wonder if we can make step 41 reveal the issue... it looks totally fine whether it fails or passes. https://openqa.suse.de/tests/11963289#step/setup/41

A week ago we saw a similar error in our development openQA instance.

See: https://openqaworker15.qa.suse.cz/tests/217494#step/setup/33

This seems to fail in yast2 firewall services add zone=EXT service=service:target instead of a needle. So that's a little better. I wonder if there's a more verbose mode we can use here to examine why this fails?

Actions #44

Updated by livdywan 8 months ago

  • Subject changed from [tools] network protocols failures on multimachine tests on HA/SAP size:S to [tools] network protocols failures on multimachine tests on HA/SAP size:S auto_review:"no candidate.*iscsi-target-overview-service-tab":retry

To mitigate the urgency I recommend to apply https://github.com/os-autoinst/scripts/blob/master/README.md#auto-review---automatically-detect-known-issues-in-openqa-jobs-label-openqa-jobs-with-ticket-references-and-optionally-retrigger

This will only reroll the dice with the hope that the retries will land on the same tap worker while also increase the workload on the limited tap pool.

This would have the benefit of still giving us logs at the cost of < 15 minutes for each bad job (I don't know the failure rate, though) which might make it more useful than pinning which also comes with delays (see #135035#note-6 which also applies to manual pinning). And more importantly if we can identify an expression that all failures have an common we're probably a step closer to a fix.

For now the best we have is probably the needle mismatch no candidate needle with tag(s) 'iscsi-target-overview-service-tab' matched, so I'm starting with that.

Actions #45

Updated by livdywan 8 months ago

  • Priority changed from Urgent to High

If we want to go the hard-coding route we could for example use only worker40: https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/596/diffs

Opinions welcome. I'm mainly looking into mitigation here. This is not a fix by any stretch.

Actions #46

Updated by apappas 8 months ago

I do think that having a single point of failure again is not good. Especially since there are too many jobs that will pass through them

What I am currently doing right now is editing the jobgroups manually to pin one set of mm jobs to one worker but distribute the sets among the known good workers.

Actions #47

Updated by acarvajal 8 months ago

livdywan wrote in #note-43:

On the above note I wonder if we can make step 41 reveal the issue... it looks totally fine whether it fails or passes. https://openqa.suse.de/tests/11963289#step/setup/41

I don't think so. AFAIK that step is local to the support server and is not doing anything network-related.


On a more positive note, after some days I've finally seen some MM passing in multiple workers. No idea if this is a sign of workers stabilizing after the migration, or if Tools Team did something to these workers which fixed this issue. Jobs are:

I will continue monitoring and updating this ticket with what I find. Would not consider above jobs as proof that everything's back to normal until I see more.

Actions #48

Updated by acarvajal 8 months ago

Things are looking better. I'm seeing fewer jobs failing in iscsi_client and in support_server/setup.

I even found some jobs passing in the workers I had seen failing in support_server/setup (worker34, worker37 and worker38) last week:

Not sure what was done, but thank you very much.

Another instance of a MM job finishing successfully in the new workers. This one is a HAWK test, which does a docker pull which was failing on connection-related issues last Thursday; they ran successfully on Sunday on worker37, worker38, worker35 and worker29:

We are still seeing some sporadic issues where the sles4sap/hana_install test module takes over 4 hours to run, and ends up failing. If you see https://openqa.suse.de/tests/11994946, in that test the module ran in under 13 minutes.

Actions #49

Updated by livdywan 8 months ago

On a more positive note, after some days I've finally seen some MM passing in multiple workers. No idea if this is a sign of workers stabilizing after the migration, or if Tools Team did something to these workers which fixed this issue. Jobs are:

No. I suggested we remove mm workers but that hasn't been merged. More likely this is What I am currently doing right now is editing the jobgroups manually to pin one set of mm jobs to one worker but distribute the sets among the known good workers. as @apappas mentioned above. I think we should try to coordinate better to avoid drawing the wrong conclusions. Maybe slack is going to work better for that? Let's see.

Actions #50

Updated by livdywan 8 months ago

  • Copied to action #135200: [qe-core] Implement a ping check with custom MTU packet size added
Actions #51

Updated by livdywan 8 months ago

  • Description updated (diff)
Actions #52

Updated by acarvajal 8 months ago

livdywan wrote in #note-49:

No. I suggested we remove mm workers but that hasn't been merged. More likely this is What I am currently doing right now is editing the jobgroups manually to pin one set of mm jobs to one worker but distribute the sets among the known good workers. as @apappas mentioned above.

Nope. That's not it. As I said:

  1. Job https://openqa.suse.de/tests/11996380 ran in worker39
  2. Job https://openqa.suse.de/tests/11996383 ran in worker31
  3. Job https://openqa.suse.de/tests/11996382 ran in worker33

These jobs were not cloned. They were not manipulated. Their WORKER_CLASS setting is qemu_x86_64-large-mem,tap

I think we should try to coordinate better to avoid drawing the wrong conclusions. Maybe slack is going to work better for that? Let's see.

Ack.

Actions #53

Updated by livdywan 8 months ago

https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/master/tests/support_server/setup.pm#L655

(Just for my own reference: good run: https://openqa.suse.de/tests/11934844#step/setup/41, bad run: https://openqa.suse.de/tests/11963289)

A week ago we saw a similar error in our development openQA instance.

See: https://openqaworker15.qa.suse.cz/tests/217494#step/setup/33

We had to check https://open.qa/docs/#_multi_machine_test_setup and after messing around with the firewalld configuration, installing libcap, and restarting the server, we managed to get it working: https://openqaworker15.qa.suse.cz/tests/217498

You can read our thread on the debug https://suse.slack.com/archives/C0369JZFBKK/p1692886259753739

Is there some way I can get access to this conversation. Apparently I can't open it. I was wondering if whatever you found would help come up with ideas to diagnose failures better.

Actions #54

Updated by acarvajal 8 months ago

livdywan wrote in #note-53:

You can read our thread on the debug https://suse.slack.com/archives/C0369JZFBKK/p1692886259753739

Is there some way I can get access to this conversation. Apparently I can't open it. I was wondering if whatever you found would help come up with ideas to diagnose failures better.

Sorry. Didn't realize at the time that this was in a closed channel.

I've added you and @okurz

Actions #55

Updated by livdywan 8 months ago

acarvajal wrote in #note-54:

livdywan wrote in #note-53:

You can read our thread on the debug https://suse.slack.com/archives/C0369JZFBKK/p1692886259753739

Is there some way I can get access to this conversation. Apparently I can't open it. I was wondering if whatever you found would help come up with ideas to diagnose failures better.

Sorry. Didn't realize at the time that this was in a closed channel.

I've added you and @okurz

Thanks! I have suspicion that #133469#note-14 is contributing to this. In particular missing packages. Because it looks to me like worker.sls is not really missing anything.

Actions #56

Updated by nicksinger 8 months ago

livdywan wrote in #note-55:

Thanks! I have suspicion that #133469#note-14 is contributing to this. In particular missing packages. Because it looks to me like worker.sls is not really missing anything.

I don't think this is related here. We had several successful highstates in the past few days (at least no such issue as described in #133469 the whole week).

Actions #57

Updated by livdywan 8 months ago

nicksinger wrote in #note-56:

livdywan wrote in #note-55:

Thanks! I have suspicion that #133469#note-14 is contributing to this. In particular missing packages. Because it looks to me like worker.sls is not really missing anything.

I don't think this is related here. We had several successful highstates in the past few days (at least no such issue as described in #133469 the whole week).

Maybe that particular issue with the openvswitch states is new... but in that case it must be #134042 without packages having been re-installed because Alvaro had to reinstall missing packages that are specified in salt.

Actions #58

Updated by livdywan 8 months ago

  • Related to action #134042: auto-update on OSD does not install updates due to "Problem: nothing provides 'libwebkit2gtk3 ..." but service does not fail and we do not get an alert size:M added
Actions #61

Updated by acarvajal 8 months ago

Seems worker29 & worker30 are also impacting tests in another way:

https://openqa.suse.de/tests/12030835#step/iscsi_client/47
https://openqa.suse.de/tests/12030870#step/iscsi_client/22

This is a failure in iscsi_client, but earlier than in the cases reported in https://progress.opensuse.org/issues/134282#note-27. This time it fails on a zypper in yast-iscsi-client call, with timing out attempting a connection to updates.suse.com.

Actions #63

Updated by livdywan 8 months ago

Probing for what might've changed i.e. things mentioned in https://open.qa/docs/#_multi_machine_test_setup I'm not spotting any obvious changes. I'm also not aware of relevant changes. Unfortunately again I don't get what happened.

Actions #65

Updated by acarvajal 8 months ago

We're collecting results from the weekend, but things look broken from our end.

It seems like at least 9/10 hours ago, support server jobs running in worker37 (https://openqa.suse.de/tests/12071208 and https://openqa.suse.de/tests/12071225) were able to pass the support_server/setup test module, while those in worker30 (https://openqa.suse.de/tests/12071212) and worker29 (https://openqa.suse.de/tests/12071216) could not.

Regrettably, even in those cases where there were no issues with the support server, the parallel jobs ran into issues:

  1. The first one running from worker38 could not connect to scc.suse.com https://openqa.suse.de/tests/12071210#step/iscsi_client/57
  2. The second one running from worker29 could not connecto to scc.suse.com https://openqa.suse.de/tests/12071228#step/qnetd/53

Due to the different nature of those failures (though I suspect the root cause could be the same), I'm reporting those 2 issues in #135056

Actions #66

Updated by acarvajal 8 months ago

srinidhir wrote in #note-64:

There are many more failures in support_server/setup and also in iscsi_client,

https://openqa.suse.de/tests/12076961#step/setup/35
https://openqa.suse.de/tests/12076964#step/setup/35
https://openqa.suse.de/tests/12076967#step/setup/35
https://openqa.suse.de/tests/12076986#step/setup/35
https://openqa.suse.de/tests/12076983#step/setup/35
https://openqa.suse.de/tests/12076978#step/setup/35
https://openqa.suse.de/tests/12076990#step/setup/35
https://openqa.suse.de/tests/12071154#step/iscsi_client/37
https://openqa.suse.de/tests/12071127#step/iscsi_client/37

The ones failing in iscsi_client are doing so while attempting connections to addresses outside of the worker (either scc.suse.com or updates.suse.com). This is different than previous iscsi_client failures which were during setup of the iscsi devices, i.e., while attempting to connect the cluster node to the iscsi server located in the support server.

Trying to identify some pattern, failures in support_server/setup were on worker30, worker39, worker38, worker29.

While both failures in iscsi_client were on worker29. Interestingly, the support server related to these iscsi_client failures ran in worker37 in both cases (https://openqa.suse.de/tests/12071152 and https://openqa.suse.de/tests/12071126) and were able to clear the support_server/setup test module, so it appears worker37 is behaving better now.

I am restarting https://openqa.suse.de/tests/12071126 and forcing all jobs to run in worker37 to see if tests pass.

See:

 - sle-15-SP4-Server-DVD-HA-Incidents-x86_64-Build:30601:shadow-qam_2nodes_supportserver@64bit -> http://openqa.suse.de/tests/12079184
 - sle-15-SP4-Server-DVD-HA-Incidents-x86_64-Build:30601:shadow-qam_2nodes_01@64bit -> http://openqa.suse.de/tests/12079183
 - sle-15-SP4-Server-DVD-HA-Incidents-x86_64-Build:30601:shadow-qam_2nodes_02@64bit -> http://openqa.suse.de/tests/12079185
 - sle-15-SP4-Server-DVD-HA-Incidents-x86_64-Build:30601:shadow-qam_2nodes_client@64bit -> http://openqa.suse.de/tests/12079186
Actions #67

Updated by okurz 8 months ago

Given that there are still many problems I suggest to go ahead with https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/596 and only run x86_64 multi-machine tests from a single physical machine at least until we have better ideas and test improvements.

Actions #68

Updated by livdywan 8 months ago

okurz wrote in #note-67:

Given that there are still many problems I suggest to go ahead with https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/596 and only run x86_64 multi-machine tests from a single physical machine at least until we have better ideas and test improvements.

Merged

Actions #70

Updated by livdywan 8 months ago

  • Subject changed from [tools] network protocols failures on multimachine tests on HA/SAP size:S auto_review:"no candidate.*iscsi-target-overview-service-tab":retry to [tools] network protocols failures on multimachine tests on HA/SAP size:S auto_review:"no candidate.*iscsi-target-overview-service-tab|yast2.+firewall.+services.+add.+zone":retry

There are more failures in the support_server/setup,

https://openqa.suse.de/tests/12081543#step/setup/35

Thanks. I'm adding this to the regex as well. Interesting to see this on osd in production now.

I'm once gain wondering if yast2 could help us debug the issue? If it's consistently failing in the firewall config?

Actions #71

Updated by acarvajal 8 months ago

Hello,

Besides the failures reported by @srinidhir, I also saw failures like this: https://openqa.suse.de/tests/12080260#step/setup/45

I saw a total of 231 failures (I can share the full list if necessary), all of them in support_server/setup test module, either on step 35 or on step 45.

Jobs ran in worker40 (expected due to https://progress.opensuse.org/issues/134282#note-68) but also in worker30. Seems like there were actually 2 workers handling Multi-Machine last night.

As commented, issue is present in the support server. The parallel jobs start but get blocked while the support server is starting; however the support server fails in support_server/setup and kills the whole MM job.

There are 2 types of errors:

  1. The first one is like https://openqa.suse.de/tests/12080239#step/setup/35 and it's a script_run timeout. This happens in HA jobs
  2. Second one is like https://openqa.suse.de/tests/12080260#step/setup/45 and it's a needle match failure. It happens in HanaSR jobs.

Even though they fail in different steps, I believe root cause is the same. Looking at the command which causes the failure in the HA jobs (yast2 firewall services add zone=EXT service=service:target) when it runs in the HanaSR job, we can see that:

HanaSR job:

[2023-09-12T02:15:57.536570+02:00] [debug] [pid:13581] <<< testapi::script_run(cmd="yast2 firewall services add zone=EXT service=service:target", die_on_timeout=1, timeout=200, output="", quiet=undef)
[2023-09-12T02:15:57.536672+02:00] [debug] [pid:13581] tests/support_server/setup.pm:628 called setup::setup_iscsi_server -> tests/support_server/setup.pm:354 called testapi::script_run
[2023-09-12T02:15:57.536779+02:00] [debug] [pid:13581] <<< testapi::type_string(string="yast2 firewall services add zone=EXT service=service:target", max_interval=250, wait_screen_change=0, wait_still_screen=0, timeout=30, similarity_level=47)
[2023-09-12T02:15:59.600493+02:00] [debug] [pid:13581] tests/support_server/setup.pm:628 called setup::setup_iscsi_server -> tests/support_server/setup.pm:354 called testapi::script_run
[2023-09-12T02:15:59.600776+02:00] [debug] [pid:13581] <<< testapi::type_string(string="; echo nOU2g-\$?- > /dev/ttyS0\n", max_interval=250, wait_screen_change=0, wait_still_screen=0, timeout=30, similarity_level=47)
[2023-09-12T02:16:00.716267+02:00] [debug] [pid:13581] tests/support_server/setup.pm:628 called setup::setup_iscsi_server -> tests/support_server/setup.pm:354 called testapi::script_run
[2023-09-12T02:16:00.716494+02:00] [debug] [pid:13581] <<< testapi::wait_serial(timeout=200, no_regex=0, record_output=undef, regexp=qr/nOU2g-\d+-/u, expect_not_found=0, buffer_size=undef, quiet=undef)
[2023-09-12T02:22:07.016683+02:00] [debug] [pid:13581] >>> testapi::wait_serial: (?^u:nOU2g-\d+-): ok

While in the HA job is:

[2023-09-12T01:53:34.668942+02:00] [debug] [pid:17155] <<< testapi::script_run(cmd="yast2 firewall services add zone=EXT service=service:target", output="", die_on_timeout=1, timeout=200, quiet=undef)
[2023-09-12T01:53:34.669046+02:00] [debug] [pid:17155] tests/support_server/setup.pm:628 called setup::setup_iscsi_server -> tests/support_server/setup.pm:354 called testapi::script_run
[2023-09-12T01:53:34.669152+02:00] [debug] [pid:17155] <<< testapi::type_string(string="yast2 firewall services add zone=EXT service=service:target", max_interval=250, wait_screen_change=0, wait_still_screen=0, timeout=30, similarity_level=47)
[2023-09-12T01:53:36.730800+02:00] [debug] [pid:17155] tests/support_server/setup.pm:628 called setup::setup_iscsi_server -> tests/support_server/setup.pm:354 called testapi::script_run
[2023-09-12T01:53:36.731133+02:00] [debug] [pid:17155] <<< testapi::type_string(string="; echo nOU2g-\$?- > /dev/ttyS0\n", max_interval=250, wait_screen_change=0, wait_still_screen=0, timeout=30, similarity_level=47)
[2023-09-12T01:53:37.846062+02:00] [debug] [pid:17155] tests/support_server/setup.pm:628 called setup::setup_iscsi_server -> tests/support_server/setup.pm:354 called testapi::script_run
[2023-09-12T01:53:37.846380+02:00] [debug] [pid:17155] <<< testapi::wait_serial(quiet=undef, buffer_size=undef, regexp=qr/nOU2g-\d+-/u, no_regex=0, expect_not_found=0, record_output=undef, timeout=200)
[2023-09-12T01:56:58.961522+02:00] [debug] [pid:17155] >>> testapi::wait_serial: (?^u:nOU2g-\d+-): fail

As you can see, in the HanaSR support server the command eventually worked after 6+ minutes, while in the HA job, it failed at the 3 minutes 21 seconds mark. HanaSR jobs run with TIMEOUT_SCALE=3 which explains why it waits longer.

Even though the command is successful in HanaSR tests after 6 minutes, I still think this shows that there is a problem, as that yast2 firewall services add zone=EXT service=service:target command should finish faster. Looking at one of the successful test from last week (see https://progress.opensuse.org/issues/134282#note-48 & https://openqa.suse.de/tests/11987887):

[2023-09-03T04:37:19.641796+02:00] [debug] [pid:38709] <<< testapi::script_run(cmd="yast2 firewall services add zone=EXT service=service:target", output="", timeout=200, quiet=undef, die_on_timeout=1)
[2023-09-03T04:37:19.641902+02:00] [debug] [pid:38709] tests/support_server/setup.pm:628 called setup::setup_iscsi_server -> tests/support_server/setup.pm:354 called testapi::script_run
[2023-09-03T04:37:19.642010+02:00] [debug] [pid:38709] <<< testapi::type_string(string="yast2 firewall services add zone=EXT service=service:target", max_interval=250, wait_screen_change=0, wait_still_screen=0, timeout=30, similarity_level=47)
[2023-09-03T04:37:21.700529+02:00] [debug] [pid:38709] tests/support_server/setup.pm:628 called setup::setup_iscsi_server -> tests/support_server/setup.pm:354 called testapi::script_run
[2023-09-03T04:37:21.700795+02:00] [debug] [pid:38709] <<< testapi::type_string(string="; echo nOU2g-\$?- > /dev/ttyS0\n", max_interval=250, wait_screen_change=0, wait_still_screen=0, timeout=30, similarity_level=47)
[2023-09-03T04:37:22.816323+02:00] [debug] [pid:38709] tests/support_server/setup.pm:628 called setup::setup_iscsi_server -> tests/support_server/setup.pm:354 called testapi::script_run
[2023-09-03T04:37:22.816573+02:00] [debug] [pid:38709] <<< testapi::wait_serial(timeout=200, buffer_size=undef, expect_not_found=0, record_output=undef, no_regex=0, regexp=qr/nOU2g-\d+-/u, quiet=undef)
[2023-09-03T04:37:25.876126+02:00] [debug] [pid:38709] >>> testapi::wait_serial: (?^u:nOU2g-\d+-): ok

As you can see, in this working test the command finished in under 6 seconds.

Hope this helps in the debug.

Actions #72

Updated by okurz 8 months ago

acarvajal wrote in #note-71:

[…]
Jobs ran in worker40 (expected due to https://progress.opensuse.org/issues/134282#note-68) but also in worker30. Seems like there were actually 2 workers handling Multi-Machine last night.

@livdywan please make sure that only one machine is used here.

Actions #73

Updated by livdywan 8 months ago

okurz wrote in #note-72:

acarvajal wrote in #note-71:

[…]
Jobs ran in worker40 (expected due to https://progress.opensuse.org/issues/134282#note-68) but also in worker30. Seems like there were actually 2 workers handling Multi-Machine last night.

@livdywan please make sure that only one machine is used here.

Apparently I couldn't see what was missing yesterday. Looks I missed it when resolving conflicts so here's a follow-up: https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/609

Actions #75

Updated by livdywan 8 months ago

As you can see, in this working test the command finished in under 6 seconds.

Hope this helps in the debug.

@acarvajal To avoid any misconceptions please note that I'm mainly looking after mitigations here. Somebody else will need to debug this further and narrow down the actual issue or come up with test improvements such as #135200.

livdywan wrote in #note-14:

https://sd.suse.com/servicedesk/customer/portal/1/SD-130143

Just FYI the latest feedback from infra on the SD ticket:

By default MTU runs at MTU 1500, however for openQA TORs we have MTU 9216 configured for each port and the future network automation service will apply this setting as well by default throughout PRG2, lowering the MTU will then be request via SD-Ticket.

Actions #76

Updated by livdywan 8 months ago

  • Description updated (diff)
Actions #77

Updated by livdywan 8 months ago

I went through the comments again just to be sure we've covered raised points in the current summary. We should be good.

Actions #78

Updated by okurz 8 months ago

Judging from https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&viewPanel=26&from=1692855397995&to=1694689544873 I see no improvement in the results of multi-machine tests so no clear indication regarding E7-1 however this were no specially scheduled test results so other factors could also play in here.

One more idea for an experiment: Run multi-machine tests specifically triggered on an older NUE1 based worker to see if that one is affected the same.

Actions #79

Updated by okurz 8 months ago

  • Related to action #135578: Long job age and jobs not executed for long size:M added
Actions #80

Updated by okurz 8 months ago

  • Priority changed from High to Urgent

I realized that there is a strong relation to #135578 . Due to many multi-machine tests failing we have 1. longer runtimes due to timeouts and execution of post-fail-hooks, 2. multiple retries for recurring failures for jobs with setting RETRY=N, 3. investigation jobs for failing multi-machine tests that are unreviewed. All those three issues lead to a longer job schedule queue as observed in #135578 hence bumping prio again

EDIT: I conducted quick SQL queries to ensure that no x86_64 multi-machine tests have been executed on any other machine than worker40

openqa=> select jobs.id from jobs join workers on jobs.assigned_worker_id = workers.id join job_settings on jobs.id = job_settings.job_id where t_finished >= '2023-09-13' and host != 'worker40' and key = 'WORKER_CLASS' and value = 'tap' and jobs.arch = 'x86_64' limit 3;
    id    
----------
 12087785
 12087858
 12087788
(3 rows)

openqa=> select jobs.id from jobs join workers on jobs.assigned_worker_id = workers.id join job_settings on jobs.id = job_settings.job_id where t_finished >= '2023-09-14' and host != 'worker40' and key = 'WORKER_CLASS' and value = 'tap' and jobs.arch = 'x86_64' limit 3;                                            
 id 
----
(0 rows)

the first being a crosscheck showing that we still had tests on worker30 yesterday and the second shows that today, i.e. in the 15h of the day no multi-machine tests on any other than w40. To find all resulting multi-machine tests from worker40:

openqa=> select result,count(jobs.id) from jobs join workers on jobs.assigned_worker_id = workers.id join job_settings on jobs.id = job_settings.job_id where t_finished >= '2023-09-14' and host = 'worker40' and key = 'WORKER_CLASS' and value = 'tap' and jobs.arch = 'x86_64' group by result order by count DESC;
       result       | count 
--------------------+-------
 parallel_failed    |  1173
 failed             |   735
 incomplete         |   107
 parallel_restarted |    27
 passed             |    15
 timeout_exceeded   |    14
 skipped            |     2
 softfailed         |     1
(8 rows)

clearly disproving H7 so please REJECT H7 as in being the cause for the problem at hand because while there might be problem with cross-worker tests it also does not really help to make more tests pass from a single worker. That should be enough for you to re-enable the other machines.

Actions #81

Updated by livdywan 8 months ago

I couldn't find an explicit mention of it, and apparently that was not clear in retrospect: My expectation is that somebody from @acarvajal's side makes time for experiments and we provide help from the Tools side.

Actions #82

Updated by livdywan 8 months ago

  • Description updated (diff)

clearly disproving H7 so please REJECT H7 as in being the cause for the problem at hand because while there might be problem with cross-worker tests it also does not really help to make more tests pass from a single worker. That should be enough for you to re-enable the other machines.

Ack.

One example I took a look at is https://openqa.suse.de/tests/12096958/logfile?filename=autoinst-log.txt and it seems to run into the network delay we've seen before:

[2023-09-13T13:55:37.165578+02:00] [debug] [pid:12551] <<< testapi::type_string(string="yast2 firewall services add zone=EXT service=service:target", max_interval=250, wait_screen_change=0, wait_still_screen=0, timeout=30, similarity_level=47)
[2023-09-13T13:55:39.220794+02:00] [debug] [pid:12551] tests/support_server/setup.pm:628 called setup::setup_iscsi_server -> tests/support_server/setup.pm:354 called testapi::script_run
[2023-09-13T13:55:39.221004+02:00] [debug] [pid:12551] <<< testapi::type_string(string="; echo nOU2g-\$?- > /dev/ttyS0\n", max_interval=250, wait_screen_change=0, wait_still_screen=0, timeout=30, similarity_level=47)
[2023-09-13T13:55:40.333170+02:00] [debug] [pid:12551] tests/support_server/setup.pm:628 called setup::setup_iscsi_server -> tests/support_server/setup.pm:354 called testapi::script_run
[2023-09-13T13:55:40.333380+02:00] [debug] [pid:12551] <<< testapi::wait_serial(quiet=undef, regexp=qr/nOU2g-\d+-/u, record_output=undef, no_regex=0, timeout=200, expect_not_found=0, buffer_size=undef)
[2023-09-13T14:01:46.551431+02:00] [debug] [pid:12551] >>> testapi::wait_serial: (?^u:nOU2g-\d+-): ok
Actions #83

Updated by livdywan 8 months ago

Presumably we can re-enable all x86-64 workers previously used for multi-machine cases. Do note that nothing changed here. I'm proposing this based on the findings in #note-80 in general and #note-82 by example:

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/613

Actions #84

Updated by apappas 8 months ago

The statement:

Multimachine jobs do not work across multiple workers
is not disproved when multimachine jobs also fail across a single worker.

The failures could be relevant to cluster miscofiguration or not.

After a cursory glance in worker40, most of the failures happen in zypper. https://openqa.suse.de/tests/12108426#step/fips_setup/31 (fails zypper in -t pattern fips)

I fail to understand how that invalidates the hypothesis. On the contrary we have yet again a test that cannot communicate to outside networks.

Actions #85

Updated by livdywan 8 months ago

After a cursory glance in worker40, most of the failures happen in zypper. https://openqa.suse.de/tests/12108426#step/fips_setup/31 (fails zypper in -t pattern fips)

I fail to understand how that invalidates the hypothesis. On the contrary we have yet again a test that cannot communicate to outside networks.

2023-09-13 16:27:45 <5> server(2215) [zypp-core] Exception.cc(log):186 Error message: Could not resolve host: updates.suse.com

That's not one of the cases we've been looking at before, though? This looks to fail because a host outside of openQA production infra is not reachable. 🤔

Actions #86

Updated by acarvajal 8 months ago

okurz wrote in #note-80:

clearly disproving H7 so please REJECT H7 as in being the cause for the problem at hand because while there might be problem with cross-worker tests it also does not really help to make more tests pass from a single worker. That should be enough for you to re-enable the other machines.

How does the fact that Multi-Machine jobs have been running only in worker40 the past 2 days, disproves that Multi-Machine jobs don't work across workers?

IMHO, only seeing passing MM jobs across multiple workers would disprove H7.

okurz wrote in #note-78:

One more idea for an experiment: Run multi-machine tests specifically triggered on an older NUE1 based worker to see if that one is affected the same.

I scheduled some in worker9: https://openqa.suse.de/tests/overview?groupid=300&distri=sle&build=poo_134282&version=15-SP1

livdywan wrote in #note-83:

Presumably we can re-enable all x86-64 workers previously used for multi-machine cases. Do note that nothing changed here. I'm proposing this based on the findings in #note-80 in general and #note-82 by example:

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/613

I am not against this, but don't see how it will improve things when per #note-80 we are seeing a massive amount of failures when using only one worker. Yes, enabling the other workers may help with the job queue, and we may be lucky (see my comment regarding worker37 in #note-66) and some MM could be picked by workers where the support server setup works, but then again they could be also be picked by worker40 or other workers behaving like worker40, causing more failures.

IHMO, we should focus on making sure that MM workers fully work in a single worker first, and then dig into any issues that may be GRE-related.

Actions #87

Updated by acarvajal 8 months ago

livdywan wrote in #note-85:

2023-09-13 16:27:45 <5> server(2215) [zypp-core] Exception.cc(log):186 Error message: Could not resolve host: updates.suse.com

That's not one of the cases we've been looking at before, though? This looks to fail because a host outside of openQA production infra is not reachable. 🤔

Yes, this is like the scenario described in #135056.

I do believe both issues are related ... i.e., the same workers where the support server is failing to finish its setup, are the workers unable to connect to addresses outside of osd.

Actions #88

Updated by livdywan 8 months ago

clearly disproving H7 so please REJECT H7 as in being the cause for the problem at hand because while there might be problem with cross-worker tests it also does not really help to make more tests pass from a single worker. That should be enough for you to re-enable the other machines.

How does the fact that Multi-Machine jobs have been running only in worker40 the past 2 days, disproves that Multi-Machine jobs don't work across workers?

IMHO, only seeing passing MM jobs across multiple workers would disprove H7.

The same jobs fail regardless of whether they run on multiple or a single physical machine. To me that suggests the physical machine part is a red herring.

That's not one of the cases we've been looking at before, though? This looks to fail because a host outside of openQA production infra is not reachable. 🤔

Yes, this is like the scenario described in #135056.

I do believe both issues are related ... i.e., the same workers where the support server is failing to finish its setup, are the workers unable to connect to addresses outside of osd.

Right. What's mainly confusing me right now is that failures occur on a single physical host and also accessing download servers. As if the worker instance has no access to the network.

Actions #89

Updated by okurz 8 months ago

  • Copied to action #135773: [tools] many multi-machine test failures in "ovs-client+ovs-server" test scenario when tests are run across different workers size:M added
Actions #90

Updated by okurz 8 months ago

  • Description updated (diff)

acarvajal wrote in #note-86:

okurz wrote in #note-80:

clearly disproving H7 so please REJECT H7 as in being the cause for the problem at hand because while there might be problem with cross-worker tests it also does not really help to make more tests pass from a single worker. That should be enough for you to re-enable the other machines.

How does the fact that Multi-Machine jobs have been running only in worker40 the past 2 days, disproves that Multi-Machine jobs don't work across workers?

IMHO, only seeing passing MM jobs across multiple workers would disprove H7.

ok, sorry. I wasn't clear. Now I created #135773 as a clone from this for the specific problem as already observed stated by pcervinka regarding multi-machine jobs and which seems to be part of the problem domain. Also there is #111908 for long. With that I updated the description any hypotheses to keep H7 open but add H7.1 "Multi-machine jobs generally work fine when executed on a single physical machine" and only rejected that one. I also updated the description for the field TBD that wasn't filled by livdywan

livdywan wrote in #note-83:

Presumably we can re-enable all x86-64 workers previously used for multi-machine cases. Do note that nothing changed here. I'm proposing this based on the findings in #note-80 in general and #note-82 by example:

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/613

I am not against this, but don't see how it will improve things when per #note-80 we are seeing a massive amount of failures when using only one worker. Yes, enabling the other workers may help with the job queue, and we may be lucky (see my comment regarding worker37 in #note-66) and some MM could be picked by workers where the support server setup works, but then again they could be also be picked by worker40 or other workers behaving like worker40, causing more failures.

IHMO, we should focus on making sure that MM workers fully work in a single worker first, and then dig into any issues that may be GRE-related.

Yes, I agree. But apparently a single machine does not make a difference. And keeping only a single machine for all production multi-machine tests is conflicting with the long job schedule queue. And for investigation and trying to fix things one can still select worker classes freely to select on which machines something runs.

Actions #91

Updated by livdywan 8 months ago

One more idea for an experiment: Run multi-machine tests specifically triggered on an older NUE1 based worker to see if that one is affected the same.

I scheduled some in worker9: https://openqa.suse.de/tests/overview?groupid=300&distri=sle&build=poo_134282&version=15-SP1

Note that as of #134912#note-4 that machine hasn't been running.

I tried to power it on again but it doesn't seem responsive. chassis status says System Power: off and power cycle says Set Chassis Power Control to Cycle failed: Command not supported in present state even after repeated attempts.

Edith: power on followed by another power cycle seems to have worked. I can get in via SSH. The webUI hasn't "seen" it yet.

Actions #92

Updated by mkittler 8 months ago

About H3: We're currently using Open vSwitch 3.1.1 (or 3.1.0, the package version is a bit unclear to me). Maybe the update to 3.1.0 introduced a regression? It was released in February which would be too old but maybe it only landed in Leap a few month later. Maybe this could be easily cross-checked by downgrading to a previous version of Open vSwitch on all workers and then re-triggering problematic tests like https://openqa.suse.de/tests/latest?arch=x86_64&distri=sle&flavor=Server-DVD-HPC-Incidents&machine=64bit&test=hpc_BETA_mpich_mpi_cplusplus_master&version=15-SP5. (That scenario is currently passing but the history doesn't look very good. Supposedly one had to run a few successful tests before drawing the conclusion that downgrading helped.)

Actions #93

Updated by livdywan 8 months ago

mkittler wrote in #note-92:

About H3: We're currently using Open vSwitch 3.1.1 (or 3.1.0, the package version is a bit unclear to me). Maybe the update to 3.1.0 introduced a regression? It was released in February which would be too old but maybe it only landed in Leap a few month later. Maybe this could be easily cross-checked by downgrading to a previous version of Open vSwitch on all workers and then re-triggering problematic tests like https://openqa.suse.de/tests/latest?arch=x86_64&distri=sle&flavor=Server-DVD-HPC-Incidents&machine=64bit&test=hpc_BETA_mpich_mpi_cplusplus_master&version=15-SP5. (That scenario is currently passing but the history doesn't look very good. Supposedly one had to run a few successful tests before drawing the conclusion that downgrading helped.)

Where are you getting that version? What I see on worker10 for example is this:

zypper pa -i | grep openvswitch
itory with updates from SUSE Linux Enterprise 15 | openvswitch                                          | 2.14.2-150400.24.3.1                        | x86_64
v  | openSUSE-Leap-15.4-Oss                                       | openvswitch                                          | 2.14.2-150400.22.23                         | x86_64
i+ | devel_openQA                                                 | os-autoinst-openvswitch                              | 4.6.1694444383.e6a5294-lp154.1635.1         | x86_64
v  | Update repository of openSUSE Backports                      | os-autoinst-openvswitch                              | 4.6.1639403953.ae94c4bd-bp154.2.3.1         | x86_64
v  | openSUSE-Leap-15.4-Oss                                       | os-autoinst-openvswitch                              | 4.6.1639403953.ae94c4bd-bp154.1.137         | x86_6
Actions #94

Updated by acarvajal 8 months ago

livdywan wrote in #note-91:

One more idea for an experiment: Run multi-machine tests specifically triggered on an older NUE1 based worker to see if that one is affected the same.

I scheduled some in worker9: https://openqa.suse.de/tests/overview?groupid=300&distri=sle&build=poo_134282&version=15-SP1

Note that as of #134912#note-4 that machine hasn't been running.

I tried to power it on again but it doesn't seem responsive. chassis status says System Power: off and power cycle says Set Chassis Power Control to Cycle failed: Command not supported in present state even after repeated attempts.

Edith: power on followed by another power cycle seems to have worked. I can get in via SSH. The webUI hasn't "seen" it yet.

Yes. Just saw that. I'm cancelling those jobs and starting new ones in worker8.

Edit: https://openqa.suse.de/tests/overview?build=poo_134282&groupid=300&distri=sle&version=15-SP1

Actions #95

Updated by livdywan 8 months ago

acarvajal wrote in #note-94:

livdywan wrote in #note-91:

One more idea for an experiment: Run multi-machine tests specifically triggered on an older NUE1 based worker to see if that one is affected the same.

I scheduled some in worker9: https://openqa.suse.de/tests/overview?groupid=300&distri=sle&build=poo_134282&version=15-SP1

Note that as of #134912#note-4 that machine hasn't been running.

I tried to power it on again but it doesn't seem responsive. chassis status says System Power: off and power cycle says Set Chassis Power Control to Cycle failed: Command not supported in present state even after repeated attempts.

Edith: power on followed by another power cycle seems to have worked. I can get in via SSH. The webUI hasn't "seen" it yet.

Yes. Just saw that. I'm cancelling those jobs and starting new ones in worker8.

Edit: https://openqa.suse.de/tests/overview?build=poo_134282&groupid=300&distri=sle&version=15-SP1

Okay! Meanwhile I realized I got confused by the naming of the running services... if you still wanna re-run those jobs on worker9:

grep numofworkers /etc/openqa/workers.ini
# numofworkers: 16
sudo systemctl enable --now openqa-worker-auto-restart@{1..16}.service
Created symlink /etc/systemd/system/multi-user.target.wants/openqa-worker-auto-restart@1.service → /usr/lib/systemd/system/openqa-worker-auto-restart@.service
[...]
Actions #96

Updated by okurz 8 months ago

  • Copied to action #135818: [kernel] minimal reproducer for many multi-machine test failures in "ovs-client+ovs-server" test scenario when tests are run across different workers added
Actions #97

Updated by acarvajal 8 months ago

acarvajal wrote in #note-94:

Yes. Just saw that. I'm cancelling those jobs and starting new ones in worker8.

Edit: https://openqa.suse.de/tests/overview?build=poo_134282&groupid=300&distri=sle&version=15-SP1

FYI:

  1. Support server from these passed the support_server/setup step. See: https://openqa.suse.de/tests/12138679#step/barrier_init/1
  2. iscsi_client module also passed. See: https://openqa.suse.de/tests/12138678#step/watchdog/1 & https://openqa.suse.de/tests/12138676#step/watchdog/1

Since all these jobs run in worker8, it's not a good test case to confirm or deny whether same situation would work when running in multiple workers.

Edit: tests passed.

Actions #98

Updated by mkittler 8 months ago

  • Description updated (diff)

@livdywan

Where are you getting that version? What I see on worker10 for example is this:

From

martchus@worker40:~> zypper se -i -v vswitch
Loading repository data...
Reading installed packages...

S  | Name                    | Type    | Version                             | Arch   | Repository
---+-------------------------+---------+-------------------------------------+--------+-------------------------------------------------------------
i  | libopenvswitch-3_1-0    | package | 3.1.0-150500.3.3.1                  | x86_64 | Update repository with updates from SUSE Linux Enterprise 15
    name: libopenvswitch-3_1-0
i  | openvswitch3            | package | 3.1.0-150500.3.3.1                  | x86_64 | Update repository with updates from SUSE Linux Enterprise 15
    name: openvswitch3
i+ | os-autoinst-openvswitch | package | 4.6.1694444383.e6a5294-lp155.1635.1 | x86_64 | devel_openQA
    name: os-autoinst-openvswitch

but it looks like on some workers an older version (2.14) is used, e.g.

martchus@worker10:~> zypper se -i -v vswitch
Loading repository data...
Reading installed packages...

S  | Name                    | Type    | Version                             | Arch   | Repository
---+-------------------------+---------+-------------------------------------+--------+-------------------------------------------------------------
i  | libopenvswitch-2_14-0   | package | 2.14.2-150400.24.9.1                | x86_64 | Update repository with updates from SUSE Linux Enterprise 15
    name: libopenvswitch-2_14-0
i  | openvswitch             | package | 2.14.2-150400.24.9.1                | x86_64 | Update repository with updates from SUSE Linux Enterprise 15
    name: openvswitch
i+ | os-autoinst-openvswitch | package | 4.6.1694444383.e6a5294-lp154.1635.1 | x86_64 | devel_openQA

Note that worker10 is generally not the most relevant worker to check, though (as it doesn't have the tap worker class enabled anymore).

On the other side, this actually tells us something: We saw this problem before the dct move when we still used the Nürnberg-located workers. Those workers seem to still use the old version (2.14). So it is probably not due to updating Open vSwitch. (I say probably because I haven't checked whether the 2.x package has received any updates in the relevant time frame. Possibly 2.x and 3.x both received a minor update introducing the same bug. This is unlikely, though.)

Actions #99

Updated by livdywan 8 months ago

Just had a call with Ralf, Anton, Alvaro and José to check where we're at:

  • Stop all openQA deployments for now DONE (https://gitlab.suse.de/openqa/osd-deployment/-/pipeline_schedules)
  • Let's have a daily standup
  • Can we have a rollback to the previous state? Probably not?
  • Is there anyone we can pull in who's more versed in debugging network setups?
  • Let's pull in Marius temporarily
  • There were similar symptoms in Walfdorf. Can we check this as a reference?

I also sent an email to qa-team to ensure there's a general visibility of what's being done.

Actions #100

Updated by livdywan 8 months ago

  • Description updated (diff)
Actions #101

Updated by okurz 8 months ago

livdywan wrote in #note-99:

Just had a call with Ralf, Anton, Alvaro and José to check where we're at:

Please make sure you have an according "rollback action". By the way I consider that a bad idea. We must not forget that the majority of tests within OSD still work fine and also we need to apply changes for other tasks.

  • Let's have a daily standup

  • Can we have a rollback to the previous state? Probably not?

We don't know what the "previous state" was but we do know that there were changes that are effectively impossible to revert, e.g. moving physical machines back to NUE1 datacenter.

  • Is there anyone we can pull in who's more versed in debugging network setups?
  • Let's pull in Marius temporarily
  • There were similar symptoms in Walfdorf. Can we check this as a reference?

I also sent an email to qa-team to ensure there's a general visibility of what's being done.

Actions #102

Updated by acarvajal 8 months ago

livdywan wrote in #note-99:

  • There were similar symptoms in Walfdorf. Can we check this as a reference?

Regarding this:

  1. openqa.wdf.sap.corp setup originally consisted of 2 servers:
    1.1. srv1 had the webUI, 9 x86_64 qemu workers and 8 pvm_hmc workers.
    1.2. srv2 had 9 x86_64 qemu workers.
    1.3. There was a GRE tunnel from srv1->srv2, and another from srv2->srv1

  2. We got new HW to replace the old servers (newsrv1 & newsrv2). Both were installed and configured as openQA workers.

  3. We enabled 15 qemu workers in each of newsrv1 and newsrv2

  4. We disabled qemu workers in srv1 and srv2

  5. After this, we noticed that MM jobs which ran across both newsrv1 and newsrv2 failed to connect to the support server.

  6. At the same time, MM jobs which ran wholly in either of the new servers would work.

  7. We suspected there was an issue with the GRE tunnel. After looking at the configuration we noticed that due to copy & paste error, the GRE tunnel in the new servers was established as newsrv1->srv1 and newsrv2->srv2 only.

  8. After fixing the GRE tunnels and restarting network services and openQA workers, issue was still present; however rebooting both servers fixed the issue.

My impression is that some of the related services (wicked, firewalld, nftables, openvswitch, os-autoins-openvswitch, etc) had to be started in a certain order, and that would be why after a clean reboot issue was gone.

I don't expect osd workers to have misconfigured GRE tunnels.

Actions #104

Updated by acarvajal 8 months ago

Following up on one of the action items raised during the QE-SAP/Tools Team sync from Wednesday, we ran 10 support server jobs in tap networks in each of the workers.

In order to have quick tests, these support servers jobs ran with a reduced schedule (dropping the modules which would block waiting for other jobs, see https://github.com/alvarocarvajald/os-autoinst-distri-opensuse/commit/dd4e04fd1b95f73c6582e4c1c2268f4509ca2669) and they were run without parallel jobs.

Settings were taken from the passing Multi-Machine support server job from earlier in the day: https://openqa.suse.de/tests/12138679/file/vars.json

The following settings were removed from the JSON file: JOBTOKEN, NAME, NEEDLES_GIT_HASH, NICMAC, NICMODEL, NICVLAN, OPENQA_HOSTNAME, OPENQA_URL, PRODUCTDIR, START_AFTER_TEST, TAPDEV, TAPDOWNSCRIPT, TAPSCRIPT, VNC, WORKER_HOSTNAME, WORKER_ID, WORKER_INSTANCE

The following settings were updated:

WORKER_CLASS was of course changed after jobs for worker29 were scheduled, to schedule jobs in worker30, worker37, worker38, worker39 and worker40.

Jobs were posted with the command:

openqa-cli api --osd -X POST jobs $(cat vars.json | perl -MJSON -e 'my $j = ""; while (<>) { $j .= $_ } my $r = decode_json($j); foreach (keys %$r) { print "$_=$r->{$_} "}')

These were the results:

worker29: (100% passing rate)
https://openqa.suse.de/tests/12141451
https://openqa.suse.de/tests/12141510
https://openqa.suse.de/tests/12141511
https://openqa.suse.de/tests/12141512
https://openqa.suse.de/tests/12141513
https://openqa.suse.de/tests/12141514
https://openqa.suse.de/tests/12141515
https://openqa.suse.de/tests/12141516
https://openqa.suse.de/tests/12141517
https://openqa.suse.de/tests/12141518

worker30: (100% passing rate)
https://openqa.suse.de/tests/12141526
https://openqa.suse.de/tests/12141527
https://openqa.suse.de/tests/12141528
https://openqa.suse.de/tests/12141529
https://openqa.suse.de/tests/12141530
https://openqa.suse.de/tests/12141531
https://openqa.suse.de/tests/12141532
https://openqa.suse.de/tests/12141533
https://openqa.suse.de/tests/12141534
https://openqa.suse.de/tests/12141632

worker37: (100% failure)
https://openqa.suse.de/tests/12141589
https://openqa.suse.de/tests/12141590
https://openqa.suse.de/tests/12141591
https://openqa.suse.de/tests/12141592
https://openqa.suse.de/tests/12141593
https://openqa.suse.de/tests/12141594
https://openqa.suse.de/tests/12141595
https://openqa.suse.de/tests/12141596
https://openqa.suse.de/tests/12141597
https://openqa.suse.de/tests/12141633

worker38: (100% passing rate)
https://openqa.suse.de/tests/12141598
https://openqa.suse.de/tests/12141599
https://openqa.suse.de/tests/12141600
https://openqa.suse.de/tests/12141601
https://openqa.suse.de/tests/12141602
https://openqa.suse.de/tests/12141603
https://openqa.suse.de/tests/12141604
https://openqa.suse.de/tests/12141605
https://openqa.suse.de/tests/12141606
https://openqa.suse.de/tests/12141634

worker39: (100% passing rate)
https://openqa.suse.de/tests/12141607
https://openqa.suse.de/tests/12141608
https://openqa.suse.de/tests/12141609
https://openqa.suse.de/tests/12141610
https://openqa.suse.de/tests/12141611
https://openqa.suse.de/tests/12141612
https://openqa.suse.de/tests/12141613
https://openqa.suse.de/tests/12141614
https://openqa.suse.de/tests/12141615
https://openqa.suse.de/tests/12141636

worker40: (100% passing rate)
https://openqa.suse.de/tests/12141616
https://openqa.suse.de/tests/12141617
https://openqa.suse.de/tests/12141618
https://openqa.suse.de/tests/12141619
https://openqa.suse.de/tests/12141620
https://openqa.suse.de/tests/12141621
https://openqa.suse.de/tests/12141622
https://openqa.suse.de/tests/12141623
https://openqa.suse.de/tests/12141624
https://openqa.suse.de/tests/12141631

While things look much improved, I think we still have an issue with worker37.

Actions #105

Updated by acarvajal 8 months ago

Some actual issues observed related to worker37 this past afternoon:

  1. support server running in worker37, fails in setup: https://openqa.suse.de/tests/12140139#step/setup/35
  2. Multi-Machine jobs running in worker37 & worker38, node 2 running in worker37 is unable to reach updates.suse.com: https://openqa.suse.de/tests/12140151#step/iscsi_client/57
  3. Multi-Machine jobs running in worker37, worker38 & worker40, node 2 running in worker37 is unable to reach updates.suse.com: https://openqa.suse.de/tests/12140149#step/iscsi_client/57
  4. Multi-Machine jobs running in worker37 & worker38, node 2 running in worker37 is unable to reach scc.suse.com: https://openqa.suse.de/tests/12140167#step/suseconnect_scc/20
  5. Multi-Machine jobs running in worker29, worker40 & worker37, client job running in worker37 is unable to reach download.docker.com: https://openqa.suse.de/tests/12140197#step/hawk_gui/6

I went there, and net.ipv4.conf.br1.forwarding was set to 0, so I added the following to /etc/sysctl.conf:

net.ipv4.ip_forward = 1
net.ipv4.conf.br1.forwarding = 1
net.ipv4.conf.eth0.forwarding = 1

And then ran a sysctl -p /etc/sysctl.conf as documented in https://progress.opensuse.org/issues/135524#note-15.

After that checked with:

worker37:/proc/sys/net/ipv4/conf # cat {br1,eth0}/forwarding
1
1

Hopefully worker37 is fixed too.

Actions #106

Updated by okurz 8 months ago

ok, thank you for trying to fix it. Just be aware that this is inconsistent with what mkittler did in https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/987 . But anyway to do it properly I guess we need to follow the original plan I had (for years) that we need to reinstall machines more often including those just freshly installed machines to ensure our configuration management has all the needed changes.

Actions #107

Updated by okurz 8 months ago

  • Status changed from Feedback to Workable
  • Assignee deleted (livdywan)

Unassigning due to absence

Actions #108

Updated by nicksinger 8 months ago

  • Assignee set to nicksinger
Actions #109

Updated by okurz 8 months ago

  • Description updated (diff)
  • Status changed from Workable to In Progress

Also met with pcervinka, mkittler, nicksinger. pcervinka will work on #135818 . Only after that is done we should consider enabling more machines for multi-machine jobs again.

Actions #110

Updated by okurz 8 months ago

  • Copied to action #135914: Extend/add initial validation steps and "best practices" for multi-machine test setup/debugging to openQA documentation size:M added
Actions #111

Updated by nicksinger 8 months ago

okurz wrote in #note-109:

Also met with pcervinka, mkittler, nicksinger. pcervinka will work on #135818 . Only after that is done we should consider enabling more machines for multi-machine jobs again.

to add from the meeting: the situation got way better after forwarding was enabled in salt/firewalld on each bridge with https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/987. net.ipv4.ip_forward = 1 might still be required to be covered in salt but we need to understand what the <forwarding/>-directive in firewalld does first. Oli and me discussed in the infra daily that this is possible by e.g. reading firewalld documentation or just set it back to 0, run salt and see if this changes it back to 1.

Actions #112

Updated by acarvajal 8 months ago

Actions #113

Updated by openqa_review 8 months ago

  • Due date set to 2023-10-03

Setting due date based on mean cycle time of SUSE QE Tools

Actions #114

Updated by okurz 8 months ago

  • Description updated (diff)
Actions #115

Updated by okurz 8 months ago

  • Copied to action #136007: Conduct "lessons learned" with Five Why analysis for network protocols failures on multimachine tests on HA/SAP size:S added
Actions #116

Updated by okurz 8 months ago

  • Description updated (diff)
Actions #117

Updated by okurz 8 months ago

  • Description updated (diff)
Actions #118

Updated by okurz 8 months ago

  • Copied to action #136013: Ensure IP forwarding is persistent for multi-machine tests also in our salt recipes size:M added
Actions #119

Updated by okurz 8 months ago

  • Description updated (diff)
Actions #120

Updated by livdywan 8 months ago

  • Related to action #135944: Implement a constantly running monitoring/debugging VM for the multi-machine network added
Actions #121

Updated by okurz 8 months ago

Discussed better alert definitions with nicksinger and livdywan. I created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/992 to prevent the very jaggy multi-machine ratio graphs. And nicksinger will tweak the existing alert by lowering the alert threshold on failed mm-tests from 60 to 30 and introduce a second, longer-time alert with threshold 20 for 6h

Actions #122

Updated by livdywan 8 months ago

Note that follow-up tickets have been filed, see the Out of Scope section in the description.

Specifically for this ticket open action items are:

  • The title still has no candidate.*iscsi-target-overview-service-tab|yast2.+firewall.+services.+add.+zone. No failing tests should match this ticket.
  • We re-enabled deployments.
  • Adjust multi-machine result alerts to have a better measure of whether the situation has improved:
Actions #123

Updated by nicksinger 8 months ago

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/993 which adjusts the old alert to make it alert on short spikes, introduced the newer long-time alert and also adjusted the panel itself.

Actions #124

Updated by nicksinger 8 months ago

  • Description updated (diff)
  • Status changed from In Progress to Feedback
Actions #125

Updated by okurz 8 months ago

@livdywan we were about to miss that you switched on worker9, wasn't mentioned in rollback steps. I will power it off again for #134912

Actions #126

Updated by livdywan 8 months ago

  • Description updated (diff)

okurz wrote in #note-125:

@livdywan we were about to miss that you switched on worker9, wasn't mentioned in rollback steps. I will power it off again for #134912

Ah! Sorry, I thought I mentioned it in Jitsi but apparently didn't add it here!

Actions #127

Updated by pstivanin 7 months ago

  • Blocks deleted (action #134495: [security][maintenance] all multi machines tests are failing)
Actions #128

Updated by okurz 7 months ago

  • Description updated (diff)
  • Priority changed from Urgent to High

We closely monitored the situation over the past days and will continue to do so at least over the next days into the next week in particular:

  1. job queue on https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&from=now-2d&to=now&viewPanel=9
  2. scheduled jobs on https://openqa.suse.de/tests/
  3. Ratio of multi-machine tests by result https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&from=now-24h&to=now&viewPanel=24

All the other known remaining tasks are in separate tickets

Actions #129

Updated by livdywan 7 months ago

  1. Ratio of multi-machine tests by result https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&from=now-24h&to=now&viewPanel=24

We're slightly higher above 6% now. Still with in sensible terms? No queued jobs >5days and no impossible jobs queued forever.

Actions #130

Updated by nicksinger 7 months ago

We're still in an acceptable range at around 5% failed jobs.

Actions #131

Updated by okurz 7 months ago

  • Due date deleted (2023-10-03)
  • Status changed from Feedback to Resolved

So we are good. There are follow-up tasks like a "lessons learned" task so look out for that :)

Actions #132

Updated by livdywan 7 months ago

  • Description updated (diff)
Actions #133

Updated by okurz 5 months ago

  • Parent task set to #111929
Actions

Also available in: Atom PDF