action #151612
closed[kernel][tools] test fails in suseconnect_scc - SUT times out trying to reach https://scc.suse.com
Added by acarvajal about 1 year ago. Updated about 1 year ago.
0%
Description
Observation¶
openQA test in scenario sle-15-SP4-Server-DVD-HA-Updates-x86_64-qam_ha_rolling_update_node01@64bit fails in
suseconnect_scc
SUT is attempting SCC registration using SUSEConnect -r $regcode
, but this is failing with a timeout attempting to connect to https://scc.suse.com.
Issue was found on Multi-Machine jobs, but as of this writing, I'm not sure it is limited only to MM jobs.
Issue has been seen in several workers: worker29, worker30, worker38, worker39.
Test suite description¶
Testsuite maintained at https://gitlab.suse.de/qa-maintenance/qam-openqa-yml.
Reproducible¶
Fails since (at least) Build 20231127-1
Steps to reproduce manually¶
Login on e.g. worker38.oqa.prg2.suse.org
and start a VM using a TAP device from a worker slot:
sudo systemctl stop openqa-reload-worker-auto-restart@40.path
sudo systemctl stop openqa-worker-auto-restart@40.service
wget https://download.suse.de/install/SLE-Micro-5.5-GM/SLE-Micro.x86_64-5.5.0-Default-qcow-GM.qcow2
qemu-system-x86_64 -m 2048 -enable-kvm -vnc :42 -snapshot -netdev tap,id=qanet0,ifname=tap39,script=no,downscript=no -device virtio-net,netdev=qanet0,mac=52:54:00:12:0b:ff SLE-Micro.x86_64-5.5.0-Default-qcow-GM.qcow2
Configure the network within the VM, e.g.:
ip a dev eth0 10.0.2.15/24
ip r add default via 10.0.2.2
echo 'nameserver 8.8.8.8' > /etc/resolv.conf
curl https://www.google.de # should work (you'll get tons of HTML)
Reproduce the issue:
curl https://scc.suse.com # does *not* work (timeout)
SUSEConnect --url https://scc.suse.com -r 1234 # does *not* work (TLS handshake timeout)
curl http://scc.suse.com # interstingly works (showing 301 as expected)
Expected result¶
Last good: 20231126-1 (or more recent)
Further details¶
Always latest result in this scenario: latest
Files
Screenshot 2023-12-05 at 12.37.05.png (135 KB) Screenshot 2023-12-05 at 12.37.05.png | jlausuch, 2023-12-05 11:58 | ||
clipboard-202312051304-xrf5b.png (84.2 KB) clipboard-202312051304-xrf5b.png | jlausuch, 2023-12-05 12:04 | ||
clipboard-202312051304-rev0g.png (71 KB) clipboard-202312051304-rev0g.png | jlausuch, 2023-12-05 12:04 |
Updated by acarvajal about 1 year ago · Edited
I was able to pause 2 jobs (https://openqa.suse.de/tests/12918352 & https://openqa.suse.de/tests/12918351) running in worker38 & worker39 while they were failing, and could find out the following by testing directly from one of the SUTs:
- SUT can resolve scc.suse.com
- scc.suse.com is not reachable via ICMP, but that's expected
curl scc.suse.com
workscurl https://scc.suse.com
times outcurl -k https://openqa.suse.de
works- SUT can reach support server
- SUT can reach other node
- SUT can reach updates.suse.com via HTTP and HTTPS
- SUT can reach download.suse.de via HTTP and HTTPS
- SUT can not reach registry.suse.com via HTTPS
I also checked the following in worker39 where that SUT was running:
curl https://scc.suse.com
workssysctl -a
show that IPv4 forwarding is enabled for eht0, br1 and tap interfaces- Not sure if this is a problem, but
nft list table firewalld
in w39 does not show tap interfaces innat_PREROUTING_ZONES
:
chain nat_PREROUTING_ZONES {
iifname "eth0" goto nat_PRE_trusted
iifname "ovs-system" goto nat_PRE_trusted
iifname "br1" goto nat_PRE_trusted
iifname "docker0" goto nat_PRE_docker
goto nat_PRE_trusted
}
- Nor in
nat_POSTROUTING_ZONES
:
chain nat_POSTROUTING_ZONES {
oifname "eth0" goto nat_POST_trusted
oifname "ovs-system" goto nat_POST_trusted
oifname "br1" goto nat_POST_trusted
oifname "docker0" goto nat_POST_docker
goto nat_POST_trusted
}
I added some screenshots from the SUT's tests in https://suse.slack.com/archives/C02CANHLANP/p1701181092050239
Updated by okurz about 1 year ago
- Project changed from openQA Infrastructure (public) to openQA Project (public)
- Due date set to 2023-12-12
- Category set to Support
- Status changed from New to Feedback
- Assignee set to okurz
- Target version set to Ready
Oh, nice investigation. That looks very specific about the host and port. So I suggest you open an SD ticket mentioning that likely the firewall is blocking traffic there.
Updated by szarate about 1 year ago
okurz wrote in #note-2:
Oh, nice investigation. That looks very specific about the host and port. So I suggest you open an SD ticket mentioning that likely the firewall is blocking traffic there.
looks like it could have been a network hiccups: https://openqa.suse.de/tests/12925142 https://openqa.suse.de/tests/12925143
See: https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/18221
Updated by acarvajal about 1 year ago
szarate wrote in #note-3:
looks like it could have been a network hiccups: https://openqa.suse.de/tests/12925142 https://openqa.suse.de/tests/12925143
We saw many failures over the night. Just in case it was network hiccups, I'm restarting the jobs. Will comment here later.
Updated by acarvajal about 1 year ago
Issue seems to be still present:
- https://openqa.suse.de/tests/12925775#step/register_system/59
- https://openqa.suse.de/tests/12925771#step/register_system/59
Error is not exactly the same as in the jobs I used to create the poo# (those were doing registration via SUSEConnect
, these are using yast2
), but I figure root cause is the same.
Updated by acarvajal about 1 year ago
More failures, these ones like the jobs reported yesterday:
- https://openqa.suse.de/tests/12925741#step/suseconnect_scc/20
- https://openqa.suse.de/tests/12925739#step/suseconnect_scc/20
- https://openqa.suse.de/tests/12925735#step/suseconnect_scc/20
- https://openqa.suse.de/tests/12925732#step/suseconnect_scc/20
- https://openqa.suse.de/tests/12925750#step/suseconnect_scc/20
Updated by acarvajal about 1 year ago
okurz wrote in #note-2:
Oh, nice investigation. That looks very specific about the host and port. So I suggest you open an SD ticket mentioning that likely the firewall is blocking traffic there.
Would an SD ticket actually help?
As reported, https://scc.suse.com is reachable from the worker itself, but not from the VM running in the worker. As I said in the slack thread, I suspect there is something wrong with the NAT rules in the workers.
Updated by MMoese about 1 year ago
Updated by dzedro about 1 year ago · Edited
IMO it's broken openvswitch
I downgraded openvswitch packages to previous version with reboot on wroker18 and SUSEConnect/SSL works
It was probably just lucky run, maybe not related at all, next runs failed.
openqaworker18:~ # zypper se -s -x libopenvswitch-2_14-0 openvswitch
Loading repository data...
Reading installed packages...
S | Name | Type | Version | Arch | Repository
---+-----------------------+------------+-----------------------+--------+-------------------------------------------------------------
v | libopenvswitch-2_14-0 | package | 2.14.2-150400.24.14.2 | x86_64 | Update repository with updates from SUSE Linux Enterprise 15
i+ | libopenvswitch-2_14-0 | package | 2.14.2-150400.24.9.1 | x86_64 | Update repository with updates from SUSE Linux Enterprise 15
v | libopenvswitch-2_14-0 | package | 2.14.2-150400.24.6.1 | x86_64 | Update repository with updates from SUSE Linux Enterprise 15
v | libopenvswitch-2_14-0 | package | 2.14.2-150400.24.3.1 | x86_64 | Main Repository
v | openvswitch | package | 2.14.2-150400.24.14.2 | x86_64 | Update repository with updates from SUSE Linux Enterprise 15
i+ | openvswitch | package | 2.14.2-150400.24.9.1 | x86_64 | Update repository with updates from SUSE Linux Enterprise 15
v | openvswitch | package | 2.14.2-150400.24.6.1 | x86_64 | Update repository with updates from SUSE Linux Enterprise 15
v | openvswitch | package | 2.14.2-150400.24.3.1 | x86_64 | Main Repository
| openvswitch | srcpackage | 2.14.2-150400.24.14.2 | noarch | Update repository with updates from SUSE Linux Enterprise 15
| openvswitch | srcpackage | 2.14.2-150400.24.9.1 | noarch | Update repository with updates from SUSE Linux Enterprise 15
| openvswitch | srcpackage | 2.14.2-150400.24.6.1 | noarch | Update repository with updates from SUSE Linux Enterprise 15
openqaworker18:~ #
https://openqa.suse.de/tests/12941822#step/smt_server_install/21 smt
https://openqa.suse.de/tests/12941843#step/nfs_server/12 nfs
Updated by jkohoutek about 1 year ago
dzedro wrote in #note-9:
IMO it's broken openvswitch
I downgraded openvswitch packages to previous version with reboot on wroker18 and SUSEConnect/SSL works
Agree, it's not generic time out, but "TLS handshake timeout" so it's something fishy in the SSL, at lest in implementation of it.
Updated by okurz about 1 year ago
- Project changed from openQA Project (public) to openQA Tests (public)
- Subject changed from test fails in suseconnect_scc - SUT times out trying to reach https://scc.suse.com to [kernel] test fails in suseconnect_scc - SUT times out trying to reach https://scc.suse.com
- Category changed from Support to Bugs in existing tests
- Status changed from Feedback to Blocked
- Assignee changed from okurz to MMoese
- Target version deleted (
Ready)
@MMoese assigning to you in the kernel's area as I don't have access to the SD ticket and I don't see where we could help from the tools team right now. Blocking on https://sd.suse.com/servicedesk/customer/portal/1/SD-140538
Updated by MMoese about 1 year ago
I added you to that ticket as well, but nothing happened there so far
Updated by okurz about 1 year ago
- Related to action #151310: [regression] significant increase of parallel_failed+failed since 2023-11-21 size:M added
Updated by mkittler about 1 year ago
To exclude problems with the system/product we run within the VM I tried with a statically linked curl binary from https://github.com/moparisthebest/static-curl. It is also statically linked against OpenSSL and works outside of the VM. Within the VM it also reproduces the TLS handshake timeout. So it is very unlikely that a regression of the product we're testing is responsible for the problem. (It could of course still be a kernel regression but that's not very likely.)
By the way, with --verbose
one gets
…
Connected to scc.suse.com (18.194.119.137) port 443
* ALPN: curl offers h2,http/1.1
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
and then nothing.
Updated by jkohoutek about 1 year ago
I would like to add for the record, that SCC registration seems to work from Yast on sapworker2:14 :
https://openqa.suse.de/tests/12961803#step/scc_registration/43
but not at worker29:39 from SUSEConnect CLI: https://openqa.suse.de/tests/12961827#step/suseconnect_scc/22
where it ends on net/http: TLS handshake timeout
also is broken from SUSEConnect CLI on worker29:39: https://openqa.suse.de/tests/12961829#step/suseconnect_scc/20
but there it throw different error: Net::OpenTimeout
Updated by jlausuch about 1 year ago · Edited
- File Screenshot 2023-12-05 at 12.37.05.png Screenshot 2023-12-05 at 12.37.05.png added
- File clipboard-202312051304-xrf5b.png clipboard-202312051304-xrf5b.png added
- File clipboard-202312051304-rev0g.png clipboard-202312051304-rev0g.png added
So, so nailing down a bit the problem (but not solving it yet), in case you have some hints...
1) Client (openQA VM) sends first TLS handshake packet Client Hello (SNI=scc.suse.com)
to SCC
2) SCC answers with 2 messages:
Server Hello
Certificate, Server Key Exchange, Server Hello Done
3) In working conditions, the client should answer with Client Key Exchange, Change Cipher Spec, Encrypted Handshake Message
.
In this picture, I ran sudo tcpdump -i eth0
and curl -k -v --interface eth0 https://scc.suse.com
on the worker, just to see the expected exchange of paquets. Here 10.145.10.11
is the worker, 18.198.163.39
is scc.suse.com.
Actually there are 3 possible IP's for scc:
3.64.219.7
18.198.163.39
18.194.119.137
When running curl -k -v --interface eth0 https://scc.suse.com
on the openQA VM, it hangs. The reason is that does not get the message Server Hello
, so it waits until it times out.
I have also sniffed the paquets in br1 and the respective tap device attached to the VM and observed the following:
- br1 receives the
Server Hello
andCertificate, Server Key Exchange, Server Hello Done
paquets, BUT - BUT the tap device does not receive them
Looking at the message on the tap TCP Previous segment not captured
inclines me to think that there is a segmentation problem. Maybe wrong MTU configuration, but I'm not 100% sure.
Updated by jlausuch about 1 year ago
Indeed... MTU issue.
jlausuch@worker38:~> ip a|grep br1
5: br1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default qlen 1000
inet 10.0.2.2/15 brd 10.1.255.255 scope global br1
After setting the MTU to 1500 in br1, everything works again..
susetest:~ # curl -k https://scc.suse.com|head -10
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0<!DOCTYPE html>
<html lang='en' ng-app='' ng-strict-di='false'>
<head>
<title>SUSE Customer Center</title>
<meta content='initial-scale=1' name='viewport'>
<meta content='all' name='robots'>
<meta content='nWPQ59EF614zzwjOAiG7b1SXCUZKcu7ajpinvshy0xs' name='google-site-verification'>
<link rel="icon" type="image/x-icon" href="https://static.scc.suse.com/assets/favicon-b308f78dd95ea6e03778476d824701284771b296ffd48bfc677e3acc6a2f4db1.ico" />
<link rel="stylesheet" href="https://static.scc.suse.com/assets/application-15bf7ebc1f4a9032f945e051e46e95d63d06ac87d4cff543babe9e2ea4e1592e.css" media="all" data-turbo-track="reload" />
<link href='/humans.txt' rel='author' type='text/plain'>
91 27229 91 24878 0 0 283k 0 --:--:-- --:--:-- --:--:-- 285k
Updated by jlausuch about 1 year ago
I know this PR is not the culprit as the issue was spotted before this change, but maybe we can force it to 1500 instead of 1450?
The tap devices are set to 1500, so I would assume br1 should be the same.
Updated by mkittler about 1 year ago · Edited
We've just did a similar test.
The problem with setting 1500 on the bridge is that this will cause problems with gre tunnels. At least I believe this is the case because:
- We have seen connectivity problems with the default setting. VM hosts were not getting an IP at all. This only happened sporadically so it is hard to tell anything for sure but it seemed that lowering the MTU to 1450 helped.
- The FAQ also suggests to lower the MTU like this for our use case (search for 1450 on https://docs.openvswitch.org/en/latest/faq/issues).
When setting the MTU to 1450 in the VM via e.g. ip link set dev eth0 mtu 1450
it works again. Normally the MTU within the VMs is set to 1458 which is a bit too high. Maybe we can set the MTU on the bridge to 1460 as also Dirk Müller suggested. This would be big enough for a MTU of 1458 within the VMs in will hopefully not break traffic via gre tunnels.
EDIT: MR: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1061
Updated by jlausuch about 1 year ago
mkittler wrote in #note-23:
We've just did a similar test.
The problem with setting 1500 on the bridge is that this will cause problems with gre tunnels. At least I believe this is the case because:
- We have seen connectivity problems with the default setting. VM hosts were not getting an IP at all. This only happened sporadically so it is hard to tell anything for sure but it seemed that lowering the MTU to 1450 helped.
- The FAQ also suggests to lower the MTU like this for our use case (search for 1450 on https://docs.openvswitch.org/en/latest/faq/issues).
When setting the MTU to 1450 in the VM via e.g.
ip link set dev eth0 mtu 1450
it works again. Normally the MTU within the VMs is set to 1458 which is a bit too high. Maybe we can set the MTU on the bridge to 1460 as also Dirk Müller suggested. This would be big enough for a MTU of 1458 within the VMs in will hopefully not break traffic via gre tunnels.EDIT: MR: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1061
Ok, I missed the context and all what you have tried as part of #151310. Thanks for clarification, Marius.
Updated by mkittler about 1 year ago
- Assignee changed from MMoese to mkittler
The MR has been applied. I restarted one of the production jobs: https://openqa.suse.de/tests/12973571
I also created a bunch of other jobs that were sporadically failing before the reduced the MTU: https://openqa.suse.de/tests/overview?version=15-SP2&distri=sle&build=20231204-1-mtu-1460
Hopefully these jobs will succeed also with 1460.
Updated by mkittler about 1 year ago · Edited
All test jobs look good so far:
- The restarted production job cluster is most likely passed the problematic point.
- There are also no sporadic failures among the other MM jobs. (The cluster failing to to an incomplete job is because I didn't stop the one worker slot it ran on correctly when using its tap device for a test VM.)
I also created a PR for openqa-clone-job to make cloning test jobs easier in the future: https://github.com/os-autoinst/openQA/pull/5385
I also restarted https://openqa.suse.de/tests/12974495. Let's see whether it works as well.
Updated by acarvajal about 1 year ago
Updated by okurz about 1 year ago
https://openqa.suse.de/tests/12980422 which ran today in the morning is still showing "SUSEConnect error: Get "https://scc.suse.com/connect/systems/activations": net/http: TLS handshake timeout" running on worker30. All OSD salt controlled workers have an MTU size of 1460 so we need to think further.
Updated by okurz about 1 year ago
- Subject changed from [kernel] test fails in suseconnect_scc - SUT times out trying to reach https://scc.suse.com to [kernel][tools] test fails in suseconnect_scc - SUT times out trying to reach https://scc.suse.com
- Priority changed from Normal to High
- Target version set to Ready
Updated by okurz about 1 year ago
- Priority changed from High to Urgent
@mkittler I added this ticket to our backlog as you assigned yourself to it. But given the impact as visible on http://dashboard.qam.suse.de/blocked I suggest to apply improvements on multiple levels and most importantly urgency mitigation by apply workarounds on the higher levels were SLE maintenance tests are affected. So hence my specific suggestion is that you focus on fixing the actual underlying issue(s) as part of #151310 and reassign here and leave this ticket here for the more superficial handling of symptoms affecting SLE maintenance tests.
Updated by mkittler about 1 year ago · Edited
The MTU setting on worker30 is ok and the other production jobs I restarted succeeded. The problem with this particular scenario (https://openqa.suse.de/tests/12980422 / https://openqa.suse.de/tests/latest?arch=x86_64&distri=sle&flavor=Server-DVD-Updates&machine=64bit&test=qam-wicked_basic_ref&version=15-SP1) is that the MTU within the SUT is not be set to something <= 1460:
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
link/ether 52:54:00:12:09:f6 brd ff:ff:ff:ff:ff:ff
inet 10.0.2.10/15 brd 10.1.255.255 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::5054:ff:fe12:9f6/64 scope link
valid_lft forever preferred_lft forever
(from step https://openqa.suse.de/tests/12980422#step/before_test/152 and I could not find any further step that would change the MTU before the SUSEConnect invocation)
I see multiple options:
- Not lowering the MTU on the bridge anymore and
- live with sporadic connectivity issues of the MM setup leaving #151310 unresolved.
- try to increase the maximum physical MTU using jumbo frames.
- Ensure that all MM SUTs have an MTU of <= 1460 configured.
Considering we already do 2. it makes likely most sense to adapt the remaining test scenarios. The good thing is that these problems are always reproducible and considering the manual tinkering with the network these tests already do it should not be hard to set the MTU as well.
Updated by mkittler about 1 year ago · Edited
This PR will hopefully fix the mentioned wicked test scenario: https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/18264
EDIT: With the PR merged it now works in production, e.g. https://openqa.suse.de/tests/12982770.
PR to document the steps to reproduce this issue: https://github.com/os-autoinst/openQA/pull/5387
Updated by okurz about 1 year ago
- Due date deleted (
2023-12-12) - Status changed from In Progress to Resolved
Both https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/18264 and https://github.com/os-autoinst/openQA/pull/5387 merged. Looking at the scenarios that mkittler has mentioned being stable again and also https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&from=now-7d&to=now looks good again we resolve here and continue in #151310 . I asked yesterday for somebody else to take over and handle the fallout including the impact on SLE maintenance test. This hasn't happened so there might still be SLE maintenance tests blocked by this issue that have not been handled yet but considering that people didn't see that as a priority I am ok to live with the state as is.
Updated by okurz about 1 year ago
- Related to action #152389: significant increase in MM-test failure ratio 2023-12-11: test fails in multipath_iscsi and other multi-machine scenarios due to MTU size auto_review:"ping with packet size 1350 failed, problems with MTU" size:M added
Updated by okurz almost 1 year ago
- Related to action #152755: [tools] test fails in scc_registration - SCC not reachable despite not running multi-machine tests? size:M added