Project

General

Profile

Actions

action #151612

closed

[kernel][tools] test fails in suseconnect_scc - SUT times out trying to reach https://scc.suse.com

Added by acarvajal about 1 year ago. Updated about 1 year ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Bugs in existing tests
Start date:
2023-11-28
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

openQA test in scenario sle-15-SP4-Server-DVD-HA-Updates-x86_64-qam_ha_rolling_update_node01@64bit fails in
suseconnect_scc

SUT is attempting SCC registration using SUSEConnect -r $regcode, but this is failing with a timeout attempting to connect to https://scc.suse.com.

Issue was found on Multi-Machine jobs, but as of this writing, I'm not sure it is limited only to MM jobs.

Issue has been seen in several workers: worker29, worker30, worker38, worker39.

Test suite description

Testsuite maintained at https://gitlab.suse.de/qa-maintenance/qam-openqa-yml.

Reproducible

Fails since (at least) Build 20231127-1

Steps to reproduce manually

Login on e.g. worker38.oqa.prg2.suse.org and start a VM using a TAP device from a worker slot:

sudo systemctl stop openqa-reload-worker-auto-restart@40.path
sudo systemctl stop openqa-worker-auto-restart@40.service
wget https://download.suse.de/install/SLE-Micro-5.5-GM/SLE-Micro.x86_64-5.5.0-Default-qcow-GM.qcow2
qemu-system-x86_64 -m 2048 -enable-kvm -vnc :42 -snapshot -netdev tap,id=qanet0,ifname=tap39,script=no,downscript=no -device virtio-net,netdev=qanet0,mac=52:54:00:12:0b:ff SLE-Micro.x86_64-5.5.0-Default-qcow-GM.qcow2

Configure the network within the VM, e.g.:

ip a dev eth0 10.0.2.15/24
ip r add default via 10.0.2.2
echo 'nameserver 8.8.8.8' > /etc/resolv.conf
curl https://www.google.de # should work (you'll get tons of HTML)

Reproduce the issue:

curl https://scc.suse.com # does *not* work (timeout)
SUSEConnect --url https://scc.suse.com -r 1234 # does *not* work (TLS handshake timeout)
curl http://scc.suse.com # interstingly works (showing 301 as expected)

Expected result

Last good: 20231126-1 (or more recent)

Further details

Always latest result in this scenario: latest


Files


Related issues 3 (0 open3 closed)

Related to openQA Project (public) - action #151310: [regression] significant increase of parallel_failed+failed since 2023-11-21 size:MResolvedmkittler2023-11-23

Actions
Related to openQA Project (public) - action #152389: significant increase in MM-test failure ratio 2023-12-11: test fails in multipath_iscsi and other multi-machine scenarios due to MTU size auto_review:"ping with packet size 1350 failed, problems with MTU" size:MResolvedmkittler2023-12-11

Actions
Related to openQA Tests (public) - action #152755: [tools] test fails in scc_registration - SCC not reachable despite not running multi-machine tests? size:MResolvedmkittler2023-12-19

Actions
Actions #1

Updated by acarvajal about 1 year ago · Edited

I was able to pause 2 jobs (https://openqa.suse.de/tests/12918352 & https://openqa.suse.de/tests/12918351) running in worker38 & worker39 while they were failing, and could find out the following by testing directly from one of the SUTs:

  • SUT can resolve scc.suse.com
  • scc.suse.com is not reachable via ICMP, but that's expected
  • curl scc.suse.com works
  • curl https://scc.suse.com times out
  • curl -k https://openqa.suse.de works
  • SUT can reach support server
  • SUT can reach other node
  • SUT can reach updates.suse.com via HTTP and HTTPS
  • SUT can reach download.suse.de via HTTP and HTTPS
  • SUT can not reach registry.suse.com via HTTPS

I also checked the following in worker39 where that SUT was running:

  • curl https://scc.suse.com works
  • sysctl -a show that IPv4 forwarding is enabled for eht0, br1 and tap interfaces
  • Not sure if this is a problem, but nft list table firewalld in w39 does not show tap interfaces in nat_PREROUTING_ZONES:
        chain nat_PREROUTING_ZONES {
                iifname "eth0" goto nat_PRE_trusted
                iifname "ovs-system" goto nat_PRE_trusted
                iifname "br1" goto nat_PRE_trusted
                iifname "docker0" goto nat_PRE_docker
                goto nat_PRE_trusted
        }
  • Nor in nat_POSTROUTING_ZONES:
    chain nat_POSTROUTING_ZONES {
        oifname "eth0" goto nat_POST_trusted
        oifname "ovs-system" goto nat_POST_trusted
        oifname "br1" goto nat_POST_trusted
        oifname "docker0" goto nat_POST_docker
        goto nat_POST_trusted
    }

I added some screenshots from the SUT's tests in https://suse.slack.com/archives/C02CANHLANP/p1701181092050239

Actions #2

Updated by okurz about 1 year ago

  • Project changed from openQA Infrastructure (public) to openQA Project (public)
  • Due date set to 2023-12-12
  • Category set to Support
  • Status changed from New to Feedback
  • Assignee set to okurz
  • Target version set to Ready

Oh, nice investigation. That looks very specific about the host and port. So I suggest you open an SD ticket mentioning that likely the firewall is blocking traffic there.

Actions #3

Updated by szarate about 1 year ago

okurz wrote in #note-2:

Oh, nice investigation. That looks very specific about the host and port. So I suggest you open an SD ticket mentioning that likely the firewall is blocking traffic there.

looks like it could have been a network hiccups: https://openqa.suse.de/tests/12925142 https://openqa.suse.de/tests/12925143

See: https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/18221

Actions #4

Updated by acarvajal about 1 year ago

szarate wrote in #note-3:

looks like it could have been a network hiccups: https://openqa.suse.de/tests/12925142 https://openqa.suse.de/tests/12925143

We saw many failures over the night. Just in case it was network hiccups, I'm restarting the jobs. Will comment here later.

Actions #5

Updated by acarvajal about 1 year ago

Issue seems to be still present:

Error is not exactly the same as in the jobs I used to create the poo# (those were doing registration via SUSEConnect, these are using yast2), but I figure root cause is the same.

Actions #7

Updated by acarvajal about 1 year ago

okurz wrote in #note-2:

Oh, nice investigation. That looks very specific about the host and port. So I suggest you open an SD ticket mentioning that likely the firewall is blocking traffic there.

Would an SD ticket actually help?

As reported, https://scc.suse.com is reachable from the worker itself, but not from the VM running in the worker. As I said in the slack thread, I suspect there is something wrong with the NAT rules in the workers.

Actions #9

Updated by dzedro about 1 year ago · Edited

IMO it's broken openvswitch

I downgraded openvswitch packages to previous version with reboot on wroker18 and SUSEConnect/SSL works

It was probably just lucky run, maybe not related at all, next runs failed.

openqaworker18:~ # zypper se -s -x libopenvswitch-2_14-0 openvswitch
Loading repository data...
Reading installed packages...

S  | Name                  | Type       | Version               | Arch   | Repository
---+-----------------------+------------+-----------------------+--------+-------------------------------------------------------------
v  | libopenvswitch-2_14-0 | package    | 2.14.2-150400.24.14.2 | x86_64 | Update repository with updates from SUSE Linux Enterprise 15
i+ | libopenvswitch-2_14-0 | package    | 2.14.2-150400.24.9.1  | x86_64 | Update repository with updates from SUSE Linux Enterprise 15
v  | libopenvswitch-2_14-0 | package    | 2.14.2-150400.24.6.1  | x86_64 | Update repository with updates from SUSE Linux Enterprise 15
v  | libopenvswitch-2_14-0 | package    | 2.14.2-150400.24.3.1  | x86_64 | Main Repository
v  | openvswitch           | package    | 2.14.2-150400.24.14.2 | x86_64 | Update repository with updates from SUSE Linux Enterprise 15
i+ | openvswitch           | package    | 2.14.2-150400.24.9.1  | x86_64 | Update repository with updates from SUSE Linux Enterprise 15
v  | openvswitch           | package    | 2.14.2-150400.24.6.1  | x86_64 | Update repository with updates from SUSE Linux Enterprise 15
v  | openvswitch           | package    | 2.14.2-150400.24.3.1  | x86_64 | Main Repository
   | openvswitch           | srcpackage | 2.14.2-150400.24.14.2 | noarch | Update repository with updates from SUSE Linux Enterprise 15
   | openvswitch           | srcpackage | 2.14.2-150400.24.9.1  | noarch | Update repository with updates from SUSE Linux Enterprise 15
   | openvswitch           | srcpackage | 2.14.2-150400.24.6.1  | noarch | Update repository with updates from SUSE Linux Enterprise 15
openqaworker18:~ #

https://openqa.suse.de/tests/12941822#step/smt_server_install/21 smt
https://openqa.suse.de/tests/12941843#step/nfs_server/12 nfs

Actions #10

Updated by jkohoutek about 1 year ago

dzedro wrote in #note-9:

IMO it's broken openvswitch

I downgraded openvswitch packages to previous version with reboot on wroker18 and SUSEConnect/SSL works

Agree, it's not generic time out, but "TLS handshake timeout" so it's something fishy in the SSL, at lest in implementation of it.

Actions #11

Updated by okurz about 1 year ago

  • Project changed from openQA Project (public) to openQA Tests (public)
  • Subject changed from test fails in suseconnect_scc - SUT times out trying to reach https://scc.suse.com to [kernel] test fails in suseconnect_scc - SUT times out trying to reach https://scc.suse.com
  • Category changed from Support to Bugs in existing tests
  • Status changed from Feedback to Blocked
  • Assignee changed from okurz to MMoese
  • Target version deleted (Ready)

@MMoese assigning to you in the kernel's area as I don't have access to the SD ticket and I don't see where we could help from the tools team right now. Blocking on https://sd.suse.com/servicedesk/customer/portal/1/SD-140538

Actions #12

Updated by MMoese about 1 year ago

I added you to that ticket as well, but nothing happened there so far

Actions #13

Updated by okurz about 1 year ago

  • Related to action #151310: [regression] significant increase of parallel_failed+failed since 2023-11-21 size:M added
Actions #14

Updated by mkittler about 1 year ago

  • Description updated (diff)
Actions #15

Updated by mkittler about 1 year ago

  • Description updated (diff)
Actions #16

Updated by mkittler about 1 year ago

  • Description updated (diff)
Actions #17

Updated by mkittler about 1 year ago

To exclude problems with the system/product we run within the VM I tried with a statically linked curl binary from https://github.com/moparisthebest/static-curl. It is also statically linked against OpenSSL and works outside of the VM. Within the VM it also reproduces the TLS handshake timeout. So it is very unlikely that a regression of the product we're testing is responsible for the problem. (It could of course still be a kernel regression but that's not very likely.)

By the way, with --verbose one gets

…
Connected to scc.suse.com (18.194.119.137) port 443
* ALPN: curl offers h2,http/1.1
* TLSv1.3 (OUT), TLS handshake, Client hello (1):

and then nothing.

Actions #18

Updated by MMoese about 1 year ago

  • Status changed from Blocked to In Progress
Actions #19

Updated by jkohoutek about 1 year ago

I would like to add for the record, that SCC registration seems to work from Yast on sapworker2:14 :
https://openqa.suse.de/tests/12961803#step/scc_registration/43

but not at worker29:39 from SUSEConnect CLI: https://openqa.suse.de/tests/12961827#step/suseconnect_scc/22
where it ends on net/http: TLS handshake timeout

also is broken from SUSEConnect CLI on worker29:39: https://openqa.suse.de/tests/12961829#step/suseconnect_scc/20
but there it throw different error: Net::OpenTimeout

Updated by jlausuch about 1 year ago · Edited

So, so nailing down a bit the problem (but not solving it yet), in case you have some hints...

1) Client (openQA VM) sends first TLS handshake packet Client Hello (SNI=scc.suse.com) to SCC
2) SCC answers with 2 messages:
Server Hello
Certificate, Server Key Exchange, Server Hello Done
3) In working conditions, the client should answer with Client Key Exchange, Change Cipher Spec, Encrypted Handshake Message.

In this picture, I ran sudo tcpdump -i eth0 and curl -k -v --interface eth0 https://scc.suse.com on the worker, just to see the expected exchange of paquets. Here 10.145.10.11 is the worker, 18.198.163.39 is scc.suse.com.
Actually there are 3 possible IP's for scc:

  • 3.64.219.7
  • 18.198.163.39
  • 18.194.119.137

When running curl -k -v --interface eth0 https://scc.suse.com on the openQA VM, it hangs. The reason is that does not get the message Server Hello, so it waits until it times out.

I have also sniffed the paquets in br1 and the respective tap device attached to the VM and observed the following:

  • br1 receives the Server Hello and Certificate, Server Key Exchange, Server Hello Done paquets, BUT
  • BUT the tap device does not receive them

Looking at the message on the tap TCP Previous segment not captured inclines me to think that there is a segmentation problem. Maybe wrong MTU configuration, but I'm not 100% sure.

Actions #21

Updated by jlausuch about 1 year ago

Indeed... MTU issue.

jlausuch@worker38:~> ip a|grep br1
5: br1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default qlen 1000
    inet 10.0.2.2/15 brd 10.1.255.255 scope global br1

After setting the MTU to 1500 in br1, everything works again..

susetest:~ # curl -k https://scc.suse.com|head -10
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0<!DOCTYPE html>
<html lang='en' ng-app='' ng-strict-di='false'>
<head>
<title>SUSE Customer Center</title>
<meta content='initial-scale=1' name='viewport'>
<meta content='all' name='robots'>
<meta content='nWPQ59EF614zzwjOAiG7b1SXCUZKcu7ajpinvshy0xs' name='google-site-verification'>
<link rel="icon" type="image/x-icon" href="https://static.scc.suse.com/assets/favicon-b308f78dd95ea6e03778476d824701284771b296ffd48bfc677e3acc6a2f4db1.ico" />
<link rel="stylesheet" href="https://static.scc.suse.com/assets/application-15bf7ebc1f4a9032f945e051e46e95d63d06ac87d4cff543babe9e2ea4e1592e.css" media="all" data-turbo-track="reload" />
<link href='/humans.txt' rel='author' type='text/plain'>
 91 27229   91 24878    0     0   283k      0 --:--:-- --:--:-- --:--:--  285k
Actions #22

Updated by jlausuch about 1 year ago

I know this PR is not the culprit as the issue was spotted before this change, but maybe we can force it to 1500 instead of 1450?
The tap devices are set to 1500, so I would assume br1 should be the same.

Actions #23

Updated by mkittler about 1 year ago · Edited

We've just did a similar test.

The problem with setting 1500 on the bridge is that this will cause problems with gre tunnels. At least I believe this is the case because:

  • We have seen connectivity problems with the default setting. VM hosts were not getting an IP at all. This only happened sporadically so it is hard to tell anything for sure but it seemed that lowering the MTU to 1450 helped.
  • The FAQ also suggests to lower the MTU like this for our use case (search for 1450 on https://docs.openvswitch.org/en/latest/faq/issues).

When setting the MTU to 1450 in the VM via e.g. ip link set dev eth0 mtu 1450 it works again. Normally the MTU within the VMs is set to 1458 which is a bit too high. Maybe we can set the MTU on the bridge to 1460 as also Dirk Müller suggested. This would be big enough for a MTU of 1458 within the VMs in will hopefully not break traffic via gre tunnels.

EDIT: MR: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1061

Actions #24

Updated by jlausuch about 1 year ago

mkittler wrote in #note-23:

We've just did a similar test.

The problem with setting 1500 on the bridge is that this will cause problems with gre tunnels. At least I believe this is the case because:

  • We have seen connectivity problems with the default setting. VM hosts were not getting an IP at all. This only happened sporadically so it is hard to tell anything for sure but it seemed that lowering the MTU to 1450 helped.
  • The FAQ also suggests to lower the MTU like this for our use case (search for 1450 on https://docs.openvswitch.org/en/latest/faq/issues).

When setting the MTU to 1450 in the VM via e.g. ip link set dev eth0 mtu 1450 it works again. Normally the MTU within the VMs is set to 1458 which is a bit too high. Maybe we can set the MTU on the bridge to 1460 as also Dirk Müller suggested. This would be big enough for a MTU of 1458 within the VMs in will hopefully not break traffic via gre tunnels.

EDIT: MR: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1061

Ok, I missed the context and all what you have tried as part of #151310. Thanks for clarification, Marius.

Actions #25

Updated by mkittler about 1 year ago

  • Assignee changed from MMoese to mkittler

The MR has been applied. I restarted one of the production jobs: https://openqa.suse.de/tests/12973571

I also created a bunch of other jobs that were sporadically failing before the reduced the MTU: https://openqa.suse.de/tests/overview?version=15-SP2&distri=sle&build=20231204-1-mtu-1460
Hopefully these jobs will succeed also with 1460.

Actions #26

Updated by mkittler about 1 year ago · Edited

All test jobs look good so far:

  • The restarted production job cluster is most likely passed the problematic point.
  • There are also no sporadic failures among the other MM jobs. (The cluster failing to to an incomplete job is because I didn't stop the one worker slot it ran on correctly when using its tap device for a test VM.)

I also created a PR for openqa-clone-job to make cloning test jobs easier in the future: https://github.com/os-autoinst/openQA/pull/5385

I also restarted https://openqa.suse.de/tests/12974495. Let's see whether it works as well.

Actions #27

Updated by acarvajal about 1 year ago

Thanks a lot @mkittler and @jlausuch, very informative. I didn't suspect MTU issues due to the nature of the failure, but good to know that it can cause failures that look like a FW blocking packets. I'm keeping an eye to HA jobs and will report back if I see issues.

Actions #28

Updated by okurz about 1 year ago

https://openqa.suse.de/tests/12980422 which ran today in the morning is still showing "SUSEConnect error: Get "https://scc.suse.com/connect/systems/activations": net/http: TLS handshake timeout" running on worker30. All OSD salt controlled workers have an MTU size of 1460 so we need to think further.

Actions #29

Updated by okurz about 1 year ago

  • Subject changed from [kernel] test fails in suseconnect_scc - SUT times out trying to reach https://scc.suse.com to [kernel][tools] test fails in suseconnect_scc - SUT times out trying to reach https://scc.suse.com
  • Priority changed from Normal to High
  • Target version set to Ready
Actions #30

Updated by okurz about 1 year ago

  • Priority changed from High to Urgent

@mkittler I added this ticket to our backlog as you assigned yourself to it. But given the impact as visible on http://dashboard.qam.suse.de/blocked I suggest to apply improvements on multiple levels and most importantly urgency mitigation by apply workarounds on the higher levels were SLE maintenance tests are affected. So hence my specific suggestion is that you focus on fixing the actual underlying issue(s) as part of #151310 and reassign here and leave this ticket here for the more superficial handling of symptoms affecting SLE maintenance tests.

Actions #31

Updated by mkittler about 1 year ago · Edited

The MTU setting on worker30 is ok and the other production jobs I restarted succeeded. The problem with this particular scenario (https://openqa.suse.de/tests/12980422 / https://openqa.suse.de/tests/latest?arch=x86_64&distri=sle&flavor=Server-DVD-Updates&machine=64bit&test=qam-wicked_basic_ref&version=15-SP1) is that the MTU within the SUT is not be set to something <= 1460:

2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 52:54:00:12:09:f6 brd ff:ff:ff:ff:ff:ff
    inet 10.0.2.10/15 brd 10.1.255.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::5054:ff:fe12:9f6/64 scope link 
       valid_lft forever preferred_lft forever

(from step https://openqa.suse.de/tests/12980422#step/before_test/152 and I could not find any further step that would change the MTU before the SUSEConnect invocation)

I see multiple options:

  1. Not lowering the MTU on the bridge anymore and
    1. live with sporadic connectivity issues of the MM setup leaving #151310 unresolved.
    2. try to increase the maximum physical MTU using jumbo frames.
  2. Ensure that all MM SUTs have an MTU of <= 1460 configured.

Considering we already do 2. it makes likely most sense to adapt the remaining test scenarios. The good thing is that these problems are always reproducible and considering the manual tinkering with the network these tests already do it should not be hard to set the MTU as well.

Actions #32

Updated by mkittler about 1 year ago · Edited

This PR will hopefully fix the mentioned wicked test scenario: https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/18264
EDIT: With the PR merged it now works in production, e.g. https://openqa.suse.de/tests/12982770.

PR to document the steps to reproduce this issue: https://github.com/os-autoinst/openQA/pull/5387

Actions #33

Updated by okurz about 1 year ago

  • Due date deleted (2023-12-12)
  • Status changed from In Progress to Resolved

Both https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/18264 and https://github.com/os-autoinst/openQA/pull/5387 merged. Looking at the scenarios that mkittler has mentioned being stable again and also https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&from=now-7d&to=now looks good again we resolve here and continue in #151310 . I asked yesterday for somebody else to take over and handle the fallout including the impact on SLE maintenance test. This hasn't happened so there might still be SLE maintenance tests blocked by this issue that have not been handled yet but considering that people didn't see that as a priority I am ok to live with the state as is.

Actions #34

Updated by okurz about 1 year ago

  • Related to action #152389: significant increase in MM-test failure ratio 2023-12-11: test fails in multipath_iscsi and other multi-machine scenarios due to MTU size auto_review:"ping with packet size 1350 failed, problems with MTU" size:M added
Actions #35

Updated by okurz almost 1 year ago

  • Related to action #152755: [tools] test fails in scc_registration - SCC not reachable despite not running multi-machine tests? size:M added
Actions

Also available in: Atom PDF