action #135056
closedMM Test fails in a connection to an address outside of the worker
0%
Description
Observation¶
openQA test in scenario sle-15-SP5-Server-DVD-HA-Incidents-x86_64-qam_ha_rolling_upgrade_migration_node01@64bit fails in
suseconnect_scc
Tests times out in a SUSEConnect
command while attempting connections to https://scc.suse.com.
So far I have seen the same issue in the following jobs; I'm also adding the workers where these jobs ran to see if there is a pattern:
https://openqa.suse.de/tests/11977452#step/suseconnect_scc/20 / worker37
https://openqa.suse.de/tests/11968267#step/suseconnect_scc/20 / worker34
https://openqa.suse.de/tests/11968328#step/suseconnect_scc/20 / worker37
https://openqa.suse.de/tests/11968329#step/suseconnect_scc/20 / worker34
https://openqa.suse.de/tests/11968336#step/suseconnect_scc/20 / worker34
https://openqa.suse.de/tests/11968418#step/suseconnect_scc/20 / worker34
https://openqa.suse.de/tests/11968417#step/suseconnect_scc/20 / worker37
https://openqa.suse.de/tests/11965548#step/suseconnect_scc/20 / worker38
https://openqa.suse.de/tests/11965600#step/suseconnect_scc/20 / worker37
https://openqa.suse.de/tests/11975401#step/suseconnect_scc/20 / worker37
https://openqa.suse.de/tests/11975400#step/suseconnect_scc/20 / worker34
https://openqa.suse.de/tests/11981360#step/suseconnect_scc/20 / worker38
https://openqa.suse.de/tests/11975460#step/suseconnect_scc/20 / worker34
Test suite description¶
The base test suite is used for job templates defined in YAML documents. It has no settings of its own.
Expected result¶
Last good: :30365:python-iniconfig (or more recent)
Further details¶
Always latest result in this scenario: latest
Updated by acarvajal over 1 year ago
Looking at the list of workers, this seems related to support_server/setup issue described in #134282#note-27
Updated by okurz over 1 year ago
- Related to action #134282: [tools] network protocols failures on multimachine tests on HA/SAP size:S auto_review:"no candidate.*iscsi-target-overview-service-tab|yast2.+firewall.+services.+add.+zone":retry added
Updated by acarvajal over 1 year ago
- Priority changed from Normal to Urgent
This issue was gone between Sunday Sept. 3th and Wednesday Sept. 6th, but it started happening again in different workers on the 7th, for example this one failing in iscs_client while running zypper:
https://openqa.suse.de/tests/12038078#step/iscsi_client/22
https://openqa.suse.de/tests/12030835#step/iscsi_client/47
https://openqa.suse.de/tests/12030870#step/iscsi_client/22
So far I've only seen this in worker29 & worker30. This time worker37 & worker38 seem fine and worker34 is offline.
I'm increasing the priority.
Updated by okurz over 1 year ago
- Target version changed from Ready to future
Sorry, need to reconsider. We can't handle that in the team right now with urgent priority. You need to find somebody else to work on this. In particular I suggest to improve the error reporting from tests
Updated by acarvajal over 1 year ago
okurz wrote in #note-6:
Sorry, need to reconsider. We can't handle that in the team right now with urgent priority.
Seriously?
You need to find somebody else to work on this. In particular I suggest to improve the error reporting from tests
This message:
2023-09-07 13:16:29 <5> hana-node02(3466) [zypp-core] Exception.cc(log):186 MediaCurl.cc(evaluateCurlCode):939 THROW: Timeout exceeded when accessing 'https://scc.suse.com/access/services/2383/repo/repoindex.xml?cookies=0&credentials=Basesystem_Module_15_SP5_x86_64'.
From https://openqa.suse.de/tests/12030870#step/iscsi_client/37 needs to be improved? How?
Or this one:
2023-09-10 20:30:33 <5> qdevice-node03(2977) [zypp-core] Exception.cc(log):186 MediaCurl.cc(evaluateCurlCode):960 THROW: Download (curl) error for 'https://scc.suse.com/access/services/1931/repo/repoindex.xml?cookies=0&credentials=Basesystem_Module_15_SP2_x86_64':
2023-09-10 20:30:33 <5> qdevice-node03(2977) [zypp-core] Exception.cc(log):186 Error code: Connection failed
2023-09-10 20:30:33 <5> qdevice-node03(2977) [zypp-core] Exception.cc(log):186 Error message: Failed to connect to scc.suse.com port 443: Connection timed out
From https://openqa.suse.de/tests/12071228#step/qnetd/53
I'm open to suggestions.
Updated by srinidhir over 1 year ago
More failures when trying to reach the network outside osd,
https://openqa.suse.de/tests/12071228#step/qnetd/53
https://openqa.suse.de/tests/12070935#step/iscsi_client/57
https://openqa.suse.de/tests/12071082#step/iscsi_client/57
https://openqa.suse.de/tests/12071094#step/iscsi_client/57
https://openqa.suse.de/tests/12070935#step/iscsi_client/57
https://openqa.suse.de/tests/12070893#step/qnetd/33
https://openqa.suse.de/tests/12071179#step/suseconnect_scc/21
https://openqa.suse.de/tests/12071045#step/suseconnect_scc/21
https://openqa.suse.de/tests/12070686#step/iscsi_client/57
https://openqa.suse.de/tests/12070559#step/iscsi_client/37
https://openqa.suse.de/tests/12070390#step/iscsi_client/37
https://openqa.suse.de/tests/12076977#step/suseconnect_scc/21
https://openqa.suse.de/tests/12076971#step/register_system/46
https://openqa.suse.de/tests/12077037#step/iscsi_client/57
Updated by okurz over 1 year ago
acarvajal wrote in #note-7:
okurz wrote in #note-6:
Sorry, need to reconsider. We can't handle that in the team right now with urgent priority.
Seriously?
Yes, seriously. https://os-autoinst.github.io/qa-tools-backlog-assistant/ shows our complete backlog status. In particular regarding the infrastructure maintenance our focus needs to be on conduct the dataserver migration with all related parts. If you want to know more details and provide your feedback on the plans we have you can join our monthly roadmap discussion in the SUSE QE Tools workshop sessions, see https://progress.opensuse.org/projects/qa/wiki/tools#Workshop-Topics
You need to find somebody else to work on this. In particular I suggest to improve the error reporting from tests
This message:
2023-09-07 13:16:29 <5> hana-node02(3466) [zypp-core] Exception.cc(log):186 MediaCurl.cc(evaluateCurlCode):939 THROW: Timeout exceeded when accessing 'https://scc.suse.com/access/services/2383/repo/repoindex.xml?cookies=0&credentials=Basesystem_Module_15_SP5_x86_64'.
From https://openqa.suse.de/tests/12030870#step/iscsi_client/37 needs to be improved? How?
That error message itself is quite clear but then it's not clear why there is a timeout. Is the server not resolvable at all? Can it be pinged? Can it be pinged with a low packet size but is there a MTU related problem which would be apparent with a high-packet-size ping? Also to debug further the test (if not already done) show the output of ip a
and ip r
and such.
Updated by acarvajal over 1 year ago
okurz wrote in #note-9:
That error message itself is quite clear but then it's not clear why there is a timeout. Is the server not resolvable at all? Can it be pinged? Can it be pinged with a low packet size but is there a MTU related problem which would be apparent with a high-packet-size ping? Also to debug further the test (if not already done) show the output of
ip a
andip r
and such.
Updated by mkittler over 1 year ago
I've been briefly looking at https://openqa.suse.de/tests/12070893. It looks like the SUT got an IP in https://openqa.suse.de/tests/12070893#step/hostname/27. This seems indeed similar to what I've observed so far when previously working on #134282 (although these tests could refresh repositories, e.g. https://openqa.suse.de/tests/11821882#step/iscsi_client/5).
Note that further down in the logs there are clearer error messages like:
2023-09-07 13:22:47 <5> hana-node02(3466) [zypp-core] Exception.cc(log):186 MediaCurl.cc(evaluateCurlCode):960 THROW: Download (curl) error for 'https://updates.suse.com/SUSE/Updates/SLE-Module-Basesystem/15-SP5/x86_64/update/repodata/repomd.xml?sFGn5uvJ5S4_i56DYilSWqkEzPpwt0b39EZhCAAW033WhiiwUwKvex5kavICK-LmJUTLiVEuqiy53d5NPP9-msAwVT9gZQJkWlgIxOqfZg0v3FPL2PisUGil0q23CaWn4Kwocks3ykg62ik9-gaoF_irCNb6aA':
2023-09-07 13:22:47 <5> hana-node02(3466) [zypp-core] Exception.cc(log):186 Error code: Connection failed
2023-09-07 13:22:47 <5> hana-node02(3466) [zypp-core] Exception.cc(log):186 Error message: Failed to connect to updates.suse.com port 443 after 18361 ms: Couldn't connect to server
Updated by livdywan over 1 year ago
It may be worth noting that there were jobs failing to access scc servers as far back as 2 months ago: https://openqa.suse.de/tests/11614943#step/suseconnect_scc/20 and here's a version failing on dsc: https://openqa.suse.de/tests/11560650#step/iscsi_client/5 - unfortunately there's no bug comments on those old jobs.
Updated by jlausuch over 1 year ago
I think (but I'm not sure) there is some missing configuration in the nftables in the workers, see #135524 but I would need help troubleshooting this as I don't have much experience with it.
Updated by acarvajal over 1 year ago
livdywan wrote in #note-12:
It may be worth noting that there were jobs failing to access scc servers as far back as 2 months ago: https://openqa.suse.de/tests/11614943#step/suseconnect_scc/20 and here's a version failing on dsc: https://openqa.suse.de/tests/11560650#step/iscsi_client/5 - unfortunately there's no bug comments on those old jobs.
That job history is interesting as it has:
- Failures to access SCC servers 2 months ago (same root cause? different? I don't think it's possible to know now)
- Consecutive passing results for 2 weeks following that
- Then some sporadic failures (check_logs, cluster_md, ha_cluster_join ... out of this, only ha_cluster_join could be related to the current issues)
- Then, starting on August 13th, Multi-Machines failures in
iscsi_client
. FYI, #134282 was opened on August 15th - Then, again passing jobs starting on the 17th, however all of these ran on the same worker. This was probably the result of a workaround by QE-SAP
- Then, 9 days ago, the first one of these running in multiple workers: https://openqa.suse.de/tests/12031268#dependencies (worker37 & worker39). Remember #134282#note-47 and #134282#note-48 and #134282#note-59? 9 days ago is smack in the middle of those comments which are 8 to 11 days ago, when I thought the issue had been fixed.
- Finally, last one there it's from 7 days ago failing attempting a connection to SCC. There are no results after that, more than likely because the parallel support server setup is failing (tracked in #134282)
Updated by mkittler over 1 year ago
The scenarios https://openqa.suse.de/tests/latest?arch=x86_64&distri=sle&flavor=Server-DVD-HA-Incidents&machine=64bit&test=qam_ha_rolling_upgrade_migration_node01&version=15-SP4#next_previous and https://openqa.suse.de/tests/latest?arch=x86_64&distri=sle&flavor=Server-DVD-HA-Incidents&machine=64bit&test=qam_ha_rolling_upgrade_migration_node01&version=15-SP4#next_previous look quite good since https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/987 was merged. So maybe it really helped.
Updated by mkittler over 1 year ago
I've just re-triggered MM tests on https://openqa.suse.de/tests/overview?distri=sle&version=15-SP6&build=20.1&groupid=158 that failed 4 days ago similarly. Let's see whether they'll pass after https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/987.
EDIT: It looks very good so far. The only failing cluster so far is https://openqa.suse.de/tests/12165551 but it didn't fail due to a network connection issue with an address outside of the worker. Some tests are still running/scheduled.
Updated by mkittler over 1 year ago
- Status changed from New to In Progress
- Assignee set to mkittler
Updated by mkittler over 1 year ago
- Status changed from In Progress to Feedback
There were a few more failures but again unrelated ones. Now all tests are past the point where we'd see this problem if it still happened but it looks like all is good now. So I would say this issue can be closed.
Updated by acarvajal about 1 year ago
Not sure if related, but found some fresh failures today of jobs trying to reach addresses outside of osd:
https://openqa.suse.de/tests/12205876#step/iscsi_client/32 (worker31)
https://openqa.suse.de/tests/12205879#step/iscsi_client/32 (worker29)
Checked both workers and they seem to have the br1<->eth0 IP forwarding enabled:
worker31:~ # sysctl -a |grep net.ipv4.conf.br1.forwarding
net.ipv4.conf.br1.forwarding = 1
worker31:~ # sysctl -a |grep net.ipv4.conf.eth0.forwarding
net.ipv4.conf.eth0.forwarding = 1
worker31:~ # cat /proc/sys/net/ipv4/conf/{br1,eth0}/forwarding
1
1
And:
worker29:~ # sysctl -a |grep net.ipv4.conf.br1.forwarding
net.ipv4.conf.br1.forwarding = 1
worker29:~ # sysctl -a |grep net.ipv4.conf.eth0.forwarding
net.ipv4.conf.eth0.forwarding = 1
worker29:~ # cat /proc/sys/net/ipv4/conf/{br1,eth0}/forwarding
1
1
Haven't seen the other issue (connection issues in MM job to MM job) so far. Will update the other ticket if I find some.
Edit: seems like issue is connecting to 10.0.2.2 and not with external addresses specifically.
Updated by mkittler about 1 year ago
The cluster also contained worker31 which had been disabled (see https://progress.opensuse.org/issues/135407#note-12). So maybe the worker was in a bad state at this time (worker services were running but gre/tap setup was incomplete). That the job which ran on worker29 is similarly affected is a bit more worrying. It looks like it was able to run other MM tests (e.g. https://openqa.suse.de/tests/12222248) so I guess the worker is not totally broken, though.
Edit: seems like issue is connecting to 10.0.2.2 and not with external addresses specifically.
Yes, but that also shouldn't be the case.
Updated by mkittler about 1 year ago
Considering https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/622 worker31 is currently generally problematic.
Updated by livdywan about 1 year ago
- Subject changed from MM Test fails in a connection to an address outside of the worker to MM Test fails in a connection to an address outside of the worker auto_review:"MediaCurl.cc.+evaluateCurlCode"
Updated by mgrifalconi about 1 year ago
Got here from the autoreview comment on this test: https://openqa.suse.de/tests/12386091#
This seems an unrelated issue, @livdywan could you please check and if that is the case, fix/remove the autoreview? Thanks!
Updated by livdywan about 1 year ago
- Subject changed from MM Test fails in a connection to an address outside of the worker auto_review:"MediaCurl.cc.+evaluateCurlCode" to MM Test fails in a connection to an address outside of the worker auto_review:"MediaCurl.cc.+evaluateCurlCode.+Timeout exceeded"
mgrifalconi wrote in #note-24:
Got here from the autoreview comment on this test: https://openqa.suse.de/tests/12386091#
This seems an unrelated issue, @livdywan could you please check and if that is the case, fix/remove the autoreview? Thanks!
It's not a timeout so I assume it's unrelated, hence making the expression more specific.
Updated by mkittler about 1 year ago
- Status changed from Feedback to Resolved
Considering no actually related issues came up anymore I suppose this ticket can be resolved.
Updated by mkittler about 1 year ago
- Subject changed from MM Test fails in a connection to an address outside of the worker auto_review:"MediaCurl.cc.+evaluateCurlCode.+Timeout exceeded" to MM Test fails in a connection to an address outside of the worker
The auto review regex is still too generic, e.g. this ticket was wrongly referenced in https://openqa.suse.de/tests/12440238. I'm removing the regex completely because if we see a similar symptom again it likely makes most sense to investigate the cause from scratch (as this kind of issue can be caused by various problems).
Updated by okurz about 1 year ago
- Status changed from Resolved to Feedback
Keep in mind that regardless of auto-review the same ticket reference can still be carried over if a test in the same test scenario fails in the same module as in before. This is why it is prudent for every auto-review ticket, if not every ticket used as ticket reference in openQA tests, to follow the suggestions from the template in https://github.com/os-autoinst/scripts/#auto-review---automatically-detect-known-issues-in-openqa-jobs-label-openqa-jobs-with-ticket-references-and-optionally-retrigger to call openqa-query-for-job-label poo#135056
which I did now and found:
3631705|2023-10-09 14:20:22|done|failed|extra_tests_kernel:investigate:last_good_tests_and_build:b2240c20072a4b93f57e5bad3ac54f95af9d0a99+20230910||openqaworker-arm22
3631704|2023-10-09 14:06:21|done|failed|extra_tests_kernel:investigate:last_good_build:20230910||openqaworker-arm22
3629458|2023-10-07 10:31:52|done|failed|extra_tests_kernel:investigate:last_good_build:20230910||ip-10-252-32-28
3629459|2023-10-07 10:22:11|done|failed|extra_tests_kernel:investigate:last_good_tests_and_build:b2240c20072a4b93f57e5bad3ac54f95af9d0a99+20230910||openqaworker-arm22
3626142|2023-10-06 03:00:50|done|failed|extra_tests_kernel:investigate:last_good_tests_and_build:b2240c20072a4b93f57e5bad3ac54f95af9d0a99+20230910||ip-10-252-32-28
3626141|2023-10-06 02:46:45|done|failed|extra_tests_kernel:investigate:last_good_build:20230910||openqaworker-arm22
3622659|2023-10-05 14:22:26|done|failed|extra_tests_textmode_podman_containers||openqaworker26
3622432|2023-10-05 12:57:10|done|failed|extra_tests_kernel:investigate:last_good_tests_and_build:b2240c20072a4b93f57e5bad3ac54f95af9d0a99+20230910||ip-10-252-32-28
3622431|2023-10-05 12:45:33|done|failed|extra_tests_kernel:investigate:last_good_build:20230910||ip-10-252-32-28
12429891|2023-10-09 12:04:38|done|failed|ha_autoyast_create_hdd_15sp5||imagetester
12429808|2023-10-09 11:24:33|done|failed|ha_autoyast_create_hdd_15sp5||sapworker1
12429750|2023-10-09 11:16:44|done|failed|ha_autoyast_create_hdd_15sp5||sapworker1
12404086|2023-10-06 21:46:02|done|failed|qam-sles4sap_hana_node01||worker29
12401282|2023-10-06 11:59:46|done|parallel_failed|hpc_ALPHA_openmpi_mpi_slave00||worker29
12401274|2023-10-06 11:56:32|done|parallel_failed|hpc_BETA_openmpi_mpi_master||worker-arm2
12401271|2023-10-06 11:56:31|done|failed|hpc_ALPHA_openmpi_mpi_slave00||worker-arm1
12395883|2023-10-06 00:35:20|done|failed|extratests_fips_kernelmode||worker38
12394503|2023-10-06 00:06:05|done|failed|extratests_fips_kernelmode||worker33
12395075|2023-10-05 23:13:16|done|failed|extratests_fips_kernelmode||sapworker3
I found multiple tests where there is already a more recent rerun but at least https://openqa.suse.de/tests/12401271 still references this ticket as the latest job in this scenario.
Updated by mkittler about 1 year ago
I removed the ticket reference from many jobs. Unfortunately, after going though the list ./openqa-query-for-job-label poo#135056
will show more tests. I did it a few rounds but it seems like a never ending story.
Updated by mkittler about 1 year ago
- Status changed from Feedback to Resolved
I did another round of ./openqa-query-for-job-label poo#135056
. That must be enough.