action #135056: MM Test fails in a connection to an address outside of the worker - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #135056

closed

MM Test fails in a connection to an address outside of the worker

Added by acarvajal over 1 year ago. Updated over 1 year ago.

Status:

Resolved

Priority:

Urgent

Assignee:

mkittler

Category:

Target version:

QA (public) - future

Start date:

2023-09-01

Due date:

% Done:

Estimated time:

Description

Observation¶

openQA test in scenario sle-15-SP5-Server-DVD-HA-Incidents-x86_64-qam_ha_rolling_upgrade_migration_node01@64bit fails in
suseconnect_scc

Tests times out in a SUSEConnect command while attempting connections to https://scc.suse.com.

So far I have seen the same issue in the following jobs; I'm also adding the workers where these jobs ran to see if there is a pattern:

Test suite description¶

The base test suite is used for job templates defined in YAML documents. It has no settings of its own.

Expected result¶

Last good: :30365:python-iniconfig (or more recent)

Further details¶

Always latest result in this scenario: latest

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by acarvajal over 1 year ago

Looking at the list of workers, this seems related to support_server/setup issue described in #134282#note-27

Actions

Copy link

Updated by okurz over 1 year ago

Related to action #134282: [tools] network protocols failures on multimachine tests on HA/SAP size:S auto_review:"no candidate.*iscsi-target-overview-service-tab|yast2.+firewall.+services.+add.+zone":retry added

Actions

Copy link

Updated by okurz over 1 year ago

Target version set to future

Actions

Copy link

Updated by acarvajal over 1 year ago

Priority changed from Normal to Urgent

This issue was gone between Sunday Sept. 3th and Wednesday Sept. 6th, but it started happening again in different workers on the 7th, for example this one failing in iscs_client while running zypper:

https://openqa.suse.de/tests/12038078#step/iscsi_client/22
https://openqa.suse.de/tests/12030835#step/iscsi_client/47
https://openqa.suse.de/tests/12030870#step/iscsi_client/22

So far I've only seen this in worker29 & worker30. This time worker37 & worker38 seem fine and worker34 is offline.

I'm increasing the priority.

Actions

Copy link

Updated by okurz over 1 year ago

Target version changed from future to Ready

Actions

Copy link

Updated by okurz over 1 year ago

Target version changed from Ready to future

Sorry, need to reconsider. We can't handle that in the team right now with urgent priority. You need to find somebody else to work on this. In particular I suggest to improve the error reporting from tests

Actions

Copy link

Updated by acarvajal over 1 year ago

okurz wrote in #note-6:

Sorry, need to reconsider. We can't handle that in the team right now with urgent priority.

Seriously?

You need to find somebody else to work on this. In particular I suggest to improve the error reporting from tests

This message:

2023-09-07 13:16:29 <5> hana-node02(3466) [zypp-core] Exception.cc(log):186 MediaCurl.cc(evaluateCurlCode):939 THROW:    Timeout exceeded when accessing 'https://scc.suse.com/access/services/2383/repo/repoindex.xml?cookies=0&credentials=Basesystem_Module_15_SP5_x86_64'.

From https://openqa.suse.de/tests/12030870#step/iscsi_client/37 needs to be improved? How?

Or this one:

2023-09-10 20:30:33 <5> qdevice-node03(2977) [zypp-core] Exception.cc(log):186 MediaCurl.cc(evaluateCurlCode):960 THROW:    Download (curl) error for 'https://scc.suse.com/access/services/1931/repo/repoindex.xml?cookies=0&credentials=Basesystem_Module_15_SP2_x86_64':
2023-09-10 20:30:33 <5> qdevice-node03(2977) [zypp-core] Exception.cc(log):186 Error code: Connection failed
2023-09-10 20:30:33 <5> qdevice-node03(2977) [zypp-core] Exception.cc(log):186 Error message: Failed to connect to scc.suse.com port 443: Connection timed out

From https://openqa.suse.de/tests/12071228#step/qnetd/53

I'm open to suggestions.

Actions

Copy link

Updated by srinidhir over 1 year ago

More failures when trying to reach the network outside osd,

Actions

Copy link

Updated by okurz over 1 year ago

acarvajal wrote in #note-7:

okurz wrote in #note-6:

Sorry, need to reconsider. We can't handle that in the team right now with urgent priority.

Seriously?

Yes, seriously. https://os-autoinst.github.io/qa-tools-backlog-assistant/ shows our complete backlog status. In particular regarding the infrastructure maintenance our focus needs to be on conduct the dataserver migration with all related parts. If you want to know more details and provide your feedback on the plans we have you can join our monthly roadmap discussion in the SUSE QE Tools workshop sessions, see https://progress.opensuse.org/projects/qa/wiki/tools#Workshop-Topics

You need to find somebody else to work on this. In particular I suggest to improve the error reporting from tests

This message:
2023-09-07 13:16:29 <5> hana-node02(3466) [zypp-core] Exception.cc(log):186 MediaCurl.cc(evaluateCurlCode):939 THROW:    Timeout exceeded when accessing 'https://scc.suse.com/access/services/2383/repo/repoindex.xml?cookies=0&credentials=Basesystem_Module_15_SP5_x86_64'.
From https://openqa.suse.de/tests/12030870#step/iscsi_client/37 needs to be improved? How?

That error message itself is quite clear but then it's not clear why there is a timeout. Is the server not resolvable at all? Can it be pinged? Can it be pinged with a low packet size but is there a MTU related problem which would be apparent with a high-packet-size ping? Also to debug further the test (if not already done) show the output of ip a and ip r and such.

Actions

Copy link

#10

Updated by acarvajal over 1 year ago

okurz wrote in #note-9:

That error message itself is quite clear but then it's not clear why there is a timeout. Is the server not resolvable at all? Can it be pinged? Can it be pinged with a low packet size but is there a MTU related problem which would be apparent with a high-packet-size ping? Also to debug further the test (if not already done) show the output of ip a and ip r and such.

https://openqa.suse.de/tests/12030870#step/hostname/27

Actions

Copy link

#11

Updated by mkittler over 1 year ago

I've been briefly looking at https://openqa.suse.de/tests/12070893. It looks like the SUT got an IP in https://openqa.suse.de/tests/12070893#step/hostname/27. This seems indeed similar to what I've observed so far when previously working on #134282 (although these tests could refresh repositories, e.g. https://openqa.suse.de/tests/11821882#step/iscsi_client/5).

Note that further down in the logs there are clearer error messages like:

2023-09-07 13:22:47 <5> hana-node02(3466) [zypp-core] Exception.cc(log):186 MediaCurl.cc(evaluateCurlCode):960 THROW:    Download (curl) error for 'https://updates.suse.com/SUSE/Updates/SLE-Module-Basesystem/15-SP5/x86_64/update/repodata/repomd.xml?sFGn5uvJ5S4_i56DYilSWqkEzPpwt0b39EZhCAAW033WhiiwUwKvex5kavICK-LmJUTLiVEuqiy53d5NPP9-msAwVT9gZQJkWlgIxOqfZg0v3FPL2PisUGil0q23CaWn4Kwocks3ykg62ik9-gaoF_irCNb6aA':
2023-09-07 13:22:47 <5> hana-node02(3466) [zypp-core] Exception.cc(log):186 Error code: Connection failed
2023-09-07 13:22:47 <5> hana-node02(3466) [zypp-core] Exception.cc(log):186 Error message: Failed to connect to updates.suse.com port 443 after 18361 ms: Couldn't connect to server

Actions

Copy link

#12

Updated by livdywan over 1 year ago

It may be worth noting that there were jobs failing to access scc servers as far back as 2 months ago: https://openqa.suse.de/tests/11614943#step/suseconnect_scc/20 and here's a version failing on dsc: https://openqa.suse.de/tests/11560650#step/iscsi_client/5 - unfortunately there's no bug comments on those old jobs.

Actions

Copy link

#14

Updated by jlausuch over 1 year ago

I think (but I'm not sure) there is some missing configuration in the nftables in the workers, see #135524 but I would need help troubleshooting this as I don't have much experience with it.

Actions

Copy link

#15

Updated by acarvajal over 1 year ago

livdywan wrote in #note-12:

It may be worth noting that there were jobs failing to access scc servers as far back as 2 months ago: https://openqa.suse.de/tests/11614943#step/suseconnect_scc/20 and here's a version failing on dsc: https://openqa.suse.de/tests/11560650#step/iscsi_client/5 - unfortunately there's no bug comments on those old jobs.

That job history is interesting as it has:

Failures to access SCC servers 2 months ago (same root cause? different? I don't think it's possible to know now)
Consecutive passing results for 2 weeks following that
Then some sporadic failures (check_logs, cluster_md, ha_cluster_join ... out of this, only ha_cluster_join could be related to the current issues)
Then, starting on August 13th, Multi-Machines failures in iscsi_client. FYI, #134282 was opened on August 15th
Then, again passing jobs starting on the 17th, however all of these ran on the same worker. This was probably the result of a workaround by QE-SAP
Then, 9 days ago, the first one of these running in multiple workers: https://openqa.suse.de/tests/12031268#dependencies (worker37 & worker39). Remember #134282#note-47 and #134282#note-48 and #134282#note-59? 9 days ago is smack in the middle of those comments which are 8 to 11 days ago, when I thought the issue had been fixed.
Finally, last one there it's from 7 days ago failing attempting a connection to SCC. There are no results after that, more than likely because the parallel support server setup is failing (tracked in #134282)

Actions

Copy link

#16

Updated by mkittler over 1 year ago

The scenarios https://openqa.suse.de/tests/latest?arch=x86_64&distri=sle&flavor=Server-DVD-HA-Incidents&machine=64bit&test=qam_ha_rolling_upgrade_migration_node01&version=15-SP4#next_previous and https://openqa.suse.de/tests/latest?arch=x86_64&distri=sle&flavor=Server-DVD-HA-Incidents&machine=64bit&test=qam_ha_rolling_upgrade_migration_node01&version=15-SP4#next_previous look quite good since https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/987 was merged. So maybe it really helped.

Actions

Copy link

#17

Updated by mkittler over 1 year ago

I've just re-triggered MM tests on https://openqa.suse.de/tests/overview?distri=sle&version=15-SP6&build=20.1&groupid=158 that failed 4 days ago similarly. Let's see whether they'll pass after https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/987.

EDIT: It looks very good so far. The only failing cluster so far is https://openqa.suse.de/tests/12165551 but it didn't fail due to a network connection issue with an address outside of the worker. Some tests are still running/scheduled.

Actions

Copy link

#18

Updated by mkittler over 1 year ago

Status changed from New to In Progress
Assignee set to mkittler

Actions

Copy link

#19

Updated by mkittler over 1 year ago

Status changed from In Progress to Feedback

There were a few more failures but again unrelated ones. Now all tests are past the point where we'd see this problem if it still happened but it looks like all is good now. So I would say this issue can be closed.

Actions

Copy link

#20

Updated by acarvajal over 1 year ago

Not sure if related, but found some fresh failures today of jobs trying to reach addresses outside of osd:

https://openqa.suse.de/tests/12205876#step/iscsi_client/32 (worker31)
https://openqa.suse.de/tests/12205879#step/iscsi_client/32 (worker29)

Checked both workers and they seem to have the br1<->eth0 IP forwarding enabled:

worker31:~ # sysctl -a |grep net.ipv4.conf.br1.forwarding
net.ipv4.conf.br1.forwarding = 1
worker31:~ # sysctl -a |grep net.ipv4.conf.eth0.forwarding
net.ipv4.conf.eth0.forwarding = 1
worker31:~ # cat /proc/sys/net/ipv4/conf/{br1,eth0}/forwarding
1
1

And:

worker29:~ # sysctl -a |grep net.ipv4.conf.br1.forwarding
net.ipv4.conf.br1.forwarding = 1
worker29:~ # sysctl -a |grep net.ipv4.conf.eth0.forwarding
net.ipv4.conf.eth0.forwarding = 1
worker29:~ # cat /proc/sys/net/ipv4/conf/{br1,eth0}/forwarding
1
1

Haven't seen the other issue (connection issues in MM job to MM job) so far. Will update the other ticket if I find some.

Edit: seems like issue is connecting to 10.0.2.2 and not with external addresses specifically.

Actions

Copy link

#21

Updated by mkittler over 1 year ago

The cluster also contained worker31 which had been disabled (see https://progress.opensuse.org/issues/135407#note-12). So maybe the worker was in a bad state at this time (worker services were running but gre/tap setup was incomplete). That the job which ran on worker29 is similarly affected is a bit more worrying. It looks like it was able to run other MM tests (e.g. https://openqa.suse.de/tests/12222248) so I guess the worker is not totally broken, though.

Edit: seems like issue is connecting to 10.0.2.2 and not with external addresses specifically.

Yes, but that also shouldn't be the case.

Actions

Copy link

#22

Updated by mkittler over 1 year ago

Considering https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/622 worker31 is currently generally problematic.

Actions

Copy link

#23

Updated by livdywan over 1 year ago

Subject changed from MM Test fails in a connection to an address outside of the worker to MM Test fails in a connection to an address outside of the worker auto_review:"MediaCurl.cc.+evaluateCurlCode"

https://openqa.suse.de/tests/12372946

Actions

Copy link

#24

Updated by mgrifalconi over 1 year ago

Got here from the autoreview comment on this test: https://openqa.suse.de/tests/12386091#
This seems an unrelated issue, @livdywan could you please check and if that is the case, fix/remove the autoreview? Thanks!

Actions

Copy link

#25

Updated by livdywan over 1 year ago

Subject changed from MM Test fails in a connection to an address outside of the worker auto_review:"MediaCurl.cc.+evaluateCurlCode" to MM Test fails in a connection to an address outside of the worker auto_review:"MediaCurl.cc.+evaluateCurlCode.+Timeout exceeded"

mgrifalconi wrote in #note-24:

Got here from the autoreview comment on this test: https://openqa.suse.de/tests/12386091#
This seems an unrelated issue, @livdywan could you please check and if that is the case, fix/remove the autoreview? Thanks!

It's not a timeout so I assume it's unrelated, hence making the expression more specific.

Actions

Copy link

#26

Updated by mkittler over 1 year ago

Status changed from Feedback to Resolved

Considering no actually related issues came up anymore I suppose this ticket can be resolved.

Actions

Copy link

#27

Updated by mkittler over 1 year ago

Subject changed from MM Test fails in a connection to an address outside of the worker auto_review:"MediaCurl.cc.+evaluateCurlCode.+Timeout exceeded" to MM Test fails in a connection to an address outside of the worker

The auto review regex is still too generic, e.g. this ticket was wrongly referenced in https://openqa.suse.de/tests/12440238. I'm removing the regex completely because if we see a similar symptom again it likely makes most sense to investigate the cause from scratch (as this kind of issue can be caused by various problems).

Actions

Copy link

#28

Updated by okurz over 1 year ago

Status changed from Resolved to Feedback

Keep in mind that regardless of auto-review the same ticket reference can still be carried over if a test in the same test scenario fails in the same module as in before. This is why it is prudent for every auto-review ticket, if not every ticket used as ticket reference in openQA tests, to follow the suggestions from the template in https://github.com/os-autoinst/scripts/#auto-review---automatically-detect-known-issues-in-openqa-jobs-label-openqa-jobs-with-ticket-references-and-optionally-retrigger to call openqa-query-for-job-label poo#135056 which I did now and found:

3631705|2023-10-09 14:20:22|done|failed|extra_tests_kernel:investigate:last_good_tests_and_build:b2240c20072a4b93f57e5bad3ac54f95af9d0a99+20230910||openqaworker-arm22
3631704|2023-10-09 14:06:21|done|failed|extra_tests_kernel:investigate:last_good_build:20230910||openqaworker-arm22
3629458|2023-10-07 10:31:52|done|failed|extra_tests_kernel:investigate:last_good_build:20230910||ip-10-252-32-28
3629459|2023-10-07 10:22:11|done|failed|extra_tests_kernel:investigate:last_good_tests_and_build:b2240c20072a4b93f57e5bad3ac54f95af9d0a99+20230910||openqaworker-arm22
3626142|2023-10-06 03:00:50|done|failed|extra_tests_kernel:investigate:last_good_tests_and_build:b2240c20072a4b93f57e5bad3ac54f95af9d0a99+20230910||ip-10-252-32-28
3626141|2023-10-06 02:46:45|done|failed|extra_tests_kernel:investigate:last_good_build:20230910||openqaworker-arm22
3622659|2023-10-05 14:22:26|done|failed|extra_tests_textmode_podman_containers||openqaworker26
3622432|2023-10-05 12:57:10|done|failed|extra_tests_kernel:investigate:last_good_tests_and_build:b2240c20072a4b93f57e5bad3ac54f95af9d0a99+20230910||ip-10-252-32-28
3622431|2023-10-05 12:45:33|done|failed|extra_tests_kernel:investigate:last_good_build:20230910||ip-10-252-32-28
12429891|2023-10-09 12:04:38|done|failed|ha_autoyast_create_hdd_15sp5||imagetester
12429808|2023-10-09 11:24:33|done|failed|ha_autoyast_create_hdd_15sp5||sapworker1
12429750|2023-10-09 11:16:44|done|failed|ha_autoyast_create_hdd_15sp5||sapworker1
12404086|2023-10-06 21:46:02|done|failed|qam-sles4sap_hana_node01||worker29
12401282|2023-10-06 11:59:46|done|parallel_failed|hpc_ALPHA_openmpi_mpi_slave00||worker29
12401274|2023-10-06 11:56:32|done|parallel_failed|hpc_BETA_openmpi_mpi_master||worker-arm2
12401271|2023-10-06 11:56:31|done|failed|hpc_ALPHA_openmpi_mpi_slave00||worker-arm1
12395883|2023-10-06 00:35:20|done|failed|extratests_fips_kernelmode||worker38
12394503|2023-10-06 00:06:05|done|failed|extratests_fips_kernelmode||worker33
12395075|2023-10-05 23:13:16|done|failed|extratests_fips_kernelmode||sapworker3

I found multiple tests where there is already a more recent rerun but at least https://openqa.suse.de/tests/12401271 still references this ticket as the latest job in this scenario.

Actions

Copy link

#29

Updated by mkittler over 1 year ago

I removed the ticket reference from many jobs. Unfortunately, after going though the list ./openqa-query-for-job-label poo#135056 will show more tests. I did it a few rounds but it seems like a never ending story.

Actions

Copy link

#30

Updated by mkittler over 1 year ago

Status changed from Feedback to Resolved

I did another round of ./openqa-query-for-job-label poo#135056. That must be enough.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #135056

MM Test fails in a connection to an address outside of the worker

Observation¶

Test suite description¶

Expected result¶

Further details¶

Updated by acarvajal over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by acarvajal over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by acarvajal over 1 year ago

Updated by srinidhir over 1 year ago

Updated by okurz over 1 year ago

Updated by acarvajal over 1 year ago

Updated by mkittler over 1 year ago

Updated by livdywan over 1 year ago

Updated by jlausuch over 1 year ago

Updated by acarvajal over 1 year ago

Updated by mkittler over 1 year ago

Updated by mkittler over 1 year ago

Updated by mkittler over 1 year ago

Updated by mkittler over 1 year ago

Updated by acarvajal over 1 year ago

Updated by mkittler over 1 year ago

Updated by mkittler over 1 year ago

Updated by livdywan over 1 year ago

Updated by mgrifalconi over 1 year ago

Updated by livdywan over 1 year ago

Updated by mkittler over 1 year ago

Updated by mkittler over 1 year ago

Updated by okurz over 1 year ago

Updated by mkittler over 1 year ago

Updated by mkittler over 1 year ago