Project

General

Profile

Actions

action #135056

closed

MM Test fails in a connection to an address outside of the worker

Added by acarvajal over 1 year ago. Updated about 1 year ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
Start date:
2023-09-01
Due date:
% Done:

0%

Estimated time:

Description

Observation

openQA test in scenario sle-15-SP5-Server-DVD-HA-Incidents-x86_64-qam_ha_rolling_upgrade_migration_node01@64bit fails in
suseconnect_scc

Tests times out in a SUSEConnect command while attempting connections to https://scc.suse.com.

So far I have seen the same issue in the following jobs; I'm also adding the workers where these jobs ran to see if there is a pattern:

https://openqa.suse.de/tests/11977452#step/suseconnect_scc/20 / worker37
https://openqa.suse.de/tests/11968267#step/suseconnect_scc/20 / worker34
https://openqa.suse.de/tests/11968328#step/suseconnect_scc/20 / worker37
https://openqa.suse.de/tests/11968329#step/suseconnect_scc/20 / worker34
https://openqa.suse.de/tests/11968336#step/suseconnect_scc/20 / worker34
https://openqa.suse.de/tests/11968418#step/suseconnect_scc/20 / worker34
https://openqa.suse.de/tests/11968417#step/suseconnect_scc/20 / worker37
https://openqa.suse.de/tests/11965548#step/suseconnect_scc/20 / worker38
https://openqa.suse.de/tests/11965600#step/suseconnect_scc/20 / worker37
https://openqa.suse.de/tests/11975401#step/suseconnect_scc/20 / worker37
https://openqa.suse.de/tests/11975400#step/suseconnect_scc/20 / worker34
https://openqa.suse.de/tests/11981360#step/suseconnect_scc/20 / worker38
https://openqa.suse.de/tests/11975460#step/suseconnect_scc/20 / worker34

Test suite description

The base test suite is used for job templates defined in YAML documents. It has no settings of its own.

Expected result

Last good: :30365:python-iniconfig (or more recent)

Further details

Always latest result in this scenario: latest


Related issues 1 (0 open1 closed)

Related to openQA Infrastructure (public) - action #134282: [tools] network protocols failures on multimachine tests on HA/SAP size:S auto_review:"no candidate.*iscsi-target-overview-service-tab|yast2.+firewall.+services.+add.+zone":retryResolvednicksinger2023-08-15

Actions
Actions #1

Updated by acarvajal over 1 year ago

Looking at the list of workers, this seems related to support_server/setup issue described in #134282#note-27

Actions #2

Updated by okurz over 1 year ago

  • Related to action #134282: [tools] network protocols failures on multimachine tests on HA/SAP size:S auto_review:"no candidate.*iscsi-target-overview-service-tab|yast2.+firewall.+services.+add.+zone":retry added
Actions #3

Updated by okurz over 1 year ago

  • Target version set to future
Actions #4

Updated by acarvajal over 1 year ago

  • Priority changed from Normal to Urgent

This issue was gone between Sunday Sept. 3th and Wednesday Sept. 6th, but it started happening again in different workers on the 7th, for example this one failing in iscs_client while running zypper:

https://openqa.suse.de/tests/12038078#step/iscsi_client/22
https://openqa.suse.de/tests/12030835#step/iscsi_client/47
https://openqa.suse.de/tests/12030870#step/iscsi_client/22

So far I've only seen this in worker29 & worker30. This time worker37 & worker38 seem fine and worker34 is offline.

I'm increasing the priority.

Actions #5

Updated by okurz over 1 year ago

  • Target version changed from future to Ready
Actions #6

Updated by okurz over 1 year ago

  • Target version changed from Ready to future

Sorry, need to reconsider. We can't handle that in the team right now with urgent priority. You need to find somebody else to work on this. In particular I suggest to improve the error reporting from tests

Actions #7

Updated by acarvajal over 1 year ago

okurz wrote in #note-6:

Sorry, need to reconsider. We can't handle that in the team right now with urgent priority.

Seriously?

You need to find somebody else to work on this. In particular I suggest to improve the error reporting from tests

This message:

2023-09-07 13:16:29 <5> hana-node02(3466) [zypp-core] Exception.cc(log):186 MediaCurl.cc(evaluateCurlCode):939 THROW:    Timeout exceeded when accessing 'https://scc.suse.com/access/services/2383/repo/repoindex.xml?cookies=0&credentials=Basesystem_Module_15_SP5_x86_64'.

From https://openqa.suse.de/tests/12030870#step/iscsi_client/37 needs to be improved? How?

Or this one:

2023-09-10 20:30:33 <5> qdevice-node03(2977) [zypp-core] Exception.cc(log):186 MediaCurl.cc(evaluateCurlCode):960 THROW:    Download (curl) error for 'https://scc.suse.com/access/services/1931/repo/repoindex.xml?cookies=0&credentials=Basesystem_Module_15_SP2_x86_64':
2023-09-10 20:30:33 <5> qdevice-node03(2977) [zypp-core] Exception.cc(log):186 Error code: Connection failed
2023-09-10 20:30:33 <5> qdevice-node03(2977) [zypp-core] Exception.cc(log):186 Error message: Failed to connect to scc.suse.com port 443: Connection timed out

From https://openqa.suse.de/tests/12071228#step/qnetd/53

I'm open to suggestions.

Actions #9

Updated by okurz over 1 year ago

acarvajal wrote in #note-7:

okurz wrote in #note-6:

Sorry, need to reconsider. We can't handle that in the team right now with urgent priority.

Seriously?

Yes, seriously. https://os-autoinst.github.io/qa-tools-backlog-assistant/ shows our complete backlog status. In particular regarding the infrastructure maintenance our focus needs to be on conduct the dataserver migration with all related parts. If you want to know more details and provide your feedback on the plans we have you can join our monthly roadmap discussion in the SUSE QE Tools workshop sessions, see https://progress.opensuse.org/projects/qa/wiki/tools#Workshop-Topics

You need to find somebody else to work on this. In particular I suggest to improve the error reporting from tests

This message:

2023-09-07 13:16:29 <5> hana-node02(3466) [zypp-core] Exception.cc(log):186 MediaCurl.cc(evaluateCurlCode):939 THROW:    Timeout exceeded when accessing 'https://scc.suse.com/access/services/2383/repo/repoindex.xml?cookies=0&credentials=Basesystem_Module_15_SP5_x86_64'.

From https://openqa.suse.de/tests/12030870#step/iscsi_client/37 needs to be improved? How?

That error message itself is quite clear but then it's not clear why there is a timeout. Is the server not resolvable at all? Can it be pinged? Can it be pinged with a low packet size but is there a MTU related problem which would be apparent with a high-packet-size ping? Also to debug further the test (if not already done) show the output of ip a and ip r and such.

Actions #10

Updated by acarvajal over 1 year ago

okurz wrote in #note-9:

That error message itself is quite clear but then it's not clear why there is a timeout. Is the server not resolvable at all? Can it be pinged? Can it be pinged with a low packet size but is there a MTU related problem which would be apparent with a high-packet-size ping? Also to debug further the test (if not already done) show the output of ip a and ip r and such.

https://openqa.suse.de/tests/12030870#step/hostname/27

Actions #11

Updated by mkittler over 1 year ago

I've been briefly looking at https://openqa.suse.de/tests/12070893. It looks like the SUT got an IP in https://openqa.suse.de/tests/12070893#step/hostname/27. This seems indeed similar to what I've observed so far when previously working on #134282 (although these tests could refresh repositories, e.g. https://openqa.suse.de/tests/11821882#step/iscsi_client/5).

Note that further down in the logs there are clearer error messages like:

2023-09-07 13:22:47 <5> hana-node02(3466) [zypp-core] Exception.cc(log):186 MediaCurl.cc(evaluateCurlCode):960 THROW:    Download (curl) error for 'https://updates.suse.com/SUSE/Updates/SLE-Module-Basesystem/15-SP5/x86_64/update/repodata/repomd.xml?sFGn5uvJ5S4_i56DYilSWqkEzPpwt0b39EZhCAAW033WhiiwUwKvex5kavICK-LmJUTLiVEuqiy53d5NPP9-msAwVT9gZQJkWlgIxOqfZg0v3FPL2PisUGil0q23CaWn4Kwocks3ykg62ik9-gaoF_irCNb6aA':
2023-09-07 13:22:47 <5> hana-node02(3466) [zypp-core] Exception.cc(log):186 Error code: Connection failed
2023-09-07 13:22:47 <5> hana-node02(3466) [zypp-core] Exception.cc(log):186 Error message: Failed to connect to updates.suse.com port 443 after 18361 ms: Couldn't connect to server
Actions #12

Updated by livdywan over 1 year ago

It may be worth noting that there were jobs failing to access scc servers as far back as 2 months ago: https://openqa.suse.de/tests/11614943#step/suseconnect_scc/20 and here's a version failing on dsc: https://openqa.suse.de/tests/11560650#step/iscsi_client/5 - unfortunately there's no bug comments on those old jobs.

Actions #14

Updated by jlausuch over 1 year ago

I think (but I'm not sure) there is some missing configuration in the nftables in the workers, see #135524 but I would need help troubleshooting this as I don't have much experience with it.

Actions #15

Updated by acarvajal over 1 year ago

livdywan wrote in #note-12:

It may be worth noting that there were jobs failing to access scc servers as far back as 2 months ago: https://openqa.suse.de/tests/11614943#step/suseconnect_scc/20 and here's a version failing on dsc: https://openqa.suse.de/tests/11560650#step/iscsi_client/5 - unfortunately there's no bug comments on those old jobs.

That job history is interesting as it has:

  • Failures to access SCC servers 2 months ago (same root cause? different? I don't think it's possible to know now)
  • Consecutive passing results for 2 weeks following that
  • Then some sporadic failures (check_logs, cluster_md, ha_cluster_join ... out of this, only ha_cluster_join could be related to the current issues)
  • Then, starting on August 13th, Multi-Machines failures in iscsi_client. FYI, #134282 was opened on August 15th
  • Then, again passing jobs starting on the 17th, however all of these ran on the same worker. This was probably the result of a workaround by QE-SAP
  • Then, 9 days ago, the first one of these running in multiple workers: https://openqa.suse.de/tests/12031268#dependencies (worker37 & worker39). Remember #134282#note-47 and #134282#note-48 and #134282#note-59? 9 days ago is smack in the middle of those comments which are 8 to 11 days ago, when I thought the issue had been fixed.
  • Finally, last one there it's from 7 days ago failing attempting a connection to SCC. There are no results after that, more than likely because the parallel support server setup is failing (tracked in #134282)
Actions #17

Updated by mkittler over 1 year ago

I've just re-triggered MM tests on https://openqa.suse.de/tests/overview?distri=sle&version=15-SP6&build=20.1&groupid=158 that failed 4 days ago similarly. Let's see whether they'll pass after https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/987.

EDIT: It looks very good so far. The only failing cluster so far is https://openqa.suse.de/tests/12165551 but it didn't fail due to a network connection issue with an address outside of the worker. Some tests are still running/scheduled.

Actions #18

Updated by mkittler over 1 year ago

  • Status changed from New to In Progress
  • Assignee set to mkittler
Actions #19

Updated by mkittler over 1 year ago

  • Status changed from In Progress to Feedback

There were a few more failures but again unrelated ones. Now all tests are past the point where we'd see this problem if it still happened but it looks like all is good now. So I would say this issue can be closed.

Actions #20

Updated by acarvajal about 1 year ago

Not sure if related, but found some fresh failures today of jobs trying to reach addresses outside of osd:

https://openqa.suse.de/tests/12205876#step/iscsi_client/32 (worker31)
https://openqa.suse.de/tests/12205879#step/iscsi_client/32 (worker29)

Checked both workers and they seem to have the br1<->eth0 IP forwarding enabled:

worker31:~ # sysctl -a |grep net.ipv4.conf.br1.forwarding
net.ipv4.conf.br1.forwarding = 1
worker31:~ # sysctl -a |grep net.ipv4.conf.eth0.forwarding
net.ipv4.conf.eth0.forwarding = 1
worker31:~ # cat /proc/sys/net/ipv4/conf/{br1,eth0}/forwarding
1
1

And:

worker29:~ # sysctl -a |grep net.ipv4.conf.br1.forwarding
net.ipv4.conf.br1.forwarding = 1
worker29:~ # sysctl -a |grep net.ipv4.conf.eth0.forwarding
net.ipv4.conf.eth0.forwarding = 1
worker29:~ # cat /proc/sys/net/ipv4/conf/{br1,eth0}/forwarding
1
1

Haven't seen the other issue (connection issues in MM job to MM job) so far. Will update the other ticket if I find some.

Edit: seems like issue is connecting to 10.0.2.2 and not with external addresses specifically.

Actions #21

Updated by mkittler about 1 year ago

The cluster also contained worker31 which had been disabled (see https://progress.opensuse.org/issues/135407#note-12). So maybe the worker was in a bad state at this time (worker services were running but gre/tap setup was incomplete). That the job which ran on worker29 is similarly affected is a bit more worrying. It looks like it was able to run other MM tests (e.g. https://openqa.suse.de/tests/12222248) so I guess the worker is not totally broken, though.

Edit: seems like issue is connecting to 10.0.2.2 and not with external addresses specifically.

Yes, but that also shouldn't be the case.

Actions #22

Updated by mkittler about 1 year ago

Considering https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/622 worker31 is currently generally problematic.

Actions #23

Updated by livdywan about 1 year ago

  • Subject changed from MM Test fails in a connection to an address outside of the worker to MM Test fails in a connection to an address outside of the worker auto_review:"MediaCurl.cc.+evaluateCurlCode"
Actions #24

Updated by mgrifalconi about 1 year ago

Got here from the autoreview comment on this test: https://openqa.suse.de/tests/12386091#
This seems an unrelated issue, @livdywan could you please check and if that is the case, fix/remove the autoreview? Thanks!

Actions #25

Updated by livdywan about 1 year ago

  • Subject changed from MM Test fails in a connection to an address outside of the worker auto_review:"MediaCurl.cc.+evaluateCurlCode" to MM Test fails in a connection to an address outside of the worker auto_review:"MediaCurl.cc.+evaluateCurlCode.+Timeout exceeded"

mgrifalconi wrote in #note-24:

Got here from the autoreview comment on this test: https://openqa.suse.de/tests/12386091#
This seems an unrelated issue, @livdywan could you please check and if that is the case, fix/remove the autoreview? Thanks!

It's not a timeout so I assume it's unrelated, hence making the expression more specific.

Actions #26

Updated by mkittler about 1 year ago

  • Status changed from Feedback to Resolved

Considering no actually related issues came up anymore I suppose this ticket can be resolved.

Actions #27

Updated by mkittler about 1 year ago

  • Subject changed from MM Test fails in a connection to an address outside of the worker auto_review:"MediaCurl.cc.+evaluateCurlCode.+Timeout exceeded" to MM Test fails in a connection to an address outside of the worker

The auto review regex is still too generic, e.g. this ticket was wrongly referenced in https://openqa.suse.de/tests/12440238. I'm removing the regex completely because if we see a similar symptom again it likely makes most sense to investigate the cause from scratch (as this kind of issue can be caused by various problems).

Actions #28

Updated by okurz about 1 year ago

  • Status changed from Resolved to Feedback

Keep in mind that regardless of auto-review the same ticket reference can still be carried over if a test in the same test scenario fails in the same module as in before. This is why it is prudent for every auto-review ticket, if not every ticket used as ticket reference in openQA tests, to follow the suggestions from the template in https://github.com/os-autoinst/scripts/#auto-review---automatically-detect-known-issues-in-openqa-jobs-label-openqa-jobs-with-ticket-references-and-optionally-retrigger to call openqa-query-for-job-label poo#135056 which I did now and found:

3631705|2023-10-09 14:20:22|done|failed|extra_tests_kernel:investigate:last_good_tests_and_build:b2240c20072a4b93f57e5bad3ac54f95af9d0a99+20230910||openqaworker-arm22
3631704|2023-10-09 14:06:21|done|failed|extra_tests_kernel:investigate:last_good_build:20230910||openqaworker-arm22
3629458|2023-10-07 10:31:52|done|failed|extra_tests_kernel:investigate:last_good_build:20230910||ip-10-252-32-28
3629459|2023-10-07 10:22:11|done|failed|extra_tests_kernel:investigate:last_good_tests_and_build:b2240c20072a4b93f57e5bad3ac54f95af9d0a99+20230910||openqaworker-arm22
3626142|2023-10-06 03:00:50|done|failed|extra_tests_kernel:investigate:last_good_tests_and_build:b2240c20072a4b93f57e5bad3ac54f95af9d0a99+20230910||ip-10-252-32-28
3626141|2023-10-06 02:46:45|done|failed|extra_tests_kernel:investigate:last_good_build:20230910||openqaworker-arm22
3622659|2023-10-05 14:22:26|done|failed|extra_tests_textmode_podman_containers||openqaworker26
3622432|2023-10-05 12:57:10|done|failed|extra_tests_kernel:investigate:last_good_tests_and_build:b2240c20072a4b93f57e5bad3ac54f95af9d0a99+20230910||ip-10-252-32-28
3622431|2023-10-05 12:45:33|done|failed|extra_tests_kernel:investigate:last_good_build:20230910||ip-10-252-32-28
12429891|2023-10-09 12:04:38|done|failed|ha_autoyast_create_hdd_15sp5||imagetester
12429808|2023-10-09 11:24:33|done|failed|ha_autoyast_create_hdd_15sp5||sapworker1
12429750|2023-10-09 11:16:44|done|failed|ha_autoyast_create_hdd_15sp5||sapworker1
12404086|2023-10-06 21:46:02|done|failed|qam-sles4sap_hana_node01||worker29
12401282|2023-10-06 11:59:46|done|parallel_failed|hpc_ALPHA_openmpi_mpi_slave00||worker29
12401274|2023-10-06 11:56:32|done|parallel_failed|hpc_BETA_openmpi_mpi_master||worker-arm2
12401271|2023-10-06 11:56:31|done|failed|hpc_ALPHA_openmpi_mpi_slave00||worker-arm1
12395883|2023-10-06 00:35:20|done|failed|extratests_fips_kernelmode||worker38
12394503|2023-10-06 00:06:05|done|failed|extratests_fips_kernelmode||worker33
12395075|2023-10-05 23:13:16|done|failed|extratests_fips_kernelmode||sapworker3

I found multiple tests where there is already a more recent rerun but at least https://openqa.suse.de/tests/12401271 still references this ticket as the latest job in this scenario.

Actions #29

Updated by mkittler about 1 year ago

I removed the ticket reference from many jobs. Unfortunately, after going though the list ./openqa-query-for-job-label poo#135056 will show more tests. I did it a few rounds but it seems like a never ending story.

Actions #30

Updated by mkittler about 1 year ago

  • Status changed from Feedback to Resolved

I did another round of ./openqa-query-for-job-label poo#135056. That must be enough.

Actions

Also available in: Atom PDF