Project

General

Profile

Actions

action #107062

open

Multiple failures due to network issues

Added by jlausuch about 2 years ago. Updated about 1 year ago.

Status:
Feedback
Priority:
Normal
Assignee:
Category:
Bugs in existing tests
Target version:
-
Start date:
2021-09-27
Due date:
% Done:

77%

Estimated time:
(Total: 0.00 h)
Difficulty:
Tags:

Description

Observation

I will use this ticket to collect the different errors I observe in our tests (at least for QE-C squad) that fail due to network issues.
Normally a restart helps to get the job green again (in case we need it green), but this is not the ideal solution.

The idea of this ticket is to collect more potential issues caught by reviewers and propose solutions for some of them, in the code (retry same command several times might help) or in the infra side.

There is an example for each error I found, but from my experience reviewing jobs every day, these failures happen multiple times a day and randomly (difficult to predict).

1) SUSEConnect timeouts -> https://openqa.suse.de/tests/8189768#step/docker/34
Test died: command 'SUSEConnect -p sle-module-containers/${VERSION_ID}/${CPU} ' timed out at /usr/lib/os-autoinst/testapi.pm line 1039.

Or https://openqa.suse.de/tests/8193554#step/suseconnect_scc/8
Test died: command 'SUSEConnect -r $regcode' timed out at /usr/lib/os-autoinst/testapi.pm line 950.

2) updates.suse.com not reachable -> https://openqa.suse.de/tests/8189697#step/image_docker/1110

Retrieving: kmod-25-6.10.1.aarch64.rpm [.........error]
Abort, retry, ignore? [a/r/i/...? shows all options] (a): a
Download (curl) error for 'https://updates.suse.com/SUSE/Updates/SLE-Module-Basesystem/15-SP2/aarch64/update/aarch64/kmod-25-6.10.1.aarch64.rpm?nE0jiYdfiOdLYjH0o-llNN2xIDXncon0vYw8z1aBPGx00H9S1eN413vUsfSJnzFrVz-CoZoGtSdsPKIDRAOQy3Xw2Tac3Yx5_1i8TPomSNiqhDJ0Ayxro23n46NHHB-XHq669RlHs17wiUFSJiSMCSh-YzdGdFw':
Error code: Connection failed
Error message: Could not resolve host: updates.suse.com

Problem occurred during or after installation or removal of packages:
Installation has been aborted as directed.
Please see the above error message for a hint.

3) SCC timeouts -> https://openqa.suse.de/tests/8189613#step/image_docker/316

docker run --entrypoint /usr/lib/zypp/plugins/services/container-suseconnect-zypp -i zypper_docker_derived lp
...
2022/02/18 07:16:19 Installed product: SLES-12.3-x86_64
2022/02/18 07:16:19 Registration server set to https://scc.suse.com
2022/02/18 07:16:30 Get https://scc.suse.com/connect/subscriptions/products?arch=x86_64&identifier=SLES&version=12.3: dial tcp: lookup scc.suse.com on 10.0.2.3:53: read udp 172.17.0.2:37151->10.0.2.3:53: i/o timeout

4) zypper ref timeout or error -> https://openqa.opensuse.org/tests/2193730#step/image_podman/124

podman run -i --name 'refreshed' --entrypoint '' registry.opensuse.org/opensuse/leap/15.3/images/totest/containers/opensuse/leap:15.3 zypper -nv ref
...
Retrieving: cb71cb070e8aac79327e6f1b6edc5317122ca1f72970299c3cb2cf505e18b27f-deltainfo.xml.gz [........................done (82.3 KiB/s)]
Retrieving: 832729371fe20bc1a4d27e59d76c10ffe2c0b5a1ff71c4e934e7a11baa24a74b-primary.xml.gz [............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................error (87.0 KiB/s)]
WJdDM-124-

Acceptance criteria

  • AC1: All existing subtasks are resolved, no additional work needed on top

Files

Screenshot 2022-02-21 at 11.03.33.png (31.4 KB) Screenshot 2022-02-21 at 11.03.33.png jlausuch, 2022-02-21 10:03
canvas.png (25.5 KB) canvas.png pdostal, 2022-03-01 14:04
expert.jpg (60.9 KB) expert.jpg jstehlik, 2022-03-22 13:11
scc_timeout.png (11.7 KB) scc_timeout.png jlausuch, 2022-03-23 09:35
Screenshot_2022-04-12_22-16-57.png (80.2 KB) Screenshot_2022-04-12_22-16-57.png no response, no log, no connection ? dzedro, 2022-04-13 09:01

Subtasks 11 (3 open8 closed)

action #99345: [tools][qem] Incomplete test runs on s390x with auto_review:"backend died: Error connecting to VNC server.*s390.*Connection timed out":retry size:MResolvedmkittler2021-09-27

Actions
openQA Infrastructure - action #108266: grenache: script_run() commands randomly time out since server room moveNew2022-03-14

Actions
openQA Infrastructure - action #108845: Network performance problems, DNS, DHCP, within SUSE QA network auto_review:"(Error connecting to VNC server.*qa.suse.*Connection timed out|ipmitool.*qa.suse.*Unable to establish)":retry but also other symptoms size:MResolvednicksinger2022-03-24

Actions
openQA Infrastructure - action #108872: Outdated information on openqaw5-xen https://racktables.suse.de/index.php?page=object&tab=default&object_id=3468Newcachen

Actions
openQA Infrastructure - action #108896: [ppc64le] auto_review:"(?s)Size of.*differs, expected.*but downloaded.*Download.*failed: 521 Connect timeout":retryResolvedokurz2022-03-24

Actions
action #108953: [tools] Performance issues in some s390 workersResolvedokurz2022-03-25

Actions
openQA Infrastructure - action #109241: Prefer to use domain names rather than IPv4 in salt pillars size:MResolvedokurz

Actions
openQA Infrastructure - action #109253: Add monitoring for SUSE QA network infrastructure size:MResolvedjbaier_cz

Actions
openQA Infrastructure - action #120169: Make s390x kvm workers also use FQDN instead of IPv4 in salt pillars for VIRSH_GUESTNew2022-11-09

Actions
openQA Infrastructure - action #120261: tests should try to access worker by WORKER_HOSTNAME FQDN but sometimes get 'worker2' or something auto_review:".*curl.*worker\d+:.*failed at.*":retry size:meowResolvedmkittler2022-11-10

Actions
openQA Infrastructure - action #121672: [virtualization] Connectivity issues on worker8-vmware.oqa.suse.deResolvedokurz2022-12-07

Actions

Related issues 4 (1 open3 closed)

Related to openQA Tests - action #107635: [qem][y] test fails in installationNew2022-02-25

Actions
Related to openQA Infrastructure - action #108668: Failed systemd services alert (except openqa.suse.de) for < 60 minRejectedmkittler2022-03-21

Actions
Related to openQA Tests - action #113528: [qe-core] test fails in bootloader_zkvm - performance degradation in the s390 network is causing serial console to be unreliable (and killing jobs slowly)Resolvedszarate2022-07-132022-07-18

Actions
Related to openQA Infrastructure - action #113716: [qe-core] proxy-scc is downResolvedszarate2022-07-182022-07-19

Actions
Actions #1

Updated by jlausuch about 2 years ago

Actions #2

Updated by jlausuch about 2 years ago

I have created this proposal for adding a product/module with suseconnect.
https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/14283

Actions #4

Updated by jlausuch about 2 years ago

  • Description updated (diff)
Actions #8

Updated by jlausuch about 2 years ago

New PR to create a new retry wrapper for validate_script_output mehtod.
https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/14295

This will help with issues with validate_script_output that are prone to timeout if network is slow.

Actions #10

Updated by jlausuch about 2 years ago

dzedro wrote:

Few examples of SCC/network failures, it's nothing rare.
https://openqa.suse.de/tests/8219630#step/zypper_patch/20

For this, we can make a script_retry, similar to what I did here: https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/14327

https://openqa.suse.de/tests/8198732#step/scc_registration/16

This is more tricky, as it's installation UI, we can't use any in-code retry here... (it would need needle handling for error cases)

Actions #11

Updated by okurz about 2 years ago

  • Related to action #107635: [qem][y] test fails in installation added
Actions #12

Updated by okurz about 2 years ago

As suggested over other channels might have gone lost: I suggest to get in contact with the SCC team and EngInfra-Team, e.g. create bugs in their according issue trackers and link here or ping them over chat or email and invite them to brainstorm and investigate on this issue. We don't know where the problem is but I am sure together with experts from other teams we have enough brain power to solve this riddle. It could be that there is a problem within the product that we or you test and then of course we should handle this as appropriate product problems or regressions that should be fixed because then customers will also be affected. It could be a problem in the user facing infrastructure especially for updates.suse.com on a CDN where customers would or could also be affected. Then as well we should not just accept this issue but look into it and try to solve it together.

Actions #13

Updated by jlausuch about 2 years ago

okurz wrote:

As suggested over other channels might have gone lost: I suggest to get in contact with the SCC team and EngInfra-Team, e.g. create bugs in their according issue trackers and link here or ping them over chat or email and invite them to brainstorm and investigate on this issue. We don't know where the problem is but I am sure together with experts from other teams we have enough brain power to solve this riddle. It could be that there is a problem within the product that we or you test and then of course we should handle this as appropriate product problems or regressions that should be fixed because then customers will also be affected. It could be a problem in the user facing infrastructure especially for updates.suse.com on a CDN where customers would or could also be affected. Then as well we should not just accept this issue but look into it and try to solve it together.

Yes, that would be ideal, but first we would need to collect proofs and really have specific tests for SCC/updates.suse.de/etc that tests the connectivity (I think there is a ticket about it) and collect some statistics to see how that behaves over the time. Maybe we can recognize a pattern when this hiccups happen, or maybe we don't. I think this is a quite complex scenario to define. I personally don't have the time to drive this initiative forward.

Actions #14

Updated by jlausuch about 2 years ago

Again: https://openqa.suse.de/tests/8242917#step/image_docker/705

> docker run --entrypoint /usr/lib/zypp/plugins/services/container-suseconnect-zypp -i zypper_docker_derived lm
2022/02/28 13:53:28 Installed product: SLES-12.4-x86_64
2022/02/28 13:53:28 Registration server set to https://scc.suse.com
2022/02/28 13:53:42 Get "https://scc.suse.com/connect/subscriptions/products?arch=x86_64&identifier=SLES&version=12.4": dial tcp: lookup scc.suse.com on 10.0.2.3:53: read udp 172.17.0.2:39374->10.0.2.3:53: i/o timeout

Hopefully, these type of failures will be workarounded by https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/14295

Actions #15

Updated by jlausuch about 2 years ago

And more:
https://openqa.suse.de/tests/8243068#step/registry/156

> SUSEConnect --status-text
> EOT_L_NVX
L_NVX-0-
# echo L_NVX; bash -oe pipefail /tmp/scriptL_NVX.sh ; echo SCRIPT_FINISHEDL_NVX-$?-
L_NVX
Error: Cannot parse response from server
Actions #16

Updated by jlausuch about 2 years ago

An interesting one, even failed after 3 retries... and this has nothing to do with our scc or any suse.de domain:
https://openqa.suse.de/tests/8245957#step/docker_3rd_party_images/825

# timeout 420 docker pull registry.access.redhat.com/ubi7/ubi-init; echo 2n4wD-$?-
Using default tag: latest
Error response from daemon: error parsing HTTP 404 response body: invalid character 'N' looking for beginning of value: "Not found\n"
2n4wD-1-
# timeout 420 docker pull registry.access.redhat.com/ubi7/ubi-init; echo 2n4wD-$?-
Using default tag: latest
Error response from daemon: error parsing HTTP 404 response body: invalid character 'N' looking for beginning of value: "Not found\n"
2n4wD-1-
# timeout 420 docker pull registry.access.redhat.com/ubi7/ubi-init; echo 2n4wD-$?-
Using default tag: latest
Error response from daemon: error parsing HTTP 404 response body: invalid character 'N' looking for beginning of value: "Not found\n"
2n4wD-1-

I guess this time we can blame RH :)

Actions #17

Updated by pdostal about 2 years ago

jlausuch wrote:

dzedro wrote:

Few examples of SCC/network failures, it's nothing rare.
https://openqa.suse.de/tests/8219630#step/zypper_patch/20

For this, we can make a script_retry, similar to what I did here: https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/14327

https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/14383

Actions #18

Updated by pdostal about 2 years ago

jlausuch wrote:

dzedro wrote:

https://openqa.suse.de/tests/8198732#step/scc_registration/16

This is more tricky, as it's installation UI, we can't use any in-code retry here... (it would need needle handling for error cases)

YaST2 Installation - Connection to registration server failed. (I'm just documenting this so it's not cleaned by openQA).

Actions #20

Updated by jlausuch about 2 years ago

jlausuch wrote:

I have created this proposal for adding a product/module with suseconnect.
https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/14283

Follow-up: https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/14402

Actions #21

Updated by jlausuch about 2 years ago

New one today in parent job where many jobs depend on...
Installation: https://openqa.suse.de/tests/8261955#step/scc_registration/31
Cannot parse the data from the server.

Actions #22

Updated by jlausuch about 2 years ago

To keep record:
https://openqa.suse.de/tests/8261596#step/image_docker/1491

# timeout 600 docker exec refreshed zypper -nv ref; echo Uyq3l-$?-
Entering non-interactive mode.
Verbosity: 2
Initializing Target
Refreshing service 'container-suseconnect-zypp'.
Problem retrieving the repository index file for service 'container-suseconnect-zypp':
[container-suseconnect-zypp|file:/usr/lib/zypp/plugins/services/container-suseconnect-zypp] 
Warning: Skipping service 'container-suseconnect-zypp' because of the above error.
Specified repositories: 
Warning: There are no enabled repositories defined.
Use 'zypper addrepo' or 'zypper modifyrepo' commands to add or enable repositories.

This failed even after some 3 retries, and the consequent modules also failed with similar connectivity issues:

> SUSEConnect --status-text
> EOT_L_NVX
L_NVX-0-
# echo L_NVX; bash -oe pipefail /tmp/scriptL_NVX.sh ; echo SCRIPT_FINISHEDL_NVX-$?-
L_NVX
SUSEConnect error: SocketError: getaddrinfo: Temporary failure in name resolution

which could be partially avoided with https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/14409

Actions #23

Updated by dzedro about 2 years ago

I guess it's time to fix the real network or SCC issue.

Actions #24

Updated by jlausuch about 2 years ago

dzedro wrote:

I guess it's time to fix the real network or SCC issue.

+1

We just need a driver for that :)

Actions #26

Updated by jlausuch about 2 years ago

https://openqa.suse.de/tests/8262854#step/image_docker/921

# docker run --rm -ti registry.suse.com/suse/sle15:15.2 zypper ls | grep 'Usage:'; echo GcdHR-$?-
GcdHR-1-
# cat > /tmp/scriptTBhqL.sh << 'EOT_TBhqL'; echo TBhqL-$?-
> docker run -i --entrypoint '' registry.suse.com/suse/sle15:15.2 zypper lr -s
> EOT_TBhqL
TBhqL-0-
# echo TBhqL; bash -oe pipefail /tmp/scriptTBhqL.sh ; echo SCRIPT_FINISHEDTBhqL-$?-
TBhqL
Refreshing service 'container-suseconnect-zypp'.
Warning: Skipping service 'container-suseconnect-zypp' because of the above error.
Warning: No repositories defined.
Use the 'zypper addrepo' command to add one or more repositories.
Problem retrieving the repository index file for service 'container-suseconnect-zypp':
[container-suseconnect-zypp|file:/usr/lib/zypp/plugins/services/container-suseconnect-zypp] 
SCRIPT_FINISHEDTBhqL-6-
Actions #28

Updated by jlausuch about 2 years ago

SLE Micro timeout in transactional-update migration:
https://openqa.suse.de/tests/8268653#step/zypper_migration/9

Calling zypper migration --no-snapshots --no-selfupdate
2022-03-04 01:25:26 tukit 3.4.0 started
2022-03-04 01:25:26 Options: call 5 zypper migration --no-snapshots --no-selfupdate 
2022-03-04 01:25:26 Executing `zypper migration --no-snapshots --no-selfupdate`:


Executing 'zypper  refresh'

Repository 'SUSE-MicroOS-5.0-Pool' is up to date.
Repository 'SUSE-MicroOS-5.0-Updates' is up to date.
Repository 'TEST_0' is up to date.
Repository 'TEST_1' is up to date.
Repository 'TEST_10' is up to date.
Repository 'TEST_11' is up to date.
Repository 'TEST_12' is up to date.
Repository 'TEST_2' is up to date.
Repository 'TEST_3' is up to date.
Repository 'TEST_4' is up to date.
Repository 'TEST_5' is up to date.
Repository 'TEST_6' is up to date.
Repository 'TEST_7' is up to date.
Repository 'TEST_8' is up to date.
Repository 'TEST_9' is up to date.
All repositories have been refreshed.
Can't determine the list of installed products: JSON::ParserError: 765: unexpected token at '<html>

<head><title>504 Gateway Time-out</title></head>

<body>

<center><h1>504 Gateway Time-out</h1></center>

</body>

</html>

'
'/usr/lib/zypper/commands/zypper-migration' exited with status 1
2022-03-04 01:25:42 Application returned with exit status 1.
Actions #29

Updated by jlausuch about 2 years ago

Timeouts on some container command running zypper lr:
https://openqa.suse.de/tests/8269339#step/image_docker/919
https://openqa.suse.de/tests/8269313#step/image_docker/162
https://openqa.suse.de/tests/8269116#step/image_docker/929

> docker run -i --entrypoint '' registry.suse.com/suse/sles12sp5 zypper lr -s
> EOT_sed1h
sed1h-0-
# echo sed1h; bash -oe pipefail /tmp/scriptsed1h.sh ; echo SCRIPT_FINISHEDsed1h-$?-
sed1h
Refreshing service 'container-suseconnect-zypp'.
Problem retrieving the repository index file for service 'container-suseconnect-zypp':
[container-suseconnect-zypp|file:/usr/lib/zypp/plugins/services/container-suseconnect-zypp] 
Warning: Skipping service 'container-suseconnect-zypp' because of the above error.
Warning: No repositories defined.
Use the 'zypper addrepo' command to add one or more repositories.
SCRIPT_FINISHEDsed1h-6-

This can be partially stabilized with retries: https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/14420

Actions #30

Updated by jlausuch about 2 years ago

btrfs_autocompletion:
Timeout exceeded when accessing 'https://scc.suse.com/access/services/1924/repo/repoindex.xml?credentials=SUSE_Linux_Enterprise_Server_15_SP2_x86_64'.

https://openqa.suse.de/tests/8268555#step/btrfs_autocompletion/11
https://openqa.suse.de/tests/8268561#step/btrfs_autocompletion/11

Actions #31

Updated by jlausuch about 2 years ago

Today's failures:

Updating system details on https://scc.suse.com ...
Activating PackageHub 15.3 x86_64 ...
Error: Cannot parse response from server
Actions #32

Updated by jlausuch about 2 years ago

Today's failures:

Timeouts in installation jobs:

SUSEConnect --list-extensions:

Zypper timeouts:

Other connectivity issues:

Actions #33

Updated by jlausuch about 2 years ago

jlausuch wrote:

SUSEConnect --list-extensions:

This PR will help with this specific issue:
https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/14437

Actions #35

Updated by jlausuch about 2 years ago

2 more parent installation jobs today:

Actions #37

Updated by jlausuch about 2 years ago

SCC issues from today:

Non-SCC related:

Actions #38

Updated by jlausuch almost 2 years ago

https://openqa.suse.de/tests/8354268#step/scc_registration/17 (Connection timeout. Make sure that the registration server is reachable and the connection is reliable.)
https://openqa.suse.de/tests/8354381#step/scc_registration/52 (Timeout exceeded accessing scc.suse.com)
https://openqa.suse.de/tests/8353928#step/scc_registration/20 (504 Gateway Time-out)
https://openqa.suse.de/tests/8354861#step/scc_registration/8 (Error Refreshing service 'SUSE_Linux_Enterprise_Server_12_SP5_s390x' failed)

https://openqa.suse.de/tests/8352068#step/docker/68 (docker pull timeout) (not scc related)

Actions #39

Updated by jstehlik almost 2 years ago

  • Assignee set to jstehlik

I propose to use Nagios to monitor availability of mentioned services (if it is not used already and I just don't know about it) for a week or two to collect good statistics. This topic seems also related to https://progress.opensuse.org/issues/103656 #103656

Actions #41

Updated by jlausuch almost 2 years ago

jstehlik wrote:

I propose to use Nagios to monitor availability of mentioned services (if it is not used already and I just don't know about it) for a week or two to collect good statistics. This topic seems also related to https://progress.opensuse.org/issues/103656

Yeah, why not, continuous monitoring can give us a hint of those random hiccups.

Actions #42

Updated by dzedro almost 2 years ago

  • Assignee deleted (jstehlik)

I have prepared PR to workaround the SCC error pop-ups e.g. https://openqa.suse.de/tests/8368303#step/scc_registration/11
Question is if it's safe to just press OK and continue so I created https://bugzilla.suse.com/show_bug.cgi?id=1197380

We need statistics for what ? To confirm there is network/SCC issue ?

Actions #43

Updated by okurz almost 2 years ago

  • Assignee set to jstehlik

@jpupava you might have removed jstehlik as assignee by mistake. I added him back as we just discussed this ticket together also in light of #102945

To get some statistics you can look into the data that is available within openQA.
We also have
https://maintenance-statistics.dyn.cloud.suse.de/question/319
which parses the openQA database.
The graph shows that by far "scc_registration" is now the most failing test module. That was certainly different some months ago.

Actions #44

Updated by okurz almost 2 years ago

  • Description updated (diff)
Actions #45

Updated by dzedro almost 2 years ago

Sorry, the assigne was removed as I added comment from unupdated session.
Great we have statistics proving serious network/SCC issue, time to investigate/fix it.

Actions #46

Updated by jstehlik almost 2 years ago

Since SCC is not the only service failing randomly with timeout, I would like to compare results of SCC monitoring with availability from OpenQA side. That would clearly show whether the problem is in connection of OpenQA network to outside. Might it be correlated with recent moving of servers in Nurnberg from one floor to another? Maybe we will find a long CAT4 cable with many loops around a switching power source or some other computer black magic. In attachment you can see an expert who could help :)

Actions #47

Updated by jlausuch almost 2 years ago

https://openqa.suse.de/tests/8371303#step/suseconnect_scc/16 (temporary failure in name resolution) -> maybe this is issue in our end?
https://openqa.suse.de/tests/8371270#step/curl_ipv6/4 (could not resolve host: www3.zq1.de)

Actions #48

Updated by okurz almost 2 years ago

jstehlik wrote:

Since SCC is not the only service failing randomly with timeout, I would like to compare results of SCC monitoring with availability from OpenQA side. That would clearly show whether the problem is in connection of OpenQA network to outside.

Yes, we will look into a bit more monitoring on that side. We should be aware that "from openQA side" should be distinguished from which machine specially we try to reach scc.suse.com and any CDN component even outside SUSE. I am thinking of connectivity monitoring from each openQA worker machine to scc.suse.com

Might it be correlated with recent moving of servers in Nurnberg from one floor to another? […]

This seems to be more unlikely. For example there is https://openqa.suse.de/tests/8368303#step/scc_registration/11 running on openqaworker13. openqaworker13 and most productive openQA workers are located in SUSE Nbg SRV1. The move of QA machines from QA labs was to SUSE Nbg SRV2 which should not significantly affect network capabilities in SRV1.

We discussed this ticket in the weekly SUSE QE sync meeting. okurz suggested to use in particular https://github.com/os-autoinst/scripts/blob/master/README.md#auto-review---automatically-detect-known-issues-in-openqa-jobs-label-openqa-jobs-with-ticket-references-and-optionally-retrigger , see #108773 as an example for that

EDIT:

  • https://openqa.suse.de/tests/8379417#step/journal_check/4 is a test on "svirt-xen-hvm" which is running on openqaw5-xen.qa.suse.de so definitely QA network is involved. If DHCP fails that might be on openqaw5-xen.qa.suse.de (one could take a look into logs there) but for DNS resolution problem trying to resolve scc.suse.com for sure the DNS server on qanet.qa.suse.de is involved. So we could take a look into DNS server logs on qanet.qa.suse.de (named service).
  • https://openqa.suse.de/tests/8380623#step/podman/180 is a failure in podman within an internal tumbleweed container failing to get valid data from download.opensuse.org from within a qemu VM running on openqaworker8 so neither scc.suse.com nor the SUSE QA net are involved.

We could separate into three different areas and follow up with improvements and better investigation for each of these:

  • Check DNS resolution and DHCP stability within QA net, e.g. more monitoring, mtr, etc. -> #108845
  • Check accessibility to SCC: That should be covered by #102945 but maybe we can progress faster than waiting for "management level" followup with implementing some monitoring on our side, e.g. telegraf ping checks to components like download.opensuse.org, scc.suse.com, proxy-scc
Actions #49

Updated by jlausuch almost 2 years ago

  • File deleted (scc_timeout.png)
Actions #51

Updated by lpalovsky almost 2 years ago

HA MM jobs are affected by this since last build 113.1 as well.
What I am seeing are SCC connection timeouts:
https://openqa.suse.de/tests/8373145#step/welcome/10

But as well hostname resolution problems not only related to SCC, but as well connection between scsi client/server:
https://openqa.suse.de/tests/8373191#step/iscsi_client/13
https://openqa.suse.de/tests/8368368#step/patch_sle/163
https://openqa.suse.de/tests/8363004#step/ha_cluster_init/18
https://openqa.suse.de/tests/8373197#step/iscsi_client/13

Actions #52

Updated by jlausuch almost 2 years ago

And a non-scc related:
https://openqa.suse.de/tests/8380471#step/rootless_podman/103
Timeout exceeded when accessing 'http://download.opensuse.org/tumbleweed/repo/non-oss/content'
and also https://openqa.suse.de/tests/8380623#step/podman/180

Actions #53

Updated by jlausuch almost 2 years ago

This is definitely something related to our infra network: https://openqa.suse.de/tests/8379417#step/journal_check/4
localhost wicked[1059]: eth0: DHCP4 discovery failed
and the next module fails at adding a repo, even after a few retries: https://openqa.suse.de/tests/8379417#step/pam/51
and also, the curl command to upload the logs to the worker: https://openqa.suse.de/tests/8379417#step/pam/54

And DNS resolution issue:
https://openqa.suse.de/tests/8379419#step/suseconnect_scc/16 (getaddrinfo: Temporary failure in name resolution)

Actions #55

Updated by lpalovsky almost 2 years ago

Latest build 116.4 is again very bad for HA. Issues are mostly related to either lost connection or DNS resolution failure. This happens between cluster nodes, ssh connection to SUT or to external site like openqa.suse.de or scc.

Various SSH disconnections:
https://openqa.suse.de/tests/8375657#step/patch_sle/150
https://openqa.suse.de/tests/8374846#step/boot_to_desktop/16

Node being unreachable/network resolution issue:
https://openqa.suse.de/tests/8376302#step/ha_cluster_init/15
https://openqa.suse.de/tests/8376293#step/iscsi_client/11
https://openqa.suse.de/tests/8376330#step/iscsi_client/13

Examples above are only a portion of failed tests.

Actions #56

Updated by okurz almost 2 years ago

So in the above examples I can see

Actions #57

Updated by okurz almost 2 years ago

  • Related to action #108668: Failed systemd services alert (except openqa.suse.de) for < 60 min added
Actions #60

Updated by lpalovsky almost 2 years ago

okurz wrote:

Thanks, I have created a separate ticket for that issue: #108962

Actions #61

Updated by okurz almost 2 years ago

A change in network cabling was made around 2022-03-28 0815Z . We now observed a stable connection between the QA switches and previously affected machines. We will continue to monitor the situation.See #108845#note-25 for details. Please report any related issues still occuring for openQA tests that have been started after the above time.

Actions #62

Updated by jlausuch almost 2 years ago

I noticed less network issues lately, but still some DNS resolution failures here and there (not so many):

Actions #63

Updated by jstehlik almost 2 years ago

I consider this issue resolved by fixing cabling. The topic of infrastructure reliability shall be further followed in https://progress.opensuse.org/issues/109250

Actions #64

Updated by jstehlik almost 2 years ago

  • Status changed from Workable to Feedback
Actions #65

Updated by okurz almost 2 years ago

jstehlik wrote:

I consider this issue resolved by fixing cabling. The topic of infrastructure reliability shall be further followed in https://progress.opensuse.org/issues/109250

I think we shouldn't just declare this ticket as "resolved" just because one problem was fixed. The problem might have been small but the impact was huge. I suggest we discuss further improvement ideas on multiple levels so that any new or next upcoming problem would have less severe impact. Also multiple subtasks are sill open. If you like to await their results before reviewing this ticket then I suggest to use the "Blocked" status.

Actions #66

Updated by dzedro almost 2 years ago

Now there are serious network problems on s390x, this kind of issue happened also before, but now it's very bad, I guess since yesterday afternoon.
If I look on the live log or serial output of running job, at some point it will freeze no boot update or worker debug output.
Then just fail will show up, see attachment.

Three variations of test unable to boot.
https://openqa.suse.de/tests/8538657#step/bootloader_zkvm/18
https://openqa.suse.de/tests/8538656#step/boot_to_desktop/16
https://openqa.suse.de/tests/8538654#step/bootloader_start/19

Actions #68

Updated by dzedro almost 2 years ago

Looks like the s390x failures were not related to network but disk space. https://suse.slack.com/archives/C02CANHLANP/p1649923742408719

Actions #69

Updated by okurz almost 2 years ago

We discussed within the weekly QE Sync meeting 2022-05-04 that some problems have been fixed (operational work). We should raise it to LSG mgmt, e.g. in the "Open Doors meeting" how we suffer from the impact of bad infrastructure and that we motivate for improvements, e.g. with the planned datacenter move. szarate mentions that we already occupying multiple engineers in their daily work to investigate network related problems, e.g. effectively loosing a full day due to a non-executed milestone build validation.

Actions #70

Updated by okurz almost 2 years ago

Asked Jose Lausuch and Jozef Pupava in https://suse.slack.com/archives/C02CANHLANP/p1652176711212269

as the main commenters on https://progress.opensuse.org/issues/107062 , in your opinions, what do you see as necessary to resolve this ticket?

EDIT: I got confirmation from jpupava and jlausuch that both see the original issue(s) resolved. We agree that there is still a lot of room for improvement, e.g. base-level infrastructure monitoring but that has been stated likely sufficiently. I think it makes sense to track the work in the still open subtasks before resolving the ticket but after all subtasks are resolved then nothing more would be needed. Added acceptance criteria as such.

Actions #71

Updated by okurz almost 2 years ago

  • Description updated (diff)
Actions #72

Updated by szarate almost 2 years ago

On the topic of how much in terms of hours this took:

While we know what the root cause of the problem was (The network cable), there's a second level of the root cause which is lack of manpower and monitoring on the infraestructure (I guess here it's SUSE IT and QE).

Now in terms of costs:

So, It's around 528 hours, only counting 3 engineers from QE tools, 3 from openQA maintenance review, accounting for ~59 QE Engineers, 1 RM, 66 persons in total, assuming that:

  • Some of them had to do twice review (Day of the failure and day after).
  • Some of them were siting idly during manual validation, due to some resources not being available.
  • some of them could not verify or run verification runs that depended on resources in OSD or qanet due to the network being degraded/down.

This doesn't account for the amount of hours of automated testing, that can be calculated, and directly converted to €, if we take into account the power consumption of the servers, nor the work that had to be done as fall out (Your time, meeting last wed, + any other meeting happening).

NOTE: My numbers might be off, if somebody wants to cross check, be my guest :). Also having the data of how many hours of testing on openQA were lost, would be kind of a good idea, for this and further incidents.

Actions #73

Updated by okurz almost 2 years ago

@szarate your estimation looks good to me.

@jstehlik from today's discussion it sounds like some people still see some issues, you mentioned "zypper problems". I think the current subtasks to this ticket do not cover this. We need to ensure that there are tickets which handle each issue explicitly so that no one is waiting for miracle solution problems.

Actions #74

Updated by jstehlik over 1 year ago

There is further evaluation needed when new test rack is set up in Prague.

Also this issue seems related: QA network infrastructure Ping time alert for several hosts
https://progress.opensuse.org/issues/113498

Actions #75

Updated by szarate over 1 year ago

  • Related to action #113528: [qe-core] test fails in bootloader_zkvm - performance degradation in the s390 network is causing serial console to be unreliable (and killing jobs slowly) added
Actions #76

Updated by szarate over 1 year ago

Actions

Also available in: Atom PDF