action #107062
openMultiple failures due to network issues
Added by jlausuch almost 3 years ago. Updated 2 months ago.
78%
Description
Observation¶
I will use this ticket to collect the different errors I observe in our tests (at least for QE-C squad) that fail due to network issues.
Normally a restart helps to get the job green again (in case we need it green), but this is not the ideal solution.
The idea of this ticket is to collect more potential issues caught by reviewers and propose solutions for some of them, in the code (retry same command several times might help) or in the infra side.
There is an example for each error I found, but from my experience reviewing jobs every day, these failures happen multiple times a day and randomly (difficult to predict).
1) SUSEConnect timeouts -> https://openqa.suse.de/tests/8189768#step/docker/34
Test died: command 'SUSEConnect -p sle-module-containers/${VERSION_ID}/${CPU} ' timed out at /usr/lib/os-autoinst/testapi.pm line 1039.
Or https://openqa.suse.de/tests/8193554#step/suseconnect_scc/8
Test died: command 'SUSEConnect -r $regcode' timed out at /usr/lib/os-autoinst/testapi.pm line 950.
2) updates.suse.com not reachable -> https://openqa.suse.de/tests/8189697#step/image_docker/1110
Retrieving: kmod-25-6.10.1.aarch64.rpm [.........error]
Abort, retry, ignore? [a/r/i/...? shows all options] (a): a
Download (curl) error for 'https://updates.suse.com/SUSE/Updates/SLE-Module-Basesystem/15-SP2/aarch64/update/aarch64/kmod-25-6.10.1.aarch64.rpm?nE0jiYdfiOdLYjH0o-llNN2xIDXncon0vYw8z1aBPGx00H9S1eN413vUsfSJnzFrVz-CoZoGtSdsPKIDRAOQy3Xw2Tac3Yx5_1i8TPomSNiqhDJ0Ayxro23n46NHHB-XHq669RlHs17wiUFSJiSMCSh-YzdGdFw':
Error code: Connection failed
Error message: Could not resolve host: updates.suse.com
Problem occurred during or after installation or removal of packages:
Installation has been aborted as directed.
Please see the above error message for a hint.
3) SCC timeouts -> https://openqa.suse.de/tests/8189613#step/image_docker/316
docker run --entrypoint /usr/lib/zypp/plugins/services/container-suseconnect-zypp -i zypper_docker_derived lp
...
2022/02/18 07:16:19 Installed product: SLES-12.3-x86_64
2022/02/18 07:16:19 Registration server set to https://scc.suse.com
2022/02/18 07:16:30 Get https://scc.suse.com/connect/subscriptions/products?arch=x86_64&identifier=SLES&version=12.3: dial tcp: lookup scc.suse.com on 10.0.2.3:53: read udp 172.17.0.2:37151->10.0.2.3:53: i/o timeout
4) zypper ref timeout or error -> https://openqa.opensuse.org/tests/2193730#step/image_podman/124
podman run -i --name 'refreshed' --entrypoint '' registry.opensuse.org/opensuse/leap/15.3/images/totest/containers/opensuse/leap:15.3 zypper -nv ref
...
Retrieving: cb71cb070e8aac79327e6f1b6edc5317122ca1f72970299c3cb2cf505e18b27f-deltainfo.xml.gz [........................done (82.3 KiB/s)]
Retrieving: 832729371fe20bc1a4d27e59d76c10ffe2c0b5a1ff71c4e934e7a11baa24a74b-primary.xml.gz [............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................error (87.0 KiB/s)]
WJdDM-124-
Acceptance criteria¶
- AC1: All existing subtasks are resolved, no additional work needed on top
Files
Screenshot 2022-02-21 at 11.03.33.png (31.4 KB) Screenshot 2022-02-21 at 11.03.33.png | jlausuch, 2022-02-21 10:03 | ||
canvas.png (25.5 KB) canvas.png | pdostal, 2022-03-01 14:04 | ||
expert.jpg (60.9 KB) expert.jpg | jstehlik, 2022-03-22 13:11 | ||
scc_timeout.png (11.7 KB) scc_timeout.png | jlausuch, 2022-03-23 09:35 | ||
Screenshot_2022-04-12_22-16-57.png (80.2 KB) Screenshot_2022-04-12_22-16-57.png | no response, no log, no connection ? | dzedro, 2022-04-13 09:01 |
Updated by jlausuch almost 3 years ago
More examples of failures/timeouts activating a module via SUSEConnect:
Updated by jlausuch almost 3 years ago
I have created this proposal for adding a product/module with suseconnect.
https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/14283
Updated by jlausuch almost 3 years ago
- File scc_timeout.png added
2 more issues due to SCC not reachable:
https://openqa.suse.de/tests/8201102#step/suseconnect_scc/6
https://openqa.suse.de/tests/8201100#step/suseconnect_scc/6
Updated by jlausuch almost 3 years ago
And another one with updates.suse.com not reachable:
https://openqa.suse.de/tests/8201099#step/glibc_locale/47
Updated by jlausuch almost 3 years ago
New PR to create a new retry wrapper for validate_script_output
mehtod.
https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/14295
This will help with issues with validate_script_output that are prone to timeout if network is slow.
Updated by dzedro almost 3 years ago
Few examples of SCC/network failures, it's nothing rare.
https://openqa.suse.de/tests/8219630#step/zypper_patch/20
https://openqa.suse.de/tests/8198732#step/scc_registration/16
Updated by jlausuch almost 3 years ago
dzedro wrote:
Few examples of SCC/network failures, it's nothing rare.
https://openqa.suse.de/tests/8219630#step/zypper_patch/20
For this, we can make a script_retry, similar to what I did here: https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/14327
https://openqa.suse.de/tests/8198732#step/scc_registration/16
This is more tricky, as it's installation UI, we can't use any in-code retry here... (it would need needle handling for error cases)
Updated by okurz almost 3 years ago
- Related to action #107635: [qem][y] test fails in installation added
Updated by okurz almost 3 years ago
As suggested over other channels might have gone lost: I suggest to get in contact with the SCC team and EngInfra-Team, e.g. create bugs in their according issue trackers and link here or ping them over chat or email and invite them to brainstorm and investigate on this issue. We don't know where the problem is but I am sure together with experts from other teams we have enough brain power to solve this riddle. It could be that there is a problem within the product that we or you test and then of course we should handle this as appropriate product problems or regressions that should be fixed because then customers will also be affected. It could be a problem in the user facing infrastructure especially for updates.suse.com on a CDN where customers would or could also be affected. Then as well we should not just accept this issue but look into it and try to solve it together.
Updated by jlausuch almost 3 years ago
okurz wrote:
As suggested over other channels might have gone lost: I suggest to get in contact with the SCC team and EngInfra-Team, e.g. create bugs in their according issue trackers and link here or ping them over chat or email and invite them to brainstorm and investigate on this issue. We don't know where the problem is but I am sure together with experts from other teams we have enough brain power to solve this riddle. It could be that there is a problem within the product that we or you test and then of course we should handle this as appropriate product problems or regressions that should be fixed because then customers will also be affected. It could be a problem in the user facing infrastructure especially for updates.suse.com on a CDN where customers would or could also be affected. Then as well we should not just accept this issue but look into it and try to solve it together.
Yes, that would be ideal, but first we would need to collect proofs and really have specific tests for SCC/updates.suse.de/etc that tests the connectivity (I think there is a ticket about it) and collect some statistics to see how that behaves over the time. Maybe we can recognize a pattern when this hiccups happen, or maybe we don't. I think this is a quite complex scenario to define. I personally don't have the time to drive this initiative forward.
Updated by jlausuch almost 3 years ago
Again: https://openqa.suse.de/tests/8242917#step/image_docker/705
> docker run --entrypoint /usr/lib/zypp/plugins/services/container-suseconnect-zypp -i zypper_docker_derived lm
2022/02/28 13:53:28 Installed product: SLES-12.4-x86_64
2022/02/28 13:53:28 Registration server set to https://scc.suse.com
2022/02/28 13:53:42 Get "https://scc.suse.com/connect/subscriptions/products?arch=x86_64&identifier=SLES&version=12.4": dial tcp: lookup scc.suse.com on 10.0.2.3:53: read udp 172.17.0.2:39374->10.0.2.3:53: i/o timeout
Hopefully, these type of failures will be workarounded by https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/14295
Updated by jlausuch almost 3 years ago
And more:
https://openqa.suse.de/tests/8243068#step/registry/156
> SUSEConnect --status-text
> EOT_L_NVX
L_NVX-0-
# echo L_NVX; bash -oe pipefail /tmp/scriptL_NVX.sh ; echo SCRIPT_FINISHEDL_NVX-$?-
L_NVX
Error: Cannot parse response from server
Updated by jlausuch almost 3 years ago
An interesting one, even failed after 3 retries... and this has nothing to do with our scc or any suse.de domain:
https://openqa.suse.de/tests/8245957#step/docker_3rd_party_images/825
# timeout 420 docker pull registry.access.redhat.com/ubi7/ubi-init; echo 2n4wD-$?-
Using default tag: latest
Error response from daemon: error parsing HTTP 404 response body: invalid character 'N' looking for beginning of value: "Not found\n"
2n4wD-1-
# timeout 420 docker pull registry.access.redhat.com/ubi7/ubi-init; echo 2n4wD-$?-
Using default tag: latest
Error response from daemon: error parsing HTTP 404 response body: invalid character 'N' looking for beginning of value: "Not found\n"
2n4wD-1-
# timeout 420 docker pull registry.access.redhat.com/ubi7/ubi-init; echo 2n4wD-$?-
Using default tag: latest
Error response from daemon: error parsing HTTP 404 response body: invalid character 'N' looking for beginning of value: "Not found\n"
2n4wD-1-
I guess this time we can blame RH :)
Updated by pdostal almost 3 years ago
jlausuch wrote:
dzedro wrote:
Few examples of SCC/network failures, it's nothing rare.
https://openqa.suse.de/tests/8219630#step/zypper_patch/20For this, we can make a script_retry, similar to what I did here: https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/14327
https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/14383
Updated by pdostal almost 3 years ago
- File canvas.png canvas.png added
jlausuch wrote:
dzedro wrote:
https://openqa.suse.de/tests/8198732#step/scc_registration/16
This is more tricky, as it's installation UI, we can't use any in-code retry here... (it would need needle handling for error cases)
YaST2 Installation - Connection to registration server failed. (I'm just documenting this so it's not cleaned by openQA).
Updated by jlausuch almost 3 years ago
jlausuch wrote:
I have created this proposal for adding a product/module with suseconnect.
https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/14283
Follow-up: https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/14402
Updated by jlausuch almost 3 years ago
New one today in parent job where many jobs depend on...
Installation: https://openqa.suse.de/tests/8261955#step/scc_registration/31
Cannot parse the data from the server
.
Updated by jlausuch almost 3 years ago
To keep record:
https://openqa.suse.de/tests/8261596#step/image_docker/1491
# timeout 600 docker exec refreshed zypper -nv ref; echo Uyq3l-$?-
Entering non-interactive mode.
Verbosity: 2
Initializing Target
Refreshing service 'container-suseconnect-zypp'.
Problem retrieving the repository index file for service 'container-suseconnect-zypp':
[container-suseconnect-zypp|file:/usr/lib/zypp/plugins/services/container-suseconnect-zypp]
Warning: Skipping service 'container-suseconnect-zypp' because of the above error.
Specified repositories:
Warning: There are no enabled repositories defined.
Use 'zypper addrepo' or 'zypper modifyrepo' commands to add or enable repositories.
This failed even after some 3 retries, and the consequent modules also failed with similar connectivity issues:
> SUSEConnect --status-text
> EOT_L_NVX
L_NVX-0-
# echo L_NVX; bash -oe pipefail /tmp/scriptL_NVX.sh ; echo SCRIPT_FINISHEDL_NVX-$?-
L_NVX
SUSEConnect error: SocketError: getaddrinfo: Temporary failure in name resolution
which could be partially avoided with https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/14409
Updated by dzedro almost 3 years ago
I guess it's time to fix the real network or SCC issue.
Updated by jlausuch almost 3 years ago
dzedro wrote:
I guess it's time to fix the real network or SCC issue.
+1
We just need a driver for that :)
Updated by jlausuch almost 3 years ago
- Related to action #107806: docker build fails - No provider of 'apache2' found due to container-suseconnect-zypp failure. added
Updated by jlausuch almost 3 years ago
https://openqa.suse.de/tests/8262854#step/image_docker/921
# docker run --rm -ti registry.suse.com/suse/sle15:15.2 zypper ls | grep 'Usage:'; echo GcdHR-$?-
GcdHR-1-
# cat > /tmp/scriptTBhqL.sh << 'EOT_TBhqL'; echo TBhqL-$?-
> docker run -i --entrypoint '' registry.suse.com/suse/sle15:15.2 zypper lr -s
> EOT_TBhqL
TBhqL-0-
# echo TBhqL; bash -oe pipefail /tmp/scriptTBhqL.sh ; echo SCRIPT_FINISHEDTBhqL-$?-
TBhqL
Refreshing service 'container-suseconnect-zypp'.
Warning: Skipping service 'container-suseconnect-zypp' because of the above error.
Warning: No repositories defined.
Use the 'zypper addrepo' command to add one or more repositories.
Problem retrieving the repository index file for service 'container-suseconnect-zypp':
[container-suseconnect-zypp|file:/usr/lib/zypp/plugins/services/container-suseconnect-zypp]
SCRIPT_FINISHEDTBhqL-6-
Updated by jlausuch almost 3 years ago
Today again parent installation jobs failing in SCC registration step:
https://openqa.suse.de/tests/8269572#step/scc_registration/29
https://openqa.suse.de/tests/8269587#step/scc_registration/36
Updated by jlausuch almost 3 years ago
SLE Micro timeout in transactional-update migration
:
https://openqa.suse.de/tests/8268653#step/zypper_migration/9
Calling zypper migration --no-snapshots --no-selfupdate
2022-03-04 01:25:26 tukit 3.4.0 started
2022-03-04 01:25:26 Options: call 5 zypper migration --no-snapshots --no-selfupdate
2022-03-04 01:25:26 Executing `zypper migration --no-snapshots --no-selfupdate`:
Executing 'zypper refresh'
Repository 'SUSE-MicroOS-5.0-Pool' is up to date.
Repository 'SUSE-MicroOS-5.0-Updates' is up to date.
Repository 'TEST_0' is up to date.
Repository 'TEST_1' is up to date.
Repository 'TEST_10' is up to date.
Repository 'TEST_11' is up to date.
Repository 'TEST_12' is up to date.
Repository 'TEST_2' is up to date.
Repository 'TEST_3' is up to date.
Repository 'TEST_4' is up to date.
Repository 'TEST_5' is up to date.
Repository 'TEST_6' is up to date.
Repository 'TEST_7' is up to date.
Repository 'TEST_8' is up to date.
Repository 'TEST_9' is up to date.
All repositories have been refreshed.
Can't determine the list of installed products: JSON::ParserError: 765: unexpected token at '<html>
<head><title>504 Gateway Time-out</title></head>
<body>
<center><h1>504 Gateway Time-out</h1></center>
</body>
</html>
'
'/usr/lib/zypper/commands/zypper-migration' exited with status 1
2022-03-04 01:25:42 Application returned with exit status 1.
Updated by jlausuch almost 3 years ago
Timeouts on some container command running zypper lr:
https://openqa.suse.de/tests/8269339#step/image_docker/919
https://openqa.suse.de/tests/8269313#step/image_docker/162
https://openqa.suse.de/tests/8269116#step/image_docker/929
> docker run -i --entrypoint '' registry.suse.com/suse/sles12sp5 zypper lr -s
> EOT_sed1h
sed1h-0-
# echo sed1h; bash -oe pipefail /tmp/scriptsed1h.sh ; echo SCRIPT_FINISHEDsed1h-$?-
sed1h
Refreshing service 'container-suseconnect-zypp'.
Problem retrieving the repository index file for service 'container-suseconnect-zypp':
[container-suseconnect-zypp|file:/usr/lib/zypp/plugins/services/container-suseconnect-zypp]
Warning: Skipping service 'container-suseconnect-zypp' because of the above error.
Warning: No repositories defined.
Use the 'zypper addrepo' command to add one or more repositories.
SCRIPT_FINISHEDsed1h-6-
This can be partially stabilized with retries: https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/14420
Updated by jlausuch almost 3 years ago
btrfs_autocompletion:
Timeout exceeded when accessing 'https://scc.suse.com/access/services/1924/repo/repoindex.xml?credentials=SUSE_Linux_Enterprise_Server_15_SP2_x86_64'.
https://openqa.suse.de/tests/8268555#step/btrfs_autocompletion/11
https://openqa.suse.de/tests/8268561#step/btrfs_autocompletion/11
Updated by jlausuch almost 3 years ago
Today's failures:
installation job, scc timeout: https://openqa.suse.de/tests/8280236#step/scc_registration/15
btrfs_autocompletion:
File '/media.1/media' not found on medium 'http://dist.suse.de/ibs/SUSE:/Maintenance:/22627/SUSE_Updates_SLE-SERVER_12-SP5_x86_64/'
https://openqa.suse.de/tests/8279343#step/btrfs_autocompletion/11
https://openqa.suse.de/tests/8279361#step/btrfs_autocompletion/11
https://openqa.suse.de/tests/8279358#step/btrfs_autocompletion/11
https://openqa.suse.de/tests/8279355#step/btrfs_autocompletion/10docker|podman build timeouts due to
container-suseconnect-zypp
:
https://openqa.suse.de/tests/8280575#step/image_docker/771
https://openqa.suse.de/tests/8280667#step/image_docker/1932
https://openqa.suse.de/tests/8280515#step/image_docker/2319
https://openqa.suse.de/tests/8280373#step/image_docker/771
https://openqa.suse.de/tests/8280357#step/image_podman/175Module Activation via SUSEConnect:
https://openqa.suse.de/tests/8279393#step/registry/7
Updating system details on https://scc.suse.com ...
Activating PackageHub 15.3 x86_64 ...
Error: Cannot parse response from server
Updated by jlausuch almost 3 years ago
Today's failures:
Timeouts in installation jobs:
- https://openqa.suse.de/tests/8287090#step/scc_registration/8 (504 Gateway Time-out)
- https://openqa.suse.de/tests/8286159#step/scc_registration/17 (504 Gateway Time-out)
- https://openqa.suse.de/tests/8286272#step/scc_registration/17 (cannot parse data from server)
SUSEConnect --list-extensions
:
- https://openqa.suse.de/tests/8285389#step/suseconnect_scc/12 (Error: Cannot parse response from server)
- https://openqa.suse.de/tests/8285390#step/suseconnect_scc/12 (Error: Cannot parse response from server)
- https://openqa.suse.de/tests/8285402#step/suseconnect_scc/12 (Error: Cannot parse response from server)
- https://openqa.suse.de/tests/8285445#step/suseconnect_scc/12 (Error: Cannot parse response from server)
Zypper timeouts:
- https://openqa.suse.de/tests/8285493#step/toolbox/87 (toolbox run -c devel -- zypper lr' timed out)
Other connectivity issues:
- https://openqa.suse.de/tests/8285630#step/prepare_instance/45 (https://compute.googleapis.com - network is unreachable)
Updated by jlausuch almost 3 years ago
jlausuch wrote:
SUSEConnect --list-extensions
:
- https://openqa.suse.de/tests/8285389#step/suseconnect_scc/12 (Error: Cannot parse response from server)
- https://openqa.suse.de/tests/8285390#step/suseconnect_scc/12 (Error: Cannot parse response from server)
- https://openqa.suse.de/tests/8285402#step/suseconnect_scc/12 (Error: Cannot parse response from server)
- https://openqa.suse.de/tests/8285445#step/suseconnect_scc/12 (Error: Cannot parse response from server)
This PR will help with this specific issue:
https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/14437
Updated by jlausuch almost 3 years ago
- Related to action #108064: Test fails in btrfs_autocompletion - System management is locked by the application with pid 1658 (zypper). added
Updated by jlausuch almost 3 years ago
2 more parent installation jobs today:
- https://openqa.suse.de/tests/8305172#step/scc_registration/20 (504 Gateway Time-out)
- https://openqa.suse.de/tests/8305850#step/scc_registration/16 (Cannot parse data from server)
Updated by jlausuch over 2 years ago
https://openqa.suse.de/tests/8327428#step/scc_registration/33 (Cannot parse data from server)
https://openqa.suse.de/tests/8326890#step/scc_registration/53 (Refreshing service 'Web_and_Scripting_Module_12_x86_64' failed)
https://openqa.suse.de/tests/8326342#step/installation/22 (Timeout)
https://openqa.suse.de/tests/8326325#step/installation/6 (Internal server error)
https://openqa.suse.de/tests/8326322#step/scc_registration/8 (Cannot parse data from server)
https://openqa.suse.de/tests/8326362#step/enable_lp_module/46 (Registration server returned '' 502)
Updated by jlausuch over 2 years ago
SCC issues from today:
- https://openqa.suse.de/tests/8338397#step/scc_registration/7 (Timeout exceeded accessing scc.suse.com)
- https://openqa.suse.de/tests/8336864#step/scc_registration/16 (Cannot parse data from server)
Non-SCC related:
- https://openqa.suse.de/tests/8336767#step/curl_ipv6/4 (could not resolve host www.zq1.de)
- https://openqa.suse.de/tests/8336791#step/wget_ipv6/9 (wget -O- -q www3.zq1.de/test.txt' timed out)
Updated by jlausuch over 2 years ago
https://openqa.suse.de/tests/8354268#step/scc_registration/17 (Connection timeout. Make sure that the registration server is reachable and the connection is reliable.)
https://openqa.suse.de/tests/8354381#step/scc_registration/52 (Timeout exceeded accessing scc.suse.com)
https://openqa.suse.de/tests/8353928#step/scc_registration/20 (504 Gateway Time-out)
https://openqa.suse.de/tests/8354861#step/scc_registration/8 (Error Refreshing service 'SUSE_Linux_Enterprise_Server_12_SP5_s390x' failed)
https://openqa.suse.de/tests/8352068#step/docker/68 (docker pull timeout) (not scc related)
Updated by jstehlik over 2 years ago
- Assignee set to jstehlik
I propose to use Nagios to monitor availability of mentioned services (if it is not used already and I just don't know about it) for a week or two to collect good statistics. This topic seems also related to https://progress.opensuse.org/issues/103656 #103656
Updated by jlausuch over 2 years ago
https://openqa.suse.de/tests/8366841#step/scc_registration/39 (Timeout exceeded when accessing 'https://scc.suse.com/access/services/.....)
https://openqa.suse.de/tests/8366970#step/scc_registration/31 (Timeout exceeded when accessing 'https://scc.suse.com/access/services/.....)
https://openqa.suse.de/tests/8367831#step/scc_registration/32 (Timeout exceeded when accessing 'https://scc.suse.com/access/services/.....)
Updated by jlausuch over 2 years ago
jstehlik wrote:
I propose to use Nagios to monitor availability of mentioned services (if it is not used already and I just don't know about it) for a week or two to collect good statistics. This topic seems also related to https://progress.opensuse.org/issues/103656
Yeah, why not, continuous monitoring can give us a hint of those random hiccups.
Updated by dzedro over 2 years ago
- Assignee deleted (
jstehlik)
I have prepared PR to workaround the SCC error pop-ups e.g. https://openqa.suse.de/tests/8368303#step/scc_registration/11
Question is if it's safe to just press OK and continue so I created https://bugzilla.suse.com/show_bug.cgi?id=1197380
We need statistics for what ? To confirm there is network/SCC issue ?
Updated by okurz over 2 years ago
- Assignee set to jstehlik
@jpupava you might have removed jstehlik as assignee by mistake. I added him back as we just discussed this ticket together also in light of #102945
To get some statistics you can look into the data that is available within openQA.
We also have
https://maintenance-statistics.dyn.cloud.suse.de/question/319
which parses the openQA database.
The graph shows that by far "scc_registration" is now the most failing test module. That was certainly different some months ago.
Updated by dzedro over 2 years ago
Sorry, the assigne was removed as I added comment from unupdated session.
Great we have statistics proving serious network/SCC issue, time to investigate/fix it.
Updated by jstehlik over 2 years ago
- File expert.jpg expert.jpg added
Since SCC is not the only service failing randomly with timeout, I would like to compare results of SCC monitoring with availability from OpenQA side. That would clearly show whether the problem is in connection of OpenQA network to outside. Might it be correlated with recent moving of servers in Nurnberg from one floor to another? Maybe we will find a long CAT4 cable with many loops around a switching power source or some other computer black magic. In attachment you can see an expert who could help :)
Updated by jlausuch over 2 years ago
https://openqa.suse.de/tests/8371303#step/suseconnect_scc/16 (temporary failure in name resolution) -> maybe this is issue in our end?
https://openqa.suse.de/tests/8371270#step/curl_ipv6/4 (could not resolve host: www3.zq1.de)
Updated by okurz over 2 years ago
jstehlik wrote:
Since SCC is not the only service failing randomly with timeout, I would like to compare results of SCC monitoring with availability from OpenQA side. That would clearly show whether the problem is in connection of OpenQA network to outside.
Yes, we will look into a bit more monitoring on that side. We should be aware that "from openQA side" should be distinguished from which machine specially we try to reach scc.suse.com and any CDN component even outside SUSE. I am thinking of connectivity monitoring from each openQA worker machine to scc.suse.com
Might it be correlated with recent moving of servers in Nurnberg from one floor to another? […]
This seems to be more unlikely. For example there is https://openqa.suse.de/tests/8368303#step/scc_registration/11 running on openqaworker13. openqaworker13 and most productive openQA workers are located in SUSE Nbg SRV1. The move of QA machines from QA labs was to SUSE Nbg SRV2 which should not significantly affect network capabilities in SRV1.
We discussed this ticket in the weekly SUSE QE sync meeting. okurz suggested to use in particular https://github.com/os-autoinst/scripts/blob/master/README.md#auto-review---automatically-detect-known-issues-in-openqa-jobs-label-openqa-jobs-with-ticket-references-and-optionally-retrigger , see #108773 as an example for that
EDIT:
- https://openqa.suse.de/tests/8379417#step/journal_check/4 is a test on "svirt-xen-hvm" which is running on openqaw5-xen.qa.suse.de so definitely QA network is involved. If DHCP fails that might be on openqaw5-xen.qa.suse.de (one could take a look into logs there) but for DNS resolution problem trying to resolve scc.suse.com for sure the DNS server on qanet.qa.suse.de is involved. So we could take a look into DNS server logs on qanet.qa.suse.de (named service).
- https://openqa.suse.de/tests/8380623#step/podman/180 is a failure in podman within an internal tumbleweed container failing to get valid data from download.opensuse.org from within a qemu VM running on openqaworker8 so neither scc.suse.com nor the SUSE QA net are involved.
We could separate into three different areas and follow up with improvements and better investigation for each of these:
- Check DNS resolution and DHCP stability within QA net, e.g. more monitoring, mtr, etc. -> #108845
- Check accessibility to SCC: That should be covered by #102945 but maybe we can progress faster than waiting for "management level" followup with implementing some monitoring on our side, e.g. telegraf ping checks to components like download.opensuse.org, scc.suse.com, proxy-scc
Updated by jlausuch over 2 years ago
- File scc_timeout.png scc_timeout.png added
Updated by lpalovsky over 2 years ago
HA MM jobs are affected by this since last build 113.1 as well.
What I am seeing are SCC connection timeouts:
https://openqa.suse.de/tests/8373145#step/welcome/10
But as well hostname resolution problems not only related to SCC, but as well connection between scsi client/server:
https://openqa.suse.de/tests/8373191#step/iscsi_client/13
https://openqa.suse.de/tests/8368368#step/patch_sle/163
https://openqa.suse.de/tests/8363004#step/ha_cluster_init/18
https://openqa.suse.de/tests/8373197#step/iscsi_client/13
Updated by jlausuch over 2 years ago
And a non-scc related:
https://openqa.suse.de/tests/8380471#step/rootless_podman/103
Timeout exceeded when accessing 'http://download.opensuse.org/tumbleweed/repo/non-oss/content'
and also https://openqa.suse.de/tests/8380623#step/podman/180
Updated by jlausuch over 2 years ago
This is definitely something related to our infra network: https://openqa.suse.de/tests/8379417#step/journal_check/4
localhost wicked[1059]: eth0: DHCP4 discovery failed
and the next module fails at adding a repo, even after a few retries: https://openqa.suse.de/tests/8379417#step/pam/51
and also, the curl
command to upload the logs to the worker: https://openqa.suse.de/tests/8379417#step/pam/54
And DNS resolution issue:
https://openqa.suse.de/tests/8379419#step/suseconnect_scc/16 (getaddrinfo: Temporary failure in name resolution)
Updated by jlausuch over 2 years ago
Today is just a mess:
https://openqa.suse.de/tests/8379452#step/suseconnect_scc/16
https://openqa.suse.de/tests/8379449#step/suseconnect_scc/16
https://openqa.suse.de/tests/8381692#step/docker_runc/116
https://openqa.suse.de/tests/8382172#step/docker/89
https://openqa.suse.de/tests/8382012#step/buildah/78
https://openqa.suse.de/tests/8381902#step/zypper_docker/96
https://openqa.suse.de/tests/8379451#step/patch_and_reboot/148
https://openqa.suse.de/tests/8381868#step/zypper_docker/96
and this is just a portion of all the failures that I've seen today...
Updated by lpalovsky over 2 years ago
Latest build 116.4 is again very bad for HA. Issues are mostly related to either lost connection or DNS resolution failure. This happens between cluster nodes, ssh connection to SUT or to external site like openqa.suse.de or scc.
Various SSH disconnections:
https://openqa.suse.de/tests/8375657#step/patch_sle/150
https://openqa.suse.de/tests/8374846#step/boot_to_desktop/16
Node being unreachable/network resolution issue:
https://openqa.suse.de/tests/8376302#step/ha_cluster_init/15
https://openqa.suse.de/tests/8376293#step/iscsi_client/11
https://openqa.suse.de/tests/8376330#step/iscsi_client/13
Examples above are only a portion of failed tests.
Updated by okurz over 2 years ago
So in the above examples I can see
- qemu jobs on machines within SRV1 having problems to access dist.suse.de, e.g. https://openqa.suse.de/tests/8381692#step/docker_runc/116 (aarch64) and https://openqa.suse.de/tests/8382172#step/docker/91 (x86_64) -> please create a specific subticket about that and investigate the network performance between the physical machines and dist.suse.de or create EngInfra ticket
- XEN+svirt jobs on openqaw5-xen.qa.suse.de, e.g. https://openqa.suse.de/tests/8379451#step/patch_and_reboot/148 , that is covered in #108845 , feel welcome to add an according auto-review regex for the error messages. We are looking into the problem
- s390x jobs like https://openqa.suse.de/tests/8375657#step/patch_sle/150 loosing connections just in the middle -> please create a specific subticket about that and investigate the network performance between the physical machines and dist.suse.de or create EngInfra ticket. Could be related to #99345 although the mentioned job is failed, not incomplete with "Connection timed out"
- ppc qemu jobs like https://openqa.suse.de/tests/8376302#step/ha_cluster_init/15 -> that again looks to be something we could treat as separate. Please report a ticket as such.
Updated by okurz over 2 years ago
- Related to action #108668: Failed systemd services alert (except openqa.suse.de) for < 60 min added
Updated by jlausuch over 2 years ago
More name resolution ones:
https://openqa.suse.de/tests/8387038#step/suseconnect_scc/16 - openqaworker2:9
https://openqa.suse.de/tests/8387053#step/suseconnect_scc/16 - openqaworker2:9
https://openqa.suse.de/tests/8387089#step/suseconnect_scc/16 - openqaworker2:10
https://openqa.suse.de/tests/8387088#step/suseconnect_scc/16 - openqaworker2:16
And DHCP discovery failures:
https://openqa.suse.de/tests/8387055#step/journal_check/4 - openqaworker2:16
https://openqa.suse.de/tests/8387085#step/journal_check/4 - openqaworker2:16
Updated by lpalovsky over 2 years ago
okurz wrote:
- ppc qemu jobs like https://openqa.suse.de/tests/8376302#step/ha_cluster_init/15 -> that again looks to be something we could treat as separate. Please report a ticket as such.
Thanks, I have created a separate ticket for that issue: #108962
Updated by okurz over 2 years ago
A change in network cabling was made around 2022-03-28 0815Z . We now observed a stable connection between the QA switches and previously affected machines. We will continue to monitor the situation.See #108845#note-25 for details. Please report any related issues still occuring for openQA tests that have been started after the above time.
Updated by jlausuch over 2 years ago
I noticed less network issues lately, but still some DNS resolution failures here and there (not so many):
- https://openqa.suse.de/tests/8426618#step/cifs/63
could not resolve address for currywurst.qam.suse.de
- https://openqa.suse.de/tests/8426620#step/podman/96
could not resolve host google.de
Updated by jstehlik over 2 years ago
I consider this issue resolved by fixing cabling. The topic of infrastructure reliability shall be further followed in https://progress.opensuse.org/issues/109250
Updated by okurz over 2 years ago
jstehlik wrote:
I consider this issue resolved by fixing cabling. The topic of infrastructure reliability shall be further followed in https://progress.opensuse.org/issues/109250
I think we shouldn't just declare this ticket as "resolved" just because one problem was fixed. The problem might have been small but the impact was huge. I suggest we discuss further improvement ideas on multiple levels so that any new or next upcoming problem would have less severe impact. Also multiple subtasks are sill open. If you like to await their results before reviewing this ticket then I suggest to use the "Blocked" status.
Updated by dzedro over 2 years ago
Now there are serious network problems on s390x, this kind of issue happened also before, but now it's very bad, I guess since yesterday afternoon.
If I look on the live log or serial output of running job, at some point it will freeze no boot update or worker debug output.
Then just fail will show up, see attachment.
Three variations of test unable to boot.
https://openqa.suse.de/tests/8538657#step/bootloader_zkvm/18
https://openqa.suse.de/tests/8538656#step/boot_to_desktop/16
https://openqa.suse.de/tests/8538654#step/bootloader_start/19
Updated by dzedro over 2 years ago
Also today there are many failed s390x tests at boot e.g.
https://openqa.suse.de/tests/8550263#step/bootloader_start/19
https://openqa.suse.de/tests/8550253#step/bootloader_zkvm/18
Updated by dzedro over 2 years ago
Looks like the s390x failures were not related to network but disk space. https://suse.slack.com/archives/C02CANHLANP/p1649923742408719
Updated by okurz over 2 years ago
We discussed within the weekly QE Sync meeting 2022-05-04 that some problems have been fixed (operational work). We should raise it to LSG mgmt, e.g. in the "Open Doors meeting" how we suffer from the impact of bad infrastructure and that we motivate for improvements, e.g. with the planned datacenter move. szarate mentions that we already occupying multiple engineers in their daily work to investigate network related problems, e.g. effectively loosing a full day due to a non-executed milestone build validation.
Updated by okurz over 2 years ago
Asked Jose Lausuch and Jozef Pupava in https://suse.slack.com/archives/C02CANHLANP/p1652176711212269
as the main commenters on https://progress.opensuse.org/issues/107062 , in your opinions, what do you see as necessary to resolve this ticket?
EDIT: I got confirmation from jpupava and jlausuch that both see the original issue(s) resolved. We agree that there is still a lot of room for improvement, e.g. base-level infrastructure monitoring but that has been stated likely sufficiently. I think it makes sense to track the work in the still open subtasks before resolving the ticket but after all subtasks are resolved then nothing more would be needed. Added acceptance criteria as such.
Updated by szarate over 2 years ago
On the topic of how much in terms of hours this took:
While we know what the root cause of the problem was (The network cable), there's a second level of the root cause which is lack of manpower and monitoring on the infraestructure (I guess here it's SUSE IT and QE).
Now in terms of costs:
So, It's around 528 hours, only counting 3 engineers from QE tools, 3 from openQA maintenance review, accounting for ~59 QE Engineers, 1 RM, 66 persons in total, assuming that:
- Some of them had to do twice review (Day of the failure and day after).
- Some of them were siting idly during manual validation, due to some resources not being available.
- some of them could not verify or run verification runs that depended on resources in OSD or qanet due to the network being degraded/down.
This doesn't account for the amount of hours of automated testing, that can be calculated, and directly converted to €, if we take into account the power consumption of the servers, nor the work that had to be done as fall out (Your time, meeting last wed, + any other meeting happening).
NOTE: My numbers might be off, if somebody wants to cross check, be my guest :). Also having the data of how many hours of testing on openQA were lost, would be kind of a good idea, for this and further incidents.
Updated by okurz over 2 years ago
@szarate your estimation looks good to me.
@jstehlik from today's discussion it sounds like some people still see some issues, you mentioned "zypper problems". I think the current subtasks to this ticket do not cover this. We need to ensure that there are tickets which handle each issue explicitly so that no one is waiting for miracle solution problems.
Updated by jstehlik over 2 years ago
There is further evaluation needed when new test rack is set up in Prague.
Also this issue seems related: QA network infrastructure Ping time alert for several hosts
https://progress.opensuse.org/issues/113498
Updated by szarate over 2 years ago
- Related to action #113528: [qe-core] test fails in bootloader_zkvm - performance degradation in the s390 network is causing serial console to be unreliable (and killing jobs slowly) added
Updated by szarate over 2 years ago
- Related to action #113716: [qe-core] proxy-scc is down added