Project

General

Profile

Actions

action #107062

open

Multiple failures due to network issues

Added by jlausuch about 2 years ago. Updated over 1 year ago.

Status:
Feedback
Priority:
Normal
Assignee:
Category:
Bugs in existing tests
Target version:
-
Start date:
2021-09-27
Due date:
% Done:

77%

Estimated time:
(Total: 0.00 h)
Difficulty:
Tags:

Description

Observation

I will use this ticket to collect the different errors I observe in our tests (at least for QE-C squad) that fail due to network issues.
Normally a restart helps to get the job green again (in case we need it green), but this is not the ideal solution.

The idea of this ticket is to collect more potential issues caught by reviewers and propose solutions for some of them, in the code (retry same command several times might help) or in the infra side.

There is an example for each error I found, but from my experience reviewing jobs every day, these failures happen multiple times a day and randomly (difficult to predict).

1) SUSEConnect timeouts -> https://openqa.suse.de/tests/8189768#step/docker/34
Test died: command 'SUSEConnect -p sle-module-containers/${VERSION_ID}/${CPU} ' timed out at /usr/lib/os-autoinst/testapi.pm line 1039.

Or https://openqa.suse.de/tests/8193554#step/suseconnect_scc/8
Test died: command 'SUSEConnect -r $regcode' timed out at /usr/lib/os-autoinst/testapi.pm line 950.

2) updates.suse.com not reachable -> https://openqa.suse.de/tests/8189697#step/image_docker/1110

Retrieving: kmod-25-6.10.1.aarch64.rpm [.........error]
Abort, retry, ignore? [a/r/i/...? shows all options] (a): a
Download (curl) error for 'https://updates.suse.com/SUSE/Updates/SLE-Module-Basesystem/15-SP2/aarch64/update/aarch64/kmod-25-6.10.1.aarch64.rpm?nE0jiYdfiOdLYjH0o-llNN2xIDXncon0vYw8z1aBPGx00H9S1eN413vUsfSJnzFrVz-CoZoGtSdsPKIDRAOQy3Xw2Tac3Yx5_1i8TPomSNiqhDJ0Ayxro23n46NHHB-XHq669RlHs17wiUFSJiSMCSh-YzdGdFw':
Error code: Connection failed
Error message: Could not resolve host: updates.suse.com

Problem occurred during or after installation or removal of packages:
Installation has been aborted as directed.
Please see the above error message for a hint.

3) SCC timeouts -> https://openqa.suse.de/tests/8189613#step/image_docker/316

docker run --entrypoint /usr/lib/zypp/plugins/services/container-suseconnect-zypp -i zypper_docker_derived lp
...
2022/02/18 07:16:19 Installed product: SLES-12.3-x86_64
2022/02/18 07:16:19 Registration server set to https://scc.suse.com
2022/02/18 07:16:30 Get https://scc.suse.com/connect/subscriptions/products?arch=x86_64&identifier=SLES&version=12.3: dial tcp: lookup scc.suse.com on 10.0.2.3:53: read udp 172.17.0.2:37151->10.0.2.3:53: i/o timeout

4) zypper ref timeout or error -> https://openqa.opensuse.org/tests/2193730#step/image_podman/124

podman run -i --name 'refreshed' --entrypoint '' registry.opensuse.org/opensuse/leap/15.3/images/totest/containers/opensuse/leap:15.3 zypper -nv ref
...
Retrieving: cb71cb070e8aac79327e6f1b6edc5317122ca1f72970299c3cb2cf505e18b27f-deltainfo.xml.gz [........................done (82.3 KiB/s)]
Retrieving: 832729371fe20bc1a4d27e59d76c10ffe2c0b5a1ff71c4e934e7a11baa24a74b-primary.xml.gz [............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................error (87.0 KiB/s)]
WJdDM-124-

Acceptance criteria

  • AC1: All existing subtasks are resolved, no additional work needed on top

Files

Screenshot 2022-02-21 at 11.03.33.png (31.4 KB) Screenshot 2022-02-21 at 11.03.33.png jlausuch, 2022-02-21 10:03
canvas.png (25.5 KB) canvas.png pdostal, 2022-03-01 14:04
expert.jpg (60.9 KB) expert.jpg jstehlik, 2022-03-22 13:11
scc_timeout.png (11.7 KB) scc_timeout.png jlausuch, 2022-03-23 09:35
Screenshot_2022-04-12_22-16-57.png (80.2 KB) Screenshot_2022-04-12_22-16-57.png no response, no log, no connection ? dzedro, 2022-04-13 09:01

Subtasks 11 (3 open8 closed)

action #99345: [tools][qem] Incomplete test runs on s390x with auto_review:"backend died: Error connecting to VNC server.*s390.*Connection timed out":retry size:MResolvedmkittler2021-09-27

Actions
openQA Infrastructure - action #108266: grenache: script_run() commands randomly time out since server room moveNew2022-03-14

Actions
openQA Infrastructure - action #108845: Network performance problems, DNS, DHCP, within SUSE QA network auto_review:"(Error connecting to VNC server.*qa.suse.*Connection timed out|ipmitool.*qa.suse.*Unable to establish)":retry but also other symptoms size:MResolvednicksinger2022-03-24

Actions
openQA Infrastructure - action #108872: Outdated information on openqaw5-xen https://racktables.suse.de/index.php?page=object&tab=default&object_id=3468Newcachen

Actions
openQA Infrastructure - action #108896: [ppc64le] auto_review:"(?s)Size of.*differs, expected.*but downloaded.*Download.*failed: 521 Connect timeout":retryResolvedokurz2022-03-24

Actions
action #108953: [tools] Performance issues in some s390 workersResolvedokurz2022-03-25

Actions
openQA Infrastructure - action #109241: Prefer to use domain names rather than IPv4 in salt pillars size:MResolvedokurz

Actions
openQA Infrastructure - action #109253: Add monitoring for SUSE QA network infrastructure size:MResolvedjbaier_cz

Actions
openQA Infrastructure - action #120169: Make s390x kvm workers also use FQDN instead of IPv4 in salt pillars for VIRSH_GUESTNew2022-11-09

Actions
openQA Infrastructure - action #120261: tests should try to access worker by WORKER_HOSTNAME FQDN but sometimes get 'worker2' or something auto_review:".*curl.*worker\d+:.*failed at.*":retry size:meowResolvedmkittler2022-11-10

Actions
openQA Infrastructure - action #121672: [virtualization] Connectivity issues on worker8-vmware.oqa.suse.deResolvedokurz2022-12-07

Actions

Related issues 4 (1 open3 closed)

Related to openQA Tests - action #107635: [qem][y] test fails in installationNew2022-02-25

Actions
Related to openQA Infrastructure - action #108668: Failed systemd services alert (except openqa.suse.de) for < 60 minRejectedmkittler2022-03-21

Actions
Related to openQA Tests - action #113528: [qe-core] test fails in bootloader_zkvm - performance degradation in the s390 network is causing serial console to be unreliable (and killing jobs slowly)Resolvedszarate2022-07-132022-07-18

Actions
Related to openQA Infrastructure - action #113716: [qe-core] proxy-scc is downResolvedszarate2022-07-182022-07-19

Actions
Actions

Also available in: Atom PDF