action #107062: Multiple failures due to network issues - openQA Tests (public) - openSUSE Project Management Tool

As suggested over other channels might have gone lost: I suggest to get in contact with the SCC team and EngInfra-Team, e.g. create bugs in their according issue trackers and link here or ping them over chat or email and invite them to brainstorm and investigate on this issue. We don't know where the problem is but I am sure together with experts from other teams we have enough brain power to solve this riddle. It could be that there is a problem within the product that we or you test and then of course we should handle this as appropriate product problems or regressions that should be fixed because then customers will also be affected. It could be a problem in the user facing infrastructure especially for updates.suse.com on a CDN where customers would or could also be affected. Then as well we should not just accept this issue but look into it and try to solve it together.

Actions

Copy link

#13

Updated by jlausuch about 3 years ago

okurz wrote:

As suggested over other channels might have gone lost: I suggest to get in contact with the SCC team and EngInfra-Team, e.g. create bugs in their according issue trackers and link here or ping them over chat or email and invite them to brainstorm and investigate on this issue. We don't know where the problem is but I am sure together with experts from other teams we have enough brain power to solve this riddle. It could be that there is a problem within the product that we or you test and then of course we should handle this as appropriate product problems or regressions that should be fixed because then customers will also be affected. It could be a problem in the user facing infrastructure especially for updates.suse.com on a CDN where customers would or could also be affected. Then as well we should not just accept this issue but look into it and try to solve it together.

Yes, that would be ideal, but first we would need to collect proofs and really have specific tests for SCC/updates.suse.de/etc that tests the connectivity (I think there is a ticket about it) and collect some statistics to see how that behaves over the time. Maybe we can recognize a pattern when this hiccups happen, or maybe we don't. I think this is a quite complex scenario to define. I personally don't have the time to drive this initiative forward.

Actions

Copy link

#14

Updated by jlausuch about 3 years ago

Again: https://openqa.suse.de/tests/8242917#step/image_docker/705

> docker run --entrypoint /usr/lib/zypp/plugins/services/container-suseconnect-zypp -i zypper_docker_derived lm
2022/02/28 13:53:28 Installed product: SLES-12.4-x86_64
2022/02/28 13:53:28 Registration server set to https://scc.suse.com
2022/02/28 13:53:42 Get "https://scc.suse.com/connect/subscriptions/products?arch=x86_64&identifier=SLES&version=12.4": dial tcp: lookup scc.suse.com on 10.0.2.3:53: read udp 172.17.0.2:39374->10.0.2.3:53: i/o timeout

Hopefully, these type of failures will be workarounded by https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/14295

Actions

Copy link

#15

Updated by jlausuch about 3 years ago

And more:
https://openqa.suse.de/tests/8243068#step/registry/156

> SUSEConnect --status-text
> EOT_L_NVX
L_NVX-0-
# echo L_NVX; bash -oe pipefail /tmp/scriptL_NVX.sh ; echo SCRIPT_FINISHEDL_NVX-$?-
L_NVX
Error: Cannot parse response from server

Actions

Copy link

#16

Updated by jlausuch about 3 years ago

An interesting one, even failed after 3 retries... and this has nothing to do with our scc or any suse.de domain:
https://openqa.suse.de/tests/8245957#step/docker_3rd_party_images/825

# timeout 420 docker pull registry.access.redhat.com/ubi7/ubi-init; echo 2n4wD-$?-
Using default tag: latest
Error response from daemon: error parsing HTTP 404 response body: invalid character 'N' looking for beginning of value: "Not found\n"
2n4wD-1-
# timeout 420 docker pull registry.access.redhat.com/ubi7/ubi-init; echo 2n4wD-$?-
Using default tag: latest
Error response from daemon: error parsing HTTP 404 response body: invalid character 'N' looking for beginning of value: "Not found\n"
2n4wD-1-
# timeout 420 docker pull registry.access.redhat.com/ubi7/ubi-init; echo 2n4wD-$?-
Using default tag: latest
Error response from daemon: error parsing HTTP 404 response body: invalid character 'N' looking for beginning of value: "Not found\n"
2n4wD-1-

I guess this time we can blame RH :)

Actions

Copy link

#17

Updated by pdostal about 3 years ago

jlausuch wrote:

dzedro wrote:

Few examples of SCC/network failures, it's nothing rare.
https://openqa.suse.de/tests/8219630#step/zypper_patch/20

For this, we can make a script_retry, similar to what I did here: https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/14327

https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/14383

Actions

Copy link

#18

Updated by pdostal about 3 years ago

File canvas.png canvas.png added

jlausuch wrote:

dzedro wrote:

https://openqa.suse.de/tests/8198732#step/scc_registration/16

This is more tricky, as it's installation UI, we can't use any in-code retry here... (it would need needle handling for error cases)

YaST2 Installation - Connection to registration server failed. (I'm just documenting this so it's not cleaned by openQA).

Actions

Copy link

#20

Updated by jlausuch about 3 years ago

jlausuch wrote:

I have created this proposal for adding a product/module with suseconnect.
https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/14283

Follow-up: https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/14402

Actions

Copy link

#21

Updated by jlausuch about 3 years ago

New one today in parent job where many jobs depend on...
Installation: https://openqa.suse.de/tests/8261955#step/scc_registration/31
Cannot parse the data from the server.

Actions

Copy link

#22

Updated by jlausuch about 3 years ago

To keep record:
https://openqa.suse.de/tests/8261596#step/image_docker/1491

# timeout 600 docker exec refreshed zypper -nv ref; echo Uyq3l-$?-
Entering non-interactive mode.
Verbosity: 2
Initializing Target
Refreshing service 'container-suseconnect-zypp'.
Problem retrieving the repository index file for service 'container-suseconnect-zypp':
[container-suseconnect-zypp|file:/usr/lib/zypp/plugins/services/container-suseconnect-zypp] 
Warning: Skipping service 'container-suseconnect-zypp' because of the above error.
Specified repositories: 
Warning: There are no enabled repositories defined.
Use 'zypper addrepo' or 'zypper modifyrepo' commands to add or enable repositories.

This failed even after some 3 retries, and the consequent modules also failed with similar connectivity issues:

> SUSEConnect --status-text
> EOT_L_NVX
L_NVX-0-
# echo L_NVX; bash -oe pipefail /tmp/scriptL_NVX.sh ; echo SCRIPT_FINISHEDL_NVX-$?-
L_NVX
SUSEConnect error: SocketError: getaddrinfo: Temporary failure in name resolution

which could be partially avoided with https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/14409

Actions

Copy link

#23

Updated by dzedro about 3 years ago

I guess it's time to fix the real network or SCC issue.

Actions

Copy link

#24

Updated by jlausuch about 3 years ago

dzedro wrote:

I guess it's time to fix the real network or SCC issue.

+1

We just need a driver for that :)

Actions

Copy link

#25

Updated by jlausuch about 3 years ago

Related to action #107806: docker build fails - No provider of 'apache2' found due to container-suseconnect-zypp failure. added

Actions

Copy link

#26

Updated by jlausuch about 3 years ago

https://openqa.suse.de/tests/8262854#step/image_docker/921

# docker run --rm -ti registry.suse.com/suse/sle15:15.2 zypper ls | grep 'Usage:'; echo GcdHR-$?-
GcdHR-1-
# cat > /tmp/scriptTBhqL.sh << 'EOT_TBhqL'; echo TBhqL-$?-
> docker run -i --entrypoint '' registry.suse.com/suse/sle15:15.2 zypper lr -s
> EOT_TBhqL
TBhqL-0-
# echo TBhqL; bash -oe pipefail /tmp/scriptTBhqL.sh ; echo SCRIPT_FINISHEDTBhqL-$?-
TBhqL
Refreshing service 'container-suseconnect-zypp'.
Warning: Skipping service 'container-suseconnect-zypp' because of the above error.
Warning: No repositories defined.
Use the 'zypper addrepo' command to add one or more repositories.
Problem retrieving the repository index file for service 'container-suseconnect-zypp':
[container-suseconnect-zypp|file:/usr/lib/zypp/plugins/services/container-suseconnect-zypp] 
SCRIPT_FINISHEDTBhqL-6-

Actions

Copy link

#27

Updated by jlausuch about 3 years ago

Today again parent installation jobs failing in SCC registration step:
https://openqa.suse.de/tests/8269572#step/scc_registration/29
https://openqa.suse.de/tests/8269587#step/scc_registration/36

Actions

Copy link

#28

Updated by jlausuch about 3 years ago

SLE Micro timeout in transactional-update migration:
https://openqa.suse.de/tests/8268653#step/zypper_migration/9

Calling zypper migration --no-snapshots --no-selfupdate
2022-03-04 01:25:26 tukit 3.4.0 started
2022-03-04 01:25:26 Options: call 5 zypper migration --no-snapshots --no-selfupdate 
2022-03-04 01:25:26 Executing `zypper migration --no-snapshots --no-selfupdate`:


Executing 'zypper  refresh'

Repository 'SUSE-MicroOS-5.0-Pool' is up to date.
Repository 'SUSE-MicroOS-5.0-Updates' is up to date.
Repository 'TEST_0' is up to date.
Repository 'TEST_1' is up to date.
Repository 'TEST_10' is up to date.
Repository 'TEST_11' is up to date.
Repository 'TEST_12' is up to date.
Repository 'TEST_2' is up to date.
Repository 'TEST_3' is up to date.
Repository 'TEST_4' is up to date.
Repository 'TEST_5' is up to date.
Repository 'TEST_6' is up to date.
Repository 'TEST_7' is up to date.
Repository 'TEST_8' is up to date.
Repository 'TEST_9' is up to date.
All repositories have been refreshed.
Can't determine the list of installed products: JSON::ParserError: 765: unexpected token at '<html>

<head><title>504 Gateway Time-out</title></head>

<body>

<center><h1>504 Gateway Time-out</h1></center>

</body>

</html>

'
'/usr/lib/zypper/commands/zypper-migration' exited with status 1
2022-03-04 01:25:42 Application returned with exit status 1.

Actions

Copy link

#29

Updated by jlausuch about 3 years ago

Timeouts on some container command running zypper lr:
https://openqa.suse.de/tests/8269339#step/image_docker/919
https://openqa.suse.de/tests/8269313#step/image_docker/162
https://openqa.suse.de/tests/8269116#step/image_docker/929

> docker run -i --entrypoint '' registry.suse.com/suse/sles12sp5 zypper lr -s
> EOT_sed1h
sed1h-0-
# echo sed1h; bash -oe pipefail /tmp/scriptsed1h.sh ; echo SCRIPT_FINISHEDsed1h-$?-
sed1h
Refreshing service 'container-suseconnect-zypp'.
Problem retrieving the repository index file for service 'container-suseconnect-zypp':
[container-suseconnect-zypp|file:/usr/lib/zypp/plugins/services/container-suseconnect-zypp] 
Warning: Skipping service 'container-suseconnect-zypp' because of the above error.
Warning: No repositories defined.
Use the 'zypper addrepo' command to add one or more repositories.
SCRIPT_FINISHEDsed1h-6-

This can be partially stabilized with retries: https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/14420

Actions

Copy link

#30

Updated by jlausuch about 3 years ago

btrfs_autocompletion:
Timeout exceeded when accessing 'https://scc.suse.com/access/services/1924/repo/repoindex.xml?credentials=SUSE_Linux_Enterprise_Server_15_SP2_x86_64'.

https://openqa.suse.de/tests/8268555#step/btrfs_autocompletion/11
https://openqa.suse.de/tests/8268561#step/btrfs_autocompletion/11

Actions

Copy link

#31

Updated by jlausuch about 3 years ago

Today's failures:

installation job, scc timeout: https://openqa.suse.de/tests/8280236#step/scc_registration/15
btrfs_autocompletion:
File '/media.1/media' not found on medium 'http://dist.suse.de/ibs/SUSE:/Maintenance:/22627/SUSE_Updates_SLE-SERVER_12-SP5_x86_64/'
https://openqa.suse.de/tests/8279343#step/btrfs_autocompletion/11
https://openqa.suse.de/tests/8279361#step/btrfs_autocompletion/11
https://openqa.suse.de/tests/8279358#step/btrfs_autocompletion/11
https://openqa.suse.de/tests/8279355#step/btrfs_autocompletion/10
docker|podman build timeouts due to container-suseconnect-zypp:
https://openqa.suse.de/tests/8280575#step/image_docker/771
https://openqa.suse.de/tests/8280667#step/image_docker/1932
https://openqa.suse.de/tests/8280515#step/image_docker/2319
https://openqa.suse.de/tests/8280373#step/image_docker/771
https://openqa.suse.de/tests/8280357#step/image_podman/175
Module Activation via SUSEConnect:
https://openqa.suse.de/tests/8279393#step/registry/7

Updating system details on https://scc.suse.com ...
Activating PackageHub 15.3 x86_64 ...
Error: Cannot parse response from server

Actions

Copy link

#32

Updated by jlausuch about 3 years ago

Today's failures:

Timeouts in installation jobs:

https://openqa.suse.de/tests/8287090#step/scc_registration/8 (504 Gateway Time-out)
https://openqa.suse.de/tests/8286159#step/scc_registration/17 (504 Gateway Time-out)
https://openqa.suse.de/tests/8286272#step/scc_registration/17 (cannot parse data from server)

SUSEConnect --list-extensions:

https://openqa.suse.de/tests/8285389#step/suseconnect_scc/12 (Error: Cannot parse response from server)
https://openqa.suse.de/tests/8285390#step/suseconnect_scc/12 (Error: Cannot parse response from server)
https://openqa.suse.de/tests/8285402#step/suseconnect_scc/12 (Error: Cannot parse response from server)
https://openqa.suse.de/tests/8285445#step/suseconnect_scc/12 (Error: Cannot parse response from server)

Zypper timeouts:

https://openqa.suse.de/tests/8285493#step/toolbox/87 (toolbox run -c devel -- zypper lr' timed out)

Other connectivity issues:

https://openqa.suse.de/tests/8285630#step/prepare_instance/45 (https://compute.googleapis.com - network is unreachable)

Actions

Copy link

#33

Updated by jlausuch about 3 years ago

jlausuch wrote:

SUSEConnect --list-extensions:

https://openqa.suse.de/tests/8285389#step/suseconnect_scc/12 (Error: Cannot parse response from server)

https://openqa.suse.de/tests/8285390#step/suseconnect_scc/12 (Error: Cannot parse response from server)

https://openqa.suse.de/tests/8285402#step/suseconnect_scc/12 (Error: Cannot parse response from server)

https://openqa.suse.de/tests/8285445#step/suseconnect_scc/12 (Error: Cannot parse response from server)

This PR will help with this specific issue:
https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/14437

Actions

Copy link

#34

Updated by jlausuch about 3 years ago

Related to action #108064: Test fails in btrfs_autocompletion - System management is locked by the application with pid 1658 (zypper). added

Actions

Copy link

#35

Updated by jlausuch about 3 years ago

2 more parent installation jobs today:

https://openqa.suse.de/tests/8305172#step/scc_registration/20 (504 Gateway Time-out)
https://openqa.suse.de/tests/8305850#step/scc_registration/16 (Cannot parse data from server)

Actions

Copy link

#36

Updated by jlausuch about 3 years ago

https://openqa.suse.de/tests/8327428#step/scc_registration/33 (Cannot parse data from server)
https://openqa.suse.de/tests/8326890#step/scc_registration/53 (Refreshing service 'Web_and_Scripting_Module_12_x86_64' failed)
https://openqa.suse.de/tests/8326342#step/installation/22 (Timeout)
https://openqa.suse.de/tests/8326325#step/installation/6 (Internal server error)
https://openqa.suse.de/tests/8326322#step/scc_registration/8 (Cannot parse data from server)
https://openqa.suse.de/tests/8326362#step/enable_lp_module/46 (Registration server returned '' 502)

Actions

Copy link

#37

Updated by jlausuch about 3 years ago

SCC issues from today:

https://openqa.suse.de/tests/8338397#step/scc_registration/7 (Timeout exceeded accessing scc.suse.com)
https://openqa.suse.de/tests/8336864#step/scc_registration/16 (Cannot parse data from server)

Non-SCC related:

https://openqa.suse.de/tests/8336767#step/curl_ipv6/4 (could not resolve host www.zq1.de)
https://openqa.suse.de/tests/8336791#step/wget_ipv6/9 (wget -O- -q www3.zq1.de/test.txt' timed out)

Actions

Copy link

#38

Updated by jlausuch about 3 years ago

https://openqa.suse.de/tests/8354268#step/scc_registration/17 (Connection timeout. Make sure that the registration server is reachable and the connection is reliable.)
https://openqa.suse.de/tests/8354381#step/scc_registration/52 (Timeout exceeded accessing scc.suse.com)
https://openqa.suse.de/tests/8353928#step/scc_registration/20 (504 Gateway Time-out)
https://openqa.suse.de/tests/8354861#step/scc_registration/8 (Error Refreshing service 'SUSE_Linux_Enterprise_Server_12_SP5_s390x' failed)

https://openqa.suse.de/tests/8352068#step/docker/68 (docker pull timeout) (not scc related)

Actions

Copy link

#39

Updated by jstehlik about 3 years ago

Assignee set to jstehlik

I propose to use Nagios to monitor availability of mentioned services (if it is not used already and I just don't know about it) for a week or two to collect good statistics. This topic seems also related to https://progress.opensuse.org/issues/103656 #103656

Actions

Copy link

#40

Updated by jlausuch about 3 years ago

https://openqa.suse.de/tests/8366841#step/scc_registration/39 (Timeout exceeded when accessing 'https://scc.suse.com/access/services/.....)
https://openqa.suse.de/tests/8366970#step/scc_registration/31 (Timeout exceeded when accessing 'https://scc.suse.com/access/services/.....)
https://openqa.suse.de/tests/8367831#step/scc_registration/32 (Timeout exceeded when accessing 'https://scc.suse.com/access/services/.....)

Actions

Copy link

#41

Updated by jlausuch about 3 years ago

jstehlik wrote:

I propose to use Nagios to monitor availability of mentioned services (if it is not used already and I just don't know about it) for a week or two to collect good statistics. This topic seems also related to https://progress.opensuse.org/issues/103656

Yeah, why not, continuous monitoring can give us a hint of those random hiccups.

Actions

Copy link

#42

Updated by dzedro about 3 years ago

Assignee deleted (~~jstehlik~~)

I have prepared PR to workaround the SCC error pop-ups e.g. https://openqa.suse.de/tests/8368303#step/scc_registration/11
Question is if it's safe to just press OK and continue so I created https://bugzilla.suse.com/show_bug.cgi?id=1197380

We need statistics for what ? To confirm there is network/SCC issue ?

Actions

Copy link

#43

Updated by okurz about 3 years ago

Assignee set to jstehlik

@jpupava you might have removed jstehlik as assignee by mistake. I added him back as we just discussed this ticket together also in light of #102945

To get some statistics you can look into the data that is available within openQA.
We also have
https://maintenance-statistics.dyn.cloud.suse.de/question/319
which parses the openQA database.
The graph shows that by far "scc_registration" is now the most failing test module. That was certainly different some months ago.

Actions

Copy link

#44

Updated by okurz about 3 years ago

Description updated (diff)

Actions

Copy link

#45

Updated by dzedro about 3 years ago

Sorry, the assigne was removed as I added comment from unupdated session.
Great we have statistics proving serious network/SCC issue, time to investigate/fix it.

Actions

Copy link

#46

Updated by jstehlik about 3 years ago

File expert.jpg expert.jpg added

Since SCC is not the only service failing randomly with timeout, I would like to compare results of SCC monitoring with availability from OpenQA side. That would clearly show whether the problem is in connection of OpenQA network to outside. Might it be correlated with recent moving of servers in Nurnberg from one floor to another? Maybe we will find a long CAT4 cable with many loops around a switching power source or some other computer black magic. In attachment you can see an expert who could help :)

Actions

Copy link

#47

Updated by jlausuch about 3 years ago

https://openqa.suse.de/tests/8371303#step/suseconnect_scc/16 (temporary failure in name resolution) -> maybe this is issue in our end?
https://openqa.suse.de/tests/8371270#step/curl_ipv6/4 (could not resolve host: www3.zq1.de)

Actions

Copy link

#48

Updated by okurz about 3 years ago

jstehlik wrote:

Since SCC is not the only service failing randomly with timeout, I would like to compare results of SCC monitoring with availability from OpenQA side. That would clearly show whether the problem is in connection of OpenQA network to outside.

Yes, we will look into a bit more monitoring on that side. We should be aware that "from openQA side" should be distinguished from which machine specially we try to reach scc.suse.com and any CDN component even outside SUSE. I am thinking of connectivity monitoring from each openQA worker machine to scc.suse.com

Might it be correlated with recent moving of servers in Nurnberg from one floor to another? […]

This seems to be more unlikely. For example there is https://openqa.suse.de/tests/8368303#step/scc_registration/11 running on openqaworker13. openqaworker13 and most productive openQA workers are located in SUSE Nbg SRV1. The move of QA machines from QA labs was to SUSE Nbg SRV2 which should not significantly affect network capabilities in SRV1.

We discussed this ticket in the weekly SUSE QE sync meeting. okurz suggested to use in particular https://github.com/os-autoinst/scripts/blob/master/README.md#auto-review---automatically-detect-known-issues-in-openqa-jobs-label-openqa-jobs-with-ticket-references-and-optionally-retrigger , see #108773 as an example for that

EDIT:

https://openqa.suse.de/tests/8379417#step/journal_check/4 is a test on "svirt-xen-hvm" which is running on openqaw5-xen.qa.suse.de so definitely QA network is involved. If DHCP fails that might be on openqaw5-xen.qa.suse.de (one could take a look into logs there) but for DNS resolution problem trying to resolve scc.suse.com for sure the DNS server on qanet.qa.suse.de is involved. So we could take a look into DNS server logs on qanet.qa.suse.de (named service).
https://openqa.suse.de/tests/8380623#step/podman/180 is a failure in podman within an internal tumbleweed container failing to get valid data from download.opensuse.org from within a qemu VM running on openqaworker8 so neither scc.suse.com nor the SUSE QA net are involved.

We could separate into three different areas and follow up with improvements and better investigation for each of these:

Check DNS resolution and DHCP stability within QA net, e.g. more monitoring, mtr, etc. -> #108845
Check accessibility to SCC: That should be covered by #102945 but maybe we can progress faster than waiting for "management level" followup with implementing some monitoring on our side, e.g. telegraf ping checks to components like download.opensuse.org, scc.suse.com, proxy-scc

Actions

Copy link

#49

Updated by jlausuch about 3 years ago

File deleted (~~scc_timeout.png~~)

Actions

Copy link

#50

Updated by jlausuch about 3 years ago

File scc_timeout.png scc_timeout.png added

Actions

Copy link

#51

Updated by lpalovsky about 3 years ago

HA MM jobs are affected by this since last build 113.1 as well.
What I am seeing are SCC connection timeouts:
https://openqa.suse.de/tests/8373145#step/welcome/10

But as well hostname resolution problems not only related to SCC, but as well connection between scsi client/server:
https://openqa.suse.de/tests/8373191#step/iscsi_client/13
https://openqa.suse.de/tests/8368368#step/patch_sle/163
https://openqa.suse.de/tests/8363004#step/ha_cluster_init/18
https://openqa.suse.de/tests/8373197#step/iscsi_client/13

Actions

Copy link

#52

Updated by jlausuch about 3 years ago

And a non-scc related:
https://openqa.suse.de/tests/8380471#step/rootless_podman/103
Timeout exceeded when accessing 'http://download.opensuse.org/tumbleweed/repo/non-oss/content'
and also https://openqa.suse.de/tests/8380623#step/podman/180

Actions

Copy link

#53

Updated by jlausuch about 3 years ago

This is definitely something related to our infra network: https://openqa.suse.de/tests/8379417#step/journal_check/4
localhost wicked[1059]: eth0: DHCP4 discovery failed
and the next module fails at adding a repo, even after a few retries: https://openqa.suse.de/tests/8379417#step/pam/51
and also, the curl command to upload the logs to the worker: https://openqa.suse.de/tests/8379417#step/pam/54

And DNS resolution issue:
https://openqa.suse.de/tests/8379419#step/suseconnect_scc/16 (getaddrinfo: Temporary failure in name resolution)

Actions

Copy link

#54

Updated by jlausuch about 3 years ago

Today is just a mess:

https://openqa.suse.de/tests/8379452#step/suseconnect_scc/16
https://openqa.suse.de/tests/8379449#step/suseconnect_scc/16
https://openqa.suse.de/tests/8381692#step/docker_runc/116
https://openqa.suse.de/tests/8382172#step/docker/89
https://openqa.suse.de/tests/8382012#step/buildah/78
https://openqa.suse.de/tests/8381902#step/zypper_docker/96
https://openqa.suse.de/tests/8379451#step/patch_and_reboot/148
https://openqa.suse.de/tests/8381868#step/zypper_docker/96

and this is just a portion of all the failures that I've seen today...

Actions

Copy link

#55

Updated by lpalovsky about 3 years ago

Latest build 116.4 is again very bad for HA. Issues are mostly related to either lost connection or DNS resolution failure. This happens between cluster nodes, ssh connection to SUT or to external site like openqa.suse.de or scc.

Various SSH disconnections:
https://openqa.suse.de/tests/8375657#step/patch_sle/150
https://openqa.suse.de/tests/8374846#step/boot_to_desktop/16

Node being unreachable/network resolution issue:
https://openqa.suse.de/tests/8376302#step/ha_cluster_init/15
https://openqa.suse.de/tests/8376293#step/iscsi_client/11
https://openqa.suse.de/tests/8376330#step/iscsi_client/13

Examples above are only a portion of failed tests.

Actions

Copy link

#56

Updated by okurz about 3 years ago

So in the above examples I can see

qemu jobs on machines within SRV1 having problems to access dist.suse.de, e.g. https://openqa.suse.de/tests/8381692#step/docker_runc/116 (aarch64) and https://openqa.suse.de/tests/8382172#step/docker/91 (x86_64) -> please create a specific subticket about that and investigate the network performance between the physical machines and dist.suse.de or create EngInfra ticket
XEN+svirt jobs on openqaw5-xen.qa.suse.de, e.g. https://openqa.suse.de/tests/8379451#step/patch_and_reboot/148 , that is covered in #108845 , feel welcome to add an according auto-review regex for the error messages. We are looking into the problem
s390x jobs like https://openqa.suse.de/tests/8375657#step/patch_sle/150 loosing connections just in the middle -> please create a specific subticket about that and investigate the network performance between the physical machines and dist.suse.de or create EngInfra ticket. Could be related to #99345 although the mentioned job is failed, not incomplete with "Connection timed out"
ppc qemu jobs like https://openqa.suse.de/tests/8376302#step/ha_cluster_init/15 -> that again looks to be something we could treat as separate. Please report a ticket as such.

Actions

Copy link

#57

Updated by okurz about 3 years ago

Related to action #108668: Failed systemd services alert (except openqa.suse.de) for < 60 min added

Actions

Copy link

#59

Updated by jlausuch about 3 years ago

More name resolution ones:
https://openqa.suse.de/tests/8387038#step/suseconnect_scc/16 - openqaworker2:9
https://openqa.suse.de/tests/8387053#step/suseconnect_scc/16 - openqaworker2:9
https://openqa.suse.de/tests/8387089#step/suseconnect_scc/16 - openqaworker2:10
https://openqa.suse.de/tests/8387088#step/suseconnect_scc/16 - openqaworker2:16

And DHCP discovery failures:
https://openqa.suse.de/tests/8387055#step/journal_check/4 - openqaworker2:16
https://openqa.suse.de/tests/8387085#step/journal_check/4 - openqaworker2:16

Actions

Copy link

#60

Updated by lpalovsky about 3 years ago

okurz wrote:

ppc qemu jobs like https://openqa.suse.de/tests/8376302#step/ha_cluster_init/15 -> that again looks to be something we could treat as separate. Please report a ticket as such.

Thanks, I have created a separate ticket for that issue: #108962

Actions

Copy link

#61

Updated by okurz about 3 years ago

A change in network cabling was made around 2022-03-28 0815Z . We now observed a stable connection between the QA switches and previously affected machines. We will continue to monitor the situation.See #108845#note-25 for details. Please report any related issues still occuring for openQA tests that have been started after the above time.

Actions

Copy link

#62

Updated by jlausuch about 3 years ago

I noticed less network issues lately, but still some DNS resolution failures here and there (not so many):

https://openqa.suse.de/tests/8426618#step/cifs/63 could not resolve address for currywurst.qam.suse.de
https://openqa.suse.de/tests/8426620#step/podman/96 could not resolve host google.de

Actions

Copy link

#63

Updated by jstehlik about 3 years ago

I consider this issue resolved by fixing cabling. The topic of infrastructure reliability shall be further followed in https://progress.opensuse.org/issues/109250

Actions

Copy link

#64

Updated by jstehlik about 3 years ago

Status changed from Workable to Feedback

Actions

Copy link

#65

Updated by okurz about 3 years ago

jstehlik wrote:

I consider this issue resolved by fixing cabling. The topic of infrastructure reliability shall be further followed in https://progress.opensuse.org/issues/109250

I think we shouldn't just declare this ticket as "resolved" just because one problem was fixed. The problem might have been small but the impact was huge. I suggest we discuss further improvement ideas on multiple levels so that any new or next upcoming problem would have less severe impact. Also multiple subtasks are sill open. If you like to await their results before reviewing this ticket then I suggest to use the "Blocked" status.

Actions

Copy link

#66

Updated by dzedro about 3 years ago

File Screenshot_2022-04-12_22-16-57.png Screenshot_2022-04-12_22-16-57.png added

Now there are serious network problems on s390x, this kind of issue happened also before, but now it's very bad, I guess since yesterday afternoon.
If I look on the live log or serial output of running job, at some point it will freeze no boot update or worker debug output.
Then just fail will show up, see attachment.

Three variations of test unable to boot.
https://openqa.suse.de/tests/8538657#step/bootloader_zkvm/18
https://openqa.suse.de/tests/8538656#step/boot_to_desktop/16
https://openqa.suse.de/tests/8538654#step/bootloader_start/19

Actions

Copy link

#67

Updated by dzedro about 3 years ago

Also today there are many failed s390x tests at boot e.g.
https://openqa.suse.de/tests/8550263#step/bootloader_start/19
https://openqa.suse.de/tests/8550253#step/bootloader_zkvm/18

Actions

Copy link

#68

Updated by dzedro about 3 years ago

Looks like the s390x failures were not related to network but disk space. https://suse.slack.com/archives/C02CANHLANP/p1649923742408719

Actions

Copy link

#69

Updated by okurz almost 3 years ago

We discussed within the weekly QE Sync meeting 2022-05-04 that some problems have been fixed (operational work). We should raise it to LSG mgmt, e.g. in the "Open Doors meeting" how we suffer from the impact of bad infrastructure and that we motivate for improvements, e.g. with the planned datacenter move. szarate mentions that we already occupying multiple engineers in their daily work to investigate network related problems, e.g. effectively loosing a full day due to a non-executed milestone build validation.

Actions

Copy link

#70

Updated by okurz almost 3 years ago

Asked Jose Lausuch and Jozef Pupava in https://suse.slack.com/archives/C02CANHLANP/p1652176711212269

as the main commenters on https://progress.opensuse.org/issues/107062 , in your opinions, what do you see as necessary to resolve this ticket?

EDIT: I got confirmation from jpupava and jlausuch that both see the original issue(s) resolved. We agree that there is still a lot of room for improvement, e.g. base-level infrastructure monitoring but that has been stated likely sufficiently. I think it makes sense to track the work in the still open subtasks before resolving the ticket but after all subtasks are resolved then nothing more would be needed. Added acceptance criteria as such.

Actions

Copy link

#71

Updated by okurz almost 3 years ago

Description updated (diff)

Actions

Copy link

#72

Updated by szarate almost 3 years ago

On the topic of how much in terms of hours this took:

While we know what the root cause of the problem was (The network cable), there's a second level of the root cause which is lack of manpower and monitoring on the infraestructure (I guess here it's SUSE IT and QE).

Now in terms of costs:

So, It's around 528 hours, only counting 3 engineers from QE tools, 3 from openQA maintenance review, accounting for ~59 QE Engineers, 1 RM, 66 persons in total, assuming that:

Some of them had to do twice review (Day of the failure and day after).
Some of them were siting idly during manual validation, due to some resources not being available.
some of them could not verify or run verification runs that depended on resources in OSD or qanet due to the network being degraded/down.

This doesn't account for the amount of hours of automated testing, that can be calculated, and directly converted to €, if we take into account the power consumption of the servers, nor the work that had to be done as fall out (Your time, meeting last wed, + any other meeting happening).

NOTE: My numbers might be off, if somebody wants to cross check, be my guest :). Also having the data of how many hours of testing on openQA were lost, would be kind of a good idea, for this and further incidents.

Actions

Copy link

#73

Updated by okurz almost 3 years ago

@szarate your estimation looks good to me.

@jstehlik from today's discussion it sounds like some people still see some issues, you mentioned "zypper problems". I think the current subtasks to this ticket do not cover this. We need to ensure that there are tickets which handle each issue explicitly so that no one is waiting for miracle solution problems.

Actions

Copy link

#74

Updated by jstehlik almost 3 years ago

There is further evaluation needed when new test rack is set up in Prague.

Also this issue seems related: QA network infrastructure Ping time alert for several hosts
https://progress.opensuse.org/issues/113498

Actions

Copy link

#75

Updated by szarate almost 3 years ago

Related to action #113528: [qe-core] test fails in bootloader_zkvm - performance degradation in the s390 network is causing serial console to be unreliable (and killing jobs slowly) added

Actions

Copy link

#76

Updated by szarate almost 3 years ago

Related to action #113716: [qe-core] proxy-scc is down added

Actions

Copy link

#78

Updated by okurz 10 months ago

Subtask #162794 added

Screenshot 2022-02-21 at 11.03.33.png (31.4 KB) Screenshot 2022-02-21 at 11.03.33.png		jlausuch, 2022-02-21 10:03
canvas.png (25.5 KB) canvas.png		pdostal, 2022-03-01 14:04
expert.jpg (60.9 KB) expert.jpg		jstehlik, 2022-03-22 13:11
scc_timeout.png (11.7 KB) scc_timeout.png		jlausuch, 2022-03-23 09:35
Screenshot_2022-04-12_22-16-57.png (80.2 KB) Screenshot_2022-04-12_22-16-57.png	no response, no log, no connection ?	dzedro, 2022-04-13 09:01

Project

General

Profile

QA (public) » openQA Project (public) » openQA Tests (public)

Tags

Custom queries

action #107062

Multiple failures due to network issues

Observation¶

Acceptance criteria¶

Updated by jlausuch about 3 years ago

Updated by jlausuch about 3 years ago

Updated by jlausuch about 3 years ago

Updated by jlausuch about 3 years ago

Updated by jlausuch about 3 years ago

Updated by jlausuch about 3 years ago

Updated by dzedro about 3 years ago

Updated by jlausuch about 3 years ago

Updated by okurz about 3 years ago

Updated by okurz about 3 years ago

Updated by jlausuch about 3 years ago

Updated by jlausuch about 3 years ago

Updated by jlausuch about 3 years ago

Updated by jlausuch about 3 years ago

Updated by pdostal about 3 years ago

Updated by pdostal about 3 years ago

Updated by jlausuch about 3 years ago

Updated by jlausuch about 3 years ago

Updated by jlausuch about 3 years ago

Updated by dzedro about 3 years ago

Updated by jlausuch about 3 years ago

Updated by jlausuch about 3 years ago

Updated by jlausuch about 3 years ago

Updated by jlausuch about 3 years ago

Updated by jlausuch about 3 years ago

Updated by jlausuch about 3 years ago

Updated by jlausuch about 3 years ago

Updated by jlausuch about 3 years ago

Updated by jlausuch about 3 years ago

Updated by jlausuch about 3 years ago

Updated by jlausuch about 3 years ago

Updated by jlausuch about 3 years ago

Updated by jlausuch about 3 years ago

Updated by jlausuch about 3 years ago

Updated by jlausuch about 3 years ago

Updated by jstehlik about 3 years ago

Updated by jlausuch about 3 years ago

Updated by jlausuch about 3 years ago

Updated by dzedro about 3 years ago

Updated by okurz about 3 years ago

Updated by okurz about 3 years ago

Updated by dzedro about 3 years ago

Updated by jstehlik about 3 years ago

Updated by jlausuch about 3 years ago

Updated by okurz about 3 years ago

Updated by jlausuch about 3 years ago

Updated by jlausuch about 3 years ago

Updated by lpalovsky about 3 years ago

Updated by jlausuch about 3 years ago

Updated by jlausuch about 3 years ago

Updated by jlausuch about 3 years ago

Updated by lpalovsky about 3 years ago

Updated by okurz about 3 years ago

Updated by okurz about 3 years ago

Updated by jlausuch about 3 years ago

Updated by lpalovsky about 3 years ago

Updated by okurz about 3 years ago

Updated by jlausuch about 3 years ago

Updated by jstehlik about 3 years ago

Updated by jstehlik about 3 years ago

Updated by okurz about 3 years ago

Updated by dzedro about 3 years ago

Updated by dzedro about 3 years ago

Updated by dzedro about 3 years ago

Updated by okurz almost 3 years ago

Updated by okurz almost 3 years ago

Updated by okurz almost 3 years ago

Updated by szarate almost 3 years ago

Updated by okurz almost 3 years ago

Updated by jstehlik almost 3 years ago