action #162101: [s390x] timeouts on s390x openQA Workers size:M - openQA Tests (public) - openSUSE Project Management Tool

Actions

Copy link

action #162101

closed

[s390x] timeouts on s390x openQA Workers size:M

Added by AdaLovelace 12 months ago. Updated 11 months ago.

Status:

Resolved

Priority:

Normal

Assignee:

okurz

Category:

Infrastructure

Target version:

openQA Project (public) - Ready

Start date:

2024-06-11

Due date:

% Done:

Estimated time:

Difficulty:

Tags:

infra, reactive work

Description

Observation¶

All tests are failing on s390x because of timeouts. I tried it with a restart to identify the issue. I had to wait longer than 6 minutes for a start of the openQA job.
What is going wrong here? It seems, there can be a network problem between the mainframe and openQA.

openQA test in scenario opensuse-Tumbleweed-DVD-s390x-autoyast_zvm@s390x-zVM-vswitch-l2 fails in
bootloader_start

Test suite description¶

Create HDD for s390x textmode

Reproducible¶

Fails since (at least) Build 20240606
Clearly reproducible

Expected result¶

Last good: 20240531 (or more recent)
The openQA tests are working also on s390x.

Suggestions¶

Compare to similar tests on osd
Try out connection to s390x manually or use the developer mode to interactively debug
Checkout previous results for s390x in https://openqa.opensuse.org/group_overview/34
https://openqa.opensuse.org/tests/latest?arch=s390x&distri=opensuse&flavor=DVD&machine=s390x-zVM-vswitch-l2&test=autoyast_zvm&version=Tumbleweed fails the earliest, the other jobs in the same job group fail much later in a different module but seemingly also related to interaction with the network
Parsing of console output might be broken due to network lag

Further details¶

Always latest result in this scenario: latest

Related issues 2 (1 open — 1 closed)

Actions

Copy link

Updated by okurz 12 months ago

Tags set to infra, reactive work
Target version set to Ready

Actions

Copy link

Updated by okurz 12 months ago

Subject changed from [s390x] timeouts on s390x openQA Workers to [s390x] timeouts on s390x openQA Workers size:M
Description updated (diff)
Status changed from New to In Progress
Assignee set to okurz

Actions

Copy link

Updated by okurz 12 months ago

Looking at https://openqa.opensuse.org/tests/4255937#investigation on the "first bad" compared to the "last good" I see

nothing obvious in the diff of settings. Most likely problematic settings if any at all:

-   "BUILD" : "20240531",
+   "BUILD" : "20240606",
-   "S390_NETWORK_PARAMS" : "OSAMedium=eth OSAInterface=qdio InstNetDev=osa Gateway=10.150.1.254 Nameserver=10.150.1.11 Domain=openqanet.opensuse.org PortNo= Layer2=0 ReadChannel=0.0.0A00 WriteChannel=0.0.0A01 DataChannel=0.0.0A02 OSAHWAddr= HostIP=10.150.1.153/24 Hostname=o3zvm004",
+   "S390_NETWORK_PARAMS" : "OSAMedium=eth OSAInterface=qdio InstNetDev=osa Gateway=10.150.1.254 Nameserver=10.150.1.11 Domain=openqanet.opensuse.org PortNo= Layer2=0 ReadChannel=0.0.0A00 WriteChannel=0.0.0A01 DataChannel=0.0.0A02 OSAHWAddr= HostIP=10.150.1.152/24 Hostname=o3zvm003",
-   "WORKER_ID" : 1147,
-   "WORKER_INSTANCE" : 104,
+   "WORKER_ID" : 1146,
+   "WORKER_INSTANCE" : 103,
-   "ZVM_GUEST" : "o3zvm004.openqanet.opensuse.org",
+   "ZVM_GUEST" : "o3zvm003.openqanet.opensuse.org",

https://openqa.opensuse.org/tests/4255937#comment-601601 shows that the issue seems reproducible. https://openqa.opensuse.org/tests/4255948 "last_good_build" fails showing that this is unlikely to be a product regression. I wonder why no "last good tests" job was triggered as there are test distribution changes listed:

test_log	

+ 1e20aa3e9 Fix cloud-init schedule for sl-micro
+ bbad8c698 Reduce code duplication in peering list
+ 7307fda93 containers: Increate subuid range
+ 2649ce516 Quick fix for SUSEConnect -d
+ b189afa0a Change the command in test_pids_max function into bash script
+ 904522158 OpenStack:Remove boot time measurements
+ 936a9d005 Add nat gateway setup
+ bbd75b551 Create git clone function
+ 065888283 README: Clarify with a link to test distribution definition
+ 1af1954d9 Enable --replacefiles via variable
+ 05f78fd8f Revert "Remove release package from SCC_ADDONS_DROP"
+ f93785946 Use snapper rollback to get to state before patch
+ 67048eb56 Refresh repos if zypper gets return 106
+ 4f3ea9693 Move saptune_on_pvm yaml schedule file to correct place
+ 77fad2896 Fix Elemental boot on aarch64
+ 25cf13db7 Make cleanup non-fatal (#19472)
+ c2590ef53 Use podman instead of docker in hawk_gui test module
+ 37b875599 Revert "Run ansible remediation in background with a countdown timer"
+ 7d9d2e88a Wsl-systemd check
+ d25ae9946 Add soft-failures in jornal for SL Micro 6.1
+ 713d798f4 Revert "Make cleanup non-fatal"
+ 64dae2922 containers/seccomp.pm: Harden test a bit
+ df75d86d5 Reapply "containers: Use busybox instead of tumbleweed image"
+ 13ae5148f Add support for Elemental based on SLE Micro 5.5
+ cf07e5e36 Follow upstream changes in package name
+ f8e746eb0 Add "--gpg-auto-import-keys" option to install openSUSE-repos-NVIDIA
+ 6f2d81b2f containers: Avoid registry mirror in 3rd party module
+ ec254c4e3 Skip version check in Tumbleweed when BUILD setting is non-numeric
+ 92c38b656 Make cleanup non-fatal
+ d6d7b2749 Fix the sporadic issue in fltk test
+ 3f0acad48 P.C. check Larry server up retries
+ 6773f9924 Adjust the saptune test on power9
+ eb88ce39e Run ansible remediation in background
+ 1a990b8fe Add 'EXTRA_CUSTOMER_REPOS' for qam dracut install tests
+ 3028dca30 console/ping: Add more tests
+ 10f13dbef console: Add simple arping tests
+ 04fe4f480 Fix Elemental test
+ 647a73f9a Add hanasr setup module
+ a94984844 Pass registration code in for guest instead of using plain ones
+ cba83a507 Ay profile for skip_registration_functional
+ 97373a762 Remove release package from SCC_ADDONS_DROP
+ 7c8abcc8a Update schedule/security/selinux_jeos.yaml
+ a7c8ebad9 Enable SELinux on JeOS Tumbleweed
+ ba39ad73c Revert "containers: Use busybox instead of tumbleweed image"
+ e7a4cabfa Define new name for the project *osado*
+ c8139c865 containers: Use httpd from registry mirror to fix netavark
+ c4a321062 Upload audit.log in post_fail_hook
+ 71506605a aaa_base: drop get_kernel_version as it is not used anymore
+ 465a0c813 Switch 15sp6 guests to GM

out of those none sound related.

Actions

Copy link

Updated by okurz 12 months ago

(Jozef Pupava) To me it looks like the installation didn't reach the VNC being ready, maybe try to increase timeout ? Not sure why is it so slow.
When it works on osd https://openqa.suse.de/tests/14578010#step/bootloader_start/32
(Joaquin Rivera) I was looking at that code the other day for other stuff and it is quite difficult to debug, I've just tried that job with 400 seconds or double the RAM and same result. It could be the AutoYaST profile that for some reason now is not valid, because the textmode test suite boots. 400 seconds and 2GB
(Jozef Pupava) I would make sure autoyast profile is valid, when the textmode above is fine
(Oliver Kurz) but there is serial output from the maachine which doesn't state anything about an invalid profile. Let me crosscheck an older job before recent test code changes then

openqa-clone-job --skip-chained-deps --within-instance https://openqa.opensuse.org/tests/4243631 _GROUP=0 {TEST,BUILD}+=-okurz-poo162101

1 job has been created:

opensuse-Tumbleweed-DVD-s390x-Build20240531-autoyast_zvm@s390x-zVM-vswitch-l2 -> https://openqa.opensuse.org/tests/4270254

Actions

Copy link

Updated by okurz 12 months ago

Related to action #153057: [tools] test fails in bootloader_start because openQA can not boot for s390x size:M added

Actions

Copy link

Updated by okurz 12 months ago

The problem looks very much the same as described in #153057-18. Asking mgriessmeier if this is maybe something to be solved by resetting the network stack on the s390x z/VM hypervisor or what was done last time. We should have specific instructions how to do that ourselves for the future.

Actions

Copy link

Updated by okurz 12 months ago

Due date set to 2024-06-27
Status changed from In Progress to Feedback

Waiting for mgriessmeier or other domain specialists based on my question which I also put in https://suse.slack.com/archives/C02CANHLANP/p1718268035867699

(Oliver Kurz) who can help me to understand the s390x boot problem in https://openqa.opensuse.org/tests/4263651#step/bootloader_start/30 related to https://progress.opensuse.org/issues/162101 ?

Actions

Copy link

Updated by okurz 11 months ago

Related to action #162239: [s390x] test fails in bootloader_start due to slow response from z/VM hypervisor and/or changed response on "cp i cms" command added

Actions

Copy link

Updated by okurz 11 months ago

might be related to #162239 as in that the hypervisor host being very slow to react to having problems with the network could also influence this? I retriggered a job ending with https://openqa.opensuse.org/tests/4273219#step/bootloader_start/30 , so same as before, back to waiting for response in above mentioned thread.

Actions

Copy link

#10

Updated by okurz 11 months ago · Edited

(Matthias Griessmeier) hmmm... the good news is... it's working again https://openqa.opensuse.org/tests/4286182# . the bad news is... I don't know what exactly fixed it.. I suspect restarting the container that's running the worker though
(Oliver Kurz) that's actually a good and simple enough idea which we haven't retried in before. So did you restart the container?
(Matthias Griessmeier) Yes, it was last started 3 weeks ago… I restarted it and tried manually with curl - and got an answer immediately

Actions

Copy link

#11

Updated by livdywan 11 months ago

Are you planning to do more here? Asking as this ticket is due tomorrow.

Actions

Copy link

#12

Updated by AdaLovelace 11 months ago

Thank you for your openQA support! It seems to be fixed with the timeouts.
It is a little bit sad, that SUSE Admins don't do any analysis of this problem. Then I have to reask every time...

Actions

Copy link

#13

Updated by okurz 11 months ago

Due date deleted (~~2024-06-27~~)
Status changed from Feedback to Resolved

It appears that the relevant part was the fix mentioned in #162101-10

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Tests (public)

Tags

Custom queries

action #162101

[s390x] timeouts on s390x openQA Workers size:M

Observation¶

Test suite description¶

Reproducible¶

Expected result¶

Suggestions¶

Further details¶

Updated by okurz 12 months ago

Updated by okurz 12 months ago

Updated by okurz 12 months ago

Updated by okurz 12 months ago

Updated by okurz 12 months ago

Updated by okurz 12 months ago

Updated by okurz 12 months ago

Updated by okurz 11 months ago

Updated by okurz 11 months ago

Updated by okurz 11 months ago · Edited

Updated by livdywan 11 months ago

Updated by AdaLovelace 11 months ago

Updated by okurz 11 months ago