Project

General

Profile

Actions

action #162101

closed

[s390x] timeouts on s390x openQA Workers size:M

Added by AdaLovelace 18 days ago. Updated 1 day ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Infrastructure
Target version:
Start date:
2024-06-11
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

All tests are failing on s390x because of timeouts. I tried it with a restart to identify the issue. I had to wait longer than 6 minutes for a start of the openQA job.
What is going wrong here? It seems, there can be a network problem between the mainframe and openQA.

openQA test in scenario opensuse-Tumbleweed-DVD-s390x-autoyast_zvm@s390x-zVM-vswitch-l2 fails in
bootloader_start

Test suite description

Create HDD for s390x textmode

Reproducible

  • Fails since (at least) Build 20240606
  • Clearly reproducible

Expected result

Last good: 20240531 (or more recent)
The openQA tests are working also on s390x.

Suggestions

Further details

Always latest result in this scenario: latest


Related issues 2 (1 open1 closed)

Related to openQA Tests - action #153057: [tools] test fails in bootloader_start because openQA can not boot for s390x size:MResolvedmkittler2024-01-03

Actions
Related to openQA Tests - action #162239: [s390x] test fails in bootloader_start due to slow response from z/VM hypervisor and/or changed response on "cp i cms" commandBlockedokurz2024-06-13

Actions
Actions #1

Updated by okurz 18 days ago

  • Tags set to infra, reactive work
  • Target version set to Ready
Actions #2

Updated by okurz 17 days ago

  • Subject changed from [s390x] timeouts on s390x openQA Workers to [s390x] timeouts on s390x openQA Workers size:M
  • Description updated (diff)
  • Status changed from New to In Progress
  • Assignee set to okurz
Actions #3

Updated by okurz 17 days ago

Looking at https://openqa.opensuse.org/tests/4255937#investigation on the "first bad" compared to the "last good" I see

nothing obvious in the diff of settings. Most likely problematic settings if any at all:

-   "BUILD" : "20240531",
+   "BUILD" : "20240606",
-   "S390_NETWORK_PARAMS" : "OSAMedium=eth OSAInterface=qdio InstNetDev=osa Gateway=10.150.1.254 Nameserver=10.150.1.11 Domain=openqanet.opensuse.org PortNo= Layer2=0 ReadChannel=0.0.0A00 WriteChannel=0.0.0A01 DataChannel=0.0.0A02 OSAHWAddr= HostIP=10.150.1.153/24 Hostname=o3zvm004",
+   "S390_NETWORK_PARAMS" : "OSAMedium=eth OSAInterface=qdio InstNetDev=osa Gateway=10.150.1.254 Nameserver=10.150.1.11 Domain=openqanet.opensuse.org PortNo= Layer2=0 ReadChannel=0.0.0A00 WriteChannel=0.0.0A01 DataChannel=0.0.0A02 OSAHWAddr= HostIP=10.150.1.152/24 Hostname=o3zvm003",
-   "WORKER_ID" : 1147,
-   "WORKER_INSTANCE" : 104,
+   "WORKER_ID" : 1146,
+   "WORKER_INSTANCE" : 103,
-   "ZVM_GUEST" : "o3zvm004.openqanet.opensuse.org",
+   "ZVM_GUEST" : "o3zvm003.openqanet.opensuse.org",

https://openqa.opensuse.org/tests/4255937#comment-601601 shows that the issue seems reproducible. https://openqa.opensuse.org/tests/4255948 "last_good_build" fails showing that this is unlikely to be a product regression. I wonder why no "last good tests" job was triggered as there are test distribution changes listed:

test_log    

+ 1e20aa3e9 Fix cloud-init schedule for sl-micro
+ bbad8c698 Reduce code duplication in peering list
+ 7307fda93 containers: Increate subuid range
+ 2649ce516 Quick fix for SUSEConnect -d
+ b189afa0a Change the command in test_pids_max function into bash script
+ 904522158 OpenStack:Remove boot time measurements
+ 936a9d005 Add nat gateway setup
+ bbd75b551 Create git clone function
+ 065888283 README: Clarify with a link to test distribution definition
+ 1af1954d9 Enable --replacefiles via variable
+ 05f78fd8f Revert "Remove release package from SCC_ADDONS_DROP"
+ f93785946 Use snapper rollback to get to state before patch
+ 67048eb56 Refresh repos if zypper gets return 106
+ 4f3ea9693 Move saptune_on_pvm yaml schedule file to correct place
+ 77fad2896 Fix Elemental boot on aarch64
+ 25cf13db7 Make cleanup non-fatal (#19472)
+ c2590ef53 Use podman instead of docker in hawk_gui test module
+ 37b875599 Revert "Run ansible remediation in background with a countdown timer"
+ 7d9d2e88a Wsl-systemd check
+ d25ae9946 Add soft-failures in jornal for SL Micro 6.1
+ 713d798f4 Revert "Make cleanup non-fatal"
+ 64dae2922 containers/seccomp.pm: Harden test a bit
+ df75d86d5 Reapply "containers: Use busybox instead of tumbleweed image"
+ 13ae5148f Add support for Elemental based on SLE Micro 5.5
+ cf07e5e36 Follow upstream changes in package name
+ f8e746eb0 Add "--gpg-auto-import-keys" option to install openSUSE-repos-NVIDIA
+ 6f2d81b2f containers: Avoid registry mirror in 3rd party module
+ ec254c4e3 Skip version check in Tumbleweed when BUILD setting is non-numeric
+ 92c38b656 Make cleanup non-fatal
+ d6d7b2749 Fix the sporadic issue in fltk test
+ 3f0acad48 P.C. check Larry server up retries
+ 6773f9924 Adjust the saptune test on power9
+ eb88ce39e Run ansible remediation in background
+ 1a990b8fe Add 'EXTRA_CUSTOMER_REPOS' for qam dracut install tests
+ 3028dca30 console/ping: Add more tests
+ 10f13dbef console: Add simple arping tests
+ 04fe4f480 Fix Elemental test
+ 647a73f9a Add hanasr setup module
+ a94984844 Pass registration code in for guest instead of using plain ones
+ cba83a507 Ay profile for skip_registration_functional
+ 97373a762 Remove release package from SCC_ADDONS_DROP
+ 7c8abcc8a Update schedule/security/selinux_jeos.yaml
+ a7c8ebad9 Enable SELinux on JeOS Tumbleweed
+ ba39ad73c Revert "containers: Use busybox instead of tumbleweed image"
+ e7a4cabfa Define new name for the project *osado*
+ c8139c865 containers: Use httpd from registry mirror to fix netavark
+ c4a321062 Upload audit.log in post_fail_hook
+ 71506605a aaa_base: drop get_kernel_version as it is not used anymore
+ 465a0c813 Switch 15sp6 guests to GM

out of those none sound related.

Actions #4

Updated by okurz 17 days ago

(Jozef Pupava) To me it looks like the installation didn't reach the VNC being ready, maybe try to increase timeout ? Not sure why is it so slow.
When it works on osd https://openqa.suse.de/tests/14578010#step/bootloader_start/32
(Joaquin Rivera) I was looking at that code the other day for other stuff and it is quite difficult to debug, I've just tried that job with 400 seconds or double the RAM and same result. It could be the AutoYaST profile that for some reason now is not valid, because the textmode test suite boots. 400 seconds and 2GB
(Jozef Pupava) I would make sure autoyast profile is valid, when the textmode above is fine
(Oliver Kurz) but there is serial output from the maachine which doesn't state anything about an invalid profile. Let me crosscheck an older job before recent test code changes then

openqa-clone-job --skip-chained-deps --within-instance https://openqa.opensuse.org/tests/4243631 _GROUP=0 {TEST,BUILD}+=-okurz-poo162101

1 job has been created:

Actions #5

Updated by okurz 17 days ago

  • Related to action #153057: [tools] test fails in bootloader_start because openQA can not boot for s390x size:M added
Actions #6

Updated by okurz 17 days ago

The problem looks very much the same as described in #153057-18. Asking mgriessmeier if this is maybe something to be solved by resetting the network stack on the s390x z/VM hypervisor or what was done last time. We should have specific instructions how to do that ourselves for the future.

Actions #7

Updated by okurz 16 days ago

  • Due date set to 2024-06-27
  • Status changed from In Progress to Feedback

Waiting for mgriessmeier or other domain specialists based on my question which I also put in https://suse.slack.com/archives/C02CANHLANP/p1718268035867699

(Oliver Kurz) who can help me to understand the s390x boot problem in https://openqa.opensuse.org/tests/4263651#step/bootloader_start/30 related to https://progress.opensuse.org/issues/162101 ?

Actions #8

Updated by okurz 15 days ago

  • Related to action #162239: [s390x] test fails in bootloader_start due to slow response from z/VM hypervisor and/or changed response on "cp i cms" command added
Actions #9

Updated by okurz 15 days ago

might be related to #162239 as in that the hypervisor host being very slow to react to having problems with the network could also influence this? I retriggered a job ending with https://openqa.opensuse.org/tests/4273219#step/bootloader_start/30 , so same as before, back to waiting for response in above mentioned thread.

Actions #10

Updated by okurz 9 days ago · Edited

(Matthias Griessmeier) hmmm... the good news is... it's working again https://openqa.opensuse.org/tests/4286182# . the bad news is... I don't know what exactly fixed it.. I suspect restarting the container that's running the worker though
(Oliver Kurz) that's actually a good and simple enough idea which we haven't retried in before. So did you restart the container?
(Matthias Griessmeier) Yes, it was last started 3 weeks ago… I restarted it and tried manually with curl - and got an answer immediately

Actions #11

Updated by livdywan 3 days ago

Are you planning to do more here? Asking as this ticket is due tomorrow.

Actions #12

Updated by AdaLovelace 2 days ago

Thank you for your openQA support! It seems to be fixed with the timeouts.
It is a little bit sad, that SUSE Admins don't do any analysis of this problem. Then I have to reask every time...

Actions #13

Updated by okurz 1 day ago

  • Due date deleted (2024-06-27)
  • Status changed from Feedback to Resolved

It appears that the relevant part was the fix mentioned in #162101-10

Actions

Also available in: Atom PDF