[sles][functional][tools][s390x][ipmi][hard][sporadic] test incompletes - "DIE The console isn't responding correctly. Maybe half-open socket?"
|Target version:||Milestone 14|
openQA test in scenario sle-15-Installer-DVD-s390x-allpatterns@s390x-kvm-sle12 fails in
- all tests failing are s390x-kvm-sle12 (s390p8 LPAR), though not all jobs on this LPAR are failing
- all those jobs are running on openqaw2
- there were no recent changes in os-autoinst or the tests regarding this (dasantiago confirmed in irc)
- Gather statistics about how often this happens
- Check if we can handle the root cause of these "debug-messages"
one of those should be done:
- AC1: Turn the incomplete into a fail with a proper message, understandable by everyone
- AC2: Find the root-cause of this and come up with a fix
- see michals suggestion: https://github.com/os-autoinst/os-autoinst/pull/906/files#diff-333bbcc7c9ce8c440b7c87218c426f42R15.
Fails since (at least) Build 408.1
Last good: (unknown) (or more recent)
Always latest result in this scenario: latest
#2 Updated by michalnowak about 2 years ago
@mgriessmeier: You can fine-tune the polling mechanism to suite that particular host: https://github.com/os-autoinst/os-autoinst/pull/906/files#diff-333bbcc7c9ce8c440b7c87218c426f42R15.
#3 Updated by dasantiago about 2 years ago
Sometimes the channels die, that's why we implemented a quick failure mechanism, otherwise it would get stuck for 2 hours.
This requires an investigation at the machine/qemu level to determine why the channels fail.
As workaround please follow Michal's advise and try to increase the value, let's say for two minutes, just to be sure that it isn't the channel's dead or if it's the machine slow.
#10 Updated by mgriessmeier almost 2 years ago
- Status changed from In Progress to Feedback
Latest example: https://openqa.suse.de/tests/1513907
You did not try out the workaround suggestion, did you?
_CHKSEL_RATE_WAIT_TIME=120 to the MACHINE s390x-kvm-sle12 now
setting too feedback to track it over the next week
please use this ticket if this issue occurs again
#11 Updated by mgriessmeier almost 2 years ago
still happening with 120s :(
but it seems to occur less (just a feeling)
#13 Updated by mgriessmeier almost 2 years ago
I see that there are 8 black/blank screens. Isn't this an indication that the channel is dead?
that's a 'not so nice' thing in the s390x implementation, we also have this on passing tests, e.g. https://openqa.suse.de/tests/1521080 for some reasons
but the half-open-socket issues also appears in cases like this https://openqa.suse.de/tests/1521150# where we don't see any black screens
I wonder if it could help to increase the value of this variable even more?
#15 Updated by michalnowak almost 2 years ago
Looking at https://openqa.suse.de/tests/1521150 I noticed that we start serial console grab
power_action() when the VM is, I suppose, down and then in redefine_svirt_domain.pm we are starting it again via
$svirt->define_and_start. The latter place seems to be the right one to connect to serial console.
For Xen I moved the logic you have in redefine_svirt_domain.pm to utils.pm's
assert_shutdown_and_restore_system() called from
#17 Updated by mgriessmeier almost 2 years ago
- Subject changed from [sles][functional][tools][s390x][hard][sporadic] test incompletes - "DIE The console isn't responding correctly. Maybe half-open socket?" to [sles][functional][tools][s390x][ipmi][hard][sporadic] test incompletes - "DIE The console isn't responding correctly. Maybe half-open socket?"
#21 Updated by xlai almost 2 years ago
The virtualization tests which rely on ipmi and ssh console, fail also with msg "
The console isn't responding correctly. Maybe half-open socket? at /usr/lib/os-autoinst/backend/baseclass.pm line 241", as reported in progress ticket #32746.
I am not sure whether the svirt console failure root cause is the same as the reported ticket #32746. But it has been marked as duplicated. So please help to double confirm the failures on ipmi virtualization tests do not happen again when pushing solution. Thanks!
#27 Updated by mgriessmeier almost 2 years ago
- Status changed from Workable to In Progress
with this one, I got 10 tests in a row working right now:
will conduct more runs to get more statistics
#28 Updated by mgriessmeier almost 2 years ago
- Status changed from In Progress to Resolved
PR was merged - includes the fix for s390x
for ipmi, nothing was done yet, better handle this in a separate ticket from now on -> reopening https://progress.opensuse.org/issues/32746
please reopen if issue occurs again on s390x