Project

General

Profile

action #31543

[sles][functional][tools][s390x][ipmi][hard][sporadic] test incompletes - "DIE The console isn't responding correctly. Maybe half-open socket?"

Added by mgriessmeier over 3 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Feature requests
Target version:
Start date:
2018-02-08
Due date:
2018-03-13
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

openQA test in scenario sle-15-Installer-DVD-s390x-allpatterns@s390x-kvm-sle12 fails in
install_and_reboot

  • all tests failing are s390x-kvm-sle12 (s390p8 LPAR), though not all jobs on this LPAR are failing
  • all those jobs are running on openqaw2
  • there were no recent changes in os-autoinst or the tests regarding this (dasantiago confirmed in irc)

(maybe) related PRs:
https://github.com/os-autoinst/os-autoinst/pull/906
https://github.com/os-autoinst/os-autoinst/pull/902

Tasks

  • Gather statistics about how often this happens
  • Check if we can handle the root cause of these "debug-messages"

Acceptance Criteria

one of those should be done:

  • AC1: Turn the incomplete into a fail with a proper message, understandable by everyone
  • AC2: Find the root-cause of this and come up with a fix

Workaround

Reproducible

Fails since (at least) Build 408.1

Expected result

Last good: (unknown) (or more recent)

Further details

Always latest result in this scenario: latest


Related issues

Related to openQA Tests - action #30216: [sles][virtualization][xen] svirt-xen-hvm tests are incomplete with "DIE The console isn't responding correctly. Maybe half-open socket?"Resolved2018-01-12

Related to openQA Tests - action #37087: [kernel][s390x] test incompletes in shutdown_ltp: half-open socket?Resolved2018-06-11

Related to openQA Tests - action #40655: [tools][ipmi] DIE The console isn't responding correctly. Maybe half-open socket? at /usr/lib/os-autoinst/backend/baseclass.pm line 241Rejected2018-09-06

Is duplicate of openQA Tests - action #31534: [sle][functional][medium][s390x] test fails in install_and_reboot - vm stucks after installation processRejected2018-02-082018-03-13

Is duplicate of openQA Tests - action #33001: [functional][sle][s390x] test fails in reboot_after_installation - Multiple tests failing to reconnectRejected2018-03-09

History

#1 Updated by mgriessmeier over 3 years ago

  • Related to action #30216: [sles][virtualization][xen] svirt-xen-hvm tests are incomplete with "DIE The console isn't responding correctly. Maybe half-open socket?" added

#2 Updated by michalnowak over 3 years ago

mgriessmeier: You can fine-tune the polling mechanism to suite that particular host: https://github.com/os-autoinst/os-autoinst/pull/906/files#diff-333bbcc7c9ce8c440b7c87218c426f42R15.

#3 Updated by dasantiago over 3 years ago

Sometimes the channels die, that's why we implemented a quick failure mechanism, otherwise it would get stuck for 2 hours.
This requires an investigation at the machine/qemu level to determine why the channels fail.

As workaround please follow Michal's advise and try to increase the value, let's say for two minutes, just to be sure that it isn't the channel's dead or if it's the machine slow.

#4 Updated by okurz over 3 years ago

  • Due date set to 2018-03-13

#5 Updated by mgriessmeier over 3 years ago

  • Description updated (diff)
  • Status changed from New to Workable

#6 Updated by mgriessmeier over 3 years ago

  • Is duplicate of action #31534: [sle][functional][medium][s390x] test fails in install_and_reboot - vm stucks after installation process added

#7 Updated by mgriessmeier over 3 years ago

  • Status changed from Workable to Rejected

#8 Updated by mgriessmeier over 3 years ago

  • Status changed from Rejected to In Progress
  • Assignee set to mgriessmeier

I've introduced a circular dependency here...
reopening - trying out michals workaround suggestion

#9 Updated by okurz over 3 years ago

Latest example: https://openqa.suse.de/tests/1513907

You did not try out the workaround suggestion, did you?

#10 Updated by mgriessmeier over 3 years ago

  • Status changed from In Progress to Feedback

okurz wrote:

Latest example: https://openqa.suse.de/tests/1513907

You did not try out the workaround suggestion, did you?

I've added _CHKSEL_RATE_WAIT_TIME=120 to the MACHINE s390x-kvm-sle12 now
setting too feedback to track it over the next week
please use this ticket if this issue occurs again

#11 Updated by mgriessmeier over 3 years ago

http://openqa.suse.de/tests/1521075/file/autoinst-log.txt
still happening with 120s :(

but it seems to occur less (just a feeling)

#12 Updated by dasantiago over 3 years ago

I see that there are 8 black/blank screens. Isn't this an indication that the channel is dead?

#13 Updated by mgriessmeier over 3 years ago

dasantiago wrote:

I see that there are 8 black/blank screens. Isn't this an indication that the channel is dead?

that's a 'not so nice' thing in the s390x implementation, we also have this on passing tests, e.g. https://openqa.suse.de/tests/1521080 for some reasons
but the half-open-socket issues also appears in cases like this https://openqa.suse.de/tests/1521150# where we don't see any black screens

I wonder if it could help to increase the value of this variable even more?

#14 Updated by coolo over 3 years ago

If for 2 minutes there are no activities on this socket, increasing it even more will hide some other problem even more.

#15 Updated by michalnowak over 3 years ago

Looking at https://openqa.suse.de/tests/1521150 I noticed that we start serial console grab console('svirt')->start_serial_grab in power_action() when the VM is, I suppose, down and then in redefine_svirt_domain.pm we are starting it again via $svirt->define_and_start. The latter place seems to be the right one to connect to serial console.

For Xen I moved the logic you have in redefine_svirt_domain.pm to utils.pm's assert_shutdown_and_restore_system() called from power_action().

#16 Updated by mgriessmeier over 3 years ago

  • Has duplicate action #32746: [sle][tools][remote-backends][hard] Incomplete job because console isn't responding correctly. Half-open socket on IPMI added

#17 Updated by mgriessmeier over 3 years ago

  • Subject changed from [sles][functional][tools][s390x][hard][sporadic] test incompletes - "DIE The console isn't responding correctly. Maybe half-open socket?" to [sles][functional][tools][s390x][ipmi][hard][sporadic] test incompletes - "DIE The console isn't responding correctly. Maybe half-open socket?"

Thanks Michal, I'll followup on this

also happens on ipmi, see https://openqa.suse.de/tests/1514142 and https://openqa.suse.de/tests/1516150

#18 Updated by mgriessmeier over 3 years ago

  • Has duplicate deleted (action #32746: [sle][tools][remote-backends][hard] Incomplete job because console isn't responding correctly. Half-open socket on IPMI)

#19 Updated by mgriessmeier over 3 years ago

  • Is duplicate of action #32746: [sle][tools][remote-backends][hard] Incomplete job because console isn't responding correctly. Half-open socket on IPMI added

#20 Updated by mgriessmeier over 3 years ago

  • Status changed from Feedback to Rejected

#21 Updated by xlai over 3 years ago

The virtualization tests which rely on ipmi and ssh console, fail also with msg "
The console isn't responding correctly. Maybe half-open socket? at /usr/lib/os-autoinst/backend/baseclass.pm line 241", as reported in progress ticket #32746.

I am not sure whether the svirt console failure root cause is the same as the reported ticket #32746. But it has been marked as duplicated. So please help to double confirm the failures on ipmi virtualization tests do not happen again when pushing solution. Thanks!

#22 Updated by mgriessmeier over 3 years ago

  • Status changed from Rejected to In Progress

#23 Updated by mgriessmeier over 3 years ago

  • Status changed from In Progress to Workable

setting back to workable
will revisit on monday

#24 Updated by xlai over 3 years ago

mgriessmeier wrote:

setting back to workable
will revisit on monday

Would you please share the PR link with the fixes?

#25 Updated by mgriessmeier over 3 years ago

xlai wrote:

mgriessmeier wrote:

setting back to workable
will revisit on monday

Would you please share the PR link with the fixes?

as soon as I have one, sure

#26 Updated by mgriessmeier over 3 years ago

  • Is duplicate of action #33001: [functional][sle][s390x] test fails in reboot_after_installation - Multiple tests failing to reconnect added

#27 Updated by mgriessmeier over 3 years ago

  • Status changed from Workable to In Progress

with this one, I got 10 tests in a row working right now:
https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/4585

will conduct more runs to get more statistics

#28 Updated by mgriessmeier over 3 years ago

  • Status changed from In Progress to Resolved

PR was merged - includes the fix for s390x
for ipmi, nothing was done yet, better handle this in a separate ticket from now on -> reopening https://progress.opensuse.org/issues/32746
please reopen if issue occurs again on s390x

#29 Updated by mgriessmeier over 3 years ago

  • Is duplicate of deleted (action #32746: [sle][tools][remote-backends][hard] Incomplete job because console isn't responding correctly. Half-open socket on IPMI)

#30 Updated by mgriessmeier over 3 years ago

  • Related to action #37087: [kernel][s390x] test incompletes in shutdown_ltp: half-open socket? added

#31 Updated by okurz about 3 years ago

  • Related to action #40655: [tools][ipmi] DIE The console isn't responding correctly. Maybe half-open socket? at /usr/lib/os-autoinst/backend/baseclass.pm line 241 added

Also available in: Atom PDF