action #31543

[sles][functional][tools][s390x][ipmi][hard][sporadic] test incompletes - "DIE The console isn't responding correctly. Maybe half-open socket?"

Added by mgriessmeier about 2 years ago. Updated almost 2 years ago.

Status:ResolvedStart date:08/02/2018
Priority:HighDue date:13/03/2018
Assignee:mgriessmeier% Done:

0%

Category:Feature requests
Target version:Milestone 14
Difficulty:
Duration: 24

Description

Observation

openQA test in scenario sle-15-Installer-DVD-s390x-allpatterns@s390x-kvm-sle12 fails in
install_and_reboot

  • all tests failing are s390x-kvm-sle12 (s390p8 LPAR), though not all jobs on this LPAR are failing
  • all those jobs are running on openqaw2
  • there were no recent changes in os-autoinst or the tests regarding this (dasantiago confirmed in irc)

(maybe) related PRs:
https://github.com/os-autoinst/os-autoinst/pull/906
https://github.com/os-autoinst/os-autoinst/pull/902

Tasks

  • Gather statistics about how often this happens
  • Check if we can handle the root cause of these "debug-messages"

Acceptance Criteria

one of those should be done:

  • AC1: Turn the incomplete into a fail with a proper message, understandable by everyone
  • AC2: Find the root-cause of this and come up with a fix

Workaround

Reproducible

Fails since (at least) Build 408.1

Expected result

Last good: (unknown) (or more recent)

Further details

Always latest result in this scenario: latest


Related issues

Related to openQA Tests - action #30216: [sles][virtualization][xen] svirt-xen-hvm tests are incom... Resolved 12/01/2018
Related to openQA Tests - action #37087: [kernel][s390x] test incompletes in shutdown_ltp: half-op... Resolved 11/06/2018
Related to openQA Tests - action #40655: [tools][ipmi] DIE The console isn't responding correctly.... Rejected 06/09/2018
Duplicates openQA Tests - action #31534: [sle][functional][medium][s390x] test fails in install_an... Rejected 08/02/2018 13/03/2018
Duplicates openQA Tests - action #33001: [functional][sle][s390x] test fails in reboot_after_insta... Rejected 09/03/2018

History

#1 Updated by mgriessmeier about 2 years ago

  • Related to action #30216: [sles][virtualization][xen] svirt-xen-hvm tests are incomplete with "DIE The console isn't responding correctly. Maybe half-open socket?" added

#2 Updated by michalnowak about 2 years ago

@mgriessmeier: You can fine-tune the polling mechanism to suite that particular host: https://github.com/os-autoinst/os-autoinst/pull/906/files#diff-333bbcc7c9ce8c440b7c87218c426f42R15.

#3 Updated by dasantiago about 2 years ago

Sometimes the channels die, that's why we implemented a quick failure mechanism, otherwise it would get stuck for 2 hours.
This requires an investigation at the machine/qemu level to determine why the channels fail.

As workaround please follow Michal's advise and try to increase the value, let's say for two minutes, just to be sure that it isn't the channel's dead or if it's the machine slow.

#4 Updated by okurz about 2 years ago

  • Due date set to 13/03/2018

#5 Updated by mgriessmeier about 2 years ago

  • Description updated (diff)
  • Status changed from New to Workable

#6 Updated by mgriessmeier almost 2 years ago

  • Duplicates action #31534: [sle][functional][medium][s390x] test fails in install_and_reboot - vm stucks after installation process added

#7 Updated by mgriessmeier almost 2 years ago

  • Status changed from Workable to Rejected

#8 Updated by mgriessmeier almost 2 years ago

  • Status changed from Rejected to In Progress
  • Assignee set to mgriessmeier

I've introduced a circular dependency here...
reopening - trying out michals workaround suggestion

#9 Updated by okurz almost 2 years ago

Latest example: https://openqa.suse.de/tests/1513907

You did not try out the workaround suggestion, did you?

#10 Updated by mgriessmeier almost 2 years ago

  • Status changed from In Progress to Feedback

okurz wrote:

Latest example: https://openqa.suse.de/tests/1513907


You did not try out the workaround suggestion, did you?

I've added _CHKSEL_RATE_WAIT_TIME=120 to the MACHINE s390x-kvm-sle12 now
setting too feedback to track it over the next week
please use this ticket if this issue occurs again

#11 Updated by mgriessmeier almost 2 years ago

http://openqa.suse.de/tests/1521075/file/autoinst-log.txt
still happening with 120s :(

but it seems to occur less (just a feeling)

#12 Updated by dasantiago almost 2 years ago

I see that there are 8 black/blank screens. Isn't this an indication that the channel is dead?

#13 Updated by mgriessmeier almost 2 years ago

dasantiago wrote:

I see that there are 8 black/blank screens. Isn't this an indication that the channel is dead?

that's a 'not so nice' thing in the s390x implementation, we also have this on passing tests, e.g. https://openqa.suse.de/tests/1521080 for some reasons
but the half-open-socket issues also appears in cases like this https://openqa.suse.de/tests/1521150# where we don't see any black screens

I wonder if it could help to increase the value of this variable even more?

#14 Updated by coolo almost 2 years ago

If for 2 minutes there are no activities on this socket, increasing it even more will hide some other problem even more.

#15 Updated by michalnowak almost 2 years ago

Looking at https://openqa.suse.de/tests/1521150 I noticed that we start serial console grab console('svirt')->start_serial_grab in power_action() when the VM is, I suppose, down and then in redefine_svirt_domain.pm we are starting it again via $svirt->define_and_start. The latter place seems to be the right one to connect to serial console.

For Xen I moved the logic you have in redefine_svirt_domain.pm to utils.pm's assert_shutdown_and_restore_system() called from power_action().

#16 Updated by mgriessmeier almost 2 years ago

  • Duplicated by action #32746: [sle][tools][remote-backends][hard] Incomplete job because console isn't responding correctly. Half-open socket on IPMI added

#17 Updated by mgriessmeier almost 2 years ago

  • Subject changed from [sles][functional][tools][s390x][hard][sporadic] test incompletes - "DIE The console isn't responding correctly. Maybe half-open socket?" to [sles][functional][tools][s390x][ipmi][hard][sporadic] test incompletes - "DIE The console isn't responding correctly. Maybe half-open socket?"

Thanks Michal, I'll followup on this

also happens on ipmi, see https://openqa.suse.de/tests/1514142 and https://openqa.suse.de/tests/1516150

#18 Updated by mgriessmeier almost 2 years ago

  • Duplicated by deleted (action #32746: [sle][tools][remote-backends][hard] Incomplete job because console isn't responding correctly. Half-open socket on IPMI)

#19 Updated by mgriessmeier almost 2 years ago

  • Duplicates action #32746: [sle][tools][remote-backends][hard] Incomplete job because console isn't responding correctly. Half-open socket on IPMI added

#20 Updated by mgriessmeier almost 2 years ago

  • Status changed from Feedback to Rejected

#21 Updated by xlai almost 2 years ago

The virtualization tests which rely on ipmi and ssh console, fail also with msg "
The console isn't responding correctly. Maybe half-open socket? at /usr/lib/os-autoinst/backend/baseclass.pm line 241", as reported in progress ticket #32746.

I am not sure whether the svirt console failure root cause is the same as the reported ticket #32746. But it has been marked as duplicated. So please help to double confirm the failures on ipmi virtualization tests do not happen again when pushing solution. Thanks!

#22 Updated by mgriessmeier almost 2 years ago

  • Status changed from Rejected to In Progress

#23 Updated by mgriessmeier almost 2 years ago

  • Status changed from In Progress to Workable

setting back to workable
will revisit on monday

#24 Updated by xlai almost 2 years ago

mgriessmeier wrote:

setting back to workable

will revisit on monday

Would you please share the PR link with the fixes?

#25 Updated by mgriessmeier almost 2 years ago

xlai wrote:

mgriessmeier wrote:

setting back to workable

will revisit on monday


Would you please share the PR link with the fixes?

as soon as I have one, sure

#26 Updated by mgriessmeier almost 2 years ago

  • Duplicates action #33001: [functional][sle][s390x] test fails in reboot_after_installation - Multiple tests failing to reconnect added

#27 Updated by mgriessmeier almost 2 years ago

  • Status changed from Workable to In Progress

with this one, I got 10 tests in a row working right now:
https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/4585

will conduct more runs to get more statistics

#28 Updated by mgriessmeier almost 2 years ago

  • Status changed from In Progress to Resolved

PR was merged - includes the fix for s390x
for ipmi, nothing was done yet, better handle this in a separate ticket from now on -> reopening https://progress.opensuse.org/issues/32746
please reopen if issue occurs again on s390x

#29 Updated by mgriessmeier almost 2 years ago

  • Duplicates deleted (action #32746: [sle][tools][remote-backends][hard] Incomplete job because console isn't responding correctly. Half-open socket on IPMI)

#30 Updated by mgriessmeier over 1 year ago

  • Related to action #37087: [kernel][s390x] test incompletes in shutdown_ltp: half-open socket? added

#31 Updated by okurz over 1 year ago

  • Related to action #40655: [tools][ipmi] DIE The console isn't responding correctly. Maybe half-open socket? at /usr/lib/os-autoinst/backend/baseclass.pm line 241 added

Also available in: Atom PDF