Project

General

Profile

Actions

action #31543

closed

[sles][functional][tools][s390x][ipmi][hard][sporadic] test incompletes - "DIE The console isn't responding correctly. Maybe half-open socket?"

Added by mgriessmeier about 6 years ago. Updated about 6 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Feature requests
Target version:
Start date:
2018-02-08
Due date:
2018-03-13
% Done:

0%

Estimated time:

Description

Observation

openQA test in scenario sle-15-Installer-DVD-s390x-allpatterns@s390x-kvm-sle12 fails in
install_and_reboot

  • all tests failing are s390x-kvm-sle12 (s390p8 LPAR), though not all jobs on this LPAR are failing
  • all those jobs are running on openqaw2
  • there were no recent changes in os-autoinst or the tests regarding this (dasantiago confirmed in irc)

(maybe) related PRs:
https://github.com/os-autoinst/os-autoinst/pull/906
https://github.com/os-autoinst/os-autoinst/pull/902

Tasks

  • Gather statistics about how often this happens
  • Check if we can handle the root cause of these "debug-messages"

Acceptance Criteria

one of those should be done:

  • AC1: Turn the incomplete into a fail with a proper message, understandable by everyone
  • AC2: Find the root-cause of this and come up with a fix

Workaround

Reproducible

Fails since (at least) Build 408.1

Expected result

Last good: (unknown) (or more recent)

Further details

Always latest result in this scenario: latest


Related issues 5 (0 open5 closed)

Related to openQA Tests - action #30216: [sles][virtualization][xen] svirt-xen-hvm tests are incomplete with "DIE The console isn't responding correctly. Maybe half-open socket?"Resolveddasantiago2018-01-12

Actions
Related to openQA Tests - action #37087: [kernel][s390x] test incompletes in shutdown_ltp: half-open socket?Resolvedmgriessmeier2018-06-11

Actions
Related to openQA Tests - action #40655: [tools][ipmi] DIE The console isn't responding correctly. Maybe half-open socket? at /usr/lib/os-autoinst/backend/baseclass.pm line 241Rejected2018-09-06

Actions
Is duplicate of openQA Tests - action #31534: [sle][functional][medium][s390x] test fails in install_and_reboot - vm stucks after installation processRejected2018-02-082018-03-13

Actions
Is duplicate of openQA Tests - action #33001: [functional][sle][s390x] test fails in reboot_after_installation - Multiple tests failing to reconnectRejected2018-03-09

Actions
Actions #1

Updated by mgriessmeier about 6 years ago

  • Related to action #30216: [sles][virtualization][xen] svirt-xen-hvm tests are incomplete with "DIE The console isn't responding correctly. Maybe half-open socket?" added
Actions #2

Updated by michalnowak about 6 years ago

@mgriessmeier: You can fine-tune the polling mechanism to suite that particular host: https://github.com/os-autoinst/os-autoinst/pull/906/files#diff-333bbcc7c9ce8c440b7c87218c426f42R15.

Actions #3

Updated by dasantiago about 6 years ago

Sometimes the channels die, that's why we implemented a quick failure mechanism, otherwise it would get stuck for 2 hours.
This requires an investigation at the machine/qemu level to determine why the channels fail.

As workaround please follow Michal's advise and try to increase the value, let's say for two minutes, just to be sure that it isn't the channel's dead or if it's the machine slow.

Actions #4

Updated by okurz about 6 years ago

  • Due date set to 2018-03-13
Actions #5

Updated by mgriessmeier about 6 years ago

  • Description updated (diff)
  • Status changed from New to Workable
Actions #6

Updated by mgriessmeier about 6 years ago

  • Is duplicate of action #31534: [sle][functional][medium][s390x] test fails in install_and_reboot - vm stucks after installation process added
Actions #7

Updated by mgriessmeier about 6 years ago

  • Status changed from Workable to Rejected
Actions #8

Updated by mgriessmeier about 6 years ago

  • Status changed from Rejected to In Progress
  • Assignee set to mgriessmeier

I've introduced a circular dependency here...
reopening - trying out michals workaround suggestion

Actions #9

Updated by okurz about 6 years ago

Latest example: https://openqa.suse.de/tests/1513907

You did not try out the workaround suggestion, did you?

Actions #10

Updated by mgriessmeier about 6 years ago

  • Status changed from In Progress to Feedback

okurz wrote:

Latest example: https://openqa.suse.de/tests/1513907

You did not try out the workaround suggestion, did you?

I've added _CHKSEL_RATE_WAIT_TIME=120 to the MACHINE s390x-kvm-sle12 now
setting too feedback to track it over the next week
please use this ticket if this issue occurs again

Actions #11

Updated by mgriessmeier about 6 years ago

http://openqa.suse.de/tests/1521075/file/autoinst-log.txt
still happening with 120s :(

but it seems to occur less (just a feeling)

Actions #12

Updated by dasantiago about 6 years ago

I see that there are 8 black/blank screens. Isn't this an indication that the channel is dead?

Actions #13

Updated by mgriessmeier about 6 years ago

dasantiago wrote:

I see that there are 8 black/blank screens. Isn't this an indication that the channel is dead?

that's a 'not so nice' thing in the s390x implementation, we also have this on passing tests, e.g. https://openqa.suse.de/tests/1521080 for some reasons
but the half-open-socket issues also appears in cases like this https://openqa.suse.de/tests/1521150# where we don't see any black screens

I wonder if it could help to increase the value of this variable even more?

Actions #14

Updated by coolo about 6 years ago

If for 2 minutes there are no activities on this socket, increasing it even more will hide some other problem even more.

Actions #15

Updated by michalnowak about 6 years ago

Looking at https://openqa.suse.de/tests/1521150 I noticed that we start serial console grab console('svirt')->start_serial_grab in power_action() when the VM is, I suppose, down and then in redefine_svirt_domain.pm we are starting it again via $svirt->define_and_start. The latter place seems to be the right one to connect to serial console.

For Xen I moved the logic you have in redefine_svirt_domain.pm to utils.pm's assert_shutdown_and_restore_system() called from power_action().

Actions #16

Updated by mgriessmeier about 6 years ago

  • Has duplicate action #32746: [sle][tools][remote-backends][hard] Incomplete job because console isn't responding correctly. Half-open socket on IPMI added
Actions #17

Updated by mgriessmeier about 6 years ago

  • Subject changed from [sles][functional][tools][s390x][hard][sporadic] test incompletes - "DIE The console isn't responding correctly. Maybe half-open socket?" to [sles][functional][tools][s390x][ipmi][hard][sporadic] test incompletes - "DIE The console isn't responding correctly. Maybe half-open socket?"

Thanks Michal, I'll followup on this

also happens on ipmi, see https://openqa.suse.de/tests/1514142 and https://openqa.suse.de/tests/1516150

Actions #18

Updated by mgriessmeier about 6 years ago

  • Has duplicate deleted (action #32746: [sle][tools][remote-backends][hard] Incomplete job because console isn't responding correctly. Half-open socket on IPMI)
Actions #19

Updated by mgriessmeier about 6 years ago

  • Is duplicate of action #32746: [sle][tools][remote-backends][hard] Incomplete job because console isn't responding correctly. Half-open socket on IPMI added
Actions #20

Updated by mgriessmeier about 6 years ago

  • Status changed from Feedback to Rejected
Actions #21

Updated by xlai about 6 years ago

The virtualization tests which rely on ipmi and ssh console, fail also with msg "
The console isn't responding correctly. Maybe half-open socket? at /usr/lib/os-autoinst/backend/baseclass.pm line 241", as reported in progress ticket #32746.

I am not sure whether the svirt console failure root cause is the same as the reported ticket #32746. But it has been marked as duplicated. So please help to double confirm the failures on ipmi virtualization tests do not happen again when pushing solution. Thanks!

Actions #22

Updated by mgriessmeier about 6 years ago

  • Status changed from Rejected to In Progress
Actions #23

Updated by mgriessmeier about 6 years ago

  • Status changed from In Progress to Workable

setting back to workable
will revisit on monday

Actions #24

Updated by xlai about 6 years ago

mgriessmeier wrote:

setting back to workable
will revisit on monday

Would you please share the PR link with the fixes?

Actions #25

Updated by mgriessmeier about 6 years ago

xlai wrote:

mgriessmeier wrote:

setting back to workable
will revisit on monday

Would you please share the PR link with the fixes?

as soon as I have one, sure

Actions #26

Updated by mgriessmeier about 6 years ago

  • Is duplicate of action #33001: [functional][sle][s390x] test fails in reboot_after_installation - Multiple tests failing to reconnect added
Actions #27

Updated by mgriessmeier about 6 years ago

  • Status changed from Workable to In Progress

with this one, I got 10 tests in a row working right now:
https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/4585

will conduct more runs to get more statistics

Actions #28

Updated by mgriessmeier about 6 years ago

  • Status changed from In Progress to Resolved

PR was merged - includes the fix for s390x
for ipmi, nothing was done yet, better handle this in a separate ticket from now on -> reopening https://progress.opensuse.org/issues/32746
please reopen if issue occurs again on s390x

Actions #29

Updated by mgriessmeier about 6 years ago

  • Is duplicate of deleted (action #32746: [sle][tools][remote-backends][hard] Incomplete job because console isn't responding correctly. Half-open socket on IPMI)
Actions #30

Updated by mgriessmeier almost 6 years ago

  • Related to action #37087: [kernel][s390x] test incompletes in shutdown_ltp: half-open socket? added
Actions #31

Updated by okurz over 5 years ago

  • Related to action #40655: [tools][ipmi] DIE The console isn't responding correctly. Maybe half-open socket? at /usr/lib/os-autoinst/backend/baseclass.pm line 241 added
Actions

Also available in: Atom PDF