Project

General

Profile

Actions

action #32746

closed

[sle][tools][remote-backends][hard] Incomplete job because console isn't responding correctly. Half-open socket on IPMI

Added by xlai about 6 years ago. Updated about 5 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Bugs in existing tests
Target version:
-
Start date:
2018-03-05
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Failure root cause from autoinst log:

The console isn't responding correctly. Maybe half-open socket? at /usr/lib/os-autoinst/backend/baseclass.pm line 241.

at /usr/lib/os-autoinst/backend/baseclass.pm line 80. backend::baseclass::die_handler('The console isn\'t responding correctly. Maybe half-open sock...') called at /usr/lib/os-autoinst/backend/baseclass.pm line 241 eval {...} called at /usr/lib/os-autoinst/backend/baseclass.pm line 156 backend::baseclass::run_capture_loop('backend::ipmi=HASH(0x6205910)') called at /usr/lib/os-autoinst/backend/baseclass.pm line 129 backend::baseclass::run('backend::ipmi=HASH(0x6205910)', 5, 8) called at /usr/lib/os-autoinst/backend/driver.pm line 85 backend::driver::start('backend::driver=HASH(0x5d76870)') called at /usr/lib/os-autoinst/backend/driver.pm line 48 backend::driver::new('backend::driver', 'ipmi') called at /usr/bin/isotovideo line 211 main::init_backend() called at /usr/bin/isotovideo line 280 [2018-03-02T12:58:49.0368 CET] [debug] IPMI: Chassis Power Control: Down/Off last frame

Failure job link:
https://openqa.suse.de/tests/1514142
https://openqa.suse.de/tests/1516150


Related issues 10 (2 open8 closed)

Related to openQA Tests - action #31375: [sle][functional][ipmi][u][hard] test fails in first_boot - VNC installation on SLE 15 failed because of various issues (ipmi worker, first_boot, boot_from_pxe, await_install)RejectedSLindoMansilla2018-02-05

Actions
Related to openQA Tests - action #32089: [sle][functional][u][ipmi][easy] test fails in first_boot - abort the test early so that we at least test the installationResolvedSLindoMansilla2018-02-05

Actions
Related to openQA Tests - action #37087: [kernel][s390x] test incompletes in shutdown_ltp: half-open socket?Resolvedmgriessmeier2018-06-11

Actions
Related to openQA Infrastructure - action #40544: [OpenQA][IPMI backend] IPMI worker can not survive reboot on dell SUTResolvedXGWang02018-09-03

Actions
Related to openQA Tests - action #41330: [functional][y][s390x][investigation][timebox:4h] test fails in welcome - half-open socket in post_fail_hook causing incomplete jobResolvedriafarov2018-09-192019-01-29

Actions
Related to openQA Tests - action #46964: [functional][u][s390x] test fails in the middle of execution (not installation) as incomplete with "half-open socket?" – connection to machine vanished?Resolvedokurz2019-02-01

Actions
Related to openQA Tests - action #48482: [ipmi][functional][u] test fails in reboot_after_installation; The console isn't responding correctly. Maybe half-open socketResolvedokurz2019-02-27

Actions
Related to openQA Tests - action #60161: [network][qem] auto_review:"The console.*(root-virtio-terminal1|sut).*is not responding.*half-open socket" test incompletes in t20_teaming_ab_all_linkWorkablecfconrad

Actions
Has duplicate openQA Tests - action #40655: [tools][ipmi] DIE The console isn't responding correctly. Maybe half-open socket? at /usr/lib/os-autoinst/backend/baseclass.pm line 241Rejected2018-09-06

Actions
Blocks openQA Tests - action #34471: [qe-core][functional][opensuse][medium] too early matching in too generic needle text-login-20160812New2018-04-08

Actions
Actions #1

Updated by mgriessmeier about 6 years ago

  • Is duplicate of action #31543: [sles][functional][tools][s390x][ipmi][hard][sporadic] test incompletes - "DIE The console isn't responding correctly. Maybe half-open socket?" added
Actions #2

Updated by mgriessmeier about 6 years ago

  • Is duplicate of deleted (action #31543: [sles][functional][tools][s390x][ipmi][hard][sporadic] test incompletes - "DIE The console isn't responding correctly. Maybe half-open socket?")
Actions #3

Updated by mgriessmeier about 6 years ago

  • Has duplicate action #31543: [sles][functional][tools][s390x][ipmi][hard][sporadic] test incompletes - "DIE The console isn't responding correctly. Maybe half-open socket?" added
Actions #4

Updated by mgriessmeier about 6 years ago

  • Status changed from New to Rejected
  • Assignee deleted (szarate)

reject because duplicate of https://progress.opensuse.org/issues/31543

Actions #5

Updated by mgriessmeier about 6 years ago

  • Subject changed from [tools] Incomplete job because console isn't responding correctly. to [sles][functional][tools][ipmi] Incomplete job because console isn't responding correctly. Half-open socket on IPMI
  • Category set to Bugs in existing tests
  • Status changed from Rejected to Workable
  • Target version set to Milestone 15
Actions #6

Updated by mgriessmeier about 6 years ago

  • Has duplicate deleted (action #31543: [sles][functional][tools][s390x][ipmi][hard][sporadic] test incompletes - "DIE The console isn't responding correctly. Maybe half-open socket?")
Actions #7

Updated by xlai about 6 years ago

  • Category changed from Bugs in existing tests to Infrastructure
  • Assignee set to szarate
Actions #8

Updated by xlai about 6 years ago

  • Category changed from Infrastructure to Bugs in existing tests
Actions #9

Updated by mitiao about 6 years ago

  • Assignee changed from szarate to mitiao
Actions #10

Updated by mgriessmeier about 6 years ago

  • Due date set to 2018-03-27

planned for next sprint

Actions #11

Updated by nicksinger about 6 years ago

  • Subject changed from [sles][functional][tools][ipmi] Incomplete job because console isn't responding correctly. Half-open socket on IPMI to [sles][functional][tools][ipmi][hard] Incomplete job because console isn't responding correctly. Half-open socket on IPMI
Actions #12

Updated by mitiao almost 6 years ago

@xlai, have you set this var:
_CHKSEL_RATE_WAIT_TIME=120
It may not solve the issue completely, but append it to your test to see if it will reduce the frequency of isse

Actions #13

Updated by xlai almost 6 years ago

  • Status changed from Workable to In Progress

mitiao wrote:

@xlai, have you set this var:
_CHKSEL_RATE_WAIT_TIME=120
It may not solve the issue completely, but append it to your test to see if it will reduce the frequency of isse

Will try. Thanks for the advice.

Actions #14

Updated by mgriessmeier almost 6 years ago

  • Due date changed from 2018-03-27 to 2018-04-24
Actions #15

Updated by xlai almost 6 years ago

xlai wrote:

mitiao wrote:

@xlai, have you set this var:
_CHKSEL_RATE_WAIT_TIME=120
It may not solve the issue completely, but append it to your test to see if it will reduce the frequency of isse

Will try. Thanks for the advice.

With the parameter, the issue happen again on latest build 550.2,

Job list(total 3, happen ratio 3/34, nearly10%):
https://openqa.suse.de/tests/1599762#
https://openqa.suse.de/tests/1599951/file/autoinst-log.txt
https://openqa.suse.de/tests/1599952/file/autoinst-log.txt

Actions #16

Updated by okurz almost 6 years ago

  • Subject changed from [sles][functional][tools][ipmi][hard] Incomplete job because console isn't responding correctly. Half-open socket on IPMI to [sle][functional][u][tools][ipmi][hard] Incomplete job because console isn't responding correctly. Half-open socket on IPMI
Actions #17

Updated by xlai almost 6 years ago

xlai wrote:

xlai wrote:

mitiao wrote:

@xlai, have you set this var:
_CHKSEL_RATE_WAIT_TIME=120
It may not solve the issue completely, but append it to your test to see if it will reduce the frequency of isse

Will try. Thanks for the advice.

With the parameter, the issue happen again on latest build 550.2,

Job list(total 3, happen ratio 3/34, nearly10%):
https://openqa.suse.de/tests/1599762#
https://openqa.suse.de/tests/1599951/file/autoinst-log.txt
https://openqa.suse.de/tests/1599952/file/autoinst-log.txt

On build 555.1, the happen ratio douled -- 6 cases incomplete due to this issue, happen ratio is 20% now.

Really appreciate your efforts on it. Look forward to the fix! Thanks.

Actions #18

Updated by okurz almost 6 years ago

  • Description updated (diff)

@mitiao do you have an idea what to do next?

Actions #19

Updated by mitiao almost 6 years ago

okurz wrote:

@mitiao do you have an idea what to do next?

No idea yet, currently i am working on other stuffs.
@dasantiago may give some help.

Actions #20

Updated by xlai almost 6 years ago

On SLE15 RC3 build 567.1 , four tests failed by this issue, happen ratio is 4/34=11%.

Actions #21

Updated by mgriessmeier almost 6 years ago

  • Due date changed from 2018-04-24 to 2018-05-08
  • Target version changed from Milestone 15 to Milestone 16
Actions #22

Updated by SLindoMansilla almost 6 years ago

  • Related to action #31375: [sle][functional][ipmi][u][hard] test fails in first_boot - VNC installation on SLE 15 failed because of various issues (ipmi worker, first_boot, boot_from_pxe, await_install) added
Actions #23

Updated by SLindoMansilla almost 6 years ago

  • Related to action #32089: [sle][functional][u][ipmi][easy] test fails in first_boot - abort the test early so that we at least test the installation added
Actions #24

Updated by mgriessmeier almost 6 years ago

  • Due date changed from 2018-05-08 to 2018-05-22

@mitiao: could you please give us an update on the state here?
Are you working on this actively? Do you need any help by the QSF team?

Actions #25

Updated by mitiao almost 6 years ago

mgriessmeier wrote:

@mitiao: could you please give us an update on the state here?
Are you working on this actively? Do you need any help by the QSF team?

No update yet, i put this in my schedule later.
Any help welcome and if anyone have idea or able to fix it, please take it :)

Actions #26

Updated by mgriessmeier almost 6 years ago

  • Subject changed from [sle][functional][u][tools][ipmi][hard] Incomplete job because console isn't responding correctly. Half-open socket on IPMI to [sle][tools][ipmi][hard] Incomplete job because console isn't responding correctly. Half-open socket on IPMI
  • Due date deleted (2018-05-22)

We didn't see this for a longer time now on jobs which are covered by the QSF team.
so unassigning from our backlog for now.
If you need any further assistance here, feel free to ask

Actions #27

Updated by xlai almost 6 years ago

mgriessmeier wrote:

We didn't see this for a longer time now on jobs which are covered by the QSF team.
so unassigning from our backlog for now.
If you need any further assistance here, feel free to ask

Virtualization job group still keeps meeting the issue, eg build 635.1, see https://openqa.suse.de/tests/1711157/file/autoinst-log.txt.

This issue has been marked a serious blocking openqa backend issue for virtualization job group, since we meet a lot and blocked a lot by this issue.

Hope this can be fixed ASAP.

Actions #28

Updated by mgriessmeier almost 6 years ago

  • Related to action #37087: [kernel][s390x] test incompletes in shutdown_ltp: half-open socket? added
Actions #29

Updated by mgriessmeier almost 6 years ago

I might have an idea here... let's see...

Actions #30

Updated by mgriessmeier almost 6 years ago

I'm not able to get an ipmi job running on my openQA instance...

if anyone wants to pick up the idea - My guess is that we are missing a disable_vnc_stall for the ipmi backend before we reboot, so I've modified prepare_system_shutdown in lib/utils:

@@ -239,7 +239,7 @@ sub prepare_system_shutdown {
     # kill the ssh connection before triggering reboot
     console('root-ssh')->kill_ssh if check_var('BACKEND', 'ipmi');

-    if (check_var('ARCH', 's390x')) {
+    if (check_var('ARCH', 's390x') || check_var('BACKEND', 'ipmi')) {
         if (check_var('BACKEND', 's390x')) {
             # kill serial ssh connection (if it exists)
             eval { console('iucvconn')->kill_ssh unless get_var('BOOT_EXISTING_S390', ''); };
         }
         console('installation')->disable_vnc_stalls;

Actions #31

Updated by cachen almost 6 years ago

mgriessmeier wrote:

I'm not able to get an ipmi job running on my openQA instance...

if anyone wants to pick up the idea - My guess is that we are missing a disable_vnc_stall for the ipmi backend before we reboot, so I've modified prepare_system_shutdown in lib/utils:

@@ -239,7 +239,7 @@ sub prepare_system_shutdown {
     # kill the ssh connection before triggering reboot
     console('root-ssh')->kill_ssh if check_var('BACKEND', 'ipmi');

-    if (check_var('ARCH', 's390x')) {
+    if (check_var('ARCH', 's390x') || check_var('BACKEND', 'ipmi')) {
         if (check_var('BACKEND', 's390x')) {
             # kill serial ssh connection (if it exists)
             eval { console('iucvconn')->kill_ssh unless get_var('BOOT_EXISTING_S390', ''); };
         }
         console('installation')->disable_vnc_stalls;

Nice, thank you for the solution idea!

@Alice, @mitiao, any idea? can we pick up this fix and try at least in Beijing ipmi server firstly? If the fix not harm testing, then maybe we can have the fix merged and try in openqa.suse.de, what do you think?

Actions #32

Updated by xlai almost 6 years ago

cachen wrote:

mgriessmeier wrote:

I'm not able to get an ipmi job running on my openQA instance...

if anyone wants to pick up the idea - My guess is that we are missing a disable_vnc_stall for the ipmi backend before we reboot, so I've modified prepare_system_shutdown in lib/utils:

@@ -239,7 +239,7 @@ sub prepare_system_shutdown {
     # kill the ssh connection before triggering reboot
     console('root-ssh')->kill_ssh if check_var('BACKEND', 'ipmi');

-    if (check_var('ARCH', 's390x')) {
+    if (check_var('ARCH', 's390x') || check_var('BACKEND', 'ipmi')) {
         if (check_var('BACKEND', 's390x')) {
             # kill serial ssh connection (if it exists)
             eval { console('iucvconn')->kill_ssh unless get_var('BOOT_EXISTING_S390', ''); };
         }
         console('installation')->disable_vnc_stalls;

Nice, thank you for the solution idea!

@Alice, @mitiao, any idea? can we pick up this fix and try at least in Beijing ipmi server firstly? If the fix not harm testing, then maybe we can have the fix merged and try in openqa.suse.de, what do you think?

Thanks to mattias for the suggestion!
I will try locally. It is just that our situation is more complex here. We not only has this step called but also other parts of code that handles reboot, so we may need to extend this idea. I will try locally first to see if it can work well. Then we push to openqa to see if it can make less incomplete jobs.

Actions #33

Updated by xlai almost 6 years ago

PR is proposed in https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/5236.

This change will not affect workflow. Let's see if it can kill incomplete job due to half open socket.

Actions #34

Updated by mgriessmeier over 5 years ago

  • Blocks action #34471: [qe-core][functional][opensuse][medium] too early matching in too generic needle text-login-20160812 added
Actions #35

Updated by xlai over 5 years ago

xlai wrote:

PR is proposed in https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/5236.

This change will not affect workflow. Let's see if it can kill incomplete job due to half open socket.

From the PR's comment by zaoliang, it reproduced easily on his test with the changes, so it was proved to be not working.

Now it is open for new suggestions or bug fix on the backend side.

Please comment back if disagree with the conclusion about this suggestion.

Actions #36

Updated by szarate over 5 years ago

  • Assignee changed from mitiao to szarate

I will pick it up, will follow up with Zaoliang to see if I can give a hand on this.

Actions #37

Updated by xlai over 5 years ago

This issue still reproduces a lot in sle12sp4 openqa testing.

Failed jobs examples:
http://openqa.suse.de/tests/1795850
http://openqa.suse.de/tests/1795947

Both failed at reboot_after_installation step which is the final common os installation step, rather than real virtualization specific code.

Actions #38

Updated by cachen over 5 years ago

  • Related to action #40544: [OpenQA][IPMI backend] IPMI worker can not survive reboot on dell SUT added
Actions #39

Updated by cachen over 5 years ago

  • Assignee changed from szarate to jerrytang

Hello Santi,
PR#1021 & PR#5722 can reduce the frequency of half-open issue but doesn't fix the root cause in ipmi backend. As the issue impact ipmi base test stability a lot, here let me add Jerry to support this ticket. Hope we can get it fix ASAP.

Actions #40

Updated by okurz over 5 years ago

  • Related to action #40655: [tools][ipmi] DIE The console isn't responding correctly. Maybe half-open socket? at /usr/lib/os-autoinst/backend/baseclass.pm line 241 added
Actions #41

Updated by jerrytang over 5 years ago

update to this issue:
broken ssh sock-connection by shutdown will trigger the half open check .

In virtualization test :
select_console root-ssh will create and add ssh sock to monitor code , reboot without disconnect will cause half-open issue.

so this can be fix by use prepare_system_shutdown before every reboot/shutdown step in the testcase .

Actions #42

Updated by jerrytang over 5 years ago

I submit 2 PR for fixing .

It will be better if developer can review and give some comment about this

PR for test :
https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/5778

PR for backend :
https://github.com/os-autoinst/os-autoinst/pull/1026

Thanks

Jerry tang

Actions #43

Updated by nicksinger over 5 years ago

  • Subject changed from [sle][tools][ipmi][hard] Incomplete job because console isn't responding correctly. Half-open socket on IPMI to [sle][tools][remote-backends][hard] Incomplete job because console isn't responding correctly. Half-open socket on IPMI

Changing the tag because this doesn't only affect ipmi jobs but rather all remote backends.

Actions #44

Updated by nicksinger over 5 years ago

jerrytang wrote:

update to this issue:
broken ssh sock-connection by shutdown will trigger the half open check .

In virtualization test :
select_console root-ssh will create and add ssh sock to monitor code , reboot without disconnect will cause half-open issue.

so this can be fix by use prepare_system_shutdown before every reboot/shutdown step in the testcase .

Jerry, I don't understand your argumentation here while following your code-change on the same time. If I understand you correctly, calling prepare_system_shutdown before every reboot fixes this. However, your change doesn't call prepare_system_shutdown but changes much much more in the backend and test code. So simple question:

  1. Does calling prepare_system_shutdown before reboot fix this issue?
Actions #45

Updated by jerrytang over 5 years ago

Jerry, I don't understand your argumentation here while following your code-change on the same time. If I understand you correctly, calling prepare_system_shutdown before every reboot fixes this. However, your change doesn't call prepare_system_shutdown but changes much much more in the backend and test code. So simple question:

  1. Does calling prepare_system_shutdown before reboot fix this issue?

Theoretically ,calling prepare_system_shutdown before reboot fix this issue .

During fix testcase side i found reboot is special for this situation .
The currently scheme :
1.reboot require ssh-connected(root-ssh console: xvnc+xterm+ssh)。

  1. prepare_system_shutdown will disconnect ssh.

so you can see the problem.
call prepare_system_shutdown you never get chance to reboot .

my pr is just one way to handle this , better way is welcome .

Actions #46

Updated by michalnowak over 5 years ago

  • Related to deleted (action #40655: [tools][ipmi] DIE The console isn't responding correctly. Maybe half-open socket? at /usr/lib/os-autoinst/backend/baseclass.pm line 241)
Actions #47

Updated by michalnowak over 5 years ago

  • Has duplicate action #40655: [tools][ipmi] DIE The console isn't responding correctly. Maybe half-open socket? at /usr/lib/os-autoinst/backend/baseclass.pm line 241 added
Actions #48

Updated by michalnowak over 5 years ago

Did you try disable_vnc_stalls on the active console before restart is triggered? I used to have problems with VNC stalls on the svirt backend, this helped. See: https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/ecf740a0999cc8cb29c93880b7d43b080933be3c/lib/power_action_utils.pm#L47. Do you use power_action()? disable_vnc_stalls is used from there.

Actions #49

Updated by jerrytang over 5 years ago

michalnowak wrote:

Did you try disable_vnc_stalls on the active console before restart is triggered? I used to have problems with VNC stalls on the svirt backend, this helped. See: https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/ecf740a0999cc8cb29c93880b7d43b080933be3c/lib/power_action_utils.pm#L47. Do you use power_action()? disable_vnc_stalls is used from there.

could you please explain how this works?
because I'm not sure vnc sock is add to the monitor socks ($self->{select}).

And this issue is not reproduce 100%.
( I think it's because of the race condition between

1 send_key 'alt-o' means shutdown host kill ssh
2 power_action('reboot', observe => 1, keepconsole => 1, first_reboot => 1); means remove and kill ssh .
.
if 2 faster 1, then it's fine
)

Actions #50

Updated by okurz over 5 years ago

https://openqa.suse.de/tests/2095884/file/autoinst-log.txt seems to be one of the last examples, just to reference a more recent job :)

I think you missed to mention the important step which IMHO can destroy everything:

https://github.com/os-autoinst/os-autoinst-distri-opensuse/commit/ae9cf2e2c51d24ae3505fa5fedcdb8b3528d1707 introduced a wait_screen_change around the alt-o keypress. This relies on the remote VNC connection which might be not there already at this time. I guess removing the wait_screen_change can actually fix the problem -> https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/5915

Actions #51

Updated by cachen over 5 years ago

Thanks all the above suggestions and solutions!

We tried disable_vnc_stalls by calling prepare_system_shutdown for ipmi backend, which is not 100% fix the issue.

Will try solution of removing wait_screen_change.

I feel we are very close to get this issue fix :)

Actions #52

Updated by jerrytang over 5 years ago

to prove the race condition , i create a job intend to get half-open issue:

 #after ipmi host boot 
 use_ssh_serial_console;
 type_string("reboot\n");
 sleep 4;  #===========>this will make sure reboot kill ssh first
 save_screenshot;
 power_action('reboot', observe => 1, keepconsole => 1, first_reboot => 1);
 save_screenshot;

I run 4 times , and all of them get half-open problem .
http://10.67.132.86/tests/309#next_previous

Also i replace type_string("reboot\n"); ====> type_string("shutdown -r 1\n"); to avoid half-open.
and result is no half-open issue.

problem happened after send-key alt-o , your PR may not work as expect.

Actions #53

Updated by okurz over 5 years ago

So let me try to phrase in my own words: Pressing 'alt-o' causes the SUT to reboot which will close the socket from one side. This is not a problem if the corresponding ssh connection is killed fast enough but not otherwise, hence the race condition. However, we can not simply call prepare_system_shutdown which would also be triggered by power_action because we still need a connection to the SUT to send the 'alt-o' key. https://github.com/os-autoinst/os-autoinst/pull/902 which was done to detect "half-open sockets" might have introduced exactly that problem. IMHO the backend needs to handle this gracefully, e.g. we press 'alt-o' and then just call power_action but I do not know in which way exactly this could work.

Actions #54

Updated by jerrytang over 5 years ago

okurz wrote:

So let me try to phrase in my own words: Pressing 'alt-o' causes the SUT to reboot which will close the socket from one side. This is not a problem if the corresponding ssh connection is killed fast enough but not otherwise, hence the race condition. However, we can not simply call prepare_system_shutdown which would also be triggered by power_action because we still need a connection to the SUT to send the 'alt-o' key. https://github.com/os-autoinst/os-autoinst/pull/902 which was done to detect "half-open sockets" might have introduced exactly that problem. IMHO the backend needs to handle this gracefully, e.g. we press 'alt-o' and then just call power_action but I do not know in which way exactly this could work.

exactly.
As installation session has

  • no io cache policy .
  • no simple way to reboot in ssh session. so you can see my PR is use 2nd way to hard_reset.
Actions #55

Updated by okurz over 5 years ago

  • Target version deleted (Milestone 16)

M16 is closed for long.

Actions #56

Updated by cachen over 5 years ago

The issue was still happened time to time during RC2 build 0421 acceptance testing, we have to staring the tests and prepare to retrigger the tests failed by this issue. Personally I don't like to pushing too much, but it affect the acceptance result deliver so much, and more ipmi relevant tests will be added :(

Let me try to understand the current situation from comments and PRs:
first all I think we are clear of why and when Half-open happens in installation->reboot_after_installation step,finally in same page :)
So far there are 2 options:
1)To just reduce the error hitting by Olive's PR#5915 or by prepare_system_shutdown
2)Fix it as Jerry's PRs by Roughly reboot the ipmi server as hard_reset (perhaps to have server reboot directly in power_action for ipmi specify is prefer?)

@coolo, you are PO of openQA and expert/author of ipmi backend, we need your suggestion and decision, or maybe you have better solution :)

Actions #57

Updated by okurz over 5 years ago

cachen wrote:

So far there are 2 options:
1)To just reduce the error hitting by Olive's PR#5915 or by prepare_system_shutdown

I think this option can be applied regardless. It might just help with the symptoms but still help :)

2)Fix it as Jerry's PRs by Roughly reboot the ipmi server as hard_reset (perhaps to have server reboot directly in power_action for ipmi specify is prefer?)

@coolo, you are PO of openQA and expert/author of ipmi backend, we need your suggestion and decision, or maybe you have better solution :)

I would still go with #32746#note-53 which is a different approach: Handle the uni-directional socket termination gracefully in the backend.

Actions #58

Updated by jerrytang over 5 years ago

Anyway I update my pr follow some point of coolo mentioned , move all action in the prepare_system_shutdown function;
https://github.com/os-autoinst/os-autoinst/pull/1026#issuecomment-429277855

But , still need backend api supported in my "NOT" graceful way.

I hope this issue can be fixed soon , waiting for graceful way;

Actions #59

Updated by okurz over 5 years ago

jerrytang wrote:

I hope this issue can be fixed soon , waiting for graceful way;

Don't wait, better try to fix it yourself. I made a suggestion in
https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/5778#pullrequestreview-164196772

Actions #60

Updated by okurz over 5 years ago

https://github.com/os-autoinst/os-autoinst/pull/902 merged, https://openqa.opensuse.org/tests/772175 is VR on normal qemu-x86_64, sle-15-SP1-Installer-DVD-x86_64-Build66.2-gi-guest_sles12sp2-on-host-developing-kvm@64bit-ipmi shows that at least the IPMI virtualization tests are also not broken now :)

Please crosscheck if this helps to mitigate the problem, e.g. with a good statistical analysis, monitor jobs triggered after that, etc.

Actions #61

Updated by cachen over 5 years ago

  • Related to action #41330: [functional][y][s390x][investigation][timebox:4h] test fails in welcome - half-open socket in post_fail_hook causing incomplete job added
Actions #62

Updated by cachen over 5 years ago

  • Assignee deleted (jerrytang)

So far the error hitting has reduced in virtualization tests by the merged PO#5915, since the rest fix should touch the deep in openQA backend, I agree with Jerry to hand over it back to openQA Tools group. The discussions in below PRs can be followed.

https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/5953
https://github.com/os-autoinst/os-autoinst/pull/1041

Thanks to all who involved in this ticket for your discussions and your solutions.

Actions #63

Updated by okurz over 5 years ago

Hi cachen,

cachen wrote:

So far the error hitting has reduced in virtualization tests by the merged PO#5915

I am happy to hear that my PR help a bit :)

since the rest fix should touch the deep in openQA backend, I agree with Jerry to hand over it back to openQA Tools group […]

Hm, I am not sure this will work. The current tools team does not really have experts in the aforementioned domain and I think Jerry was already on a good track to fix it for good.

Actions #64

Updated by cachen over 5 years ago

okurz wrote:

Hi cachen,

cachen wrote:

So far the error hitting has reduced in virtualization tests by the merged PO#5915

I am happy to hear that my PR help a bit :)

I appreciated so much ;)

since the rest fix should touch the deep in openQA backend, I agree with Jerry to hand over it back to openQA Tools group […]

Hm, I am not sure this will work. The current tools team does not really have experts in the aforementioned domain and I think Jerry was already on a good track to fix it for good.

Unfortunately, I have to have Jerry back to Performance testing.

mitiao(Wei Jiang) know the whole background, he has been involved in discussion and testing a lot in Beijing office :)
I think he can take over the rest thing to optimize the codes, of course Jerry and Virtualization group will continue to support and verify for his code.

Actions #65

Updated by okurz over 5 years ago

I understood from xlai that the issue is still present even though less likely. According to xlai mitiao is on it

Actions #66

Updated by cachen about 5 years ago

  • Assignee set to mitiao
  • Priority changed from High to Normal

Let me try assign the ticket to mitiao, since there is no respond for more than 1 month.
From my understand, QA-VT still expect the ipmi issue be 100% fixed on tool backend, but the priority can be lower down(change to 'normal') since Oli's workaround helped the issue less to happen.
@Alice, correct me if I am wrong.

Actions #67

Updated by mitiao about 5 years ago

  • Assignee changed from mitiao to xlai

re-assign to alice since i am leaving...

Actions #68

Updated by cachen about 5 years ago

  • Status changed from In Progress to Blocked
  • Assignee deleted (xlai)

Sorry, Alice isn't member of Tools group and she doesn't responsible for backend, let me remove the assignment and mark the status as 'Blocked' by no human resource since seems currently nobody from Tools group would like to take over.

Actions #69

Updated by okurz about 5 years ago

  • Status changed from Blocked to Workable

Commonly we use the status "Blocked" only in relation with a blocking ticket and a person tracking the blocked status so that this person is also automatically informed about ticket updates and can then update the ticket. Setting back to "Workable" which IMHO is the most suitable for a task that is in principle "workable" but no one picked it up, ok with that?

Actions #70

Updated by cachen about 5 years ago

  • Status changed from Workable to New

okurz wrote:

Commonly we use the status "Blocked" only in relation with a blocking ticket and a person tracking the blocked status so that this person is also automatically informed about ticket updates and can then update the ticket. Setting back to "Workable" which IMHO is the most suitable for a task that is in principle "workable" but no one picked it up, ok with that?

I don't want to judge whether this is 'workable' or not since this seems is a 'hard' task, let's leave the statue to 'New' until someone can take it and they can mark it to 'workable' or 'in progress' or others :)

Actions #71

Updated by okurz about 5 years ago

  • Related to action #46964: [functional][u][s390x] test fails in the middle of execution (not installation) as incomplete with "half-open socket?" – connection to machine vanished? added
Actions #72

Updated by cachen about 5 years ago

New happens in VT reboot_after_installation step:

https://openqa.nue.suse.com/tests/2503084
https://openqa.nue.suse.com/tests/2504185

Does it caused by the rewrite in commit 023c4c09dca87d17b3cec325f3adb5288525a211 ?

Actions #73

Updated by cachen about 5 years ago

  • Related to action #48482: [ipmi][functional][u] test fails in reboot_after_installation; The console isn't responding correctly. Maybe half-open socket added
Actions #74

Updated by okurz about 5 years ago

  • Status changed from New to Resolved
  • Assignee set to okurz

https://openqa.nue.suse.com/tests/2503084#next_previous and the last ten jobs in this scenario are all green. #48482 as well as #48260 should have solved this. https://github.com/os-autoinst/os-autoinst/pull/1120 in particular should help with a better feedback in case of errors what went wrong. In most cases it was problems introduced by the tests that caused "half-open sockets" however not being obvious what caused it. We can not easily prevent future test code changes to introduce the same symptom again however the hint in the error message should make it more obvious what the test writer is missing or has done wrong. In short: The most likely problem is that the tests try to still access a console while the SUT is in reboot or shutdown. This can be prevented by explicitly disabling stall detection on these consoles or terminating ssh-based console connections before triggering a reboot or shutdown.

Actions #75

Updated by szarate over 4 years ago

  • Related to action #60161: [network][qem] auto_review:"The console.*(root-virtio-terminal1|sut).*is not responding.*half-open socket" test incompletes in t20_teaming_ab_all_link added
Actions

Also available in: Atom PDF