Project

General

Profile

Actions

action #106056

open

coordination #125708: [epic] Future ideas for more stable non-qemu backends

[virtualization][tools] Improve retry behaviour and connection error handling in backend::ipmi (was: "Fail to connect openqaipmi5-sp.qa.suse.de on our osd environment") size:M

Added by xguo almost 3 years ago. Updated over 1 year ago.

Status:
Workable
Priority:
Low
Assignee:
-
Category:
Feature requests
Target version:
Start date:
2022-02-07
Due date:
% Done:

0%

Estimated time:

Description

Observation

refer to osd test failure log https://openqa.suse.de/tests/8113762, Fail to connect openqaipmi5-sp.qa.suse.de on our osd environment.
[2022-02-07T07:36:27.624742+01:00] [debug] IPMI: Chassis Power Control: Up/On
[2022-02-07T07:36:40.726270+01:00] [info] ::: backend::baseclass::die_handler: Backend process died, backend errors are reported below in the following lines:
ipmitool -I lanplus -H openqaipmi5-sp.qa.suse.de -U admin -P XX chassis power status: Error: Unable to establish IPMI v2 / RMCP+ session at /usr/lib/os-autoinst/backend/ipmi.pm line 45.

But, we can connect openqaipmi5-sp.qa.suse.de successfully on our testing environment
ipmitool -I lanplus -H openqaipmi5-sp.qa.suse.de -U admin -P XX chassis power status
Chassis Power is on
ip a
2: eth0: mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
link/ether d2:0a:cd:f5:97:40 brd ff:ff:ff:ff:ff:ff
inet 10.161.159.120/20 brd 10.161.159.255 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::d00a:cdff:fef5:9740/64 scope link
valid_lft forever preferred_lft forever

FYI, refer to https://openqa.suse.de/admin/workers/1207 for more details.

Acceptance criteria

  • AC1: os-autoinst backend::ipmi retries consistently in more cases of network related unavailabilities and instabilities

Suggestions

ipmi power reset
for i in (1 .. 10):
    ipmi power status && break
    echo "Retrying ipmi connection $i of 10 after sleep"
    sleep 10
...

Related issues 2 (0 open2 closed)

Related to openQA Infrastructure (public) - action #102650: Organize labs move to new building and SRV2 size:MResolvednicksinger2021-11-182022-05-27

Actions
Related to openQA Infrastructure (public) - action #128654: [sporadic] Fail to create an ipmi session to worker grenache-1:16 (ix64ph1075) in its vlanResolvedokurz2023-05-04

Actions
Actions #1

Updated by nicksinger almost 3 years ago

  • Due date set to 2022-02-14
  • Status changed from Workable to Feedback

We moved the machine last week from the small qa lab into the big qa lab. After replugging the machine it seems the BMC got stuck in a strange state. chassis power status worked fine but sol activate ran into an error. After I did a chassis power up the sol activate worked again. I did another cold reset of the bmc with bmc reset cold just to be sure. Now it seems to work: https://openqa.suse.de/tests/8114219#live

Please let me know if you see new issues. I will check again in one week and will close this ticket here as "resolved" if no new issues come up.

Actions #2

Updated by xguo almost 3 years ago

  • Description updated (diff)
Actions #3

Updated by xguo almost 3 years ago

nicksinger wrote:

We moved the machine last week from the small qa lab into the big qa lab. After replugging the machine it seems the BMC got stuck in a strange state. chassis power status worked fine but sol activate ran into an error. After I did a chassis power up the sol activate worked again. I did another cold reset of the bmc with bmc reset cold just to be sure. Now it seems to work: https://openqa.suse.de/tests/8114219#live

Please let me know if you see new issues. I will check again in one week and will close this ticket here as "resolved" if no new issues come up.

@Nick, thanks so much for your great help! I am keeping eyes on it now. make sure if there were new issues again.

Actions #4

Updated by xguo almost 3 years ago

@nick,

Shot update, the same problem still reproduce on openqaipmi5-sp.qa.suse.de again.
Please refer to https://openqa.suse.de/tests/8119345 for more details.

Actions #5

Updated by waynechen55 almost 3 years ago

This issue still persists. And it is very obvious on openqa.suse.de

All openqa test run assigned to grenache-1:10 failed due to error:
Reason: backend died: ipmitool -I lanplus -H openqaipmi5-sp.qa.suse.de -U admin -P qatesting mc guid: Error: Unable to establish IPMI v2 / RMCP+ session at /usr/lib/os-autoinst/backend/ipmi.pm line 45.

Please refer to detailed logs for these tests:
https://openqa.suse.de/tests/8123821
https://openqa.suse.de/tests/8123769
https://openqa.suse.de/tests/8119344
https://openqa.suse.de/tests/8119345
https://openqa.suse.de/tests/8119353
https://openqa.suse.de/tests/8123751

Here are just some of them. There are many more failure with same reason.

I tried on my personal computer:
ipmitool -I lanplus -H openqaipmi5-sp.qa.suse.de -U admin -P qatesting chassis power status
Chassis Power is on
which succeeded.

But it keeps failing in openqa environment.

Something wrong with this worker grenache-1:10 ? Or network glitch ?

Actions #6

Updated by okurz almost 3 years ago

Please make sure that the password is still correct and the same as included in https://gitlab.nue.suse.com/openqa/salt-pillars-openqa

Actions #7

Updated by waynechen55 almost 3 years ago

okurz wrote:

Please make sure that the password is still correct and the same as included in https://gitlab.nue.suse.com/openqa/salt-pillars-openqa

ipmi hostname/user/password are correct for the involved machine.

And "mc reset cold" does not help. This issue persists.

Actions #8

Updated by okurz almost 3 years ago

  • Subject changed from [virtualization] Fail to connect openqaipmi5-sp.qa.suse.de on our osd environemnt to [virtualization][tools] Fail to connect openqaipmi5-sp.qa.suse.de on our osd environment
  • Target version changed from QE-VT Sprint Future to Ready

I checked on grenache-1 with the command env runs=100 sh <(curl -sS https://raw.githubusercontent.com/okurz/scripts/master/count_fail_ratio) ipmitool -I lanplus -H openqaipmi5-sp.qa.suse.de -U admin -P XXX chassis power status

and got 100% success rate over 100 runs. Checking the original scenario https://openqa.suse.de/tests/latest?arch=x86_64&distri=sle&flavor=Online&machine=64bit-ipmi-large-mem&test=gi-guest_developing-on-host_developing-xen&version=15-SP4#next_previous I find the last completed job to not run into this error https://openqa.suse.de/tests/8119300 on grenache-1:17 but grenache-1:10 still shows no success however https://openqa.suse.de/tests/8124460 is currently running and looks fine. So, all good?

Actions #9

Updated by waynechen55 almost 3 years ago

okurz wrote:

I checked on grenache-1 with the command env runs=100 sh <(curl -sS https://raw.githubusercontent.com/okurz/scripts/master/count_fail_ratio) ipmitool -I lanplus -H openqaipmi5-sp.qa.suse.de -U admin -P XXX chassis power status

and got 100% success rate over 100 runs. Checking the original scenario https://openqa.suse.de/tests/latest?arch=x86_64&distri=sle&flavor=Online&machine=64bit-ipmi-large-mem&test=gi-guest_developing-on-host_developing-xen&version=15-SP4#next_previous I find the last completed job to not run into this error https://openqa.suse.de/tests/8119300 on grenache-1:17 but grenache-1:10 still shows no success however https://openqa.suse.de/tests/8124460 is currently running and looks fine. So, all good?

Have done anything proactively to this worker or related environment ? Or Have you tried fixing it ?
I also noticed a successful run on this grenache-1:10
https://openqa.suse.de/tests/8124460
But it is just the only one that succeeded on this grenache-1:10 till now.
So I do not know whether this is sustainable.

Actions #10

Updated by okurz almost 3 years ago

waynechen55 wrote:

okurz wrote:

I checked on grenache-1 with the command env runs=100 sh <(curl -sS https://raw.githubusercontent.com/okurz/scripts/master/count_fail_ratio) ipmitool -I lanplus -H openqaipmi5-sp.qa.suse.de -U admin -P XXX chassis power status

and got 100% success rate over 100 runs. Checking the original scenario https://openqa.suse.de/tests/latest?arch=x86_64&distri=sle&flavor=Online&machine=64bit-ipmi-large-mem&test=gi-guest_developing-on-host_developing-xen&version=15-SP4#next_previous I find the last completed job to not run into this error https://openqa.suse.de/tests/8119300 on grenache-1:17 but grenache-1:10 still shows no success however https://openqa.suse.de/tests/8124460 is currently running and looks fine. So, all good?

Have done anything proactively to this worker or related environment ? Or Have you tried fixing it ?

Not that I am aware of. I just executed the above mentioned commands for testing with no intention to fix it yet, just to investigate.

I also noticed a successful run on this grenache-1:10
https://openqa.suse.de/tests/8124460
But it is just the only one that succeeded on this grenache-1:10 till now.
So I do not know whether this is sustainable.

Sure, so as this is only for a single machine I guess it will take some hours/days to be sure that it works stable.

Actions #11

Updated by nicksinger almost 3 years ago

okurz wrote:

I checked on grenache-1 with the command env runs=100 sh <(curl -sS https://raw.githubusercontent.com/okurz/scripts/master/count_fail_ratio) ipmitool -I lanplus -H openqaipmi5-sp.qa.suse.de -U admin -P XXX chassis power status

and got 100% success rate over 100 runs. Checking the original scenario https://openqa.suse.de/tests/latest?arch=x86_64&distri=sle&flavor=Online&machine=64bit-ipmi-large-mem&test=gi-guest_developing-on-host_developing-xen&version=15-SP4#next_previous I find the last completed job to not run into this error https://openqa.suse.de/tests/8119300 on grenache-1:17 but grenache-1:10 still shows no success however https://openqa.suse.de/tests/8124460 is currently running and looks fine. So, all good?

My first observations show that power status seems to be more stable then sol activate. What puzzles me is that the only change on that machine was to move it from the small lab into the big lab (where other IPMI workers are already present and don't show these problems).

Actions #12

Updated by okurz almost 3 years ago

So according to https://github.com/ipmitool/ipmitool/blob/d674bfdc9a2447706ea9b0b943c98bcb0bbe80d2/lib/ipmi_sol.c#L1873 we should be able to use ipmitool ... sol activate usesolkeepalive so if the above tests show still as unstable then this could be something to be tried within the test code

Actions #13

Updated by waynechen55 almost 3 years ago

okurz wrote:

So according to https://github.com/ipmitool/ipmitool/blob/d674bfdc9a2447706ea9b0b943c98bcb0bbe80d2/lib/ipmi_sol.c#L1873 we should be able to use ipmitool ... sol activate usesolkeepalive so if the above tests show still as unstable then this could be something to be tried within the test code

Unfortunately, the issue is not gone. Massive failures on this single worker with the latest Build95.1
https://openqa.suse.de/tests/8126359
https://openqa.suse.de/tests/8126360
https://openqa.suse.de/tests/8126361
https://openqa.suse.de/tests/8126131
https://openqa.suse.de/tests/8126132
https://openqa.suse.de/tests/8126129
https://openqa.suse.de/tests/8126130

Could you help push up the priority a little bit ? I can not prevent test run from being assigned to this worker currently.

Actions #14

Updated by waynechen55 almost 3 years ago

@okurz

I found a feasible way to reproduce the problem.
I am prudent to say it might be related to 'power reset', because we already have retries=3 to be used with ipmitool.

This machine might be just not very responsive during handling with this scenario. You can use this script to reproduce on any host:

#!/usr/bin/perl

use IPC::Run ();

IPC::Run::run("ipmitool -I lanplus -H openqaipmi5-sp.qa.suse.de -U admin -P qatesting chassis power off");
sleep 3;
IPC::Run::run("ipmitool -I lanplus -H openqaipmi5-sp.qa.suse.de -U admin -P qatesting chassis power on");
sleep 3;
IPC::Run::run("ipmitool -I lanplus -H openqaipmi5-sp.qa.suse.de -U admin -P qatesting chassis power reset");
my @tries = (1 .. 20);
for (@tries) {
my $ret=IPC::Run::run("ipmitool -I lanplus -H openqaipmi5-sp.qa.suse.de -U admin -P qatesting chassis power status");
print $ret;
}

You can adjust the sleep time or specific power management command used.
Hope this can be helpful to come up a solution to fix the problem. Of course you can have your own different opinion.

Actions #15

Updated by okurz almost 3 years ago

I see. If that is the problem then how about waiting for responsiveness after executing a power reset? Like the following pseudo-code:

ipmi power reset
for i in (1 .. 10):
    ipmi power status && break
    echo "Retrying ipmi connection $i of 10 after sleep"
    sleep 10
...
Actions #16

Updated by waynechen55 almost 3 years ago

okurz wrote:

I see. If that is the problem then how about waiting for responsiveness after executing a power reset? Like the following pseudo-code:

ipmi power reset
for i in (1 .. 10):
    ipmi power status && break
    echo "Retrying ipmi connection $i of 10 after sleep"
    sleep 10
...

It seems that we already has this logic in restart_host in ipmi.pm in os-autoinst ? It uses retries=3, right ?

Actions #18

Updated by waynechen55 almost 3 years ago

okurz wrote:

correct, in https://github.com/os-autoinst/os-autoinst/blob/2c87fe6b12c61f3927b47bba28ed7d67b8a2dc72/backend/ipmi.pm#L57

Could you help fix this somehow ? I know this looks strange.

Or wait further response from @nicksinger for more info about this machine ?

Actions #19

Updated by okurz almost 3 years ago

waynechen55 wrote:

okurz wrote:

correct, in https://github.com/os-autoinst/os-autoinst/blob/2c87fe6b12c61f3927b47bba28ed7d67b8a2dc72/backend/ipmi.pm#L57

Could you help fix this somehow ? I know this looks strange.

At this point I suggest that you please take this ticket and try on your own.

Or wait further response from @nicksinger for more info about this machine ?

We don't have more information about this machine that could help at this point.

Actions #20

Updated by waynechen55 almost 3 years ago

okurz wrote:

waynechen55 wrote:

okurz wrote:

correct, in https://github.com/os-autoinst/os-autoinst/blob/2c87fe6b12c61f3927b47bba28ed7d67b8a2dc72/backend/ipmi.pm#L57

Could you help fix this somehow ? I know this looks strange.

At this point I suggest that you please take this ticket and try on your own.

Or wait further response from @nicksinger for more info about this machine ?

We don't have more information about this machine that could help at this point.

This looks like a specific problem for a specific machine, but any change to backend will have global impact.
And as far as backend is concerned, it needs to be more prudent.
Let alone, at the current stage, 'the possible root cause' just emerged but not fully identified.
There is already similar mechanism in os-autoinst to avoid such unexpected situation. So we need to know why further.
And I do not think this only happens to this single command "ipmitool power status". According to my findings, it may also happen to any other commands like "ipmitool sol deactivate or activate". So there are many more places to be investigated, fixed or enhanced.
We still need help from all of you involved in this ticket.

Actions #21

Updated by nicksinger almost 3 years ago

The machine in question will be moved to SRV2 soon (see https://progress.opensuse.org/issues/102650#note-8). In the meantime we could try to workaround the issues in software. @waynechen55 we could also temporary disable the worker here: https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls#L926-932 if you think implementing a workaround is too complicated for now

Actions #22

Updated by okurz almost 3 years ago

  • Description updated (diff)

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/383 for temporarily disabling the worker instance. masked on the machine with sudo systemctl mask --now openqa-worker-auto-restart@10

Actions #23

Updated by xguo almost 3 years ago

  • Description updated (diff)
  • Target version changed from Ready to QE-VT Sprint Future

okurz wrote:

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/383 for temporarily disabling the worker instance. masked on the machine with sudo systemctl mask --now openqa-worker-auto-restart@10

@okurz, would you mind help us merge MR!383 to disable this worker instance. Currently, this machine problem block our some OSD test runs.
Thanks so much for your great help.

Actions #24

Updated by waynechen55 almost 3 years ago

  • Target version changed from QE-VT Sprint Future to Ready

nicksinger wrote:

The machine in question will be moved to SRV2 soon (see https://progress.opensuse.org/issues/102650#note-8). In the meantime we could try to workaround the issues in software. @waynechen55 we could also temporary disable the worker here: https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls#L926-932 if you think implementing a workaround is too complicated for now

I tried to reproduce the problem on my personal openQA. Unfortunately, I can not see the test failed at "ipmitool power status" command. But it kept failing at another command "ipmitool sol deactivate" with the same error:
Test died: Unexpected IPMI response: ipmitool -I lanplus -H openqaipmi5-sp.qa.suse.de -U admin -P qatesting sol deactivate: Error: Unable to establish IPMI v2 / RMCP+ session at /usr/lib/os-autoinst/backend/ipmi.pm line 45.
at /usr/lib/os-autoinst/testapi.pm line 1762.
So it looks to me that test can fail at any ipmitool command on this machine.
Have you got any idea on this problem ? Do you think it is related to unstable environment or just unstable machine ? In other words, do you think you can help fix the problem by relocating the machine and reconnecting it to somewhere ?

By the way, @nicksinger what's your detailed plan of lab movement ? Have you taken openQA test automation schedule into consideration ? According to the original schedule, the significant PublicBeta will be released onto openQA for testing around Feb,24. Please refer to https://confluence.suse.com/display/SUSELinuxEnterpriseServer15SP4/Schedule
So it is scary to see massive test runs failures due to lack of service :)

Actions #25

Updated by okurz almost 3 years ago

waynechen55 wrote:

nicksinger wrote:

The machine in question will be moved to SRV2 soon (see https://progress.opensuse.org/issues/102650#note-8). In the meantime we could try to workaround the issues in software. @waynechen55 we could also temporary disable the worker here: https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls#L926-932 if you think implementing a workaround is too complicated for now

I tried to reproduce the problem on my personal openQA. Unfortunately, I can not see the test failed at "ipmitool power status" command. But it kept failing at another command "ipmitool sol deactivate" with the same error:
Test died: Unexpected IPMI response: ipmitool -I lanplus -H openqaipmi5-sp.qa.suse.de -U admin -P qatesting sol deactivate: Error: Unable to establish IPMI v2 / RMCP+ session at /usr/lib/os-autoinst/backend/ipmi.pm line 45.
at /usr/lib/os-autoinst/testapi.pm line 1762.
So it looks to me that test can fail at any ipmitool command on this machine.

Yes, could happen. Good that you can reproduce a stability problem. This is an important step towards fixing the problem. Could you please try to add the parameter -C 3 and see if that makes a difference?

Have you got any idea on this problem ? Do you think it is related to unstable environment or just unstable machine ? In other words, do you think you can help fix the problem by relocating the machine and reconnecting it to somewhere ?

I would be if that helps but let's be optimistic :)

Actions #26

Updated by okurz almost 3 years ago

  • Due date changed from 2022-02-14 to 2022-03-01

coming back to #106056#note-1 we still see problems so the move from one room to another seems to have had no impact. Hence updating due-date

Actions #27

Updated by waynechen55 almost 3 years ago

okurz wrote:

coming back to #106056#note-1 we still see problems so the move from one room to another seems to have had no impact. Hence updating due-date

FYI, the 'I tried to reproduce the problem on my personal openQA' in https://progress.opensuse.org/issues/106056#note-24 happened on last Friday Feb,11,2022.

Actions #28

Updated by waynechen55 almost 3 years ago

okurz wrote:

coming back to #106056#note-1 we still see problems so the move from one room to another seems to have had no impact. Hence updating due-date

FYI, the 'I tried to reproduce the problem on my personal openQA' in https://progress.opensuse.org/issues/106056#note-24 happened on last Friday.

waynechen55 wrote:

okurz wrote:

coming back to #106056#note-1 we still see problems so the move from one room to another seems to have had no impact. Hence updating due-date

FYI, the 'I tried to reproduce the problem on my personal openQA' in https://progress.opensuse.org/issues/106056#note-24 happened on last Friday Feb,11,2022.

Reproducing the problem today Feb,14th,2022 looks a little bit different compared with last Friday Feb,11,2022:
1) I can not see failure at 'ipmitool sol deactivate' anymore
2) I start seeing failure at 'ipmitool chassis power status'
3) The failure in 2) does not happen every time
4) ipmitool with '-C 3' does not help solve the problem completely. I would say no obvious improvement compared with my results without it.

Actions #29

Updated by waynechen55 almost 3 years ago

waynechen55 wrote:

nicksinger wrote:

The machine in question will be moved to SRV2 soon (see https://progress.opensuse.org/issues/102650#note-8). In the meantime we could try to workaround the issues in software. @waynechen55 we could also temporary disable the worker here: https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls#L926-932 if you think implementing a workaround is too complicated for now

I tried to reproduce the problem on my personal openQA. Unfortunately, I can not see the test failed at "ipmitool power status" command. But it kept failing at another command "ipmitool sol deactivate" with the same error:
Test died: Unexpected IPMI response: ipmitool -I lanplus -H openqaipmi5-sp.qa.suse.de -U admin -P qatesting sol deactivate: Error: Unable to establish IPMI v2 / RMCP+ session at /usr/lib/os-autoinst/backend/ipmi.pm line 45.
at /usr/lib/os-autoinst/testapi.pm line 1762.
So it looks to me that test can fail at any ipmitool command on this machine.
Have you got any idea on this problem ? Do you think it is related to unstable environment or just unstable machine ? In other words, do you think you can help fix the problem by relocating the machine and reconnecting it to somewhere ?

By the way, @nicksinger what's your detailed plan of lab movement ? Have you taken openQA test automation schedule into consideration ? According to the original schedule, the significant PublicBeta will be released onto openQA for testing around Feb,24. Please refer to https://confluence.suse.com/display/SUSELinuxEnterpriseServer15SP4/Schedule
So it is scary to see massive test runs failures due to lack of service :)

@nicksinger Could I know your detailed lab movement plan ? Or I do not know but you already posted somewhere ? Just a little bit concerned about any possibility of conflict with 15-SP4 PublicBeta testing. Thanks.

Actions #30

Updated by waynechen55 almost 3 years ago

@okurz I found that there might be something wrong with the backend. I added debug code to output auxiliary information about internals during test execution. It looks like the backend ipmitool does not evaluate the return value from IPC::Run::run very precisely. So it may keep retrying regardless of whether the 'ipmitool command line' returns OK or not.

So the problem is
1) The backend ipmitool function may return at failed 'ipmitool command line' although the command may already succeeded.
2) There is no precise control over the return value from IPC::run::run. So the loop here may not function as what is expected. We have no control over its behavior to some extent.
3) From my previous experiments, it seems that IPC::run::run('ipmitool command line') always returned value 1 regardless of whether 'ipmitool command line' succeeded.

What do you think of ?

Code:
my @tries = (1 .. $args{tries});
for (@tries) {
diag("Trying $_ times: @cmd");
#sleep 30;
sleep(30);
$ret = IPC::Run::run(\@cmd, \$stdin, \$stdout, \$stderr, IPC::Run::timeout(30));
if ($ret == 0) {
$self->dell_sleep;
last;
}
}

autoinst-log:
[2022-02-17T17:03:08.093293+08:00] [debug] Trying 1 times: ipmitool -C 3 -I lanplus -H openqaipmi5-sp.qa.suse.de -U admin -P qatesting mc selftest
[2022-02-17T17:03:40.344158+08:00] [debug] Trying 2 times: ipmitool -C 3 -I lanplus -H openqaipmi5-sp.qa.suse.de -U admin -P qatesting mc selftest
[2022-02-17T17:04:12.589995+08:00] [debug] Trying 3 times: ipmitool -C 3 -I lanplus -H openqaipmi5-sp.qa.suse.de -U admin -P qatesting mc selftest
[2022-02-17T17:04:44.860066+08:00] [debug] IPMI: Selftest: passed
[2022-02-17T17:04:44.860816+08:00] [debug] Trying 1 times: ipmitool -C 3 -I lanplus -H openqaipmi5-sp.qa.suse.de -U admin -P qatesting chassis power status
[2022-02-17T17:05:17.108662+08:00] [debug] Trying 2 times: ipmitool -C 3 -I lanplus -H openqaipmi5-sp.qa.suse.de -U admin -P qatesting chassis power status
[2022-02-17T17:05:49.387361+08:00] [debug] Trying 3 times: ipmitool -C 3 -I lanplus -H openqaipmi5-sp.qa.suse.de -U admin -P qatesting chassis power status
[2022-02-17T17:06:21.641583+08:00] [debug] IPMI: Chassis Power is on
[2022-02-17T17:06:21.642291+08:00] [debug] Trying 1 times: ipmitool -C 3 -I lanplus -H openqaipmi5-sp.qa.suse.de -U admin -P qatesting chassis power off
[2022-02-17T17:06:53.895791+08:00] [debug] Trying 2 times: ipmitool -C 3 -I lanplus -H openqaipmi5-sp.qa.suse.de -U admin -P qatesting chassis power off
[2022-02-17T17:07:26.147922+08:00] [debug] Trying 3 times: ipmitool -C 3 -I lanplus -H openqaipmi5-sp.qa.suse.de -U admin -P qatesting chassis power off
[2022-02-17T17:07:58.405340+08:00] [debug] IPMI: Chassis Power Control: Down/Off
[2022-02-17T17:08:01.406437+08:00] [debug] Trying 1 times: ipmitool -C 3 -I lanplus -H openqaipmi5-sp.qa.suse.de -U admin -P qatesting chassis power status
[2022-02-17T17:08:33.668797+08:00] [debug] Trying 2 times: ipmitool -C 3 -I lanplus -H openqaipmi5-sp.qa.suse.de -U admin -P qatesting chassis power status
[2022-02-17T17:09:05.922901+08:00] [debug] Trying 3 times: ipmitool -C 3 -I lanplus -H openqaipmi5-sp.qa.suse.de -U admin -P qatesting chassis power status
[2022-02-17T17:09:38.183903+08:00] [debug] IPMI: Chassis Power is off
[2022-02-17T17:09:38.184688+08:00] [debug] Trying 1 times: ipmitool -C 3 -I lanplus -H openqaipmi5-sp.qa.suse.de -U admin -P qatesting chassis power on
[2022-02-17T17:10:10.456215+08:00] [debug] Trying 2 times: ipmitool -C 3 -I lanplus -H openqaipmi5-sp.qa.suse.de -U admin -P qatesting chassis power on
[2022-02-17T17:10:42.714619+08:00] [debug] Trying 3 times: ipmitool -C 3 -I lanplus -H openqaipmi5-sp.qa.suse.de -U admin -P qatesting chassis power on
[2022-02-17T17:11:15.034221+08:00] [debug] IPMI: Chassis Power Control: Up/On
[2022-02-17T17:11:18.035415+08:00] [debug] Trying 1 times: ipmitool -C 3 -I lanplus -H openqaipmi5-sp.qa.suse.de -U admin -P qatesting chassis power status
[2022-02-17T17:11:50.288376+08:00] [debug] Trying 2 times: ipmitool -C 3 -I lanplus -H openqaipmi5-sp.qa.suse.de -U admin -P qatesting chassis power status
[2022-02-17T17:12:22.540156+08:00] [debug] Trying 3 times: ipmitool -C 3 -I lanplus -H openqaipmi5-sp.qa.suse.de -U admin -P qatesting chassis power status
[2022-02-17T17:12:54.789326+08:00] [debug] IPMI: Chassis Power is on
GOT GO

Actions #31

Updated by nicksinger almost 3 years ago

waynechen55 wrote:

@nicksinger Could I know your detailed lab movement plan ? Or I do not know but you already posted somewhere ? Just a little bit concerned about any possibility of conflict with 15-SP4 PublicBeta testing. Thanks.

We plan to conduct the move on 21th and 22nd of February, so next week. We're aware that this intersects with PublicBeta testing and will do your best to keep the impact as low as possible.

Actions #32

Updated by xlai almost 3 years ago

  • Related to action #102650: Organize labs move to new building and SRV2 size:M added
Actions #33

Updated by okurz almost 3 years ago

waynechen55 wrote:

@okurz I found that there might be something wrong with the backend. I added debug code to output auxiliary information about internals during test execution. It looks like the backend ipmitool does not evaluate the return value from IPC::Run::run very precisely. So it may keep retrying regardless of whether the 'ipmitool command line' returns OK or not.

So the problem is
1) The backend ipmitool function may return at failed 'ipmitool command line' although the command may already succeeded.
2) There is no precise control over the return value from IPC::run::run. So the loop here may not function as what is expected. We have no control over its behavior to some extent.
3) From my previous experiments, it seems that IPC::run::run('ipmitool command line') always returned value 1 regardless of whether 'ipmitool command line' succeeded.

What do you think of ?

Code:
` my @tries = (1 .. $args{tries});
for (@tries) {
diag("Trying $_ times: @cmd");
#sleep 30;
sleep(30);
$ret = IPC::Run::run(\@cmd, \$stdin, \$stdout, \$stderr, IPC::Run::timeout(30));

why sleep before even trying?

Actions #34

Updated by waynechen55 almost 3 years ago

nicksinger wrote:

waynechen55 wrote:

@nicksinger Could I know your detailed lab movement plan ? Or I do not know but you already posted somewhere ? Just a little bit concerned about any possibility of conflict with 15-SP4 PublicBeta testing. Thanks.

We plan to conduct the move on 21th and 22nd of February, so next week. We're aware that this intersects with PublicBeta testing and will do your best to keep the impact as low as possible.

Thanks for the info.

Actions #35

Updated by waynechen55 almost 3 years ago

okurz wrote:

waynechen55 wrote:

@okurz I found that there might be something wrong with the backend. I added debug code to output auxiliary information about internals during test execution. It looks like the backend ipmitool does not evaluate the return value from IPC::Run::run very precisely. So it may keep retrying regardless of whether the 'ipmitool command line' returns OK or not.

So the problem is
1) The backend ipmitool function may return at failed 'ipmitool command line' although the command may already succeeded.
2) There is no precise control over the return value from IPC::run::run. So the loop here may not function as what is expected. We have no control over its behavior to some extent.
3) From my previous experiments, it seems that IPC::run::run('ipmitool command line') always returned value 1 regardless of whether 'ipmitool command line' succeeded.

What do you think of ?

Code:
` my @tries = (1 .. $args{tries});
for (@tries) {
diag("Trying $_ times: @cmd");
#sleep 30;
sleep(30);
$ret = IPC::Run::run(\@cmd, \$stdin, \$stdout, \$stderr, IPC::Run::timeout(30));

why sleep before even trying?

I want to make sure there is 'real sleep' before each command, including the firs time try. But I do not think this is crucial.
Back to the main topic and the problems I mentioned in comment#30:
1) If the fix is going to have more retries, then there will be equally the same number of retries for any single ipmitool command line because there is no checkpoint to terminate loop based on successful return value. IPC::Run::run always returns the same value under most circumstances.
2) If the fix is going to have more retries, and at the same time, return immediately when there is successful ipmitool command line, then IPC::Run::run return value should be evaluated more precisely.
3) Or consider employ other method to execute ipmitool command line instead of IPC::Run::run ?
IMHO.

Actions #36

Updated by okurz almost 3 years ago

  • Assignee changed from nicksinger to okurz

Taking over the ticket from nicksinger as the machine is reachable again in general and it just looks like there was no obvious or no immediate connection to the machine move. What you seem to hit looks like something that could have always happened and will continue to happen. However I think we are on a good track towards improving the behaviour in the backend.

waynechen55 wrote:

why sleep before even trying?

I want to make sure there is 'real sleep' before each command, including the firs time try. But I do not think this is crucial.

I think the problem is what I am also stating when reviewing os-autoinst-distri-opensuse code that introduces new sleeps: You are using sleep() or are simulating a sleep-like functionality. This is introducing the risk of either unnecessarily wasting time or just shifting race conditions and making them even harder to investigate. You can use sleep but there are only very limited cases where is the best solution (e.g. yield threads in low-level system code). Here you have the classical problem in software design of Synchronization . sleep should not be used in place of proper synchronization methods. https://blogs.msmvps.com/peterritchie/2007/04/26/thread-sleep-is-a-sign-of-a-poorly-designed-program/ is one of multiple references that explains this: "The original problem is likely a timing/synchronization issue, ignoring it by hiding it with Thread.Sleep is only going to delay the problem and make it occur in random, hard to reproduce ways." And if the problem does not occur at least time is wasted. https://testing.googleblog.com/2021/03/test-flakiness-one-of-main-challenges.html states that as well: "Do NOT add arbitrary delays as these can become flaky again over time and slow down the test unnecessarily."

We are waiting because some condition changes some time ... keyword(s) is/are some time. It is important to know the exact condition that the code should wait for and check for that instead of delaying program execution for an arbitrary amount of time. For openQA tests in general just try to do what you as a human user would also do. You would not look on your watch waiting for exactly 100s before you look at the screen again, right? :) The same applies for IPMI connections.

Back to the main topic and the problems I mentioned in comment#30:
1) If the fix is going to have more retries, then there will be equally the same number of retries for any single ipmitool command line because there is no checkpoint to terminate loop based on successful return value. IPC::Run::run always returns the same value under most circumstances.

We would need to check this further but if this is the case then maybe it makes sense to check the actual output for error messages or successful execution. However IPC::Run::run returns a true value when all subcommands exit with a 0 result code, see https://metacpan.org/pod/IPC::Run#RETURN-VALUES . So I don't know where you think it always returns the same value.

2) If the fix is going to have more retries, and at the same time, return immediately when there is successful ipmitool command line, then IPC::Run::run return value should be evaluated more precisely.
3) Or consider employ other method to execute ipmitool command line instead of IPC::Run::run ?

I think IPC::Run::run is a good choice and can stay.

Actions #37

Updated by waynechen55 almost 3 years ago

okurz wrote:

Taking over the ticket from nicksinger as the machine is reachable again in general and it just looks like there was no obvious or no immediate connection to the machine move. What you seem to hit looks like something that could have always happened and will continue to happen. However I think we are on a good track towards improving the behaviour in the backend.

waynechen55 wrote:

why sleep before even trying?

I want to make sure there is 'real sleep' before each command, including the firs time try. But I do not think this is crucial.

I think the problem is what I am also stating when reviewing os-autoinst-distri-opensuse code that introduces new sleeps: You are using sleep() or are simulating a sleep-like functionality. This is introducing the risk of either unnecessarily wasting time or just shifting race conditions and making them even harder to investigate. You can use sleep but there are only very limited cases where is the best solution (e.g. yield threads in low-level system code). Here you have the classical problem in software design of Synchronization . sleep should not be used in place of proper synchronization methods. https://blogs.msmvps.com/peterritchie/2007/04/26/thread-sleep-is-a-sign-of-a-poorly-designed-program/ is one of multiple references that explains this: "The original problem is likely a timing/synchronization issue, ignoring it by hiding it with Thread.Sleep is only going to delay the problem and make it occur in random, hard to reproduce ways." And if the problem does not occur at least time is wasted. https://testing.googleblog.com/2021/03/test-flakiness-one-of-main-challenges.html states that as well: "Do NOT add arbitrary delays as these can become flaky again over time and slow down the test unnecessarily."

We are waiting because some condition changes some time ... keyword(s) is/are some time. It is important to know the exact condition that the code should wait for and check for that instead of delaying program execution for an arbitrary amount of time. For openQA tests in general just try to do what you as a human user would also do. You would not look on your watch waiting for exactly 100s before you look at the screen again, right? :) The same applies for IPMI connections.

Back to the main topic and the problems I mentioned in comment#30:
1) If the fix is going to have more retries, then there will be equally the same number of retries for any single ipmitool command line because there is no checkpoint to terminate loop based on successful return value. IPC::Run::run always returns the same value under most circumstances.

We would need to check this further but if this is the case then maybe it makes sense to check the actual output for error messages or successful execution. However IPC::Run::run returns a true value when all subcommands exit with a 0 result code, see https://metacpan.org/pod/IPC::Run#RETURN-VALUES . So I don't know where you think it always returns the same value.

2) If the fix is going to have more retries, and at the same time, return immediately when there is successful ipmitool command line, then IPC::Run::run return value should be evaluated more precisely.
3) Or consider employ other method to execute ipmitool command line instead of IPC::Run::run ?

I think IPC::Run::run is a good choice and can stay.

Thanks for your reply. I think why 'sleep' should be avoided is much more clear now. But I have to say I did not intend to introduce any new 'sleep' functionality. I just used it for my experiment purpose.

And after looking into IPC::Run::run in more details, I found something as below:
1) On successful ipmitool command line return, IPC::Run::run returns 1. But just as I said above, the ipmi backend loop will not terminate on successful return. It only returns here:

        if ($ret == 0) {
            $self->dell_sleep;
            last;
        }

I do not know what situation can lead to returned value 0. At least, I did not see it ever happen. So all retries will be executed actually regardless of whether successful ipmitool command exectuion.

2) I found two error scenarios have "empty" return value:
Error: Unable to establish IPMI v2 / RMCP+ session
Return Value As Below

Return Value As Above

And

Set Session Privilege Level to ADMINISTRATOR failed
Error: Unable to establish IPMI v2 / RMCP+ session
Return Value As Below

Return Value As Above

So my suggestions are:
1) Make ipmi backend code more intelligent to handle the real situations that it may encounter instead of just if ($ret == 0)
2) Retry means retry after failure, so there is no point to continue to run the same ipmitool command if previous one already succeeds
3) If 1) and 2) are done properly, we can try to introduce more retires on crappy machine like openqaipmi5
4) There are already 'sleep' functionality in ipmi backend code, what's your plan to deal with it ? Improve it or even remove it ?
5) We can even explore more advanced usage of IPC::Run::run, for example, timeout as passed in parameter to it and etc.
6) And if 'delay' is really needed between two consecutive ipmitool command, how can be done gracefully ? I think this worth thinking as well.
IMHO.

Actions #38

Updated by waynechen55 almost 3 years ago

@okurz May I know any update/result from your investigation work on this issue ? Or what is your plan to approach this issue ?

Actions #39

Updated by nicksinger almost 3 years ago

  • Due date deleted (2022-03-01)

@waynechen55 the machine is meanwhile moved into srv2. Could you please try if it is more stable now?

Actions #40

Updated by waynechen55 almost 3 years ago

nicksinger wrote:

@waynechen55 the machine is meanwhile moved into srv2. Could you please try if it is more stable now?

I really gave it a moderate thorough testing. It really gave me a surprise that it rarely fail now.

In 20 testings, including 10 openQA and 10 reproducer, the machine only failed once !

So what had been done to this machine ?

Next I may double check its PXE boot and network config, then I will consider bringing it back online on OSD.

Any other suggestions ? Anything more needs to be done further before bringing it back ? Because 'rare' does not mean 'zero'. @nicksinger @okurz

Actions #41

Updated by okurz almost 3 years ago

waynechen55 wrote:

nicksinger wrote:

@waynechen55 the machine is meanwhile moved into srv2. Could you please try if it is more stable now?

I really gave it a moderate thorough testing. It really gave me a surprise that it rarely fail now.

In 20 testings, including 10 openQA and 10 reproducer, the machine only failed once !

So what had been done to this machine ?

Nothing in particular. Maybe network performance slightly differs or you just hit a lucky time where there is less else going on of whatever would cause the machine to fail more often.

Next I may double check its PXE boot and network config, then I will consider bringing it back online on OSD.

Any other suggestions ? Anything more needs to be done further before bringing it back ? Because 'rare' does not mean 'zero'. @nicksinger @okurz

I agree that "rare" is still enough for us to care to improve that.

My further plans for this ticket would be that at best you can look into improving the stability of the backend based on your suggestions. If you are unable then eventually we would need to plan within SUSE QE Tools to continue with that but this will only happen in some weeks or months.

Actions #42

Updated by waynechen55 almost 3 years ago

okurz wrote:

waynechen55 wrote:

nicksinger wrote:

@waynechen55 the machine is meanwhile moved into srv2. Could you please try if it is more stable now?

I really gave it a moderate thorough testing. It really gave me a surprise that it rarely fail now.

In 20 testings, including 10 openQA and 10 reproducer, the machine only failed once !

So what had been done to this machine ?

Nothing in particular. Maybe network performance slightly differs or you just hit a lucky time where there is less else going on of whatever would cause the machine to fail more often.

Next I may double check its PXE boot and network config, then I will consider bringing it back online on OSD.

Any other suggestions ? Anything more needs to be done further before bringing it back ? Because 'rare' does not mean 'zero'. @nicksinger @okurz

I agree that "rare" is still enough for us to care to improve that.

My further plans for this ticket would be that at best you can look into improving the stability of the backend based on your suggestions. If you are unable then eventually we would need to plan within SUSE QE Tools to continue with that but this will only happen in some weeks or months.

@okurz

  1. I also agree that there is still something wrong with the machine. The symptom is still the same.
  2. I am sorry that I do not have too much energy to have the issue fixed at the moment, so I think currently you are still the primary anchor for this ticket. I can also provide help if needed. So I suggest you'd better start planning relevant task for this issue and make sure it is on the track of being fixed. Planned prolonged work is always better than unknown schedule.
  3. From my perspective and after careful thinking, I think the priority of ticket can be pushed up a little bit IMHO because:

Current ipmi backend power reset functionality does consecutive three times of power cycle in a row for every test suites utilizing ipmi. I think this will shorten the life of the machine under test to roughly one third, because it gets rebooted too often.

There is defect in the ipmi backend from my perspective as per comment#37. So its functionality does not look intact and can not serve its retry purpose very well as expected initially.

Actions #43

Updated by okurz over 2 years ago

  • Project changed from 204 to openQA Project (public)
  • Subject changed from [virtualization][tools] Fail to connect openqaipmi5-sp.qa.suse.de on our osd environment to [virtualization][tools] Improve retry behaviour and connection error handling in backend::ipmi (was: "Fail to connect openqaipmi5-sp.qa.suse.de on our osd environment")
  • Category changed from investigation to Feature requests
  • Status changed from Feedback to New
  • Assignee deleted (okurz)
  • Priority changed from Normal to Low
Actions #44

Updated by okurz over 2 years ago

Actions #45

Updated by okurz over 2 years ago

  • Parent task set to #109656
Actions #46

Updated by okurz over 2 years ago

  • Subject changed from [virtualization][tools] Improve retry behaviour and connection error handling in backend::ipmi (was: "Fail to connect openqaipmi5-sp.qa.suse.de on our osd environment") to [virtualization][tools] Improve retry behaviour and connection error handling in backend::ipmi (was: "Fail to connect openqaipmi5-sp.qa.suse.de on our osd environment") size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #47

Updated by okurz over 2 years ago

  • Target version changed from Ready to future

We within the SUSE QE Tools team currently do not have the capacity to work on this, removing from backlog.

Actions #48

Updated by okurz almost 2 years ago

  • Parent task changed from #109656 to #125708
Actions #50

Updated by syrianidou_sofia over 1 year ago

We see a lot of errors :

backend::baseclass::die_handler: Backend process died, backend errors are reported below in the following lines:
ipmitool -I lanplus -H ix64hm1200.qa.suse.de -U admin -P [masked] mc guid: Error: Unable to establish IPMI v2 / RMCP+ session at /usr/lib/os-autoinst/backend/ipmi.pm line 45.

from grenache-1:16

Actions #51

Updated by xguo over 1 year ago

syrianidou_sofia wrote:

We see a lot of errors :

backend::baseclass::die_handler: Backend process died, backend errors are reported below in the following lines:
ipmitool -I lanplus -H ix64hm1200.qa.suse.de -U admin -P [masked] mc guid: Error: Unable to establish IPMI v2 / RMCP+ session at /usr/lib/os-autoinst/backend/ipmi.pm line 45.

from grenache-1:16

Short update,
indeed, found out the same situation that this assigned worker: grenache-1:16 block our osd test runs on the latest 15-SP5 build96.1.
Refer to https://openqa.suse.de/tests/overview?distri=sle&version=15-SP5&build=96.1&groupid=264 for more details.

Actions #52

Updated by Julie_CAO over 1 year ago

opened an openqa infra ticket https://progress.opensuse.org/issues/128654

Actions #53

Updated by okurz over 1 year ago

  • Related to action #128654: [sporadic] Fail to create an ipmi session to worker grenache-1:16 (ix64ph1075) in its vlan added
Actions

Also available in: Atom PDF