action #128654
closed[sporadic] Fail to create an ipmi session to worker grenache-1:16 (ix64ph1075) in its vlan
Added by Julie_CAO over 1 year ago. Updated over 1 year ago.
0%
Description
Observation¶
All tests running on grenache-1:16 failed today. It worked 5 days ago.
https://openqa.suse.de/admin/workers/1247
Failure looks like:
[2023-05-04T07:49:10.563049+02:00] [info] [pid:113776] ::: backend::baseclass::die_handler: Backend process died, backend errors are reported below in the following lines:
ipmitool -I lanplus -H ix64hm1200.qa.suse.de -U admin -P [masked] mc guid: IPMI response is NULL. at /usr/lib/os-autoinst/backend/ipmi.pm line 45.
[2023-05-04T07:49:23.682702+02:00] [warn] [pid:113776] !!! backend::baseclass::run_capture_loop: capture loop failed ipmitool -I lanplus -H ix64hm1200.qa.suse.de -U admin -P [masked] chassis power off: Error: Unable to establish IPMI v2 / RMCP+ session at /usr/lib/os-autoinst/backend/ipmi.pm line 45.
I tried "ipmitool -I lanplus -H 10.162.28.200 -U admin -P [masked] ..." in other VLAN, such as on 10.168.192.87 and my laptop with vpn, it did not report failures. However if I run on 10.162.2.99, errors prompted:
fozzie-1:~ # ipmitool -I lanplus -H 10.162.28.200 -U admin -P [masked] -vvv
ipmitool version 1.8.18
...
Get Auth Capabilities error
Error issuing Get Channel Authentication Capabilities request
Error: Unable to establish IPMI v2 / RMCP+ session
Anything wrong with IPMI services or configurations in 10.162.xx vlan or this machine?
Problem¶
- H1 The network in NUE1-SRV2-B rack 1+2 is badly impacted due to switch behaviour -> E1-1 Reset switches in rack 1 and 2 and rerun experiment from #128654#note-7
- H2 IPMI is just unstable in general and needs retries and waits -> E2-1 Increase number of retries (default 4) and timeout (default 1 for lanplas), e.g.
-R 10 -N 10
- H3 ppc64le ipmitool behaves different -> E3-1 Crosscheck experiment on different machines and architectures
Updated by okurz over 1 year ago
- Tags set to infra
- Priority changed from Normal to High
- Target version set to Ready
Updated by okurz over 1 year ago
- Related to action #106056: [virtualization][tools] Improve retry behaviour and connection error handling in backend::ipmi (was: "Fail to connect openqaipmi5-sp.qa.suse.de on our osd environment") size:M added
Updated by xlai over 1 year ago
@okurz, Hi Oliver, this issue introdued the majority of VT failures in latest build, see https://openqa.suse.de/tests/overview?distri=sle&version=15-SP5&build=96.1&groupid=264. When do you expect this fixed? @Julie_CAO, To exclude the impact, better to consider if it is needed to temporarily set the worker down based on the reply?
Updated by okurz over 1 year ago
- Project changed from openQA Infrastructure (public) to openQA Project (public)
- Due date set to 2023-05-12
- Category set to Support
- Status changed from New to Feedback
- Assignee set to okurz
Right now I see that https://openqa.suse.de/tests/11030545#live worked to boot the system fine so either the problem does not reproduce or is sporadic.
There might be a problem in the ipmitool command line. The command line option
-C 3
might improve stability by forcing an updated cipher suite. I suggest you use that option and compare statistical stability with and without this option.
Updated by Julie_CAO over 1 year ago
okurz wrote:
Right now I see that https://openqa.suse.de/tests/11030545#live worked to boot the system fine so either the problem does not reproduce or is sporadic.
Don't know why this job survived the ipmitool command, though it failed finally. It is the only survivor today. I just run another test on this machine and it failed again, https://openqa.suse.de/tests/11031023#details.
There might be a problem in the ipmitool command line. The command line option
-C 3
might improve stability by forcing an updated cipher suite. I suggest you use that option and compare statistical stability with and without this option.
-C 3
is the default option of ipmitool for IPMIv2.0.
-C <ciphersuite>
The remote server authentication, integrity, and encryption algorithms to use for IPMIv2.0 lanplus con-nections. See table 22-19 in the IPMIv2.0 specification. The default is 3 which specifies RAKP-HMAC-SHA1 authentication, HMAC-SHA1-96 integrity, and AES-CBC-128 encryption algorithms.
Updated by okurz over 1 year ago
- Subject changed from Fail to create an ipmi session to worker grenache-1:16 in its vlan to [sporadic] Fail to create an ipmi session to worker grenache-1:16 (ix64ph1075) in its vlan
Updated by okurz over 1 year ago
- Status changed from Feedback to In Progress
I am testing access with for i in {1..1000000}; do echo "### Run: $i -- $(date -Is)" && ipmitool -I lanplus -C 3 -H ix64hm1200.qa.suse.de -U admin -P [masked] power status; done
to check statistics. The connection between grenache-1 and ix64ph1075 should generally be rather stable as they are in neighboring racks but of course there are at least two switches involved so that might have an impact. So far I see the error "Close Session command failed" after the succesful execution of the IPMI command. But not reproduced the original problem. Let's run this over time:
runs=10000 count_fail_ratio unbuffer ipmitool -I lanplus -C 3 -H ix64hm1200.qa.suse.de -U admin -P [masked] power status 2>&1 | ts -i '%H:%M:%.S' | ts -s '%H:%M:%.S'
Running the command locally finishes each execution much quicker and no error about "Close Session command failed", both ipmitool versions on my local notebook as well as grenache-1 are the same version except for the build architecture: ipmitool-1.8.18.238.gb7adc1d-150400.1.14.ppc64le
If that shows any problems things to try:
- First wait for results
- Add the ipmitool option "-o supermicro" to work around quirks for the specific target machine. Although likely ix64ph1075 is not a SuperMicro machine
- Research about "Close Session command failed"
- Increase retries with ipmitool itself with option "-R N" and N should be selected to be higher than the default 4
- Crosscheck experiment on different machines and architectures
The error message "Close Session command failed" disappeared for some intermediate tests from grenache-1. On my notebook so far the error was never observed, so far 4/1500 fails vs 13/900 fails on grenache-1.
Updated by okurz over 1 year ago
So far I could observe a fail ratio of 1% when calling ipmitool to establish connection from grenache-1 to ix64ph1075 meaning that I can not reproduce the original problem yet.
Result from linux-28d7, my notebook over VPN: count_fail_ratio: Run: 10000. Fails: 21. Fail ratio 0%. Some fails being "Error: Unable to establish IPMI v2 / RMCP+ session", "Error: no response from RAKP 3 message". Also sometime "Close Session command failed", not counted as error as it does not influence the exit code. However this causes a delay of 10s.
Result from grenache-1, still running in screen session on grenache-1, I should take a look tomorrow.
EDIT: 2023-05-05: 06:17:08.253007 00:00:00.004818 ## count_fail_ratio: Run: 10000. Fails: 264. Fail ratio 2%
so the fail ratio decreased a lot over night. I assume conflicting network load or on the host has an impact.
Running the same experiment again with ipmitool options -R 10 -N 10
Updated by Julie_CAO over 1 year ago
the ipmi command begun to work today. I rerun 3 times and all worked. It seems the problem is gone.
Updated by okurz over 1 year ago
Maybe firmware update helps? Logging in over web on http://ix64hm1200.qa.suse.de
Firmware Revision : 03.48 IP address : 010.162.028.200
Firmware Build Time : 2016-01-29 BMC MAC address : 00:25:90:9f:9e:12
but we don't know what machine that is. SoL on boot says https://www.supermicro.com/products/archive/motherboard/x9dri-f . https://www.supermicro.com/en/support/resources/downloadcenter/firmware/MBD-X9DRi-F/BMC says there is a new version 3.62. Downloaded and upgrading over firmware update mode in web BMC interface.
We also reviewed https://github.com/ipmitool/ipmitool/ and in particular https://github.com/ipmitool/ipmitool/releases/tag/IPMITOOL_1_8_19 if there is anything new that might help to improve stability.
However anyway we could update the firmware so that is at least something :)
Firmware Revision : 03.62 IP address : 010.162.028.200
Firmware Build Time : 2020-07-15
Retriggered the ipmitool experiment from grenache-1 where we could immediately observe another occurence of "Close Session command failed" so that did not change in behaviour. Also started again on qamaster.
Also updated my "count_fail_ratio" to calculate floating point precision fail ratios as we are often encountering "below 1%" fail ratios, using bc. See https://github.com/okurz/scripts/commit/6a33126f0107383a5285e5a5ef42ce26adc9ce82
Running the experiment with that again also locally as baseline:
runs=10000 count_fail_ratio unbuffer ipmitool -I lanplus -C 3 -o supermicro -H ix64hm1200.qa.suse.de -U admin -P [masked] mc guid 2>&1 | ts -i '%H:%M:%.S' | ts -s '%H:%M:%.S'
It's already obvious that from linux-28d7, my notebook, I don't get "Close Session command failed" often if at all, but at the same time reproduced often from qamaster and grenache-1, both in same rack. I am sure that ipmitool itself is binary-identical for qamaster and linux-28d7, checked with md5sum $(which ipmitool)
yielding "6dd56e24f18a89a0725af5de7f390f89 /usr/bin/ipmitool". I checked loaded libraries with ldd
and also checked /usr/lib64/libfreeipmi.so.17
On qamaster the error "Close Session command failed" just stopped:
00:13:14.322024 00:00:00.000090 ## Run 73
00:13:24.442356 00:00:10.027833 Close Session command failed
00:13:24.450918 00:00:00.008681 ## count_fail_ratio: Run: 73. Fails: 0. Fail ratio 0%
00:13:24.450958 00:00:00.000039 ## Run 74
00:13:35.575072 00:00:10.031620 Close Session command failed
00:13:35.583449 00:00:00.008413 ## count_fail_ratio: Run: 74. Fails: 0. Fail ratio 0%
00:13:35.583496 00:00:00.000046 ## Run 75
00:13:46.703624 00:00:10.027679 Close Session command failed
00:13:46.712191 00:00:00.008641 ## count_fail_ratio: Run: 75. Fails: 0. Fail ratio 0%
00:13:46.712229 00:00:00.000038 ## Run 76
00:13:47.816701 00:00:00.009188 ## count_fail_ratio: Run: 76. Fails: 0. Fail ratio 0%
00:13:47.816766 00:00:00.000098 ## Run 77
00:13:57.934171 00:00:10.027014 Close Session command failed
00:13:57.942132 00:00:00.007996 ## count_fail_ratio: Run: 77. Fails: 0. Fail ratio 0%
00:13:57.942198 00:00:00.000075 ## Run 78
00:13:58.035694 00:00:00.008616 ## count_fail_ratio: Run: 78. Fails: 0. Fail ratio 0%
00:13:58.035745 00:00:00.000054 ## Run 79
00:13:58.126907 00:00:00.008116 ## count_fail_ratio: Run: 79. Fails: 0. Fail ratio 0%
00:13:58.126960 00:00:00.000038 ## Run 80
00:13:58.220228 00:00:00.008218 ## count_fail_ratio: Run: 80. Fails: 0. Fail ratio 0%
00:13:58.220281 00:00:00.000085 ## Run 81
00:13:58.313960 00:00:00.008244 ## count_fail_ratio: Run: 81. Fails: 0. Fail ratio 0%
00:13:58.314005 00:00:00.000044 ## Run 82
00:13:58.410096 00:00:00.008803 ## count_fail_ratio: Run: 82. Fails: 0. Fail ratio 0%
interesting to see: All showed the error up to run 75, then 76 w/o, 77 error again, 78ff w/o. And now it just seems to start again with run 1960.
But right now fail ratio on grenache-1 is 41 percent. Let's try to restart the switch in rack 2 where ix64ph1075 is located. After the switch was back up again the connection from grenache-1 looks really good.
Updated by okurz over 1 year ago
The connection from grenache-1 to ix64ph1075 is still very stable using the power status
command. Latest experiment showed count_fail_ratio: Run: 30000. Fails: 1. Fail ratio 0%
. I am trying to look into the still present failures and if multiple retries to ipmitool can help.
On linux-28d7 from 10k runs 4 fails, i.e. fail ratio 0.04%. And the errors that showed up:
$ grep -v '##' ipmi_mc_guid_poo125450_2023-05-08T13:35:14+02:00.log
00:00:27.799559 00:00:01.165539 > Error: no response from RAKP 3 message
00:00:28.204694 00:00:00.176309 Error: Unable to establish IPMI v2 / RMCP+ session
00:08:50.399711 00:00:00.173545 Error: Unable to establish IPMI v2 / RMCP+ session
00:14:14.862245 00:00:01.166526 > Error: no response from RAKP 3 message
00:14:27.908729 00:00:13.046506 Set Session Privilege Level to ADMINISTRATOR failed
00:14:27.908814 00:00:00.000104 Error: Unable to establish IPMI v2 / RMCP+ session
00:18:35.836568 00:00:10.030031 Close Session command failed
00:27:17.824404 00:00:01.153423 > Error: no response from RAKP 1 message
00:43:09.868358 00:00:01.126254 > Error: no response from RAKP 1 message
00:43:56.470334 00:00:01.184566 Error: Unable to establish IPMI v2 / RMCP+ session
00:45:38.892198 00:00:10.026831 Close Session command failed
00:47:52.052188 00:00:01.143234 > Error: no response from RAKP 3 message
00:48:20.016474 00:00:01.147754 > Error: no response from RAKP 3 message
00:49:05.664527 00:00:10.029110 Close Session command failed
00:49:45.595523 00:00:01.141644 > Error: no response from RAKP 3 message
00:52:05.778055 00:00:01.178423 > Error: no response from RAKP 3 message
00:52:06.806498 00:00:01.028391 > Error: no response from RAKP 1 message
00:52:36.678707 00:00:01.120133 > Error: no response from RAKP 1 message
00:53:22.582800 00:00:01.160691 > Error: no response from RAKP 3 message
00:55:25.867893 00:00:01.448023 > Error: no response from RAKP 3 message
00:57:24.914749 00:00:01.149647 > Error: no response from RAKP 3 message
00:57:39.460361 00:00:10.028691 Close Session command failed
00:58:26.276220 00:00:02.245563 > Error: no response from RAKP 3 message
those are both the fatal as well as non-fatal errors.
Running again on grenache-1 with
runs=30000 count_fail_ratio unbuffer ipmitool -I lanplus -C 3 -H ix64hm1200.qa.suse.de -U admin -P [masked] mc guid 2>&1 | ts -i '%H:%M:%.S' | ts -s '%H:%M:%.S' | grep --line-buffered -v 'GUID.*:' | tee -a /tmp/ipmi_mc_guid_poo125450_$(date -Is).log
Updated by okurz over 1 year ago
The latest experiment on grenache-1 showed a fail rate of 0.00%, all 30k ipmi calls succeeded. I don't see where we can improve the use of ipmitool then. The only thing I could think of is some kind of monitoring or ipmitool check where we would do the above regularly so that misbehaving switches and such would show up as alert.
Updated by okurz over 1 year ago
- Copied to action #129032: ipmitool monitoring added
Updated by okurz over 1 year ago
- Due date deleted (
2023-05-12) - Status changed from In Progress to Resolved
Created #129032 for the latter idea, rest is considered resolved here.
Updated by okurz over 1 year ago
- Project changed from openQA Project (public) to openQA Infrastructure (public)
- Category deleted (
Support)
Updated by emiler over 1 year ago
- Related to action #128501: [security] [QR] [IPMI] test fails in consoletest_setup on @64bit-ipmi added