Project

General

Profile

action #128654

Updated by okurz over 1 year ago

## Observation 

 All tests running on grenache-1:16 failed today. It worked 5 days ago. 

 https://openqa.suse.de/admin/workers/1247 
 Failure looks like: 
 ``` 
 [2023-05-04T07:49:10.563049+02:00] [info] [pid:113776] ::: backend::baseclass::die_handler: Backend process died, backend errors are reported below in the following lines: 
   ipmitool -I lanplus -H ix64hm1200.qa.suse.de -U admin -P [masked] mc guid: IPMI response is NULL. at /usr/lib/os-autoinst/backend/ipmi.pm line 45. 
 [2023-05-04T07:49:23.682702+02:00] [warn] [pid:113776] !!! backend::baseclass::run_capture_loop: capture loop failed ipmitool -I lanplus -H ix64hm1200.qa.suse.de -U admin -P [masked] chassis power off: Error: Unable to establish IPMI v2 / RMCP+ session at /usr/lib/os-autoinst/backend/ipmi.pm line 45. 
 ``` 
 I tried "ipmitool -I lanplus -H 10.162.28.200 -U admin -P [masked] ..." in other VLAN, such as on 10.168.192.87 and my laptop with vpn, it did not report failures. However if I run on 10.162.2.99, errors prompted: 

 ``` 
 fozzie-1:~ # ipmitool -I lanplus -H 10.162.28.200 -U admin -P [masked] -vvv 
 ipmitool version 1.8.18 
 ... 
 Get Auth Capabilities error 
 Error issuing Get Channel Authentication Capabilities request 
 Error: Unable to establish IPMI v2 / RMCP+ session 
 ``` 

 Anything wrong with IPMI services or configurations in 10.162.xx vlan or this machine? 

 ## Problem 
 * **H1** The network in NUE1-SRV2-B rack 1+2 is badly impacted due to switch behaviour -> **E1-1** Reset switches in rack 1 and 2 and rerun experiment from #128654#note-7 
 * **H2** IPMI is just unstable in general and needs retries and waits -> **E2-1** Increase number of retries (default 4) and timeout (default 1 for lanplas), e.g. `-R 10 -N 10` 
 * **H3** ppc64le ipmitool behaves different -> **E3-1** Crosscheck experiment on different machines and architectures

Back