Project

General

Profile

Actions

action #106056

open

coordination #125708: [epic] Future ideas for more stable non-qemu backends

[virtualization][tools] Improve retry behaviour and connection error handling in backend::ipmi (was: "Fail to connect openqaipmi5-sp.qa.suse.de on our osd environment") size:M

Added by xguo about 2 years ago. Updated 12 months ago.

Status:
Workable
Priority:
Low
Assignee:
-
Category:
Feature requests
Target version:
Start date:
2022-02-07
Due date:
% Done:

0%

Estimated time:

Description

Observation

refer to osd test failure log https://openqa.suse.de/tests/8113762, Fail to connect openqaipmi5-sp.qa.suse.de on our osd environment.
[2022-02-07T07:36:27.624742+01:00] [debug] IPMI: Chassis Power Control: Up/On
[2022-02-07T07:36:40.726270+01:00] [info] ::: backend::baseclass::die_handler: Backend process died, backend errors are reported below in the following lines:
ipmitool -I lanplus -H openqaipmi5-sp.qa.suse.de -U admin -P XX chassis power status: Error: Unable to establish IPMI v2 / RMCP+ session at /usr/lib/os-autoinst/backend/ipmi.pm line 45.

But, we can connect openqaipmi5-sp.qa.suse.de successfully on our testing environment
ipmitool -I lanplus -H openqaipmi5-sp.qa.suse.de -U admin -P XX chassis power status
Chassis Power is on
ip a
2: eth0: mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
link/ether d2:0a:cd:f5:97:40 brd ff:ff:ff:ff:ff:ff
inet 10.161.159.120/20 brd 10.161.159.255 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::d00a:cdff:fef5:9740/64 scope link
valid_lft forever preferred_lft forever

FYI, refer to https://openqa.suse.de/admin/workers/1207 for more details.

Acceptance criteria

  • AC1: os-autoinst backend::ipmi retries consistently in more cases of network related unavailabilities and instabilities

Suggestions

ipmi power reset
for i in (1 .. 10):
    ipmi power status && break
    echo "Retrying ipmi connection $i of 10 after sleep"
    sleep 10
...

Related issues 2 (0 open2 closed)

Related to openQA Infrastructure - action #102650: Organize labs move to new building and SRV2 size:MResolvednicksinger2021-11-182022-05-27

Actions
Related to openQA Infrastructure - action #128654: [sporadic] Fail to create an ipmi session to worker grenache-1:16 (ix64ph1075) in its vlanResolvedokurz2023-05-04

Actions
Actions

Also available in: Atom PDF