action #107917
closedRecovery of imagetester via IPMI failed size:M
Added by mkittler almost 3 years ago. Updated almost 3 years ago.
0%
Description
Observation¶
The corresponding GitLab pipeline failed: monitor-o3 | Failed pipeline for master | fee77e0e
$ ssh o3 'ping -q -c 1 imagetester >/dev/null' || ipmitool -I lanplus -C 3 -H 10.160.65.195 -U ADMIN -P $imagetester_ipmi_password power cycle
Error: Unable to establish IPMI v2 / RMCP+ session
Cleaning up project directory and file based variables 00:00
ERROR: Job failed: command terminated with exit code 1
I haven't restarted the job because imagetester seems to be online nevertheless. IPMI being sometimes unavailable is something I also experience when using it manually. We could implement a retry, though.
Suggestions¶
- Check if imagester is currently actually online or needs recovery
- Maybe the ping fails but the machine is online?
- Crosscheck credentials and IPMI access
- Re-try ipmi if it fails
- Check our wiki because we stated that imagetester does not have a working IPMI anyway
Further info¶
- The recovery is implemented in https://gitlab.suse.de/openqa/monitor-o3/-/blob/master/.gitlab-ci.yml
- imagetester has only been added recently: https://gitlab.suse.de/openqa/monitor-o3/-/commit/0a3d0ee4f9543961f8bb368ece1fcf3642b2a6bc
Updated by okurz almost 3 years ago
- Priority changed from Normal to High
- Target version set to Ready
Updated by livdywan almost 3 years ago
- Subject changed from Recovery of imagetester via IPMI failed to Recovery of imagetester via IPMI failed size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by mkittler almost 3 years ago
- Description updated (diff)
- Target version deleted (
Ready)
Updated by livdywan almost 3 years ago
- Target version set to Ready
I'll assume you lost Ready by accident, since there's no comment.
Updated by mkittler almost 3 years ago
IPMI works but not from the outside, only on the host itself. When following the URL to the management web interface (https://progress.opensuse.org/projects/openqav3/wiki/#Accessing-imagetester) I also don't get a response. So also the alternative path for accessing the management interface seems broken.
So although a retry would be nice in general, I think we have a general problem of accessing the worker.
The following didn't work:
ipmitool -U ADMIN -P … mc reset cold
Maybe 10.160.65.195 is just not reachable due an interfering firewall rule?
Updated by mkittler almost 3 years ago
I've tried my luck with ipmiutil
but couldn't make it work:
transactional update # ipmiutil lan -c -e
ipmiutil ver 3.07
ilan ver 3.07
-- BMC version 2.50, IPMI version 2.0
PEF Control | none
Channel 1 IP address | 10.160.3.100
Channel 1 IP addr src | Static
Channel 1 MAC addr | 00:25:90:a2:b8:48
Channel 1 Subnet mask | 255.255.0.0
Channel 1 Def gateway IP | 10.160.255.254
Channel 1 Def gateway MAC | 00:00:5e:00:01:02
Channel 1 Community string | public
Channel 1 Dest address | IP=0.0.0.0 MAC=00:00:00:00:00:00
SuperMicro Lan Interface | Failover
Channel 1 SOL Enable | enabled
Channel 1 SOL Baud Rate | 115.2k
Channel 1 SOL Payload Access(user1) | enabled
Channel 1 SOL Payload Access(user2) | enabled
Channel 1 SOL Payload Access(user3) | enabled
Channel 1 SOL Payload Access(user4) | enabled
Channel 1 SOL Payload Access(user5) | disabled
Channel 1 User 1 Access | Reserved ()
Channel 1 User 2 Access | IPMI, Admin (ADMIN)
Channel 1 User 3 Access | IPMI, Admin (root)
Channel 1 User 4 Access | IPMI, Admin (devcon)
Channel 1 User 5 Access | Reserved ()
ilan, SetLanEntry for channel 1 ...
SetUser(2), ret = 0
SetLanEntry(2), ret = 0
LAN1 (eth0) ip=10.160.3.100 mac=00:25:90:a2:b8:48
SetLanEntry(4), ret = 0
SetLanEntry(3), ret = 0
SetLanEntry(5), ret = 0
SetLanEntry(6), ret = 0
SetLanEntry(7), ret = 0
SetLanEntry(10,1), ret = 0
SetLanEntry(11), ret = 0
gateway ip=192.168.112.254 mac=52:54:00:5f:2b:b3
WARNING: IP Address and Gateway are not on the same subnet, setting Gateway to previous value
SetLanEntry(12), ret = 0
SetLanEntry(13), ret = 0
SetupSerialOverLan: ret = 0
alert dest address not specified
ipmiutil lan, completed successfully
transactional update #
transactional update #
transactional update #
transactional update #
transactional update # ipmiutil lan -c -e
ipmiutil ver 3.07
ilan ver 3.07
-- BMC version 2.50, IPMI version 2.0
PEF Control | none
Channel 1 IP address | 10.160.3.100
Channel 1 IP addr src | Static
Channel 1 MAC addr | 00:25:90:a2:b8:48
Channel 1 Subnet mask | 255.255.0.0
Channel 1 Def gateway IP | 10.160.255.254
Channel 1 Def gateway MAC | 00:00:5e:00:01:02
Channel 1 Community string | public
Channel 1 Dest address | IP=0.0.0.0 MAC=00:00:00:00:00:00
SuperMicro Lan Interface | Failover
Channel 1 SOL Enable | enabled
Channel 1 SOL Baud Rate | 115.2k
Channel 1 SOL Payload Access(user1) | enabled
Channel 1 SOL Payload Access(user2) | enabled
Channel 1 SOL Payload Access(user3) | enabled
Channel 1 SOL Payload Access(user4) | enabled
Channel 1 SOL Payload Access(user5) | disabled
Channel 1 User 1 Access | Reserved ()
Channel 1 User 2 Access | IPMI, Admin (ADMIN)
Channel 1 User 3 Access | IPMI, Admin (root)
Channel 1 User 4 Access | IPMI, Admin (devcon)
Channel 1 User 5 Access | Reserved ()
ilan, SetLanEntry for channel 1 ...
SetUser(2), ret = 0
SetLanEntry(2), ret = 0
LAN1 (eth0) ip=10.160.3.100 mac=00:25:90:a2:b8:48
SetLanEntry(4), ret = 0
SetLanEntry(3), ret = 0
SetLanEntry(5), ret = 0
SetLanEntry(6), ret = 0
SetLanEntry(7), ret = 0
SetLanEntry(10,1), ret = 0
SetLanEntry(11), ret = 0
gateway ip=192.168.112.254 mac=52:54:00:5f:2b:b3
WARNING: IP Address and Gateway are not on the same subnet, setting Gateway to previous value
SetLanEntry(12), ret = 0
SetLanEntry(13), ret = 0
SetupSerialOverLan: ret = 0
alert dest address not specified
ipmiutil lan, completed successfully
Interestingly, these logs show a different IP address (10.160.3.100). Under that IP I can actually reach a web server but I don't know the login or how to set it.
Btw, the IP also shows up via ipmitool
:
transactional update # ipmitool lan print
Set in Progress : Set Complete
Auth Type Support : NONE MD2 MD5 PASSWORD
Auth Type Enable : Callback : MD2 MD5 PASSWORD
: User : MD2 MD5 PASSWORD
: Operator : MD2 MD5 PASSWORD
: Admin : MD2 MD5 PASSWORD
: OEM :
IP Address Source : Static Address
IP Address : 10.160.3.100
Subnet Mask : 255.255.0.0
MAC Address : 00:25:90:a2:b8:48
SNMP Community String : public
IP Header : TTL=0x40 Flags=0x40 Precedence=0x00 TOS=0x10
BMC ARP Control : ARP Responses Enabled, Gratuitous ARP Disabled
Default Gateway IP : 10.160.255.254
Default Gateway MAC : 00:00:5e:00:01:02
Backup Gateway IP : 0.0.0.0
Backup Gateway MAC : 00:00:00:00:00:00
802.1q VLAN ID : Disabled
802.1q VLAN Priority : 0
RMCP+ Cipher Suites : 1,2,3,6,7,8,11,12
Cipher Suite Priv Max : XXXXXXXXXXXXXXX
: X=Cipher Suite Unused
: c=CALLBACK
: u=USER
: o=OPERATOR
: a=ADMIN
: O=OEM
Bad Password Threshold : Not Available
Updated by mkittler almost 3 years ago
When using 10.160.3.100
I can actually connect from the outside (not from o3 itself but locally via VPN):
ipmitool -I lanplus -C 3 -H 10.160.3.100 -U ADMIN -P pass -b 1 -B 1 power status
Error in open session response message : invalid role
Error: Unable to establish IPMI v2 / RMCP+ session
Now the error message indicates some auth issue. However, I couldn't find out what the problem is.
The users have a role and "Link auth" is set similar to other hosts:
imagetester:~ # ipmitool user list 0
ID Name Callin Link Auth IPMI Msg Channel Priv Limit
1 true false false Unknown (0x00)
2 ADMIN true true true ADMINISTRATOR
3 root true true true ADMINISTRATOR
4 devcon true true true ADMINISTRATOR
5 true false false Unknown (0x00)
6 admin true true true ADMINISTRATOR
7 true false false Unknown (0x00)
8 true false false Unknown (0x00)
9 true false false Unknown (0x00)
10 true false false Unknown (0x00)
imagetester:~ # ipmitool user list 1
ID Name Callin Link Auth IPMI Msg Channel Priv Limit
1 true false false Unknown (0x00)
2 ADMIN true true true ADMINISTRATOR
3 root true true true ADMINISTRATOR
4 devcon true true true ADMINISTRATOR
5 true false false Unknown (0x00)
6 admin true true true ADMINISTRATOR
7 true false false Unknown (0x00)
8 true false false Unknown (0x00)
9 true false false Unknown (0x00)
10 true false false Unknown (0x00)
Btw, that's how the LAN settings look like:
imagetester:~ # ipmitool lan print
Set in Progress : Set Complete
Auth Type Support : NONE MD2 MD5 PASSWORD
Auth Type Enable : Callback : MD2 MD5 PASSWORD
: User : MD2 MD5 PASSWORD
: Operator : MD2 MD5 PASSWORD
: Admin : MD2 MD5 PASSWORD
: OEM :
IP Address Source : Static Address
IP Address : 10.160.3.100
Subnet Mask : 255.255.0.0
MAC Address : 00:25:90:a2:b8:48
SNMP Community String : public
IP Header : TTL=0x40 Flags=0x40 Precedence=0x00 TOS=0x10
BMC ARP Control : ARP Responses Enabled, Gratuitous ARP Disabled
Default Gateway IP : 10.160.255.254
Default Gateway MAC : 00:00:5e:00:01:02
Backup Gateway IP : 0.0.0.0
Backup Gateway MAC : 00:00:00:00:00:00
802.1q VLAN ID : Disabled
802.1q VLAN Priority : 0
RMCP+ Cipher Suites : 1,2,3,6,7,8,11,12
Cipher Suite Priv Max : XXXXXXXXXXXXXXX
: X=Cipher Suite Unused
: c=CALLBACK
: u=USER
: o=OPERATOR
: a=ADMIN
: O=OEM
Bad Password Threshold : Not Available
For other workers the IP printed there matches what we have documented. So I assume this really is the IP (and 10.160.65.195
is wrong).
Updated by mkittler almost 3 years ago
Interestingly, on other workers no channel 0 exists (only channel 1):
openqa-aarch64:~ # ipmitool channel info 0
IPMI command failed: Invalid data field in request
Unable to Get Channel Info
openqa-aarch64:~ # ipmitool channel info 1
Channel 0x1 info:
Channel Medium Type : 802.3 LAN
Channel Protocol Type : IPMB-1.0
Session Support : multi-session
Active Session Count : 0
Protocol Vendor ID : 7154
Volatile(active) Settings
Alerting : disabled
Per-message Auth : enabled
User Level Auth : enabled
Access Mode : always available
Non-Volatile Settings
Alerting : enabled
Per-message Auth : enabled
User Level Auth : enabled
Access Mode : disabled
Maybe channel 0 is interfering?
Updated by livdywan almost 3 years ago
- Status changed from Workable to In Progress
Let's assume work is being done here
Updated by bmwiedemann almost 3 years ago
imagetester-ipmi.suse.de is the internal hostname for 10.160.3.100
It reponds to ping and http, so probably just need to get the ipmi credentials right.
Did you use the new password that okurz set some months ago?
Updated by mkittler almost 3 years ago
I was actually the one who changed the IPMI passwords on all machine last time. So yes, I've been using the current password and I'm sure if @okurz had changed it in the meantime he'd updated workerconf.sls
. I also tried to set the password to something else to be sure but it didn't work.
Btw, I've accidentally mentioned the password here so I'm going to change the IPMI password on all hosts again. I'm about to do that now. When testing whether all hosts are accessible via IPMI I've also noticed that fsp1-malbec.arch.suse.de
cannot be accessed via LAN.
Updated by openqa_review almost 3 years ago
- Due date set to 2022-03-26
Setting due date based on mean cycle time of SUSE QE Tools
Updated by livdywan almost 3 years ago
- Assignee changed from mkittler to livdywan
I'll give it a ago since @mkittler exhausted all options, and confirm if we can get this sorted or if we need physical access to reset the machine
Updated by livdywan almost 3 years ago
I can confirm that the login works on http://imagetester-ipmi.suse.de although ipmitool
gives me invalid role
which as per documentation should mean that the login is wrong. The same happens using the IP reported by the web UI / the generated jnlp config - my suspicion is that the IP is not static, and perhaps there's some re-routing going on but connections via other IP's are not accepted.
Still, I don't get why I can't login when the password is most definitely correct...
Updated by mkittler almost 3 years ago
Oh, your right - with the "new new" password (I've set as mentioned in #107917#note-13) login works via the web UI. Maybe the old password and the "first new" password I've tried contained too many or unsupported characters. That's already an improvement. I've triggered a reboot of the IPMI device over the web UI. Maybe it helps.
Updated by mkittler almost 3 years ago
- Assignee changed from livdywan to mkittler
It didn't help. I'll try a few other things over the web UI.
Updated by mkittler almost 3 years ago
- Status changed from In Progress to Resolved
I did a factory reset over the web UI, set the password again and now it works.
Updated by mkittler almost 3 years ago
- Status changed from Resolved to Feedback
Ok, there are a few more things to sort out:
Updated by mkittler almost 3 years ago
- Status changed from Feedback to Resolved
The PRs have been merged. I suppose I can now actually resolve the ticket. Note that we can still think of implementing a retry in monotor-o3 but I see it out of scope for fixing the immediate issue.
Updated by livdywan over 2 years ago
- Copied to action #108671: Resilient IPMI recovery of o3 machines in monitor-o3 size:M added
Updated by okurz over 1 year ago
- Related to action #135137: Bring back imagetester size:M added