Project

General

Profile

Actions

action #107917

closed

Recovery of imagetester via IPMI failed size:M

Added by mkittler about 2 years ago. Updated about 2 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
2022-03-07
Due date:
2022-03-26
% Done:

0%

Estimated time:

Description

Observation

The corresponding GitLab pipeline failed: monitor-o3 | Failed pipeline for master | fee77e0e

$ ssh o3 'ping -q -c 1 imagetester >/dev/null' || ipmitool -I lanplus -C 3 -H 10.160.65.195 -U ADMIN -P $imagetester_ipmi_password power cycle
Error: Unable to establish IPMI v2 / RMCP+ session
Cleaning up project directory and file based variables 00:00
ERROR: Job failed: command terminated with exit code 1

I haven't restarted the job because imagetester seems to be online nevertheless. IPMI being sometimes unavailable is something I also experience when using it manually. We could implement a retry, though.

Suggestions

  • Check if imagester is currently actually online or needs recovery
  • Maybe the ping fails but the machine is online?
  • Crosscheck credentials and IPMI access
  • Re-try ipmi if it fails
  • Check our wiki because we stated that imagetester does not have a working IPMI anyway

Further info


Related issues 2 (0 open2 closed)

Related to openQA Infrastructure - action #135137: Bring back imagetester size:MResolvedokurz2023-09-04

Actions
Copied to openQA Infrastructure - action #108671: Resilient IPMI recovery of o3 machines in monitor-o3 size:MResolvedmkittler2022-03-07

Actions
Actions #1

Updated by okurz about 2 years ago

  • Priority changed from Normal to High
  • Target version set to Ready
Actions #2

Updated by livdywan about 2 years ago

  • Subject changed from Recovery of imagetester via IPMI failed to Recovery of imagetester via IPMI failed size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #3

Updated by mkittler about 2 years ago

  • Description updated (diff)
  • Target version deleted (Ready)
Actions #4

Updated by mkittler about 2 years ago

  • Description updated (diff)
Actions #5

Updated by livdywan about 2 years ago

  • Target version set to Ready

I'll assume you lost Ready by accident, since there's no comment.

Actions #6

Updated by mkittler about 2 years ago

  • Assignee set to mkittler
Actions #7

Updated by mkittler about 2 years ago

IPMI works but not from the outside, only on the host itself. When following the URL to the management web interface (https://progress.opensuse.org/projects/openqav3/wiki/#Accessing-imagetester) I also don't get a response. So also the alternative path for accessing the management interface seems broken.

So although a retry would be nice in general, I think we have a general problem of accessing the worker.

The following didn't work:

ipmitool -U ADMIN -P … mc reset cold

Maybe 10.160.65.195 is just not reachable due an interfering firewall rule?

Actions #8

Updated by mkittler about 2 years ago

I've tried my luck with ipmiutil but couldn't make it work:

transactional update # ipmiutil lan -c -e                 
ipmiutil ver 3.07
ilan ver 3.07 
-- BMC version 2.50, IPMI version 2.0 
PEF Control                             | none 
Channel 1 IP address                    | 10.160.3.100
Channel 1 IP addr src                   | Static
Channel 1 MAC addr                      | 00:25:90:a2:b8:48
Channel 1 Subnet mask                   | 255.255.0.0
Channel 1 Def gateway IP                | 10.160.255.254
Channel 1 Def gateway MAC               | 00:00:5e:00:01:02
Channel 1 Community string              | public 
Channel 1 Dest address                  | IP=0.0.0.0 MAC=00:00:00:00:00:00
SuperMicro Lan Interface                | Failover
Channel 1 SOL Enable                    | enabled
Channel 1 SOL Baud Rate                 | 115.2k
Channel 1 SOL Payload Access(user1)     | enabled
Channel 1 SOL Payload Access(user2)     | enabled
Channel 1 SOL Payload Access(user3)     | enabled
Channel 1 SOL Payload Access(user4)     | enabled
Channel 1 SOL Payload Access(user5)     | disabled
Channel 1 User 1 Access                 | Reserved ()
Channel 1 User 2 Access                 | IPMI, Admin  (ADMIN)
Channel 1 User 3 Access                 | IPMI, Admin  (root)
Channel 1 User 4 Access                 | IPMI, Admin  (devcon)
Channel 1 User 5 Access                 | Reserved ()

ilan, SetLanEntry for channel 1 ...
SetUser(2), ret = 0
SetLanEntry(2), ret = 0
LAN1 (eth0)     ip=10.160.3.100 mac=00:25:90:a2:b8:48
SetLanEntry(4), ret = 0
SetLanEntry(3), ret = 0
SetLanEntry(5), ret = 0
SetLanEntry(6), ret = 0
SetLanEntry(7), ret = 0
SetLanEntry(10,1), ret = 0
SetLanEntry(11), ret = 0
gateway         ip=192.168.112.254 mac=52:54:00:5f:2b:b3
WARNING: IP Address and Gateway are not on the same subnet, setting Gateway to previous value
SetLanEntry(12), ret = 0
SetLanEntry(13), ret = 0
SetupSerialOverLan: ret = 0
alert dest      address not specified
ipmiutil lan, completed successfully
transactional update # 
transactional update # 
transactional update # 
transactional update # 
transactional update # ipmiutil lan -c -e 
ipmiutil ver 3.07
ilan ver 3.07 
-- BMC version 2.50, IPMI version 2.0 
PEF Control                             | none 
Channel 1 IP address                    | 10.160.3.100
Channel 1 IP addr src                   | Static
Channel 1 MAC addr                      | 00:25:90:a2:b8:48
Channel 1 Subnet mask                   | 255.255.0.0
Channel 1 Def gateway IP                | 10.160.255.254
Channel 1 Def gateway MAC               | 00:00:5e:00:01:02
Channel 1 Community string              | public 
Channel 1 Dest address                  | IP=0.0.0.0 MAC=00:00:00:00:00:00
SuperMicro Lan Interface                | Failover
Channel 1 SOL Enable                    | enabled
Channel 1 SOL Baud Rate                 | 115.2k
Channel 1 SOL Payload Access(user1)     | enabled
Channel 1 SOL Payload Access(user2)     | enabled
Channel 1 SOL Payload Access(user3)     | enabled
Channel 1 SOL Payload Access(user4)     | enabled
Channel 1 SOL Payload Access(user5)     | disabled
Channel 1 User 1 Access                 | Reserved ()
Channel 1 User 2 Access                 | IPMI, Admin  (ADMIN)
Channel 1 User 3 Access                 | IPMI, Admin  (root)
Channel 1 User 4 Access                 | IPMI, Admin  (devcon)
Channel 1 User 5 Access                 | Reserved ()

ilan, SetLanEntry for channel 1 ...
SetUser(2), ret = 0
SetLanEntry(2), ret = 0
LAN1 (eth0)     ip=10.160.3.100 mac=00:25:90:a2:b8:48
SetLanEntry(4), ret = 0
SetLanEntry(3), ret = 0
SetLanEntry(5), ret = 0
SetLanEntry(6), ret = 0
SetLanEntry(7), ret = 0
SetLanEntry(10,1), ret = 0
SetLanEntry(11), ret = 0
gateway         ip=192.168.112.254 mac=52:54:00:5f:2b:b3
WARNING: IP Address and Gateway are not on the same subnet, setting Gateway to previous value
SetLanEntry(12), ret = 0
SetLanEntry(13), ret = 0
SetupSerialOverLan: ret = 0
alert dest      address not specified
ipmiutil lan, completed successfully

Interestingly, these logs show a different IP address (10.160.3.100). Under that IP I can actually reach a web server but I don't know the login or how to set it.

Btw, the IP also shows up via ipmitool:

transactional update # ipmitool lan print
Set in Progress         : Set Complete
Auth Type Support       : NONE MD2 MD5 PASSWORD 
Auth Type Enable        : Callback : MD2 MD5 PASSWORD 
                        : User     : MD2 MD5 PASSWORD 
                        : Operator : MD2 MD5 PASSWORD 
                        : Admin    : MD2 MD5 PASSWORD 
                        : OEM      : 
IP Address Source       : Static Address
IP Address              : 10.160.3.100
Subnet Mask             : 255.255.0.0
MAC Address             : 00:25:90:a2:b8:48
SNMP Community String   : public
IP Header               : TTL=0x40 Flags=0x40 Precedence=0x00 TOS=0x10
BMC ARP Control         : ARP Responses Enabled, Gratuitous ARP Disabled
Default Gateway IP      : 10.160.255.254
Default Gateway MAC     : 00:00:5e:00:01:02
Backup Gateway IP       : 0.0.0.0
Backup Gateway MAC      : 00:00:00:00:00:00
802.1q VLAN ID          : Disabled
802.1q VLAN Priority    : 0
RMCP+ Cipher Suites     : 1,2,3,6,7,8,11,12
Cipher Suite Priv Max   : XXXXXXXXXXXXXXX
                        :     X=Cipher Suite Unused
                        :     c=CALLBACK
                        :     u=USER
                        :     o=OPERATOR
                        :     a=ADMIN
                        :     O=OEM
Bad Password Threshold  : Not Available
Actions #9

Updated by mkittler about 2 years ago

When using 10.160.3.100 I can actually connect from the outside (not from o3 itself but locally via VPN):

ipmitool -I lanplus -C 3 -H 10.160.3.100 -U ADMIN -P pass -b 1 -B 1 power status
Error in open session response message : invalid role

Error: Unable to establish IPMI v2 / RMCP+ session

Now the error message indicates some auth issue. However, I couldn't find out what the problem is.

The users have a role and "Link auth" is set similar to other hosts:

imagetester:~ # ipmitool user list 0
ID  Name             Callin  Link Auth  IPMI Msg   Channel Priv Limit
1                    true    false      false      Unknown (0x00)
2   ADMIN            true    true       true       ADMINISTRATOR
3   root             true    true       true       ADMINISTRATOR
4   devcon           true    true       true       ADMINISTRATOR
5                    true    false      false      Unknown (0x00)
6   admin            true    true       true       ADMINISTRATOR
7                    true    false      false      Unknown (0x00)
8                    true    false      false      Unknown (0x00)
9                    true    false      false      Unknown (0x00)
10                   true    false      false      Unknown (0x00)
imagetester:~ # ipmitool user list 1
ID  Name             Callin  Link Auth  IPMI Msg   Channel Priv Limit
1                    true    false      false      Unknown (0x00)
2   ADMIN            true    true       true       ADMINISTRATOR
3   root             true    true       true       ADMINISTRATOR
4   devcon           true    true       true       ADMINISTRATOR
5                    true    false      false      Unknown (0x00)
6   admin            true    true       true       ADMINISTRATOR
7                    true    false      false      Unknown (0x00)
8                    true    false      false      Unknown (0x00)
9                    true    false      false      Unknown (0x00)
10                   true    false      false      Unknown (0x00)

Btw, that's how the LAN settings look like:

imagetester:~ # ipmitool lan print
Set in Progress         : Set Complete
Auth Type Support       : NONE MD2 MD5 PASSWORD 
Auth Type Enable        : Callback : MD2 MD5 PASSWORD 
                        : User     : MD2 MD5 PASSWORD 
                        : Operator : MD2 MD5 PASSWORD 
                        : Admin    : MD2 MD5 PASSWORD 
                        : OEM      : 
IP Address Source       : Static Address
IP Address              : 10.160.3.100
Subnet Mask             : 255.255.0.0
MAC Address             : 00:25:90:a2:b8:48
SNMP Community String   : public
IP Header               : TTL=0x40 Flags=0x40 Precedence=0x00 TOS=0x10
BMC ARP Control         : ARP Responses Enabled, Gratuitous ARP Disabled
Default Gateway IP      : 10.160.255.254
Default Gateway MAC     : 00:00:5e:00:01:02
Backup Gateway IP       : 0.0.0.0
Backup Gateway MAC      : 00:00:00:00:00:00
802.1q VLAN ID          : Disabled
802.1q VLAN Priority    : 0
RMCP+ Cipher Suites     : 1,2,3,6,7,8,11,12
Cipher Suite Priv Max   : XXXXXXXXXXXXXXX
                        :     X=Cipher Suite Unused
                        :     c=CALLBACK
                        :     u=USER
                        :     o=OPERATOR
                        :     a=ADMIN
                        :     O=OEM
Bad Password Threshold  : Not Available

For other workers the IP printed there matches what we have documented. So I assume this really is the IP (and 10.160.65.195 is wrong).

Actions #10

Updated by mkittler about 2 years ago

Interestingly, on other workers no channel 0 exists (only channel 1):

openqa-aarch64:~ # ipmitool channel info 0
IPMI command failed: Invalid data field in request
Unable to Get Channel Info
openqa-aarch64:~ # ipmitool channel info 1
Channel 0x1 info:
  Channel Medium Type   : 802.3 LAN
  Channel Protocol Type : IPMB-1.0
  Session Support       : multi-session
  Active Session Count  : 0
  Protocol Vendor ID    : 7154
  Volatile(active) Settings
    Alerting            : disabled
    Per-message Auth    : enabled
    User Level Auth     : enabled
    Access Mode         : always available
  Non-Volatile Settings
    Alerting            : enabled
    Per-message Auth    : enabled
    User Level Auth     : enabled
    Access Mode         : disabled

Maybe channel 0 is interfering?

Actions #11

Updated by livdywan about 2 years ago

  • Status changed from Workable to In Progress

Let's assume work is being done here

Actions #12

Updated by bmwiedemann about 2 years ago

imagetester-ipmi.suse.de is the internal hostname for 10.160.3.100
It reponds to ping and http, so probably just need to get the ipmi credentials right.

Did you use the new password that okurz set some months ago?

Actions #13

Updated by mkittler about 2 years ago

I was actually the one who changed the IPMI passwords on all machine last time. So yes, I've been using the current password and I'm sure if @okurz had changed it in the meantime he'd updated workerconf.sls. I also tried to set the password to something else to be sure but it didn't work.

Btw, I've accidentally mentioned the password here so I'm going to change the IPMI password on all hosts again. I'm about to do that now. When testing whether all hosts are accessible via IPMI I've also noticed that fsp1-malbec.arch.suse.de cannot be accessed via LAN.

Actions #14

Updated by openqa_review about 2 years ago

  • Due date set to 2022-03-26

Setting due date based on mean cycle time of SUSE QE Tools

Actions #15

Updated by livdywan about 2 years ago

  • Assignee changed from mkittler to livdywan

I'll give it a ago since @mkittler exhausted all options, and confirm if we can get this sorted or if we need physical access to reset the machine

Actions #16

Updated by livdywan about 2 years ago

I can confirm that the login works on http://imagetester-ipmi.suse.de although ipmitool gives me invalid role which as per documentation should mean that the login is wrong. The same happens using the IP reported by the web UI / the generated jnlp config - my suspicion is that the IP is not static, and perhaps there's some re-routing going on but connections via other IP's are not accepted.

Still, I don't get why I can't login when the password is most definitely correct...

Actions #17

Updated by mkittler about 2 years ago

Oh, your right - with the "new new" password (I've set as mentioned in #107917#note-13) login works via the web UI. Maybe the old password and the "first new" password I've tried contained too many or unsupported characters. That's already an improvement. I've triggered a reboot of the IPMI device over the web UI. Maybe it helps.

Actions #18

Updated by mkittler about 2 years ago

  • Assignee changed from livdywan to mkittler

It didn't help. I'll try a few other things over the web UI.

Actions #19

Updated by mkittler about 2 years ago

  • Status changed from In Progress to Resolved

I did a factory reset over the web UI, set the password again and now it works.

Actions #21

Updated by mkittler about 2 years ago

  • Status changed from Feedback to Resolved

The PRs have been merged. I suppose I can now actually resolve the ticket. Note that we can still think of implementing a retry in monotor-o3 but I see it out of scope for fixing the immediate issue.

Actions #22

Updated by livdywan about 2 years ago

  • Copied to action #108671: Resilient IPMI recovery of o3 machines in monitor-o3 size:M added
Actions #23

Updated by okurz 8 months ago

Actions

Also available in: Atom PDF