Project

General

Profile

action #80544

openQA Project - coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

coordination #78206: [epic] 2020-11-18 nbg power outage aftermath

Ensure that IPMI for powerqaworker-qam works reliably

Added by okurz 8 months ago. Updated 4 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:

Description

Motivation

parent #78206 showed that IPMI access to powerqaworker-qam-1.qa was not possible, at least not for okurz following https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls#L547 . We should ensure that IPMI works reliably and our documentation is correct.

Acceptance criteria

Suggestions

  • Multiple persons crosscheck if it works as documented (does already not work for okurz)
  • Find out what is the correct command
  • Test that this works reliably, e.g. a for-loop to call multiple times
  • Correct documentation if necessary

History

#1 Updated by cdywan 8 months ago

  • Description updated (diff)

Trying with -vvv for higher verbosity I can see this error some of the time:

Error in open session response message : insufficient resources for session

And sometimes:

> RAKP 2 HMAC is invalid

That second suggests that the password is wrong, so I added the password used with other commands and get this:

RAKP 2 message indicates an error : illegal parameter

Which probably also means wrong credentials.

That said, I tried to change the password via the web interface for testing purposes and it still won't connect - I changed it back afterwards.

A suggestion I found was ipmitool channel setaccess 1 2 link=on ipmi=on callin=on privilege=4; ipmitool user enable 2 to fix the login. This probably needs to be done by someone with access to the machine, though - currently I can't get in at all.

#2 Updated by cdywan 8 months ago

So nicksinger proposed an MR to fix the password. And it seems that I can currently login w/o problems... leaving aside that I'm looking at kernel panics...

#3 Updated by cdywan 8 months ago

  • Status changed from Workable to In Progress
  • Assignee set to cdywan

#4 Updated by okurz 8 months ago

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/289 merged. Please keep in mind to use up-to-date ipmi commands , e.g. to include the "-C 3" parameter. I suggest to follow https://gitlab.suse.de/openqa/salt-pillars-openqa#get-ipmi-definition-aliases and use handy shell aliases.

Then I found an interesting command sol looptest. By default this seems to try out connecting 200 times in a row. Maybe good to use that for testing stability.

This already showed me that ipmi-openqaworker12-ipmi sol looptest worked fine and ipmi-fsp1-powerqaworker-qam.qa sol looptest:

remain loop test counter: 200
[SOL Session operational.  Use ~? for help]
remain loop test counter: 199
Error: No response activating SOL payload
SOL looptest failed: -1

Also I found that next to "ipmitool" there is also "ipmiutil" with a slightly different syntax but maybe we want that in some cases. E.g. https://www.systutorials.com/docs/linux/man/8-isol/ shows that there is a parameter to specify an "input file" for scripted commands on the serial console.

#5 Updated by cdywan 8 months ago

  • Status changed from In Progress to Feedback

okurz wrote:

This already showed me that ipmi-openqaworker12-ipmi sol looptest worked fine and ipmi-fsp1-powerqaworker-qam.qa sol looptest:

remain loop test counter: 200
[SOL Session operational.  Use ~? for help]
remain loop test counter: 199
Error: No response activating SOL payload
SOL looptest failed: -1

Sadly looptest only confirms how unreliable fsp1-powerqaworker-qam.qa.suse.de is, and it leaves an active session behind after breaking.

Also I found that next to "ipmitool" there is also "ipmiutil" with a slightly different syntax but maybe we want that in some cases. E.g. https://www.systutorials.com/docs/linux/man/8-isol/ shows that there is a parameter to specify an "input file" for scripted commands on the serial console.

I tried this out, e.g. ipmiutil sol -a -N $HOSTNAME but it seems like on openSUSE it's not built with lanplus 🤔

ipmiutil sol ver 3.16
-- BMC version 7.68, IPMI version 2.0 
2.0 LanPlus module not available, trying 1.5 SOL instead
lanplus not configured
ipmiutil sol, invalid lan parameter

#6 Updated by okurz 8 months ago

cdywan wrote:

okurz wrote:

This already showed me that ipmi-openqaworker12-ipmi sol looptest worked fine and ipmi-fsp1-powerqaworker-qam.qa sol looptest:

remain loop test counter: 200
[SOL Session operational.  Use ~? for help]
remain loop test counter: 199
Error: No response activating SOL payload
SOL looptest failed: -1

Sadly looptest only confirms how unreliable fsp1-powerqaworker-qam.qa.suse.de is, and it leaves an active session behind after breaking.

That loop test shows that it's unreliable is good, right? That it leaves a session behind should not be a problem. A good state is just a sol deactivate away :)

By the away, there seem to be some ipmi parameters that we might benefit from, maybe timeout, retry, intervals, etc.

Can you say why you set the ticket to feedback now, what feedback from others do you expect now?

#7 Updated by cdywan 7 months ago

okurz wrote:

Can you say why you set the ticket to feedback now, what feedback from others do you expect now?

It's Feedback because it's not clear to me what the solution looks like. The commands work but the connection is not reliable and the AC suggests we're looking for both. We've so far only found more ways to confirm that known to work parameters don't work unless lady luck is sitting next to you.

#8 Updated by cdywan 7 months ago

  • Assignee deleted (cdywan)
  • Start date deleted (2020-11-27)

#9 Updated by okurz 7 months ago

  • Status changed from Feedback to Workable

#10 Updated by mkittler 4 months ago

I've just read the ticket description. The line numbers are outdated so I'm not sure which commands don't work exactly. So I've just searched for powerqaworker-qam-1 in workerconf.sls and tested the IPMI command. It works but it is indeed quite slow. I doubt we can do anything about it, though. By the way, it seems that the worker is still stuck in a petitboot shell.

#11 Updated by okurz 4 months ago

mkittler wrote:

So I've […] tested the IPMI command. It works but it is indeed quite slow. I doubt we can do anything about it, though.

Slow is ok as long as it's reliable. And "we can not do anything about it" is not enough to be able to resolve the issue. The most extreme measure would be to say that we get rid of the machine as we can't reliably control it – in contrast to other machines. But I can imagine there could be easier measures to try first, e.g. update the IPMI firmware, try a different or more recent ipmitool client and make sure everyone knows which versions work and which don't. Have you read the previous comments? Stating that as we already tried out different things. Did you also try sol looptest?

#12 Updated by okurz 4 months ago

Just ran ipmi sol looptest again and got

$ ipmi-fsp1-powerqaworker-qam.qa sol looptest
remain loop test counter: 200
[SOL Session operational.  Use ~? for help]
remain loop test counter: 199
…
[SOL Session operational.  Use ~? for help]
remain loop test counter: 189
[SOL Session operational.  Use ~? for help]
remain loop test counter: 188
Error: Unexpected data length (0) received in payload activation response
SOL looptest failed: -1

so not that reliable. Running ipmitool 1.8.18 from openSUSE Leap 15.2 stable package version 1.8.18+git20200204.7ccea28-lp152.1.3

Together with mkittler we started an experiment: Both of us ran sol looptest in parallel. For mkittler looptest just continued fine for 200 iterations but just showed an info sometimes that "SOL payload already de-activated".

My run aborted with:

…
[SOL Session operational.  Use ~? for help]
remain loop test counter: 187
[SOL Session operational.  Use ~? for help]
Info: SOL payload already de-activated
remain loop test counter: 186
Error: Unexpected data length (0) received in payload activation response
SOL looptest failed: -1
Close Session command failed: Unknown (0x80)

#13 Updated by mkittler 4 months ago

Here I could run it 200 times without problems (using ipmitool -I lanplus -C 3 -H fsp1-powerqaworker-qam.qa.suse.de -P admin sol looptest). I'm using ipmitool 1.8.18+git20200916.1245aaa387dc-2.4 as provided by TW.

Btw, when closing the VPN connection I get

ipmitool -I lanplus -C 3 -H fsp1-powerqaworker-qam.qa.suse.de -P admin sol looptest
remain loop test counter: 200
[SOL Session operational.  Use ~? for help]
remain loop test counter: 199
[SOL Session operational.  Use ~? for help]
remain loop test counter: 198
[SOL Session operational.  Use ~? for help]
IPMI LAN send command failed
Error: No response de-activating SOL payload
remain loop test counter: 197
IPMI LAN send command failed
Error: No response activating SOL payload
SOL looptest failed: -1
IPMI LAN send command failed
Close Session command failed

which is different from the error okurz got.


After a 2nd attempt I've got:

[SOL Session operational.  Use ~? for help]
remain loop test counter: 4
Info: SOL payload already active on another session
SOL looptest failed: -1

#14 Updated by okurz 4 months ago

  • Status changed from Workable to In Progress
  • Assignee set to okurz

To try out a more recent version of impitool I did:

sudo zypper ar -p 105 https://download.opensuse.org/repositories/systemsmanagement/openSUSE_Leap_15.2/systemsmanagement.repo
sudo zypper -n in --force ipmitool-1.8.18+git20200916.1245aaa387dc-lp152.3.1

Then I got:

$ ipmi-fsp1-powerqaworker-qam.qa sol looptest
remain loop test counter: 200
[SOL Session operational.  Use ~? for help]
…
remain loop test counter: 178
[SOL Session operational.  Use ~? for help]
remain loop test counter: 177
Info: SOL payload already active on another session
SOL looptest failed: -1

and

$ ipmi-fsp1-powerqaworker-qam.qa sol looptest
remain loop test counter: 200
[SOL Session operational.  Use ~? for help]
…
[SOL Session operational.  Use ~? for help]
remain loop test counter: 149
Info: SOL payload already active on another session
SOL looptest failed: -1

but I have the suspicion the newer ipmitool version is more stable. But as confirmed by mkittler and me with the updated cipher and an up-to-date ipmitool this reliable enough.

Next step: Add entry to our wiki for "best practices".

Also available in: Atom PDF