Project

General

Profile

action #80544

openQA Project - coordination #80142: [saga][epic] Scale out openQA: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

coordination #78206: [epic] 2020-11-18 nbg power outage aftermath

Ensure that IPMI for powerqaworker-qam works reliably

Added by okurz about 2 months ago. Updated 9 days ago.

Status:
Workable
Priority:
Normal
Assignee:
-
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:

Description

Motivation

parent #78206 showed that IPMI access to powerqaworker-qam-1.qa was not possible, at least not for okurz following https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls#L547 . We should ensure that IPMI works reliably and our documentation is correct.

Acceptance criteria

Suggestions

  • Multiple persons crosscheck if it works as documented (does already not work for okurz)
  • Find out what is the correct command
  • Test that this works reliably, e.g. a for-loop to call multiple times
  • Correct documentation if necessary

History

#1 Updated by cdywan about 1 month ago

  • Description updated (diff)

Trying with -vvv for higher verbosity I can see this error some of the time:

Error in open session response message : insufficient resources for session

And sometimes:

> RAKP 2 HMAC is invalid

That second suggests that the password is wrong, so I added the password used with other commands and get this:

RAKP 2 message indicates an error : illegal parameter

Which probably also means wrong credentials.

That said, I tried to change the password via the web interface for testing purposes and it still won't connect - I changed it back afterwards.

A suggestion I found was ipmitool channel setaccess 1 2 link=on ipmi=on callin=on privilege=4; ipmitool user enable 2 to fix the login. This probably needs to be done by someone with access to the machine, though - currently I can't get in at all.

#2 Updated by cdywan about 1 month ago

So nicksinger proposed an MR to fix the password. And it seems that I can currently login w/o problems... leaving aside that I'm looking at kernel panics...

#3 Updated by cdywan about 1 month ago

  • Status changed from Workable to In Progress
  • Assignee set to cdywan

#4 Updated by okurz about 1 month ago

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/289 merged. Please keep in mind to use up-to-date ipmi commands , e.g. to include the "-C 3" parameter. I suggest to follow https://gitlab.suse.de/openqa/salt-pillars-openqa#get-ipmi-definition-aliases and use handy shell aliases.

Then I found an interesting command sol looptest. By default this seems to try out connecting 200 times in a row. Maybe good to use that for testing stability.

This already showed me that ipmi-openqaworker12-ipmi sol looptest worked fine and ipmi-fsp1-powerqaworker-qam.qa sol looptest:

remain loop test counter: 200
[SOL Session operational.  Use ~? for help]
remain loop test counter: 199
Error: No response activating SOL payload
SOL looptest failed: -1

Also I found that next to "ipmitool" there is also "ipmiutil" with a slightly different syntax but maybe we want that in some cases. E.g. https://www.systutorials.com/docs/linux/man/8-isol/ shows that there is a parameter to specify an "input file" for scripted commands on the serial console.

#5 Updated by cdywan about 1 month ago

  • Status changed from In Progress to Feedback

okurz wrote:

This already showed me that ipmi-openqaworker12-ipmi sol looptest worked fine and ipmi-fsp1-powerqaworker-qam.qa sol looptest:

remain loop test counter: 200
[SOL Session operational.  Use ~? for help]
remain loop test counter: 199
Error: No response activating SOL payload
SOL looptest failed: -1

Sadly looptest only confirms how unreliable fsp1-powerqaworker-qam.qa.suse.de is, and it leaves an active session behind after breaking.

Also I found that next to "ipmitool" there is also "ipmiutil" with a slightly different syntax but maybe we want that in some cases. E.g. https://www.systutorials.com/docs/linux/man/8-isol/ shows that there is a parameter to specify an "input file" for scripted commands on the serial console.

I tried this out, e.g. ipmiutil sol -a -N $HOSTNAME but it seems like on openSUSE it's not built with lanplus 🤔

ipmiutil sol ver 3.16
-- BMC version 7.68, IPMI version 2.0 
2.0 LanPlus module not available, trying 1.5 SOL instead
lanplus not configured
ipmiutil sol, invalid lan parameter

#6 Updated by okurz about 1 month ago

cdywan wrote:

okurz wrote:

This already showed me that ipmi-openqaworker12-ipmi sol looptest worked fine and ipmi-fsp1-powerqaworker-qam.qa sol looptest:

remain loop test counter: 200
[SOL Session operational.  Use ~? for help]
remain loop test counter: 199
Error: No response activating SOL payload
SOL looptest failed: -1

Sadly looptest only confirms how unreliable fsp1-powerqaworker-qam.qa.suse.de is, and it leaves an active session behind after breaking.

That loop test shows that it's unreliable is good, right? That it leaves a session behind should not be a problem. A good state is just a sol deactivate away :)

By the away, there seem to be some ipmi parameters that we might benefit from, maybe timeout, retry, intervals, etc.

Can you say why you set the ticket to feedback now, what feedback from others do you expect now?

#7 Updated by cdywan 23 days ago

okurz wrote:

Can you say why you set the ticket to feedback now, what feedback from others do you expect now?

It's Feedback because it's not clear to me what the solution looks like. The commands work but the connection is not reliable and the AC suggests we're looking for both. We've so far only found more ways to confirm that known to work parameters don't work unless lady luck is sitting next to you.

#8 Updated by cdywan 23 days ago

  • Assignee deleted (cdywan)
  • Start date deleted (2020-11-27)

#9 Updated by okurz 9 days ago

  • Status changed from Feedback to Workable

Also available in: Atom PDF