action #80544
openQA Project - coordination #80142: [saga][epic] Scale out openQA: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes
coordination #78206: [epic] 2020-11-18 nbg power outage aftermath
Ensure that IPMI for powerqaworker-qam works reliably
0%
Description
Motivation¶
parent #78206 showed that IPMI access to powerqaworker-qam-1.qa was not possible, at least not for okurz following https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls#L547 . We should ensure that IPMI works reliably and our documentation is correct.
Acceptance criteria¶
- AC1: IPMI commands as specified in https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls#L579 work reliably for more than one person
Suggestions¶
- Multiple persons crosscheck if it works as documented (does already not work for okurz)
- Find out what is the correct command
- Test that this works reliably, e.g. a for-loop to call multiple times
- Correct documentation if necessary
History
#1
Updated by cdywan about 1 month ago
- Description updated (diff)
Trying with -vvv
for higher verbosity I can see this error some of the time:
Error in open session response message : insufficient resources for session
And sometimes:
> RAKP 2 HMAC is invalid
That second suggests that the password is wrong, so I added the password used with other commands and get this:
RAKP 2 message indicates an error : illegal parameter
Which probably also means wrong credentials.
That said, I tried to change the password via the web interface for testing purposes and it still won't connect - I changed it back afterwards.
A suggestion I found was ipmitool channel setaccess 1 2 link=on ipmi=on callin=on privilege=4; ipmitool user enable 2
to fix the login. This probably needs to be done by someone with access to the machine, though - currently I can't get in at all.
#2
Updated by cdywan about 1 month ago
So nicksinger proposed an MR to fix the password. And it seems that I can currently login w/o problems... leaving aside that I'm looking at kernel panics...
#3
Updated by cdywan about 1 month ago
- Status changed from Workable to In Progress
- Assignee set to cdywan
#4
Updated by okurz about 1 month ago
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/289 merged. Please keep in mind to use up-to-date ipmi commands , e.g. to include the "-C 3" parameter. I suggest to follow https://gitlab.suse.de/openqa/salt-pillars-openqa#get-ipmi-definition-aliases and use handy shell aliases.
Then I found an interesting command sol looptest
. By default this seems to try out connecting 200 times in a row. Maybe good to use that for testing stability.
This already showed me that ipmi-openqaworker12-ipmi sol looptest
worked fine and ipmi-fsp1-powerqaworker-qam.qa sol looptest
:
remain loop test counter: 200 [SOL Session operational. Use ~? for help] remain loop test counter: 199 Error: No response activating SOL payload SOL looptest failed: -1
Also I found that next to "ipmitool" there is also "ipmiutil" with a slightly different syntax but maybe we want that in some cases. E.g. https://www.systutorials.com/docs/linux/man/8-isol/ shows that there is a parameter to specify an "input file" for scripted commands on the serial console.
#5
Updated by cdywan about 1 month ago
- Status changed from In Progress to Feedback
okurz wrote:
This already showed me that
ipmi-openqaworker12-ipmi sol looptest
worked fine andipmi-fsp1-powerqaworker-qam.qa sol looptest
:remain loop test counter: 200 [SOL Session operational. Use ~? for help] remain loop test counter: 199 Error: No response activating SOL payload SOL looptest failed: -1
Sadly looptest only confirms how unreliable fsp1-powerqaworker-qam.qa.suse.de
is, and it leaves an active session behind after breaking.
Also I found that next to "ipmitool" there is also "ipmiutil" with a slightly different syntax but maybe we want that in some cases. E.g. https://www.systutorials.com/docs/linux/man/8-isol/ shows that there is a parameter to specify an "input file" for scripted commands on the serial console.
I tried this out, e.g. ipmiutil sol -a -N $HOSTNAME
but it seems like on openSUSE it's not built with lanplus 🤔
ipmiutil sol ver 3.16 -- BMC version 7.68, IPMI version 2.0 2.0 LanPlus module not available, trying 1.5 SOL instead lanplus not configured ipmiutil sol, invalid lan parameter
#6
Updated by okurz about 1 month ago
cdywan wrote:
okurz wrote:
This already showed me that
ipmi-openqaworker12-ipmi sol looptest
worked fine andipmi-fsp1-powerqaworker-qam.qa sol looptest
:remain loop test counter: 200 [SOL Session operational. Use ~? for help] remain loop test counter: 199 Error: No response activating SOL payload SOL looptest failed: -1Sadly looptest only confirms how unreliable
fsp1-powerqaworker-qam.qa.suse.de
is, and it leaves an active session behind after breaking.
That loop test shows that it's unreliable is good, right? That it leaves a session behind should not be a problem. A good state is just a sol deactivate
away :)
By the away, there seem to be some ipmi parameters that we might benefit from, maybe timeout, retry, intervals, etc.
Can you say why you set the ticket to feedback now, what feedback from others do you expect now?
#7
Updated by cdywan 23 days ago
okurz wrote:
Can you say why you set the ticket to feedback now, what feedback from others do you expect now?
It's Feedback because it's not clear to me what the solution looks like. The commands work but the connection is not reliable and the AC suggests we're looking for both. We've so far only found more ways to confirm that known to work parameters don't work unless lady luck is sitting next to you.