action #80542
closedopenQA Project - coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes
coordination #78206: [epic] 2020-11-18 nbg power outage aftermath
Configure "automatic power-on" after power loss for openqaworker1 (and all others)
0%
Description
Motivation¶
During the next power outage we want openqaworker1 and others to automatically power up when power returns.
Acceptance criteria¶
- AC1: openqaworker1 is configured to start automatically when power returns
- AC2: all other machines are configured like that as well
Suggestion¶
- Experiment if you can reach a menu to configure the firmware of the machine "openqaworker1" to configure automatic power-on after power returns
- Test with help of EngInfra that this works if necessary
Updated by okurz over 3 years ago
- Subject changed from Configure "automatic power-on" after power loss for openqaworker1 to Configure "automatic power-on" after power loss for openqaworker1 (and all others)
- Description updated (diff)
Found out with mkittler that we can use IPMI to get and set this, e.g.
ipmi-openqaworker-arm-3-ipm chassis status | grep 'Power Restore Policy'
returns
Power Restore Policy : always-on
and we can set it with
ipmi-openqaworker-arm-3-ipmi chassis policy always-on
Updated by okurz over 3 years ago
- Status changed from Workable to In Progress
- Assignee set to okurz
We should be able to just call that from each worker itself with sudo ipmitool chassis policy always-on
, maybe from salt.
I found https://docs.saltproject.io/en/latest/ref/modules/all/salt.modules.ipmi.html#salt.modules.ipmi.set_power which needs a python module pyghmi and does not support a state so I guess we are better of if we control the commands ourselves.
I checked on openqaworker11 how expensive the operation is:
# time sh -c "ipmitool chassis status | grep 'Power Restore Policy.*always-on' || ipmitool chassis policy always-on"
Set chassis power restore policy to always-on
real 0m0.185s
user 0m0.100s
sys 0m0.033s
openqaworker11:/home/okurz # time sh -c "ipmitool chassis status | grep 'Power Restore Policy.*always-on' || ipmitool chassis policy always-on"
Power Restore Policy : always-on
real 0m0.095s
user 0m0.050s
sys 0m0.022s
so 95ms for a single check call, 185ms for the combined set operation. But salt is not meant to redo the command everytime but check a state. Probably a proper way would be a custom grain python function to check the state. I guess I will resort to the manual action and documenting on a wiki maybe?
Updated by okurz over 3 years ago
- Status changed from In Progress to Resolved
Simulating setting "always-on" policy on all machines:
IFS=$'\n'; for i in $(sed 's/^alias .*="\(.*\)"/\1/' ~/.openqa_ipmi_aliases); do (eval "$i" chassis status | grep 'Power Restore Policy.*always-on' || echo eval "$i" chassis policy always-on); done
relying on https://gitlab.suse.de/openqa/salt-pillars-openqa/#get-ipmi-definition-aliases
likely "previous" is a better policy in case someone on purpose wants to power off the machines and only power them on one by one after e.g. some work on the electrical infrastructure. And some of our machines do not support the commands, e.g. PowerPC (with LPARs and such). So I guess we are better off with the manual for now. Checking current status:
IFS=$'\n'; for i in $(sed 's/^alias .*="\(.*\)"/\1/' ~/.openqa_ipmi_aliases); do echo "# $i" && eval "$i" chassis status | grep 'Power Restore Policy'; done
reveals a mixture of "previous, always-on, always-off". Setting "previous" on all:
IFS=$'\n'; for i in $(sed 's/^alias .*="\(.*\)"/\1/' ~/.openqa_ipmi_aliases); do eval "$i" chassis policy previous; done
Updated wiki https://progress.opensuse.org/projects/openqav3/wiki/Wiki#Remote-management-with-IPMI