Project

General

Profile

Actions

action #80542

closed

openQA Project - coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

coordination #78206: [epic] 2020-11-18 nbg power outage aftermath

Configure "automatic power-on" after power loss for openqaworker1 (and all others)

Added by okurz over 3 years ago. Updated about 3 years ago.

Status:
Resolved
Priority:
Low
Assignee:
Category:
-
Target version:
Start date:
2020-11-27
Due date:
% Done:

0%

Estimated time:

Description

Motivation

During the next power outage we want openqaworker1 and others to automatically power up when power returns.

Acceptance criteria

  • AC1: openqaworker1 is configured to start automatically when power returns
  • AC2: all other machines are configured like that as well

Suggestion

  • Experiment if you can reach a menu to configure the firmware of the machine "openqaworker1" to configure automatic power-on after power returns
  • Test with help of EngInfra that this works if necessary
Actions #1

Updated by okurz about 3 years ago

  • Target version changed from future to Ready
Actions #2

Updated by okurz about 3 years ago

  • Subject changed from Configure "automatic power-on" after power loss for openqaworker1 to Configure "automatic power-on" after power loss for openqaworker1 (and all others)
  • Description updated (diff)

Found out with mkittler that we can use IPMI to get and set this, e.g.

ipmi-openqaworker-arm-3-ipm chassis status | grep 'Power Restore Policy'

returns

Power Restore Policy : always-on

and we can set it with

ipmi-openqaworker-arm-3-ipmi chassis policy always-on
Actions #3

Updated by okurz about 3 years ago

  • Status changed from Workable to In Progress
  • Assignee set to okurz

We should be able to just call that from each worker itself with sudo ipmitool chassis policy always-on, maybe from salt.

I found https://docs.saltproject.io/en/latest/ref/modules/all/salt.modules.ipmi.html#salt.modules.ipmi.set_power which needs a python module pyghmi and does not support a state so I guess we are better of if we control the commands ourselves.

I checked on openqaworker11 how expensive the operation is:

# time sh -c "ipmitool chassis status | grep 'Power Restore Policy.*always-on' || ipmitool chassis policy always-on"
Set chassis power restore policy to always-on

real    0m0.185s
user    0m0.100s
sys 0m0.033s
openqaworker11:/home/okurz # time sh -c "ipmitool chassis status | grep 'Power Restore Policy.*always-on' || ipmitool chassis policy always-on"
Power Restore Policy : always-on

real    0m0.095s
user    0m0.050s
sys 0m0.022s

so 95ms for a single check call, 185ms for the combined set operation. But salt is not meant to redo the command everytime but check a state. Probably a proper way would be a custom grain python function to check the state. I guess I will resort to the manual action and documenting on a wiki maybe?

Actions #4

Updated by okurz about 3 years ago

  • Status changed from In Progress to Resolved

Simulating setting "always-on" policy on all machines:

IFS=$'\n'; for i in $(sed 's/^alias .*="\(.*\)"/\1/' ~/.openqa_ipmi_aliases); do (eval "$i" chassis status | grep 'Power Restore Policy.*always-on' || echo eval "$i" chassis policy always-on); done

relying on https://gitlab.suse.de/openqa/salt-pillars-openqa/#get-ipmi-definition-aliases

likely "previous" is a better policy in case someone on purpose wants to power off the machines and only power them on one by one after e.g. some work on the electrical infrastructure. And some of our machines do not support the commands, e.g. PowerPC (with LPARs and such). So I guess we are better off with the manual for now. Checking current status:

IFS=$'\n'; for i in $(sed 's/^alias .*="\(.*\)"/\1/' ~/.openqa_ipmi_aliases); do echo "# $i" && eval "$i" chassis status | grep 'Power Restore Policy'; done

reveals a mixture of "previous, always-on, always-off". Setting "previous" on all:

IFS=$'\n'; for i in $(sed 's/^alias .*="\(.*\)"/\1/' ~/.openqa_ipmi_aliases); do eval "$i" chassis policy previous; done

Updated wiki https://progress.opensuse.org/projects/openqav3/wiki/Wiki#Remote-management-with-IPMI

Actions

Also available in: Atom PDF