action #59382

openqaworker-arm-1 is down, was automatically power cycled by grafana+gitlab, no reaction, power cycled again, SOL is unresponsive

Added by okurz 5 months ago. Updated 5 months ago.

Status:ResolvedStart date:13/11/2019
Priority:NormalDue date:
Assignee:okurz% Done:


Target version:openQA Project - Current Sprint


Observation was triggered after the alarm that arm-1 is down on 2019-11-11 but it seems arm-1 never came back according to


We should escalate the alert when recovery fails or at least try again multiple times until we see some response, e.g. extend to add "sol activate" and see if there is any response and ping the host.


#1 Updated by okurz 5 months ago

  • Status changed from New to In Progress
  • Assignee set to okurz
  • Priority changed from Urgent to Normal

I did manual power cycle and sol activate but no response. Tried with power off and power on, no response … or so I thought. The machine actually came back after power off and power on and a waiting time of about 5 minutes, responding on ping but no respone on SOL. Machine is back up again.

My suggestions:

  • timeout -k 5 600 sh -c "until ping -c1 $MACHINE; do :; done" after the power cycle, if not responding try power off and power on and ping again
  • After ping succeeds probe ssh, e.g. timeout -k 5 300 sh -c "until nc -vz -w 1 $MACHINE 22; do :; done" . Consider it success if ssh probe returns successful

#2 Updated by okurz 5 months ago

  • Status changed from In Progress to Feedback
  • Target version set to Current Sprint

#3 Updated by okurz 5 months ago

  • Status changed from Feedback to Resolved

merged. Let's see what it brings the next time.

Also available in: Atom PDF