action #59382

openqaworker-arm-1 is down, was automatically power cycled by grafana+gitlab, no reaction, power cycled again, SOL is unresponsive

Added by okurz 5 months ago. Updated 5 months ago.

Status:ResolvedStart date:13/11/2019
Priority:NormalDue date:
Assignee:okurz% Done:

0%

Category:-
Target version:openQA Project - Current Sprint
Duration:

Description

Observation

https://gitlab.suse.de/openqa/grafana-webhook-actions/-/jobs/138117 was triggered after the alarm that arm-1 is down on 2019-11-11 but it seems arm-1 never came back according to https://stats.openqa-monitor.qa.suse.de/d/WDopenqaworker-arm-1/worker-dashoard-openqaworker-arm-1?orgId=1&refresh=1m&from=now-7d&to=now

Suggestions

We should escalate the alert when recovery fails or at least try again multiple times until we see some response, e.g. extend https://gitlab.suse.de/openqa/grafana-webhook-actions to add "sol activate" and see if there is any response and ping the host.

History

#1 Updated by okurz 5 months ago

  • Status changed from New to In Progress
  • Assignee set to okurz
  • Priority changed from Urgent to Normal

I did manual power cycle and sol activate but no response. Tried with power off and power on, no response … or so I thought. The machine actually came back after power off and power on and a waiting time of about 5 minutes, responding on ping but no respone on SOL. Machine is back up again.

My suggestions:

  • timeout -k 5 600 sh -c "until ping -c1 $MACHINE; do :; done" after the power cycle, if not responding try power off and power on and ping again
  • After ping succeeds probe ssh, e.g. timeout -k 5 300 sh -c "until nc -vz -w 1 $MACHINE 22; do :; done" . Consider it success if ssh probe returns successful

#2 Updated by okurz 5 months ago

  • Status changed from In Progress to Feedback
  • Target version set to Current Sprint

#3 Updated by okurz 5 months ago

  • Status changed from Feedback to Resolved

merged. Let's see what it brings the next time.

Also available in: Atom PDF