Project

General

Profile

action #111149

openQA Project - coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

Recover openqaworker-arm-3

Added by mkittler about 2 months ago. Updated about 2 months ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
Start date:
2022-05-16
Due date:
% Done:

0%

Estimated time:

Description

Observation

$ ./ipmi-recover-worker
Attempting to reboot openqaworker-arm-3
Error: Unable to establish IPMI v2 / RMCP+ session
IPMI access to openqaworker-arm-3-ipmi.suse.de failed ==> aborting reboot attempt, needs PDU reset
IPMI based recovery failed ==> trying switched rack PDU power cycling+IPMI
spawn telnet [MASKED]
couldn't execute "telnet": no such file or directory
    while executing
"spawn telnet $hostname"
    (file "control-switched-rack-pdu.exp" line 12)

Re-triggering doesn't help: https://gitlab.suse.de/openqa/grafana-webhook-actions/-/jobs/975568

Rollback steps

  • Add openqaworker-arm-3 back to salt control

Related issues

Copied to openQA Infrastructure - action #111156: No effect of remote PDU controls on openqaworker-arm-4.qa and openqaworker-arm-5.qa, check power connectionsResolved2022-05-16

History

#1 Updated by mkittler about 2 months ago

okurz Added the missing dependency: https://build.opensuse.org/package/rdiff/home:okurz:container/ipmitool-ping-nc-msmtp-expect?linkrev=base&rev=3

I'll re-try the recovery once the new container is published and also re-try the OSD deployment when arm-3 is up again.

#2 Updated by okurz about 2 months ago

  • Description updated (diff)

#3 Updated by mkittler about 2 months ago

Now the pipeline works but we still cannot recover openqaworker-arm-3.suse.de - also not by controlling the power outlet via the web interface.

I moved openqaworker-arm-3.suse.de out of salt and will restart the OSD deployment.

#4 Updated by mkittler about 2 months ago

  • Subject changed from Worker recovery broken due to missing telnet command to Recover openqaworker-arm-3

#5 Updated by okurz about 2 months ago

I updated the configuration in http://qaps06nue.qa.suse.de/olnames.htm to include our known hostnames for the ports. I tried to power cycle outlet 4+5 according to https://gitlab.suse.de/openqa/grafana-webhook-actions/-/blob/master/.gitlab-ci.yml#L54 but fail to ping ping openqaworker-arm-3-ipmi.suse.de

#6 Updated by okurz about 2 months ago

  • Assignee changed from mkittler to okurz

#7 Updated by okurz about 2 months ago

  • Copied to action #111156: No effect of remote PDU controls on openqaworker-arm-4.qa and openqaworker-arm-5.qa, check power connections added

#8 Updated by okurz about 2 months ago

I updated all names in http://qaps06nue.qa.suse.de/outlctrl.htm so now we have

  • openqaworker-arm-1
  • openqaworker-arm-2 #1
  • openqaworker-arm-2 #2
  • openqaworker-arm-3 #1
  • openqaworker-arm-3 #2
  • openqaworker-arm-4
  • openqaworker-arm-5
  • free

With

watch nmap -v -sn openqaworker-arm-1 openqaworker-arm-1-ipmi openqaworker-arm-2 openqaworker-arm-2-ipmi openqaworker-arm-3 openqaworker-arm-3-ipmi openqaworker-arm-4.qa ipmi.openqaworker-arm-4.qa.suse.de openqaworker-arm-5.qa ipmi.openqaworker-arm-5.qa.suse.de

I could now see that all are up except for openqaworker-arm-3-ipmi. I switched off outlets 4,5,6,7,8 and openqaworker-arm-3 went offline as expected but openqaworker-arm-4 and openqaworker-arm-5 stayed on.

After switching on outlets 4+5 for arm-3 again openqaworker-arm-3-ipmi is reachable again and produces valid output. So openqaworker-arm-3 problem resolved here, openqaworker-arm-4.qa and openqaworker-arm-5.qa to be done in #111156

#9 Updated by okurz about 2 months ago

  • Status changed from In Progress to Feedback

#10 Updated by okurz about 2 months ago

  • Status changed from Feedback to In Progress

merged. So we don't have tickets to SUSE-IT created anymore automatically from failed recovery attempts but emails to our own mailing list. I saw that openqaworker-arm-3 is offline since 2 days, maybe because it's not controlled by salt anymore even though it shows up in grafana. I triggered a power reset over IPMI and will monitor if it comes up.

#11 Updated by okurz about 2 months ago

  • Parent task set to #109743

#12 Updated by okurz about 2 months ago

  • Status changed from In Progress to Resolved

host is up again, added to salt, high state applied. no related alerts, all good.

Also available in: Atom PDF