action #111149
closed
Recover openqaworker-arm-3
Added by mkittler over 2 years ago.
Updated over 2 years ago.
Description
Observation¶
$ ./ipmi-recover-worker
Attempting to reboot openqaworker-arm-3
Error: Unable to establish IPMI v2 / RMCP+ session
IPMI access to openqaworker-arm-3-ipmi.suse.de failed ==> aborting reboot attempt, needs PDU reset
IPMI based recovery failed ==> trying switched rack PDU power cycling+IPMI
spawn telnet [MASKED]
couldn't execute "telnet": no such file or directory
while executing
"spawn telnet $hostname"
(file "control-switched-rack-pdu.exp" line 12)
Re-triggering doesn't help: https://gitlab.suse.de/openqa/grafana-webhook-actions/-/jobs/975568
Rollback steps¶
- Add openqaworker-arm-3 back to salt control
- Description updated (diff)
Now the pipeline works but we still cannot recover openqaworker-arm-3.suse.de - also not by controlling the power outlet via the web interface.
I moved openqaworker-arm-3.suse.de out of salt and will restart the OSD deployment.
- Subject changed from Worker recovery broken due to missing telnet command to Recover openqaworker-arm-3
- Assignee changed from mkittler to okurz
- Copied to action #111156: No effect of remote PDU controls on openqaworker-arm-4.qa and openqaworker-arm-5.qa, check power connections added
I updated all names in http://qaps06nue.qa.suse.de/outlctrl.htm so now we have
- openqaworker-arm-1
- openqaworker-arm-2 #1
- openqaworker-arm-2 #2
- openqaworker-arm-3 #1
- openqaworker-arm-3 #2
- openqaworker-arm-4
- openqaworker-arm-5
- free
With
watch nmap -v -sn openqaworker-arm-1 openqaworker-arm-1-ipmi openqaworker-arm-2 openqaworker-arm-2-ipmi openqaworker-arm-3 openqaworker-arm-3-ipmi openqaworker-arm-4.qa ipmi.openqaworker-arm-4.qa.suse.de openqaworker-arm-5.qa ipmi.openqaworker-arm-5.qa.suse.de
I could now see that all are up except for openqaworker-arm-3-ipmi. I switched off outlets 4,5,6,7,8 and openqaworker-arm-3 went offline as expected but openqaworker-arm-4 and openqaworker-arm-5 stayed on.
After switching on outlets 4+5 for arm-3 again openqaworker-arm-3-ipmi is reachable again and produces valid output. So openqaworker-arm-3 problem resolved here, openqaworker-arm-4.qa and openqaworker-arm-5.qa to be done in #111156
- Status changed from In Progress to Feedback
- Status changed from Feedback to In Progress
merged. So we don't have tickets to SUSE-IT created anymore automatically from failed recovery attempts but emails to our own mailing list. I saw that openqaworker-arm-3 is offline since 2 days, maybe because it's not controlled by salt anymore even though it shows up in grafana. I triggered a power reset over IPMI and will monitor if it comes up.
- Parent task set to #109743
- Status changed from In Progress to Resolved
host is up again, added to salt, high state applied. no related alerts, all good.
Also available in: Atom
PDF