action #111149
closedRecover openqaworker-arm-3
0%
Description
Observation¶
$ ./ipmi-recover-worker
Attempting to reboot openqaworker-arm-3
Error: Unable to establish IPMI v2 / RMCP+ session
IPMI access to openqaworker-arm-3-ipmi.suse.de failed ==> aborting reboot attempt, needs PDU reset
IPMI based recovery failed ==> trying switched rack PDU power cycling+IPMI
spawn telnet [MASKED]
couldn't execute "telnet": no such file or directory
while executing
"spawn telnet $hostname"
(file "control-switched-rack-pdu.exp" line 12)
Re-triggering doesn't help: https://gitlab.suse.de/openqa/grafana-webhook-actions/-/jobs/975568
Rollback steps¶
- Add openqaworker-arm-3 back to salt control
Updated by mkittler over 2 years ago
@okurz Added the missing dependency: https://build.opensuse.org/package/rdiff/home:okurz:container/ipmitool-ping-nc-msmtp-expect?linkrev=base&rev=3
I'll re-try the recovery once the new container is published and also re-try the OSD deployment when arm-3 is up again.
Updated by mkittler over 2 years ago
Now the pipeline works but we still cannot recover openqaworker-arm-3.suse.de - also not by controlling the power outlet via the web interface.
I moved openqaworker-arm-3.suse.de out of salt and will restart the OSD deployment.
Updated by mkittler over 2 years ago
- Subject changed from Worker recovery broken due to missing telnet command to Recover openqaworker-arm-3
Updated by okurz over 2 years ago
I updated the configuration in http://qaps06nue.qa.suse.de/olnames.htm to include our known hostnames for the ports. I tried to power cycle outlet 4+5 according to https://gitlab.suse.de/openqa/grafana-webhook-actions/-/blob/master/.gitlab-ci.yml#L54 but fail to ping ping openqaworker-arm-3-ipmi.suse.de
Updated by okurz over 2 years ago
- Copied to action #111156: No effect of remote PDU controls on openqaworker-arm-4.qa and openqaworker-arm-5.qa, check power connections added
Updated by okurz over 2 years ago
I updated all names in http://qaps06nue.qa.suse.de/outlctrl.htm so now we have
- openqaworker-arm-1
- openqaworker-arm-2 #1
- openqaworker-arm-2 #2
- openqaworker-arm-3 #1
- openqaworker-arm-3 #2
- openqaworker-arm-4
- openqaworker-arm-5
- free
With
watch nmap -v -sn openqaworker-arm-1 openqaworker-arm-1-ipmi openqaworker-arm-2 openqaworker-arm-2-ipmi openqaworker-arm-3 openqaworker-arm-3-ipmi openqaworker-arm-4.qa ipmi.openqaworker-arm-4.qa.suse.de openqaworker-arm-5.qa ipmi.openqaworker-arm-5.qa.suse.de
I could now see that all are up except for openqaworker-arm-3-ipmi. I switched off outlets 4,5,6,7,8 and openqaworker-arm-3 went offline as expected but openqaworker-arm-4 and openqaworker-arm-5 stayed on.
After switching on outlets 4+5 for arm-3 again openqaworker-arm-3-ipmi is reachable again and produces valid output. So openqaworker-arm-3 problem resolved here, openqaworker-arm-4.qa and openqaworker-arm-5.qa to be done in #111156
Updated by okurz over 2 years ago
- Status changed from In Progress to Feedback
Updated by okurz over 2 years ago
- Status changed from Feedback to In Progress
merged. So we don't have tickets to SUSE-IT created anymore automatically from failed recovery attempts but emails to our own mailing list. I saw that openqaworker-arm-3 is offline since 2 days, maybe because it's not controlled by salt anymore even though it shows up in grafana. I triggered a power reset over IPMI and will monitor if it comes up.
Updated by okurz over 2 years ago
- Status changed from In Progress to Resolved
host is up again, added to salt, high state applied. no related alerts, all good.