action #89815
closed
osd-deployment blocked by openqaworker-arm-3 offline and not recovered automatically
Added by okurz almost 4 years ago.
Updated over 3 years ago.
Description
Observation¶
https://gitlab.suse.de/openqa/osd-deployment/pipelines shows that deployment on 2021-03-08 and 2021-03-10 have been blocked by openqaworker-arm-3 being offline. https://monitor.qa.suse.de/d/1bNU0StZz/automatic-actions?orgId=1 shows that openqaworker-arm-3 is offline since 2021-03-06 and automatic actions could not recover
Acceptance criteria¶
- AC1: openqaworker-arm-3 is online again
- AC2: openqaworker-arm-3 can be recovered automatically or ticket to EngInfra is created automatically as in before
- AC3: osd-deployment is recovered
Suggestions¶
- Remove openqaworker-arm-3 from salt keys
- Trigger osd-deployment again manually
- Check automatic actions for 2021-03-06 and find out if a ticket was created or not, fix if not
- Bring back openqaworker-arm-3 after the above is fixed (or get rid of this unreliable machine and replace it with a better one)
- Status changed from Workable to In Progress
- Assignee set to okurz
- Status changed from In Progress to Workable
- Assignee deleted (
okurz)
- Priority changed from Urgent to High
osd-deployment passed so far. As I have the salt key for the worker removed I consider this less critical now.
I'll have a look because with only 2 workers we can easily get into a situation where arm jobs aren't processed fast enough, see #90755.
sol, power reset and power cycle don't work so it is not a surprise that our automatic recovery didn't work either. Maybe I can play around with IPMI a little bit more. Otherwise I suppose we'll have to resort to filing an infra ticket.
Skip to the end if you're not interested in my exploratory use of ipmitool :-)
ipmitool -I lanplus -C 3 -H openqaworker-arm-3-ipmi.suse.de -U ADMIN -P ADMIN chassis selftest
Self Test Results : passed
ipmitool -I lanplus -C 3 -H openqaworker-arm-3-ipmi.suse.de -U ADMIN -P ADMIN chassis restart_cause
System restart cause: chassis power control command
ipmitool -I lanplus -C 3 -H openqaworker-arm-3-ipmi.suse.de -U ADMIN -P ADMIN chassis policy always-on
Set chassis power restore policy to always-on
ipmitool -I lanplus -C 3 -H openqaworker-arm-3-ipmi.suse.de -U ADMIN -P ADMIN chassis power reset
Set Chassis Power Control to Reset failed: Command not supported in present state
ipmitool -I lanplus -C 3 -H openqaworker-arm-3-ipmi.suse.de -U ADMIN -P ADMIN chassis power cycle
Set Chassis Power Control to Cycle failed: Command not supported in present state
ipmitool -I lanplus -C 3 -H openqaworker-arm-3-ipmi.suse.de -U ADMIN -P ADMIN chassis power diag
Set Chassis Power Control to Diag failed: Command not supported in present state
ipmitool -I lanplus -C 3 -H openqaworker-arm-3-ipmi.suse.de -U ADMIN -P ADMIN chassis power on
Chassis Power Control: Up/On
ipmitool -I lanplus -C 3 -H openqaworker-arm-3-ipmi.suse.de -U ADMIN -P ADMIN chassis power diag
Chassis Power Control: Diag
And something finally shows up on the SOL. It took a while as it conducted a few tests but now the system is booting. It looks like the chassis was turned off (whatever that means). Maybe our automatic recovery should also ensure that the chassis is turned on.
Wow, that's interesting. Can we just add a chassis on
in the grafana-webhook actions repo?
If "grafana-webhook" is the place where we do the automatic recovery, than that's what I'd do. Still have to find it, though.
- Status changed from Workable to In Progress
- Due date set to 2021-04-22
Setting due date based on mean cycle time of SUSE QE Tools
The SR has been merged and I've tested whether it works. It does. After power off
the Grafana alert triggers the GitLab pipeline which then turns the power on again. However, since the power state is apparently not immediately reflected the subsequent power cycle command doesn't work:
PING openqaworker-arm-3.suse.de (10.160.0.85) 56(84) bytes of data.
--- openqaworker-arm-3.suse.de ping statistics ---
1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms
Chassis Power Control: Up/On
Set Chassis Power Control to Cycle failed: Command not supported in present state
Not a big deal because it isn't required anyways in that state. If the power was on and the worker just otherwise stuck the power on
command will be a noop which zero return code so this case should be fine as well.
- Status changed from In Progress to Feedback
discussed in weekly 2021-04-09:
https://gitlab.suse.de/openqa/grafana-webhook-actions/-/blob/master/ipmi-health-check#L22 wastes 10 minutes and also we found out that https://gitlab.suse.de/openqa/grafana-webhook-actions/-/merge_requests/12 did more than what we expected. I read all comments from #76876 again but have not found any discussion that we want to turn the "do-reboot" steps into a "check-health" script. So mkittler and me decided that instead we want to go back to the previous way to trigger a reboot as soon as the script is started, not check if the machine might be still up (respond to ping but potentially not being fully reactive) and not waste 10 minutes of waiting if ping comes back successful before triggering any action.
Also I suggest:
if $ipmitool power status | grep -q 'is on'; then
$ipmitool power cycle
else
$ipmitool power on
fi
- Status changed from Feedback to Resolved
Both MRs merged and considered stable. We won't test the actual ticket creation
- Related to action #92176: [alert] openqaworker-arm-3 offline and CI pipeline unable to send email but stating "passed" added
- Related to action #94456: no data from any arm host on https://stats.openqa-monitor.qa.suse.de/d/1bNU0StZz/automatic-actions?orgId=1 added
- Related to action #159270: openqaworker-arm-1 is Unreachable size:S added
Also available in: Atom
PDF