action #124715
closed
Following the action in the alert email I silenced the Packet loss alert. Here's hoping that's what I wanted to do ;-)
Grafana webhook will initiate the reboot and it seems that sometimes, the reboot actually does something with the host (probably not in the first hour) as it is often enough to resolve the "Packet loss between worker hosts and other hosts alert". Unfortunately, it also seems that the host will not stay up for long.
For now, I've removed the worker from salt and re-triggered the deployment.
The IPMI connection works so I'll try to power cycle it manually and follow SOL to see what's the problem.
- Status changed from New to In Progress
- Assignee set to mkittler
The system was stuck in the GRUB rescue shell when I entered the SOL session. I tried to boot it nevertheless but got interrupted by the system rebooting. Maybe the automatic action triggered a reboot again.
Supposedly the machine needs to be re-installed or we can recover it via a rescue media. GRUB shows the following partitions so supposedly not everything is wiped:
grub> ls
(proc) (md/openqa) (lvm/system-swap) (lvm/system-root) (hd0) (hd0,gpt2)
(hd0,gpt1) (hd1) (hd1,msdos1)
Since the automatic action is interfering with manual recovery I've silenced the alert in Grafana (which hopefully works, I'm not quite sure since we have migrated that).
I'll have a look on https://openqaworker-arm-1-ipmi.suse.de/index.html to see how one could mount a recovery ISO.
Since the ticket description cannot be changed (due to use of certain characters progress cannot cope with), I'll add rollback steps here:
Rollback steps¶
- If the worker can be recovered
- Add it back to salt
- Remove silence for alert on automatic actions panel
- Assignee deleted (
mkittler)
- Copied to action #124877: Failing pipelines because of unreachable machine openqaworker-arm-1 added
- Status changed from In Progress to Rejected
- Copied to deleted (action #124877: Failing pipelines because of unreachable machine openqaworker-arm-1)
- Has duplicate action #124877: Failing pipelines because of unreachable machine openqaworker-arm-1 added
Also available in: Atom
PDF