action #124715
closedQA - coordination #121720: [saga][epic] Migration to QE setup in PRG2+NUE3 while ensuring availability
QA - coordination #116623: [epic] Migration of SUSE Nbg based openQA+QA+QAM systems to new security zones
Failing pipelines because of unreachable machine openqaworker-arm-1
0%
Description
Observation¶
./ipmi-recover-worker fails in the grafana web hook actions pipeline.
❯ ssh openqaworker-arm-1
ssh: connect to host openqaworker-arm-1 port 22: Connection timed out
lost connection
Acceptance criteria¶
- AC1:
Suggestions¶
Updated by livdywan almost 2 years ago
Following the action in the alert email I silenced the Packet loss alert. Here's hoping that's what I wanted to do ;-)
Updated by jbaier_cz almost 2 years ago
Grafana webhook will initiate the reboot and it seems that sometimes, the reboot actually does something with the host (probably not in the first hour) as it is often enough to resolve the "Packet loss between worker hosts and other hosts alert". Unfortunately, it also seems that the host will not stay up for long.
Updated by mkittler almost 2 years ago
For now, I've removed the worker from salt and re-triggered the deployment.
The IPMI connection works so I'll try to power cycle it manually and follow SOL to see what's the problem.
Updated by mkittler almost 2 years ago
- Status changed from New to In Progress
- Assignee set to mkittler
Updated by mkittler almost 2 years ago
The system was stuck in the GRUB rescue shell when I entered the SOL session. I tried to boot it nevertheless but got interrupted by the system rebooting. Maybe the automatic action triggered a reboot again.
Supposedly the machine needs to be re-installed or we can recover it via a rescue media. GRUB shows the following partitions so supposedly not everything is wiped:
grub> ls
(proc) (md/openqa) (lvm/system-swap) (lvm/system-root) (hd0) (hd0,gpt2)
(hd0,gpt1) (hd1) (hd1,msdos1)
Updated by mkittler almost 2 years ago
Since the automatic action is interfering with manual recovery I've silenced the alert in Grafana (which hopefully works, I'm not quite sure since we have migrated that).
I'll have a look on https://openqaworker-arm-1-ipmi.suse.de/index.html to see how one could mount a recovery ISO.
Updated by mkittler almost 2 years ago
Since the ticket description cannot be changed (due to use of certain characters progress cannot cope with), I'll add rollback steps here:
Rollback steps¶
- If the worker can be recovered
- Add it back to salt
- Remove silence for alert on automatic actions panel
Updated by mkittler almost 2 years ago
- Assignee deleted (
mkittler)
Looks like a vKVM session would be possible but I haven't managed to make it work. None of the approaches mentioned on https://progress.opensuse.org/projects/openqav3/wiki#Accessing-old-BMCs-with-Java-iKVM-Viewer-when-ipmitool-does-not-work-eg-imagetester or the next section helped.
Updated by livdywan almost 2 years ago
- Copied to action #124877: Failing pipelines because of unreachable machine openqaworker-arm-1 added
Updated by livdywan almost 2 years ago
- Status changed from In Progress to Rejected
Updated by jbaier_cz almost 2 years ago
- Copied to deleted (action #124877: Failing pipelines because of unreachable machine openqaworker-arm-1)
Updated by jbaier_cz almost 2 years ago
- Has duplicate action #124877: Failing pipelines because of unreachable machine openqaworker-arm-1 added