



action #124715


QA (public) - coordination #121720: [saga][epic] Migration to QE setup in PRG2+NUE3 while ensuring availability

QA (public) - coordination #116623: [epic] Migration of SUSE Nbg based openQA+QA+QAM systems to new security zones

Failing pipelines because of unreachable machine openqaworker-arm-1

Added by livdywan about 2 years ago. Updated about 2 years ago.

Start date:
Due date:
% Done:


Estimated time:



./ipmi-recover-worker fails in the grafana web hook actions pipeline.

❯ ssh openqaworker-arm-1
ssh: connect to host openqaworker-arm-1 port 22: Connection timed out                                                   
lost connection

Acceptance criteria

  • AC1:


Related issues 1 (0 open1 closed)

Has duplicate openQA Infrastructure (public) - action #124877: Failing pipelines because of unreachable machine openqaworker-arm-1Resolvedmkittler2023-02-08

Actions #1

Updated by livdywan about 2 years ago

Following the action in the alert email I silenced the Packet loss alert. Here's hoping that's what I wanted to do ;-)

Actions #2

Updated by jbaier_cz about 2 years ago

Grafana webhook will initiate the reboot and it seems that sometimes, the reboot actually does something with the host (probably not in the first hour) as it is often enough to resolve the "Packet loss between worker hosts and other hosts alert". Unfortunately, it also seems that the host will not stay up for long.

Actions #3

Updated by mkittler about 2 years ago

For now, I've removed the worker from salt and re-triggered the deployment.

The IPMI connection works so I'll try to power cycle it manually and follow SOL to see what's the problem.

Actions #4

Updated by mkittler about 2 years ago

  • Status changed from New to In Progress
  • Assignee set to mkittler
Actions #5

Updated by mkittler about 2 years ago

The system was stuck in the GRUB rescue shell when I entered the SOL session. I tried to boot it nevertheless but got interrupted by the system rebooting. Maybe the automatic action triggered a reboot again.

Supposedly the machine needs to be re-installed or we can recover it via a rescue media. GRUB shows the following partitions so supposedly not everything is wiped:

grub> ls
(proc) (md/openqa) (lvm/system-swap) (lvm/system-root) (hd0) (hd0,gpt2) 
(hd0,gpt1) (hd1) (hd1,msdos1)
Actions #6

Updated by mkittler about 2 years ago

Since the automatic action is interfering with manual recovery I've silenced the alert in Grafana (which hopefully works, I'm not quite sure since we have migrated that).

I'll have a look on to see how one could mount a recovery ISO.

Actions #7

Updated by mkittler about 2 years ago

Since the ticket description cannot be changed (due to use of certain characters progress cannot cope with), I'll add rollback steps here:

Rollback steps

  • If the worker can be recovered
    • Add it back to salt
    • Remove silence for alert on automatic actions panel
Actions #8

Updated by mkittler about 2 years ago

  • Assignee deleted (mkittler)

Looks like a vKVM session would be possible but I haven't managed to make it work. None of the approaches mentioned on or the next section helped.

Actions #9

Updated by livdywan about 2 years ago

  • Copied to action #124877: Failing pipelines because of unreachable machine openqaworker-arm-1 added
Actions #10

Updated by livdywan about 2 years ago

  • Status changed from In Progress to Rejected
Actions #11

Updated by jbaier_cz about 2 years ago

  • Copied to deleted (action #124877: Failing pipelines because of unreachable machine openqaworker-arm-1)
Actions #12

Updated by jbaier_cz about 2 years ago

  • Has duplicate action #124877: Failing pipelines because of unreachable machine openqaworker-arm-1 added

Also available in: Atom PDF