Project

General

Profile

Actions

action #124715

closed

QA - coordination #121720: [saga][epic] Migration to QE setup in PRG2+NUE3 while ensuring availability

QA - coordination #116623: [epic] Migration of SUSE Nbg based openQA+QA+QAM systems to new security zones

Failing pipelines because of unreachable machine openqaworker-arm-1

Added by livdywan about 1 year ago. Updated about 1 year ago.

Status:
Rejected
Priority:
High
Assignee:
-
Category:
-
Target version:
Start date:
2023-02-08
Due date:
% Done:

0%

Estimated time:

Description

Observation

./ipmi-recover-worker fails in the grafana web hook actions pipeline.

❯ ssh openqaworker-arm-1
ssh: connect to host openqaworker-arm-1 port 22: Connection timed out                                                   
lost connection

Acceptance criteria

  • AC1:

Suggestions


Related issues 1 (0 open1 closed)

Has duplicate openQA Infrastructure - action #124877: Failing pipelines because of unreachable machine openqaworker-arm-1Resolvedmkittler2023-02-08

Actions
Actions #1

Updated by livdywan about 1 year ago

Following the action in the alert email I silenced the Packet loss alert. Here's hoping that's what I wanted to do ;-)

Actions #2

Updated by jbaier_cz about 1 year ago

Grafana webhook will initiate the reboot and it seems that sometimes, the reboot actually does something with the host (probably not in the first hour) as it is often enough to resolve the "Packet loss between worker hosts and other hosts alert". Unfortunately, it also seems that the host will not stay up for long.

Actions #3

Updated by mkittler about 1 year ago

For now, I've removed the worker from salt and re-triggered the deployment.

The IPMI connection works so I'll try to power cycle it manually and follow SOL to see what's the problem.

Actions #4

Updated by mkittler about 1 year ago

  • Status changed from New to In Progress
  • Assignee set to mkittler
Actions #5

Updated by mkittler about 1 year ago

The system was stuck in the GRUB rescue shell when I entered the SOL session. I tried to boot it nevertheless but got interrupted by the system rebooting. Maybe the automatic action triggered a reboot again.

Supposedly the machine needs to be re-installed or we can recover it via a rescue media. GRUB shows the following partitions so supposedly not everything is wiped:

grub> ls
(proc) (md/openqa) (lvm/system-swap) (lvm/system-root) (hd0) (hd0,gpt2) 
(hd0,gpt1) (hd1) (hd1,msdos1)
Actions #6

Updated by mkittler about 1 year ago

Since the automatic action is interfering with manual recovery I've silenced the alert in Grafana (which hopefully works, I'm not quite sure since we have migrated that).

I'll have a look on https://openqaworker-arm-1-ipmi.suse.de/index.html to see how one could mount a recovery ISO.

Actions #7

Updated by mkittler about 1 year ago

Since the ticket description cannot be changed (due to use of certain characters progress cannot cope with), I'll add rollback steps here:

Rollback steps

  • If the worker can be recovered
    • Add it back to salt
    • Remove silence for alert on automatic actions panel
Actions #8

Updated by mkittler about 1 year ago

  • Assignee deleted (mkittler)

Looks like a vKVM session would be possible but I haven't managed to make it work. None of the approaches mentioned on https://progress.opensuse.org/projects/openqav3/wiki#Accessing-old-BMCs-with-Java-iKVM-Viewer-when-ipmitool-does-not-work-eg-imagetester or the next section helped.

Actions #9

Updated by livdywan about 1 year ago

  • Copied to action #124877: Failing pipelines because of unreachable machine openqaworker-arm-1 added
Actions #10

Updated by livdywan about 1 year ago

  • Status changed from In Progress to Rejected
Actions #11

Updated by jbaier_cz about 1 year ago

  • Copied to deleted (action #124877: Failing pipelines because of unreachable machine openqaworker-arm-1)
Actions #12

Updated by jbaier_cz about 1 year ago

  • Has duplicate action #124877: Failing pipelines because of unreachable machine openqaworker-arm-1 added
Actions

Also available in: Atom PDF