action #116437: Recover qa-power8-5 size:M - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

action #116437

closed

openQA Project (public) - coordination #111860: [epic] Upgrade all our infrastructure, e.g. o3+osd workers+webui and production workloads, to openSUSE Leap 15.4

Recover qa-power8-5 size:M

Added by livdywan over 2 years ago. Updated over 2 years ago.

Status:

Resolved

Priority:

High

Assignee:

mkittler

Category:

Target version:

openQA Project (public) - Ready

Start date:

Due date:

% Done:

Estimated time:

Description

Observation¶

Apparently QA-Power8-5-kvm isn't responsive over SSH and causes failures in the osd-deployment pipeline:

QA-Power8-5-kvm.qa.suse.de:
    Minion did not return. [Not connected]
ERROR: Minions returned with non-zero exit code

Note that this is not #116113 since other salt minions respond just fine.

Acceptance criteria¶

AC1: QA-Power8-5-kvm is not showing up as a broken salt minion

Suggestions¶

Fix SSH access to QA-Power8-5-kvm
Remove QA-Power8-5-kvm from salt configurations and leave further investigations to #114565

Related issues 3 (1 open — 2 closed)

Actions

Copy link

Updated by livdywan over 2 years ago

Related to action #114565: recover qa-power8-4+qa-power8-5 size:M added

Actions

Copy link

Updated by mkittler over 2 years ago

Assignee set to mkittler

Actions

Copy link

Updated by mkittler over 2 years ago

Subject changed from recover qa-power8-4+qa-power8-5 size:M to Recover qa-power8-5 size:M
Status changed from New to In Progress

I'm recovering qa-power8-5. Note that qa-power8-4 is not broken and also not mentioned in the ticket description. Hence I'm removing it from the ticket title.

Actions

Copy link

Updated by mkittler over 2 years ago

Status changed from In Progress to Resolved

This time it isn't a crash (like the last time I had to recover the worker, see #114565#note-40). The worker was just shutting down normally but then didn't came up again. Considering that there were no log messages in the journal after the shutdown it must have been stuck somewhere early in the boot. I don't know where exactly because there was nothing over SOL. A power reset helped and the machine came back without problems.

That's the journal shortly before the shutdown:

Sep 11 03:31:36 QA-Power8-5-kvm systemd[1]: Stopped Open vSwitch.
Sep 11 03:31:36 QA-Power8-5-kvm systemd[1]: Stopping Open vSwitch Forwarding Unit...
Sep 11 03:31:36 QA-Power8-5-kvm ovs-ctl[81634]: Exiting ovs-vswitchd (100958)..done
Sep 11 03:31:36 QA-Power8-5-kvm systemd[1]: ovs-vswitchd.service: Deactivated successfully.
Sep 11 03:31:36 QA-Power8-5-kvm systemd[1]: Stopped Open vSwitch Forwarding Unit.
Sep 11 03:31:36 QA-Power8-5-kvm systemd[1]: ovs-delete-transient-ports.service: Deactivated successfully.
Sep 11 03:31:36 QA-Power8-5-kvm systemd[1]: Stopped Open vSwitch Delete Transient Ports.
Sep 11 03:31:36 QA-Power8-5-kvm systemd[1]: Stopping Open vSwitch Database Unit...
Sep 11 03:31:36 QA-Power8-5-kvm ovs-ctl[81658]: Exiting ovsdb-server (100900)..done
Sep 11 03:31:36 QA-Power8-5-kvm systemd[1]: ovsdb-server.service: Deactivated successfully.
Sep 11 03:31:36 QA-Power8-5-kvm systemd[1]: Stopped Open vSwitch Database Unit.
Sep 11 03:31:36 QA-Power8-5-kvm systemd[1]: Stopped target Preparation for Network.
Sep 11 03:31:36 QA-Power8-5-kvm systemd[1]: Stopping firewalld - dynamic firewall daemon...
-- Boot 4b14fb20a8df443d815bb85c60796d74 --
Sep 12 12:50:33 QA-Power8-5-kvm kernel: Reserving 210MB of memory at 128MB for crashkernel (System RAM: 262144MB)
Sep 12 12:50:33 QA-Power8-5-kvm kernel: hash-mmu: Page sizes from device-tree:
Sep 12 12:50:33 QA-Power8-5-kvm kernel: hash-mmu: base_shift=12: shift=12, sllp=0x0000, avpnm=0x00000000, tlbiel=1, penc=0
Sep 12 12:50:33 QA-Power8-5-kvm kernel: hash-mmu: base_shift=12: shift=16, sllp=0x0000, avpnm=0x00000000, tlbiel=1, penc=7
Sep 12 12:50:33 QA-Power8-5-kvm kernel: hash-mmu: base_shift=12: shift=24, sllp=0x0000, avpnm=0x00000000, tlbiel=1, penc=56
Sep 12 12:50:33 QA-Power8-5-kvm kernel: hash-mmu: base_shift=16: shift=16, sllp=0x0110, avpnm=0x00000000, tlbiel=1, penc=1
Sep 12 12:50:33 QA-Power8-5-kvm kernel: hash-mmu: base_shift=16: shift=24, sllp=0x0110, avpnm=0x00000000, tlbiel=1, penc=8
Sep 12 12:50:33 QA-Power8-5-kvm kernel: hash-mmu: base_shift=20: shift=20, sllp=0x0130, avpnm=0x00000000, tlbiel=0, penc=2
Sep 12 12:50:33 QA-Power8-5-kvm kernel: hash-mmu: base_shift=24: shift=24, sllp=0x0100, avpnm=0x00000001, tlbiel=0, penc=0
Sep 12 12:50:33 QA-Power8-5-kvm kernel: hash-mmu: base_shift=34: shift=34, sllp=0x0120, avpnm=0x000007ff, tlbiel=0, penc=3
Sep 12 12:50:33 QA-Power8-5-kvm kernel: Enabling pkeys with max key count 32
Sep 12 12:50:33 QA-Power8-5-kvm kernel: Disabling hardware transactional memory (HTM)
Sep 12 12:50:33 QA-Power8-5-kvm kernel: Activating Kernel Userspace Access Prevention
Sep 12 12:50:33 QA-Power8-5-kvm kernel: Activating Kernel Userspace Execution Prevention
Sep 12 12:50:33 QA-Power8-5-kvm kernel: Page orders: linear mapping = 24, virtual = 16, io = 16, vmemmap = 24
Sep 12 12:50:33 QA-Power8-5-kvm kernel: Using 1TB segments
Sep 12 12:50:33 QA-Power8-5-kvm kernel: hash-mmu: Initializing hash mmu with SLB
Sep 12 12:50:33 QA-Power8-5-kvm kernel: Linux version 5.14.21-150400.24.18-default (geeko@buildhost) (gcc (SUSE Linux) 7.5.0, GNU ld (GNU Binutils; SUSE Linux Enterprise 15) 2.37.20211103-150100.7.37) #1 SMP Thu Aug 4 14:17:48 UTC 2022 (e9f7bfc)
Sep 12 12:50:33 QA-Power8-5-kvm kernel: Secure boot mode disabled
Sep 12 12:50:33 QA-Power8-5-kvm kernel: Found initrd at 0xc000000004470000:0xc000000005ea059d
Sep 12 12:50:33 QA-Power8-5-kvm kernel: OPAL: Found non-mapped LPC bus on chip 0
Sep 12 12:50:33 QA-Power8-5-kvm kernel: Using PowerNV machine description
Sep 12 12:50:33 QA-Power8-5-kvm kernel: printk: bootconsole [udbg0] enabled

I've retriggered the OSD deployment pipeline which was failing due to that. It has just succeeded so I suppose we can consider this issue resolved. (Likely at some point we should have automatic recovery for all our workers, not just for the ARM machines.)

Actions

Copy link

Updated by mkittler over 2 years ago

Related to action #116473: Add OSD PowerPC workers to automatic recovery we already have for ARM workers added

Actions

Copy link

Updated by okurz over 2 years ago

Related to action #80482: qa-power8-5-kvm has been down for days, use more robust filesystem setup added

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #116437

Recover qa-power8-5 size:M

Observation¶

Acceptance criteria¶

Suggestions¶

Updated by livdywan over 2 years ago

Updated by mkittler over 2 years ago

Updated by mkittler over 2 years ago

Updated by mkittler over 2 years ago

Updated by mkittler over 2 years ago

Updated by okurz over 2 years ago