Project

General

Profile

action #116437

openQA Project - coordination #111860: [epic] Upgrade all our infrastructure, e.g. o3+osd workers+webui and production workloads, to openSUSE Leap 15.4

Recover qa-power8-5 size:M

Added by cdywan 3 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:

Description

Observation

Apparently QA-Power8-5-kvm isn't responsive over SSH and causes failures in the osd-deployment pipeline:

QA-Power8-5-kvm.qa.suse.de:
    Minion did not return. [Not connected]
ERROR: Minions returned with non-zero exit code

Note that this is not #116113 since other salt minions respond just fine.

Acceptance criteria

  • AC1: QA-Power8-5-kvm is not showing up as a broken salt minion

Suggestions

  • Fix SSH access to QA-Power8-5-kvm
  • Remove QA-Power8-5-kvm from salt configurations and leave further investigations to #114565

Related issues

Related to openQA Infrastructure - action #114565: recover qa-power8-4+qa-power8-5 size:MBlocked2022-07-222023-02-10

Related to openQA Infrastructure - action #116473: Add OSD PowerPC workers to automatic recovery we already have for ARM workersNew2022-09-12

Related to openQA Infrastructure - action #80482: qa-power8-5-kvm has been down for days, use more robust filesystem setupResolved

History

#1 Updated by cdywan 3 months ago

  • Related to action #114565: recover qa-power8-4+qa-power8-5 size:M added

#2 Updated by mkittler 3 months ago

  • Assignee set to mkittler

#3 Updated by mkittler 3 months ago

  • Subject changed from recover qa-power8-4+qa-power8-5 size:M to Recover qa-power8-5 size:M
  • Status changed from New to In Progress

I'm recovering qa-power8-5. Note that qa-power8-4 is not broken and also not mentioned in the ticket description. Hence I'm removing it from the ticket title.

#4 Updated by mkittler 3 months ago

  • Status changed from In Progress to Resolved

This time it isn't a crash (like the last time I had to recover the worker, see #114565#note-40). The worker was just shutting down normally but then didn't came up again. Considering that there were no log messages in the journal after the shutdown it must have been stuck somewhere early in the boot. I don't know where exactly because there was nothing over SOL. A power reset helped and the machine came back without problems.

That's the journal shortly before the shutdown:

Sep 11 03:31:36 QA-Power8-5-kvm systemd[1]: Stopped Open vSwitch.
Sep 11 03:31:36 QA-Power8-5-kvm systemd[1]: Stopping Open vSwitch Forwarding Unit...
Sep 11 03:31:36 QA-Power8-5-kvm ovs-ctl[81634]: Exiting ovs-vswitchd (100958)..done
Sep 11 03:31:36 QA-Power8-5-kvm systemd[1]: ovs-vswitchd.service: Deactivated successfully.
Sep 11 03:31:36 QA-Power8-5-kvm systemd[1]: Stopped Open vSwitch Forwarding Unit.
Sep 11 03:31:36 QA-Power8-5-kvm systemd[1]: ovs-delete-transient-ports.service: Deactivated successfully.
Sep 11 03:31:36 QA-Power8-5-kvm systemd[1]: Stopped Open vSwitch Delete Transient Ports.
Sep 11 03:31:36 QA-Power8-5-kvm systemd[1]: Stopping Open vSwitch Database Unit...
Sep 11 03:31:36 QA-Power8-5-kvm ovs-ctl[81658]: Exiting ovsdb-server (100900)..done
Sep 11 03:31:36 QA-Power8-5-kvm systemd[1]: ovsdb-server.service: Deactivated successfully.
Sep 11 03:31:36 QA-Power8-5-kvm systemd[1]: Stopped Open vSwitch Database Unit.
Sep 11 03:31:36 QA-Power8-5-kvm systemd[1]: Stopped target Preparation for Network.
Sep 11 03:31:36 QA-Power8-5-kvm systemd[1]: Stopping firewalld - dynamic firewall daemon...
-- Boot 4b14fb20a8df443d815bb85c60796d74 --
Sep 12 12:50:33 QA-Power8-5-kvm kernel: Reserving 210MB of memory at 128MB for crashkernel (System RAM: 262144MB)
Sep 12 12:50:33 QA-Power8-5-kvm kernel: hash-mmu: Page sizes from device-tree:
Sep 12 12:50:33 QA-Power8-5-kvm kernel: hash-mmu: base_shift=12: shift=12, sllp=0x0000, avpnm=0x00000000, tlbiel=1, penc=0
Sep 12 12:50:33 QA-Power8-5-kvm kernel: hash-mmu: base_shift=12: shift=16, sllp=0x0000, avpnm=0x00000000, tlbiel=1, penc=7
Sep 12 12:50:33 QA-Power8-5-kvm kernel: hash-mmu: base_shift=12: shift=24, sllp=0x0000, avpnm=0x00000000, tlbiel=1, penc=56
Sep 12 12:50:33 QA-Power8-5-kvm kernel: hash-mmu: base_shift=16: shift=16, sllp=0x0110, avpnm=0x00000000, tlbiel=1, penc=1
Sep 12 12:50:33 QA-Power8-5-kvm kernel: hash-mmu: base_shift=16: shift=24, sllp=0x0110, avpnm=0x00000000, tlbiel=1, penc=8
Sep 12 12:50:33 QA-Power8-5-kvm kernel: hash-mmu: base_shift=20: shift=20, sllp=0x0130, avpnm=0x00000000, tlbiel=0, penc=2
Sep 12 12:50:33 QA-Power8-5-kvm kernel: hash-mmu: base_shift=24: shift=24, sllp=0x0100, avpnm=0x00000001, tlbiel=0, penc=0
Sep 12 12:50:33 QA-Power8-5-kvm kernel: hash-mmu: base_shift=34: shift=34, sllp=0x0120, avpnm=0x000007ff, tlbiel=0, penc=3
Sep 12 12:50:33 QA-Power8-5-kvm kernel: Enabling pkeys with max key count 32
Sep 12 12:50:33 QA-Power8-5-kvm kernel: Disabling hardware transactional memory (HTM)
Sep 12 12:50:33 QA-Power8-5-kvm kernel: Activating Kernel Userspace Access Prevention
Sep 12 12:50:33 QA-Power8-5-kvm kernel: Activating Kernel Userspace Execution Prevention
Sep 12 12:50:33 QA-Power8-5-kvm kernel: Page orders: linear mapping = 24, virtual = 16, io = 16, vmemmap = 24
Sep 12 12:50:33 QA-Power8-5-kvm kernel: Using 1TB segments
Sep 12 12:50:33 QA-Power8-5-kvm kernel: hash-mmu: Initializing hash mmu with SLB
Sep 12 12:50:33 QA-Power8-5-kvm kernel: Linux version 5.14.21-150400.24.18-default (geeko@buildhost) (gcc (SUSE Linux) 7.5.0, GNU ld (GNU Binutils; SUSE Linux Enterprise 15) 2.37.20211103-150100.7.37) #1 SMP Thu Aug 4 14:17:48 UTC 2022 (e9f7bfc)
Sep 12 12:50:33 QA-Power8-5-kvm kernel: Secure boot mode disabled
Sep 12 12:50:33 QA-Power8-5-kvm kernel: Found initrd at 0xc000000004470000:0xc000000005ea059d
Sep 12 12:50:33 QA-Power8-5-kvm kernel: OPAL: Found non-mapped LPC bus on chip 0
Sep 12 12:50:33 QA-Power8-5-kvm kernel: Using PowerNV machine description
Sep 12 12:50:33 QA-Power8-5-kvm kernel: printk: bootconsole [udbg0] enabled

I've retriggered the OSD deployment pipeline which was failing due to that. It has just succeeded so I suppose we can consider this issue resolved. (Likely at some point we should have automatic recovery for all our workers, not just for the ARM machines.)

#5 Updated by mkittler 3 months ago

  • Related to action #116473: Add OSD PowerPC workers to automatic recovery we already have for ARM workers added

#6 Updated by okurz 2 months ago

  • Related to action #80482: qa-power8-5-kvm has been down for days, use more robust filesystem setup added

Also available in: Atom PDF