action #116437
closedopenQA Project (public) - coordination #111860: [epic] Upgrade all our infrastructure, e.g. o3+osd workers+webui and production workloads, to openSUSE Leap 15.4
Recover qa-power8-5 size:M
0%
Description
Observation¶
Apparently QA-Power8-5-kvm isn't responsive over SSH and causes failures in the osd-deployment pipeline:
QA-Power8-5-kvm.qa.suse.de:
Minion did not return. [Not connected]
ERROR: Minions returned with non-zero exit code
Note that this is not #116113 since other salt minions respond just fine.
Acceptance criteria¶
- AC1: QA-Power8-5-kvm is not showing up as a broken salt minion
Suggestions¶
- Fix SSH access to QA-Power8-5-kvm
- Remove QA-Power8-5-kvm from salt configurations and leave further investigations to #114565
Updated by livdywan about 2 years ago
- Related to action #114565: recover qa-power8-4+qa-power8-5 size:M added
Updated by mkittler about 2 years ago
- Subject changed from recover qa-power8-4+qa-power8-5 size:M to Recover qa-power8-5 size:M
- Status changed from New to In Progress
I'm recovering qa-power8-5. Note that qa-power8-4 is not broken and also not mentioned in the ticket description. Hence I'm removing it from the ticket title.
Updated by mkittler about 2 years ago
- Status changed from In Progress to Resolved
This time it isn't a crash (like the last time I had to recover the worker, see #114565#note-40). The worker was just shutting down normally but then didn't came up again. Considering that there were no log messages in the journal after the shutdown it must have been stuck somewhere early in the boot. I don't know where exactly because there was nothing over SOL. A power reset helped and the machine came back without problems.
That's the journal shortly before the shutdown:
Sep 11 03:31:36 QA-Power8-5-kvm systemd[1]: Stopped Open vSwitch.
Sep 11 03:31:36 QA-Power8-5-kvm systemd[1]: Stopping Open vSwitch Forwarding Unit...
Sep 11 03:31:36 QA-Power8-5-kvm ovs-ctl[81634]: Exiting ovs-vswitchd (100958)..done
Sep 11 03:31:36 QA-Power8-5-kvm systemd[1]: ovs-vswitchd.service: Deactivated successfully.
Sep 11 03:31:36 QA-Power8-5-kvm systemd[1]: Stopped Open vSwitch Forwarding Unit.
Sep 11 03:31:36 QA-Power8-5-kvm systemd[1]: ovs-delete-transient-ports.service: Deactivated successfully.
Sep 11 03:31:36 QA-Power8-5-kvm systemd[1]: Stopped Open vSwitch Delete Transient Ports.
Sep 11 03:31:36 QA-Power8-5-kvm systemd[1]: Stopping Open vSwitch Database Unit...
Sep 11 03:31:36 QA-Power8-5-kvm ovs-ctl[81658]: Exiting ovsdb-server (100900)..done
Sep 11 03:31:36 QA-Power8-5-kvm systemd[1]: ovsdb-server.service: Deactivated successfully.
Sep 11 03:31:36 QA-Power8-5-kvm systemd[1]: Stopped Open vSwitch Database Unit.
Sep 11 03:31:36 QA-Power8-5-kvm systemd[1]: Stopped target Preparation for Network.
Sep 11 03:31:36 QA-Power8-5-kvm systemd[1]: Stopping firewalld - dynamic firewall daemon...
-- Boot 4b14fb20a8df443d815bb85c60796d74 --
Sep 12 12:50:33 QA-Power8-5-kvm kernel: Reserving 210MB of memory at 128MB for crashkernel (System RAM: 262144MB)
Sep 12 12:50:33 QA-Power8-5-kvm kernel: hash-mmu: Page sizes from device-tree:
Sep 12 12:50:33 QA-Power8-5-kvm kernel: hash-mmu: base_shift=12: shift=12, sllp=0x0000, avpnm=0x00000000, tlbiel=1, penc=0
Sep 12 12:50:33 QA-Power8-5-kvm kernel: hash-mmu: base_shift=12: shift=16, sllp=0x0000, avpnm=0x00000000, tlbiel=1, penc=7
Sep 12 12:50:33 QA-Power8-5-kvm kernel: hash-mmu: base_shift=12: shift=24, sllp=0x0000, avpnm=0x00000000, tlbiel=1, penc=56
Sep 12 12:50:33 QA-Power8-5-kvm kernel: hash-mmu: base_shift=16: shift=16, sllp=0x0110, avpnm=0x00000000, tlbiel=1, penc=1
Sep 12 12:50:33 QA-Power8-5-kvm kernel: hash-mmu: base_shift=16: shift=24, sllp=0x0110, avpnm=0x00000000, tlbiel=1, penc=8
Sep 12 12:50:33 QA-Power8-5-kvm kernel: hash-mmu: base_shift=20: shift=20, sllp=0x0130, avpnm=0x00000000, tlbiel=0, penc=2
Sep 12 12:50:33 QA-Power8-5-kvm kernel: hash-mmu: base_shift=24: shift=24, sllp=0x0100, avpnm=0x00000001, tlbiel=0, penc=0
Sep 12 12:50:33 QA-Power8-5-kvm kernel: hash-mmu: base_shift=34: shift=34, sllp=0x0120, avpnm=0x000007ff, tlbiel=0, penc=3
Sep 12 12:50:33 QA-Power8-5-kvm kernel: Enabling pkeys with max key count 32
Sep 12 12:50:33 QA-Power8-5-kvm kernel: Disabling hardware transactional memory (HTM)
Sep 12 12:50:33 QA-Power8-5-kvm kernel: Activating Kernel Userspace Access Prevention
Sep 12 12:50:33 QA-Power8-5-kvm kernel: Activating Kernel Userspace Execution Prevention
Sep 12 12:50:33 QA-Power8-5-kvm kernel: Page orders: linear mapping = 24, virtual = 16, io = 16, vmemmap = 24
Sep 12 12:50:33 QA-Power8-5-kvm kernel: Using 1TB segments
Sep 12 12:50:33 QA-Power8-5-kvm kernel: hash-mmu: Initializing hash mmu with SLB
Sep 12 12:50:33 QA-Power8-5-kvm kernel: Linux version 5.14.21-150400.24.18-default (geeko@buildhost) (gcc (SUSE Linux) 7.5.0, GNU ld (GNU Binutils; SUSE Linux Enterprise 15) 2.37.20211103-150100.7.37) #1 SMP Thu Aug 4 14:17:48 UTC 2022 (e9f7bfc)
Sep 12 12:50:33 QA-Power8-5-kvm kernel: Secure boot mode disabled
Sep 12 12:50:33 QA-Power8-5-kvm kernel: Found initrd at 0xc000000004470000:0xc000000005ea059d
Sep 12 12:50:33 QA-Power8-5-kvm kernel: OPAL: Found non-mapped LPC bus on chip 0
Sep 12 12:50:33 QA-Power8-5-kvm kernel: Using PowerNV machine description
Sep 12 12:50:33 QA-Power8-5-kvm kernel: printk: bootconsole [udbg0] enabled
I've retriggered the OSD deployment pipeline which was failing due to that. It has just succeeded so I suppose we can consider this issue resolved. (Likely at some point we should have automatic recovery for all our workers, not just for the ARM machines.)
Updated by mkittler about 2 years ago
- Related to action #116473: Add OSD PowerPC workers to automatic recovery we already have for ARM workers added
Updated by okurz about 2 years ago
- Related to action #80482: qa-power8-5-kvm has been down for days, use more robust filesystem setup added