This time it isn't a crash (like the last time I had to recover the worker, see #114565#note-40). The worker was just shutting down normally but then didn't came up again. Considering that there were no log messages in the journal after the shutdown it must have been stuck somewhere early in the boot. I don't know where exactly because there was nothing over SOL. A power reset helped and the machine came back without problems.
That's the journal shortly before the shutdown:
Sep 11 03:31:36 QA-Power8-5-kvm systemd[1]: Stopped Open vSwitch.
Sep 11 03:31:36 QA-Power8-5-kvm systemd[1]: Stopping Open vSwitch Forwarding Unit...
Sep 11 03:31:36 QA-Power8-5-kvm ovs-ctl[81634]: Exiting ovs-vswitchd (100958)..done
Sep 11 03:31:36 QA-Power8-5-kvm systemd[1]: ovs-vswitchd.service: Deactivated successfully.
Sep 11 03:31:36 QA-Power8-5-kvm systemd[1]: Stopped Open vSwitch Forwarding Unit.
Sep 11 03:31:36 QA-Power8-5-kvm systemd[1]: ovs-delete-transient-ports.service: Deactivated successfully.
Sep 11 03:31:36 QA-Power8-5-kvm systemd[1]: Stopped Open vSwitch Delete Transient Ports.
Sep 11 03:31:36 QA-Power8-5-kvm systemd[1]: Stopping Open vSwitch Database Unit...
Sep 11 03:31:36 QA-Power8-5-kvm ovs-ctl[81658]: Exiting ovsdb-server (100900)..done
Sep 11 03:31:36 QA-Power8-5-kvm systemd[1]: ovsdb-server.service: Deactivated successfully.
Sep 11 03:31:36 QA-Power8-5-kvm systemd[1]: Stopped Open vSwitch Database Unit.
Sep 11 03:31:36 QA-Power8-5-kvm systemd[1]: Stopped target Preparation for Network.
Sep 11 03:31:36 QA-Power8-5-kvm systemd[1]: Stopping firewalld - dynamic firewall daemon...
-- Boot 4b14fb20a8df443d815bb85c60796d74 --
Sep 12 12:50:33 QA-Power8-5-kvm kernel: Reserving 210MB of memory at 128MB for crashkernel (System RAM: 262144MB)
Sep 12 12:50:33 QA-Power8-5-kvm kernel: hash-mmu: Page sizes from device-tree:
Sep 12 12:50:33 QA-Power8-5-kvm kernel: hash-mmu: base_shift=12: shift=12, sllp=0x0000, avpnm=0x00000000, tlbiel=1, penc=0
Sep 12 12:50:33 QA-Power8-5-kvm kernel: hash-mmu: base_shift=12: shift=16, sllp=0x0000, avpnm=0x00000000, tlbiel=1, penc=7
Sep 12 12:50:33 QA-Power8-5-kvm kernel: hash-mmu: base_shift=12: shift=24, sllp=0x0000, avpnm=0x00000000, tlbiel=1, penc=56
Sep 12 12:50:33 QA-Power8-5-kvm kernel: hash-mmu: base_shift=16: shift=16, sllp=0x0110, avpnm=0x00000000, tlbiel=1, penc=1
Sep 12 12:50:33 QA-Power8-5-kvm kernel: hash-mmu: base_shift=16: shift=24, sllp=0x0110, avpnm=0x00000000, tlbiel=1, penc=8
Sep 12 12:50:33 QA-Power8-5-kvm kernel: hash-mmu: base_shift=20: shift=20, sllp=0x0130, avpnm=0x00000000, tlbiel=0, penc=2
Sep 12 12:50:33 QA-Power8-5-kvm kernel: hash-mmu: base_shift=24: shift=24, sllp=0x0100, avpnm=0x00000001, tlbiel=0, penc=0
Sep 12 12:50:33 QA-Power8-5-kvm kernel: hash-mmu: base_shift=34: shift=34, sllp=0x0120, avpnm=0x000007ff, tlbiel=0, penc=3
Sep 12 12:50:33 QA-Power8-5-kvm kernel: Enabling pkeys with max key count 32
Sep 12 12:50:33 QA-Power8-5-kvm kernel: Disabling hardware transactional memory (HTM)
Sep 12 12:50:33 QA-Power8-5-kvm kernel: Activating Kernel Userspace Access Prevention
Sep 12 12:50:33 QA-Power8-5-kvm kernel: Activating Kernel Userspace Execution Prevention
Sep 12 12:50:33 QA-Power8-5-kvm kernel: Page orders: linear mapping = 24, virtual = 16, io = 16, vmemmap = 24
Sep 12 12:50:33 QA-Power8-5-kvm kernel: Using 1TB segments
Sep 12 12:50:33 QA-Power8-5-kvm kernel: hash-mmu: Initializing hash mmu with SLB
Sep 12 12:50:33 QA-Power8-5-kvm kernel: Linux version 5.14.21-150400.24.18-default (geeko@buildhost) (gcc (SUSE Linux) 7.5.0, GNU ld (GNU Binutils; SUSE Linux Enterprise 15) 2.37.20211103-150100.7.37) #1 SMP Thu Aug 4 14:17:48 UTC 2022 (e9f7bfc)
Sep 12 12:50:33 QA-Power8-5-kvm kernel: Secure boot mode disabled
Sep 12 12:50:33 QA-Power8-5-kvm kernel: Found initrd at 0xc000000004470000:0xc000000005ea059d
Sep 12 12:50:33 QA-Power8-5-kvm kernel: OPAL: Found non-mapped LPC bus on chip 0
Sep 12 12:50:33 QA-Power8-5-kvm kernel: Using PowerNV machine description
Sep 12 12:50:33 QA-Power8-5-kvm kernel: printk: bootconsole [udbg0] enabled
I've retriggered the OSD deployment pipeline which was failing due to that. It has just succeeded so I suppose we can consider this issue resolved. (Likely at some point we should have automatic recovery for all our workers, not just for the ARM machines.)