Project

General

Profile

action #116060

Updated by mkittler 5 months ago

arm-1 has been down for almost 10 days: https://stats.openqa-monitor.qa.suse.de/d/WDopenqaworker-arm-1/worker-dashboard-openqaworker-arm-1?orgId=1&from=now-10d&to=now

As far as I can see the alert has been left unhandled so far so I'm creating a ticket now (hopefully no duplicate).

The worker is reachable via IPMI but fails on the NVMe setup:

```
[ 51.594064][ T766] nicvf 0002:01:00.2: enabling device (0004 -> 0006)
[ 51.624678][ T766] nicvf 0002:01:00.3: enabling device (0004 -> 0006)
[ 51.655248][ T766] nicvf 0002:01:00.4: enabling device (0004 -> 0006)
[ 51.680192][ T1097] BTRFS info (device dm-1): devid 1 device path /dev/mapper/system-root changed to /dev/dm-1 scanned by systemd-udevd (1097)
[ 51.685843][ T766] nicvf 0002:01:00.5: enabling device (0004 -> 0006)
[ 51.697377][ T1097] BTRFS info (device dm-1): devid 1 device path /dev/dm-1 changed to /dev/mapper/system-root scanned by systemd-udevd (1097)
[ 51.725401][ T766] nicvf 0002:01:00.6: enabling device (0004 -> 0006)
[ 51.755433][ T766] nicvf 0002:01:00.7: enabling device (0004 -> 0006)
[ 51.785576][ T766] nicvf 0002:01:01.0: enabling device (0004 -> 0006)
[ 51.815515][ T766] nicvf 0002:01:01.1: enabling device (0004 -> 0006)
[ 51.845446][ T766] nicvf 0002:01:01.2: enabling device (0004 -> 0006)
[ 51.875885][ T766] nicvf 0002:01:01.3: enabling device (0004 -> 0006)
[FAILED] Failed to start Setup NVMe before mounting it.
See 'systemctl status openqa_nvme_format.service' for details.
[DEPEND] Dependency failed for /var/lib/openqa.ling device (0004 -> 0006)
[DEPEND] Dependency failed for openQA Worker #6.ing device (0004 -> 0006)
[DEPEND] Dependency failed for Prepare NVMe after mounting it.04 -> 0006)
[DEPEND] Dependency failed for openQA Worker #8.ing device (0004 -> 0006)
[DEPEND] Dependency failed for openQA Worker #3.
[DEPEND] Dependency failed for openQA Worker #9.ing device (0004 -> 0006)
[DEPEND] Dependency failed for openQA Worker #5.ing device (0004 -> 0006)
[DEPEND] Dependency failed for Local File Systems.g device (0004 -> 0006)
[DEPEND] Dependency failed for openQA Worker #2.ing device (0004 -> 0006)
[DEPEND] Dependency failed for openQA Worker #4.ing device (0004 -> 0006)
[DEPEND] Dependency failed for openQA Worker #1.ing device (0004 -> 0006)
[DEPEND] Dependency failed for openQA Worker #10.ng device (0004 -> 0006)
[DEPEND] Dependency failed for var-lib-openqa-share.automount.04 -> 0006)
[DEPEND] Dependency failed for openQA Worker #7.ing device (0004 -> 0006)
[ OK ] Reached target Timer Units.:03.2: enabling device (0004 -> 0006)
Starting Restore /run/initramfs on shutdown...ice (0004 -> 0006)
[ OK ] Reached target Host and Network Name Lookups.vice (0004 -> 0006)
[ OK ] Reached target System Time Synchronized.
[ OK ] Reached target Login Prompts. on device 8:2...
[ OK ] Reached target User and Group Name Lookups.device (0004 -> 0006)
[ OK ] Reached target Remote File Systems.nabling device (0004 -> 0006)
[ OK ] Reached target Preparation for Network. 8:2.
[ OK ] Reached target Network.161HJ primary.
[ OK ] Reached target Network is Online.
[ OK ] Reached target Path Units.): Volume was not properly unmounted. Some data may be corrupt. Please run fsck.
[ OK ] Reached target Socket Units.
[ OK ] Started Emergency Shell.o Complete Device Initialization.
[ OK ] Reached target Emergency Mode.up NV…before mounting it (44s / 5min 1s)
Starting Tell Plymouth To Write Out Runtime Data...abort: 96000210 [#1] PREEMPT SMP
Starting Create Volatile Files and Directories...s_cp437 vfat fat joydev mdio_thunder nicvf cavium_ptp cavium_rng_vf mdio_cavium nicpf thunder_bgx thunderx_edac thunder_xcv cavium_rng ipmi_ssif uio_pdrv_genirq ipmi_devintf efi_[ OK ] Finished Restore /run/initramfs on shutdown.bles x_tables hid_generic usbhid ast i2c_algo_bit drm_vram_helper drm_kms_helper sd_mod xhci_pci syscopyarea sysfillrect xhci_pci_renesas sysimgblt fb_sys_fops xhci_hcd cec crct10dif_[ OK ] Finished Tell Plymouth To Write Out Runtime Data.sha2_ce libahci thunderx_mmc usbcore nvme_core drm sha256_arm64 sha1_ce gpio_keys t10_pi libata usb_common mmc_core i2c_thunderx btrfs blake2b_generic libcrc32c xor xor_neon raid[ OK ] Finished Create Volatile Files and Directories.dge stp llc dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod efivarfs aes_ce_blk crypto_simd cryptd aes_ce_cipher
Starting Record System Boot/Shutdown in UTMP...s are loaded
[ OK ] Finished Record System Boot/Shutdown in UTMP./24:1H Tainted: G N 5.14.21-150400.24.18-default #1 SLE15-SP4 6e36986a976ce35b699a98bb4b0ce7312b4d04e2
Starting Record Runlevel Change in UTMP...rX CRB/MT30-GS1, BIOS T19 09/29/2016
[ OK ] Finished Record Runlevel Change in UTMP.timeout_work
You are in emergency mode. After logging in, type "journalctl -xb" to view)
system logs, "systemctl reboot" to reboot, "systemctl default" or "exit"
to boot into default mode.: blk_mq_check_expired+0x94/0xc0
Give root password for maintenance00171fbbe0
(or press Control-D to continue): 00171fbbe0 x28: 0000000000000000 x27: ffff000102661120
[ 91.702987][ T1345] x26: 000000000000001c x25: ffff00010d079938 x24: ffff0001162c3928
[ 91.711763][ T1345] x23: ffff0001162c3800 x22: ffff000102730600 x21: ffff80001158d000
[ 91.720551][ T1345] x20: ffff0001060b4000 x19: ffff800014bee01c x18: 0000000000000000
[ 91.729353][ T1345] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
[ 91.738166][ T1345] x14: 0000000000000000 x13: 0000000000000038 x12: 0101010101010101
[ 91.746975][ T1345] x11: 7f7f7f7f7f7f7f7f x10: fefefefefefefeff x9 : ffff8000107e28d4
[ 91.755784][ T1345] x8 : 8080808080808080 x7 : ffffffffffffffff x6 : 0000000000000001
[ 91.764611][ T1345] x5 : 00000000ffffad6b x4 : 0000000000000015 x3 : 0000000000000000
[ 91.773451][ T1345] x2 : ffff800009a287c0 x1 : 0000000000000000 x0 : ffff80001158d628
[ 91.782304][ T1345] Call trace:
[ 91.786430][ T1345] nvme_timeout+0x5c/0x380 [nvme 94e89dee374936950a3d9b7ed2710d061e1acef2]
[ 91.795933][ T1345] blk_mq_check_expired+0x94/0xc0
[ 91.801848][ T1345] bt_iter+0xa0/0xc0
[ 91.806626][ T1345] blk_mq_queue_tag_busy_iter+0x1f8/0x380
[ 91.813255][ T1345] blk_mq_timeout_work+0x11c/0x1c0
[
```

---

DONE:
After that is fixed the worker's salt setup should also be verified because it could not be recovered as the worker was offline when I took care of that.

Back