Project

General

Profile

action #116060

Updated by mkittler over 1 year ago

arm-1 has been down for almost 10 days: https://stats.openqa-monitor.qa.suse.de/d/WDopenqaworker-arm-1/worker-dashboard-openqaworker-arm-1?orgId=1&from=now-10d&to=now 

 As far as I can see the alert has been left unhandled so far so I'm creating a ticket now (hopefully no duplicate). 

 The worker is reachable via IPMI but fails on the NVMe setup: 

 ``` 
 [     51.594064][    T766] nicvf 0002:01:00.2: enabling device (0004 -> 0006) 
 [     51.624678][    T766] nicvf 0002:01:00.3: enabling device (0004 -> 0006) 
 [     51.655248][    T766] nicvf 0002:01:00.4: enabling device (0004 -> 0006) 
 [     51.680192][ T1097] BTRFS info (device dm-1): devid 1 device path /dev/mapper/system-root changed to /dev/dm-1 scanned by systemd-udevd (1097) 
 [     51.685843][    T766] nicvf 0002:01:00.5: enabling device (0004 -> 0006) 
 [     51.697377][ T1097] BTRFS info (device dm-1): devid 1 device path /dev/dm-1 changed to /dev/mapper/system-root scanned by systemd-udevd (1097) 
 [     51.725401][    T766] nicvf 0002:01:00.6: enabling device (0004 -> 0006) 
 [     51.755433][    T766] nicvf 0002:01:00.7: enabling device (0004 -> 0006) 
 [     51.785576][    T766] nicvf 0002:01:01.0: enabling device (0004 -> 0006) 
 [     51.815515][    T766] nicvf 0002:01:01.1: enabling device (0004 -> 0006) 
 [     51.845446][    T766] nicvf 0002:01:01.2: enabling device (0004 -> 0006) 
 [     51.875885][    T766] nicvf 0002:01:01.3: enabling device (0004 -> 0006) 
 [FAILED] Failed to start Setup NVMe before mounting it. 
 See 'systemctl status openqa_nvme_format.service' for details. 
 [DEPEND] Dependency failed for /var/lib/openqa.ling device (0004 -> 0006) 
 [DEPEND] Dependency failed for openQA Worker #6.ing device (0004 -> 0006) 
 [DEPEND] Dependency failed for Prepare NVMe after mounting it.04 -> 0006) 
 [DEPEND] Dependency failed for openQA Worker #8.ing device (0004 -> 0006) 
 [DEPEND] Dependency failed for openQA Worker #3. 
 [DEPEND] Dependency failed for openQA Worker #9.ing device (0004 -> 0006) 
 [DEPEND] Dependency failed for openQA Worker #5.ing device (0004 -> 0006) 
 [DEPEND] Dependency failed for Local File Systems.g device (0004 -> 0006) 
 [DEPEND] Dependency failed for openQA Worker #2.ing device (0004 -> 0006) 
 [DEPEND] Dependency failed for openQA Worker #4.ing device (0004 -> 0006) 
 [DEPEND] Dependency failed for openQA Worker #1.ing device (0004 -> 0006) 
 [DEPEND] Dependency failed for openQA Worker #10.ng device (0004 -> 0006) 
 [DEPEND] Dependency failed for var-lib-openqa-share.automount.04 -> 0006) 
 [DEPEND] Dependency failed for openQA Worker #7.ing device (0004 -> 0006) 
 [    OK    ] Reached target Timer Units.:03.2: enabling device (0004 -> 0006) 
          Starting Restore /run/initramfs on shutdown...ice (0004 -> 0006) 
 [    OK    ] Reached target Host and Network Name Lookups.vice (0004 -> 0006) 
 [    OK    ] Reached target System Time Synchronized. 
 [    OK    ] Reached target Login Prompts. on device 8:2... 
 [    OK    ] Reached target User and Group Name Lookups.device (0004 -> 0006) 
 [    OK    ] Reached target Remote File Systems.nabling device (0004 -> 0006) 
 [    OK    ] Reached target Preparation for Network. 8:2. 
 [    OK    ] Reached target Network.161HJ primary. 
 [    OK    ] Reached target Network is Online. 
 [    OK    ] Reached target Path Units.): Volume was not properly unmounted. Some data may be corrupt. Please run fsck. 
 [    OK    ] Reached target Socket Units. 
 [    OK    ] Started Emergency Shell.o Complete Device Initialization. 
 [    OK    ] Reached target Emergency Mode.up NV…before mounting it (44s / 5min 1s) 
          Starting Tell Plymouth To Write Out Runtime Data...abort: 96000210 [#1] PREEMPT SMP 
          Starting Create Volatile Files and Directories...s_cp437 vfat fat joydev mdio_thunder nicvf cavium_ptp cavium_rng_vf mdio_cavium nicpf thunder_bgx thunderx_edac thunder_xcv cavium_rng ipmi_ssif uio_pdrv_genirq ipmi_devintf efi_[    OK    ] Finished Restore /run/initramfs on shutdown.bles x_tables hid_generic usbhid ast i2c_algo_bit drm_vram_helper drm_kms_helper sd_mod xhci_pci syscopyarea sysfillrect xhci_pci_renesas sysimgblt fb_sys_fops xhci_hcd cec crct10dif_[    OK    ] Finished Tell Plymouth To Write Out Runtime Data.sha2_ce libahci thunderx_mmc usbcore nvme_core drm sha256_arm64 sha1_ce gpio_keys t10_pi libata usb_common mmc_core i2c_thunderx btrfs blake2b_generic libcrc32c xor xor_neon raid[    OK    ] Finished Create Volatile Files and Directories.dge stp llc dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod efivarfs aes_ce_blk crypto_simd cryptd aes_ce_cipher 
          Starting Record System Boot/Shutdown in UTMP...s are loaded 
 [    OK    ] Finished Record System Boot/Shutdown in UTMP./24:1H Tainted: G                    N 5.14.21-150400.24.18-default #1 SLE15-SP4 6e36986a976ce35b699a98bb4b0ce7312b4d04e2 
          Starting Record Runlevel Change in UTMP...rX CRB/MT30-GS1, BIOS T19 09/29/2016 
 [    OK    ] Finished Record Runlevel Change in UTMP.timeout_work 
 You are in emergency mode. After logging in, type "journalctl -xb" to view) 
 system logs, "systemctl reboot" to reboot, "systemctl default" or "exit" 
 to boot into default mode.: blk_mq_check_expired+0x94/0xc0 
 Give root password for maintenance00171fbbe0 
 (or press Control-D to continue): 00171fbbe0 x28: 0000000000000000 x27: ffff000102661120 
 [     91.702987][ T1345] x26: 000000000000001c x25: ffff00010d079938 x24: ffff0001162c3928 
 [     91.711763][ T1345] x23: ffff0001162c3800 x22: ffff000102730600 x21: ffff80001158d000 
 [     91.720551][ T1345] x20: ffff0001060b4000 x19: ffff800014bee01c x18: 0000000000000000 
 [     91.729353][ T1345] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000 
 [     91.738166][ T1345] x14: 0000000000000000 x13: 0000000000000038 x12: 0101010101010101 
 [     91.746975][ T1345] x11: 7f7f7f7f7f7f7f7f x10: fefefefefefefeff x9 : ffff8000107e28d4 
 [     91.755784][ T1345] x8 : 8080808080808080 x7 : ffffffffffffffff x6 : 0000000000000001 
 [     91.764611][ T1345] x5 : 00000000ffffad6b x4 : 0000000000000015 x3 : 0000000000000000 
 [     91.773451][ T1345] x2 : ffff800009a287c0 x1 : 0000000000000000 x0 : ffff80001158d628 
 [     91.782304][ T1345] Call trace: 
 [     91.786430][ T1345]    nvme_timeout+0x5c/0x380 [nvme 94e89dee374936950a3d9b7ed2710d061e1acef2] 
 [     91.795933][ T1345]    blk_mq_check_expired+0x94/0xc0 
 [     91.801848][ T1345]    bt_iter+0xa0/0xc0 
 [     91.806626][ T1345]    blk_mq_queue_tag_busy_iter+0x1f8/0x380 
 [     91.813255][ T1345]    blk_mq_timeout_work+0x11c/0x1c0 
 [     
 ``` 

 --- 

 DONE: After that is fixed the worker's salt setup should also be verified because it could not be recovered as the worker was offline when I took care of that.

Back