action #116060
closedRecover openqaworker-arm-1 size:M
0%
Description
arm-1 has been down for almost 10 days: https://stats.openqa-monitor.qa.suse.de/d/WDopenqaworker-arm-1/worker-dashboard-openqaworker-arm-1?orgId=1&from=now-10d&to=now
As far as I can see the alert has been left unhandled so far so I'm creating a ticket now (hopefully no duplicate).
The worker is reachable via IPMI but fails on the NVMe setup:
[ 51.594064][ T766] nicvf 0002:01:00.2: enabling device (0004 -> 0006)
[ 51.624678][ T766] nicvf 0002:01:00.3: enabling device (0004 -> 0006)
[ 51.655248][ T766] nicvf 0002:01:00.4: enabling device (0004 -> 0006)
[ 51.680192][ T1097] BTRFS info (device dm-1): devid 1 device path /dev/mapper/system-root changed to /dev/dm-1 scanned by systemd-udevd (1097)
[ 51.685843][ T766] nicvf 0002:01:00.5: enabling device (0004 -> 0006)
[ 51.697377][ T1097] BTRFS info (device dm-1): devid 1 device path /dev/dm-1 changed to /dev/mapper/system-root scanned by systemd-udevd (1097)
[ 51.725401][ T766] nicvf 0002:01:00.6: enabling device (0004 -> 0006)
[ 51.755433][ T766] nicvf 0002:01:00.7: enabling device (0004 -> 0006)
[ 51.785576][ T766] nicvf 0002:01:01.0: enabling device (0004 -> 0006)
[ 51.815515][ T766] nicvf 0002:01:01.1: enabling device (0004 -> 0006)
[ 51.845446][ T766] nicvf 0002:01:01.2: enabling device (0004 -> 0006)
[ 51.875885][ T766] nicvf 0002:01:01.3: enabling device (0004 -> 0006)
[FAILED] Failed to start Setup NVMe before mounting it.
See 'systemctl status openqa_nvme_format.service' for details.
[DEPEND] Dependency failed for /var/lib/openqa.ling device (0004 -> 0006)
[DEPEND] Dependency failed for openQA Worker #6.ing device (0004 -> 0006)
[DEPEND] Dependency failed for Prepare NVMe after mounting it.04 -> 0006)
[DEPEND] Dependency failed for openQA Worker #8.ing device (0004 -> 0006)
[DEPEND] Dependency failed for openQA Worker #3.
[DEPEND] Dependency failed for openQA Worker #9.ing device (0004 -> 0006)
[DEPEND] Dependency failed for openQA Worker #5.ing device (0004 -> 0006)
[DEPEND] Dependency failed for Local File Systems.g device (0004 -> 0006)
[DEPEND] Dependency failed for openQA Worker #2.ing device (0004 -> 0006)
[DEPEND] Dependency failed for openQA Worker #4.ing device (0004 -> 0006)
[DEPEND] Dependency failed for openQA Worker #1.ing device (0004 -> 0006)
[DEPEND] Dependency failed for openQA Worker #10.ng device (0004 -> 0006)
[DEPEND] Dependency failed for var-lib-openqa-share.automount.04 -> 0006)
[DEPEND] Dependency failed for openQA Worker #7.ing device (0004 -> 0006)
[ OK ] Reached target Timer Units.:03.2: enabling device (0004 -> 0006)
Starting Restore /run/initramfs on shutdown...ice (0004 -> 0006)
[ OK ] Reached target Host and Network Name Lookups.vice (0004 -> 0006)
[ OK ] Reached target System Time Synchronized.
[ OK ] Reached target Login Prompts. on device 8:2...
[ OK ] Reached target User and Group Name Lookups.device (0004 -> 0006)
[ OK ] Reached target Remote File Systems.nabling device (0004 -> 0006)
[ OK ] Reached target Preparation for Network. 8:2.
[ OK ] Reached target Network.161HJ primary.
[ OK ] Reached target Network is Online.
[ OK ] Reached target Path Units.): Volume was not properly unmounted. Some data may be corrupt. Please run fsck.
[ OK ] Reached target Socket Units.
[ OK ] Started Emergency Shell.o Complete Device Initialization.
[ OK ] Reached target Emergency Mode.up NV…before mounting it (44s / 5min 1s)
Starting Tell Plymouth To Write Out Runtime Data...abort: 96000210 [#1] PREEMPT SMP
Starting Create Volatile Files and Directories...s_cp437 vfat fat joydev mdio_thunder nicvf cavium_ptp cavium_rng_vf mdio_cavium nicpf thunder_bgx thunderx_edac thunder_xcv cavium_rng ipmi_ssif uio_pdrv_genirq ipmi_devintf efi_[ OK ] Finished Restore /run/initramfs on shutdown.bles x_tables hid_generic usbhid ast i2c_algo_bit drm_vram_helper drm_kms_helper sd_mod xhci_pci syscopyarea sysfillrect xhci_pci_renesas sysimgblt fb_sys_fops xhci_hcd cec crct10dif_[ OK ] Finished Tell Plymouth To Write Out Runtime Data.sha2_ce libahci thunderx_mmc usbcore nvme_core drm sha256_arm64 sha1_ce gpio_keys t10_pi libata usb_common mmc_core i2c_thunderx btrfs blake2b_generic libcrc32c xor xor_neon raid[ OK ] Finished Create Volatile Files and Directories.dge stp llc dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod efivarfs aes_ce_blk crypto_simd cryptd aes_ce_cipher
Starting Record System Boot/Shutdown in UTMP...s are loaded
[ OK ] Finished Record System Boot/Shutdown in UTMP./24:1H Tainted: G N 5.14.21-150400.24.18-default #1 SLE15-SP4 6e36986a976ce35b699a98bb4b0ce7312b4d04e2
Starting Record Runlevel Change in UTMP...rX CRB/MT30-GS1, BIOS T19 09/29/2016
[ OK ] Finished Record Runlevel Change in UTMP.timeout_work
You are in emergency mode. After logging in, type "journalctl -xb" to view)
system logs, "systemctl reboot" to reboot, "systemctl default" or "exit"
to boot into default mode.: blk_mq_check_expired+0x94/0xc0
Give root password for maintenance00171fbbe0
(or press Control-D to continue): 00171fbbe0 x28: 0000000000000000 x27: ffff000102661120
[ 91.702987][ T1345] x26: 000000000000001c x25: ffff00010d079938 x24: ffff0001162c3928
[ 91.711763][ T1345] x23: ffff0001162c3800 x22: ffff000102730600 x21: ffff80001158d000
[ 91.720551][ T1345] x20: ffff0001060b4000 x19: ffff800014bee01c x18: 0000000000000000
[ 91.729353][ T1345] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
[ 91.738166][ T1345] x14: 0000000000000000 x13: 0000000000000038 x12: 0101010101010101
[ 91.746975][ T1345] x11: 7f7f7f7f7f7f7f7f x10: fefefefefefefeff x9 : ffff8000107e28d4
[ 91.755784][ T1345] x8 : 8080808080808080 x7 : ffffffffffffffff x6 : 0000000000000001
[ 91.764611][ T1345] x5 : 00000000ffffad6b x4 : 0000000000000015 x3 : 0000000000000000
[ 91.773451][ T1345] x2 : ffff800009a287c0 x1 : 0000000000000000 x0 : ffff80001158d628
[ 91.782304][ T1345] Call trace:
[ 91.786430][ T1345] nvme_timeout+0x5c/0x380 [nvme 94e89dee374936950a3d9b7ed2710d061e1acef2]
[ 91.795933][ T1345] blk_mq_check_expired+0x94/0xc0
[ 91.801848][ T1345] bt_iter+0xa0/0xc0
[ 91.806626][ T1345] blk_mq_queue_tag_busy_iter+0x1f8/0x380
[ 91.813255][ T1345] blk_mq_timeout_work+0x11c/0x1c0
[
DONE: After that is fixed the worker's salt setup should also be verified because it could not be recovered as the worker was offline when I took care of that.
Updated by mkittler over 2 years ago
It comes quite far, even attempts to create the ext2 filesystem. However, then it gets stuck and eventually systemd kills the command:
a-establish-nvme-setup[1036]: mdadm: Defaulting to version 1.2 metadata
a-establish-nvme-setup[1036]: mdadm: array /dev/md/openqa started.
a-establish-nvme-setup[1023]: Status for RAID0 "/dev/md/openqa"
a-establish-nvme-setup[1183]: md127 : active raid0 nvme0n1[0]
a-establish-nvme-setup[1185]: ARRAY /dev/md/openqa metadata=1.2 name=openqawork>
a-establish-nvme-setup[1023]: Creating ext2 filesystem on RAID0 "/dev/md/openqa"
a-establish-nvme-setup[1186]: mke2fs 1.46.4 (18-Aug-2021)
a-establish-nvme-setup[1186]: [220B blob data]
a-establish-nvme-setup[1186]: Creating filesystem with 195319936 4k blocks and >
a-establish-nvme-setup[1186]: Filesystem UUID: 3828224d-39fb-457e-a242-3a93f67d>
a-establish-nvme-setup[1186]: Superblock backups stored on blocks:
a-establish-nvme-setup[1186]: 32768, 98304, 163840, 229376, 294912, 819>
a-establish-nvme-setup[1186]: 4096000, 7962624, 11239424, 20480000, 238>
a-establish-nvme-setup[1186]: 102400000
a-establish-nvme-setup[1186]: [65B blob data]
md[1]: openqa_nvme_format.service: start operation timed out. Terminating.
md[1]: openqa_nvme_format.service: Main process exited, code=killed, status=15/>
md[1]: openqa_nvme_format.service: State 'final-sigterm' timed out. Killing.
It looks like the file system isn't there, indeed.
Invoking openqaworker-arm-1:~ # mdadm --detail --scan | grep openqa
manually now gets stuck and we get some logging about a workqueue lockup:
openqaworker-arm-1:~ # mdadm --detail --scan | grep openqa
[ 1706.267001][ C0] BUG: workqueue lockup - pool cpus=24 node=0 flags=0x0 nice=-20 stuck for 34s!
[ 1706.276024][ C0] Showing busy workqueues and worker pools:
[ 1706.281836][ C0] workqueue events: flags=0x0
[ 1706.286421][ C0] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[ 1706.286432][ C0] pending: drm_fb_helper_damage_work [drm_kms_helper]
[ 1706.301198][ C0] workqueue events_power_efficient: flags=0x80
[ 1706.307284][ C0] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[ 1706.307296][ C0] pending: fb_flashcursor
[ 1706.319505][ C0] workqueue kblockd: flags=0x18
[ 1706.324265][ C0] pwq 49: cpus=24 node=0 flags=0x0 nice=-20 active=2/256 refcnt=3
[ 1706.324275][ C0] in-flight: 1345:blk_mq_timeout_work
[ 1706.324284][ C0] pending: blk_mq_timeout_work
[ 1706.342961][ C0] pool 49: cpus=24 node=0 flags=0x0 nice=-20 hung=34s workers=2 idle: 136
[ 1736.927007][ C0] BUG: workqueue lockup - pool cpus=24 node=0 flags=0x0 nice=-20 stuck for 64s!
[ 1736.936024][ C0] Showing busy workqueues and worker pools:
[ 1736.941835][ C0] workqueue events: flags=0x0
[ 1736.946428][ C0] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256 refcnt=3
[ 1736.946440][ C0] pending: vmstat_shepherd, drm_fb_helper_damage_work [drm_kms_helper]
[ 1736.962626][ C0] workqueue events_power_efficient: flags=0x80
[ 1736.968703][ C0] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[ 1736.968712][ C0] pending: fb_flashcursor
[ 1736.980875][ C0] workqueue kblockd: flags=0x18
[ 1736.987060][ C0] pwq 49: cpus=24 node=0 flags=0x0 nice=-20 active=2/256 refcnt=3
[ 1736.987070][ C0] in-flight: 1345:blk_mq_timeout_work
[ 1736.987077][ C0] pending: blk_mq_timeout_work
[ 1737.008504][ C0] pool 49: cpus=24 node=0 flags=0x0 nice=-20 hung=64s workers=2 idle: 136
[ 1767.797002][ C0] BUG: workqueue lockup - pool cpus=24 node=0 flags=0x0 nice=-20 stuck for 95s!
[ 1767.806957][ C0] Showing busy workqueues and worker pools:
[ 1767.813713][ C0] workqueue events: flags=0x0
[ 1767.819240][ C0] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[ 1767.819252][ C0] pending: drm_fb_helper_damage_work [drm_kms_helper]
[ 1767.835824][ C0] workqueue events_power_efficient: flags=0x80
[ 1767.842848][ C0] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[ 1767.842858][ C0] pending: fb_flashcursor
[ 1767.856904][ C0] workqueue kblockd: flags=0x18
[ 1767.862590][ C0] pwq 49: cpus=24 node=0 flags=0x0 nice=-20 active=2/256 refcnt=3
[ 1767.862599][ C0] in-flight: 1345:blk_mq_timeout_work
[ 1767.862606][ C0] pending: blk_mq_timeout_work
[ 1767.883999][ C0] pool 49: cpus=24 node=0 flags=0x0 nice=-20 hung=95s workers=2 idle: 136
[ 1798.457002][ C0] BUG: workqueue lockup - pool cpus=24 node=0 flags=0x0 nice=-20 stuck for 126s!
[ 1798.467028][ C0] Showing busy workqueues and worker pools:
[ 1798.473794][ C0] workqueue events: flags=0x0
[ 1798.479334][ C0] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[ 1798.479345][ C0] pending: drm_fb_helper_damage_work [drm_kms_helper]
[ 1798.495919][ C0] workqueue events_power_efficient: flags=0x80
[ 1798.502928][ C0] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[ 1798.502938][ C0] pending: fb_flashcursor
[ 1798.516949][ C0] workqueue kblockd: flags=0x18
[ 1798.522595][ C0] pwq 49: cpus=24 node=0 flags=0x0 nice=-20 active=2/256 refcnt=3
[ 1798.522605][ C0] in-flight: 1345:blk_mq_timeout_work
[ 1798.522611][ C0] pending: blk_mq_timeout_work
[ 1798.543882][ C0] pool 49: cpus=24 node=0 flags=0x0 nice=-20 hung=126s workers=2 idle: 136
I've seen the same kind of logging when just pressing Ctrl-D on the rescue shell prompt, so that's very likely the source of the issue. Now the question is why this lockup happens.
Updated by mkittler over 2 years ago
It could be a kernel bug. At least it boots fine on the older kernel 5.3.18-150300.59.71-default (instead of 5.14.21-150400.24.18-default). On the other hand, it is strange that the other arm workers don't show this problem as they're running fine on the (same) new version.
Updated by mkittler over 2 years ago
- Status changed from In Progress to Feedback
Now the worker booted fine on the new kernel version as well. The only thing I did was an update of the system (which did not include any kernel packages). It doesn't look like a problem of our NVMe setup script but rather like a sporatic kernel bug or even a hardware issue (considering the hardware is unstable anyways). So I'll just keep the system running for now and see what happens.
Updated by livdywan over 2 years ago
- Subject changed from Recover openqaworker-arm-1 to Recover openqaworker-arm-1 size:M
Updated by mkittler over 2 years ago
- Status changed from Feedback to Resolved
Looks like the machine worked as usual so far. There are only 2 small gaps visible in Grafana but those are expected.