Project

General

Profile

Actions

action #116060

closed

Recover openqaworker-arm-1 size:M

Added by mkittler over 1 year ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
2022-08-31
Due date:
% Done:

0%

Estimated time:

Description

arm-1 has been down for almost 10 days: https://stats.openqa-monitor.qa.suse.de/d/WDopenqaworker-arm-1/worker-dashboard-openqaworker-arm-1?orgId=1&from=now-10d&to=now

As far as I can see the alert has been left unhandled so far so I'm creating a ticket now (hopefully no duplicate).

The worker is reachable via IPMI but fails on the NVMe setup:

[   51.594064][  T766] nicvf 0002:01:00.2: enabling device (0004 -> 0006)
[   51.624678][  T766] nicvf 0002:01:00.3: enabling device (0004 -> 0006)
[   51.655248][  T766] nicvf 0002:01:00.4: enabling device (0004 -> 0006)
[   51.680192][ T1097] BTRFS info (device dm-1): devid 1 device path /dev/mapper/system-root changed to /dev/dm-1 scanned by systemd-udevd (1097)
[   51.685843][  T766] nicvf 0002:01:00.5: enabling device (0004 -> 0006)
[   51.697377][ T1097] BTRFS info (device dm-1): devid 1 device path /dev/dm-1 changed to /dev/mapper/system-root scanned by systemd-udevd (1097)
[   51.725401][  T766] nicvf 0002:01:00.6: enabling device (0004 -> 0006)
[   51.755433][  T766] nicvf 0002:01:00.7: enabling device (0004 -> 0006)
[   51.785576][  T766] nicvf 0002:01:01.0: enabling device (0004 -> 0006)
[   51.815515][  T766] nicvf 0002:01:01.1: enabling device (0004 -> 0006)
[   51.845446][  T766] nicvf 0002:01:01.2: enabling device (0004 -> 0006)
[   51.875885][  T766] nicvf 0002:01:01.3: enabling device (0004 -> 0006)
[FAILED] Failed to start Setup NVMe before mounting it.
See 'systemctl status openqa_nvme_format.service' for details.
[DEPEND] Dependency failed for /var/lib/openqa.ling device (0004 -> 0006)
[DEPEND] Dependency failed for openQA Worker #6.ing device (0004 -> 0006)
[DEPEND] Dependency failed for Prepare NVMe after mounting it.04 -> 0006)
[DEPEND] Dependency failed for openQA Worker #8.ing device (0004 -> 0006)
[DEPEND] Dependency failed for openQA Worker #3.
[DEPEND] Dependency failed for openQA Worker #9.ing device (0004 -> 0006)
[DEPEND] Dependency failed for openQA Worker #5.ing device (0004 -> 0006)
[DEPEND] Dependency failed for Local File Systems.g device (0004 -> 0006)
[DEPEND] Dependency failed for openQA Worker #2.ing device (0004 -> 0006)
[DEPEND] Dependency failed for openQA Worker #4.ing device (0004 -> 0006)
[DEPEND] Dependency failed for openQA Worker #1.ing device (0004 -> 0006)
[DEPEND] Dependency failed for openQA Worker #10.ng device (0004 -> 0006)
[DEPEND] Dependency failed for var-lib-openqa-share.automount.04 -> 0006)
[DEPEND] Dependency failed for openQA Worker #7.ing device (0004 -> 0006)
[  OK  ] Reached target Timer Units.:03.2: enabling device (0004 -> 0006)
         Starting Restore /run/initramfs on shutdown...ice (0004 -> 0006)
[  OK  ] Reached target Host and Network Name Lookups.vice (0004 -> 0006)
[  OK  ] Reached target System Time Synchronized.
[  OK  ] Reached target Login Prompts. on device 8:2...
[  OK  ] Reached target User and Group Name Lookups.device (0004 -> 0006)
[  OK  ] Reached target Remote File Systems.nabling device (0004 -> 0006)
[  OK  ] Reached target Preparation for Network. 8:2.
[  OK  ] Reached target Network.161HJ primary.
[  OK  ] Reached target Network is Online.
[  OK  ] Reached target Path Units.): Volume was not properly unmounted. Some data may be corrupt. Please run fsck.
[  OK  ] Reached target Socket Units.
[  OK  ] Started Emergency Shell.o Complete Device Initialization.
[  OK  ] Reached target Emergency Mode.up NV…before mounting it (44s / 5min 1s)
         Starting Tell Plymouth To Write Out Runtime Data...abort: 96000210 [#1] PREEMPT SMP
         Starting Create Volatile Files and Directories...s_cp437 vfat fat joydev mdio_thunder nicvf cavium_ptp cavium_rng_vf mdio_cavium nicpf thunder_bgx thunderx_edac thunder_xcv cavium_rng ipmi_ssif uio_pdrv_genirq ipmi_devintf efi_[  OK  ] Finished Restore /run/initramfs on shutdown.bles x_tables hid_generic usbhid ast i2c_algo_bit drm_vram_helper drm_kms_helper sd_mod xhci_pci syscopyarea sysfillrect xhci_pci_renesas sysimgblt fb_sys_fops xhci_hcd cec crct10dif_[  OK  ] Finished Tell Plymouth To Write Out Runtime Data.sha2_ce libahci thunderx_mmc usbcore nvme_core drm sha256_arm64 sha1_ce gpio_keys t10_pi libata usb_common mmc_core i2c_thunderx btrfs blake2b_generic libcrc32c xor xor_neon raid[  OK  ] Finished Create Volatile Files and Directories.dge stp llc dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod efivarfs aes_ce_blk crypto_simd cryptd aes_ce_cipher
         Starting Record System Boot/Shutdown in UTMP...s are loaded
[  OK  ] Finished Record System Boot/Shutdown in UTMP./24:1H Tainted: G                  N 5.14.21-150400.24.18-default #1 SLE15-SP4 6e36986a976ce35b699a98bb4b0ce7312b4d04e2
         Starting Record Runlevel Change in UTMP...rX CRB/MT30-GS1, BIOS T19 09/29/2016
[  OK  ] Finished Record Runlevel Change in UTMP.timeout_work
You are in emergency mode. After logging in, type "journalctl -xb" to view)
system logs, "systemctl reboot" to reboot, "systemctl default" or "exit"
to boot into default mode.: blk_mq_check_expired+0x94/0xc0
Give root password for maintenance00171fbbe0
(or press Control-D to continue): 00171fbbe0 x28: 0000000000000000 x27: ffff000102661120
[   91.702987][ T1345] x26: 000000000000001c x25: ffff00010d079938 x24: ffff0001162c3928
[   91.711763][ T1345] x23: ffff0001162c3800 x22: ffff000102730600 x21: ffff80001158d000
[   91.720551][ T1345] x20: ffff0001060b4000 x19: ffff800014bee01c x18: 0000000000000000
[   91.729353][ T1345] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
[   91.738166][ T1345] x14: 0000000000000000 x13: 0000000000000038 x12: 0101010101010101
[   91.746975][ T1345] x11: 7f7f7f7f7f7f7f7f x10: fefefefefefefeff x9 : ffff8000107e28d4
[   91.755784][ T1345] x8 : 8080808080808080 x7 : ffffffffffffffff x6 : 0000000000000001
[   91.764611][ T1345] x5 : 00000000ffffad6b x4 : 0000000000000015 x3 : 0000000000000000
[   91.773451][ T1345] x2 : ffff800009a287c0 x1 : 0000000000000000 x0 : ffff80001158d628
[   91.782304][ T1345] Call trace:
[   91.786430][ T1345]  nvme_timeout+0x5c/0x380 [nvme 94e89dee374936950a3d9b7ed2710d061e1acef2]
[   91.795933][ T1345]  blk_mq_check_expired+0x94/0xc0
[   91.801848][ T1345]  bt_iter+0xa0/0xc0
[   91.806626][ T1345]  blk_mq_queue_tag_busy_iter+0x1f8/0x380
[   91.813255][ T1345]  blk_mq_timeout_work+0x11c/0x1c0
[    

DONE: After that is fixed the worker's salt setup should also be verified because it could not be recovered as the worker was offline when I took care of that.

Actions #1

Updated by mkittler over 1 year ago

It comes quite far, even attempts to create the ext2 filesystem. However, then it gets stuck and eventually systemd kills the command:

a-establish-nvme-setup[1036]: mdadm: Defaulting to version 1.2 metadata
a-establish-nvme-setup[1036]: mdadm: array /dev/md/openqa started.
a-establish-nvme-setup[1023]: Status for RAID0 "/dev/md/openqa"
a-establish-nvme-setup[1183]: md127 : active raid0 nvme0n1[0]
a-establish-nvme-setup[1185]: ARRAY /dev/md/openqa metadata=1.2 name=openqawork>
a-establish-nvme-setup[1023]: Creating ext2 filesystem on RAID0 "/dev/md/openqa"
a-establish-nvme-setup[1186]: mke2fs 1.46.4 (18-Aug-2021)
a-establish-nvme-setup[1186]: [220B blob data]
a-establish-nvme-setup[1186]: Creating filesystem with 195319936 4k blocks and >
a-establish-nvme-setup[1186]: Filesystem UUID: 3828224d-39fb-457e-a242-3a93f67d>
a-establish-nvme-setup[1186]: Superblock backups stored on blocks:
a-establish-nvme-setup[1186]:         32768, 98304, 163840, 229376, 294912, 819>
a-establish-nvme-setup[1186]:         4096000, 7962624, 11239424, 20480000, 238>
a-establish-nvme-setup[1186]:         102400000
a-establish-nvme-setup[1186]: [65B blob data]
md[1]: openqa_nvme_format.service: start operation timed out. Terminating.
md[1]: openqa_nvme_format.service: Main process exited, code=killed, status=15/>
md[1]: openqa_nvme_format.service: State 'final-sigterm' timed out. Killing.

It looks like the file system isn't there, indeed.

Invoking openqaworker-arm-1:~ # mdadm --detail --scan | grep openqa manually now gets stuck and we get some logging about a workqueue lockup:

openqaworker-arm-1:~ # mdadm --detail --scan | grep openqa
[ 1706.267001][    C0] BUG: workqueue lockup - pool cpus=24 node=0 flags=0x0 nice=-20 stuck for 34s!
[ 1706.276024][    C0] Showing busy workqueues and worker pools:
[ 1706.281836][    C0] workqueue events: flags=0x0
[ 1706.286421][    C0]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[ 1706.286432][    C0]     pending: drm_fb_helper_damage_work [drm_kms_helper]
[ 1706.301198][    C0] workqueue events_power_efficient: flags=0x80
[ 1706.307284][    C0]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[ 1706.307296][    C0]     pending: fb_flashcursor
[ 1706.319505][    C0] workqueue kblockd: flags=0x18
[ 1706.324265][    C0]   pwq 49: cpus=24 node=0 flags=0x0 nice=-20 active=2/256 refcnt=3
[ 1706.324275][    C0]     in-flight: 1345:blk_mq_timeout_work
[ 1706.324284][    C0]     pending: blk_mq_timeout_work
[ 1706.342961][    C0] pool 49: cpus=24 node=0 flags=0x0 nice=-20 hung=34s workers=2 idle: 136

[ 1736.927007][    C0] BUG: workqueue lockup - pool cpus=24 node=0 flags=0x0 nice=-20 stuck for 64s!
[ 1736.936024][    C0] Showing busy workqueues and worker pools:
[ 1736.941835][    C0] workqueue events: flags=0x0
[ 1736.946428][    C0]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256 refcnt=3
[ 1736.946440][    C0]     pending: vmstat_shepherd, drm_fb_helper_damage_work [drm_kms_helper]
[ 1736.962626][    C0] workqueue events_power_efficient: flags=0x80
[ 1736.968703][    C0]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[ 1736.968712][    C0]     pending: fb_flashcursor
[ 1736.980875][    C0] workqueue kblockd: flags=0x18
[ 1736.987060][    C0]   pwq 49: cpus=24 node=0 flags=0x0 nice=-20 active=2/256 refcnt=3
[ 1736.987070][    C0]     in-flight: 1345:blk_mq_timeout_work
[ 1736.987077][    C0]     pending: blk_mq_timeout_work
[ 1737.008504][    C0] pool 49: cpus=24 node=0 flags=0x0 nice=-20 hung=64s workers=2 idle: 136
[ 1767.797002][    C0] BUG: workqueue lockup - pool cpus=24 node=0 flags=0x0 nice=-20 stuck for 95s!
[ 1767.806957][    C0] Showing busy workqueues and worker pools:
[ 1767.813713][    C0] workqueue events: flags=0x0
[ 1767.819240][    C0]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[ 1767.819252][    C0]     pending: drm_fb_helper_damage_work [drm_kms_helper]
[ 1767.835824][    C0] workqueue events_power_efficient: flags=0x80
[ 1767.842848][    C0]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[ 1767.842858][    C0]     pending: fb_flashcursor
[ 1767.856904][    C0] workqueue kblockd: flags=0x18
[ 1767.862590][    C0]   pwq 49: cpus=24 node=0 flags=0x0 nice=-20 active=2/256 refcnt=3
[ 1767.862599][    C0]     in-flight: 1345:blk_mq_timeout_work
[ 1767.862606][    C0]     pending: blk_mq_timeout_work
[ 1767.883999][    C0] pool 49: cpus=24 node=0 flags=0x0 nice=-20 hung=95s workers=2 idle: 136
[ 1798.457002][    C0] BUG: workqueue lockup - pool cpus=24 node=0 flags=0x0 nice=-20 stuck for 126s!
[ 1798.467028][    C0] Showing busy workqueues and worker pools:
[ 1798.473794][    C0] workqueue events: flags=0x0
[ 1798.479334][    C0]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[ 1798.479345][    C0]     pending: drm_fb_helper_damage_work [drm_kms_helper]
[ 1798.495919][    C0] workqueue events_power_efficient: flags=0x80
[ 1798.502928][    C0]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[ 1798.502938][    C0]     pending: fb_flashcursor
[ 1798.516949][    C0] workqueue kblockd: flags=0x18
[ 1798.522595][    C0]   pwq 49: cpus=24 node=0 flags=0x0 nice=-20 active=2/256 refcnt=3
[ 1798.522605][    C0]     in-flight: 1345:blk_mq_timeout_work
[ 1798.522611][    C0]     pending: blk_mq_timeout_work
[ 1798.543882][    C0] pool 49: cpus=24 node=0 flags=0x0 nice=-20 hung=126s workers=2 idle: 136

I've seen the same kind of logging when just pressing Ctrl-D on the rescue shell prompt, so that's very likely the source of the issue. Now the question is why this lockup happens.

Actions #2

Updated by mkittler over 1 year ago

It could be a kernel bug. At least it boots fine on the older kernel 5.3.18-150300.59.71-default (instead of 5.14.21-150400.24.18-default). On the other hand, it is strange that the other arm workers don't show this problem as they're running fine on the (same) new version.

Actions #3

Updated by mkittler over 1 year ago

  • Status changed from In Progress to Feedback

Now the worker booted fine on the new kernel version as well. The only thing I did was an update of the system (which did not include any kernel packages). It doesn't look like a problem of our NVMe setup script but rather like a sporatic kernel bug or even a hardware issue (considering the hardware is unstable anyways). So I'll just keep the system running for now and see what happens.

Actions #4

Updated by mkittler over 1 year ago

  • Description updated (diff)
Actions #5

Updated by livdywan over 1 year ago

  • Subject changed from Recover openqaworker-arm-1 to Recover openqaworker-arm-1 size:M
Actions #6

Updated by mkittler over 1 year ago

  • Status changed from Feedback to Resolved

Looks like the machine worked as usual so far. There are only 2 small gaps visible in Grafana but those are expected.

Actions

Also available in: Atom PDF