action #116060: Recover openqaworker-arm-1 size:M - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #116060

closed

Recover openqaworker-arm-1 size:M

Added by mkittler over 2 years ago. Updated over 2 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

mkittler

Category:

Target version:

openQA Project (public) - Ready

Start date:

2022-08-31

Due date:

% Done:

Estimated time:

Description

arm-1 has been down for almost 10 days: https://stats.openqa-monitor.qa.suse.de/d/WDopenqaworker-arm-1/worker-dashboard-openqaworker-arm-1?orgId=1&from=now-10d&to=now

As far as I can see the alert has been left unhandled so far so I'm creating a ticket now (hopefully no duplicate).

The worker is reachable via IPMI but fails on the NVMe setup:

[   51.594064][  T766] nicvf 0002:01:00.2: enabling device (0004 -> 0006)
[   51.624678][  T766] nicvf 0002:01:00.3: enabling device (0004 -> 0006)
[   51.655248][  T766] nicvf 0002:01:00.4: enabling device (0004 -> 0006)
[   51.680192][ T1097] BTRFS info (device dm-1): devid 1 device path /dev/mapper/system-root changed to /dev/dm-1 scanned by systemd-udevd (1097)
[   51.685843][  T766] nicvf 0002:01:00.5: enabling device (0004 -> 0006)
[   51.697377][ T1097] BTRFS info (device dm-1): devid 1 device path /dev/dm-1 changed to /dev/mapper/system-root scanned by systemd-udevd (1097)
[   51.725401][  T766] nicvf 0002:01:00.6: enabling device (0004 -> 0006)
[   51.755433][  T766] nicvf 0002:01:00.7: enabling device (0004 -> 0006)
[   51.785576][  T766] nicvf 0002:01:01.0: enabling device (0004 -> 0006)
[   51.815515][  T766] nicvf 0002:01:01.1: enabling device (0004 -> 0006)
[   51.845446][  T766] nicvf 0002:01:01.2: enabling device (0004 -> 0006)
[   51.875885][  T766] nicvf 0002:01:01.3: enabling device (0004 -> 0006)
[FAILED] Failed to start Setup NVMe before mounting it.
See 'systemctl status openqa_nvme_format.service' for details.
[DEPEND] Dependency failed for /var/lib/openqa.ling device (0004 -> 0006)
[DEPEND] Dependency failed for openQA Worker #6.ing device (0004 -> 0006)
[DEPEND] Dependency failed for Prepare NVMe after mounting it.04 -> 0006)
[DEPEND] Dependency failed for openQA Worker #8.ing device (0004 -> 0006)
[DEPEND] Dependency failed for openQA Worker #3.
[DEPEND] Dependency failed for openQA Worker #9.ing device (0004 -> 0006)
[DEPEND] Dependency failed for openQA Worker #5.ing device (0004 -> 0006)
[DEPEND] Dependency failed for Local File Systems.g device (0004 -> 0006)
[DEPEND] Dependency failed for openQA Worker #2.ing device (0004 -> 0006)
[DEPEND] Dependency failed for openQA Worker #4.ing device (0004 -> 0006)
[DEPEND] Dependency failed for openQA Worker #1.ing device (0004 -> 0006)
[DEPEND] Dependency failed for openQA Worker #10.ng device (0004 -> 0006)
[DEPEND] Dependency failed for var-lib-openqa-share.automount.04 -> 0006)
[DEPEND] Dependency failed for openQA Worker #7.ing device (0004 -> 0006)
[  OK  ] Reached target Timer Units.:03.2: enabling device (0004 -> 0006)
         Starting Restore /run/initramfs on shutdown...ice (0004 -> 0006)
[  OK  ] Reached target Host and Network Name Lookups.vice (0004 -> 0006)
[  OK  ] Reached target System Time Synchronized.
[  OK  ] Reached target Login Prompts. on device 8:2...
[  OK  ] Reached target User and Group Name Lookups.device (0004 -> 0006)
[  OK  ] Reached target Remote File Systems.nabling device (0004 -> 0006)
[  OK  ] Reached target Preparation for Network. 8:2.
[  OK  ] Reached target Network.161HJ primary.
[  OK  ] Reached target Network is Online.
[  OK  ] Reached target Path Units.): Volume was not properly unmounted. Some data may be corrupt. Please run fsck.
[  OK  ] Reached target Socket Units.
[  OK  ] Started Emergency Shell.o Complete Device Initialization.
[  OK  ] Reached target Emergency Mode.up NV…before mounting it (44s / 5min 1s)
         Starting Tell Plymouth To Write Out Runtime Data...abort: 96000210 [#1] PREEMPT SMP
         Starting Create Volatile Files and Directories...s_cp437 vfat fat joydev mdio_thunder nicvf cavium_ptp cavium_rng_vf mdio_cavium nicpf thunder_bgx thunderx_edac thunder_xcv cavium_rng ipmi_ssif uio_pdrv_genirq ipmi_devintf efi_[  OK  ] Finished Restore /run/initramfs on shutdown.bles x_tables hid_generic usbhid ast i2c_algo_bit drm_vram_helper drm_kms_helper sd_mod xhci_pci syscopyarea sysfillrect xhci_pci_renesas sysimgblt fb_sys_fops xhci_hcd cec crct10dif_[  OK  ] Finished Tell Plymouth To Write Out Runtime Data.sha2_ce libahci thunderx_mmc usbcore nvme_core drm sha256_arm64 sha1_ce gpio_keys t10_pi libata usb_common mmc_core i2c_thunderx btrfs blake2b_generic libcrc32c xor xor_neon raid[  OK  ] Finished Create Volatile Files and Directories.dge stp llc dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod efivarfs aes_ce_blk crypto_simd cryptd aes_ce_cipher
         Starting Record System Boot/Shutdown in UTMP...s are loaded
[  OK  ] Finished Record System Boot/Shutdown in UTMP./24:1H Tainted: G                  N 5.14.21-150400.24.18-default #1 SLE15-SP4 6e36986a976ce35b699a98bb4b0ce7312b4d04e2
         Starting Record Runlevel Change in UTMP...rX CRB/MT30-GS1, BIOS T19 09/29/2016
[  OK  ] Finished Record Runlevel Change in UTMP.timeout_work
You are in emergency mode. After logging in, type "journalctl -xb" to view)
system logs, "systemctl reboot" to reboot, "systemctl default" or "exit"
to boot into default mode.: blk_mq_check_expired+0x94/0xc0
Give root password for maintenance00171fbbe0
(or press Control-D to continue): 00171fbbe0 x28: 0000000000000000 x27: ffff000102661120
[   91.702987][ T1345] x26: 000000000000001c x25: ffff00010d079938 x24: ffff0001162c3928
[   91.711763][ T1345] x23: ffff0001162c3800 x22: ffff000102730600 x21: ffff80001158d000
[   91.720551][ T1345] x20: ffff0001060b4000 x19: ffff800014bee01c x18: 0000000000000000
[   91.729353][ T1345] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
[   91.738166][ T1345] x14: 0000000000000000 x13: 0000000000000038 x12: 0101010101010101
[   91.746975][ T1345] x11: 7f7f7f7f7f7f7f7f x10: fefefefefefefeff x9 : ffff8000107e28d4
[   91.755784][ T1345] x8 : 8080808080808080 x7 : ffffffffffffffff x6 : 0000000000000001
[   91.764611][ T1345] x5 : 00000000ffffad6b x4 : 0000000000000015 x3 : 0000000000000000
[   91.773451][ T1345] x2 : ffff800009a287c0 x1 : 0000000000000000 x0 : ffff80001158d628
[   91.782304][ T1345] Call trace:
[   91.786430][ T1345]  nvme_timeout+0x5c/0x380 [nvme 94e89dee374936950a3d9b7ed2710d061e1acef2]
[   91.795933][ T1345]  blk_mq_check_expired+0x94/0xc0
[   91.801848][ T1345]  bt_iter+0xa0/0xc0
[   91.806626][ T1345]  blk_mq_queue_tag_busy_iter+0x1f8/0x380
[   91.813255][ T1345]  blk_mq_timeout_work+0x11c/0x1c0
[

DONE: After that is fixed the worker's salt setup should also be verified because it could not be recovered as the worker was offline when I took care of that.

Actions

Copy link

Updated by mkittler over 2 years ago

It comes quite far, even attempts to create the ext2 filesystem. However, then it gets stuck and eventually systemd kills the command:

a-establish-nvme-setup[1036]: mdadm: Defaulting to version 1.2 metadata
a-establish-nvme-setup[1036]: mdadm: array /dev/md/openqa started.
a-establish-nvme-setup[1023]: Status for RAID0 "/dev/md/openqa"
a-establish-nvme-setup[1183]: md127 : active raid0 nvme0n1[0]
a-establish-nvme-setup[1185]: ARRAY /dev/md/openqa metadata=1.2 name=openqawork>
a-establish-nvme-setup[1023]: Creating ext2 filesystem on RAID0 "/dev/md/openqa"
a-establish-nvme-setup[1186]: mke2fs 1.46.4 (18-Aug-2021)
a-establish-nvme-setup[1186]: [220B blob data]
a-establish-nvme-setup[1186]: Creating filesystem with 195319936 4k blocks and >
a-establish-nvme-setup[1186]: Filesystem UUID: 3828224d-39fb-457e-a242-3a93f67d>
a-establish-nvme-setup[1186]: Superblock backups stored on blocks:
a-establish-nvme-setup[1186]:         32768, 98304, 163840, 229376, 294912, 819>
a-establish-nvme-setup[1186]:         4096000, 7962624, 11239424, 20480000, 238>
a-establish-nvme-setup[1186]:         102400000
a-establish-nvme-setup[1186]: [65B blob data]
md[1]: openqa_nvme_format.service: start operation timed out. Terminating.
md[1]: openqa_nvme_format.service: Main process exited, code=killed, status=15/>
md[1]: openqa_nvme_format.service: State 'final-sigterm' timed out. Killing.

It looks like the file system isn't there, indeed.

Invoking openqaworker-arm-1:~ # mdadm --detail --scan | grep openqa manually now gets stuck and we get some logging about a workqueue lockup:

openqaworker-arm-1:~ # mdadm --detail --scan | grep openqa
[ 1706.267001][    C0] BUG: workqueue lockup - pool cpus=24 node=0 flags=0x0 nice=-20 stuck for 34s!
[ 1706.276024][    C0] Showing busy workqueues and worker pools:
[ 1706.281836][    C0] workqueue events: flags=0x0
[ 1706.286421][    C0]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[ 1706.286432][    C0]     pending: drm_fb_helper_damage_work [drm_kms_helper]
[ 1706.301198][    C0] workqueue events_power_efficient: flags=0x80
[ 1706.307284][    C0]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[ 1706.307296][    C0]     pending: fb_flashcursor
[ 1706.319505][    C0] workqueue kblockd: flags=0x18
[ 1706.324265][    C0]   pwq 49: cpus=24 node=0 flags=0x0 nice=-20 active=2/256 refcnt=3
[ 1706.324275][    C0]     in-flight: 1345:blk_mq_timeout_work
[ 1706.324284][    C0]     pending: blk_mq_timeout_work
[ 1706.342961][    C0] pool 49: cpus=24 node=0 flags=0x0 nice=-20 hung=34s workers=2 idle: 136

[ 1736.927007][    C0] BUG: workqueue lockup - pool cpus=24 node=0 flags=0x0 nice=-20 stuck for 64s!
[ 1736.936024][    C0] Showing busy workqueues and worker pools:
[ 1736.941835][    C0] workqueue events: flags=0x0
[ 1736.946428][    C0]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256 refcnt=3
[ 1736.946440][    C0]     pending: vmstat_shepherd, drm_fb_helper_damage_work [drm_kms_helper]
[ 1736.962626][    C0] workqueue events_power_efficient: flags=0x80
[ 1736.968703][    C0]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[ 1736.968712][    C0]     pending: fb_flashcursor
[ 1736.980875][    C0] workqueue kblockd: flags=0x18
[ 1736.987060][    C0]   pwq 49: cpus=24 node=0 flags=0x0 nice=-20 active=2/256 refcnt=3
[ 1736.987070][    C0]     in-flight: 1345:blk_mq_timeout_work
[ 1736.987077][    C0]     pending: blk_mq_timeout_work
[ 1737.008504][    C0] pool 49: cpus=24 node=0 flags=0x0 nice=-20 hung=64s workers=2 idle: 136
[ 1767.797002][    C0] BUG: workqueue lockup - pool cpus=24 node=0 flags=0x0 nice=-20 stuck for 95s!
[ 1767.806957][    C0] Showing busy workqueues and worker pools:
[ 1767.813713][    C0] workqueue events: flags=0x0
[ 1767.819240][    C0]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[ 1767.819252][    C0]     pending: drm_fb_helper_damage_work [drm_kms_helper]
[ 1767.835824][    C0] workqueue events_power_efficient: flags=0x80
[ 1767.842848][    C0]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[ 1767.842858][    C0]     pending: fb_flashcursor
[ 1767.856904][    C0] workqueue kblockd: flags=0x18
[ 1767.862590][    C0]   pwq 49: cpus=24 node=0 flags=0x0 nice=-20 active=2/256 refcnt=3
[ 1767.862599][    C0]     in-flight: 1345:blk_mq_timeout_work
[ 1767.862606][    C0]     pending: blk_mq_timeout_work
[ 1767.883999][    C0] pool 49: cpus=24 node=0 flags=0x0 nice=-20 hung=95s workers=2 idle: 136
[ 1798.457002][    C0] BUG: workqueue lockup - pool cpus=24 node=0 flags=0x0 nice=-20 stuck for 126s!
[ 1798.467028][    C0] Showing busy workqueues and worker pools:
[ 1798.473794][    C0] workqueue events: flags=0x0
[ 1798.479334][    C0]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[ 1798.479345][    C0]     pending: drm_fb_helper_damage_work [drm_kms_helper]
[ 1798.495919][    C0] workqueue events_power_efficient: flags=0x80
[ 1798.502928][    C0]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[ 1798.502938][    C0]     pending: fb_flashcursor
[ 1798.516949][    C0] workqueue kblockd: flags=0x18
[ 1798.522595][    C0]   pwq 49: cpus=24 node=0 flags=0x0 nice=-20 active=2/256 refcnt=3
[ 1798.522605][    C0]     in-flight: 1345:blk_mq_timeout_work
[ 1798.522611][    C0]     pending: blk_mq_timeout_work
[ 1798.543882][    C0] pool 49: cpus=24 node=0 flags=0x0 nice=-20 hung=126s workers=2 idle: 136

I've seen the same kind of logging when just pressing Ctrl-D on the rescue shell prompt, so that's very likely the source of the issue. Now the question is why this lockup happens.

Actions

Copy link

Updated by mkittler over 2 years ago

It could be a kernel bug. At least it boots fine on the older kernel 5.3.18-150300.59.71-default (instead of 5.14.21-150400.24.18-default). On the other hand, it is strange that the other arm workers don't show this problem as they're running fine on the (same) new version.

Actions

Copy link

Updated by mkittler over 2 years ago

Status changed from In Progress to Feedback

Now the worker booted fine on the new kernel version as well. The only thing I did was an update of the system (which did not include any kernel packages). It doesn't look like a problem of our NVMe setup script but rather like a sporatic kernel bug or even a hardware issue (considering the hardware is unstable anyways). So I'll just keep the system running for now and see what happens.

Actions

Copy link