openQA Infrastructure - action #116060: Recover openqaworker-arm-1 size:M</h1> <article> <h1>openQA Infrastructure - action #116060: Recover openqaworker-arm-1 size:M</h1> <p>2022-08-31T09:14:24Z</p> <ul></ul><p>It comes quite far, even attempts to create the ext2 filesystem. However, then it gets stuck and eventually systemd kills the command:</p> <pre><code>a-establish-nvme-setup[1036]: mdadm: Defaulting to version 1.2 metadata a-establish-nvme-setup[1036]: mdadm: array /dev/md/openqa started. a-establish-nvme-setup[1023]: Status for RAID0 "/dev/md/openqa" a-establish-nvme-setup[1183]: md127 : active raid0 nvme0n1[0] a-establish-nvme-setup[1185]: ARRAY /dev/md/openqa metadata=1.2 name=openqawork> a-establish-nvme-setup[1023]: Creating ext2 filesystem on RAID0 "/dev/md/openqa" a-establish-nvme-setup[1186]: mke2fs 1.46.4 (18-Aug-2021) a-establish-nvme-setup[1186]: [220B blob data] a-establish-nvme-setup[1186]: Creating filesystem with 195319936 4k blocks and > a-establish-nvme-setup[1186]: Filesystem UUID: 3828224d-39fb-457e-a242-3a93f67d> a-establish-nvme-setup[1186]: Superblock backups stored on blocks: a-establish-nvme-setup[1186]: 32768, 98304, 163840, 229376, 294912, 819> a-establish-nvme-setup[1186]: 4096000, 7962624, 11239424, 20480000, 238> a-establish-nvme-setup[1186]: 102400000 a-establish-nvme-setup[1186]: [65B blob data] md[1]: openqa_nvme_format.service: start operation timed out. Terminating. md[1]: openqa_nvme_format.service: Main process exited, code=killed, status=15/> md[1]: openqa_nvme_format.service: State 'final-sigterm' timed out. Killing. </code></pre> <p>It looks like the file system isn't there, indeed.</p> <p>Invoking <code>openqaworker-arm-1:~ # mdadm --detail --scan | grep openqa</code> manually now gets stuck and we get some logging about a workqueue lockup:</p> <pre><code>openqaworker-arm-1:~ # mdadm --detail --scan | grep openqa [ 1706.267001][ C0] BUG: workqueue lockup - pool cpus=24 node=0 flags=0x0 nice=-20 stuck for 34s! [ 1706.276024][ C0] Showing busy workqueues and worker pools: [ 1706.281836][ C0] workqueue events: flags=0x0 [ 1706.286421][ C0] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2 [ 1706.286432][ C0] pending: drm_fb_helper_damage_work [drm_kms_helper] [ 1706.301198][ C0] workqueue events_power_efficient: flags=0x80 [ 1706.307284][ C0] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2 [ 1706.307296][ C0] pending: fb_flashcursor [ 1706.319505][ C0] workqueue kblockd: flags=0x18 [ 1706.324265][ C0] pwq 49: cpus=24 node=0 flags=0x0 nice=-20 active=2/256 refcnt=3 [ 1706.324275][ C0] in-flight: 1345:blk_mq_timeout_work [ 1706.324284][ C0] pending: blk_mq_timeout_work [ 1706.342961][ C0] pool 49: cpus=24 node=0 flags=0x0 nice=-20 hung=34s workers=2 idle: 136 [ 1736.927007][ C0] BUG: workqueue lockup - pool cpus=24 node=0 flags=0x0 nice=-20 stuck for 64s! [ 1736.936024][ C0] Showing busy workqueues and worker pools: [ 1736.941835][ C0] workqueue events: flags=0x0 [ 1736.946428][ C0] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256 refcnt=3 [ 1736.946440][ C0] pending: vmstat_shepherd, drm_fb_helper_damage_work [drm_kms_helper] [ 1736.962626][ C0] workqueue events_power_efficient: flags=0x80 [ 1736.968703][ C0] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2 [ 1736.968712][ C0] pending: fb_flashcursor [ 1736.980875][ C0] workqueue kblockd: flags=0x18 [ 1736.987060][ C0] pwq 49: cpus=24 node=0 flags=0x0 nice=-20 active=2/256 refcnt=3 [ 1736.987070][ C0] in-flight: 1345:blk_mq_timeout_work [ 1736.987077][ C0] pending: blk_mq_timeout_work [ 1737.008504][ C0] pool 49: cpus=24 node=0 flags=0x0 nice=-20 hung=64s workers=2 idle: 136 [ 1767.797002][ C0] BUG: workqueue lockup - pool cpus=24 node=0 flags=0x0 nice=-20 stuck for 95s! [ 1767.806957][ C0] Showing busy workqueues and worker pools: [ 1767.813713][ C0] workqueue events: flags=0x0 [ 1767.819240][ C0] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2 [ 1767.819252][ C0] pending: drm_fb_helper_damage_work [drm_kms_helper] [ 1767.835824][ C0] workqueue events_power_efficient: flags=0x80 [ 1767.842848][ C0] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2 [ 1767.842858][ C0] pending: fb_flashcursor [ 1767.856904][ C0] workqueue kblockd: flags=0x18 [ 1767.862590][ C0] pwq 49: cpus=24 node=0 flags=0x0 nice=-20 active=2/256 refcnt=3 [ 1767.862599][ C0] in-flight: 1345:blk_mq_timeout_work [ 1767.862606][ C0] pending: blk_mq_timeout_work [ 1767.883999][ C0] pool 49: cpus=24 node=0 flags=0x0 nice=-20 hung=95s workers=2 idle: 136 [ 1798.457002][ C0] BUG: workqueue lockup - pool cpus=24 node=0 flags=0x0 nice=-20 stuck for 126s! [ 1798.467028][ C0] Showing busy workqueues and worker pools: [ 1798.473794][ C0] workqueue events: flags=0x0 [ 1798.479334][ C0] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2 [ 1798.479345][ C0] pending: drm_fb_helper_damage_work [drm_kms_helper] [ 1798.495919][ C0] workqueue events_power_efficient: flags=0x80 [ 1798.502928][ C0] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2 [ 1798.502938][ C0] pending: fb_flashcursor [ 1798.516949][ C0] workqueue kblockd: flags=0x18 [ 1798.522595][ C0] pwq 49: cpus=24 node=0 flags=0x0 nice=-20 active=2/256 refcnt=3 [ 1798.522605][ C0] in-flight: 1345:blk_mq_timeout_work [ 1798.522611][ C0] pending: blk_mq_timeout_work [ 1798.543882][ C0] pool 49: cpus=24 node=0 flags=0x0 nice=-20 hung=126s workers=2 idle: 136 </code></pre> <p>I've seen the same kind of logging when just pressing Ctrl-D on the rescue shell prompt, so that's very likely the source of the issue. Now the question is why this lockup happens.</p> </article> <article> <h1>openQA Infrastructure - action #116060: Recover openqaworker-arm-1 size:M</h1> <p>2022-08-31T10:21:49Z</p> <ul></ul><p>It could be a kernel bug. At least it boots fine on the older kernel 5.3.18-150300.59.71-default (instead of 5.14.21-150400.24.18-default). On the other hand, it is strange that the other arm workers don't show this problem as they're running fine on the (same) new version.</p> </article> <article> <h1>openQA Infrastructure - action #116060: Recover openqaworker-arm-1 size:M</h1> <p>2022-08-31T13:03:12Z</p> <ul><li><strong>Status</strong> changed from <i>In Progress</i> to <i>Feedback</i></li></ul><p>Now the worker booted fine on the new kernel version as well. The only thing I did was an update of the system (which did not include any kernel packages). It doesn't look like a problem of our NVMe setup script but rather like a sporatic kernel bug or even a hardware issue (considering the hardware is unstable anyways). So I'll just keep the system running for now and see what happens.</p> </article> <article> <h1>openQA Infrastructure - action #116060: Recover openqaworker-arm-1 size:M</h1> <p>2022-08-31T13:41:43Z</p> <ul><li><strong>Description</strong> updated (<a title="View differences" href="/journals/549917/diff?detail_id=519242">diff</a>)</li></ul> </article> <article> <h1>openQA Infrastructure - action #116060: Recover openqaworker-arm-1 size:M</h1> <p>2022-09-08T09:38:55Z</p> <ul><li><strong>Subject</strong> changed from <i>Recover openqaworker-arm-1</i> to <i>Recover openqaworker-arm-1 size:M</i></li></ul> </article> <article> <h1>openQA Infrastructure - action #116060: Recover openqaworker-arm-1 size:M</h1> <p>2022-09-08T13:16:42Z</p> <ul><li><strong>Status</strong> changed from <i>Feedback</i> to <i>Resolved</i></li></ul><p>Looks like the machine worked as usual so far. There are only 2 small gaps visible in Grafana but those are expected.</p> </article> </main></body></html>