https://progress.opensuse.org/
https://progress.opensuse.org/themes/openSUSE/favicon/favicon.ico?1582917784
2022-08-31T09:14:24Z
openSUSE Project Management Tool
openQA Infrastructure - action #116060: Recover openqaworker-arm-1 size:M
https://progress.opensuse.org/issues/116060?journal_id=549791
2022-08-31T09:14:24Z
mkittler
marius.kittler@suse.com
<ul></ul><p>It comes quite far, even attempts to create the ext2 filesystem. However, then it gets stuck and eventually systemd kills the command:</p>
<pre><code>a-establish-nvme-setup[1036]: mdadm: Defaulting to version 1.2 metadata
a-establish-nvme-setup[1036]: mdadm: array /dev/md/openqa started.
a-establish-nvme-setup[1023]: Status for RAID0 "/dev/md/openqa"
a-establish-nvme-setup[1183]: md127 : active raid0 nvme0n1[0]
a-establish-nvme-setup[1185]: ARRAY /dev/md/openqa metadata=1.2 name=openqawork>
a-establish-nvme-setup[1023]: Creating ext2 filesystem on RAID0 "/dev/md/openqa"
a-establish-nvme-setup[1186]: mke2fs 1.46.4 (18-Aug-2021)
a-establish-nvme-setup[1186]: [220B blob data]
a-establish-nvme-setup[1186]: Creating filesystem with 195319936 4k blocks and >
a-establish-nvme-setup[1186]: Filesystem UUID: 3828224d-39fb-457e-a242-3a93f67d>
a-establish-nvme-setup[1186]: Superblock backups stored on blocks:
a-establish-nvme-setup[1186]: 32768, 98304, 163840, 229376, 294912, 819>
a-establish-nvme-setup[1186]: 4096000, 7962624, 11239424, 20480000, 238>
a-establish-nvme-setup[1186]: 102400000
a-establish-nvme-setup[1186]: [65B blob data]
md[1]: openqa_nvme_format.service: start operation timed out. Terminating.
md[1]: openqa_nvme_format.service: Main process exited, code=killed, status=15/>
md[1]: openqa_nvme_format.service: State 'final-sigterm' timed out. Killing.
</code></pre>
<p>It looks like the file system isn't there, indeed.</p>
<p>Invoking <code>openqaworker-arm-1:~ # mdadm --detail --scan | grep openqa</code> manually now gets stuck and we get some logging about a workqueue lockup:</p>
<pre><code>openqaworker-arm-1:~ # mdadm --detail --scan | grep openqa
[ 1706.267001][ C0] BUG: workqueue lockup - pool cpus=24 node=0 flags=0x0 nice=-20 stuck for 34s!
[ 1706.276024][ C0] Showing busy workqueues and worker pools:
[ 1706.281836][ C0] workqueue events: flags=0x0
[ 1706.286421][ C0] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[ 1706.286432][ C0] pending: drm_fb_helper_damage_work [drm_kms_helper]
[ 1706.301198][ C0] workqueue events_power_efficient: flags=0x80
[ 1706.307284][ C0] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[ 1706.307296][ C0] pending: fb_flashcursor
[ 1706.319505][ C0] workqueue kblockd: flags=0x18
[ 1706.324265][ C0] pwq 49: cpus=24 node=0 flags=0x0 nice=-20 active=2/256 refcnt=3
[ 1706.324275][ C0] in-flight: 1345:blk_mq_timeout_work
[ 1706.324284][ C0] pending: blk_mq_timeout_work
[ 1706.342961][ C0] pool 49: cpus=24 node=0 flags=0x0 nice=-20 hung=34s workers=2 idle: 136
[ 1736.927007][ C0] BUG: workqueue lockup - pool cpus=24 node=0 flags=0x0 nice=-20 stuck for 64s!
[ 1736.936024][ C0] Showing busy workqueues and worker pools:
[ 1736.941835][ C0] workqueue events: flags=0x0
[ 1736.946428][ C0] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256 refcnt=3
[ 1736.946440][ C0] pending: vmstat_shepherd, drm_fb_helper_damage_work [drm_kms_helper]
[ 1736.962626][ C0] workqueue events_power_efficient: flags=0x80
[ 1736.968703][ C0] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[ 1736.968712][ C0] pending: fb_flashcursor
[ 1736.980875][ C0] workqueue kblockd: flags=0x18
[ 1736.987060][ C0] pwq 49: cpus=24 node=0 flags=0x0 nice=-20 active=2/256 refcnt=3
[ 1736.987070][ C0] in-flight: 1345:blk_mq_timeout_work
[ 1736.987077][ C0] pending: blk_mq_timeout_work
[ 1737.008504][ C0] pool 49: cpus=24 node=0 flags=0x0 nice=-20 hung=64s workers=2 idle: 136
[ 1767.797002][ C0] BUG: workqueue lockup - pool cpus=24 node=0 flags=0x0 nice=-20 stuck for 95s!
[ 1767.806957][ C0] Showing busy workqueues and worker pools:
[ 1767.813713][ C0] workqueue events: flags=0x0
[ 1767.819240][ C0] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[ 1767.819252][ C0] pending: drm_fb_helper_damage_work [drm_kms_helper]
[ 1767.835824][ C0] workqueue events_power_efficient: flags=0x80
[ 1767.842848][ C0] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[ 1767.842858][ C0] pending: fb_flashcursor
[ 1767.856904][ C0] workqueue kblockd: flags=0x18
[ 1767.862590][ C0] pwq 49: cpus=24 node=0 flags=0x0 nice=-20 active=2/256 refcnt=3
[ 1767.862599][ C0] in-flight: 1345:blk_mq_timeout_work
[ 1767.862606][ C0] pending: blk_mq_timeout_work
[ 1767.883999][ C0] pool 49: cpus=24 node=0 flags=0x0 nice=-20 hung=95s workers=2 idle: 136
[ 1798.457002][ C0] BUG: workqueue lockup - pool cpus=24 node=0 flags=0x0 nice=-20 stuck for 126s!
[ 1798.467028][ C0] Showing busy workqueues and worker pools:
[ 1798.473794][ C0] workqueue events: flags=0x0
[ 1798.479334][ C0] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[ 1798.479345][ C0] pending: drm_fb_helper_damage_work [drm_kms_helper]
[ 1798.495919][ C0] workqueue events_power_efficient: flags=0x80
[ 1798.502928][ C0] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[ 1798.502938][ C0] pending: fb_flashcursor
[ 1798.516949][ C0] workqueue kblockd: flags=0x18
[ 1798.522595][ C0] pwq 49: cpus=24 node=0 flags=0x0 nice=-20 active=2/256 refcnt=3
[ 1798.522605][ C0] in-flight: 1345:blk_mq_timeout_work
[ 1798.522611][ C0] pending: blk_mq_timeout_work
[ 1798.543882][ C0] pool 49: cpus=24 node=0 flags=0x0 nice=-20 hung=126s workers=2 idle: 136
</code></pre>
<p>I've seen the same kind of logging when just pressing Ctrl-D on the rescue shell prompt, so that's very likely the source of the issue. Now the question is why this lockup happens.</p>
openQA Infrastructure - action #116060: Recover openqaworker-arm-1 size:M
https://progress.opensuse.org/issues/116060?journal_id=549824
2022-08-31T10:21:49Z
mkittler
marius.kittler@suse.com
<ul></ul><p>It could be a kernel bug. At least it boots fine on the older kernel 5.3.18-150300.59.71-default (instead of 5.14.21-150400.24.18-default). On the other hand, it is strange that the other arm workers don't show this problem as they're running fine on the (same) new version.</p>
openQA Infrastructure - action #116060: Recover openqaworker-arm-1 size:M
https://progress.opensuse.org/issues/116060?journal_id=549908
2022-08-31T13:03:12Z
mkittler
marius.kittler@suse.com
<ul><li><strong>Status</strong> changed from <i>In Progress</i> to <i>Feedback</i></li></ul><p>Now the worker booted fine on the new kernel version as well. The only thing I did was an update of the system (which did not include any kernel packages). It doesn't look like a problem of our NVMe setup script but rather like a sporatic kernel bug or even a hardware issue (considering the hardware is unstable anyways). So I'll just keep the system running for now and see what happens.</p>
openQA Infrastructure - action #116060: Recover openqaworker-arm-1 size:M
https://progress.opensuse.org/issues/116060?journal_id=549917
2022-08-31T13:41:43Z
mkittler
marius.kittler@suse.com
<ul><li><strong>Description</strong> updated (<a title="View differences" href="/journals/549917/diff?detail_id=519242">diff</a>)</li></ul>
openQA Infrastructure - action #116060: Recover openqaworker-arm-1 size:M
https://progress.opensuse.org/issues/116060?journal_id=551947
2022-09-08T09:38:55Z
livdywan
liv.dywan@suse.com
<ul><li><strong>Subject</strong> changed from <i>Recover openqaworker-arm-1</i> to <i>Recover openqaworker-arm-1 size:M</i></li></ul>
openQA Infrastructure - action #116060: Recover openqaworker-arm-1 size:M
https://progress.opensuse.org/issues/116060?journal_id=552025
2022-09-08T13:16:42Z
mkittler
marius.kittler@suse.com
<ul><li><strong>Status</strong> changed from <i>Feedback</i> to <i>Resolved</i></li></ul><p>Looks like the machine worked as usual so far. There are only 2 small gaps visible in Grafana but those are expected.</p>