action #59855
closedopenqaworker-arm-1 seems to be under serious distress "kernel:[93903.692361] BUG: workqueue lockup - pool cpus=32 node=0 flags=0x0 nice=0 stuck for 42657s!"
0%
Description
Observation¶
E.g. see https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/139724 :
openqaworker-arm-1.suse.de:
Minion did not return. [No response]
On the machine in a root ssh session:
Message from syslogd@openqaworker-arm-1 at Nov 14 19:00:26 ...
kernel:[108858.247643] BUG: workqueue lockup - pool cpus=32 node=0 flags=0x0 nice=0 stuck for 57612s!
Message from syslogd@openqaworker-arm-1 at Nov 14 19:00:56 ...
kernel:[108888.966978] BUG: workqueue lockup - pool cpus=32 node=0 flags=0x0 nice=0 stuck for 57642s!
Message from syslogd@openqaworker-arm-1 at Nov 14 19:01:27 ...
kernel:[108919.696316] BUG: workqueue lockup - pool cpus=32 node=0 flags=0x0 nice=0 stuck for 57673s!
…
A lot of IO stalled processes and a high load:
# cat /proc/loadavg
35.53 35.65 35.62 4/877 27365
# ps -weo stat,pid,wchan:32,args | grep '^D\>'
D 362 rcu_exp_wait_wake [kworker/6:1]
D 4189 io_schedule [kworker/u96:9]
D 8636 io_schedule /usr/bin/perl /usr/share/openqa/script/openqa-workercache minion worker -m production
D 8893 io_schedule /usr/bin/perl /usr/share/openqa/script/openqa-workercache minion worker -m production
D 9272 io_schedule /usr/bin/perl /usr/share/openqa/script/openqa-workercache minion worker -m production
D 9654 io_schedule /usr/bin/perl /usr/share/openqa/script/openqa-workercache minion worker -m production
D 14271 flush_work /usr/bin/gpg2 --version
Updated by okurz about 5 years ago
- Status changed from New to In Progress
- Assignee set to okurz
- Priority changed from Urgent to Normal
Triggered systemctl restart openqa-worker-cacheservice-minion.service
but as expected this does not work as the minion processes are stuck in state D so IO blocked.
# cat /proc/8636/stack
[<ffff0000080876b4>] __switch_to+0x9c/0xe0
[<ffff00000810d5b8>] io_schedule+0x20/0x40
[<ffff000008238fa8>] __lock_page+0xf0/0x120
[<ffff00000823a4c8>] pagecache_get_page+0x1d0/0x270
[<ffff0000083a2bf0>] ext4_mb_init_group+0x90/0x278
[<ffff0000083a2f90>] ext4_mb_good_group+0x1b8/0x1c8
[<ffff0000083a6704>] ext4_mb_regular_allocator+0x18c/0x4c0
[<ffff0000083a8a58>] ext4_mb_new_blocks+0x488/0x5c8
[<ffff00000838d5f0>] ext4_ind_map_blocks+0x9c8/0xa70
[<ffff00000839588c>] ext4_map_blocks+0x274/0x5c8
[<ffff000008395c44>] _ext4_get_block+0x64/0x108
[<ffff000008395d28>] ext4_get_block+0x40/0x50
[<ffff000008392348>] ext4_block_write_begin+0x138/0x490
[<ffff00000839a880>] ext4_write_begin+0x160/0x550
[<ffff000008239dc0>] generic_perform_write+0x98/0x188
[<ffff00000823b888>] __generic_file_write_iter+0x158/0x1c8
[<ffff00000838689c>] ext4_file_write_iter+0xa4/0x3b8
[<ffff0000082d4c20>] __vfs_write+0xd0/0x148
[<ffff0000082d60d4>] vfs_write+0xac/0x1b8
[<ffff0000082d779c>] SyS_write+0x54/0xb0
[<ffff000008083c30>] el0_svc_naked+0x44/0x48
[<ffffffffffffffff>] 0xffffffffffffffff
and neither strace nor lsof can tell me anything. Reported bug https://bugzilla.opensuse.org/show_bug.cgi?id=1156813 and force rebooted, using ipmi SOL, sending break with ~B, then "s", "u", "b". Hint: To use the break signal multiple times for multiple sysrq actions press "ret" after each command, e.g. "~B", then "s", then "ret", then again "~B", …
Updated by okurz about 5 years ago
- Due date set to 2019-11-21
- Status changed from In Progress to Feedback
- Target version set to Current Sprint
machine is back up. Applied high state manually from osd with sudo salt -l error --state-output=changes '*arm*1*' state.apply
without problems. Let's see for some days if this reappears, close otherwise. I will follow the bug anyway.
Updated by okurz about 5 years ago
- Status changed from Feedback to Resolved
- Target version changed from Current Sprint to Done
checked on arm-1, did not reappear