action #59855

openqaworker-arm-1 seems to be under serious distress "kernel:[93903.692361] BUG: workqueue lockup - pool cpus=32 node=0 flags=0x0 nice=0 stuck for 42657s!"

Added by okurz 3 months ago. Updated 3 months ago.

Status:ResolvedStart date:14/11/2019
Priority:NormalDue date:21/11/2019
Assignee:okurz% Done:

0%

Category:-
Target version:openQA Project - Done
Duration: 6

Description

Observation

E.g. see https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/139724 :

openqaworker-arm-1.suse.de:
    Minion did not return. [No response]

On the machine in a root ssh session:

Message from syslogd@openqaworker-arm-1 at Nov 14 19:00:26 ...
 kernel:[108858.247643] BUG: workqueue lockup - pool cpus=32 node=0 flags=0x0 nice=0 stuck for 57612s!

Message from syslogd@openqaworker-arm-1 at Nov 14 19:00:56 ...
 kernel:[108888.966978] BUG: workqueue lockup - pool cpus=32 node=0 flags=0x0 nice=0 stuck for 57642s!

Message from syslogd@openqaworker-arm-1 at Nov 14 19:01:27 ...
 kernel:[108919.696316] BUG: workqueue lockup - pool cpus=32 node=0 flags=0x0 nice=0 stuck for 57673s!
…

A lot of IO stalled processes and a high load:

# cat /proc/loadavg 
35.53 35.65 35.62 4/877 27365
# ps -weo stat,pid,wchan:32,args | grep '^D\>'
D      362 rcu_exp_wait_wake                [kworker/6:1]
D     4189 io_schedule                      [kworker/u96:9]
D     8636 io_schedule                      /usr/bin/perl /usr/share/openqa/script/openqa-workercache minion worker -m production
D     8893 io_schedule                      /usr/bin/perl /usr/share/openqa/script/openqa-workercache minion worker -m production
D     9272 io_schedule                      /usr/bin/perl /usr/share/openqa/script/openqa-workercache minion worker -m production
D     9654 io_schedule                      /usr/bin/perl /usr/share/openqa/script/openqa-workercache minion worker -m production
D    14271 flush_work                       /usr/bin/gpg2 --version

History

#1 Updated by okurz 3 months ago

  • Status changed from New to In Progress
  • Assignee set to okurz
  • Priority changed from Urgent to Normal

Triggered systemctl restart openqa-worker-cacheservice-minion.service but as expected this does not work as the minion processes are stuck in state D so IO blocked.

# cat /proc/8636/stack
[<ffff0000080876b4>] __switch_to+0x9c/0xe0
[<ffff00000810d5b8>] io_schedule+0x20/0x40
[<ffff000008238fa8>] __lock_page+0xf0/0x120
[<ffff00000823a4c8>] pagecache_get_page+0x1d0/0x270
[<ffff0000083a2bf0>] ext4_mb_init_group+0x90/0x278
[<ffff0000083a2f90>] ext4_mb_good_group+0x1b8/0x1c8
[<ffff0000083a6704>] ext4_mb_regular_allocator+0x18c/0x4c0
[<ffff0000083a8a58>] ext4_mb_new_blocks+0x488/0x5c8
[<ffff00000838d5f0>] ext4_ind_map_blocks+0x9c8/0xa70
[<ffff00000839588c>] ext4_map_blocks+0x274/0x5c8
[<ffff000008395c44>] _ext4_get_block+0x64/0x108
[<ffff000008395d28>] ext4_get_block+0x40/0x50
[<ffff000008392348>] ext4_block_write_begin+0x138/0x490
[<ffff00000839a880>] ext4_write_begin+0x160/0x550
[<ffff000008239dc0>] generic_perform_write+0x98/0x188
[<ffff00000823b888>] __generic_file_write_iter+0x158/0x1c8
[<ffff00000838689c>] ext4_file_write_iter+0xa4/0x3b8
[<ffff0000082d4c20>] __vfs_write+0xd0/0x148
[<ffff0000082d60d4>] vfs_write+0xac/0x1b8
[<ffff0000082d779c>] SyS_write+0x54/0xb0
[<ffff000008083c30>] el0_svc_naked+0x44/0x48
[<ffffffffffffffff>] 0xffffffffffffffff

and neither strace nor lsof can tell me anything. Reported bug https://bugzilla.opensuse.org/show_bug.cgi?id=1156813 and force rebooted, using ipmi SOL, sending break with ~B, then "s", "u", "b". Hint: To use the break signal multiple times for multiple sysrq actions press "ret" after each command, e.g. "~B", then "s", then "ret", then again "~B", …

#2 Updated by okurz 3 months ago

  • Due date set to 21/11/2019
  • Status changed from In Progress to Feedback
  • Target version set to Current Sprint

machine is back up. Applied high state manually from osd with sudo salt -l error --state-output=changes '*arm*1*' state.apply without problems. Let's see for some days if this reappears, close otherwise. I will follow the bug anyway.

#3 Updated by okurz 3 months ago

  • Status changed from Feedback to Resolved
  • Target version changed from Current Sprint to Done

checked on arm-1, did not reappear

Also available in: Atom PDF