Project

General

Profile

Actions

action #125885

closed

worker10 crashed triggering systemd-services alert and host-up alert size:M

Added by mkittler over 1 year ago. Updated about 1 year ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
2023-03-13
Due date:
2023-03-27
% Done:

0%

Estimated time:
Tags:

Description

Observation

It is up and running again. I've been moving the crash dump to /var/crash-bak/2023-03-13-00:16 on that host.

Note that we've recently seen a crash on worker11 (#125207) and worker13 (#125210) as well.

Acceptance criteria

  • AC1: We know why certain workers have started to crash in similar ways recently

Suggestion

  • Reach out for help on public channels or people from kernel testing squad
  • Search relevant for already reported upstream bugs

Files

dmesg.txt (97.1 KB) dmesg.txt mkittler, 2023-03-13 11:53
Actions #1

Updated by okurz over 1 year ago

  • Priority changed from Normal to High
  • Target version set to Ready
Actions #2

Updated by mkittler over 1 year ago

I've attached the dmesg log. It looks similar to the traces from worker11 and worker13. There's also a crash dump from a few months ago in /var/crash-bak/2022-09-26-16:02 but it looks different.

Actions #3

Updated by livdywan over 1 year ago

  • Subject changed from worker10 crashed triggering systemd-services alert and host-up alert to worker10 crashed triggering systemd-services alert and host-up alert size:M
  • Description updated (diff)
  • Status changed from New to Feedback
Actions #4

Updated by mkittler over 1 year ago

  • Tags deleted (alert, infra)
  • Target version deleted (Ready)

I've asked on the public opensuse-factory channel and the internal kernel squad channel. Maybe someone knows more.

Actions #5

Updated by mkittler over 1 year ago

  • Tags set to infra, alert
  • Target version set to Ready
Actions #6

Updated by MDoucha over 1 year ago

I've reported some new workqueue lockup warnings on SLE-15SP4 kernels in bsc#1201188 but this doesn't seem to be the same issue. All 3 worker however ran Btrfs rebalance shortly (~20 minutes) before the crash. Can you trigger the crash by running Btrfs rebalance manually? And if so, can you trigger it again when you boot the worker without the igb driver module?

Actions #7

Updated by mkittler over 1 year ago

All 3 worker however ran Btrfs rebalance shortly (~20 minutes) before the crash.

I've of course noticed that as well but I'm not sure whether it is related.

Can you trigger the crash by running Btrfs rebalance manually?

Good idea, I'll try that.


EDIT: I've just invoked sudo systemctl start btrfs-balance.service on all three workers (worker10, 11 and 13) to increase the chances to reproduce the issue. So far it doesn't lead to any crashes and the re-balancing has already ended on all hosts. Maybe this isn't an ideal way to reproduce the problem because there was likely not much to re-balance; the re-balancing didn't take very long.

I've just triggered one more re-balancing on all three hosts but it also just exited very quickly with no crashes yet.

Actions #8

Updated by okurz over 1 year ago

  • Due date set to 2023-03-27

We suggest to manually trigger the balance, e.g.

sudo btrfs balance start --full-balance /
Actions #9

Updated by mkittler over 1 year ago

I did 2 re-balances this way on each worker. It took indeed much more time that before. However, so far none of the machines has crashed.

Actions #10

Updated by mkittler about 1 year ago

  • Status changed from Feedback to Resolved

There were no further crashes. I'm considering this resolved for now. Maybe it was even a kernel issue that has been patched meanwhile. At this point spending more effort into investigating this is not worth it.

Actions

Also available in: Atom PDF