action #125885: worker10 crashed triggering systemd-services alert and host-up alert size:M - openQA Infrastructure (public) - openSUSE Project Management Tool

Custom queries

openQA Infrastructure Project
openqa-review - Closed tickets last updated by openqa-review, last 30 days
QA roadmap long-term
QA SLE functional
QA SLE Functional - closed in last 14 days
QA SLE Functional - High, need to be refined
QA SLE Functional - over cycle time median
QA SLE u
QA SLE y
QA tools (tag not necessary in openQA and subprojects)
QA tools tag (tag not necessary in openQA and subprojects; excluding tickets in "Ready" version as they are already on the backlog)
QAC - Backlog
QE tools team - backlog (dev)
QE tools team - backlog (ready issues)
QE tools team - backlog SLA high
QE tools team - backlog SLA immediate
QE tools team - backlog SLA no immediate/urgent in feedback/blocked
QE tools team - backlog SLA normal
QE tools team - backlog SLA urgent
QE tools team - backlog SLO high
QE tools team - backlog SLO normal
QE tools team - backlog SLO urgent
QE tools team - backlog, high-level view (epics and higher)
QE tools team - backlog, non-reactive work, needs parent
QE tools team - backlog, top-level view (all sagas)
QE tools team - closed within last 14 days
QE tools team - closed within last 60 days
QE tools team - closed yesterday
QE Tools Team - Collaborative Session
QE tools team - due date forecast
QE tools team - exceeding due-date
QE tools team - infrastructure backlog
QE tools team - next - sorted by update time
QE tools team - next issues
QE tools team - non-estimated (unblocked) issues (dev)
QE tools team - non-estimated (unblocked) issues (infra)
QE tools team - ready issues - Workable
QE tools team - ready, not assigned/blocked/low
QE tools team - SLO high forecast
QE tools team - update forecast
QE tools team - updated by priority
QE tools team - what members of the team are working on - Feedback (not-low)
QE Tools Team Backlog By Assignee
Tools Team Retrospective
Tools Team Retrospective (not estimated or assigned)

Actions

Copy link

action #125885

closed

worker10 crashed triggering systemd-services alert and host-up alert size:M

Added by mkittler almost 2 years ago. Updated over 1 year ago.

Status:

Resolved

Priority:

High

Assignee:

mkittler

Category:

Target version:

openQA Project (public) - Ready

Start date:

2023-03-13

Due date:

2023-03-27

% Done:

Estimated time:

Tags:

alert, infra

Description

Observation¶

It is up and running again. I've been moving the crash dump to /var/crash-bak/2023-03-13-00:16 on that host.

Note that we've recently seen a crash on worker11 (#125207) and worker13 (#125210) as well.

Acceptance criteria¶

AC1: We know why certain workers have started to crash in similar ways recently

Suggestion¶

Reach out for help on public channels or people from kernel testing squad
Search relevant for already reported upstream bugs

Files

dmesg.txt (97.1 KB) dmesg.txt

mkittler, 2023-03-13 11:53

History
Notes
Property changes

Actions

Copy link

Updated by okurz almost 2 years ago

Priority changed from Normal to High
Target version set to Ready

Actions

Copy link

Updated by mkittler almost 2 years ago

File dmesg.txt dmesg.txt added

I've attached the dmesg log. It looks similar to the traces from worker11 and worker13. There's also a crash dump from a few months ago in /var/crash-bak/2022-09-26-16:02 but it looks different.

Actions

Copy link

Updated by livdywan almost 2 years ago

Subject changed from worker10 crashed triggering systemd-services alert and host-up alert to worker10 crashed triggering systemd-services alert and host-up alert size:M
Description updated (diff)
Status changed from New to Feedback

Actions

Copy link

Updated by mkittler almost 2 years ago

Tags deleted (~~alert, infra~~)
Target version deleted (~~Ready~~)

I've asked on the public opensuse-factory channel and the internal kernel squad channel. Maybe someone knows more.

Actions

Copy link

Updated by mkittler almost 2 years ago

Tags set to infra, alert
Target version set to Ready

Actions

Copy link

Updated by MDoucha almost 2 years ago

I've reported some new workqueue lockup warnings on SLE-15SP4 kernels in bsc#1201188 but this doesn't seem to be the same issue. All 3 worker however ran Btrfs rebalance shortly (~20 minutes) before the crash. Can you trigger the crash by running Btrfs rebalance manually? And if so, can you trigger it again when you boot the worker without the igb driver module?

Actions

Copy link

Updated by mkittler almost 2 years ago

All 3 worker however ran Btrfs rebalance shortly (~20 minutes) before the crash.

I've of course noticed that as well but I'm not sure whether it is related.

Can you trigger the crash by running Btrfs rebalance manually?

Good idea, I'll try that.

EDIT: I've just invoked sudo systemctl start btrfs-balance.service on all three workers (worker10, 11 and 13) to increase the chances to reproduce the issue. So far it doesn't lead to any crashes and the re-balancing has already ended on all hosts. Maybe this isn't an ideal way to reproduce the problem because there was likely not much to re-balance; the re-balancing didn't take very long.

I've just triggered one more re-balancing on all three hosts but it also just exited very quickly with no crashes yet.

Actions

Copy link

Updated by okurz almost 2 years ago

Due date set to 2023-03-27

We suggest to manually trigger the balance, e.g.

sudo btrfs balance start --full-balance /

Actions

Copy link

Updated by mkittler almost 2 years ago

I did 2 re-balances this way on each worker. It took indeed much more time that before. However, so far none of the machines has crashed.

Actions

Copy link

#10

Updated by mkittler over 1 year ago

Status changed from Feedback to Resolved

There were no further crashes. I'm considering this resolved for now. Maybe it was even a kernel issue that has been patched meanwhile. At this point spending more effort into investigating this is not worth it.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #125885

worker10 crashed triggering systemd-services alert and host-up alert size:M

Observation¶

Acceptance criteria¶

Suggestion¶

Updated by okurz almost 2 years ago

Updated by mkittler almost 2 years ago

Updated by livdywan almost 2 years ago

Updated by mkittler almost 2 years ago

Updated by mkittler almost 2 years ago

Updated by MDoucha almost 2 years ago

Updated by mkittler almost 2 years ago

Updated by okurz almost 2 years ago

Updated by mkittler almost 2 years ago

Updated by mkittler over 1 year ago