Project

General

Profile

Actions

action #125207

closed

worker11 host up alert - similar as for worker13

Added by livdywan over 1 year ago. Updated over 1 year ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
2023-03-01
Due date:
% Done:

0%

Estimated time:

Description

Observation

Several alert emails about worker11 being (un)available.

Acceptance criteria

  • AC1: No alerts about worker11

Suggestions

  • Follow what has been done for #125210
  • Confirm what happened to the machine
  • Ensure worker11 is stable
  • Investigate if there's any recent changes e.g. disk space running out, too many jobs, package installed manually or other things

Files

dmesg-worker11.log (100 KB) dmesg-worker11.log mkittler, 2023-03-06 09:56

Related issues 2 (0 open2 closed)

Related to openQA Infrastructure - action #125213: Failed systemd services alert due do crash dumps on worker11 and worker13 (except openqa.suse.de)Closedmkittler2023-03-012023-03-16

Actions
Copied to openQA Infrastructure - action #125210: worker13 host up alert - kernel crash size:MResolvedmkittler2023-03-01

Actions
Actions #1

Updated by livdywan over 1 year ago

  • Copied to action #125210: worker13 host up alert - kernel crash size:M added
Actions #2

Updated by osukup over 1 year ago

kernel crashed ..

Actions #3

Updated by okurz over 1 year ago

  • Tags set to infra, alert, reactive work, osd, worker11, nbg
Actions #4

Updated by mkittler over 1 year ago

  • Related to action #125213: Failed systemd services alert due do crash dumps on worker11 and worker13 (except openqa.suse.de) added
Actions #5

Updated by mkittler over 1 year ago

Not sure whether #125210 is related or whether both workers just crashed coincidentally in a close time-frame (see #125210#note-5 for the initial investigation).

Actions #6

Updated by mkittler over 1 year ago

  • Assignee set to mkittler
Actions #7

Updated by mkittler over 1 year ago

The vanished crash dump was actually moved to /home/osukup/2023-03-01-01:35. So everything went fine regarding the crash reporting and our systemd service used to trigger an alert.

We apparently got hardware errors while BTRFS was re-balancing and soon the crash happens. I have attached the dmesg logs for details.

Actions #8

Updated by okurz over 1 year ago

  • Subject changed from worker11 host up alert to worker11 host up alert - similar as for worker13
  • Description updated (diff)
  • Status changed from New to Resolved

Host is stable right now and well-covered in our monitoring and alerting. If any similar case happens again we have the dmesg log output attached here for comparison.

Actions

Also available in: Atom PDF