Project

General

Profile

action #125207

worker11 host up alert - similar as for worker13

Added by cdywan 3 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
Start date:
2023-03-01
Due date:
% Done:

0%

Estimated time:

Description

Observation

Several alert emails about worker11 being (un)available.

Acceptance criteria

  • AC1: No alerts about worker11

Suggestions

  • Follow what has been done for #125210
  • Confirm what happened to the machine
  • Ensure worker11 is stable
  • Investigate if there's any recent changes e.g. disk space running out, too many jobs, package installed manually or other things
dmesg-worker11.log (100 KB) dmesg-worker11.log mkittler, 2023-03-06 09:56

Related issues

Related to openQA Infrastructure - action #125213: Failed systemd services alert due do crash dumps on worker11 and worker13 (except openqa.suse.de)Closed2023-03-012023-03-16

Copied to openQA Infrastructure - action #125210: worker13 host up alert - kernel crash size:MResolved2023-03-01

History

#1 Updated by cdywan 3 months ago

  • Copied to action #125210: worker13 host up alert - kernel crash size:M added

#2 Updated by osukup 3 months ago

kernel crashed ..

#3 Updated by okurz 3 months ago

  • Tags set to infra, alert, reactive work, osd, worker11, nbg

#4 Updated by mkittler 3 months ago

  • Related to action #125213: Failed systemd services alert due do crash dumps on worker11 and worker13 (except openqa.suse.de) added

#5 Updated by mkittler 3 months ago

Not sure whether #125210 is related or whether both workers just crashed coincidentally in a close time-frame (see #125210#note-5 for the initial investigation).

#6 Updated by mkittler 3 months ago

  • Assignee set to mkittler

#7 Updated by mkittler 3 months ago

The vanished crash dump was actually moved to /home/osukup/2023-03-01-01:35. So everything went fine regarding the crash reporting and our systemd service used to trigger an alert.

We apparently got hardware errors while BTRFS was re-balancing and soon the crash happens. I have attached the dmesg logs for details.

#8 Updated by okurz 3 months ago

  • Subject changed from worker11 host up alert to worker11 host up alert - similar as for worker13
  • Description updated (diff)
  • Status changed from New to Resolved

Host is stable right now and well-covered in our monitoring and alerting. If any similar case happens again we have the dmesg log output attached here for comparison.

Also available in: Atom PDF