Project

General

Profile

Actions

action #125213

closed

Failed systemd services alert due do crash dumps on worker11 and worker13 (except openqa.suse.de)

Added by livdywan almost 2 years ago. Updated almost 2 years ago.

Status:
Closed
Priority:
High
Assignee:
Category:
-
Start date:
2023-03-01
Due date:
2023-03-16
% Done:

0%

Estimated time:

Description

Observation

Several alert emails about Failed systemd services alert (except openqa.suse.de)

Acceptance criteria

  • AC1: No alerts about failed systemd services w/o osd

Suggestions

  • Confirm what actually failed

Related issues 2 (0 open2 closed)

Related to openQA Infrastructure (public) - action #125210: worker13 host up alert - kernel crash size:MResolvedmkittler2023-03-01

Actions
Related to openQA Infrastructure (public) - action #125207: worker11 host up alert - similar as for worker13Resolvedmkittler2023-03-01

Actions
Actions #1

Updated by mkittler almost 2 years ago

  • Status changed from New to In Progress
  • Assignee set to mkittler
Actions #2

Updated by mkittler almost 2 years ago

  • Subject changed from Failed systemd services alert (except openqa.suse.de) to Failed systemd services alert due do crash dumps on worker11 and worker13 (except openqa.suse.de)

The failing unit was check-for-kernel-crash on worker11 and worker13. When I've checked these workers I couldn't find any dumps in /var/crash anymore and after restarting the service it also went ok again. However, at some point there must have been a crash dump:

worker13:/home/martchus # journalctl -fu check-for-kernel-crash.service
Mar 01 00:42:20 worker13 systemd[1]: check-for-kernel-crash.service: Main process exited, code=exited, status=1/FAILURE
Mar 01 00:42:20 worker13 systemd[1]: check-for-kernel-crash.service: Failed with result 'exit-code'.
Mar 01 00:42:20 worker13 systemd[1]: Failed to start Fail if at least one kernel crash has been recorded under /var/crash.
Mar 01 10:25:31 worker13 systemd[1]: Starting Fail if at least one kernel crash has been recorded under /var/crash...
Mar 01 10:25:31 worker13 systemd[1]: check-for-kernel-crash.service: Main process exited, code=exited, status=1/FAILURE
Mar 01 10:25:31 worker13 systemd[1]: check-for-kernel-crash.service: Failed with result 'exit-code'.
Mar 01 10:25:31 worker13 systemd[1]: Failed to start Fail if at least one kernel crash has been recorded under /var/crash.
Mar 01 17:02:48 worker13 systemd[1]: Starting Fail if at least one kernel crash has been recorded under /var/crash...
Mar 01 17:02:48 worker13 systemd[1]: check-for-kernel-crash.service: Deactivated successfully.
Mar 01 17:02:48 worker13 systemd[1]: Finished Fail if at least one kernel crash has been recorded under /var/crash.
martchus@worker11:~> sudo journalctl -fu check-for-kernel-crash.service
Mar 01 01:39:39 worker11 systemd[1]: Starting Fail if at least one kernel crash has been recorded under /var/crash...
Mar 01 01:39:39 worker11 systemd[1]: check-for-kernel-crash.service: Main process exited, code=exited, status=1/FAILURE
Mar 01 01:39:39 worker11 systemd[1]: check-for-kernel-crash.service: Failed with result 'exit-code'.
Mar 01 01:39:39 worker11 systemd[1]: Failed to start Fail if at least one kernel crash has been recorded under /var/crash.
Mar 01 13:58:05 worker11 systemd[1]: Starting Fail if at least one kernel crash has been recorded under /var/crash...
Mar 01 13:58:05 worker11 systemd[1]: check-for-kernel-crash.service: Deactivated successfully.

It doesn't look like one has already moved those dumps to /var/crash-bak so I'm wondering where they went. Or did perhaps the systemd unit file checking /var/crash created a false alarm?

Actions #3

Updated by mkittler almost 2 years ago

Interesting log messages on worker13:

Mar 01 00:03:47 worker13 smartd[1182]: Device: /dev/nvme0, Critical Warning (0x04): Reliability
…
Mar 01 00:34:28 worker13 velociraptor[1183]: [DEBUG] 2023-03-01T00:34:28+01:00 Connection Info {"IdleTime":2005766021,"LocalAddr":{"IP":"10.137.10.13","Port":47482,"Zone":""},"Reused":true,"WasIdle":true}
Mar 01 00:34:28 worker13 velociraptor[1183]: [INFO] 2023-03-01T00:34:28+01:00 Sender: sent 3955 bytes, response with status: 200 OK
Mar 01 00:34:28 worker13 velociraptor[1183]: [INFO] 2023-03-01T00:34:28+01:00 Sender: received 626 bytes
Mar 01 00:34:28 worker13 worker[6074]: [debug] [pid:6074] Uploading artefact tpm2_tools_encrypt-1.txt
Mar 01 00:34:28 worker13 worker[31425]: [debug] [pid:31425] Uploading artefact validate_btrfs-206.txt
-- Boot 8658499653c34f2f808678629b576a99 --
Mar 01 00:42:07 worker13 kernel: Linux version 5.14.21-150400.24.33-default (geeko@buildhost) (gcc (SUSE Linux) 7.5.0, GNU ld (GNU Binutils; SUSE Linux Enterprise 15) 2.37.20211103-150100.7.37) #1 SMP PREEMPT_DYNAMIC Fri Nov 4 13:55:06 UTC 2022 (76cfe60)
Mar 01 00:42:07 worker13 kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-5.14.21-150400.24.33-default root=UUID=1a83d875-0e24-49a9-8dc8-ddf8cf83f04e console=tty0 console=ttyS1,115200 nospec kvm.nested=1 kvm_intel.nested=1 kvm_amd.nested=1 kvm-arm.nested=1 crashkernel=210M
Mar 01 00:42:07 worker13 kernel: random: get_random_u32 called from bsp_init_amd+0x231/0x260 with crng_init=0
Mar 01 00:42:07 worker13 kernel: x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
Mar 01 00:42:07 worker13 kernel: x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
…

So there was a crash, indeed (as the log just ends and then starts again with boot messages).

On worker11 the logs look different (although there's no critical warning from smartd):

Mar 01 01:34:27 worker11 velociraptor[1198]: [INFO] 2023-03-01T01:34:27+01:00 Sender: sent 1059 bytes, response with status: 200 OK
Mar 01 01:34:27 worker11 velociraptor[1198]: [INFO] 2023-03-01T01:34:27+01:00 Sender: received 626 bytes
Mar 01 01:34:27 worker11 worker[10661]: [debug] [pid:10661] Uploading artefact run-1615.txt
Mar 01 01:34:27 worker11 velociraptor[1198]: [INFO] 2023-03-01T01:34:27+01:00 File Ring Buffer: Enqueue {"header":"{\"ReadPointer\":50,\"WritePointer\":306,\"MaxSize\":1073741874,\"AvailableBytes\":248,\"LeasedBytes\":0}","leased_pointer":50}
Mar 01 01:34:27 worker11 velociraptor[1198]: [INFO] 2023-03-01T01:34:27+01:00 File Ring Buffer: Enqueue {"header":"{\"ReadPointer\":50,\"WritePointer\":1876,\"MaxSize\":1073741874,\"AvailableBytes\":1810,\"LeasedBytes\":0}","leased_pointer":50}
Mar 01 01:34:27 worker11 worker[10661]: [debug] [pid:10661] Uploading artefact run-291.txt
Mar 01 01:34:28 worker11 worker[10661]: [debug] [pid:10661] Uploading artefact run-470.txt
Mar 01 01:34:28 worker11 velociraptor[1198]: [INFO] 2023-03-01T01:34:28+01:00 Sender: Connected to https://sec-velociraptor.prg.suse.com:8000/control
-- Boot f8bcade8197d40478f65050eb56e0314 --
Mar 01 01:39:25 worker11 kernel: Linux version 5.14.21-150400.24.33-default (geeko@buildhost) (gcc (SUSE Linux) 7.5.0, GNU ld (GNU Binutils; SUSE Linux Enterprise 15) 2.37.20211103-150100.7.37) #1 SMP PREEMPT_DYNAMIC Fri Nov 4 13:55:06 UTC 2022 (76cfe60)
Mar 01 01:39:25 worker11 kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-5.14.21-150400.24.33-default root=UUID=e5c48ed4-d3ca-4278-8350-46db38cfcf2e console=tty0 console=ttyS1,115200 nospec kvm.nested=1 kvm_intel.nested=1 kvm_amd.nested=1 kvm-arm.nested=1 crashkernel=210M
Mar 01 01:39:25 worker11 kernel: random: get_random_u32 called from bsp_init_amd+0x231/0x260 with crng_init=0
Mar 01 01:39:25 worker11 kernel: x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
Mar 01 01:39:25 worker11 kernel: x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
Mar 01 01:39:25 worker11 kernel: x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
Actions #4

Updated by openqa_review almost 2 years ago

  • Due date set to 2023-03-16

Setting due date based on mean cycle time of SUSE QE Tools

Actions #5

Updated by okurz almost 2 years ago

  • Tags set to infra, alert, systemd
Actions #6

Updated by mkittler almost 2 years ago

  • Related to action #125210: worker13 host up alert - kernel crash size:M added
Actions #7

Updated by mkittler almost 2 years ago

  • Related to action #125207: worker11 host up alert - similar as for worker13 added
Actions #8

Updated by mkittler almost 2 years ago

  • Status changed from In Progress to Closed

The timestamps correspond to the host-up alerts being triggered as well. For this we have #125207 and #125210. So I'm closing this issue in favor of those individual ones.

Actions

Also available in: Atom PDF