Actions
action #125213
closedFailed systemd services alert due do crash dumps on worker11 and worker13 (except openqa.suse.de)
Start date:
2023-03-01
Due date:
2023-03-16
% Done:
0%
Estimated time:
Description
Updated by mkittler over 1 year ago
- Status changed from New to In Progress
- Assignee set to mkittler
Updated by mkittler over 1 year ago
- Subject changed from Failed systemd services alert (except openqa.suse.de) to Failed systemd services alert due do crash dumps on worker11 and worker13 (except openqa.suse.de)
The failing unit was check-for-kernel-crash on worker11 and worker13. When I've checked these workers I couldn't find any dumps in /var/crash
anymore and after restarting the service it also went ok again. However, at some point there must have been a crash dump:
worker13:/home/martchus # journalctl -fu check-for-kernel-crash.service
Mar 01 00:42:20 worker13 systemd[1]: check-for-kernel-crash.service: Main process exited, code=exited, status=1/FAILURE
Mar 01 00:42:20 worker13 systemd[1]: check-for-kernel-crash.service: Failed with result 'exit-code'.
Mar 01 00:42:20 worker13 systemd[1]: Failed to start Fail if at least one kernel crash has been recorded under /var/crash.
Mar 01 10:25:31 worker13 systemd[1]: Starting Fail if at least one kernel crash has been recorded under /var/crash...
Mar 01 10:25:31 worker13 systemd[1]: check-for-kernel-crash.service: Main process exited, code=exited, status=1/FAILURE
Mar 01 10:25:31 worker13 systemd[1]: check-for-kernel-crash.service: Failed with result 'exit-code'.
Mar 01 10:25:31 worker13 systemd[1]: Failed to start Fail if at least one kernel crash has been recorded under /var/crash.
Mar 01 17:02:48 worker13 systemd[1]: Starting Fail if at least one kernel crash has been recorded under /var/crash...
Mar 01 17:02:48 worker13 systemd[1]: check-for-kernel-crash.service: Deactivated successfully.
Mar 01 17:02:48 worker13 systemd[1]: Finished Fail if at least one kernel crash has been recorded under /var/crash.
martchus@worker11:~> sudo journalctl -fu check-for-kernel-crash.service
Mar 01 01:39:39 worker11 systemd[1]: Starting Fail if at least one kernel crash has been recorded under /var/crash...
Mar 01 01:39:39 worker11 systemd[1]: check-for-kernel-crash.service: Main process exited, code=exited, status=1/FAILURE
Mar 01 01:39:39 worker11 systemd[1]: check-for-kernel-crash.service: Failed with result 'exit-code'.
Mar 01 01:39:39 worker11 systemd[1]: Failed to start Fail if at least one kernel crash has been recorded under /var/crash.
Mar 01 13:58:05 worker11 systemd[1]: Starting Fail if at least one kernel crash has been recorded under /var/crash...
Mar 01 13:58:05 worker11 systemd[1]: check-for-kernel-crash.service: Deactivated successfully.
It doesn't look like one has already moved those dumps to /var/crash-bak
so I'm wondering where they went. Or did perhaps the systemd unit file checking /var/crash
created a false alarm?
Updated by mkittler over 1 year ago
Interesting log messages on worker13:
Mar 01 00:03:47 worker13 smartd[1182]: Device: /dev/nvme0, Critical Warning (0x04): Reliability
…
Mar 01 00:34:28 worker13 velociraptor[1183]: [DEBUG] 2023-03-01T00:34:28+01:00 Connection Info {"IdleTime":2005766021,"LocalAddr":{"IP":"10.137.10.13","Port":47482,"Zone":""},"Reused":true,"WasIdle":true}
Mar 01 00:34:28 worker13 velociraptor[1183]: [INFO] 2023-03-01T00:34:28+01:00 Sender: sent 3955 bytes, response with status: 200 OK
Mar 01 00:34:28 worker13 velociraptor[1183]: [INFO] 2023-03-01T00:34:28+01:00 Sender: received 626 bytes
Mar 01 00:34:28 worker13 worker[6074]: [debug] [pid:6074] Uploading artefact tpm2_tools_encrypt-1.txt
Mar 01 00:34:28 worker13 worker[31425]: [debug] [pid:31425] Uploading artefact validate_btrfs-206.txt
-- Boot 8658499653c34f2f808678629b576a99 --
Mar 01 00:42:07 worker13 kernel: Linux version 5.14.21-150400.24.33-default (geeko@buildhost) (gcc (SUSE Linux) 7.5.0, GNU ld (GNU Binutils; SUSE Linux Enterprise 15) 2.37.20211103-150100.7.37) #1 SMP PREEMPT_DYNAMIC Fri Nov 4 13:55:06 UTC 2022 (76cfe60)
Mar 01 00:42:07 worker13 kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-5.14.21-150400.24.33-default root=UUID=1a83d875-0e24-49a9-8dc8-ddf8cf83f04e console=tty0 console=ttyS1,115200 nospec kvm.nested=1 kvm_intel.nested=1 kvm_amd.nested=1 kvm-arm.nested=1 crashkernel=210M
Mar 01 00:42:07 worker13 kernel: random: get_random_u32 called from bsp_init_amd+0x231/0x260 with crng_init=0
Mar 01 00:42:07 worker13 kernel: x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
Mar 01 00:42:07 worker13 kernel: x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
…
So there was a crash, indeed (as the log just ends and then starts again with boot messages).
On worker11 the logs look different (although there's no critical warning from smartd):
Mar 01 01:34:27 worker11 velociraptor[1198]: [INFO] 2023-03-01T01:34:27+01:00 Sender: sent 1059 bytes, response with status: 200 OK
Mar 01 01:34:27 worker11 velociraptor[1198]: [INFO] 2023-03-01T01:34:27+01:00 Sender: received 626 bytes
Mar 01 01:34:27 worker11 worker[10661]: [debug] [pid:10661] Uploading artefact run-1615.txt
Mar 01 01:34:27 worker11 velociraptor[1198]: [INFO] 2023-03-01T01:34:27+01:00 File Ring Buffer: Enqueue {"header":"{\"ReadPointer\":50,\"WritePointer\":306,\"MaxSize\":1073741874,\"AvailableBytes\":248,\"LeasedBytes\":0}","leased_pointer":50}
Mar 01 01:34:27 worker11 velociraptor[1198]: [INFO] 2023-03-01T01:34:27+01:00 File Ring Buffer: Enqueue {"header":"{\"ReadPointer\":50,\"WritePointer\":1876,\"MaxSize\":1073741874,\"AvailableBytes\":1810,\"LeasedBytes\":0}","leased_pointer":50}
Mar 01 01:34:27 worker11 worker[10661]: [debug] [pid:10661] Uploading artefact run-291.txt
Mar 01 01:34:28 worker11 worker[10661]: [debug] [pid:10661] Uploading artefact run-470.txt
Mar 01 01:34:28 worker11 velociraptor[1198]: [INFO] 2023-03-01T01:34:28+01:00 Sender: Connected to https://sec-velociraptor.prg.suse.com:8000/control
-- Boot f8bcade8197d40478f65050eb56e0314 --
Mar 01 01:39:25 worker11 kernel: Linux version 5.14.21-150400.24.33-default (geeko@buildhost) (gcc (SUSE Linux) 7.5.0, GNU ld (GNU Binutils; SUSE Linux Enterprise 15) 2.37.20211103-150100.7.37) #1 SMP PREEMPT_DYNAMIC Fri Nov 4 13:55:06 UTC 2022 (76cfe60)
Mar 01 01:39:25 worker11 kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-5.14.21-150400.24.33-default root=UUID=e5c48ed4-d3ca-4278-8350-46db38cfcf2e console=tty0 console=ttyS1,115200 nospec kvm.nested=1 kvm_intel.nested=1 kvm_amd.nested=1 kvm-arm.nested=1 crashkernel=210M
Mar 01 01:39:25 worker11 kernel: random: get_random_u32 called from bsp_init_amd+0x231/0x260 with crng_init=0
Mar 01 01:39:25 worker11 kernel: x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
Mar 01 01:39:25 worker11 kernel: x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
Mar 01 01:39:25 worker11 kernel: x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
Updated by openqa_review over 1 year ago
- Due date set to 2023-03-16
Setting due date based on mean cycle time of SUSE QE Tools
Updated by mkittler over 1 year ago
- Related to action #125210: worker13 host up alert - kernel crash size:M added
Updated by mkittler over 1 year ago
- Related to action #125207: worker11 host up alert - similar as for worker13 added
Actions