https://progress.opensuse.org/https://progress.opensuse.org/themes/openSUSE/favicon/favicon.ico?15829177842023-03-01T15:51:03ZopenSUSE Project Management ToolopenQA Infrastructure - action #125213: Failed systemd services alert due do crash dumps on worker11 and worker13 (except openqa.suse.de)https://progress.opensuse.org/issues/125213?journal_id=6078172023-03-01T15:51:03Zmkittlermarius.kittler@suse.com
<ul><li><strong>Status</strong> changed from <i>New</i> to <i>In Progress</i></li><li><strong>Assignee</strong> set to <i>mkittler</i></li></ul> openQA Infrastructure - action #125213: Failed systemd services alert due do crash dumps on worker11 and worker13 (except openqa.suse.de)https://progress.opensuse.org/issues/125213?journal_id=6078202023-03-01T16:08:16Zmkittlermarius.kittler@suse.com
<ul><li><strong>Subject</strong> changed from <i>Failed systemd services alert (except openqa.suse.de)</i> to <i>Failed systemd services alert due do crash dumps on worker11 and worker13 (except openqa.suse.de)</i></li></ul><p>The failing unit was check-for-kernel-crash on worker11 and worker13. When I've checked these workers I couldn't find any dumps in <code>/var/crash</code> anymore and after restarting the service it also went ok again. However, at some point there must have been a crash dump:</p>
<pre><code>worker13:/home/martchus # journalctl -fu check-for-kernel-crash.service
Mar 01 00:42:20 worker13 systemd[1]: check-for-kernel-crash.service: Main process exited, code=exited, status=1/FAILURE
Mar 01 00:42:20 worker13 systemd[1]: check-for-kernel-crash.service: Failed with result 'exit-code'.
Mar 01 00:42:20 worker13 systemd[1]: Failed to start Fail if at least one kernel crash has been recorded under /var/crash.
Mar 01 10:25:31 worker13 systemd[1]: Starting Fail if at least one kernel crash has been recorded under /var/crash...
Mar 01 10:25:31 worker13 systemd[1]: check-for-kernel-crash.service: Main process exited, code=exited, status=1/FAILURE
Mar 01 10:25:31 worker13 systemd[1]: check-for-kernel-crash.service: Failed with result 'exit-code'.
Mar 01 10:25:31 worker13 systemd[1]: Failed to start Fail if at least one kernel crash has been recorded under /var/crash.
Mar 01 17:02:48 worker13 systemd[1]: Starting Fail if at least one kernel crash has been recorded under /var/crash...
Mar 01 17:02:48 worker13 systemd[1]: check-for-kernel-crash.service: Deactivated successfully.
Mar 01 17:02:48 worker13 systemd[1]: Finished Fail if at least one kernel crash has been recorded under /var/crash.
</code></pre><pre><code>martchus@worker11:~> sudo journalctl -fu check-for-kernel-crash.service
Mar 01 01:39:39 worker11 systemd[1]: Starting Fail if at least one kernel crash has been recorded under /var/crash...
Mar 01 01:39:39 worker11 systemd[1]: check-for-kernel-crash.service: Main process exited, code=exited, status=1/FAILURE
Mar 01 01:39:39 worker11 systemd[1]: check-for-kernel-crash.service: Failed with result 'exit-code'.
Mar 01 01:39:39 worker11 systemd[1]: Failed to start Fail if at least one kernel crash has been recorded under /var/crash.
Mar 01 13:58:05 worker11 systemd[1]: Starting Fail if at least one kernel crash has been recorded under /var/crash...
Mar 01 13:58:05 worker11 systemd[1]: check-for-kernel-crash.service: Deactivated successfully.
</code></pre>
<p>It doesn't look like one has already moved those dumps to <code>/var/crash-bak</code> so I'm wondering where they went. Or did perhaps the systemd unit file checking <code>/var/crash</code> created a false alarm?</p>
openQA Infrastructure - action #125213: Failed systemd services alert due do crash dumps on worker11 and worker13 (except openqa.suse.de)https://progress.opensuse.org/issues/125213?journal_id=6078292023-03-01T16:30:00Zmkittlermarius.kittler@suse.com
<ul></ul><p>Interesting log messages on worker13:</p>
<pre><code>Mar 01 00:03:47 worker13 smartd[1182]: Device: /dev/nvme0, Critical Warning (0x04): Reliability
…
Mar 01 00:34:28 worker13 velociraptor[1183]: [DEBUG] 2023-03-01T00:34:28+01:00 Connection Info {"IdleTime":2005766021,"LocalAddr":{"IP":"10.137.10.13","Port":47482,"Zone":""},"Reused":true,"WasIdle":true}
Mar 01 00:34:28 worker13 velociraptor[1183]: [INFO] 2023-03-01T00:34:28+01:00 Sender: sent 3955 bytes, response with status: 200 OK
Mar 01 00:34:28 worker13 velociraptor[1183]: [INFO] 2023-03-01T00:34:28+01:00 Sender: received 626 bytes
Mar 01 00:34:28 worker13 worker[6074]: [debug] [pid:6074] Uploading artefact tpm2_tools_encrypt-1.txt
Mar 01 00:34:28 worker13 worker[31425]: [debug] [pid:31425] Uploading artefact validate_btrfs-206.txt
-- Boot 8658499653c34f2f808678629b576a99 --
Mar 01 00:42:07 worker13 kernel: Linux version 5.14.21-150400.24.33-default (geeko@buildhost) (gcc (SUSE Linux) 7.5.0, GNU ld (GNU Binutils; SUSE Linux Enterprise 15) 2.37.20211103-150100.7.37) #1 SMP PREEMPT_DYNAMIC Fri Nov 4 13:55:06 UTC 2022 (76cfe60)
Mar 01 00:42:07 worker13 kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-5.14.21-150400.24.33-default root=UUID=1a83d875-0e24-49a9-8dc8-ddf8cf83f04e console=tty0 console=ttyS1,115200 nospec kvm.nested=1 kvm_intel.nested=1 kvm_amd.nested=1 kvm-arm.nested=1 crashkernel=210M
Mar 01 00:42:07 worker13 kernel: random: get_random_u32 called from bsp_init_amd+0x231/0x260 with crng_init=0
Mar 01 00:42:07 worker13 kernel: x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
Mar 01 00:42:07 worker13 kernel: x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
…
</code></pre>
<p>So there was a crash, indeed (as the log just ends and then starts again with boot messages).</p>
<p>On worker11 the logs look different (although there's no critical warning from smartd):</p>
<pre><code>Mar 01 01:34:27 worker11 velociraptor[1198]: [INFO] 2023-03-01T01:34:27+01:00 Sender: sent 1059 bytes, response with status: 200 OK
Mar 01 01:34:27 worker11 velociraptor[1198]: [INFO] 2023-03-01T01:34:27+01:00 Sender: received 626 bytes
Mar 01 01:34:27 worker11 worker[10661]: [debug] [pid:10661] Uploading artefact run-1615.txt
Mar 01 01:34:27 worker11 velociraptor[1198]: [INFO] 2023-03-01T01:34:27+01:00 File Ring Buffer: Enqueue {"header":"{\"ReadPointer\":50,\"WritePointer\":306,\"MaxSize\":1073741874,\"AvailableBytes\":248,\"LeasedBytes\":0}","leased_pointer":50}
Mar 01 01:34:27 worker11 velociraptor[1198]: [INFO] 2023-03-01T01:34:27+01:00 File Ring Buffer: Enqueue {"header":"{\"ReadPointer\":50,\"WritePointer\":1876,\"MaxSize\":1073741874,\"AvailableBytes\":1810,\"LeasedBytes\":0}","leased_pointer":50}
Mar 01 01:34:27 worker11 worker[10661]: [debug] [pid:10661] Uploading artefact run-291.txt
Mar 01 01:34:28 worker11 worker[10661]: [debug] [pid:10661] Uploading artefact run-470.txt
Mar 01 01:34:28 worker11 velociraptor[1198]: [INFO] 2023-03-01T01:34:28+01:00 Sender: Connected to https://sec-velociraptor.prg.suse.com:8000/control
-- Boot f8bcade8197d40478f65050eb56e0314 --
Mar 01 01:39:25 worker11 kernel: Linux version 5.14.21-150400.24.33-default (geeko@buildhost) (gcc (SUSE Linux) 7.5.0, GNU ld (GNU Binutils; SUSE Linux Enterprise 15) 2.37.20211103-150100.7.37) #1 SMP PREEMPT_DYNAMIC Fri Nov 4 13:55:06 UTC 2022 (76cfe60)
Mar 01 01:39:25 worker11 kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-5.14.21-150400.24.33-default root=UUID=e5c48ed4-d3ca-4278-8350-46db38cfcf2e console=tty0 console=ttyS1,115200 nospec kvm.nested=1 kvm_intel.nested=1 kvm_amd.nested=1 kvm-arm.nested=1 crashkernel=210M
Mar 01 01:39:25 worker11 kernel: random: get_random_u32 called from bsp_init_amd+0x231/0x260 with crng_init=0
Mar 01 01:39:25 worker11 kernel: x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
Mar 01 01:39:25 worker11 kernel: x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
Mar 01 01:39:25 worker11 kernel: x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
</code></pre> openQA Infrastructure - action #125213: Failed systemd services alert due do crash dumps on worker11 and worker13 (except openqa.suse.de)https://progress.opensuse.org/issues/125213?journal_id=6080182023-03-02T04:11:41Zopenqa_reviewopenqa-review@suse.de
<ul><li><strong>Due date</strong> set to <i>2023-03-16</i></li></ul><p>Setting due date based on mean cycle time of SUSE QE Tools</p>
openQA Infrastructure - action #125213: Failed systemd services alert due do crash dumps on worker11 and worker13 (except openqa.suse.de)https://progress.opensuse.org/issues/125213?journal_id=6080392023-03-02T06:16:55Zokurzokurz@suse.com
<ul><li><strong>Tags</strong> set to <i>infra, alert, systemd</i></li></ul> openQA Infrastructure - action #125213: Failed systemd services alert due do crash dumps on worker11 and worker13 (except openqa.suse.de)https://progress.opensuse.org/issues/125213?journal_id=6083692023-03-02T12:11:16Zmkittlermarius.kittler@suse.com
<ul><li><strong>Related to</strong> <i><a class="issue tracker-4 status-3 priority-5 priority-high3 closed" href="/issues/125210">action #125210</a>: worker13 host up alert - kernel crash size:M</i> added</li></ul> openQA Infrastructure - action #125213: Failed systemd services alert due do crash dumps on worker11 and worker13 (except openqa.suse.de)https://progress.opensuse.org/issues/125213?journal_id=6083752023-03-02T12:11:28Zmkittlermarius.kittler@suse.com
<ul><li><strong>Related to</strong> <i><a class="issue tracker-4 status-3 priority-5 priority-high3 closed" href="/issues/125207">action #125207</a>: worker11 host up alert - similar as for worker13</i> added</li></ul> openQA Infrastructure - action #125213: Failed systemd services alert due do crash dumps on worker11 and worker13 (except openqa.suse.de)https://progress.opensuse.org/issues/125213?journal_id=6083782023-03-02T12:12:47Zmkittlermarius.kittler@suse.com
<ul><li><strong>Status</strong> changed from <i>In Progress</i> to <i>Closed</i></li></ul><p>The timestamps correspond to the host-up alerts being triggered as well. For this we have <a class="issue tracker-4 status-3 priority-5 priority-high3 closed" title="action: worker11 host up alert - similar as for worker13 (Resolved)" href="https://progress.opensuse.org/issues/125207">#125207</a> and <a class="issue tracker-4 status-3 priority-5 priority-high3 closed" title="action: worker13 host up alert - kernel crash size:M (Resolved)" href="https://progress.opensuse.org/issues/125210">#125210</a>. So I'm closing this issue in favor of those individual ones.</p>