openQA Infrastructure - action #123493: [Alerting] Failed systemd services alert</h1> <article> <h1>openQA Infrastructure - action #123493: [Alerting] Failed systemd services alert</h1> <p>2023-01-23T09:21:49Z</p> <ul><li><strong>Copied from</strong> <i><a class="issue tracker-4 status-3 priority-5 priority-high3 closed" href="/issues/123151">action #123151</a>: [Alerting] Failed systemd services alert</i> added</li></ul> </article> <article> <h1>openQA Infrastructure - action #123493: [Alerting] Failed systemd services alert</h1> <p>2023-01-23T10:40:35Z</p> <ul><li><strong>Status</strong> changed from <i>New</i> to <i>In Progress</i></li><li><strong>Assignee</strong> set to <i>mkittler</i></li></ul> </article> <article> <h1>openQA Infrastructure - action #123493: [Alerting] Failed systemd services alert</h1> <p>2023-01-23T10:49:49Z</p> <ul></ul><p>openqa-piworker.qa.suse.de has just been moved from the TAM lab to SRV2. It seems to struggle rebooting. At least that would explain the failing service (likely timeout) and that I currently cannot access it via SSH. Maybe physical access is required to recover the machine. I could have a look later this week but maybe it is better if <a class="user active user-mention" href="https://progress.opensuse.org/users/664">@dheidler</a> takes care of this.</p> </article> <article> <h1>openQA Infrastructure - action #123493: [Alerting] Failed systemd services alert</h1> <p>2023-01-23T10:55:53Z</p> <ul></ul><p>On worker2 a kernel crash dump was collected. I have moved it from <code>/var/crash</code> to <code>/var/crash-bak</code> so the alert can go of after restarting <code>check-for-kernel-crash.service</code>. The good thing is that our approach for monitoring crashes seems to work in practice. I'll investigate the dump. It is just a single one and was created on 2023-01-22-04:21.</p> </article> <article> <h1>openQA Infrastructure - action #123493: [Alerting] Failed systemd services alert</h1> <p>2023-01-23T11:21:29Z</p> <ul></ul><p>Judging by the journal, the worker was shutting down around the time the crash dump was collected. There is a problem with the kernel auditing logged shortly before the reboot:</p> <pre><code>Jan 22 03:31:32 worker2 auditd[1516]: Error receiving audit netlink packet (No buffer space available) Jan 22 03:31:32 worker2 kernel: br0: port 1(eth0) entered disabled state Jan 22 03:31:32 worker2 kernel: device eth0 left promiscuous mode Jan 22 03:31:32 worker2 kernel: audit: audit_backlog=65 > audit_backlog_limit=64 Jan 22 03:31:32 worker2 kernel: audit: audit_lost=8 audit_rate_limit=0 audit_backlog_limit=64 Jan 22 03:31:32 worker2 kernel: audit: backlog limit exceeded </code></pre> <p>In the end (very shortly before the reboot) the audit daemon could at least be deactivated successfully:</p> <pre><code>Jan 22 03:31:52 worker2 auditd[1516]: The audit daemon is exiting. Jan 22 03:31:52 worker2 kernel: audit: type=1305 audit(1674354712.861:465705): op=set audit_pid=0 old=1516 auid=4294967295 ses=4294967295 subj=unconfined res=1 Jan 22 03:31:53 worker2 auditctl[6009]: enabled 0 Jan 22 03:31:53 worker2 auditctl[6009]: failure 1 Jan 22 03:31:53 worker2 auditctl[6009]: pid 0 Jan 22 03:31:53 worker2 auditctl[6009]: rate_limit 0 Jan 22 03:31:53 worker2 auditctl[6009]: backlog_limit 64 Jan 22 03:31:53 worker2 auditctl[6009]: lost 8 Jan 22 03:31:53 worker2 auditctl[6009]: backlog 12 Jan 22 03:31:53 worker2 auditctl[6009]: backlog_wait_time 15000 Jan 22 03:31:53 worker2 auditctl[6009]: backlog_wait_time_actual 5492 Jan 22 03:31:53 worker2 auditctl[6009]: No rules Jan 22 03:31:53 worker2 kernel: audit: type=1300 audit(1674354712.965:465706): arch=c000003e syscall=59 success=yes exit=0 a0=5616afa6d250 a1=5616af99f550 a2=5616afcf2680 a3=5616afbc8580 items=2 ppid=1 pid=6009 auid=4294967295 uid=0 gi> Jan 22 03:31:53 worker2 kernel: audit: type=1321 audit(1674354712.965:465706): fver=0 fp=0 fi=0 fe=0 old_pp=000001ffffffffff old_pi=0 old_pe=000001ffffffffff old_pa=0 pp=000001f3fdfeffff pi=0 pe=000001f3fdfeffff pa=0 frootid=0 Jan 22 03:31:53 worker2 kernel: audit: type=1309 audit(1674354712.965:465706): argc=3 a0="/sbin/auditctl" a1="-R" a2="/etc/audit/audit-stop.rules" Jan 22 03:31:53 worker2 kernel: audit: type=1307 audit(1674354712.965:465706): cwd="/" Jan 22 03:31:53 worker2 kernel: audit: type=1302 audit(1674354712.965:465706): item=0 name="/sbin/auditctl" inode=12460540 dev=00:2f mode=0100750 ouid=0 ogid=0 rdev=00:00 nametype=NORMAL cap_fp=0 cap_fi=0 cap_fe=0 cap_fver=0 cap_frooti> Jan 22 03:31:53 worker2 kernel: audit: type=1302 audit(1674354712.965:465706): item=1 name="/lib64/ld-linux-x86-64.so.2" inode=12246242 dev=00:2f mode=0100755 ouid=0 ogid=0 rdev=00:00 nametype=NORMAL cap_fp=0 cap_fi=0 cap_fe=0 cap_fver> Jan 22 03:31:53 worker2 kernel: audit: type=1327 audit(1674354712.965:465706): proctitle=2F7362696E2F617564697463746C002D52002F6574632F61756469742F61756469742D73746F702E72756C6573 Jan 22 03:31:53 worker2 kernel: audit: type=1305 audit(1674354713.005:465707): op=set audit_enabled=0 old=1 auid=4294967295 ses=4294967295 subj=unconfined res=1 Jan 22 03:31:53 worker2 kernel: audit: type=1300 audit(1674354713.005:465707): arch=c000003e syscall=44 success=yes exit=60 a0=3 a1=7ffc561031d0 a2=3c a3=0 items=0 ppid=1 pid=6009 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=> Jan 22 03:31:53 worker2 systemd[1]: auditd.service: Deactivated successfully. </code></pre> <p>There are also log messages about kdump itself shortly after the reboot was triggered:</p> <pre><code>Jan 22 03:30:01 worker2 nrpe[2869]: Caught SIGTERM - shutting down... Jan 22 03:30:01 worker2 systemd[1]: openqa-worker-auto-restart@18.service: Deactivated successfully. Jan 22 03:30:01 worker2 nrpe[2869]: Daemon shutdown Jan 22 03:30:01 worker2 systemd[1]: Stopped openQA Worker #18. … Jan 22 03:30:01 worker2 systemd[1]: Stopped openQA Worker #5. Jan 22 03:30:01 worker2 systemd[1]: var-lib-openqa-share.mount: Deactivated successfully. Jan 22 03:30:01 worker2 systemd[1]: Unmounted /var/lib/openqa/share. Jan 22 03:30:01 worker2 systemd[1]: dracut-shutdown.service: Deactivated successfully. Jan 22 03:30:01 worker2 systemd[1]: Stopped Restore /run/initramfs on shutdown. Jan 22 03:30:01 worker2 systemd[1]: kdump-early.service: Control process exited, code=exited, status=1/FAILURE Jan 22 03:30:01 worker2 systemd[1]: kdump-early.service: Failed with result 'exit-code'. Jan 22 03:30:01 worker2 systemd[1]: Stopped Load kdump kernel early on startup. Jan 22 03:30:01 worker2 systemd[1]: kdump.service: Deactivated successfully. Jan 22 03:30:01 worker2 systemd[1]: Stopped Load kdump kernel and initrd. … </code></pre> <p>Otherwise there was nothing interesting going on before the shutdown. Some services (mainly openQA workers) had to be killed forcefully, though.</p> </article> <article> <h1>openQA Infrastructure - action #123493: [Alerting] Failed systemd services alert</h1> <p>2023-01-23T11:59:52Z</p> <ul></ul><p>This is likely the most important information from the dump:</p> <pre><code>[ 2802.493706] Supported: No, Unsupported modules are loaded [ 2802.500922] CPU: 18 PID: 12402 Comm: Xvnc Kdump: loaded Tainted: G N 5.14.21-150400.24.33-default #1 SLE15-SP4 018fa6ac9e6418760457307f6579fc12335772aa [ 2802.518932] Hardware name: Quanta Computer Inc D51B-2U (dual 1G LoM)/S2B-MB (dual 1G LoM), BIOS S2B_3A19 05/15/2015 [ 2802.531253] RIP: 0010:__sbq_wake_up+0x27/0xd0 [ 2802.537491] Code: 44 00 00 41 55 41 54 55 53 8b 47 30 85 c0 74 2e 8b 47 24 4c 8b 47 28 ba 08 00 00 00 48 63 d8 48 c1 e3 06 4c 01 c3 48 8b 73 10 <48> 8d 4b 10 48 39 f1 75 1c 83 c0 01 83 e0 07 83 ea 01 75 de 45 31 [ 2802.559368] RSP: 0000:ffffa9956555f9e8 EFLAGS: 00000286 [ 2802.566496] RAX: 0000000000000004 RBX: ffff932bc79e5300 RCX: ffff932bc79e5310 [ 2802.575546] RDX: 0000000000000008 RSI: ffffa99560d278b0 RDI: ffff932bc71d01a0 [ 2802.584590] RBP: 000000000000003b R08: ffff932bc79e5200 R09: ffff9335cfe3ed00 [ 2802.593653] R10: 0000000000000001 R11: ffff932c41570c00 R12: 0000000000000004 [ 2802.602735] R13: 000000000000003b R14: ffffc9753fb0ef00 R15: ffff932bc728a198 [ 2802.611824] FS: 00007fad3cdab380(0000) GS:ffff934abfc80000(0000) knlGS:0000000000000000 [ 2802.621894] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 2802.629628] CR2: 00007fad3317b640 CR3: 000000065afe2004 CR4: 00000000001726e0 [ 2802.638775] Call Trace: [ 2802.643242] <TASK> [ 2802.647354] sbitmap_queue_clear+0x3b/0x70 [ 2802.653457] __blk_mq_free_request+0x7f/0xb0 [ 2802.659730] bfq_insert_requests+0x1e8/0x2080 [ 2802.666099] ? btrfs_submit_data_bio+0x150/0x200 [btrfs f3815fd88a158cd7157ae031158e74430ce504fd] [ 2802.677068] blk_mq_sched_insert_requests+0x5a/0xf0 [ 2802.683984] blk_mq_flush_plug_list+0x116/0x1d0 [ 2802.690546] blk_flush_plug_list+0xc4/0xf0 [ 2802.696686] blk_finish_plug+0x2b/0x40 [ 2802.702460] read_pages+0x159/0x260 [ 2802.707970] ? page_cache_ra_unbounded+0x133/0x220 [ 2802.714762] page_cache_ra_unbounded+0x133/0x220 [ 2802.721359] filemap_fault+0x49f/0xa30 [ 2802.727052] ? filemap_map_pages+0x14e/0x580 [ 2802.733244] __do_fault+0x31/0xc0 [ 2802.738466] __handle_mm_fault+0xd64/0x1220 [ 2802.744521] handle_mm_fault+0xd7/0x290 [ 2802.750219] do_user_addr_fault+0x1eb/0x730 [ 2802.756243] exc_page_fault+0x67/0x150 [ 2802.761800] ? asm_exc_page_fault+0x8/0x30 [ 2802.767678] asm_exc_page_fault+0x1e/0x30 [ 2802.773445] RIP: 0033:0x7fad3036cfb0 [ 2802.778761] Code: 49 8b 3e 41 83 c2 fe 44 89 d2 d1 ea 44 8d 42 01 85 d2 0f 84 d4 00 00 00 45 89 c1 41 81 e1 fe ff ff 7f 31 d2 66 0f 1f 44 00 00 <8b> 06 8b 5e 04 48 0b 04 d7 48 c1 e3 20 48 09 c3 48 89 1c d7 8b 46 [ 2802.800280] RSP: 002b:00007ffdf62ecee0 EFLAGS: 00010246 [ 2802.807220] RAX: 0000000000000000 RBX: 00007fad344b3d88 RCX: 000000000000007c [ 2802.816049] RDX: 0000000000000000 RSI: 00007fad3317b640 RDI: 0000558771c17040 [ 2802.824847] RBP: 0000000000000008 R08: 0000000000000002 R09: 0000000000000002 [ 2802.833623] R10: 0000000000000002 R11: 0000000000000002 R12: 00007ffdf62ee908 [ 2802.842370] R13: 00007ffdf62ee950 R14: 00007fad344b3d88 R15: 0000000000000000 [ 2802.851094] </TASK> [ 2802.854852] Kernel panic - not syncing: softlockup: hung tasks [ 2802.862261] CPU: 18 PID: 12402 Comm: Xvnc Kdump: loaded Tainted: G L N 5.14.21-150400.24.33-default #1 SLE15-SP4 018fa6ac9e6418760457307f6579fc12335772aa [ 2802.879729] Hardware name: Quanta Computer Inc D51B-2U (dual 1G LoM)/S2B-MB (dual 1G LoM), BIOS S2B_3A19 05/15/2015 [ 2802.891734] Call Trace: [ 2802.895733] <IRQ> [ 2802.899267] dump_stack_lvl+0x45/0x5b [ 2802.904442] panic+0x105/0x2dd [ 2802.908992] watchdog_timer_fn+0x1f1/0x210 [ 2802.914578] ? softlockup_fn+0x30/0x30 [ 2802.919826] __hrtimer_run_queues+0x10d/0x2a0 [ 2802.925674] hrtimer_interrupt+0xe5/0x250 [ 2802.931173] __sysvec_apic_timer_interrupt+0x5d/0x130 [ 2802.937714] sysvec_apic_timer_interrupt+0x4b/0x90 [ 2802.943991] </IRQ> [ 2802.947558] <TASK> [ 2802.951104] asm_sysvec_apic_timer_interrupt+0x12/0x20 [ 2802.957714] RIP: 0010:__sbq_wake_up+0x27/0xd0 [ 2802.963534] Code: 44 00 00 41 55 41 54 55 53 8b 47 30 85 c0 74 2e 8b 47 24 4c 8b 47 28 ba 08 00 00 00 48 63 d8 48 c1 e3 06 4c 01 c3 48 8b 73 10 <48> 8d 4b 10 48 39 f1 75 1c 83 c0 01 83 e0 07 83 ea 01 75 de 45 31 [ 2802.984607] RSP: 0000:ffffa9956555f9e8 EFLAGS: 00000286 [ 2802.991361] RAX: 0000000000000004 RBX: ffff932bc79e5300 RCX: ffff932bc79e5310 [ 2803.000012] RDX: 0000000000000008 RSI: ffffa99560d278b0 RDI: ffff932bc71d01a0 [ 2803.008663] RBP: 000000000000003b R08: ffff932bc79e5200 R09: ffff9335cfe3ed00 [ 2803.017305] R10: 0000000000000001 R11: ffff932c41570c00 R12: 0000000000000004 [ 2803.025947] R13: 000000000000003b R14: ffffc9753fb0ef00 R15: ffff932bc728a198 [ 2803.034598] sbitmap_queue_clear+0x3b/0x70 [ 2803.040213] __blk_mq_free_request+0x7f/0xb0 [ 2803.046002] bfq_insert_requests+0x1e8/0x2080 [ 2803.051881] ? btrfs_submit_data_bio+0x150/0x200 [btrfs f3815fd88a158cd7157ae031158e74430ce504fd] [ 2803.062354] blk_mq_sched_insert_requests+0x5a/0xf0 [ 2803.068780] blk_mq_flush_plug_list+0x116/0x1d0 [ 2803.074856] blk_flush_plug_list+0xc4/0xf0 [ 2803.080496] blk_finish_plug+0x2b/0x40 [ 2803.085791] read_pages+0x159/0x260 [ 2803.090830] ? page_cache_ra_unbounded+0x133/0x220 [ 2803.097171] page_cache_ra_unbounded+0x133/0x220 [ 2803.103335] filemap_fault+0x49f/0xa30 [ 2803.108639] ? filemap_map_pages+0x14e/0x580 [ 2803.114434] __do_fault+0x31/0xc0 [ 2803.119251] __handle_mm_fault+0xd64/0x1220 [ 2803.124937] handle_mm_fault+0xd7/0x290 [ 2803.130261] do_user_addr_fault+0x1eb/0x730 [ 2803.135927] exc_page_fault+0x67/0x150 [ 2803.141156] ? asm_exc_page_fault+0x8/0x30 [ 2803.146734] asm_exc_page_fault+0x1e/0x30 [ 2803.152225] RIP: 0033:0x7fad3036cfb0 [ 2803.157275] Code: 49 8b 3e 41 83 c2 fe 44 89 d2 d1 ea 44 8d 42 01 85 d2 0f 84 d4 00 00 00 45 89 c1 41 81 e1 fe ff ff 7f 31 d2 66 0f 1f 44 00 00 <8b> 06 8b 5e 04 48 0b 04 d7 48 c1 e3 20 48 09 c3 48 89 1c d7 8b 46 [ 2803.178322] RSP: 002b:00007ffdf62ecee0 EFLAGS: 00010246 [ 2803.185026] RAX: 0000000000000000 RBX: 00007fad344b3d88 RCX: 000000000000007c [ 2803.193635] RDX: 0000000000000000 RSI: 00007fad3317b640 RDI: 0000558771c17040 [ 2803.202216] RBP: 0000000000000008 R08: 0000000000000002 R09: 0000000000000002 [ 2803.210807] R10: 0000000000000002 R11: 0000000000000002 R12: 00007ffdf62ee908 [ 2803.219403] R13: 00007ffdf62ee950 R14: 00007fad344b3d88 R15: 0000000000000000 [ 2803.228001] </TASK> </code></pre> <p>Of course you can find the full dump still on <code>worker2.oqa.suse.de:/var/crash-bak</code> and I also created a copy on <code>storage.qa.suse.de:/storage/crash-dumps</code>. I'll investigate a little bit further. However, maybe it makes sense to decide whether we want to spent more effort in this investigation (and possibly then handle it as separate ticket).</p> </article> <article> <h1>openQA Infrastructure - action #123493: [Alerting] Failed systemd services alert</h1> <p>2023-01-23T13:05:37Z</p> <ul><li><strong>Description</strong> updated (<a title="View differences" href="/journals/596143/diff?detail_id=559564">diff</a>)</li></ul><p>Since the pi worker's logrotate service is failing again I'm temporarily pausing the alert.</p> </article> <article> <h1>openQA Infrastructure - action #123493: [Alerting] Failed systemd services alert</h1> <p>2023-01-23T13:16:14Z</p> <ul><li><strong>Description</strong> updated (<a title="View differences" href="/journals/596161/diff?detail_id=559576">diff</a>)</li><li><strong>Status</strong> changed from <i>In Progress</i> to <i>Feedback</i></li></ul><p>I could now connect to the pi worker. The problem was actually:</p> <pre><code>Jan 23 14:06:42 openqa-piworker systemd[1]: Starting Rotate log files... Jan 23 14:06:42 openqa-piworker logrotate[10457]: error: destination /var/log/messages-20230123.xz already exists, skipping rotation Jan 23 14:06:42 openqa-piworker systemd[1]: logrotate.service: Main process exited, code=exited, status=1/FAILURE Jan 23 14:06:42 openqa-piworker systemd[1]: logrotate.service: Failed with result 'exit-code'. Jan 23 14:06:42 openqa-piworker systemd[1]: Failed to start Rotate log files. </code></pre> <p>Restarting alone didn't help but moving away the existing file. Maybe the worker was shut down while the log rotation was ongoing leaving this mess. It should now be fixed.</p> <p>So I resumed the alert again.</p> <hr> <p>Note sure what to do about the crash. It isn't clear to me yet why it happened. Maybe we shouldn't put too much effort into it considering it only happened a single time so far.</p> </article> <article> <h1>openQA Infrastructure - action #123493: [Alerting] Failed systemd services alert</h1> <p>2023-01-24T11:01:09Z</p> <ul><li><strong>Status</strong> changed from <i>Feedback</i> to <i>Resolved</i></li></ul><p>mkittler wrote:</p> <blockquote> <p>[…] Maybe we shouldn't put too much effort into it considering it only happened a single time so far.</p> </blockquote> <p>I agree. Regarding the reported crash dump I also agree that we shouldn't spend more effort there if it's not a recurring issue. As there are currently no failed services alerted about we can consider this ticket resolved.</p> </article> </main></body></html>