Project

General

Profile

Actions

action #123493

closed

[Alerting] Failed systemd services alert

Added by jbaier_cz almost 2 years ago. Updated almost 2 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
2023-01-16
Due date:
% Done:

0%

Estimated time:

Description

According to the monitoring we have 6 failed services (since the reboot yesterday?), although the currently failed view is showing only

2023-01-23 10:18:00 worker2         check-for-kernel-crash  
2023-01-23 00:00:00 openqa-piworker logrotate

Rollback steps

  • Resume the alert

Related issues 1 (0 open1 closed)

Copied from openQA Infrastructure - action #123151: [Alerting] Failed systemd services alertResolvedmkittler2023-01-16

Actions
Actions #1

Updated by jbaier_cz almost 2 years ago

  • Copied from action #123151: [Alerting] Failed systemd services alert added
Actions #2

Updated by mkittler almost 2 years ago

  • Status changed from New to In Progress
  • Assignee set to mkittler
Actions #3

Updated by mkittler almost 2 years ago

openqa-piworker.qa.suse.de has just been moved from the TAM lab to SRV2. It seems to struggle rebooting. At least that would explain the failing service (likely timeout) and that I currently cannot access it via SSH. Maybe physical access is required to recover the machine. I could have a look later this week but maybe it is better if @dheidler takes care of this.

Actions #4

Updated by mkittler almost 2 years ago

On worker2 a kernel crash dump was collected. I have moved it from /var/crash to /var/crash-bak so the alert can go of after restarting check-for-kernel-crash.service. The good thing is that our approach for monitoring crashes seems to work in practice. I'll investigate the dump. It is just a single one and was created on 2023-01-22-04:21.

Actions #5

Updated by mkittler almost 2 years ago

Judging by the journal, the worker was shutting down around the time the crash dump was collected. There is a problem with the kernel auditing logged shortly before the reboot:

Jan 22 03:31:32 worker2 auditd[1516]: Error receiving audit netlink packet (No buffer space available)
Jan 22 03:31:32 worker2 kernel: br0: port 1(eth0) entered disabled state
Jan 22 03:31:32 worker2 kernel: device eth0 left promiscuous mode
Jan 22 03:31:32 worker2 kernel: audit: audit_backlog=65 > audit_backlog_limit=64
Jan 22 03:31:32 worker2 kernel: audit: audit_lost=8 audit_rate_limit=0 audit_backlog_limit=64
Jan 22 03:31:32 worker2 kernel: audit: backlog limit exceeded

In the end (very shortly before the reboot) the audit daemon could at least be deactivated successfully:

Jan 22 03:31:52 worker2 auditd[1516]: The audit daemon is exiting.
Jan 22 03:31:52 worker2 kernel: audit: type=1305 audit(1674354712.861:465705): op=set audit_pid=0 old=1516 auid=4294967295 ses=4294967295 subj=unconfined res=1
Jan 22 03:31:53 worker2 auditctl[6009]: enabled 0
Jan 22 03:31:53 worker2 auditctl[6009]: failure 1
Jan 22 03:31:53 worker2 auditctl[6009]: pid 0
Jan 22 03:31:53 worker2 auditctl[6009]: rate_limit 0
Jan 22 03:31:53 worker2 auditctl[6009]: backlog_limit 64
Jan 22 03:31:53 worker2 auditctl[6009]: lost 8
Jan 22 03:31:53 worker2 auditctl[6009]: backlog 12
Jan 22 03:31:53 worker2 auditctl[6009]: backlog_wait_time 15000
Jan 22 03:31:53 worker2 auditctl[6009]: backlog_wait_time_actual 5492
Jan 22 03:31:53 worker2 auditctl[6009]: No rules
Jan 22 03:31:53 worker2 kernel: audit: type=1300 audit(1674354712.965:465706): arch=c000003e syscall=59 success=yes exit=0 a0=5616afa6d250 a1=5616af99f550 a2=5616afcf2680 a3=5616afbc8580 items=2 ppid=1 pid=6009 auid=4294967295 uid=0 gi>
Jan 22 03:31:53 worker2 kernel: audit: type=1321 audit(1674354712.965:465706): fver=0 fp=0 fi=0 fe=0 old_pp=000001ffffffffff old_pi=0 old_pe=000001ffffffffff old_pa=0 pp=000001f3fdfeffff pi=0 pe=000001f3fdfeffff pa=0 frootid=0
Jan 22 03:31:53 worker2 kernel: audit: type=1309 audit(1674354712.965:465706): argc=3 a0="/sbin/auditctl" a1="-R" a2="/etc/audit/audit-stop.rules"
Jan 22 03:31:53 worker2 kernel: audit: type=1307 audit(1674354712.965:465706): cwd="/"
Jan 22 03:31:53 worker2 kernel: audit: type=1302 audit(1674354712.965:465706): item=0 name="/sbin/auditctl" inode=12460540 dev=00:2f mode=0100750 ouid=0 ogid=0 rdev=00:00 nametype=NORMAL cap_fp=0 cap_fi=0 cap_fe=0 cap_fver=0 cap_frooti>
Jan 22 03:31:53 worker2 kernel: audit: type=1302 audit(1674354712.965:465706): item=1 name="/lib64/ld-linux-x86-64.so.2" inode=12246242 dev=00:2f mode=0100755 ouid=0 ogid=0 rdev=00:00 nametype=NORMAL cap_fp=0 cap_fi=0 cap_fe=0 cap_fver>
Jan 22 03:31:53 worker2 kernel: audit: type=1327 audit(1674354712.965:465706): proctitle=2F7362696E2F617564697463746C002D52002F6574632F61756469742F61756469742D73746F702E72756C6573
Jan 22 03:31:53 worker2 kernel: audit: type=1305 audit(1674354713.005:465707): op=set audit_enabled=0 old=1 auid=4294967295 ses=4294967295 subj=unconfined res=1
Jan 22 03:31:53 worker2 kernel: audit: type=1300 audit(1674354713.005:465707): arch=c000003e syscall=44 success=yes exit=60 a0=3 a1=7ffc561031d0 a2=3c a3=0 items=0 ppid=1 pid=6009 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=>
Jan 22 03:31:53 worker2 systemd[1]: auditd.service: Deactivated successfully.

There are also log messages about kdump itself shortly after the reboot was triggered:

Jan 22 03:30:01 worker2 nrpe[2869]: Caught SIGTERM - shutting down...
Jan 22 03:30:01 worker2 systemd[1]: openqa-worker-auto-restart@18.service: Deactivated successfully.
Jan 22 03:30:01 worker2 nrpe[2869]: Daemon shutdown
Jan 22 03:30:01 worker2 systemd[1]: Stopped openQA Worker #18.
…
Jan 22 03:30:01 worker2 systemd[1]: Stopped openQA Worker #5.
Jan 22 03:30:01 worker2 systemd[1]: var-lib-openqa-share.mount: Deactivated successfully.
Jan 22 03:30:01 worker2 systemd[1]: Unmounted /var/lib/openqa/share.
Jan 22 03:30:01 worker2 systemd[1]: dracut-shutdown.service: Deactivated successfully.
Jan 22 03:30:01 worker2 systemd[1]: Stopped Restore /run/initramfs on shutdown.
Jan 22 03:30:01 worker2 systemd[1]: kdump-early.service: Control process exited, code=exited, status=1/FAILURE
Jan 22 03:30:01 worker2 systemd[1]: kdump-early.service: Failed with result 'exit-code'.
Jan 22 03:30:01 worker2 systemd[1]: Stopped Load kdump kernel early on startup.
Jan 22 03:30:01 worker2 systemd[1]: kdump.service: Deactivated successfully.
Jan 22 03:30:01 worker2 systemd[1]: Stopped Load kdump kernel and initrd.
…

Otherwise there was nothing interesting going on before the shutdown. Some services (mainly openQA workers) had to be killed forcefully, though.

Actions #6

Updated by mkittler almost 2 years ago

This is likely the most important information from the dump:

[ 2802.493706] Supported: No, Unsupported modules are loaded
[ 2802.500922] CPU: 18 PID: 12402 Comm: Xvnc Kdump: loaded Tainted: G                  N 5.14.21-150400.24.33-default #1 SLE15-SP4 018fa6ac9e6418760457307f6579fc12335772aa
[ 2802.518932] Hardware name: Quanta Computer Inc D51B-2U (dual 1G LoM)/S2B-MB (dual 1G LoM), BIOS S2B_3A19 05/15/2015
[ 2802.531253] RIP: 0010:__sbq_wake_up+0x27/0xd0
[ 2802.537491] Code: 44 00 00 41 55 41 54 55 53 8b 47 30 85 c0 74 2e 8b 47 24 4c 8b 47 28 ba 08 00 00 00 48 63 d8 48 c1 e3 06 4c 01 c3 48 8b 73 10 <48> 8d 4b 10 48 39 f1 75 1c 83 c0 01 83 e0 07 83 ea 01 75 de 45 31
[ 2802.559368] RSP: 0000:ffffa9956555f9e8 EFLAGS: 00000286
[ 2802.566496] RAX: 0000000000000004 RBX: ffff932bc79e5300 RCX: ffff932bc79e5310
[ 2802.575546] RDX: 0000000000000008 RSI: ffffa99560d278b0 RDI: ffff932bc71d01a0
[ 2802.584590] RBP: 000000000000003b R08: ffff932bc79e5200 R09: ffff9335cfe3ed00
[ 2802.593653] R10: 0000000000000001 R11: ffff932c41570c00 R12: 0000000000000004
[ 2802.602735] R13: 000000000000003b R14: ffffc9753fb0ef00 R15: ffff932bc728a198
[ 2802.611824] FS:  00007fad3cdab380(0000) GS:ffff934abfc80000(0000) knlGS:0000000000000000
[ 2802.621894] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2802.629628] CR2: 00007fad3317b640 CR3: 000000065afe2004 CR4: 00000000001726e0
[ 2802.638775] Call Trace:
[ 2802.643242]  <TASK>
[ 2802.647354]  sbitmap_queue_clear+0x3b/0x70
[ 2802.653457]  __blk_mq_free_request+0x7f/0xb0
[ 2802.659730]  bfq_insert_requests+0x1e8/0x2080
[ 2802.666099]  ? btrfs_submit_data_bio+0x150/0x200 [btrfs f3815fd88a158cd7157ae031158e74430ce504fd]
[ 2802.677068]  blk_mq_sched_insert_requests+0x5a/0xf0
[ 2802.683984]  blk_mq_flush_plug_list+0x116/0x1d0
[ 2802.690546]  blk_flush_plug_list+0xc4/0xf0
[ 2802.696686]  blk_finish_plug+0x2b/0x40
[ 2802.702460]  read_pages+0x159/0x260
[ 2802.707970]  ? page_cache_ra_unbounded+0x133/0x220
[ 2802.714762]  page_cache_ra_unbounded+0x133/0x220
[ 2802.721359]  filemap_fault+0x49f/0xa30
[ 2802.727052]  ? filemap_map_pages+0x14e/0x580
[ 2802.733244]  __do_fault+0x31/0xc0
[ 2802.738466]  __handle_mm_fault+0xd64/0x1220
[ 2802.744521]  handle_mm_fault+0xd7/0x290
[ 2802.750219]  do_user_addr_fault+0x1eb/0x730
[ 2802.756243]  exc_page_fault+0x67/0x150
[ 2802.761800]  ? asm_exc_page_fault+0x8/0x30
[ 2802.767678]  asm_exc_page_fault+0x1e/0x30
[ 2802.773445] RIP: 0033:0x7fad3036cfb0
[ 2802.778761] Code: 49 8b 3e 41 83 c2 fe 44 89 d2 d1 ea 44 8d 42 01 85 d2 0f 84 d4 00 00 00 45 89 c1 41 81 e1 fe ff ff 7f 31 d2 66 0f 1f 44 00 00 <8b> 06 8b 5e 04 48 0b 04 d7 48 c1 e3 20 48 09 c3 48 89 1c d7 8b 46
[ 2802.800280] RSP: 002b:00007ffdf62ecee0 EFLAGS: 00010246
[ 2802.807220] RAX: 0000000000000000 RBX: 00007fad344b3d88 RCX: 000000000000007c
[ 2802.816049] RDX: 0000000000000000 RSI: 00007fad3317b640 RDI: 0000558771c17040
[ 2802.824847] RBP: 0000000000000008 R08: 0000000000000002 R09: 0000000000000002
[ 2802.833623] R10: 0000000000000002 R11: 0000000000000002 R12: 00007ffdf62ee908
[ 2802.842370] R13: 00007ffdf62ee950 R14: 00007fad344b3d88 R15: 0000000000000000
[ 2802.851094]  </TASK>
[ 2802.854852] Kernel panic - not syncing: softlockup: hung tasks
[ 2802.862261] CPU: 18 PID: 12402 Comm: Xvnc Kdump: loaded Tainted: G             L    N 5.14.21-150400.24.33-default #1 SLE15-SP4 018fa6ac9e6418760457307f6579fc12335772aa
[ 2802.879729] Hardware name: Quanta Computer Inc D51B-2U (dual 1G LoM)/S2B-MB (dual 1G LoM), BIOS S2B_3A19 05/15/2015
[ 2802.891734] Call Trace:
[ 2802.895733]  <IRQ>
[ 2802.899267]  dump_stack_lvl+0x45/0x5b
[ 2802.904442]  panic+0x105/0x2dd
[ 2802.908992]  watchdog_timer_fn+0x1f1/0x210
[ 2802.914578]  ? softlockup_fn+0x30/0x30
[ 2802.919826]  __hrtimer_run_queues+0x10d/0x2a0
[ 2802.925674]  hrtimer_interrupt+0xe5/0x250
[ 2802.931173]  __sysvec_apic_timer_interrupt+0x5d/0x130
[ 2802.937714]  sysvec_apic_timer_interrupt+0x4b/0x90
[ 2802.943991]  </IRQ>
[ 2802.947558]  <TASK>
[ 2802.951104]  asm_sysvec_apic_timer_interrupt+0x12/0x20
[ 2802.957714] RIP: 0010:__sbq_wake_up+0x27/0xd0
[ 2802.963534] Code: 44 00 00 41 55 41 54 55 53 8b 47 30 85 c0 74 2e 8b 47 24 4c 8b 47 28 ba 08 00 00 00 48 63 d8 48 c1 e3 06 4c 01 c3 48 8b 73 10 <48> 8d 4b 10 48 39 f1 75 1c 83 c0 01 83 e0 07 83 ea 01 75 de 45 31
[ 2802.984607] RSP: 0000:ffffa9956555f9e8 EFLAGS: 00000286
[ 2802.991361] RAX: 0000000000000004 RBX: ffff932bc79e5300 RCX: ffff932bc79e5310
[ 2803.000012] RDX: 0000000000000008 RSI: ffffa99560d278b0 RDI: ffff932bc71d01a0
[ 2803.008663] RBP: 000000000000003b R08: ffff932bc79e5200 R09: ffff9335cfe3ed00
[ 2803.017305] R10: 0000000000000001 R11: ffff932c41570c00 R12: 0000000000000004
[ 2803.025947] R13: 000000000000003b R14: ffffc9753fb0ef00 R15: ffff932bc728a198
[ 2803.034598]  sbitmap_queue_clear+0x3b/0x70
[ 2803.040213]  __blk_mq_free_request+0x7f/0xb0
[ 2803.046002]  bfq_insert_requests+0x1e8/0x2080
[ 2803.051881]  ? btrfs_submit_data_bio+0x150/0x200 [btrfs f3815fd88a158cd7157ae031158e74430ce504fd]
[ 2803.062354]  blk_mq_sched_insert_requests+0x5a/0xf0
[ 2803.068780]  blk_mq_flush_plug_list+0x116/0x1d0
[ 2803.074856]  blk_flush_plug_list+0xc4/0xf0
[ 2803.080496]  blk_finish_plug+0x2b/0x40
[ 2803.085791]  read_pages+0x159/0x260
[ 2803.090830]  ? page_cache_ra_unbounded+0x133/0x220
[ 2803.097171]  page_cache_ra_unbounded+0x133/0x220
[ 2803.103335]  filemap_fault+0x49f/0xa30
[ 2803.108639]  ? filemap_map_pages+0x14e/0x580
[ 2803.114434]  __do_fault+0x31/0xc0
[ 2803.119251]  __handle_mm_fault+0xd64/0x1220
[ 2803.124937]  handle_mm_fault+0xd7/0x290
[ 2803.130261]  do_user_addr_fault+0x1eb/0x730
[ 2803.135927]  exc_page_fault+0x67/0x150
[ 2803.141156]  ? asm_exc_page_fault+0x8/0x30
[ 2803.146734]  asm_exc_page_fault+0x1e/0x30
[ 2803.152225] RIP: 0033:0x7fad3036cfb0
[ 2803.157275] Code: 49 8b 3e 41 83 c2 fe 44 89 d2 d1 ea 44 8d 42 01 85 d2 0f 84 d4 00 00 00 45 89 c1 41 81 e1 fe ff ff 7f 31 d2 66 0f 1f 44 00 00 <8b> 06 8b 5e 04 48 0b 04 d7 48 c1 e3 20 48 09 c3 48 89 1c d7 8b 46
[ 2803.178322] RSP: 002b:00007ffdf62ecee0 EFLAGS: 00010246
[ 2803.185026] RAX: 0000000000000000 RBX: 00007fad344b3d88 RCX: 000000000000007c
[ 2803.193635] RDX: 0000000000000000 RSI: 00007fad3317b640 RDI: 0000558771c17040
[ 2803.202216] RBP: 0000000000000008 R08: 0000000000000002 R09: 0000000000000002
[ 2803.210807] R10: 0000000000000002 R11: 0000000000000002 R12: 00007ffdf62ee908
[ 2803.219403] R13: 00007ffdf62ee950 R14: 00007fad344b3d88 R15: 0000000000000000
[ 2803.228001]  </TASK>

Of course you can find the full dump still on worker2.oqa.suse.de:/var/crash-bak and I also created a copy on storage.qa.suse.de:/storage/crash-dumps. I'll investigate a little bit further. However, maybe it makes sense to decide whether we want to spent more effort in this investigation (and possibly then handle it as separate ticket).

Actions #7

Updated by mkittler almost 2 years ago

  • Description updated (diff)

Since the pi worker's logrotate service is failing again I'm temporarily pausing the alert.

Actions #8

Updated by mkittler almost 2 years ago

  • Description updated (diff)
  • Status changed from In Progress to Feedback

I could now connect to the pi worker. The problem was actually:

Jan 23 14:06:42 openqa-piworker systemd[1]: Starting Rotate log files...
Jan 23 14:06:42 openqa-piworker logrotate[10457]: error: destination /var/log/messages-20230123.xz already exists, skipping rotation
Jan 23 14:06:42 openqa-piworker systemd[1]: logrotate.service: Main process exited, code=exited, status=1/FAILURE
Jan 23 14:06:42 openqa-piworker systemd[1]: logrotate.service: Failed with result 'exit-code'.
Jan 23 14:06:42 openqa-piworker systemd[1]: Failed to start Rotate log files.

Restarting alone didn't help but moving away the existing file. Maybe the worker was shut down while the log rotation was ongoing leaving this mess. It should now be fixed.

So I resumed the alert again.


Note sure what to do about the crash. It isn't clear to me yet why it happened. Maybe we shouldn't put too much effort into it considering it only happened a single time so far.

Actions #9

Updated by okurz almost 2 years ago

  • Status changed from Feedback to Resolved

mkittler wrote:

[…] Maybe we shouldn't put too much effort into it considering it only happened a single time so far.

I agree. Regarding the reported crash dump I also agree that we shouldn't spend more effort there if it's not a recurring issue. As there are currently no failed services alerted about we can consider this ticket resolved.

Actions

Also available in: Atom PDF