Project

General

Profile

Actions

action #137615

closed

[alert] Failed systemd services alert - s390zl12,s390zl13 - kdump-early, kdump, smartd

Added by jbaier_cz 8 months ago. Updated 8 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
2023-10-09
Due date:
% Done:

0%

Estimated time:

Description

Observation

Failed systemd services alert (except openqa.suse.de)

View alert
http://stats.openqa-monitor.qa.suse.de/alerting/grafana/Uk02cifVkz/view?orgId=1

Values

B0=6

Labels
alertname Failed systemd services alert (except openqa.suse.de)
grafana_folder Salt
rule_uid Uk02cifVkz

Annotations
message Check failed systemd services on hosts with systemctl --failed. Hint: Go to parent dashboard
https://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services to see a list of affected hosts.

Silence
http://stats.openqa-monitor.qa.suse.de/alerting/silence/new?alertmanager=grafana&matcher=alertname%3DFailed+systemd+services+alert+%28except+openqa.suse.de%29&matcher=grafana_folder%3DSalt&matcher=rule_uid%3DUk02cifVkz&orgId=1

View dashboard http://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz?orgId=1

View panel
http://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz?orgId=1&viewPanel=6

Observed 9m3s before this notification was delivered, at 2023-10-09
11:37:00 +0200 CEST

2023-10-09 11:51:20 s390zl13    kdump-early, kdump, smartd  3
2023-10-09 11:51:20 s390zl12    kdump-early, kdump, smartd  3
Actions #1

Updated by okurz 8 months ago

  • Tags set to infra, alert, reactive work
  • Assignee set to okurz
  • Target version set to Ready
Actions #3

Updated by okurz 8 months ago

kdump-early+kdump fixed by restart after the "crashkernel" option was enabled. For "smartd" I just did systemctl mask --now smartd on both s390zl12+s390zl13. mgriessmeier provided credentials for https://zhmc2.suse.de/ which I put into https://gitlab.suse.de/openqa/password/ over which we can configure/start/stop/debug LPARs as required.

Actions #4

Updated by okurz 8 months ago

  • Status changed from New to In Progress
Actions #5

Updated by openqa_review 8 months ago

  • Due date set to 2023-10-24

Setting due date based on mean cycle time of SUSE QE Tools

Actions #6

Updated by okurz 8 months ago

  • Priority changed from Urgent to High

There had been no related alerts the past days. Now planning to prevent the situation in the future if new workers are setup.

Actions #7

Updated by okurz 8 months ago

Regarding smart we don't enable smart anywhere in salt states. I assume it's actually a product issue that an installation automatically enables a smart service in an environment where it is not applicable. I know just removed smartmontools on s390zl12 and s390zl13. Don't plan to do more in this direction right now.

That is the output of failing service that I now removed.

# journalctl -u smartd
Oct 11 10:22:25 s390zl13 systemd[1]: Starting Self Monitoring and Reporting Technology (SMART) Daemon...
Oct 11 10:22:25 s390zl13 smartd[129471]: smartd 7.2 2021-09-14 r5237 [s390x-linux-5.14.21-150500.55.28-default] (SUSE RPM)
Oct 11 10:22:25 s390zl13 smartd[129471]: Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
Oct 11 10:22:25 s390zl13 smartd[129471]: Opened configuration file /etc/smartd.conf
Oct 11 10:22:25 s390zl13 smartd[129471]: Drive: DEVICESCAN, implied '-a' Directive on line 32 of file /etc/smartd.conf
Oct 11 10:22:25 s390zl13 smartd[129471]: Configuration file /etc/smartd.conf was parsed, found DEVICESCAN, scanning devices
Oct 11 10:22:25 s390zl13 smartd[129471]: Device: /dev/sda, opened
Oct 11 10:22:25 s390zl13 smartd[129471]: Device: /dev/sda, [IBM      2145             0000], lu id: 0x600507638081855cd80000000000004c, S/N: 00e020615736XX00, 4>
Oct 11 10:22:25 s390zl13 smartd[129471]: Device: /dev/sda, IE (SMART) not enabled, skip device
Oct 11 10:22:25 s390zl13 smartd[129471]: Try 'smartctl -s on /dev/sda' to turn on SMART features
Oct 11 10:22:25 s390zl13 smartd[129471]: Device: /dev/sdb, opened
Oct 11 10:22:25 s390zl13 smartd[129471]: Device: /dev/sdb, [IBM      2145             0000], lu id: 0x600507638081855cd80000000000004c, S/N: 00e020615736XX00, 4>
Oct 11 10:22:25 s390zl13 smartd[129471]: Device: /dev/sdb, IE (SMART) not enabled, skip device
Oct 11 10:22:25 s390zl13 smartd[129471]: Try 'smartctl -s on /dev/sdb' to turn on SMART features
Oct 11 10:22:25 s390zl13 smartd[129471]: Unable to monitor any SMART enabled devices. Try debug (-d) option. Exiting...
Oct 11 10:22:25 s390zl13 systemd[1]: smartd.service: Main process exited, code=exited, status=17/n/a
Oct 11 10:22:25 s390zl13 systemd[1]: smartd.service: Failed with result 'exit-code'.
Oct 11 10:22:25 s390zl13 systemd[1]: Failed to start Self Monitoring and Reporting Technology (SMART) Daemon.

Regarding kdump the problem is that we set the crashkernel parameter which needs a reboot to make kdump work and not start kdump directly. I have researched and found https://stackoverflow.com/questions/23660645/how-to-reboot-in-the-middle-of-a-salt-state , maybe helps.

Actions #8

Updated by okurz 8 months ago

  • Due date deleted (2023-10-24)
  • Status changed from In Progress to Resolved

our salt states correctly state that the kdump service should only be enabled, not started. I don't know what started the kdump service before the reboot. I crosschecked how the states behave on a clean system by downloading a micro clean Tumbleweed VM over http://get.opensuse.org/, booted it up locally, cloned and applied the kdump state from our salt states and I confirmed that as specified kdump.service is enabled but not started. I don't know how to reproduce the problem so I would leave our code as is. If the problem can be reproduced then maybe we want to explicitly call for a reboot with

system.reboot:
  module.run:
    - onchanges:
      - file: /etc/default/grub
Actions

Also available in: Atom PDF