action #137615
closed[alert] Failed systemd services alert - s390zl12,s390zl13 - kdump-early, kdump, smartd
0%
Description
Observation¶
Failed systemd services alert (except openqa.suse.de)
View alert
http://stats.openqa-monitor.qa.suse.de/alerting/grafana/Uk02cifVkz/view?orgId=1Values
B0=6
Labels
alertname Failed systemd services alert (except openqa.suse.de)
grafana_folder Salt
rule_uid Uk02cifVkzAnnotations
message Check failed systemd services on hosts withsystemctl --failed
. Hint: Go to parent dashboard
https://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services to see a list of affected hosts.View dashboard http://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz?orgId=1
View panel
http://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz?orgId=1&viewPanel=6Observed 9m3s before this notification was delivered, at 2023-10-09
11:37:00 +0200 CEST
2023-10-09 11:51:20 s390zl13 kdump-early, kdump, smartd 3
2023-10-09 11:51:20 s390zl12 kdump-early, kdump, smartd 3
Updated by okurz about 1 year ago
- Tags set to infra, alert, reactive work
- Assignee set to okurz
- Target version set to Ready
Updated by okurz about 1 year ago
kdump-early+kdump fixed by restart after the "crashkernel" option was enabled. For "smartd" I just did systemctl mask --now smartd
on both s390zl12+s390zl13. mgriessmeier provided credentials for https://zhmc2.suse.de/ which I put into https://gitlab.suse.de/openqa/password/ over which we can configure/start/stop/debug LPARs as required.
Updated by openqa_review about 1 year ago
- Due date set to 2023-10-24
Setting due date based on mean cycle time of SUSE QE Tools
Updated by okurz about 1 year ago
- Priority changed from Urgent to High
There had been no related alerts the past days. Now planning to prevent the situation in the future if new workers are setup.
Updated by okurz about 1 year ago
Regarding smart we don't enable smart anywhere in salt states. I assume it's actually a product issue that an installation automatically enables a smart service in an environment where it is not applicable. I know just removed smartmontools on s390zl12 and s390zl13. Don't plan to do more in this direction right now.
That is the output of failing service that I now removed.
# journalctl -u smartd
Oct 11 10:22:25 s390zl13 systemd[1]: Starting Self Monitoring and Reporting Technology (SMART) Daemon...
Oct 11 10:22:25 s390zl13 smartd[129471]: smartd 7.2 2021-09-14 r5237 [s390x-linux-5.14.21-150500.55.28-default] (SUSE RPM)
Oct 11 10:22:25 s390zl13 smartd[129471]: Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
Oct 11 10:22:25 s390zl13 smartd[129471]: Opened configuration file /etc/smartd.conf
Oct 11 10:22:25 s390zl13 smartd[129471]: Drive: DEVICESCAN, implied '-a' Directive on line 32 of file /etc/smartd.conf
Oct 11 10:22:25 s390zl13 smartd[129471]: Configuration file /etc/smartd.conf was parsed, found DEVICESCAN, scanning devices
Oct 11 10:22:25 s390zl13 smartd[129471]: Device: /dev/sda, opened
Oct 11 10:22:25 s390zl13 smartd[129471]: Device: /dev/sda, [IBM 2145 0000], lu id: 0x600507638081855cd80000000000004c, S/N: 00e020615736XX00, 4>
Oct 11 10:22:25 s390zl13 smartd[129471]: Device: /dev/sda, IE (SMART) not enabled, skip device
Oct 11 10:22:25 s390zl13 smartd[129471]: Try 'smartctl -s on /dev/sda' to turn on SMART features
Oct 11 10:22:25 s390zl13 smartd[129471]: Device: /dev/sdb, opened
Oct 11 10:22:25 s390zl13 smartd[129471]: Device: /dev/sdb, [IBM 2145 0000], lu id: 0x600507638081855cd80000000000004c, S/N: 00e020615736XX00, 4>
Oct 11 10:22:25 s390zl13 smartd[129471]: Device: /dev/sdb, IE (SMART) not enabled, skip device
Oct 11 10:22:25 s390zl13 smartd[129471]: Try 'smartctl -s on /dev/sdb' to turn on SMART features
Oct 11 10:22:25 s390zl13 smartd[129471]: Unable to monitor any SMART enabled devices. Try debug (-d) option. Exiting...
Oct 11 10:22:25 s390zl13 systemd[1]: smartd.service: Main process exited, code=exited, status=17/n/a
Oct 11 10:22:25 s390zl13 systemd[1]: smartd.service: Failed with result 'exit-code'.
Oct 11 10:22:25 s390zl13 systemd[1]: Failed to start Self Monitoring and Reporting Technology (SMART) Daemon.
Regarding kdump the problem is that we set the crashkernel parameter which needs a reboot to make kdump work and not start kdump directly. I have researched and found https://stackoverflow.com/questions/23660645/how-to-reboot-in-the-middle-of-a-salt-state , maybe helps.
Updated by okurz about 1 year ago
- Due date deleted (
2023-10-24) - Status changed from In Progress to Resolved
our salt states correctly state that the kdump service should only be enabled, not started. I don't know what started the kdump service before the reboot. I crosschecked how the states behave on a clean system by downloading a micro clean Tumbleweed VM over http://get.opensuse.org/, booted it up locally, cloned and applied the kdump state from our salt states and I confirmed that as specified kdump.service is enabled but not started. I don't know how to reproduce the problem so I would leave our code as is. If the problem can be reproduced then maybe we want to explicitly call for a reboot with
system.reboot:
module.run:
- onchanges:
- file: /etc/default/grub