action #56588
closedCheck failed services on our workers
0%
Description
Use `salt '*' cmd.run 'systemctl --failed' to find out, e.g. kdump failed on some of our ppc workers because the boot parameters miss proper "crashkernel" information on both qa-power8-4-kvm and qa-power8-5-kvm.
Looks like the problem of telegraf on ppc64le might be the same as reported in #54128#note-7 , the package is simply not provided by devel:languages:go for neither aarch64 nor ppc64le for Leap 15.1. now building in https://build.opensuse.org/project/show/home:okurz:telegraf , maybe we want to simply add the package to devel:openQA:Leap:15.1
Updated by okurz over 5 years ago
- Status changed from Feedback to Workable
- Assignee deleted (
okurz)
lucky or not, using a cleanly built new "telegraf" for Leap 15.1 on qa-power8-4-kvm.qa reproduces the same problem:
okurz@QA-Power8-4-kvm:~> sudo /usr/bin/telegraf -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d $TELEGRAF_OPTS
/usr/bin/telegraf: error while loading shared libraries: R_PPC64_ADDR16_HA re10dcea92c for symbol `' out of range
looks like this is related to PIE in the compiler settings? According to https://build.opensuse.org/package/view_file/devel:languages:go/go1.10/go1.10.changes?expand=0 line line 64 this should have been fixed, at least in go10. According to https://build.opensuse.org/package/live_build_log/home:okurz:telegraf/telegraf/openSUSE_Leap_15.1/ppc64le go1.11 is used and also "gcc-PIE". Dunno further for now.
Updated by nicksinger over 5 years ago
We disabled PIE for ppc64(le) with https://build.opensuse.org/request/show/729378 and it seems to work now. You can check with salt -G 'cpuarch:ppc64le' cmd.run 'systemctl --failed'systemctl --failed'
Updated by nicksinger over 5 years ago
I disabled the smartd
service on malbec and grenache-1. Disks on power are virtualized anyway and attached to malbec over multipath so it doesn't make much sense to run smart inside the LPAR anyway.
Updated by nicksinger over 5 years ago
I removed /etc/modules-load.d/kvm.conf
from grenache-1 to get rid of the failing systemd-modules-load.service
. I was about to create a bugzilla ticket for this but realized this file doesn't belong to any package:
grenache-1:/etc/modules-load.d # cat kvm.conf
kvm_hv
grenache-1:/etc/modules-load.d # rpm -qf kvm.conf
file /etc/modules-load.d/kvm.conf is not owned by any package
grenache-1:/etc/modules-load.d # rpm -qf
kvm.conf sg.conf
grenache-1:/etc/modules-load.d # rpm -qf sg.conf
suse-module-tools-15.1.13-lp151.1.1.ppc64le
Original text for the BSC:
Kernel module "kvm_hv" should not be loaded on LPAR installations
On one of our power LPARs the service systemd-modules-load.service fails because the module kvm_hv can't be loaded. According to https://wiki.qemu.org/Documentation/Platforms/POWER this is expected since nested virtualization is not supported.
Therefore, the module kvm_hv shouldn't be included in /etc/modules-load.d if the system is installed into an LPAR.
Updated by nicksinger over 5 years ago
Disabled service lm_sensors
on powerqaworker-qam-1
since it's not running on any other worker (no clue who enabled it on qam-1). To get it running one could touch /etc/sysconfig/lm_sensors
.
Updated by nicksinger over 5 years ago
Restarted worker-instance 3 on QA-Power8-5-kvm. According to the logs the worker got restarted but didn't respond to SIGTERM (previous logs indicate it was still uploading). Therefore systemd shot it with a SIGKILL and left it in the state failed. If this happens more often we might have to consider fixing this in openQA itself.
Now we only have the failing kdump.service on QA-Power8-4-kvm.qa.suse.de, QA-Power8-5-kvm.qa.suse.de and powerqaworker-qam-1 left.
Updated by okurz over 5 years ago
- Due date set to 2019-09-18
- Status changed from Workable to Feedback
- Assignee set to okurz
Thank you for looking into this.
For kdump actually I fixed that already on grenache-1 by giving an additional kernel command line parameter "crashkernel=272M" however I realized that all other workers do not have kdump enabled so I simply disabled it also on the ppc workers:
sudo salt '*' cmd.run 'systemctl disable --now kdump'
openqaworker2.suse.de:
Failed to disable unit: Unit file kdump.service does not exist.
openqaworker6.suse.de:
Failed to disable unit: Unit file kdump.service does not exist.
openqaworker7.suse.de:
Failed to disable unit: Unit file kdump.service does not exist.
openqaworker3.suse.de:
Failed to disable unit: Unit file kdump.service does not exist.
openqaworker8.suse.de:
Failed to disable unit: Unit file kdump.service does not exist.
openqaworker5.suse.de:
Failed to disable unit: Unit file kdump.service does not exist.
openqaworker9.suse.de:
Failed to disable unit: Unit file kdump.service does not exist.
openqaw2.qa.suse.de:
Failed to disable unit: Unit file kdump.service does not exist.
openqaw1.qa.suse.de:
Failed to disable unit: Unit file kdump.service does not exist.
openqaworker13.suse.de:
Failed to disable unit: Unit file kdump.service does not exist.
openqa.suse.de:
Failed to disable unit: Unit file kdump.service does not exist.
openqa-monitor.qa.suse.de:
Failed to disable unit: Unit file kdump.service does not exist.
QA-Power8-5-kvm.qa.suse.de:
Removed /etc/systemd/system/multi-user.target.wants/kdump.service.
Removed /etc/systemd/system/multi-user.target.wants/kdump-early.service.
malbec.arch.suse.de:
Removed /etc/systemd/system/multi-user.target.wants/kdump.service.
Removed /etc/systemd/system/multi-user.target.wants/kdump-early.service.
QA-Power8-4-kvm.qa.suse.de:
Removed /etc/systemd/system/multi-user.target.wants/kdump.service.
Removed /etc/systemd/system/multi-user.target.wants/kdump-early.service.
powerqaworker-qam-1:
Removed /etc/systemd/system/multi-user.target.wants/kdump.service.
Removed /etc/systemd/system/multi-user.target.wants/kdump-early.service.
openqaworker-arm-2.suse.de:
grenache-1.qa.suse.de:
Removed /etc/systemd/system/multi-user.target.wants/kdump.service.
Removed /etc/systemd/system/multi-user.target.wants/kdump-early.service.
openqaworker-arm-1.suse.de:
openqaworker-arm-3.suse.de:
ERROR: Minions returned with non-zero exit code
and
sudo salt '*' cmd.run 'systemctl reset-failed'
QA-Power8-5-kvm.qa.suse.de:
QA-Power8-4-kvm.qa.suse.de:
openqaworker2.suse.de:
malbec.arch.suse.de:
openqaworker5.suse.de:
openqaworker9.suse.de:
powerqaworker-qam-1:
openqaworker6.suse.de:
openqaworker7.suse.de:
openqaworker8.suse.de:
openqaworker3.suse.de:
grenache-1.qa.suse.de:
openqaw1.qa.suse.de:
openqaw2.qa.suse.de:
openqa-monitor.qa.suse.de:
openqaworker13.suse.de:
openqa.suse.de:
openqaworker-arm-1.suse.de:
openqaworker-arm-2.suse.de:
openqaworker-arm-3.suse.de:
with this we have no more failed services as of now. I can check again e.g. next week and close it if no further big issues found. Next step after that could be monitoring+alerting for any future failed systemd services.
Updated by okurz over 5 years ago
- Status changed from Feedback to Resolved
No more failed services found