action #56588

Check failed services on our workers

Added by okurz 6 months ago. Updated 5 months ago.

Status:ResolvedStart date:08/09/2019
Priority:NormalDue date:18/09/2019
Assignee:okurz% Done:

0%

Category:-
Target version:openQA Project - Current Sprint
Duration: 8

Description

Use `salt '*' cmd.run 'systemctl --failed' to find out, e.g. kdump failed on some of our ppc workers because the boot parameters miss proper "crashkernel" information on both qa-power8-4-kvm and qa-power8-5-kvm.

Looks like the problem of telegraf on ppc64le might be the same as reported in #54128#note-7 , the package is simply not provided by devel:languages:go for neither aarch64 nor ppc64le for Leap 15.1. now building in https://build.opensuse.org/project/show/home:okurz:telegraf , maybe we want to simply add the package to devel:openQA:Leap:15.1

History

#1 Updated by okurz 6 months ago

  • Target version set to Current Sprint

#2 Updated by okurz 6 months ago

  • Status changed from Feedback to Workable
  • Assignee deleted (okurz)

lucky or not, using a cleanly built new "telegraf" for Leap 15.1 on qa-power8-4-kvm.qa reproduces the same problem:

okurz@QA-Power8-4-kvm:~> sudo /usr/bin/telegraf -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d $TELEGRAF_OPTS
/usr/bin/telegraf: error while loading shared libraries: R_PPC64_ADDR16_HA re10dcea92c for symbol `' out of range

looks like this is related to PIE in the compiler settings? According to https://build.opensuse.org/package/view_file/devel:languages:go/go1.10/go1.10.changes?expand=0 line line 64 this should have been fixed, at least in go10. According to https://build.opensuse.org/package/live_build_log/home:okurz:telegraf/telegraf/openSUSE_Leap_15.1/ppc64le go1.11 is used and also "gcc-PIE". Dunno further for now.

#3 Updated by coolo 6 months ago

added the arch to devel:languages:go

#4 Updated by nicksinger 6 months ago

We disabled PIE for ppc64(le) with https://build.opensuse.org/request/show/729378 and it seems to work now. You can check with salt -G 'cpuarch:ppc64le' cmd.run 'systemctl --failed'systemctl --failed'

#5 Updated by nicksinger 6 months ago

I disabled the smartd service on malbec and grenache-1. Disks on power are virtualized anyway and attached to malbec over multipath so it doesn't make much sense to run smart inside the LPAR anyway.

#6 Updated by nicksinger 6 months ago

I removed /etc/modules-load.d/kvm.conf from grenache-1 to get rid of the failing systemd-modules-load.service. I was about to create a bugzilla ticket for this but realized this file doesn't belong to any package:

grenache-1:/etc/modules-load.d # cat kvm.conf 
kvm_hv
grenache-1:/etc/modules-load.d # rpm -qf kvm.conf 
file /etc/modules-load.d/kvm.conf is not owned by any package
grenache-1:/etc/modules-load.d # rpm -qf 
kvm.conf  sg.conf   
grenache-1:/etc/modules-load.d # rpm -qf sg.conf 
suse-module-tools-15.1.13-lp151.1.1.ppc64le

Original text for the BSC:
Kernel module "kvm_hv" should not be loaded on LPAR installations

On one of our power LPARs the service systemd-modules-load.service fails because the module kvm_hv can't be loaded. According to https://wiki.qemu.org/Documentation/Platforms/POWER this is expected since nested virtualization is not supported.

Therefore, the module kvm_hv shouldn't be included in /etc/modules-load.d if the system is installed into an LPAR.

#7 Updated by nicksinger 6 months ago

Disabled service lm_sensors on powerqaworker-qam-1 since it's not running on any other worker (no clue who enabled it on qam-1). To get it running one could touch /etc/sysconfig/lm_sensors.

#8 Updated by nicksinger 6 months ago

Restarted worker-instance 3 on QA-Power8-5-kvm. According to the logs the worker got restarted but didn't respond to SIGTERM (previous logs indicate it was still uploading). Therefore systemd shot it with a SIGKILL and left it in the state failed. If this happens more often we might have to consider fixing this in openQA itself.

Now we only have the failing kdump.service on QA-Power8-4-kvm.qa.suse.de, QA-Power8-5-kvm.qa.suse.de and powerqaworker-qam-1 left.

#9 Updated by okurz 6 months ago

  • Due date set to 18/09/2019
  • Status changed from Workable to Feedback
  • Assignee set to okurz

Thank you for looking into this.

For kdump actually I fixed that already on grenache-1 by giving an additional kernel command line parameter "crashkernel=272M" however I realized that all other workers do not have kdump enabled so I simply disabled it also on the ppc workers:

 sudo salt '*' cmd.run 'systemctl disable --now kdump'
openqaworker2.suse.de:
    Failed to disable unit: Unit file kdump.service does not exist.
openqaworker6.suse.de:
    Failed to disable unit: Unit file kdump.service does not exist.
openqaworker7.suse.de:
    Failed to disable unit: Unit file kdump.service does not exist.
openqaworker3.suse.de:
    Failed to disable unit: Unit file kdump.service does not exist.
openqaworker8.suse.de:
    Failed to disable unit: Unit file kdump.service does not exist.
openqaworker5.suse.de:
    Failed to disable unit: Unit file kdump.service does not exist.
openqaworker9.suse.de:
    Failed to disable unit: Unit file kdump.service does not exist.
openqaw2.qa.suse.de:
    Failed to disable unit: Unit file kdump.service does not exist.
openqaw1.qa.suse.de:
    Failed to disable unit: Unit file kdump.service does not exist.
openqaworker13.suse.de:
    Failed to disable unit: Unit file kdump.service does not exist.
openqa.suse.de:
    Failed to disable unit: Unit file kdump.service does not exist.
openqa-monitor.qa.suse.de:
    Failed to disable unit: Unit file kdump.service does not exist.
QA-Power8-5-kvm.qa.suse.de:
    Removed /etc/systemd/system/multi-user.target.wants/kdump.service.
    Removed /etc/systemd/system/multi-user.target.wants/kdump-early.service.
malbec.arch.suse.de:
    Removed /etc/systemd/system/multi-user.target.wants/kdump.service.
    Removed /etc/systemd/system/multi-user.target.wants/kdump-early.service.
QA-Power8-4-kvm.qa.suse.de:
    Removed /etc/systemd/system/multi-user.target.wants/kdump.service.
    Removed /etc/systemd/system/multi-user.target.wants/kdump-early.service.
powerqaworker-qam-1:
    Removed /etc/systemd/system/multi-user.target.wants/kdump.service.
    Removed /etc/systemd/system/multi-user.target.wants/kdump-early.service.
openqaworker-arm-2.suse.de:
grenache-1.qa.suse.de:
    Removed /etc/systemd/system/multi-user.target.wants/kdump.service.
    Removed /etc/systemd/system/multi-user.target.wants/kdump-early.service.
openqaworker-arm-1.suse.de:
openqaworker-arm-3.suse.de:
ERROR: Minions returned with non-zero exit code

and

sudo salt '*' cmd.run 'systemctl reset-failed'
QA-Power8-5-kvm.qa.suse.de:
QA-Power8-4-kvm.qa.suse.de:
openqaworker2.suse.de:
malbec.arch.suse.de:
openqaworker5.suse.de:
openqaworker9.suse.de:
powerqaworker-qam-1:
openqaworker6.suse.de:
openqaworker7.suse.de:
openqaworker8.suse.de:
openqaworker3.suse.de:
grenache-1.qa.suse.de:
openqaw1.qa.suse.de:
openqaw2.qa.suse.de:
openqa-monitor.qa.suse.de:
openqaworker13.suse.de:
openqa.suse.de:
openqaworker-arm-1.suse.de:
openqaworker-arm-2.suse.de:
openqaworker-arm-3.suse.de:

with this we have no more failed services as of now. I can check again e.g. next week and close it if no further big issues found. Next step after that could be monitoring+alerting for any future failed systemd services.

#10 Updated by okurz 5 months ago

  • Status changed from Feedback to Resolved

No more failed services found

Also available in: Atom PDF