action #56588: Check failed services on our workers - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #56588

closed

Check failed services on our workers

Added by okurz over 5 years ago. Updated over 5 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

okurz

Category:

Target version:

openQA Project (public) - Current Sprint

Start date:

2019-09-08

Due date:

2019-09-18

% Done:

Estimated time:

Description

Use `salt '*' cmd.run 'systemctl --failed' to find out, e.g. kdump failed on some of our ppc workers because the boot parameters miss proper "crashkernel" information on both qa-power8-4-kvm and qa-power8-5-kvm.

Looks like the problem of telegraf on ppc64le might be the same as reported in #54128#note-7 , the package is simply not provided by devel:languages:go for neither aarch64 nor ppc64le for Leap 15.1. now building in https://build.opensuse.org/project/show/home:okurz:telegraf , maybe we want to simply add the package to devel:openQA:Leap:15.1

Actions

Copy link

Updated by okurz over 5 years ago

Target version set to Current Sprint

Actions

Copy link

Updated by okurz over 5 years ago

Status changed from Feedback to Workable
Assignee deleted (~~okurz~~)

lucky or not, using a cleanly built new "telegraf" for Leap 15.1 on qa-power8-4-kvm.qa reproduces the same problem:

okurz@QA-Power8-4-kvm:~> sudo /usr/bin/telegraf -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d $TELEGRAF_OPTS
/usr/bin/telegraf: error while loading shared libraries: R_PPC64_ADDR16_HA re10dcea92c for symbol `' out of range

looks like this is related to PIE in the compiler settings? According to https://build.opensuse.org/package/view_file/devel:languages:go/go1.10/go1.10.changes?expand=0 line line 64 this should have been fixed, at least in go10. According to https://build.opensuse.org/package/live_build_log/home:okurz:telegraf/telegraf/openSUSE_Leap_15.1/ppc64le go1.11 is used and also "gcc-PIE". Dunno further for now.

Actions

Copy link

Updated by coolo over 5 years ago

added the arch to devel:languages:go

Actions

Copy link

Updated by nicksinger over 5 years ago

We disabled PIE for ppc64(le) with https://build.opensuse.org/request/show/729378 and it seems to work now. You can check with salt -G 'cpuarch:ppc64le' cmd.run 'systemctl --failed'systemctl --failed'

Actions

Copy link

Updated by nicksinger over 5 years ago

I disabled the smartd service on malbec and grenache-1. Disks on power are virtualized anyway and attached to malbec over multipath so it doesn't make much sense to run smart inside the LPAR anyway.

Actions

Copy link

Updated by nicksinger over 5 years ago

I removed /etc/modules-load.d/kvm.conf from grenache-1 to get rid of the failing systemd-modules-load.service. I was about to create a bugzilla ticket for this but realized this file doesn't belong to any package:

grenache-1:/etc/modules-load.d # cat kvm.conf 
kvm_hv
grenache-1:/etc/modules-load.d # rpm -qf kvm.conf 
file /etc/modules-load.d/kvm.conf is not owned by any package
grenache-1:/etc/modules-load.d # rpm -qf 
kvm.conf  sg.conf   
grenache-1:/etc/modules-load.d # rpm -qf sg.conf 
suse-module-tools-15.1.13-lp151.1.1.ppc64le

Original text for the BSC:
Kernel module "kvm_hv" should not be loaded on LPAR installations

On one of our power LPARs the service systemd-modules-load.service fails because the module kvm_hv can't be loaded. According to https://wiki.qemu.org/Documentation/Platforms/POWER this is expected since nested virtualization is not supported.

Therefore, the module kvm_hv shouldn't be included in /etc/modules-load.d if the system is installed into an LPAR.

Actions

Copy link

Updated by nicksinger over 5 years ago

Disabled service lm_sensors on powerqaworker-qam-1 since it's not running on any other worker (no clue who enabled it on qam-1). To get it running one could touch /etc/sysconfig/lm_sensors.

Actions

Copy link

Updated by nicksinger over 5 years ago

Restarted worker-instance 3 on QA-Power8-5-kvm. According to the logs the worker got restarted but didn't respond to SIGTERM (previous logs indicate it was still uploading). Therefore systemd shot it with a SIGKILL and left it in the state failed. If this happens more often we might have to consider fixing this in openQA itself.

Now we only have the failing kdump.service on QA-Power8-4-kvm.qa.suse.de, QA-Power8-5-kvm.qa.suse.de and powerqaworker-qam-1 left.

Actions

Copy link

Updated by okurz over 5 years ago

Due date set to 2019-09-18
Status changed from Workable to Feedback
Assignee set to okurz

Thank you for looking into this.

For kdump actually I fixed that already on grenache-1 by giving an additional kernel command line parameter "crashkernel=272M" however I realized that all other workers do not have kdump enabled so I simply disabled it also on the ppc workers:

 sudo salt '*' cmd.run 'systemctl disable --now kdump'
openqaworker2.suse.de:
    Failed to disable unit: Unit file kdump.service does not exist.
openqaworker6.suse.de:
    Failed to disable unit: Unit file kdump.service does not exist.
openqaworker7.suse.de:
    Failed to disable unit: Unit file kdump.service does not exist.
openqaworker3.suse.de:
    Failed to disable unit: Unit file kdump.service does not exist.
openqaworker8.suse.de:
    Failed to disable unit: Unit file kdump.service does not exist.
openqaworker5.suse.de:
    Failed to disable unit: Unit file kdump.service does not exist.
openqaworker9.suse.de:
    Failed to disable unit: Unit file kdump.service does not exist.
openqaw2.qa.suse.de:
    Failed to disable unit: Unit file kdump.service does not exist.
openqaw1.qa.suse.de:
    Failed to disable unit: Unit file kdump.service does not exist.
openqaworker13.suse.de:
    Failed to disable unit: Unit file kdump.service does not exist.
openqa.suse.de:
    Failed to disable unit: Unit file kdump.service does not exist.
openqa-monitor.qa.suse.de:
    Failed to disable unit: Unit file kdump.service does not exist.
QA-Power8-5-kvm.qa.suse.de:
    Removed /etc/systemd/system/multi-user.target.wants/kdump.service.
    Removed /etc/systemd/system/multi-user.target.wants/kdump-early.service.
malbec.arch.suse.de:
    Removed /etc/systemd/system/multi-user.target.wants/kdump.service.
    Removed /etc/systemd/system/multi-user.target.wants/kdump-early.service.
QA-Power8-4-kvm.qa.suse.de:
    Removed /etc/systemd/system/multi-user.target.wants/kdump.service.
    Removed /etc/systemd/system/multi-user.target.wants/kdump-early.service.
powerqaworker-qam-1:
    Removed /etc/systemd/system/multi-user.target.wants/kdump.service.
    Removed /etc/systemd/system/multi-user.target.wants/kdump-early.service.
openqaworker-arm-2.suse.de:
grenache-1.qa.suse.de:
    Removed /etc/systemd/system/multi-user.target.wants/kdump.service.
    Removed /etc/systemd/system/multi-user.target.wants/kdump-early.service.
openqaworker-arm-1.suse.de:
openqaworker-arm-3.suse.de:
ERROR: Minions returned with non-zero exit code

and

sudo salt '*' cmd.run 'systemctl reset-failed'
QA-Power8-5-kvm.qa.suse.de:
QA-Power8-4-kvm.qa.suse.de:
openqaworker2.suse.de:
malbec.arch.suse.de:
openqaworker5.suse.de:
openqaworker9.suse.de:
powerqaworker-qam-1:
openqaworker6.suse.de:
openqaworker7.suse.de:
openqaworker8.suse.de:
openqaworker3.suse.de:
grenache-1.qa.suse.de:
openqaw1.qa.suse.de:
openqaw2.qa.suse.de:
openqa-monitor.qa.suse.de:
openqaworker13.suse.de:
openqa.suse.de:
openqaworker-arm-1.suse.de:
openqaworker-arm-2.suse.de:
openqaworker-arm-3.suse.de:

with this we have no more failed services as of now. I can check again e.g. next week and close it if no further big issues found. Next step after that could be monitoring+alerting for any future failed systemd services.

Actions

Copy link

#10

Updated by okurz over 5 years ago

Status changed from Feedback to Resolved

No more failed services found

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #56588

Check failed services on our workers

Updated by okurz over 5 years ago

Updated by okurz over 5 years ago

Updated by coolo over 5 years ago

Updated by nicksinger over 5 years ago

Updated by nicksinger over 5 years ago

Updated by nicksinger over 5 years ago

Updated by nicksinger over 5 years ago

Updated by nicksinger over 5 years ago

Updated by okurz over 5 years ago

Updated by okurz over 5 years ago