action #53471
closedmachine aarch64.o.o often unresponsive and needs power-cycle
0%
Description
Observation¶
aarch64.o.o had been unresponsive multiple times in the past weeksmonths, maybe after the upgrade to Leap 15.1 already since Leap 15.0 times?
I encountered this again 2019-06-23 09:10 CEST and gave the machine a power cycle with ipmitool -I lanplus -H openqa-aarch64-ipmi.suse.de -U ADMIN -P XXX power cycle
as a remedy.
Checking /var/log/messages
I can only find:
2019-06-22T03:30:00.073653+02:00 openqa-aarch64 worker[3005]: [info] quit due to signal TERM
2019-06-23T09:16:28.561376+02:00 openqa-aarch64 systemd[1]: systemd 234 running in system mode. (+PAM -AUDIT +SELINUX -IMA +APPARMOR -SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT -GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID -ELFUTIL
so it looks like an openQA worker is terminated at 03:30, probably due to daily reboot command and the machine never comes up it seems?
Checking further log files with zgrep -B 1 'systemd .* running in system mode' /var/log/messages* | sed 's/systemd.*running in system mode.*$/BOOTUP/' | less
I can find:
/var/log/messages-20190203.xz:2019-01-27T03:30:00.117876+01:00 openqa-aarch64 worker[2436]: [info] quit due to signal TERM
/var/log/messages-20190203.xz:2019-01-28T10:46:49.969565+01:00 openqa-aarch64 BOOTUP
which seems like the first occurence when the machine shut down but did not come up automatically. I assume someone on the next day power cycled it manually.
Taking a look in /var/log/zypp/history I do not find any suspicious packages like kernel update that could explain the system not coming up:
2019-01-27 00:42:46|command|root@openqa-aarch64|'zypper' '-R' '/tmp/tmp.pFe9WfeYHw' 'up' '-y' '--auto-agree-with-product-licenses'|
2019-01-27 00:42:47|install|openQA-common|4.6.1548532175.d411287e-lp150.1115.1|noarch||devel_openQA|7a31821029e964a30afb96c283d392f664f6d836d571eca98c82296caa91f756|
2019-01-27 00:42:47|install|openQA-client|4.6.1548532175.d411287e-lp150.1115.1|noarch||devel_openQA|fe261bc53a6b3ef0deb7ea1e24848a63dcdef07691601f716903ee3e3f7da893|
# 2019-01-27 00:42:47 openQA-worker-4.6.1548532175.d411287e-lp150.1115.1.noarch.rpm installed ok
# Additional rpm output:
# Running in chroot, ignoring request.
# Running in chroot, ignoring request.
#
2019-01-27 00:42:47|install|openQA-worker|4.6.1548532175.d411287e-lp150.1115.1|noarch||devel_openQA|7b6ad19005249fbc679f6f23d8940806942660380840af39f421cb14e2956cf3|
Problem¶
I suspect a problem on bootup causing – at best – a kernel crash which we could catch, installing kdump …