powerqaworker-qam-1 seems to have just gone unresponsive due to unknown reason
https://stats.openqa-monitor.qa.suse.de/d/WDpowerqaworker-qam-1/worker-dashboard-powerqaworker-qam-1?tab=alert&orgId=1&refresh=1m showed the system to not have returned data since 2022-07-26 12:17CEST
- Re-add to salt
- Unpause alerts
I recovered the worker via IPMI and it seems good again. The journal doesn't go show anything from before the last boot:
martchus@powerqaworker-qam-1:~> sudo journalctl --since '2 hours ago' Journal file /firstname.lastname@example.org~ is truncated, ignoring file. Jul 26 14:31:22 powerqaworker-qam-1 kernel: Reserving 210MB of memory at 128MB for crashkernel (System RAM: 196608MB) Jul 26 14:31:22 powerqaworker-qam-1 kernel: hash-mmu: Page sizes from device-tree:
The warning likely means that the worker has crashed leaving a truncated journal file. So it is hard to tell what was going wrong. Let's hope the server will remain stable.
I resumed the alert to know if the machine would crash again.
I also added the machine back to salt because for now it is a one-time problem.
- Status changed from Feedback to In Progress
The units are not "normally" mask and seem to be rather a result of something going wrong than intentional masking. So I'm trying to unmask them but it isn't easily possible. So far I found one broken symlink and I'm trying to reboot now.
Some unit files from the openQA-worker package where empty files on disk. Re-installing the worker package fixed the problem and units no longer appear masked. (Apparently an empty unit file is simply considered a masked unit without further warnings.)
However, now it looks that several files on disk are broken as well. I'll try re-installing all packages.
#8 Updated by mkittler about 2 months ago
- Status changed from In Progress to Resolved
After rebuilding the rpm database, re-installing all packages (¹) and installing updates all services run fine again. I've also rebooted the machine yet another time and it survived the reboot with no failing services. Since everything works again and the machine hasn't crashed again I suppose the ticket can be considered resolved.
I also checked the disks with
smartctl. It reports the heath is ok but apparently invoking tests is not supported by the hardware.
¹ I suppose only a very small amount of packages were affected (all seemed from our devel repo) but this way it is ensured everything is installed as it should be. (I basically conducted a forced reinstall of all packages that were installed on the host.)