action #114685
closedpowerqaworker-qam-1 seems to have just gone unresponsive due to unknown reason
0%
Description
Observation¶
https://stats.openqa-monitor.qa.suse.de/d/WDpowerqaworker-qam-1/worker-dashboard-powerqaworker-qam-1?tab=alert&orgId=1&refresh=1m showed the system to not have returned data since 2022-07-26 12:17CEST
Rollback steps¶
- Re-add to salt
- Unpause alerts
Updated by mkittler about 2 years ago
- Status changed from New to In Progress
- Assignee set to mkittler
Updated by mkittler about 2 years ago
I recovered the worker via IPMI and it seems good again. The journal doesn't go show anything from before the last boot:
martchus@powerqaworker-qam-1:~> sudo journalctl --since '2 hours ago'
Journal file /var/log/journal/f0fcce0e26d256c4cf363f4b59fb2556/system@0005e4b47c57223b-6e8ac86fc6f49890.journal~ is truncated, ignoring file.
Jul 26 14:31:22 powerqaworker-qam-1 kernel: Reserving 210MB of memory at 128MB for crashkernel (System RAM: 196608MB)
Jul 26 14:31:22 powerqaworker-qam-1 kernel: hash-mmu: Page sizes from device-tree:
The warning likely means that the worker has crashed leaving a truncated journal file. So it is hard to tell what was going wrong. Let's hope the server will remain stable.
I resumed the alert to know if the machine would crash again.
I also added the machine back to salt because for now it is a one-time problem.
Updated by mkittler about 2 years ago
- Status changed from In Progress to Feedback
Updated by mkittler about 2 years ago
It looks still good but worker units are masked. I'm askinng on #eng-testing whether that's how it is supposed to be.
Updated by mkittler about 2 years ago
- Status changed from Feedback to In Progress
The units are not "normally" mask and seem to be rather a result of something going wrong than intentional masking. So I'm trying to unmask them but it isn't easily possible. So far I found one broken symlink and I'm trying to reboot now.
Updated by mkittler about 2 years ago
Some unit files from the openQA-worker package where empty files on disk. Re-installing the worker package fixed the problem and units no longer appear masked. (Apparently an empty unit file is simply considered a masked unit without further warnings.)
However, now it looks that several files on disk are broken as well. I'll try re-installing all packages.
Updated by okurz about 2 years ago
that sounds scary, like a filesystem check gone wrong or inconsistent integrity. If you find multiple files "broken" or empty then it might be more reasonable to reinstall.
Updated by mkittler about 2 years ago
- Status changed from In Progress to Resolved
After rebuilding the rpm database, re-installing all packages (¹) and installing updates all services run fine again. I've also rebooted the machine yet another time and it survived the reboot with no failing services. Since everything works again and the machine hasn't crashed again I suppose the ticket can be considered resolved.
I also checked the disks with smartctl
. It reports the heath is ok but apparently invoking tests is not supported by the hardware.
¹ I suppose only a very small amount of packages were affected (all seemed from our devel repo) but this way it is ensured everything is installed as it should be. (I basically conducted a forced reinstall of all packages that were installed on the host.)