Project

General

Profile

action #114685

powerqaworker-qam-1 seems to have just gone unresponsive due to unknown reason

Added by okurz 2 months ago. Updated about 2 months ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
Start date:
2022-07-26
Due date:
% Done:

0%

Estimated time:

Description

Observation

https://stats.openqa-monitor.qa.suse.de/d/WDpowerqaworker-qam-1/worker-dashboard-powerqaworker-qam-1?tab=alert&orgId=1&refresh=1m showed the system to not have returned data since 2022-07-26 12:17CEST

Rollback steps

  • Re-add to salt
  • Unpause alerts

History

#1 Updated by mkittler 2 months ago

  • Status changed from New to In Progress
  • Assignee set to mkittler

#2 Updated by mkittler 2 months ago

I recovered the worker via IPMI and it seems good again. The journal doesn't go show anything from before the last boot:

martchus@powerqaworker-qam-1:~> sudo journalctl --since '2 hours ago' 
Journal file /var/log/journal/f0fcce0e26d256c4cf363f4b59fb2556/system@0005e4b47c57223b-6e8ac86fc6f49890.journal~ is truncated, ignoring file.

Jul 26 14:31:22 powerqaworker-qam-1 kernel: Reserving 210MB of memory at 128MB for crashkernel (System RAM: 196608MB)
Jul 26 14:31:22 powerqaworker-qam-1 kernel: hash-mmu: Page sizes from device-tree:

The warning likely means that the worker has crashed leaving a truncated journal file. So it is hard to tell what was going wrong. Let's hope the server will remain stable.

I resumed the alert to know if the machine would crash again.

I also added the machine back to salt because for now it is a one-time problem.

#3 Updated by mkittler 2 months ago

  • Status changed from In Progress to Feedback

#4 Updated by mkittler 2 months ago

It looks still good but worker units are masked. I'm askinng on #eng-testing whether that's how it is supposed to be.

#5 Updated by mkittler 2 months ago

  • Status changed from Feedback to In Progress

The units are not "normally" mask and seem to be rather a result of something going wrong than intentional masking. So I'm trying to unmask them but it isn't easily possible. So far I found one broken symlink and I'm trying to reboot now.

#6 Updated by mkittler 2 months ago

Some unit files from the openQA-worker package where empty files on disk. Re-installing the worker package fixed the problem and units no longer appear masked. (Apparently an empty unit file is simply considered a masked unit without further warnings.)

However, now it looks that several files on disk are broken as well. I'll try re-installing all packages.

#7 Updated by okurz 2 months ago

that sounds scary, like a filesystem check gone wrong or inconsistent integrity. If you find multiple files "broken" or empty then it might be more reasonable to reinstall.

#8 Updated by mkittler about 2 months ago

  • Status changed from In Progress to Resolved

After rebuilding the rpm database, re-installing all packages (¹) and installing updates all services run fine again. I've also rebooted the machine yet another time and it survived the reboot with no failing services. Since everything works again and the machine hasn't crashed again I suppose the ticket can be considered resolved.

I also checked the disks with smartctl. It reports the heath is ok but apparently invoking tests is not supported by the hardware.


¹ I suppose only a very small amount of packages were affected (all seemed from our devel repo) but this way it is ensured everything is installed as it should be. (I basically conducted a forced reinstall of all packages that were installed on the host.)

Also available in: Atom PDF