action #114685: powerqaworker-qam-1 seems to have just gone unresponsive due to unknown reason - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #114685

closed

powerqaworker-qam-1 seems to have just gone unresponsive due to unknown reason

Added by okurz almost 3 years ago. Updated almost 3 years ago.

Status:

Resolved

Priority:

High

Assignee:

mkittler

Category:

Target version:

openQA Project (public) - Ready

Start date:

2022-07-26

Due date:

% Done:

Estimated time:

Description

Observation¶

https://stats.openqa-monitor.qa.suse.de/d/WDpowerqaworker-qam-1/worker-dashboard-powerqaworker-qam-1?tab=alert&orgId=1&refresh=1m showed the system to not have returned data since 2022-07-26 12:17CEST

Rollback steps¶

Re-add to salt
Unpause alerts

Actions

Copy link

Updated by mkittler almost 3 years ago

Status changed from New to In Progress
Assignee set to mkittler

Actions

Copy link

Updated by mkittler almost 3 years ago

I recovered the worker via IPMI and it seems good again. The journal doesn't go show anything from before the last boot:

martchus@powerqaworker-qam-1:~> sudo journalctl --since '2 hours ago' 
Journal file /var/log/journal/f0fcce0e26d256c4cf363f4b59fb2556/system@0005e4b47c57223b-6e8ac86fc6f49890.journal~ is truncated, ignoring file.

Jul 26 14:31:22 powerqaworker-qam-1 kernel: Reserving 210MB of memory at 128MB for crashkernel (System RAM: 196608MB)
Jul 26 14:31:22 powerqaworker-qam-1 kernel: hash-mmu: Page sizes from device-tree:

The warning likely means that the worker has crashed leaving a truncated journal file. So it is hard to tell what was going wrong. Let's hope the server will remain stable.

I resumed the alert to know if the machine would crash again.

I also added the machine back to salt because for now it is a one-time problem.

Actions

Copy link

Updated by mkittler almost 3 years ago

Status changed from In Progress to Feedback

Actions

Copy link

Updated by mkittler almost 3 years ago

It looks still good but worker units are masked. I'm askinng on #eng-testing whether that's how it is supposed to be.

Actions

Copy link

Updated by mkittler almost 3 years ago

Status changed from Feedback to In Progress

The units are not "normally" mask and seem to be rather a result of something going wrong than intentional masking. So I'm trying to unmask them but it isn't easily possible. So far I found one broken symlink and I'm trying to reboot now.

Actions

Copy link

Updated by mkittler almost 3 years ago

Some unit files from the openQA-worker package where empty files on disk. Re-installing the worker package fixed the problem and units no longer appear masked. (Apparently an empty unit file is simply considered a masked unit without further warnings.)

However, now it looks that several files on disk are broken as well. I'll try re-installing all packages.

Actions

Copy link

Updated by okurz almost 3 years ago

that sounds scary, like a filesystem check gone wrong or inconsistent integrity. If you find multiple files "broken" or empty then it might be more reasonable to reinstall.

Actions

Copy link

Updated by mkittler almost 3 years ago

Status changed from In Progress to Resolved

After rebuilding the rpm database, re-installing all packages (¹) and installing updates all services run fine again. I've also rebooted the machine yet another time and it survived the reboot with no failing services. Since everything works again and the machine hasn't crashed again I suppose the ticket can be considered resolved.

I also checked the disks with smartctl. It reports the heath is ok but apparently invoking tests is not supported by the hardware.

¹ I suppose only a very small amount of packages were affected (all seemed from our devel repo) but this way it is ensured everything is installed as it should be. (I basically conducted a forced reinstall of all packages that were installed on the host.)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #114685

powerqaworker-qam-1 seems to have just gone unresponsive due to unknown reason

Observation¶

Rollback steps¶

Updated by mkittler almost 3 years ago

Updated by mkittler almost 3 years ago

Updated by mkittler almost 3 years ago

Updated by mkittler almost 3 years ago

Updated by mkittler almost 3 years ago

Updated by mkittler almost 3 years ago

Updated by okurz almost 3 years ago

Updated by mkittler almost 3 years ago