Project

General

Profile

Actions

action #116740

closed

[alert] openqaworker14: host up alert

Added by okurz about 2 years ago. Updated about 2 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Start date:
2022-09-19
Due date:
% Done:

0%

Estimated time:

Description

Observation

https://monitor.qa.suse.de/d/WDopenqaworker14/worker-dashboard-openqaworker14?orgId=1&from=1663513460231&to=1663548197462&viewPanel=65105 shows that openqaworker14 is reported as down since 2022-09-18 2135.

ipmi-openqaworker14-ipmi sol activate reveals:

Give root password for maintenance
(or press Control-D to continue): 

so stuck in bootup

Rollback steps

  • Unpause alert "host up"

Related issues 2 (0 open2 closed)

Related to openQA Infrastructure (public) - action #116722: openqa.suse.de is not reachable 2022-09-18, no ping response, postgreSQL OOM and kernel panics size:MResolvedmkittler2022-09-18

Actions
Copied to openQA Infrastructure (public) - action #116743: [alert] QA-Power8-5-kvm: host up alertResolvednicksinger2022-09-192022-10-04

Actions
Actions #1

Updated by okurz about 2 years ago

  • Description updated (diff)
Actions #2

Updated by okurz about 2 years ago

  • Copied to action #116743: [alert] QA-Power8-5-kvm: host up alert added
Actions #3

Updated by okurz about 2 years ago

  • Related to action #116722: openqa.suse.de is not reachable 2022-09-18, no ping response, postgreSQL OOM and kernel panics size:M added
Actions #4

Updated by nicksinger about 2 years ago

  • Status changed from New to In Progress
  • Assignee set to nicksinger
Actions #5

Updated by nicksinger about 2 years ago

I wasn't able to login into the rescue shell because no password I know of worked. ctrl+D resulted in some OOM-messages of systemd-udev (which is strange). Because I couldn't do anything in the recovery console I just rebooted the machine and it came up again perfectly fine. Changed the root password now to the old default PW (which seems to be used on other workers too).
Rebooting 3x again to see if stable boot can be proven.

Actions #6

Updated by nicksinger about 2 years ago

  • Status changed from In Progress to Resolved

Machine successfully rebooted 3 times in a row. "host up" alert is enabled again.

Actions

Also available in: Atom PDF