Project

General

Profile

Actions

action #116740

closed

[alert] openqaworker14: host up alert

Added by okurz over 1 year ago. Updated over 1 year ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
2022-09-19
Due date:
% Done:

0%

Estimated time:

Description

Observation

https://monitor.qa.suse.de/d/WDopenqaworker14/worker-dashboard-openqaworker14?orgId=1&from=1663513460231&to=1663548197462&viewPanel=65105 shows that openqaworker14 is reported as down since 2022-09-18 2135.

ipmi-openqaworker14-ipmi sol activate reveals:

Give root password for maintenance
(or press Control-D to continue): 

so stuck in bootup

Rollback steps

  • Unpause alert "host up"

Related issues 2 (0 open2 closed)

Related to openQA Infrastructure - action #116722: openqa.suse.de is not reachable 2022-09-18, no ping response, postgreSQL OOM and kernel panics size:MResolvedmkittler2022-09-18

Actions
Copied to openQA Infrastructure - action #116743: [alert] QA-Power8-5-kvm: host up alertResolvednicksinger2022-09-192022-10-04

Actions
Actions #1

Updated by okurz over 1 year ago

  • Description updated (diff)
Actions #2

Updated by okurz over 1 year ago

  • Copied to action #116743: [alert] QA-Power8-5-kvm: host up alert added
Actions #3

Updated by okurz over 1 year ago

  • Related to action #116722: openqa.suse.de is not reachable 2022-09-18, no ping response, postgreSQL OOM and kernel panics size:M added
Actions #4

Updated by nicksinger over 1 year ago

  • Status changed from New to In Progress
  • Assignee set to nicksinger
Actions #5

Updated by nicksinger over 1 year ago

I wasn't able to login into the rescue shell because no password I know of worked. ctrl+D resulted in some OOM-messages of systemd-udev (which is strange). Because I couldn't do anything in the recovery console I just rebooted the machine and it came up again perfectly fine. Changed the root password now to the old default PW (which seems to be used on other workers too).
Rebooting 3x again to see if stable boot can be proven.

Actions #6

Updated by nicksinger over 1 year ago

  • Status changed from In Progress to Resolved

Machine successfully rebooted 3 times in a row. "host up" alert is enabled again.

Actions

Also available in: Atom PDF