action #58346

o3 openqaworker1 and openqaworker4 are completely down on 2019-10-18

Added by okurz 6 months ago. Updated 5 months ago.

Status:ResolvedStart date:18/10/2019
Priority:LowDue date:
Assignee:okurz% Done:

0%

Category:-
Target version:openQA Project - Current Sprint
Duration:

Related issues

Duplicated by openQA Infrastructure - action #58403: openqaworker1 and w4 are repeatedly down Rejected 20/10/2019 10/11/2019

History

#1 Updated by okurz 6 months ago

  • Priority changed from Urgent to Normal

checked responsiveness of both hosts over IPMI SOL but there is nothing. power status is on. power cycled both machines, both are up. Side-effect: The only x86_64 worker that was up is imagetester:1 and :2 and they did not seem to be very stable: https://openqa.opensuse.org/tests/1059689#next_previous shows two "random" failures in a row.

#2 Updated by okurz 5 months ago

  • Due date set to 03/11/2019
  • Status changed from In Progress to Feedback
  • Priority changed from Normal to Low

I will check if this happens again to see what I can do about debugging. I could apply the same monitor+reboot check as done for aarch64.o.o

openqaworker1 and w4 were down 2019-10-19, potentially one more time lately in the past days.

Oct 19 03:30:37 openqaworker4 systemd-journald[777]: Journal stopped
-- Reboot --
Oct 20 09:21:15 openqaworker4 kernel: microcode: microcode updated early to revision 0x43, date = 2019-03-01

after forced power cycle. I suspect a recent kernel upgrade.

#3 Updated by okurz 5 months ago

  • Duplicated by action #58403: openqaworker1 and w4 are repeatedly down added

#4 Updated by okurz 5 months ago

  • Due date deleted (03/11/2019)
  • Status changed from Feedback to Resolved

Added recovery to okurz's crontab on lord.arch same as aarch64.o.o . Let's see if these trigger at all and how often.

Also available in: Atom PDF