Project

General

Profile

action #64737

openqaworker-arm-3 is down since 2020-03-16, also IPMI unresponsive

Added by okurz over 1 year ago. Updated over 1 year ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
-
Start date:
2020-03-24
Due date:
% Done:

0%

Estimated time:

Description

Observation

https://stats.openqa-monitor.qa.suse.de/d/1bNU0StZz/automatic-actions?orgId=1&from=now-7d&to=now shows that openqaworker-arm-3 is unresponsive since 2020-03-16 , also no response over IPMI


Related issues

Blocks openQA Infrastructure - action #61844: auto_review:"download failed: 521 - Connect timeout" Network issues on openqaworker-arm-3 (and others)Resolved2020-01-07

History

#1 Updated by okurz over 1 year ago

  • Status changed from In Progress to Blocked

Created https://infra.nue.suse.com/SelfService/Display.html?id=166330 with text

openqaworker-arm-3 is unresponsive since 2020-03-16 , also no response over IPMI, similar as in https://infra.nue.suse.com/SelfService/Display.html?id=164419 and https://infra.nue.suse.com/SelfService/Display.html?id=153124 previously. Please recover, e.g. reset IPMI and machine but please *also* look into why the management interfaces breaks down repeatedly.

Reference: https://progress.opensuse.org/issues/64737

#2 Updated by okurz over 1 year ago

removed salt key for openqaworker-arm-3 for now to prevent failures when applying salt states.

#3 Updated by okurz over 1 year ago

  • Blocks action #61844: auto_review:"download failed: 521 - Connect timeout" Network issues on openqaworker-arm-3 (and others) added

#4 Updated by okurz over 1 year ago

asked aneisner in chat if there is anything I could do for help.

#5 Updated by okurz over 1 year ago

Unfortunately no response from aneisner. Escalated the issue with SUSE-IT repeatedly, no action so far. I asked runger for help in escalation.

#6 Updated by okurz over 1 year ago

  • Status changed from Blocked to In Progress

jdsn helped to recover the machine by rebooting. He wrote

The machine
* did not react on ipmi
* did not show any output on the vga port
* did not react on key presses on a USB keyboard

I asked if he can help to find out why the management interface repeatedly disappears.

I accepted the salt key with salt-key -A on osd and I am upgrading with zypper dup and applying the high state with salt -l error --no-color -C '*arm-3*' state.apply.

#7 Updated by okurz over 1 year ago

  • Status changed from In Progress to Resolved

jdsn recommends to look into logs of the management interface and sees himself not able to debug vendor problems which is understandable :)

I have enabled sending emails on alerts on https://openqaworker-arm-3-ipmi.suse.de/index.html same as on arm-1 and arm-2.

Worker is upgraded, jobs are taken and executed, e.g. https://stackoverflow.com/a/10236104 is fine, jobs are running.

Also available in: Atom PDF