Project

General

Profile

action #70969

openqaworker-arm-2 stuck in system management menu after reboot

Added by okurz 5 months ago. Updated 4 months ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Observation

For openqaworker-arm-2 I see on startup a message:

WARNING: ********************************************************
WARNING: * This is debug mode when the default.dtb file is there.
WARNING: * [Restore factory defaults] can return the normal mode.
WARNING: ********************************************************

and eventually the system does not boot a full OS but gets stuck in

       Aptio Setup Utility - Copyright (C) 2017 American Megatrends, Inc.       
    Main  Advanced  Security  Boot  Save & Exit  Server Mgmt                    
ÚÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÂÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄĿ
³  BIOS Information                                 ³Memory Slot Information. ³
³  Access Level            Administrator            ۳                         ³
³  Project Name            MT60-SC4-00              ۳                         ³
³  Project Version         T32                      ۳                         ³
³  Build Date and Time     03/03/2017 13:09:58      ۳                         ³
³                                                   ۳                         ³
³  BMC Information                                  ۳                         ³
³  BMC Firmware Version    07.68                    ۳                         ³
³  SDR Version             00.04                    ۳                         ³
³  FRU Version             01.00                    °³ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄij
³                                                   °³><: Select Screen        ³
³  Processor Information                            °³: Select Item          ³
³  CPU 0 : CN8890-2000BG2601-ST-Y-G                 °³Enter: Select            ³
³  CPU 1 : CN8890-2000BG2601-CP-Y-G                 °³+/-: Change Opt.         ³
³  Max CPU Speed           2000 MHz                 °³F1: General Help         ³
³  CPU Data Cache          32 KB                    °³F3: Previous Values      ³
³  CPU Instruction Cache   78 KB                    °³F9: Optimized Defaults   ³
³                                                   ³F10: Save & Exit         ³
³                                                    ³ESC: Exit                ³
ÀÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÁÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÙ
        Version 2.18.1264. Copyright (C) 2017 American Megatrends, Inc.        
                                                                             AB

exiting helps to boot but this should be automatic. Just exiting the menu helps. Don't know what I would need to change here.


Related issues

Copied from openQA Infrastructure - action #69727: reduce heat in NUE-SRV2Resolved2020-08-072020-09-04

History

#1 Updated by okurz 5 months ago

#2 Updated by nicksinger 4 months ago

  • Status changed from New to In Progress
  • Assignee set to nicksinger

BMC is not reachable anymore, I created an infra ticket (#176913):


"Dear Colleague,

Thank you for your report of: "[openqa] openqaworker-arm-2.suse.de down - please reboot"
assigned reference number: "176913"

Someone from the designate team will contact you about
your request as soon as we can.

If you have additional comments or questions, you can
follow up to the ticket here at :

https://infra.nue.suse.com/Ticket/Display.html?id=176913

Regards,
The Engineering Infrastructure Team"
arm-ticket@suse.de


The original message:

Hey Toni, all,

unfortunately we lost again one of our workers. This time openqaworker-arm-2 is
affected. Could you please hard-power cycle the machine?
Also, would it be possible to access the power socket on our own so we don't
need to open a ticket all the time?

Best and thanks in advance,
Nick

#3 Updated by nicksinger 4 months ago

Hi Nick,

machine has been reseted.

I checked down there for PDU ports, but it does not look good.

The next one has 1 Port free and is 2 Rack next to it the other one is 4 Racks
next to it with a total of 2 ports free, but we do not have Powercables in this
lenght and I think these are comepletely used by other teams (one is SES
otherone seems like QA CSS) but as arm2 and arm3 has the same issues and they
have 2 Power outlets each, it would not be enough and we do not have a PDU on
spare, at least not that I know of.

Maybe we need to think of a diffrent solution or let our manager think about
one.

So the machine is back and running again. I will check now if it can survive a reboot. I will also create a follow-up ticket to evaluate if we want to invest into a PDU to power cycle these machines automatically.

#4 Updated by nicksinger 4 months ago

  • Status changed from In Progress to Resolved

The machine came up fine after a reboot without getting stuck in the BIOS. I could imagine somebody (maybe even me) accidentally did a ipmitool chassis bootdev bios.
I improved the automatic reboot a little bit with https://gitlab.suse.de/openqa/grafana-webhook-actions/-/merge_requests/7 to enforce booting from disk.

#5 Updated by okurz 4 months ago

  • Status changed from Resolved to Feedback
  • Assignee changed from nicksinger to okurz

thx. I added the salt key for openqaworker-arm-2 again and re-enabled telegraf on the machine and have checked that all alerts are enabled. The alerts have not recovered yet so I will monitor this.

#6 Updated by okurz 4 months ago

  • Status changed from Feedback to Resolved
  • Assignee changed from okurz to nicksinger

all good now

#7 Updated by okurz 4 months ago

  • Status changed from Resolved to In Progress
  • Assignee changed from nicksinger to okurz

#8 Updated by okurz 4 months ago

  • Status changed from In Progress to Resolved

The system is again stuck in the system management menu even though we should have the fix in the gitlab CI pipeline for automatic recovery already in place. https://gitlab.suse.de/openqa/grafana-webhook-actions/-/jobs looks like we could have a small time where openqaworker-arm-2 was triggered for reboot but without the fixed approach yet. Retriggered a job in gitlab CI, see https://gitlab.suse.de/openqa/grafana-webhook-actions/-/jobs/257134

https://gitlab.suse.de/openqa/grafana-webhook-actions/-/jobs/257134#L18 shows that no boot device was selected. And Commit 191a23ba shows that this is the old state. That can explain it.

Another trigger for openqaworker-arm-3 just 1h ago was fine, see https://gitlab.suse.de/openqa/grafana-webhook-actions/-/jobs/256966#L19 doing the boot device select.

So handled the reboot for openqaworker-arm-2 one more time manually. Machine is now booted and alerts are recovering.

Also available in: Atom PDF