action #70969
closedopenqaworker-arm-2 stuck in system management menu after reboot
0%
Description
Observation¶
For openqaworker-arm-2 I see on startup a message:
WARNING: ********************************************************
WARNING: * This is debug mode when the default.dtb file is there.
WARNING: * [Restore factory defaults] can return the normal mode.
WARNING: ********************************************************
and eventually the system does not boot a full OS but gets stuck in
Aptio Setup Utility - Copyright (C) 2017 American Megatrends, Inc.
Main Advanced Security Boot Save & Exit Server Mgmt
ÚÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÂÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄĿ
³ BIOS Information ³Memory Slot Information. ³
³ Access Level Administrator ۳ ³
³ Project Name MT60-SC4-00 ۳ ³
³ Project Version T32 ۳ ³
³ Build Date and Time 03/03/2017 13:09:58 ۳ ³
³ ۳ ³
³ BMC Information ۳ ³
³ BMC Firmware Version 07.68 ۳ ³
³ SDR Version 00.04 ۳ ³
³ FRU Version 01.00 °³ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄij
³ °³><: Select Screen ³
³ Processor Information °³: Select Item ³
³ CPU 0 : CN8890-2000BG2601-ST-Y-G °³Enter: Select ³
³ CPU 1 : CN8890-2000BG2601-CP-Y-G °³+/-: Change Opt. ³
³ Max CPU Speed 2000 MHz °³F1: General Help ³
³ CPU Data Cache 32 KB °³F3: Previous Values ³
³ CPU Instruction Cache 78 KB °³F9: Optimized Defaults ³
³ ³F10: Save & Exit ³
³ ³ESC: Exit ³
ÀÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÁÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÙ
Version 2.18.1264. Copyright (C) 2017 American Megatrends, Inc.
AB
exiting helps to boot but this should be automatic. Just exiting the menu helps. Don't know what I would need to change here.
Updated by okurz over 4 years ago
- Copied from action #69727: reduce heat in NUE-SRV2 added
Updated by nicksinger over 4 years ago
- Status changed from New to In Progress
- Assignee set to nicksinger
BMC is not reachable anymore, I created an infra ticket (#176913):
"Dear Colleague,
Thank you for your report of: "[openqa] openqaworker-arm-2.suse.de down - please reboot"
assigned reference number: "176913"
Someone from the designate team will contact you about
your request as soon as we can.
If you have additional comments or questions, you can
follow up to the ticket here at :
https://infra.nue.suse.com/Ticket/Display.html?id=176913
Regards,
The Engineering Infrastructure Team"
arm-ticket@suse.de
The original message:¶
Hey Toni, all,
unfortunately we lost again one of our workers. This time openqaworker-arm-2 is
affected. Could you please hard-power cycle the machine?
Also, would it be possible to access the power socket on our own so we don't
need to open a ticket all the time?
Best and thanks in advance,
Nick
Updated by nicksinger over 4 years ago
Hi Nick,
machine has been reseted.
I checked down there for PDU ports, but it does not look good.
The next one has 1 Port free and is 2 Rack next to it the other one is 4 Racks
next to it with a total of 2 ports free, but we do not have Powercables in this
lenght and I think these are comepletely used by other teams (one is SES
otherone seems like QA CSS) but as arm2 and arm3 has the same issues and they
have 2 Power outlets each, it would not be enough and we do not have a PDU on
spare, at least not that I know of.
Maybe we need to think of a diffrent solution or let our manager think about
one.
So the machine is back and running again. I will check now if it can survive a reboot. I will also create a follow-up ticket to evaluate if we want to invest into a PDU to power cycle these machines automatically.
Updated by nicksinger over 4 years ago
- Status changed from In Progress to Resolved
The machine came up fine after a reboot without getting stuck in the BIOS. I could imagine somebody (maybe even me) accidentally did a ipmitool chassis bootdev bios
.
I improved the automatic reboot a little bit with https://gitlab.suse.de/openqa/grafana-webhook-actions/-/merge_requests/7 to enforce booting from disk.
Updated by okurz over 4 years ago
- Status changed from Resolved to Feedback
- Assignee changed from nicksinger to okurz
thx. I added the salt key for openqaworker-arm-2
again and re-enabled telegraf on the machine and have checked that all alerts are enabled. The alerts have not recovered yet so I will monitor this.
Updated by okurz over 4 years ago
- Status changed from Feedback to Resolved
- Assignee changed from okurz to nicksinger
all good now
Updated by okurz over 4 years ago
- Status changed from Resolved to In Progress
- Assignee changed from nicksinger to okurz
hm, https://stats.openqa-monitor.qa.suse.de/d/1bNU0StZz/automatic-actions?orgId=1&panelId=6&fullscreen&edit&tab=alert triggered. Maybe we still have a problem here? Checking …
Updated by okurz over 4 years ago
- Status changed from In Progress to Resolved
The system is again stuck in the system management menu even though we should have the fix in the gitlab CI pipeline for automatic recovery already in place. https://gitlab.suse.de/openqa/grafana-webhook-actions/-/jobs looks like we could have a small time where openqaworker-arm-2 was triggered for reboot but without the fixed approach yet. Retriggered a job in gitlab CI, see https://gitlab.suse.de/openqa/grafana-webhook-actions/-/jobs/257134
https://gitlab.suse.de/openqa/grafana-webhook-actions/-/jobs/257134#L18 shows that no boot device was selected. And Commit 191a23ba
shows that this is the old state. That can explain it.
Another trigger for openqaworker-arm-3 just 1h ago was fine, see https://gitlab.suse.de/openqa/grafana-webhook-actions/-/jobs/256966#L19 doing the boot device select.
So handled the reboot for openqaworker-arm-2 one more time manually. Machine is now booted and alerts are recovering.