action #69727
closedreduce heat in NUE-SRV2
0%
Description
AC is running with reduced performance. We should shut down machines that we do not urgently need. Walking over racktables entries from QA I can find that for example openqaworker-arm-1, -2, -3 are in SRV2.
Updated by okurz over 4 years ago
- Priority changed from Normal to Urgent
I paused alerts for openqaworker-arm-2 and openqaworker-arm-3 and triggered a poweroff with sudo salt -l error --state-output=changes 'openqaworker-arm-[23]*' cmd.run 'systemctl poweroff'
but kept openqaworker-arm-1 running to have at least one aarch64 machine. There are also ppc64le machines in SRV2 but I do not dare to touch them based on my previous experiences :)
Updated by okurz over 4 years ago
- Due date changed from 2020-08-11 to 2020-09-02
- Priority changed from Urgent to Normal
have not heard any update if the situation was resolved but https://openqa.suse.de/tests/ shows no "stuck" jobs nor a very long list so I guess we can keep the state as-is until after August vacation period.
I have removed the keys for both machines with sudo salt-key -y -d openqaworker-arm-[23]\*
. The keys should be readded if the AC problem has been resolved.
Updated by okurz over 4 years ago
- Due date changed from 2020-09-02 to 2020-09-04
- Priority changed from Normal to High
There was no official resolution of the A/C problem yet, we are still running with reduced capacity. I asked EngInfra TL Evzenie Sujskaja in https://chat.suse.de/channel/suse-it-ama?msg=X53ws27XhyA8nhomZ and she responded that the situation should be resolved today. We can check again tomorrow. However nsinger already triggered openqaworker-arm-2 and openqaworker-arm-3 for start. I can not access these machines neither over ssh nor ipmi. I triggered a power cycle of openqaworker-arm-2 and sol activate
showed the machine stuck in a system management window. I selected to exit the menu to boot cleanly and the kernel booted.
EDIT: 2020-09-04: Asked again in #suse-it after receiving no update, see https://chat.suse.de/channel/suse-it-ama?msg=PjPWuRhzJW9ePuYiK
Updated by okurz over 4 years ago
- Status changed from Feedback to In Progress
got an update in https://chat.suse.de/channel/suse-it-ama?msg=nde4vgeR4P5wgO0t3 that NBG SRV2 AC is fully healthy now. We can start all missing pending machines again.
Updated by okurz over 4 years ago
- Related to action #70966: ipmi management interface of openqaworker-arm-3 is inaccessible added
Updated by okurz over 4 years ago
openqaworker-arm-2 and openqaworker-arm-3 are in SRV2, not ext, both should be controllable over IPMI although I can't reach ipmi interface of openqaworker-arm-3 again as happened 5 times already in the past, see #70966. I think some "PowerPC" machines are missing still but I do not know in what state they are, where they are or what would be necessary to be done. Next is openqaworker-arm-2 that I power cycled yesterday but is again missing. power cycled again and should confirm working and then enable alerts with automatic recovery.
EDIT: For openqaworker-arm-2 I see on startup a message:
WARNING: ********************************************************
WARNING: * This is debug mode when the default.dtb file is there.
WARNING: * [Restore factory defaults] can return the normal mode.
WARNING: ********************************************************
and eventually the system does not boot a full OS but gets stuck in
Aptio Setup Utility - Copyright (C) 2017 American Megatrends, Inc.
Main Advanced Security Boot Save & Exit Server Mgmt
ÚÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÂÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄĿ
³ BIOS Information ³Memory Slot Information. ³
³ Access Level Administrator ۳ ³
³ Project Name MT60-SC4-00 ۳ ³
³ Project Version T32 ۳ ³
³ Build Date and Time 03/03/2017 13:09:58 ۳ ³
³ ۳ ³
³ BMC Information ۳ ³
³ BMC Firmware Version 07.68 ۳ ³
³ SDR Version 00.04 ۳ ³
³ FRU Version 01.00 °³ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄij
³ °³><: Select Screen ³
³ Processor Information °³: Select Item ³
³ CPU 0 : CN8890-2000BG2601-ST-Y-G °³Enter: Select ³
³ CPU 1 : CN8890-2000BG2601-CP-Y-G °³+/-: Change Opt. ³
³ Max CPU Speed 2000 MHz °³F1: General Help ³
³ CPU Data Cache 32 KB °³F3: Previous Values ³
³ CPU Instruction Cache 78 KB °³F9: Optimized Defaults ³
³ ³F10: Save & Exit ³
³ ³ESC: Exit ³
ÀÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÁÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÙ
Version 2.18.1264. Copyright (C) 2017 American Megatrends, Inc.
AB
exiting helps to boot but this should be automatic. Just exiting the menu helps. Don't know what I would need to change here. Reported as #70969
Updated by okurz over 4 years ago
- Copied to action #70969: openqaworker-arm-2 stuck in system management menu after reboot added
Updated by okurz over 4 years ago
currently all three arm machines are back up. #70966 regarding openqaworker-arm-3 was resolved. I added the salt key and enable alerts for openqaworker-arm-3 again. Also I triggered a high state with salt salt -l error --no-color -C 'openqaworker-arm-3*' state.apply test=True
. For openqaworker-arm-2 due to #70969 I have not yet added the salt key back nor enabled the alerts again. However I have paused the alerts for median job age until the job queue is back to sane levels again.
Updated by okurz over 4 years ago
- Status changed from In Progress to Resolved
openqaworker-arm-2 is back. I added the salt key and unpaused alerts. Also job age alert unpaused and green. All seems to be back to normal now.