reduce heat in NUE-SRV2
AC is running with reduced performance. We should shut down machines that we do not urgently need. Walking over racktables entries from QA I can find that for example openqaworker-arm-1, -2, -3 are in SRV2.
- Priority changed from Normal to Urgent
I paused alerts for openqaworker-arm-2 and openqaworker-arm-3 and triggered a poweroff with
sudo salt -l error --state-output=changes 'openqaworker-arm-*' cmd.run 'systemctl poweroff' but kept openqaworker-arm-1 running to have at least one aarch64 machine. There are also ppc64le machines in SRV2 but I do not dare to touch them based on my previous experiences :)
- Due date changed from 2020-08-11 to 2020-09-02
- Priority changed from Urgent to Normal
have not heard any update if the situation was resolved but https://openqa.suse.de/tests/ shows no "stuck" jobs nor a very long list so I guess we can keep the state as-is until after August vacation period.
I have removed the keys for both machines with
sudo salt-key -y -d openqaworker-arm-\* . The keys should be readded if the AC problem has been resolved.
- Due date changed from 2020-09-02 to 2020-09-04
- Priority changed from Normal to High
There was no official resolution of the A/C problem yet, we are still running with reduced capacity. I asked EngInfra TL Evzenie Sujskaja in https://chat.suse.de/channel/suse-it-ama?msg=X53ws27XhyA8nhomZ and she responded that the situation should be resolved today. We can check again tomorrow. However nsinger already triggered openqaworker-arm-2 and openqaworker-arm-3 for start. I can not access these machines neither over ssh nor ipmi. I triggered a power cycle of openqaworker-arm-2 and
sol activate showed the machine stuck in a system management window. I selected to exit the menu to boot cleanly and the kernel booted.
EDIT: 2020-09-04: Asked again in #suse-it after receiving no update, see https://chat.suse.de/channel/suse-it-ama?msg=PjPWuRhzJW9ePuYiK
- Status changed from Feedback to In Progress
got an update in https://chat.suse.de/channel/suse-it-ama?msg=nde4vgeR4P5wgO0t3 that NBG SRV2 AC is fully healthy now. We can start all missing pending machines again.
openqaworker-arm-2 and openqaworker-arm-3 are in SRV2, not ext, both should be controllable over IPMI although I can't reach ipmi interface of openqaworker-arm-3 again as happened 5 times already in the past, see #70966. I think some "PowerPC" machines are missing still but I do not know in what state they are, where they are or what would be necessary to be done. Next is openqaworker-arm-2 that I power cycled yesterday but is again missing. power cycled again and should confirm working and then enable alerts with automatic recovery.
EDIT: For openqaworker-arm-2 I see on startup a message:
WARNING: ******************************************************** WARNING: * This is debug mode when the default.dtb file is there. WARNING: * [Restore factory defaults] can return the normal mode. WARNING: ********************************************************
and eventually the system does not boot a full OS but gets stuck in
Aptio Setup Utility - Copyright (C) 2017 American Megatrends, Inc. Main Advanced Security Boot Save & Exit Server Mgmt ÚÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÂÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄĿ ³ BIOS Information ³Memory Slot Information. ³ ³ Access Level Administrator ۳ ³ ³ Project Name MT60-SC4-00 ۳ ³ ³ Project Version T32 ۳ ³ ³ Build Date and Time 03/03/2017 13:09:58 ۳ ³ ³ ۳ ³ ³ BMC Information ۳ ³ ³ BMC Firmware Version 07.68 ۳ ³ ³ SDR Version 00.04 ۳ ³ ³ FRU Version 01.00 °³ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄĳ ³ °³><: Select Screen ³ ³ Processor Information °³: Select Item ³ ³ CPU 0 : CN8890-2000BG2601-ST-Y-G °³Enter: Select ³ ³ CPU 1 : CN8890-2000BG2601-CP-Y-G °³+/-: Change Opt. ³ ³ Max CPU Speed 2000 MHz °³F1: General Help ³ ³ CPU Data Cache 32 KB °³F3: Previous Values ³ ³ CPU Instruction Cache 78 KB °³F9: Optimized Defaults ³ ³ ³F10: Save & Exit ³ ³ ³ESC: Exit ³ ÀÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÁÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÙ Version 2.18.1264. Copyright (C) 2017 American Megatrends, Inc. AB
exiting helps to boot but this should be automatic. Just exiting the menu helps. Don't know what I would need to change here. Reported as #70969
currently all three arm machines are back up. #70966 regarding openqaworker-arm-3 was resolved. I added the salt key and enable alerts for openqaworker-arm-3 again. Also I triggered a high state with salt
salt -l error --no-color -C 'openqaworker-arm-3*' state.apply test=True. For openqaworker-arm-2 due to #70969 I have not yet added the salt key back nor enabled the alerts again. However I have paused the alerts for median job age until the job queue is back to sane levels again.