Project

General

Profile

action #69727

reduce heat in NUE-SRV2

Added by okurz 12 months ago. Updated 11 months ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
Start date:
2020-08-07
Due date:
2020-09-04
% Done:

0%

Estimated time:
Tags:

Description

AC is running with reduced performance. We should shut down machines that we do not urgently need. Walking over racktables entries from QA I can find that for example openqaworker-arm-1, -2, -3 are in SRV2.


Related issues

Related to openQA Infrastructure - action #70966: ipmi management interface of openqaworker-arm-3 is inaccessibleResolved2020-07-16

Copied to openQA Infrastructure - action #70969: openqaworker-arm-2 stuck in system management menu after rebootResolved

History

#2 Updated by okurz 12 months ago

  • Priority changed from Normal to Urgent

I paused alerts for openqaworker-arm-2 and openqaworker-arm-3 and triggered a poweroff with sudo salt -l error --state-output=changes 'openqaworker-arm-[23]*' cmd.run 'systemctl poweroff' but kept openqaworker-arm-1 running to have at least one aarch64 machine. There are also ppc64le machines in SRV2 but I do not dare to touch them based on my previous experiences :)

#3 Updated by okurz 12 months ago

  • Due date changed from 2020-08-11 to 2020-09-02
  • Priority changed from Urgent to Normal

have not heard any update if the situation was resolved but https://openqa.suse.de/tests/ shows no "stuck" jobs nor a very long list so I guess we can keep the state as-is until after August vacation period.

I have removed the keys for both machines with sudo salt-key -y -d openqaworker-arm-[23]\* . The keys should be readded if the AC problem has been resolved.

#4 Updated by okurz 11 months ago

  • Due date changed from 2020-09-02 to 2020-09-04
  • Priority changed from Normal to High

There was no official resolution of the A/C problem yet, we are still running with reduced capacity. I asked EngInfra TL Evzenie Sujskaja in https://chat.suse.de/channel/suse-it-ama?msg=X53ws27XhyA8nhomZ and she responded that the situation should be resolved today. We can check again tomorrow. However nsinger already triggered openqaworker-arm-2 and openqaworker-arm-3 for start. I can not access these machines neither over ssh nor ipmi. I triggered a power cycle of openqaworker-arm-2 and sol activate showed the machine stuck in a system management window. I selected to exit the menu to boot cleanly and the kernel booted.

EDIT: 2020-09-04: Asked again in #suse-it after receiving no update, see https://chat.suse.de/channel/suse-it-ama?msg=PjPWuRhzJW9ePuYiK

#5 Updated by okurz 11 months ago

  • Status changed from Feedback to In Progress

got an update in https://chat.suse.de/channel/suse-it-ama?msg=nde4vgeR4P5wgO0t3 that NBG SRV2 AC is fully healthy now. We can start all missing pending machines again.

#6 Updated by okurz 11 months ago

  • Related to action #70966: ipmi management interface of openqaworker-arm-3 is inaccessible added

#7 Updated by okurz 11 months ago

openqaworker-arm-2 and openqaworker-arm-3 are in SRV2, not ext, both should be controllable over IPMI although I can't reach ipmi interface of openqaworker-arm-3 again as happened 5 times already in the past, see #70966. I think some "PowerPC" machines are missing still but I do not know in what state they are, where they are or what would be necessary to be done. Next is openqaworker-arm-2 that I power cycled yesterday but is again missing. power cycled again and should confirm working and then enable alerts with automatic recovery.

EDIT: For openqaworker-arm-2 I see on startup a message:

WARNING: ********************************************************
WARNING: * This is debug mode when the default.dtb file is there.
WARNING: * [Restore factory defaults] can return the normal mode.
WARNING: ********************************************************

and eventually the system does not boot a full OS but gets stuck in

       Aptio Setup Utility - Copyright (C) 2017 American Megatrends, Inc.       
    Main  Advanced  Security  Boot  Save & Exit  Server Mgmt                    
ÚÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÂÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄĿ
³  BIOS Information                                 ³Memory Slot Information. ³
³  Access Level            Administrator            ۳                         ³
³  Project Name            MT60-SC4-00              ۳                         ³
³  Project Version         T32                      ۳                         ³
³  Build Date and Time     03/03/2017 13:09:58      ۳                         ³
³                                                   ۳                         ³
³  BMC Information                                  ۳                         ³
³  BMC Firmware Version    07.68                    ۳                         ³
³  SDR Version             00.04                    ۳                         ³
³  FRU Version             01.00                    °³ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄij
³                                                   °³><: Select Screen        ³
³  Processor Information                            °³: Select Item          ³
³  CPU 0 : CN8890-2000BG2601-ST-Y-G                 °³Enter: Select            ³
³  CPU 1 : CN8890-2000BG2601-CP-Y-G                 °³+/-: Change Opt.         ³
³  Max CPU Speed           2000 MHz                 °³F1: General Help         ³
³  CPU Data Cache          32 KB                    °³F3: Previous Values      ³
³  CPU Instruction Cache   78 KB                    °³F9: Optimized Defaults   ³
³                                                   ³F10: Save & Exit         ³
³                                                    ³ESC: Exit                ³
ÀÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÁÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÙ
        Version 2.18.1264. Copyright (C) 2017 American Megatrends, Inc.        
                                                                             AB

exiting helps to boot but this should be automatic. Just exiting the menu helps. Don't know what I would need to change here. Reported as #70969

#8 Updated by okurz 11 months ago

  • Copied to action #70969: openqaworker-arm-2 stuck in system management menu after reboot added

#9 Updated by okurz 11 months ago

currently all three arm machines are back up. #70966 regarding openqaworker-arm-3 was resolved. I added the salt key and enable alerts for openqaworker-arm-3 again. Also I triggered a high state with salt salt -l error --no-color -C 'openqaworker-arm-3*' state.apply test=True. For openqaworker-arm-2 due to #70969 I have not yet added the salt key back nor enabled the alerts again. However I have paused the alerts for median job age until the job queue is back to sane levels again.

#10 Updated by okurz 11 months ago

  • Status changed from In Progress to Resolved

openqaworker-arm-2 is back. I added the salt key and unpaused alerts. Also job age alert unpaused and green. All seems to be back to normal now.

Also available in: Atom PDF