Project

General

Profile

Actions

action #132500

closed

NUE1-SRV2, .qa.suse.de, aarch64 workers offline due to heat-related SRV2 shutdown size:M

Added by livdywan over 1 year ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Start date:
Due date:
2023-07-27
% Done:

0%

Estimated time:

Description

Observation

Jobs on openqaworker-arm* are not running. See conversation in Slack.

Acceptance criteria

Suggestions

  • Follow up on the situation with SRV2
  • If climate issue resolved then ensure OSD jobs are good again
  • Ensure that monitor.qa.suse.de is good again

Rollback steps


Related issues 4 (0 open4 closed)

Related to QA (public) - action #132488: gitlab CI shows showing no logs or are getting stuck (was: qem-bot sync aggregates gitlab CI job times out after 2h) size:MResolvedokurz2023-07-10

Actions
Related to openQA Tests (public) - action #132476: [qe-core][qem]test fails in samba_adcli, is DNS server down?Resolvedrfan12023-07-10

Actions
Related to openQA Infrastructure (public) - action #132812: [alert] openqaw5-xen host up alert + infrastructure ping size:MResolvednicksinger2023-07-16

Actions
Copied to openQA Infrastructure (public) - action #132947: Bring back ada.qe.suse.de and fix it properlyResolvednicksinger

Actions
Actions #1

Updated by okurz over 1 year ago

  • Related to action #132488: gitlab CI shows showing no logs or are getting stuck (was: qem-bot sync aggregates gitlab CI job times out after 2h) size:M added
Actions #3

Updated by osukup over 1 year ago

in SRV2 is also qamaster.qa.suse.de, openqaworker5-xen, grenache,qanet, and openqa-power-8 workers ...

Actions #4

Updated by okurz over 1 year ago

  • Subject changed from aarch64 workers offline due to heat-related SRV2 shutdown to NUE1-SRV2, .qa.suse.de, aarch64 workers offline due to heat-related SRV2 shutdown
  • Priority changed from Normal to Urgent
Actions #5

Updated by okurz over 1 year ago

  • Tags set to infra, alert, outage, ac, heat, nue1, SRV2
Actions #6

Updated by osukup over 1 year ago

because also Gitlab Workers are also in SRV2 - all CI jobs are timeouted --> set CI on openqaBot|bot-ng to inactive until is SRV2 resolved

Actions #7

Updated by rfan1 over 1 year ago

  • Related to action #132476: [qe-core][qem]test fails in samba_adcli, is DNS server down? added
Actions #8

Updated by okurz over 1 year ago

  • Subject changed from NUE1-SRV2, .qa.suse.de, aarch64 workers offline due to heat-related SRV2 shutdown to NUE1-SRV2, .qa.suse.de, aarch64 workers offline due to heat-related SRV2 shutdown size:M
  • Description updated (diff)
Actions #9

Updated by okurz over 1 year ago

  • Status changed from Feedback to In Progress

I wrote a message in https://suse.slack.com/archives/C02CANHLANP/p1689188645969769

@here The climate control outage in NUE1-SRV2 was mitigated and we could reinstatiate the barebone LSG QE infrastructure. Expect that multiple individual machines will still be off or show problems. We will be looking into each openQA related machine as well as some other services. If you encounter other outage-related problems like unavailable, unreachable machines please reach out.

Actions #11

Updated by okurz over 1 year ago

From OSD salt:

qesapworker-prg4.qa.suse.cz:
    Minion did not return. [Not connected]
QA-Power8-5-kvm.qa.suse.de:
    Minion did not return. [Not connected]
malbec.arch.suse.de:
    Minion did not return. [Not connected]
openqaworker14.qa.suse.cz:
    Minion did not return. [No response]
    The minions may not have all finished running and any remaining minions will return upon completion. To look up the return data for this job later, run the following command:

    salt-run jobs.lookup_jid 20230712191308295607
powerqaworker-qam-1.qa.suse.de:
    Minion did not return. [Not connected]
QA-Power8-4-kvm.qa.suse.de:
    Minion did not return. [Not connected]
ERROR: Minions returned with non-zero exit code

so those machines should be looked into, in particular the ppc ones as without those we don't have any ppc related testing.

Actions #12

Updated by openqa_review over 1 year ago

  • Due date set to 2023-07-27

Setting due date based on mean cycle time of SUSE QE Tools

Actions #13

Updated by okurz over 1 year ago

  • Assignee changed from livdywan to nicksinger
Actions #14

Updated by nicksinger over 1 year ago

openqaworker-arm-{1,2,3} are online and execute jobs successfully. "qamaster" and therefore schort and grafana are reachable and working as well

Actions #15

Updated by nicksinger over 1 year ago

  • Description updated (diff)
Actions #16

Updated by okurz over 1 year ago

  • Description updated (diff)
Actions #17

Updated by nicksinger over 1 year ago

  • Description updated (diff)
Actions #18

Updated by pdostal over 1 year ago

If the AC works again, can ada please be powered back on? People have VMs over thereā€¦

Actions #19

Updated by okurz over 1 year ago

  • Copied to action #132947: Bring back ada.qe.suse.de and fix it properly added
Actions #20

Updated by okurz over 1 year ago

  • Related to action #132812: [alert] openqaw5-xen host up alert + infrastructure ping size:M added
Actions #21

Updated by nicksinger over 1 year ago

malbec connected and running again

Actions #22

Updated by nicksinger over 1 year ago

  • Description updated (diff)
Actions #23

Updated by livdywan over 1 year ago

  • Status changed from In Progress to Feedback

The Core Maintenance Updates seem good, no suspicious incompletes, no relevant silences. Anything else left?

Actions #24

Updated by nicksinger over 1 year ago

  • Status changed from Feedback to Resolved

cdywan wrote:

The Core Maintenance Updates seem good, no suspicious incompletes, no relevant silences. Anything else left?

Thanks for checking. No, we have all ACs covered - thanks for giving me a helping hand here!

Actions

Also available in: Atom PDF