action #132500
closed
NUE1-SRV2, .qa.suse.de, aarch64 workers offline due to heat-related SRV2 shutdown size:M
Added by livdywan over 1 year ago.
Updated over 1 year ago.
Description
Observation¶
Jobs on openqaworker-arm* are not running. See conversation in Slack.
Acceptance criteria¶
Suggestions¶
- Follow up on the situation with SRV2
- If climate issue resolved then ensure OSD jobs are good again
- Ensure that monitor.qa.suse.de is good again
Rollback steps¶
- Related to action #132488: gitlab CI shows showing no logs or are getting stuck (was: qem-bot sync aggregates gitlab CI job times out after 2h) size:M added
in SRV2 is also qamaster.qa.suse.de, openqaworker5-xen, grenache,qanet, and openqa-power-8 workers ...
- Subject changed from aarch64 workers offline due to heat-related SRV2 shutdown to NUE1-SRV2, .qa.suse.de, aarch64 workers offline due to heat-related SRV2 shutdown
- Priority changed from Normal to Urgent
- Tags set to infra, alert, outage, ac, heat, nue1, SRV2
because also Gitlab Workers are also in SRV2 - all CI jobs are timeouted --> set CI on openqaBot|bot-ng to inactive
until is SRV2 resolved
- Related to action #132476: [qe-core][qem]test fails in samba_adcli, is DNS server down? added
- Subject changed from NUE1-SRV2, .qa.suse.de, aarch64 workers offline due to heat-related SRV2 shutdown to NUE1-SRV2, .qa.suse.de, aarch64 workers offline due to heat-related SRV2 shutdown size:M
- Description updated (diff)
- Status changed from Feedback to In Progress
I wrote a message in https://suse.slack.com/archives/C02CANHLANP/p1689188645969769
@here The climate control outage in NUE1-SRV2 was mitigated and we could reinstatiate the barebone LSG QE infrastructure. Expect that multiple individual machines will still be off or show problems. We will be looking into each openQA related machine as well as some other services. If you encounter other outage-related problems like unavailable, unreachable machines please reach out.
From OSD salt:
qesapworker-prg4.qa.suse.cz:
Minion did not return. [Not connected]
QA-Power8-5-kvm.qa.suse.de:
Minion did not return. [Not connected]
malbec.arch.suse.de:
Minion did not return. [Not connected]
openqaworker14.qa.suse.cz:
Minion did not return. [No response]
The minions may not have all finished running and any remaining minions will return upon completion. To look up the return data for this job later, run the following command:
salt-run jobs.lookup_jid 20230712191308295607
powerqaworker-qam-1.qa.suse.de:
Minion did not return. [Not connected]
QA-Power8-4-kvm.qa.suse.de:
Minion did not return. [Not connected]
ERROR: Minions returned with non-zero exit code
so those machines should be looked into, in particular the ppc ones as without those we don't have any ppc related testing.
- Due date set to 2023-07-27
Setting due date based on mean cycle time of SUSE QE Tools
- Assignee changed from livdywan to nicksinger
openqaworker-arm-{1,2,3} are online and execute jobs successfully. "qamaster" and therefore schort and grafana are reachable and working as well
- Description updated (diff)
- Description updated (diff)
- Description updated (diff)
If the AC works again, can ada
please be powered back on? People have VMs over thereā¦
- Copied to action #132947: Bring back ada.qe.suse.de and fix it properly added
- Related to action #132812: [alert] openqaw5-xen host up alert + infrastructure ping size:M added
malbec connected and running again
- Description updated (diff)
- Status changed from In Progress to Feedback
- Status changed from Feedback to Resolved
Also available in: Atom
PDF