action #132500
closedNUE1-SRV2, .qa.suse.de, aarch64 workers offline due to heat-related SRV2 shutdown size:M
0%
Description
Observation¶
Jobs on openqaworker-arm* are not running. See conversation in Slack.
Acceptance criteria¶
- AC1: openqaworker-arm, **openqaw5-xen*, etc., are able to run jobs
- AC2: https://monitor.qa.suse.de is reachable again
- AC3: No unhandled alerts on https://monitor.qa.suse.de
Suggestions¶
- Follow up on the situation with SRV2
- If climate issue resolved then ensure OSD jobs are good again
- Ensure that monitor.qa.suse.de is good again
Rollback steps¶
- Unsilence https://monitor.qa.suse.de/alerting/silence/4a6e0759-792d-4b1b-b885-3a6d9d1928b8/edit?alertmanager=grafana once malbec is powered back on
- Add salt-keys for at least
malbecopenqaworker14.qa.suse.czDONE: https://progress.opensuse.org/issues/131249#note-38- qesapworker-prg4.qa.suse.cz
Updated by okurz over 1 year ago
- Related to action #132488: gitlab CI shows showing no logs or are getting stuck (was: qem-bot sync aggregates gitlab CI job times out after 2h) size:M added
Updated by livdywan over 1 year ago
Also this Slack thread and live updates here.
Updated by osukup over 1 year ago
in SRV2 is also qamaster.qa.suse.de, openqaworker5-xen, grenache,qanet, and openqa-power-8 workers ...
Updated by okurz over 1 year ago
- Subject changed from aarch64 workers offline due to heat-related SRV2 shutdown to NUE1-SRV2, .qa.suse.de, aarch64 workers offline due to heat-related SRV2 shutdown
- Priority changed from Normal to Urgent
Updated by okurz over 1 year ago
- Tags set to infra, alert, outage, ac, heat, nue1, SRV2
Updated by osukup over 1 year ago
because also Gitlab Workers are also in SRV2 - all CI jobs are timeouted --> set CI on openqaBot|bot-ng to inactive
until is SRV2 resolved
Updated by rfan1 over 1 year ago
- Related to action #132476: [qe-core][qem]test fails in samba_adcli, is DNS server down? added
Updated by okurz over 1 year ago
- Subject changed from NUE1-SRV2, .qa.suse.de, aarch64 workers offline due to heat-related SRV2 shutdown to NUE1-SRV2, .qa.suse.de, aarch64 workers offline due to heat-related SRV2 shutdown size:M
- Description updated (diff)
Updated by okurz over 1 year ago
- Status changed from Feedback to In Progress
I wrote a message in https://suse.slack.com/archives/C02CANHLANP/p1689188645969769
@here The climate control outage in NUE1-SRV2 was mitigated and we could reinstatiate the barebone LSG QE infrastructure. Expect that multiple individual machines will still be off or show problems. We will be looking into each openQA related machine as well as some other services. If you encounter other outage-related problems like unavailable, unreachable machines please reach out.
Updated by okurz over 1 year ago
From OSD salt:
qesapworker-prg4.qa.suse.cz:
Minion did not return. [Not connected]
QA-Power8-5-kvm.qa.suse.de:
Minion did not return. [Not connected]
malbec.arch.suse.de:
Minion did not return. [Not connected]
openqaworker14.qa.suse.cz:
Minion did not return. [No response]
The minions may not have all finished running and any remaining minions will return upon completion. To look up the return data for this job later, run the following command:
salt-run jobs.lookup_jid 20230712191308295607
powerqaworker-qam-1.qa.suse.de:
Minion did not return. [Not connected]
QA-Power8-4-kvm.qa.suse.de:
Minion did not return. [Not connected]
ERROR: Minions returned with non-zero exit code
so those machines should be looked into, in particular the ppc ones as without those we don't have any ppc related testing.
Updated by openqa_review over 1 year ago
- Due date set to 2023-07-27
Setting due date based on mean cycle time of SUSE QE Tools
Updated by nicksinger over 1 year ago
openqaworker-arm-{1,2,3} are online and execute jobs successfully. "qamaster" and therefore schort and grafana are reachable and working as well
Updated by pdostal over 1 year ago
If the AC works again, can ada
please be powered back on? People have VMs over thereā¦
Updated by okurz over 1 year ago
- Copied to action #132947: Bring back ada.qe.suse.de and fix it properly added
Updated by okurz over 1 year ago
- Related to action #132812: [alert] openqaw5-xen host up alert + infrastructure ping size:M added
Updated by livdywan over 1 year ago
- Status changed from In Progress to Feedback
The Core Maintenance Updates seem good, no suspicious incompletes, no relevant silences. Anything else left?
Updated by nicksinger over 1 year ago
- Status changed from Feedback to Resolved
cdywan wrote:
The Core Maintenance Updates seem good, no suspicious incompletes, no relevant silences. Anything else left?
Thanks for checking. No, we have all ACs covered - thanks for giving me a helping hand here!