action #132500
closedNUE1-SRV2, .qa.suse.de, aarch64 workers offline due to heat-related SRV2 shutdown size:M
0%
Description
Observation¶
Jobs on openqaworker-arm* are not running. See conversation in Slack.
Acceptance criteria¶
- AC1: openqaworker-arm, **openqaw5-xen*, etc., are able to run jobs
- AC2: https://monitor.qa.suse.de is reachable again
- AC3: No unhandled alerts on https://monitor.qa.suse.de
Suggestions¶
- Follow up on the situation with SRV2
- If climate issue resolved then ensure OSD jobs are good again
- Ensure that monitor.qa.suse.de is good again
Rollback steps¶
- Unsilence https://monitor.qa.suse.de/alerting/silence/4a6e0759-792d-4b1b-b885-3a6d9d1928b8/edit?alertmanager=grafana once malbec is powered back on
- Add salt-keys for at least
malbecopenqaworker14.qa.suse.czDONE: https://progress.opensuse.org/issues/131249#note-38- qesapworker-prg4.qa.suse.cz
Updated by okurz 10 months ago
- Related to action #132488: gitlab CI shows showing no logs or are getting stuck (was: qem-bot sync aggregates gitlab CI job times out after 2h) size:M added
Updated by rfan1 10 months ago
- Related to action #132476: [qe-core][qem]test fails in samba_adcli, is DNS server down? added
Updated by okurz 10 months ago
- Status changed from Feedback to In Progress
I wrote a message in https://suse.slack.com/archives/C02CANHLANP/p1689188645969769
@here The climate control outage in NUE1-SRV2 was mitigated and we could reinstatiate the barebone LSG QE infrastructure. Expect that multiple individual machines will still be off or show problems. We will be looking into each openQA related machine as well as some other services. If you encounter other outage-related problems like unavailable, unreachable machines please reach out.
Updated by okurz 10 months ago
From OSD salt:
qesapworker-prg4.qa.suse.cz:
Minion did not return. [Not connected]
QA-Power8-5-kvm.qa.suse.de:
Minion did not return. [Not connected]
malbec.arch.suse.de:
Minion did not return. [Not connected]
openqaworker14.qa.suse.cz:
Minion did not return. [No response]
The minions may not have all finished running and any remaining minions will return upon completion. To look up the return data for this job later, run the following command:
salt-run jobs.lookup_jid 20230712191308295607
powerqaworker-qam-1.qa.suse.de:
Minion did not return. [Not connected]
QA-Power8-4-kvm.qa.suse.de:
Minion did not return. [Not connected]
ERROR: Minions returned with non-zero exit code
so those machines should be looked into, in particular the ppc ones as without those we don't have any ppc related testing.
Updated by openqa_review 10 months ago
- Due date set to 2023-07-27
Setting due date based on mean cycle time of SUSE QE Tools
Updated by nicksinger 10 months ago
openqaworker-arm-{1,2,3} are online and execute jobs successfully. "qamaster" and therefore schort and grafana are reachable and working as well
Updated by okurz 10 months ago
- Copied to action #132947: Bring back ada.qe.suse.de and fix it properly added
Updated by okurz 10 months ago
- Related to action #132812: [alert] openqaw5-xen host up alert + infrastructure ping size:M added
Updated by livdywan 10 months ago
- Status changed from In Progress to Feedback
The Core Maintenance Updates seem good, no suspicious incompletes, no relevant silences. Anything else left?
Updated by nicksinger 10 months ago
- Status changed from Feedback to Resolved
cdywan wrote:
The Core Maintenance Updates seem good, no suspicious incompletes, no relevant silences. Anything else left?
Thanks for checking. No, we have all ACs covered - thanks for giving me a helping hand here!