action #123028
closedA/C broken in TAM lab size:M
0%
Description
Observation¶
The A/C is broken right now (see https://suse.slack.com/archives/C02CANHLANP/p1673525005712329) and we need to shut down our machines.
Acceptance criteria¶
- AC1: All machines moved to SRV2 work normally again (e.g. executing openQA jobs)
Suggestions¶
Rollback steps¶
- Make sure the A/C is working again
- Restore power to all relevant machines (once racktables is working again)
- Revert https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/480
Updated by livdywan over 1 year ago
- Blocks action #122983: [alert] openqa/monitor-o3 failing because openqaworker1 is down size:M added
Updated by livdywan over 1 year ago
- Assignee set to nicksinger
- Target version set to Ready
We're working on it together, but I'll still have to put one name there.
Updated by mkittler over 1 year ago
- Tags deleted (
infra)
MR for disabling IPMI worker slots: https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/480
The o3 worker slot rebel:6 has also been disabled and openqaworker1 has been shut down completely.
Updated by nicksinger over 1 year ago
- Description updated (diff)
- Target version deleted (
Ready)
Updated by openqa_review over 1 year ago
- Due date set to 2023-01-27
Setting due date based on mean cycle time of SUSE QE Tools
Updated by okurz over 1 year ago
- Related to action #122458: O3 ipmi worker rebel:5 is broken size:M added
Updated by mkittler over 1 year ago
- Status changed from In Progress to Feedback
- Priority changed from Immediate to High
Updated by okurz over 1 year ago
- Status changed from Feedback to Blocked
I also created https://sd.suse.com/servicedesk/customer/portal/1/SD-109284 for what I think is "Facilities"
Updated by okurz over 1 year ago
- Tags changed from infra to infra, reactive work
Updated by okurz over 1 year ago
- Copied to action #123226: Temperature monitoring in SUSE QE lab size:M added
Updated by okurz over 1 year ago
- Related to action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:M added
Updated by okurz over 1 year ago
- Due date deleted (
2023-01-27) - Status changed from Blocked to Feedback
The lab NUE-2.2.14-B was disassembled. Most machines are prepared for moving to FC labs. I moved all machines in racktables from the racks into the "storage" virtual rack: https://racktables.suse.de/index.php?page=row&tab=default&row_id=18094 ready for transport, see #119551. Other machines have been moved to server room 2, e.g. unarmed, openqaw5-xen, fozzie, quinn, amd-zen2-gpu-sut1 (o3), pi cluster.
We should confirm that all above mentioned hosts are usable from their new location:
- unarmed: According to https://suse.slack.com/archives/C02CANHLANP/p1674553662285099 usable, confirmed by Felix Niederwanger
- openqaw5-xen: I confirmed for openqaw5-xen that I could power it and login over ssh. Also services are running fine and I could progress with xen tests although they failed eventually in another step. But I reported that as an individual specific issue #123571
- fozzie: DONE, see #123028#note-22
- quinn: DONE, see #123028#note-20
- amd-zen2-gpu-sut1 (o3): DONE, see #123028#note-22
- pi cluster: Apparently usable as confirmed by #123493#note-8
Updated by nicksinger over 1 year ago
- Subject changed from A/C broken in TAM lab to A/C broken in TAM lab size:M
- Description updated (diff)
- Status changed from Feedback to In Progress
Updated by okurz over 1 year ago
Related MR https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/484, merged. Please also enable the use of amd-zen2-gpu-sut1 (o3) and ensure all machines are usable by having according openQA jobs running on those, either explicitly triggered for those or waiting for according automatic jobs to be used on those machines anyway.
Updated by livdywan over 1 year ago
- Due date changed from 2023-02-03 to 2023-02-10
Bumping due date due to hackweek.
Updated by xlai over 1 year ago
- Related to action #123984: [boot][pxe][sut] Machine fozzie can not boot from pxe added
Updated by nicksinger over 1 year ago
quinn can execute jobs successfully: https://openqa.suse.de/tests/10432865
Updated by nicksinger over 1 year ago
verification run on o3 for amd-zen2-gpu-sut1 https://openqa.opensuse.org/tests/3103141 after @okurz started the worker instance on rebel:6 again.
Updated by okurz over 1 year ago
For fozzie I manually added a worker class for grenache-1:14 pointing to fozzie.qa.suse.de and then did
openqa-clone-job --within-instance https://openqa.suse.de/tests/10288644 WORKER_CLASS=fozzie.qa.suse.de SCHEDULE=tests/boot/boot_from_pxe BUILD= _GROUP=0 TEST=okurz_poo123028_fozzie_verification
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/488 merged for fozzie.
Created job #10443245: sle-15-SP5-Online-x86_64-Build64.1-prj4_guest_upgrade_sles12sp5_on_sles12sp5-kvm@64bit-ipmi -> https://openqa.suse.de/t10443245
For amd-zen2-gpu-sut1 did systemctl enable openqa-worker-auto-restart@6.service
Updated by okurz over 1 year ago
- Due date deleted (
2023-02-10) - Status changed from In Progress to Resolved
We confirmed all rollback steps are done.
Updated by openqa_review over 1 year ago
- Status changed from Resolved to Feedback
This is an autogenerated message for openQA integration by the openqa_review script:
This bug is still referenced in a failing openQA test: jeos-containers-docker@svirt-vmware65
https://openqa.suse.de/tests/10654219#step/bootloader_svirt/1
To prevent further reminder comments one of the following options should be followed:
- The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
- The openQA job group is moved to "Released" or "EOL" (End-of-Life)
- The bugref in the openQA scenario is removed or replaced, e.g.
label:wontfix:boo1234
Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.
Updated by nicksinger over 1 year ago
- Status changed from Feedback to Resolved
Deleted the reference in the last failed job after seeing that the scenario can succeed: https://openqa.suse.de/tests/10730677 and confirming that it was indeed a different issue: https://app.slack.com/client/T02863RC2AC/C02CANHLANP/thread/C02CANHLANP-1678699736.004789