A/C broken in TAM lab size:M
The A/C is broken right now (see https://suse.slack.com/archives/C02CANHLANP/p1673525005712329) and we need to shut down our machines.
- AC1: All machines moved to SRV2 work normally again (e.g. executing openQA jobs)
- Make sure the A/C is working again
- Restore power to all relevant machines (once racktables is working again)
- Revert https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/480
#1 Updated by cdywan 2 months ago
- Blocks action #122983: [alert] openqa/monitor-o3 failing because openqaworker1 is down size:M added
#3 Updated by mkittler 2 months ago
- Tags deleted (
MR for disabling IPMI worker slots: https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/480
The o3 worker slot rebel:6 has also been disabled and openqaworker1 has been shut down completely.
#4 Updated by nicksinger 2 months ago
- Description updated (diff)
- Target version deleted (
#5 Updated by nicksinger 2 months ago
- Target version set to Ready
#7 Updated by openqa_review 2 months ago
- Due date set to 2023-01-27
Setting due date based on mean cycle time of SUSE QE Tools
#8 Updated by okurz 2 months ago
- Related to action #122458: O3 ipmi worker rebel:5 is broken size:M added
#10 Updated by okurz 2 months ago
- Status changed from Feedback to Blocked
I also created https://sd.suse.com/servicedesk/customer/portal/1/SD-109284 for what I think is "Facilities"
#12 Updated by okurz 2 months ago
- Copied to action #123226: Temperature monitoring in SUSE QE lab size:M added
#13 Updated by okurz 2 months ago
- Related to action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:M added
#14 Updated by okurz 2 months ago
- Due date deleted (
- Status changed from Blocked to Feedback
The lab NUE-2.2.14-B was disassembled. Most machines are prepared for moving to FC labs. I moved all machines in racktables from the racks into the "storage" virtual rack: https://racktables.suse.de/index.php?page=row&tab=default&row_id=18094 ready for transport, see #119551. Other machines have been moved to server room 2, e.g. unarmed, openqaw5-xen, fozzie, quinn, amd-zen2-gpu-sut1 (o3), pi cluster.
We should confirm that all above mentioned hosts are usable from their new location:
- unarmed: According to https://suse.slack.com/archives/C02CANHLANP/p1674553662285099 usable, confirmed by Felix Niederwanger
- openqaw5-xen: I confirmed for openqaw5-xen that I could power it and login over ssh. Also services are running fine and I could progress with xen tests although they failed eventually in another step. But I reported that as an individual specific issue #123571
- fozzie: DONE, see #123028#note-22
- quinn: DONE, see #123028#note-20
- amd-zen2-gpu-sut1 (o3): DONE, see #123028#note-22
- pi cluster: Apparently usable as confirmed by #123493#note-8
#16 Updated by nicksinger 2 months ago
- Subject changed from A/C broken in TAM lab to A/C broken in TAM lab size:M
- Description updated (diff)
- Status changed from Feedback to In Progress
#17 Updated by okurz about 2 months ago
Related MR https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/484, merged. Please also enable the use of amd-zen2-gpu-sut1 (o3) and ensure all machines are usable by having according openQA jobs running on those, either explicitly triggered for those or waiting for according automatic jobs to be used on those machines anyway.
#18 Updated by cdywan about 2 months ago
- Due date changed from 2023-02-03 to 2023-02-10
Bumping due date due to hackweek.
#19 Updated by xlai about 2 months ago
- Related to action #123984: [boot][pxe][sut] Machine fozzie can not boot from pxe added
#20 Updated by nicksinger about 2 months ago
quinn can execute jobs successfully: https://openqa.suse.de/tests/10432865
#21 Updated by nicksinger about 2 months ago
verification run on o3 for amd-zen2-gpu-sut1 https://openqa.opensuse.org/tests/3103141 after okurz started the worker instance on rebel:6 again.
#22 Updated by okurz about 2 months ago
For fozzie I manually added a worker class for grenache-1:14 pointing to fozzie.qa.suse.de and then did
openqa-clone-job --within-instance https://openqa.suse.de/tests/10288644 WORKER_CLASS=fozzie.qa.suse.de SCHEDULE=tests/boot/boot_from_pxe BUILD= _GROUP=0 TEST=okurz_poo123028_fozzie_verification
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/488 merged for fozzie.
Created job #10443245: sle-15-SP5-Online-x86_64-Build64.1-prj4_guest_upgrade_sles12sp5_on_sles12sp5-kvm@64bit-ipmi -> https://openqa.suse.de/t10443245
For amd-zen2-gpu-sut1 did
systemctl enable firstname.lastname@example.org
#23 Updated by okurz about 2 months ago
- Due date deleted (
- Status changed from In Progress to Resolved
We confirmed all rollback steps are done.
#24 Updated by openqa_review 17 days ago
- Status changed from Resolved to Feedback
This is an autogenerated message for openQA integration by the openqa_review script:
This bug is still referenced in a failing openQA test: jeos-containers-docker@svirt-vmware65
To prevent further reminder comments one of the following options should be followed:
- The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
- The openQA job group is moved to "Released" or "EOL" (End-of-Life)
- The bugref in the openQA scenario is removed or replaced, e.g.
Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.
#25 Updated by nicksinger 6 days ago
- Status changed from Feedback to Resolved
Deleted the reference in the last failed job after seeing that the scenario can succeed: https://openqa.suse.de/tests/10730677 and confirming that it was indeed a different issue: https://app.slack.com/client/T02863RC2AC/C02CANHLANP/thread/C02CANHLANP-1678699736.004789