Project

General

Profile

action #123028

A/C broken in TAM lab size:M

Added by nicksinger 2 months ago. Updated 6 days ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
Start date:
2023-01-12
Due date:
% Done:

0%

Estimated time:

Description

Observation

The A/C is broken right now (see https://suse.slack.com/archives/C02CANHLANP/p1673525005712329) and we need to shut down our machines.

Acceptance criteria

  • AC1: All machines moved to SRV2 work normally again (e.g. executing openQA jobs)

Suggestions

Rollback steps

  1. Make sure the A/C is working again
  2. Restore power to all relevant machines (once racktables is working again)
  3. Revert https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/480

Related issues

Related to openQA Project - action #122458: O3 ipmi worker rebel:5 is broken size:MResolved2022-12-26

Related to QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:MResolved2023-03-102023-03-25

Related to openQA Infrastructure - action #123984: [boot][pxe][sut] Machine fozzie can not boot from pxeResolved2023-02-07

Blocks openQA Infrastructure - action #122983: [alert] openqa/monitor-o3 failing because openqaworker1 is down size:MFeedback2023-01-02

Copied to openQA Infrastructure - action #123226: Temperature monitoring in SUSE QE lab size:MNew

History

#1 Updated by cdywan 2 months ago

  • Blocks action #122983: [alert] openqa/monitor-o3 failing because openqaworker1 is down size:M added

#2 Updated by cdywan 2 months ago

  • Assignee set to nicksinger
  • Target version set to Ready

We're working on it together, but I'll still have to put one name there.

#3 Updated by mkittler 2 months ago

  • Tags deleted (infra)

MR for disabling IPMI worker slots: https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/480
The o3 worker slot rebel:6 has also been disabled and openqaworker1 has been shut down completely.

#4 Updated by nicksinger 2 months ago

  • Description updated (diff)
  • Target version deleted (Ready)

#5 Updated by nicksinger 2 months ago

  • Target version set to Ready

#6 Updated by mkittler 2 months ago

  • Tags set to infra

#7 Updated by openqa_review 2 months ago

  • Due date set to 2023-01-27

Setting due date based on mean cycle time of SUSE QE Tools

#8 Updated by okurz 2 months ago

  • Related to action #122458: O3 ipmi worker rebel:5 is broken size:M added

#9 Updated by mkittler 2 months ago

  • Status changed from In Progress to Feedback
  • Priority changed from Immediate to High

#10 Updated by okurz 2 months ago

  • Status changed from Feedback to Blocked

I also created https://sd.suse.com/servicedesk/customer/portal/1/SD-109284 for what I think is "Facilities"

#11 Updated by okurz 2 months ago

  • Tags changed from infra to infra, reactive work

#12 Updated by okurz 2 months ago

  • Copied to action #123226: Temperature monitoring in SUSE QE lab size:M added

#13 Updated by okurz 2 months ago

  • Related to action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:M added

#14 Updated by okurz 2 months ago

  • Due date deleted (2023-01-27)
  • Status changed from Blocked to Feedback

The lab NUE-2.2.14-B was disassembled. Most machines are prepared for moving to FC labs. I moved all machines in racktables from the racks into the "storage" virtual rack: https://racktables.suse.de/index.php?page=row&tab=default&row_id=18094 ready for transport, see #119551. Other machines have been moved to server room 2, e.g. unarmed, openqaw5-xen, fozzie, quinn, amd-zen2-gpu-sut1 (o3), pi cluster.

We should confirm that all above mentioned hosts are usable from their new location:

#15 Updated by okurz 2 months ago

  • Due date set to 2023-02-03

#16 Updated by nicksinger 2 months ago

  • Subject changed from A/C broken in TAM lab to A/C broken in TAM lab size:M
  • Description updated (diff)
  • Status changed from Feedback to In Progress

#17 Updated by okurz about 2 months ago

Related MR https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/484, merged. Please also enable the use of amd-zen2-gpu-sut1 (o3) and ensure all machines are usable by having according openQA jobs running on those, either explicitly triggered for those or waiting for according automatic jobs to be used on those machines anyway.

#18 Updated by cdywan about 2 months ago

  • Due date changed from 2023-02-03 to 2023-02-10

Bumping due date due to hackweek.

#19 Updated by xlai about 2 months ago

  • Related to action #123984: [boot][pxe][sut] Machine fozzie can not boot from pxe added

#20 Updated by nicksinger about 2 months ago

quinn can execute jobs successfully: https://openqa.suse.de/tests/10432865

#21 Updated by nicksinger about 2 months ago

verification run on o3 for amd-zen2-gpu-sut1 https://openqa.opensuse.org/tests/3103141 after okurz started the worker instance on rebel:6 again.

#22 Updated by okurz about 2 months ago

For fozzie I manually added a worker class for grenache-1:14 pointing to fozzie.qa.suse.de and then did

openqa-clone-job --within-instance https://openqa.suse.de/tests/10288644 WORKER_CLASS=fozzie.qa.suse.de SCHEDULE=tests/boot/boot_from_pxe BUILD= _GROUP=0 TEST=okurz_poo123028_fozzie_verification

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/488 merged for fozzie.

Created job #10443245: sle-15-SP5-Online-x86_64-Build64.1-prj4_guest_upgrade_sles12sp5_on_sles12sp5-kvm@64bit-ipmi -> https://openqa.suse.de/t10443245

For amd-zen2-gpu-sut1 did systemctl enable openqa-worker-auto-restart@6.service

#23 Updated by okurz about 2 months ago

  • Due date deleted (2023-02-10)
  • Status changed from In Progress to Resolved

We confirmed all rollback steps are done.

#24 Updated by openqa_review 17 days ago

  • Status changed from Resolved to Feedback

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: jeos-containers-docker@svirt-vmware65
https://openqa.suse.de/tests/10654219#step/bootloader_svirt/1

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.

#25 Updated by nicksinger 6 days ago

  • Status changed from Feedback to Resolved

Deleted the reference in the last failed job after seeing that the scenario can succeed: https://openqa.suse.de/tests/10730677 and confirming that it was indeed a different issue: https://app.slack.com/client/T02863RC2AC/C02CANHLANP/thread/C02CANHLANP-1678699736.004789

Also available in: Atom PDF