Project

General

Profile

Actions

action #123028

closed

A/C broken in TAM lab size:M

Added by nicksinger almost 2 years ago. Updated over 1 year ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
2023-01-12
Due date:
% Done:

0%

Estimated time:

Description

Observation

The A/C is broken right now (see https://suse.slack.com/archives/C02CANHLANP/p1673525005712329) and we need to shut down our machines.

Acceptance criteria

  • AC1: All machines moved to SRV2 work normally again (e.g. executing openQA jobs)

Suggestions

Rollback steps

  1. Make sure the A/C is working again
  2. Restore power to all relevant machines (once racktables is working again)
  3. Revert https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/480

Related issues 5 (1 open4 closed)

Related to openQA Project - action #122458: O3 ipmi worker rebel:5 is broken size:MResolvedmkittler2022-12-26

Actions
Related to QA - action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:MResolvednicksinger2023-03-10

Actions
Related to openQA Infrastructure - action #123984: [boot][pxe][sut] Machine fozzie can not boot from pxeResolvedokurz2023-02-07

Actions
Blocks openQA Infrastructure - action #122983: [alert] openqa/monitor-o3 failing because openqaworker1 is down size:MResolvedmkittler2023-01-022023-05-12

Actions
Copied to openQA Infrastructure - action #123226: Temperature monitoring in SUSE QE lab size:MNew

Actions
Actions #1

Updated by livdywan almost 2 years ago

  • Blocks action #122983: [alert] openqa/monitor-o3 failing because openqaworker1 is down size:M added
Actions #2

Updated by livdywan almost 2 years ago

  • Assignee set to nicksinger
  • Target version set to Ready

We're working on it together, but I'll still have to put one name there.

Actions #3

Updated by mkittler almost 2 years ago

  • Tags deleted (infra)

MR for disabling IPMI worker slots: https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/480
The o3 worker slot rebel:6 has also been disabled and openqaworker1 has been shut down completely.

Actions #4

Updated by nicksinger almost 2 years ago

  • Description updated (diff)
  • Target version deleted (Ready)
Actions #5

Updated by nicksinger almost 2 years ago

  • Target version set to Ready
Actions #6

Updated by mkittler almost 2 years ago

  • Tags set to infra
Actions #7

Updated by openqa_review almost 2 years ago

  • Due date set to 2023-01-27

Setting due date based on mean cycle time of SUSE QE Tools

Actions #8

Updated by okurz almost 2 years ago

  • Related to action #122458: O3 ipmi worker rebel:5 is broken size:M added
Actions #9

Updated by mkittler almost 2 years ago

  • Status changed from In Progress to Feedback
  • Priority changed from Immediate to High
Actions #10

Updated by okurz almost 2 years ago

  • Status changed from Feedback to Blocked

I also created https://sd.suse.com/servicedesk/customer/portal/1/SD-109284 for what I think is "Facilities"

Actions #11

Updated by okurz almost 2 years ago

  • Tags changed from infra to infra, reactive work
Actions #12

Updated by okurz almost 2 years ago

  • Copied to action #123226: Temperature monitoring in SUSE QE lab size:M added
Actions #13

Updated by okurz almost 2 years ago

  • Related to action #119551: Move QA labs NUE-2.2.14-B to Frankencampus labs - bare-metal openQA workers size:M added
Actions #14

Updated by okurz almost 2 years ago

  • Due date deleted (2023-01-27)
  • Status changed from Blocked to Feedback

The lab NUE-2.2.14-B was disassembled. Most machines are prepared for moving to FC labs. I moved all machines in racktables from the racks into the "storage" virtual rack: https://racktables.suse.de/index.php?page=row&tab=default&row_id=18094 ready for transport, see #119551. Other machines have been moved to server room 2, e.g. unarmed, openqaw5-xen, fozzie, quinn, amd-zen2-gpu-sut1 (o3), pi cluster.

We should confirm that all above mentioned hosts are usable from their new location:

Actions #15

Updated by okurz almost 2 years ago

  • Due date set to 2023-02-03
Actions #16

Updated by nicksinger almost 2 years ago

  • Subject changed from A/C broken in TAM lab to A/C broken in TAM lab size:M
  • Description updated (diff)
  • Status changed from Feedback to In Progress
Actions #17

Updated by okurz almost 2 years ago

Related MR https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/484, merged. Please also enable the use of amd-zen2-gpu-sut1 (o3) and ensure all machines are usable by having according openQA jobs running on those, either explicitly triggered for those or waiting for according automatic jobs to be used on those machines anyway.

Actions #18

Updated by livdywan almost 2 years ago

  • Due date changed from 2023-02-03 to 2023-02-10

Bumping due date due to hackweek.

Actions #19

Updated by xlai over 1 year ago

  • Related to action #123984: [boot][pxe][sut] Machine fozzie can not boot from pxe added
Actions #20

Updated by nicksinger over 1 year ago

quinn can execute jobs successfully: https://openqa.suse.de/tests/10432865

Actions #21

Updated by nicksinger over 1 year ago

verification run on o3 for amd-zen2-gpu-sut1 https://openqa.opensuse.org/tests/3103141 after @okurz started the worker instance on rebel:6 again.

Actions #22

Updated by okurz over 1 year ago

For fozzie I manually added a worker class for grenache-1:14 pointing to fozzie.qa.suse.de and then did

openqa-clone-job --within-instance https://openqa.suse.de/tests/10288644 WORKER_CLASS=fozzie.qa.suse.de SCHEDULE=tests/boot/boot_from_pxe BUILD= _GROUP=0 TEST=okurz_poo123028_fozzie_verification

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/488 merged for fozzie.

Created job #10443245: sle-15-SP5-Online-x86_64-Build64.1-prj4_guest_upgrade_sles12sp5_on_sles12sp5-kvm@64bit-ipmi -> https://openqa.suse.de/t10443245

For amd-zen2-gpu-sut1 did systemctl enable openqa-worker-auto-restart@6.service

Actions #23

Updated by okurz over 1 year ago

  • Due date deleted (2023-02-10)
  • Status changed from In Progress to Resolved

We confirmed all rollback steps are done.

Actions #24

Updated by openqa_review over 1 year ago

  • Status changed from Resolved to Feedback

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: jeos-containers-docker@svirt-vmware65
https://openqa.suse.de/tests/10654219#step/bootloader_svirt/1

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.

Actions #25

Updated by nicksinger over 1 year ago

  • Status changed from Feedback to Resolved

Deleted the reference in the last failed job after seeing that the scenario can succeed: https://openqa.suse.de/tests/10730677 and confirming that it was indeed a different issue: https://app.slack.com/client/T02863RC2AC/C02CANHLANP/thread/C02CANHLANP-1678699736.004789

Actions

Also available in: Atom PDF