Project

General

Profile

Actions

action #121282

closed

Recover storage.qa.suse.de size:S

Added by mkittler over 1 year ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
2022-12-01
Due date:
% Done:

0%

Estimated time:

Description

Observation

The server cannot be reached via SSH and the host-up alert in our monitoring fired (paused it for now). I wasn't able to reach the IPMI host (via jumpy@qe-jumpy.suse.de as documented in pillars).

Acceptance criteria

  • AC1: Machine is racked again
  • AC2: Racktables is updated including mac/connections/rack
  • AC3: No related alerts for that machine is firing

Rollback steps


Related issues 4 (0 open4 closed)

Related to openQA Infrastructure - action #88546: Make use of the new "Storage Server", e.g. complete OSD backupResolvedokurz

Actions
Related to QA - action #120267: Conduct the migration of openqa-ses aka. "storage.qa.suse.de" size:MResolvedmkittler2022-09-15

Actions
Related to openQA Infrastructure - action #69577: Handle installation of the new "Storage Server"Resolvednicksinger2020-08-04

Actions
Related to openQA Infrastructure - action #123082: backup of o3 to storage.qa.suse.de was not conducted by rsnapshot since 2021-12 size:MResolvedmkittler2023-01-132023-02-24

Actions
Actions #1

Updated by okurz over 1 year ago

  • Tags changed from alert to alert, reactive work
Actions #2

Updated by dheidler over 1 year ago

Unable to find that hostname in racktables:
https://racktables.nue.suse.com/?page=search&last_page=search&last_tab=default&q=storage.qa.suse.de

Does anyone know where this machine is located?

Actions #3

Updated by okurz over 1 year ago

  • Related to action #88546: Make use of the new "Storage Server", e.g. complete OSD backup added
Actions #4

Updated by okurz over 1 year ago

  • Related to action #120267: Conduct the migration of openqa-ses aka. "storage.qa.suse.de" size:M added
Actions #5

Updated by okurz over 1 year ago

funny story: https://infra.nue.suse.com/SelfService/Display.html?id=175645#txn-2575010

guess where the racktable links leads to: https://racktables.suse.de/index.php?page=object&tab=default&object_id=13558 "openqa-ses" the seemingly "orphaned" machine we didn't know what it was intended for :D

So next task: Decide if we should move openqa-ses back to SRV1 or find a new home at FC

Actions #6

Updated by mkittler over 1 year ago

I suppose having it in FC will be fine. That means we'd likely mount it somewhere in SRV2 or the lab in the 2nd floor until FC is ready.

Where's the machine now, btw?

Next time we could at least check what services run on a machine before pulling the plug. In this case it would have been very obvious that it's just the storage server.

Actions #7

Updated by nicksinger over 1 year ago

  • Tags changed from alert, reactive work to alert, next-office-day

I've updated https://racktables.suse.de/index.php?page=object&tab=edit&object_id=13558 to reflect the actual FQDN we know the host and also added the serial number from the attached picture so we can avoid this issue in the future by just looking up the serial number

Actions #8

Updated by nicksinger over 1 year ago

  • Subject changed from Recover storage.qa.suse.de to Recover storage.qa.suse.de size:S
  • Description updated (diff)
  • Status changed from New to Workable
Actions #9

Updated by nicksinger over 1 year ago

  • Related to action #69577: Handle installation of the new "Storage Server" added
Actions #10

Updated by okurz over 1 year ago

  • Description updated (diff)
Actions #11

Updated by okurz over 1 year ago

  • Assignee set to dheidler

At next opportunity someone nearby to Nuremberg please create a EngInfra ticket over sd.suse.com/ with added Jira SD group "OSD Admins", make an appointment to take the machine from 2.2.14 (TAM) QA lab location where the machine is currently located, bring it back to SRV1 and mount it back where it was, update racktables and make sure the machine is reachable. Migrating the machine into the new network zone should be done in #120267

As discussed in daily 2022-12-07 dheidler will pick this up.

Actions #12

Updated by dheidler over 1 year ago

  • Status changed from Workable to Blocked
Actions #13

Updated by okurz over 1 year ago

  • Priority changed from Urgent to Normal

In the ticket gschlotter informed that they likely won't be at Maxtorhof anymore this year so we will have to wait

Actions #14

Updated by okurz over 1 year ago

  • Tags changed from alert, next-office-day, infra to alert, next-office-day, infra, reactive work
Actions #15

Updated by nicksinger over 1 year ago

  • Status changed from Blocked to In Progress

We plugged the machine back in where previously power8 was sitting. Gerhard configured the switch-ports to be in VLAN2 again. IPMI+OS is running and pingable again, racktables is updated

Actions #16

Updated by nicksinger over 1 year ago

alerts enabled again. Checking in 5m if they come up green

Actions #17

Updated by nicksinger over 1 year ago

  • Status changed from In Progress to Resolved

alerts are good again.

Actions #18

Updated by nicksinger over 1 year ago

  • Assignee changed from dheidler to nicksinger
Actions #19

Updated by okurz over 1 year ago

Great work on resolving this and taking care about all the mentioned alerts, appreciated. Now we can look into the previously blocked #120267

Actions #20

Updated by okurz over 1 year ago

  • Related to action #123082: backup of o3 to storage.qa.suse.de was not conducted by rsnapshot since 2021-12 size:M added
Actions

Also available in: Atom PDF