action #121282
closedRecover storage.qa.suse.de size:S
0%
Description
Observation¶
The server cannot be reached via SSH and the host-up alert in our monitoring fired (paused it for now). I wasn't able to reach the IPMI host (via jumpy@qe-jumpy.suse.de as documented in pillars).
Acceptance criteria¶
- AC1: Machine is racked again
- AC2: Racktables is updated including mac/connections/rack
- AC3: No related alerts for that machine is firing
Rollback steps¶
- Enable the host-up alert for storage again
- Enable "Packet loss between worker hosts and other hosts" https://monitor.qa.suse.de/d/EML0bpuGk/monitoring?editPanel=4&tab=alert&orgId=1 again
Updated by okurz almost 2 years ago
- Tags changed from alert to alert, reactive work
Updated by dheidler almost 2 years ago
Unable to find that hostname in racktables:
https://racktables.nue.suse.com/?page=search&last_page=search&last_tab=default&q=storage.qa.suse.de
Does anyone know where this machine is located?
Updated by okurz almost 2 years ago
- Related to action #88546: Make use of the new "Storage Server", e.g. complete OSD backup added
Updated by okurz almost 2 years ago
- Related to action #120267: Conduct the migration of openqa-ses aka. "storage.qa.suse.de" size:M added
Updated by okurz almost 2 years ago
funny story: https://infra.nue.suse.com/SelfService/Display.html?id=175645#txn-2575010
guess where the racktable links leads to: https://racktables.suse.de/index.php?page=object&tab=default&object_id=13558 "openqa-ses" the seemingly "orphaned" machine we didn't know what it was intended for :D
So next task: Decide if we should move openqa-ses back to SRV1 or find a new home at FC
Updated by mkittler almost 2 years ago
I suppose having it in FC will be fine. That means we'd likely mount it somewhere in SRV2 or the lab in the 2nd floor until FC is ready.
Where's the machine now, btw?
Next time we could at least check what services run on a machine before pulling the plug. In this case it would have been very obvious that it's just the storage server.
Updated by nicksinger almost 2 years ago
- Tags changed from alert, reactive work to alert, next-office-day
I've updated https://racktables.suse.de/index.php?page=object&tab=edit&object_id=13558 to reflect the actual FQDN we know the host and also added the serial number from the attached picture so we can avoid this issue in the future by just looking up the serial number
Updated by nicksinger almost 2 years ago
- Subject changed from Recover storage.qa.suse.de to Recover storage.qa.suse.de size:S
- Description updated (diff)
- Status changed from New to Workable
Updated by nicksinger almost 2 years ago
- Related to action #69577: Handle installation of the new "Storage Server" added
Updated by okurz almost 2 years ago
- Assignee set to dheidler
At next opportunity someone nearby to Nuremberg please create a EngInfra ticket over sd.suse.com/ with added Jira SD group "OSD Admins", make an appointment to take the machine from 2.2.14 (TAM) QA lab location where the machine is currently located, bring it back to SRV1 and mount it back where it was, update racktables and make sure the machine is reachable. Migrating the machine into the new network zone should be done in #120267
As discussed in daily 2022-12-07 dheidler will pick this up.
Updated by dheidler almost 2 years ago
- Status changed from Workable to Blocked
Updated by okurz over 1 year ago
- Priority changed from Urgent to Normal
In the ticket gschlotter informed that they likely won't be at Maxtorhof anymore this year so we will have to wait
Updated by okurz over 1 year ago
- Tags changed from alert, next-office-day, infra to alert, next-office-day, infra, reactive work
Updated by nicksinger over 1 year ago
- Status changed from Blocked to In Progress
We plugged the machine back in where previously power8 was sitting. Gerhard configured the switch-ports to be in VLAN2 again. IPMI+OS is running and pingable again, racktables is updated
Updated by nicksinger over 1 year ago
alerts enabled again. Checking in 5m if they come up green
Updated by nicksinger over 1 year ago
- Status changed from In Progress to Resolved
alerts are good again.
Updated by nicksinger over 1 year ago
- Assignee changed from dheidler to nicksinger
Updated by okurz over 1 year ago
Great work on resolving this and taking care about all the mentioned alerts, appreciated. Now we can look into the previously blocked #120267
Updated by okurz over 1 year ago
- Related to action #123082: backup of o3 to storage.qa.suse.de was not conducted by rsnapshot since 2021-12 size:M added