action #121282
Recover storage.qa.suse.de size:S
0%
Description
Observation¶
The server cannot be reached via SSH and the host-up alert in our monitoring fired (paused it for now). I wasn't able to reach the IPMI host (via jumpy@qe-jumpy.suse.de as documented in pillars).
Acceptance criteria¶
- AC1: Machine is racked again
- AC2: Racktables is updated including mac/connections/rack
- AC3: No related alerts for that machine is firing
Rollback steps¶
- Enable the host-up alert for storage again
- Enable "Packet loss between worker hosts and other hosts" https://monitor.qa.suse.de/d/EML0bpuGk/monitoring?editPanel=4&tab=alert&orgId=1 again
Related issues
History
#2
Updated by dheidler 2 months ago
Unable to find that hostname in racktables:
https://racktables.nue.suse.com/?page=search&last_page=search&last_tab=default&q=storage.qa.suse.de
Does anyone know where this machine is located?
#3
Updated by okurz 2 months ago
- Related to action #88546: Make use of the new "Storage Server", e.g. complete OSD backup added
#5
Updated by okurz 2 months ago
funny story: https://infra.nue.suse.com/SelfService/Display.html?id=175645#txn-2575010
guess where the racktable links leads to: https://racktables.suse.de/index.php?page=object&tab=default&object_id=13558 "openqa-ses" the seemingly "orphaned" machine we didn't know what it was intended for :D
So next task: Decide if we should move openqa-ses back to SRV1 or find a new home at FC
#6
Updated by mkittler 2 months ago
I suppose having it in FC will be fine. That means we'd likely mount it somewhere in SRV2 or the lab in the 2nd floor until FC is ready.
Where's the machine now, btw?
Next time we could at least check what services run on a machine before pulling the plug. In this case it would have been very obvious that it's just the storage server.
#7
Updated by nicksinger 2 months ago
- Tags changed from alert, reactive work to alert, next-office-day
I've updated https://racktables.suse.de/index.php?page=object&tab=edit&object_id=13558 to reflect the actual FQDN we know the host and also added the serial number from the attached picture so we can avoid this issue in the future by just looking up the serial number
#8
Updated by nicksinger 2 months ago
- Subject changed from Recover storage.qa.suse.de to Recover storage.qa.suse.de size:S
- Description updated (diff)
- Status changed from New to Workable
#9
Updated by nicksinger 2 months ago
- Related to action #69577: Handle installation of the new "Storage Server" added
#11
Updated by okurz 2 months ago
- Assignee set to dheidler
At next opportunity someone nearby to Nuremberg please create a EngInfra ticket over sd.suse.com/ with added Jira SD group "OSD Admins", make an appointment to take the machine from 2.2.14 (TAM) QA lab location where the machine is currently located, bring it back to SRV1 and mount it back where it was, update racktables and make sure the machine is reachable. Migrating the machine into the new network zone should be done in #120267
As discussed in daily 2022-12-07 dheidler will pick this up.
#13
Updated by okurz about 2 months ago
- Priority changed from Urgent to Normal
In the ticket gschlotter informed that they likely won't be at Maxtorhof anymore this year so we will have to wait
#15
Updated by nicksinger 26 days ago
- Status changed from Blocked to In Progress
We plugged the machine back in where previously power8 was sitting. Gerhard configured the switch-ports to be in VLAN2 again. IPMI+OS is running and pingable again, racktables is updated
#16
Updated by nicksinger 26 days ago
alerts enabled again. Checking in 5m if they come up green
#17
Updated by nicksinger 26 days ago
- Status changed from In Progress to Resolved
alerts are good again.
#18
Updated by nicksinger 26 days ago
- Assignee changed from dheidler to nicksinger
#20
Updated by okurz 23 days ago
- Related to action #123082: backup of o3 to storage.qa.suse.de was not conducted by rsnapshot since 2021-12 size:M added