action #157441
Updated by okurz 9 months ago
## Observation
https://gitlab.suse.de/openqa/osd-deployment/-/jobs/2398227
```
Date: Sun, 17 Mar 2024 05:49:14 +0000
Date: Mon, 18 Mar 2024 05:49:42 +0000
qesapworker-prg5.qa.suse.cz:
2184 Minion did not return. [Not connected]
```
https://stats.openqa-monitor.qa.suse.de/alerting/grafana/host_up_alert_qesapworker-prg5/view?orgId=1
The worker seemed to have hung up. No login prompt on serial tty.
Rebooted via IPMI.
Worker came up, but a systemd service failed: …
It seems like the NVMe disk is not found anymore. Maybe it died and the system subsequently freezed.
## Acceptance criteria
* **AC1:** osd-deployment passed again
* **AC2:** qesapworker-prg5.qa.suse.cz back in production again
## Suggestions
* *DONE* Take machine out of production: https://progress.opensuse.org/projects/openqav3/wiki/#Take-machines-out-of-salt-controlled-production
* *DONE* Remove qesapworker-prg5.qa.suse.cz from production `ssh osd "sudo salt-key -y -d qesapworker-prg5.qa.suse.cz"`
* Retrigger failed osd deployment CI pipeline
* Investigate the specific issue on qesapworker-prg5.qa.suse.cz
* Fix any potential hardware issue, e.g. with hardware replacement
* Ensure qesapworker-prg5.qa.suse.cz is back in production
## Rollback steps
* https://progress.opensuse.org/projects/openqav3/wiki/#Bring-back-machines-into-salt-controlled-production `hostname=qesapworker-prg5.qa.suse.cz ssh osd "sudo `sudo salt-key -y -a $hostname && sudo salt --state-output=changes $hostname state.apply"` qesapworker-prg5.qa.suse.cz`
Back