action #157441
Updated by okurz about 1 year ago
## Observation https://gitlab.suse.de/openqa/osd-deployment/-/jobs/2398227 ``` Date: Sun, 17 Mar 2024 05:49:14 +0000 Date: Mon, 18 Mar 2024 05:49:42 +0000 qesapworker-prg5.qa.suse.cz: 2184 Minion did not return. [Not connected] ``` https://stats.openqa-monitor.qa.suse.de/alerting/grafana/host_up_alert_qesapworker-prg5/view?orgId=1 The worker seemed to have hung up. No login prompt on serial tty. Rebooted via IPMI. Worker came up, but a systemd service failed: … It seems like the NVMe disk is not found anymore. Maybe it died and the system subsequently freezed. ## Acceptance criteria * **AC1:** osd-deployment passed again * **AC2:** qesapworker-prg5.qa.suse.cz back in production again ## Suggestions * *DONE* Take machine out of production: https://progress.opensuse.org/projects/openqav3/wiki/#Take-machines-out-of-salt-controlled-production * *DONE* Remove qesapworker-prg5.qa.suse.cz from production `ssh osd "sudo salt-key -y -d qesapworker-prg5.qa.suse.cz"` * Retrigger failed osd deployment CI pipeline * Investigate the specific issue on qesapworker-prg5.qa.suse.cz * Fix any potential hardware issue, e.g. with hardware replacement * Ensure qesapworker-prg5.qa.suse.cz is back in production ## Rollback steps * https://progress.opensuse.org/projects/openqav3/wiki/#Bring-back-machines-into-salt-controlled-production `hostname=qesapworker-prg5.qa.suse.cz ssh osd "sudo `sudo salt-key -y -a $hostname && sudo salt --state-output=changes $hostname state.apply"` qesapworker-prg5.qa.suse.cz`