Project

General

Profile

action #157441

Updated by okurz about 2 months ago

## Observation 

 https://gitlab.suse.de/openqa/osd-deployment/-/jobs/2398227 
 ``` 
 Date: Sun, 17 Mar 2024 05:49:14 +0000 
 Date: Mon, 18 Mar 2024 05:49:42 +0000 

 qesapworker-prg5.qa.suse.cz: 
 2184      Minion did not return. [Not connected] 
 ``` 


 https://stats.openqa-monitor.qa.suse.de/alerting/grafana/host_up_alert_qesapworker-prg5/view?orgId=1 

 The worker seemed to have hung up. No login prompt on serial tty. 
 Rebooted via IPMI. 
 Worker came up, but a systemd service failed: … 

 It seems like the NVMe disk is not found anymore. Maybe it died and the system subsequently freezed. 

 ## Acceptance criteria 
 * **AC1:** osd-deployment passed again 
 * **AC2:** qesapworker-prg5.qa.suse.cz back in production again 

 ## Suggestions 
 * *DONE* Take machine out of production: https://progress.opensuse.org/projects/openqav3/wiki/#Take-machines-out-of-salt-controlled-production 
 * *DONE* Remove qesapworker-prg5.qa.suse.cz from production `ssh osd "sudo salt-key -y -d qesapworker-prg5.qa.suse.cz"` 
 * Retrigger failed osd deployment CI pipeline 
 * Investigate the specific issue on qesapworker-prg5.qa.suse.cz 
 * Fix any potential hardware issue, e.g. with hardware replacement 
 * Ensure qesapworker-prg5.qa.suse.cz is back in production 

 ## Rollback steps 
 * https://progress.opensuse.org/projects/openqav3/wiki/#Bring-back-machines-into-salt-controlled-production `hostname=qesapworker-prg5.qa.suse.cz ssh osd "sudo `sudo salt-key -y -a $hostname && sudo salt --state-output=changes $hostname state.apply"` qesapworker-prg5.qa.suse.cz`

Back