Project

General

Profile

action #162293

Updated by okurz about 1 month ago

## Observation 
 Struggling with w31 that upgraded itself to Leap 15.6 and then crashed multiple times after booting into kernel 6.4 we observed that early during bootup there were SMART errors shown. Possibly this might explain kernel crashes or might be separate errors. We downgraded to Leap 15.5 for now and took it out of production but still run as openQA worker. 

 ## Acceptance criteria 
 * **AC1:** w31 boots up fine without SMART errors 

 ## Steps to reproduce 
 * reboot worker31 and then follow the output on `ssh -t jumpy@qe-jumpy.prg2.suse.org "ipmitool -I lanplus -H openqaworker31.qe-ipmi-ur -U … -P … sol activate"` 
 * observe SMART errors very early during firmware initialization 

 ## Suggestions 
 * Check the content of /var/crash and clean up after investigation 
 * Check the status of SMART from the running Linux system and then also the messages on bootup 
 * Crosscheck the SMART status on other salt controlled machines 
 * Consider replacing defective hardware 
 * Ensure no failed services again 
 * Bring back the system into production 

 ## Rollback steps 
 * `hostname=worker31.oqa.prg2.suse.org ssh osd "sudo salt-key -y -a $hostname && sudo salt --state-output=changes $hostname state.apply"` 
 * `ssh osd "worker31.oqa.prg2.suse.org' cmd.run 'systemctl unmask rebootmgr && systemctl enable --now rebootmgr && rebootmgrctl reboot'"`

Back