Actions
action #68050
closedopenqaworker3 fails to come up on reboot, openqa_nvme_format.service failed
Start date:
2020-06-14
Due date:
2020-07-07
% Done:
0%
Estimated time:
Actions
Added by okurz over 4 years ago. Updated over 4 years ago.
0%
I could recover by calling mdadm --stop /dev/md127
and exiting the emergency mode from where the boot continued. We should crosscheck the config in /etc/mdadm.conf
I did mdadm --detail --scan >> /etc/mdadm.conf
and adjusted the entries manually so that the / fs raid is preserved.
removed the worker machine's salt key for now to fix osd deployment. I am not yet sure if the above is the right fix, particularly because we don't include that in salt. I will test more and multiple reboots.
merged the MR, was applied to all workers. Brought back openqaworker3 with
sudo systemctl unmask openqa-worker.target salt-minion telegraf
sudo systemctl enable --now openqa-worker.target salt-minion telegraf
on openqaworker3 and on osd
sudo salt-key -y -A openqaworker3\*
sudo salt -l error --state-output=changes -C 'G@roles:worker and openqaworker3*' state.apply
openqaworker3 again stuck in openqa_nvme_format.service . Working on it again
All workers rebooted fine over the weekend, no failed services.