Project

General

Profile

action #68050

openqaworker3 fails to come up on reboot, openqa_nvme_format.service failed

Added by okurz over 1 year ago. Updated about 1 year ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
-
Start date:
2020-06-14
Due date:
2020-07-07
% Done:

0%

Estimated time:

Description

Suggestions

debug over ipmi SOL connection.


Related issues

Related to openQA Infrastructure - action #78010: unreliable reboots on openqaworker3, likely due do openqa_nvme_format (was: [alert] PROBLEM Host Alert: openqaworker3.suse.de is DOWN)Resolved2020-11-162021-04-21

Copied to openQA Infrastructure - action #68053: powerqaworker-qam-1 fails to come up on reboot (repeatedly)Resolved2020-06-14

History

#1 Updated by okurz over 1 year ago

  • Copied to action #68053: powerqaworker-qam-1 fails to come up on reboot (repeatedly) added

#2 Updated by okurz over 1 year ago

  • Status changed from New to Workable

I could recover by calling mdadm --stop /dev/md127 and exiting the emergency mode from where the boot continued. We should crosscheck the config in /etc/mdadm.conf

#3 Updated by okurz over 1 year ago

  • Status changed from Workable to Feedback
  • Assignee set to okurz
  • Priority changed from Urgent to Normal

I did mdadm --detail --scan >> /etc/mdadm.conf and adjusted the entries manually so that the / fs raid is preserved.

#4 Updated by okurz over 1 year ago

removed the worker machine's salt key for now to fix osd deployment. I am not yet sure if the above is the right fix, particularly because we don't include that in salt. I will test more and multiple reboots.

#6 Updated by okurz about 1 year ago

  • Status changed from Feedback to Resolved

merged the MR, was applied to all workers. Brought back openqaworker3 with

sudo systemctl unmask openqa-worker.target salt-minion telegraf
sudo systemctl enable --now  openqa-worker.target salt-minion telegraf

on openqaworker3 and on osd

sudo salt-key -y -A openqaworker3\*
sudo salt -l error --state-output=changes -C 'G@roles:worker and openqaworker3*' state.apply

#7 Updated by okurz about 1 year ago

  • Status changed from Resolved to In Progress

openqaworker3 again stuck in openqa_nvme_format.service . Working on it again

#8 Updated by okurz about 1 year ago

  • Due date set to 2020-07-07
  • Status changed from In Progress to Feedback

#9 Updated by okurz about 1 year ago

  • Status changed from Feedback to Resolved

All workers rebooted fine over the weekend, no failed services.

#10 Updated by okurz 10 months ago

  • Related to action #78010: unreliable reboots on openqaworker3, likely due do openqa_nvme_format (was: [alert] PROBLEM Host Alert: openqaworker3.suse.de is DOWN) added

Also available in: Atom PDF