Project

General

Profile

Actions

action #68050

closed

openqaworker3 fails to come up on reboot, openqa_nvme_format.service failed

Added by okurz almost 4 years ago. Updated almost 4 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
Start date:
2020-06-14
Due date:
2020-07-07
% Done:

0%

Estimated time:

Description

Suggestions

debug over ipmi SOL connection.


Related issues 2 (0 open2 closed)

Related to openQA Infrastructure - action #78010: unreliable reboots on openqaworker3, likely due do openqa_nvme_format (was: [alert] PROBLEM Host Alert: openqaworker3.suse.de is DOWN)Resolvedokurz2020-11-162021-04-21

Actions
Copied to openQA Infrastructure - action #68053: powerqaworker-qam-1 fails to come up on reboot (repeatedly)Resolvedokurz2020-06-14

Actions
Actions #1

Updated by okurz almost 4 years ago

  • Copied to action #68053: powerqaworker-qam-1 fails to come up on reboot (repeatedly) added
Actions #2

Updated by okurz almost 4 years ago

  • Status changed from New to Workable

I could recover by calling mdadm --stop /dev/md127 and exiting the emergency mode from where the boot continued. We should crosscheck the config in /etc/mdadm.conf

Actions #3

Updated by okurz almost 4 years ago

  • Status changed from Workable to Feedback
  • Assignee set to okurz
  • Priority changed from Urgent to Normal

I did mdadm --detail --scan >> /etc/mdadm.conf and adjusted the entries manually so that the / fs raid is preserved.

Actions #4

Updated by okurz almost 4 years ago

removed the worker machine's salt key for now to fix osd deployment. I am not yet sure if the above is the right fix, particularly because we don't include that in salt. I will test more and multiple reboots.

Actions #6

Updated by okurz almost 4 years ago

  • Status changed from Feedback to Resolved

merged the MR, was applied to all workers. Brought back openqaworker3 with

sudo systemctl unmask openqa-worker.target salt-minion telegraf
sudo systemctl enable --now  openqa-worker.target salt-minion telegraf

on openqaworker3 and on osd

sudo salt-key -y -A openqaworker3\*
sudo salt -l error --state-output=changes -C 'G@roles:worker and openqaworker3*' state.apply
Actions #7

Updated by okurz almost 4 years ago

  • Status changed from Resolved to In Progress

openqaworker3 again stuck in openqa_nvme_format.service . Working on it again

Actions #8

Updated by okurz almost 4 years ago

  • Due date set to 2020-07-07
  • Status changed from In Progress to Feedback
Actions #9

Updated by okurz almost 4 years ago

  • Status changed from Feedback to Resolved

All workers rebooted fine over the weekend, no failed services.

Actions #10

Updated by okurz over 3 years ago

  • Related to action #78010: unreliable reboots on openqaworker3, likely due do openqa_nvme_format (was: [alert] PROBLEM Host Alert: openqaworker3.suse.de is DOWN) added
Actions

Also available in: Atom PDF