Project

General

Profile

action #81058

[tracker-ticket] Power machines can't find installed OS. Automatic reboots disabled for now

Added by nicksinger 10 months ago. Updated 7 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
Start date:
2020-12-15
Due date:
2021-04-16
% Done:

0%

Estimated time:

Description

I think that we face some kind of product bug inside leap which causes power8 workers to not boot properly anymore.
malbec: #80656#note-9
QA-Power8-4-kvm: #81020#note-3
QA-Power8-5-kvm: #80482
powerqaworker-qam-1: #68053

I disabled rebootmgr on these machines for now with systemctl --now disable rebootmgr. I also made a MR to our salt repo so the service does not get enabled again: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/421

If you really, really need to reboot the machine (or it gets unavailable in the meantime) you can use these commands in the petitboot shell (over ipmi) to get it booting once:

malbec: kexec -l /var/petitboot/mnt/dev/sdb1/boot/vmlinux-5.3.18-lp152.57-default --initrd=/var/petitboot/mnt/dev/sdb1/boot/initrd-5.3.18-lp152.57-default --command-line="root=UUID=ae18adf5-d27e-4fa1-93a1-6ab55263c29d nospec kvm.nested=1 kvm_intel.nested=1 kvm_amd.nested=1 kvm-arm.nested=1 crashkernel=210M" && kexec -e
QA-Power8-4-kvm: kexec -l /var/petitboot/mnt/dev/sdb2/boot/vmlinux-5.3.18-lp152.57-default --initrd=/var/petitboot/mnt/dev/sdb2/boot/initrd-5.3.18-lp152.57-default --command-line="root=UUID=eebe647f-e867-416e-a0fa-7a6732bfcf9d nospec kvm.nested=1 kvm_intel.nested=1 kvm_amd.nested=1 kvm-arm.nested=1 crashkernel=210M" && kexec -e
QA-Power8-5-kvm: kexec -l /var/petitboot/mnt/dev/sda2/boot/vmlinux-5.3.18-lp152.57-default --initrd=/var/petitboot/mnt/dev/sda2/boot/initrd-5.3.18-lp152.57-default --command-line="root=UUID=89ca2dff-86af-478b-8d4c-2a45ca689fd5 nospec kvm.nested=1 kvm_intel.nested=1 kvm_amd.nested=1 kvm-arm.nested=1 crashkernel=210M" && kexec -e
powerqaworker-qam-1: kexec -l /var/petitboot/mnt/dev/sda2/boot/vmlinux-5.3.18-lp152.57-default --initrd=/var/petitboot/mnt/dev/sda2/boot/initrd-5.3.18-lp152.57-default --command-line="root=UUID=e29496d5-0080-4a01-9bde-b786944f4ba4 nospec kvm.nested=1 kvm_intel.nested=1 kvm_amd.nested=1 kvm-arm.nested=1 crashkernel=210M" && kexec -e


Related issues

Related to openQA Infrastructure - action #80656: OSD deployment failed at 2020-12-02 because 'malbec.arch.suse.de' is downResolved2020-12-02

Related to openQA Infrastructure - action #80482: qa-power8-5-kvm has been down for days, use more robust filesystem setupWorkable

Related to openQA Infrastructure - action #81020: QA-Power8-4-kvm start failed since reboot on 2020-12-13Resolved2020-12-14

Related to openQA Infrastructure - action #88474: All workers on powerqaworker-qam-1 are offlineResolved2021-02-08

Related to openQA Infrastructure - action #68053: powerqaworker-qam-1 fails to come up on reboot (repeatedly)Resolved2020-06-14

History

#1 Updated by nicksinger 10 months ago

  • Related to action #80656: OSD deployment failed at 2020-12-02 because 'malbec.arch.suse.de' is down added

#2 Updated by nicksinger 10 months ago

  • Related to action #80482: qa-power8-5-kvm has been down for days, use more robust filesystem setup added

#3 Updated by nicksinger 10 months ago

  • Related to action #81020: QA-Power8-4-kvm start failed since reboot on 2020-12-13 added

#4 Updated by okurz 10 months ago

  • Target version set to Ready

#5 Updated by okurz 10 months ago

  • Description updated (diff)

changed the progress tickets description to use redmine-internal links.

#6 Updated by nicksinger 10 months ago

  • Description updated (diff)

#7 Updated by nicksinger 10 months ago

  • Description updated (diff)

#8 Updated by Xiaojing_liu 10 months ago

On 2021-01-04, powerqaworker-qam-1 didn't boot success, execute

kexec -l /var/petitboot/mnt/dev/sdb2/boot/vmlinux-5.3.18-lp152.57-default --
initrd=/var/petitboot/mnt/dev/sdb2/boot/initrd-5.3.18-lp152.57-default --command-line="root=UUID=e29496d5-0080-4a01-9bde-b786944f4ba4 nospec kvm.nested=1 kvm_intel.nested=1 kvm_amd.nested=1 kvm-arm.nested=1 crashkernel=210M" && kexec -e

#9 Updated by cdywan 8 months ago

  • Description updated (diff)

#10 Updated by cdywan 8 months ago

  • Description updated (diff)

#11 Updated by cdywan 8 months ago

  • Related to action #88474: All workers on powerqaworker-qam-1 are offline added

#12 Updated by okurz 8 months ago

  • Related to action #68053: powerqaworker-qam-1 fails to come up on reboot (repeatedly) added

#13 Updated by okurz 8 months ago

  • Description updated (diff)

#14 Updated by okurz 7 months ago

  • Status changed from Feedback to In Progress
  • Assignee changed from nicksinger to okurz

After progress in #68053 I am running a check on all PowerPC osd machines

On OSD:

for run in {01..10}; do for host in QA-Power8-4-kvm.qa QA-Power8-5-kvm.qa powerqaworker-qam-1 malbec.arch grenache-1.qa; do echo -n "run: $run, $host: ping .. " && timeout -k 5 600 sh -c "until ping -c30 $host >/dev/null; do :; done" && echo -n "ok, ssh .. " && timeout -k 5 600 sh -c "until nc -z -w 1 $host 22; do :; done" && echo -n "ok, salt .. " && timeout -k 5 600 sh -c " until salt --timeout=300 --no-color $host\* test.ping >/dev/null; do :; done" && echo -n "ok, uptime/reboot: " && salt $host\* cmd.run "uptime && systemctl disable --now openqa-worker-cacheservice.service >/dev/null" && salt $host\* system.reboot 1 || break; done || break; done

#15 Updated by openqa_review 7 months ago

  • Due date set to 2021-04-16

Setting due date based on mean cycle time of SUSE QE Tools

#16 Updated by okurz 7 months ago

  • Status changed from In Progress to Resolved

The above experiment was succesful. Machines came up just fine after reboot.
I unpaused the alerts "Broken workers alert", "Failed systemd services alert (except openqa.suse.de)", "Failed systemd services alert (except openqa.suse.de)" and confirmed that all expected services on these machines are active and the machines are working on openQA jobs.

Unmasked rebootmgr.service again and created
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/472
to enable the rebootmgr service again everywhere.

I also enabled rebootmgr on storage.qa where auto-update was already active but not rebootmgr. Tested that reboot works fine as well.

Also available in: Atom PDF