Project

General

Profile

action #93683

osd-deployment failed due to storage.qa.suse.de not reachable by salt

Added by okurz 5 months ago. Updated 4 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Target version:
Start date:
2021-06-09
Due date:
% Done:

0%

Estimated time:

Description

Observation

https://gitlab.suse.de/openqa/osd-deployment/-/jobs/453045 shows

storage.qa.suse.de:
    Minion did not return. [Not connected]

ipmi-ipmi.storage.qa sol activate shows

storage login: root
Password: 
Have a lot of fun...
-bash-4.4#

so no proper PS1. And there is no systemd running.

Acceptance criteria

  • AC1: storage.qa is back
  • AC1: osd deployment continues after storage.qa is back

Suggestions

  • Remove storage.qa from salt control with ssh osd 'sudo salt-key -y -d storage.qa.suse.de'
  • Try to reboot storage.qa and see what happens
  • Check reboot stability of storage.qa

Out of scope

  • Monitoring for storage.qa: #91779

Rollback

  • Add storage.qa back to salt ssh osd 'sudo salt-key -y -a storage.qa.suse.de'

Related issues

Related to openQA Infrastructure - action #90629: administration of the new "Storage Server"Resolved2020-08-04

Related to openQA Infrastructure - action #69577: Handle installation of the new "Storage Server"Resolved2020-08-04

Related to openQA Infrastructure - action #66709: Storage server for OSD and monitoringResolved2020-05-12

History

#1 Updated by okurz 5 months ago

I triggered a reboot with echo b >/proc/sysrq-trigger and now the machine is stuck in a grub command line

#2 Updated by okurz 5 months ago

  • Description updated (diff)

As storage.qa.suse.de is now stuck in grub prompt I did ssh osd 'sudo salt-key -y -d storage.qa.suse.de' to be able to continue with the deployment: https://gitlab.suse.de/openqa/osd-deployment/-/pipelines/151428

#3 Updated by cdywan 5 months ago

Since I made the mistake of thinking Oli was investigating this issue earlier: The above is just the MR to drop storage.qa.suse.de from salt.

  • A naive ssh storage.qa.suse.de appears to time out here, so something is responding but not really
  • IPMI command from workerconf.sls connects successfully
  • exiting grub landed me in UEFI
  • After waiting a little while for "Checking Media Presence" to do something I am back in grub
  • local puts me back in the grub command line
  • normal doesn't work

Do we havy any documentation on what should be booted here? Devices used?

#4 Updated by okurz 5 months ago

cdywan wrote:

Since I made the mistake of thinking Oli was investigating this issue earlier: The above is just the MR to drop storage.qa.suse.de from salt.

It was not a "MR", just a command

Do we havy any documentation on what should be booted here? Devices used?

Search for previous tickets about the host, e.g. look for "storage". nicksinger conducted the installation of the host. I assume no profile for automated installation exists, e.g. no autoyast profile, just a manual installation. Although I am thinking for all newly installed machines we should aim for a completely automated installation from autoyast.

#5 Updated by nicksinger 4 months ago

  • Status changed from Workable to In Progress
  • Assignee set to nicksinger

The machine is in a really strange state. I've booted a rescue media and chrooted into the rootfs (on nvme0n1p2) where many things seem to be missing. E.g. zypper is missing (the whole /usr/bin folder is missing) and also snapper is not present. However the files are present in snapshots. I will check if I can maybe restore one of the snapshots and investigate why blew up the OS there

#6 Updated by nicksinger 4 months ago

After playing around with the btrfs on there I realized that the system is mounting snapshot 1 as default (described as "first root filesystem" by snapper). This snapshot really contains almost nothing. I was able to manually mount snapshot 33 (the most recent one) with btrfs mount /dev/nvme0n1p2 /mnt/mychroot -o subvol=@/.snapshots/33/snapshot and chroot into that snapshot. It was ro (as snapshots always are) but snapper was installed in there so I could do a snapper rollback 33 which created snapshot 35 (writable copy of #33) and set it as default. After a mount -a inside the chroot I executed grub2-mkconfig -o /boot/grub2/grub.cfg just to make sure the correct grub files are generated again. With these changes I was able to reboot the machine again normally. However, for some reason polkit.service fails to come up now which (I assume) results in no network. Investigating further if this can easily be fixed now

#7 Updated by nicksinger 4 months ago

got polkit (and cron) back and running by manually creating /var/lib/polkit/ and /var/spool/cron/. Did another system upgrade and rebooted 3 times without issues. I'm declaring the machine as stable again but I have absolutely no clue how this could happen without somebody manually deleting files.

#8 Updated by okurz 4 months ago

Hi Nick, this is great. Thank you for the quick reaction and detailed update.

cdywan wrote:

Do we havy any documentation on what should be booted here? Devices used?

can you remind us about that context? Should we consider cheap redeploys based on autoyast profiles?

#9 Updated by nicksinger 4 months ago

  • Status changed from In Progress to Resolved

I've added the machine back to salt on OSD and ran an highstate to see if maybe this caused the destruction. But everything went smooth and another reboot showed that the machine can be considered stable for now:

openqa:~ # salt-key -y -a storage.qa.suse.de
The following keys are going to be accepted:
Unaccepted Keys:
storage.qa.suse.de
Key for minion storage.qa.suse.de accepted.
openqa:~ # salt 'storage.qa.suse.de' test.ping
storage.qa.suse.de:
    True
openqa:~ # salt 'storage.qa.suse.de' state.highstate
storage.qa.suse.de:

Summary for storage.qa.suse.de
--------------
Succeeded: 192
Failed:      0
--------------
Total states run:     192
Total run time:     7.146 s

#10 Updated by nicksinger 4 months ago

okurz wrote:

Hi Nick, this is great. Thank you for the quick reaction and detailed update.

cdywan wrote:

Do we havy any documentation on what should be booted here? Devices used?

can you remind us about that context? Should we consider cheap redeploys based on autoyast profiles?

I think it could help even though it masks problems which I don't like. However, instead of autoyast I'd take a look into the yomi project to have an installation based on salt.

#11 Updated by okurz 4 months ago

  • Related to action #90629: administration of the new "Storage Server" added

#12 Updated by okurz 4 months ago

  • Related to action #69577: Handle installation of the new "Storage Server" added

#13 Updated by okurz 4 months ago

  • Related to action #66709: Storage server for OSD and monitoring added

#14 Updated by okurz 4 months ago

nicksinger wrote:

okurz wrote:

Hi Nick, this is great. Thank you for the quick reaction and detailed update.

cdywan wrote:

Do we havy any documentation on what should be booted here? Devices used?

can you remind us about that context?

Maybe you missed this question. I linked three tickets for reference.

Should we consider cheap redeploys based on autoyast profiles?

I think it could help even though it masks problems which I don't like.

It should be as simple as https://w3.nue.suse.com/~okurz/ay-openqa-worker.xml as the rest is done by salt.

However, instead of autoyast I'd take a look into the yomi project to have an installation based on salt.

yes, I would like that as well.

Also available in: Atom PDF