action #93683: osd-deployment failed due to storage.qa.suse.de not reachable by salt - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #93683

closed

osd-deployment failed due to storage.qa.suse.de not reachable by salt

Added by okurz almost 4 years ago. Updated almost 4 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

nicksinger

Category:

Target version:

openQA Project (public) - Ready

Start date:

2021-06-09

Due date:

% Done:

Estimated time:

Description

Observation¶

https://gitlab.suse.de/openqa/osd-deployment/-/jobs/453045 shows

storage.qa.suse.de:
    Minion did not return. [Not connected]

ipmi-ipmi.storage.qa sol activate shows

storage login: root
Password: 
Have a lot of fun...
-bash-4.4#

so no proper PS1. And there is no systemd running.

Acceptance criteria¶

AC1: storage.qa is back
AC1: osd deployment continues after storage.qa is back

Suggestions¶

Remove storage.qa from salt control with ssh osd 'sudo salt-key -y -d storage.qa.suse.de'
Try to reboot storage.qa and see what happens
Check reboot stability of storage.qa

Out of scope¶

Monitoring for storage.qa: #91779

Rollback¶

Add storage.qa back to salt ssh osd 'sudo salt-key -y -a storage.qa.suse.de'

Related issues 3 (0 open — 3 closed)

Actions

Copy link

Updated by okurz almost 4 years ago

I triggered a reboot with echo b >/proc/sysrq-trigger and now the machine is stuck in a grub command line

Actions

Copy link

Updated by okurz almost 4 years ago

Description updated (diff)

As storage.qa.suse.de is now stuck in grub prompt I did ssh osd 'sudo salt-key -y -d storage.qa.suse.de' to be able to continue with the deployment: https://gitlab.suse.de/openqa/osd-deployment/-/pipelines/151428

Actions

Copy link

Updated by livdywan almost 4 years ago

Since I made the mistake of thinking Oli was investigating this issue earlier: The above is just the MR to drop storage.qa.suse.de from salt.

A naive ssh storage.qa.suse.de appears to time out here, so something is responding but not really
IPMI command from workerconf.sls connects successfully
exiting grub landed me in UEFI
After waiting a little while for "Checking Media Presence" to do something I am back in grub
local puts me back in the grub command line
normal doesn't work

Do we havy any documentation on what should be booted here? Devices used?

Actions

Copy link

Updated by okurz almost 4 years ago

cdywan wrote:

Since I made the mistake of thinking Oli was investigating this issue earlier: The above is just the MR to drop storage.qa.suse.de from salt.

It was not a "MR", just a command

Do we havy any documentation on what should be booted here? Devices used?

Search for previous tickets about the host, e.g. look for "storage". nicksinger conducted the installation of the host. I assume no profile for automated installation exists, e.g. no autoyast profile, just a manual installation. Although I am thinking for all newly installed machines we should aim for a completely automated installation from autoyast.

Actions

Copy link

Updated by nicksinger almost 4 years ago

Status changed from Workable to In Progress
Assignee set to nicksinger

The machine is in a really strange state. I've booted a rescue media and chrooted into the rootfs (on nvme0n1p2) where many things seem to be missing. E.g. zypper is missing (the whole /usr/bin folder is missing) and also snapper is not present. However the files are present in snapshots. I will check if I can maybe restore one of the snapshots and investigate why blew up the OS there

Actions

Copy link

Updated by nicksinger almost 4 years ago

After playing around with the btrfs on there I realized that the system is mounting snapshot 1 as default (described as "first root filesystem" by snapper). This snapshot really contains almost nothing. I was able to manually mount snapshot 33 (the most recent one) with btrfs mount /dev/nvme0n1p2 /mnt/mychroot -o subvol=@/.snapshots/33/snapshot and chroot into that snapshot. It was ro (as snapshots always are) but snapper was installed in there so I could do a snapper rollback 33 which created snapshot 35 (writable copy of #33) and set it as default. After a mount -a inside the chroot I executed grub2-mkconfig -o /boot/grub2/grub.cfg just to make sure the correct grub files are generated again. With these changes I was able to reboot the machine again normally. However, for some reason polkit.service fails to come up now which (I assume) results in no network. Investigating further if this can easily be fixed now

Actions

Copy link

Updated by nicksinger almost 4 years ago

got polkit (and cron) back and running by manually creating /var/lib/polkit/ and /var/spool/cron/. Did another system upgrade and rebooted 3 times without issues. I'm declaring the machine as stable again but I have absolutely no clue how this could happen without somebody manually deleting files.

Actions

Copy link

Updated by okurz almost 4 years ago

Hi Nick, this is great. Thank you for the quick reaction and detailed update.

cdywan wrote:

Do we havy any documentation on what should be booted here? Devices used?

can you remind us about that context? Should we consider cheap redeploys based on autoyast profiles?

Actions

Copy link

Updated by nicksinger almost 4 years ago

Status changed from In Progress to Resolved

I've added the machine back to salt on OSD and ran an highstate to see if maybe this caused the destruction. But everything went smooth and another reboot showed that the machine can be considered stable for now:

openqa:~ # salt-key -y -a storage.qa.suse.de
The following keys are going to be accepted:
Unaccepted Keys:
storage.qa.suse.de
Key for minion storage.qa.suse.de accepted.
openqa:~ # salt 'storage.qa.suse.de' test.ping
storage.qa.suse.de:
    True
openqa:~ # salt 'storage.qa.suse.de' state.highstate
storage.qa.suse.de:

Summary for storage.qa.suse.de
--------------
Succeeded: 192
Failed:      0
--------------
Total states run:     192
Total run time:     7.146 s

Actions

Copy link

#10

Updated by nicksinger almost 4 years ago

okurz wrote:

Hi Nick, this is great. Thank you for the quick reaction and detailed update.

cdywan wrote:

Do we havy any documentation on what should be booted here? Devices used?

can you remind us about that context? Should we consider cheap redeploys based on autoyast profiles?

I think it could help even though it masks problems which I don't like. However, instead of autoyast I'd take a look into the yomi project to have an installation based on salt.

Actions

Copy link

#11

Updated by okurz almost 4 years ago

Related to action #90629: administration of the new "Storage Server" added

Actions

Copy link

#12

Updated by okurz almost 4 years ago

Related to action #69577: Handle installation of the new "Storage Server" added

Actions

Copy link

#13

Updated by okurz almost 4 years ago

Related to action #66709: Storage server for OSD and monitoring added

Actions

Copy link

#14

Updated by okurz almost 4 years ago

nicksinger wrote:

okurz wrote:

Hi Nick, this is great. Thank you for the quick reaction and detailed update.

cdywan wrote:

Do we havy any documentation on what should be booted here? Devices used?

can you remind us about that context?

Maybe you missed this question. I linked three tickets for reference.

Should we consider cheap redeploys based on autoyast profiles?

I think it could help even though it masks problems which I don't like.

It should be as simple as https://w3.nue.suse.com/~okurz/ay-openqa-worker.xml as the rest is done by salt.

However, instead of autoyast I'd take a look into the yomi project to have an installation based on salt.

yes, I would like that as well.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #93683

osd-deployment failed due to storage.qa.suse.de not reachable by salt

Observation¶

Acceptance criteria¶

Suggestions¶

Out of scope¶

Rollback¶

Updated by okurz almost 4 years ago

Updated by okurz almost 4 years ago

Updated by livdywan almost 4 years ago

Updated by okurz almost 4 years ago

Updated by nicksinger almost 4 years ago

Updated by nicksinger almost 4 years ago

Updated by nicksinger almost 4 years ago

Updated by okurz almost 4 years ago

Updated by nicksinger almost 4 years ago

Updated by nicksinger almost 4 years ago

Updated by okurz almost 4 years ago

Updated by okurz almost 4 years ago

Updated by okurz almost 4 years ago

Updated by okurz almost 4 years ago