Project

General

Profile

action #77011

openqaworker7 (o3) is stuck in "recovery mode" as visible over IPMI SoL

Added by okurz 9 months ago. Updated 8 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
Start date:
2020-11-05
Due date:
% Done:

0%

Estimated time:

Description

Observation

openqaworker7 (o3) is not reachable over ssh, is stuck in "recovery mode" as visible over IPMI SoL

Acceptance criteria

  • AC1: openqaworker7 is working on openQA tests again

Suggestions

  • call ipmi-openqaworker7-ipmi sol activate and fix

Further details

Hint, use the IPMI aliases from https://gitlab.suse.de/openqa/salt-pillars-openqa


Related issues

Related to openQA Infrastructure - action #49694: openqaworker7 lost one NVMeResolved2019-03-26

History

#1 Updated by cdywan 9 months ago

  • Status changed from Workable to In Progress
  • Assignee set to cdywan

#2 Updated by cdywan 9 months ago

  • Status changed from In Progress to Feedback

After a reboot the machine seems responsive again and jobs are being processed e.g. https://openqa.opensuse.org/tests/1460574#live (unfortunately I can´t link to the query of active workers on the machine directly)

#3 Updated by cdywan 9 months ago

  • Status changed from Feedback to Resolved

#4 Updated by favogt 9 months ago

  • Status changed from Resolved to Workable
  • Priority changed from Immediate to Urgent

It seems like this happened again. Over the remote console it was visble that dependency services for var-lib-openqa.mount failed.
openqa_nvme_format.service does a grep openqa /proc/mdstat || ... mdadm --create. The condition check looked broken as /proc/mdstat only showed the numeric id, i.e. md127 and so it tried to mdadm --create on active devices. Additionally, the way || and | are mixed in ExecStart means that the condition is ignored anyway.
The result is that the service only succeeds (and the system boots) if /dev/md127 is (auto) assembled after the service ran.
I tried to fix that, but then the console froze and the system had to be reset. It came up properly that time.

#5 Updated by cdywan 9 months ago

The system is operational right now. But it seems we need salt-states-openqa to be deployed 🤔

There are some failures in the pipelines but no logs. Running pipelines manually on master atm

#6 Updated by cdywan 9 months ago

Pipelines/deployment seems to have succeeded

#7 Updated by okurz 9 months ago

cdywan wrote:

The system is operational right now. But it seems we need salt-states-openqa to be deployed 🤔

What do you mean with that and why? Also for that we would need #43934 first.
openqaworker7 is part of o3 and not touched by triggering any gitlab CI pipelines.

But as we reopened this ticket please keep in mind that we should look for at least two improvements.

#8 Updated by cdywan 9 months ago

okurz wrote:

cdywan wrote:

The system is operational right now. But it seems we need salt-states-openqa to be deployed 🤔

What do you mean with that and why? Also for that we would need #43934 first.
openqaworker7 is part of o3 and not touched by triggering any gitlab CI pipelines.

The mitigation done by favogt consisted of manually fixing the mount points. The above salt change fixes the mount points, but the pipeline was failing at the time, so I investigated that. Note how I didn't touch the ticket state or draw any conclusion so far, I just transfered ideas and steps from IRC to a non-temporary place ;-)

#9 Updated by okurz 9 months ago

#10 Updated by okurz 9 months ago

cdywan wrote:

[…] The above salt change fixes the mount points, but the pipeline was failing at the time, so I investigated that

yeah but please understand that the changes in the salt repo are only applied to OSD and have no impact on O3 infrastructure unless you manually copy paste the instructions from the salt git repo to the manually maintained files on o3 machines.

#11 Updated by okurz 9 months ago

  • Status changed from Workable to Feedback
  • Priority changed from Urgent to Normal

I assume fvogt applied manually the slightly different variant:

ExecStart=/bin/sh -c 'test -e /dev/md/openqa || lsblk -n | grep -v nvme | grep "/$" && mdadm --create /dev/md/openqa --level=0 --force --raid-devices=$(ls /dev/nvme?n1 | wc -l) --run /dev/nvme?n1 || mdadm --create /dev/md/openqa --level=0 --force --raid-devices=1 --run /dev/nvme0n1p3'

which should also work for now. As the machine is up since another automatic reboot today I think the original problem was resolved, hence reducing prio and setting "Feedback". Now, following our best practice of "think of at least a second improvement", cdywan what can you think of? :)

#12 Updated by cdywan 8 months ago

okurz wrote:

cdywan wrote:

[…] The above salt change fixes the mount points, but the pipeline was failing at the time, so I investigated that

yeah but please understand that the changes in the salt repo are only applied to OSD and have no impact on O3 infrastructure unless you manually copy paste the instructions from the salt git repo to the manually maintained files on o3 machines.

Sure. But you asked why I was looking into that.

I assume fvogt applied manually the slightly different variant:

ExecStart=/bin/sh -c 'test -e /dev/md/openqa || lsblk -n | grep -v nvme | grep "/$" && mdadm --create /dev/md/openqa --level=0 --force --raid-devices=$(ls /dev/nvme?n1 | wc -l) --run /dev/nvme?n1 || mdadm --create /dev/md/openqa --level=0 --force --raid-devices=1 --run /dev/nvme0n1p3'

which should also work for now. As the machine is up since another automatic reboot today I think the original problem was resolved, hence reducing prio and setting "Feedback". Now, following our best practice of "think of at least a second improvement", cdywan what can you think of? :)

How about documenting this in the wiki:

https://progress.opensuse.org/projects/openqav3/wiki/Wiki#Mitigation-of-boot-failure-or-disk-issues

#13 Updated by okurz 8 months ago

  • Status changed from Feedback to Resolved

Yes, good idea. Thanks for that. Well, and we also have the ticket to use salt for o3 as well. So I guess enough learned :)

#14 Updated by favogt 8 months ago

It broke again two days ago. It seems like openqa_nvme_format.service was not failing, but got triggered in an endless loop recreating the raid and formatting it over and over again.

The mdadm --stop makes the source of var-lib-openqa.mount disappar, so systemd stopped it. The creation made it appear again, so systemd might schedule yet another start of openqa_nvme_format.service. I commented out the first ExecStart to avoid that, but it showed what the actual issue is. The unit is activated way too early due to DefaultDependencies=no:

Nov 23 10:15:36 openqaworker7 systemd[1]: systemd 234 running in system mode. (+PAM -AUDIT +SELINUX -IMA +APPARMOR -SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN2 -IDN default-hierarchy=hybrid)
Nov 23 10:15:36 openqaworker7 systemd[1]: Detected architecture x86-64.
Nov 23 10:15:36 openqaworker7 systemd[1]: Set hostname to <openqaworker7>.
Nov 23 10:15:38 openqaworker7 sh[935]: grep: /proc/mdstat: No such file or directory
Nov 23 10:15:38 openqaworker7 kernel: BTRFS info (device sda1): disk space caching is enabled
Nov 23 10:15:38 openqaworker7 systemd[1]: Started Load Kernel Modules.
Nov 23 10:15:38 openqaworker7 systemd[1]: openqa_nvme_format.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Nov 23 10:15:38 openqaworker7 systemd[1]: Failed to start Setup NVMe before mounting it.
Nov 23 10:15:38 openqaworker7 systemd[1]: Dependency failed for /var/lib/openqa.

Unfortunately the system stopped responding to input for some reason, so it had to go through another slow reboot... I added

Requires=dev-md-openqa.device
After=dev-md-openqa.device

to the unit as it only does the formatting now. After yet another reboot it's now up, AFAICT race-free.

#15 Updated by okurz 8 months ago

  • Status changed from Resolved to Workable
  • Assignee changed from cdywan to favogt
  • Target version changed from Ready to future

I appreciate your efforts but that mdadm call was there for a reason.
Certainly the approach was not perfect but just removing it is changing what it was "designed" for. Yes, it's only used if the RAID does not assemble automatically. This happens if the same service definition is used on fresh installs as well as when NVMe devices change, which is what happens and will happen again. I will comment the suggestion in the ticket, reopen it and hope you can bring back the mdadm call in a way that fulfills the original requirements as well as convince you that the design is not "broken" anymore.

#16 Updated by favogt 8 months ago

okurz wrote:

I appreciate your efforts but that mdadm call was there for a reason.

Well, in the vast majority of boots it did more harm (break completely) than good (recreate the raid if necessary), so the current state is arguably much better.

Certainly the approach was not perfect but just removing it is changing what it was "designed" for. Yes, it's only used if the RAID does not assemble automatically. This happens if the same service definition is used on fresh installs as well as when NVMe devices change, which is what happens and will happen again.

That's only manually triggered though and happens maybe once a year at most?

I will comment the suggestion in the ticket,

Which ticket?

reopen it and hope you can bring back the mdadm call in a way that fulfills the original requirements as well as convince you that the design is not "broken" anymore.

I split the removed parts into a new openqa_nvme_create.service. It seems like the "array works" case works, but the "array needs (re)creation" case isn't tested yet.

#17 Updated by okurz 8 months ago

favogt wrote:

I split the removed parts into a new openqa_nvme_create.service. It seems like the "array works" case works, but the "array needs (re)creation" case isn't tested yet.

Feel free to test this part on one of the o3 workers as well by destroying/recreating the RAID and file system on top, no problem as this is only the cache+pool of openQA workers which holds only temporary data.

#18 Updated by favogt 8 months ago

  • Status changed from Workable to Resolved

okurz wrote:

favogt wrote:

I split the removed parts into a new openqa_nvme_create.service. It seems like the "array works" case works, but the "array needs (re)creation" case isn't tested yet.

Feel free to test this part on one of the o3 workers as well by destroying/recreating the RAID and file system on top, no problem as this is only the cache+pool of openQA workers which holds only temporary data.

After the system was idle, I ran sgdisk --zap on /dev/nvme{0,1}n1 and also added nofail to the mountpoint in /etc/fstab to ensure that even if it fails to create or mount it, the system is reachable over ssh, then triggered a reboot.

When looking at the serial console I was a shocked when it tried to do PXE, but that's apparently just the configured boot order...

On the first boot it didn't bother to create the array as a || true was missing and so it aborted early. After fixing that and trying again it failed because mdadm --create didn't trigger dev-md-openqa.device to be up, due to a missing udev event. I added a workaround and after zapping again it worked as expected after a reboot.

Feel free to copy /etc/systemd/system/openqa_nvme_{create,format,prepare}.service into the git repo if you think they're ok.

Also available in: Atom PDF