action #77011
closedopenqaworker7 (o3) is stuck in "recovery mode" as visible over IPMI SoL
0%
Description
Observation¶
openqaworker7 (o3) is not reachable over ssh, is stuck in "recovery mode" as visible over IPMI SoL
Acceptance criteria¶
- AC1: openqaworker7 is working on openQA tests again
Suggestions¶
- call
ipmi-openqaworker7-ipmi sol activate
and fix
Further details¶
Hint, use the IPMI aliases from https://gitlab.suse.de/openqa/salt-pillars-openqa
Updated by livdywan almost 4 years ago
- Status changed from Workable to In Progress
- Assignee set to livdywan
Updated by livdywan almost 4 years ago
- Status changed from In Progress to Feedback
After a reboot the machine seems responsive again and jobs are being processed e.g. https://openqa.opensuse.org/tests/1460574#live (unfortunately I can´t link to the query of active workers on the machine directly)
Updated by favogt almost 4 years ago
- Status changed from Resolved to Workable
- Priority changed from Immediate to Urgent
It seems like this happened again. Over the remote console it was visble that dependency services for var-lib-openqa.mount failed.
openqa_nvme_format.service
does a grep openqa /proc/mdstat || ... mdadm --create
. The condition check looked broken as /proc/mdstat
only showed the numeric id, i.e. md127 and so it tried to mdadm --create
on active devices. Additionally, the way ||
and |
are mixed in ExecStart
means that the condition is ignored anyway.
The result is that the service only succeeds (and the system boots) if /dev/md127 is (auto) assembled after the service ran.
I tried to fix that, but then the console froze and the system had to be reset. It came up properly that time.
Updated by livdywan almost 4 years ago
The system is operational right now. But it seems we need salt-states-openqa to be deployed 🤔
There are some failures in the pipelines but no logs. Running pipelines manually on master atm
Updated by livdywan almost 4 years ago
Pipelines/deployment seems to have succeeded
Updated by okurz almost 4 years ago
cdywan wrote:
The system is operational right now. But it seems we need salt-states-openqa to be deployed 🤔
What do you mean with that and why? Also for that we would need #43934 first.
openqaworker7 is part of o3 and not touched by triggering any gitlab CI pipelines.
But as we reopened this ticket please keep in mind that we should look for at least two improvements.
Updated by livdywan almost 4 years ago
okurz wrote:
cdywan wrote:
The system is operational right now. But it seems we need salt-states-openqa to be deployed 🤔
What do you mean with that and why? Also for that we would need #43934 first.
openqaworker7 is part of o3 and not touched by triggering any gitlab CI pipelines.
The mitigation done by @favogt consisted of manually fixing the mount points. The above salt change fixes the mount points, but the pipeline was failing at the time, so I investigated that. Note how I didn't touch the ticket state or draw any conclusion so far, I just transfered ideas and steps from IRC to a non-temporary place ;-)
Updated by okurz almost 4 years ago
- Related to action #49694: openqaworker7 lost one NVMe added
Updated by okurz almost 4 years ago
cdywan wrote:
[…] The above salt change fixes the mount points, but the pipeline was failing at the time, so I investigated that
yeah but please understand that the changes in the salt repo are only applied to OSD and have no impact on O3 infrastructure unless you manually copy paste the instructions from the salt git repo to the manually maintained files on o3 machines.
Updated by okurz almost 4 years ago
- Status changed from Workable to Feedback
- Priority changed from Urgent to Normal
I assume fvogt applied manually the slightly different variant:
ExecStart=/bin/sh -c 'test -e /dev/md/openqa || lsblk -n | grep -v nvme | grep "/$" && mdadm --create /dev/md/openqa --level=0 --force --raid-devices=$(ls /dev/nvme?n1 | wc -l) --run /dev/nvme?n1 || mdadm --create /dev/md/openqa --level=0 --force --raid-devices=1 --run /dev/nvme0n1p3'
which should also work for now. As the machine is up since another automatic reboot today I think the original problem was resolved, hence reducing prio and setting "Feedback". Now, following our best practice of "think of at least a second improvement", @cdywan what can you think of? :)
Updated by livdywan almost 4 years ago
okurz wrote:
cdywan wrote:
[…] The above salt change fixes the mount points, but the pipeline was failing at the time, so I investigated that
yeah but please understand that the changes in the salt repo are only applied to OSD and have no impact on O3 infrastructure unless you manually copy paste the instructions from the salt git repo to the manually maintained files on o3 machines.
Sure. But you asked why I was looking into that.
I assume fvogt applied manually the slightly different variant:
ExecStart=/bin/sh -c 'test -e /dev/md/openqa || lsblk -n | grep -v nvme | grep "/$" && mdadm --create /dev/md/openqa --level=0 --force --raid-devices=$(ls /dev/nvme?n1 | wc -l) --run /dev/nvme?n1 || mdadm --create /dev/md/openqa --level=0 --force --raid-devices=1 --run /dev/nvme0n1p3'
which should also work for now. As the machine is up since another automatic reboot today I think the original problem was resolved, hence reducing prio and setting "Feedback". Now, following our best practice of "think of at least a second improvement", @cdywan what can you think of? :)
How about documenting this in the wiki:
https://progress.opensuse.org/projects/openqav3/wiki/Wiki#Mitigation-of-boot-failure-or-disk-issues
Updated by okurz almost 4 years ago
- Status changed from Feedback to Resolved
Yes, good idea. Thanks for that. Well, and we also have the ticket to use salt for o3 as well. So I guess enough learned :)
Updated by favogt almost 4 years ago
It broke again two days ago. It seems like openqa_nvme_format.service
was not failing, but got triggered in an endless loop recreating the raid and formatting it over and over again.
The mdadm --stop
makes the source of var-lib-openqa.mount
disappar, so systemd stopped it. The creation made it appear again, so systemd might schedule yet another start of openqa_nvme_format.service
. I commented out the first ExecStart
to avoid that, but it showed what the actual issue is. The unit is activated way too early due to DefaultDependencies=no
:
Nov 23 10:15:36 openqaworker7 systemd[1]: systemd 234 running in system mode. (+PAM -AUDIT +SELINUX -IMA +APPARMOR -SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN2 -IDN default-hierarchy=hybrid)
Nov 23 10:15:36 openqaworker7 systemd[1]: Detected architecture x86-64.
Nov 23 10:15:36 openqaworker7 systemd[1]: Set hostname to <openqaworker7>.
Nov 23 10:15:38 openqaworker7 sh[935]: grep: /proc/mdstat: No such file or directory
Nov 23 10:15:38 openqaworker7 kernel: BTRFS info (device sda1): disk space caching is enabled
Nov 23 10:15:38 openqaworker7 systemd[1]: Started Load Kernel Modules.
Nov 23 10:15:38 openqaworker7 systemd[1]: openqa_nvme_format.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Nov 23 10:15:38 openqaworker7 systemd[1]: Failed to start Setup NVMe before mounting it.
Nov 23 10:15:38 openqaworker7 systemd[1]: Dependency failed for /var/lib/openqa.
Unfortunately the system stopped responding to input for some reason, so it had to go through another slow reboot... I added
Requires=dev-md-openqa.device
After=dev-md-openqa.device
to the unit as it only does the formatting now. After yet another reboot it's now up, AFAICT race-free.
Updated by okurz almost 4 years ago
- Status changed from Resolved to Workable
- Assignee changed from livdywan to favogt
- Target version changed from Ready to future
I appreciate your efforts but that mdadm call was there for a reason.
Certainly the approach was not perfect but just removing it is changing what it was "designed" for. Yes, it's only used if the RAID does not assemble automatically. This happens if the same service definition is used on fresh installs as well as when NVMe devices change, which is what happens and will happen again. I will comment the suggestion in the ticket, reopen it and hope you can bring back the mdadm call in a way that fulfills the original requirements as well as convince you that the design is not "broken" anymore.
Updated by favogt almost 4 years ago
okurz wrote:
I appreciate your efforts but that mdadm call was there for a reason.
Well, in the vast majority of boots it did more harm (break completely) than good (recreate the raid if necessary), so the current state is arguably much better.
Certainly the approach was not perfect but just removing it is changing what it was "designed" for. Yes, it's only used if the RAID does not assemble automatically. This happens if the same service definition is used on fresh installs as well as when NVMe devices change, which is what happens and will happen again.
That's only manually triggered though and happens maybe once a year at most?
I will comment the suggestion in the ticket,
Which ticket?
reopen it and hope you can bring back the mdadm call in a way that fulfills the original requirements as well as convince you that the design is not "broken" anymore.
I split the removed parts into a new openqa_nvme_create.service
. It seems like the "array works" case works, but the "array needs (re)creation" case isn't tested yet.
Updated by okurz almost 4 years ago
favogt wrote:
I split the removed parts into a new
openqa_nvme_create.service
. It seems like the "array works" case works, but the "array needs (re)creation" case isn't tested yet.
Feel free to test this part on one of the o3 workers as well by destroying/recreating the RAID and file system on top, no problem as this is only the cache+pool of openQA workers which holds only temporary data.
Updated by favogt almost 4 years ago
- Status changed from Workable to Resolved
okurz wrote:
favogt wrote:
I split the removed parts into a new
openqa_nvme_create.service
. It seems like the "array works" case works, but the "array needs (re)creation" case isn't tested yet.Feel free to test this part on one of the o3 workers as well by destroying/recreating the RAID and file system on top, no problem as this is only the cache+pool of openQA workers which holds only temporary data.
After the system was idle, I ran sgdisk --zap
on /dev/nvme{0,1}n1
and also added nofail
to the mountpoint in /etc/fstab
to ensure that even if it fails to create or mount it, the system is reachable over ssh, then triggered a reboot.
When looking at the serial console I was a shocked when it tried to do PXE, but that's apparently just the configured boot order...
On the first boot it didn't bother to create the array as a || true
was missing and so it aborted early. After fixing that and trying again it failed because mdadm --create
didn't trigger dev-md-openqa.device
to be up, due to a missing udev event. I added a workaround and after zapping again it worked as expected after a reboot.
Feel free to copy /etc/systemd/system/openqa_nvme_{create,format,prepare}.service
into the git repo if you think they're ok.