action #160098
closedopenQA Project - coordination #157969: [epic] Upgrade all our infrastructure, e.g. o3+osd workers+webui and production workloads, to openSUSE Leap 15.6
After the upgrade to Leap 15.6 osiris showed no proper mount points again for libvirt VMs size:S
0%
Description
Observation¶
Similar as in #125087 now after I upgraded osiris to Leap 15.6 virt-manager only showed a single VM called "first-test-vm", no other machines like "okurz". I manually recovered by logging in over ssh and called
systemctl stop libvirtd
drbdadm up r0
systemctl restart etc-libvirt.mount
systemctl start libvirtd
but we should ensure this does not happen anymore and also we should not even try to start libvirtd if those dependencies are not fulfilled
Acceptance criteria¶
- AC1: osiris shows expected production VMs consistently after multiple reboots
Suggestions¶
- Look into what happened and what we did in the past in related tickets
- Check if this is reproducible on reboots
- Take a look into logs of drbd to see what the problem was (storage shared with seth)
- Maybe restarts of systemd services can be enough
Rollback actions¶
Remove silence from https://stats.openqa-monitor.qa.suse.de/alerting/silences alertname=Failed systemd services alert (not openqa)
Updated by okurz 6 months ago
- Copied from action #157996: Upgrade all other LSG QE salt controlled machines to openSUSE Leap 15.6 added
Updated by ybonatakis 6 months ago
- Status changed from Workable to In Progress
- Assignee set to ybonatakis
Updated by ybonatakis 6 months ago
The problem was reproducible after each reboot.
I found some complains about a dependency.
I checked the drbd.service and it cant start. From the logs May 10 17:14:39 osiris-1 (drbd)[2493]: drbd.service: Failed at step EXEC spawning /lib/drbd/scripts/drbd: No such file or directory
The directory has been moved:
iob@osiris-1:~> ls /lib/drbd.rpmmoved/scripts/
drbd drbd-service-shim.sh drbd-wait-promotable.sh ocf.ra.wrapper.sh
iob@osiris-1:~> ls -la /lib/drbd.rpmmoved
total 4
drwxr-xr-x 1 root root 14 Sep 18 2023 .
drwxr-xr-x 1 root root 898 May 8 19:38 ..
lrwxrwxrwx 1 root root 21 Sep 18 2023 scripts -> /usr/lib/drbd/scripts
iob@osiris-1:~> ls /usr/lib/drbd/scripts/
drbd drbd-service-shim.sh drbd-wait-promotable.sh ocf.ra.wrapper.sh
Updated by livdywan 6 months ago
2024-05-10 18:07:00 osiris-1 drbd 1
Is what I see on on the systemd services panel, with the alert still firing.
Updated by openqa_review 6 months ago
- Due date set to 2024-05-25
Setting due date based on mean cycle time of SUSE QE Tools
Updated by livdywan 6 months ago
May 14 13:44:28 osiris-1 drbd[3897]: /usr/lib/drbd/scripts/drbd: line 148: /var/lib/linstor/loop_device_mapping: No such file or directory
According to Yannis this is safe to ignore due to an unrelated upstream bug.
We addressed the following errors by modifying /etc/drbd.d/qsf-cluster.res
:
May 14 13:44:28 osiris-1 drbd[3940]: outdated-wfc-timeout has to be shorter than degr-wfc-timeout
May 14 13:44:28 osiris-1 drbd[3940]: outdated-wfc-timeout implicitly set to degr-wfc-timeout (10s)
->
wfc-timeout 30;
degr-wfc-timeout 15;
outdated-wfc-timeout 10;
However it's not clear if this addresses the underlying issue.
After the last test reboot neither of the mounts nor drbd were loaded and nothing failed. What, if anything, should load the devices/drbd/mounts? The mounts After=drbd. Researching the presumably related systemd error we couldn't find what it refers to:
-- Boot 2356cb9e267c474b8e5f8bb375e70678 --
May 14 14:13:10 osiris-1 systemd[1]: Dependency failed for Mount DRBD device with libvirtd configs.
May 14 14:13:10 osiris-1 systemd[1]: etc-libvirt.mount: Job etc-libvirt.mount/start failed with result 'dependency'.
Updated by ybonatakis 6 months ago
- Description updated (diff)
- Status changed from In Progress to Workable
- Priority changed from Urgent to High
silence alert was added.
Updated by ybonatakis 6 months ago
The only thing I have actually done is to fix the exec path of drbd. But still doesnt seem to fix the problem.
For some reason the service doesnt seem to run on reboot
Updated by nicksinger 6 months ago
I think the only missing part was systemctl enable drbd
which is - despite the warning - required. More on that can be read here: https://progress.opensuse.org/issues/125087#note-13
Updated by ybonatakis 6 months ago ยท Edited
- Status changed from Workable to Resolved
So the machine mounts /var/lib/libvirt/images after reboot now
/dev/drbd0 on /etc/libvirt type btrfs (rw,relatime,discard=async,space_cache,subvolid=261,subvol=/libvirt-configs)
/dev/drbd0 on /var/lib/libvirt/images type btrfs (rw,relatime,discard=async,space_cache,subvolid=257,subvol=/libvirt)
Mark ticket as resolved.
And silence on grafana was removed
Updated by okurz 6 months ago
- Has duplicate action #160493: Failed systemd services alert (osiris-1 drbd) added
Updated by nicksinger 6 months ago
- Assignee changed from ybonatakis to nicksinger
Updated by nicksinger 6 months ago
I found an old bug of mine describing the same problems: https://bugzilla.opensuse.org/show_bug.cgi?id=1215462 - I updated it with the newest findings. A first test of mine to use the drbd@r0.service
failed because the packaged script (/lib/drbd/scripts/drbd-service-shim.sh
) fails to access another file:
drbd-r0[7168]: /lib/drbd/scripts/drbd-service-shim.sh: line 46: /usr/sbin/drbdsetup: No such file or directory
Guess I will try to add something in the lines of https://superuser.com/a/1322035 now explicitly depending on these directories to be present. Makes no sense because all files/scripts/symlinks are on the same btrfs subvolume but still worth a shot.
Updated by nicksinger 6 months ago
- Status changed from In Progress to Feedback
I used systemctl edit drbd
to add:
[Unit]
RequiresMountsFor=/usr/lib/drbd/scripts/
RequiresMountsFor=/lib/drbd/scripts/
as described before. While checking reboot stability I realized that accessing libvirtd (e.g. with virsh list
) too early while booting it can happen that the mounts are not present yet. To solve that, I adjusted /etc/systemd/system/var-lib-libvirt-images.mount and /etc/systemd/system/etc-libvirt.mount and added Before=virtqemud.service
in the Unit-Section and RequiredBy=virtqemud.socket
in the Install-Section. Have to still check reboot stability multiple times.
Updated by nicksinger 6 months ago
- Status changed from Feedback to Resolved
I validated that machines are up and running after osiris is rebooted.