Project

General

Profile

Actions

action #160098

closed

openQA Project - coordination #157969: [epic] Upgrade all our infrastructure, e.g. o3+osd workers+webui and production workloads, to openSUSE Leap 15.6

After the upgrade to Leap 15.6 osiris showed no proper mount points again for libvirt VMs size:S

Added by okurz about 1 month ago. Updated 22 days ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2024-05-08
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Observation

Similar as in #125087 now after I upgraded osiris to Leap 15.6 virt-manager only showed a single VM called "first-test-vm", no other machines like "okurz". I manually recovered by logging in over ssh and called

systemctl stop libvirtd
drbdadm up r0
systemctl restart etc-libvirt.mount
systemctl start libvirtd

but we should ensure this does not happen anymore and also we should not even try to start libvirtd if those dependencies are not fulfilled

Acceptance criteria

  • AC1: osiris shows expected production VMs consistently after multiple reboots

Suggestions

  • Look into what happened and what we did in the past in related tickets
  • Check if this is reproducible on reboots
  • Take a look into logs of drbd to see what the problem was (storage shared with seth)
  • Maybe restarts of systemd services can be enough

Rollback actions

Remove silence from https://stats.openqa-monitor.qa.suse.de/alerting/silences alertname=Failed systemd services alert (not openqa)


Related issues 2 (0 open2 closed)

Has duplicate openQA Infrastructure - action #160493: Failed systemd services alert (osiris-1 drbd)Rejectedokurz2024-05-17

Actions
Copied from openQA Project - action #157996: Upgrade all other LSG QE salt controlled machines to openSUSE Leap 15.6Resolvedokurz

Actions
Actions #1

Updated by okurz about 1 month ago

  • Copied from action #157996: Upgrade all other LSG QE salt controlled machines to openSUSE Leap 15.6 added
Actions #3

Updated by okurz about 1 month ago

  • Subject changed from After upgrade to Leap 15.6 osiris again showed no prober mount points for libvirt VMs to After upgrade to Leap 15.6 osiris again showed no proper mount points for libvirt VMs
Actions #4

Updated by okurz about 1 month ago

  • Project changed from openQA Project to openQA Infrastructure
  • Category deleted (Regressions/Crashes)
Actions #5

Updated by okurz about 1 month ago

  • Subject changed from After upgrade to Leap 15.6 osiris again showed no proper mount points for libvirt VMs to After the upgrade to Leap 15.6 osiris showed no proper mount points again for libvirt VMs size:S
  • Description updated (diff)
  • Status changed from New to Workable
Actions #6

Updated by ybonatakis about 1 month ago

  • Status changed from Workable to In Progress
  • Assignee set to ybonatakis
Actions #7

Updated by ybonatakis about 1 month ago

The problem was reproducible after each reboot.

I found some complains about a dependency.
I checked the drbd.service and it cant start. From the logs May 10 17:14:39 osiris-1 (drbd)[2493]: drbd.service: Failed at step EXEC spawning /lib/drbd/scripts/drbd: No such file or directory
The directory has been moved:

iob@osiris-1:~> ls /lib/drbd.rpmmoved/scripts/                                                                                                                                                                                               
drbd                     drbd-service-shim.sh     drbd-wait-promotable.sh  ocf.ra.wrapper.sh                                                                                                                                                 
iob@osiris-1:~> ls -la /lib/drbd.rpmmoved                                                                                                                                                                                                    
total 4                                                                                                                                                                                                                                      
drwxr-xr-x 1 root root  14 Sep 18  2023 .                                                                                                                                                                                                    
drwxr-xr-x 1 root root 898 May  8 19:38 ..                                                                                                                                                                                                   
lrwxrwxrwx 1 root root  21 Sep 18  2023 scripts -> /usr/lib/drbd/scripts
iob@osiris-1:~> ls /usr/lib/drbd/scripts/
drbd                     drbd-service-shim.sh     drbd-wait-promotable.sh  ocf.ra.wrapper.sh
Actions #8

Updated by livdywan about 1 month ago

2024-05-10 18:07:00 osiris-1 drbd 1

Is what I see on on the systemd services panel, with the alert still firing.

Actions #9

Updated by openqa_review about 1 month ago

  • Due date set to 2024-05-25

Setting due date based on mean cycle time of SUSE QE Tools

Actions #10

Updated by okurz about 1 month ago

  • Priority changed from High to Urgent

raising to urgent due to new alert. Please consider adding an alert silence for the time being

Actions #11

Updated by livdywan about 1 month ago

May 14 13:44:28 osiris-1 drbd[3897]: /usr/lib/drbd/scripts/drbd: line 148: /var/lib/linstor/loop_device_mapping: No such file or directory

According to Yannis this is safe to ignore due to an unrelated upstream bug.

We addressed the following errors by modifying /etc/drbd.d/qsf-cluster.res:

May 14 13:44:28 osiris-1 drbd[3940]: outdated-wfc-timeout has to be shorter than degr-wfc-timeout                                                                     
May 14 13:44:28 osiris-1 drbd[3940]: outdated-wfc-timeout implicitly set to degr-wfc-timeout (10s)

->

        wfc-timeout  30;                                                                                                                                              
        degr-wfc-timeout 15;                                                                                                                                          
        outdated-wfc-timeout 10;

However it's not clear if this addresses the underlying issue.

After the last test reboot neither of the mounts nor drbd were loaded and nothing failed. What, if anything, should load the devices/drbd/mounts? The mounts After=drbd. Researching the presumably related systemd error we couldn't find what it refers to:

-- Boot 2356cb9e267c474b8e5f8bb375e70678 --                                                                                                                           
May 14 14:13:10 osiris-1 systemd[1]: Dependency failed for Mount DRBD device with libvirtd configs.                                                                   
May 14 14:13:10 osiris-1 systemd[1]: etc-libvirt.mount: Job etc-libvirt.mount/start failed with result 'dependency'.
Actions #12

Updated by ybonatakis about 1 month ago

  • Description updated (diff)
  • Status changed from In Progress to Workable
  • Priority changed from Urgent to High

silence alert was added.

Actions #13

Updated by ybonatakis about 1 month ago

The only thing I have actually done is to fix the exec path of drbd. But still doesnt seem to fix the problem.
For some reason the service doesnt seem to run on reboot

Actions #14

Updated by nicksinger about 1 month ago

I think the only missing part was systemctl enable drbd which is - despite the warning - required. More on that can be read here: https://progress.opensuse.org/issues/125087#note-13

Actions #15

Updated by ybonatakis about 1 month ago ยท Edited

  • Status changed from Workable to Resolved

So the machine mounts /var/lib/libvirt/images after reboot now

/dev/drbd0 on /etc/libvirt type btrfs (rw,relatime,discard=async,space_cache,subvolid=261,subvol=/libvirt-configs)
/dev/drbd0 on /var/lib/libvirt/images type btrfs (rw,relatime,discard=async,space_cache,subvolid=257,subvol=/libvirt)

Mark ticket as resolved.
And silence on grafana was removed

Actions #16

Updated by okurz about 1 month ago

  • Due date deleted (2024-05-25)
Actions #17

Updated by okurz 29 days ago

  • Has duplicate action #160493: Failed systemd services alert (osiris-1 drbd) added
Actions #18

Updated by okurz 29 days ago

  • Category set to Regressions/Crashes
  • Status changed from Resolved to Workable
Actions #19

Updated by nicksinger 29 days ago

  • Assignee changed from ybonatakis to nicksinger
Actions #20

Updated by nicksinger 29 days ago

  • Status changed from Workable to In Progress
Actions #21

Updated by nicksinger 29 days ago

I found an old bug of mine describing the same problems: https://bugzilla.opensuse.org/show_bug.cgi?id=1215462 - I updated it with the newest findings. A first test of mine to use the drbd@r0.service failed because the packaged script (/lib/drbd/scripts/drbd-service-shim.sh) fails to access another file:

drbd-r0[7168]: /lib/drbd/scripts/drbd-service-shim.sh: line 46: /usr/sbin/drbdsetup: No such file or directory

Guess I will try to add something in the lines of https://superuser.com/a/1322035 now explicitly depending on these directories to be present. Makes no sense because all files/scripts/symlinks are on the same btrfs subvolume but still worth a shot.

Actions #22

Updated by nicksinger 28 days ago

  • Status changed from In Progress to Feedback

I used systemctl edit drbd to add:

[Unit]
RequiresMountsFor=/usr/lib/drbd/scripts/
RequiresMountsFor=/lib/drbd/scripts/

as described before. While checking reboot stability I realized that accessing libvirtd (e.g. with virsh list) too early while booting it can happen that the mounts are not present yet. To solve that, I adjusted /etc/systemd/system/var-lib-libvirt-images.mount and /etc/systemd/system/etc-libvirt.mount and added Before=virtqemud.service in the Unit-Section and RequiredBy=virtqemud.socket in the Install-Section. Have to still check reboot stability multiple times.

Actions #23

Updated by nicksinger 22 days ago

  • Status changed from Feedback to Resolved

I validated that machines are up and running after osiris is rebooted.

Actions

Also available in: Atom PDF