Project

General

Profile

Actions

action #125783

closed

[jeos] Test fails in kdump_and_crash on SLE 12sp5 and 15sp4 XEN after worker migration from SLES to Leap 15.4

Added by pdostal almost 2 years ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Bugs in existing tests
Start date:
2023-03-10
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

openQA test in scenario sle-12-SP5-JeOS-for-kvm-and-xen-Updates-x86_64-jeos-extratest@svirt-xen-hvm fails in
kdump_and_crash

Test suite description

The test is failing since the worker has been updated from SLE 15sp2 to Leap 15.4.

Reproducible

Fails since (at least) Build 20230307-1

Expected result

Last good: 20230306-1 (or more recent)

Hypotheses

  • H1: crashkernel command line options on dom0 host due to salt changes cause the regression
  • H2: SLE15-SP2->SLE15-SP4 cause the regression
  • H3: SLE15-SP2->Leap 15.4 cause the regression
  • H4: Problem in product SUT OS version -> crosscheck with other versions

Obstacles

  1. Having no video is making investigation easier, why that choice of MAX_JOB_TIME=10800 which is disabling video?
  2. openqa-investigate jobs would have helped to tell if test or product diffs make a difference. IMHO you wasted couple of hours due to the wish of QE Kernel to have the job_done_hooks disabled for the kernel tests in particular
  3. the svirt backend makes investigation a lot harder. https://openqa.suse.de/tests/10630883/logfile?filename=serial0.txt from an x86_64 qemu job show that there is a line "[ 5.058618] reboot: Restarting system" likely coming from the crashkernel so we know that the crashkernel was actually properly running and doing something assumed useful for 5s. For svirt-xen we don't have that line in the case of "last good" so we don't even know if the crashkernel was running properly, we just see the effect in the dump files when we recreate a new VM using the same disk image and boot it again
  4. Tests like https://openqa.suse.de/tests/10654962# sle-15-SP4-JeOS-for-kvm-and-xen-Updates-x86_64-Build20230309-1-jeos-extratest@uefi-virtio-vga don't include kdump so I can't use that for crosschecking. Why is kdump_and_crash not enabled on SLE15-SP4?
  5. On Tumbleweed the kdump_and_crash module in https://openqa.opensuse.org/tests/latest?arch=x86_64&distri=opensuse&flavor=DVD&machine=64bit&test=extra_tests_textmode&version=Tumbleweed , latest https://openqa.opensuse.org/tests/3166236, is apparently affected by a bug denying proper operation since 2021 (!) due to https://bugzilla.opensuse.org/show_bug.cgi?id=1190434

Suggestions

  1. If seen critical we can rollback the OS installation and ask someone to crosscheck with different hypervisor OS versions before upgrading again
  2. We keep openqaw5-xen as is and skip the test module kdump_and_crash until anyone finds a better solution

I would go with 2. and also wait for https://bugzilla.opensuse.org/show_bug.cgi?id=1190434

Further details

Always latest result in this scenario: latest


Related issues 2 (0 open2 closed)

Related to openQA Tests (public) - action #116644: [qe-core][functional][sle15sp5]test fails in bootloader_svirt, the test is using different network bridge 'ovs-system' rather than 'br0'Resolvedrfan12022-09-16

Actions
Related to openQA Tests (public) - action #126647: [qe-core] test fails in bootloader_start - we should use br0 not ovs-systemResolvedrfan12023-03-27

Actions
Actions #1

Updated by pdostal almost 2 years ago

  • Project changed from openQA Tests (public) to 208
  • Category deleted (Bugs in existing tests)
Actions #2

Updated by pcervinka almost 2 years ago

System hangs:

[2023-03-09T17:14:10.902374+01:00] [debug] [pid:28087] <<< backend::baseclass::run_ssh_cmd(cmd="! virsh  dominfo openQA-SUT-4 | grep -w 'shut off'", wantarray=0, keep_open=1)
[2023-03-09T17:14:10.902610+01:00] [debug] [pid:28087] <<< backend::baseclass::run_ssh(cmd="! virsh  dominfo openQA-SUT-4 | grep -w 'shut off'", wantarray=0, keep_open=1)
[2023-03-09T17:14:10.902792+01:00] [debug] [pid:28087] <<< backend::baseclass::new_ssh_connection(wantarray=0, blocking=1, keep_open=1)
[2023-03-09T17:14:10.904202+01:00] [debug] [pid:28087] Use existing SSH connection (key:hostname=openqaw5-xen.qa.suse.de,username=root,port=22)
[2023-03-09T17:14:11.108949+01:00] [debug] [pid:28087] [run_ssh_cmd(! virsh  dominfo openQA-SUT-4 | grep -w 'shut off')] exit-code: 0
[2023-03-09T17:14:12.110794+01:00] [debug] [pid:28087] <<< backend::baseclass::run_ssh_cmd(cmd="! virsh  dominfo openQA-SUT-4 | grep -w 'shut off'", keep_open=1, wantarray=0)
[2023-03-09T17:14:12.111099+01:00] [debug] [pid:28087] <<< backend::baseclass::run_ssh(cmd="! virsh  dominfo openQA-SUT-4 | grep -w 'shut off'", keep_open=1, wantarray=0)
[2023-03-09T17:14:12.111354+01:00] [debug] [pid:28087] <<< backend::baseclass::new_ssh_connection(wantarray=0, keep_open=1, blocking=1)
[2023-03-09T17:14:12.113991+01:00] [debug] [pid:28087] Use existing SSH connection (key:hostname=openqaw5-xen.qa.suse.de,username=root,port=22)
[2023-03-09T17:14:12.333940+01:00] [debug] [pid:28087] [run_ssh_cmd(! virsh  dominfo openQA-SUT-4 | grep -w 'shut off')] exit-code: 0
[2023-03-09T17:14:13.335776+01:00] [debug] [pid:28087] <<< backend::baseclass::run_ssh_cmd(cmd="! virsh  dominfo openQA-SUT-4 | grep -w 'shut off'", keep_open=1, wantarray=0)
[2023-03-09T17:14:13.335939+01:00] [debug] [pid:28087] <<< backend::baseclass::run_ssh(cmd="! virsh  dominfo openQA-SUT-4 | grep -w 'shut off'", keep_open=1, wantarray=0)
[2023-03-09T17:14:13.336079+01:00] [debug] [pid:28087] <<< backend::baseclass::new_ssh_connection(wantarray=0, keep_open=1, blocking=1)

Last ok job was on monday and first failed on tuesday. Kernel version is same, there were no changes in test code.

Issue is most probably related to migration of workers to leap 15.4.

Actions #3

Updated by pcervinka almost 2 years ago

  • Subject changed from Test fails in kdump_and_crash on SLE 12sp5 and 15sp4 XEN to Test fails in kdump_and_crash on SLE 12sp5 and 15sp4 XEN after worker migration from SLES to Leap 15.4
Actions #4

Updated by pcervinka almost 2 years ago

  • Priority changed from Normal to High
Actions #5

Updated by pcervinka almost 2 years ago

  • Priority changed from High to Urgent
Actions #6

Updated by pcervinka almost 2 years ago

@okurz could you please check?

Actions #8

Updated by okurz almost 2 years ago

I don't know what you want to me check there

Actions #9

Updated by okurz almost 2 years ago

  • Status changed from New to In Progress
  • Assignee set to okurz
  • Target version set to Ready

trying to reproduce with a smaller testrun:

openqa-clone-job --within-instance https://openqa.suse.de/tests/10660683 _GROUP=0 BUILD= TEST+="-poo125783-okurz" SCHEDULE=tests/installation/bootloader_svirt,tests/installation/bootloader_uefi,tests/jeos/firstrun,tests/console/kdump_and_crash

Created job #10661334: sle-12-SP5-JeOS-for-kvm-and-xen-Updates-x86_64-BuildPDOSTAL-jeos-extratest@svirt-xen-hvm -> https://openqa.suse.de/t10661334

that fails because the SLE system isn't registered. Going with

openqa-clone-job --within-instance https://openqa.suse.de/tests/10660683 _GROUP=0 BUILD= TEST+="-poo125783-okurz" CASEDIR=https://github.com/okurz/os-autoinst-distri-opensuse#fix/poo125783_xen_vm_still_running_after_triggered_crash SCHEDULE=tests/installation/bootloader_svirt,tests/installation/bootloader_uefi,tests/jeos/firstrun,tests/console/poo125783_trigger_crash

Created job #10661416: sle-12-SP5-JeOS-for-kvm-and-xen-Updates-x86_64-BuildPDOSTAL-jeos-extratest@svirt-xen-hvm -> https://openqa.suse.de/t10661416

using a simplified test module for investigation.

While the job was running I was using tail -f autoinst-log.txt on the worker instance. And on openqaw5-xen.qa.suse.de I was running while sleep 1; do date; virsh dominfo openQA-SUT-3 | grep '^State'; done which eventually showed shut off so that looks like what the test expected.

Specifically what is visible in logs:

[2023-03-10T14:49:47.799205+01:00] [debug] [pid:25280] <<< testapi::type_string(string="echo c > /proc/sysrq-trigger\n", max_interval=250, wait_screen_change=0, wait_still_screen=0, timeout=30, similarity_level=47)
[  135.403885] sysrq: SysRq : Trigger a crash

and in parallel:

2023-03-10T14:49:47 CET
State:          running
2023-03-10T14:49:48 CET
State:          running
2023-03-10T14:49:49 CET
State:          shut off

so it takes only 2s to power down. That can't be the problem. I will need to rerun the original test case it seems or a better selection for the subset.

openqa-clone-job --within-instance https://openqa.suse.de/tests/10660683 _GROUP=0 BUILD= TEST+="-poo125783-okurz" CASEDIR=https://github.com/okurz/os-autoinst-distri-opensuse#fix/poo125783_xen_vm_still_running_after_triggered_crash SCHEDULE=tests/installation/bootloader_svirt,tests/installation/bootloader_uefi,tests/jeos/firstrun,tests/console/suseconnect_scc,tests/console/kdump_and_crash

Created job #10661442: sle-12-SP5-JeOS-for-kvm-and-xen-Updates-x86_64-BuildPDOSTAL-jeos-extratest@svirt-xen-hvm -> https://openqa.suse.de/t10661442

That one reproduces the problem and took 13m which is acceptable.

While the test was running I looked up open ports on openqaw5-xen.qa.suse.de with ss -ltp4n and found openQA-SUT-5, the relevant VM here, to be offering VNC on 0.0.0.0:5905 so I connected from my local machine as well with vncviewer -Shared openqaw5-xen.qa.suse.de:5905 and I could follow the test but as soon as the kernel crash was triggered the VNC connection was terminated which supports the hypothesis that the VM is actually shut down.

I found in the output of virsh list --all that the corresponding VM looses its "Id" when the kernel crash is triggered but the state does not go to "shut off" as expected by the test. https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/virtualization_deployment_and_administration_guide/sect-managing_guest_virtual_machines_with_virsh-retrieving_guest_virtual_machine_information actually mentions something similar for the domid command

Note, domid returns - for guest virtual machines that are in shut off state

So maybe we can just look for the corresponding state with

! (virsh dominfo $sut | grep -w 'shut off' || virsh domid $sut | grep -q '^-')  

I hotpatched /usr/lib/os-autoinst/backend/svirt.pm on worker2.oqa.suse.de and retriggered another test https://openqa.suse.de/tests/10662086 which is also running on instance worker2:49 with openQA-SUT-5. My web research regarding that change of behaviour did not yield any result, also nothing on bugzilla.opensuse.org, maybe a new bug.

The change is effective as visible in https://openqa.suse.de/tests/10662103#step/kdump_and_crash/58 detecting the system as "shut off" but in a later step in https://openqa.suse.de/tests/10662103#step/kdump_and_crash/76 no crash dump is detected so maybe we terminated the VM prematurely.

I also asked in #discuss-virtualization in https://suse.slack.com/archives/C02AY3MUZD5/p1678462049789819

Hi, after upgrading a Xen hypervisor host from SLE15-SP2 to SLE15-SP4 and now Leap 15.4 after triggering a deliberate crash+kdump in a VM causes the VM to be listed as still "running" in virsh list --all in contrast to previously "shut off". Interestingly the "Id" vanishes in that list and is replaced with -. If I just trigger a crash in a VM that does not have crashkernel configured the system goes to "shut off" after 2s as expected. Does that sound familiar to anyone?

As I don't see any output from the crashkernel execution I wonder if that can be made visible. Does the corresponding qemu test case show it?

We still have the option to rollback the migration SLE->Leap or SLE15-SP2 to SLE15-SP4, see

openqaw5-xen:~ # snapper list
   # | Type   | Pre # | Date                     | User | Cleanup | Description             | Userdata
-----+--------+-------+--------------------------+------+---------+-------------------------+--------------
  0  | single |       |                          | root |         | current                 |
  1* | single |       | 2016-08-26T10:36:24 CEST | root |         | first root filesystem   |
580  | pre    |       | 2021-12-06T15:56:11 CET  | root | number  | zypp(zypper)            | important=yes
581  | post   |   580 | 2021-12-06T16:00:42 CET  | root | number  |                         | important=yes
582  | pre    |       | 2022-01-31T14:31:56 CET  | root | number  | zypp(zypper)            | important=yes
583  | post   |   582 | 2022-01-31T14:32:25 CET  | root | number  |                         | important=yes
584  | pre    |       | 2022-01-31T15:32:40 CET  | root | number  | zypp(zypper)            | important=yes
585  | post   |   584 | 2022-01-31T15:35:04 CET  | root | number  |                         | important=yes
599  | pre    |       | 2023-03-07T16:54:58 CET  | root | number  | before online migration | important=yes
600  | post   |   599 | 2023-03-07T16:55:20 CET  | root | number  | after online migration  | important=yes
601  | pre    |       | 2023-03-07T16:56:27 CET  | root | number  | before online migration | important=yes
602  | post   |   601 | 2023-03-07T17:24:28 CET  | root | number  | after online migration  | important=yes
Actions #10

Updated by pcervinka almost 2 years ago

We verified test scenario, there were no test related changes, kernel versions are the same, fails on two different products. It is not related to maintenance updates (also tested without them). There is only one change and it is migration from sles to leap, so logically we are asking person who participated on migration which is most likely root cause (started to fail after it).

Actions #11

Updated by pdostal almost 2 years ago

Here is a run without maintenance updates https://openqa.suse.de/t10660683

Actions #12

Updated by okurz almost 2 years ago

pdostal wrote:

Here is a run without maintenance updates https://openqa.suse.de/t10660683

I confirmed the same in tests like https://openqa.suse.de/t10662103, the pending SLE maintenance updates don't have an impact.

To have a video I called

openqa-clone-job --within-instance https://openqa.suse.de/tests/10660683 _GROUP=0 BUILD= TEST+="-poo125783-okurz" CASEDIR=https://github.com/okurz/os-autoinst-distri-opensuse#fix/poo125783_xen_vm_still_running_after_triggered_crash SCHEDULE=tests/installation/bootloader_svirt,tests/installation/bootloader_uefi,tests/jeos/firstrun,tests/console/suseconnect_scc,tests/console/kdump_and_crash MAX_JOB_TIME= NOVIDEO=0

Created job #10663267: sle-12-SP5-JeOS-for-kvm-and-xen-Updates-x86_64-BuildPDOSTAL-jeos-extratest@svirt-xen-hvm -> https://openqa.suse.de/t10663267

with the resulting video
https://openqa.suse.de/tests/10663267/video?filename=video.ogv
this shows that the kernel crash backtrace is shown from the point when a kernel crash is triggered till the end of test execution.

I also ran xl during test execution and as soon as the kernel crash is deliberately triggered the xl output changes to

(null)                                     363     0     1     --p--d

what is that supposed to mean? Undefined state?

Both of the above support the hypothesis that the VM crashes completely and is unable to load the crashkernel.

I tried to look for SLE15 jobs for comparison in the database using select * from jobs where machine~'svirt-xen' and version ~ '15' and test='jeos-extratest' and t_started >= '2023-01-01' order by t_started desc; but found only jobs which miss that test module. Why is kdump_and_crash not tested on SLE15?

ok, trying with 15-SP4 jobs anyway with a custom schedule:

for i in 10661812 10661804; do openqa-clone-job --within-instance https://openqa.suse.de/tests/$i _GROUP=0 BUILD= TEST+="-poo125783-okurz" CASEDIR=https://github.com/okurz/os-autoinst-distri-opensuse#fix/poo125783_xen_vm_still_running_after_triggered_crash SCHEDULE=tests/installation/bootloader_svirt,tests/jeos/firstrun,tests/console/suseconnect_scc,tests/console/kdump_and_crash MAX_JOB_TIME= NOVIDEO=0; done

Created job #10663653: sle-15-SP4-JeOS-for-kvm-and-xen-Updates-x86_64-Build20230309-1-jeos-extratest@svirt-xen-pv -> https://openqa.suse.de/t10663653
Created job #10663654: sle-15-SP4-JeOS-for-kvm-and-xen-Updates-x86_64-Build20230309-1-jeos-extratest@svirt-xen-hvm -> https://openqa.suse.de/t10663654
to have both hvm and pv.

https://openqa.suse.de/tests/10663653#step/kdump_and_crash/417 shows kdump not being supported on PV, the job actually passed. https://openqa.suse.de/tests/10663651#step/kdump_and_crash/449 shows HVM kdump failed on SLE15-SP4, same as in the originally reported job, same in https://openqa.suse.de/tests/10663654#step/kdump_and_crash/448 also SLE15-SP4.

I called

diff -Naur /.snapshots/580/snapshot/boot/grub2/grub.cfg /boot/grub2/grub.cfg

to check for differences in the boot config additionally to the version differences. Maybe it's the crashkernel on dom0?

The relevant section is likely

-                       multiboot2      /boot/xen-4.13.3_04-3.37.gz placeholder dom0_mem=4G dom0_max_vcpus=4 loglvl=all guest_loglvl=all loglvl=all guest_loglvl=all crashkernel=185M\<4G ${xen_rm_opts}
-                       echo    'Loading Linux 5.3.18-24.86-default ...'
-                       module2 /boot/vmlinuz-5.3.18-24.86-default placeholder root=UUID=5cbe801d-dfc3-4f39-b3d3-70b3e1a4eaa1  ${extra_cmdline} console=tty console=ttyS0,115200n splash=silent showopts crashkernel=113M,high crashkernel=72M,low
+                       multiboot2      /boot/xen-4.16.3_04-150400.4.22.gz placeholder dom0_mem=4G dom0_max_vcpus=4 loglvl=all guest_loglvl=all loglvl=all guest_loglvl=all crashkernel=185M\<4G ${xen_rm_opts}
+                       echo    'Loading Linux 5.14.21-150400.24.46-default ...'
+                       module2 /boot/vmlinuz-5.14.21-150400.24.46-default placeholder root=UUID=5cbe801d-dfc3-4f39-b3d3-70b3e1a4eaa1  ${extra_cmdline} console=tty console=ttyS0,115200n splash=silent showopts crashkernel=113M,high crashkernel=72M,low crashkernel=210M

I could try to run without the crashkernel 210M. Removed that and rebooted. Also should disable velociraptor-client to try to see if that has an impact.

But for now I broke it completely like in https://openqa.suse.de/tests/10666066 with completely not obvious error details.

Might have to do with #116644?

Actions #13

Updated by szarate almost 2 years ago

Following https://suse.slack.com/archives/C02CANHLANP/p1678484051238939:

So on one end, looks like the kvm module is not loaded, not sure if xen needs it too, but libvirt is complaining (I did modprobe it):

journalctl  --no-pager | grep kvm
Mar 10 20:59:26 openqaw5-xen libvirtd[2617]: 2634: error : virHostCPUGetCPUID:1320 : Unable to open /dev/kvm: No such file or directory
Mar 10 21:08:47 openqaw5-xen libvirtd[8261]: 8327: error : virHostCPUGetCPUID:1320 : Unable to open /dev/kvm: No such file or directory
Mar 10 21:11:43 openqaw5-xen libvirtd[10207]: 10224: error : virHostCPUGetCPUID:1320 : Unable to open /dev/kvm: No such file or directory
Mar 10 21:25:45 openqaw5-xen libvirtd[2845]: 2863: error : virHostCPUGetCPUID:1320 : Unable to open /dev/kvm: No such file or directory
Mar 10 21:46:29 openqaw5-xen libvirtd[16316]: 16333: error : virHostCPUGetCPUID:1320 : Unable to open /dev/kvm: No such file or directory
Mar 10 21:56:37 openqaw5-xen libvirtd[2709]: 2727: error : virHostCPUGetCPUID:1320 : Unable to open /dev/kvm: No such file or directory
Mar 10 22:16:40 openqaw5-xen libvirtd[2476]: 2493: error : virHostCPUGetCPUID:1320 : Unable to open /dev/kvm: No such file or directory
Mar 10 22:22:06 openqaw5-xen libvirtd[2719]: 2737: error : virHostCPUGetCPUID:1320 : Unable to open /dev/kvm: No such file or directory
Mar 10 22:37:51 openqaw5-xen libvirtd[7520]: 7537: error : virHostCPUGetCPUID:1320 : Unable to open /dev/kvm: No such file or directory
Mar 10 23:58:04 openqaw5-xen libvirtd[8606]: 8623: error : virHostCPUGetCPUID:1320 : Unable to open /dev/kvm: No such file or directory

I restarted https://openqa.suse.de/tests/10666124

Mar 11 00:04:28 openqaw5-xen root[9249]: /etc/xen/scripts/vif-openvswitch: Writing backend/vif/7/0/hotplug-error ovs-vsctl --timeout=30 -- --if-exists del-port vif7.0 -- add-port ovs-system vif7.0 -- set interface vif7.0 external-ids:"xen-vm-name"="openQA-SUT-1" -- set interface vif7.0 external-ids:"xen-vm-uuid"="f53564e0-6ee1-4a56-b460-06d8f029dfdc" -- set interface vif7.0 external-ids:"attached-mac"="00:16:3e:3f:0d:44" failed backend/vif/7/0/hotplug-status error to xenstore.

Looking into cat /var/log/xen/xen-hotplug.log (got the hint from: https://wiki.xenproject.org/wiki/Xen_Common_Problems#Xen_complains_about_.22hotplug_scripts_not_working.22)

tail xen-hotplug.log

ovs-vsctl: no bridge named ovs-system
ovs-vsctl: no bridge named ovs-system
ovs-vsctl: no bridge named ovs-system
ovs-vsctl: no bridge named ovs-system

The openvswitch setup is missing or at least I have the hunch that It is the source of the problem (not sure about the root cause tho), but it could be that the migration of the database didn't work (Product bug) and well... :),

Mar 11 00:04:28 openqaw5-xen wickedd[1100]: __ni_netdev_event(vif7.0, idx=12, device-create)
Mar 11 00:04:28 openqaw5-xen wickedd-dhcp6[793]: __ni_netdev_event(vif7.0, idx=12, device-create)
Mar 11 00:04:28 openqaw5-xen wickedd-dhcp6[793]: vif7.0[12]: received interface event: device-create
Mar 11 00:04:28 openqaw5-xen wickedd-dhcp4[792]: __ni_netdev_event(vif7.0, idx=12, device-create)
Mar 11 00:04:28 openqaw5-xen wickedd-dhcp6[793]: Created dhcp6 device for 'vif7.0' and index 12
Mar 11 00:04:28 openqaw5-xen wickedd-dhcp4[792]: Created dhcp4 device for 'vif7.0' and index 12
Mar 11 00:04:28 openqaw5-xen wickedd-auto4[791]: __ni_netdev_event(vif7.0, idx=12, device-create)
Mar 11 00:04:28 openqaw5-xen wickedd-auto4[791]: Created autoip4 device for 'vif7.0' and index 12
Mar 11 00:04:28 openqaw5-xen wickedd-nanny[1115]: process event signal deviceCreate from /org/opensuse/Network/Interface/12; uuid=<>
Mar 11 00:04:28 openqaw5-xen systemd-udevd[9183]: Using default interface naming scheme 'sle15-sp4'.
Mar 11 00:04:28 openqaw5-xen wickedd-nanny[1115]: received new non-ready device vif7.0 (/org/opensuse/Network/Interface/12)
Mar 11 00:04:28 openqaw5-xen wickedd-nanny[1115]: ni_fsm_process_event: No ready fsm worker for /org/opensuse/Network/Interface/12 found to process deviceCreate event
Mar 11 00:04:28 openqaw5-xen wickedd-nanny[1115]: waiting for 0 devices to become ready (0 explicitly requested)
Mar 11 00:04:28 openqaw5-xen wickedd[1100]: __ni_netdev_event(vif7.0, idx=12, device-ready)
Mar 11 00:04:28 openqaw5-xen wickedd-nanny[1115]: process event signal deviceReady from /org/opensuse/Network/Interface/12; uuid=<>
Mar 11 00:04:28 openqaw5-xen wickedd-nanny[1115]: received new ready device vif7.0 (/org/opensuse/Network/Interface/12)
Mar 11 00:04:28 openqaw5-xen wickedd-nanny[1115]: br0: <device> identified device as "eth0"
Mar 11 00:04:28 openqaw5-xen wickedd-nanny[1115]: br0: resolved reference ovs-bridge/ports/port/device to subordinate device eth0
Mar 11 00:04:28 openqaw5-xen wickedd-nanny[1115]: br0: linkUp transition requires eth0 worker to be in state device-up..max
Mar 11 00:04:28 openqaw5-xen wickedd-nanny[1115]: Config device hierarchy structure:
Mar 11 00:04:28 openqaw5-xen wickedd-nanny[1115]: lo [1]
Mar 11 00:04:28 openqaw5-xen wickedd-nanny[1115]: eth1 [3]
Mar 11 00:04:28 openqaw5-xen wickedd-nanny[1115]: br0 [5]
Mar 11 00:04:28 openqaw5-xen wickedd-nanny[1115]:  ?-- eth0 [2]
Mar 11 00:04:28 openqaw5-xen wickedd-nanny[1115]: ovs-system [4]
Mar 11 00:04:28 openqaw5-xen wickedd-nanny[1115]:  *-- eth0 [2]
Mar 11 00:04:28 openqaw5-xen wickedd-nanny[1115]: vif2.0 [7]
Mar 11 00:04:28 openqaw5-xen wickedd-nanny[1115]: vif7.0 [12]
Mar 11 00:04:28 openqaw5-xen wickedd-nanny[1115]: new device vif7.0, class managed-netif
Mar 11 00:04:28 openqaw5-xen wickedd-nanny[1115]: vif7.0: processed event deviceReady; state=stopped, policy=<none>
Mar 11 00:04:28 openqaw5-xen wickedd-nanny[1115]: waiting for 0 devices to become ready (0 explicitly requested)
Mar 11 00:04:28 openqaw5-xen root[9215]: /etc/xen/scripts/vif-openvswitch: online type_if=vif XENBUS_PATH=backend/vif/7/0
Mar 11 00:04:28 openqaw5-xen wickedd-dhcp6[793]: __ni_netdev_event(vif7.0, idx=12, device-change)
Mar 11 00:04:28 openqaw5-xen wickedd-dhcp6[793]: vif7.0[12]: received interface event: device-change
Mar 11 00:04:28 openqaw5-xen wickedd-dhcp4[792]: __ni_netdev_event(vif7.0, idx=12, device-change)
Mar 11 00:04:28 openqaw5-xen wickedd[1100]: __ni_netdev_event(vif7.0, idx=12, device-change)
Mar 11 00:04:28 openqaw5-xen wickedd-auto4[791]: __ni_netdev_event(vif7.0, idx=12, device-change)
Mar 11 00:04:28 openqaw5-xen wickedd-nanny[1115]: process event signal deviceChange from /org/opensuse/Network/Interface/12; uuid=<>
Mar 11 00:04:28 openqaw5-xen wickedd-nanny[1115]: vif7.0: processed event deviceChange; state=stopped, policy=<none>
Mar 11 00:04:28 openqaw5-xen wickedd-nanny[1115]: waiting for 0 devices to become ready (0 explicitly requested)
Mar 11 00:04:28 openqaw5-xen ovs-vsctl[9248]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --timeout=30 -- --if-exists del-port vif7.0 -- add-port ovs-system vif7.0 -- set interface vif7.0 "external-ids:\"xen-vm-name\"=\"openQA-SUT-1\"" -- set interface vif7.0 "external-ids:\"xen-vm-uuid\"=\"f53564e0-6ee1-4a56-b460-06d8f029dfdc\"" -- set interface vif7.0 "external-ids:\"attached-mac\"=\"00:16:3e:3f:0d:44\""
Mar 11 00:04:28 openqaw5-xen ovs-vsctl[9248]: ovs|00002|db_ctl_base|ERR|no bridge named ovs-system
Mar 11 00:04:28 openqaw5-xen root[9249]: /etc/xen/scripts/vif-openvswitch: Writing backend/vif/7/0/hotplug-error ovs-vsctl --timeout=30 -- --if-exists del-port vif7.0 -- add-port ovs-system vif7.0 -- set interface vif7.0 external-ids:"xen-vm-name"="openQA-SUT-1" -- set interface vif7.0 external-ids:"xen-vm-uuid"="f53564e0-6ee1-4a56-b460-06d8f029dfdc" -- set interface vif7.0 external-ids:"attached-mac"="00:16:3e:3f:0d:44" failed backend/vif/7/0/hotplug-status error to xenstore.
Mar 11 00:04:28 openqaw5-xen root[9251]: /etc/xen/scripts/vif-openvswitch: ovs-vsctl --timeout=30 -- --if-exists del-port vif7.0 -- add-port ovs-system vif7.0 -- set interface vif7.0 external-ids:"xen-vm-name"="openQA-SUT-1" -- set interface vif7.0 external-ids:"xen-vm-uuid"="f53564e0-6ee1-4a56-b460-06d8f029dfdc" -- set interface vif7.0 external-ids:"attached-mac"="00:16:3e:3f:0d:44" failed
Mar 11 00:04:28 openqaw5-xen wickedd[1100]: __ni_netdev_event(vif7.0, idx=12, device-delete)
Mar 11 00:04:28 openqaw5-xen wickedd-dhcp6[793]: __ni_netdev_event(vif7.0, idx=12, device-delete)
Mar 11 00:04:28 openqaw5-xen wickedd-dhcp6[793]: vif7.0[12]: received interface event: device-delete
Mar 11 00:04:28 openqaw5-xen wickedd-dhcp6[793]: vif7.0: Destroying dhcp6 device with index 12
Mar 11 00:04:28 openqaw5-xen wickedd-dhcp6[793]: vif7.0: Deleting dhcp6 device with index 12
Mar 11 00:04:28 openqaw5-xen wickedd-dhcp4[792]: __ni_netdev_event(vif7.0, idx=12, device-delete)
Mar 11 00:04:28 openqaw5-xen wickedd-dhcp4[792]: vif7.0: Destroying dhcp4 device with index 12
Mar 11 00:04:28 openqaw5-xen wickedd-dhcp4[792]: vif7.0: Deleting dhcp4 device with index 12
Mar 11 00:04:28 openqaw5-xen wickedd-auto4[791]: __ni_netdev_event(vif7.0, idx=12, device-delete)
Mar 11 00:04:28 openqaw5-xen wickedd-auto4[791]: vif7.0: Destroying autoip4 device with index 12
Mar 11 00:04:28 openqaw5-xen wickedd-auto4[791]: vif7.0: Deleting autoip4 device with index 12
Mar 11 00:04:28 openqaw5-xen wickedd-nanny[1115]: process event signal deviceDelete from /org/opensuse/Network/Interface/12; uuid=<>
Mar 11 00:04:28 openqaw5-xen wickedd-nanny[1115]: vif7.0: received event "deviceDelete" from "/org/opensuse/Network/Interface/12" (not a managed device)
Mar 11 00:04:28 openqaw5-xen wickedd-nanny[1115]: ni_fsm_destroy_worker(vif7.0)
Mar 11 00:04:28 openqaw5-xen wickedd-nanny[1115]: ni_ifworker_device_delete(vif7.0)
Mar 11 00:04:28 openqaw5-xen wickedd-nanny[1115]: br0: <device> identified device as "eth0"
Mar 11 00:04:28 openqaw5-xen wickedd-nanny[1115]: br0: resolved reference ovs-bridge/ports/port/device to subordinate device eth0
Mar 11 00:04:28 openqaw5-xen wickedd-nanny[1115]: br0: linkUp transition requires eth0 worker to be in state device-up..max
Mar 11 00:04:28 openqaw5-xen wickedd-nanny[1115]: Config device hierarchy structure:
Mar 11 00:04:28 openqaw5-xen wickedd-nanny[1115]: lo [1]
Mar 11 00:04:28 openqaw5-xen wickedd-nanny[1115]: eth1 [3]
Mar 11 00:04:28 openqaw5-xen wickedd-nanny[1115]: br0 [5]
Mar 11 00:04:28 openqaw5-xen wickedd-nanny[1115]:  ?-- eth0 [2]
Mar 11 00:04:28 openqaw5-xen wickedd-nanny[1115]: ovs-system [4]
Mar 11 00:04:28 openqaw5-xen wickedd-nanny[1115]:  *-- eth0 [2]
Mar 11 00:04:28 openqaw5-xen wickedd-nanny[1115]: vif2.0 [7]
Mar 11 00:04:28 openqaw5-xen wickedd-nanny[1115]: deleted worker device with object path /org/opensuse/Network/Interface/12
Mar 11 00:04:28 openqaw5-xen wickedd-nanny[1115]: ni_managed_device_get_worker: no corresponding worker for ifindex 12
Mar 11 00:04:28 openqaw5-xen wickedd-nanny[1115]: ni_managed_device_free(anon): obj=(nil)
Mar 11 00:04:28 openqaw5-xen wickedd-nanny[1115]: waiting for 0 devices to become ready (0 explicitly requested)
Mar 11 00:04:28 openqaw5-xen root[9266]: /etc/xen/scripts/vif-openvswitch: offline type_if=vif XENBUS_PATH=backend/vif/7/0
Mar 11 00:04:28 openqaw5-xen ovs-vsctl[9275]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --timeout=30 -- --if-exists del-port vif7.0
Mar 11 00:04:28 openqaw5-xen root[9279]: /etc/xen/scripts/vif-openvswitch: ip link set vif7.0 down failed
Mar 11 00:04:28 openqaw5-xen root[9288]: /etc/xen/scripts/vif-openvswitch: Successful vif-openvswitch offline for vif7.0.
Actions #14

Updated by openqa_review almost 2 years ago

  • Due date set to 2023-03-25

Setting due date based on mean cycle time of SUSE QE Tools

Actions #15

Updated by okurz almost 2 years ago

  • Description updated (diff)
Actions #16

Updated by okurz almost 2 years ago

Hi szarate, thank you for looking into this.

szarate wrote:

[…]

ovs-vsctl: no bridge named ovs-system
ovs-vsctl: no bridge named ovs-system
ovs-vsctl: no bridge named ovs-system
ovs-vsctl: no bridge named ovs-system

The openvswitch setup is missing or at least I have the hunch that It is the source of the problem (not sure about the root cause tho), but it could be that the migration of the database didn't work (Product bug) and well... :),

I have seen the error message as well. Not sure what "migration of the database" this could entail here. Note that Xen tests were basically working after the upgrade SLE15-SP2->SLE15-SP4->Leap15.4 but stopped after I did some changes where I don't remember which detail could have caused the problems observed. I assume some crashes or unclean shutdown of Xen VMs caused this, maybe even the originally observed problem that VMs did not properly shutdown after an induced kernel crash.

Unfortunately I could not find any documentation or description or tickets that would explain how openvswitch plays into running openQA tests on Xen.

I did what was mentioned in #116644#note-4 and called virsh iface-destroy ovs-system and newly triggered openQA jobs defined the networks again correctly pointing to br0 in /var/lib/libvirt/images (compared to openQA-SUT-3.xml.org) instead of ovs-system which comes from openvswitch. Maybe I triggered the generation of ovs-system by starting some services on the system which are not supposed to run.
https://openqa.suse.de/tests/10667474#live

Actions #17

Updated by szarate almost 2 years ago

  • Related to action #116644: [qe-core][functional][sle15sp5]test fails in bootloader_svirt, the test is using different network bridge 'ovs-system' rather than 'br0' added
Actions #18

Updated by okurz almost 2 years ago

  • Tags set to jeos, virtualization, xen, regression, kdump, crash, openQA
  • Project changed from 208 to openQA Tests (public)
  • Subject changed from Test fails in kdump_and_crash on SLE 12sp5 and 15sp4 XEN after worker migration from SLES to Leap 15.4 to [jeos] Test fails in kdump_and_crash on SLE 12sp5 and 15sp4 XEN after worker migration from SLES to Leap 15.4
  • Category set to Bugs in existing tests
Actions #19

Updated by okurz almost 2 years ago

  • Due date deleted (2023-03-25)
  • Status changed from In Progress to Blocked
  • Priority changed from Urgent to High
Actions #20

Updated by xlai almost 2 years ago

Sorry for the late reply. I finally find time today looking into the long discussion in this ticket and the reported bug. Glad to spot, in the end, that dev already finds a potential fix in https://bugzilla.opensuse.org/show_bug.cgi?id=1209245#c5. Silly me :(.

Anyway I could save my long analysis comment, but still I'd like to give @okurz a big Thumb-UP :). I think Oliver did a very good job in https://progress.opensuse.org/issues/125783#note-9 and in deciding to go to VT dev for help in #discuss-virtualization. From virtualization regard, to tell if this regression comes from host upgrade or not, I was planning to give my support by manually reproducing it with some quick shorter path trigger similar with what Oliver did in that comment. And he did even more than me and tried quite some others. Nice!

Actions #21

Updated by okurz almost 2 years ago

  • Status changed from Blocked to In Progress

thank you!

In the meantime https://build.suse.de/package/show/home:olh:bug1209245:sle15sp4/xen has updated packages which I will try

Upgrade from that repo contains:

The following 4 packages are going to be downgraded:
  xen         
    4.16.3_04-150400.4.22.1 -> 4.16.3_04-bug1209245.150400.4.25.1  x86_64  home:olh:bug1209245:sle15sp4 (SLE_15)
    SUSE LLC <https://www.suse.com/> -> obs://build.suse.de/home:olh
  xen-doc-html
    4.16.3_04-150400.4.22.1 -> 4.16.3_04-bug1209245.150400.4.25.1  x86_64  home:olh:bug1209245:sle15sp4 (SLE_15)
    SUSE LLC <https://www.suse.com/> -> obs://build.suse.de/home:olh
  xen-libs    
    4.16.3_04-150400.4.22.1 -> 4.16.3_04-bug1209245.150400.4.25.1  x86_64  home:olh:bug1209245:sle15sp4 (SLE_15)
    SUSE LLC <https://www.suse.com/> -> obs://build.suse.de/home:olh
  xen-tools   
    4.16.3_04-150400.4.22.1 -> 4.16.3_04-bug1209245.150400.4.25.1  x86_64  home:olh:bug1209245:sle15sp4 (SLE_15)
    SUSE LLC <https://www.suse.com/> -> obs://build.suse.de/home:olh
Actions #22

Updated by okurz almost 2 years ago

https://openqa.suse.de/tests/10707400# passed so fix looks good!

Actions #23

Updated by okurz almost 2 years ago

  • Status changed from In Progress to Blocked
  • Priority changed from High to Normal

The patch is deployed on openqaw5-xen so further openQA jobs on this machine should work fine again.I commented in the bug and will wait for an official update, then remove the special repository, bring back the machine into salt and so on.

Actions #24

Updated by szarate over 1 year ago

It looks like it's showing up again:

openqaw5-xen:~ # virsh iface-list --all
 Name         State    MAC Address
------------------------------------
 ovs-system   active  

But atm I don't have much idea on how to do the debugging there

Actions #25

Updated by szarate over 1 year ago

  • Related to action #126647: [qe-core] test fails in bootloader_start - we should use br0 not ovs-system added
Actions #26

Updated by okurz over 1 year ago

https://build.suse.de/package/rdiff/home:olh:bug1209245:sle15sp4/xen?opackage=xen&oproject=SUSE%3ASLE-15-SP4%3AUpdate&rev=2 still shows a link diff so not yet available in a SLE update. I called

zypper al -m 'boo#1209245' 'xen*'

on the host openqaw5-xen.qa.suse.de to prevent the package being upgraded and did zypper dup.

I did the mistake to add roles: worker to /etc/salt/grains and call salt to apply a high state which is pulling in too much. I called snapper rollback 697 && reboot to go to a snapshot before me messing up.

Machine came back after reboot fine and looks clean again.

Actions #27

Updated by okurz over 1 year ago

Likely zypper lock seems to have vanished because of my snapper rollback call. So I called again:

zypper al -m 'boo#1209245' 'xen*'

and

zypper dup

looks clean. Back to waiting for official bug fix release for https://bugzilla.opensuse.org/show_bug.cgi?id=1209245

Actions #28

Updated by okurz over 1 year ago

  • Status changed from Blocked to In Progress

Finally https://bugzilla.suse.com/show_bug.cgi?id=1209245#c12 states "SUSE-RU-2023:2366-1: An update that has three recommended fixes can now be installed."

Removed the lock and applied the upgrade:

The following 6 packages are going to be upgraded:
  release-notes-sles   15.4.20230301-150400.3.13.2 -> 15.4.20230511-150400.3.18.1           noarch  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  velociraptor-client  0.6.7.4~git53.0e85855-lp154.2.17 -> 0.6.7.5~git81.01be570-lp154.3.1  x86_64  SUSE Velociraptor (15.4)                                      SUSE LINUX Products GmbH, Nuernberg, Germany
  xen                  4.16.3_04-bug1209245.150400.4.25.1 -> 4.16.4_02-150400.4.28.1        x86_64  Update repository with updates from SUSE Linux Enterprise 15  obs://build.suse.de/home:olh -> SUSE LLC <https://www.suse.com/>
  xen-doc-html         4.16.3_04-bug1209245.150400.4.25.1 -> 4.16.4_02-150400.4.28.1        x86_64  Update repository with updates from SUSE Linux Enterprise 15  obs://build.suse.de/home:olh -> SUSE LLC <https://www.suse.com/>
  xen-libs             4.16.3_04-bug1209245.150400.4.25.1 -> 4.16.4_02-150400.4.28.1        x86_64  Update repository with updates from SUSE Linux Enterprise 15  obs://build.suse.de/home:olh -> SUSE LLC <https://www.suse.com/>
  xen-tools            4.16.3_04-bug1209245.150400.4.25.1 -> 4.16.4_02-150400.4.28.1        x86_64  Update repository with updates from SUSE Linux Enterprise 15  obs://build.suse.de/home:olh -> SUSE LLC <https://www.suse.com/>

The following 4 packages are going to change vendor:
  xen           4.16.3_04-bug1209245.150400.4.25.1 -> 4.16.4_02-150400.4.28.1  x86_64  Update repository with updates from SUSE Linux Enterprise 15  obs://build.suse.de/home:olh -> SUSE LLC <https://www.suse.com/>
  xen-doc-html  4.16.3_04-bug1209245.150400.4.25.1 -> 4.16.4_02-150400.4.28.1  x86_64  Update repository with updates from SUSE Linux Enterprise 15  obs://build.suse.de/home:olh -> SUSE LLC <https://www.suse.com/>
  xen-libs      4.16.3_04-bug1209245.150400.4.25.1 -> 4.16.4_02-150400.4.28.1  x86_64  Update repository with updates from SUSE Linux Enterprise 15  obs://build.suse.de/home:olh -> SUSE LLC <https://www.suse.com/>
  xen-tools     4.16.3_04-bug1209245.150400.4.25.1 -> 4.16.4_02-150400.4.28.1  x86_64  Update repository with updates from SUSE Linux Enterprise 15  obs://build.suse.de/home:olh -> SUSE LLC <https://www.suse.com/>

The following NEW package is going to be installed:
  system-user-velociraptor  1.0.0-lp154.3.1  x86_64  SUSE Velociraptor (15.4)  SUSE LINUX Products GmbH, Nuernberg, Germany

The following package requires a system reboot:
  xen  4.16.4_02-150400.4.28.1  x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>

6 packages to upgrade, 1 new, 4  to change vendor.
Overall download size: 46.1 MiB. Already cached: 0 B. After the operation, additional 32.8 MiB will be used.

Applied upgrade, rebooted and triggered a retrigger of the "link to latest" job from the ticket description: https://openqa.suse.de/tests/11257451

EDIT: The job actually does not have the "kdump" test module. I found another jeos based kdump specific scenario so triggered https://openqa.suse.de/tests/11257525, monitoring

Actions #29

Updated by okurz over 1 year ago

  • Status changed from In Progress to Resolved

https://openqa.suse.de/tests/11257525# completed fine. I removed all zypper locks and temporary fix repos. zypper dup is clean and the system is in salt just fine. I set the bug to VERIFIED FIXED with an according comment, we are good now.

Actions

Also available in: Atom PDF