Project

General

Profile

Actions

action #117196

closed

virsh domain XML does not specify VM ID which is required from VNC over WebSockets

Added by nanzhang over 1 year ago. Updated over 1 year ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2022-09-26
Due date:
% Done:

0%

Estimated time:

Description

http://10.67.129.66/tests/1407/logfile?filename=autoinst-log.txt

virsh domain XML does not specify VM ID which is required from VNC over WebSockets at /usr/lib/os-autoinst/consoles/sshVirtsh.pm line 549.

http://10.67.129.66/tests/1407/file/vars.json

"VMWARE_VNC_OVER_WS": "1",
"VMWARE_VNC_OVER_WS_INSECURE": "1",

Manually tried to start the VM and saw the domain id generated.

qa2-dhcp-66:~ # virsh -c esx://root@10.67.131.2/?no_verify=1\&authfile=/tmp/0xEKsSIBNo  start openQA-SUT-1
2022-09-26 06:37:49.994+0000: 5366: info : libvirt version: 7.1.0
2022-09-26 06:37:49.994+0000: 5366: info : hostname: qa2-dhcp-66
2022-09-26 06:37:49.994+0000: 5366: warning : esxUtil_ParseUri:147 : Ignoring unexpected query parameter 'authfile'
Domain 'openQA-SUT-1' started

qa2-dhcp-66:~ # virsh -c esx://root@10.67.131.2/?no_verify=1\&authfile=/tmp/0xEKsSIBNo  dumpxml openQA-SUT-1
2022-09-26 06:37:59.720+0000: 5369: info : libvirt version: 7.1.0
2022-09-26 06:37:59.720+0000: 5369: info : hostname: qa2-dhcp-66
2022-09-26 06:37:59.720+0000: 5369: warning : esxUtil_ParseUri:147 : Ignoring unexpected query parameter 'authfile'

<domain type='vmware' id='41' xmlns:vmware='http://libvirt.org/schemas/domain/vmware/1.0'>
  <name>openQA-SUT-1</name>
  <uuid>9b70f025-c4d9-4698-83fd-fe81bde0e435</uuid>
...
...
Actions #1

Updated by nanzhang over 1 year ago

  • Description updated (diff)
Actions #2

Updated by mkittler over 1 year ago

  • Status changed from New to In Progress

Did it happen more often? It it happens always but not on your manual attempt then your manual attempt must differ from what the os-autoinst backend does.

Unfortunately the logs don't contain what the XML config actually contained (only the file on disk is dumped). So I've created https://github.com/os-autoinst/os-autoinst/pull/2182 to include the XML config in the logs if an error happens. Maybe we'll learn something from that (e.g. the XML structure sometimes differs from what the code can handle or there's an error message contained).

Actions #3

Updated by mkittler over 1 year ago

I've also just wanted to compare against the last successful job. However, it looks like this scenario has actually never worked before.

I don't know the particular web UI instance and worker host but maybe you can apply the patch from my PR manually on the relevant worker hosts┬╣?


┬╣ You can patch a worker like this:

cd /tmp
wget https://github.com/os-autoinst/os-autoinst/pull/2182.patch
sudo patch -d /usr/lib/os-autoinst -p1 -i /tmp/2182.patch

Of course you can also just play around yourself modifying some of the backend code within /usr/lib/os-autoinst and its subdirs yourself. You can always restore the default state by simply re-installing the os-autoinst package.

Actions #4

Updated by okurz over 1 year ago

  • Target version set to Ready
Actions #5

Updated by openqa_review over 1 year ago

  • Due date set to 2022-10-11

Setting due date based on mean cycle time of SUSE QE Tools

Actions #6

Updated by nanzhang over 1 year ago

I can't see any output after applying this path, and the last command 'virsh dumpxml' failed. But I can't reproduce this error by manual, even using the same xml file to define & start VM.

http://10.67.129.66/tests/1412/logfile?filename=autoinst-log.txt

[2022-09-27T13:06:03.181517+08:00] [debug] [run_ssh_cmd(virsh -c esx://root@10.67.131.2/?no_verify=1\&authfile=/tmp/ilPcNHDHp6  dumpxml openQA-SUT-1)] stdout:


[2022-09-27T13:06:03.181708+08:00] [debug] [run_ssh_cmd(virsh -c esx://root@10.67.131.2/?no_verify=1\&authfile=/tmp/ilPcNHDHp6  dumpxml openQA-SUT-1)] stderr:
  2022-09-27 05:06:03.067+0000: 15623: info : libvirt version: 7.1.0
  2022-09-27 05:06:03.067+0000: 15623: info : hostname: qa2-dhcp-66
  2022-09-27 05:06:03.067+0000: 15623: warning : esxUtil_ParseUri:147 : Ignoring unexpected query parameter 'authfile'
  error: configuration file syntax error: memory conf:36: numbers not allowed in VMX format
Actions #7

Updated by nanzhang over 1 year ago

Not sure if any issue on my local openqa worker, I can't see this issue on OSD. Probably, we could rerun an OSD job to verify it after the PR(https://github.com/os-autoinst/os-autoinst/pull/2180) merged.

https://openqa.suse.de/tests/9564774/logfile?filename=autoinst-log.txt

[2022-09-22T10:56:12.406952+02:00] [debug] [run_ssh_cmd(virsh -c esx://root@esxi7.qa.suse.cz/?no_verify=1\&authfile=/tmp/Y33UgkNWHY  dumpxml openQA-SUT-2)] stdout:
  <domain type='vmware' id='25' xmlns:vmware='http://libvirt.org/schemas/domain/vmware/1.0'>
    <name>openQA-SUT-2</name>
    <uuid>79026f57-2272-4d78-85e4-15c8361d7a84</uuid>
....
....
[2022-09-22T10:56:12.407102+02:00] [debug] [run_ssh_cmd(virsh -c esx://root@esxi7.qa.suse.cz/?no_verify=1\&authfile=/tmp/Y33UgkNWHY  dumpxml openQA-SUT-2)] stderr:
  2022-09-22 08:56:11.929+0000: 3241: info : libvirt version: 6.0.0
  2022-09-22 08:56:11.929+0000: 3241: info : hostname: openqaw5-xen
  2022-09-22 08:56:11.929+0000: 3241: warning : esxUtil_ParseUri:149 : Ignoring unexpected query parameter 'authfile'

[2022-09-22T10:56:12.407175+02:00] [debug] [run_ssh_cmd(virsh -c esx://root@esxi7.qa.suse.cz/?no_verify=1\&authfile=/tmp/Y33UgkNWHY  dumpxml openQA-SUT-2)] exit-code: 0
Actions #8

Updated by mkittler over 1 year ago

It looks like the patch has worked but the domain config XML is simply empty as only the newline is logged:

  virsh domain XML does not specify VM ID which is required from VNC over WebSockets:

   at /usr/lib/os-autoinst/backend/console_proxy.pm line 46.

That means the error must be somewhere before and that we cannot find the ID is just a symptom.


The following log messages look suspicious:

  2022-09-27 05:06:02.426+0000: 15619: warning : esxUtil_ParseUri:147 : Ignoring unexpected query parameter 'authfile'

However, "warning" and "Ignoring" don't sound like this should cause the config to be rejected completely. Besides, those messages are also logged in http://10.67.129.66/tests/1397/logfile?filename=autoinst-log.txt which softfailed.

There's actually another error at the end of:

[2022-09-27T13:06:03.181517+08:00] [debug] [run_ssh_cmd(virsh -c esx://root@10.67.131.2/?no_verify=1\&authfile=/tmp/ilPcNHDHp6  dumpxml openQA-SUT-1)] stdout:


[2022-09-27T13:06:03.181708+08:00] [debug] [run_ssh_cmd(virsh -c esx://root@10.67.131.2/?no_verify=1\&authfile=/tmp/ilPcNHDHp6  dumpxml openQA-SUT-1)] stderr:
  2022-09-27 05:06:03.067+0000: 15623: info : libvirt version: 7.1.0
  2022-09-27 05:06:03.067+0000: 15623: info : hostname: qa2-dhcp-66
  2022-09-27 05:06:03.067+0000: 15623: warning : esxUtil_ParseUri:147 : Ignoring unexpected query parameter 'authfile'
  error: configuration file syntax error: memory conf:36: numbers not allowed in VMX format

That's not logged in the supposedly good tests. It sounds like something is misconfigured on the VMWare side (which you might be more familiar with than me).

Actions #9

Updated by nanzhang over 1 year ago

OK, the error indicated that there was a syntax issue at line 36. After checking the vmx file, I can see the property which missing double quotes around the value.

/vmfs/volumes/5ffc09f2-d9e8dde4-1604-0cc47ac51e38/openQA/openQA-SUT-1.vmx
Line 36: bios.bootDelay = 10000
consoles/sshVirtsh.pm
Line 528: $self->run_cmd("echo bios.bootDelay = \"10000\" >> $vmx", domain => 'sshVMwareServer');
Actions #10

Updated by mkittler over 1 year ago

So does it help to change that line to e.g. $self->run_cmd("echo bios.bootDelay = \\\"10000\\\" >> $vmx", domain => 'sshVMwareServer');? (I haven't tested it yet, maybe you need to play around with it yourself.)

Actions #11

Updated by nanzhang over 1 year ago

You are right. The syntax error got fixed after updated code - http://10.67.129.66/tests/1418

Created PR https://github.com/os-autoinst/os-autoinst/pull/2184

Actions #12

Updated by nanzhang over 1 year ago

Got one more successful runs:
default_install_svirt - http://10.67.129.66/tests/1421 [passed]
online_upgrade_sles15sp4_vmware - http://10.67.129.66/tests/1419 [passed]
textmode_svirt - http://10.67.129.66/tests/1420 [passed]

Actions #13

Updated by mkittler over 1 year ago

  • Status changed from In Progress to Resolved

Nice. The PR has also already been merged. So I suppose this issue can be considered resolved. (Of course the changes will also be deployed on o3/OSD automatically.)

Actions #14

Updated by openqa_review over 1 year ago

  • Status changed from Resolved to Feedback

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: default_install_svirt@svirt-vmware70
https://openqa.suse.de/tests/9634381#step/bootloader_svirt/1

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.

Actions #15

Updated by okurz over 1 year ago

  • Due date deleted (2022-10-11)
  • Priority changed from Normal to High
Actions #16

Updated by okurz over 1 year ago

  • Status changed from Feedback to Resolved

https://openqa.suse.de/tests/9634381#step/bootloader_svirt/1 is 13 days old and reproduced the original problem when the fix was not yet deployed to OSD yet. In the meantime there have been more recent jobs in the same scenario that confirm that the problem was not appearing anymore. https://openqa.suse.de/tests/9693253 is the most recent test in a more recent build that passed. openqa-review likely detected the old build 24.1 as the last finished because some tests were still running in the later 25.1, maybe due to human operators manually retriggering tests in production.

Actions #17

Updated by openqa_review over 1 year ago

  • Status changed from Resolved to Feedback

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: textmode_svirt@svirt-vmware70
https://openqa.suse.de/tests/9920510#step/bootloader_svirt/1

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.

Actions #18

Updated by livdywan over 1 year ago

  • Status changed from Feedback to Resolved

openqa_review wrote:

This bug is still referenced in a failing openQA test: textmode_svirt@svirt-vmware70
https://openqa.suse.de/tests/9920510#step/bootloader_svirt/1

Error connecting to <root@esxi7.qa.suse.cz>: Connection timed out

Doesn't seem related since this is a timing out call trying to rm -f /vmfs/volumes/Datastore2/openQA/*openQA-SUT-1* so I dropped the ticket reference.

Actions

Also available in: Atom PDF