Project

General

Profile

Actions

action #103575

closed

[virtualization][3rd party hypervisor] Worker openqaw8-vmware.qa.suse.de is not reachable

Added by nanzhang almost 3 years ago. Updated almost 3 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
2021-12-07
Due date:
% Done:

0%

Estimated time:

Description

The following OSD jobs failed at bootloader_svirt due to the worker openqaw8-vmware.qa.suse.de cannot be reached and logged in.

https://openqa.nue.suse.com/tests/7794974
https://openqa.nue.suse.com/tests/7799143
https://openqa.nue.suse.com/tests/7799141
https://openqa.nue.suse.com/tests/7799136

Actions #2

Updated by okurz almost 3 years ago

  • Status changed from New to In Progress
  • Assignee set to okurz
  • Priority changed from Normal to High
  • Target version set to Ready

I think this is related to a recent IPMI firmware change done by bmwiedemann from EngInfra. I just powered on the host with ipmitool -I lanplus -H sp.openqaw8-vmware.qa.suse.de -U ADMIN -P $password power on and will check. Also https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/372 is related. IPMI SoL looks like the machine is up so I cloned https://openqa.suse.de/tests/7794713#step/bootloader_svirt/7 as https://openqa.suse.de/tests/7802844#live and will monitor.

https://openqa.suse.de/tests/7802844#step/bootloader_svirt/44 shows that we reached the SUT so I went ahead and executed

openqa-label-all --verbose --openqa-host https://openqa.suse.de --label '* bootloader_svirt: https://progress.opensuse.org/issues/103575' --module bootloader_svirt

with openqa-label-all from the package openQA-python-scripts

The complete output including all jobs that have been triggered:

Actions #3

Updated by openqa_review almost 3 years ago

  • Due date set to 2021-12-22

Setting due date based on mean cycle time of SUSE QE Tools

Actions #4

Updated by nanzhang almost 3 years ago

Thank you for the fix!
The latest run looks good. - https://openqa.nue.suse.com/tests/7806054

Actions #5

Updated by okurz almost 3 years ago

  • Due date deleted (2021-12-22)
  • Status changed from In Progress to Resolved

All referenced jobs seem to have passed the initial step at least

Actions #6

Updated by jlausuch almost 3 years ago

Thanks for the help!

Actions #7

Updated by openqa_review almost 3 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: jeos-fips@svirt-vmware65
https://openqa.suse.de/tests/7888369

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234
Actions #8

Updated by openqa_review almost 3 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: jeos-fips@svirt-vmware65
https://openqa.suse.de/tests/7924906

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234
Actions #9

Updated by openqa_review almost 3 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: jeos-fs_stress@svirt-vmware65
https://openqa.suse.de/tests/7998874

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234
Actions #10

Updated by okurz almost 3 years ago

  • Status changed from Resolved to New
  • Assignee deleted (okurz)

We apparently need to check this again, see the openQA test references

Actions #11

Updated by mkittler almost 3 years ago

  • Assignee set to mkittler
Actions #12

Updated by mkittler almost 3 years ago

  • Status changed from New to Feedback

The tests which fail now look like VMWare tests but they're actually (successfully) connecting to openqaw5-xen.qa.suse.de (instead of openqaw8-vmware.qa.suse.de). Judging by its hostname, I assume openqaw5-xen.qa.suse.de only works for Xen tests so this test setup seems just wrong. So the failures haven't anything to do with openqaw8-vmware.qa.suse.de being unreachable. In fact, I can connect to that host just fine (via VPN).

Actions #13

Updated by mkittler almost 3 years ago

Ok, it is actually using the VMWare host. However, it still seems to be reachable and responding to SSH commands:

[2022-01-26T18:33:11.275678+01:00] [debug] SSH connection to root@openqaw8-vmware.qa.suse.de established
[2022-01-26T18:33:11.364395+01:00] [debug] [run_ssh_cmd(set -x; rm -f /vmfs/volumes/datastore1/openQA/*openQA-SUT-3*)] stderr:
  + rm -f /vmfs/volumes/datastore1/openQA/SLES15-SP1-JeOS.x86_64-15.1-VMware-Build37.8.53_openQA-SUT-3_thinfile-flat.vmdk /vmfs/volumes/datastore1/openQA/SLES15-SP1-JeOS.x86_64-15.1-VMware-Build37.8.53_openQA-SUT-3_thinfile.vmdk /vmfs/volumes/datastore1/openQA/openQA-SUT-3.vmsd /vmfs/volumes/datastore1/openQA/openQA-SUT-3.vmx

[2022-01-26T18:33:11.367458+01:00] [debug] [run_ssh_cmd(set -x; rm -f /vmfs/volumes/datastore1/openQA/*openQA-SUT-3*)] exit-code: 0
…
[2022-01-26T18:33:11.831266+01:00] [debug] <<< backend::baseclass::new_ssh_connection(keep_open=1, username="root", hostname="openqaw8-vmware.qa.suse.de", password="SECRET", blocking=1, wantarray=1)
[2022-01-26T18:33:11.945655+01:00] [debug] Use existing SSH connection (key:hostname=openqaw8-vmware.qa.suse.de,username=root,port=22)
[2022-01-26T18:33:33.357052+01:00] [debug] [run_ssh_cmd(find /vmfs/volumes/openqa/hdd /vmfs/volumes/openqa/hdd/fixed -name SLES15-SP2-JeOS.x86_64-15.2-VMware-Build15.106.vmdk.xz | head -n1 | awk 1 ORS='')] stdout:
  /vmfs/volumes/openqa/hdd/SLES15-SP2-JeOS.x86_64-15.2-VMware-Build15.106.vmdk.xz
[2022-01-26T18:33:33.359679+01:00] [debug] [run_ssh_cmd(find /vmfs/volumes/openqa/hdd /vmfs/volumes/openqa/hdd/fixed -name SLES15-SP2-JeOS.x86_64-15.2-VMware-Build15.106.vmdk.xz | head -n1 | awk 1 ORS='')] exit-code: 0
[2022-01-26T18:33:33.518226+01:00] [debug] Image found: /vmfs/volumes/openqa/hdd/SLES15-SP2-JeOS.x86_64-15.2-VMware-Build15.106.vmdk.xz
[2022-01-26T18:33:33.518469+01:00] [debug] tests/installation/bootloader_svirt.pm:137 called bootloader_svirt::search_image_on_svirt_host -> tests/installation/bootloader_svirt.pm:49 called testapi::enter_cmd
[2022-01-26T18:33:33.518686+01:00] [debug] <<< testapi::type_string(string="# Copying image SLES15-SP2-JeOS.x86_64-15.2-VMware-Build15.106.vmdk.xz...", max_interval=250, wait_screen_changes=0, wait_still_screen=0, timeout=30, similarity_level=47)
[2022-01-26T18:33:36.283921+01:00] [debug] tests/installation/bootloader_svirt.pm:142 called backend::console_proxy::__ANON__
[2022-01-26T18:33:36.284202+01:00] [debug] <<< backend::console_proxy::__ANON__(wrapped_call={
    "function" => "run_cmd",
    "console" => "svirt",
    "wantarray" => "",
    "args" => [
                "test -e /vmfs/volumes/datastore1/openQA//SLES15-SP2-JeOS.x86_64-15.2-VMware-Build15.106.vmdk",
                "domain",
                "sshVMwareServer"
              ]
  })
[2022-01-26T18:33:36.285235+01:00] [debug] <<< backend::baseclass::run_ssh_cmd(cmd="test -e /vmfs/volumes/datastore1/openQA//SLES15-SP2-JeOS.x86_64-15.2-VMware-Build15.106.vmdk", wantarray=0, keep_open=1, username="root", password="SECRET", hostname="openqaw8-vmware.qa.suse.de")

The problem is apparently that some file exists which shouldn't:

[2022-01-26T18:34:41.051751+01:00] [debug] Use existing SSH connection (key:hostname=openqaw8-vmware.qa.suse.de,username=root,port=22)
[2022-01-26T18:34:41.064536+01:00] [debug] [run_ssh_cmd(xz --decompress --keep --verbose /vmfs/volumes/datastore1/openQA//SLES15-SP2-JeOS.x86_64-15.2-VMware-Build15.106.vmdk.xz)] stderr:
  xz: /vmfs/volumes/datastore1/openQA//SLES15-SP2-JeOS.x86_64-15.2-VMware-Build15.106.vmdk: File exists

[2022-01-26T18:34:41.067476+01:00] [debug] [run_ssh_cmd(xz --decompress --keep --verbose /vmfs/volumes/datastore1/openQA//SLES15-SP2-JeOS.x86_64-15.2-VMware-Build15.106.vmdk.xz)] exit-code: 1
[2022-01-26T18:34:41.299347+01:00] [info] ::: basetest::runtest: # Test died: Image decompress in datastore failed!
   at sle/tests/installation/bootloader_svirt.pm line 146.
    bootloader_svirt::run(bootloader_svirt=HASH(0x560cf842d6e0)) called at /usr/lib/os-autoinst/basetest.pm line 360
    eval {...} called at /usr/lib/os-autoinst/basetest.pm line 354
    basetest::runtest(bootloader_svirt=HASH(0x560cf842d6e0)) called at /usr/lib/os-autoinst/autotest.pm line 372
    eval {...} called at /usr/lib/os-autoinst/autotest.pm line 372
    autotest::runalltests() called at /usr/lib/os-autoinst/autotest.pm line 242
    eval {...} called at /usr/lib/os-autoinst/autotest.pm line 242
    autotest::run_all() called at /usr/lib/os-autoinst/autotest.pm line 296
    autotest::__ANON__(Mojo::IOLoop::ReadWriteProcess=HASH(0x560cfa2895f0)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm line 326
    eval {...} called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm line 326
    Mojo::IOLoop::ReadWriteProcess::_fork(Mojo::IOLoop::ReadWriteProcess=HASH(0x560cfa2895f0), CODE(0x560cfa73a5a0)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm line 488
    Mojo::IOLoop::ReadWriteProcess::start(Mojo::IOLoop::ReadWriteProcess=HASH(0x560cfa2895f0)) called at /usr/lib/os-autoinst/autotest.pm line 298
    autotest::start_process() called at /usr/bin/isotovideo line 261
Actions #14

Updated by mkittler almost 3 years ago

Maybe

[root@openqaw8-vmware:~] mv /vmfs/volumes/datastore1/openQA//SLES15-SP2-JeOS.x86_64-15.2-VMware-Build15.106.vmdk.xz /vmfs/volumes/datastore1/openQA//SLES15-SP2-JeOS.x86_64-15.2-VMware-Build15.106.vmdk.xz.bak

helped but I cannot retry the openQA job to test that because assets are missing.

Ensuring these details within the test setup is also likely more something the test writers should handle.

I've been removing the wrong bugrefs from the jobs.

Actions #15

Updated by mloviska almost 3 years ago

mkittler wrote:

Maybe

[root@openqaw8-vmware:~] mv /vmfs/volumes/datastore1/openQA//SLES15-SP2-JeOS.x86_64-15.2-VMware-Build15.106.vmdk.xz /vmfs/volumes/datastore1/openQA//SLES15-SP2-JeOS.x86_64-15.2-VMware-Build15.106.vmdk.xz.bak

helped but I cannot retry the openQA job to test that because assets are missing.

Ensuring these details within the test setup is also likely more something the test writers should handle.

I've been removing the wrong bugrefs from the jobs.

Seems like you are trying to use assets that we are not testing anymore. :)

Actions #16

Updated by mkittler almost 3 years ago

  • Status changed from Feedback to Resolved

I am not testing anything. I am only taking care of this ticket which was reopened due to these failing jobs. However, it turns out to be unrelated so I'm resolving the ticket again. Of course it would make sense to avoid creating those jobs the they are not relevant anymore. (Maybe that's already the case. The last job is from 7 days ago.)

Actions

Also available in: Atom PDF