Project

General

Profile

action #103575

[virtualization][3rd party hypervisor] Worker openqaw8-vmware.qa.suse.de is not reachable

Added by nanzhang 5 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
Start date:
2021-12-07
Due date:
% Done:

0%

Estimated time:

Description

The following OSD jobs failed at bootloader_svirt due to the worker openqaw8-vmware.qa.suse.de cannot be reached and logged in.

https://openqa.nue.suse.com/tests/7794974
https://openqa.nue.suse.com/tests/7799143
https://openqa.nue.suse.com/tests/7799141
https://openqa.nue.suse.com/tests/7799136

History

#2 Updated by okurz 5 months ago

  • Status changed from New to In Progress
  • Assignee set to okurz
  • Priority changed from Normal to High
  • Target version set to Ready

I think this is related to a recent IPMI firmware change done by bmwiedemann from EngInfra. I just powered on the host with ipmitool -I lanplus -H sp.openqaw8-vmware.qa.suse.de -U ADMIN -P $password power on and will check. Also https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/372 is related. IPMI SoL looks like the machine is up so I cloned https://openqa.suse.de/tests/7794713#step/bootloader_svirt/7 as https://openqa.suse.de/tests/7802844#live and will monitor.

https://openqa.suse.de/tests/7802844#step/bootloader_svirt/44 shows that we reached the SUT so I went ahead and executed

openqa-label-all --verbose --openqa-host https://openqa.suse.de --label '* bootloader_svirt: https://progress.opensuse.org/issues/103575' --module bootloader_svirt

with openqa-label-all from the package openQA-python-scripts

#3 Updated by openqa_review 5 months ago

  • Due date set to 2021-12-22

Setting due date based on mean cycle time of SUSE QE Tools

#4 Updated by nanzhang 5 months ago

Thank you for the fix!
The latest run looks good. - https://openqa.nue.suse.com/tests/7806054

#5 Updated by okurz 5 months ago

  • Due date deleted (2021-12-22)
  • Status changed from In Progress to Resolved

All referenced jobs seem to have passed the initial step at least

#6 Updated by jlausuch 5 months ago

Thanks for the help!

#7 Updated by openqa_review 5 months ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: jeos-fips@svirt-vmware65
https://openqa.suse.de/tests/7888369

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

#8 Updated by openqa_review 4 months ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: jeos-fips@svirt-vmware65
https://openqa.suse.de/tests/7924906

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

#9 Updated by openqa_review 4 months ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: jeos-fs_stress@svirt-vmware65
https://openqa.suse.de/tests/7998874

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

#10 Updated by okurz 4 months ago

  • Status changed from Resolved to New
  • Assignee deleted (okurz)

We apparently need to check this again, see the openQA test references

#11 Updated by mkittler 3 months ago

  • Assignee set to mkittler

#12 Updated by mkittler 3 months ago

  • Status changed from New to Feedback

The tests which fail now look like VMWare tests but they're actually (successfully) connecting to openqaw5-xen.qa.suse.de (instead of openqaw8-vmware.qa.suse.de). Judging by its hostname, I assume openqaw5-xen.qa.suse.de only works for Xen tests so this test setup seems just wrong. So the failures haven't anything to do with openqaw8-vmware.qa.suse.de being unreachable. In fact, I can connect to that host just fine (via VPN).

#13 Updated by mkittler 3 months ago

Ok, it is actually using the VMWare host. However, it still seems to be reachable and responding to SSH commands:

[2022-01-26T18:33:11.275678+01:00] [debug] SSH connection to root@openqaw8-vmware.qa.suse.de established
[2022-01-26T18:33:11.364395+01:00] [debug] [run_ssh_cmd(set -x; rm -f /vmfs/volumes/datastore1/openQA/*openQA-SUT-3*)] stderr:
  + rm -f /vmfs/volumes/datastore1/openQA/SLES15-SP1-JeOS.x86_64-15.1-VMware-Build37.8.53_openQA-SUT-3_thinfile-flat.vmdk /vmfs/volumes/datastore1/openQA/SLES15-SP1-JeOS.x86_64-15.1-VMware-Build37.8.53_openQA-SUT-3_thinfile.vmdk /vmfs/volumes/datastore1/openQA/openQA-SUT-3.vmsd /vmfs/volumes/datastore1/openQA/openQA-SUT-3.vmx

[2022-01-26T18:33:11.367458+01:00] [debug] [run_ssh_cmd(set -x; rm -f /vmfs/volumes/datastore1/openQA/*openQA-SUT-3*)] exit-code: 0
…
[2022-01-26T18:33:11.831266+01:00] [debug] <<< backend::baseclass::new_ssh_connection(keep_open=1, username="root", hostname="openqaw8-vmware.qa.suse.de", password="SECRET", blocking=1, wantarray=1)
[2022-01-26T18:33:11.945655+01:00] [debug] Use existing SSH connection (key:hostname=openqaw8-vmware.qa.suse.de,username=root,port=22)
[2022-01-26T18:33:33.357052+01:00] [debug] [run_ssh_cmd(find /vmfs/volumes/openqa/hdd /vmfs/volumes/openqa/hdd/fixed -name SLES15-SP2-JeOS.x86_64-15.2-VMware-Build15.106.vmdk.xz | head -n1 | awk 1 ORS='')] stdout:
  /vmfs/volumes/openqa/hdd/SLES15-SP2-JeOS.x86_64-15.2-VMware-Build15.106.vmdk.xz
[2022-01-26T18:33:33.359679+01:00] [debug] [run_ssh_cmd(find /vmfs/volumes/openqa/hdd /vmfs/volumes/openqa/hdd/fixed -name SLES15-SP2-JeOS.x86_64-15.2-VMware-Build15.106.vmdk.xz | head -n1 | awk 1 ORS='')] exit-code: 0
[2022-01-26T18:33:33.518226+01:00] [debug] Image found: /vmfs/volumes/openqa/hdd/SLES15-SP2-JeOS.x86_64-15.2-VMware-Build15.106.vmdk.xz
[2022-01-26T18:33:33.518469+01:00] [debug] tests/installation/bootloader_svirt.pm:137 called bootloader_svirt::search_image_on_svirt_host -> tests/installation/bootloader_svirt.pm:49 called testapi::enter_cmd
[2022-01-26T18:33:33.518686+01:00] [debug] <<< testapi::type_string(string="# Copying image SLES15-SP2-JeOS.x86_64-15.2-VMware-Build15.106.vmdk.xz...", max_interval=250, wait_screen_changes=0, wait_still_screen=0, timeout=30, similarity_level=47)
[2022-01-26T18:33:36.283921+01:00] [debug] tests/installation/bootloader_svirt.pm:142 called backend::console_proxy::__ANON__
[2022-01-26T18:33:36.284202+01:00] [debug] <<< backend::console_proxy::__ANON__(wrapped_call={
    "function" => "run_cmd",
    "console" => "svirt",
    "wantarray" => "",
    "args" => [
                "test -e /vmfs/volumes/datastore1/openQA//SLES15-SP2-JeOS.x86_64-15.2-VMware-Build15.106.vmdk",
                "domain",
                "sshVMwareServer"
              ]
  })
[2022-01-26T18:33:36.285235+01:00] [debug] <<< backend::baseclass::run_ssh_cmd(cmd="test -e /vmfs/volumes/datastore1/openQA//SLES15-SP2-JeOS.x86_64-15.2-VMware-Build15.106.vmdk", wantarray=0, keep_open=1, username="root", password="SECRET", hostname="openqaw8-vmware.qa.suse.de")

The problem is apparently that some file exists which shouldn't:

[2022-01-26T18:34:41.051751+01:00] [debug] Use existing SSH connection (key:hostname=openqaw8-vmware.qa.suse.de,username=root,port=22)
[2022-01-26T18:34:41.064536+01:00] [debug] [run_ssh_cmd(xz --decompress --keep --verbose /vmfs/volumes/datastore1/openQA//SLES15-SP2-JeOS.x86_64-15.2-VMware-Build15.106.vmdk.xz)] stderr:
  xz: /vmfs/volumes/datastore1/openQA//SLES15-SP2-JeOS.x86_64-15.2-VMware-Build15.106.vmdk: File exists

[2022-01-26T18:34:41.067476+01:00] [debug] [run_ssh_cmd(xz --decompress --keep --verbose /vmfs/volumes/datastore1/openQA//SLES15-SP2-JeOS.x86_64-15.2-VMware-Build15.106.vmdk.xz)] exit-code: 1
[2022-01-26T18:34:41.299347+01:00] [info] ::: basetest::runtest: # Test died: Image decompress in datastore failed!
   at sle/tests/installation/bootloader_svirt.pm line 146.
    bootloader_svirt::run(bootloader_svirt=HASH(0x560cf842d6e0)) called at /usr/lib/os-autoinst/basetest.pm line 360
    eval {...} called at /usr/lib/os-autoinst/basetest.pm line 354
    basetest::runtest(bootloader_svirt=HASH(0x560cf842d6e0)) called at /usr/lib/os-autoinst/autotest.pm line 372
    eval {...} called at /usr/lib/os-autoinst/autotest.pm line 372
    autotest::runalltests() called at /usr/lib/os-autoinst/autotest.pm line 242
    eval {...} called at /usr/lib/os-autoinst/autotest.pm line 242
    autotest::run_all() called at /usr/lib/os-autoinst/autotest.pm line 296
    autotest::__ANON__(Mojo::IOLoop::ReadWriteProcess=HASH(0x560cfa2895f0)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm line 326
    eval {...} called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm line 326
    Mojo::IOLoop::ReadWriteProcess::_fork(Mojo::IOLoop::ReadWriteProcess=HASH(0x560cfa2895f0), CODE(0x560cfa73a5a0)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm line 488
    Mojo::IOLoop::ReadWriteProcess::start(Mojo::IOLoop::ReadWriteProcess=HASH(0x560cfa2895f0)) called at /usr/lib/os-autoinst/autotest.pm line 298
    autotest::start_process() called at /usr/bin/isotovideo line 261

#14 Updated by mkittler 3 months ago

Maybe

[root@openqaw8-vmware:~] mv /vmfs/volumes/datastore1/openQA//SLES15-SP2-JeOS.x86_64-15.2-VMware-Build15.106.vmdk.xz /vmfs/volumes/datastore1/openQA//SLES15-SP2-JeOS.x86_64-15.2-VMware-Build15.106.vmdk.xz.bak

helped but I cannot retry the openQA job to test that because assets are missing.

Ensuring these details within the test setup is also likely more something the test writers should handle.

I've been removing the wrong bugrefs from the jobs.

#15 Updated by mloviska 3 months ago

mkittler wrote:

Maybe

[root@openqaw8-vmware:~] mv /vmfs/volumes/datastore1/openQA//SLES15-SP2-JeOS.x86_64-15.2-VMware-Build15.106.vmdk.xz /vmfs/volumes/datastore1/openQA//SLES15-SP2-JeOS.x86_64-15.2-VMware-Build15.106.vmdk.xz.bak

helped but I cannot retry the openQA job to test that because assets are missing.

Ensuring these details within the test setup is also likely more something the test writers should handle.

I've been removing the wrong bugrefs from the jobs.

Seems like you are trying to use assets that we are not testing anymore. :)

#16 Updated by mkittler 3 months ago

  • Status changed from Feedback to Resolved

I am not testing anything. I am only taking care of this ticket which was reopened due to these failing jobs. However, it turns out to be unrelated so I'm resolving the ticket again. Of course it would make sense to avoid creating those jobs the they are not relevant anymore. (Maybe that's already the case. The last job is from 7 days ago.)

Also available in: Atom PDF