Project

General

Profile

action #105690

s390x svirt jobs incomplete with auto_review:"unable to extract assets:.*/var/lib/libvirt/images/a.img":retry

Added by okurz 4 months ago. Updated 4 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Concrete Bugs
Target version:
Start date:
2022-01-28
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

jobs incomplete with qemu-img: Could not open '/var/lib/libvirt/images/a.img': Could not open '/var/lib/libvirt/images/a.img': No such file or directory, for instance https://openqa.suse.de/tests/8047803 :

[2022-01-28T09:34:25.924322+01:00] [debug] running `nice ionice qemu-img convert -p -O qcow2 /var/lib/libvirt/images/a.img assets_public/sle-12-SP4-s390x-4.12.14-41.1.g82b276a-Server-DVD-Incidents-Kernel-KOTD@s390x-kvm-sle12-with-ltp.qcow2 -c`
[2022-01-28T09:34:25.930323+01:00] [info] ::: backend::driver::__ANON__: Driver backend collected unknown process with pid 398978 and exit status: 0
[2022-01-28T09:34:25.930519+01:00] [info] ::: backend::driver::__ANON__: Driver backend collected unknown process with pid 399590 and exit status: 0
[2022-01-28T09:34:25.934036+01:00] [debug]     (0.00/100%)
qemu-img: Could not open '/var/lib/libvirt/images/a.img': Could not open '/var/lib/libvirt/images/a.img': No such file or directory

[2022-01-28T09:34:25.934973+01:00] [warn] !!! bmwqemu::serialize_state: unable to extract assets: runcmd 'nice ionice qemu-img convert -p -O qcow2 /var/lib/libvirt/images/a.img assets_public/sle-12-SP4-s390x-4.12.14-41.1.g82b276a-Server-DVD-Incidents-Kernel-KOTD@s390x-kvm-sle12-with-ltp.qcow2 -c' failed with exit code 1: '    (0.00/100%)
qemu-img: Could not open '/var/lib/libvirt/images/a.img': Could not open '/var/lib/libvirt/images/a.img': No such file or directory
  ' at /usr/lib/os-autoinst/osutils.pm line 109.
    osutils::runcmd("nice", "ionice", "qemu-img", "convert", "-p", "-O", "qcow2", "/var/lib/libvirt/images/a.img", ...) called at /usr/lib/os-autoinst/backend/svirt.pm line 189
    backend::svirt::do_extract_assets(backend::svirt=HASH(0x100359af2d8), HASH(0x100349611d8)) called at /usr/lib/os-autoinst/backend/driver.pm line 80
    backend::driver::extract_assets(backend::driver=HASH(0x1003621a9f8), HASH(0x100349611d8)) called at /usr/lib/os-autoinst/OpenQA/Isotovideo/Utils.pm line 163
    eval {...} called at /usr/lib/os-autoinst/OpenQA/Isotovideo/Utils.pm line 163
    OpenQA::Isotovideo::Utils::handle_generated_assets(OpenQA::Isotovideo::CommandHandler=HASH(0x100359c7600), 1) called at /usr/bin/isotovideo line 409

seems like https://github.com/os-autoinst/os-autoinst/pull/1936 is the culprit. Complete changelog since last good:

  • Thu Jan 27 2022 okurz@suse.com
  • Update to version 4.6.1643299616.01abba344:
  • Add more perl signatures
  • Add more perl signatures
  • Simplify code for assigning job settings in create_from_settings
  • Use constant for referring to a job's main settings
  • Allow changing job settings via restart API
  • Check whether clones created by the restart API take over the group
  • Prevent error when restarting jobs with skip_parents=1
  • Use tidyall instead of custom implementation

  • Tue Jan 25 2022 okurz@suse.com

  • Update to version 4.6.1643089984.09669586b:

  • Dependency cron 2022-01-22

  • Add signatures to OpenQA::App and OpenQA::BuildResults

  • templates: Fix wording for the "VNC display number"

  • Show number of restarts of a job within info box on details page

os-autoinst changes:

  • Thu Jan 27 2022 okurz@suse.com
  • Update to version 4.6.1643273407.65ca16b7:
  • svirt: Store vmname early for use after test run

  • Tue Jan 25 2022 okurz@suse.com

  • Update to version 4.6.1643061641.d319802b:

  • Continue further checks in fullstack test after one fails

  • Add OBS workflow

  • Simplify base inheritance statement with Mojo::Base everywhere

  • Exclude 29-backend-driver.t from OBS checks

  • Simplify string concatenation in log.pm

  • Extract all log functions into new module "log"

  • Add test for defining/starting VM via VMware in svirt backend

  • Add test for generating XML file with UEFI loader in svirt backend

  • Use tidyall for faster tidying

  • Fix single, unnecessary UTF8 character in consoles::VNC

  • git subrepo pull (merge) external/os-autoinst-common

Reproducible

Always reproducible on s390x svirt jobs since we have that change in.

Expected result

Last good https://openqa.suse.de/tests/8019621/logfile?filename=autoinst-log.txt

showing:

[2022-01-23T01:41:54.907320+01:00] [debug] <<< testapi::type_string(string="nice ionice qemu-img convert -p -O qcow2 /var/lib/libvirt/images/openQA-SUT-3a.img /var/lib/libvirt/images/sle-12-SP4-s390x-4.12.14-40.1.g1475601-Server-DVD-Incidents-Kernel-KOTD\@s390x-kvm-sle12-with-ltp.qcow2 && echo OK", max_interval=250, wait_screen_changes=0, wait_still_screen=0, timeout=30, similarity_level=47)
[2022-01-23T01:42:02.749003+01:00] [debug] tests/kernel/../shutdown/svirt_upload_assets.pm:60 called svirt_upload_assets::extract_assets -> tests/kernel/../shutdown/svirt_upload_assets.pm:32 called testapi::assert_screen

Workaround

Revert https://github.com/os-autoinst/os-autoinst/pull/1936 and retrigger tests. OSD deployment was rolled back with https://gitlab.suse.de/openqa/osd-deployment/-/jobs/813165


Related issues

Related to openQA Project - action #104520: Move svirt extract_asset code from os-autoinst-distri-opensuse to os-autoinst/backend/svirt.pm size:M auto_review:"unable to extract assets: Can't call method.+name.+on an undefined value":retryWorkable2021-12-29

Copied to openQA Project - coordination #105699: [epic] 5 whys follow-up to s390x svirt jobs incomplete with unable to extract assets:.*/var/lib/libvirt/images/a.img" size:SBlocked2021-12-29

History

#1 Updated by okurz 4 months ago

  • Description updated (diff)

Was brought up by mdoucha in https://suse.slack.com/archives/C02CANHLANP/p1643362926351179 . I have rolled back the deployment on OSD and proposed a revert https://github.com/os-autoinst/os-autoinst/pull/1941 and will inform mdoucha.

#2 Updated by okurz 4 months ago

  • Related to action #104520: Move svirt extract_asset code from os-autoinst-distri-opensuse to os-autoinst/backend/svirt.pm size:M auto_review:"unable to extract assets: Can't call method.+name.+on an undefined value":retry added

#3 Updated by okurz 4 months ago

  • Status changed from New to In Progress
  • Assignee set to okurz

Cris Dywan: Interesting. Those didn't show up on host=openqa.suse.de ./openqa-monitor-incompletes which I was checking just a few minutes ago
Oliver Kurz: Yes, because they received a label "missing_asset" already. Something we might want to make less greedy. I am currently running a locally patched "openqa-monitor-incompletes" which is also looking at already commented jobs

#4 Updated by cdywan 4 months ago

  • Description updated (diff)

#5 Updated by cdywan 4 months ago

okurz wrote:

Was brought up by mdoucha in https://suse.slack.com/archives/C02CANHLANP/p1643362926351179 . I have rolled back the deployment on OSD and proposed a revert https://github.com/os-autoinst/os-autoinst/pull/1941 and will inform mdoucha.

The rollback went through, but it brought back https://github.com/os-autoinst/os-autoinst/pull/1895 rather than the previous state.

okurz now proposed a revert of both https://github.com/os-autoinst/os-autoinst/pull/1942

#6 Updated by okurz 4 months ago

  • Description updated (diff)

#7 Updated by cdywan 4 months ago

  • Copied to coordination #105699: [epic] 5 whys follow-up to s390x svirt jobs incomplete with unable to extract assets:.*/var/lib/libvirt/images/a.img" size:S added

#8 Updated by cdywan 4 months ago

  • Description updated (diff)
  • Status changed from In Progress to Feedback

cdywan wrote:

okurz wrote:

Was brought up by mdoucha in https://suse.slack.com/archives/C02CANHLANP/p1643362926351179 . I have rolled back the deployment on OSD and proposed a revert https://github.com/os-autoinst/os-autoinst/pull/1941 and will inform mdoucha.

The rollback went through, but it brought back https://github.com/os-autoinst/os-autoinst/pull/1895 rather than the previous state.

okurz now proposed a revert of both https://github.com/os-autoinst/os-autoinst/pull/1942

Packages are built, deployment triggered.

See #105699 for a proposal to conduct 5 WHYs.

#9 Updated by okurz 4 months ago

cdywan wrote:

okurz wrote:

Was brought up by mdoucha in https://suse.slack.com/archives/C02CANHLANP/p1643362926351179 . I have rolled back the deployment on OSD and proposed a revert https://github.com/os-autoinst/os-autoinst/pull/1941 and will inform mdoucha.

The rollback went through, but it brought back https://github.com/os-autoinst/os-autoinst/pull/1895 rather than the previous state.

I assumed that the issue which was only observed in jobs as of today was also introduced by today's deployment but I was wrong. https://openqa.suse.de/tests/8047803#next_previous shows the last good 6 days ago and the first bad in the scenario https://openqa.suse.de/tests/8045433 already 2022-01-28 00:53 so before today's morning deployment. I have assumed that after the last rollback https://gitlab.suse.de/openqa/osd-deployment/-/jobs/809450 two days ago you would have ensured to have an actually fixed state prepared and merged but it was never ensured that after https://github.com/os-autoinst/os-autoinst/pull/1936 all issues would have been fixed.

okurz now proposed a revert of both https://github.com/os-autoinst/os-autoinst/pull/1942

That was merged and is currently planned to be deployed by an extraordinary deployment https://gitlab.suse.de/openqa/osd-deployment/-/pipelines/308409

As discussed in chat cdywan and me agreed to have a lessons learned session with five why analysis on the incident. See #105699

#10 Updated by okurz 4 months ago

  • Description updated (diff)

#11 Updated by okurz 4 months ago

deployment succeeded in https://gitlab.suse.de/openqa/osd-deployment/-/jobs/813761 . The changelog includes the double-revert commits. I triggered another run of openqa-label-known-issues and will monitor https://openqa.suse.de/tests/8053716 as an example run.

#12 Updated by cdywan 4 months ago

  • Status changed from Feedback to Resolved

okurz wrote:

deployment succeeded in https://gitlab.suse.de/openqa/osd-deployment/-/jobs/813761 . The changelog includes the double-revert commits. I triggered another run of openqa-label-known-issues and will monitor https://openqa.suse.de/tests/8053716 as an example run.

Job passed and uploaded all assets successfully.

Also available in: Atom PDF